This is part 1 of the first instalment of a new series of blogs on Open-Source Spatial technologies. First stop on this tour is a Python library called GeoPandas.
What is GeoPandas?
Before GeoPandas, there was of course Pandas, the adorably named but very powerful Python library of data structure and analysis tools. Pandas is “the Excel of Python”. Its primary object type is the DataFrame, which I think of as an abstract database table, or spreadsheet – you can do things like create queries, index your data, perform grouping and aggregations, and join tables, i.e. all that good database/spreadsheet stuff, but right in a python environment.
Of course, all of these functions are very useful when working with spatial data (I even remember adding Pandas to ArcGIS 10.1 by hand back in 2015 to use it in a toolbox script). The one thing that was missing from Pandas was spatial data typing, in the same way that PostGIS was missing from PostgreSQL.
This is what GeoPandas does – it extends the standard Pandas DataFrame with spatial typings. The advantage of having typings directly on your data object is that you can use a standard set of methods and properties for each type. Basically, GeoPandas adds a geometry column to the DataFrame, not dissimilar to the “geom” column from PostGIS. This geometry column contains Shapely geometries (Shapely is a Python library with geometry types and operations), which means that we can access all the properties and methods available to Shapely objects directly on the DataFrame, or even the feature itself. For example, we can get the CRS of the layer by accessing its `crs` property, or create a buffer simply by calling the `buffer()` method, right on the GeoDataFrame object.
In addition to this core function, GeoPandas also adds a few useful common tools to the package to help buff out the offering a bit, including reading/writing spatial file formats directly into/from GeoDataFrames using the Fiona library. It also adds plotting (i.e. generating map outputs) with Descartes and Matplotlib.
The sum of these parts is a very “Pythonic” and user-friendly experience, which provides a set of useful basic tools for working with your spatial data, even generating map and data outputs, all directly in an open-source Python environment. I will say that DataFrames, as good as they are, can take a little bit of time to wrap ones’ head around, but it is time well spent. My only other “criticism” is that not all the Shapely methods and properties have been bound to the GeoDataFrame yet, but as we see in the examples to come, we can work around this.
Incidentally, the power of Pandas did not go unnoticed by ESRI, who since ArcGIS version 10.4 have also baked the Pandas library into their arcpy releases. They have implemented basically the same premise as the GeoDataFrame (i.e. adding spatial typing to DataFrames), but gave it the more pragmatic title of “Spatially Enabled DataFrame” (SEDF). It’s worth noting though that ArcGIS 10.4 was released in February of 2016, while the first commit to the GeoPandas repository on github was back in June 2013. I’d say that earns GeoPandas the title of OG (“Original GIS”) fair and square.
Worked Example, Part 1
Now for a worked example.
The example I’ve chosen is based on an actual workflow I used for a job last year. I’ve simplified bits of the workflow for the sake of brevity, but the outputs I needed were two datasets:
- a set of addresses as points, and
- a set of rotated minimum bounding rectangles of postcodes that either fall within, or touch the border of Victoria.
And I needed the two tables to be joinable on the postcode value.
In part 1, below, we’ll select out the postcodes and save them to a shapefile, then in part 2 we’ll generate some random points and rotated bounding boxes, and join them.
A quick note – I won’t go through any environment setup in this blog, but you can check out the steps I went through [[ here ]].
And you can download the git repository here, then follow the instructions in the readme to get set up.
Anyway, let’s get started.
import geopandas
Create data frame from shapefile¶
Let's make a GeoDataFrame from our postcode dataset. We can do this directly using the read_file()
geopandas method.
The read_file()
method references the Fiona library's import functions, and can read from any OGR vector source.
aus_poas = geopandas.read_file('aus_poas.shp')
This is what our dataframe looks like - it's a table, but with a fully typed spatial geometry column.
The pandas head()
method returns the first 5 rows with our attributes - POA_NAME, code, state and geometry
aus_poas.head()
Plot our data¶
Basic plotting is fully implemented on our dataframe via MatPlotLib, and we can get a quick and dirty preview of our data using the plot()
method on the dataframe
ax = aus_poas.plot()
Pandas Methods¶
All the common methods for a Pandas dataframe are supported on our GeoPandas dataframe.
We can use the query()
method to select features. For example, this query()
returns the Melbourne postcode
melb = aus_poas.query('code == 3000')
ax = melb.plot()
melb.head()
Let's select all the Victorian postcodes
vic_poas = aus_poas.query('code >= 3000 & code <= 3999')
ax = vic_poas.plot()
vic_poas.head()
Performing Dataframe Operations¶
There are a bunch of useful methods available right on our GeoDataframe, such as dissolve()
.
Let's create a shape of Victoria using dissolve()
, based on the Victoria postcode polygons.
vic_shape = vic_poas.dissolve(by='state')
ax = vic_shape.plot()
vic_shape.head()
note that this has made the 'state' column the index column, on the far left.
Projecting & Buffering¶
Now that we have our Victoria polygon, we want to buffer around it to get adjoining postcodes from the full set.
The vic_shape layer is in GDA94, so let's re-project it to GA Lambert EPSG:3112 so we can calculate in metres. We can do this super quickly and easily using the to_crs()
method.
vic_shape_lamb = vic_shape.to_crs(epsg=3112)
ax = vic_shape_lamb.plot()
Now we can buffer in metres, using the buffer()
method.
The buffer()
method returns a GeoSeries (a single feature geometry), but we want to keep using our data in a GeoDataFrame, so we need to create a new data frame and then add the resulting buffer GeoSeries.
We create a new empty GeoDataFrame using geopandas.GeoDataFrame()
with the CRS we defined earlier on the vic_shape object (which we can reference directly with the .crs
property)
vic_buffer = geopandas.GeoDataFrame(crs=vic_shape_lamb.crs)
Then we can perform the 10km buffer operation with buffer(10000)
, and store it in the 'geometry' column on the GeoDataFrame we just created.
vic_buffer['geometry'] = vic_shape_lamb.buffer(10000)
ax = vic_buffer.plot()
Let's re-project our buffer to get it back to the original CRS, ready to for joining with our full postcode dataset
vic_buffer_gda = vic_buffer.to_crs(aus_poas.crs)
ax = vic_buffer_gda.plot()
Do a Spatial Join to Select All Postcodes Inside Victoria and Within 10kms¶
We can use the 'spatial join' or sjoin()
method (accessible via the geopandas class) to do an intersect analysis, then query out the features that satisfy the intersect relationship.
Performing sjoin()
with the how
parameter set to 'left' keeps all rows from the full set of postcodes, but we can filter them down to just those that intersect by querying the 'index_right' column, which is added as part of the join with the index of the joined feature (in this case our Vic buffer has an index_right id of 'VIC', as noted above)
vic_plus = geopandas.sjoin(aus_poas, vic_buffer_gda, how='left', op='intersects').query('index_right == "VIC"')
ax = vic_plus.plot()
vic_plus.head()
The result is all of the postcodes in Victoria, and the postcodes in the other states which come within 10kms of the border
Exporting Our Data¶
Finally, let's save our data for use in part two of the processing. We can use the to_file()
method directly from the GeoDataFrame to save in our workspace.
vic_plus.to_file('./outputs/vic_poas_plus.shp', driver='ESRI Shapefile')
Let's quickly check our output file...
ax = geopandas.read_file('./outputs/vic_poas_plus.shp').plot()
I'm also going to save out our dissolved Victoria shape for later
vic_shape.to_file('./outputs/vic_shape.shp', driver='ESRI Shapefile')
This ends part 1, in part two let's create some random points, join the postcode attributes to them, and create some minimum bounding boxes.