Difficulty: intermediate
Estimated Time: 20 minutes

Quick Introduction to PostGIS

This short litle class will give you a quick introduction to some of the ways you can use PostGIS, the spatial extension for PostgreSQL. The database has already been started and the spatial data has already been loaded

We loaded some data for U.S. Counties - a sub-state (province) geographic subdivision. We chose U.S. data because so much of it is freely available and released under a permissive license.

All the data and SQL used in the following pieces are found in /data/crunchy_demo_data in your vm.

The two tables you will be working with are county boundaries and the point location of storm events.

Here are the details on the database we are connecting to:

  1. Username: groot
  2. Password: password (same password for the postgres user as well)
  3. A database named: workshop

We will not be going in depth on PostGIS we are just going to cover enough to get whet your appetite.

Final Notes

The container used in this class is available in Dockerhub. As long as you have Docker on your machine you can use the same version of PostgreSQL as the workshop. All the data from the workshop was intentionally chosen from public domain or permissive licenses so that you can use it for commercial and non-commercial purposes. Feel free to download it and play some more at your own pace on your own machine.

I hope you enjoyed learning a little about PostGIS and some new analysis.

And with that, we are finished.

Quick Introduction to PostGIS

Working with Spatial Data

Working with Spatial Data in PostGIS

PostgreSQL has the Gold Standard in spatial extensions for any RDBMs on the market - PostGIS. If you have data that has direct spatial information, like coordinates, or indirect, such as an address, you can leverage the power of spatial analysis to enhance the insights into your dataIn the workshopwe will barely be scratching the surface of what you can do with PostGIS so please don't consider this exhaustive in the slightest.

Final note before we dig in, remember that usually to work with spatial data you need to

CREATE EXTENSION postgis;

in your database to enable all the functionality. We don't have to do it in the workshop because we already enabled the extension when we created the DB in the container.

Spatial Tables

Let's go ahead and log in to our PostgreSQL database:

psql -U groot -h localhost workshop

Remember that the password is the word 'password'.

Now if you do:

\d county_geometry

PostgreSQL will show you a full description of the county_geometry table. To see all the \ commands in PostgreSQL just do \? (though don't do it right now).

You will see two spatial columns:

interior_pnt | geography(Point,4326)        |
the_geom     | geography(MultiPolygon,4326) |

You can tell they are spatial because we declared them as type Geography with the type of spatial feature in the parentheses. The other spatial type is Geometry. I give a brief discussion at the end of the page about the difference between the two types. Just know that Geography is perfect for data coming from most GPSs or if you want to deal with data on a continent scale.

You can also see we made indices for these two columns:

"countygeom_interiorpt_indx" gist (interior_pnt)
"countygeom_the_geom_indx" gist (the_geom)

Making GiST indices on spatial data allows for efficient querying of the data by creating bounding rectangles and putting them in the index. The database can then use these simple rectangles to quickly filter out which features are not in the area of interest because geometric operations on rectangles is much quicker than complex shapes.

Simple spatial query

Let's start with one of the simplest queries, a distance query. Let's find the 3 counties closest to the geographic center of the United States (including Alaska and Hawaii): 44.967244 Latitude, -103.771555 Longitude.

Now let's select the id, county names, and distance from the 3 closest counties using the ST_Distance function:

SELECT id, county_name, ST_Distance('POINT(-103.771555 44.967244)'::geography, the_geom) FROM county_geometry ORDER BY ST_Distance('POINT(-103.771555 44.967244)'::geography, the_geom) LIMIT 3;

This result may take a while to return because we are calculating the distance between all the counties in the U.S. and that point so we can ORDER the results on distance from the point. But I mentioned that PostGIS was the gold standard for a reason. It has a method to handle our use case. It's called a K Nearest Neighbor Search (KNN) with its own operator.

SELECT id, county_name, ST_Distance('POINT(-103.771555 44.967244)'::geography, the_geom) FROM county_geometry ORDER BY the_geom <-> 'POINT(-103.771555 44.967244)'::geography LIMIT 3;

When we have a spatial index on the column, set a relatively small limit (X) on the return, and we use the <-> operator, the database "knows" to find the first X spatially closest results and THEN calculate everything else on them. Spatial indices to the rescue!

Spatial Join

This next query will demonstrate joining data based on spatial co-incidence rather than a shared primary key-foreign key relationship. Our example will be to join the storm location data to the county geometry data, given us the county of the incident eventhough it is not in our original file.

select geo.statefp, geo.county_name, geo.aland, se.event_id, se.location from county_geometry as geo, se_locations as se where ST_Covers(geo.the_geom, se.the_geom) limit 10;

The spatial operator we use is ST_Covers which return a boolean if the second geometry is completely withing the first geometry. We set the limit to 5 because we don't want to wait for all the results to return for over 48K rows. The results also show the state fips code which tells us the state name given the number. This way we can check if their is a county with the name in that state as well as a location and if they two overlap.

Spatial buffer and then select

Finally let's do more complicated query that you could not do without sophisticated spatial operations. Suppose we were trying to put together emergency response centers in counties with high potential for storms. We are going to buffer a 12.5 KM radius (about 8 miles) off a storm even center point and then select all the counties that intersect that buffered circle. We will use a grouping query to do a count of the storms circles per county.

First is the query returning all the counties with 12.5KM of a storm event location:

select geo.statefp, geo.county_name, se.locationid from county_geometry as geo, se_locations as se where ST_intersects(geo.the_geom, ST_Buffer(se.the_geom, 12500.0)) limit 200;

and then do the grouping and counting:

with all_counties as (
           select geo.statefp, geo.county_name, se.locationid from county_geometry as geo, se_locations as se where ST_intersects(geo.the_geom, ST_Buffer(se.the_geom, 12500.0)) limit 200
   )
   select statefp, county_name, count(*) from all_counties group by statefp, county_name order by statefp, count(*) DESC;

The with x as () syntax is called a Common Table Expression (CTE) in PostgreSQL and makes writing subqueries a lot easier. The CTE create a temporary table that exists for just one query. The tradeoff is that they are an optimization boundary. In this case they are fine to use for the workshop but if you use them in future work please dig deeper into the tradeoffs of CTEs.

Final Note

Today we worked with a Geography type for our spatial data because our geographic extent was the entire U.S. and it also made our calculations easier syntax wise as well as being accurate. This is also the data format you get natively from GPS units, such as from your phone.

In the future, if you deal with data of a geographic extent less than a province or state, you are going to need to learn to work with coordinate systems and projections of your coordinates (basically the process to take a globe and make a map).

The way you specify your coordinates would change, since you would now have to give the projection you are using. Local governments, non-profits, and companies will using give you the coordinates in a projected system so you will need to learn how to use it in PostGIS. The ideas above remain exactly the same, it's just the way you store your data will change.

To learn more, there is a great, but slightly outdated, discussion in this other workshop content.