Getting Started With GeoPandas

This notebook serves as a starter guide or template for doing geographical analysis using the GeoPandas library.

The dataset we will be using in this tutorial is from Analyze Boston. Analyze Boston is the City of Boston’s data hub and is a great resource for data sets regarding the city.

We will be working with two datasets. First is the 2020 CENSUS TRACTS IN BOSTON dataset, and second is the WICKED FREE WIFI LOCATIONS dataset. The first dataset contains the geographical boundaries of each census tract in Boston, and the second dataset contains the geographical location of each free wifi location in Boston.

Census Tracts in Boston

Wicked Free Wifi Locations

# Start by importing the necessary packages
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, Polygon

# for visualizations : 
import seaborn as sns
import matplotlib.pyplot as plt

When working with geographical data, .shp files are the most common format, however these files are often accompanied by other files with the same name but different extensions. For example, a .shp file might come with a .dbf, .prj, .shx, and .cpg file. These are necessary files and are used in the background. GeoPandas (and other GIS software) typically expect these files to be in the same directory as the .shp file and to share the same base filename.

# read in the census data
census = gpd.read_file('census/Census2020_BlockGroups.shp')
census.head()

	OBJECTID	STATEFP20	COUNTYFP20	TRACTCE20	BLKGRPCE20	GEOID20	NAMELSAD20	MTFCC20	FUNCSTAT20	ALAND20	AWATER20	INTPTLAT20	INTPTLON20	Shape_STAr	Shape_STLe	geometry
0	1	25	025	040600	1	250250406001	Block Group 1	G5030	S	1265377.0	413598.0	+42.3833695	-071.0707743	1.807118e+07	29256.866068	POLYGON ((769378.692 2964626.314, 769385.078 2...
1	2	25	025	051101	1	250250511011	Block Group 1	G5030	S	220626.0	0.0	+42.3882285	-071.0046816	2.374654e+06	9142.174252	POLYGON ((788317.786 2966115.262, 788838.563 2...
2	3	25	025	051101	4	250250511014	Block Group 4	G5030	S	227071.0	270.0	+42.3913407	-071.0020343	2.446949e+06	11579.546171	POLYGON ((789538.125 2967889.427, 789539.214 2...
3	4	25	025	981600	1	250259816001	Block Group 1	G5030	S	586981.0	158777.0	+42.3886205	-070.9934424	8.026752e+06	16626.718823	POLYGON ((790938.417 2966482.118, 790974.619 2...
4	5	25	025	010204	3	250250102043	Block Group 3	G5030	S	145888.0	0.0	+42.3459611	-071.1020344	1.570220e+06	5510.560013	POLYGON ((762928.668 2951612.031, 763063.607 2...

What makes a .shp file unique is the “geometry” column. The geometry column typically contains the coordinates or geometric properties of the features represented in the shapefile, such as points, lines, or polygons. For us, each row in our shapefile represents one census block group, and the geometry column stores a vectorized polygon boundaries of each block group that lets us see a visual representation of the block group.

census['geometry'][0]

svg

We can plot all of these block groups together to get a visual representation of the city of Boston.

census.plot()
plt.title('Boston Census Block Groups')
plt.show()

png

Now let’s read in our Wicked Free Wifi Locations shapefile

# read in the wifi locations shapefile
wfw = gpd.read_file('wifi/Wicked_Free_WiFi_Locations.shp')
wfw

	neighborho	neighbor_1	device_ser	device_con	device_add	device_lat	device_lon	device_tag	etl_update	is_current	org1	org2	inside_out	landmark	ObjectId	geometry
0	N_568579452955534204	Dorchester	Q2CK-H48M-P2E9	DOT-BFD21-AP3	641 Columbia Rd., Dorchester, MA	42.318336	-71.063236	BFD Engine-21 Outside	2023-06-12	1	BFD	NaN	Outside	Engine 21	1	POINT (-7910723.209 5208784.596)
1	N_568579452955537848	South Boston	Q2CK-LEG8-FAFN	SB-HIGHSCHOOL-AP2	95 G St., South Boston, MA	42.332869	-71.044891	recently-added	2023-06-12	1	NaN	NaN	NaN	NaN	2	POINT (-7908681.044 5210972.710)
2	N_568579452955538316	Hyde Park	Q2CK-NERF-4JX7	HP-BPDE18-AP2	1249 Hyde Park Ave., Hyde Park, MA	42.256473	-71.124219	recently-added	2023-06-12	1	NaN	NaN	NaN	NaN	3	POINT (-7917511.803 5199475.676)
3	N_579275502070532581	Roxbury	Q2CK-DSFT-7HTL	ROX-TROTTER-AP3	Humbolt & Waumbeck St., Roxbury, MA	42.313249	-71.089180	BPS Outside Trotter-School	2023-06-12	1	BPS	NaN	Outside	Trotter School	4	POINT (-7913611.287 5208018.690)
4	N_579275502070532581	Roxbury	Q2CK-2D9P-VKGR	ROX-BFDE42-AP1	1870 Columbus Ave., Roxbury, MA	42.318412	-71.097758	BFD Engine-42 Outside	2023-06-12	1	BFD	NaN	Outside	Engine 42	5	POINT (-7914566.172 5208795.947)
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
255	N_568579452955534204	Dorchester	Q2CK-G9M2-VTK4	DOT-BFD21-AP2	641 Columbia Rd., Dorchester, MA	42.318336	-71.063236	BFD Engine-21 Outside	2023-06-12	1	BFD	NaN	Outside	Engine 21	256	POINT (-7910723.209 5208784.596)
256	N_568579452955527921	Parks	Q2CK-MP93-YDN3	PARKS-COMMONEAST-AP1	139 Tremont St, Boston, MA 02111	42.355436	-71.063827	Boston-Common	2023-06-12	1	NaN	NaN	NaN	Boston Common	257	POINT (-7910789.066 5214371.556)
257	L_601230550253962688	Navy Yard	Q2EK-HAXQ-8XW4	NavyYard-AP2	Navy yard Cambridge	42.373720	-71.053272	NaN	2023-06-12	1	NaN	NaN	NaN	NaN	258	POINT (-7909614.112 5217126.360)
258	N_568579452955534204	Dorchester	Q2CK-NJQU-RB55	DOT-MATHER-AP3	24 Parrish St, Dorchester	42.308397	-71.061017	recently-added	2023-06-12	1	NaN	NaN	NaN	NaN	259	POINT (-7910476.256 5207288.280)
259	N_579275502070532581	Roxbury	Q2CK-NEZE-VRYJ	ROX-VINE-AP3	339 Dudley St, Roxbury	42.326811	-71.076707	recently-added	2023-06-12	1	NaN	NaN	NaN	NaN	260	POINT (-7912222.881 5210060.516)

260 rows × 16 columns

This time, our geometry column are single POINTs rather than whole POLYGONs, which makes sense because each row represents a single wifi location. After plotting them all together we see just a bunch of dots, which is not very helpful. We can make this plot more useful by adding the census block groups to the plot as well.

wfw.plot()
plt.title('Wifi Locations')
plt.show()

png

Data Cleaning

print(census.shape)

print(wfw.shape)

(581, 16)
(260, 16)

census.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 581 entries, 0 to 580
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   OBJECTID    581 non-null    int64   
 1   STATEFP20   581 non-null    object  
 2   COUNTYFP20  581 non-null    object  
 3   TRACTCE20   581 non-null    object  
 4   BLKGRPCE20  581 non-null    object  
 5   GEOID20     581 non-null    object  
 6   NAMELSAD20  581 non-null    object  
 7   MTFCC20     581 non-null    object  
 8   FUNCSTAT20  581 non-null    object  
 9   ALAND20     581 non-null    float64 
 10  AWATER20    581 non-null    float64 
 11  INTPTLAT20  581 non-null    object  
 12  INTPTLON20  581 non-null    object  
 13  Shape_STAr  581 non-null    float64 
 14  Shape_STLe  581 non-null    float64 
 15  geometry    581 non-null    geometry
dtypes: float64(4), geometry(1), int64(1), object(10)
memory usage: 72.8+ KB

wfw.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 260 entries, 0 to 259
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   neighborho  260 non-null    object  
 1   neighbor_1  260 non-null    object  
 2   device_ser  260 non-null    object  
 3   device_con  250 non-null    object  
 4   device_add  245 non-null    object  
 5   device_lat  248 non-null    float64 
 6   device_lon  248 non-null    float64 
 7   device_tag  215 non-null    object  
 8   etl_update  260 non-null    object  
 9   is_current  260 non-null    int64   
 10  org1        119 non-null    object  
 11  org2        5 non-null      object  
 12  inside_out  162 non-null    object  
 13  landmark    168 non-null    object  
 14  ObjectId    260 non-null    int64   
 15  geometry    248 non-null    geometry
dtypes: float64(2), geometry(1), int64(2), object(11)
memory usage: 32.6+ KB

Doesn’t look like there’s much to clean in the census data, but in the wifi dataset let’s just drop the organization columns because they contain the most nulls, and we don’t need them for our analysis.

wfw.drop(['org1', 'org2'], axis=1, inplace=True)

Showing the wifi locations on our census block group map

So we can see a map of Boston and a map of dots representing wifi locations separately, but how do I overlay these dots on top of the map of Boston to actuall see where in Boston these wifi locations are? First, check that the Coordinate Reference Systems (CRS) of both geodataframes are the same. The CRS specifies the coordinate system and reference frame used to define the spatial positions of the features represented in the file, and in order to overlay two geodataframes, they must have the same CRS.

# Check the CRS of both GeoDataFrames
print("Census CRS: ", census.crs)
print("Wifi CRS: ", wfw.crs)

Census CRS:  PROJCS["NAD83 / Massachusetts Mainland (ftUS)",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0],UNIT["Degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic_2SP"],PARAMETER["latitude_of_origin",41],PARAMETER["central_meridian",-71.5],PARAMETER["standard_parallel_1",41.7166666666667],PARAMETER["standard_parallel_2",42.6833333333333],PARAMETER["false_easting",656166.666666667],PARAMETER["false_northing",2460625],UNIT["US survey foot",0.304800609601219,AUTHORITY["EPSG","9003"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]
Wifi CRS:  EPSG:3857

Since they’re different, we need to use the to_crs method to make sure they’re all in the same coordinate system. We’ll use the EPSG:4326 coordinate system, which is one of the most common coordinate systems for geographical data. Then, we can overlay the two plots using matplotlib.

# If they're different, reproject the wifi data to match the census data
if census.crs != wfw.crs:
    census = census.to_crs(wfw.crs)

# Create a new figure and axis
fig, ax = plt.subplots(1, 1)

# Plot the census data
census.plot(ax=ax, color='gray')

# Plot the wifi locations
wfw.plot(ax=ax, color='red', markersize=10, alpha=0.2)

plt.title("Boston Census Block Groups and Wifi Locations")
plt.show()

png

If you want to zoom in on a specific area of a map in GeoPandas, you need to add the lines: plt.xlim() and plt.ylim(). These functions take in the xmin and xmax, and ymin and ymax values of the area you want to zoom in on. You can find these values by looking at the x and y axes of the original plot.

# Same as before
if census.crs != wfw.crs:
    census = census.to_crs(wfw.crs)
fig, ax = plt.subplots(1, 1)
census.plot(ax=ax, color='gray')
wfw.plot(ax=ax, color='red', markersize=10, alpha=0.2)
plt.title("Zoomed in Boston Census Block Groups and Wifi Locations")


# Set the x and y limits to zoom in on Boston
plt.xlim([-7.915e6, -7.91e6])  # xmin and xmax
plt.ylim([5.2075e6, 5.2125e6])  # ymin and ymax

plt.show()

png

Make a Choropleth (color each census block according to the number of WiFi locations in it):

Now that both of our GeoDataFrames are on the same CRS, we can do a spatial join to find out how many WiFi locations are in each census tract. A spatial join is a type of join that merges two GeoDataFrames based on the spatial relationship between the geometries of the two GeoDataFrames. In this case, we want to know how many WiFi locations are in each census tract, so we will use the contains operation to find all of the WiFi locations that are within each census tract.

# Perform a spatial join between the census tracts and WiFi locations
wifi_per_tract = gpd.sjoin(census, wfw, predicate='contains').groupby('TRACTCE20').size()
wifi_per_tract.head()

TRACTCE20
000301     1
000603     3
000604     1
000805     1
030302    35
dtype: int64

The “contains” operation (as well as other spatial operations like “intersects”, “within”, etc.) uses geometric calculations based on the shapes’ coordinates (which have the same CRS) to determine spatial relationships.

For the “contains” operation specifically, it checks if all points of the other geometric object are within the current geometric object, and that their boundaries do not coincide.

For example:

from shapely.geometry import Point, Polygon

# Create a 5x5 square (tract)
polygon = Polygon([(0, 0), (5, 0), (5, 5), (0, 5)])

# Create two points (WiFi location)
point1 = Point(3, 3)  # Inside the polygon
point2 = Point(6, 6)  # Outside the polygon

# Check containment
print(polygon.contains(point1))
print(polygon.contains(point2))  

True
False

We can now groupby the unique ‘TRACTCE20’ values and take .size() to get the number of wifi locations in each tract. Then, we can merge this with our census dataframe to get a new column with the number of wifi locations in each census tract.

# Add a new column to the census GeoDataFrame with the count of WiFi locations per tract
census = census.merge(wifi_per_tract.rename('wifi_count'), how='left', left_on='TRACTCE20', right_index=True)

# Replace NaN values (tracts without any WiFi location) with 0
census['wifi_count'].fillna(0, inplace=True)

# Now, create a new figure and axis
fig, ax = plt.subplots(1, 1, figsize=(10, 10))

# Plot the census data, color-coded by the number of WiFi locations
census.plot(column='wifi_count', ax=ax, legend=True, cmap='cividis', 
            legend_kwds={'label': "Number of WiFi Locations per Tract",
                         'orientation': "horizontal"})

plt.title("Number of WiFi Locations per Tract")
plt.show()

png

census[census['wifi_count'] == census['wifi_count'].max()]['wifi_count']

478    35.0
Name: wifi_count, dtype: float64

Looks like there’s 35 free wifi locations in this bright yellow tract, maybe the ‘landmarks’ column will tell us why…

wfw['landmark'].value_counts().head(5).sort_values().plot(kind='barh', figsize=(10, 10))
plt.title('Number of WiFi Locations by Landmark')
plt.xlabel('Number of WiFi Locations')
plt.ylabel('Landmark')
plt.show()

png

So it could be the City Hall, let’s graph it!

city_hall_wifi = wfw[wfw['landmark'] == 'City Hall']

if census.crs != city_hall_wifi.crs:
    city_hall_wifi = city_hall_wifi.to_crs(census.crs)

fig, ax = plt.subplots()
census.plot(ax=ax, color='grey', edgecolor='black')
city_hall_wifi.plot(ax=ax, color='red', markersize=10)
plt.title('WiFi Locations at City Hall')
plt.show()

png

Looks like we’re right! After plotting the subset of our dataframe where Landmark == “City Hall”, we can see that all of these points are clustered in that bright yellow census tract. Looking at the listed address of these WiFi locations, we can see that they are all located at 1 City Hall Square, which is the address of Boston City Hall.

Here are the unique device addresses of these city hall wifi locations:

city_hall_wifi['device_add'].unique()

array(['1 City Hall Square, Boston, MA 02201',
       '1 City Hall Plaza, Pavilion\nBoston, MA  02201',
       'One City Hall Sq, Pavilion Bldg\nBoston, MA 02201', nan,
       '1 City Hall Plaza, \nBoston, MA  02201',
       'One City Hall Sq, Pavilion Bld, IWF Room,\nBoston, MA 02201'],
      dtype=object)

Inside vs Outside Wifi locations

fig, ax = plt.subplots(1, 1)

census.plot(ax=ax, color='gray')

# Plot the wifi locations with color based on "inside_out" column
wfw.plot(ax=ax, column='inside_out', cmap='cool', markersize=10, legend=True, alpha=0.2)

plt.title("Boston Census Block Groups and Wifi Locations")
plt.show()

png

wfw['inside_out'].value_counts().plot(kind='barh')
plt.show()

png

Looks like most of our WiFi locations are outside. There’s probably a lot of “regular” EDA that can be done here, but since this is a GeoPandas/geographical analysis notebook, we’ll just leave it at that.

Closest Wifi location to BU

from geopy.distance import geodesic

# Coordinates of Boston University (accordign to Google)
BU_latitude = 42.3505
BU_longitude = -71.1054

# Create a copy of the WiFi locations and drop rows with missing coordinates
wfw = wfw.copy()
wfw.dropna(subset=['device_lat', 'device_lon'], inplace=True)


# Function to calculate distances to BU
def calculate_distance(row):
    point = (row['device_lat'], row['device_lon'])
    return geodesic(point, (BU_latitude, BU_longitude)).miles

wfw['distance_to_BU'] = wfw.apply(calculate_distance, axis=1)

# Find the WiFi location with the minimum distance
wfw[wfw['distance_to_BU'] == min(wfw['distance_to_BU'])]

	neighborho	neighbor_1	device_ser	device_con	device_add	device_lat	device_lon	device_tag	etl_update	is_current	inside_out	landmark	ObjectId	geometry	distance_to_BU
36	N_601230550253961587	BCYF Tremont	Q2FD-4SE4-JW2S	BCYF-3rd FL GED	1483 Tremont St, Boston, MA	42.332317	-71.09865	BCYF Inside employee	2023-06-12	1	Inside	employee	37	POINT (-7914665.525 5210889.576)	1.301774

In order to find the closest wifi location to BU, we need to find the distance between each wifi location and BU. We can do this using the geodesic function from geopy, which returns the distance between two longitudes/latitudes. Then, we can find the row where the distance is the smallest, which will be the closest wifi location to BU.

The closest wifi location to BU is the one at 1483 Tremont St, Boston, MA. It’s about 1.3 miles away from BU. Let’s plot it!

fig, ax = plt.subplots(1, 1)

# Plot the census data
census.plot(ax=ax, color='gray')

# Plot the wifi locations
wfw.plot(ax=ax, color='blue', markersize=10, alpha=0.2)

# Plot the WiFi location closest to BU in red
wfw.query("device_ser == 'Q2FD-4SE4-JW2S'").plot(ax=ax, color='red', markersize=20)
plt.title("1483 Tremont Street and other Wifi Locations")
plt.show()

png

Wrap Up, Next Steps

There’s a lot more you can do with these datasets, for example, you could try finding the average distance between wifi locations, or incorporate another dataset with demographic information and try to find a relationship between income, education level, population density, etc and wifi locations. but this notebook should serve as a good starting point for your geographical analysis.

Click here to download this Jupyter Notebook and data (make sure you are signed in with your BU email)!