If you have been using GeoPandas for more than a few months, you have almost certainly written something like this:
gdf['zone'] = gdf.geometry.apply(lambda g: classify_zone(g))
It works. It runs. And on a dataset of ten thousand rows it returns in a couple of seconds, which feels fine. Then you run it on two million rows and go make a coffee while Python loops over each geometry, constructs a Python object, dispatches a Python function, and builds the result — one element at a time.
The .apply() pattern is the most common GeoPandas performance anti-pattern in production spatial code. This article explains why it is slow, identifies every common variant of it, and shows the vectorised alternative for each.
# Why .apply() Is Slow for Spatial Operations
Pandas .apply() is essentially a Python for loop with a callback. For numeric operations, NumPy already has vectorised equivalents that run in C. But spatial operations on geometry objects are not NumPy operations — or at least, they weren’t until Shapely 2.0 introduced the array API.
When you write gdf.geometry.apply(lambda g: g.buffer(100)), the execution path is:
- Python iterates over each element of the Series (Python loop)
- Each element is a Python-wrapped Shapely geometry object
- The lambda calls
.buffer(100)— a Python method dispatch - GEOS executes the buffer in C
- The result is wrapped in a new Python geometry object
- Python appends it to an accumulating list
Steps 1, 2, 3, 5, and 6 are pure Python overhead, repeated N times. On two million rows, that is ten million Python operations before a single useful byte of C work.
The fix: use Shapely 2.0’s module-level array API, or GeoPandas’s built-in geometry accessor methods that already call it for you.
# The Six Most Common .apply() Patterns and Their Fixes
# 1. Single Geometry Attribute Access
# SLOW
gdf['area_m2'] = gdf.geometry.apply(lambda g: g.area)
gdf['length_m'] = gdf.geometry.apply(lambda g: g.length)
gdf['centroid_x'] = gdf.geometry.apply(lambda g: g.centroid.x)
# FAST — GeoPandas accessors call Shapely array API internally
gdf['area_m2'] = gdf.geometry.area
gdf['length_m'] = gdf.geometry.length
gdf['centroid_x'] = gdf.geometry.centroid.x
Speedup: 20–30× on large datasets. The accessor properties on GeoSeries are wired directly to shapely.area(), shapely.length(), shapely.get_x(shapely.centroid(arr)) etc.
# 2. Buffer / Transform Operations
# SLOW
gdf['buffered'] = gdf.geometry.apply(lambda g: g.buffer(500))
# FAST — GeoSeries.buffer() uses the array API
gdf['buffered'] = gdf.geometry.buffer(500)
# Or directly with Shapely for non-standard options:
import shapely
gdf['buffered'] = shapely.buffer(gdf.geometry.values, 500, cap_style='flat')
If you need per-row buffer distances:
# SLOW
gdf['buffered'] = gdf.apply(lambda row: row.geometry.buffer(row['radius']), axis=1)
# FAST — pass the array of radii directly
gdf['buffered'] = shapely.buffer(gdf.geometry.values, gdf['radius'].values)
# 3. Distance to a Reference Point
from shapely.geometry import Point
reference = Point(153.0, -27.5) # Brisbane
# SLOW
gdf['dist_km'] = gdf.geometry.apply(lambda g: g.distance(reference) / 1000)
# FAST — scalar broadcasts over the array
import shapely
gdf['dist_km'] = shapely.distance(gdf.geometry.values, reference) / 1000
Benchmark: 2 million Point geometries. .apply(): 3.8 seconds. shapely.distance(): 0.07 seconds. 54× speedup.
# 4. Point-in-Polygon Classification
This is the canonical trap. A common task: given 2 million GPS pings and 500 zone polygons, assign each ping to its zone.
# THE TRAP — looping through zones
def classify_zone(point):
for idx, zone in zones_gdf.iterrows():
if zone.geometry.contains(point):
return zone['zone_name']
return None
gdf['zone'] = gdf.geometry.apply(classify_zone)
# On 2M points × 500 zones: ~4 minutes (O(N × M))
The vectorised replacement uses a spatial join:
# FAST — sjoin builds an STRtree index on zones and queries in bulk
result = gpd.sjoin(
gdf[['geometry']],
zones_gdf[['geometry', 'zone_name']],
how='left',
predicate='within'
)
gdf['zone'] = result['zone_name']
# On 2M points × 500 zones: ~8 seconds (O(N log M))
That is a 30× speedup — 4 minutes down to 8 seconds — and it scales logarithmically with zone count rather than linearly.
# 5. Custom Validity Checks or Geometry Predicates
# SLOW
gdf['is_valid'] = gdf.geometry.apply(lambda g: g.is_valid)
gdf['is_empty'] = gdf.geometry.apply(lambda g: g.is_empty)
# FAST
import shapely
gdf['is_valid'] = shapely.is_valid(gdf.geometry.values)
gdf['is_empty'] = shapely.is_empty(gdf.geometry.values)
# 6. WKT / WKB Parsing on Load
A slow data loading pattern:
# SLOW — calls shapely.wkt.loads per row
df['geometry'] = df['wkt_col'].apply(shapely.wkt.loads)
# FAST — bulk parse the entire column
import shapely
df['geometry'] = shapely.from_wkt(df['wkt_col'].values)
gdf = gpd.GeoDataFrame(df, geometry='geometry', crs='EPSG:4326')
Bulk WKT parsing 1 million geometries: .apply() approach takes ~9 seconds. shapely.from_wkt(): ~1.2 seconds. 7.5× speedup.
# The GPS Ping Classification Benchmark
To make the comparison concrete, here is the full benchmark for the GPS ping classification task:
import geopandas as gpd
import numpy as np
import shapely
import time
from shapely.geometry import Point, box
# ── Synthetic data: 2M GPS pings within Australia bounding box ───────────────
rng = np.random.default_rng(42)
coords = rng.uniform(
low=[113.0, -44.0],
high=[154.0, -10.0],
size=(2_000_000, 2)
)
pings = gpd.GeoDataFrame(
geometry=shapely.points(coords),
crs='EPSG:4326'
)
# ── 500 synthetic zone polygons ───────────────────────────────────────────────
zone_coords = rng.uniform(low=[113.0, -44.0], high=[154.0, -10.0], size=(500, 2))
zones = gpd.GeoDataFrame(
{'zone_name': [f'zone_{i:03d}' for i in range(500)]},
geometry=[
box(x - 0.5, y - 0.5, x + 0.5, y + 0.5)
for x, y in zone_coords
],
crs='EPSG:4326'
)
# ── Method 1: .apply() with nested loop ──────────────────────────────────────
# (tested on 50k subset — full 2M would take ~4 minutes)
subset = pings.iloc[:50_000].copy()
def classify_zone_loop(point):
for _, zone in zones.iterrows():
if zone.geometry.contains(point):
return zone['zone_name']
return None
t0 = time.perf_counter()
subset['zone'] = subset.geometry.apply(classify_zone_loop)
apply_time = time.perf_counter() - t0
print(f".apply() on 50k pings: {apply_time:.1f}s")
# → 12.3s for 50k = ~493s (8+ min) extrapolated to 2M
# ── Method 2: sjoin (full 2M) ─────────────────────────────────────────────────
t0 = time.perf_counter()
result = gpd.sjoin(
pings[['geometry']],
zones[['geometry', 'zone_name']],
how='left',
predicate='within'
)
pings['zone'] = result['zone_name']
sjoin_time = time.perf_counter() - t0
print(f"sjoin on 2M pings: {sjoin_time:.1f}s")
# → 8.1s for 2M
print(f"Extrapolated speedup: {(apply_time / 50_000 * 2_000_000) / sjoin_time:.0f}x")
# → ~61x
# The .apply() You Cannot Avoid (and How to Minimise It)
Sometimes .apply() is genuinely unavoidable — for example, when your operation has no vectorised equivalent (calling an external API per row, or running a complex branching algorithm that depends on multiple geometry properties).
In those cases:
Option 1: Reduce the input first. If you must .apply(), filter to only the rows that need it:
# Only run the expensive operation on non-null, non-empty geometries
mask = shapely.is_valid(gdf.geometry.values) & ~shapely.is_empty(gdf.geometry.values)
gdf.loc[mask, 'result'] = gdf.loc[mask].geometry.apply(expensive_fn)
Option 2: Use Dask-GeoPandas for parallelism. If you cannot eliminate the loop, you can parallelise it:
import dask_geopandas
dgdf = dask_geopandas.from_geopandas(gdf, npartitions=8)
dgdf['result'] = dgdf.geometry.apply(expensive_fn, meta=('result', 'float64'))
result_gdf = dgdf.compute()
This does not remove the Python overhead per element, but it distributes it across CPU cores, which is useful when the per-element computation is itself slow (network calls, complex algorithms).
Option 3: Vectorise with Numba. If the operation is numeric after geometry extraction (e.g. a custom distance formula), extract coordinates as arrays first, then use Numba on the arrays. Covered in depth in Numba-Accelerated Spatial Kernels.
# Identifying .apply() Hot Spots in Existing Code
A quick audit for .apply() usage in a spatial codebase:
grep -rn "\.apply(" src/ --include="*.py" | grep -i "geom\|geometry\|spatial\|buffer\|distance\|contains"
Flag every hit. Then ask: does Shapely 2.0’s module-level API have a direct equivalent? Does the GeoSeries accessor already handle this? If yes to either, replace it.
The rule of thumb: if a spatial .apply() was written before 2023, it almost certainly has a vectorised replacement. The Shapely 2.0 release in December 2022 made the array API the default path — existing code just hasn’t caught up.
Related reading: Shapely 2.0: The Vectorisation Revolution in Python Geometry · Shapely STRtree: Mass Nearest-Neighbour and Intersection Queries · Dask-GeoPandas: Parallel Spatial Processing Across CPU Cores