PostGIS, GeoPandas, and DuckDB: A Decision Framework for Spatial Analytics

A spatial analytics team typically has access to all three tools — PostGIS in the database, GeoPandas in the notebook, DuckDB in the pipeline. The question is not which tool is best overall (none of them is), but which tool is best for a specific task. Using the wrong tool does not always mean getting wrong answers; it means getting correct answers slowly, or in a workflow that is harder to maintain.

This article provides a decision framework based on the structural characteristics of each tool, with performance benchmarks to back up the recommendations.

# The Three Tools in Profile

# PostGIS

PostGIS is a PostgreSQL extension that adds spatial data types, indexing (GiST and SP-GiST), and a function library covering everything from ST_Distance to ST_VoronoiPolygons. It is the most mature spatial SQL system available and the right choice when:

Multiple clients need concurrent read or write access to the same data
Indexed lookups are needed (point-in-polygon queries on a constantly updated dataset)
Transactions are required (spatial data that must be consistent with related non-spatial tables)
Complex SQL with CTEs, window functions, and user-defined functions is your preferred query language
The data is already in Postgres and moving it elsewhere adds friction

The cost: PostgreSQL is a server. It needs provisioning, monitoring, backup, and a connection. In serverless or ephemeral compute environments, this is a meaningful operational burden.

# GeoPandas

GeoPandas is a Python DataFrame library that adds geometry types and Shapely-based spatial operations to Pandas. It is the right choice when:

You are writing Python code and want geometry operations to feel like DataFrame operations
The result is a GeoDataFrame — something you will plot, join to other DataFrames, or pass to Shapely functions
The dataset fits in memory (up to a few hundred million rows on a well-equipped machine)
Exploratory analysis — interactive notebooks where you want .explore() and gdf.plot()
Complex spatial operations that are easier to express in Python than SQL

The cost: everything is in-memory. For very large datasets, memory pressure limits what is possible on a single machine.

# DuckDB Spatial

DuckDB is an in-process OLAP database with a Spatial extension (via GDAL and GEOS). It is the right choice when:

Data is in files (GeoParquet, Parquet, CSV with WKB columns) and you do not want to load it into Python memory
SQL is your preferred query language for aggregation and filtering
Serverless execution — CI/CD, AWS Lambda, dbt, or any environment without a persistent database
Reading from S3 without downloading the full dataset
Multi-file queries — glob patterns over directories of partition files

The cost: limited spatial index support (no persistent GiST index), and geometry-intensive operations (large union aggregates, complex overlay analysis) are slower than PostGIS with GiST.

# Benchmark: Five Representative Tasks

The five most common spatial analytics operations, benchmarked across all three tools on a 1.2 million parcel dataset:

# Task 1: Filter + Aggregate (non-spatial)

Count parcels and sum land value by zone type.

# GeoPandas
result = gdf.groupby('zone_type').agg(
    count=('land_value', 'count'),
    total_value=('land_value', 'sum')
)
# → 0.08s

# DuckDB
result = con.execute("""
    SELECT zone_type, COUNT(*), SUM(land_value)
    FROM read_parquet('parcels.parquet')
    GROUP BY zone_type
""").df()
# → 0.06s

# PostGIS
result = pd.read_sql("""
    SELECT zone_type, COUNT(*), SUM(land_value)
    FROM parcels GROUP BY zone_type
""", engine)
# → 0.12s (includes network round-trip)

Winner for non-spatial aggregation: DuckDB or GeoPandas. PostGIS adds network overhead.

# Task 2: Point-in-Polygon Spatial Join

Assign 2 million GPS pings to their containing zone.

# GeoPandas (STRtree sjoin)
result = gpd.sjoin(pings_gdf, zones_gdf, how='left', predicate='within')
# → 8.2s

# DuckDB Spatial
result = con.execute("""
    SELECT p.ping_id, z.zone_name
    FROM read_parquet('pings.parquet') p
    JOIN read_parquet('zones.parquet') z
    ON ST_Contains(z.geometry, p.geometry)
""").df()
# → 14.8s

# PostGIS (with GiST index on zones)
result = pd.read_sql("""
    SELECT p.ping_id, z.zone_name
    FROM gps_pings p
    JOIN zones z ON ST_Contains(z.geom, p.geom)
""", engine)
# → 6.1s (GiST index makes repeated queries fast)

Winner for spatial joins: PostGIS (with GiST) for repeat queries, GeoPandas for one-off joins.

# Task 3: Bounding Box Subset Read

Read parcels within Brisbane CBD.

bbox = (152.9, -27.6, 153.1, -27.4)

# GeoPandas (load full dataset, then filter)
gdf_cbd = gdf.cx[152.9:153.1, -27.6:-27.4]   # if already loaded
# → 0.02s (if already loaded), 4.1s if loading from scratch

# GeoPandas with bbox= parameter
gdf_cbd = gpd.read_parquet("parcels.parquet", bbox=bbox)
# → 0.7s (uses Parquet row group statistics)

# DuckDB with bbox columns
result = con.execute("""
    SELECT * FROM read_parquet('parcels.parquet')
    WHERE "bbox.xmin" < 153.1 AND "bbox.xmax" > 152.9
      AND "bbox.ymin" < -27.4 AND "bbox.ymax" > -27.6
      AND ST_Intersects(geometry, ST_MakeEnvelope(152.9, -27.6, 153.1, -27.4))
""").df()
# → 0.3s

# PostGIS with GiST index
result = pd.read_sql("""
    SELECT * FROM parcels
    WHERE geom && ST_MakeEnvelope(152.9, -27.6, 153.1, -27.4, 4326)
      AND ST_Intersects(geom, ST_MakeEnvelope(152.9, -27.6, 153.1, -27.4, 4326))
""", engine)
# → 0.08s (GiST index on geom)

Winner for bbox queries: PostGIS (with GiST) by a large margin for indexed data. DuckDB is faster than GeoPandas for unindexed files.

# Task 4: Complex Geometry Operation (Union by Group)

Compute the union of all parcels per zone type (20 zone types → 20 union polygons).

# GeoPandas
result = gdf.groupby('zone_type').geometry.apply(lambda g: g.union_all())
# → 18.4s

# DuckDB
result = con.execute("""
    SELECT zone_type, ST_Union_Agg(geometry) AS union_geom
    FROM read_parquet('parcels.parquet')
    GROUP BY zone_type
""").df()
# → 22.1s

# PostGIS
result = pd.read_sql("""
    SELECT zone_type, ST_Union(geom) AS union_geom
    FROM parcels GROUP BY zone_type
""", engine)
# → 14.2s (GEOS runs server-side)

Winner for complex geometry operations: PostGIS (server-side GEOS with good memory management).

# Task 5: Data Loaded Once, Many Queries

Load once, run 100 different filter + aggregate queries.

# GeoPandas: load once, query from in-memory DataFrame
gdf = gpd.read_parquet("parcels.parquet")   # 4.1s, 100M → memory

for query in 100_queries:
    result = gdf.query(query['filter']).groupby('zone_type')['land_value'].mean()
    # each query: ~0.05s

# DuckDB: each query reads from file (row group caching helps)
for query in 100_queries:
    result = con.execute(f"SELECT zone_type, AVG(land_value) FROM read_parquet('parcels.parquet') WHERE {query['filter']} GROUP BY zone_type").df()
    # each query: ~0.2s (file I/O each time unless cached)

# PostGIS: indexed table, server-side caching
for query in 100_queries:
    result = pd.read_sql(f"SELECT zone_type, AVG(land_value) FROM parcels WHERE {query['filter']} GROUP BY zone_type", engine)
    # each query: ~0.08s

Winner for repeated queries on the same data: GeoPandas (data loaded once in memory). PostGIS is close (indexed, with buffer pool cache). DuckDB is slower per query due to file re-read overhead.

# Decision Matrix

Task	PostGIS	GeoPandas	DuckDB
Non-spatial aggregation	Good	Best	Best
Spatial join (first run)	Good	Best	OK
Spatial join (repeated)	Best (GiST)	Good	OK
Bbox read from file	Good	Good	Best
Complex geometry ops	Best	Good	OK
Many queries, same data	Good	Best	OK
Serverless/no-infra	No	Yes	Yes
Read from S3	Via FDW	No	Yes
Multi-client concurrency	Yes	No	Limited

# Combining the Three Tools

The most performant pipelines use all three tools at their strengths:

# Pattern: DuckDB for initial filtering from S3, GeoPandas for analysis,
# PostGIS for indexed queries in production API

import duckdb
import geopandas as gpd

# 1. DuckDB: read subset from S3 GeoParquet without downloading
result_df = duckdb.execute("""
    SELECT * FROM read_parquet('s3://bucket/national_parcels/*.parquet')
    WHERE state_code = 'QLD' AND zone_type IN ('Residential', 'Commercial')
""").df()

# 2. GeoPandas: in-memory spatial analysis on the filtered subset
import shapely
gdf = gpd.GeoDataFrame(result_df, geometry=shapely.from_wkb(result_df['geometry']), crs='EPSG:4326')
buffered = gdf.copy()
buffered.geometry = gdf.geometry.buffer(50)
result = gpd.sjoin(buffered, flood_zones, how='left', predicate='intersects')

# 3. PostGIS: write final result to production database for API queries
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@host/db")
result.to_postgis("flood_risk_parcels", engine, if_exists='replace', index=False)
# PostGIS now has an indexed spatial table for fast API queries

This pattern — DuckDB for extraction, GeoPandas for analysis, PostGIS for serving — avoids loading the full national dataset into memory, leverages SQL for the initial filter, and produces an indexed result ready for production queries.