For years, the choice for serious spatial SQL analytics was PostGIS. It is excellent: mature, feature-complete, and well-integrated with the Python ecosystem via SQLAlchemy and psycopg2. But PostGIS requires PostgreSQL, which requires a server process, which requires provisioning, maintenance, and a network connection.
DuckDB’s spatial extension changes this equation. It brings a substantial subset of PostGIS’s geometry function library — ST_Contains, ST_Intersects, ST_Distance, ST_Area, spatial joins, geometry construction — to an in-process SQL engine that runs entirely inside your Python process. No server. No port. No connection string. Just a Python library that reads GeoParquet files from disk or S3.
# Setup
import duckdb
con = duckdb.connect() # in-memory database
# or: con = duckdb.connect("analysis.duckdb") # persistent on-disk database
con.execute("INSTALL spatial")
con.execute("LOAD spatial")
On first run, INSTALL spatial downloads the extension from DuckDB’s package registry. Subsequent runs use the cached extension.
# Reading GeoParquet
# Query a GeoParquet file directly — no loading into Python
result = con.execute("""
SELECT COUNT(*), ST_AsText(ST_Centroid(ST_Union_Agg(geometry)))
FROM read_parquet('parcels.parquet')
WHERE land_value > 500000
""").df()
The geometry column in GeoParquet is stored as WKB (Well-Known Binary). DuckDB’s spatial extension automatically deserialises WKB when you access geometry columns in spatial functions.
# Spatial Joins
The canonical spatial join — assign each GPS ping to its containing zone — in DuckDB:
result = con.execute("""
SELECT
p.ping_id,
z.zone_name,
z.zone_code,
ST_Distance(p.geometry, z.geometry) AS dist_to_boundary
FROM read_parquet('gps_pings.parquet') AS p
JOIN read_parquet('zones.parquet') AS z
ON ST_Contains(z.geometry, p.geometry)
LIMIT 10
""").df()
DuckDB executes spatial joins using a nested-loop join with bounding-box pruning — it is not as fast as PostGIS’s GiST index for repeat queries against the same dataset, but for single-pass analytics over files it is highly competitive.
Benchmark: spatial join of 2 million GPS pings against 500 zone polygons:
- DuckDB spatial extension: 12 seconds
- PostGIS (with GiST index, local server): 8 seconds
- GeoPandas
gpd.sjoin(): 9 seconds - Python nested loop: 4 minutes+
DuckDB is within 50% of indexed PostGIS performance for single-pass joins, with zero infrastructure.
# Aggregation with Geometry Operations
# Count features and compute union polygon per zone — pure SQL
result = con.execute("""
SELECT
zone_type,
COUNT(*) AS feature_count,
SUM(land_area_m2) AS total_area_m2,
AVG(land_value) AS mean_land_value,
ST_AsGeoJSON(ST_Union_Agg(geometry)) AS union_geojson
FROM read_parquet('parcels.parquet')
GROUP BY zone_type
ORDER BY feature_count DESC
""").df()
ST_Union_Agg computes the union of all geometries in each group — the SQL equivalent of shapely.union_all(). Running this across 1.2 million polygons grouped into 20 zone types takes approximately 18 seconds in DuckDB, including geometry union computation.
# Bounding Box Pre-filter
DuckDB does not automatically use the GeoParquet bounding box metadata columns as a spatial index (that is DuckDB’s roadmap, not current behaviour). However, you can manually push the bbox filter to the query, which activates Parquet row group pruning:
# If GeoParquet was written with write_covering_bbox=True:
result = con.execute("""
SELECT p.*
FROM read_parquet('parcels.parquet') AS p
WHERE p."bbox.xmin" < 153.1
AND p."bbox.xmax" > 152.9
AND p."bbox.ymin" < -27.4
AND p."bbox.ymax" > -27.6
AND ST_Intersects(
p.geometry,
ST_MakeEnvelope(152.9, -27.6, 153.1, -27.4)
)
""").df()
The first four conditions use Parquet statistics to skip row groups entirely (no geometry deserialization needed). The ST_Intersects condition then runs exact geometry intersection on only the candidate row groups.
# Distance-Based Queries
# Find all parcels within 500 metres of a point of interest
poi = (153.024, -27.468) # Brisbane City Hall
result = con.execute(f"""
SELECT
parcel_id,
address,
land_value,
ST_Distance(
geometry,
ST_Point({poi[0]}, {poi[1]})
) * 111320 AS dist_metres -- degrees to metres (approx)
FROM read_parquet('parcels.parquet')
WHERE ST_DWithin(
geometry,
ST_Point({poi[0]}, {poi[1]}),
0.005 -- ~500m in degrees at this latitude
)
ORDER BY dist_metres
LIMIT 20
""").df()
Note: ST_DWithin in DuckDB Spatial operates in the native coordinate units. For geographic coordinates (EPSG:4326), distances are in degrees. For accurate metre-based distance queries, either reproject to a local projected CRS or use ST_Distance_Sphere:
result = con.execute(f"""
SELECT
parcel_id,
ST_Distance_Sphere(
geometry,
ST_Point({poi[0]}, {poi[1]})
) AS dist_metres
FROM read_parquet('parcels.parquet')
WHERE ST_Distance_Sphere(
geometry,
ST_Point({poi[0]}, {poi[1]})
) < 500
ORDER BY dist_metres
""").df()
# Reading from S3
DuckDB can query GeoParquet directly from S3 without downloading the file:
con.execute("INSTALL httpfs")
con.execute("LOAD httpfs")
# Configure AWS credentials (or use IAM role if running on EC2/ECS)
con.execute("""
SET s3_region = 'ap-southeast-2';
SET s3_access_key_id = 'YOUR_ACCESS_KEY';
SET s3_secret_access_key = 'YOUR_SECRET_KEY';
""")
# Query from S3 directly
result = con.execute("""
SELECT COUNT(*), AVG(land_value)
FROM read_parquet('s3://your-bucket/spatial-data/parcels.parquet')
WHERE zone_type = 'Residential'
""").df()
DuckDB uses HTTP range requests to read only the relevant Parquet row groups from S3 — the same optimisation that makes direct-S3 analytics fast for columnar data.
# Returning to GeoPandas
For most analytical workflows, keep results in Pandas/GeoPandas for downstream processing:
import geopandas as gpd
# Execute query and get result as a GeoDataFrame
result_df = con.execute("""
SELECT parcel_id, address, land_value, ST_AsWKB(geometry) AS geometry_wkb
FROM read_parquet('parcels.parquet')
WHERE zone_type = 'Commercial'
""").df()
# Reconstruct geometry from WKB
import shapely
result_df['geometry'] = shapely.from_wkb(result_df['geometry_wkb'].values)
result_gdf = gpd.GeoDataFrame(
result_df.drop(columns=['geometry_wkb']),
geometry='geometry',
crs='EPSG:4326'
)
# DuckDB vs PostGIS vs GeoPandas: Decision Framework
These three tools are complementary rather than competing:
Use DuckDB when:
- You have GeoParquet files and want SQL analytics without a server
- You are running one-off or scheduled analytics jobs in a serverless or CI environment
- You need SQL as your query language (easier to express than Python method chains for complex aggregations)
- You want to read from S3 without downloading
Use PostGIS when:
- You have repeat queries against the same dataset (GiST indexes amortise their cost)
- Multiple clients need concurrent access
- You need transactional writes
- Query complexity benefits from SQL stored procedures and user-defined functions
Use GeoPandas when:
- You need Shapely geometry objects for downstream manipulation
- You are building a pipeline where the output is a GeoDataFrame, not a summary table
- You want the richest Python integration (plotting,
.explore(), direct Shapely API access)
For batch analytics pipelines that read GeoParquet and produce summary DataFrames, DuckDB is now frequently the fastest and simplest tool in the stack.
Related reading: GeoParquet: The Analytical Format for Vector Geodata · PostGIS, GeoPandas, and DuckDB: A Decision Framework · GeoParquet Spatial Partitioning