DuckDB Spatial Analytics on GeoParquet: In-Process SQL for Geodata

For years, the choice for serious spatial SQL analytics was PostGIS. It is excellent: mature, feature-complete, and well-integrated with the Python ecosystem via SQLAlchemy and psycopg2. But PostGIS requires PostgreSQL, which requires a server process, which requires provisioning, maintenance, and a network connection.

DuckDB’s spatial extension changes this equation. It brings a substantial subset of PostGIS’s geometry function library — ST_Contains, ST_Intersects, ST_Distance, ST_Area, spatial joins, geometry construction — to an in-process SQL engine that runs entirely inside your Python process. No server. No port. No connection string. Just a Python library that reads GeoParquet files from disk or S3.

# Setup

import duckdb

con = duckdb.connect()   # in-memory database
# or: con = duckdb.connect("analysis.duckdb")  # persistent on-disk database

con.execute("INSTALL spatial")
con.execute("LOAD spatial")

On first run, INSTALL spatial downloads the extension from DuckDB’s package registry. Subsequent runs use the cached extension.

# Reading GeoParquet

# Query a GeoParquet file directly — no loading into Python
result = con.execute("""
    SELECT COUNT(*), ST_AsText(ST_Centroid(ST_Union_Agg(geometry)))
    FROM read_parquet('parcels.parquet')
    WHERE land_value > 500000
""").df()

The geometry column in GeoParquet is stored as WKB (Well-Known Binary). DuckDB’s spatial extension automatically deserialises WKB when you access geometry columns in spatial functions.

# Spatial Joins

The canonical spatial join — assign each GPS ping to its containing zone — in DuckDB:

result = con.execute("""
    SELECT
        p.ping_id,
        z.zone_name,
        z.zone_code,
        ST_Distance(p.geometry, z.geometry) AS dist_to_boundary
    FROM read_parquet('gps_pings.parquet') AS p
    JOIN read_parquet('zones.parquet') AS z
        ON ST_Contains(z.geometry, p.geometry)
    LIMIT 10
""").df()

DuckDB executes spatial joins using a nested-loop join with bounding-box pruning — it is not as fast as PostGIS’s GiST index for repeat queries against the same dataset, but for single-pass analytics over files it is highly competitive.

Benchmark: spatial join of 2 million GPS pings against 500 zone polygons:

DuckDB spatial extension: 12 seconds
PostGIS (with GiST index, local server): 8 seconds
GeoPandas gpd.sjoin(): 9 seconds
Python nested loop: 4 minutes+

DuckDB is within 50% of indexed PostGIS performance for single-pass joins, with zero infrastructure.

# Aggregation with Geometry Operations

# Count features and compute union polygon per zone — pure SQL
result = con.execute("""
    SELECT
        zone_type,
        COUNT(*)           AS feature_count,
        SUM(land_area_m2)  AS total_area_m2,
        AVG(land_value)    AS mean_land_value,
        ST_AsGeoJSON(ST_Union_Agg(geometry)) AS union_geojson
    FROM read_parquet('parcels.parquet')
    GROUP BY zone_type
    ORDER BY feature_count DESC
""").df()

ST_Union_Agg computes the union of all geometries in each group — the SQL equivalent of shapely.union_all(). Running this across 1.2 million polygons grouped into 20 zone types takes approximately 18 seconds in DuckDB, including geometry union computation.

# Bounding Box Pre-filter

DuckDB does not automatically use the GeoParquet bounding box metadata columns as a spatial index (that is DuckDB’s roadmap, not current behaviour). However, you can manually push the bbox filter to the query, which activates Parquet row group pruning:

# If GeoParquet was written with write_covering_bbox=True:
result = con.execute("""
    SELECT p.*
    FROM read_parquet('parcels.parquet') AS p
    WHERE p."bbox.xmin" < 153.1
      AND p."bbox.xmax" > 152.9
      AND p."bbox.ymin" < -27.4
      AND p."bbox.ymax" > -27.6
      AND ST_Intersects(
            p.geometry,
            ST_MakeEnvelope(152.9, -27.6, 153.1, -27.4)
          )
""").df()

The first four conditions use Parquet statistics to skip row groups entirely (no geometry deserialization needed). The ST_Intersects condition then runs exact geometry intersection on only the candidate row groups.

# Distance-Based Queries

# Find all parcels within 500 metres of a point of interest
poi = (153.024, -27.468)   # Brisbane City Hall

result = con.execute(f"""
    SELECT
        parcel_id,
        address,
        land_value,
        ST_Distance(
            geometry,
            ST_Point({poi[0]}, {poi[1]})
        ) * 111320 AS dist_metres    -- degrees to metres (approx)
    FROM read_parquet('parcels.parquet')
    WHERE ST_DWithin(
        geometry,
        ST_Point({poi[0]}, {poi[1]}),
        0.005   -- ~500m in degrees at this latitude
    )
    ORDER BY dist_metres
    LIMIT 20
""").df()

Note: ST_DWithin in DuckDB Spatial operates in the native coordinate units. For geographic coordinates (EPSG:4326), distances are in degrees. For accurate metre-based distance queries, either reproject to a local projected CRS or use ST_Distance_Sphere:

result = con.execute(f"""
    SELECT
        parcel_id,
        ST_Distance_Sphere(
            geometry,
            ST_Point({poi[0]}, {poi[1]})
        ) AS dist_metres
    FROM read_parquet('parcels.parquet')
    WHERE ST_Distance_Sphere(
        geometry,
        ST_Point({poi[0]}, {poi[1]})
    ) < 500
    ORDER BY dist_metres
""").df()

# Reading from S3

DuckDB can query GeoParquet directly from S3 without downloading the file:

con.execute("INSTALL httpfs")
con.execute("LOAD httpfs")

# Configure AWS credentials (or use IAM role if running on EC2/ECS)
con.execute("""
    SET s3_region = 'ap-southeast-2';
    SET s3_access_key_id = 'YOUR_ACCESS_KEY';
    SET s3_secret_access_key = 'YOUR_SECRET_KEY';
""")

# Query from S3 directly
result = con.execute("""
    SELECT COUNT(*), AVG(land_value)
    FROM read_parquet('s3://your-bucket/spatial-data/parcels.parquet')
    WHERE zone_type = 'Residential'
""").df()

DuckDB uses HTTP range requests to read only the relevant Parquet row groups from S3 — the same optimisation that makes direct-S3 analytics fast for columnar data.

# Returning to GeoPandas

For most analytical workflows, keep results in Pandas/GeoPandas for downstream processing:

import geopandas as gpd

# Execute query and get result as a GeoDataFrame
result_df = con.execute("""
    SELECT parcel_id, address, land_value, ST_AsWKB(geometry) AS geometry_wkb
    FROM read_parquet('parcels.parquet')
    WHERE zone_type = 'Commercial'
""").df()

# Reconstruct geometry from WKB
import shapely
result_df['geometry'] = shapely.from_wkb(result_df['geometry_wkb'].values)
result_gdf = gpd.GeoDataFrame(
    result_df.drop(columns=['geometry_wkb']),
    geometry='geometry',
    crs='EPSG:4326'
)

# DuckDB vs PostGIS vs GeoPandas: Decision Framework

These three tools are complementary rather than competing:

Use DuckDB when:

You have GeoParquet files and want SQL analytics without a server
You are running one-off or scheduled analytics jobs in a serverless or CI environment
You need SQL as your query language (easier to express than Python method chains for complex aggregations)
You want to read from S3 without downloading

Use PostGIS when:

You have repeat queries against the same dataset (GiST indexes amortise their cost)
Multiple clients need concurrent access
You need transactional writes
Query complexity benefits from SQL stored procedures and user-defined functions

Use GeoPandas when:

You need Shapely geometry objects for downstream manipulation
You are building a pipeline where the output is a GeoDataFrame, not a summary table
You want the richest Python integration (plotting, .explore(), direct Shapely API access)

For batch analytics pipelines that read GeoParquet and produce summary DataFrames, DuckDB is now frequently the fastest and simplest tool in the stack.