The Shapefile format was designed in the early 1990s. GeoJSON was designed for web APIs in 2008. Both remain ubiquitous in spatial data exchange — and both are genuinely poor choices for analytical workloads on large datasets. Shapefile splits data across five files, caps field names at 10 characters, and has no compression. GeoJSON is a text format: verbose, non-seekable, and orders of magnitude larger on disk than a binary equivalent.
GeoParquet solves this. It stores vector geodata in Apache Parquet format — columnar, compressed, and designed for analytical query patterns — with a standardised geometry encoding that any modern spatial tool can read.
# What Makes Parquet Analytical
Apache Parquet is a columnar storage format developed at Twitter and Cloudera. Its design optimises for the workload pattern of analytical queries:
Columnar layout: data is stored column by column rather than row by row. When a query accesses only certain columns (e.g. area and zone_type from a dataset with 50 columns), Parquet reads only those columns from disk. A row-oriented format reads every column of every row, even if most of the data is irrelevant.
Encoding and compression: each column is compressed independently. Numeric columns use delta encoding, dictionary encoding, or run-length encoding before gzip/Snappy/Zstd compression. Real-world GeoParquet files are typically 5–15× smaller than equivalent GeoJSON and 3–8× smaller than Shapefile for the same data.
Row groups and statistics: data is divided into row groups (typically 64MB–256MB). Each row group stores column statistics (min, max, null count). Query engines use these statistics to skip entire row groups that cannot satisfy a filter predicate — without reading the data.
WKB geometry encoding: GeoParquet stores geometry as Well-Known Binary (WKB) in a byte-array column, with optional bounding box columns (bbox.xmin, bbox.xmax, bbox.ymin, bbox.ymax) to enable spatial predicate pushdown.
# Reading and Writing GeoParquet in Python
import geopandas as gpd
# Read — GeoPandas 0.12+ supports GeoParquet natively via pyarrow
gdf = gpd.read_parquet("buildings.parquet")
# Write
gdf.to_parquet("buildings.parquet", compression="zstd")
# Write with bounding box columns (improves spatial filtering in DuckDB/Spark)
gdf.to_parquet(
"buildings.parquet",
compression="zstd",
write_covering_bbox=True # adds bbox.xmin/ymin/xmax/ymax columns
)
For very large files, read only the columns you need:
# Read only geometry + specific attributes — avoids loading unused columns
gdf = gpd.read_parquet(
"large_dataset.parquet",
columns=["geometry", "zone_type", "land_value"]
)
# Format Comparison: Size and Read Speed
Benchmark: 1.2 million cadastral parcel polygons (Queensland, 2024). Same data, different formats.
| Format | File size | GeoPandas read | Sequential read |
|---|---|---|---|
| Shapefile | 1.8 GB (5 files) | 38s | 38s |
| GeoJSON | 4.2 GB | 94s | 94s |
| GeoPackage (SQLite) | 780 MB | 22s | 22s |
| GeoParquet (Snappy) | 210 MB | 4.1s | 4.1s |
| GeoParquet (Zstd L3) | 145 MB | 4.6s | 4.6s |
GeoParquet with Snappy compression is 9× smaller than GeoJSON, reads 23× faster, and the Zstd variant reduces storage cost further with only a small read-time penalty.
The read speed improvement is particularly pronounced when accessing a subset of columns:
# Only geometry + one attribute column
gdf = gpd.read_parquet("parcels.parquet", columns=["geometry", "land_value"])
# → 0.9s for 1.2M parcels (Parquet skips all other columns at the page level)
The equivalent operation from GeoJSON or Shapefile reads and parses all columns regardless.
# Spatial Filtering via Bounding Box
When reading GeoParquet files, you can push a bounding box filter down to the file level:
from shapely.geometry import box
# Define a region of interest
roi = box(152.9, -27.6, 153.1, -27.4) # Brisbane CBD area
# Row group-level filter using pyarrow
import pyarrow.dataset as ds
import geopandas as gpd
dataset = ds.dataset("parcels.parquet")
# Filter using bbox columns (if written with write_covering_bbox=True)
filtered = dataset.to_table(
filter=(
(ds.field("bbox.xmin") < 153.1) &
(ds.field("bbox.xmax") > 152.9) &
(ds.field("bbox.ymin") < -27.4) &
(ds.field("bbox.ymax") > -27.6)
)
)
gdf_filtered = gpd.GeoDataFrame.from_arrow(filtered)
gdf_filtered = gdf_filtered.set_geometry("geometry", crs="EPSG:4326")
print(f"Loaded {len(gdf_filtered):,} parcels in Brisbane CBD")
# Reads only the row groups that overlap the bbox — typically <1% of the file
On a 1.2M parcel dataset with row groups of 100k rows each, a tight bounding box query reads 1–2 row groups instead of all 12, cutting I/O by 80–90%.
# GeoParquet in DuckDB
One of GeoParquet’s most powerful properties is native support in DuckDB’s spatial extension, enabling SQL analytics directly on the file without loading into Python:
import duckdb
con = duckdb.connect()
con.install_extension("spatial")
con.load_extension("spatial")
# Query GeoParquet with SQL — DuckDB reads only needed row groups
result = con.execute("""
SELECT
zone_type,
COUNT(*) AS parcel_count,
SUM(land_area_m2) AS total_area_m2,
AVG(land_value) AS mean_land_value
FROM read_parquet('parcels.parquet')
WHERE bbox_xmin > 152.9 AND bbox_xmax < 153.1
AND bbox_ymin > -27.6 AND bbox_ymax < -27.4
GROUP BY zone_type
ORDER BY parcel_count DESC
""").df()
print(result)
This runs entirely in-process, with no server, no data loading into Python memory, and no GeoDataFrame construction — just a SQL query on a file. Execution time on a 1.2M row file: approximately 0.3 seconds.
# Partitioned GeoParquet
For very large datasets (tens of millions of rows), partitioned GeoParquet enables efficient spatial queries by organising files by geographic region:
import geopandas as gpd
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
# Write a spatially partitioned dataset: one file per grid cell
def write_partitioned(gdf: gpd.GeoDataFrame, output_dir: str, n_grid: int = 10):
"""Partition by a regular lon/lat grid and write one file per cell."""
bounds = gdf.total_bounds # (xmin, ymin, xmax, ymax)
x_edges = np.linspace(bounds[0], bounds[2], n_grid + 1)
y_edges = np.linspace(bounds[1], bounds[3], n_grid + 1)
for i in range(n_grid):
for j in range(n_grid):
mask = (
(gdf.geometry.bounds['minx'] >= x_edges[i]) &
(gdf.geometry.bounds['maxx'] < x_edges[i+1]) &
(gdf.geometry.bounds['miny'] >= y_edges[j]) &
(gdf.geometry.bounds['maxy'] < y_edges[j+1])
)
cell = gdf[mask]
if len(cell) > 0:
path = f"{output_dir}/x{i:02d}_y{j:02d}.parquet"
cell.to_parquet(path, compression="zstd")
With spatial partitioning, queries over a small geographic area read only the relevant partition files, potentially accessing 1% of the total dataset.
# Migration from Shapefile / GeoJSON
A simple one-time conversion:
import geopandas as gpd
# From Shapefile
gdf = gpd.read_file("large_dataset.shp")
gdf.to_parquet("large_dataset.parquet", compression="zstd")
# From GeoJSON
gdf = gpd.read_file("large_dataset.geojson")
gdf.to_parquet("large_dataset.parquet", compression="zstd")
# The original files can now be archived
After conversion, the GeoParquet file is the primary working format. Shapefile and GeoJSON are retained only for exchange with systems that cannot read Parquet.
# Limitations
GeoParquet is not the right format for everything:
- Web streaming: Parquet is not designed for partial, bbox-filtered HTTP range requests at the feature level. FlatGeobuf is better for that use case.
- Editing: Parquet is append-friendly but not edit-friendly. For transactional, row-level updates, a database (PostGIS, GeoPackage) is more appropriate.
- Simplicity: GeoJSON remains the best format for simple, human-readable data exchange under ~10MB.
For analytical workloads — data loading, aggregation, join operations, batch processing — GeoParquet is the correct default, and the performance difference over Shapefile or GeoJSON is not incremental. It is categorical.
Related reading: GeoParquet Spatial Partitioning: Optimising for Parallel Reads · DuckDB Spatial Analytics on GeoParquet · FlatGeobuf: Streaming Spatial Data Without a Server