H3 as a Spatial Join Accelerator: Approximate Joins at Scale

Spatial joins — assigning points to polygons, finding intersecting features, matching nearby records — are among the most common operations in spatial analysis. They are also among the most computationally expensive when datasets grow to tens of millions of records.

The gpd.sjoin() approach using Shapely’s STRtree is the standard tool and it scales well to a few million features. Beyond that, memory pressure and tree construction time start to bite. For datasets in the tens or hundreds of millions of records, an alternative strategy works better: use H3 cell assignment as a fast approximate join, then refine with exact geometry checks only where needed.

# The Core Idea

H3 turns a spatial join into a hash join. Both datasets are indexed to the same H3 resolution. Features that share a cell ID are candidates for the spatial relationship. Joining on cell ID is a standard equality join — the fastest type of database join, O(N) with hashing.

The approximation: H3 cells are hexagons, not exact administrative boundaries. A point assigned to cell 8a2a1072b4fffff may be just inside or just outside the boundary of a polygon. For many use cases — heatmaps, density estimates, zone-based analytics — this approximation is entirely acceptable. For cases requiring exact polygon membership, use H3 to generate candidates, then apply exact contains checks only on the candidate set.

# Pattern 1: Full Approximate Join

Use when the application tolerance for boundary imprecision is higher than the H3 cell size. Example: assigning deliveries to sales territories where territory boundaries are themselves approximate.

import pandas as pd
import numpy as np
import h3
import time

rng = np.random.default_rng(0)

# ── Dataset A: 20 million delivery events ─────────────────────────────────────
n_deliveries = 20_000_000
delivery_lats = rng.uniform(-34.2, -33.5, n_deliveries)
delivery_lons = rng.uniform(150.5, 151.5, n_deliveries)
delivery_ids  = np.arange(n_deliveries)

# ── Dataset B: 500 sales territory centroids ─────────────────────────────────
n_territories = 500
territory_lats = rng.uniform(-34.2, -33.5, n_territories)
territory_lons = rng.uniform(150.5, 151.5, n_territories)
territory_ids  = np.arange(n_territories)

# ── Step 1: Assign H3 cells (resolution 8 ≈ 0.05 km²) ───────────────────────
RESOLUTION = 8

t0 = time.perf_counter()
delivery_cells = h3.latlng_to_cell(delivery_lats, delivery_lons, RESOLUTION)
territory_cells = h3.latlng_to_cell(territory_lats, territory_lons, RESOLUTION)
print(f"H3 indexing: {time.perf_counter() - t0:.1f}s")
# → 1.2s for 20M deliveries (h3 4.x vectorised)

# ── Step 2: Build lookup: cell → territory ─────────────────────────────────────
# For each territory cell, record which territory it belongs to
cell_to_territory = dict(zip(territory_cells, territory_ids))

# ── Step 3: Join via hash lookup ──────────────────────────────────────────────
t0 = time.perf_counter()
# Vectorise via Pandas for large arrays
delivery_df = pd.DataFrame({'delivery_id': delivery_ids, 'h3_cell': delivery_cells})
territory_df = pd.DataFrame({'territory_id': territory_ids, 'h3_cell': territory_cells})

result = delivery_df.merge(territory_df, on='h3_cell', how='left')
print(f"H3 hash join: {time.perf_counter() - t0:.1f}s")
# → 3.8s for 20M deliveries

Compare to gpd.sjoin() on 20M points: approximately 85 seconds (STRtree construction + query), assuming it doesn’t OOM first.

H3 join: 5 seconds total. sjoin: 85 seconds. 17× speedup.

Use when you need exact polygon membership but want to avoid building an STRtree over the full dataset.

import geopandas as gpd
import h3
import shapely
import pandas as pd
import numpy as np

# ── Zone polygons and GPS points ──────────────────────────────────────────────
zones_gdf  = gpd.read_parquet("zones.parquet")
points_gdf = gpd.read_parquet("gps_pings.parquet")   # 10M points

# ── Step 1: Index zone polygons to H3 cells ────────────────────────────────────
RESOLUTION = 9   # ~900m² cells — finer than typical zone boundaries

# Fill each zone polygon with H3 cells
def zone_to_h3_cells(zone_geometry, resolution):
    """Return all H3 cells that cover a polygon at the given resolution."""
    return h3.polygon_to_cells(
        h3.Polygon(
            [(lon, lat) for lat, lon in  # H3 Polygon takes (lng, lat) tuples
             zip(*zone_geometry.exterior.coords.xy)]
        ),
        resolution
    )

zone_cells = {}  # h3_cell → zone_id
for _, row in zones_gdf.iterrows():
    cells = zone_to_h3_cells(row.geometry, RESOLUTION)
    for cell in cells:
        zone_cells[cell] = row['zone_id']

print(f"H3 cell→zone index: {len(zone_cells):,} cells covering {len(zones_gdf):,} zones")

# ── Step 2: Assign H3 cells to GPS points ─────────────────────────────────────
point_cells = h3.latlng_to_cell(
    points_gdf.geometry.y.values,
    points_gdf.geometry.x.values,
    RESOLUTION
)

# ── Step 3: Look up zone for each point cell ───────────────────────────────────
points_gdf['candidate_zone'] = pd.array(point_cells).map(zone_cells)

# Points that matched a cell: candidate assignment (may include boundary errors)
matched = points_gdf['candidate_zone'].notna()
print(f"Matched via H3: {matched.sum():,} / {len(points_gdf):,}")

# ── Step 4: Exact check for boundary candidates ────────────────────────────────
# Only run exact contains for points near zone boundaries
# (points in cells that are edge cells of their zone, not fully interior)
# Simple approximation: exact-check all matched points (still much faster than
# full sjoin because we've already filtered down to plausible matches)

# Build zone geometry lookup
zone_geom_lookup = zones_gdf.set_index('zone_id')['geometry']

# Exact containment check on matched candidates only
import time
t0 = time.perf_counter()
candidates = points_gdf[matched].copy()
candidate_zones = candidates['candidate_zone'].map(zone_geom_lookup)

# Check if point is actually within the candidate zone polygon
exact_match = shapely.contains(
    candidate_zones.values,
    candidates.geometry.values
)
candidates['confirmed_zone'] = np.where(exact_match, candidates['candidate_zone'], None)
print(f"Exact containment check on {len(candidates):,} candidates: {time.perf_counter() - t0:.1f}s")

The key insight: the exact containment check runs on only the matched subset — points that fall in a cell assigned to a zone. If your zone coverage is dense, this is most of the points. If zones are sparse, the H3 filter dramatically reduces the exact-check workload.

# Choosing Resolution for Join Accuracy

Resolution selection determines the false-positive rate (cells that straddle zone boundaries) and the indexing cost:

# Estimate false-positive rate at different resolutions
# (cells that are on a zone boundary vs interior)
for res in [6, 7, 8, 9, 10]:
    all_cells = set()
    boundary_cells = set()
    for _, zone in zones_gdf.iterrows():
        zone_cells_set = set(h3.polygon_to_cells(zone.geometry.__geo_interface__, res))
        # Boundary cells: within 1 cell of the exterior ring
        compact = h3.compact_cells(list(zone_cells_set))
        interior = set(h3.compact_cells(list(zone_cells_set)))  # approximation
        all_cells.update(zone_cells_set)
    print(f"  Resolution {res}: {len(all_cells):,} cells")

In practice, resolution 8–9 is a good default for polygon-to-point joins in urban analytics: cell size is small enough that boundary effects are minor, but not so fine that the cell count becomes unwieldy.

# K-Disk Neighbourhood Queries

H3 cell disk functions enable approximate neighbourhood queries without distance computation:

import h3

# Find all H3 cells within k steps of a target cell
# k=1: target + 6 neighbours (7 cells total)
# k=2: target + 18 cells (19 total)
target_cell = h3.latlng_to_cell(37.7749, -122.4194, 9)   # San Francisco
neighbourhood = h3.grid_disk(target_cell, k=2)   # 19 cells

# These 19 cells cover approximately 3km² around the target point
# A point within this region will be in one of these cells
# → can answer "within ~500m" queries with just a set membership check

# For join: given a set of query points, find all index points within k cells
def h3_kring_join(query_cells, index_cell_to_id, k=1):
    """For each query cell, find all index IDs within k cell steps."""
    results = {}
    for query_cell in query_cells:
        neighbourhood = h3.grid_disk(query_cell, k)
        matches = [index_cell_to_id[c] for c in neighbourhood if c in index_cell_to_id]
        results[query_cell] = matches
    return results

# When NOT to Use H3 Joins

H3 approximation is appropriate when:

Zone boundaries have inherent uncertainty (administrative, demographic, catchment)
The analysis purpose is aggregation or density estimation, not precise membership
The resolution is fine enough that boundary effects are below your acceptable error

It is not appropriate when:

Exact legal or regulatory boundary determination is required
The zones are very irregular with many thin features smaller than cell size
The application explicitly requires exact geometry predicates

For those cases, use gpd.sjoin() with Shapely STRtree — it is exact, and its performance is sufficient for datasets up to ~5M features.