Shapely STRtree: Mass Nearest-Neighbour and Intersection Queries

A recurring problem in spatial analysis: given two large geometry datasets, find the relationships between them. Which of my 500,000 customer points are within 2km of any of my 12,000 retail sites? What is the nearest road segment to each of my 800,000 GPS pings? Which of my 2 million polygons intersect any flood zone?

The naive Python approach to all three problems is a nested loop: for each geometry in dataset A, iterate through all geometries in dataset B checking each one. That is O(N × M) — on large datasets, it is practically unusable.

The correct approach uses a spatial index: a data structure that organises geometries by their bounding boxes so that most candidates can be eliminated without the expensive geometry predicate check. Shapely 2.0’s STRtree is the fastest spatial index available in pure Python, and its bulk query API makes it dramatically faster than even indexed one-at-a-time lookups.

# What Is an STRtree?

STR stands for Sort-Tile-Recursive. It is an R-tree construction algorithm that sorts geometries by space-filling curve order, tiles them into groups, and builds the tree bottom-up. The result is a well-balanced tree with tight bounding boxes at each node — better query performance than an R-tree built by random insertion.

In Shapely 2.0, STRtree is implemented in C inside shapely.lib, not in Python. Index construction and all query operations execute entirely in C, with Python involved only at the entry and exit boundaries.

# Building the Index

import shapely
import numpy as np

# Build an STRtree from an array of geometries
tree = shapely.STRtree(geometries)

# 'geometries' is a numpy array of Shapely geometry objects
# Construction is O(N log N) — one-time cost, then amortised over queries

Construction cost: building a tree on 1 million geometries takes approximately 0.8 seconds. This is a one-time cost — build the index once, run thousands of queries against it.

# Nearest-Neighbour Queries: `query_nearest`

The most powerful STRtree capability in Shapely 2.0 is query_nearest — it returns the nearest geometry (or nearest K geometries) from the tree for each input geometry, using a bulk vectorised code path.

# ── Setup: 500k GPS pings, 12k road segment midpoints ────────────────────────
rng = np.random.default_rng(0)

n_pings = 500_000
n_roads = 12_000

ping_coords = rng.uniform([144.9, -37.9], [145.2, -37.7], (n_pings, 2))
road_coords = rng.uniform([144.8, -38.0], [145.3, -37.6], (n_roads, 2))

pings = shapely.points(ping_coords)   # (500k,) geometry array
roads = shapely.points(road_coords)   # (12k,)  geometry array

# ── Build index on roads ──────────────────────────────────────────────────────
tree = shapely.STRtree(roads)

# ── Nearest road for each ping ────────────────────────────────────────────────
import time
t0 = time.perf_counter()
nearest_idx = tree.query_nearest(pings)
elapsed = time.perf_counter() - t0
print(f"query_nearest (500k pings → 12k roads): {elapsed:.2f}s")
# → 0.42s

# nearest_idx is a (2, K) array: [input_indices, tree_indices]
# For a 1-to-1 nearest: shape is (2, 500_000)
input_indices, tree_indices = nearest_idx
nearest_roads = roads[tree_indices]

0.42 seconds to find the nearest of 12,000 road points for each of 500,000 GPS pings. The naive nested loop on this problem would take approximately 8 minutes.

# Nearest with Distance Return

# Return both the index and the distance
nearest_idx, distances = tree.query_nearest(pings, return_distance=True)

# distances is a float64 array — the actual geometry distance (in CRS units)
print(f"Mean distance to nearest road: {distances.mean():.4f} degrees")

# K Nearest Neighbours

# Find the 3 nearest roads to each ping
nearest_idx = tree.query_nearest(pings, max_distance=0.05, all_matches=True)
# Returns ALL matches within max_distance, sorted by distance
# Useful for "within X metres" queries where multiple results are wanted

# Intersection Queries: `query`

For intersection, containment, and other predicate queries, use tree.query():

# ── Setup: 200k polygons, 50k query envelopes ─────────────────────────────────
import shapely

polygons = ...   # (200k,) polygon array — the tree
queries  = ...   # (50k,)  polygon array — what we're querying with

tree = shapely.STRtree(polygons)

# ── Step 1: bounding-box candidates ──────────────────────────────────────────
# query() returns candidate pairs based on envelope intersection
candidate_pairs = tree.query(queries, predicate='intersects')
# Returns (2, K) array of [query_index, tree_index] pairs
# K ≤ N × M but typically much smaller

# ── Step 2: exact predicate check (already done if predicate= is specified) ──
# When predicate= is given, Shapely runs the exact GEOS predicate after the
# envelope filter automatically. No manual second pass needed.

query_indices, tree_indices = candidate_pairs
print(f"Intersecting pairs: {len(query_indices):,}")

The predicate= argument accepts: 'intersects', 'within', 'contains', 'overlaps', 'crosses', 'touches', 'covers', 'covered_by', 'contains_properly'.

# The Full Nearest-Neighbour Benchmark

Here is the complete benchmark comparing the naive approach against STRtree, for the GPS-to-road-network snap problem:

import shapely
import numpy as np
import time

rng = np.random.default_rng(42)
n_pings = 100_000
n_roads = 5_000

pings = shapely.points(rng.uniform([0, 50], [2, 52], (n_pings, 2)))
roads = shapely.points(rng.uniform([0, 50], [2, 52], (n_roads, 2)))

# ── Naive: nested loop ────────────────────────────────────────────────────────
t0 = time.perf_counter()
nearest_naive = []
for ping in pings[:1_000]:   # only 1k subset — full would take ~8 min
    min_dist = float('inf')
    nearest = None
    for road in roads:
        d = shapely.distance(ping, road)
        if d < min_dist:
            min_dist = d
            nearest = road
    nearest_naive.append(nearest)
naive_time = time.perf_counter() - t0
print(f"Naive loop (1k pings × 5k roads): {naive_time:.2f}s")
# → 4.8s for 1k, extrapolated 480s for 100k

# ── STRtree: bulk query ────────────────────────────────────────────────────────
tree = shapely.STRtree(roads)
t0 = time.perf_counter()
nearest_idx, dists = tree.query_nearest(pings, return_distance=True)
strtree_time = time.perf_counter() - t0
print(f"STRtree query_nearest (100k pings): {strtree_time:.3f}s")
# → 0.087s

print(f"Speedup (extrapolated): {(naive_time / 1_000 * 100_000) / strtree_time:.0f}x")
# → ~5500x

# Spatial Join Pattern with STRtree

The gpd.sjoin() function uses STRtree internally, but sometimes you need finer control — for instance, when your datasets are too large to load into a single GeoDataFrame, or when you want to process matches in chunks:

import shapely
import numpy as np

# Manually replicate sjoin logic with STRtree
tree = shapely.STRtree(zones_geom_array)

# Find all ping/zone pairs where the ping is within the zone
candidate_input, candidate_tree = tree.query(pings_geom_array, predicate='within')

# candidate_input and candidate_tree are parallel arrays of matching indices
# Build the result as a dictionary or structured array
result = {
    'ping_index': candidate_input,
    'zone_index': candidate_tree,
    'zone_name':  zone_names[candidate_tree]
}

This avoids constructing GeoDataFrames just to run a join — useful when operating on raw geometry arrays from NumPy or PyArrow.

# Self-Join: Finding All Overlapping Polygons

A common QA task: find all pairs of polygons in a dataset that overlap each other:

import shapely

polygons = gdf.geometry.values   # (N,) geometry array

# Build a tree on all polygons
tree = shapely.STRtree(polygons)

# Self-join with 'overlaps' predicate
input_idx, tree_idx = tree.query(polygons, predicate='overlaps')

# Filter out self-matches (where input_idx == tree_idx)
# and duplicate pairs (where input_idx > tree_idx)
mask = input_idx < tree_idx
overlap_pairs = list(zip(input_idx[mask], tree_idx[mask]))
print(f"{len(overlap_pairs):,} overlapping polygon pairs found")

On a dataset of 500,000 cadastral parcels, this runs in about 3 seconds — the equivalent nested loop would take hours.

# When to Use STRtree vs gpd.sjoin

Use gpd.sjoin() when you want a GeoDataFrame result and the data fits in memory. It handles CRS alignment, index management, and result formatting.
Use STRtree directly when you need raw index arrays, when you are processing in chunks, when you want custom distance filtering, or when you are working outside of GeoPandas (e.g. with raw Shapely arrays from PyArrow).

The performance is equivalent — sjoin uses STRtree internally — but the direct API is more flexible and avoids GeoDataFrame construction overhead for large intermediate results.

# Index Persistence and Reuse

One underused pattern: build the STRtree once, cache it, and reuse it across many queries. The index is not serialisable (you cannot pickle it), but in a long-running process or a FastAPI application, you can build it at startup:

# In a FastAPI application startup event:
from contextlib import asynccontextmanager
import shapely

zones_tree = None

@asynccontextmanager
async def lifespan(app):
    global zones_tree
    zones_geom = load_zone_geometries()   # load from PostGIS or GeoParquet
    zones_tree = shapely.STRtree(zones_geom)
    yield
    # cleanup if needed

# In a request handler:
def classify_point(lon: float, lat: float):
    query_point = shapely.points([[lon, lat]])
    idx = zones_tree.query(query_point, predicate='within')
    if len(idx[1]) == 0:
        return None
    return zone_names[idx[1][0]]

With the index pre-built (one-time cost at startup), individual point-in-polygon queries return in under a millisecond.