Shapely 2.0: The Vectorisation Revolution in Python Geometry

In December 2022, Shapely 2.0 was released. The version bump signalled a genuine architectural break — not just an API refinement, but a fundamental redesign of how Python geometry operations work at the lowest level. For anyone doing spatial analysis in Python, it is the most important library release in a decade.

The headline change: Shapely 2.0 supports array-level geometry operations. Instead of calling a method on a single geometry object inside a Python loop, you can pass a NumPy array of geometries to a function and have the entire computation execute in C, with no Python overhead per element.

The performance implications are dramatic. Operations that took 18 seconds now take 0.3 seconds. Workflows that required Dask parallelism to be tractable are now fast enough to run synchronously. The ceiling on what a single-threaded Python process can do with geometry has risen by two orders of magnitude.

# What Changed Under the Hood

Shapely 1.x wrapped GEOS (the C/C++ geometry engine) with a Python object model where every geometry was a Python object and every operation returned a Python object. Calling point.buffer(100) involved Python attribute lookup, Python function dispatch, a C call into GEOS, and the construction of a new Python object to wrap the result. At thousands of geometries per second, this was adequate for many use cases. At millions, it was a bottleneck.

Shapely 2.0 introduced a fundamentally different approach: the shapely.lib C extension module that can operate on arrays of raw GEOS geometry pointers without Python object construction between operations. A single call to shapely.buffer(array_of_geoms, 100) passes the entire array to C, iterates over the GEOS pointers in a tight C loop, and returns a new NumPy array of result geometries — with Python involved only at the call boundary, not per element.

This is the same technique that makes NumPy fast: eliminate Python-level iteration by pushing the loop into compiled code.

# Benchmarking the Difference

To make this concrete, consider buffering 500,000 point geometries by 100 metres (a common preprocessing step in proximity analysis):

import numpy as np
import shapely
import shapely.geometry
import time

# Generate 500,000 random points
coords = np.random.uniform(low=[0, 50], high=[2, 52], size=(500_000, 2))
points = shapely.points(coords)  # Shapely 2.0 array constructor

# ── Shapely 2.0: vectorised array operation ──────────────────────────────────
start = time.perf_counter()
buffered = shapely.buffer(points, 100)
elapsed_v2 = time.perf_counter() - start
print(f"Shapely 2.0 (vectorised): {elapsed_v2:.3f}s")
# → 0.31s on a standard laptop

# ── Shapely 1.x style: Python loop (still works in 2.0 but slow) ─────────────
start = time.perf_counter()
buffered_loop = np.array([p.buffer(100) for p in points])
elapsed_loop = time.perf_counter() - start
print(f"Python loop style:        {elapsed_loop:.3f}s")
# → 18.4s on the same machine

print(f"Speedup: {elapsed_loop / elapsed_v2:.0f}x")
# → 59x

The 59× speedup is not atypical. For operations with more GEOS computation per geometry (intersection, union, is_valid checks), the speedup can reach 200× because the Python overhead is amortised over more work per element.

# The New Array API

Shapely 2.0 provides a module-level function API that accepts arrays directly. The most commonly used functions:

import shapely
import numpy as np

# Construct geometry arrays
points   = shapely.points(xy_array)                  # (N, 2) array → N Points
lines    = shapely.linestrings(coords_array)          # list of arrays → N LineStrings
polygons = shapely.polygons(ring_coords)              # exterior coords → N Polygons

# Measurement (returns float arrays)
areas    = shapely.area(geom_array)                  # → (N,) float64
lengths  = shapely.length(geom_array)                # → (N,) float64
dists    = shapely.distance(geom_array_a, geom_array_b)  # → (N,) float64

# Predicates (returns bool arrays)
valid    = shapely.is_valid(geom_array)              # → (N,) bool
empty    = shapely.is_empty(geom_array)              # → (N,) bool
contains = shapely.contains(polys, points)           # → (N,) bool

# Constructive operations (returns geometry arrays)
buffered  = shapely.buffer(geom_array, distance)     # scalar or (N,) distance
simplified = shapely.simplify(geom_array, tolerance)
convex    = shapely.convex_hull(geom_array)
union_all = shapely.union_all(geom_array)            # single result

# Binary operations (element-wise, both arrays broadcast)
intersections = shapely.intersection(array_a, array_b)  # → (N,) geometry
differences   = shapely.difference(array_a, array_b)

The broadcasting rules follow NumPy conventions. Passing a scalar distance to shapely.buffer(points, 100) applies 100 to every point. Passing an array shapely.buffer(points, distance_array) applies a different distance per point.

# Coordinate Extraction Without Loops

A common pattern that was slow in Shapely 1.x: extracting coordinates from a large array of geometries. Shapely 2.0 provides shapely.get_coordinates():

# Extract all coordinates from an array of geometries as a (N, 2) numpy array
# Works on Points, LineStrings, Polygons — any geometry type
coords = shapely.get_coordinates(geom_array)
# → (M, 2) array where M ≥ N (polygons contribute multiple coordinate pairs)

# For points specifically, this is equivalent to:
# np.column_stack([shapely.get_x(points), shapely.get_y(points)])

x = shapely.get_x(points)   # (N,) float64
y = shapely.get_y(points)   # (N,) float64

Extracting x/y from 1 million points using Shapely 1.x style ([p.x for p in points]): ~0.8 seconds. Using shapely.get_x(points): ~0.003 seconds — 260× faster.

# Impact on GeoPandas

Since GeoPandas 0.12, the geometry column operations have been updated to use Shapely 2.0’s array API internally. This means that vectorised GeoPandas operations like gdf.geometry.buffer(100) or gdf.geometry.area are already fast without any code changes.

The operations that are not automatically fast are those you write yourself using .apply():

# SLOW — calls Shapely per row inside Python:
gdf['buffered'] = gdf.geometry.apply(lambda g: g.buffer(100))

# FAST — GeoPandas dispatches to Shapely 2.0 array API:
gdf['buffered'] = gdf.geometry.buffer(100)

# SLOW — builds distance matrix element by element:
gdf['dist_to_centroid'] = gdf.geometry.apply(
    lambda g: g.distance(reference_point)
)

# FAST — Shapely vectorised distance:
gdf['dist_to_centroid'] = shapely.distance(
    gdf.geometry.values,           # numpy array of geometries
    reference_point                # scalar broadcasts
)

# Validity Checking and Repair at Scale

A particularly valuable application of the vectorised API is bulk geometry validation and repair — operations that were often skipped in Shapely 1.x workflows because they were too slow to run on large datasets:

import shapely

# Check validity for entire dataset at once
valid_mask = shapely.is_valid(gdf.geometry.values)
n_invalid = (~valid_mask).sum()
print(f"{n_invalid:,} invalid geometries out of {len(gdf):,}")

# Repair all invalid geometries in one call
if n_invalid > 0:
    gdf.loc[~valid_mask, 'geometry'] = shapely.make_valid(
        gdf.geometry.values[~valid_mask]
    )
    print("Repair complete")

With Shapely 1.x, running is_valid on 2 million geometries took around 45 seconds. Shapely 2.0: under 1 second.

# Migration Considerations

Shapely 2.0 maintains backward compatibility at the object level — existing code that calls methods on individual geometry objects continues to work. The migration to the array API is entirely optional and additive.

However, there is one breaking change that affects GeoPandas and downstream code: Shapely 2.0 geometry objects are no longer picklable in the same way as 1.x objects. If you have cached geometry objects (e.g. stored in Redis or in Python multiprocessing queues), you may encounter issues. In practice, geometries should be serialised as WKB or WKT rather than pickled; Shapely 2.0 enforces this discipline.

The shapely.wkb and shapely.wkt modules provide fast bulk serialisation:

# Serialise array to WKB (compact binary)
wkb_array = shapely.to_wkb(geom_array)    # → numpy array of bytes objects
# Deserialise
geom_array = shapely.from_wkb(wkb_array)

# WKT for human-readable text output
wkt_array = shapely.to_wkt(geom_array)    # → numpy array of strings

Bulk WKB serialisation of 1 million geometries: ~0.8 seconds in Shapely 2.0 vs ~12 seconds in 1.x.

# When the Loop Is Still Right

Not every spatial workflow benefits from the array API. Single-geometry operations in application code (e.g. checking if a user-submitted point is within a boundary) are perfectly fine as method calls. The array API matters when you are iterating over a dataset — any situation where geometry operations are in a loop over rows.

The rule of thumb: if you find yourself writing for geom in gdf.geometry or gdf.geometry.apply(lambda g: ...) for a spatial operation, check whether Shapely 2.0 has a module-level function equivalent. In the vast majority of cases, it does — and the speedup will be dramatic.