Coordinate reference system transformation is one of those operations that feels trivial until you are doing it millions of times. Reprojecting a shapefile from WGS84 to a local projected CRS is a standard preprocessing step in almost every spatial workflow. When that workflow runs on 10 million GPS tracks, 50 million building footprints, or a continental point cloud, the reprojection step can quietly become the bottleneck.
This article covers PyProj’s Transformer array API — the correct way to reproject large datasets — and the common slow patterns to replace.
# The Slow Pattern: Per-Object Transforms
In Shapely 1.x and older GeoPandas workflows, reprojection often looked like this:
from pyproj import Transformer
from shapely.ops import transform
# Create transformer once (good)
transformer = Transformer.from_crs("EPSG:4326", "EPSG:28355", always_xy=True)
# But apply it per geometry (bad)
gdf['geometry_projected'] = gdf.geometry.apply(
lambda g: transform(transformer.transform, g)
)
Or even worse, constructing the transformer inside the loop:
# TERRIBLE — creates a new Transformer object per row
gdf['geometry_projected'] = gdf.geometry.apply(
lambda g: transform(
Transformer.from_crs("EPSG:4326", "EPSG:28355", always_xy=True).transform,
g
)
)
The shapely.ops.transform + .apply() pattern calls the transformer once per coordinate, wrapping each in a Python function call. For a polygon with 50 vertices, that is 50 Python function calls per geometry. On a dataset of 100,000 complex polygons, that is 5 million individual Python-level transform calls.
# The Fast Pattern: GeoDataFrame.to_crs()
For GeoPandas users, the first answer is to use GeoDataFrame.to_crs():
# FAST — GeoPandas handles CRS transformation internally using PyProj arrays
gdf_projected = gdf.to_crs("EPSG:28355")
Since GeoPandas 0.7, to_crs() uses PyProj’s vectorised transform path internally. It extracts all coordinates from the geometry array, transforms them in a single call, and reconstructs the geometries. No Python loop over rows.
Benchmark: reprojecting 500,000 polygon geometries (avg 8 vertices each) from WGS84 to GDA2020/MGA Zone 55:
.apply(lambda g: transform(...)): 42 secondsgdf.to_crs("EPSG:7855"): 1.1 seconds- 38× speedup
# Reprojecting Raw Coordinate Arrays
When you are working with raw coordinate arrays (not GeoPandas), Transformer.transform() accepts NumPy arrays directly:
import numpy as np
from pyproj import Transformer
# 10 million (lon, lat) pairs
coords = np.random.uniform(
low=[140.0, -38.0],
high=[150.0, -30.0],
size=(10_000_000, 2)
)
lon = coords[:, 0]
lat = coords[:, 1]
# Create transformer once — reuse for all transforms
transformer = Transformer.from_crs(
"EPSG:4326",
"EPSG:28355", # GDA94 / MGA Zone 55
always_xy=True # ensures lon/lat not lat/lon — critical!
)
import time
t0 = time.perf_counter()
x, y = transformer.transform(lon, lat)
elapsed = time.perf_counter() - t0
print(f"Transformed {len(lon):,} coordinates: {elapsed:.3f}s")
# → 0.34s for 10 million coordinate pairs
Compared to per-point transform: 10 million points via a Python loop takes approximately 95 seconds. The array transform takes 0.34 seconds — a 280× speedup.
# The always_xy=True Flag: Why It Matters
One of the most common sources of bugs in reprojection code: axis order. Geographic standards define (latitude, longitude) order for many CRS definitions, but most GIS software and APIs use (longitude, latitude) / (x, y) order.
PyProj follows the authority definition by default. EPSG:4326 is defined as (latitude, longitude). This means:
# WITHOUT always_xy — follows CRS definition (lat/lon for EPSG:4326)
transformer = Transformer.from_crs("EPSG:4326", "EPSG:28355")
x, y = transformer.transform(lat, lon) # must pass lat first!
# WITH always_xy — always (longitude, latitude) / (x, y) regardless of CRS
transformer = Transformer.from_crs("EPSG:4326", "EPSG:28355", always_xy=True)
x, y = transformer.transform(lon, lat) # natural (x, y) order
Always use always_xy=True unless you are specifically working with CRS definitions that have non-standard axis orders and need to preserve them. It prevents the silent transposition bug where coordinates appear to transform successfully but end up in the wrong hemisphere.
# Transforming Shapely Geometry Arrays Directly
For Shapely geometry arrays (outside of GeoPandas), the fastest approach combines shapely.get_coordinates() with the array transformer:
import shapely
import numpy as np
from pyproj import Transformer
geom_array = gdf.geometry.values # NumPy array of Shapely geometries
# Step 1: Extract all coordinates as (M, 2) array
# include_z=False avoids a sparse Z array for 2D geometries
coords = shapely.get_coordinates(geom_array, include_z=False)
# Step 2: Transform all coordinates in one call
transformer = Transformer.from_crs("EPSG:4326", "EPSG:28355", always_xy=True)
x_proj, y_proj = transformer.transform(coords[:, 0], coords[:, 1])
# Step 3: Put transformed coordinates back into geometry array
coords_proj = np.column_stack([x_proj, y_proj])
geom_array_proj = shapely.set_coordinates(geom_array.copy(), coords_proj)
shapely.set_coordinates() modifies the coordinates of a geometry array in-place (on a copy). Combined with get_coordinates(), this gives you a pure-array reprojection pipeline without constructing GeoPandas objects.
Benchmark: 2 million polygon geometries with an average of 12 vertices each (24M total coordinates):
.apply(transform(...))approach: 4 minutes 10 secondsget_coordinates()+transformer.transform()+set_coordinates(): 6.2 seconds- 40× speedup
# Transformer Caching and Thread Safety
Transformer objects are not cheap to construct — each one initialises the PROJ library context and loads CRS definitions. In a web application or batch job that runs many reprojection operations:
from pyproj import Transformer
from functools import lru_cache
@lru_cache(maxsize=32)
def get_transformer(from_crs: str, to_crs: str) -> Transformer:
"""Cache transformers by CRS pair — construction cost paid once."""
return Transformer.from_crs(from_crs, to_crs, always_xy=True)
# Usage: cached after first call
transformer = get_transformer("EPSG:4326", "EPSG:28355")
x, y = transformer.transform(lon_array, lat_array)
Regarding thread safety: Transformer objects are thread-safe in PyProj 3.x and later for array-based transforms. You can safely share a cached Transformer instance across threads.
# Handling Mixed or Unknown CRS
A common pattern in data pipelines: receiving data from multiple sources with different CRS, needing to normalise everything to a target CRS:
from pyproj import Transformer, CRS
from functools import lru_cache
TARGET_CRS = "EPSG:4326"
@lru_cache(maxsize=64)
def get_transformer_to_wgs84(source_epsg: int) -> Transformer:
return Transformer.from_crs(
f"EPSG:{source_epsg}",
TARGET_CRS,
always_xy=True
)
def normalise_to_wgs84(lon: np.ndarray, lat: np.ndarray, source_epsg: int):
if source_epsg == 4326:
return lon, lat # already in target CRS, no-op
transformer = get_transformer_to_wgs84(source_epsg)
return transformer.transform(lon, lat)
# Integration with Pandas: Reprojecting a Coordinate DataFrame
When coordinates are stored as separate columns rather than geometry objects:
import pandas as pd
import numpy as np
from pyproj import Transformer
# DataFrame with lon/lat columns
df = pd.DataFrame({
'lon': np.random.uniform(140, 154, 1_000_000),
'lat': np.random.uniform(-44, -10, 1_000_000),
'value': np.random.random(1_000_000)
})
transformer = Transformer.from_crs("EPSG:4326", "EPSG:3857", always_xy=True)
# Transform both columns in one call — returns two arrays
df['x_web'], df['y_web'] = transformer.transform(
df['lon'].values,
df['lat'].values
)
One million coordinate pairs transformed in approximately 0.03 seconds — fast enough that this can be run inline during data load rather than in a separate preprocessing step.
# Summary: The Rules
- Never construct
Transformerinside a loop — create once, reuse. - Always use
always_xy=Trueunless you have a specific reason not to. - Pass arrays, not scalars —
transformer.transform(lon_array, lat_array)executes in C. - Use
gdf.to_crs()for GeoPandas GeoDataFrames — it already uses the array path. - Cache transformers with
lru_cachein long-running processes. - For raw geometry arrays, use
shapely.get_coordinates()+transformer.transform()+shapely.set_coordinates().
The patterns are simple, the speedups are dramatic, and the change is entirely backwards compatible — you get the performance improvement by changing how you call the existing library, not by changing libraries.
Related reading: GeoPandas at Scale: Escaping the .apply() Trap · Shapely 2.0: The Vectorisation Revolution in Python Geometry · GeoParquet: The Analytical Format for Vector Geodata