Coordinate reference system transformation is one of those operations that feels trivial until you are doing it millions of times. Reprojecting a shapefile from WGS84 to a local projected CRS is a standard preprocessing step in almost every spatial workflow. When that workflow runs on 10 million GPS tracks, 50 million building footprints, or a continental point cloud, the reprojection step can quietly become the bottleneck.

This article covers PyProj’s Transformer array API — the correct way to reproject large datasets — and the common slow patterns to replace.

# The Slow Pattern: Per-Object Transforms

In Shapely 1.x and older GeoPandas workflows, reprojection often looked like this:

from pyproj import Transformer
from shapely.ops import transform

# Create transformer once (good)
transformer = Transformer.from_crs("EPSG:4326", "EPSG:28355", always_xy=True)

# But apply it per geometry (bad)
gdf['geometry_projected'] = gdf.geometry.apply(
    lambda g: transform(transformer.transform, g)
)

Or even worse, constructing the transformer inside the loop:

# TERRIBLE — creates a new Transformer object per row
gdf['geometry_projected'] = gdf.geometry.apply(
    lambda g: transform(
        Transformer.from_crs("EPSG:4326", "EPSG:28355", always_xy=True).transform,
        g
    )
)

The shapely.ops.transform + .apply() pattern calls the transformer once per coordinate, wrapping each in a Python function call. For a polygon with 50 vertices, that is 50 Python function calls per geometry. On a dataset of 100,000 complex polygons, that is 5 million individual Python-level transform calls.

# The Fast Pattern: GeoDataFrame.to_crs()

For GeoPandas users, the first answer is to use GeoDataFrame.to_crs():

# FAST — GeoPandas handles CRS transformation internally using PyProj arrays
gdf_projected = gdf.to_crs("EPSG:28355")

Since GeoPandas 0.7, to_crs() uses PyProj’s vectorised transform path internally. It extracts all coordinates from the geometry array, transforms them in a single call, and reconstructs the geometries. No Python loop over rows.

Benchmark: reprojecting 500,000 polygon geometries (avg 8 vertices each) from WGS84 to GDA2020/MGA Zone 55:

  • .apply(lambda g: transform(...)): 42 seconds
  • gdf.to_crs("EPSG:7855"): 1.1 seconds
  • 38× speedup

# Reprojecting Raw Coordinate Arrays

When you are working with raw coordinate arrays (not GeoPandas), Transformer.transform() accepts NumPy arrays directly:

import numpy as np
from pyproj import Transformer

# 10 million (lon, lat) pairs
coords = np.random.uniform(
    low=[140.0, -38.0],
    high=[150.0, -30.0],
    size=(10_000_000, 2)
)

lon = coords[:, 0]
lat = coords[:, 1]

# Create transformer once — reuse for all transforms
transformer = Transformer.from_crs(
    "EPSG:4326",
    "EPSG:28355",   # GDA94 / MGA Zone 55
    always_xy=True  # ensures lon/lat not lat/lon — critical!
)

import time
t0 = time.perf_counter()
x, y = transformer.transform(lon, lat)
elapsed = time.perf_counter() - t0
print(f"Transformed {len(lon):,} coordinates: {elapsed:.3f}s")
# → 0.34s for 10 million coordinate pairs

Compared to per-point transform: 10 million points via a Python loop takes approximately 95 seconds. The array transform takes 0.34 seconds — a 280× speedup.

# The always_xy=True Flag: Why It Matters

One of the most common sources of bugs in reprojection code: axis order. Geographic standards define (latitude, longitude) order for many CRS definitions, but most GIS software and APIs use (longitude, latitude) / (x, y) order.

PyProj follows the authority definition by default. EPSG:4326 is defined as (latitude, longitude). This means:

# WITHOUT always_xy — follows CRS definition (lat/lon for EPSG:4326)
transformer = Transformer.from_crs("EPSG:4326", "EPSG:28355")
x, y = transformer.transform(lat, lon)   # must pass lat first!

# WITH always_xy — always (longitude, latitude) / (x, y) regardless of CRS
transformer = Transformer.from_crs("EPSG:4326", "EPSG:28355", always_xy=True)
x, y = transformer.transform(lon, lat)   # natural (x, y) order

Always use always_xy=True unless you are specifically working with CRS definitions that have non-standard axis orders and need to preserve them. It prevents the silent transposition bug where coordinates appear to transform successfully but end up in the wrong hemisphere.

# Transforming Shapely Geometry Arrays Directly

For Shapely geometry arrays (outside of GeoPandas), the fastest approach combines shapely.get_coordinates() with the array transformer:

import shapely
import numpy as np
from pyproj import Transformer

geom_array = gdf.geometry.values  # NumPy array of Shapely geometries

# Step 1: Extract all coordinates as (M, 2) array
# include_z=False avoids a sparse Z array for 2D geometries
coords = shapely.get_coordinates(geom_array, include_z=False)

# Step 2: Transform all coordinates in one call
transformer = Transformer.from_crs("EPSG:4326", "EPSG:28355", always_xy=True)
x_proj, y_proj = transformer.transform(coords[:, 0], coords[:, 1])

# Step 3: Put transformed coordinates back into geometry array
coords_proj = np.column_stack([x_proj, y_proj])
geom_array_proj = shapely.set_coordinates(geom_array.copy(), coords_proj)

shapely.set_coordinates() modifies the coordinates of a geometry array in-place (on a copy). Combined with get_coordinates(), this gives you a pure-array reprojection pipeline without constructing GeoPandas objects.

Benchmark: 2 million polygon geometries with an average of 12 vertices each (24M total coordinates):

  • .apply(transform(...)) approach: 4 minutes 10 seconds
  • get_coordinates() + transformer.transform() + set_coordinates(): 6.2 seconds
  • 40× speedup

# Transformer Caching and Thread Safety

Transformer objects are not cheap to construct — each one initialises the PROJ library context and loads CRS definitions. In a web application or batch job that runs many reprojection operations:

from pyproj import Transformer
from functools import lru_cache

@lru_cache(maxsize=32)
def get_transformer(from_crs: str, to_crs: str) -> Transformer:
    """Cache transformers by CRS pair — construction cost paid once."""
    return Transformer.from_crs(from_crs, to_crs, always_xy=True)

# Usage: cached after first call
transformer = get_transformer("EPSG:4326", "EPSG:28355")
x, y = transformer.transform(lon_array, lat_array)

Regarding thread safety: Transformer objects are thread-safe in PyProj 3.x and later for array-based transforms. You can safely share a cached Transformer instance across threads.

# Handling Mixed or Unknown CRS

A common pattern in data pipelines: receiving data from multiple sources with different CRS, needing to normalise everything to a target CRS:

from pyproj import Transformer, CRS
from functools import lru_cache

TARGET_CRS = "EPSG:4326"

@lru_cache(maxsize=64)
def get_transformer_to_wgs84(source_epsg: int) -> Transformer:
    return Transformer.from_crs(
        f"EPSG:{source_epsg}",
        TARGET_CRS,
        always_xy=True
    )

def normalise_to_wgs84(lon: np.ndarray, lat: np.ndarray, source_epsg: int):
    if source_epsg == 4326:
        return lon, lat   # already in target CRS, no-op
    transformer = get_transformer_to_wgs84(source_epsg)
    return transformer.transform(lon, lat)

# Integration with Pandas: Reprojecting a Coordinate DataFrame

When coordinates are stored as separate columns rather than geometry objects:

import pandas as pd
import numpy as np
from pyproj import Transformer

# DataFrame with lon/lat columns
df = pd.DataFrame({
    'lon': np.random.uniform(140, 154, 1_000_000),
    'lat': np.random.uniform(-44, -10, 1_000_000),
    'value': np.random.random(1_000_000)
})

transformer = Transformer.from_crs("EPSG:4326", "EPSG:3857", always_xy=True)

# Transform both columns in one call — returns two arrays
df['x_web'], df['y_web'] = transformer.transform(
    df['lon'].values,
    df['lat'].values
)

One million coordinate pairs transformed in approximately 0.03 seconds — fast enough that this can be run inline during data load rather than in a separate preprocessing step.

# Summary: The Rules

  1. Never construct Transformer inside a loop — create once, reuse.
  2. Always use always_xy=True unless you have a specific reason not to.
  3. Pass arrays, not scalarstransformer.transform(lon_array, lat_array) executes in C.
  4. Use gdf.to_crs() for GeoPandas GeoDataFrames — it already uses the array path.
  5. Cache transformers with lru_cache in long-running processes.
  6. For raw geometry arrays, use shapely.get_coordinates() + transformer.transform() + shapely.set_coordinates().

The patterns are simple, the speedups are dramatic, and the change is entirely backwards compatible — you get the performance improvement by changing how you call the existing library, not by changing libraries.


Related reading: GeoPandas at Scale: Escaping the .apply() Trap · Shapely 2.0: The Vectorisation Revolution in Python Geometry · GeoParquet: The Analytical Format for Vector Geodata