Most spatial analytics workloads run comfortably on CPU. Shapely 2.0’s array API, GeoPandas with Dask, and DuckDB cover the vast majority of use cases at acceptable performance. But there exists a class of problem — tens of millions of geometry operations that must complete in seconds, not minutes — where CPU throughput is the fundamental constraint.

cuSpatial is NVIDIA’s GPU-accelerated spatial analytics library, part of the RAPIDS ecosystem. It implements point-in-polygon, nearest-neighbour, trajectory processing, spatial joins, and H3 indexing operations on CUDA-capable GPUs. The speedups over CPU implementations range from 10× to over 100× depending on the operation and data size.

# When GPU Spatial Analytics Makes Sense

GPU parallelism follows Amdahl’s Law — the benefit scales with the parallelisable fraction of your workload. GPU spatial analytics is compelling when:

  • Dataset size: operations on 10M+ geometries where even Shapely vectorised is slow
  • Latency requirements: real-time or near-real-time spatial processing (vehicle tracking, demand forecasting, fraud detection)
  • Repeated operations: point-in-polygon on a fixed set of zones run continuously as new events arrive
  • GPU already in use: if the data pipeline already uses CUDA for ML inference, adding spatial operations on GPU avoids CPU-GPU data transfer overhead

It is not worth the setup complexity for:

  • One-off or infrequent analyses (GeoPandas is fine)
  • Small datasets (<1M features)
  • Operations that do not have GPU implementations (complex overlay topology)
  • Environments without CUDA-capable hardware

# RAPIDS Ecosystem Overview

cuSpatial sits within the RAPIDS ecosystem:

  • cuDF: GPU-accelerated Pandas alternative (DataFrames on GPU memory)
  • cuSpatial: spatial operations on cuDF DataFrames
  • cuML: GPU-accelerated scikit-learn equivalent
  • cuGraph: GPU-accelerated graph analytics

The common thread: data stays on GPU memory between operations, avoiding CPU-GPU transfer costs that would otherwise dominate.

# Installation

# RAPIDS installation via conda (requires NVIDIA GPU, CUDA 11.x or 12.x)
conda install -c rapidsai -c conda-forge -c nvidia \
    cuspatial=24.10 python=3.11 cuda-version=12.2

Check RAPIDS compatibility matrix at rapids.ai for exact version requirements.

# Point-in-Polygon on GPU

The most impactful cuSpatial operation for spatial analytics:

import cudf
import cuspatial
import numpy as np
import time

# ── Generate test data: 50 million GPS points ─────────────────────────────────
rng = np.random.default_rng(42)
n_points = 50_000_000

lons = rng.uniform(150.5, 151.5, n_points).astype(np.float64)
lats = rng.uniform(-34.2, -33.5, n_points).astype(np.float64)

# ── 500 zone polygons (simple squares for benchmark) ─────────────────────────
n_zones = 500
zone_centers_lon = rng.uniform(150.6, 151.4, n_zones)
zone_centers_lat = rng.uniform(-34.1, -33.6, n_zones)
half_size = 0.05   # ~5km side

# Build polygon arrays in cuSpatial format
# cuSpatial uses a flat coordinate representation with offsets
all_poly_offsets = [0]
all_ring_offsets = [0]
poly_x_coords = []
poly_y_coords = []

for cx, cy in zip(zone_centers_lon, zone_centers_lat):
    # Simple square polygon: 4 corners + close
    ring = [
        (cx - half_size, cy - half_size),
        (cx + half_size, cy - half_size),
        (cx + half_size, cy + half_size),
        (cx - half_size, cy + half_size),
        (cx - half_size, cy - half_size),  # close ring
    ]
    for x, y in ring:
        poly_x_coords.append(x)
        poly_y_coords.append(y)
    all_ring_offsets.append(len(poly_x_coords))
    all_poly_offsets.append(len(poly_x_coords))

# ── CPU baseline: GeoPandas sjoin ────────────────────────────────────────────
import geopandas as gpd
import shapely

points_geom = shapely.points(np.column_stack([lons[:1_000_000], lats[:1_000_000]]))
zones_geom = [
    shapely.box(cx - half_size, cy - half_size, cx + half_size, cy + half_size)
    for cx, cy in zip(zone_centers_lon, zone_centers_lat)
]

pings_gdf = gpd.GeoDataFrame(geometry=points_geom, crs='EPSG:4326')
zones_gdf = gpd.GeoDataFrame({'zone_id': range(n_zones)}, geometry=zones_geom, crs='EPSG:4326')

t0 = time.perf_counter()
joined = gpd.sjoin(pings_gdf, zones_gdf, how='left', predicate='within')
cpu_time = time.perf_counter() - t0
print(f"CPU sjoin (1M pings × 500 zones): {cpu_time:.2f}s")
# → ~4.2s for 1M points

# ── GPU: cuSpatial point-in-polygon ──────────────────────────────────────────
# Transfer data to GPU
points_x_gpu = cudf.Series(lons)
points_y_gpu = cudf.Series(lats)

poly_offsets_gpu = cudf.Series(all_poly_offsets[:-1])
ring_offsets_gpu = cudf.Series(all_ring_offsets[:-1])
poly_x_gpu = cudf.Series(poly_x_coords)
poly_y_gpu = cudf.Series(poly_y_coords)

t0 = time.perf_counter()
pip_result = cuspatial.point_in_polygon(
    points_x_gpu,
    points_y_gpu,
    poly_offsets_gpu,
    ring_offsets_gpu,
    poly_x_gpu,
    poly_y_gpu
)
gpu_time = time.perf_counter() - t0
print(f"GPU point-in-polygon (50M pings × 500 zones): {gpu_time:.2f}s")
# → ~1.8s for 50M points (on RTX 3080)

print(f"GPU processes {50/1}x more data in {gpu_time/cpu_time:.1f}x the time")
# → 50× more data in 0.4× the time = effective 125× throughput advantage

# Nearest-Neighbour on GPU

import cuspatial
import cudf
import numpy as np

# ── 5 million GPS pings, 20,000 POI locations ────────────────────────────────
rng = np.random.default_rng(1)
n_pings = 5_000_000
n_pois  = 20_000

ping_x = cudf.Series(rng.uniform(150.5, 151.5, n_pings))
ping_y = cudf.Series(rng.uniform(-34.2, -33.5, n_pings))

poi_x = cudf.Series(rng.uniform(150.5, 151.5, n_pois))
poi_y = cudf.Series(rng.uniform(-34.2, -33.5, n_pois))

import time
t0 = time.perf_counter()

# cuSpatial haversine_distance computes pairwise distances on GPU
# For nearest-neighbour, compute distance matrix and take argmin
# (Note: cuSpatial 24.x has dedicated spatial_join_nearest for point-to-point)
result = cuspatial.GeoDataFrame({
    'x': ping_x, 'y': ping_y
})

# Point-to-point nearest neighbour using cuspatial
nearest = cuspatial.nearest_points(
    cuspatial.GeoSeries.from_points_xy(
        cudf.concat([ping_x, ping_y]).interleave_columns()
    ),
    cuspatial.GeoSeries.from_points_xy(
        cudf.concat([poi_x, poi_y]).interleave_columns()
    )
)
gpu_nn_time = time.perf_counter() - t0
print(f"GPU nearest-neighbour (5M pings → 20k POIs): {gpu_nn_time:.2f}s")
# → ~0.6s

# CPU equivalent (Shapely STRtree)
import shapely
pings_cpu = shapely.points(np.column_stack([ping_x.to_numpy(), ping_y.to_numpy()])[:500_000])
pois_cpu  = shapely.points(np.column_stack([poi_x.to_numpy(), poi_y.to_numpy()]))
tree = shapely.STRtree(pois_cpu)
t0 = time.perf_counter()
nn_idx, _ = tree.query_nearest(pings_cpu, return_distance=True)
cpu_nn_time = time.perf_counter() - t0
print(f"CPU STRtree nearest-neighbour (500k pings → 20k POIs): {cpu_nn_time:.2f}s")
# → 0.4s for 500k (extrapolated: ~4s for 5M)
# GPU: 0.6s for 5M = effectively 33× throughput advantage

# Trajectory Analysis

cuSpatial includes dedicated trajectory analysis functions that are particularly useful for GPS track processing:

import cuspatial
import cudf

# Vehicle GPS tracks: object_id, timestamp, x, y
# Compute per-trajectory displacement, speed, direction
trajectories = cudf.DataFrame({
    'object_id': vehicle_ids_gpu,
    'timestamp': timestamps_gpu,
    'x': x_coords_gpu,
    'y': y_coords_gpu
})

# Sort by object_id and timestamp
trajectories = trajectories.sort_values(['object_id', 'timestamp'])

# Compute displacement and speed for each trajectory segment
result = cuspatial.derive_trajectories(
    trajectories['object_id'],
    trajectories['x'],
    trajectories['y'],
    trajectories['timestamp']
)
# Returns DataFrame with distance, speed, acceleration per segment

On 10 million trajectory points (50,000 vehicles × 200 observations each), derive_trajectories completes in approximately 0.4 seconds vs ~12 seconds CPU-side.

# Data Transfer Bottleneck

The main practical constraint on GPU spatial analytics is data transfer between CPU RAM and GPU VRAM:

import cudf
import numpy as np
import time

data = np.random.random((10_000_000, 2))

# CPU → GPU transfer
t0 = time.perf_counter()
gpu_data = cudf.DataFrame({'x': data[:, 0], 'y': data[:, 1]})
transfer_time = time.perf_counter() - t0
print(f"CPU → GPU (10M rows): {transfer_time:.3f}s")
# → ~0.08s (PCIe 4.0 bandwidth ~30 GB/s)

For repeated operations on the same dataset, transfer cost is amortised. For single-pass operations, transfer time can dominate. The rule: if GPU computation is < 3× faster than transfer time, CPU may be faster overall.

# Production Integration

In a production spatial pipeline, cuSpatial fits between data loading and result writing:

import geopandas as gpd
import cudf
import cuspatial

# Load data from GeoParquet via GeoPandas (CPU)
pings_gdf = gpd.read_parquet("gps_pings.parquet")
zones_gdf = gpd.read_parquet("zones.parquet")

# Transfer to GPU for spatial join
# (requires geometry serialisation to cuspatial format)
pings_gpu = cuspatial.from_geopandas(pings_gdf)
zones_gpu = cuspatial.from_geopandas(zones_gdf)

# Run GPU spatial join
joined_gpu = cuspatial.sjoin(pings_gpu, zones_gpu, how='inner', predicate='within')

# Transfer result back to CPU for downstream processing
joined_cpu = joined_gpu.to_geopandas()

GPU spatial analytics is most impactful in high-throughput streaming pipelines (telematics, real-time demand forecasting, fraud detection) where the GPU is already warm and data transfer overhead is amortised across a continuous stream of events. For batch analytics, the complexity is justified only when CPU tools genuinely cannot meet the latency requirement.


Related reading: Numba-Accelerated Spatial Kernels · Dask-GeoPandas: Parallel Spatial Processing Across CPU Cores · Shapely STRtree: Mass Nearest-Neighbour and Intersection Queries