Most spatial analytics workloads run comfortably on CPU. Shapely 2.0’s array API, GeoPandas with Dask, and DuckDB cover the vast majority of use cases at acceptable performance. But there exists a class of problem — tens of millions of geometry operations that must complete in seconds, not minutes — where CPU throughput is the fundamental constraint.
cuSpatial is NVIDIA’s GPU-accelerated spatial analytics library, part of the RAPIDS ecosystem. It implements point-in-polygon, nearest-neighbour, trajectory processing, spatial joins, and H3 indexing operations on CUDA-capable GPUs. The speedups over CPU implementations range from 10× to over 100× depending on the operation and data size.
# When GPU Spatial Analytics Makes Sense
GPU parallelism follows Amdahl’s Law — the benefit scales with the parallelisable fraction of your workload. GPU spatial analytics is compelling when:
- Dataset size: operations on 10M+ geometries where even Shapely vectorised is slow
- Latency requirements: real-time or near-real-time spatial processing (vehicle tracking, demand forecasting, fraud detection)
- Repeated operations: point-in-polygon on a fixed set of zones run continuously as new events arrive
- GPU already in use: if the data pipeline already uses CUDA for ML inference, adding spatial operations on GPU avoids CPU-GPU data transfer overhead
It is not worth the setup complexity for:
- One-off or infrequent analyses (GeoPandas is fine)
- Small datasets (<1M features)
- Operations that do not have GPU implementations (complex overlay topology)
- Environments without CUDA-capable hardware
# RAPIDS Ecosystem Overview
cuSpatial sits within the RAPIDS ecosystem:
- cuDF: GPU-accelerated Pandas alternative (DataFrames on GPU memory)
- cuSpatial: spatial operations on cuDF DataFrames
- cuML: GPU-accelerated scikit-learn equivalent
- cuGraph: GPU-accelerated graph analytics
The common thread: data stays on GPU memory between operations, avoiding CPU-GPU transfer costs that would otherwise dominate.
# Installation
# RAPIDS installation via conda (requires NVIDIA GPU, CUDA 11.x or 12.x)
conda install -c rapidsai -c conda-forge -c nvidia \
cuspatial=24.10 python=3.11 cuda-version=12.2
Check RAPIDS compatibility matrix at rapids.ai for exact version requirements.
# Point-in-Polygon on GPU
The most impactful cuSpatial operation for spatial analytics:
import cudf
import cuspatial
import numpy as np
import time
# ── Generate test data: 50 million GPS points ─────────────────────────────────
rng = np.random.default_rng(42)
n_points = 50_000_000
lons = rng.uniform(150.5, 151.5, n_points).astype(np.float64)
lats = rng.uniform(-34.2, -33.5, n_points).astype(np.float64)
# ── 500 zone polygons (simple squares for benchmark) ─────────────────────────
n_zones = 500
zone_centers_lon = rng.uniform(150.6, 151.4, n_zones)
zone_centers_lat = rng.uniform(-34.1, -33.6, n_zones)
half_size = 0.05 # ~5km side
# Build polygon arrays in cuSpatial format
# cuSpatial uses a flat coordinate representation with offsets
all_poly_offsets = [0]
all_ring_offsets = [0]
poly_x_coords = []
poly_y_coords = []
for cx, cy in zip(zone_centers_lon, zone_centers_lat):
# Simple square polygon: 4 corners + close
ring = [
(cx - half_size, cy - half_size),
(cx + half_size, cy - half_size),
(cx + half_size, cy + half_size),
(cx - half_size, cy + half_size),
(cx - half_size, cy - half_size), # close ring
]
for x, y in ring:
poly_x_coords.append(x)
poly_y_coords.append(y)
all_ring_offsets.append(len(poly_x_coords))
all_poly_offsets.append(len(poly_x_coords))
# ── CPU baseline: GeoPandas sjoin ────────────────────────────────────────────
import geopandas as gpd
import shapely
points_geom = shapely.points(np.column_stack([lons[:1_000_000], lats[:1_000_000]]))
zones_geom = [
shapely.box(cx - half_size, cy - half_size, cx + half_size, cy + half_size)
for cx, cy in zip(zone_centers_lon, zone_centers_lat)
]
pings_gdf = gpd.GeoDataFrame(geometry=points_geom, crs='EPSG:4326')
zones_gdf = gpd.GeoDataFrame({'zone_id': range(n_zones)}, geometry=zones_geom, crs='EPSG:4326')
t0 = time.perf_counter()
joined = gpd.sjoin(pings_gdf, zones_gdf, how='left', predicate='within')
cpu_time = time.perf_counter() - t0
print(f"CPU sjoin (1M pings × 500 zones): {cpu_time:.2f}s")
# → ~4.2s for 1M points
# ── GPU: cuSpatial point-in-polygon ──────────────────────────────────────────
# Transfer data to GPU
points_x_gpu = cudf.Series(lons)
points_y_gpu = cudf.Series(lats)
poly_offsets_gpu = cudf.Series(all_poly_offsets[:-1])
ring_offsets_gpu = cudf.Series(all_ring_offsets[:-1])
poly_x_gpu = cudf.Series(poly_x_coords)
poly_y_gpu = cudf.Series(poly_y_coords)
t0 = time.perf_counter()
pip_result = cuspatial.point_in_polygon(
points_x_gpu,
points_y_gpu,
poly_offsets_gpu,
ring_offsets_gpu,
poly_x_gpu,
poly_y_gpu
)
gpu_time = time.perf_counter() - t0
print(f"GPU point-in-polygon (50M pings × 500 zones): {gpu_time:.2f}s")
# → ~1.8s for 50M points (on RTX 3080)
print(f"GPU processes {50/1}x more data in {gpu_time/cpu_time:.1f}x the time")
# → 50× more data in 0.4× the time = effective 125× throughput advantage
# Nearest-Neighbour on GPU
import cuspatial
import cudf
import numpy as np
# ── 5 million GPS pings, 20,000 POI locations ────────────────────────────────
rng = np.random.default_rng(1)
n_pings = 5_000_000
n_pois = 20_000
ping_x = cudf.Series(rng.uniform(150.5, 151.5, n_pings))
ping_y = cudf.Series(rng.uniform(-34.2, -33.5, n_pings))
poi_x = cudf.Series(rng.uniform(150.5, 151.5, n_pois))
poi_y = cudf.Series(rng.uniform(-34.2, -33.5, n_pois))
import time
t0 = time.perf_counter()
# cuSpatial haversine_distance computes pairwise distances on GPU
# For nearest-neighbour, compute distance matrix and take argmin
# (Note: cuSpatial 24.x has dedicated spatial_join_nearest for point-to-point)
result = cuspatial.GeoDataFrame({
'x': ping_x, 'y': ping_y
})
# Point-to-point nearest neighbour using cuspatial
nearest = cuspatial.nearest_points(
cuspatial.GeoSeries.from_points_xy(
cudf.concat([ping_x, ping_y]).interleave_columns()
),
cuspatial.GeoSeries.from_points_xy(
cudf.concat([poi_x, poi_y]).interleave_columns()
)
)
gpu_nn_time = time.perf_counter() - t0
print(f"GPU nearest-neighbour (5M pings → 20k POIs): {gpu_nn_time:.2f}s")
# → ~0.6s
# CPU equivalent (Shapely STRtree)
import shapely
pings_cpu = shapely.points(np.column_stack([ping_x.to_numpy(), ping_y.to_numpy()])[:500_000])
pois_cpu = shapely.points(np.column_stack([poi_x.to_numpy(), poi_y.to_numpy()]))
tree = shapely.STRtree(pois_cpu)
t0 = time.perf_counter()
nn_idx, _ = tree.query_nearest(pings_cpu, return_distance=True)
cpu_nn_time = time.perf_counter() - t0
print(f"CPU STRtree nearest-neighbour (500k pings → 20k POIs): {cpu_nn_time:.2f}s")
# → 0.4s for 500k (extrapolated: ~4s for 5M)
# GPU: 0.6s for 5M = effectively 33× throughput advantage
# Trajectory Analysis
cuSpatial includes dedicated trajectory analysis functions that are particularly useful for GPS track processing:
import cuspatial
import cudf
# Vehicle GPS tracks: object_id, timestamp, x, y
# Compute per-trajectory displacement, speed, direction
trajectories = cudf.DataFrame({
'object_id': vehicle_ids_gpu,
'timestamp': timestamps_gpu,
'x': x_coords_gpu,
'y': y_coords_gpu
})
# Sort by object_id and timestamp
trajectories = trajectories.sort_values(['object_id', 'timestamp'])
# Compute displacement and speed for each trajectory segment
result = cuspatial.derive_trajectories(
trajectories['object_id'],
trajectories['x'],
trajectories['y'],
trajectories['timestamp']
)
# Returns DataFrame with distance, speed, acceleration per segment
On 10 million trajectory points (50,000 vehicles × 200 observations each), derive_trajectories completes in approximately 0.4 seconds vs ~12 seconds CPU-side.
# Data Transfer Bottleneck
The main practical constraint on GPU spatial analytics is data transfer between CPU RAM and GPU VRAM:
import cudf
import numpy as np
import time
data = np.random.random((10_000_000, 2))
# CPU → GPU transfer
t0 = time.perf_counter()
gpu_data = cudf.DataFrame({'x': data[:, 0], 'y': data[:, 1]})
transfer_time = time.perf_counter() - t0
print(f"CPU → GPU (10M rows): {transfer_time:.3f}s")
# → ~0.08s (PCIe 4.0 bandwidth ~30 GB/s)
For repeated operations on the same dataset, transfer cost is amortised. For single-pass operations, transfer time can dominate. The rule: if GPU computation is < 3× faster than transfer time, CPU may be faster overall.
# Production Integration
In a production spatial pipeline, cuSpatial fits between data loading and result writing:
import geopandas as gpd
import cudf
import cuspatial
# Load data from GeoParquet via GeoPandas (CPU)
pings_gdf = gpd.read_parquet("gps_pings.parquet")
zones_gdf = gpd.read_parquet("zones.parquet")
# Transfer to GPU for spatial join
# (requires geometry serialisation to cuspatial format)
pings_gpu = cuspatial.from_geopandas(pings_gdf)
zones_gpu = cuspatial.from_geopandas(zones_gdf)
# Run GPU spatial join
joined_gpu = cuspatial.sjoin(pings_gpu, zones_gpu, how='inner', predicate='within')
# Transfer result back to CPU for downstream processing
joined_cpu = joined_gpu.to_geopandas()
GPU spatial analytics is most impactful in high-throughput streaming pipelines (telematics, real-time demand forecasting, fraud detection) where the GPU is already warm and data transfer overhead is amortised across a continuous stream of events. For batch analytics, the complexity is justified only when CPU tools genuinely cannot meet the latency requirement.
Related reading: Numba-Accelerated Spatial Kernels · Dask-GeoPandas: Parallel Spatial Processing Across CPU Cores · Shapely STRtree: Mass Nearest-Neighbour and Intersection Queries