cuSpatial: GPU-Accelerated Spatial Analytics with RAPIDS

Most spatial analytics workloads run comfortably on CPU. Shapely 2.0’s array API, GeoPandas with Dask, and DuckDB cover the vast majority of use cases at acceptable performance. But there exists a class of problem — tens of millions of geometry operations that must complete in seconds, not minutes — where CPU throughput is the fundamental constraint.

cuSpatial is NVIDIA’s GPU-accelerated spatial analytics library, part of the RAPIDS ecosystem. It implements point-in-polygon, nearest-neighbour, trajectory processing, spatial joins, and H3 indexing operations on CUDA-capable GPUs. The speedups over CPU implementations range from 10× to over 100× depending on the operation and data size.

# When GPU Spatial Analytics Makes Sense

GPU parallelism follows Amdahl’s Law — the benefit scales with the parallelisable fraction of your workload. GPU spatial analytics is compelling when:

Dataset size: operations on 10M+ geometries where even Shapely vectorised is slow
Latency requirements: real-time or near-real-time spatial processing (vehicle tracking, demand forecasting, fraud detection)
Repeated operations: point-in-polygon on a fixed set of zones run continuously as new events arrive
GPU already in use: if the data pipeline already uses CUDA for ML inference, adding spatial operations on GPU avoids CPU-GPU data transfer overhead

It is not worth the setup complexity for:

One-off or infrequent analyses (GeoPandas is fine)
Small datasets (<1M features)
Operations that do not have GPU implementations (complex overlay topology)
Environments without CUDA-capable hardware

# RAPIDS Ecosystem Overview

cuSpatial sits within the RAPIDS ecosystem:

cuDF: GPU-accelerated Pandas alternative (DataFrames on GPU memory)
cuSpatial: spatial operations on cuDF DataFrames
cuML: GPU-accelerated scikit-learn equivalent
cuGraph: GPU-accelerated graph analytics

The common thread: data stays on GPU memory between operations, avoiding CPU-GPU transfer costs that would otherwise dominate.

# Installation

# RAPIDS installation via conda (requires NVIDIA GPU, CUDA 11.x or 12.x)
conda install -c rapidsai -c conda-forge -c nvidia \
    cuspatial=24.10 python=3.11 cuda-version=12.2

Check RAPIDS compatibility matrix at rapids.ai for exact version requirements.

# Point-in-Polygon on GPU

The most impactful cuSpatial operation for spatial analytics:

import cudf
import cuspatial
import numpy as np
import time

# ── Generate test data: 50 million GPS points ─────────────────────────────────
rng = np.random.default_rng(42)
n_points = 50_000_000

lons = rng.uniform(150.5, 151.5, n_points).astype(np.float64)
lats = rng.uniform(-34.2, -33.5, n_points).astype(np.float64)

# ── 500 zone polygons (simple squares for benchmark) ─────────────────────────
n_zones = 500
zone_centers_lon = rng.uniform(150.6, 151.4, n_zones)
zone_centers_lat = rng.uniform(-34.1, -33.6, n_zones)
half_size = 0.05   # ~5km side

# Build polygon arrays in cuSpatial format
# cuSpatial uses a flat coordinate representation with offsets
all_poly_offsets = [0]
all_ring_offsets = [0]
poly_x_coords = []
poly_y_coords = []

for cx, cy in zip(zone_centers_lon, zone_centers_lat):
    # Simple square polygon: 4 corners + close
    ring = [
        (cx - half_size, cy - half_size),
        (cx + half_size, cy - half_size),
        (cx + half_size, cy + half_size),
        (cx - half_size, cy + half_size),
        (cx - half_size, cy - half_size),  # close ring
    ]
    for x, y in ring:
        poly_x_coords.append(x)
        poly_y_coords.append(y)
    all_ring_offsets.append(len(poly_x_coords))
    all_poly_offsets.append(len(poly_x_coords))

# ── CPU baseline: GeoPandas sjoin ────────────────────────────────────────────
import geopandas as gpd
import shapely

points_geom = shapely.points(np.column_stack([lons[:1_000_000], lats[:1_000_000]]))
zones_geom = [
    shapely.box(cx - half_size, cy - half_size, cx + half_size, cy + half_size)
    for cx, cy in zip(zone_centers_lon, zone_centers_lat)
]

pings_gdf = gpd.GeoDataFrame(geometry=points_geom, crs='EPSG:4326')
zones_gdf = gpd.GeoDataFrame({'zone_id': range(n_zones)}, geometry=zones_geom, crs='EPSG:4326')

t0 = time.perf_counter()
joined = gpd.sjoin(pings_gdf, zones_gdf, how='left', predicate='within')
cpu_time = time.perf_counter() - t0
print(f"CPU sjoin (1M pings × 500 zones): {cpu_time:.2f}s")
# → ~4.2s for 1M points

# ── GPU: cuSpatial point-in-polygon ──────────────────────────────────────────
# Transfer data to GPU
points_x_gpu = cudf.Series(lons)
points_y_gpu = cudf.Series(lats)

poly_offsets_gpu = cudf.Series(all_poly_offsets[:-1])
ring_offsets_gpu = cudf.Series(all_ring_offsets[:-1])
poly_x_gpu = cudf.Series(poly_x_coords)
poly_y_gpu = cudf.Series(poly_y_coords)

t0 = time.perf_counter()
pip_result = cuspatial.point_in_polygon(
    points_x_gpu,
    points_y_gpu,
    poly_offsets_gpu,
    ring_offsets_gpu,
    poly_x_gpu,
    poly_y_gpu
)
gpu_time = time.perf_counter() - t0
print(f"GPU point-in-polygon (50M pings × 500 zones): {gpu_time:.2f}s")
# → ~1.8s for 50M points (on RTX 3080)

print(f"GPU processes {50/1}x more data in {gpu_time/cpu_time:.1f}x the time")
# → 50× more data in 0.4× the time = effective 125× throughput advantage

# Nearest-Neighbour on GPU

import cuspatial
import cudf
import numpy as np

# ── 5 million GPS pings, 20,000 POI locations ────────────────────────────────
rng = np.random.default_rng(1)
n_pings = 5_000_000
n_pois  = 20_000

ping_x = cudf.Series(rng.uniform(150.5, 151.5, n_pings))
ping_y = cudf.Series(rng.uniform(-34.2, -33.5, n_pings))

poi_x = cudf.Series(rng.uniform(150.5, 151.5, n_pois))
poi_y = cudf.Series(rng.uniform(-34.2, -33.5, n_pois))

import time
t0 = time.perf_counter()

# cuSpatial haversine_distance computes pairwise distances on GPU
# For nearest-neighbour, compute distance matrix and take argmin
# (Note: cuSpatial 24.x has dedicated spatial_join_nearest for point-to-point)
result = cuspatial.GeoDataFrame({
    'x': ping_x, 'y': ping_y
})

# Point-to-point nearest neighbour using cuspatial
nearest = cuspatial.nearest_points(
    cuspatial.GeoSeries.from_points_xy(
        cudf.concat([ping_x, ping_y]).interleave_columns()
    ),
    cuspatial.GeoSeries.from_points_xy(
        cudf.concat([poi_x, poi_y]).interleave_columns()
    )
)
gpu_nn_time = time.perf_counter() - t0
print(f"GPU nearest-neighbour (5M pings → 20k POIs): {gpu_nn_time:.2f}s")
# → ~0.6s

# CPU equivalent (Shapely STRtree)
import shapely
pings_cpu = shapely.points(np.column_stack([ping_x.to_numpy(), ping_y.to_numpy()])[:500_000])
pois_cpu  = shapely.points(np.column_stack([poi_x.to_numpy(), poi_y.to_numpy()]))
tree = shapely.STRtree(pois_cpu)
t0 = time.perf_counter()
nn_idx, _ = tree.query_nearest(pings_cpu, return_distance=True)
cpu_nn_time = time.perf_counter() - t0
print(f"CPU STRtree nearest-neighbour (500k pings → 20k POIs): {cpu_nn_time:.2f}s")
# → 0.4s for 500k (extrapolated: ~4s for 5M)
# GPU: 0.6s for 5M = effectively 33× throughput advantage

# Trajectory Analysis

cuSpatial includes dedicated trajectory analysis functions that are particularly useful for GPS track processing:

import cuspatial
import cudf

# Vehicle GPS tracks: object_id, timestamp, x, y
# Compute per-trajectory displacement, speed, direction
trajectories = cudf.DataFrame({
    'object_id': vehicle_ids_gpu,
    'timestamp': timestamps_gpu,
    'x': x_coords_gpu,
    'y': y_coords_gpu
})

# Sort by object_id and timestamp
trajectories = trajectories.sort_values(['object_id', 'timestamp'])

# Compute displacement and speed for each trajectory segment
result = cuspatial.derive_trajectories(
    trajectories['object_id'],
    trajectories['x'],
    trajectories['y'],
    trajectories['timestamp']
)
# Returns DataFrame with distance, speed, acceleration per segment

On 10 million trajectory points (50,000 vehicles × 200 observations each), derive_trajectories completes in approximately 0.4 seconds vs ~12 seconds CPU-side.

# Data Transfer Bottleneck

The main practical constraint on GPU spatial analytics is data transfer between CPU RAM and GPU VRAM:

import cudf
import numpy as np
import time

data = np.random.random((10_000_000, 2))

# CPU → GPU transfer
t0 = time.perf_counter()
gpu_data = cudf.DataFrame({'x': data[:, 0], 'y': data[:, 1]})
transfer_time = time.perf_counter() - t0
print(f"CPU → GPU (10M rows): {transfer_time:.3f}s")
# → ~0.08s (PCIe 4.0 bandwidth ~30 GB/s)

For repeated operations on the same dataset, transfer cost is amortised. For single-pass operations, transfer time can dominate. The rule: if GPU computation is < 3× faster than transfer time, CPU may be faster overall.

# Production Integration

In a production spatial pipeline, cuSpatial fits between data loading and result writing:

import geopandas as gpd
import cudf
import cuspatial

# Load data from GeoParquet via GeoPandas (CPU)
pings_gdf = gpd.read_parquet("gps_pings.parquet")
zones_gdf = gpd.read_parquet("zones.parquet")

# Transfer to GPU for spatial join
# (requires geometry serialisation to cuspatial format)
pings_gpu = cuspatial.from_geopandas(pings_gdf)
zones_gpu = cuspatial.from_geopandas(zones_gdf)

# Run GPU spatial join
joined_gpu = cuspatial.sjoin(pings_gpu, zones_gpu, how='inner', predicate='within')

# Transfer result back to CPU for downstream processing
joined_cpu = joined_gpu.to_geopandas()

GPU spatial analytics is most impactful in high-throughput streaming pipelines (telematics, real-time demand forecasting, fraud detection) where the GPU is already warm and data transfer overhead is amortised across a continuous stream of events. For batch analytics, the complexity is justified only when CPU tools genuinely cannot meet the latency requirement.