The Modern Case for Rasters: Why Data Engineers Keep Missing a Trick

Modern data engineering is dominated by two paradigms: the relational table and the dataframe. Data enters pipelines as rows, gets transformed by SQL or Pandas operations, and exits as more rows. This model is enormously powerful and rightly central to most data workflows.

But it has a blind spot. When the data has a spatial dimension — when you are asking questions about the real world and its geography — there is often a better structure than a table of rows: the raster.

A raster is simply a two-dimensional grid of numerical values, spatially registered to the real world. Think of it as a giant spreadsheet where the row and column indices directly correspond to geographic coordinates. Every cell has a position, and every position has at most one value per band. That is not a limitation — it is the defining property that makes rasters so fast.

Data engineers who spend their careers building row-based pipelines often encounter spatial analysis requirements and reach for the tools they know: PostGIS for spatial queries, complex polygon geometries, spatial joins. In many cases this works fine. In a subset of cases — particularly analysis involving continuous spatial fields, land cover classification, or complex geometry datasets — it is dramatically the wrong tool, and the performance difference is measurable in orders of magnitude.

# What a Raster Actually Is

Before arguing for rasters, it is worth being precise about what they are and what they are not.

A raster is a spatially-referenced regular grid of values. Each cell (pixel) covers a defined geographic area. The spatial extent, cell size, and coordinate reference system are stored as metadata alongside the data arrays. Common raster data types include:

Satellite imagery (multispectral: blue, green, red, near-infrared bands)
Elevation models (a single-band grid where each cell stores height above sea level)
Land cover classifications (each cell stores a class code: urban, forest, water, etc.)
Continuous environmental variables (temperature, rainfall, air quality, soil moisture)
Derived analytical products (NDVI, flood extent, fire severity)

The critical properties that make rasters fast for certain queries:

Random access is O(1). Given a geographic coordinate, finding the cell value requires only an arithmetic transformation from world coordinates to array indices. No spatial index, no tree traversal, no geometry computation — just row = (y_max - lat) / cell_size, col = (lon - x_min) / cell_size. This is as fast as array access gets.

Spatial subsets are trivial. Extracting a rectangular region from a raster is an array slice operation: data[row_min:row_max, col_min:col_max]. The result is a new array of exactly the right size, loaded in one I/O operation if the data is chunked appropriately.

Area statistics are reductions. “What is the mean land cover class within this area?” is np.mean(data[mask]). “What fraction of cells are forest?” is (data == FOREST_CODE).sum() / data.size. These are vectorised NumPy operations that run at memory bandwidth speed.

# The Problem with Complex Geometry Datasets

To make the performance argument concrete, consider one of the most challenging real-world vector datasets in Australian spatial data: the Queensland Regional Ecosystems (RE) database.

Queensland’s Regional Ecosystems are a formal classification of the vegetation communities of Queensland, mapped at a fine spatial resolution. The database has approximately 1,200 distinct ecosystem types. The polygon boundaries — derived from detailed field survey, aerial photography interpretation, and satellite image classification — are extraordinarily complex. Many polygons follow natural boundaries: the precise edge of a rainforest patch, the margins of a wetland system, the boundary between distinct soil types expressed in vegetation.

The vector dataset has approximately 180,000 polygons covering an area of 1.85 million km². Many of these polygons have thousands or tens of thousands of vertices. The total vertex count across the dataset runs into the hundreds of millions.

For a query like “what Regional Ecosystem types occur within 25 km of this point?”, PostGIS must:

Build or traverse a spatial index to identify candidate polygons
Test each candidate polygon for actual intersection with the buffer
Execute complex polygon intersection geometry computations across potentially thousands of polygons, many with very high vertex counts

This is not a PostGIS performance problem — it is inherent to the query. Computing intersections of complex polygons is computationally expensive because the geometry is genuinely complex. The polygons faithfully represent reality. Reality is intricate.

# The Raster Alternative

The Queensland RE database also comes in a raster version: a 25-metre resolution grid covering the entire state, where each cell stores the RE type code as an integer. The file is approximately 4 GB as a Cloud-Optimised GeoTIFF.

For the same query — “what RE types occur within 25 km of this point?” — the raster approach is:

Convert the buffer circle to a raster-space bounding box (arithmetic)
Load the array slice covering the bounding box (one I/O operation)
Apply a circular mask (element-wise comparison of distance from centre)
Return the unique values (np.unique(data[mask]))

import numpy as np
import rasterio
from rasterio.windows import from_bounds
from rasterio.transform import rowcol
from rasterio.crs import CRS
from pyproj import Transformer

def re_types_within_radius(lat: float, lon: float, radius_m: float,
                            raster_path: str) -> list[int]:
    """
    Return all Regional Ecosystem codes within radius_m metres of lat/lon.
    ~5ms on a local SSD, ~80ms from S3 for a 25km radius query.
    """
    with rasterio.open(raster_path) as src:
        # Project query point to raster CRS (GDA94 / MGA z54 or similar)
        transformer = Transformer.from_crs("EPSG:4326", src.crs, always_xy=True)
        x, y = transformer.transform(lon, lat)

        # Compute bounding box for the query radius (square approximation for I/O)
        window = from_bounds(
            x - radius_m, y - radius_m,
            x + radius_m, y + radius_m,
            src.transform
        )

        # Read ONLY this window — typically a few hundred KB
        data = src.read(1, window=window)
        win_transform = src.window_transform(window)

        # Build pixel-space coordinate grids
        nrows, ncols = data.shape
        col_coords = win_transform.c + np.arange(ncols) * win_transform.a
        row_coords = win_transform.f + np.arange(nrows) * win_transform.e

        col_grid, row_grid = np.meshgrid(col_coords, row_coords)
        dist = np.sqrt((col_grid - x)**2 + (row_grid - y)**2)

        # Apply circular mask
        mask = (dist <= radius_m) & (data != src.nodata)

        return np.unique(data[mask]).tolist()

The same function using PostGIS vector data would look like this:

import psycopg2
import json

def re_types_within_radius_postgis(lat: float, lon: float, radius_m: float,
                                    conn_str: str) -> list[int]:
    """
    PostGIS version. For complex Queensland RE polygons, this query
    typically takes 3–30 seconds depending on the spatial index hit rate
    and polygon complexity in the query area.
    """
    with psycopg2.connect(conn_str) as conn:
        with conn.cursor() as cur:
            cur.execute("""
                SELECT DISTINCT re_code
                FROM qld_regional_ecosystems
                WHERE ST_DWithin(
                    geom::geography,
                    ST_SetSRID(ST_MakePoint(%s, %s), 4326)::geography,
                    %s
                )
                ORDER BY re_code
            """, (lon, lat, radius_m))
            return [row[0] for row in cur.fetchall()]

The performance difference is dramatic. On a representative test (25 km radius, suburban fringe location with mixed ecosystem types):

Raster version: 5–15 ms (local SSD), 60–120 ms (Cloud-Optimised GeoTIFF on S3)
PostGIS vector version: 3,000–25,000 ms depending on geometry complexity

That is 100× to 1,000× faster — and the raster version is running entirely in Python with no database.

# Where the Approximation Bites

The raster approach is not perfect. The 25-metre resolution means that cells near polygon boundaries have a discrete representation: a cell is either one RE type or another, not a mix. For a query about which RE types occur in an area, this means:

Very small RE patches (smaller than one or a few cells) may be missed or over-represented
Boundary effects at the query radius edge are coarse (cells are in or out based on their centre point)
Exact area calculations are approximate (each cell counts as exactly 25m × 25m)

For most analytical questions — “which ecosystems are present?”, “what is the dominant type?”, “how much area is affected?” — these approximations are acceptable. The 25m raster is a deliberate generalisation of a much more complex mapped reality; the boundaries in the vector data are themselves interpretations of gradual ecological transitions.

The cases where rasters genuinely cannot substitute for vectors are those requiring precise legal boundaries (cadastral parcels, administrative definitions with statutory significance) or exact area calculations for regulatory compliance. For exploratory analysis, modelling, and operational applications, the performance advantage of rasters almost always outweighs the loss of exact boundary precision.

# XArray for Raster Analysis at Scale

For analytical workflows involving multiple raster datasets, temporal stacks, or large spatial extents, XArray provides the natural data model. A land cover change analysis using XArray:

import xarray as xr
import numpy as np

# Load two time periods of land cover rasters (lazy)
lc_2018 = xr.open_rasterio('landcover_qld_2018.tif', chunks={'x': 1024, 'y': 1024})
lc_2023 = xr.open_rasterio('landcover_qld_2023.tif', chunks={'x': 1024, 'y': 1024})

# Detect change — still lazy (builds a computation graph)
changed = lc_2018 != lc_2023

# Compute only for a region of interest
roi = changed.sel(
    x=slice(153.0, 154.0),   # longitude range
    y=slice(-27.0, -28.0),   # latitude range
)

# Execute: loads only the relevant chunks from both files
result = roi.compute()

# How many cells changed?
n_changed = int(changed_roi.sum())
pct_changed = 100 * n_changed / result.size
print(f"{n_changed:,} cells changed ({pct_changed:.1f}% of area)")

This computation reads only the required spatial chunks from both files — a few megabytes even for a large ROI from files that might be tens of gigabytes in total. The same analysis in PostGIS would require loading and intersecting all polygons in the region from both time periods.

# The Right Mental Model

The key insight is that rasters and vectors are not competing tools — they are complementary representations of spatial information suited to different types of queries.

Use vectors when:

You need exact, legally-defined boundaries
Features have rich and important attributes
You are doing network analysis (roads, pipes, streams)
You need to identify individual features (this specific parcel, this specific building)

Use rasters when:

The data is inherently continuous (elevation, temperature, coverage)
You need fast spatial queries against complex geometry (rasterise the complex polygons)
You are doing zonal statistics (mean value within a polygon)
You are performing map algebra (combining multiple spatial layers)
Query speed matters more than exact boundary precision

The pattern that consistently surprises engineers coming to spatial analysis from a pure data engineering background: rasterise the complex polygons once, save the result as a Cloud-Optimised GeoTIFF, and answer subsequent queries against the raster. The upfront cost of rasterisation (minutes, using GDAL) buys orders-of-magnitude speedup for every subsequent query.

# PostGIS Raster: Possible but Not Elegant

PostGIS does include raster support — the raster data type, and a library of ST_ functions for raster analysis. It is technically capable. It is not, in most practitioners’ experience, elegant.

Raster tiles are stored as rows in a PostgreSQL table, with each row containing a fixed-size tile as a binary blob. Operations on large rasters involve joining these rows and performing tile-by-tile computation. The query planner was designed for sets of records, not for spatial arrays; the raster extension sits somewhat awkwardly on top of a fundamentally row-oriented engine.

For raster analysis, GDAL (via Rasterio), XArray, and specialised tools like rasterstats are better choices than PostGIS raster. PostGIS is excellent for what it was designed for — vector spatial queries. For rasters, use raster-native tools.

The ecosystem here is mature: GDAL has been doing raster analysis for twenty years; XArray handles the array data model with elegance; Rasterio provides the bridge between the two. The combination is fast, expressive, and well-documented.