Querying NetCDF Data via API: XArray and FastAPI for Analytical Speed

For decades, the standard workflow for distributing scientific raster data has been the bulk file download. A researcher or modeller wanting to work with climate reanalysis data, ocean temperature profiles, or atmospheric model outputs would navigate to a data portal, select a region and time range, wait for a server-side extraction job to complete, download a multi-gigabyte NetCDF file, and then load it locally to answer their question.

This workflow made sense when computation was centralised and networks were slow. It makes much less sense when cloud compute is cheap, network bandwidth is high, and the actual question being asked can often be answered by reading a few kilobytes from a file that is hundreds of gigabytes in total. The legacy download model forces users to fetch the entire haystack to find a single needle.

There is a better architecture: load the NetCDF into memory using XArray’s lazy evaluation model, expose a thin FastAPI layer on top, and let clients “sip” exactly the data they need rather than swallowing the whole file. This approach is already transforming how climate and environmental data is served, and it is well within reach of any Python developer familiar with the scientific stack.

# What is NetCDF?

NetCDF (Network Common Data Form) is a self-describing binary file format designed for storing multi-dimensional array-oriented scientific data. Originally developed by Unidata in the late 1980s, it has become the de facto standard for climate, oceanographic, meteorological, atmospheric, and environmental datasets worldwide.

A NetCDF file is organised around three core concepts:

Dimensions define the axes of the data arrays. A global daily temperature dataset might have dimensions (time, latitude, longitude). A vertical profile might add (level). Each dimension has a size.

Variables are named multi-dimensional arrays associated with dimensions. The primary data variable might be air_temperature with shape (8760, 721, 1440) — one year of hourly data at 0.25° global resolution. Dimension variables (also called coordinate variables) give the actual values along each axis: the array of timestamps, latitude values, and longitude values.

Attributes are metadata: units, long_name, valid_range, missing_value, institution, history. Self-describing data is a significant advantage — you know what the values mean without an external data dictionary.

The format supports several underlying data types (float32, float64, int16, etc.) and compression codecs. Modern NetCDF4 files use HDF5 as their underlying storage format, with support for chunked storage and lossless compression that can reduce file sizes by 5–10× with no loss of precision.

# Why Legacy Download Patterns Break Down

Consider a concrete example: the ERA5 global atmospheric reanalysis dataset from ECMWF. ERA5 provides hourly estimates of a large number of atmospheric, land, and oceanic climate variables at a global spatial resolution of 0.25°, from 1940 to the present. The total archive is approximately 5 petabytes.

A typical modelling application might need to answer questions like:

What is the daily maximum 2-metre temperature over a specific river catchment for the past 30 years?
What are the wind speed and direction profiles at 10m height over a 50km radius for a given day?
What is the monthly mean sea surface temperature anomaly for a defined ocean box?

Each of these questions touches only a tiny fraction of the total dataset. The spatial extent might be 0.01% of the global domain. The temporal extent might be 5% of the full record. The combination means the data actually needed is often 0.001–0.1% of a full file download.

The legacy pattern — download a regional extract, load locally, query — transfers 100–10,000× more data than necessary, consumes local storage that must be managed, and requires users to re-download when their time window or area of interest changes.

# XArray: Lazy, Labelled, Chunked

XArray is a Python library that provides labelled multi-dimensional arrays and datasets built on NumPy and (optionally) Dask. It is designed specifically for scientific datasets that have labelled dimensions and coordinate variables — which makes it a perfect fit for NetCDF.

The key concept for our API architecture is lazy evaluation. When you open a NetCDF file with XArray, no data is actually loaded from disk or network:

import xarray as xr

# Open the file — only reads metadata and coordinate arrays
# The data arrays themselves are NOT loaded into memory yet
ds = xr.open_dataset('era5_2m_temperature_2023.nc', chunks='auto')

print(ds)
# <xarray.Dataset>
# Dimensions:  (time: 8760, latitude: 721, longitude: 1440)
# Coordinates:
#   * time     (time) datetime64[ns] 2023-01-01 ... 2023-12-31T23:00:00
#   * latitude (latitude) float32 90.0 89.75 ... -89.75 -90.0
#   * longitude (longitude) float32 0.0 0.25 ... 359.5 359.75
# Data variables:
#     t2m      (time, latitude, longitude) float32 dask.array<chunksize=...>
#
# Total file size: ~12 GB
# Memory used: ~2 MB (only coordinates and metadata)

The chunks='auto' argument tells XArray to use Dask for lazy chunked loading. The data variable t2m is now a Dask array — a graph of future computations, not an array of values.

When you apply a selection, XArray computes exactly which chunks need to be read:

# Select a spatial subset — still lazy, no disk I/O yet
subset = ds['t2m'].sel(
    latitude=slice(52.0, 50.0),   # slice of ~8 latitude values
    longitude=slice(-1.5, 1.5),   # slice of ~12 longitude values
    time=slice('2023-06-01', '2023-06-30')  # one month
)

# .load() actually reads ONLY the required data from disk
data = subset.load()
# Reads ~500 KB from a 12 GB file
print(f"Loaded: {data.nbytes / 1024:.0f} KB into memory")

This lazy evaluation is what makes the API pattern efficient. The selection expression defines the query; .load() executes it, reading only the required chunks.

# Building the FastAPI Layer

With XArray handling the data access, FastAPI provides a lightweight HTTP interface. The architecture is:

Load the NetCDF dataset at startup (lazy — only metadata in memory)
Each request specifies a spatial/temporal slice via query parameters
XArray selects and loads the slice — typically a few hundred KB
Return the result as JSON (small responses) or as a binary array format (larger responses)

from contextlib import asynccontextmanager
from fastapi import FastAPI, Query, HTTPException
from fastapi.middleware.cors import CORSMiddleware
import xarray as xr
import numpy as np
from typing import Optional
import json

# ── Dataset loaded once at startup ──────────────────────────────────────────
datasets = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    """Load datasets on startup; release on shutdown."""
    datasets['era5'] = xr.open_dataset(
        'era5_2m_temperature_2023.nc',
        chunks='auto'
    )
    yield
    datasets['era5'].close()

app = FastAPI(
    title="Spatial Data API",
    description="Serve NetCDF slices on demand",
    lifespan=lifespan
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["GET"],
    allow_headers=["*"],
)

# ── Temperature endpoint ─────────────────────────────────────────────────────
@app.get("/temperature/slice")
def get_temperature_slice(
    lat_min: float = Query(..., ge=-90,  le=90,  description="Southern latitude bound"),
    lat_max: float = Query(..., ge=-90,  le=90,  description="Northern latitude bound"),
    lon_min: float = Query(..., ge=-180, le=180, description="Western longitude bound"),
    lon_max: float = Query(..., ge=-180, le=180, description="Eastern longitude bound"),
    time_start: str = Query(..., description="Start datetime, ISO format e.g. 2023-06-01"),
    time_end:   str = Query(..., description="End datetime, ISO format e.g. 2023-06-30"),
    stat:       str = Query("mean", description="Aggregation: mean | max | min | none"),
):
    ds = datasets['era5']

    try:
        # Spatial + temporal selection — lazy
        da = ds['t2m'].sel(
            latitude=slice(lat_max, lat_min),  # ERA5 lat is descending
            longitude=slice(lon_min, lon_max),
            time=slice(time_start, time_end),
        )

        # Apply temporal aggregation if requested
        if stat == "mean":
            da = da.mean(dim="time")
        elif stat == "max":
            da = da.max(dim="time")
        elif stat == "min":
            da = da.min(dim="time")
        # stat == "none" returns full time series

        # Load ONLY this slice into memory
        data = da.load()

        # Convert to Celsius (ERA5 t2m is in Kelvin)
        data_c = data - 273.15

        return {
            "variable":    "2m_air_temperature",
            "units":       "celsius",
            "aggregation": stat,
            "shape":       list(data_c.shape),
            "dims":        list(data_c.dims),
            "lat_values":  data_c.latitude.values.tolist() if "latitude" in data_c.dims else None,
            "lon_values":  data_c.longitude.values.tolist() if "longitude" in data_c.dims else None,
            "time_values": [str(t) for t in data_c.time.values] if "time" in data_c.dims else None,
            "data":        np.round(data_c.values, 2).tolist(),
        }

    except KeyError as e:
        raise HTTPException(status_code=400, detail=f"Selection error: {e}")
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


# ── Point time-series endpoint ───────────────────────────────────────────────
@app.get("/temperature/timeseries")
def get_temperature_timeseries(
    lat: float = Query(..., ge=-90,  le=90),
    lon: float = Query(..., ge=-180, le=180),
    time_start: str = Query(...),
    time_end:   str = Query(...),
):
    """Return a time series for a single lat/lon point."""
    ds = datasets['era5']

    da = ds['t2m'].sel(
        latitude=lat,
        longitude=lon,
        method="nearest",               # snap to nearest grid cell
        time=slice(time_start, time_end),
    ).load() - 273.15

    return {
        "lat":        float(da.latitude),
        "lon":        float(da.longitude),
        "units":      "celsius",
        "timestamps": [str(t) for t in da.time.values],
        "values":     np.round(da.values, 2).tolist(),
    }


# ── Coverage endpoint — what's available ────────────────────────────────────
@app.get("/coverage")
def get_coverage():
    ds = datasets['era5']
    return {
        "variables":   list(ds.data_vars),
        "lat_range":   [float(ds.latitude.min()), float(ds.latitude.max())],
        "lon_range":   [float(ds.longitude.min()), float(ds.longitude.max())],
        "time_range":  [str(ds.time.min().values), str(ds.time.max().values)],
        "resolution":  {
            "lat": float(abs(ds.latitude.diff("latitude").mean())),
            "lon": float(abs(ds.longitude.diff("longitude").mean())),
        },
    }

Start it with:

pip install fastapi uvicorn xarray dask netcdf4 numpy
uvicorn api:app --host 0.0.0.0 --port 8000

A request for one month of daily mean temperature over Greater London:

GET /temperature/slice?lat_min=51.2&lat_max=51.8&lon_min=-0.5&lon_max=0.3&time_start=2023-06-01&time_end=2023-06-30&stat=mean

Reads approximately 80 KB from disk and returns a 2D temperature grid in under 200ms — regardless of the total file size.

# Handling Multiple Variables and Large Datasets

Production datasets are rarely a single file. ERA5 distributes data as separate files per variable per month. A robust API handles this using XArray’s open_mfdataset, which combines multiple files into a single virtual dataset:

import glob

# Open an entire year of hourly precipitation data across monthly files
files = sorted(glob.glob('era5_precip_2023_*.nc'))
ds = xr.open_mfdataset(
    files,
    combine='by_coords',   # combine along shared coordinate (time)
    chunks={'time': 24, 'latitude': 100, 'longitude': 100},
    parallel=True          # use Dask to open files in parallel
)
# ds.time now spans the full year, loaded from 12 separate files
# Memory used: still just a few MB for metadata

For files hosted in cloud object storage, XArray integrates with fsspec to open remote files with the same lazy semantics:

import fsspec

# Open a NetCDF file directly from S3 — no local copy needed
ds = xr.open_dataset(
    fsspec.open('s3://my-bucket/era5/t2m_2023.nc', mode='rb').open(),
    engine='h5netcdf',
    chunks='auto'
)

Combined with the range-request capability of modern cloud storage (S3 supports HTTP range requests), XArray reads only the required byte ranges from the remote file — the same efficiency as local disk, over the network.

# Replacing the Legacy Download in Practice

The architectural shift this enables is significant. Instead of:

User navigates to data portal
Specifies bounding box and time range
Server generates extract (minutes to hours for large requests)
User downloads multi-GB file
User loads file locally (additional minutes)
User runs their analysis

The modern pattern is:

Application calls /temperature/slice with bounding box and time range
API returns JSON in under a second
Application renders or processes the result

For a modelling application in a web context, this removes the concept of “downloading the data” almost entirely. A catchment hydrology model that needs daily temperature inputs for 30 years over a small area can receive them in a single API call that returns a lightweight JSON response. A visualisation application showing temperature anomalies can stream slices for each frame of an animation without any pre-download step.

The stat=none option preserves the full time dimension for workflows that need it — but even then, the data volume transferred is limited by the spatial extent and time window of the request, not by the size of the entire source file.

# Deployment Considerations

A few practical notes for production deployment:

Memory management: XArray’s lazy loading means the server process uses minimal memory in steady state. Peak memory spikes occur when large slices are loaded. Monitor this and set appropriate instance memory for your typical request sizes.

Caching: Common requests (the current month, a popular bounding box) benefit from result caching. FastAPI integrates well with Redis or simple in-memory LRU caches. Cache the computed result array, not the raw data.

Chunking strategy: The chunk sizes in open_mfdataset affect read performance. For API use, spatial chunks (small lat/lon, large time) are typically better than the default, because most requests select small spatial areas across time.

Concurrency: Dask’s compute() / .load() is not async-native. For production use under concurrent load, run multiple Uvicorn workers (e.g. --workers 4) rather than relying on FastAPI’s async for data loading.

This API pattern extends beyond NetCDF to any large raster dataset that can be opened with XArray: Zarr archives, GeoTIFF collections, GRIB2 files, HDF5 datasets. Anywhere you have a multi-dimensional numerical array with spatial and/or temporal coordinates, the same lazy-load-and-slice pattern applies.