stackstac: Lazy Satellite Time Series from STAC Catalogues

The traditional satellite data workflow: identify the imagery you need, submit a download order, wait hours for files to arrive, store them locally, process them. For a decade of Landsat data over a study area, that is terabytes of files you download once and never reuse, stored on spinning disks that will eventually fail.

The cloud-native alternative: data stays in the cloud (AWS, Google Cloud, Microsoft Planetary Computer), you query a STAC (SpatioTemporal Asset Catalog) API to find the exact scenes you need, and stackstac materialises only the pixels within your area of interest — lazily, in parallel, at compute time.

# What STAC Is

STAC is a specification for describing geospatial assets — satellite scenes, aerial imagery, DEM tiles — in a consistent JSON structure. A STAC Catalogue is an index of these assets with searchable properties: spatial bounding box, acquisition date, cloud cover, satellite platform, band names.

Public STAC APIs with free access to COG (Cloud-Optimised GeoTIFF) assets:

Microsoft Planetary Computer: Landsat, Sentinel-2, Sentinel-1, MODIS, NAIP
Element84 Earth Search: Landsat Collection 2, Sentinel-2 L2A
NOAA Big Data Program: GOES, NWM, MRMS

These datasets live in cloud object storage. The source files are Cloud-Optimised GeoTIFFs — a variant of GeoTIFF where the overview levels and internal tiles are ordered so that HTTP range requests can fetch just the pixels you need from just the zoom level you need.

# stackstac Workflow

The stackstac workflow has three steps:

Search a STAC API for items matching your spatiotemporal query
Pass the item collection to stackstac.stack() — get back a lazy XArray DataArray
Run XArray operations on the lazy array — only then do any pixels load

import stackstac
import pystac_client
import numpy as np

# ── Step 1: Search STAC API ───────────────────────────────────────────────────
catalog = pystac_client.Client.open(
    "https://earth-search.aws.element84.com/v1"
)

# Search for Sentinel-2 L2A scenes over Brisbane, low cloud cover
items = catalog.search(
    collections=["sentinel-2-l2a"],
    bbox=[152.7, -27.7, 153.2, -27.3],       # Brisbane metro
    datetime="2022-01-01/2022-12-31",
    query={"eo:cloud_cover": {"lt": 20}}      # <20% cloud cover
).item_collection()

print(f"Found {len(items)} scenes")
# → Found 47 scenes

# ── Step 2: Build lazy DataArray ──────────────────────────────────────────────
da = stackstac.stack(
    items,
    assets=["B04", "B08"],           # Red + NIR for NDVI
    resolution=10,                    # 10m resolution
    bounds_latlon=[152.7, -27.7, 153.2, -27.3],
    chunksize=1024                    # spatial chunk size in pixels
)

print(da)
# <xarray.DataArray 'stackstac-...' (time: 47, band: 2, y: ~5000, x: ~5000)>
# dask.array<...> chunksize=(1, 2, 1024, 1024), dtype=float32
# Coordinates: time, band, y, x, ...

# No data has been loaded yet.

# Computing NDVI: Fully Lazy

import xarray as xr

# Select bands by name
red = da.sel(band="B04")
nir = da.sel(band="B08")

# Compute NDVI — still lazy, no data loaded
ndvi = (nir - red) / (nir + red)
ndvi.name = "ndvi"

# Apply cloud mask using SCL band
# (would require loading SCL as well — see below)

# ── Monthly mean NDVI ─────────────────────────────────────────────────────────
monthly_ndvi = ndvi.resample(time="1MS").mean(skipna=True)

# ── NOW trigger computation — loads only needed pixels via HTTP range requests ─
import time
t0 = time.perf_counter()
result = monthly_ndvi.compute()
elapsed = time.perf_counter() - t0
print(f"Monthly NDVI (47 scenes → 12 months, 50km²): {elapsed:.1f}s")
# → ~18s (limited by network to AWS S3 East)
# A downloaded equivalent would take ~2min read + compute

The key insight: stackstac never loads the entire scene. Each Sentinel-2 L2A scene covers ~100km × 100km at 10m resolution — that is 10,000 × 10,000 pixels = 400MB per band per scene. For 47 scenes × 2 bands, that would be ~37GB. stackstac only fetches the pixels within the bounding box you specified (~50 km²), for a tiny fraction of the data.

# Including Cloud Masking

Sentinel-2 provides a Scene Classification Layer (SCL) band for cloud masking:

# Load with SCL for cloud masking
da_with_scl = stackstac.stack(
    items,
    assets=["B04", "B08", "SCL"],
    resolution=20,   # SCL is 20m native resolution
    bounds_latlon=[152.7, -27.7, 153.2, -27.3],
    chunksize=512
)

red  = da_with_scl.sel(band="B04")
nir  = da_with_scl.sel(band="B08")
scl  = da_with_scl.sel(band="SCL")

# SCL values: 4=vegetation, 5=bare soil, 6=water, 7=unclassified
# 8=cloud medium probability, 9=cloud high probability, 10=cirrus, 11=snow
cloud_mask = scl.isin([8, 9, 10])

ndvi = (nir - red) / (nir + red)
ndvi_masked = ndvi.where(~cloud_mask)   # NaN where cloudy

# Monthly median (more robust than mean to residual cloud contamination)
monthly_median_ndvi = ndvi_masked.resample(time="1MS").median(skipna=True)
result = monthly_median_ndvi.compute()

# Landsat from Microsoft Planetary Computer

import pystac_client
import stackstac
import planetary_computer   # pip install planetary-computer

# Planetary Computer requires token signing for data access
catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace  # adds SAS tokens to asset URLs
)

items = catalog.search(
    collections=["landsat-c2-l2"],
    bbox=[152.7, -27.7, 153.2, -27.3],
    datetime="2013-01-01/2023-12-31",   # 10 years of Landsat 8/9
    query={"eo:cloud_cover": {"lt": 15}, "platform": {"in": ["landsat-8", "landsat-9"]}}
).item_collection()

print(f"Found {len(items)} Landsat scenes over 10 years")
# → Found 203 scenes

da = stackstac.stack(
    items,
    assets=["red", "nir08"],   # Planetary Computer uses descriptive band names
    resolution=30,
    bounds_latlon=[152.7, -27.7, 153.2, -27.3],
    chunksize=512
)

ndvi = (da.sel(band="nir08") - da.sel(band="red")) / \
       (da.sel(band="nir08") + da.sel(band="red"))

annual_mean = ndvi.resample(time="AS").mean(skipna=True)
result = annual_mean.compute()
print(result.shape)   # (10, y, x) — 10 years of annual mean NDVI

# Reprojection and Resolution Handling

stackstac handles mixed resolutions and CRS automatically:

# Mixing 10m and 20m bands — stackstac resamples to target resolution
da = stackstac.stack(
    items,
    assets=["B02", "B03", "B04", "B08", "B11"],  # B11 is 20m native
    resolution=10,          # upsample B11 to 10m
    epsg=32755,             # GDA2020 / MGA Zone 55 — local projected CRS
    bounds=[300_000, 6_900_000, 350_000, 6_950_000],  # in metres
    resampling=stackstac.Resampling.bilinear
)

# Dask Configuration for Network-Bound Work

Satellite pixel loading is network-bound (HTTP range requests to S3). Increase Dask thread count for better throughput:

import dask
from dask.distributed import Client

# More threads = more concurrent HTTP requests = faster pixel fetching
client = Client(n_workers=1, threads_per_worker=16, memory_limit="32GB")

# Now .compute() will saturate network bandwidth
result = monthly_ndvi.compute()
client.close()

On a machine with good network bandwidth (1 Gbps+), 16 threads will typically saturate the connection. On slower connections, 4–8 threads are sufficient.

# Writing Results to Zarr for Reuse

If you will run the same computation multiple times (parameter tuning, iterative analysis), write intermediate results to Zarr to avoid re-fetching from S3:

# Fetch once, cache locally
ndvi_masked.to_zarr("brisbane_ndvi_2022.zarr", mode="w")

# Subsequent runs: open from local Zarr (microseconds vs. 20+ seconds from S3)
import xarray as xr
ndvi_cached = xr.open_zarr("brisbane_ndvi_2022.zarr")["ndvi"]
monthly_result = ndvi_cached.resample(time="1MS").mean().compute()

The fetch-once-cache pattern is especially valuable during analysis development where you iterate on processing parameters but the source data does not change.

# When Not to Use stackstac

stackstac excels for regional, multi-date analyses where you want a subset of the available data. It is less well-suited to:

Continental or global mosaics: if you genuinely need all pixels of all scenes, bulk download (AWS Open Data, Google Earth Engine, STAC bulk download tools) may be more efficient than streaming
Very high cadence analysis: if your analysis requires sub-scene update frequency (e.g. hourly GOES data), consider pre-processed products rather than raw COG streaming
Offline environments: stackstac requires live network access to cloud storage

For most regional spatial analysis workflows — studying change over a city, a watershed, a state — stackstac makes cloud-native satellite analysis accessible without any download infrastructure.