The traditional satellite data workflow: identify the imagery you need, submit a download order, wait hours for files to arrive, store them locally, process them. For a decade of Landsat data over a study area, that is terabytes of files you download once and never reuse, stored on spinning disks that will eventually fail.
The cloud-native alternative: data stays in the cloud (AWS, Google Cloud, Microsoft Planetary Computer), you query a STAC (SpatioTemporal Asset Catalog) API to find the exact scenes you need, and stackstac materialises only the pixels within your area of interest — lazily, in parallel, at compute time.
# What STAC Is
STAC is a specification for describing geospatial assets — satellite scenes, aerial imagery, DEM tiles — in a consistent JSON structure. A STAC Catalogue is an index of these assets with searchable properties: spatial bounding box, acquisition date, cloud cover, satellite platform, band names.
Public STAC APIs with free access to COG (Cloud-Optimised GeoTIFF) assets:
- Microsoft Planetary Computer: Landsat, Sentinel-2, Sentinel-1, MODIS, NAIP
- Element84 Earth Search: Landsat Collection 2, Sentinel-2 L2A
- NOAA Big Data Program: GOES, NWM, MRMS
These datasets live in cloud object storage. The source files are Cloud-Optimised GeoTIFFs — a variant of GeoTIFF where the overview levels and internal tiles are ordered so that HTTP range requests can fetch just the pixels you need from just the zoom level you need.
# stackstac Workflow
The stackstac workflow has three steps:
- Search a STAC API for items matching your spatiotemporal query
- Pass the item collection to
stackstac.stack()— get back a lazy XArray DataArray - Run XArray operations on the lazy array — only then do any pixels load
import stackstac
import pystac_client
import numpy as np
# ── Step 1: Search STAC API ───────────────────────────────────────────────────
catalog = pystac_client.Client.open(
"https://earth-search.aws.element84.com/v1"
)
# Search for Sentinel-2 L2A scenes over Brisbane, low cloud cover
items = catalog.search(
collections=["sentinel-2-l2a"],
bbox=[152.7, -27.7, 153.2, -27.3], # Brisbane metro
datetime="2022-01-01/2022-12-31",
query={"eo:cloud_cover": {"lt": 20}} # <20% cloud cover
).item_collection()
print(f"Found {len(items)} scenes")
# → Found 47 scenes
# ── Step 2: Build lazy DataArray ──────────────────────────────────────────────
da = stackstac.stack(
items,
assets=["B04", "B08"], # Red + NIR for NDVI
resolution=10, # 10m resolution
bounds_latlon=[152.7, -27.7, 153.2, -27.3],
chunksize=1024 # spatial chunk size in pixels
)
print(da)
# <xarray.DataArray 'stackstac-...' (time: 47, band: 2, y: ~5000, x: ~5000)>
# dask.array<...> chunksize=(1, 2, 1024, 1024), dtype=float32
# Coordinates: time, band, y, x, ...
# No data has been loaded yet.
# Computing NDVI: Fully Lazy
import xarray as xr
# Select bands by name
red = da.sel(band="B04")
nir = da.sel(band="B08")
# Compute NDVI — still lazy, no data loaded
ndvi = (nir - red) / (nir + red)
ndvi.name = "ndvi"
# Apply cloud mask using SCL band
# (would require loading SCL as well — see below)
# ── Monthly mean NDVI ─────────────────────────────────────────────────────────
monthly_ndvi = ndvi.resample(time="1MS").mean(skipna=True)
# ── NOW trigger computation — loads only needed pixels via HTTP range requests ─
import time
t0 = time.perf_counter()
result = monthly_ndvi.compute()
elapsed = time.perf_counter() - t0
print(f"Monthly NDVI (47 scenes → 12 months, 50km²): {elapsed:.1f}s")
# → ~18s (limited by network to AWS S3 East)
# A downloaded equivalent would take ~2min read + compute
The key insight: stackstac never loads the entire scene. Each Sentinel-2 L2A scene covers ~100km × 100km at 10m resolution — that is 10,000 × 10,000 pixels = 400MB per band per scene. For 47 scenes × 2 bands, that would be ~37GB. stackstac only fetches the pixels within the bounding box you specified (~50 km²), for a tiny fraction of the data.
# Including Cloud Masking
Sentinel-2 provides a Scene Classification Layer (SCL) band for cloud masking:
# Load with SCL for cloud masking
da_with_scl = stackstac.stack(
items,
assets=["B04", "B08", "SCL"],
resolution=20, # SCL is 20m native resolution
bounds_latlon=[152.7, -27.7, 153.2, -27.3],
chunksize=512
)
red = da_with_scl.sel(band="B04")
nir = da_with_scl.sel(band="B08")
scl = da_with_scl.sel(band="SCL")
# SCL values: 4=vegetation, 5=bare soil, 6=water, 7=unclassified
# 8=cloud medium probability, 9=cloud high probability, 10=cirrus, 11=snow
cloud_mask = scl.isin([8, 9, 10])
ndvi = (nir - red) / (nir + red)
ndvi_masked = ndvi.where(~cloud_mask) # NaN where cloudy
# Monthly median (more robust than mean to residual cloud contamination)
monthly_median_ndvi = ndvi_masked.resample(time="1MS").median(skipna=True)
result = monthly_median_ndvi.compute()
# Landsat from Microsoft Planetary Computer
import pystac_client
import stackstac
import planetary_computer # pip install planetary-computer
# Planetary Computer requires token signing for data access
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1",
modifier=planetary_computer.sign_inplace # adds SAS tokens to asset URLs
)
items = catalog.search(
collections=["landsat-c2-l2"],
bbox=[152.7, -27.7, 153.2, -27.3],
datetime="2013-01-01/2023-12-31", # 10 years of Landsat 8/9
query={"eo:cloud_cover": {"lt": 15}, "platform": {"in": ["landsat-8", "landsat-9"]}}
).item_collection()
print(f"Found {len(items)} Landsat scenes over 10 years")
# → Found 203 scenes
da = stackstac.stack(
items,
assets=["red", "nir08"], # Planetary Computer uses descriptive band names
resolution=30,
bounds_latlon=[152.7, -27.7, 153.2, -27.3],
chunksize=512
)
ndvi = (da.sel(band="nir08") - da.sel(band="red")) / \
(da.sel(band="nir08") + da.sel(band="red"))
annual_mean = ndvi.resample(time="AS").mean(skipna=True)
result = annual_mean.compute()
print(result.shape) # (10, y, x) — 10 years of annual mean NDVI
# Reprojection and Resolution Handling
stackstac handles mixed resolutions and CRS automatically:
# Mixing 10m and 20m bands — stackstac resamples to target resolution
da = stackstac.stack(
items,
assets=["B02", "B03", "B04", "B08", "B11"], # B11 is 20m native
resolution=10, # upsample B11 to 10m
epsg=32755, # GDA2020 / MGA Zone 55 — local projected CRS
bounds=[300_000, 6_900_000, 350_000, 6_950_000], # in metres
resampling=stackstac.Resampling.bilinear
)
# Dask Configuration for Network-Bound Work
Satellite pixel loading is network-bound (HTTP range requests to S3). Increase Dask thread count for better throughput:
import dask
from dask.distributed import Client
# More threads = more concurrent HTTP requests = faster pixel fetching
client = Client(n_workers=1, threads_per_worker=16, memory_limit="32GB")
# Now .compute() will saturate network bandwidth
result = monthly_ndvi.compute()
client.close()
On a machine with good network bandwidth (1 Gbps+), 16 threads will typically saturate the connection. On slower connections, 4–8 threads are sufficient.
# Writing Results to Zarr for Reuse
If you will run the same computation multiple times (parameter tuning, iterative analysis), write intermediate results to Zarr to avoid re-fetching from S3:
# Fetch once, cache locally
ndvi_masked.to_zarr("brisbane_ndvi_2022.zarr", mode="w")
# Subsequent runs: open from local Zarr (microseconds vs. 20+ seconds from S3)
import xarray as xr
ndvi_cached = xr.open_zarr("brisbane_ndvi_2022.zarr")["ndvi"]
monthly_result = ndvi_cached.resample(time="1MS").mean().compute()
The fetch-once-cache pattern is especially valuable during analysis development where you iterate on processing parameters but the source data does not change.
# When Not to Use stackstac
stackstac excels for regional, multi-date analyses where you want a subset of the available data. It is less well-suited to:
- Continental or global mosaics: if you genuinely need all pixels of all scenes, bulk download (AWS Open Data, Google Earth Engine, STAC bulk download tools) may be more efficient than streaming
- Very high cadence analysis: if your analysis requires sub-scene update frequency (e.g. hourly GOES data), consider pre-processed products rather than raw COG streaming
- Offline environments: stackstac requires live network access to cloud storage
For most regional spatial analysis workflows — studying change over a city, a watershed, a state — stackstac makes cloud-native satellite analysis accessible without any download infrastructure.
Related reading: XArray GroupBy for Temporal Raster Aggregation · XArray and Zarr: Chunked Raster Storage for Cloud Workflows · Rasterio Windowed Reading for Large Rasters