The Measurement Problem

Nature-based restoration in the UK is still young. As a result, there is not a vast amount of ground-truth local project data. For context, as of January 2025, there has only been one peatland project that has reached year-5 verification under the Peatland Code (PC). For the Woodland Carbon Code (WCC) this number stands at around 182 projects; however, even this data is often gated because of privacy concerns.

Moreover, where data is collected, it comes in many shapes and forms. Some projects are shifting to high-resolution drone outputs and investing in LiDAR or photogrammetry-based digital elevation models, while others are still processing 12.5 cm RGB tiles from aerial flights.

This creates a fundamental tension. The carbon market needs measurement that is accurate, auditable, and scalable. But the data landscape is fragmented, labelled examples are scarce, and the diversity of inputs means any AI system that only works on one sensor or one resolution will break the moment it meets a new site.

This reality shapes everything about how we build and ship geospatial AI at New Gradient and Calterra. In this post we walk through the full pipeline: from how we train models that can learn from very little labelled data, to how we turn that general capability into specific, interpretable measurements of individual trees that underpin credible carbon estimates.

What We Need From a Model

First, we need a model that can learn a general understanding of the underlying data distribution from a small number of samples. In other words, learning on tree-count labels from restored Juniper in England should be sufficient for the model to transfer to Caledonian Pine in Scotland. Second, the AI models must retain performance across diverse source data.

Our answer: Geospatial Foundation Modelling.

We use a mix of open-source and in-house trained foundation models (FMs) that power task-specific learners. This lets us quickly bootstrap to many downstream applications and generalise performance even with small datasets. The backbone is also a multisource, multiresolution learner, allowing us to handle the various inputs currently used in practice.

Multiple studies show that models that first learn from large pools of unlabeled imagery and only then see a small set of audited examples reach the same, or better, accuracy with far fewer labels than training from scratch. For example, one widely cited study showed that with ~1% of labels, a pretrained model could match or surpass a baseline trained with 100% of labels on a standard land-cover benchmark. For compliance-grade geospatial work, that label-efficiency is decisive.

Building the Foundation

Data and Normalisation

Sources: UK-wide RGB and depth (DSM/DTM) imagery from drone, aerial, and satellite.
Geometry: Unify CRS, verify alignment, correct residual ortho issues.
Depth cleanup: DSM detrending to emphasise local relief; distribution transforms for outlier handling and alignment with RGB data.
Tiling & indexing: Window scenes into training tiles; retain metadata.
Augmentations: Flips/rotations, careful to preserve ecology-relevant cues.

Self-Supervised Pretraining

To create foundation models we need to adopt a learning task that allows models to see vast amounts of domain data and understand it without requiring any extensive labelling efforts. This kind of learning strategy is called self-supervised learning, i.e., we are not supervising the model with labels. The task we use is masked image modelling: hide patches, ask the network to reconstruct them. With RGB + height, the model learns structure (edges, textures, boundaries) and form (micro-topography) without any labels. The model in a way gains an overarching understanding of Earth Observation (EO) data semantics.

Feature clustering on frozen embeddings of pretrained models already shows semantic grouping (trees, roads, fields, built areas) even before fine-tuning. The model is itself able to learn underlying semantics of geospatial data, which makes it a much better and quicker learner on downstream geospatial analysis tasks, say, detecting all trees in an image. Hence, it has a learnt "foundation" in geospatial data.

Scale and Infrastructure

Compute: 10x NVIDIA A40 GPUs, mixed precision, gradient checkpointing.
Data scale: ~1.8 million tiles across the UK.

That foundation gives our models a general ability to see: edges, textures, boundaries, micro-topography. But seeing isn’t enough for compliance-grade measurement. To be useful for carbon markets, a model needs to understand: to delineate individual tree crowns, identify what species they are, and measure their structure.

From Foundation to Expert Models

The pretrained encoders are the backbone. On top, we attach focused heads ("experts") with clear, bite-sized jobs. This reduces scope and fallibility.

Why Small Experts, Not One Big Model?

A single model that simultaneously segments crowns, classifies species, and estimates height would be elegant on paper. In practice, it’s fragile. Each task competes for capacity during training, errors in one objective can degrade another, and when something goes wrong in production it’s hard to isolate where.

Instead, we decompose the problem. Each expert model shares the same pretrained ViT-L/16 backbone but attaches a lightweight, task-specific head. This buys us three things: we can develop, validate, and improve each component independently; we get interpretable intermediate outputs that can be visually inspected and audited; and we avoid the risk of a complex multi-task objective pulling the model in conflicting directions.

The pipeline flows in stages, with each expert building on the outputs of the one before it.

Crown Segmentation: Finding Each Tree

The first expert answers the most fundamental question: where is each tree, and what is its boundary?

This is an instance segmentation task, i.e., not just "there are trees in this image" but "here is the precise mask of each individual crown." Getting this right is critical because every downstream measurement (species, height, crown area as a proxy for DBH) depends on having accurate per-tree boundaries.

We use a query-based detection approach on top of the foundation backbone. The idea is that the model maintains a set of learned queries, each of which competes to explain one object in the scene. Each query produces a class prediction (tree vs background) and an instance mask. The queries interact with each other, so they learn not to all predict the same tree, and they attend back to the image features to decide which region they should represent.

We experimented with two state-of-the-art architectures for the segmentation head and found meaningful performance differences on the BAMFORESTS benchmark, a standard European mixed-forest dataset. Our best model surpassed previously published results on this benchmark, meeting our target threshold and providing both accurate crown boundaries and bounding boxes useful for downstream processing.

Species Classification: Looking Inside the Crown

Once we have individual crown masks, the next expert classifies what each tree is. At this stage, we distinguish between conifer and broadleaved, a binary split, but one that matters significantly for biomass. Conifers and broadleaves follow different allometric relationships: given the same crown dimensions, a Scots Pine and an Oak will have meaningfully different wood densities, branching structures, and therefore carbon stocks. Getting species type right is not optional for credible biomass estimates downstream.

The architecture here is deliberately simple and reuses the same backbone. The key idea is forcing the classifier to only look inside the predicted crown so it isn’t distracted by surrounding canopy, shadows, or ground. We convert each pixel-level instance mask into a patch-overlap weight map that tells the model which ViT patches belong to this tree. Those patches are then passed through a small transformer encoder where they exchange information, texture, colour, crown structure, while background patches are masked out. The result is pooled into a single crown embedding and classified.

This achieves strong performance: over 96% accuracy and 96% macro F1 on the Quebec Trees benchmark, with balanced results across both classes. The main residual error is a small fraction of conifers predicted as broadleaved.

Canopy Height: Fusing ML With National Infrastructure

For canopy height, we made a pragmatic engineering decision: we don’t predict it, we measure it from existing data.

The UK has excellent national-scale LiDAR coverage. The Environment Agency’s composite products cover ~99% of England at 1 m resolution, and Scotland’s Remote Sensing Portal provides comparable data. From these, we compute a standard Canopy Height Model: CHM = DSM - DTM, where the Digital Surface Model captures the top of canopy and the Digital Terrain Model gives bare-earth elevation.

This is a deliberate choice. Where high-quality, survey-grade elevation data already exists at scale, there is no need to train a model to approximate it. Instead, our ML outputs (individual crown masks and bounding boxes) become the spatial queries: we extract per-tree height statistics directly from the CHM within each predicted crown boundary.

This approach also demonstrates something we think is important: geospatial AI doesn’t need to replace existing national EO infrastructure. It can plug into it.

What This Pipeline Delivers

At the end of this pipeline, every detected tree has a crown boundary, a species classification, and a height measurement, each derived from an auditable, inspectable intermediate. This is the measurement stack that will underpin interpretability of our biomass estimation. When a stakeholder asks why a particular carbon number was produced, we can show the crown mask, the species call, and the height measurement that validate it.

Zooming out, the combination of foundation modelling and task-specific experts gives us several properties that matter for a growing market:

Rapid adaptation. New domain, new indicators, new regulatory shifts, or new geographies: we adapt robust foundational learners instead of rebuilding from the ground up.

High accuracy even in low-label regimes. High-quality, audited annotations are enough to reach target accuracy, because the foundation has already learned the structure of the data.

Lower failure modes via task segmentation. We deploy many expert models that focus on small tasks to mitigate errors and increase performance.

Interpretability. Some experts predict intermediate measures (canopy height, species detection, bare-peat extent) to ensure visibility into factors influencing carbon numbers and decision-making around project quality.

Direct indicators. Certain experts also predict direct measures, for example, imagery-to-biomass estimates or wetness indices. We believe models can sometimes create better intermediate abstractions than humans, and these measures, although not yet compliance-ready, should flow alongside more stringent measurement protocols. Deep learning models have already shown an RMSE under 30 on biomass estimation, laying a strong foundation for more high-frequency dMRV that meets upstream reporting timelines.

Where This Goes

UK restoration needs models that are accurate, resilient, and auditable. Foundation models give us stability across fragmented data. Expert models give us the specificity and interpretability that compliance demands. Together, they let us build a measurement system that improves continuously with new data, without starting from scratch, and that can show its working at every step from raw pixel to carbon number.

From Raw Pixels to Carbon Numbers: How Geospatial AI Is Closing the Measurement Gap in UK Nature-Based Markets