Large language models feel like magic. You can instruct them in plain English, they generalise across tasks they were never explicitly trained on, and they seem to get better at everything when you improve them at anything. The AI community has spent the last few years trying to replicate this magic in other domains. In geospatial, we are not there yet. Understanding why tells you something important about where language models get their power, and about the real state of play in geospatial AI.
Language is the cheat code
The standard explanation for why LLMs work so well focuses on scale: more data, more parameters, more compute. That is true but incomplete. The deeper reason is that language is a uniquely powerful modality to learn from.
Language is not raw data. It is a representation space that humans have developed and refined over thousands of years. It exists precisely to compress complex information into a shared, parseable format. When a model predicts the next token in a sequence of text, it is not learning to reproduce raw signals. It is learning to operate in a space that is already information-dense and structured for communication.
This gives next-token prediction a property that other self-supervised objectives lack: a natural curriculum. Early in training, the model can reduce its loss significantly just by learning syntax and common sentence structure. Then it has to start learning conditional distributions: given that this text is about medieval history, certain words and concepts become more likely. Eventually, to squeeze out further improvements, it has to develop something resembling conceptual understanding. Three distinct levels of capability, all driven by the same objective, all in the same representational space.
The result is not just a good feature extractor. It is a system whose internal representations are grounded in the same space that humans use to describe tasks, evaluate outputs, and communicate goals. This is why instruction-following works. This is why reinforcement learning on top of language models is so effective: the task, the reasoning, and the output all live in the same medium, so improvements in one area compound and transfer to others. Train a language model to be better at code, and it gets better at reasoning generally, because both are expressed in the same shared space.
Geospatial data does not have this property
Satellite imagery, multispectral data, elevation models, LiDAR point clouds: these are raw, high-dimensional signals. They contain enormous amounts of information. But that information is not pre-structured into a shared representation space that you can predict your way into understanding.
One counterargument, most associated with Yann LeCun, is that visual and spatial modalities contain more information per second than language and should therefore be the path to richer learned representations. This is true at the level of raw information content. But information density is not the same thing as having a shared, externalised, machine-parseable encoding. Humans have learned extraordinarily rich representations from vision. We can glance at a landscape and instantly parse terrain type, vegetation health, drainage patterns, signs of erosion. But those representations are locked inside our nervous systems. There is no written-down data format for what a geospatial expert’s visual cortex knows. Language is unique precisely because it is the one modality where the human-learned representation exists as data you can train on.
This distinction matters practically. For language, the representation space is given before training starts. For geospatial, you have to construct it from raw signals during training. And the approaches we have for doing that, while genuinely useful, produce something qualitatively different from what language pre-training produces.
Good representations, but not magical ones
The geospatial AI community has made real progress on self-supervised pre-training. Masked autoencoders (MAE), DINO, BEiT, and contrastive approaches have all been applied to satellite imagery. Geospatial foundation models like Prithvi, Clay, SatMAE, and CROMA learn transferable feature representations from large Earth observation corpora. Fine-tuning on top of these is meaningfully better than training from scratch. These are not trivial achievements.
But there is a categorical difference between these representations and what language pre-training produces. Geospatial foundation models give you a good feature backbone. You still have to attach a task-specific head, fine-tune for your specific domain, and accept that the learned features live in an arbitrary embedding space. You cannot instruct them. You cannot describe a new task in words and have the model attempt it. The representation space is useful for feature extraction, but it is not simultaneously the feature space, the instruction space, and the evaluation space. That simultaneous triple role is what makes language representations uniquely powerful, and no other modality has it.
There are efforts to bridge the gap. RemoteCLIP, GeoRSCLIP, and vision-language models like GeoChat attempt to align satellite imagery with text descriptions, which would in principle bring geospatial representations into language space. But the data foundation is thin. The volume of high-quality text-image pairs for remote sensing is orders of magnitude smaller than what general vision-language models are trained on. A caption like “agricultural land in southern France” carries almost no information relative to the pixel-level detail in the corresponding Sentinel-2 tile. The bridge exists, but it is narrow and low-bandwidth.
No compounding from RL
This has a direct consequence for the reinforcement learning story. In language, RL is transformative partly because of transfer. You build a reward signal for one capability (coding, mathematical reasoning, instruction-following), and because everything lives in the same representational space, improvements generalise. The model gets broadly better, not just narrowly better at the trained task.
In geospatial, every RL loop is essentially bespoke. You train a model to detect erosion features, and that does not make it better at land cover classification, because there is no shared instruction space through which transfer can flow. Each task is its own island. The labs building frontier models will not invest in these task-specific training environments either, because they require domain-specific data, domain-specific evaluation criteria, and domain-specific expertise. The economics only work for high-volume, general-purpose tasks. Peatland erosion mapping is not one of those.
What this means in practice
The state of the art in geospatial AI is fine-tuning specialised models on top of generally useful but non-instructable embeddings. Domain by domain. Task by task. This is not because the field is behind or because people are not trying. It is because the structural conditions that make language AI so general, the shared externalised representation space, the natural curriculum, the triple role of language as feature, instruction, and evaluation medium, do not exist for geospatial data. And they are not going to emerge from scaling up the current approaches.
LLM-based orchestration does not solve this either. If the language model lacks grounded geospatial understanding, it is a poor judge of what to do and when. It can shuffle API calls, but it cannot make informed decisions about a domain it does not deeply know. General-purpose vision-language models have learned from web-scale image data, not from multispectral Earth observation at the resolution and specificity that real geospatial work demands.
This is not a pessimistic conclusion. If anything, it clarifies where value sits. Geospatial AI requires genuine domain expertise: understanding the data, the physics, the task-specific nuances that no foundation model has absorbed. The companies that can build and fine-tune specialised models for specialised problems, working at the intersection of ML capability and domain knowledge, are the ones that will deliver real results. General-purpose AI magic will come to many industries. Geospatial is not one of them yet. And that is exactly what makes it interesting.
