Why AI Metacognition Requires Hierarchical Random-access Data
This blog post was originally published at V-Nova’s website. It is reprinted here with the permission of V-Nova.
Executive Summary (TL;DR)
Today’s AI systems can reason impressively, but they still struggle to know when to look again. Humans use metacognition as a feedback loop between thought and perception: we scan, form a hypothesis, sense uncertainty, and focus our attention where more detail is needed. AI needs a similar loop between its models and the data they rely on.
But most visual and sensor data is still stored in monolithic formats that force systems to retrieve, decode, and process far more information than the task requires. For agentic, visual, and physical AI, this is not just inefficient. It limits the ability to perceive actively.
The next data architecture for AI must therefore be hierarchical, parallel, and compute-aware: a structure that lets models access the gist first, then selectively query only the regions, planes, or levels of detail needed to resolve uncertainty. In that sense, data becomes an API for AI perception.
The Data Architecture for AI Metacognition
If you want to see a “frontier” AI model sweat, give it Humanity’s Last Exam (HLE).
Released by the Center for AI Safety and collaborators, it is part of a new generation of expert-level benchmarks designed to test models at the edge of human knowledge.
These tests are not just difficult because the answers are obscure. They expose something deeper: even highly capable reasoning models can fail to reliably know when they do not know. Their errors are often not tentative. They can arrive with the same confident “tone” and high internal probability as correct answers.
In contrast, you, a human, have metacognition (thinking about thinking). When you see a question about sesamoid bones in hummingbirds, you instantly scan your memory, find a “File Not Found” error, and feel the sensation of ignorance. You stop and redirect your senses to find more information. You likely act like this anytime you see something unexpected.
Now, the problem is not that AI has no uncertainty mechanism. It is that today’s mainstream architectures do not have a robust, embodied, low-latency perception loop comparable to human active sensing.
AI systems can be engineered to estimate uncertainty, retrieve information, or self-check. But these mechanisms are still brittle and expensive, especially when they require repeatedly reloading, decoding, and preprocessing large volumes of sensory data. What is missing is not only better reasoning, but a cheap physical interface between uncertainty and perception: the ability to look again, selectively.
The Wall: Why AI Can’t “Look Closer”
Why can’t the most advanced “reasoning” models do this? There are three structural reasons:
- Prediction is not the same as Self-Knowledge: Models can generate plausible answers without a reliable internal signal that says, “I need more evidence”.
- Reasoning loops are often detached from sensory loops: Even when models can deliberate, the underlying visual or sensor data is often ingested “Single-Pass” as a fixed representation, rather than queried dynamically.
- The Data Layer is Mostly Static: This is the most overlooked. Our current data formats (from PDFs to JPEGs to video) are monolithic, and thus make repeated, selective perception expensive. They are “take it or leave it”. If an AI wants to verify a tiny detail in a 4K image, it usually has to move and decode the entire file into memory. It cannot “glance” cheaply and then “foveate” on a detail.
AI Must Be an Active Participant in its Own Data Flow
To achieve the next level of intelligence, Yann LeCun is working on what he calls “Joint-Embedding Predictive Architecture” (JEPA), and Google on what it describes as Agentic Vision. In simple terms, AI needs a feedback loop between its “brain” and its “senses”. But a real-time system with multiple sensors cannot wait 100ms for a monolithic file to be fully retrieved and decoded every time it wants to check a single detail. It needs instantaneous, random-access to the data of any of its sensors, including the ability to quickly look at just “the gist of it”.
In humans, you can take a quick scan of what’s around you. Then your brain tells your eyes: “That blurry shape in the corner looks like a threat; look closer”. Your eyes then immediately provide a high-resolution “crop” of just that area.
For an AI to do this, we need a compute-aware, hierarchical data architecture. We need data that behaves like an API for perception, allowing the model to interactively query it. Instead of a “Single-Pass” ingestion, the AI must be able to:
- Retrieve the “Gist” (a tiny fraction of the data) to form a hypothesis.
- Assess its own Confidence.
- Query only the specific high-resolution residuals needed to confirm that hypothesis.
The Real-Time Crisis
This architecture may be a “nice-to-have” cost/energy saving for a chatbot, but it is a life-or-death requirement for the next wave of AI: Visual AI and Physical AI. Especially in real-time use cases, where you cannot “batch” your way out of inefficiency.
When systems such as NVIDIA Cosmos, an autonomous vehicle, or a robot is navigating a complex environment in real-time, the bottleneck isn’t just “smarter models”. It is I/O, memory movement, and data preprocessing, before AI’s tensor processing can even start.
If every sensor (RADAR, LiDAR, 4K Video, MRI, CT scans, pressure maps, thermal maps, etc.) requires full retrieval and full decode before the AI model can even decide if the data is relevant, the system fails, or at the very least becomes slower, more power-hungry, and less scalable.
In a previous article, I argued that visual AI suffers from a “Trillion-Dollar Blind Spot”, the waste created when systems move, decode, and preprocess far more data than they need. But there is a deeper implication. This is not only an efficiency problem. It is also a perception problem, influencing accuracy, latency, and ultimately results.
The trillion-dollar blind spot is the value at stake if we avoid that waste, and instead give AI what it needs, when it needs it.
The Solution: Compute-aware Data (a.k.a. Media as an Interface)
The AI needs a feedback loop between its “brain” and its “senses”. If the “senses” (the data format) only offer a monolithic file, the AI cannot perform that second, targeted look.
This is the class of problem we at V-Nova have been working on for years: making visual data hierarchical, parallel, and selectively accessible, so that applications can retrieve only the levels of quality, regions, or planes they need. Standards such as SMPTE VC-6 and MPEG-5 LCEVC are practical examples of this broader shift from media as a file to media as an interface.
It’s like trying to be a detective while looking through a frosted window that can only be cleared by breaking the whole pane of glass.
By moving to hierarchical, parallel data structures like VC-6, we allow AI to “wipe a small circle in the frost”. We give it a digital fovea.
A robot navigating a warehouse may first process a low-quality view of the scene, detect ambiguity around a moving object, and then request only the higher-resolution residuals for that region of interest and sensor. The point is not to see everything better all the time. The point is to see the right thing better at the right moment.
Importantly, the data format does not decide what matters. The model, agent, or application still makes that decision. The role of the data architecture is more modest but fundamental: it makes selective perception cheap enough that the model can afford to ask better questions of its own inputs.
The Bottom Line: Efficiency is Intelligence
A reasoning model isn’t truly “intelligent” if it wastes most of its energy processing irrelevant data. Metacognition requires a data architecture that respects the AI’s limited compute budget. By treating data as a queryable interface, we are not just saving power. We are giving AI the ability to focus. And in the next wave of AI, focus may be the ultimate frontier.
Guido Meardi, CEO
V-Nova

