[ Analogy - What LLMs did for language, we’re doing for 3D ]
Google Translate and ChatGPT use similar algorithms - but Google Translate learned task of translation and gained partial language knowledge. ChatGPT learned the structure and content of language itself, enabling broad generalisation across language tasks.
Similarly, most 3D LLMs, and VLMs gain partial 3D knowledge as a by-product of solving specific tasks. The difference: our model is trained to understand objects, spaces, and their relations - the structure, content, and function of 3D environments - enabling broad generalisation across 3D tasks and industries.
[ Analogy - Hearing is music not understanding music, the same applies to 3D ]
Most people hear music as a seamless flow of sound. A composer, though, hears the individual instruments, rhythms, and structures - and understands what they are and how they fit together.
At Spatial Intelligence, our model learns to "hear" 3D spaces like a composer: breaking down images and videos into objects, learning their forms, functions, positions and relationships.
This transforms 3D spaces from a flat soup of pixels into structured, understandable environments - enabling flexible intelligence that can be applied across real-world tasks.
3D AI that can generate a realistic chair, or LLMs that can describe or caption, shows some understanding of shape, position and function - but lack an understanding of how tall the chair should be to fit under a table, its function, or feasible positions in a room.
Below are examples 3D AI with limited spatial awareness. AI does not truly understand the 3D spaces - it lacks awareness of:
The objects present in the scene [ appearance, function, etc. ]
Each object's size, shape, and position
The relationships between objects
↑ Depth estimation only sees the 3D world as 2D pixels.
[ This is not true understanding - this intelligence does not generalise well. ]
↓ 3D reconstruction only "understands" the real world [ ↑ ] as points in 3D space.
[ This is not true understanding - this intelligence does not generalise well. ]
Automating 3D scenes
[ E.g. architecture visualisation, games design, etc. ]
AI struggles to select and position the right assets due to limited object understanding and a lack of spatial reasoning.
Embodied intelligence
[ E.g. robotics ]
AI struggles in uncertain environments from a lack of robust understanding of individual objects - their sizes, shapes, positions, properties, and functions.
We pre-train our model on RGB video from everyday sources - phones, drones, dashcams, etc. - without requiring manual labels or bespoke data.
Unlike others who rely on proprietary datasets as a moat, we embrace ubiquity. Our models learn purely from visual observation, allowing them to adapt across styles and domains.
For example, by watching Star Wars, our system can learn the visual grammar of sci-fi environments -materials, lighting, motion dynamics - without needing labels.
The same model can then observe real-world footage from an autonomous vehicle or a handheld phone and immediately generalise to interact with the world -powering robots that can see, reason, and act across realities, simply by watching.
This is the path to universal integration: no manual tuning, no retraining - just observation.
Initially through partnerships in architecture visualisation, gaming, and robotics simulation - industries where 3D asset workflows and simulation are critical pain points.
Our proof of concept with SpaceForm Technologies already demonstrates early value: automating asset selection and placement within 3D scenes, saving up to 40% of workflow time. In robotics, our model can generate diverse, realistic 3D environments for training, reducing the time and cost of manual setup.
We'll expand through further commercial development partnerships across industries.
By learning about the structure, content, and function of the 3D world - not just generating objects or captioning scenes - we enable broader generalisation across tasks and industries.
Our model's flexibility means one core intelligence can automate scene design, generate simulation environments, and enable robotic reasoning - without retraining from scratch - unlike our competitors. This defensibility compounds as the model scales.
We expect competition - but acting now gives us first-mover advantage, and our novel approach strengthens our moat over time.
As with any deep-tech innovation, there are both technology and market risks - we are operating at the frontier with many unknowns.
But the opportunity is enormous - and foundational models have historically become the platforms that power entire ecosystems.
We believe 3D will follow the same pattern:
Whoever builds the intelligence layer for 3D will shape the future of both digital creativity and embodied AI.
We’re building that future.