How is this different from LLMs, VLMs, or other 3D AI?

3D AI that can generate a realistic chair, or LLMs that can describe or caption, shows some understanding of shape, position and function - but lack an understanding of how tall the chair should be to fit under a table, its function, or feasible positions in a room.

Below are examples 3D AI with limited spatial awareness. AI does not truly understand the 3D spaces - it lacks awareness of:

The objects present in the scene [ appearance, function, etc. ]
Each object's size, shape, and position
The relationships between objects

↑ Depth estimation only sees the 3D world as 2D pixels.
[ This is not true understanding - this intelligence does not generalise well. ]

↓ 3D reconstruction only "understands" the real world [ ↑ ] as points in 3D space.
[ This is not true understanding - this intelligence does not generalise well. ]

[ Consequences ]

The lack of true 3D understanding in current AI handicaps the following real-world use cases:

Automating 3D scenes
[ E.g. architecture visualisation, games design, etc. ]
AI struggles to select and position the right assets due to limited object understanding and a lack of spatial reasoning.
Embodied intelligence
[ E.g. robotics ]
AI struggles in uncertain environments from a lack of robust understanding of individual objects - their sizes, shapes, positions, properties, and functions.

All FAQs

Page updated

Google Sites

Report abuse