II-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting
Jun 1, 2025·
,,,,,·
1 min read

Zhimin Liao
Ping Wei*
Ruijie Zhang
Shuaijia Chen
Haoxuan Wang
Ziyang Ren
Abstract
Forecasting the evolution of 3D scenes and generating unseen scenarios through occupancy-based world models offers substantial potential to enhance the safety of autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose $I^{2}$-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder-only GPT-style autoregressive models, $I^{2}$-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to guide future scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that II-World achieves state-of-the-art performance, surpassing existing approaches by 41.8% in 4D occupancy forecasting with exceptional efficiency—requiring only 2.9 GB of training memory and achieving real-time inference at 94.8 FPS.
Type
Publication
International Conference on Computer Vision (ICCV) 2025
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.
Create your slides in Markdown - click the Slides button to check out the example.
Add the publication’s full text or supplementary notes here. You can use rich formatting such as including code, math, and images.