Infinite-World

Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory

Ruiqi Wu1,2,3*, Xuanhua He4,2*, Meng Cheng2, Tianyu Yang2, Yong Zhang2‡, Chunle Guo1,3†, Chongyi Li1,3, Ming-Ming Cheng1,3

1Nankai University   2Meituan   3NKIARI   4HKUST

*Equal Contribution   †Corresponding Author   ‡Project Leader

Abstract

We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits.

To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning.

Furthermore, guided by insights from a pilot toy study, we employ a Revisit-Dense Finetuning Strategy using a compact, 30-minute dataset to efficiently activate the model’s long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency.

Methodology
Infinite-World Framework

Overview of Infinite-World architecture.

(a) Hierarchical Pose-free Memory Compressor: The HPMC recursively compresses raw historical latents into a fixed memory budget via hierarchical compression with local and global stages. The compressor is jointly optimized with the DiT backbone to autonomously anchor generations in the distant past with constant computational cost.

(b) Uncertainty-Aware Action Labeling: Continuous poses are decoupled into translation and rotation primitives. A tri-state logic filters out "Uncertain" motion to ensure robust action-response learning.

(c) Data Strategy: Pre-training on open-domain video is followed by finetuning on a revisit-dense dataset to activate 1000-frame memory consistency.

Comparison
Quantitative comparison on VBench (motion smoothness, dynamic degree, aesthetic quality, image quality) and user study (memory, fidelity, action, ELO rating). Best in bold, second best underlined.
Model VBench User Study
Mot. Smo.↑ Dyn. Deg.↑ Aes. Qual.↑ Img. Qual.↑ Avg. Score↑ Memory↓ Fidelity↓ Action↓ ELO Rating↑
Hunyuan-GameCraft 0.98550.98960.53800.60100.7785 2.672.492.561311
Matrix-Game 2.0 0.97881.00000.52670.72150.8068 2.982.911.781432
Yume 1.5 0.98610.98960.58400.69690.8141 2.431.912.471495
HY-World-1.5 0.99051.00000.52800.66110.7949 2.592.781.501542
Infinite-World 0.98761.00000.54400.71590.8119 1.921.671.541719
BibTeX
@article{wu2026infiniteworld,
  title={Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory},
  author={Wu, Ruiqi and He, Xuanhua and Cheng, Meng and Yang, Tianyu and Zhang, Yong and Kang, Zhuoliang and Cai, Xunliang and Wei, Xiaoming and Guo, Chunle and Li, Chongyi and Cheng, Ming-Ming},
  journal={arXiv preprint arXiv:2602.02393},
  year={2026}
}