Yan
Foundational Interactive Video Generation
Yan Team
Yan: Where Evolution Flows, Realities Unfold.
Overview
We propose Yan, a foundational framework for interactive video generation, which comprises three core modules: Yan-Sim, Yan-Gen, and Yan-Edit. Specifically, Yan-Sim enables high-quality simulation of interactive video environments; Yan-Gen uses text and images as prompt to generate interactive video with strong generalizability; Yan-Edit supports multi-granularity, real-time editing of interactive video content. We choose games1 as testbed, regarding them as an infinite interactive video generation scenario that demands both high real-time performance and superior visual quality, to validate the performance of our framework.
AAA-level Simulation
Yan-Sim achieves high-fidelity simulation of interactive game videos, guaranteeing both 1080p resolution and real-time 60fps performance. To enable frame-by-frame interactivity, we adapt a diffusion model into a causal architecture trained using the diffusion forcing paradigm. This approach conditions each frame on previously generated frames and per-frame control signals. For simultaneous high-resolution and real-time operation, we implement three key optimizations: (1) A higher-compression VAE with a spatiotemporal downsampling of 32 × 32 × 2, reduces the size of the latent representation; (2) DDIM sampling reduces the inference steps to 4, enhanced by a shift-window denoising inference technique that processes frames with varying noise levels concurrently, ensuring clean samples at every denoising step; and (3) Model pruning combined with quantization significantly accelerates inference speed, collectively enabling the target performance.

Diverse Scenarios Simulation
Accurate Mechanism Simulation
Infinite-length Simulation
Multi-Modal Generation
Yan-Gen enables versatile generation of diverse interactive video content from multimodal inputs, specializing in adaptive synthesis across varied scenarios. The model integrates text, visual, and action-based controls to dynamically adapt generated content to specific contexts, supporting everything from close-domain games to open-world scenarios. Architecturally, Yan-Gen employs a a multimodal diffusion transformer (DiT) backbone, processing input tokens—such as text prompts, reference visuals, and action sequences—through specialized encoders (e.g., umt5-xxl for textual understanding, ViT-H-14 for visual feature extraction). Task-specific constraints are injected through cross-attention layers, allowing precise guidance of the interactive mechanics, visual styles, and narrative elements.

Text to Interactive Video
Text-Guided Interactive Video Expansion
Image to Interactive Video
Multi-Modal Cross-Domain Fusion
Multi-Granularity Editing
Yan-Edit enables multi-granularity video content editing via text-based interaction, encompassing both structural editing (e.g., adding interactive objects) and style editing (e.g., altering an object's color and texture). To achieve flexible and controllable editing, we propose a hybrid model consisting of an interactive mechanics simulator and a visual renderer to learn structure editing and style editing, respectively. We use the depth map as intermediate state to connect these two modules. The interactive mechanics simulator is built upon Yan-Sim, leveraging its high-fidelity simulation of interactive videos. And we integrate structure text prompts into Yan-Sim via text cross attention layers to achieve structure editing. The visual renderer takes advantage of Yan-Gen for its powerful generation capabilities on various open-domain visual content. It employs a controlnet to inject the depth map generated by the interactive mechanics simulator into Yan-Gen, and then uses style text prompts to enable versatile style editing.

Structure Editing
Style Editing
BibTeX
@article{yan, title = {Yan: Foundational Interactive Video Generation}, author = {Deheng Ye, Fangyun Zhou, Jiacheng Lv, Jianqi Ma, Jun Zhang, Junyan Lv, Junyou Li, Minwen Deng, Mingyu Yang, Qiang Fu, Wei Yang, Wenkai Lv, Yangbin Yu, Yewen Wang, Yonghang Guan, Zhihao Hu, Zhongbin Fang, Zhongqian Sun}, journal = {arXiv preprint arXiv:2508.08601}, year = {2025} }