─────────────── Header / Navigation ───────────────

Yan

Foundational Interactive Video Generation

Yan Team

"Da Yan's number is fifty, but its use is forty-nine." — I Ching
Yan: Where Evolution Flows, Realities Unfold.

Overview

We propose Yan, a foundational framework for interactive video generation, which comprises three core modules: Yan-Sim, Yan-Gen, and Yan-Edit. Specifically, Yan-Sim enables high-quality simulation of interactive video environments; Yan-Gen uses text and images as prompt to generate interactive video with strong generalizability; Yan-Edit supports multi-granularity, real-time editing of interactive video content. We choose games1 as testbed, regarding them as an infinite interactive video generation scenario that demands both high real-time performance and superior visual quality, to validate the performance of our framework.

AAA-level Simulation

Yan-Sim achieves high-fidelity simulation of interactive game videos, guaranteeing ​​both​​ 1080p resolution and real-time 60fps performance. ​​To enable frame-by-frame interactivity, we adapt a diffusion model into a causal architecture​​ trained using the diffusion forcing paradigm. This ​​approach conditions​​ each frame on previously generated frames ​​and​​ per-frame control signals. ​​For simultaneous high-resolution​​ ​​a​​nd real-time operation, ​​we implement three key optimizations:​​ (1) A ​​higher-compression VAE​​ with a spatiotemporal downsampling of 32 × 32 × 2, ​​reduces the size of the latent representation​​; (2) DDIM sampling ​​reduces the inference steps to 4​​, ​​enhanced by a shift-window denoising​​ inference technique ​​that processes​​ frames with varying noise levels concurrently​​, ensuring clean samples at every denoising step​​; and (3) ​​Model pruning combined with quantization​​ ​​significantly accelerates​​ inference speed​​, collectively enabling​​ the target performance.

Game simulation framework diagram

Diverse Scenarios Simulation

Accurate Mechanism Simulation

Infinite-length Simulation

Multi-Modal Generation

Yan-Gen enables versatile generation of diverse interactive video content from multimodal inputs, specializing in adaptive synthesis across varied scenarios. The model integrates text, visual, and action-based controls to dynamically adapt generated content to specific contexts, supporting everything from close-domain games to open-world scenarios. Architecturally, Yan-Gen employs a a multimodal diffusion transformer (DiT) backbone, processing input tokens—such as text prompts, reference visuals, and action sequences—through specialized encoders (e.g., umt5-xxl for textual understanding, ViT-H-14 for visual feature extraction). Task-specific constraints are injected through cross-attention layers, allowing precise guidance of the interactive mechanics, visual styles, and narrative elements.

Game generation framework diagram

Text to Interactive Video

Text-Guided Interactive Video Expansion

Image to Interactive Video

Multi-Modal Cross-Domain Fusion

Multi-Granularity Editing

Yan-Edit enables multi-granularity video content editing via text-based interaction, encompassing both structural editing (e.g., adding interactive objects) and style editing (e.g., altering an object's color and texture). To achieve flexible and controllable editing, we propose a hybrid model consisting of an interactive mechanics simulator and a visual renderer to learn structure editing and style editing, respectively. We use the depth map as intermediate state to connect these two modules. The interactive mechanics simulator is built upon Yan-Sim, leveraging its high-fidelity simulation of interactive videos. And we integrate structure text prompts into Yan-Sim via text cross attention layers to achieve structure editing. The visual renderer takes advantage of Yan-Gen for its powerful generation capabilities on various open-domain visual content. It employs a controlnet to inject the depth map generated by the interactive mechanics simulator into Yan-Gen, and then uses style text prompts to enable versatile style editing.

editing framework diagram

Structure Editing

Style Editing

BibTeX

@article{yan,
  title   = {Yan: Foundational Interactive Video Generation},
  author  = {Deheng Ye, Fangyun Zhou, Jiacheng Lv, Jianqi Ma, Jun Zhang, Junyan Lv, Junyou Li, Minwen Deng, Mingyu Yang, Qiang Fu, Wei Yang, Wenkai Lv, Yangbin Yu, Yewen Wang, Yonghang Guan, Zhihao Hu, Zhongbin Fang, Zhongqian Sun},
  journal = {arXiv preprint arXiv:2508.08601},
  year    = {2025}
}