Related papers: Code2Video: A Code-centric Paradigm for Educationa…

TeachMaster: Generative Teaching via Code

The scalability of high-quality online education is hindered by the high costs and slow cycles of manual content creation. Despite advancements in video generation, current approaches often fail to ensure pedagogical structure and precise…

Computers and Society · Computer Science 2026-04-28 Yuheng Wang , Runde Yang , Lin Wu , Jie Zhang , Jingru Fan , Tianle Zhou , Ruoyu Fu , Huatao Li , Ruijie Shi , Siheng Chen , Weinan E , Chen Qian

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

Existing codecs are designed to eliminate intrinsic redundancies to create a compact representation for compression. However, strong external priors from Multimodal Large Language Models (MLLMs) have not been explicitly explored in video…

Computer Vision and Pattern Recognition · Computer Science 2025-02-17 Pingping Zhang , Jinlong Li , Kecheng Chen , Meng Wang , Long Xu , Haoliang Li , Nicu Sebe , Sam Kwong , Shiqi Wang

Motion Control for Enhanced Complex Action Video Generation

Existing text-to-video (T2V) models often struggle with generating videos with sufficiently pronounced or complex actions. A key limitation lies in the text prompt's inability to precisely convey intricate motion details. To address this,…

Computer Vision and Pattern Recognition · Computer Science 2024-11-14 Qiang Zhou , Shaofeng Zhang , Nianzu Yang , Ye Qian , Hao Li

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as…

Artificial Intelligence · Computer Science 2026-02-13 Lingyong Yan , Jiulong Wu , Dong Xie , Weixian Shi , Deguo Xia , Jizhou Huang

Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Ajay Vikram Periasami , Junlin Wang , Bhuwan Dhingra

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

Recent text-to-video (T2V) generation methods have seen significant advancements. However, the majority of these works focus on producing short video clips of a single event (i.e., single-scene videos). Meanwhile, recent large language…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Han Lin , Abhay Zala , Jaemin Cho , Mohit Bansal

Code2Worlds: Empowering Coding LLMs for 4D World Generation

Achieving spatial intelligence requires moving beyond visual plausibility to build world simulators grounded in physical laws. While coding LLMs have advanced static 3D scene generation, extending this paradigm to 4D dynamics remains a…

Computer Vision and Pattern Recognition · Computer Science 2026-02-13 Yi Zhang , Yunshuang Wang , Zeyu Zhang , Hao Tang

Paper2Video: Automatic Video Generation from Scientific Papers

Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video.…

Computer Vision and Pattern Recognition · Computer Science 2025-10-10 Zeyu Zhu , Kevin Qinghong Lin , Mike Zheng Shou

See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation

Large language models can generate executable code for educational animations, but the resulting renders often exhibit visual defects, including element overlap, misalignment, and broken animation continuity. These defects cannot be…

Artificial Intelligence · Computer Science 2026-05-18 Yuejia Li , Ke He , Junheng Li , Shutong Chen , Jingkang Xia , Zhiyue Su , Junchi Zhang , Mang Ye

Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

Text-to-video generation has been dominated by diffusion-based or autoregressive models. These novel models provide plausible versatility, but are criticized for improper physical motion, shading and illumination, camera motion, and…

Computer Vision and Pattern Recognition · Computer Science 2025-05-06 Liu He , Yizhi Song , Hejun Huang , Pinxin Liu , Yunlong Tang , Daniel Aliaga , Xin Zhou

Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface

Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for…

Robotics · Computer Science 2026-03-24 Yujie Zhao , Hongwei Fan , Di Chen , Shengcong Chen , Liliang Chen , Xiaoqi Li , Guanghui Ren , Hao Dong

World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Recent advances in Vision-Language Models (VLMs) and the scarcity of high-quality multi-modal alignment data have inspired numerous researches on synthetic VLM data generation. The conventional norm in VLM data construction uses a mixture…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Jiacong Wang , Bohong Wu , Haiyong Jiang , Xun Zhou , Xin Xiao , Haoyuan Guo , Jun Xiao

Text-Animator: Controllable Visual Text Video Generation

Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the…

Computer Vision and Pattern Recognition · Computer Science 2024-06-26 Lin Liu , Quande Liu , Shengju Qian , Yuan Zhou , Wengang Zhou , Houqiang Li , Lingxi Xie , Qi Tian

Code2World: A GUI World Model via Renderable Code Generation

Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-11 Yuhao Zheng , Li'an Zhong , Yi Wang , Rui Dai , Kaikui Liu , Xiangxiang Chu , Linyuan Lv , Philip Torr , Kevin Qinghong Lin

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-06 Kaiyi Huang , Yukun Huang , Xuefei Ning , Zinan Lin , Yu Wang , Xihui Liu

Towards A Better Metric for Text-to-Video Generation

Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos.…

Computer Vision and Pattern Recognition · Computer Science 2024-01-17 Jay Zhangjie Wu , Guian Fang , Haoning Wu , Xintao Wang , Yixiao Ge , Xiaodong Cun , David Junhao Zhang , Jia-Wei Liu , Yuchao Gu , Rui Zhao , Weisi Lin , Wynne Hsu , Ying Shan , Mike Zheng Shou

PhyEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education

Generative AI models, particularly Text-to-Video (T2V) systems, offer a promising avenue for transforming science education by automating the creation of engaging and intuitive visual explanations. In this work, we take a first step toward…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Megha Mariam K. M , Aditya Arun , Zakaria Laskar , C. V. Jawahar

Video Coding for Machines: A Paradigm of Collaborative Compression and Intelligent Analytics

Video coding, which targets to compress and reconstruct the whole frame, and feature compression, which only preserves and transmits the most critical information, stand at two ends of the scale. That is, one is with compactness and…

Computer Vision and Pattern Recognition · Computer Science 2023-07-19 Ling-Yu Duan , Jiaying Liu , Wenhan Yang , Tiejun Huang , Wen Gao

CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation

Large language models (LLMs) have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including…

Software Engineering · Computer Science 2025-10-21 Xinchen Wang , Pengfei Gao , Chao Peng , Ruida Hu , Cuiyun Gao

PedaCo-Gen: Scaffolding Pedagogical Agency in Human-AI Collaborative Video Authoring

While advancements in Text-to-Video (T2V) generative AI offer a promising path toward democratizing content creation, current models are often optimized for visual fidelity rather than instructional efficacy. This study introduces…

Computer Vision and Pattern Recognition · Computer Science 2026-03-30 Injun Baek , Yearim Kim , Nojun Kwak