Related papers: Code2World: A GUI World Model via Renderable Code …

Generative Visual Code Mobile World Models

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual…

Machine Learning · Computer Science 2026-05-26 Woosung Koh , Sungjun Han , Segyu Lee , Se-Young Yun , Jamin Shin

How Mobile World Model Guides GUI Agents?

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk…

Artificial Intelligence · Computer Science 2026-05-25 Weikai Xu , Kun Huang , Yunren Feng , Jiaxing Li , Yuhan Chen , Yuxuan Liu , Zhizheng Jiang , Heng Qu , Pengzhi Gao , Wei Liu , Jian Luan , Xiaolin Hu , Bo An

Code2Worlds: Empowering Coding LLMs for 4D World Generation

Achieving spatial intelligence requires moving beyond visual plausibility to build world simulators grounded in physical laws. While coding LLMs have advanced static 3D scene generation, extending this paradigm to 4D dynamics remains a…

Computer Vision and Pattern Recognition · Computer Science 2026-02-13 Yi Zhang , Yunshuang Wang , Zeyu Zhang , Hao Tang

Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

Recent advances in Large Language Model (LLM)-based agents have shown remarkable progress in code generation. However, current agent methods mainly rely on text-output-based feedback (e.g. command-line outputs) for multi-round debugging and…

Software Engineering · Computer Science 2026-04-23 Zhilin Liu , Ye Huang , Ting Xie , Ruizhi Zhang , Wen Li , Lixin Duan

Coding Agent Is Good As World Simulator

World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models…

Artificial Intelligence · Computer Science 2026-05-15 Hongyu Wang , Jingquan Wang , Bocheng Zou , Radu Serban , Dan Negrut

MobileDreamer: Generative Sketch World Model for GUI Agent

Mobile GUI agents have shown strong potential in real-world automation and practical applications. However, most existing agents remain reactive, making decisions mainly from current screen, which limits their performance on long-horizon…

Artificial Intelligence · Computer Science 2026-01-08 Yilin Cao , Yufeng Zhong , Zhixiong Zeng , Liming Zheng , Jing Huang , Haibo Qiu , Peng Shi , Wenji Mao , Wan Guanglu

ViMo: A Generative Visual GUI World Model for App Agents

App agents, which autonomously operate mobile Apps through Graphical User Interfaces (GUIs), have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal…

Human-Computer Interaction · Computer Science 2025-05-21 Dezhao Luo , Bohan Tang , Kang Li , Georgios Papoudakis , Jifei Song , Shaogang Gong , Jianye Hao , Jun Wang , Kun Shao

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or…

Computer Vision and Pattern Recognition · Computer Science 2024-11-27 Kevin Qinghong Lin , Linjie Li , Difei Gao , Zhengyuan Yang , Shiwei Wu , Zechen Bai , Weixian Lei , Lijuan Wang , Mike Zheng Shou

WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation

User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision-Language Models (VLMs) automate UI-to-Code generation, they generate only…

Software Engineering · Computer Science 2025-11-11 Mingde Xu , Zhen Yang , Wenyi Hong , Lihang Pan , Xinyue Fan , Yan Wang , Xiaotao Gu , Bin Xu , Jie Tang

UI2Code^N: UI-to-Code Generation as Interactive Visual Optimization

UI-to-code aims to translate UI screenshots into executable front-end code. Despite progress with vision-language models (VLMs), most existing methods formulate UI-to-code as a single-pass generation, which mismatches real-world UI…

Computer Vision and Pattern Recognition · Computer Science 2026-05-07 Zhen Yang , Wenyi Hong , Mingde Xu , Xinyue Fan , Weihan Wang , Jiale Cheng , Xiaotao Gu , Jie Tang

Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development in which multimodal large language…

Computation and Language · Computer Science 2025-02-11 Chenglei Si , Yanzhe Zhang , Ryan Li , Zhengyuan Yang , Ruibo Liu , Diyi Yang

Code2Video: A Code-centric Paradigm for Educational Video Generation

While recent generative models advance pixel-space video synthesis, they remain limited in producing professional educational videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their…

Computer Vision and Pattern Recognition · Computer Science 2025-10-02 Yanzhe Chen , Kevin Qinghong Lin , Mike Zheng Shou

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding

Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Dongping Chen , Yue Huang , Siyuan Wu , Jingyu Tang , Liuyi Chen , Yilin Bai , Zhigang He , Chenlong Wang , Huichi Zhou , Yiqiang Li , Tianshuo Zhou , Yue Yu , Chujie Gao , Qihui Zhang , Yi Gui , Zhen Li , Yao Wan , Pan Zhou , Jianfeng Gao , Lichao Sun

Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Ajay Vikram Periasami , Junlin Wang , Bhuwan Dhingra

AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation

Large language models (LLMs) have brought exciting new advances to mobile UI agents, a long-standing research field that aims to complete arbitrary natural language tasks through mobile UI interactions. However, existing UI agents usually…

Artificial Intelligence · Computer Science 2025-05-07 Hao Wen , Shizuo Tian , Borislav Pavlov , Wenjie Du , Yixuan Li , Ge Chang , Shanhui Zhao , Jiacheng Liu , Yunxin Liu , Ya-Qin Zhang , Yuanchun Li

Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs

User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 Houston H. Zhang , Tao Zhang , Baoze Lin , Yuanqi Xue , Yincheng Zhu , Huan Liu , Li Gu , Linfeng Ye , Ziqiang Wang , Xinxin Zuo , Yang Wang , Yuanhao Yu , Zhixiang Chi

Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of…

Artificial Intelligence · Computer Science 2025-04-16 Junlei Zhang , Zichen Ding , Chang Ma , Zijie Chen , Qiushi Sun , Zhenzhong Lan , Junxian He

World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Recent advances in Vision-Language Models (VLMs) and the scarcity of high-quality multi-modal alignment data have inspired numerous researches on synthetic VLM data generation. The conventional norm in VLM data construction uses a mixture…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Jiacong Wang , Bohong Wu , Haiyong Jiang , Xun Zhou , Xin Xiao , Haoyuan Guo , Jun Xiao

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose…

Computation and Language · Computer Science 2023-05-23 Ziyi Yang , Mahmoud Khademi , Yichong Xu , Reid Pryzant , Yuwei Fang , Chenguang Zhu , Dongdong Chen , Yao Qian , Mei Gao , Yi-Ling Chen , Robert Gmyr , Naoyuki Kanda , Noel Codella , Bin Xiao , Yu Shi , Lu Yuan , Takuya Yoshioka , Michael Zeng , Xuedong Huang

Coding the Visual World: From Image to Simulation Using Vision Language Models

The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work…

Computer Vision and Pattern Recognition · Computer Science 2026-01-27 Sagi Eppel