English
Related papers

Related papers: Code2World: A GUI World Model via Renderable Code …

200 papers

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual…

Machine Learning · Computer Science 2026-05-26 Woosung Koh , Sungjun Han , Segyu Lee , Se-Young Yun , Jamin Shin

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk…

Artificial Intelligence · Computer Science 2026-05-25 Weikai Xu , Kun Huang , Yunren Feng , Jiaxing Li , Yuhan Chen , Yuxuan Liu , Zhizheng Jiang , Heng Qu , Pengzhi Gao , Wei Liu , Jian Luan , Xiaolin Hu , Bo An

Achieving spatial intelligence requires moving beyond visual plausibility to build world simulators grounded in physical laws. While coding LLMs have advanced static 3D scene generation, extending this paradigm to 4D dynamics remains a…

Computer Vision and Pattern Recognition · Computer Science 2026-02-13 Yi Zhang , Yunshuang Wang , Zeyu Zhang , Hao Tang

Recent advances in Large Language Model (LLM)-based agents have shown remarkable progress in code generation. However, current agent methods mainly rely on text-output-based feedback (e.g. command-line outputs) for multi-round debugging and…

Software Engineering · Computer Science 2026-04-23 Zhilin Liu , Ye Huang , Ting Xie , Ruizhi Zhang , Wen Li , Lixin Duan

World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models…

Artificial Intelligence · Computer Science 2026-05-15 Hongyu Wang , Jingquan Wang , Bocheng Zou , Radu Serban , Dan Negrut

Mobile GUI agents have shown strong potential in real-world automation and practical applications. However, most existing agents remain reactive, making decisions mainly from current screen, which limits their performance on long-horizon…

Artificial Intelligence · Computer Science 2026-01-08 Yilin Cao , Yufeng Zhong , Zhixiong Zeng , Liming Zheng , Jing Huang , Haibo Qiu , Peng Shi , Wenji Mao , Wan Guanglu

App agents, which autonomously operate mobile Apps through Graphical User Interfaces (GUIs), have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal…

Human-Computer Interaction · Computer Science 2025-05-21 Dezhao Luo , Bohan Tang , Kang Li , Georgios Papoudakis , Jifei Song , Shaogang Gong , Jianye Hao , Jun Wang , Kun Shao

Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or…

Computer Vision and Pattern Recognition · Computer Science 2024-11-27 Kevin Qinghong Lin , Linjie Li , Difei Gao , Zhengyuan Yang , Shiwei Wu , Zechen Bai , Weixian Lei , Lijuan Wang , Mike Zheng Shou

User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision-Language Models (VLMs) automate UI-to-Code generation, they generate only…

Software Engineering · Computer Science 2025-11-11 Mingde Xu , Zhen Yang , Wenyi Hong , Lihang Pan , Xinyue Fan , Yan Wang , Xiaotao Gu , Bin Xu , Jie Tang

UI-to-code aims to translate UI screenshots into executable front-end code. Despite progress with vision-language models (VLMs), most existing methods formulate UI-to-code as a single-pass generation, which mismatches real-world UI…

Computer Vision and Pattern Recognition · Computer Science 2026-05-07 Zhen Yang , Wenyi Hong , Mingde Xu , Xinyue Fan , Weihan Wang , Jiale Cheng , Xiaotao Gu , Jie Tang

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development in which multimodal large language…

Computation and Language · Computer Science 2025-02-11 Chenglei Si , Yanzhe Zhang , Ryan Li , Zhengyuan Yang , Ruibo Liu , Diyi Yang

While recent generative models advance pixel-space video synthesis, they remain limited in producing professional educational videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their…

Computer Vision and Pattern Recognition · Computer Science 2025-10-02 Yanzhe Chen , Kevin Qinghong Lin , Mike Zheng Shou

Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Dongping Chen , Yue Huang , Siyuan Wu , Jingyu Tang , Liuyi Chen , Yilin Bai , Zhigang He , Chenlong Wang , Huichi Zhou , Yiqiang Li , Tianshuo Zhou , Yue Yu , Chujie Gao , Qihui Zhang , Yi Gui , Zhen Li , Yao Wan , Pan Zhou , Jianfeng Gao , Lichao Sun

Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Ajay Vikram Periasami , Junlin Wang , Bhuwan Dhingra

Large language models (LLMs) have brought exciting new advances to mobile UI agents, a long-standing research field that aims to complete arbitrary natural language tasks through mobile UI interactions. However, existing UI agents usually…

Artificial Intelligence · Computer Science 2025-05-07 Hao Wen , Shizuo Tian , Borislav Pavlov , Wenjie Du , Yixuan Li , Ge Chang , Shanhui Zhao , Jiacheng Liu , Yunxin Liu , Ya-Qin Zhang , Yuanchun Li

User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 Houston H. Zhang , Tao Zhang , Baoze Lin , Yuanqi Xue , Yincheng Zhu , Huan Liu , Li Gu , Linfeng Ye , Ziqiang Wang , Xinxin Zuo , Yang Wang , Yuanhao Yu , Zhixiang Chi

Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of…

Artificial Intelligence · Computer Science 2025-04-16 Junlei Zhang , Zichen Ding , Chang Ma , Zijie Chen , Qiushi Sun , Zhenzhong Lan , Junxian He

Recent advances in Vision-Language Models (VLMs) and the scarcity of high-quality multi-modal alignment data have inspired numerous researches on synthetic VLM data generation. The conventional norm in VLM data construction uses a mixture…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Jiacong Wang , Bohong Wu , Haiyong Jiang , Xun Zhou , Xin Xiao , Haoyuan Guo , Jun Xiao

The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose…

The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work…

Computer Vision and Pattern Recognition · Computer Science 2026-01-27 Sagi Eppel
‹ Prev 1 2 3 10 Next ›