Related papers: Coding with Eyes: Visual Feedback Unlocks Reliable…

Code2World: A GUI World Model via Renderable Code Generation

Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-11 Yuhao Zheng , Li'an Zhong , Yi Wang , Rui Dai , Kaikui Liu , Xiangxiang Chu , Linyuan Lv , Philip Torr , Kevin Qinghong Lin

WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation

User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision-Language Models (VLMs) automate UI-to-Code generation, they generate only…

Software Engineering · Computer Science 2025-11-11 Mingde Xu , Zhen Yang , Wenyi Hong , Lihang Pan , Xinyue Fan , Yan Wang , Xiaotao Gu , Bin Xu , Jie Tang

VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems.…

Computation and Language · Computer Science 2024-10-07 Xueqing Wu , Zongyu Lin , Songyan Zhao , Te-Lin Wu , Pan Lu , Nanyun Peng , Kai-Wei Chang

De-fine: Decomposing and Refining Visual Programs with Auto-Feedback

Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual…

Computer Vision and Pattern Recognition · Computer Science 2024-08-06 Minghe Gao , Juncheng Li , Hao Fei , Liang Pang , Wei Ji , Guoming Wang , Zheqi Lv , Wenqiao Zhang , Siliang Tang , Yueting Zhuang

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-05 Kevin Qinghong Lin , Yuhao Zheng , Hangyu Ran , Dantong Zhu , Dongxing Mao , Linjie Li , Philip Torr , Alex Jinpeng Wang

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments,…

Computer Vision and Pattern Recognition · Computer Science 2025-05-07 Shravan Nayak , Xiangru Jian , Kevin Qinghong Lin , Juan A. Rodriguez , Montek Kalsi , Rabiul Awal , Nicolas Chapados , M. Tamer Özsu , Aishwarya Agrawal , David Vazquez , Christopher Pal , Perouz Taslakian , Spandana Gella , Sai Rajeswar

GUI Knowledge Bench: Revealing the Knowledge Gap of VLMs in GUI Tasks

Vision language models (VLMs) have advanced graphical user interface (GUI) task automation but still lag behind humans. We hypothesize this gap stems from missing core GUI knowledge, which existing training schemes (such as supervised fine…

Artificial Intelligence · Computer Science 2026-02-10 Chenrui Shi , Zedong Yu , Zhi Gao , Ruining Feng , Enqi Liu , Yuwei Wu , Yunde Jia , Liuyu Xiang , Zhaofeng He , Qing Li

WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning

Agent systems powered by large language models (LLMs) have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and…

Computation and Language · Computer Science 2025-09-29 Zimu Lu , Houxing Ren , Yunqiao Yang , Ke Wang , Zhuofan Zong , Junting Pan , Mingjie Zhan , Hongsheng Li

When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback

Programming is a fundamentally interactive process, yet coding assistants are often evaluated using static benchmarks that fail to measure how well models collaborate with users. We introduce an interactive evaluation pipeline to examine…

Human-Computer Interaction · Computer Science 2025-02-26 Jane Pan , Ryan Shar , Jacob Pfau , Ameet Talwalkar , He He , Valerie Chen

Generative Visual Code Mobile World Models

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual…

Machine Learning · Computer Science 2026-05-26 Woosung Koh , Sungjun Han , Segyu Lee , Se-Young Yun , Jamin Shin

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or…

Computer Vision and Pattern Recognition · Computer Science 2024-11-27 Kevin Qinghong Lin , Linjie Li , Difei Gao , Zhengyuan Yang , Shiwei Wu , Zechen Bai , Weixian Lei , Lijuan Wang , Mike Zheng Shou

Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Junrong Guo , Shancheng Fang , Yadong Qu , Hongtao Xie

Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction

Autonomous GUI agents based on vision-language models (VLMs) often assume deterministic environment responses, generating actions without verifying whether previous operations succeeded. In real-world settings with network latency,…

Computation and Language · Computer Science 2026-04-08 Yuzhe Zhang , Xianwei Xue , Xingyong Wu , Mengke Chen , Chen Liu , Xinran He , Run Shao , Feiran Liu , Huanmin Xu , Qiutong Pan , Haiwei Wang

CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Daeheon Jeong , Seoyeon Byun , Kihoon Son , Dae Hyun Kim , Juho Kim

GUICourse: From General Vision Language Models to Versatile GUI Agents

Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile…

Artificial Intelligence · Computer Science 2025-06-02 Wentong Chen , Junbo Cui , Jinyi Hu , Yujia Qin , Junjie Fang , Yue Zhao , Chongyi Wang , Jun Liu , Guirong Chen , Yupeng Huo , Yuan Yao , Yankai Lin , Zhiyuan Liu , Maosong Sun

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of…

Computer Vision and Pattern Recognition · Computer Science 2025-10-02 Run Luo , Lu Wang , Wanwei He , Longze Chen , Jiaming Li , Xiaobo Xia

AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Hongxin Li , Xiping Wang , Jingran Su , Zheng Ju , Yuntao Chen , Qing Li , Zhaoxiang Zhang

You Only Look at Screens: Multimodal Chain-of-Action Agents

Autonomous graphical user interface (GUI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models…

Computation and Language · Computer Science 2024-06-10 Zhuosheng Zhang , Aston Zhang

Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping

Sketches are a natural and accessible medium for UI designers to conceptualize early-stage ideas. However, existing research on UI/UX automation often requires high-fidelity inputs like Figma designs or detailed screenshots, limiting…

Computation and Language · Computer Science 2024-10-22 Ryan Li , Yanzhe Zhang , Diyi Yang

Illuminating LLM Coding Agents: Visual Analytics for Deeper Understanding and Enhancement

Coding agents powered by large language models (LLMs) have gained traction for automating code generation through iterative problem-solving with minimal human involvement. Despite the emergence of various frameworks, e.g., LangChain,…

Machine Learning · Computer Science 2025-08-19 Junpeng Wang , Yuzhong Chen , Menghai Pan , Chin-Chia Michael Yeh , Mahashweta Das