English
Related papers

Related papers: Coding with Eyes: Visual Feedback Unlocks Reliable…

200 papers

Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-11 Yuhao Zheng , Li'an Zhong , Yi Wang , Rui Dai , Kaikui Liu , Xiangxiang Chu , Linyuan Lv , Philip Torr , Kevin Qinghong Lin

User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision-Language Models (VLMs) automate UI-to-Code generation, they generate only…

Software Engineering · Computer Science 2025-11-11 Mingde Xu , Zhen Yang , Wenyi Hong , Lihang Pan , Xinyue Fan , Yan Wang , Xiaotao Gu , Bin Xu , Jie Tang

Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems.…

Computation and Language · Computer Science 2024-10-07 Xueqing Wu , Zongyu Lin , Songyan Zhao , Te-Lin Wu , Pan Lu , Nanyun Peng , Kai-Wei Chang

Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual…

Computer Vision and Pattern Recognition · Computer Science 2024-08-06 Minghe Gao , Juncheng Li , Hao Fei , Liang Pang , Wei Ji , Guoming Wang , Zheqi Lv , Wenqiao Zhang , Siliang Tang , Yueting Zhuang

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-05 Kevin Qinghong Lin , Yuhao Zheng , Hangyu Ran , Dantong Zhu , Dongxing Mao , Linjie Li , Philip Torr , Alex Jinpeng Wang

Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments,…

Vision language models (VLMs) have advanced graphical user interface (GUI) task automation but still lag behind humans. We hypothesize this gap stems from missing core GUI knowledge, which existing training schemes (such as supervised fine…

Artificial Intelligence · Computer Science 2026-02-10 Chenrui Shi , Zedong Yu , Zhi Gao , Ruining Feng , Enqi Liu , Yuwei Wu , Yunde Jia , Liuyu Xiang , Zhaofeng He , Qing Li

Agent systems powered by large language models (LLMs) have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and…

Computation and Language · Computer Science 2025-09-29 Zimu Lu , Houxing Ren , Yunqiao Yang , Ke Wang , Zhuofan Zong , Junting Pan , Mingjie Zhan , Hongsheng Li

Programming is a fundamentally interactive process, yet coding assistants are often evaluated using static benchmarks that fail to measure how well models collaborate with users. We introduce an interactive evaluation pipeline to examine…

Human-Computer Interaction · Computer Science 2025-02-26 Jane Pan , Ryan Shar , Jacob Pfau , Ameet Talwalkar , He He , Valerie Chen

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual…

Machine Learning · Computer Science 2026-05-26 Woosung Koh , Sungjun Han , Segyu Lee , Se-Young Yun , Jamin Shin

Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or…

Computer Vision and Pattern Recognition · Computer Science 2024-11-27 Kevin Qinghong Lin , Linjie Li , Difei Gao , Zhengyuan Yang , Shiwei Wu , Zechen Bai , Weixian Lei , Lijuan Wang , Mike Zheng Shou

Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Junrong Guo , Shancheng Fang , Yadong Qu , Hongtao Xie

Autonomous GUI agents based on vision-language models (VLMs) often assume deterministic environment responses, generating actions without verifying whether previous operations succeeded. In real-world settings with network latency,…

Computation and Language · Computer Science 2026-04-08 Yuzhe Zhang , Xianwei Xue , Xingyong Wu , Mengke Chen , Chen Liu , Xinran He , Run Shao , Feiran Liu , Huanmin Xu , Qiutong Pan , Haiwei Wang

User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Daeheon Jeong , Seoyeon Byun , Kihoon Son , Dae Hyun Kim , Juho Kim

Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile…

Artificial Intelligence · Computer Science 2025-06-02 Wentong Chen , Junbo Cui , Jinyi Hu , Yujia Qin , Junjie Fang , Yue Zhao , Chongyi Wang , Jun Liu , Guirong Chen , Yupeng Huo , Yuan Yao , Yankai Lin , Zhiyuan Liu , Maosong Sun

Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of…

Computer Vision and Pattern Recognition · Computer Science 2025-10-02 Run Luo , Lu Wang , Wanwei He , Longze Chen , Jiaming Li , Xiaobo Xia

Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Hongxin Li , Xiping Wang , Jingran Su , Zheng Ju , Yuntao Chen , Qing Li , Zhaoxiang Zhang

Autonomous graphical user interface (GUI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models…

Computation and Language · Computer Science 2024-06-10 Zhuosheng Zhang , Aston Zhang

Sketches are a natural and accessible medium for UI designers to conceptualize early-stage ideas. However, existing research on UI/UX automation often requires high-fidelity inputs like Figma designs or detailed screenshots, limiting…

Computation and Language · Computer Science 2024-10-22 Ryan Li , Yanzhe Zhang , Diyi Yang

Coding agents powered by large language models (LLMs) have gained traction for automating code generation through iterative problem-solving with minimal human involvement. Despite the emergence of various frameworks, e.g., LangChain,…

Machine Learning · Computer Science 2025-08-19 Junpeng Wang , Yuzhong Chen , Menghai Pan , Chin-Chia Michael Yeh , Mahashweta Das
‹ Prev 1 2 3 10 Next ›