Related papers: ScreenAgent: A Vision Language Model-driven Comput…

ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in…

Computation and Language · Computer Science 2025-03-28 Yiqiao Jin , Stefano Petrangeli , Yu Shen , Gang Wu

Training a Vision Language Model as Smartphone Assistant

Addressing the challenge of a digital assistant capable of executing a wide array of user tasks, our research focuses on the realm of instruction-based mobile device control. We leverage recent advancements in large language models (LLMs)…

Machine Learning · Computer Science 2024-04-16 Nicolai Dorka , Janusz Marecki , Ammar Anwar

CogAgent: A Visual Language Model for GUI Agents

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Wenyi Hong , Weihan Wang , Qingsong Lv , Jiazheng Xu , Wenmeng Yu , Junhui Ji , Yan Wang , Zihan Wang , Yuxuan Zhang , Juanzi Li , Bin Xu , Yuxiao Dong , Ming Ding , Jie Tang

Towards General Computer Control with Hierarchical Agents and Multi-Level Action Spaces

Controlling desktop applications via software remains a fundamental yet under-served problem. Existing multi-modal large language models (MLLMs) ingest screenshots and task instructions to generate keystrokes and mouse events, but they…

Artificial Intelligence · Computer Science 2025-09-24 Zihan Dong , Xinyu Fan , Zixiang Tang , Yunqing Li

V-Agent: An Interactive Video Search System Using Vision-Language Models

We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a…

Computer Vision and Pattern Recognition · Computer Science 2026-01-08 SunYoung Park , Jong-Hyeon Lee , Youngjune Kim , Daegyu Sung , Younghyun Yu , Young-rok Cha , Jeongho Ju

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

With the advancement of Multimodal Large Language Models (MLLM), LLM-driven visual agents are increasingly impacting software interfaces, particularly those with graphical user interfaces. This work introduces a novel LLM-based multimodal…

Human-Computer Interaction · Computer Science 2025-09-18 Yanda Li , Chi Zhang , Wenjia Jiang , Wanqi Yang , Bin Fu , Pei Cheng , Xin Chen , Ling Chen , Yunchao Wei

VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making

Recent research looks to harness the general knowledge and reasoning of large language models (LLMs) into agents that accomplish user-specified goals in interactive environments. Vision-language models (VLMs) extend LLMs to multi-modal data…

Machine Learning · Computer Science 2025-05-07 Jake Grigsby , Yuke Zhu , Michael Ryoo , Juan Carlos Niebles

SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems

Human communication is a complex and diverse process that not only involves multiple factors such as language, commonsense, and cultural backgrounds but also requires the participation of multimodal information, such as speech. Large…

Computation and Language · Computer Science 2024-01-09 Dong Zhang , Zhaowei Li , Pengyu Wang , Xin Zhang , Yaqian Zhou , Xipeng Qiu

Human-Centered LLM-Agent User Interface: A Position Paper

Large Language Model (LLM) -in-the-loop applications have been shown to effectively interpret the human user's commands, make plans, and operate external tools/systems accordingly. Still, the operation scope of the LLM agent is limited to…

Human-Computer Interaction · Computer Science 2024-09-24 Daniel Chin , Yuxuan Wang , Gus Xia

ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

The rapid progress of large language models (LLMs) has sparked growing interest in building Artificial General Intelligence (AGI) within Graphical User Interface (GUI) environments. However, existing GUI agents based on LLMs or…

Artificial Intelligence · Computer Science 2025-05-27 Runliang Niu , Jinglong Ji , Yi Chang , Qi Wang

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding

Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Dongping Chen , Yue Huang , Siyuan Wu , Jingyu Tang , Liuyi Chen , Yilin Bai , Zhigang He , Chenlong Wang , Huichi Zhou , Yiqiang Li , Tianshuo Zhou , Yue Yu , Chujie Gao , Qihui Zhang , Yi Gui , Zhen Li , Yao Wan , Pan Zhou , Jianfeng Gao , Lichao Sun

Secure and Efficient Access Control for Computer-Use Agents via Context Space

Large language model (LLM)-based computer-use agents represent a convergence of AI and OS capabilities, enabling natural language to control system- and application-level functions. However, due to LLMs' inherent uncertainty issues,…

Cryptography and Security · Computer Science 2026-01-15 Haochen Gong , Chenxiao Li , Rui Chang , Wenbo Shen

DPO Learning with LLMs-Judge Signal for Computer Use Agents

Computer use agents (CUA) are systems that automatically interact with graphical user interfaces (GUIs) to complete tasks. CUA have made significant progress with the advent of large vision-language models (VLMs). However, these agents…

Artificial Intelligence · Computer Science 2025-06-04 Man Luo , David Cobbley , Xin Su , Shachar Rosenman , Vasudev Lal , Shao-Yen Tseng , Phillip Howard

Synergistic Simulations: Multi-Agent Problem Solving with Large Language Models

Large Language Models (LLMs) have increasingly demonstrated the ability to facilitate the development of multi-agent systems that allow the interpretation of thoughts and actions generated by each individual. Promising advancements have…

Multiagent Systems · Computer Science 2024-09-24 Asher Sprigler , Alexander Drobek , Keagan Weinstock , Wendpanga Tapsoba , Gavin Childress , Andy Dao , Lucas Gral

SheetAgent: Towards A Generalist Agent for Spreadsheet Reasoning and Manipulation via Large Language Models

Spreadsheets are ubiquitous across the World Wide Web, playing a critical role in enhancing work efficiency across various domains. Large language model (LLM) has been recently attempted for automatic spreadsheet manipulation but has not…

Artificial Intelligence · Computer Science 2025-03-04 Yibin Chen , Yifu Yuan , Zeyu Zhang , Yan Zheng , Jinyi Liu , Fei Ni , Jianye Hao , Hangyu Mao , Fuzheng Zhang

Enabling Uniform Computer Interaction Experience for Blind Users through Large Language Models

Blind individuals, who by necessity depend on screen readers to interact with computers, face considerable challenges in navigating the diverse and complex graphical user interfaces of different computer applications. The heterogeneity of…

Human-Computer Interaction · Computer Science 2024-07-31 Satwik Ram Kodandaram , Utku Uckun , Xiaojun Bi , IV Ramakrishnan , Vikas Ashok

GUICourse: From General Vision Language Models to Versatile GUI Agents

Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile…

Artificial Intelligence · Computer Science 2025-06-02 Wentong Chen , Junbo Cui , Jinyi Hu , Yujia Qin , Junjie Fang , Yue Zhao , Chongyi Wang , Jun Liu , Guirong Chen , Yupeng Huo , Yuan Yao , Yankai Lin , Zhiyuan Liu , Maosong Sun

RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

This paper focuses on embodied task planning, where an agent acquires visual observations from the environment and executes atomic actions to accomplish a given task. Although recent Vision-Language Models (VLMs) have achieved impressive…

Robotics · Computer Science 2026-04-10 Peiran Xu , Jiaqi Zheng , Yadong Mu

PhysiAgent: An Embodied Agent Framework in Physical World

Vision-Language-Action (VLA) models have achieved notable success but often struggle with limited generalizations. To address this, integrating generalized Vision-Language Models (VLMs) as assistants to VLAs has emerged as a popular…

Robotics · Computer Science 2025-09-30 Zhihao Wang , Jianxiong Li , Jinliang Zheng , Wencong Zhang , Dongxiu Liu , Yinan Zheng , Haoyi Niu , Junzhi Yu , Xianyuan Zhan

ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models

Large language models (LLMs) have recently demonstrated remarkable capabilities to comprehend human intentions, engage in reasoning, and design planning-like behavior. To further unleash the power of LLMs to accomplish complex tasks, there…

Computation and Language · Computer Science 2023-09-06 Chenliang Li , Hehong Chen , Ming Yan , Weizhou Shen , Haiyang Xu , Zhikai Wu , Zhicheng Zhang , Wenmeng Zhou , Yingda Chen , Chen Cheng , Hongzhu Shi , Ji Zhang , Fei Huang , Jingren Zhou