English
Related papers

Related papers: ScreenAgent: A Vision Language Model-driven Comput…

200 papers

Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in…

Computation and Language · Computer Science 2025-03-28 Yiqiao Jin , Stefano Petrangeli , Yu Shen , Gang Wu

Addressing the challenge of a digital assistant capable of executing a wide array of user tasks, our research focuses on the realm of instruction-based mobile device control. We leverage recent advancements in large language models (LLMs)…

Machine Learning · Computer Science 2024-04-16 Nicolai Dorka , Janusz Marecki , Ammar Anwar

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Wenyi Hong , Weihan Wang , Qingsong Lv , Jiazheng Xu , Wenmeng Yu , Junhui Ji , Yan Wang , Zihan Wang , Yuxuan Zhang , Juanzi Li , Bin Xu , Yuxiao Dong , Ming Ding , Jie Tang

Controlling desktop applications via software remains a fundamental yet under-served problem. Existing multi-modal large language models (MLLMs) ingest screenshots and task instructions to generate keystrokes and mouse events, but they…

Artificial Intelligence · Computer Science 2025-09-24 Zihan Dong , Xinyu Fan , Zixiang Tang , Yunqing Li

We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a…

Computer Vision and Pattern Recognition · Computer Science 2026-01-08 SunYoung Park , Jong-Hyeon Lee , Youngjune Kim , Daegyu Sung , Younghyun Yu , Young-rok Cha , Jeongho Ju

With the advancement of Multimodal Large Language Models (MLLM), LLM-driven visual agents are increasingly impacting software interfaces, particularly those with graphical user interfaces. This work introduces a novel LLM-based multimodal…

Human-Computer Interaction · Computer Science 2025-09-18 Yanda Li , Chi Zhang , Wenjia Jiang , Wanqi Yang , Bin Fu , Pei Cheng , Xin Chen , Ling Chen , Yunchao Wei

Recent research looks to harness the general knowledge and reasoning of large language models (LLMs) into agents that accomplish user-specified goals in interactive environments. Vision-language models (VLMs) extend LLMs to multi-modal data…

Machine Learning · Computer Science 2025-05-07 Jake Grigsby , Yuke Zhu , Michael Ryoo , Juan Carlos Niebles

Human communication is a complex and diverse process that not only involves multiple factors such as language, commonsense, and cultural backgrounds but also requires the participation of multimodal information, such as speech. Large…

Computation and Language · Computer Science 2024-01-09 Dong Zhang , Zhaowei Li , Pengyu Wang , Xin Zhang , Yaqian Zhou , Xipeng Qiu

Large Language Model (LLM) -in-the-loop applications have been shown to effectively interpret the human user's commands, make plans, and operate external tools/systems accordingly. Still, the operation scope of the LLM agent is limited to…

Human-Computer Interaction · Computer Science 2024-09-24 Daniel Chin , Yuxuan Wang , Gus Xia

The rapid progress of large language models (LLMs) has sparked growing interest in building Artificial General Intelligence (AGI) within Graphical User Interface (GUI) environments. However, existing GUI agents based on LLMs or…

Artificial Intelligence · Computer Science 2025-05-27 Runliang Niu , Jinglong Ji , Yi Chang , Qi Wang

Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Dongping Chen , Yue Huang , Siyuan Wu , Jingyu Tang , Liuyi Chen , Yilin Bai , Zhigang He , Chenlong Wang , Huichi Zhou , Yiqiang Li , Tianshuo Zhou , Yue Yu , Chujie Gao , Qihui Zhang , Yi Gui , Zhen Li , Yao Wan , Pan Zhou , Jianfeng Gao , Lichao Sun

Large language model (LLM)-based computer-use agents represent a convergence of AI and OS capabilities, enabling natural language to control system- and application-level functions. However, due to LLMs' inherent uncertainty issues,…

Cryptography and Security · Computer Science 2026-01-15 Haochen Gong , Chenxiao Li , Rui Chang , Wenbo Shen

Computer use agents (CUA) are systems that automatically interact with graphical user interfaces (GUIs) to complete tasks. CUA have made significant progress with the advent of large vision-language models (VLMs). However, these agents…

Artificial Intelligence · Computer Science 2025-06-04 Man Luo , David Cobbley , Xin Su , Shachar Rosenman , Vasudev Lal , Shao-Yen Tseng , Phillip Howard

Large Language Models (LLMs) have increasingly demonstrated the ability to facilitate the development of multi-agent systems that allow the interpretation of thoughts and actions generated by each individual. Promising advancements have…

Multiagent Systems · Computer Science 2024-09-24 Asher Sprigler , Alexander Drobek , Keagan Weinstock , Wendpanga Tapsoba , Gavin Childress , Andy Dao , Lucas Gral

Spreadsheets are ubiquitous across the World Wide Web, playing a critical role in enhancing work efficiency across various domains. Large language model (LLM) has been recently attempted for automatic spreadsheet manipulation but has not…

Artificial Intelligence · Computer Science 2025-03-04 Yibin Chen , Yifu Yuan , Zeyu Zhang , Yan Zheng , Jinyi Liu , Fei Ni , Jianye Hao , Hangyu Mao , Fuzheng Zhang

Blind individuals, who by necessity depend on screen readers to interact with computers, face considerable challenges in navigating the diverse and complex graphical user interfaces of different computer applications. The heterogeneity of…

Human-Computer Interaction · Computer Science 2024-07-31 Satwik Ram Kodandaram , Utku Uckun , Xiaojun Bi , IV Ramakrishnan , Vikas Ashok

Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile…

Artificial Intelligence · Computer Science 2025-06-02 Wentong Chen , Junbo Cui , Jinyi Hu , Yujia Qin , Junjie Fang , Yue Zhao , Chongyi Wang , Jun Liu , Guirong Chen , Yupeng Huo , Yuan Yao , Yankai Lin , Zhiyuan Liu , Maosong Sun

This paper focuses on embodied task planning, where an agent acquires visual observations from the environment and executes atomic actions to accomplish a given task. Although recent Vision-Language Models (VLMs) have achieved impressive…

Robotics · Computer Science 2026-04-10 Peiran Xu , Jiaqi Zheng , Yadong Mu

Vision-Language-Action (VLA) models have achieved notable success but often struggle with limited generalizations. To address this, integrating generalized Vision-Language Models (VLMs) as assistants to VLAs has emerged as a popular…

Large language models (LLMs) have recently demonstrated remarkable capabilities to comprehend human intentions, engage in reasoning, and design planning-like behavior. To further unleash the power of LLMs to accomplish complex tasks, there…

‹ Prev 1 2 3 10 Next ›