English
Related papers

Related papers: Generative Visual Code Mobile World Models

200 papers

Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-11 Yuhao Zheng , Li'an Zhong , Yi Wang , Rui Dai , Kaikui Liu , Xiangxiang Chu , Linyuan Lv , Philip Torr , Kevin Qinghong Lin

World models have shown great utility in improving the task performance of embodied agents. While prior work largely focuses on pixel-space world models, these approaches face practical limitations in GUI settings, where predicting complex…

Artificial Intelligence · Computer Science 2025-12-17 Shufan Li , Konstantinos Kallidromitis , Akash Gokul , Yusuke Kato , Kazuki Kozuka , Aditya Grover

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk…

Artificial Intelligence · Computer Science 2026-05-25 Weikai Xu , Kun Huang , Yunren Feng , Jiaxing Li , Yuhan Chen , Yuxuan Liu , Zhizheng Jiang , Heng Qu , Pengzhi Gao , Wei Liu , Jian Luan , Xiaolin Hu , Bo An

App agents, which autonomously operate mobile Apps through Graphical User Interfaces (GUIs), have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal…

Human-Computer Interaction · Computer Science 2025-05-21 Dezhao Luo , Bohan Tang , Kang Li , Georgios Papoudakis , Jifei Song , Shaogang Gong , Jianye Hao , Jun Wang , Kun Shao

Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Dongping Chen , Yue Huang , Siyuan Wu , Jingyu Tang , Liuyi Chen , Yilin Bai , Zhigang He , Chenlong Wang , Huichi Zhou , Yiqiang Li , Tianshuo Zhou , Yue Yu , Chujie Gao , Qihui Zhang , Yi Gui , Zhen Li , Yao Wan , Pan Zhou , Jianfeng Gao , Lichao Sun

Large Language Models (LLMs) have shown great ability in generating executable code from natural language, opening the possibility of automatically constructing environments for AI agents. Recent work on Code World Models (CWMs)…

Artificial Intelligence · Computer Science 2026-05-26 Tyrone Serapio , Arjun Prakash , Haoyang Xu , Kevin Wang , Amy Greenwald

The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work…

Computer Vision and Pattern Recognition · Computer Science 2026-01-27 Sagi Eppel

Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has…

World models (WMs) demonstrate strong capabilities in prediction, generation, and planning tasks. Existing WMs primarily focus on unstructured data and cannot leverage the ubiquitous structured data, often represented as graphs, in the…

Machine Learning · Computer Science 2025-07-15 Tao Feng , Yexin Wu , Guanyu Lin , Jiaxuan You

Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms.…

Artificial Intelligence · Computer Science 2025-06-18 Boyu Gou , Ruohan Wang , Boyuan Zheng , Yanan Xie , Cheng Chang , Yiheng Shu , Huan Sun , Yu Su

Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable…

Computer Vision and Pattern Recognition · Computer Science 2025-12-15 Felix O'Mahony , Roberto Cipolla , Ayush Tewari

World models - generative models that simulate environment dynamics conditioned on past observations and actions - are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental…

Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic…

Computer Vision and Pattern Recognition · Computer Science 2026-04-03 Zheng-Hui Huang , Zhixiang Wang , Jiaming Tan , Ruihan Yu , Yidan Zhang , Bo Zheng , Yu-Lun Liu , Yung-Yu Chuang , Kaipeng Zhang

Mobile GUI agents have shown strong potential in real-world automation and practical applications. However, most existing agents remain reactive, making decisions mainly from current screen, which limits their performance on long-horizon…

Artificial Intelligence · Computer Science 2026-01-08 Yilin Cao , Yufeng Zhong , Zhixiong Zeng , Liming Zheng , Jing Huang , Haibo Qiu , Peng Shi , Wenji Mao , Wan Guanglu

Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the…

Computer Vision and Pattern Recognition · Computer Science 2026-05-19 Renjie Gu , Kaichen Zhou , Yan Luo , Mengyu Wang

Trained on internet-scale video data, generative world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. This raises a natural question:…

Computer Vision and Pattern Recognition · Computer Science 2025-10-02 Kevin Zhang , Kuangzhi Ge , Xiaowei Chi , Renrui Zhang , Shaojun Shi , Zhen Dong , Sirui Han , Shanghang Zhang

Training robot policies within a learned world model is trending due to the inefficiency of real-world interactions. The established image-based world models and policies have shown prior success, but lack robust geometric information that…

Robotics · Computer Science 2025-09-18 Guanxing Lu , Baoxiong Jia , Puhao Li , Yixin Chen , Ziwei Wang , Yansong Tang , Siyuan Huang

The Graphical User Interface (GUI) is pivotal for human interaction with the digital world, enabling efficient device control and the completion of complex tasks. Recent progress in Large Language Models (LLMs) and Vision Language Models…

Artificial Intelligence · Computer Science 2024-06-14 Danyang Zhang , Zhennan Shen , Rui Xie , Situo Zhang , Tianbao Xie , Zihan Zhao , Siyuan Chen , Lu Chen , Hongshen Xu , Ruisheng Cao , Kai Yu

Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the…

Computer Vision and Pattern Recognition · Computer Science 2025-07-09 Joonhyung Park , Peng Tang , Sagnik Das , Srikar Appalaraju , Kunwar Yashraj Singh , R. Manmatha , Shabnam Ghadar
‹ Prev 1 2 3 10 Next ›