English
Related papers

Related papers: Vision2Code: A Multi-Domain Benchmark for Evaluati…

200 papers

Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To…

While Vision Language Models (VLMs) have shown promise in Design-to-Code generation, they suffer from a "holistic bottleneck-failing to reconcile high-level structural hierarchy with fine-grained visual details, often resulting in layout…

Computer Vision and Pattern Recognition · Computer Science 2026-04-03 Xinhao Huang , Jinke Yu , Wenhao Xu , Zeyi Wen , Ying Zhou , Junzhuo Liu , Junhao Ji , Zulong Chen

Recent advances in vision-language models (VLMs) have expanded their multimodal code generation capabilities, yet their ability to generate executable visualization code from plots, especially for complex 3D, animated, plot-to-plot…

Human-Computer Interaction · Computer Science 2026-01-21 Yi Zhao , Zhen Yang , Shuaiqi Duan , Wenmeng Yu , Zhe Su , Jibing Gong , Jie Tang

Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable…

Software Engineering · Computer Science 2026-04-09 Yuansheng Ni , Songcheng Cai , Xiangchao Chen , Jiarong Liang , Zhiheng Lyu , Jiaqi Deng , Kai Zou , Ping Nie , Fei Yuan , Xiang Yue , Wenhu Chen

The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not…

Computation and Language · Computer Science 2024-05-14 Chengyue Wu , Yixiao Ge , Qiushan Guo , Jiahao Wang , Zhixuan Liang , Zeyu Lu , Ying Shan , Ping Luo

Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding…

We present Omni-I2C, a comprehensive benchmark designed to evaluate the capability of Large Multimodal Models (LMMs) in converting complex, structured digital graphics into executable code. We argue that this task represents a non-trivial…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Jiawei Zhou , Chi Zhang , Xiang Feng , Qiming Zhang , Haibo Qiu , Lihuo He , Dengpan Ye , Xinbo Gao , Jing Zhang

We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based…

Computer Vision and Pattern Recognition · Computer Science 2024-10-31 Josselin Somerville Roberts , Tony Lee , Chi Heem Wong , Michihiro Yasunaga , Yifan Mai , Percy Liang

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-05 Kevin Qinghong Lin , Yuhao Zheng , Hangyu Ran , Dantong Zhu , Dongxing Mao , Linjie Li , Philip Torr , Alex Jinpeng Wang

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development in which multimodal large language…

Computation and Language · Computer Science 2025-02-11 Chenglei Si , Yanzhe Zhang , Ryan Li , Zhengyuan Yang , Ruibo Liu , Diyi Yang

Programming often involves converting detailed and complex specifications into code, a process during which developers typically utilize visual aids to more effectively convey concepts. While recent developments in Large Multimodal Models…

Computation and Language · Computer Science 2024-09-27 Kaixin Li , Yuchen Tian , Qisheng Hu , Ziyang Luo , Zhiyong Huang , Jing Ma

User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 Houston H. Zhang , Tao Zhang , Baoze Lin , Yuanqi Xue , Yincheng Zhu , Huan Liu , Li Gu , Linfeng Ye , Ziqiang Wang , Xinxin Zuo , Yang Wang , Yuanhao Yu , Zhixiang Chi

Recent advances in Vision-Language Models (VLMs) and the scarcity of high-quality multi-modal alignment data have inspired numerous researches on synthetic VLM data generation. The conventional norm in VLM data construction uses a mixture…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Jiacong Wang , Bohong Wu , Haiyong Jiang , Xun Zhou , Xin Xiao , Haoyuan Guo , Jun Xiao

UI-to-code aims to translate UI screenshots into executable front-end code. Despite progress with vision-language models (VLMs), most existing methods formulate UI-to-code as a single-pass generation, which mismatches real-world UI…

Computer Vision and Pattern Recognition · Computer Science 2026-05-07 Zhen Yang , Wenyi Hong , Mingde Xu , Xinyue Fan , Weihan Wang , Jiale Cheng , Xiaotao Gu , Jie Tang

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a…

Computation and Language · Computer Science 2025-08-14 Lingjie Jiang , Shaohan Huang , Xun Wu , Yixia Li , Dongdong Zhang , Furu Wei

Vision-Language Models (VLMs) are increasingly used in document processing pipelines to convert flowchart images into structured code (e.g., Mermaid). In production, these systems process arbitrary inputs for which no ground-truth code…

Computer Vision and Pattern Recognition · Computer Science 2026-02-17 Giang Son Nguyen , Zi Pong Lim , Sarthak Ketanbhai Modi , Yon Shin Teo , Wenya Wang

Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Jie Deng , Kaichun Yao , Libo Zhang

Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical…

Software Engineering · Computer Science 2026-04-02 Zehai He , Wenyi Hong , Zhen Yang , Ziyang Pan , Mingdao Liu , Xiaotao Gu , Jie Tang

The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work…

Computer Vision and Pattern Recognition · Computer Science 2026-01-27 Sagi Eppel

Automatically generating webpage code from webpage designs can significantly reduce the workload of front-end developers, and recent Multimodal Large Language Models (MLLMs) have shown promising potential in this area. However, our…

Computer Vision and Pattern Recognition · Computer Science 2025-02-25 Yi Gui , Zhen Li , Yao Wan , Yemin Shi , Hongyu Zhang , Yi Su , Bohua Chen , Dongping Chen , Siyuan Wu , Xing Zhou , Wenbin Jiang , Hai Jin , Xiangliang Zhang
‹ Prev 1 2 3 10 Next ›