Related papers: Learning UI-to-Code Reverse Generator Using Visual…

Reverse Browser: Vector-Image-to-Code Generator

Automating the conversion of user interface design into code (image-to-code or image-to-UI) is an active area of software engineering research. However, the state-of-the-art solutions do not achieve high fidelity to the original design, as…

Software Engineering · Computer Science 2025-09-09 Zoltan Toth-Czifra

VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation

Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Jie Deng , Kaichun Yao , Libo Zhang

Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Ajay Vikram Periasami , Junlin Wang , Bhuwan Dhingra

GIT: A Generative Image-to-text Transformer for Vision and Language

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between…

Computer Vision and Pattern Recognition · Computer Science 2022-12-19 Jianfeng Wang , Zhengyuan Yang , Xiaowei Hu , Linjie Li , Kevin Lin , Zhe Gan , Zicheng Liu , Ce Liu , Lijuan Wang

UI2Code^N: UI-to-Code Generation as Interactive Visual Optimization

UI-to-code aims to translate UI screenshots into executable front-end code. Despite progress with vision-language models (VLMs), most existing methods formulate UI-to-code as a single-pass generation, which mismatches real-world UI…

Computer Vision and Pattern Recognition · Computer Science 2026-05-07 Zhen Yang , Wenyi Hong , Mingde Xu , Xinyue Fan , Weihan Wang , Jiale Cheng , Xiaotao Gu , Jie Tang

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for…

Human-Computer Interaction · Computer Science 2024-03-15 Hugo Laurençon , Léo Tronchon , Victor Sanh

UIPress: Bringing Optical Token Compression to UI-to-Code Generation

UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at…

Computation and Language · Computer Science 2026-04-13 Dasen Dai , Shuoqi Li , Ronghao Chen , Huacan Wang , Biao Wu , Qizhen Lan

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Vision-as-inverse-graphics, the concept of reconstructing images into editable programs, remains challenging for Vision-Language Models (VLMs), which inherently lack fine-grained spatial grounding in one-shot settings. To address this, we…

Computer Vision and Pattern Recognition · Computer Science 2026-04-07 Shaofeng Yin , Jiaxin Ge , Zora Zhiruo Wang , Chenyang Wang , Xiuyu Li , Michael J. Black , Trevor Darrell , Angjoo Kanazawa , Haiwen Feng

Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach

Websites are critical in today's digital world, with over 1.11 billion currently active and approximately 252,000 new sites launched daily. Converting website layout design into functional UI code is a time-consuming yet indispensable step…

Software Engineering · Computer Science 2025-04-28 Yuxuan Wan , Chaozheng Wang , Yi Dong , Wenxuan Wang , Shuqing Li , Yintong Huo , Michael R. Lyu

Vision-Guided Iterative Refinement for Frontend Code Generation

Code generation with large language models often relies on multi-stage human-in-the-loop refinement, which is effective but very costly - particularly in domains such as frontend web development where the solution quality depends on…

Artificial Intelligence · Computer Science 2026-04-08 Hannah Sansford , Derek H. C. Law , Wei Liu , Abhishek Tripathi , Niresh Agarwal , Gerrit J. J. van den Burg

Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

Recent advances in Large Language Model (LLM)-based agents have shown remarkable progress in code generation. However, current agent methods mainly rely on text-output-based feedback (e.g. command-line outputs) for multi-round debugging and…

Software Engineering · Computer Science 2026-04-23 Zhilin Liu , Ye Huang , Ting Xie , Ruizhi Zhang , Wen Li , Lixin Duan

Automatically Generating Codes from Graphical Screenshots Based on Deep Autocoder

During software front-end development, the work to convert Graphical User Interface(GUI) image to the corresponding front-end code is an inevitable tedious work. There have been some attempts to make this work to be automatic. However, the…

Machine Learning · Computer Science 2020-07-08 Xiaoling Huang , Feng Liao

GiT: Towards Generalist Vision Transformer through Universal Language Interface

This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g, GPT) widely used…

Computer Vision and Pattern Recognition · Computer Science 2024-03-15 Haiyang Wang , Hao Tang , Li Jiang , Shaoshuai Shi , Muhammad Ferjad Naeem , Hongsheng Li , Bernt Schiele , Liwei Wang

UNIT: Unifying Image and Text Recognition in One Vision Encoder

Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-09 Yi Zhu , Yanpeng Zhou , Chunwei Wang , Yang Cao , Jianhua Han , Lu Hou , Hang Xu

SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI Automation

UI automation is a useful technique for UI testing, bug reproduction, and robotic process automation. Recording user actions with an application assists rapid development of UI automation scripts, but existing recording techniques are…

Software Engineering · Computer Science 2025-03-18 Dehai Zhao , Zhenchang Xing , Qinghua Lu , Xiwei Xu , Liming Zhu

Contrastive Code Representation Learning

Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program…

Machine Learning · Computer Science 2022-01-10 Paras Jain , Ajay Jain , Tianjun Zhang , Pieter Abbeel , Joseph E. Gonzalez , Ion Stoica

pix2code: Generating Code from a Graphical User Interface Screenshot

Transforming a graphical user interface screenshot created by a designer into computer code is a typical task conducted by a developer in order to build customized software, websites, and mobile applications. In this paper, we show that…

Machine Learning · Computer Science 2017-09-20 Tony Beltramelli

VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning

Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like chart-to-code generation, their reliance on…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Xuanle Zhao , Deyang Jiang , Zhixiong Zeng , Lei Chen , Haibo Qiu , Jing Huang , Yufeng Zhong , Liming Zheng , Yilin Cao , Lin Ma

Renaissance: Investigating the Pretraining of Vision-Language Encoders

In the past several years there has been an explosion of available models for vision-language (VL) tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models.…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Clayton Fields , Casey Kennington

Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Jason Wu , Tianchen Zhao , Chang Liu , Jiarui Cai , Zheng Zhang , Zhuowei Li , Aaditya Singh , Xiang Xu , Mani Srivastava , Jonathan Wu