Related papers: VSA:Visual-Structural Alignment for UI-to-Code

SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality…

Computer Vision and Pattern Recognition · Computer Science 2022-01-27 Peixi Xiong , Quanzeng You , Pei Yu , Zicheng Liu , Ying Wu

Unpaired Image Translation via Vector Symbolic Architectures

Image-to-image translation has played an important role in enabling synthetic data for computer vision. However, if the source and target domains have a large semantic mismatch, existing techniques often suffer from source content…

Computer Vision and Pattern Recognition · Computer Science 2022-09-07 Justin Theiss , Jay Leverett , Daeil Kim , Aayush Prakash

ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers

In the rapidly evolving fields of natural language processing and computer vision, Visual Word Sense Disambiguation (VWSD) stands as a critical, yet challenging task. The quest for models that can seamlessly integrate and interpret…

Computer Vision and Pattern Recognition · Computer Science 2024-08-13 Aristi Papastavrou , Maria Lymperaiou , Giorgos Stamou

A Pattern Language for Resilient Visual Agents

Integrating multimodal foundation models into enterprise ecosystems presents a fundamental software architecture challenge. Architects must balance competing quality attributes: the high latency and non-determinism of vision language action…

Artificial Intelligence · Computer Science 2026-05-01 Habtom Kahsay Gidey , Alexander Lenz , Alois Knoll

Modular Layout Synthesis (MLS): Front-end Code via Structure Normalization and Constrained Generation

Automated front-end engineering drastically reduces development cycles and minimizes manual coding overhead. While Generative AI has shown promise in translating designs to code, current solutions often produce monolithic scripts, failing…

Information Retrieval · Computer Science 2025-12-23 Chong Liu , Ming Zhang , Fei Li , Hao Zhou , Xiaoshuang Chen , Ye Yuan

Vector Symbolic Architectures as a Computing Framework for Emerging Hardware

This article reviews recent progress in the development of the computing framework vector symbolic architectures (VSA) (also known as hyperdimensional computing). This framework is well suited for implementation in stochastic, emerging…

Hardware Architecture · Computer Science 2023-07-21 Denis Kleyko , Mike Davies , E. Paxon Frady , Pentti Kanerva , Spencer J. Kent , Bruno A. Olshausen , Evgeny Osipov , Jan M. Rabaey , Dmitri A. Rachkovskij , Abbas Rahimi , Friedrich T. Sommer

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Yilei Jiang , Yaozhi Zheng , Yuxuan Wan , Jiaming Han , Qunzhong Wang , Michael R. Lyu , Xiangyu Yue

PSD2Code: Automated Front-End Code Generation from Design Files via Multimodal Large Language Models

Design-to-code generation has emerged as a promising approach to bridge the gap between design prototypes and deployable frontend code. However, existing methods often suffer from structural inconsistencies, asset misalignment, and limited…

Software Engineering · Computer Science 2025-11-07 Yongxi Chen , Lei Chen

VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps

Jigsaw puzzle solving remains challenging in computer vision, requiring an understanding of both local fragment details and global spatial relationships. While most traditional approaches only focus on visual cues like edge matching and…

Machine Learning · Computer Science 2025-10-01 Zhuoning Xu , Xinyan Liu

VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various…

Computer Vision and Pattern Recognition · Computer Science 2025-06-24 Chongkai Gao , Zixuan Liu , Zhenghao Chi , Junshan Huang , Xin Fei , Yiwen Hou , Yuxuan Zhang , Yudi Lin , Zhirui Fang , Zeyu Jiang , Lin Shao

Code Arcades: 3d Visualization of Classes, Dependencies and Software Metrics

Software visualization seeks to represent software artifacts graphical-ly in two or three dimensions, with the goal of enhancing comprehension, anal-ysis, maintenance, and evolution of the source code. In this context, visualiza-tions…

Software Engineering · Computer Science 2025-09-30 Anthony Savidis , Christos Vasilopoulos

VidLA: Video-Language Alignment at Scale

In this paper, we propose VidLA, an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First, they do not capture both short-range and long-range temporal…

Computer Vision and Pattern Recognition · Computer Science 2024-03-25 Mamshad Nayeem Rizve , Fan Fei , Jayakrishnan Unnikrishnan , Son Tran , Benjamin Z. Yao , Belinda Zeng , Mubarak Shah , Trishul Chilimbi

A comparison of Vector Symbolic Architectures

Vector Symbolic Architectures combine a high-dimensional vector space with a set of carefully designed operators in order to perform symbolic computations with large numerical vectors. Major goals are the exploitation of their…

Artificial Intelligence · Computer Science 2021-12-17 Kenny Schlegel , Peer Neubert , Peter Protzel

Observing and Controlling Features in Vision-Language-Action Models

Vision-Language-Action Models (VLAs) have shown remarkable progress towards embodied intelligence. While their architecture partially resembles that of Large Language Models (LLMs), VLAs exhibit higher complexity due to their multi-modal…

Robotics · Computer Science 2026-03-06 Hugo Buurmeijer , Carmen Amo Alonso , Aiden Swann , Marco Pavone

See or Say Graphs: Agent-Driven Scalable Graph Structure Understanding with Vision-Language Models

Vision-language models (VLMs) have shown promise in graph structure understanding, but remain limited by input-token constraints, facing scalability bottlenecks and lacking effective mechanisms to coordinate textual and visual modalities.…

Artificial Intelligence · Computer Science 2026-01-12 Shuo Han , Yukun Cao , Zezhong Ding , Zengyi Gao , S Kevin Zhou , Xike Xie

Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

While Vision-Language-Action (VLA) models show strong promise for generalist robot control, it remains unclear whether -- and under what conditions -- the standard "scale data" recipe translates to robotics, where training data is…

Robotics · Computer Science 2026-02-11 Ye Wang , Sipeng Zheng , Hao Luo , Wanpeng Zhang , Haoqi Yuan , Chaoyi Xu , Haiweng Xu , Yicheng Feng , Mingyang Yu , Zhiyu Kang , Zongqing Lu , Qin Jin

GALA: Multimodal Graph Alignment for Bug Localization in Automated Program Repair

Large Language Model (LLM)-based Automated Program Repair (APR) has shown strong potential on textual benchmarks, yet struggles in multimodal scenarios where bugs are reported with GUI screenshots. Existing methods typically convert images…

Software Engineering · Computer Science 2026-04-10 Zhuoyao Liu , Zhengran Zeng , Shu-Dong Huang , Yang Liu , Shikun Zhang , Wei Ye

TCSA-UDA: Text-Driven Cross-Semantic Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation

Unsupervised domain adaptation for medical image segmentation remains a significant challenge due to substantial domain shifts across imaging modalities, such as CT and MRI. While recent vision-language representation learning methods have…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Lalit Maurya , Honghai Liu , Reyer Zwiggelaar

Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System

Vision-Language-Action (VLA) models are a promising paradigm for generalist robotic manipulation by grounding high-level semantic instructions into executable physical actions. However, prevailing approaches typically adopt a monolithic…

Robotics · Computer Science 2026-04-29 Yifei Wei , Linqing Zhong , Yi Liu , Yuxiang Lu , Xindong He , Maoqing Yao , Guanghui Ren

Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment

Vision and Language Pretraining has become the prevalent approach for tackling multimodal downstream tasks. The current trend is to move towards ever larger models and pretraining datasets. This computational headlong rush does not seem…

Computer Vision and Pattern Recognition · Computer Science 2022-10-06 Mustafa Shukor , Guillaume Couairon , Matthieu Cord