Related papers: DOM-LM: Learning Generalizable Representations for…

Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding

The growing prevalence of visually rich documents, such as webpages and scanned/digital-born documents (images, PDFs, etc.), has led to increased interest in automatic document understanding and information extraction across academia and…

Computation and Language · Computer Science 2024-02-29 Hongshen Xu , Lu Chen , Zihan Zhao , Da Ma , Ruisheng Cao , Zichen Zhu , Kai Yu

LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding

This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents. Visually Rich Document Understanding tasks, such as document image classification and information extraction, have gained…

Computation and Language · Computer Science 2024-03-22 Masato Fujitake

"What's important here?": Opportunities and Challenges of Using LLMs in Retrieving Information from Web Interfaces

Large language models (LLMs) that have been trained on a corpus that includes large amount of code exhibit a remarkable ability to understand HTML code. As web interfaces are primarily constructed using HTML, we design an in-depth study to…

Computation and Language · Computer Science 2023-12-12 Faria Huq , Jeffrey P. Bigham , Nikolas Martelaro

DocLLM: A layout-aware generative language model for multimodal document understanding

Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a…

Computation and Language · Computer Science 2024-01-03 Dongsheng Wang , Natraj Raman , Mathieu Sibue , Zhiqiang Ma , Petr Babkin , Simerjot Kaur , Yulong Pei , Armineh Nourbakhsh , Xiaomo Liu

DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding

Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-24 Tanveer Hannan , Dimitrios Mallios , Parth Pathak , Faegheh Sardari , Thomas Seidl , Gedas Bertasius , Mohsen Fayyaz , Sunando Sengupta

Understanding HTML with Large Language Models

Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. Yet, their capabilities for HTML understanding -- i.e., parsing the raw HTML of a webpage, with applications to automation of web-based…

Machine Learning · Computer Science 2023-05-22 Izzeddin Gur , Ofir Nachum , Yingjie Miao , Mustafa Safdari , Austin Huang , Aakanksha Chowdhery , Sharan Narang , Noah Fiedel , Aleksandra Faust

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

In recent years, the use of multi-modal pre-trained Transformers has led to significant advancements in visually-rich document understanding. However, existing models have mainly focused on features such as text and vision while neglecting…

Computation and Language · Computer Science 2023-08-16 Qiwei Li , Zuchao Li , Xiantao Cai , Bo Du , Hai Zhao

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Document parsing (DP) transforms unstructured or semi-structured documents into structured, machine-readable representations, enabling downstream applications such as knowledge base construction and retrieval-augmented generation (RAG).…

Multimedia · Computer Science 2026-04-07 Qintong Zhang , Bin Wang , Victor Shea-Jay Huang , Junyuan Zhang , Zhengren Wang , Hao Liang , Conghui He , Wentao Zhang

Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning

Recent approaches in literature have exploited the multi-modal information in documents (text, layout, image) to serve specific downstream document tasks. However, they are limited by their - (i) inability to learn cross-modal…

Computation and Language · Computer Science 2022-01-06 Subhojeet Pramanik , Shashank Mujumdar , Hima Patel

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while…

Computation and Language · Computer Science 2020-06-17 Yiheng Xu , Minghao Li , Lei Cui , Shaohan Huang , Furu Wei , Ming Zhou

PLM-GNN: A Webpage Classification Method based on Joint Pre-trained Language Model and Graph Neural Network

The number of web pages is growing at an exponential rate, accumulating massive amounts of data on the web. It is one of the key processes to classify webpages in web information mining. Some classical methods are based on manually building…

Computation and Language · Computer Science 2023-05-10 Qiwei Lang , Jingbo Zhou , Haoyi Wang , Shiqi Lyu , Rui Zhang

Doc2Im: document to image conversion through self-attentive embedding

Text classification is a fundamental task in NLP applications. Latest research in this field has largely been divided into two major sub-fields. Learning representations is one sub-field and learning deeper models, both sequential and…

Computation and Language · Computer Science 2018-11-09 Mithun Das Gupta

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities and alleviate the hallucination problem of LLMs. The Web is a major source of external knowledge used in RAG systems, and many commercial RAG systems have…

Information Retrieval · Computer Science 2025-02-10 Jiejun Tan , Zhicheng Dou , Wen Wang , Mang Wang , Weipeng Chen , Ji-Rong Wen

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

Recently, leveraging large language models (LLMs) or multimodal large language models (MLLMs) for document understanding has been proven very promising. However, previous works that employ LLMs/MLLMs for document understanding have not…

Computer Vision and Pattern Recognition · Computer Science 2024-04-09 Chuwei Luo , Yufan Shen , Zhaoqing Zhu , Qi Zheng , Zhi Yu , Cong Yao

HTLM: Hyper-Text Pre-Training and Prompting of Language Models

We introduce HTLM, a hyper-text language model trained on a large-scale web crawl. Modeling hyper-text has a number of advantages: (1) it is easily gathered at scale, (2) it provides rich document-level and end-task-adjacent supervision…

Computation and Language · Computer Science 2021-07-16 Armen Aghajanyan , Dmytro Okhonko , Mike Lewis , Mandar Joshi , Hu Xu , Gargi Ghosh , Luke Zettlemoyer

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like…

Computation and Language · Computer Science 2020-10-22 Bill Yuchen Lin , Ying Sheng , Nguyen Vo , Sandeep Tata

Evaluating the Use of LLMs for Automated DOM-Level Resolution of Web Performance Issues

Users demand fast, seamless webpage experiences, yet developers often struggle to meet these expectations within tight constraints. Performance optimization, while critical, is a time-consuming and often manual process. One of the most…

Software Engineering · Computer Science 2026-01-12 Gideon Peters , SayedHassan Khatoonabadi , Emad Shihab

Leveraging Large Language Models for Web Scraping

Large Language Models (LLMs) demonstrate remarkable capabilities in replicating human tasks and boosting productivity. However, their direct application for data extraction presents limitations due to a prioritisation of fluency over…

Computation and Language · Computer Science 2024-06-13 Aman Ahluwalia , Suhrud Wani

Overcoming Vision Language Model Challenges in Diagram Understanding: A Proof-of-Concept with XML-Driven Large Language Models Solutions

Diagrams play a crucial role in visually conveying complex relationships and processes within business documentation. Despite recent advances in Vision-Language Models (VLMs) for various image understanding tasks, accurately identifying and…

Software Engineering · Computer Science 2025-02-10 Shue Shiinoki , Ryo Koshihara , Hayato Motegi , Masumi Morishige

A Simple yet Effective Layout Token in Large Language Models for Document Understanding

Recent methods that integrate spatial layouts with text for document understanding in large language models (LLMs) have shown promising results. A commonly used method is to represent layout information as text tokens and interleave them…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Zhaoqing Zhu , Chuwei Luo , Zirui Shao , Feiyu Gao , Hangdi Xing , Qi Zheng , Ji Zhang