Related papers: PostDoc: Generating Poster from a Long Multimodal …

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Théo Gigant , Camille Guinaudeau , Frédéric Dufaux

SelfDoc: Self-Supervised Document Representation Learning

We propose SelfDoc, a task-agnostic pre-training framework for document image understanding. Because documents are multimodal and are intended for sequential reading, our framework exploits the positional, textual, and visual information of…

Computer Vision and Pattern Recognition · Computer Science 2021-06-08 Peizhao Li , Jiuxiang Gu , Jason Kuen , Vlad I. Morariu , Handong Zhao , Rajiv Jain , Varun Manjunatha , Hongfu Liu

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are…

Computation and Language · Computer Science 2024-11-12 Yew Ken Chia , Liying Cheng , Hou Pong Chan , Chaoqun Liu , Maojia Song , Sharifah Mahani Aljunied , Soujanya Poria , Lidong Bing

Modular Multimodal Machine Learning for Extraction of Theorems and Proofs in Long Scientific Documents (Extended Version)

We address the extraction of mathematical statements and their proofs from scholarly PDF articles as a multimodal classification problem, utilizing text, font features, and bitmap image renderings of PDFs as distinct modalities. We propose…

Artificial Intelligence · Computer Science 2024-10-14 Shrey Mishra , Antoine Gauquier , Pierre Senellart

Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning

Recent approaches in literature have exploited the multi-modal information in documents (text, layout, image) to serve specific downstream document tasks. However, they are limited by their - (i) inability to learn cross-modal…

Computation and Language · Computer Science 2022-01-06 Subhojeet Pramanik , Shashank Mujumdar , Hima Patel

POSTA: A Go-to Framework for Customized Artistic Poster Generation

Poster design is a critical medium for visual communication. Prior work has explored automatic poster design using deep learning techniques, but these approaches lack text accuracy, user customization, and aesthetic appeal, limiting their…

Graphics · Computer Science 2025-03-20 Haoyu Chen , Xiaojie Xu , Wenbo Li , Jingjing Ren , Tian Ye , Songhua Liu , Ying-Cong Chen , Lei Zhu , Xinchao Wang

PosterSum: A Multimodal Benchmark for Scientific Poster Summarization

Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like scientific posters. We introduce PosterSum, a novel benchmark to advance the development…

Computer Vision and Pattern Recognition · Computer Science 2025-02-26 Rohit Saxena , Pasquale Minervini , Frank Keller

Enhancing Presentation Slide Generation by LLMs with a Multi-Staged End-to-End Approach

Generating presentation slides from a long document with multimodal elements such as text and images is an important task. This is time consuming and needs domain expertise if done manually. Existing approaches for generating a rich…

Computation and Language · Computer Science 2024-06-12 Sambaran Bandyopadhyay , Himanshu Maheshwari , Anandhavelu Natarajan , Apoorv Saxena

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Layout generation is the keystone in achieving automated graphic design, requiring arranging the position and size of various multi-modal design elements in a visually pleasing and constraint-following manner. Previous approaches are either…

Computer Vision and Pattern Recognition · Computer Science 2024-11-27 Tao Yang , Yingmin Luo , Zhongang Qi , Yang Wu , Ying Shan , Chang Wen Chen

Presentations are not always linear! GNN meets LLM for Document-to-Presentation Transformation with Attribution

Automatically generating a presentation from the text of a long document is a challenging and useful problem. In contrast to a flat summary, a presentation needs to have a better and non-linear narrative, i.e., the content of a slide can…

Computation and Language · Computer Science 2024-05-24 Himanshu Maheshwari , Sambaran Bandyopadhyay , Aparna Garimella , Anandhavelu Natarajan

Multi-document Summarization via Deep Learning Techniques: A Survey

Multi-document summarization (MDS) is an effective tool for information aggregation that generates an informative and concise summary from a cluster of topic-related documents. Our survey, the first of its kind, systematically overviews the…

Computation and Language · Computer Science 2021-12-10 Congbo Ma , Wei Emma Zhang , Mingyu Guo , Hu Wang , Quan Z. Sheng

Shaping Political Discourse using multi-source News Summarization

Multi-document summarization is the process of automatically generating a concise summary of multiple documents related to the same topic. This summary can help users quickly understand the key information from a large collection of…

Computation and Language · Computer Science 2023-12-20 Charles Rajan , Nishit Asnani , Shreya Singh

LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding

This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents. Visually Rich Document Understanding tasks, such as document image classification and information extraction, have gained…

Computation and Language · Computer Science 2024-03-22 Masato Fujitake

Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation

People get informed of a daily task plan through diverse media involving both texts and images. However, most prior research only focuses on LLM's capability of textual plan generation. The potential of large-scale models in providing…

Computer Vision and Pattern Recognition · Computer Science 2025-06-16 Xiaoxin Lu , Ranran Haoran Zhang , Yusen Zhang , Rui Zhang

PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks

With the rapid advancement of digitalization, various document images are being applied more extensively in production and daily life, and there is an increasingly urgent need for fast and accurate parsing of the content in document images.…

Computer Vision and Pattern Recognition · Computer Science 2025-06-27 Feng Ni , Kui Huang , Yao Lu , Wenyu Lv , Guanzhong Wang , Zeyu Chen , Yi Liu

Multi-Modal Summary Generation using Multi-Objective Optimization

Significant development of communication technology over the past few years has motivated research in multi-modal summarization techniques. A majority of the previous works on multi-modal summarization focus on text and images. In this…

Information Retrieval · Computer Science 2020-05-20 Anubhav Jangra , Sriparna Saha , Adam Jatowt , Mohammad Hasanuzzaman

Understanding Long Documents with Different Position-Aware Attentions

Despite several successes in document understanding, the practical task for long document understanding is largely under-explored due to several challenges in computation and how to efficiently absorb long multimodal input. Most current…

Computation and Language · Computer Science 2022-08-18 Hai Pham , Guoxin Wang , Yijuan Lu , Dinei Florencio , Cha Zhang

Text2Poster: Laying out Stylized Texts on Retrieved Images

Poster generation is a significant task for a wide range of applications, which is often time-consuming and requires lots of manual editing and artistic experience. In this paper, we propose a novel data-driven framework, called…

Multimedia · Computer Science 2023-01-09 Chuhao Jin , Hongteng Xu , Ruihua Song , Zhiwu Lu

Investigating the Impact of Text Summarization on Topic Modeling

Topic models are used to identify and group similar themes in a set of documents. Recent advancements in deep learning based neural topic models has received significant research interest. In this paper, an approach is proposed that further…

Computation and Language · Computer Science 2024-10-15 Trishia Khandelwal

PosterO: Structuring Layout Trees to Enable Language Models in Generalized Content-Aware Layout Generation

In poster design, content-aware layout generation is crucial for automatically arranging visual-textual elements on the given image. With limited training data, existing work focused on image-centric enhancement. However, this neglects the…

Graphics · Computer Science 2025-05-28 HsiaoYuan Hsu , Yuxin Peng