Related papers: Encoding models for scholarly literature

Preface to the Special Issue of the TAL Journal on Scholarly Document Processing

The rapid growth of scholarly literature makes it increasingly difficult for researchers to keep up with new knowledge. Automated tools are now more essential than ever to help navigate and interpret this vast body of information.…

Digital Libraries · Computer Science 2025-06-05 Florian Boudin , Akiko Aizawa

Questions & Answers for TEI Newcomers

This paper provides an introduction to the Text Encoding Initia-tive (TEI), focused at bringing in newcomers who have to deal with a digital document project and are looking at the capacity that the TEI environment may have to fulfil his…

Digital Libraries · Computer Science 2009-01-26 Laurent Romary

Diachronic Document Dataset for Semantic Layout Analysis

We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages…

Computer Vision and Pattern Recognition · Computer Science 2024-11-18 Thibault Clérice , Juliette Janes , Hugo Scheithauer , Sarah Bénière , Florian Cafiero , Laurent Romary , Simon Gabay , Benoît Sagot

Deep encoding of etymological information in TEI

This paper aims to provide a comprehensive modeling and representation of etymological data in digital dictionaries. The purpose is to integrate in one coherent framework both digital representations of legacy dictionaries, and also…

Computation and Language · Computer Science 2016-12-01 Jack Bowers , Laurent Romary

Mining Scientific Papers for Bibliometrics: a (very) Brief Survey of Methods and Tools

The Open Access movement in scientific publishing and search engines like Google Scholar have made scientific articles more broadly accessible. During the last decade, the availability of scientific papers in full text has become more and…

Digital Libraries · Computer Science 2015-05-07 Iana Atanassova , Marc Bertin , Philipp Mayr

Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs TFIDF

Over the last few years, neural network derived word embeddings became popular in the natural language processing literature. Studies conducted have mostly focused on the quality and application of word embeddings trained on public…

Artificial Intelligence · Computer Science 2021-07-13 H. J. Meijer , J. Truong , R. Karimi

A Practical Approach to expressing digitally signed documents

Initially developed and considered for providing authentication and integrity functions, digital signatures are studied nowadays in relation to electronic documents (edocs) so that they can be considered equivalent to handwritten signatures…

Cryptography and Security · Computer Science 2019-10-21 Diana Berbecaru , Marius Marian

Representing human and machine dictionaries in Markup languages

In this chapter we present the main issues in representing machine readable dictionaries in XML, and in particular according to the Text Encoding Dictionary (TEI) guidelines.

Computation and Language · Computer Science 2009-12-16 Lothar Lemnitzer , Laurent Romary , Andreas Witt

Are e-readers suitable tools for scholarly work?

This paper aims to offer insights into the usability, acceptance and limitations of e-readers with regard to the specific requirements of scholarly text work. To fit into the academic workflow non-linear reading, bookmarking, commenting,…

Digital Libraries · Computer Science 2019-01-15 Siegfried Schomisch , Maria Zens , Philipp Mayr

Extracting Information About Publication Venues Using Citation-Informed Transformers

Scientific document embeddings contain a variety of rich features which can be harnessed for downstream tasks such as recommendation, ranking, and clustering. We explore which tangible insights can be drawn from scientific document…

Digital Libraries · Computer Science 2025-06-11 Brian D. Zimmerman , Joshua Folkins , Olga Vechtomova

Revisiting Framing Codebooks with AI: Employing Large Language Models as Analytical Collaborators in Deductive Content Analysis

Codebooks are central to framing research, providing theoretically grounded criteria for analyzing news content. While traditionally codebooks are built from theoretical frameworks and researchers' knowledge, applying these codebooks to…

Human-Computer Interaction · Computer Science 2026-04-22 Diego Gomez-Zara , Hernán Valdivieso , Jorge Pérez , Denis Parra , Sebastián Valenzuela

Une repr\'esentation en graphe pour l'enseignement de XML

Currently, XML is a format widely used. In the context of computer science teaching, it is necessary to introduce students to this format and, especially, at its eco-system. We have developed a model to support the teaching of XML. We…

Other Computer Science · Computer Science 2013-11-18 Emmanuel Desmontils

PDF articles metadata harvester

Scientific journals are very important in recording the finding from researchers around the world. The recent media to disseminate scientific journals is PDF. On scheme to find the scientific journals over the internet is via metadata.…

Digital Libraries · Computer Science 2013-08-01 Leon Andretti Abdillah

New Datasets and a Benchmark of Document Network Embedding Methods for Scientific Expert Finding

The scientific literature is growing faster than ever. Finding an expert in a particular scientific domain has never been as hard as today because of the increasing amount of publications and because of the ever growing diversity of…

Information Retrieval · Computer Science 2020-04-09 Robin Brochier , Antoine Gourru , Adrien Guille , Julien Velcin

Information Extraction from Visually Rich Documents with Font Style Embeddings

Information extraction (IE) from documents is an intensive area of research with a large set of industrial applications. Current state-of-the-art methods focus on scanned documents with approaches combining computer vision, natural language…

Computation and Language · Computer Science 2022-08-16 Ismail Oussaid , William Vanhuffel , Pirashanth Ratnamogan , Mhamed Hajaiej , Alexis Mathey , Thomas Gilles

Unfolding the Structure of a Document using Deep Learning

Understanding and extracting of information from large documents, such as business opportunities, academic articles, medical documents and technical reports, poses challenges not present in short documents. Such large documents may be…

Computation and Language · Computer Science 2019-10-10 Muhammad Mahbubur Rahman , Tim Finin

Writing Style Aware Document-level Event Extraction

Event extraction, the technology that aims to automatically get the structural information from documents, has attracted more and more attention in many fields. Most existing works discuss this issue with the token-level multi-label…

Computation and Language · Computer Science 2022-01-11 Zhuo Xu , Yue Wang , Lu Bai , Lixin Cui

OCR++: A Robust Framework For Information Extraction from Scholarly Articles

This paper proposes OCR++, an open-source framework designed for a variety of information extraction tasks from scholarly articles including metadata (title, author names, affiliation and e-mail), structure (section headings and body text,…

Digital Libraries · Computer Science 2016-09-26 Mayank Singh , Barnopriyo Barua , Priyank Palod , Manvi Garg , Sidhartha Satapathy , Samuel Bushi , Kumar Ayush , Krishna Sai Rohith , Tulasi Gamidi , Pawan Goyal , Animesh Mukherjee

A Supervised Approach to Extractive Summarisation of Scientific Papers

Automatic summarisation is a popular approach to reduce a document to its main arguments. Recent research in the area has focused on neural approaches to summarisation, which can be very data-hungry. However, few large datasets exist and…

Computation and Language · Computer Science 2017-06-14 Ed Collins , Isabelle Augenstein , Sebastian Riedel

Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings

Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through…

Computation and Language · Computer Science 2026-02-05 Tim Kunt , Annika Buchholz , Imene Khebouri , Thorsten Koch , Ida Litzel , Thi Huong Vu