Related papers: Encoding models for scholarly literature
The rapid growth of scholarly literature makes it increasingly difficult for researchers to keep up with new knowledge. Automated tools are now more essential than ever to help navigate and interpret this vast body of information.…
This paper provides an introduction to the Text Encoding Initia-tive (TEI), focused at bringing in newcomers who have to deal with a digital document project and are looking at the capacity that the TEI environment may have to fulfil his…
We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages…
This paper aims to provide a comprehensive modeling and representation of etymological data in digital dictionaries. The purpose is to integrate in one coherent framework both digital representations of legacy dictionaries, and also…
The Open Access movement in scientific publishing and search engines like Google Scholar have made scientific articles more broadly accessible. During the last decade, the availability of scientific papers in full text has become more and…
Over the last few years, neural network derived word embeddings became popular in the natural language processing literature. Studies conducted have mostly focused on the quality and application of word embeddings trained on public…
Initially developed and considered for providing authentication and integrity functions, digital signatures are studied nowadays in relation to electronic documents (edocs) so that they can be considered equivalent to handwritten signatures…
In this chapter we present the main issues in representing machine readable dictionaries in XML, and in particular according to the Text Encoding Dictionary (TEI) guidelines.
This paper aims to offer insights into the usability, acceptance and limitations of e-readers with regard to the specific requirements of scholarly text work. To fit into the academic workflow non-linear reading, bookmarking, commenting,…
Scientific document embeddings contain a variety of rich features which can be harnessed for downstream tasks such as recommendation, ranking, and clustering. We explore which tangible insights can be drawn from scientific document…
Codebooks are central to framing research, providing theoretically grounded criteria for analyzing news content. While traditionally codebooks are built from theoretical frameworks and researchers' knowledge, applying these codebooks to…
Currently, XML is a format widely used. In the context of computer science teaching, it is necessary to introduce students to this format and, especially, at its eco-system. We have developed a model to support the teaching of XML. We…
Scientific journals are very important in recording the finding from researchers around the world. The recent media to disseminate scientific journals is PDF. On scheme to find the scientific journals over the internet is via metadata.…
The scientific literature is growing faster than ever. Finding an expert in a particular scientific domain has never been as hard as today because of the increasing amount of publications and because of the ever growing diversity of…
Information extraction (IE) from documents is an intensive area of research with a large set of industrial applications. Current state-of-the-art methods focus on scanned documents with approaches combining computer vision, natural language…
Understanding and extracting of information from large documents, such as business opportunities, academic articles, medical documents and technical reports, poses challenges not present in short documents. Such large documents may be…
Event extraction, the technology that aims to automatically get the structural information from documents, has attracted more and more attention in many fields. Most existing works discuss this issue with the token-level multi-label…
This paper proposes OCR++, an open-source framework designed for a variety of information extraction tasks from scholarly articles including metadata (title, author names, affiliation and e-mail), structure (section headings and body text,…
Automatic summarisation is a popular approach to reduce a document to its main arguments. Recent research in the area has focused on neural approaches to summarisation, which can be very data-hungry. However, few large datasets exist and…
Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through…