Related papers: Outilex, plate-forme logicielle de traitement de t…
This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating open and high-performance multilingual large language models (LLMs). The project…
XML is a standard and universal language for representing information. XML processing is supported by two key frameworks: DOM and SAX. SAX is efficient, but leaves the developer to encode much of the processing. This paper introduces a…
We present the sTeX+ system, a user-driven advancement of sTeX - a semantic extension of LaTeX that allows for producing high-quality PDF documents for (proof)reading and printing, as well as semantic XML/OMDoc documents for the Web or…
Litrepl is a lightweight text processing tool designed to recognize and evaluate code sections within Markdown or Latex documents. This functionality is useful for both batch document section evaluation and interactive coding within a text…
This paper presents Ellogon, a multi-lingual, cross-platform, general-purpose text engineering environment. Ellogon was designed in order to aid both researchers in natural language processing, as well as companies that produce language…
Legal technology is currently receiving a lot of attention from various angles. In this contribution we describe the main technical components of a system that is currently under development in the European innovation project Lynx, which…
In this paper, we summerize the work done on the resources of Modern Greek on the Lexicon-Grammar of verbs. We detail the definitional features of each table, and all changes made to the names of features to make them consistent. Through…
To address the challenges associated with data processing at scale, we propose Dataverse, a unified open-source Extract-Transform-Load (ETL) pipeline for large language models (LLMs) with a user-friendly design at its core. Easy addition of…
Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents,…
This paper introduces Llettuce, an open-source tool designed to address the complexities of converting medical terms into OMOP standard concepts. Unlike existing solutions such as the Athena database search and Usagi, which struggle with…
The development of Large Language Models (LLMs) has predominantly focused on high-resource languages, leaving extremely low-resource languages like Irish with limited representation. This work presents UCCIX, a pioneering effort on the…
The paper presents the design and development of English-Lithuanian-English dictionarylexicon tool and lexicon database management system for MT. The system is oriented to support two main requirements: to be open to the user and to…
Lexicon-Grammar tables constitute a large-coverage syntactic lexicon but they cannot be directly used in Natural Language Processing (NLP) applications because they sometimes rely on implicit information. In this paper, we introduce…
In this paper we present a unification-based lexical platform designed for highly inflected languages (like Roman ones). A formalism is proposed for encoding a lemma-based lexical source, well suited for linguistic generalizations. From…
We report about the current state of development of a document suite and its applications. This collection of tools for the flexible and robust processing of documents in German is based on the use of XML as unifying formalism for encoding…
Helix is an open-source, extensible, Python-based software framework to facilitate reproducible and interpretable machine learning workflows for tabular data. It addresses the growing need for transparent experimental data analytics…
We present preliminary results about Legistix, a tool we are developing to automatically consolidate the French and European law. Legistix is based both on regular expressions used in several compound grammars, similar to the successive…
Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art and exhibiting emergent capabilities across various tasks. However, their application in extracting information from visually rich…
The eXtensible Markup Language (XML) can be used as data exchange format in different domains. It allows different parties to exchange data by providing common understanding of the basic concepts in the domain. XML covers the syntactic…
Although many attempts at automated aids for legal drafting have been made, they were based on the construction of a new tool, completely from scratch. This is at least curious, considering that a strong parallelism can be established…