Related papers: Outilex, plate-forme logicielle de traitement de t…

Data Processing for the OpenGPT-X Model Family

This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating open and high-performance multilingual large language models (LLMs). The project…

Computation and Language · Computer Science 2025-08-08 Nicolo' Brandizzi , Hammam Abdelwahab , Anirban Bhowmick , Lennard Helmer , Benny Jörg Stein , Pavel Denisov , Qasid Saleem , Michael Fromm , Mehdi Ali , Richard Rutmann , Farzad Naderi , Mohamad Saif Agy , Alexander Schwirjow , Fabian Küch , Luzian Hahn , Malte Ostendorff , Pedro Ortiz Suarez , Georg Rehm , Dennis Wegener , Nicolas Flores-Herr , Joachim Köhler , Johannes Leveling

Processing XML for Domain Specific Languages

XML is a standard and universal language for representing information. XML processing is supported by two key frameworks: DOM and SAX. SAX is efficient, but leaves the developer to encode much of the processing. This paper introduces a…

Formal Languages and Automata Theory · Computer Science 2015-06-11 Tony Clark

sTeX+ - a System for Flexible Formalization of Linked Data

We present the sTeX+ system, a user-driven advancement of sTeX - a semantic extension of LaTeX that allows for producing high-quality PDF documents for (proof)reading and printing, as well as semantic XML/OMDoc documents for the Web or…

Software Engineering · Computer Science 2010-06-24 Andrea Kohlhase , Michael Kohlhase , Christoph Lange

Litrepl: Literate Paper Processor Promoting Transparency More Than Reproducibility

Litrepl is a lightweight text processing tool designed to recognize and evaluate code sections within Markdown or Latex documents. This functionality is useful for both batch document section evaluation and interactive coding within a text…

Software Engineering · Computer Science 2025-01-22 Sergei Mironov

Ellogon: A New Text Engineering Platform

This paper presents Ellogon, a multi-lingual, cross-platform, general-purpose text engineering environment. Ellogon was designed in order to aid both researchers in natural language processing, as well as companies that produce language…

Computation and Language · Computer Science 2007-05-23 Georgios Petasis , Vangelis Karkaletsis , Georgios Paliouras , Ion Androutsopoulos , Constantine D. Spyropoulos

Orchestrating NLP Services for the Legal Domain

Legal technology is currently receiving a lot of attention from various angles. In this contribution we describe the main technical components of a system that is currently under development in the European innovation project Lynx, which…

Computation and Language · Computer Science 2020-03-31 Julián Moreno-Schneider , Georg Rehm , Elena Montiel-Ponsoda , Víctor Rodriguez-Doncel , Artem Revenko , Sotirios Karampatakis , Maria Khvalchik , Christian Sageder , Jorge Gracia , Filippo Maganza

Construction du lexique LGLex \`a partir des tables du Lexique-Grammaire des verbes du grec moderne

In this paper, we summerize the work done on the resources of Modern Greek on the Lexicon-Grammar of verbs. We detail the definitional features of each table, and all changes made to the names of features to make them consistent. Through…

Computation and Language · Computer Science 2011-11-15 Kyriaki Ioannidou , Elsa Tolone

Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models

To address the challenges associated with data processing at scale, we propose Dataverse, a unified open-source Extract-Transform-Load (ETL) pipeline for large language models (LLMs) with a user-friendly design at its core. Easy addition of…

Computation and Language · Computer Science 2025-03-05 Hyunbyung Park , Sukyung Lee , Gyoungjin Gim , Yungi Kim , Dahyun Kim , Chanjun Park

Exploring LLMs for Scientific Information Extraction Using The SciEx Framework

Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents,…

Artificial Intelligence · Computer Science 2026-01-26 Sha Li , Ayush Sadekar , Nathan Self , Yiqi Su , Lars Andersland , Mira Chaplin , Annabel Zhang , Hyoju Yang , James B Henderson , Krista Wigginton , Linsey Marr , T. M. Murali , Naren Ramakrishnan

Llettuce: An Open Source Natural Language Processing Tool for the Translation of Medical Terms into Uniform Clinical Encoding

This paper introduces Llettuce, an open-source tool designed to address the complexities of converting medical terms into OMOP standard concepts. Unlike existing solutions such as the Athena database search and Usagi, which struggle with…

Computation and Language · Computer Science 2026-03-13 James Mitchell-White , Reza Omdivar , Benjamin Partridge , Esmond Urwin , Karthikeyan Sivakumar , Ruizhe Li , Andy Rae , Xiaoyan Wang , Theresia Mina , Tom Giles , Diego Garcia-Gil , Tim Beck , John Chambers , Grazziela Figueredo , Philip R Quinlan

UCCIX: Irish-eXcellence Large Language Model

The development of Large Language Models (LLMs) has predominantly focused on high-resource languages, leaving extremely low-resource languages like Irish with limited representation. This work presents UCCIX, a pioneering effort on the…

Computation and Language · Computer Science 2024-05-24 Khanh-Tung Tran , Barry O'Sullivan , Hoang D. Nguyen

Multilingual lexicon design tool and database management system for MT

The paper presents the design and development of English-Lithuanian-English dictionarylexicon tool and lexicon database management system for MT. The system is oriented to support two main requirements: to be open to the user and to…

Computation and Language · Computer Science 2011-05-09 G. Barisevičius , B. Tamulynas

A generic tool to generate a lexicon for NLP from Lexicon-Grammar tables

Lexicon-Grammar tables constitute a large-coverage syntactic lexicon but they cannot be directly used in Natural Language Processing (NLP) applications because they sometimes rely on implicit information. In this paper, we introduce…

Computation and Language · Computer Science 2010-10-08 Matthieu Constant , Elsa Tolone

A framework for lexical representation

In this paper we present a unification-based lexical platform designed for highly inflected languages (like Roman ones). A formalism is proposed for encoding a lemma-based lexical source, well suited for linguistic generalizations. From…

cmp-lg · Computer Science 2016-08-15 José M. Goñi , José C. González

An XML based Document Suite

We report about the current state of development of a document suite and its applications. This collection of tools for the flexible and robust processing of documents in German is based on the use of XML as unifying formalism for encoding…

Computation and Language · Computer Science 2007-05-23 Dietmar Roesner , Manuela Kunze

Helix 1.0: An Open-Source Framework for Reproducible and Interpretable Machine Learning on Tabular Scientific Data

Helix is an open-source, extensible, Python-based software framework to facilitate reproducible and interpretable machine learning workflows for tabular data. It addresses the growing need for transparent experimental data analytics…

Machine Learning · Computer Science 2025-07-25 Eduardo Aguilar-Bejarano , Daniel Lea , Karthikeyan Sivakumar , Jimiama M. Mase , Reza Omidvar , Ruizhe Li , Troy Kettle , James Mitchell-White , Morgan R Alexander , David A Winkler , Grazziela Figueredo

Towards an Automatic Consolidation of French Law

We present preliminary results about Legistix, a tool we are developing to automatically consolidate the French and European law. Legistix is based both on regular expressions used in several compound grammars, similar to the successive…

Computation and Language · Computer Science 2023-01-18 Georges-André Silber

LMDX: Language Model-based Document Information Extraction and Localization

Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art and exhibiting emergent capabilities across various tasks. However, their application in extracting information from visually rich…

Computation and Language · Computer Science 2024-06-25 Vincent Perot , Kai Kang , Florian Luisier , Guolong Su , Xiaoyu Sun , Ramya Sree Boppana , Zilong Wang , Zifeng Wang , Jiaqi Mu , Hao Zhang , Chen-Yu Lee , Nan Hua

Automatic Generation of OWL Ontology from XML Data Source

The eXtensible Markup Language (XML) can be used as data exchange format in different domains. It allows different parties to exchange data by providing common understanding of the basic concepts in the domain. XML covers the syntactic…

Digital Libraries · Computer Science 2012-06-05 Nora Yahia , Sahar A. Mokhtar , AbdelWahab Ahmed

A Software Tool for Legal Drafting

Although many attempts at automated aids for legal drafting have been made, they were based on the construction of a new tool, completely from scratch. This is at least curious, considering that a strong parallelism can be established…

Computers and Society · Computer Science 2011-09-14 Daniel Gorín , Sergio Mera , Fernando Schapachnik