Related papers: Stanza: A Python Natural Language Processing Toolk…

Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing

We introduce Trankit, a light-weight Transformer-based Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages.…

Computation and Language · Computer Science 2021-10-18 Minh Van Nguyen , Viet Dac Lai , Amir Pouran Ben Veyseh , Thien Huu Nguyen

Biomedical and Clinical English Model Packages in the Stanza Python NLP Library

We introduce biomedical and clinical English model packages for the Stanza Python NLP library. These packages offer accurate syntactic analysis and named entity recognition capabilities for biomedical and clinical text, by combining…

Computation and Language · Computer Science 2020-07-30 Yuhao Zhang , Yuhui Zhang , Peng Qi , Christopher D. Manning , Curtis P. Langlotz

CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to…

Computation and Language · Computer Science 2023-08-14 Luka Terčon , Nikola Ljubešić

BlaBla: Linguistic Feature Extraction for Clinical Analysis in Multiple Languages

We introduce BlaBla, an open-source Python library for extracting linguistic features with proven clinical relevance to neurological and psychiatric diseases across many languages. BlaBla is a unifying framework for accelerating and…

Computation and Language · Computer Science 2020-05-21 Abhishek Shivkumar , Jack Weston , Raphael Lenain , Emil Fristed

Lupa: A Framework for Large Scale Analysis of the Programming Language Usage

In this paper, we present Lupa - a framework for large-scale analysis of the programming language usage. Lupa is a command line tool that uses the power of the IntelliJ Platform under the hood, which gives it access to powerful static…

Programming Languages · Computer Science 2022-03-30 Anna Vlasova , Maria Tigina , Ilya Vlasov , Anastasiia Birillo , Yaroslav Golubev , Timofey Bryksin

A Tidy Data Model for Natural Language Processing using cleanNLP

The package cleanNLP provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes Stanford's CoreNLP library, exposing a number of annotation…

Computation and Language · Computer Science 2018-05-04 Taylor Arnold

TurkicNLP: An NLP Toolkit for Turkic Languages

Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library…

Computation and Language · Computer Science 2026-05-25 Sherzod Hakimov

GR-NLP-TOOLKIT: An Open-Source NLP Toolkit for Modern Greek

We present GR-NLP-TOOLKIT, an open-source natural language processing (NLP) toolkit developed specifically for modern Greek. The toolkit provides state-of-the-art performance in five core NLP tasks, namely part-of-speech tagging,…

Computation and Language · Computer Science 2024-12-12 Lefteris Loukas , Nikolaos Smyrnioudis , Chrysa Dikonomaki , Spyros Barbakos , Anastasios Toumazatos , John Koutsikakis , Manolis Kyriakakis , Mary Georgiou , Stavros Vassos , John Pavlopoulos , Ion Androutsopoulos

A New Massive Multilingual Dataset for High-Performance Language Technologies

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the…

Computation and Language · Computer Science 2024-03-22 Ona de Gibert , Graeme Nail , Nikolay Arefyev , Marta Bañón , Jelmer van der Linde , Shaoxiong Ji , Jaume Zaragoza-Bernabeu , Mikko Aulamo , Gema Ramírez-Sánchez , Andrey Kutuzov , Sampo Pyysalo , Stephan Oepen , Jörg Tiedemann

Yaps: Python Frontend to Stan

Stan is a popular probabilistic programming language with a self-contained syntax and semantics that is close to graphical models. Unfortunately, existing embeddings of Stan in Python use multi-line strings. That approach forces users to…

Programming Languages · Computer Science 2018-12-12 Guillaume Baudart , Martin Hirzel , Kiran Kate , Louis Mandel , Avraham Shinnar

Is Machine Learning Speaking my Language? A Critical Look at the NLP-Pipeline Across 8 Human Languages

Natural Language Processing (NLP) is increasingly used as a key ingredient in critical decision-making systems such as resume parsers used in sorting a list of job candidates. NLP systems often ingest large corpora of human text, attempting…

Computation and Language · Computer Science 2020-07-14 Esma Wali , Yan Chen , Christopher Mahoney , Thomas Middleton , Marzieh Babaeianjelodar , Mariama Njie , Jeanna Neefe Matthews

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality…

Computation and Language · Computer Science 2025-06-05 Laurie Burchell , Ona de Gibert , Nikolay Arefyev , Mikko Aulamo , Marta Bañón , Pinzhen Chen , Mariia Fedorova , Liane Guillou , Barry Haddow , Jan Hajič , Jindřich Helcl , Erik Henriksson , Mateusz Klimaszewski , Ville Komulainen , Andrey Kutuzov , Joona Kytöniemi , Veronika Laippala , Petter Mæhlum , Bhavitvya Malik , Farrokh Mehryary , Vladislav Mikhailov , Nikita Moghe , Amanda Myntti , Dayyán O'Brien , Stephan Oepen , Proyag Pal , Jousia Piha , Sampo Pyysalo , Gema Ramírez-Sánchez , David Samuel , Pavel Stepachev , Jörg Tiedemann , Dušan Variš , Tereza Vojtěchová , Jaume Zaragoza-Bernabeu

Doing Natural Language Processing in A Natural Way: An NLP toolkit based on object-oriented knowledge base and multi-level grammar base

We introduce an NLP toolkit based on object-oriented knowledge base and multi-level grammar base. This toolkit focuses on semantic parsing, it also has abilities to discover new knowledge and grammar automatically, new discovered knowledge…

Computation and Language · Computer Science 2021-06-09 Yu Guo

Polylingual Wordnet

Princeton WordNet is one of the most important resources for natural language processing, but is only available for English. While it has been translated using the expand approach to many other languages, this is an expensive manual…

Computation and Language · Computer Science 2019-03-05 Mihael Arcan , John McCrae , Paul Buitelaar

Automated Python Translation

Python is one of the most commonly used programming languages in industry and education. Its English keywords and built-in functions/modules allow it to come close to pseudo-code in terms of its readability and ease of writing. However,…

Computation and Language · Computer Science 2025-04-17 Joshua Otten , Antonios Anastasopoulos , Kevin Moran

COMBO: State-of-the-Art Morphosyntactic Analysis

We introduce COMBO - a fully neural NLP system for accurate part-of-speech tagging, morphological analysis, lemmatisation, and (enhanced) dependency parsing. It predicts categorical morphosyntactic features whilst also exposes their vector…

Computation and Language · Computer Science 2021-09-14 Mateusz Klimaszewski , Alina Wróblewska

PyPLN: a Distributed Platform for Natural Language Processing

This paper presents a distributed platform for Natural Language Processing called PyPLN. PyPLN leverages a vast array of NLP and text processing open source tools, managing the distribution of the workload on a variety of configurations:…

Computation and Language · Computer Science 2013-02-20 Flávio Codeço Coelho , Renato Rocha Souza , Álvaro Justen , Flávio Amieiro , Heliana Mello

Learning Language Representations for Typology Prediction

One central mystery of neural NLP is what neural models "know" about their subject matter. When a neural machine translation system learns to translate from one language to another, does it learn the syntax or semantics of the languages?…

Computation and Language · Computer Science 2017-08-01 Chaitanya Malaviya , Graham Neubig , Patrick Littell

TajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script)

The Tajik language, written in Cyrillic script, remains severely under-resourced in terms of publicly available natural language processing (NLP) toolkits, hindering both linguistic research and applied development. This paper introduces…

Computation and Language · Computer Science 2026-05-29 Mullosharaf K. Arabov

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of…

Computation and Language · Computer Science 2024-02-12 Shivalika Singh , Freddie Vargus , Daniel Dsouza , Börje F. Karlsson , Abinaya Mahendiran , Wei-Yin Ko , Herumb Shandilya , Jay Patel , Deividas Mataciunas , Laura OMahony , Mike Zhang , Ramith Hettiarachchi , Joseph Wilson , Marina Machado , Luisa Souza Moura , Dominik Krzemiński , Hakimeh Fadaei , Irem Ergün , Ifeoma Okoh , Aisha Alaagib , Oshan Mudannayake , Zaid Alyafeai , Vu Minh Chien , Sebastian Ruder , Surya Guthikonda , Emad A. Alghamdi , Sebastian Gehrmann , Niklas Muennighoff , Max Bartolo , Julia Kreutzer , Ahmet Üstün , Marzieh Fadaee , Sara Hooker