Related papers: Stanza: A Python Natural Language Processing Toolk…
We introduce Trankit, a light-weight Transformer-based Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages.…
We introduce biomedical and clinical English model packages for the Stanza Python NLP library. These packages offer accurate syntactic analysis and named entity recognition capabilities for biomedical and clinical text, by combining…
We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to…
We introduce BlaBla, an open-source Python library for extracting linguistic features with proven clinical relevance to neurological and psychiatric diseases across many languages. BlaBla is a unifying framework for accelerating and…
In this paper, we present Lupa - a framework for large-scale analysis of the programming language usage. Lupa is a command line tool that uses the power of the IntelliJ Platform under the hood, which gives it access to powerful static…
The package cleanNLP provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes Stanford's CoreNLP library, exposing a number of annotation…
Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library…
We present GR-NLP-TOOLKIT, an open-source natural language processing (NLP) toolkit developed specifically for modern Greek. The toolkit provides state-of-the-art performance in five core NLP tasks, namely part-of-speech tagging,…
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the…
Stan is a popular probabilistic programming language with a self-contained syntax and semantics that is close to graphical models. Unfortunately, existing embeddings of Stan in Python use multi-line strings. That approach forces users to…
Natural Language Processing (NLP) is increasingly used as a key ingredient in critical decision-making systems such as resume parsers used in sorting a list of job candidates. NLP systems often ingest large corpora of human text, attempting…
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality…
We introduce an NLP toolkit based on object-oriented knowledge base and multi-level grammar base. This toolkit focuses on semantic parsing, it also has abilities to discover new knowledge and grammar automatically, new discovered knowledge…
Princeton WordNet is one of the most important resources for natural language processing, but is only available for English. While it has been translated using the expand approach to many other languages, this is an expensive manual…
Python is one of the most commonly used programming languages in industry and education. Its English keywords and built-in functions/modules allow it to come close to pseudo-code in terms of its readability and ease of writing. However,…
We introduce COMBO - a fully neural NLP system for accurate part-of-speech tagging, morphological analysis, lemmatisation, and (enhanced) dependency parsing. It predicts categorical morphosyntactic features whilst also exposes their vector…
This paper presents a distributed platform for Natural Language Processing called PyPLN. PyPLN leverages a vast array of NLP and text processing open source tools, managing the distribution of the workload on a variety of configurations:…
One central mystery of neural NLP is what neural models "know" about their subject matter. When a neural machine translation system learns to translate from one language to another, does it learn the syntax or semantics of the languages?…
The Tajik language, written in Cyrillic script, remains severely under-resourced in terms of publicly available natural language processing (NLP) toolkits, hindering both linguistic research and applied development. This paper introduces…
Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of…