Related papers: Starfish: A Prototype for Universal Preprocessing …

Jellyfish: A Large Language Model for Data Preprocessing

This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in…

Artificial Intelligence · Computer Science 2024-10-30 Haochen Zhang , Yuyang Dong , Chuan Xiao , Masafumi Oyamada

Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark

Text embedding methods have become increasingly popular in both industrial and academic fields due to their critical role in a variety of natural language processing tasks. The significance of universal text embeddings has been further…

Information Retrieval · Computer Science 2024-06-21 Hongliu Cao

PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks

Unsupervised text embedding methods, such as Skip-gram and Paragraph Vector, have been attracting increasing attention due to their simplicity, scalability, and effectiveness. However, comparing to sophisticated deep learning architectures…

Computation and Language · Computer Science 2015-08-04 Jian Tang , Meng Qu , Qiaozhu Mei

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release…

Computation and Language · Computer Science 2025-12-03 Project Apertus , Alejandro Hernández-Cano , Alexander Hägele , Allen Hao Huang , Angelika Romanou , Antoni-Joan Solergibert , Barna Pasztor , Bettina Messmer , Dhia Garbaya , Eduard Frank Ďurech , Ido Hakimi , Juan García Giraldo , Mete Ismayilzada , Negar Foroutan , Skander Moalla , Tiancheng Chen , Vinko Sabolčec , Yixuan Xu , Michael Aerni , Badr AlKhamissi , Inés Altemir Mariñas , Mohammad Hossein Amani , Matin Ansaripour , Ilia Badanin , Harold Benoit , Emanuela Boros , Nicholas Browning , Fabian Bösch , Maximilian Böther , Niklas Canova , Camille Challier , Clement Charmillot , Jonathan Coles , Jan Deriu , Arnout Devos , Lukas Drescher , Daniil Dzenhaliou , Maud Ehrmann , Dongyang Fan , Simin Fan , Silin Gao , Miguel Gila , María Grandury , Diba Hashemi , Alexander Hoyle , Jiaming Jiang , Mark Klein , Andrei Kucharavy , Anastasiia Kucherenko , Frederike Lübeck , Roman Machacek , Theofilos Manitaras , Andreas Marfurt , Kyle Matoba , Simon Matrenok , Henrique Mendonça , Fawzi Roberto Mohamed , Syrielle Montariol , Luca Mouchel , Sven Najem-Meyer , Jingwei Ni , Gennaro Oliva , Matteo Pagliardini , Elia Palme , Andrei Panferov , Léo Paoletti , Marco Passerini , Ivan Pavlov , Auguste Poiroux , Kaustubh Ponkshe , Nathan Ranchin , Javi Rando , Mathieu Sauser , Jakhongir Saydaliev , Muhammad Ali Sayfiddinov , Marian Schneider , Stefano Schuppli , Marco Scialanga , Andrei Semenov , Kumar Shridhar , Raghav Singhal , Anna Sotnikova , Alexander Sternfeld , Ayush Kumar Tarun , Paul Teiletche , Jannis Vamvas , Xiaozhe Yao , Hao Zhao , Alexander Ilic , Ana Klimovic , Andreas Krause , Caglar Gulcehre , David Rosenthal , Elliott Ash , Florian Tramèr , Joost VandeVondele , Livio Veraldi , Martin Rajman , Thomas Schulthess , Torsten Hoefler , Antoine Bosselut , Martin Jaggi , Imanol Schlag

The Weaves Reconfigurable Programming Framework

This research proposes a language independent intra-process framework for object based composition of unmodified code modules. Intuitively, the two major programming models, threads and processes, can be considered as extremes along a…

Programming Languages · Computer Science 2007-05-23 Srinidhi Varadarajan

StarSpace: Embed All The Things!

We present StarSpace, a general-purpose neural embedding model that can solve a wide variety of problems: labeling tasks such as text classification, ranking tasks such as information retrieval/web search, collaborative filtering-based or…

Computation and Language · Computer Science 2017-11-22 Ledell Wu , Adam Fisch , Sumit Chopra , Keith Adams , Antoine Bordes , Jason Weston

Text embedding models can be great data engineers

Data engineering pipelines are essential - albeit costly - components of predictive analytics frameworks requiring significant engineering time and domain expertise for carrying out tasks such as data ingestion, preprocessing, feature…

Machine Learning · Computer Science 2025-05-22 Iman Kazemian , Paritosh Ramanan , Murat Yildirim

PyTorchPipe: a framework for rapid prototyping of pipelines combining language and vision

Access to vast amounts of data along with affordable computational power stimulated the reincarnation of neural networks. The progress could not be achieved without adequate software tools, lowering the entry bar for the next generations of…

Machine Learning · Computer Science 2019-10-22 Tomasz Kornuta

PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs

Current sentence embedding evaluations typically rely on static test beds like the Massive Text Embedding Benchmark (MTEB). While invaluable, repeated tuning on a fixed suite can inflate reported scores and obscure real-world robustness. We…

Computation and Language · Computer Science 2026-03-02 Manuel Frank , Haithem Afli

MTEB: Massive Text Embedding Benchmark

Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be…

Computation and Language · Computer Science 2023-03-21 Niklas Muennighoff , Nouamane Tazi , Loïc Magne , Nils Reimers

SUPERB: Speech processing Universal PERformance Benchmark

Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art…

Computation and Language · Computer Science 2021-10-19 Shu-wen Yang , Po-Han Chi , Yung-Sung Chuang , Cheng-I Jeff Lai , Kushal Lakhotia , Yist Y. Lin , Andy T. Liu , Jiatong Shi , Xuankai Chang , Guan-Ting Lin , Tzu-Hsien Huang , Wei-Cheng Tseng , Ko-tik Lee , Da-Rong Liu , Zili Huang , Shuyan Dong , Shang-Wen Li , Shinji Watanabe , Abdelrahman Mohamed , Hung-yi Lee

RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams

Word embeddings have become essential components in various information retrieval and natural language processing tasks, such as ranking, document classification, and question answering. However, despite their widespread use, traditional…

Computation and Language · Computer Science 2025-07-01 Gabriel Iturra-Bocaz , Felipe Bravo-Marquez

Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning

How can we perform computations over natural language representations to solve tasks that require symbolic and numeric reasoning? We propose natural language embedded programs (NLEP) as a unifying framework for addressing math/symbolic…

Computation and Language · Computer Science 2024-04-01 Tianhua Zhang , Jiaxin Ge , Hongyin Luo , Yung-Sung Chuang , Mingye Gao , Yuan Gong , Xixin Wu , Yoon Kim , Helen Meng , James Glass

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has…

Machine Learning · Computer Science 2023-09-20 Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , Peter J. Liu

PEACH: Pretrained-embedding Explanation Across Contextual and Hierarchical Structure

In this work, we propose a novel tree-based explanation technique, PEACH (Pretrained-embedding Explanation Across Contextual and Hierarchical Structure), that can explain how text-based documents are classified by using any pretrained…

Computation and Language · Computer Science 2024-04-23 Feiqi Cao , Caren Han , Hyunsuk Chung

TextSleuth: Towards Explainable Tampered Text Detection

Recently, tampered text detection has attracted increasing attention due to its essential role in information security. Although existing methods can detect the tampered text region, the interpretation of such detection remains unclear,…

Computer Vision and Pattern Recognition · Computer Science 2025-01-16 Chenfan Qu , Jian Liu , Haoxing Chen , Baihan Yu , Jingjing Liu , Weiqiang Wang , Lianwen Jin

GuideWalk: A Novel Graph-Based Word Embedding for Enhanced Text Classification

One of the prime problems of computer science and machine learning is to extract information efficiently from large-scale, heterogeneous data. Text data, with its syntax, semantics, and even hidden information content, possesses an…

Computation and Language · Computer Science 2024-09-10 Sarmad N. Mohammed , Semra Gündüç

ProcessGPT: Transforming Business Process Management with Generative Artificial Intelligence

Generative Pre-trained Transformer (GPT) is a state-of-the-art machine learning model capable of generating human-like text through natural language processing (NLP). GPT is trained on massive amounts of text data and uses deep learning…

Artificial Intelligence · Computer Science 2023-06-06 Amin Beheshti , Jian Yang , Quan Z. Sheng , Boualem Benatallah , Fabio Casati , Schahram Dustdar , Hamid Reza Motahari Nezhad , Xuyun Zhang , Shan Xue

The Design and Implementation of an Extensible System Meta-Programming Language

System programming languages are typically compiled in a linear pipeline process, which is a completely opaque and isolated to end-users. This limits the possibilities of performing meta-programming in the same language and environment, and…

Programming Languages · Computer Science 2023-09-28 Ronie Salgado

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages…

Computation and Language · Computer Science 2017-10-09 Benjamin Heinzerling , Michael Strube