Yanis Labrak — Scifaro

SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation

We present SDialog, an MIT-licensed open-source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework for building and analyzing LLM-based conversational agents. Built…

Computation and Language · Computer Science 2026-05-12 Sergio Burdisso , Séverin Baroudi , Yanis Labrak , David Grunert , Pawel Cyrta , Yiyang Chen , Srikanth Madikeri , Thomas Schaaf , Esaú Villatoro-Tello , Ahmed Hassoon , Ricard Marxer , Petr Motlicek

Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

Long-context audio reasoning is underserved in both training data and evaluation. Existing benchmarks target short-context tasks, and the open-ended generation tasks most relevant to long-context reasoning pose well-known challenges for…

Sound · Computer Science 2026-04-08 Yanis Labrak , David Grünert , Séverin Baroudi , Jiyun Chun , Pawel Cyrta , Sergio Burdisso , Ahmed Hassoon , David Liu , Adam Rothschild , Reed Van Deusen , Petr Motlicek , Andrew Perrault , Ricard Marxer , Thomas Schaaf

Doctor or Patient? Synergizing Diarization and ASR for Code-Switched Hinglish Medical Conditions Extraction

Extracting patient medical conditions from code-switched clinical spoken dialogues is challenging due to rapid turn-taking and highly overlapped speech. We present a robust system evaluated on the DISPLACE-M dataset of real-world Hinglish…

Audio and Speech Processing · Electrical Eng. & Systems 2026-03-09 Séverin Baroudi , Yanis Labrak , Shashi Kumar , Joonas Kalda , Sergio Burdisso , Pawel Cyrta , Juan Ignacio Alvarez-Trejos , Petr Motlicek , Hervé Bredin , Ricard Marxer

SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation

We present SDialog, an MIT-licensed open-source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework for building and analyzing LLM-based conversational agents. Built…

Artificial Intelligence · Computer Science 2025-12-15 Sergio Burdisso , Séverin Baroudi , Yanis Labrak , David Grunert , Pawel Cyrta , Yiyang Chen , Srikanth Madikeri , Esaú Villatoro-Tello , Thomas Schaaf , Ricard Marxer , Petr Motlicek

Late Fusion and Multi-Level Fission Amplify Cross-Modal Transfer in Text-Speech LMs

Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- are commonly trained through an early modality fusion/fission approach, in which both modalities are fed and predicted from a…

Computation and Language · Computer Science 2025-10-21 Santiago Cuervo , Adel Moumen , Yanis Labrak , Sameer Khurana , Antoine Laurent , Mickael Rouvier , Phil Woodland , Ricard Marxer

An Empirical Analysis of Discrete Unit Representations in Speech Language Modeling Pre-training

This paper investigates discrete unit representations in Speech Language Models (SLMs), focusing on optimizing speech modeling during continual pre-training. In this paper, we systematically examine how model architecture, data…

Computation and Language · Computer Science 2025-09-09 Yanis Labrak , Richard Dufour , Mickaël Rouvier

Synthetic Lyrics Detection Across Languages and Genres

In recent years, the use of large language models (LLMs) to generate music content, particularly lyrics, has gained in popularity. These advances provide valuable tools for artists and enhance their creative processes, but they also raise…

Computation and Language · Computer Science 2025-04-25 Yanis Labrak , Markus Frohmann , Gabriel Meseguer-Brocal , Elena V. Epure

BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, offering potential applications across specialized domains such as healthcare and medicine. Despite the availability of various open-source LLMs tailored…

Computation and Language · Computer Science 2024-07-18 Yanis Labrak , Adrien Bazoge , Emmanuel Morin , Pierre-Antoine Gourraud , Mickael Rouvier , Richard Dufour

Zero-Shot End-To-End Spoken Question Answering In Medical Domain

In the rapidly evolving landscape of spoken question-answering (SQA), the integration of large language models (LLMs) has emerged as a transformative development. Conventional approaches often entail the use of separate models for question…

Computation and Language · Computer Science 2024-06-11 Yanis Labrak , Adel Moumen , Richard Dufour , Mickael Rouvier

How Important Is Tokenization in French Medical Masked Language Models?

Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) over recent years, primarily due to the widespread utilization of pre-trained language models. This shift began with Byte-Pair…

Computation and Language · Computer Science 2024-06-11 Yanis Labrak , Adrien Bazoge , Beatrice Daille , Mickael Rouvier , Richard Dufour

DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain

The biomedical domain has sparked a significant interest in the field of Natural Language Processing (NLP), which has seen substantial advancements with pre-trained language models (PLMs). However, comparing these models has proven…

Computation and Language · Computer Science 2024-06-11 Yanis Labrak , Adrien Bazoge , Oumaima El Khettari , Mickael Rouvier , Pacome Constant dit Beaufils , Natalia Grabar , Beatrice Daille , Solen Quiniou , Emmanuel Morin , Pierre-Antoine Gourraud , Richard Dufour

A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks

We evaluate four state-of-the-art instruction-tuned large language models (LLMs) -- ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca -- on a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English, such…

Computation and Language · Computer Science 2024-06-11 Yanis Labrak , Mickael Rouvier , Richard Dufour

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich…

Computation and Language · Computer Science 2023-06-28 BigScience Workshop , : , Teven Le Scao , Angela Fan , Christopher Akiki , Ellie Pavlick , Suzana Ilić , Daniel Hesslow , Roman Castagné , Alexandra Sasha Luccioni , François Yvon , Matthias Gallé , Jonathan Tow , Alexander M. Rush , Stella Biderman , Albert Webson , Pawan Sasanka Ammanamanchi , Thomas Wang , Benoît Sagot , Niklas Muennighoff , Albert Villanova del Moral , Olatunji Ruwase , Rachel Bawden , Stas Bekman , Angelina McMillan-Major , Iz Beltagy , Huu Nguyen , Lucile Saulnier , Samson Tan , Pedro Ortiz Suarez , Victor Sanh , Hugo Laurençon , Yacine Jernite , Julien Launay , Margaret Mitchell , Colin Raffel , Aaron Gokaslan , Adi Simhi , Aitor Soroa , Alham Fikri Aji , Amit Alfassy , Anna Rogers , Ariel Kreisberg Nitzav , Canwen Xu , Chenghao Mou , Chris Emezue , Christopher Klamm , Colin Leong , Daniel van Strien , David Ifeoluwa Adelani , Dragomir Radev , Eduardo González Ponferrada , Efrat Levkovizh , Ethan Kim , Eyal Bar Natan , Francesco De Toni , Gérard Dupont , Germán Kruszewski , Giada Pistilli , Hady Elsahar , Hamza Benyamina , Hieu Tran , Ian Yu , Idris Abdulmumin , Isaac Johnson , Itziar Gonzalez-Dios , Javier de la Rosa , Jenny Chim , Jesse Dodge , Jian Zhu , Jonathan Chang , Jörg Frohberg , Joseph Tobing , Joydeep Bhattacharjee , Khalid Almubarak , Kimbo Chen , Kyle Lo , Leandro Von Werra , Leon Weber , Long Phan , Loubna Ben allal , Ludovic Tanguy , Manan Dey , Manuel Romero Muñoz , Maraim Masoud , María Grandury , Mario Šaško , Max Huang , Maximin Coavoux , Mayank Singh , Mike Tian-Jian Jiang , Minh Chien Vu , Mohammad A. Jauhar , Mustafa Ghaleb , Nishant Subramani , Nora Kassner , Nurulaqilla Khamis , Olivier Nguyen , Omar Espejel , Ona de Gibert , Paulo Villegas , Peter Henderson , Pierre Colombo , Priscilla Amuok , Quentin Lhoest , Rheza Harliman , Rishi Bommasani , Roberto Luis López , Rui Ribeiro , Salomey Osei , Sampo Pyysalo , Sebastian Nagel , Shamik Bose , Shamsuddeen Hassan Muhammad , Shanya Sharma , Shayne Longpre , Somaieh Nikpoor , Stanislav Silberberg , Suhas Pai , Sydney Zink , Tiago Timponi Torrent , Timo Schick , Tristan Thrush , Valentin Danchev , Vassilina Nikoulina , Veronika Laippala , Violette Lepercq , Vrinda Prabhu , Zaid Alyafeai , Zeerak Talat , Arun Raja , Benjamin Heinzerling , Chenglei Si , Davut Emre Taşar , Elizabeth Salesky , Sabrina J. Mielke , Wilson Y. Lee , Abheesht Sharma , Andrea Santilli , Antoine Chaffin , Arnaud Stiegler , Debajyoti Datta , Eliza Szczechla , Gunjan Chhablani , Han Wang , Harshit Pandey , Hendrik Strobelt , Jason Alan Fries , Jos Rozen , Leo Gao , Lintang Sutawika , M Saiful Bari , Maged S. Al-shaibani , Matteo Manica , Nihal Nayak , Ryan Teehan , Samuel Albanie , Sheng Shen , Srulik Ben-David , Stephen H. Bach , Taewoon Kim , Tali Bers , Thibault Fevry , Trishala Neeraj , Urmish Thakker , Vikas Raunak , Xiangru Tang , Zheng-Xin Yong , Zhiqing Sun , Shaked Brody , Yallow Uri , Hadar Tojarieh , Adam Roberts , Hyung Won Chung , Jaesung Tae , Jason Phang , Ofir Press , Conglong Li , Deepak Narayanan , Hatim Bourfoune , Jared Casper , Jeff Rasley , Max Ryabinin , Mayank Mishra , Minjia Zhang , Mohammad Shoeybi , Myriam Peyrounette , Nicolas Patry , Nouamane Tazi , Omar Sanseviero , Patrick von Platen , Pierre Cornette , Pierre François Lavallée , Rémi Lacroix , Samyam Rajbhandari , Sanchit Gandhi , Shaden Smith , Stéphane Requena , Suraj Patil , Tim Dettmers , Ahmed Baruwa , Amanpreet Singh , Anastasia Cheveleva , Anne-Laure Ligozat , Arjun Subramonian , Aurélie Névéol , Charles Lovering , Dan Garrette , Deepak Tunuguntla , Ehud Reiter , Ekaterina Taktasheva , Ekaterina Voloshina , Eli Bogdanov , Genta Indra Winata , Hailey Schoelkopf , Jan-Christoph Kalo , Jekaterina Novikova , Jessica Zosa Forde , Jordan Clive , Jungo Kasai , Ken Kawamura , Liam Hazan , Marine Carpuat , Miruna Clinciu , Najoung Kim , Newton Cheng , Oleg Serikov , Omer Antverg , Oskar van der Wal , Rui Zhang , Ruochen Zhang , Sebastian Gehrmann , Shachar Mirkin , Shani Pais , Tatiana Shavrina , Thomas Scialom , Tian Yun , Tomasz Limisiewicz , Verena Rieser , Vitaly Protasov , Vladislav Mikhailov , Yada Pruksachatkun , Yonatan Belinkov , Zachary Bamberger , Zdeněk Kasner , Alice Rueda , Amanda Pestana , Amir Feizpour , Ammar Khan , Amy Faranak , Ana Santos , Anthony Hevia , Antigona Unldreaj , Arash Aghagol , Arezoo Abdollahi , Aycha Tammour , Azadeh HajiHosseini , Bahareh Behroozi , Benjamin Ajibade , Bharat Saxena , Carlos Muñoz Ferrandis , Daniel McDuff , Danish Contractor , David Lansky , Davis David , Douwe Kiela , Duong A. Nguyen , Edward Tan , Emi Baylor , Ezinwanne Ozoani , Fatima Mirza , Frankline Ononiwu , Habib Rezanejad , Hessie Jones , Indrani Bhattacharya , Irene Solaiman , Irina Sedenko , Isar Nejadgholi , Jesse Passmore , Josh Seltzer , Julio Bonis Sanz , Livia Dutra , Mairon Samagaio , Maraim Elbadri , Margot Mieskes , Marissa Gerchick , Martha Akinlolu , Michael McKenna , Mike Qiu , Muhammed Ghauri , Mykola Burynok , Nafis Abrar , Nazneen Rajani , Nour Elkott , Nour Fahmy , Olanrewaju Samuel , Ran An , Rasmus Kromann , Ryan Hao , Samira Alizadeh , Sarmad Shubber , Silas Wang , Sourav Roy , Sylvain Viguier , Thanh Le , Tobi Oyebade , Trieu Le , Yoyo Yang , Zach Nguyen , Abhinav Ramesh Kashyap , Alfredo Palasciano , Alison Callahan , Anima Shukla , Antonio Miranda-Escalada , Ayush Singh , Benjamin Beilharz , Bo Wang , Caio Brito , Chenxi Zhou , Chirag Jain , Chuxin Xu , Clémentine Fourrier , Daniel León Periñán , Daniel Molano , Dian Yu , Enrique Manjavacas , Fabio Barth , Florian Fuhrimann , Gabriel Altay , Giyaseddin Bayrak , Gully Burns , Helena U. Vrabec , Imane Bello , Ishani Dash , Jihyun Kang , John Giorgi , Jonas Golde , Jose David Posada , Karthik Rangasai Sivaraman , Lokesh Bulchandani , Lu Liu , Luisa Shinzato , Madeleine Hahn de Bykhovetz , Maiko Takeuchi , Marc Pàmies , Maria A Castillo , Marianna Nezhurina , Mario Sänger , Matthias Samwald , Michael Cullan , Michael Weinberg , Michiel De Wolf , Mina Mihaljcic , Minna Liu , Moritz Freidank , Myungsun Kang , Natasha Seelam , Nathan Dahlberg , Nicholas Michio Broad , Nikolaus Muellner , Pascale Fung , Patrick Haller , Ramya Chandrasekhar , Renata Eisenberg , Robert Martin , Rodrigo Canalli , Rosaline Su , Ruisi Su , Samuel Cahyawijaya , Samuele Garda , Shlok S Deshmukh , Shubhanshu Mishra , Sid Kiblawi , Simon Ott , Sinee Sang-aroonsiri , Srishti Kumar , Stefan Schweter , Sushil Bharati , Tanmay Laud , Théo Gigant , Tomoya Kainuma , Wojciech Kusa , Yanis Labrak , Yash Shailesh Bajaj , Yash Venkatraman , Yifan Xu , Yingxin Xu , Yu Xu , Zhe Tan , Zhongli Xie , Zifan Ye , Mathilde Bras , Younes Belkada , Thomas Wolf

DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains

In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more…

Computation and Language · Computer Science 2023-05-08 Yanis Labrak , Adrien Bazoge , Richard Dufour , Mickael Rouvier , Emmanuel Morin , Béatrice Daille , Pierre-Antoine Gourraud

FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain

This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization…

Computation and Language · Computer Science 2023-04-11 Yanis Labrak , Adrien Bazoge , Richard Dufour , Mickael Rouvier , Emmanuel Morin , Béatrice Daille , Pierre-Antoine Gourraud

BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing

Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization…

Computation and Language · Computer Science 2022-07-01 Jason Alan Fries , Leon Weber , Natasha Seelam , Gabriel Altay , Debajyoti Datta , Samuele Garda , Myungsun Kang , Ruisi Su , Wojciech Kusa , Samuel Cahyawijaya , Fabio Barth , Simon Ott , Matthias Samwald , Stephen Bach , Stella Biderman , Mario Sänger , Bo Wang , Alison Callahan , Daniel León Periñán , Théo Gigant , Patrick Haller , Jenny Chim , Jose David Posada , John Michael Giorgi , Karthik Rangasai Sivaraman , Marc Pàmies , Marianna Nezhurina , Robert Martin , Michael Cullan , Moritz Freidank , Nathan Dahlberg , Shubhanshu Mishra , Shamik Bose , Nicholas Michio Broad , Yanis Labrak , Shlok S Deshmukh , Sid Kiblawi , Ayush Singh , Minh Chien Vu , Trishala Neeraj , Jonas Golde , Albert Villanova del Moral , Benjamin Beilharz

Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations

The COVID-19 pandemic has been severely impacting global society since December 2019. Massive research has been undertaken to understand the characteristics of the virus and design vaccines and drugs. The related findings have been reported…

Digital Libraries · Computer Science 2022-06-07 Qingyu Chen , Alexis Allot , Robert Leaman , Rezarta Islamaj Doğan , Jingcheng Du , Li Fang , Kai Wang , Shuo Xu , Yuefu Zhang , Parsa Bagherzadeh , Sabine Bergler , Aakash Bhatnagar , Nidhir Bhavsar , Yung-Chun Chang , Sheng-Jie Lin , Wentai Tang , Hongtong Zhang , Ilija Tavchioski , Senja Pollak , Shubo Tian , Jinfeng Zhang , Yulia Otmakhova , Antonio Jimeno Yepes , Hang Dong , Honghan Wu , Richard Dufour , Yanis Labrak , Niladri Chatterjee , Kushagri Tandon , Fréjus Laleye , Loïc Rakotoson , Emmanuele Chersoni , Jinghang Gu , Annemarie Friedrich , Subhash Chandra Pujari , Mariia Chizhikova , Naveen Sivadasan , Naveen Sivadasan , Zhiyong Lu