Related papers: Model Equality Testing: Which Model Is This API Se…

Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

As API access becomes a primary interface to large language models (LLMs), users often interact with black-box systems that offer little transparency into the deployed model. To reduce costs or maliciously alter model behaviors, API…

Cryptography and Security · Computer Science 2026-04-10 Xiaoyuan Zhu , Yaowen Ye , Tianyi Qiu , Hanlin Zhu , Sijun Tan , Ajraf Mannan , Jonathan Michala , Raluca Ada Popa , Willie Neiswanger

You've Changed: Detecting Modification of Black-Box Large Language Models

Large Language Models (LLMs) are often provided as a service via an API, making it challenging for developers to detect changes in their behavior. We present an approach to monitor LLMs for changes by comparing the distributions of…

Computation and Language · Computer Science 2025-04-18 Alden Dima , James Foulds , Shimei Pan , Philip Feldman

Statistical Modeling and Uncertainty Estimation of LLM Inference Systems

Large Language Model (LLM) inference systems present significant challenges in statistical performance characterization due to dynamic workload variations, diverse hardware architectures, and complex interactions between model size, batch…

Performance · Computer Science 2025-05-15 Kaustabha Ray , Nelson Mimura Gonzalez , Bruno Wassermann , Rachel Tzoref-Brill , Dean H. Lorenz

Did the Model Change? Efficiently Assessing Machine Learning API Shifts

Machine learning (ML) prediction APIs are increasingly widely used. An ML API can change over time due to model updates or retraining. This presents a key challenge in the usage of the API because it is often not clear to the user if and…

Machine Learning · Statistics 2021-07-30 Lingjiao Chen , Tracy Cai , Matei Zaharia , James Zou

Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking

Large language models (LLMs) increasingly operate as autonomous agents that reason over external APIs to perform complex tasks. However, their reliability and agreement remain poorly characterized. We present a unified benchmarking…

Information Retrieval · Computer Science 2026-04-28 Eyhab Al-Masri

Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models

In the recent past, a popular way of evaluating natural language understanding (NLU), was to consider a model's ability to perform natural language inference (NLI) tasks. In this paper, we investigate if NLI tasks, that are rarely used for…

Computation and Language · Computer Science 2024-11-22 Lovish Madaan , David Esiobu , Pontus Stenetorp , Barbara Plank , Dieuwke Hupkes

Can You Detect the Difference?

The rapid advancement of large language models (LLMs) has raised concerns about reliably detecting AI-generated text. Stylometric metrics work well on autoregressive (AR) outputs, but their effectiveness on diffusion-based models is…

Computation and Language · Computer Science 2025-07-15 İsmail Tarım , Aytuğ Onan

LlamaRestTest: Effective REST API Testing with Small Language Models

Modern web services rely heavily on REST APIs, typically documented using the OpenAPI specification. The widespread adoption of this standard has resulted in the development of many black-box testing tools that generate tests based on…

Software Engineering · Computer Science 2025-04-07 Myeongsoo Kim , Saurabh Sinha , Alessandro Orso

Audit Me If You Can: Query-Efficient Active Fairness Auditing of Black-Box LLMs

Large Language Models (LLMs) exhibit systematic biases across demographic groups. Auditing is proposed as an accountability tool for black-box LLM applications, but suffers from resource-intensive query access. We conceptualise auditing as…

Machine Learning · Computer Science 2026-01-07 David Hartmann , Lena Pohlmann , Lelia Hanslik , Noah Gießing , Bettina Berendt , Pieter Delobelle

Artificial Interrogation for Attributing Language Models

This paper presents solutions to the Machine Learning Model Attribution challenge (MLMAC) collectively organized by MITRE, Microsoft, Schmidt-Futures, Robust-Intelligence, Lincoln-Network, and Huggingface community. The challenge provides…

Computation and Language · Computer Science 2022-11-22 Farhan Dhanani , Muhammad Rafi

Debiasing Algorithm through Model Adaptation

Large language models are becoming the go-to solution for the ever-growing number of tasks. However, with growing capacity, models are prone to rely on spurious correlations stemming from biases and stereotypes present in the training data.…

Computation and Language · Computer Science 2024-05-30 Tomasz Limisiewicz , David Mareček , Tomáš Musil

Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation

Large language models (LLMs) have progressed rapidly in complex reasoning and question answering, yet LLM hallucination remains a central bottleneck that hinders practical deployment, especially for commercial black-box LLMs accessible only…

Computation and Language · Computer Science 2026-05-08 Huizi Cui , Huan Ma , Qilin Wang , Yuhang Gao , Changqing Zhang

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them…

Machine Learning · Computer Science 2023-12-14 Alexander Borzunov , Max Ryabinin , Artem Chumachenko , Dmitry Baranchuk , Tim Dettmers , Younes Belkada , Pavel Samygin , Colin Raffel

Applications and Challenges of Fairness APIs in Machine Learning Software

Machine Learning software systems are frequently used in our day-to-day lives. Some of these systems are used in various sensitive environments to make life-changing decisions. Therefore, it is crucial to ensure that these AI/ML systems do…

Machine Learning · Computer Science 2025-08-25 Ajoy Das , Gias Uddin , Shaiful Chowdhury , Mostafijur Rahman Akhond , Hadi Hemmati

LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation

As large language models (LLMs) are deployed widely, detecting and understanding bias in their outputs is critical. We present LLM BiasScope, a web application for side-by-side comparison of LLM outputs with real-time bias analysis. The…

Computation and Language · Computer Science 2026-03-30 Himel Ghosh , Nick Elias Werner

Machine Learning Model Attribution Challenge

We present the findings of the Machine Learning Model Attribution Challenge. Fine-tuned machine learning models may derive from other trained models without obvious attribution characteristics. In this challenge, participants identify the…

Machine Learning · Computer Science 2023-02-20 Elizabeth Merkhofer , Deepesh Chaudhari , Hyrum S. Anderson , Keith Manville , Lily Wong , João Gante

Great Models Think Alike and this Undermines AI Oversight

As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as ''AI Oversight''. We study how…

Machine Learning · Computer Science 2025-06-13 Shashwat Goel , Joschka Struber , Ilze Amanda Auzina , Karuna K Chandra , Ponnurangam Kumaraguru , Douwe Kiela , Ameya Prabhu , Matthias Bethge , Jonas Geiping

Trusted Source Alignment in Large Language Models

Large language models (LLMs) are trained on web-scale corpora that inevitably include contradictory factual information from sources of varying reliability. In this paper, we propose measuring an LLM property called trusted source alignment…

Computation and Language · Computer Science 2023-11-14 Vasilisa Bashlovkina , Zhaobin Kuang , Riley Matthews , Edward Clifford , Yennie Jun , William W. Cohen , Simon Baumgartner

Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?

The legal field already uses various large language models (LLMs) in actual applications, but their quantitative performance and reasons for it are underexplored. We evaluated several open-source and proprietary LLMs -- including…

Computers and Society · Computer Science 2025-09-12 Bhakti Khera , Rezvan Alamian , Pascal A. Scherz , Stephan M. Goetz

Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models

The auditing of financial documents, historically a labor-intensive process, stands on the precipice of transformation. AI-driven solutions have made inroads into streamlining this process by recommending pertinent text passages from…

Computation and Language · Computer Science 2025-07-23 Armin Berger , Lars Hillebrand , David Leonhard , Tobias Deußer , Thiago Bell Felix de Oliveira , Tim Dilmaghani , Mohamed Khaled , Bernd Kliem , Rüdiger Loitz , Christian Bauckhage , Rafet Sifa