Related papers: DEEP: Docker-based Execution and Evaluation Platfo…

DEP: A Decentralized Large Language Model Evaluation Protocol

With the rapid development of Large Language Models (LLMs), a large number of benchmarks have been proposed. However, most benchmarks lack unified evaluation standard and require the manual implementation of custom scripts, making results…

Computation and Language · Computer Science 2026-03-03 Jianxiang Peng , Junhao Li , Hongxiang Wang , Haocheng Lyu , Hui Guo , Siyi Hao , Zhen Wang , Chuang Liu , Shaowei Zhang , Bojian Xiong , Yue Chen , Zhuowen Han , Ling Shi , Tianyu Dong , Juesi Xiao , Lei Yang , Yuqi Ren , Deyi Xiong

Performance Evaluation of Deep Learning Tools in Docker Containers

With the success of deep learning techniques in a broad range of application domains, many deep learning software frameworks have been developed and are being updated frequently to adapt to new hardware features and software libraries,…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-10 Pengfei Xu , Shaohuai Shi , Xiaowen Chu

Application performance on a Cluster-Booster system

The DEEP projects have developed a variety of hardware and software technologies aiming at improving the efficiency and usability of next generation high-performance computers. They evolve around an innovative concept for heterogeneous…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-11 Anke Kreuzer , Jorge Amaya , Norbert Eicker , Estela Suarez

HYCEDIS: HYbrid Confidence Engine for Deep Document Intelligence System

Measuring the confidence of AI models is critical for safely deploying AI in real-world industrial systems. One important application of confidence measurement is information extraction from scanned documents. However, there exists no…

Information Retrieval · Computer Science 2022-10-11 Bao-Sinh Nguyen , Quang-Bach Tran , Tuan-Anh Nguyen Dang , Duc Nguyen , Hung Le

Task Alignment: A simple and effective proxy for model merging in computer vision

Efficiently merging several models fine-tuned for different tasks, but stemming from the same pretrained base model, is of great practical interest. Despite extensive prior work, most evaluations of model merging in computer vision are…

Computer Vision and Pattern Recognition · Computer Science 2026-04-15 Pau de Jorge , César Roberto de Souza , Björn Michele , Mert Bülent Sarıyıldız , Philippe Weinzaepfel , Florent Perronnin , Diane Larlus , Yannis Kalantidis

Cross-Stack Workload Characterization of Deep Recommendation Systems

Deep learning based recommendation systems form the backbone of most personalized cloud services. Though the computer architecture community has recently started to take notice of deep recommendation inference, the resulting solutions have…

Hardware Architecture · Computer Science 2020-10-13 Samuel Hsia , Udit Gupta , Mark Wilkening , Carole-Jean Wu , Gu-Yeon Wei , David Brooks

When is an Embedding Model More Promising than Another?

Embedders play a central role in machine learning, projecting any object into numerical representations that can, in turn, be leveraged to perform various downstream tasks. The evaluation of embedding models typically depends on…

Machine Learning · Computer Science 2024-11-19 Maxime Darrin , Philippe Formont , Ismail Ben Ayed , Jackie CK Cheung , Pablo Piantanida

DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation

Recent advances in large language models have enabled deep research systems that generate expert-level reports through multi-step reasoning and evidence-based synthesis. However, evaluating such reports remains challenging: report quality…

Computation and Language · Computer Science 2026-03-11 Janghoon Han , Heegyu Kim , Changho Lee , Dahm Lee , Min Hyung Park , Hosung Song , Stanley Jungkyu Choi , Moontae Lee , Honglak Lee

Improving Applicability of Deep Learning based Token Classification models during Training

This paper shows that further evaluation metrics during model training are needed to decide about its applicability in inference. As an example, a LayoutLM-based model is trained for token classification in documents. The documents are…

Computer Vision and Pattern Recognition · Computer Science 2025-04-03 Anket Mehra , Malte Prieß , Marian Himstedt

Deep Clustering With Consensus Representations

The field of deep clustering combines deep learning and clustering to learn representations that improve both the learned representation and the performance of the considered clustering method. Most existing deep clustering methods are…

Machine Learning · Computer Science 2023-02-22 Lukas Miklautz , Martin Teuffenbach , Pascal Weber , Rona Perjuci , Walid Durani , Christian Böhm , Claudia Plant

InferBench: Understanding Deep Learning Inference Serving with an Automatic Benchmarking System

Deep learning (DL) models have become core modules for many applications. However, deploying these models without careful performance benchmarking that considers both hardware and software's impact often leads to poor service and costly…

Machine Learning · Computer Science 2021-01-06 Huaizheng Zhang , Yizheng Huang , Yonggang Wen , Jianxiong Yin , Kyle Guan

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process.…

Artificial Intelligence · Computer Science 2026-03-31 Fangda Ye , Yuxin Hu , Pengxiang Zhu , Yibo Li , Ziqi Jin , Yao Xiao , Yibo Wang , Lei Wang , Zhen Zhang , Lu Wang , Yue Deng , Bin Wang , Yifan Zhang , Liangcai Su , Xinyu Wang , He Zhao , Chen Wei , Qiang Ren , Bryan Hooi , An Bo , Shuicheng Yan , Lidong Bing

Support for Debugging Automatically Parallelized Programs

We describe a system that simplifies the process of debugging programs produced by computer-aided parallelization tools. The system uses relative debugging techniques to compare serial and parallel executions in order to show where the…

Software Engineering · Computer Science 2007-05-23 Robert Hood , Gabriele Jost

DERAIL: Diagnostic Environments for Reward And Imitation Learning

The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for…

Machine Learning · Computer Science 2020-12-03 Pedro Freire , Adam Gleave , Sam Toyer , Stuart Russell

Interpretable Deep Clustering for Tabular Data

Clustering is a fundamental learning task widely used as a first step in data analysis. For example, biologists use cluster assignments to analyze genome sequences, medical records, or images. Since downstream analysis is typically…

Machine Learning · Computer Science 2024-06-11 Jonathan Svirsky , Ofir Lindenbaum

A framework for benchmarking clustering algorithms

The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate…

Machine Learning · Computer Science 2023-10-27 Marek Gagolewski

Eureka: Evaluating and Understanding Large Foundation Models

Rigorous and reproducible evaluation is critical for assessing the state of the art and for guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due to several reasons, including benchmark…

Machine Learning · Computer Science 2024-09-18 Vidhisha Balachandran , Jingya Chen , Neel Joshi , Besmira Nushi , Hamid Palangi , Eduardo Salinas , Vibhav Vineet , James Woffinden-Luey , Safoora Yousefi

MREC: a fast and versatile framework for aligning and matching point clouds with applications to single cell molecular data

Comparing and aligning large datasets is a pervasive problem occurring across many different knowledge domains. We introduce and study MREC, a recursive decomposition algorithm for computing matchings between data sets. The basic idea is to…

Machine Learning · Statistics 2020-02-24 Andrew J. Blumberg , Mathieu Carriere , Michael A. Mandell , Raul Rabadan , Soledad Villar

Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization

Image clustering is one of the most important computer vision applications, which has been extensively studied in literature. However, current clustering methods mostly suffer from lack of efficiency and scalability when dealing with…

Machine Learning · Computer Science 2017-08-10 Kamran Ghasedi Dizaji , Amirhossein Herandi , Cheng Deng , Weidong Cai , Heng Huang

An Integrated Framework for Process Discovery Algorithm Evaluation

Process mining offers techniques to exploit event data by providing insights and recommendations to improve business processes. The growing amount of algorithms for process discovery has raised the question of which algorithms perform best…

Software Engineering · Computer Science 2018-06-20 Toon Jouck , Alfredo Bolt , Benoît Depaire , Massimiliano de Leoni , Wil M. P. van der Aalst