Related papers: jp-evalb: Robust Alignment-based PARSEVAL Measures

RepEval: Effective Text Evaluation with LLM Representation

The era of Large Language Models (LLMs) raises new demands for automatic evaluation metrics, which should be adaptable to various application scenarios while maintaining low cost and effectiveness. Traditional metrics for automatic text…

Computation and Language · Computer Science 2024-10-29 Shuqian Sheng , Yi Xu , Tianhang Zhang , Zanwei Shen , Luoyi Fu , Jiaxin Ding , Lei Zhou , Xiaoying Gan , Xinbing Wang , Chenghu Zhou

JELV: A Judge of Edit-Level Validity for Evaluation and Automated Reference Expansion in Grammatical Error Correction

Existing Grammatical Error Correction (GEC) systems suffer from limited reference diversity, leading to underestimated evaluation and restricted model generalization. To address this issue, we introduce the Judge of Edit-Level Validity…

Computation and Language · Computer Science 2025-12-09 Yuhao Zhan , Yuqing Zhang , Jing Yuan , Qixiang Ma , Zhiqi Yang , Yu Gu , Zemin Liu , Fei Wu

Unsupervised Parsing via Constituency Tests

We propose a method for unsupervised parsing based on the linguistic notion of a constituency test. One type of constituency test involves modifying the sentence via some transformation (e.g. replacing the span with a pronoun) and then…

Computation and Language · Computer Science 2020-10-08 Steven Cao , Nikita Kitaev , Dan Klein

MABEL: Attenuating Gender Bias using Textual Entailment Data

Pre-trained language models encode undesirable social biases, which are further exacerbated in downstream use. To this end, we propose MABEL (a Method for Attenuating Gender Bias using Entailment Labels), an intermediate pre-training…

Computation and Language · Computer Science 2022-10-28 Jacqueline He , Mengzhou Xia , Christiane Fellbaum , Danqi Chen

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation…

Computation and Language · Computer Science 2026-03-11 Chengyu Shen , Yanheng Hou , Minghui Pan , Runming He , Zhen Hao Wong , Meiyi Qiang , Zhou Liu , Hao Liang , Peichao Lai , Zeang Sheng , Wentao Zhang

ESCL: Equivariant Self-Contrastive Learning for Sentence Representations

Previous contrastive learning methods for sentence representations often focus on insensitive transformations to produce positive pairs, but neglect the role of sensitive transformations that are harmful to semantic representations.…

Computation and Language · Computer Science 2023-03-10 Jie Liu , Yixuan Liu , Xue Han , Chao Deng , Junlan Feng

Evaluating Mathematical Reasoning Beyond Accuracy

The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps. This oversight…

Computation and Language · Computer Science 2025-01-15 Shijie Xia , Xuefeng Li , Yixin Liu , Tongshuang Wu , Pengfei Liu

JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation

Reliable evaluation is essential for the development of vision-language models (VLMs). However, Japanese VQA benchmarks have undergone far less iterative refinement than their English counterparts. As a result, many existing benchmarks…

Computer Vision and Pattern Recognition · Computer Science 2026-04-07 Issa Sugiura , Koki Maeda , Shuhei Kurita , Yusuke Oda , Daisuke Kawahara , Naoaki Okazaki

SentAlign: Accurate and Scalable Sentence Alignment

We present SentAlign, an accurate sentence alignment tool designed to handle very large parallel document pairs. Given user-defined parameters, the alignment algorithm evaluates all possible alignment paths in fairly large documents of…

Computation and Language · Computer Science 2023-11-16 Steinþór Steingrímsson , Hrafn Loftsson , Andy Way

CLEV: LLM-Based Evaluation Through Lightweight Efficient Voting for Free-Form Question-Answering

Evaluating free-form Question Answering (QA) remains a challenge due to its diverse and open-ended nature. Traditional automatic metrics fail to capture semantic equivalence or accommodate the variability of open-ended responses. Leveraging…

Computation and Language · Computer Science 2025-11-12 Sher Badshah , Moamen Moustafa , Hassan Sajjad

PopEval: A Character-Level Approach to End-To-End Evaluation Compatible with Word-Level Benchmark Dataset

The most prevalent scope of interest for OCR applications used to be scanned documents, but it has now shifted towards the natural scene. Despite the change of times, the existing evaluation methods are still based on the old criteria…

Computer Vision and Pattern Recognition · Computer Science 2019-08-30 Hong-Seok Lee , Youngmin Yoon , Pil-Hoon Jang , Chankyu Choi

BatchEval: Towards Human-like Text Evaluation

Significant progress has been made in automatic text evaluation with the introduction of large language models (LLMs) as evaluators. However, current sample-wise evaluation paradigm suffers from the following issues: (1) Sensitive to prompt…

Computation and Language · Computer Science 2024-01-02 Peiwen Yuan , Shaoxiong Feng , Yiwei Li , Xinglin Wang , Boyuan Pan , Heda Wang , Kan Li

SentEval: An Evaluation Toolkit for Universal Sentence Representations

We introduce SentEval, a toolkit for evaluating the quality of universal sentence representations. SentEval encompasses a variety of tasks, including binary and multi-class classification, natural language inference and sentence similarity.…

Computation and Language · Computer Science 2018-03-16 Alexis Conneau , Douwe Kiela

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

We focus on the weakly-supervised audio-visual video parsing task (AVVP), which aims to identify and locate all the events in audio/visual modalities. Previous works only concentrate on video-level overall label denoising across modalities,…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Yingying Fan , Yu Wu , Bo Du , Yutian Lin

Better Late Than Never: Meta-Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

Simultaneous speech-to-text translation systems must balance translation quality with latency. Although quality evaluation is well established, latency measurement remains a challenge. Existing metrics produce inconsistent results,…

Computation and Language · Computer Science 2026-03-09 Peter Polák , Sara Papi , Luisa Bentivogli , Ondřej Bojar

DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning

Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias…

Computation and Language · Computer Science 2025-07-02 Kang He , Yuzhe Ding , Haining Wang , Fei Li , Chong Teng , Donghong Ji

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are…

Computer Vision and Pattern Recognition · Computer Science 2023-10-19 Dhruba Ghosh , Hanna Hajishirzi , Ludwig Schmidt

Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing

Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative…

Computer Vision and Pattern Recognition · Computer Science 2025-09-18 Yaru Chen , Ruohao Guo , Liting Gao , Yang Xiang , Qingyu Luo , Zhenbo Li , Wenwu Wang

REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation

Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Fulin Shi , Wenyi Xiao , Bin Chen , Liang Din , Leilei Gan

SEval-Ex: A Statement-Level Framework for Explainable Summarization Evaluation

Evaluating text summarization quality remains a critical challenge in Natural Language Processing. Current approaches face a trade-off between performance and interpretability. We present SEval-Ex, a framework that bridges this gap by…

Computation and Language · Computer Science 2025-05-06 Tanguy Herserant , Vincent Guigue