Simeng Han — Scifaro

Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models

Large Language Models (LLMs) often struggle with computational efficiency and error propagation in multi-step reasoning tasks. While recent advancements on prompting and post-training have enabled LLMs to perform step-wise reasoning, they…

Artificial Intelligence · Computer Science 2026-05-08 Yuan Sui , Yufei He , Tri Cao , Simeng Han , Yulin Chen , Bryan Hooi

Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics

Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce…

Artificial Intelligence · Computer Science 2026-05-04 Jinu Lee , Kyoung-Woon On , Simeng Han , Arman Cohan , Julia Hockenmaier

TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots

Advances in AI have introduced several strong models in computational pathology to usher it into the era of multi-modal diagnosis, analysis, and interpretation. However, the current pathology-specific visual language models still lack…

Quantitative Methods · Quantitative Biology 2026-04-08 Tianyu Liu , Weihao Xuan , Hao Wu , Peter Humphrey , Marcello DiStasio , Mohamed Kahila , Alfonso Garcia Tan , Heli Qi , Rui Yang , Simeng Han , Tinglin Huang , Fang Wu , Chen Liu , Qingyu Chen , Nan Liu , Irene Li , Hua Xu , Hongyu Zhao

Advancing AI Research Assistants with Expert-Involved Learning

Large language models (LLMs) and large multimodal models (LMMs) promise to accelerate biomedical discovery, yet their reliability remains unclear. We introduce ARIEL (AI Research Assistant for Expert-in-the-Loop Learning), an open-source…

Artificial Intelligence · Computer Science 2026-04-08 Tianyu Liu , Simeng Han , Hanchen Wang , Xiao Luo , Pan Lu , Biqing Zhu , Yuge Wang , Keyi Li , Jiapeng Chen , Rihao Qu , Yufeng Liu , Xinyue Cui , Aviv Yaish , Yuhang Chen , Minsheng Hao , Chuhan Li , Kexing Li , Yinsheng Lu , Xinyu Wei , Qinzhe Xing , Antonia Panescu , Mengbo Wang , Vibha Annaswamy , Alicia Sanchez , Jack Cloherty , Arman Cohan , Hua Xu , Mark Gerstein , James Zou , Hongyu Zhao

ScratchEval : A Multimodal Evaluation Framework for LLMs in Block-Based Programming

LLMs have achieved strong performance on text-based programming tasks, yet they remain unreliable for block-based languages such as Scratch. Scratch programs exhibit deeply nested, non-linear structures, event-driven concurrency across…

Software Engineering · Computer Science 2026-02-03 Yuan Si , Simeng Han , Daming Li , Hanyuan Shi , Jialu Zhang

GraphIC: A Graph-Based In-Context Example Retrieval Model for Multi-Step Reasoning

In-context learning (ICL) enhances large language models (LLMs) by incorporating demonstration examples, yet its effectiveness heavily depends on the quality of selected examples. Current methods typically use text embeddings to measure…

Artificial Intelligence · Computer Science 2025-12-02 Jiale Fu , Yaqing Wang , Simeng Han , Jiaming Fan , Xu Yang

Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as 'safety' and 'robustness'…

Computation and Language · Computer Science 2025-11-10 Andrew M. Bean , Ryan Othniel Kearns , Angelika Romanou , Franziska Sofia Hafner , Harry Mayne , Jan Batzner , Negar Foroutan , Chris Schmitz , Karolina Korgul , Hunar Batra , Oishi Deb , Emma Beharry , Cornelius Emde , Thomas Foster , Anna Gausen , María Grandury , Simeng Han , Valentin Hofmann , Lujain Ibrahim , Hazel Kim , Hannah Rose Kirk , Fangru Lin , Gabrielle Kaili-May Liu , Lennart Luettgau , Jabez Magomere , Jonathan Rystrøm , Anna Sotnikova , Yushi Yang , Yilun Zhao , Adel Bibi , Antoine Bosselut , Ronald Clark , Arman Cohan , Jakob Foerster , Yarin Gal , Scott A. Hale , Inioluwa Deborah Raji , Christopher Summerfield , Philip H. S. Torr , Cozmin Ududec , Luc Rocher , Adam Mahdi

Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models

Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more…

Artificial Intelligence · Computer Science 2025-10-30 Simeng Han , Howard Dai , Stephen Xia , Grant Zhang , Chen Liu , Lichang Chen , Hoang Huy Nguyen , Hongyuan Mei , Jiayuan Mao , R. Thomas McCoy

Learning to Reason via Mixture-of-Thought for Logical Reasoning

Human beings naturally utilize multiple reasoning modalities to learn and solve logical problems, i.e., different representational formats such as natural language, code, and symbolic logic. In contrast, most existing LLM-based approaches…

Computation and Language · Computer Science 2025-06-11 Tong Zheng , Lichang Chen , Simeng Han , R. Thomas McCoy , Heng Huang

ATEB: Evaluating and Improving Advanced NLP Tasks for Text Embedding Models

Traditional text embedding benchmarks primarily evaluate embedding models' capabilities to capture semantic similarity. However, more advanced NLP tasks require a deeper understanding of text, such as safety and factuality. These tasks…

Computation and Language · Computer Science 2025-03-05 Simeng Han , Frank Palma Gomez , Tu Vu , Zefei Li , Daniel Cer , Hansi Zeng , Chris Tar , Arman Cohan , Gustavo Hernandez Abrego

HYBRIDMIND: Meta Selection of Natural Language and Symbolic Language for Enhanced LLM Reasoning

LLMs approach logical and mathematical reasoning through natural or symbolic languages. While natural language offers human-accessible flexibility but suffers from ambiguity, symbolic reasoning provides precise, machine-executable…

Computation and Language · Computer Science 2025-02-27 Simeng Han , Tianyu Liu , Chuhan Li , Xuyuan Xiong , Arman Cohan

Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems

Benchmarks are critical for measuring Large Language Model (LLM) reasoning capabilities. Some benchmarks have even become the de facto indicator of such capabilities. However, as LLM reasoning capabilities improve, existing widely-used…

Computation and Language · Computer Science 2025-02-26 Stephen Miner , Yoshiki Takashima , Simeng Han , Sam Kouteili , Ferhat Erata , Ruzica Piskac , Scott J Shapiro

P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

Existing methods on understanding the capabilities of LLMs in logical reasoning rely on binary entailment classification or synthetically derived rationales, which are not sufficient for proper investigation of model's capabilities. We…

Artificial Intelligence · Computer Science 2024-10-15 Simeng Han , Aaron Yu , Rui Shen , Zhenting Qi , Martin Riddell , Wenfei Zhou , Yujie Qiao , Yilun Zhao , Semih Yavuz , Ye Liu , Shafiq Joty , Yingbo Zhou , Caiming Xiong , Dragomir Radev , Rex Ying , Arman Cohan

FOLIO: Natural Language Reasoning with First-Order Logic

Large language models (LLMs) have achieved remarkable performance on a variety of natural language understanding tasks. However, existing benchmarks are inadequate in measuring the complex logical reasoning capabilities of a model. We…

Computation and Language · Computer Science 2024-10-15 Simeng Han , Hailey Schoelkopf , Yilun Zhao , Zhenting Qi , Martin Riddell , Wenfei Zhou , James Coady , David Peng , Yujie Qiao , Luke Benson , Lucy Sun , Alex Wardle-Solano , Hannah Szabo , Ekaterina Zubova , Matthew Burtell , Jonathan Fan , Yixin Liu , Brian Wong , Malcolm Sailor , Ansong Ni , Linyong Nan , Jungo Kasai , Tao Yu , Rui Zhang , Alexander R. Fabbri , Wojciech Kryscinski , Semih Yavuz , Ye Liu , Xi Victoria Lin , Shafiq Joty , Yingbo Zhou , Caiming Xiong , Rex Ying , Arman Cohan , Dragomir Radev

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

While large language models (LLMs) can already achieve strong performance on standard generic summarization benchmarks, their performance on more complex summarization task settings is less studied. Therefore, we benchmark LLMs on…

Computation and Language · Computer Science 2024-07-15 Yixin Liu , Alexander R. Fabbri , Jiawen Chen , Yilun Zhao , Simeng Han , Shafiq Joty , Pengfei Liu , Dragomir Radev , Chien-Sheng Wu , Arman Cohan

Optimizing Language Model's Reasoning Abilities with Weak Supervision

While Large Language Models (LLMs) have demonstrated proficiency in handling complex queries, much of the past work has depended on extensively annotated datasets by human experts. However, this reliance on fully-supervised annotations…

Computation and Language · Computer Science 2024-05-08 Yongqi Tong , Sizhe Wang , Dawei Li , Yifan Wang , Simeng Han , Zi Lin , Chengsong Huang , Jiaxin Huang , Jingbo Shang

Eliminating Reasoning via Inferring with Planning: A New Framework to Guide LLMs' Non-linear Thinking

Chain-of-Thought(CoT) prompting and its variants explore equipping large language models (LLMs) with high-level reasoning abilities by emulating human-like linear cognition and logic. However, the human mind is complicated and mixed with…

Computation and Language · Computer Science 2023-11-16 Yongqi Tong , Yifan Wang , Dawei Li , Sizhe Wang , Zi Lin , Simeng Han , Jingbo Shang

QTSumm: Query-Focused Summarization over Tabular Data

People primarily consult tables to conduct data analysis or answer specific questions. Text generation systems that can provide accurate table summaries tailored to users' information needs can facilitate more efficient access to relevant…

Computation and Language · Computer Science 2023-11-08 Yilun Zhao , Zhenting Qi , Linyong Nan , Boyu Mi , Yixin Liu , Weijin Zou , Simeng Han , Ruizhe Chen , Xiangru Tang , Yumo Xu , Dragomir Radev , Arman Cohan

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have…

Computation and Language · Computer Science 2023-06-07 Yixin Liu , Alexander R. Fabbri , Pengfei Liu , Yilun Zhao , Linyong Nan , Ruilin Han , Simeng Han , Shafiq Joty , Chien-Sheng Wu , Caiming Xiong , Dragomir Radev

CREATIVESUMM: Shared Task on Automatic Summarization for Creative Writing

This paper introduces the shared task of summarizing documents in several creative domains, namely literary texts, movie scripts, and television scripts. Summarizing these creative documents requires making complex literary interpretations,…

Computation and Language · Computer Science 2022-12-08 Divyansh Agarwal , Alexander R. Fabbri , Simeng Han , Wojciech Kryściński , Faisal Ladhak , Bryan Li , Kathleen McKeown , Dragomir Radev , Tianyi Zhang , Sam Wiseman