Related papers: Constructing a Portfolio Optimization Benchmark Fr…

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

There has been a surge in LLM evaluation research to understand LLM capabilities and limitations. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively…

Computation and Language · Computer Science 2024-04-04 Sanchit Ahuja , Divyanshu Aggarwal , Varun Gumma , Ishaan Watts , Ashutosh Sathe , Millicent Ochieng , Rishav Hada , Prachi Jain , Maxamed Axmed , Kalika Bali , Sunayana Sitaram

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Decision-making is a complex process requiring diverse abilities, making it an excellent framework for evaluating Large Language Models (LLMs). Researchers have examined LLMs' decision-making through the lens of Game Theory. However,…

Artificial Intelligence · Computer Science 2025-03-07 Jen-tse Huang , Eric John Li , Man Ho Lam , Tian Liang , Wenxuan Wang , Youliang Yuan , Wenxiang Jiao , Xing Wang , Zhaopeng Tu , Michael R. Lyu

An Evaluation Benchmark for Autoformalization in Lean4

Large Language Models (LLMs) hold the potential to revolutionize autoformalization. The introduction of Lean4, a mathematical programming language, presents an unprecedented opportunity to rigorously assess the autoformalization…

Machine Learning · Computer Science 2024-06-12 Aryan Gulati , Devanshu Ladsaria , Shubhra Mishra , Jasdeep Sidhu , Brando Miranda

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

The rapid advancement of Large Language Models (LLMs) in the realm of mathematical reasoning necessitates comprehensive evaluations to gauge progress and inspire future directions. Existing assessments predominantly focus on problem-solving…

Computation and Language · Computer Science 2024-06-05 Xiaoyuan Li , Wenjie Wang , Moxin Li , Junrong Guo , Yang Zhang , Fuli Feng

Construction of a Japanese Financial Benchmark for Large Language Models

With the recent development of large language models (LLMs), models that focus on certain domains and languages have been discussed for their necessity. There is also a growing need for benchmarks to evaluate the performance of current LLMs…

Computational Finance · Quantitative Finance 2024-03-25 Masanori Hirano

Generating Robust Portfolios of Optimization Models using Large Language Models

Mathematical optimization is a powerful tool for structured decision-making across domains such as resource allocation and planning. Formulating optimization models faithful to reality, though, remains a significant bottleneck as it…

Artificial Intelligence · Computer Science 2026-05-27 Eleni Straitouri , Cheol Woo Kim , Milind Tambe

Investor risk profiles of large language models

This paper investigates how large language models (LLMs) form and express investor risk profiles, a critical component of retail investment advising. We examine three LLMs (GPT, Gemini, and Llama) and assess their responses to a…

Portfolio Management · Quantitative Finance 2026-05-28 Hanyong Cho , Geumil Bae , Jang Ho Kim

Benchmarking Large Language Models for Math Reasoning Tasks

The use of Large Language Models (LLMs) in mathematical reasoning has become a cornerstone of related research, demonstrating the intelligence of these models and enabling potential practical applications through their advanced performance,…

Computation and Language · Computer Science 2024-12-20 Kathrin Seßler , Yao Rong , Emek Gözlüklü , Enkelejda Kasneci

Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning

In an era dominated by Large Language Models (LLMs), understanding their capabilities and limitations, especially in high-stakes fields like law, is crucial. While LLMs such as Meta's LLaMA, OpenAI's ChatGPT, Google's Gemini, DeepSeek, and…

Computation and Language · Computer Science 2025-09-29 Antreas Ioannou , Andreas Shiamishis , Nora Hollenstein , Nezihe Merve Gürel

Benchmarking Large Language Model Uncertainty for Prompt Optimization

Prompt optimization algorithms for Large Language Models (LLMs) excel in multi-step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer,…

Machine Learning · Computer Science 2024-12-30 Pei-Fu Guo , Yun-Da Tsai , Shou-De Lin

Evaluating Large Language Models in Process Mining: Capabilities, Benchmarks, and Evaluation Strategies

Using Large Language Models (LLMs) for Process Mining (PM) tasks is becoming increasingly essential, and initial approaches yield promising results. However, little attention has been given to developing strategies for evaluating and…

Databases · Computer Science 2024-07-01 Alessandro Berti , Humam Kourani , Hannes Hafke , Chiao-Yun Li , Daniel Schuster

Better to Ask in English: Evaluation of Large Language Models on English, Low-resource and Cross-Lingual Settings

Large Language Models (LLMs) are trained on massive amounts of data, enabling their application across diverse domains and tasks. Despite their remarkable performance, most LLMs are developed and evaluated primarily in English. Recently, a…

Computation and Language · Computer Science 2024-10-18 Krishno Dey , Prerona Tarannum , Md. Arid Hasan , Imran Razzak , Usman Naseem

Evaluating Large Language Models on Financial Report Summarization: An Empirical Study

In recent years, Large Language Models (LLMs) have demonstrated remarkable versatility across various applications, including natural language understanding, domain-specific knowledge tasks, etc. However, applying LLMs to complex,…

Computation and Language · Computer Science 2024-11-12 Xinqi Yang , Scott Zang , Yong Ren , Dingjie Peng , Zheng Wen

Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation

Large language models (LLMs) are increasingly used to convert natural language descriptions into mathematical optimization formulations. Current evaluations often treat formulations as a whole, relying on coarse metrics like solution…

Machine Learning · Computer Science 2025-10-21 Dania Refai , Moataz Ahmed

FinSheet-Bench: From Simple Lookups to Complex Reasoning, Where LLMs Break on Financial Spreadsheets

While Large Language Models (LLMs) can accelerate text-heavy tasks in alternative investment due diligence, a gap remains in their ability to accurately extract and reason over structured tabular data from complex financial spreadsheets.…

Artificial Intelligence · Computer Science 2026-03-10 Jan Ravnik , Matjaž Ličen , Felix Bührmann , Bithiah Yuan , Felix Stinson , Tanvi Singh

Decision-informed Neural Networks with Large Language Model Integration for Portfolio Optimization

This paper addresses the critical disconnect between prediction and decision quality in portfolio optimization by integrating Large Language Models (LLMs) with decision-focused learning. We demonstrate both theoretically and empirically…

Portfolio Management · Quantitative Finance 2025-02-04 Yoontae Hwang , Yaxuan Kong , Stefan Zohren , Yongjae Lee

Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment

Large Language Models (LLMs) are increasingly used as evaluators of reasoning quality, yet their reliability and bias in payments-risk settings remain poorly understood. We introduce a structured multi-evaluator framework for assessing LLM…

Artificial Intelligence · Computer Science 2026-02-06 Liang Wang , Junpeng Wang , Chin-chia Michael Yeh , Yan Zheng , Jiarui Sun , Xiran Fan , Xin Dai , Yujie Fan , Yiwei Cai

Evaluation of LLMs for mathematical problem solving

Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems. In this study, we compare three prominent LLMs, including GPT-4o,…

Artificial Intelligence · Computer Science 2025-07-01 Ruonan Wang , Runxi Wang , Yunwen Shen , Chengfeng Wu , Qinglin Zhou , Rohitash Chandra

Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

The performance of large language models (LLMs) on existing reasoning benchmarks has significantly improved over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem…

Computation and Language · Computer Science 2023-10-24 Daman Arora , Himanshu Gaurav Singh , Mausam

Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

As large language models (LLMs) increasingly permeate the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. Existing financial benchmarks often suffer from limited language and…

Computation and Language · Computer Science 2025-12-09 Xiaojun Wu , Junxi Liu , Huanyi Su , Zhouchi Lin , Yiyan Qi , Chengjin Xu , Jiajun Su , Jiajie Zhong , Fuwei Wang , Saizhuo Wang , Fengrui Hua , Jia Li , Jian Guo