Related papers: Evaluating Large Language Models on Spatial Tasks:…

Correctness Comparison of ChatGPT-4, Gemini, Claude-3, and Copilot for Spatial Tasks

Generative AI including large language models (LLMs) has recently gained significant interest in the geo-science community through its versatile task-solving capabilities including programming, arithmetic reasoning, generation of sample…

Computers and Society · Computer Science 2024-08-14 Hartwig H. Hochmair , Levente Juhasz , Takoda Kemp

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark

Artificial intelligence (AI) has made remarkable progress across various domains, with large language models like ChatGPT gaining substantial attention for their human-like text-generation capabilities. Despite these achievements, spatial…

Artificial Intelligence · Computer Science 2024-01-11 Fangjun Li , David C. Hogg , Anthony G. Cohn

Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency

As large language models (LLMs) continue to advance, evaluating their comprehensive capabilities becomes significant for their application in various fields. This research study comprehensively evaluates the language, vision, speech, and…

Artificial Intelligence · Computer Science 2024-07-16 Sakib Shahriar , Brady Lund , Nishith Reddy Mannuru , Muhammad Arbab Arshad , Kadhim Hayawi , Ravi Varma Kumar Bevara , Aashrith Mannuru , Laiba Batool

Evaluating Prompting Strategies and Large Language Models in Systematic Literature Review Screening: Relevance and Task-Stage Classification

This study quantifies how prompting strategies interact with large language models (LLMs) to automate the screening stage of systematic literature reviews (SLRs). We evaluate six LLMs (GPT-4o, GPT-4o-mini, DeepSeek-Chat-V3,…

Computation and Language · Computer Science 2025-10-21 Binglan Han , Anuradha Mathrani , Teo Susnjak

On the Planning, Search, and Memorization Capabilities of Large Language Models

The rapid advancement of large language models, such as the Generative Pre-trained Transformer (GPT) series, has had significant implications across various disciplines. In this study, we investigate the potential of the state-of-the-art…

Computation and Language · Computer Science 2023-09-06 Yunhao Yang , Anshul Tomar

Evaluation of LLMs for mathematical problem solving

Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems. In this study, we compare three prominent LLMs, including GPT-4o,…

Artificial Intelligence · Computer Science 2025-07-01 Ruonan Wang , Runxi Wang , Yunwen Shen , Chengfeng Wu , Qinglin Zhou , Rohitash Chandra

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning

Large language models (LLMs) have achieved remarkable success across a wide spectrum of tasks; however, they still face limitations in scenarios that demand long-term planning and spatial reasoning. To facilitate this line of research, in…

Computation and Language · Computer Science 2025-02-25 Mohamed Aghzal , Erion Plaku , Ziyu Yao

Is ChatGPT a Biomedical Expert? -- Exploring the Zero-Shot Performance of Current GPT Models in Biomedical Tasks

We assessed the performance of commercial Large Language Models (LLMs) GPT-3.5-Turbo and GPT-4 on tasks from the 2023 BioASQ challenge. In Task 11b Phase B, which is focused on answer generation, both models demonstrated competitive…

Computation and Language · Computer Science 2023-07-25 Samy Ateia , Udo Kruschwitz

Language Models are Few-Shot Learners

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires…

Computation and Language · Computer Science 2020-07-24 Tom B. Brown , Benjamin Mann , Nick Ryder , Melanie Subbiah , Jared Kaplan , Prafulla Dhariwal , Arvind Neelakantan , Pranav Shyam , Girish Sastry , Amanda Askell , Sandhini Agarwal , Ariel Herbert-Voss , Gretchen Krueger , Tom Henighan , Rewon Child , Aditya Ramesh , Daniel M. Ziegler , Jeffrey Wu , Clemens Winter , Christopher Hesse , Mark Chen , Eric Sigler , Mateusz Litwin , Scott Gray , Benjamin Chess , Jack Clark , Christopher Berner , Sam McCandlish , Alec Radford , Ilya Sutskever , Dario Amodei

Analyzing Prominent LLMs: An Empirical Study of Performance and Complexity in Solving LeetCode Problems

Large Language Models (LLMs) like ChatGPT, Copilot, Gemini, and DeepSeek are transforming software engineering by automating key tasks, including code generation, testing, and debugging. As these models become integral to development…

Software Engineering · Computer Science 2025-08-07 Everton Guimaraes , Nathalia Nascimento , Chandan Shivalingaiah , Asish Nelapati

Performance Comparison of Large Language Models on Advanced Calculus Problems

This paper presents an in-depth analysis of the performance of seven different Large Language Models (LLMs) in solving a diverse set of math advanced calculus problems. The study aims to evaluate these models' accuracy, reliability, and…

Computation and Language · Computer Science 2025-03-07 In Hak Moon

The use of GPT-4o and Other Large Language Models for the Improvement and Design of Self-Assessment Scales for Measurement of Interpersonal Communication Skills

OpenAI's ChatGPT (GPT-4 and GPT-4o) and other Large Language Models (LLMs) like Microsoft's Copilot, Google's Gemini 1.5 Pro, and Antrophic's Claude 3.5 Sonnet can be effectively used in various phases of scientific research. Their…

Artificial Intelligence · Computer Science 2024-09-24 Goran Bubaš

Optimizing Multi-Task Learning for Enhanced Performance in Large Language Models

This study aims to explore the performance improvement method of large language models based on GPT-4 under the multi-task learning framework and conducts experiments on two tasks: text classification and automatic summary generation.…

Computation and Language · Computer Science 2024-12-10 Zhen Qi , Jiajing Chen , Shuo Wang , Bingying Liu , Hongye Zheng , Chihang Wang

Few-shot Learning with Multilingual Language Models

Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting…

Computation and Language · Computer Science 2022-11-11 Xi Victoria Lin , Todor Mihaylov , Mikel Artetxe , Tianlu Wang , Shuohui Chen , Daniel Simig , Myle Ott , Naman Goyal , Shruti Bhosale , Jingfei Du , Ramakanth Pasunuru , Sam Shleifer , Punit Singh Koura , Vishrav Chaudhary , Brian O'Horo , Jeff Wang , Luke Zettlemoyer , Zornitsa Kozareva , Mona Diab , Veselin Stoyanov , Xian Li

Go-tuning: Improving Zero-shot Learning Abilities of Smaller Language Models

With increasing scale, large language models demonstrate both quantitative improvement and new qualitative capabilities, especially as zero-shot learners, like GPT-3. However, these results rely heavily on delicate prompt design and large…

Computation and Language · Computer Science 2022-12-21 Jingjing Xu , Qingxiu Dong , Hongyi Liu , Lei Li

Preliminary Explorations with GPT-4o(mni) Native Image Generation

Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI. It demonstrates a very remarkable generation capability with excellent multimodal condition understanding and varied task instructions. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2025-05-12 Pu Cao , Feng Zhou , Junyi Ji , Qingye Kong , Zhixiang Lv , Mingjian Zhang , Xuekun Zhao , Siqi Wu , Yinghui Lin , Qing Song , Lu Yang

GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks

This paper establishes a benchmark for evaluating tool-calling capabilities of large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess eight commercial LLMs (Claude Sonnet 3.5 and 4,…

Computation and Language · Computer Science 2025-10-23 Varvara Krechetova , Denis Kochedykov

Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers

ChatGPT is a large language model developed by OpenAI. Despite its impressive performance across various tasks, no prior work has investigated its capability in the biomedical domain yet. To this end, this paper aims to evaluate the…

Computation and Language · Computer Science 2023-08-25 Israt Jahan , Md Tahmid Rahman Laskar , Chun Peng , Jimmy Huang

Towards Supporting Penetration Testing Education with Large Language Models: an Evaluation and Comparison

Cybersecurity education is challenging and it is helpful for educators to understand Large Language Models' (LLMs') capabilities for supporting education. This study evaluates the effectiveness of LLMs in conducting a variety of penetration…

Cryptography and Security · Computer Science 2026-03-30 Martin Nizon-Deladoeuille , Brynjólfur Stefánsson , Helmut Neukirchen , Thomas Welsh

Evaluating the Performance of Large Language Models for SDG Mapping (Technical Report)

The use of large language models (LLMs) is expanding rapidly, and open-source versions are becoming available, offering users safer and more adaptable options. These models enable users to protect data privacy by eliminating the need to…

Machine Learning · Computer Science 2024-08-06 Hui Yin , Amir Aryani , Nakul Nambiar