Related papers: Benchmarking Generative Models on Computational Th…

Evaluating ChatGPT and GPT-4 for Visual Programming

Generative AI and large language models have the potential to drastically improve the landscape of computing education by automatically generating personalized feedback and content. Recent works have studied the capabilities of these models…

Machine Learning · Computer Science 2023-08-08 Adish Singla

Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment

Large language and multimodal models have shown remarkable success on various benchmarks focused on specific skills such as general-purpose programming, math word problem-solving, and visual question answering. However, it is unclear how…

Artificial Intelligence · Computer Science 2025-10-07 Chao Wen , Jacqueline Staub , Adish Singla

Examining the Usage of Generative AI Models in Student Learning Activities for Software Programming

The rise of Generative AI (GenAI) tools like ChatGPT has created new opportunities and challenges for computing education. Existing research has primarily focused on GenAI's ability to complete educational tasks and its impact on student…

Software Engineering · Computer Science 2025-11-18 Rufeng Chen , Shuaishuai Jiang , Jiyun Shen , AJung Moon , Lili Wei

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors

Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies for introductory programming. Recent works have studied these models for different scenarios…

Computers and Society · Computer Science 2023-08-02 Tung Phung , Victor-Alexandru Pădurean , José Cambronero , Sumit Gulwani , Tobias Kohn , Rupak Majumdar , Adish Singla , Gustavo Soares

Generative Modeling for Multi-task Visual Learning

Generative modeling has recently shown great promise in computer vision, but it has mostly focused on synthesizing visually realistic images. In this paper, motivated by multi-task learning of shareable feature representations, we consider…

Computer Vision and Pattern Recognition · Computer Science 2021-06-28 Zhipeng Bao , Martial Hebert , Yu-Xiong Wang

Benchmarking Large Language Models for Math Reasoning Tasks

The use of Large Language Models (LLMs) in mathematical reasoning has become a cornerstone of related research, demonstrating the intelligence of these models and enabling potential practical applications through their advanced performance,…

Computation and Language · Computer Science 2024-12-20 Kathrin Seßler , Yao Rong , Emek Gözlüklü , Enkelejda Kasneci

DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models

To advance the mathematical proficiency of large language models (LLMs), the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This…

Artificial Intelligence · Computer Science 2025-05-14 Xiaoyang Chen , Xinan Dai , Yu Du , Qian Feng , Naixu Guo , Tingshuo Gu , Yuting Gao , Yingyi Gao , Xudong Han , Xiang Jiang , Yilin Jin , Hongyi Lin , Shisheng Lin , Xiangnan Li , Yuante Li , Yixing Li , Zhentao Lai , Zilu Ma , Yingrong Peng , Jiacheng Qian , Hao-Yu Sun , Jianbo Sun , Zirui Wang , Siwei Wu , Zian Wang , Bin Xu , Jianghao Xu , Yiyang Yu , Zichuan Yang , Hongji Zha , Ruichong Zhang

Measuring Vision-Language STEM Skills of Neural Models

We introduce a new challenge to test the STEM skills of neural models. The problems in the real world often require solutions, combining knowledge from STEM (science, technology, engineering, and math). Unlike existing datasets, our dataset…

Computation and Language · Computer Science 2024-05-24 Jianhao Shen , Ye Yuan , Srbuhi Mirzoyan , Ming Zhang , Chenguang Wang

Neural Task Synthesis for Visual Programming

Generative neural models hold great promise in enhancing programming education by synthesizing new content. We seek to design neural models that can automatically generate programming tasks for a given specification in the context of visual…

Machine Learning · Computer Science 2024-01-17 Victor-Alexandru Pădurean , Georgios Tzannetos , Adish Singla

Enhancing Computer Programming Education with LLMs: A Study on Effective Prompt Engineering for Python Code Generation

Large language models (LLMs) and prompt engineering hold significant potential for advancing computer programming education through personalized instruction. This paper explores this potential by investigating three critical research…

Artificial Intelligence · Computer Science 2024-07-09 Tianyu Wang , Nianjun Zhou , Zhixiong Chen

Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation

Generative AI and large language models hold great promise in enhancing programming education by generating individualized feedback and hints for learners. Recent works have primarily focused on improving the quality of generated feedback…

Machine Learning · Computer Science 2025-03-10 Nachiket Kotalwar , Alkis Gotovos , Adish Singla

An Eye for an AI: Evaluating GPT-4o's Visual Perception Skills and Geometric Reasoning Skills Using Computer Graphics Questions

CG (Computer Graphics) is a popular field of CS (Computer Science), but many students find this topic difficult due to it requiring a large number of skills, such as mathematics, programming, geometric reasoning, and creativity. Over the…

Artificial Intelligence · Computer Science 2024-10-23 Tony Haoran Feng , Paul Denny , Burkhard C. Wünsche , Andrew Luxton-Reilly , Jacqueline Whalley

A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers

This paper proposes a methodology for generating and perturbing detailed derivations of equations at scale, aided by a symbolic engine, to evaluate the generalisability of Transformers to out-of-distribution mathematical reasoning problems.…

Computation and Language · Computer Science 2024-04-09 Jordan Meadows , Marco Valentino , Damien Teney , Andre Freitas

Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks

Large Language Models (LLMs) have demonstrated impressive capabilities in natural language and code generation, and are increasingly deployed as automatic judges of model outputs and learning activities. Yet, their behavior on structured…

Computation and Language · Computer Science 2025-11-25 H. M. Shadman Tabib , Jaber Ahmed Deedar

Generative Grading: Near Human-level Accuracy for Automated Feedback on Richly Structured Problems

Access to high-quality education at scale is limited by the difficulty of providing student feedback on open-ended assignments in structured domains like computer programming, graphics, and short response questions. This problem has proven…

Machine Learning · Computer Science 2021-03-25 Ali Malik , Mike Wu , Vrinda Vasavada , Jinpeng Song , Madison Coots , John Mitchell , Noah Goodman , Chris Piech

Are LLMs ready for Visualization?

Generative models have received a lot of attention in many areas of academia and the industry. Their capabilities span many areas, from the invention of images given a prompt to the generation of concrete code to solve a certain programming…

Human-Computer Interaction · Computer Science 2024-03-12 Pere-Pau Vázquez

Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding

Vision-Language Models (VLMs) have demonstrated impressive world knowledge across a wide range of tasks, making them promising candidates for embodied reasoning applications. However, existing benchmarks primarily evaluate the embodied…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Haotian Xue , Yunhao Ge , Yu Zeng , Zhaoshuo Li , Ming-Yu Liu , Yongxin Chen , Jiaojiao Fan

KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models

This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children. A "visual analogy" is an abstract rule inferred from one image and applied to another. While benchmarks exist for…

Computer Vision and Pattern Recognition · Computer Science 2025-12-05 Eunice Yiu , Maan Qraitem , Anisa Noor Majhi , Charlie Wong , Yutong Bai , Shiry Ginosar , Alison Gopnik , Kate Saenko

Generating and Evaluating Tests for K-12 Students with Language Model Simulations: A Case Study on Sentence Reading Efficiency

Developing an educational test can be expensive and time-consuming, as each item must be written by experts and then evaluated by collecting hundreds of student responses. Moreover, many tests require multiple distinct sets of questions…

Computation and Language · Computer Science 2023-10-11 Eric Zelikman , Wanjing Anya Ma , Jasmine E. Tran , Diyi Yang , Jason D. Yeatman , Nick Haber

Investigating the Efficacy of Large Language Models in Reflective Assessment Methods through Chain of Thoughts Prompting

Large Language Models, such as Generative Pre-trained Transformer 3 (aka. GPT-3), have been developed to understand language through the analysis of extensive text data, allowing them to identify patterns and connections between words.…

Computation and Language · Computer Science 2023-10-03 Baphumelele Masikisiki , Vukosi Marivate , Yvette Hlope