Related papers: Data-driven Discovery with Large Generative Models

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we…

Computation and Language · Computer Science 2024-07-03 Bodhisattwa Prasad Majumder , Harshit Surana , Dhruv Agarwal , Bhavana Dalvi Mishra , Abhijeetsingh Meena , Aryan Prakhar , Tirth Vora , Tushar Khot , Ashish Sabharwal , Peter Clark

Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs

Given the remarkable performance of Large Language Models (LLMs), an important question arises: Can LLMs conduct human-like scientific research and discover new knowledge, and act as an AI scientist? Scientific discovery is an iterative…

Machine Learning · Computer Science 2025-02-24 Tingting Chen , Srinivas Anumasa , Beibei Lin , Vedant Shah , Anirudh Goyal , Dianbo Liu

GPT in Data Science: A Practical Exploration of Model Selection

There is an increasing interest in leveraging Large Language Models (LLMs) for managing structured data and enhancing data science processes. Despite the potential benefits, this integration poses significant questions regarding their…

Artificial Intelligence · Computer Science 2023-11-21 Nathalia Nascimento , Cristina Tavares , Paulo Alencar , Donald Cowan

DataGen: Unified Synthetic Dataset Generation via Large Language Models

Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges…

Computation and Language · Computer Science 2025-11-18 Yue Huang , Siyuan Wu , Chujie Gao , Dongping Chen , Qihui Zhang , Yao Wan , Tianyi Zhou , Jianfeng Gao , Chaowei Xiao , Lichao Sun , Xiangliang Zhang

A Survey on Open Dataset Search in the LLM Era: Retrospectives and Perspectives

High-quality datasets are typically required for accomplishing data-driven tasks, such as training medical diagnosis models, predicting real-time traffic conditions, or conducting experiments to validate research hypotheses. Consequently,…

Information Retrieval · Computer Science 2025-09-03 Pengyue Li , Sheng Wang , Hua Dai , Zhiyu Chen , Zhifeng Bao , Brian D. Davison

LLM-ODE: Data-driven Discovery of Dynamical Systems with Large Language Models

Discovering the governing equations of dynamical systems is a central problem across many scientific disciplines. As experimental data become increasingly available, automated equation discovery methods offer a promising data-driven…

Machine Learning · Computer Science 2026-04-07 Amirmohammad Ziaei Bideh , Jonathan Gryak

Data-Driven Discovery of Interpretable Kalman Filter Variants through Large Language Models and Genetic Programming

Algorithmic discovery has traditionally relied on human ingenuity and extensive experimentation. Here we investigate whether a prominent scientific computing algorithm, the Kalman Filter, can be discovered through an automated, data-driven,…

Neural and Evolutionary Computing · Computer Science 2025-08-26 Vasileios Saketos , Sebastian Kaltenbach , Sergey Litvinov , Petros Koumoutsakos

A Survey on Hypothesis Generation for Scientific Discovery in the Era of Large Language Models

Hypothesis generation is a fundamental step in scientific discovery, yet it is increasingly challenged by information overload and disciplinary fragmentation. Recent advances in Large Language Models (LLMs) have sparked growing interest in…

Computation and Language · Computer Science 2025-04-09 Atilla Kaan Alkan , Shashwat Sourav , Maja Jablonska , Simone Astarita , Rishabh Chakrabarty , Nikhil Garuda , Pranav Khetarpal , Maciej Pióro , Dimitrios Tanoglidis , Kartheik G. Iyer , Mugdha S. Polimera , Michael J. Smith , Tirthankar Ghosal , Marc Huertas-Company , Sandor Kruk , Kevin Schawinski , Ioana Ciucă

Interpretable Machine Learning for Discovery: Statistical Challenges \& Opportunities

New technologies have led to vast troves of large and complex datasets across many scientific domains and industries. People routinely use machine learning techniques to not only process, visualize, and make predictions from this big data,…

Machine Learning · Statistics 2023-08-04 Genevera I. Allen , Luqin Gan , Lili Zheng

A Survey on Generative Recommendation: Data, Model, and Tasks

Recommender systems serve as foundational infrastructure in modern information ecosystems, helping users navigate digital content and discover items aligned with their preferences. At their core, recommender systems address a fundamental…

Information Retrieval · Computer Science 2026-05-12 Min Hou , Le Wu , Yuxin Liao , Yonghui Yang , Zhen Zhang , Yu Wang , Changlong Zheng , Han Wu , Richang Hong

From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

Large Language Models (LLMs) are catalyzing a paradigm shift in scientific discovery, evolving from task-specific automation tools into increasingly autonomous agents and fundamentally redefining research processes and human-AI…

Computation and Language · Computer Science 2025-09-18 Tianshi Zheng , Zheye Deng , Hong Ting Tsang , Weiqi Wang , Jiaxin Bai , Zihao Wang , Yangqiu Song

Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents

Materials discovery and design are essential for advancing technology across various industries by enabling the development of application-specific materials. Recent research has leveraged Large Language Models (LLMs) to accelerate this…

Computation and Language · Computer Science 2025-02-11 Shrinidhi Kumbhar , Venkatesh Mishra , Kevin Coutinho , Divij Handa , Ashif Iquebal , Chitta Baral

From keywords to semantics: Perceptions of large language models in data discovery

Current approaches to data discovery match keywords between metadata and queries. This matching requires researchers to know the exact wording that other researchers previously used, creating a challenging process that could lead to missing…

Human-Computer Interaction · Computer Science 2025-10-03 Maura E Halstead , Mark A. Green , Caroline Jay , Richard Kingston , David Topping , Alexander Singleton

Scientific Hypothesis Generation and Validation: Methods, Datasets, and Future Directions

Large Language Models (LLMs) are transforming scientific hypothesis generation and validation by enabling information synthesis, latent relationship discovery, and reasoning augmentation. This survey provides a structured overview of…

Computation and Language · Computer Science 2025-05-09 Adithya Kulkarni , Fatimah Alotaibi , Xinyue Zeng , Longfeng Wu , Tong Zeng , Barry Menglong Yao , Minqian Liu , Shuaicheng Zhang , Lifu Huang , Dawei Zhou

Metadata-based Data Exploration with Retrieval-Augmented Generation for Large Language Models

Developing the capacity to effectively search for requisite datasets is an urgent requirement to assist data users in identifying relevant datasets considering the very limited available metadata. For this challenge, the utilization of…

Information Retrieval · Computer Science 2024-10-08 Teruaki Hayashi , Hiroki Sakaji , Jiayi Dai , Randy Goebel

Toward computational cumulative biology by combining models of biological datasets

A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine…

Quantitative Methods · Quantitative Biology 2015-06-19 Ali Faisal , Jaakko Peltonen , Elisabeth Georgii , Johan Rung , Samuel Kaski

The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4

In recent years, groundbreaking advancements in natural language processing have culminated in the emergence of powerful large language models (LLMs), which have showcased remarkable capabilities across a vast array of domains, including…

Computation and Language · Computer Science 2023-12-11 Microsoft Research AI4Science , Microsoft Azure Quantum

Model-Adaptive Interface Generation for Data-Driven Discovery

Discovery of new knowledge is increasingly data-driven, predicated on a team's ability to collaboratively create, find, analyze, retrieve, and share pertinent datasets over the duration of an investigation. This is especially true in the…

Human-Computer Interaction · Computer Science 2021-10-06 Hongsuda Tangmunarunkit , Aref Shafaeibejestan , Joshua Chudy , Karl Czajkowski , Robert Schuler , Carl Kesselman

Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation

Machine learning (ML) holds great promise for clinical applications but is often hindered by limited access to high-quality data due to privacy concerns, high costs, and long timelines associated with clinical trials. While large language…

Computation and Language · Computer Science 2026-03-27 Zerui Xu , Fang Wu , Yingzhou Lu , Yuanyuan Zhang , Yue Zhao

Large Causal Models for Temporal Causal Discovery

Causal discovery for both cross-sectional and temporal data has traditionally followed a dataset-specific paradigm, where a new model is fitted for each individual dataset. Such an approach limits the potential of multi-dataset pretraining.…

Machine Learning · Computer Science 2026-02-24 Nikolaos Kougioulis , Nikolaos Gkorgkolis , MingXue Wang , Bora Caglayan , Dario Simionato , Andrea Tonon , Ioannis Tsamardinos