Related papers: A Data-Centric Perspective on Evaluating Machine L…

TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks

Advances in machine learning research drive progress in real-world applications. To ensure this progress, it is important to understand the potential pitfalls on the way from a novel method's success on academic benchmarks to its practical…

Machine Learning · Computer Science 2024-10-25 Ivan Rubachev , Nikolay Kartashev , Yury Gorishniy , Artem Babenko

A Closer Look at Deep Learning Methods on Tabular Datasets

Tabular data is prevalent across diverse domains in machine learning. With the rapid progress of deep tabular prediction methods, especially pretrained (foundation) models, there is a growing need to evaluate these methods systematically…

Machine Learning · Computer Science 2025-11-10 Han-Jia Ye , Si-Yang Liu , Hao-Run Cai , Qi-Le Zhou , De-Chuan Zhan

Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

Tabular data is one of the most widely used formats across industries, driving critical applications in areas such as finance, healthcare, and marketing. In the era of data-centric AI, improving data quality and representation has become…

Machine Learning · Computer Science 2025-01-22 Dongjie Wang , Yanyong Huang , Wangyang Ying , Haoyue Bai , Nanxu Gong , Xinyuan Wang , Sixun Dong , Tao Zhe , Kunpeng Liu , Meng Xiao , Pengfei Wang , Pengyang Wang , Hui Xiong , Yanjie Fu

Tabular Data Generation Models: An In-Depth Survey and Performance Benchmarks with Extensive Tuning

The ability to train generative models that produce realistic, safe and useful tabular data is essential for data privacy, imputation, oversampling, explainability or simulation. However, generating tabular data is not straightforward due…

Machine Learning · Computer Science 2025-09-18 G. Charbel N. Kindji , Lina Maria Rojas-Barahona , Elisa Fromont , Tanguy Urvoy

A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning

Academic tabular benchmarks often contain small sets of curated features. In contrast, data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. To prevent…

Machine Learning · Computer Science 2023-11-13 Valeriia Cherepanova , Roman Levin , Gowthami Somepalli , Jonas Geiping , C. Bayan Bruss , Andrew Gordon Wilson , Tom Goldstein , Micah Goldblum

Comparative Analysis of Transformers for Modeling Tabular Data: A Casestudy using Industry Scale Dataset

We perform a comparative analysis of transformer-based models designed for modeling tabular data, specifically on an industry-scale dataset. While earlier studies demonstrated promising outcomes on smaller public or synthetic datasets, the…

Machine Learning · Computer Science 2023-11-27 Usneek Singh , Piyush Arora , Shamika Ganesan , Mohit Kumar , Siddhant Kulkarni , Salil R. Joshi

Tabular Data: Is Deep Learning all you need?

Tabular data represent one of the most prevalent data formats in applied machine learning, largely because they accommodate a broad spectrum of real-world problems. Existing literature has studied many of the shortcomings of neural…

Machine Learning · Computer Science 2025-10-07 Guri Zabërgja , Arlind Kadra , Christian M. M. Frey , Josif Grabocka

Attention versus Contrastive Learning of Tabular Data -- A Data-centric Benchmarking

Despite groundbreaking success in image and text learning, deep learning has not achieved significant improvements against traditional machine learning (ML) when it comes to tabular data. This performance gap underscores the need for…

Machine Learning · Computer Science 2024-01-10 Shourav B. Rabbani , Ivan V. Medri , Manar D. Samad

Deep Learning within Tabular Data: Foundations, Challenges, Advances and Future Directions

Tabular data remains one of the most prevalent data types across a wide range of real-world applications, yet effective representation learning for this domain poses unique challenges due to its irregular patterns, heterogeneous feature…

Machine Learning · Computer Science 2025-01-08 Weijieying Ren , Tianxiang Zhao , Yuqing Huang , Vasant Honavar

Review of Data-centric Time Series Analysis from Sample, Feature, and Period

Data is essential to performing time series analysis utilizing machine learning approaches, whether for classic models or today's large language models. A good time-series dataset is advantageous for the model's accuracy, robustness, and…

Machine Learning · Computer Science 2024-04-29 Chenxi Sun , Hongyan Li , Yaliang Li , Shenda Hong

A Topological-Framework to Improve Analysis of Machine Learning Model Performance

As both machine learning models and the datasets on which they are evaluated have grown in size and complexity, the practice of using a few summary statistics to understand model performance has become increasingly problematic. This is…

Machine Learning · Computer Science 2021-07-13 Henry Kvinge , Colby Wight , Sarah Akers , Scott Howland , Woongjo Choi , Xiaolong Ma , Luke Gosink , Elizabeth Jurrus , Keerti Kappagantula , Tegan H. Emerson

Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging…

Machine Learning · Computer Science 2023-10-27 Lasse Hansen , Nabeel Seedat , Mihaela van der Schaar , Andrija Petrovic

A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets

The analysis of tabular datasets is highly prevalent both in scientific research and real-world applications of Machine Learning (ML). Unlike many other ML tasks, Deep Learning (DL) models often do not outperform traditional methods in this…

Machine Learning · Computer Science 2024-08-28 Assaf Shmuel , Oren Glickman , Teddy Lazebnik

Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data

Classical machine learning models, such as linear models and tree-based models, are widely used in industry. These models are sensitive to data distribution, thus feature preprocessing, which transforms features from one distribution to…

Machine Learning · Computer Science 2026-04-16 Danrui Qi , Jinglin Peng , Yongjun He , Jiannan Wang

Representation Learning for Tabular Data: A Comprehensive Survey

Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Models for learning from tabular data have continuously evolved, with Deep Neural Networks…

Machine Learning · Computer Science 2025-04-24 Jun-Peng Jiang , Si-Yang Liu , Hao-Run Cai , Qile Zhou , Han-Jia Ye

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data…

Machine Learning · Computer Science 2022-12-27 Steven Euijong Whang , Yuji Roh , Hwanjun Song , Jae-Gil Lee

Deep Neural Networks and Tabular Data: A Survey

Heterogeneous tabular data are the most commonly used form of data and are essential for numerous critical and computationally demanding applications. On homogeneous data sets, deep neural networks have repeatedly shown excellent…

Machine Learning · Computer Science 2023-01-24 Vadim Borisov , Tobias Leemann , Kathrin Seßler , Johannes Haug , Martin Pawelczyk , Gjergji Kasneci

Homological Convolutional Neural Networks

Deep learning methods have demonstrated outstanding performances on classification and regression tasks on homogeneous data types (e.g., image, audio, and text data). However, tabular data still pose a challenge, with classic machine…

Machine Learning · Computer Science 2023-11-15 Antonio Briola , Yuanrong Wang , Silvia Bartolucci , Tomaso Aste

Data-Centric Machine Learning for Earth Observation: Necessary and Sufficient Features

The availability of temporal geospatial data in multiple modalities has been extensively leveraged to enhance the performance of machine learning models. While efforts on the design of adequate model architectures are approaching a level of…

Machine Learning · Computer Science 2024-08-22 Hiba Najjar , Marlon Nuske , Andreas Dengel

Rethinking Pre-Training in Tabular Data: A Neighborhood Embedding Perspective

Pre-training is prevalent in deep learning for vision and text data, leveraging knowledge from other datasets to enhance downstream tasks. However, for tabular data, the inherent heterogeneity in attribute and label spaces across datasets…

Machine Learning · Computer Science 2025-02-13 Han-Jia Ye , Qi-Le Zhou , Huai-Hong Yin , De-Chuan Zhan , Wei-Lun Chao