Related papers: Create Benchmarks for Data Lakes

Benchmarking Data Lakes Featuring Structured and Unstructured Data with DLBench

In the last few years, the concept of data lake has become trendy for data storage and analysis. Thus, several design alternatives have been proposed to build data lake systems. However, these proposals are difficult to evaluate as there…

Databases · Computer Science 2021-10-05 Pegdwendé Sawadogo , Jérôme Darmont

Data Lakes: A Survey of Functions and Systems

Data lakes are becoming increasingly prevalent for big data management and data analytics. In contrast to traditional 'schema-on-write' approaches such as data warehouses, data lakes are repositories storing raw data in its original formats…

Databases · Computer Science 2023-10-24 Rihan Hai , Christos Koutras , Christoph Quix , Matthias Jarke

LakeBench: Benchmarks for Data Discovery over Data Lakes

Within enterprises, there is a growing need to intelligently navigate data lakes, specifically focusing on data discovery. Of particular importance to enterprises is the ability to find related tables in data repositories. These tables can…

Databases · Computer Science 2023-07-11 Kavitha Srinivas , Julian Dolby , Ibrahim Abdelaziz , Oktie Hassanzadeh , Harsha Kokel , Aamod Khatiwada , Tejaswini Pedapati , Subhajit Chaudhury , Horst Samulowitz

LakeMLB: Data Lake Machine Learning Benchmark

Modern data lakes have emerged as foundational platforms for large-scale machine learning, enabling flexible storage of heterogeneous data and structured analytics through table-oriented abstractions. Despite their growing importance,…

Machine Learning · Computer Science 2026-02-12 Feiyu Pan , Tianbin Zhang , Aoqian Zhang , Yu Sun , Zheng Wang , Lixing Chen , Li Pan , Jianhua Li

Metadata Systems for Data Lakes: Models and Features

Over the past decade, the data lake concept has emerged as an alternative to data warehouses for storing and analyzing big data. A data lake allows storing data without any predefined schema. Therefore, data querying and analysis depend on…

Databases · Computer Science 2019-09-23 Pegdwendé Sawadogo , Etienne Scholly , Cécile Favre , Eric Ferey , Sabine Loudcher , Jérôme Darmont

Optimizing Data Lakes' Queries

Cloud data lakes provide a modern solution for managing large volumes of data. The fundamental principle behind these systems is the separation of compute and storage layers. In this architecture, inexpensive cloud storage is utilized for…

Databases · Computer Science 2025-10-20 Gregory , Weintraub

On data lake architectures and metadata management

Over the past two decades, we have witnessed an exponential increase of data production in the world. So-called big data generally come from transactional systems, and even more so from the Internet of Things and social media. They are…

Databases · Computer Science 2021-07-26 Pegdwendé Sawadogo , Jérôme Darmont

A Big Data Lake for Multilevel Streaming Analytics

Large organizations are seeking to create new architectures and scalable platforms to effectively handle data management challenges due to the explosive nature of data rarely seen in the past. These data management challenges are largely…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-29 Ruoran Liu , Haruna Isah , Farhana Zulkernine

Joint Management and Analysis of Textual Documents and Tabular Data within the AUDAL Data Lake

In 2010, the concept of data lake emerged as an alternative to data warehouses for big data management. Data lakes follow a schema-on-read approach to provide rich and flexible analyses. However, although trendy in both the industry and…

Databases · Computer Science 2021-09-06 Pegdwendé Sawadogo , Jérôme Darmont , Camille Noûs

Towards Operationalizing Heterogeneous Data Discovery

Querying and exploring massive collections of data sources, such as data lakes, has been an essential research topic in the database community. Although many efforts have been paid in the field of data discovery and data integration in data…

Databases · Computer Science 2025-04-04 Jin Wang , Yanlin Feng , Chen Shen , Sajjadur Rahman , Eser Kandogan

PageRank Pipeline Benchmark: Proposal for a Holistic System Benchmark for Big-Data Platforms

The rise of big data systems has created a need for benchmarks to measure and compare the capabilities of these systems. Big data benchmarks present unique scalability challenges. The supercomputing community has wrestled with these…

Performance · Computer Science 2016-12-13 Patrick Dreher , Chansup Byun , Chris Hill , Vijay Gadepally , Bradley Kuszmaul , Jeremy Kepner

PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison

The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark…

Machine Learning · Computer Science 2017-03-03 Randal S. Olson , William La Cava , Patryk Orzechowski , Ryan J. Urbanowicz , Jason H. Moore

Dataset Discovery in Data Lakes

Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data…

Databases · Computer Science 2020-11-23 Alex Bogatu , Alvaro A. A. Fernandes , Norman W. Paton , Nikolaos Konstantinou

On Big Data Benchmarking

Big data systems address the challenges of capturing, storing, managing, analyzing, and visualizing big data. Within this context, developing benchmarks to evaluate and compare big data systems has become an active topic for both research…

Performance · Computer Science 2014-02-24 Rui Han , Xiaoyi Lu

Metadata Management for Textual Documents in Data Lakes

Data lakes have emerged as an alternative to data warehouses for the storage, exploration and analysis of big data. In a data lake, data are stored in a raw state and bear no explicit schema. Thence, an efficient metadata system is…

Databases · Computer Science 2019-05-13 Pegdwendé Sawadogo , Tokio Kibata , Jérôme Darmont

Model Lakes

Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation to…

Databases · Computer Science 2025-02-24 Koyena Pal , David Bau , Renée J. Miller

Semantic Data Management in Data Lakes

In recent years, data lakes emerged as away to manage large amounts of heterogeneous data for modern data analytics. One way to prevent data lakes from turning into inoperable data swamps is semantic data management. Some approaches propose…

Databases · Computer Science 2023-10-25 Sayed Hoseini , Johannes Theissen-Lipp , Christoph Quix

Modeling Data Lake Metadata with a Data Vault

With the rise of big data, business intelligence had to find solutions for managing even greater data volumes and variety than in data warehouses, which proved ill-adapted. Data lakes answer these needs from a storage point of view, but…

Databases · Computer Science 2018-07-12 Iuri Nogueira , Maram Romdhane , Jérôme Darmont

Benchmark Data Repositories for Better Benchmarking

In machine learning research, it is common to evaluate algorithms via their performance on standard benchmark datasets. While a growing body of work establishes guidelines for -- and levies criticisms at -- data and benchmarking practices…

Machine Learning · Computer Science 2024-11-01 Rachel Longjohn , Markelle Kelly , Sameer Singh , Padhraic Smyth

BatchBench: Toward a Workload-Aware Benchmark for Autoscaling Policies in Big Data Batch Processing -- A Proposed Framework

Autoscaling has become a baseline expectation for cloud-native big data processing, and the design space has expanded beyond rule-based heuristics to include learned controllers and, most recently, large language model (LLM) agents. Yet…

Information Retrieval · Computer Science 2026-05-13 Venkata Krishna Prasanth Budigi , Siri Chandana Sirigiri