English
Related papers

Related papers: PolyFrame: A Retargetable Query-based Approach to …

200 papers

Analyzing the increasingly large volumes of data that are available today, possibly including the application of custom machine learning models, requires the utilization of distributed frameworks. This can result in serious productivity…

Databases · Computer Science 2019-08-20 Phanwadee Sinthong , Michael J. Carey

As data volumes grow across applications, analytics of large amounts of data is becoming increasingly important. Big data processing frameworks such as Apache Hadoop, Apache AsterixDB, and Apache Spark have been built to meet this demand. A…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-15 Avinash Kumar

Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in Rand Python, dataframes face performance issues even on moderately large datasets. Moreover, there is…

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

PhyloFrame is a Python library for phylogenetic computation targeting the gap between specialist, compiler-optimized operations and flexible, script-based workflows -- with emphasis on fast, memory-efficient operations for very large tree…

Populations and Evolution · Quantitative Biology 2026-05-28 Matthew Andres Moreno , Jeet Sukumaran , Luis Zaman , Emily Dolson

The wide use of XML for document management and data exchange has created the need to query large repositories of XML data. To efficiently query such large data collections and take advantage of parallelism, we have implemented Apache…

Databases · Computer Science 2015-04-02 E. Preston Carman , Till Westmann , Vinayak R. Borkar , Michael J. Carey , Vassilis J. Tsotras

This paper introduces a novel approach to schema inference as an on-demand function integrated directly within a DBMS, targeting NoSQL databases where schema flexibility can create challenges. Unlike previous methods relying on external…

Databases · Computer Science 2024-11-21 Calvin Dani , Shiva Jahangiri , Thomas Hütter

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-09 Alexandru Uta , Bogdan Ghit , Ankur Dave , Jan Rellermeyer , Peter Boncz

Background. Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-10 Marco Capuccini , Martin Dahlö , Salman Toor , Ola Spjuth

Big Data today is being generated at an unprecedented rate from various sources such as sensors, applications, and devices, and it often needs to be enriched based on other reference information to support complex analytical queries.…

Databases · Computer Science 2020-08-18 Xikui Wang , Michael J. Carey

With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Among many techniques, feature selection has been growing in interest as an important tool to identify relevant features on…

The Apache Spark stack has enabled fast large-scale data processing. Despite a rich library of statistical models and inference algorithms, it does not give domain users the ability to develop their own models. The emergence of…

Databases · Computer Science 2017-10-10 Zhuoyue Zhao , Jialing Pei , Eric Lo , Kenny Q. Zhu , Chris Liu

The use of large-scale machine learning methods is becoming ubiquitous in many applications ranging from business intelligence to self-driving cars. These methods require a complex computation pipeline consisting of various types of…

Databases · Computer Science 2021-11-10 Yongyang Yu , Mingjie Tang , Walid G. Aref

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-30 Bilal Akil , Ying Zhou , Uwe Röhm

Many distributed machine learning frameworks have recently been built to speed up the large-scale data learning process. However, most distributed machine learning used in these frameworks still uses an offline algorithm model which cannot…

Artificial Intelligence · Computer Science 2018-07-19 Mahardhika Pratama , Choiru Za'in , Eric Pardede

The challenges in feature selection, particularly in balancing model accuracy, interpretability, and computational efficiency, remain a critical issue in advancing machine learning methodologies. To address these complexities, this study…

Machine Learning · Computer Science 2026-01-06 Nachiket Kapure , Harsh Joshi , Parul Kumari , Rajeshwari Mistri , Manasi Mali

Nowadays, many scientific areas share the same broad requirements of being able to deal with massive and distributed datasets while, when possible, being integrated with services and applications. In order to solve the growing gap between…

Instrumentation and Methods for Astrophysics · Physics 2011-12-06 M. Brescia , S. Cavuoti , R. D'Abrusco , O. Laurino , G. Longo

Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper…

Databases · Computer Science 2015-07-10 Olivier Curé , Hubert Naacke , Mohamed-Amine Baazizi , Bernd Amann

Applications are increasingly written as dynamic workflows underpinned by an execution framework that manages asynchronous computations across distributed hardware. However, execution frameworks typically offer one-size-fits-all solutions…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-18 J. Gregory Pauloski , Klaudiusz Rydzy , Valerie Hayot-Sasson , Ian Foster , Kyle Chard

Parallel shared-nothing data management systems have been widely used to exploit a cluster of machines for efficient and scalable data processing. When a cluster needs to be dynamically scaled in or out, data must be efficiently rebalanced.…

Databases · Computer Science 2021-05-25 Chen Luo , Michael J. Carey
‹ Prev 1 2 3 10 Next ›