Related papers: A New Framework for Expressing, Parallelizing and …

Petuum: A New Platform for Distributed Machine Learning on Big Data

What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization…

Machine Learning · Statistics 2015-05-18 Eric P. Xing , Qirong Ho , Wei Dai , Jin Kyu Kim , Jinliang Wei , Seunghak Lee , Xun Zheng , Pengtao Xie , Abhimanu Kumar , Yaoliang Yu

Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review

This paper presents a comparative analysis of different optimization techniques for the K-means algorithm in the context of big data. K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with…

Machine Learning · Computer Science 2024-05-21 Ravil Mussabayev , Rustam Mussabayev

An Experimental Survey on Big Data Frameworks

Recently, increasingly large amounts of data are generated from a variety of sources. Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-07 Wissem Inoubli , Sabeur Aridhi , Haithem Mezni , Mondher Maddouri , Engelbert Mephu Nguifo

PageRank Pipeline Benchmark: Proposal for a Holistic System Benchmark for Big-Data Platforms

The rise of big data systems has created a need for benchmarks to measure and compare the capabilities of these systems. Big data benchmarks present unique scalability challenges. The supercomputing community has wrestled with these…

Performance · Computer Science 2016-12-13 Patrick Dreher , Chansup Byun , Chris Hill , Vijay Gadepally , Bradley Kuszmaul , Jeremy Kepner

BigFCM: Fast, Precise and Scalable FCM on Hadoop

Clustering plays an important role in mining big data both as a modeling technique and a preprocessing step in many data mining process implementations. Fuzzy clustering provides more flexibility than non-fuzzy methods by allowing each data…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-26 Nasser Ghadiri , Meysam Ghaffari , Mohammad Amin Nikbakht

High Performance Dataframes from Parallel Processing Patterns

The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-06 Niranda Perera , Supun Kamburugamuve , Chathura Widanage , Vibhatha Abeykoon , Ahmet Uyar , Kaiying Shan , Hasara Maithree , Damitha Lenadora , Thejaka Amila Kanewala , Geoffrey Fox

ENFrame: A Platform for Processing Probabilistic Data

This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as bounded-range loops, list comprehension,…

Databases · Computer Science 2013-09-03 Sebastiaan J. van Schaik , Dan Olteanu , Robert Fink

Superior Parallel Big Data Clustering through Competitive Stochastic Sample Size Optimization in Big-means

This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology. The proposed method efficiently integrates parallel processing, stochastic sampling, and competitive optimization to…

Machine Learning · Computer Science 2024-03-28 Rustam Mussabayev , Ravil Mussabayev

How to Use K-means for Big Data Clustering?

K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of…

Machine Learning · Computer Science 2023-11-27 Rustam Mussabayev , Nenad Mladenovic , Bassem Jarboui , Ravil Mussabayev

Efficient Computation of the Well-Founded Semantics over Big Data

Data originating from the Web, sensor readings and social media result in increasingly huge datasets. The so called Big Data comes with new scientific and technological challenges while creating new opportunities, hence the increasing…

Artificial Intelligence · Computer Science 2020-02-19 Ilias Tachmazidis , Grigoris Antoniou , Wolfgang Faber

KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics

Modern advanced analytics applications make use of machine learning techniques and contain multiple steps of domain-specific and general-purpose processing with high resource requirements. We present KeystoneML, a system that captures and…

Machine Learning · Computer Science 2016-11-01 Evan R. Sparks , Shivaram Venkataraman , Tomer Kaftan , Michael J. Franklin , Benjamin Recht

AFrame: Extending DataFrames for Large-Scale Modern Data Analysis (Extended Version)

Analyzing the increasingly large volumes of data that are available today, possibly including the application of custom machine learning models, requires the utilization of distributed frameworks. This can result in serious productivity…

Databases · Computer Science 2019-08-20 Phanwadee Sinthong , Michael J. Carey

HEP-Frame: an Efficient Tool for Big Data Applications at the LHC

HEP-Frame is a new C++ package designed to efficiently perform analyses of data sets from a very large number of events, like those available at the Large Hadron Collider (LHC) at CERN, Geneva. It mainly targets high performance servers and…

High Energy Physics - Experiment · Physics 2023-03-10 A. Pereira , A. Onofre , A. Proenca

PRIMEBALL: a Parallel Processing Framework Benchmark for Big Data Applications in the Cloud

In this paper, we draw the specifications of a novel benchmark for comparing parallel processing frameworks in the context of big data applications hosted in the cloud. We aim at filling several gaps in already existing cloud data…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-12-24 Jaume Ferrarons , Mulu Adhana , Carlos Colmenares , Sandra Pietrowska , Fadila Bentayeb , Jérôme Darmont

Automated Document Indexing via Intelligent Hierarchical Clustering: A Novel Approach

With the rising quantity of textual data available in electronic format, the need to organize it become a highly challenging task. In the present paper, we explore a document organization framework that exploits an intelligent hierarchical…

Information Retrieval · Computer Science 2015-04-02 Rajendra Kumar Roul , Shubham Rohan Asthana , Sanjay Kumar Sahay

High-performance K-means Implementation based on a Simplified Map-Reduce Architecture

The k-means algorithm is one of the most common clustering algorithms and widely used in data mining and pattern recognition. The increasing computational requirement of big data applications makes hardware acceleration for the k-means…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-23 Zhehao Li , Jifang Jin , Lingli Wang

A Framework for Model Search Across Multiple Machine Learning Implementations

Several recently devised machine learning (ML) algorithms have shown improved accuracy for various predictive problems. Model searches, which explore to find an optimal ML algorithm and hyperparameter values for the target problem, play a…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-28 Yoshiki Takahashi , Masato Asahara , Kazuyuki Shudo

Lifting C Semantics for Dataflow Optimization

C is the lingua franca of programming and almost any device can be programmed using C. However, programming mod-ern heterogeneous architectures such as multi-core CPUs and GPUs requires explicitly expressing parallelism as well as…

Programming Languages · Computer Science 2022-05-25 Alexandru Calotoiu , Tal Ben-Nun , Grzegorz Kwasniewski , Johannes de Fine Licht , Timo Schneider , Philipp Schaad , Torsten Hoefler

LLM4Ranking: An Easy-to-use Framework of Utilizing Large Language Models for Document Reranking

Utilizing large language models (LLMs) for document reranking has been a popular and promising research direction in recent years, many studies are dedicated to improving the performance and efficiency of using LLMs for reranking. Besides,…

Information Retrieval · Computer Science 2025-04-11 Qi Liu , Haozhe Duan , Yiqun Chen , Quanfeng Lu , Weiwei Sun , Jiaxin Mao

Quegel: A General-Purpose Query-Centric Framework for Querying Big Graphs

Pioneered by Google's Pregel, many distributed systems have been developed for large-scale graph analytics. These systems expose the user-friendly "think like a vertex" programming interface to users, and exhibit good horizontal…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-01-26 Da Yan , James Cheng , M. Tamer Özsu , Fan Yang , Yi Lu , John C. S. Lui , Qizhen Zhang , Wilfred Ng