Related papers: A New Framework for Expressing, Parallelizing and …
What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization…
This paper presents a comparative analysis of different optimization techniques for the K-means algorithm in the context of big data. K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with…
Recently, increasingly large amounts of data are generated from a variety of sources. Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a…
The rise of big data systems has created a need for benchmarks to measure and compare the capabilities of these systems. Big data benchmarks present unique scalability challenges. The supercomputing community has wrestled with these…
Clustering plays an important role in mining big data both as a modeling technique and a preprocessing step in many data mining process implementations. Fuzzy clustering provides more flexibility than non-fuzzy methods by allowing each data…
The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily…
This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as bounded-range loops, list comprehension,…
This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology. The proposed method efficiently integrates parallel processing, stochastic sampling, and competitive optimization to…
K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of…
Data originating from the Web, sensor readings and social media result in increasingly huge datasets. The so called Big Data comes with new scientific and technological challenges while creating new opportunities, hence the increasing…
Modern advanced analytics applications make use of machine learning techniques and contain multiple steps of domain-specific and general-purpose processing with high resource requirements. We present KeystoneML, a system that captures and…
Analyzing the increasingly large volumes of data that are available today, possibly including the application of custom machine learning models, requires the utilization of distributed frameworks. This can result in serious productivity…
HEP-Frame is a new C++ package designed to efficiently perform analyses of data sets from a very large number of events, like those available at the Large Hadron Collider (LHC) at CERN, Geneva. It mainly targets high performance servers and…
In this paper, we draw the specifications of a novel benchmark for comparing parallel processing frameworks in the context of big data applications hosted in the cloud. We aim at filling several gaps in already existing cloud data…
With the rising quantity of textual data available in electronic format, the need to organize it become a highly challenging task. In the present paper, we explore a document organization framework that exploits an intelligent hierarchical…
The k-means algorithm is one of the most common clustering algorithms and widely used in data mining and pattern recognition. The increasing computational requirement of big data applications makes hardware acceleration for the k-means…
Several recently devised machine learning (ML) algorithms have shown improved accuracy for various predictive problems. Model searches, which explore to find an optimal ML algorithm and hyperparameter values for the target problem, play a…
C is the lingua franca of programming and almost any device can be programmed using C. However, programming mod-ern heterogeneous architectures such as multi-core CPUs and GPUs requires explicitly expressing parallelism as well as…
Utilizing large language models (LLMs) for document reranking has been a popular and promising research direction in recent years, many studies are dedicated to improving the performance and efficiency of using LLMs for reranking. Besides,…
Pioneered by Google's Pregel, many distributed systems have been developed for large-scale graph analytics. These systems expose the user-friendly "think like a vertex" programming interface to users, and exhibit good horizontal…