Related papers: Scalable Formal Concept Analysis algorithm for lar…

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

Nuova frontiera della classificazione testuale: Big data e calcolo distribuito

This document was created in order to study the algorithms for the categorization of phrases and rank them using the facilities provided by the framework Apache Spark. Starting from the study illustrated in the publication "Classifying…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-22 Marco Covelli , Massimiliano Morrelli

Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduce Framework

While many existing formal concept analysis algorithms are efficient, they are typically unsuitable for distributed implementation. Taking the MapReduce (MR) framework as our inspiration we introduce a distributed approach for performing…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-10-10 Biao Xu , Ruairí de Fréin , Eric Robson , Mícheál Ó Foghlú

An Information Theoretic Feature Selection Framework for Big Data under Apache Spark

With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Among many techniques, feature selection has been growing in interest as an important tool to identify relevant features on…

Artificial Intelligence · Computer Science 2016-10-20 Sergio Ramírez-Gallego , Héctor Mouriño-Talín , David Martínez-Rego , Verónica Bolón-Canedo , José Manuel Benítez , Amparo Alonso-Betanzos , Francisco Herrera

Large-scale text processing pipeline with Apache Spark

In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has…

Computation and Language · Computer Science 2019-12-03 Alexey Svyatkovskiy , Kosuke Imai , Mary Kroeger , Yuki Shiraito

Formal Concept Lattice Representations and Algorithms for Hypergraphs

There is increasing focus on analyzing data represented as hypergraphs, which are better able to express complex relationships amongst entities than are graphs. Much of the critical information about hypergraph structure is available only…

Data Structures and Algorithms · Computer Science 2023-07-24 Michael G. Rawson , Audun Myers , Robert Green , Michael Robinson , Cliff Joslyn

On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper…

Databases · Computer Science 2015-07-10 Olivier Curé , Hubert Naacke , Mohamed-Amine Baazizi , Bernd Amann

Large-Scale Network Embedding in Apache Spark

Network embedding has been widely used in social recommendation and network analysis, such as recommendation systems and anomaly detection with graphs. However, most of previous approaches cannot handle large graphs efficiently, due to that…

Social and Information Networks · Computer Science 2025-10-30 Wenqing Lin

Distributed Programming via Safe Closure Passing

Programming systems incorporating aspects of functional programming, e.g., higher-order functions, are becoming increasingly popular for large-scale distributed programming. New frameworks such as Apache Spark leverage functional techniques…

Programming Languages · Computer Science 2016-02-12 Philipp Haller , Heather Miller

Scalable Readability Evaluation for Graph Layouts: 2D Geometric Distributed Algorithms

Graphs, consisting of vertices and edges, are vital for representing complex relationships in fields like social networks, finance, and blockchain. Visualizing these graphs helps analysts identify structural patterns, with readability…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-18 Sanggeon Yun

Modeling Scalability of Distributed Machine Learning

Present day machine learning is computationally intensive and processes large amounts of data. It is implemented in a distributed fashion in order to address these scalability issues. The work is parallelized across a number of computing…

Machine Learning · Computer Science 2017-03-28 Alexander Ulanov , Andrey Simanovsky , Manish Marwah

Declarative Data Pipeline for Large Scale ML Services

Modern distributed data processing systems struggle to balance performance, maintainability, and developer productivity when integrating machine learning at scale. These challenges intensify in large collaborative environments due to high…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-07 Yunzhao Yang , Runhui Wang , Xuanqing Liu , Adit Krishnan , Yefan Tao , Yuqian Deng , Kuangyou Yao , Peiyuan Sun , Henrik Johnson , Aditi sinha , Davor Golac , Gerald Friedland , Usman Shakeel , Daryl Cooke , Joe Sullivan , Madhusudhanan Chandrasekaran , Chris Kong

Intelligent Spark Agents: A Modular LangGraph Framework for Scalable, Visualized, and Enhanced Big Data Machine Learning Workflows

This paper presents a Spark-based modular LangGraph framework, designed to enhance machine learning workflows through scalability, visualization, and intelligent process optimization. At its core, the framework introduces Agent AI, a…

Artificial Intelligence · Computer Science 2024-12-09 Jialin Wang , Zhihua Duan

Understanding the Challenges and Assisting Developers with Developing Spark Applications

To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the…

Software Engineering · Computer Science 2021-03-29 Zehao Wang

A Distributed Automatic Domain-Specific Multi-Word Term Recognition Architecture using Spark Ecosystem

Automatic Term Recognition is used to extract domain-specific terms that belong to a given domain. In order to be accurate, these corpus and language-dependent methods require large volumes of textual data that need to be processed to…

Computation and Language · Computer Science 2023-05-29 Ciprian-Octavian Truică , Neculai-Ovidiu Istrate , Elena-Simona Apostol

Distributed-Memory Vertex-Centric Network Embedding for Large-Scale Graphs

Network embedding is an important step in many different computations based on graph data. However, existing approaches are limited to small or middle size graphs with fewer than a million edges. In practice, web or social network graphs…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-09 Sara Riazi , Boyana Norris

Formalising Concepts as Grounded Abstractions

The notion of concept has been studied for centuries, by philosophers, linguists, cognitive scientists, and researchers in artificial intelligence (Margolis & Laurence, 1999). There is a large literature on formal, mathematical models of…

Artificial Intelligence · Computer Science 2021-01-14 Stephen Clark , Alexander Lerchner , Tamara von Glehn , Olivier Tieleman , Richard Tanburn , Misha Dashevskiy , Matko Bosnjak

Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation

Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-01 Subhadip Mitra

Mobile Big Data Analytics Using Deep Learning and Apache Spark

The proliferation of mobile devices, such as smartphones and Internet of Things (IoT) gadgets, results in the recent mobile big data (MBD) era. Collecting MBD is unprofitable unless suitable analytics and learning methods are utilized for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-16 Mohammad Abu Alsheikh , Dusit Niyato , Shaowei Lin , Hwee-Pink Tan , Zhu Han

Large-Scale Intelligent Microservices

Deploying Machine Learning (ML) algorithms within databases is a challenge due to the varied computational footprints of modern ML algorithms and the myriad of database technologies each with its own restrictive syntax. We introduce an…

Artificial Intelligence · Computer Science 2022-03-17 Mark Hamilton , Nick Gonsalves , Christina Lee , Anand Raman , Brendan Walsh , Siddhartha Prasad , Dalitso Banda , Lucy Zhang , Mei Gao , Lei Zhang , William T. Freeman