Related papers: SODA: A Semantics-Aware Optimization Framework for…

A Survey of Semantics-Aware Performance Optimization for Data-Intensive Computing

We are living in the era of Big Data and witnessing the explosion of data. Given that the limitation of CPU and I/O in a single computer, the mainstream approach to scalability is to distribute computations among a large number of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-27 Bingbing Rao , Liqiang Wang

Auto Tuning of Hadoop and Spark parameters

Data of the order of terabytes, petabytes, or beyond is known as Big Data. This data cannot be processed using the traditional database software, and hence there comes the need for Big Data Platforms. By combining the capabilities and…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-05 Tanuja Patanshetti , Ashish Anil Pawar , Disha Patel , Sanket Thakare

SODA: Generating SQL for Business Users

The purpose of data warehouses is to enable business analysts to make better decisions. Over the years the technology has matured and data warehouses have become extremely successful. As a consequence, more and more data has been added to…

Databases · Computer Science 2012-07-03 Lukas Blunschi , Claudio Jossen , Donald Kossman , Magdalini Mori , Kurt Stockinger

DADA: Depth-aware Domain Adaptation in Semantic Segmentation

Unsupervised domain adaptation (UDA) is important for applications where large scale annotation of representative data is challenging. For semantic segmentation in particular, it helps deploy on real "target domain" data models that are…

Computer Vision and Pattern Recognition · Computer Science 2019-08-20 Tuan-Hung Vu , Himalaya Jain , Maxime Bucher , Matthieu Cord , Patrick Pérez

Performance Evaluation of Distributed Computing Environments with Hadoop and Spark Frameworks

Recently, due to rapid development of information and communication technologies, the data are created and consumed in the avalanche way. Distributed computing create preconditions for analyzing and processing such Big Data by distributing…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-30 Vladyslav Taran , Oleg Alienin , Sergii Stirenko , A. Rojbi , Yuri Gordienko

ROSA: R Optimizations with Static Analysis

R is a popular language and programming environment for data scientists. It is increasingly co-packaged with both relational and Hadoop-based data platforms and can often be the most dominant computational component in data analytics…

Programming Languages · Computer Science 2017-07-04 Rathijit Sen , Jianqiao Zhu , Jignesh M. Patel , Somesh Jha

SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer

Diffusion Transformers have become a dominant paradigm in visual generation, yet their low inference efficiency remains a key bottleneck hindering further advancement. Among common training-free techniques, caching offers high acceleration…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Tong Shao , Yusen Fu , Guoying Sun , Jingde Kong , Zhuotao Tian , Jingyong Su

Literature Study on Operational Data Analytics Frameworks in Large-scale Computing Infrastructures

By 2025, there are zettabytes of data generated every year. The size and complexity of modern large-scale computing infrastructures like High-Performance Computing (HPC) systems continue to evolve and become complex, leaving us wondering…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-20 Shekhar Suman , Xiaoyu Chu , Alexandru Iosup

Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts

The paradigm of big data is characterized by the need to collect and process data sets of great volume, arriving at the systems with great velocity, in a variety of formats. Spark is a widely used big data processing system that can be…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-29 Duarte M. Nascimento , Miguel Ferreira , Miguel L. Pardal

SODA: Semantic-Oriented Distributional Alignment for Generative Recommendation

Generative recommendation has emerged as a scalable alternative to traditional retrieve-and-rank pipelines by operating in a compact token space. However, existing methods mainly rely on discrete code-level supervision, which leads to…

Information Retrieval · Computer Science 2026-03-03 Ziqi Xue , Dingxian Wang , Yimeng Bai , Shuai Zhu , Jialei Li , Xiaoyan Zhao , Frank Yang , Andrew Rabinovich , Yang Zhang , Pablo N. Mendes

SOFA: An Extensible Logical Optimizer for UDF-heavy Dataflows

Recent years have seen an increased interest in large-scale analytical dataflows on non-relational data. These dataflows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant…

Databases · Computer Science 2013-11-26 Astrid Rheinländer , Arvid Heise , Fabian Hueske , Ulf Leser , Felix Naumann

SCOPE: Scalable Composite Optimization for Learning on Spark

Many machine learning models, such as logistic regression~(LR) and support vector machine~(SVM), can be formulated as composite optimization problems. Recently, many distributed stochastic optimization~(DSO) methods have been proposed to…

Machine Learning · Statistics 2016-12-13 Shen-Yi Zhao , Ru Xiang , Ying-Hao Shi , Peng Gao , Wu-Jun Li

Smaller but Better: Self-Paced Knowledge Distillation for Lightweight yet Effective LCMs

Large code models (LCMs) have remarkably advanced the field of code generation. Despite their impressive capabilities, they still face practical deployment issues, such as high inference costs, limited accessibility of proprietary LCMs, and…

Software Engineering · Computer Science 2025-05-21 Yujia Chen , Yang Ye , Zhongqi Li , Yuchi Ma , Cuiyun Gao

CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

The emergence of large reasoning models demonstrates that scaling inference-time compute significantly enhances performance on complex tasks. However, it often falls into another trap: overthinking simple problems, where repetitive…

Computation and Language · Computer Science 2026-04-07 Siye Wu , Jian Xie , Yikai Zhang , Yanghua Xiao

On-the-Fly Data Augmentation via Gradient-Guided and Sample-Aware Influence Estimation

Data augmentation has been widely employed to improve the generalization of deep neural networks. Most existing methods apply fixed or random transformations. However, we find that sample difficulty evolves along with the model's…

Machine Learning · Computer Science 2025-10-02 Suorong Yang , Jie Zong , Lihang Wang , Ziheng Qin , Hai Gan , Pengfei Zhou , Kai Wang , Yang You , Furao Shen

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

Long contexts improve capabilities of large language models but pose serious hardware challenges: compute and memory footprints grow linearly with sequence length. Particularly, the decoding phase continuously accesses massive KV cache,…

Hardware Architecture · Computer Science 2026-04-29 Wang Fan , Wei Cao , Xi Zha , Kedi Ma , MingQian Sun , Jialin Chen , Fengzhe Zhang , Fan Zhang

hMDAP: A Hybrid Framework for Multi-paradigm Data Analytical Processing on Spark

We propose hMDAP, a hybrid framework for large-scale data analytical processing on Spark, to support multi-paradigm process (incl. OLAP, machine learning, and graph analysis etc.) in distributed environments. The framework features a…

Databases · Computer Science 2017-01-17 Xiaowang Zhang , Jiahui Zhang , Zhiyong Feng

Towards Automated Data Integration in Software Analytics

Software organizations want to be able to base their decisions on the latest set of available data and the real-time analytics derived from them. In order to support "real-time enterprise" for software organizations and provide information…

Software Engineering · Computer Science 2018-08-17 Silverio Martínez-Fernández , Petar Jovanovic , Xavier Franch , Andreas Jedlitschka

HybridTune: Spatio-temporal Data and Model Driven Performance Diagnosis for Big Data Systems

With tremendous growing interests in Big Data systems, analyzing and facilitating their performance improvement become increasingly important. Although there have much research efforts for improving Big Data systems performance, efficiently…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-22 Rui Ren , Jiechao Cheng , Xiwen He , Lei Wang , Chunjie Luo , Jianfeng Zhan

Towards General and Efficient Online Tuning for Spark

The distributed data analytic system -- Spark is a common choice for processing massive volumes of heterogeneous data, while it is challenging to tune its parameters to achieve high performance. Recent studies try to employ auto-tuning…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-06 Yang Li , Huaijun Jiang , Yu Shen , Yide Fang , Xiaofeng Yang , Danqing Huang , Xinyi Zhang , Wentao Zhang , Ce Zhang , Peng Chen , Bin Cui