PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database

Hui Sun; Yanfeng Ding; Liping Yi; Huidong Ma; Gang Wang; Xiaoguang Liu; Cheng Zhong; Wentong Cai

PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database

Machine Learning 2025-07-18 v1 Artificial Intelligence Computation and Language Databases

Authors: Hui Sun , Yanfeng Ding , Liping Yi , Huidong Ma , Gang Wang , Xiaoguang Liu , Cheng Zhong , Wentong Cai

Abstract

Learning-based lossless compressors play a crucial role in large-scale genomic database backup, storage, transmission, and management. However, their 1) inadequate compression ratio, 2) low compression \& decompression throughput, and 3) poor compression robustness limit their widespread adoption and application in both industry and academia. To solve those challenges, we propose a novel \underline{P}arallel \underline{M}ulti-\underline{K}nowledge \underline{L}earning-based \underline{C}ompressor (PMKLC) with four crucial designs: 1) We propose an automated multi-knowledge learning-based compression framework as compressors' backbone to enhance compression ratio and robustness; 2) we design a GPU-accelerated ( $s$ , $k$ )-mer encoder to optimize compression throughput and computing resource usage; 3) we introduce data block partitioning and Step-wise Model Passing (SMP) mechanisms for parallel acceleration; 4) We design two compression modes PMKLC-S and PMKLC-M to meet the complex application scenarios, where the former runs on a resource-constrained single GPU and the latter is multi-GPU accelerated. We benchmark PMKLC-S/M and 14 baselines (7 traditional and 7 leaning-based) on 15 real-world datasets with different species and data sizes. Compared to baselines on the testing datasets, PMKLC-S/M achieve the average compression ratio improvement up to 73.609\% and 73.480\%, the average throughput improvement up to 3.036 $\times$ and 10.710 $\times$ , respectively. Besides, PMKLC-S/M also achieve the best robustness and competitive memory cost, indicating its greater stability against datasets with different probability distribution perturbations, and its strong ability to run on memory-constrained devices.

Keywords

large language model inference processing-in-memory image compression

Cite

@article{arxiv.2507.12805,
  title  = {PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database},
  author = {Hui Sun and Yanfeng Ding and Liping Yi and Huidong Ma and Gang Wang and Xiaoguang Liu and Cheng Zhong and Wentong Cai},
  journal= {arXiv preprint arXiv:2507.12805},
  year   = {2025}
}

Comments

Accepted via KDD-25

PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database

Abstract

Keywords

Cite

Comments

Related papers