English

Better Write Amplification for Streaming Data Processing

Distributed, Parallel, and Cluster Computing 2023-06-07 v1

Abstract

Many current applications have to perform data processing in a streaming fashion. Doing so at a large scale requires a parallel system that must be equipped to handle straggling workers and different kinds of failures. YT is the main driver behind distributed systems at Yandex, home to its distributed file system, lock service, key-value storage, and internal MapReduce platform. We implement a new component of this system designed for performing streaming MapReduce operations, utilizing different core YT solutions to achieve fault-tolerance and exactly-once semantics while maintaining efficiency and low write amplification factors.

Keywords

Cite

@article{arxiv.2306.03272,
  title  = {Better Write Amplification for Streaming Data Processing},
  author = {Andrei Chulkov and Maxim Akhmedov},
  journal= {arXiv preprint arXiv:2306.03272},
  year   = {2023}
}

Comments

YT is now openly available as YTSaurus: see github.com/ytsaurus/ytsaurus

R2 v1 2026-06-28T10:57:15.468Z