English

Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation

Machine Learning 2016-05-17 v2 Artificial Intelligence

Abstract

Traditional graph-based semi-supervised learning (SSL) approaches, even though widely applied, are not suited for massive data and large label scenarios since they scale linearly with the number of edges E|E| and distinct labels mm. To deal with the large label size problem, recent works propose sketch-based methods to approximate the distribution on labels per node thereby achieving a space reduction from O(m)O(m) to O(logm)O(\log m), under certain conditions. In this paper, we present a novel streaming graph-based SSL approximation that captures the sparsity of the label distribution and ensures the algorithm propagates labels accurately, and further reduces the space complexity per node to O(1)O(1). We also provide a distributed version of the algorithm that scales well to large data sizes. Experiments on real-world datasets demonstrate that the new method achieves better performance than existing state-of-the-art algorithms with significant reduction in memory footprint. We also study different graph construction mechanisms for natural language applications and propose a robust graph augmentation strategy trained using state-of-the-art unsupervised deep learning architectures that yields further significant quality gains.

Keywords

Cite

@article{arxiv.1512.01752,
  title  = {Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation},
  author = {Sujith Ravi and Qiming Diao},
  journal= {arXiv preprint arXiv:1512.01752},
  year   = {2016}
}

Comments

10 pages

R2 v1 2026-06-22T12:02:27.628Z