Homestat.MLarXiv:2605.29642

Matching Rates and Optimal Allocation for Federated Probe-Logit Distillation under Heterogeneous Bandwidth Budgets

stat.MLcs.ITMachine Learningmath.IT2026-05v1license

Abstract

In federated language modeling, KK nodes each hold nn samples but cannot pool data or exchange full-precision gradients or weights. We study the minimax rate at which a conditional distribution over VV tokens can be estimated when each node may upload at most BB bits per query in a public probe set. In federated probe-logit distillation (FPLD), each node transmits a scalar-quantized logit vector on the probe set, and an aggregator distills a global parametric student. Prior work (Dubey and Huo, 2026) establishes a high-probability KL rate O(d/(Kn)+ρVlogV/m+K122B/V)O(d/(Kn) + \rho\sqrt{V \log V / m} + K^{-1} \cdot 2^{-2B/V}) plus optimization slack, with the bandwidth term in its trace-sharpened form. Whether this bandwidth-term rate is tight, and how the upper bound generalizes to heterogeneous per-node bandwidths, are left open. We close both gaps. First, the dithered FPLD construction has a matching single-round lower bound Ω(K122B/V)\Omega(K^{-1} \cdot 2^{-2B/V}) under non-degeneracy, pinning the bandwidth-axis rate at Θ(K122B/V)\Theta(K^{-1} \cdot 2^{-2B/V}). TT-round sequential refinement with nested/scaled residual quantizers achieves O(K122TB/V)O(K^{-1} \cdot 2^{-2TB/V}); vanilla FPLD's TT-independent bandwidth term is suboptimal for every T>1T > 1. Second, we establish a heterogeneous-bandwidth upper bound for per-node budgets BiB_i, paired with a closed-form optimal allocation Bi=Btot/K+(V/2)log2(wi/wˉg)B_i^* = B_{\mathrm{tot}}/K + (V/2) \log_2(w_i / \bar{w}_g), a log-tilted water-filling rule that is the per-node analogue of reverse water-filling for distortion-rate optimization. A plug-in adaptive variant estimates the weights from a short warm-up phase and attains 1+O(log(K/δ)/(mT0))1 + O(\sqrt{\log(K/\delta)/(m T_0)}) relative suboptimality. Synthetic n-gram simulations confirm that empirical KL is bracketed by the upper and lower bounds and that the optimal allocation strictly dominates uniform and inverse-weighted baselines under heterogeneous clipping.

Cite

@article{arxiv.2605.29642,
  title  = {Matching Rates and Optimal Allocation for Federated Probe-Logit Distillation under Heterogeneous Bandwidth Budgets},
  author = {Prasanjit Dubey and Xiaoming Huo},
  journal= {arXiv preprint arXiv:2605.29642},
  year   = {2026}
}