Related papers: Comprehensive and Efficient Workload Compression
We consider the problem of optimally compressing and caching data across a communication network. Given the data generated at edge nodes and a routing path, our goal is to determine the optimal data compression ratios and caching decisions…
We study a new formulation of the team-formation problem, where the goal is to form teams to work on a given set of tasks requiring different skills. Deviating from the classic problem setting where one is asking to cover all skills of each…
The growing amount of applications that generate vast amount of data in short time scales render the problem of partial monitoring, coupled with prediction, a rather fundamental one. We study the aforementioned canonical problem under the…
Real-world data typically contain repeated and periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.). However, distance…
In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to…
We address the problem of efficiently gathering correlated data from a wired or a wireless sensor network, with the aim of designing algorithms with provable optimality guarantees, and understanding how close we can get to the known…
Relational queries, and in particular join queries, often generate large output results when executed over a huge dataset. In such cases, it is often infeasible to store the whole materialized output if we plan to reuse it further down a…
An effective technique for solving optimization problems over massive data sets is to partition the data into smaller pieces, solve the problem on each piece and compute a representative solution from it, and finally obtain a solution…
Modern applications require methods that are computationally feasible on large datasets but also preserve statistical efficiency. Frequently, these two concerns are seen as contradictory: approximation methods that enable computation are…
With the rapid development of crowdsourcing platforms that aggregate the intelligence of Internet workers, crowdsourcing has been widely utilized to address problems that require human cognitive abilities. Considering great dynamics of…
We study the non-uniform capacitated multi-item lot-sizing (\lotsizing) problem. In this problem, there is a set of demands over a planning horizon of $T$ time periods and all demands must be satisfied on time. We can place an order at the…
The problem of column subset selection has recently attracted a large body of research, with feature selection serving as one obvious and important application. Among the techniques that have been applied to solve this problem, the greedy…
Sampling biases in training data are a major source of algorithmic biases in machine learning systems. Although there are many methods that attempt to mitigate such algorithmic biases during training, the most direct and obvious way is…
We present an efficient coresets-based neural network compression algorithm that sparsifies the parameters of a trained fully-connected neural network in a manner that provably approximates the network's output. Our approach is based on an…
Crowdsourcing provides a popular paradigm for data collection at scale. We study the problem of selecting subsets of workers from a given worker pool to maximize the accuracy under a budget constraint. One natural question is whether we…
In this paper, we present long-awaited algorithmic advances toward the efficient construction of near-optimal replenishment policies for a true inventory management classic, the economic warehouse lot scheduling problem. While this paradigm…
Visual analytics have played an increasingly critical role in the Internet of Things, where massive visual signals have to be compressed and fed into machines. But facing such big data and constrained bandwidth capacity, existing…
We consider models of content delivery networks in which the servers are constrained by two main resources: memory and bandwidth. In such systems, the throughput crucially depends on how contents are replicated across servers and how the…
This paper investigates, from information theoretic grounds, a learning problem based on the principle that any regularity in a given dataset can be exploited to extract compact features from data, i.e., using fewer bits than needed to…
Consensus clustering, a fundamental task in machine learning and data analysis, aims to aggregate multiple input clusterings of a dataset, potentially based on different non-sensitive attributes, into a single clustering that best…