Related papers: PREMA: A Predictive Multi-task Scheduling Algorith…
With deep neural networks (DNNs) emerging as the backbone in a multitude of computer vision tasks, their adoption in real-world applications broadens continuously. Given the abundance and omnipresence of smart devices in the consumer…
We study the problem of scheduling VMs (Virtual Machines) in a distributed server platform, motivated by cloud computing applications. The VMs arrive dynamically over time to the system, and require a certain amount of resources (e.g.…
FPGAs are an attractive type of accelerator for all-purpose HPC computing systems due to the possibility of deploying tailored hardware on demand. However, the common tools for programming and operating FPGAs are still complex to use,…
This paper addresses the computational offloading of Deep Neural Networks (DNNs) to nearby devices with similar processing capabilities, to avoid the larger communication delays incurred for cloud offloading. We present a preemption aware…
Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging due to mixed workloads comprising latency-sensitive (LS) and best-effort (BE) jobs. Existing inference…
With the fast development of deep neural networks (DNNs), many real-world applications are adopting multiple models to conduct compound tasks, such as co-running classification, detection, and segmentation models on autonomous vehicles.…
When a computer system schedules jobs there is typically a significant cost associated with preempting a job during execution. This cost can be from the expensive task of saving the memory's state and loading data into and out of memory. It…
We present a new online algorithm for profit-oriented scheduling on multiple speed-scalable processors. Moreover, we provide a tight analysis of the algorithm's competitiveness. Our results generalize and improve upon work by…
Maximizing resource utilization by performing an efficient resource provisioning is a key factor for any cloud provider: commercial actors can maximize their revenues, whereas scientific and non-commercial providers can maximize their…
Hardware accelerators such as GPUs are required for real-time, low-latency inference with Deep Neural Networks (DNN). However, due to the inherent limits to the parallelism they can exploit, DNNs often under-utilize the capacity of today's…
Learning-augmented algorithms have emerged as a powerful paradigm to surpass traditional worst-case lower bounds by integrating potentially noisy predictions. While this framework has seen success in online scheduling, existing work…
Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML…
Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive. Therefore, the demand is growing to make them answer a heavy workload of requests with available computational…
CPU scheduling is the reason behind the performance of multiprocessing and in time-shared operating systems. Different scheduling criteria are used to evaluate Central Processing Unit Scheduling algorithms which are based on different…
Deployment of real-time ML services on warehouse-scale infrastructures is on the increase. Therefore, decreasing latency and increasing throughput of deep neural network (DNN) inference applications that empower those services have…
An optimal solution to the problem of scheduling real-time tasks on a set of identical processors is derived. The described approach is based on solving an equivalent uniprocessor real-time scheduling problem. Although there are other…
The (Non-Preemptive) Throughput Maximization problem is a natural and fundamental scheduling problem. We are given $n$ jobs, where each job $j$ is characterized by a processing time and a time window, contained in a global interval $[0,T)$,…
In this work we study the problem of scheduling tasks with dependencies in multiprocessor architectures where processors have different speeds. We present the preemptive algorithm "Save-Energy" that given a schedule of tasks it post…
Large language models (LLMs) achieve state-of-the-art accuracy on complex reasoning tasks by generating multiple chain-of-thought (CoT) traces, but using a fixed token budget per query leads to over-computation on easy inputs and…
Motivated by the increasing importance of providing delay-guaranteed services in general computing and communication systems, and the recent wide adoption of learning and prediction in network control, in this work, we consider a general…