Related papers: Codestitcher: Inter-Procedural Basic Block Layout …
Basic block reordering is an important step for profile-guided binary optimization. The state-of-the-art goal for basic block reordering is to maximize the number of fall-through branches. However, we demonstrate that such orderings may…
Fast compilation is important when compilation occurs at runtime, such as query compilers in modern database systems and WebAssembly virtual machines in modern browsers. We present copy-and-patch, an extremely fast compilation technique…
Function layout, also referred to as function reordering or function placement, is one of the most effective profile-guided compiler optimizations. By reordering functions in a binary, compilers are able to greatly improve the performance…
Block-based programming languages like Scratch have become increasingly popular as introductory languages for novices. These languages are intended to be used with a "tinkering" approach which allows learners and teachers to quickly…
Code often suffers from performance bugs. These bugs necessitate the research and practice of code optimization. Traditional rule-based methods rely on manually designing and maintaining rules for specific performance bugs (e.g., redundant…
Coded caching is a technique that generalizes conventional caching and promises significant reductions in traffic over caching networks. However, the basic coded caching scheme requires that each file hosted in the server be partitioned…
Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM…
The blocks editor, such as the editor in Scratch, is widely applied for visual programming languages (VPL) nowadays. Despite it's friendly for non-programmers, it exists three main limitations while displaying block codes: (1) the…
Code optimization is a challenging task requiring a substantial level of expertise from developers. Nonetheless, this level of human capacity is not sufficient considering the rapid evolution of new hardware architectures and software…
Elasticity is offered by cloud service providers to exploit under-utilized computing resources. The low-cost elastic nodes can leave and join any time during the computation cycle. The possibility of elastic events occurring together with…
In recent years, there is a surge on machine learning applications in industry. Many of them are based on popular AI frameworks like Tensorflow, Torch, Caffe, or MxNet, etc, and are enpowered by accelerator platforms such as GPUs. One…
Performance optimization is the art of continuous seeking a harmonious mapping between the application domain and hardware. Recent years have witnessed a surge of deep learning (DL) applications in industry. Conventional wisdom for…
This study investigates the problem of learning linear block codes optimized for Belief-Propagation decoders significantly improving performance compared to the state-of-the-art. Our previous research is extended with an enhanced system…
Performance optimization for large-scale applications has recently become more important as computation continues to move towards data centers. Data-center applications are generally very large and complex, which makes code layout an…
Even in the era of Deep Learning based methods, traditional machine learning methods with large data sets continue to attract significant attention. However, we find an apparent lack of a detailed performance characterization of these…
Mining large graphs for information is becoming an increasingly important workload due to the plethora of graph structured data becoming available. An aspect of graph algorithms that has hitherto not received much interest is the effect of…
Modern high-performance architectures employ large last-level caches (LLCs). While large LLCs can reduce average memory access latency for workloads with a high degree of locality, they can also increase latency for workloads with irregular…
Building on the previous work of Lee et al. and Ferdinand et al. on coded computation, we propose a sequential approximation framework for solving optimization problems in a distributed manner. In a distributed computation system, latency…
We show in this work that memory intensive computations can result in severe performance problems due to off-chip memory access and CPU-GPU context switch overheads in a wide range of deep learning models. For this problem, current…
Traditional optimizing compilers have played an important role in adapting to the growing complexity of modern software systems. The need for efficient parallel programming in current architectures requires strong optimization techniques.…