Related papers: Latency Based Tiling
Traditional compiler optimization theory distinguishes three separate classes of cache miss -- Cold, Conflict and Capacity. Tiling for cache is typically guided by capacity miss counts. Models of cache function have not been effectively…
Contrastive loss is a powerful approach for representation learning, where larger batch sizes enhance performance by providing more negative samples to better distinguish between similar and dissimilar data. However, scaling batch sizes is…
Caching is crucial for system performance, but the delayed hit phenomenon, where requests queue during lengthy fetches after a cache miss, significantly degrades user-perceived latency in modern high-throughput systems. While prior works…
The success of DNNs and their high computational requirements pushed for large codesign efforts aiming at DNN acceleration. Since DNNs can be represented as static computational graphs, static memory allocation and tiling are two crucial…
Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory…
Texturing is a fundamental process in computer graphics. Texture is leveraged to enhance the visualization outcome for a 3D scene. In many cases a texture image cannot cover a large 3D model surface because of its small resolution.…
In modern systems, DRAM-based main memory is significantly slower than the processor. Consequently, processors spend a long time waiting to access data from main memory, making the long main memory access latency one of the most critical…
Distributed storage systems are known to be susceptible to long tails in response time. In modern online storage systems such as Bing, Facebook, and Amazon, the long tails of the service latency are of particular concern. with 99.9th…
This extended abstract describes our solution for the Traffic4Cast Challenge 2019. The task requires modeling both fine-grained (pixel-level) and coarse (region-level) spatial structure while preserving temporal relationships across long…
Mining and exploring databases should provide users with knowledge and new insights. Tiles of data strive to unveil true underlying structure and distinguish valuable information from various kinds of noise. We propose a novel Boolean…
Memory tiering systems seek cost-effective memory scaling by adding multiple tiers of memory. For maximum performance, frequently accessed (hot) data must be placed close to the host in faster tiers and infrequently accessed (cold) data can…
Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a…
Inference-time scaling has emerged as a powerful way to improve large language model (LLM) performance by generating multiple candidate responses and selecting among them. However, existing work on dynamic allocation for test-time compute…
We propose a texture template approach, consisting of a set of virtual minutiae, to improve the overall latent fingerprint recognition accuracy. To compensate for the lack of sufficient number of minutiae in poor quality latent prints, we…
Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explained by the theory that such recursion increases a networks depth, allowing it to compactly emulate…
Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML, which refers to the deployment of DNN inference tasks on tiny, low-power microcontrollers. Applications such as audio keyword…
Caches are used to reduce the speed differential between the CPU and memory to improve the performance of modern processors. However, attackers can use contention-based cache timing attacks to steal sensitive information from victim…
Spatial statistical modeling and prediction involve generating and manipulating an n*n symmetric positive definite covariance matrix, where n denotes the number of spatial locations. However, when n is large, processing this covariance…
Differentiable neural architecture search methods became popular in recent years, mainly due to their low search costs and flexibility in designing the search space. However, these methods suffer the difficulty in optimizing network, so…
This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was published in HPCA 2015, and examines the work's significance and future potential. AL-DRAM is a mechanism that optimizes DRAM latency based on the DRAM module and…