Related papers: A Full-Stack Performance Evaluation Infrastructure…
Advances in hybrid bonding and packaging have driven growing interest in 3D DRAM-stacked accelerators with higher memory bandwidth and capacity. As LLMs scale to hundreds of billions or trillions of parameters, distributed inference across…
The rapid growth of large-language models (LLMs) is driving a new wave of specialized hardware for inference. This paper presents the first workload-centric, cross-architectural performance study of commercial AI accelerators, spanning…
Large Language Models (LLMs) have become extremely potent instruments with exceptional capacities for comprehending and producing human-like text in a wide range of applications. However, the increasing size and complexity of LLMs present…
Analyzing non-compilable C/C++ submodules without a resolved build environment remains a critical bottleneck for industrial software evolution. Traditional static analysis tools often fail in these scenarios due to their reliance on…
While Large Language Models (LLMs) have shown remarkable advancements in reasoning and tool use, they often fail to generate optimal, grounded solutions under complex constraints. Real-world travel planning exemplifies these challenges,…
The success of large language models LLMs amplifies the need for highthroughput energyefficient inference at scale. 3DDRAMbased accelerators provide high memory bandwidth and therefore an opportunity to accelerate the bandwidthbound decode…
RAPID-LLM is a unified performance modeling framework for large language model (LLM) training and inference on GPU clusters. It couples a DeepFlow-based frontend that generates hardware-aware, operator-level Chakra execution traces from an…
Transformer architectures have become the standard neural network model for various machine learning applications including natural language processing and computer vision. However, the compute and memory requirements introduced by…
Accelerating applications through the design of hardware accelerators can significantly enhance system performance and energy efficiency. Despite advances, such as high-level synthesis (HLS), designing accelerators for complex applications…
The ATLAS experiment at the Large Hadron Collider has a broad physics programme ranging from precision measurements to direct searches for new particles and new interactions, requiring ever larger and ever more accurate datasets of…
The past year has witnessed the increasing popularity of Large Language Models (LLMs). Their unprecedented scale and associated high hardware cost have impeded their broader adoption, calling for efficient hardware designs. With the large…
To overcome the well-known memory bottleneck of AI chips, 3D stacked architectures that employ advanced packaging technology with high-density through-silicon vias (TSVs) pins have proven to be a promising solution. The 3D-stacked AI chip…
ATLAS is a constraint-guided generation framework for structured engineering artifacts whose outputs must satisfy explicit schemas, domain rules, and audit requirements. Rather than treating a large language model as a standalone generator,…
Deep learning (DL) models are piquing high interest and scaling at an unprecedented rate. To this end, a handful of tiled accelerators have been proposed to support such large-scale training tasks. However, these accelerators often…
Recent work on activation and latent steering has demonstrated that modifying internal representations can effectively guide large language models (LLMs) toward improved reasoning and efficiency without additional training. However, most…
Artificial intelligence (AI) methods have become critical in scientific applications to help accelerate scientific discovery. Large language models (LLMs) are being considered as a promising approach to address some of the challenging…
Large language models (LLMs) have demonstrated exceptional proficiency in understanding and generating human language, but efficient inference on resource-constrained embedded devices remains challenging due to large model sizes and…
The substantial memory bandwidth and computational demands of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine neural processing…
Recently, large language models (LLMs) have achieved huge success in the natural language processing (NLP) field, driving a growing demand to extend their deployment from the cloud to edge devices. However, deploying LLMs on…
Long-context language models now advertise context windows up to millions of tokens, yet evaluations typically report a single length or a narrow task family, masking two failure modes: performance can collapse as length grows, and strong…