Related papers: Chiplets and the Codelet Model
This paper focuses on the simulation of multi-die System-on-Chip (SoC) architectures using VisualSim, emphasizing chiplet-based system modeling and performance analysis. Chiplet technology presents a promising alternative to traditional…
Fast-evolving artificial intelligence (AI) algorithms such as large language models have been driving the ever-increasing computing demands in today's data centers. Heterogeneous computing with domain-specific architectures (DSAs) brings…
For decades, memory capabilities have scaled up much slower than compute capabilities, leaving memory utilization as a major bottleneck. Prefetching and cache hierarchies mitigate this in applications with easily predictable memory accesses…
Due to reduced manufacturing yields, traditional monolithic chips cannot keep up with the compute, memory, and communication demands of data-intensive applications, such as rapidly growing deep neural network (DNN) models. Chiplet-based…
Vision Transformers (ViTs) have established new performance benchmarks in vision tasks such as image recognition and object detection. However, these advancements come with significant demands for memory and computational resources,…
A chiplet is an integrated circuit that encompasses a well-defined subset of an overall system's functionality. In contrast to traditional monolithic system-on-chips (SoCs), chiplet-based architecture can reduce costs and increase…
To address increasing compute demand from recent multi-model workloads with heavy models like large language models, we propose to deploy heterogeneous chiplet-based multi-chip module (MCM)-based accelerators. We develop an advanced…
Conventional heterogeneous computing systems built on PCIe interconnects suffer from inefficient fine-grained host-device interactions and complex programming models. In recent years, many proprietary and open cache-coherent interconnect…
The advent of chiplet technology introduces cutting-edge opportunities for constructing highly heterogeneous platforms with specialized accelerators. However, the HPC community currently lacks expertise in hardware development, a gap that…
On the advent of the slow death of Moore's law, the silicon industry is moving towards a new era of chiplets. The automotive industry is experiencing a profound transformation towards software-defined vehicles, fueled by the surging demand…
2.5D integration is an important technique to tackle the growing cost of manufacturing chips in advanced technology nodes. This poses the challenge of providing high-performance inter-chiplet interconnects (ICIs). As the number of chiplets…
Heterogeneous chiplets have been proposed for accelerating high-performance computing tasks. Integrated inside one package, CPU and GPU chiplets can share a common interconnection network that can be implemented through the interposer.…
Many modern workloads such as neural network inference and graph processing are fundamentally memory-bound. For such workloads, data movement between memory and CPU cores imposes a significant overhead in terms of both latency and energy. A…
Transformers have revolutionized deep learning and generative modeling, enabling advancements in natural language processing tasks. However, the size of transformer models is increasing continuously, driven by enhanced capabilities across…
Large language models (LLMs) such as OpenAI's ChatGPT and Google's Gemini have demonstrated unprecedented capabilities of autoregressive AI models across multiple tasks triggering disruptive technology innovations around the world. However,…
The trend in industry is towards heterogeneous multicore processors (HMCs), including chips with CPUs and massively-threaded throughput-oriented processors (MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the cores…
Recent works have introduced task-based parallelization schemes to accelerate graph search and sparse data-structure traversal, where some solutions scale up to thousands of processing units (PUs) on a single chip. However parallelizing…
Developing parallel algorithms efficiently requires careful management of concurrency across diverse hardware architectures. C++ executors provide a standardized interface that simplifies the development process, allowing developers to…
In large-scale distributed LLM training, communication between devices becomes the key performance bottleneck. Chiplet technology can integrate multiple dies into a package to scale-up node performance with higher bandwidth. Meanwhile,…
Chiplet architectures are on the rise as they promise to overcome the scaling challenges of monolithic chips. A key component of such architectures is an efficient inter-chiplet interconnect (ICI). The ICI design space is huge as there are…