ISCA 2024
Last updated
Was this helpful?
Last updated
Was this helpful?
Homepage:
Paper list:
Splitwise: Efficient Generative LLM Inference Using Phase Splitting
Microsoft
Best Paper Award
MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization
LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
MSRA
Pre-gating function to alleviate the dynamic nature of sparse expert activation. -> Address the large memory footprint.
UBC & GaTech
Hotline: a runtime framework.
Utilize CPU main memory for non-popular embeddings and GPUs’ HBM for popular embeddings.
Fragment a mini-batch into popular and non-popular micro-batches (μ-batches).
Cambricon-D: Full-Network Differential Acceleration for Diffusion Models
ICT, CAS
The first processor design to address Diffusion Model acceleration.
Mitigate additional memory accesses, while maintaining the concise computation from differential computing.
DaCapo: Accelerating Continuous Learning in Autonomous Systems for Video Analytics
Intel Accelerator Ecosystem: An SoC-Oriented Perspective
Intel
Industry Session
Heterogeneous Acceleration Pipeline for Recommendation System Training []