ISCA 2025

Meta Info

Homepage: https://iscaconf.org/isca2025/

Paper list: https://www.iscaconf.org/isca2025/program/

Papers

Large Language Models (LLMs)

LLM Training
- Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models [Code]
  - HKUST-GZ
- MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training [Paper]
  - UIUC
- Scaling Llama 3 Training with Efficient Parallelism Strategies
  - Industry Track
LLM Inference
- H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
  - Best Paper Nominee
- SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting
- LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
- AiF: Accelerating On-Device LLM Inference Using In-Flash Processing
- LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading
  - UIUC
- Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window
- WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling
- Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
  - Industry Track
Retrieval-Augmented Generation (RAG)
- HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation
  - HUST
- Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation At-Scale
- RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
Quantization & Compression
- Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
- Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression
Performance modeling
- AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs

Deep Learning Recommendation Models (DLRMs)

TRACI: Network Acceleration of Input-Dynamic Communication for Large-Scale Deep Learning Recommendation Model

Resource Management

GPU Management
- Forest: Access-aware GPU UVM Management
- NetCrafter: Tailoring Network Traffic for Non-Uniform Bandwidth Multi-GPU Systems
- UGPU: Dynamically Constructing Unbalanced GPUs for Enhanced Resource Efficiency
Serverless Computing
- Single-Address-Space FaaS with Jord
Microservices
- HardHarvest: Hardware-Supported Core Harvesting for Microservices

Performance Analysis & Benchmark

Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel's Gaudi NPU for AI Model Serving
Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines
- UIUC
DCPerf: An Open-Source, Battle-Tested Performance Benchmark Suite for Datacenter Workloads
- Industry Track

AI Chip

Meta's Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences
- Industry Track

Last updated 1 month ago

Was this helpful?