ISCA 2025
Last updated
Was this helpful?
Last updated
Was this helpful?
Homepage:
Paper list:
LLM Training
Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models []
HKUST-GZ
MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training []
UIUC
Scaling Llama 3 Training with Efficient Parallelism Strategies
Industry Track
LLM Inference
H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
Best Paper Nominee
SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting
LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
AiF: Accelerating On-Device LLM Inference Using In-Flash Processing
LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading
UIUC
Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window
WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
Industry Track
Retrieval-Augmented Generation (RAG)
HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation
HUST
Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation At-Scale
RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
Quantization & Compression
Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression
Performance modeling
AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs
TRACI: Network Acceleration of Input-Dynamic Communication for Large-Scale Deep Learning Recommendation Model
GPU Management
Forest: Access-aware GPU UVM Management
NetCrafter: Tailoring Network Traffic for Non-Uniform Bandwidth Multi-GPU Systems
UGPU: Dynamically Constructing Unbalanced GPUs for Enhanced Resource Efficiency
Serverless Computing
Single-Address-Space FaaS with Jord
Microservices
HardHarvest: Hardware-Supported Core Harvesting for Microservices
Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel's Gaudi NPU for AI Model Serving
Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines
UIUC
DCPerf: An Open-Source, Battle-Tested Performance Benchmark Suite for Datacenter Workloads
Industry Track
Meta's Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences
Industry Track