# ISCA 2025

## Meta Info

Homepage: <https://iscaconf.org/isca2025/>

Paper list: <https://www.iscaconf.org/isca2025/program/>

## Papers

### Large Language Models (LLMs)

* LLM Training
  * Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models \[[Code](https://zenodo.org/records/15104237)]
    * HKUST-GZ
  * MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training \[[Paper](https://iacoma.cs.uiuc.edu/iacoma-papers/isca25_1.pdf)]
    * UIUC
  * Scaling Llama 3 Training with Efficient Parallelism Strategies
    * Industry Track
* LLM Inference
  * H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
    * Best Paper Nominee
  * SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting
  * LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
  * AiF: Accelerating On-Device LLM Inference Using In-Flash Processing
  * LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading
    * UIUC
  * Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window
  * WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling
  * Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
    * Industry Track
* Retrieval-Augmented Generation (RAG)
  * HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation
    * HUST
  * Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation At-Scale
  * RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
* Quantization & Compression
  * Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
  * Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression
* Performance modeling
  * AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs

### Deep Learning Recommendation Models (DLRMs)

* TRACI: Network Acceleration of Input-Dynamic Communication for Large-Scale Deep Learning Recommendation Model

### Resource Management

* GPU Management
  * Forest: Access-aware GPU UVM Management
  * NetCrafter: Tailoring Network Traffic for Non-Uniform Bandwidth Multi-GPU Systems
  * UGPU: Dynamically Constructing Unbalanced GPUs for Enhanced Resource Efficiency
* Serverless Computing
  * Single-Address-Space FaaS with Jord
* Microservices
  * HardHarvest: Hardware-Supported Core Harvesting for Microservices

### Performance Analysis & Benchmark

* Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel's Gaudi NPU for AI Model Serving
* Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines
  * UIUC
* DCPerf: An Open-Source, Battle-Tested Performance Benchmark Suite for Datacenter Workloads
  * Industry Track

### AI Chip

* Meta's Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences
  * Industry Track
