# ISCA 2025

## Meta Info

Homepage: <https://iscaconf.org/isca2025/>

Paper list: <https://www.iscaconf.org/isca2025/program/>

## Papers

### Large Language Models (LLMs)

* LLM Training
  * Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models \[[Code](https://zenodo.org/records/15104237)]
    * HKUST-GZ
  * MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training \[[Paper](https://iacoma.cs.uiuc.edu/iacoma-papers/isca25_1.pdf)]
    * UIUC
  * Scaling Llama 3 Training with Efficient Parallelism Strategies
    * Industry Track
* LLM Inference
  * H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
    * Best Paper Nominee
  * SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting
  * LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
  * AiF: Accelerating On-Device LLM Inference Using In-Flash Processing
  * LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading
    * UIUC
  * Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window
  * WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling
  * Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
    * Industry Track
* Retrieval-Augmented Generation (RAG)
  * HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation
    * HUST
  * Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation At-Scale
  * RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
* Quantization & Compression
  * Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
  * Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression
* Performance modeling
  * AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs

### Deep Learning Recommendation Models (DLRMs)

* TRACI: Network Acceleration of Input-Dynamic Communication for Large-Scale Deep Learning Recommendation Model

### Resource Management

* GPU Management
  * Forest: Access-aware GPU UVM Management
  * NetCrafter: Tailoring Network Traffic for Non-Uniform Bandwidth Multi-GPU Systems
  * UGPU: Dynamically Constructing Unbalanced GPUs for Enhanced Resource Efficiency
* Serverless Computing
  * Single-Address-Space FaaS with Jord
* Microservices
  * HardHarvest: Hardware-Supported Core Harvesting for Microservices

### Performance Analysis & Benchmark

* Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel's Gaudi NPU for AI Model Serving
* Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines
  * UIUC
* DCPerf: An Open-Source, Battle-Tested Performance Benchmark Suite for Datacenter Workloads
  * Industry Track

### AI Chip

* Meta's Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences
  * Industry Track


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/isca-2025.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
