ISCA 2025

Meta Info

Homepage: https://iscaconf.org/isca2025/

Paper list: https://www.iscaconf.org/isca2025/program/

Papers

Large Language Models (LLMs)

  • LLM Training

    • Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models [Code]

      • HKUST-GZ

    • MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training [Paper]

      • UIUC

    • Scaling Llama 3 Training with Efficient Parallelism Strategies

      • Industry Track

  • LLM Inference

    • H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference

      • Best Paper Nominee

    • SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting

    • LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference

    • AiF: Accelerating On-Device LLM Inference Using In-Flash Processing

    • LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading

      • UIUC

    • Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window

    • WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling

    • Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

      • Industry Track

  • Retrieval-Augmented Generation (RAG)

    • HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation

      • HUST

    • Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation At-Scale

    • RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving

  • Quantization & Compression

    • Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

    • Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression

  • Performance modeling

    • AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs

Deep Learning Recommendation Models (DLRMs)

  • TRACI: Network Acceleration of Input-Dynamic Communication for Large-Scale Deep Learning Recommendation Model

Resource Management

  • GPU Management

    • Forest: Access-aware GPU UVM Management

    • NetCrafter: Tailoring Network Traffic for Non-Uniform Bandwidth Multi-GPU Systems

    • UGPU: Dynamically Constructing Unbalanced GPUs for Enhanced Resource Efficiency

  • Serverless Computing

    • Single-Address-Space FaaS with Jord

  • Microservices

    • HardHarvest: Hardware-Supported Core Harvesting for Microservices

Performance Analysis & Benchmark

  • Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel's Gaudi NPU for AI Model Serving

  • Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines

    • UIUC

  • DCPerf: An Open-Source, Battle-Tested Performance Benchmark Suite for Datacenter Workloads

    • Industry Track

AI Chip

  • Meta's Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences

    • Industry Track

Last updated

Was this helpful?