PPoPP 2026

Meta Info

LLM training
- CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
  - ICT, CAS & Ant Group
- COCCL: A Collective Communication Library Supporting Easy Integration and Configuration of Customized Compression for Scalable LLM Training
  - ICT, CAS & CUHK-SZ
- Elastor: Elastic and Efficient Model Partitioning and Checkpointing for Fault-tolerant Distributed Training
  - PKU
- HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism [arXiv] [Code]
  - NUS
LLM inference
- JanusQuant: Accurate and Efficient 2-bit KV Cache Quantization for Long-context Inference
  - WHU
- Laser: Unlocking Layer-Level Scheduling for Efficient Multi-SLO LLM Serving
  - SYSU
- High-Throughput Non-Uniformly Quantized 3-bit LLM Inference
  - CUHK & HKUST
- Accelerating Sparse Transformer Inference on GPU
  - CUP-Beijing & BUAA
Attention
- FlashAttention-T: Towards Fully Tensorized Attention by Exploiting Tensor-Vector Parallelism
  - USTC & ICT, CAS
- MetaAttention: A Unified and Performant Attention Framework Across Hardware Backends [arXiv] [Code]
  - SJTU, IPADS & PKU & MSRA

Difflow: A Data-Characteristic-Aware Serving System for Diffusion Models
- THU
MixFusion: A Patch-Level Parallel Serving System for Mixed-Resolution Diffusion Models
- UWaterloo & CMU & Rice

APERTURE: Algorithm-System Co-Optimization for Temporal Graph Network Inference
- BUAA
ElasGNN: An Elastic Training Framework for Distributed GNN Training
- BUAA
TAC: Cache-based System for Accelerating Billion-Scale GNN Training on Multi-GPU Platform
- UCAS

ASM-SpMM: Unleashing the Potential of Arm SME for Sparse Matrix Multiplication Acceleration
- SYSU
Exploiting Efficient Mapping and Pipelined Execution for Accelerating SpMV on Tensor Cores
- BUAA
VDHA: Vector-Driven Hash Aggregation for Sparse Matrix–Sparse Vector Multiplication on GPUs
- THU

RoMeo: Mitigating Dual-dimensional Outliers with Rotated Mixed Precision Quantization [Artifact]
- THU

Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds
- Alibaba Cloud

Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters
- GaTech
zBuffer: Zero-Copy and Metadata-Free Serialization for Fast RPC with Scatter-Gather Reflection
- XMU & Alibaba & SJTU

Last updated 2 months ago