MLSys 2024
Last updated
Was this helpful?
Last updated
Was this helpful?
Homepage:
Paper list:
LoRA serving
S-LoRA: Serving Thousands of Concurrent LoRA Adapters [] [] []
UC Berkeley
A system to serve many LoRA adapters
Store all adapters in the main memory and fetch the adapters used by the currently running queries to the GPU memory
Unified Paging — a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths
Employ a tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation
Built on top of
Punica: Multi-Tenant LoRA Serving [] []
UW & Duke
A system to serve multiple LoRA models in a shared GPU cluster
A CUDA kernel — Segmented Gather Matrix-Vector Multiplication (SGMV)
Batch GPU operations for concurrent execution of different LoRA models
A GPU only needs to store a single copy of the pre-trained model
A request scheduling mechanism to consolidate multi-tenant LoRA serving workloads
Route the new request to a small set of active GPUs
Allocate additional GPU resources when the existing GPUs are fully utilized
Periodically migrate existing requests for consolidation
LLM inference
Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference [] []
UBC & d-Matrix
Prompt Cache: Modular Attention Reuse for Low-Latency Inference []
Yale & Google
HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices []
NUS
Vidur: A Large-scale Simulation Framework for LLM Inference [] []
GaTech & MSR India
FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics []
THU & Infinigence-AI
LLM fine-tuning
Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems []
UT-Austin
LLM for data manipulation
UniDM: A Unified Framework for Data Manipulation with Large Language Models []
Alibaba & USTC
MoE training
HKU & AWS & Boson AI
MoE inference
Institute of Science and Technology Austria
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
HKU & AWS
Meta AI
CMU
Perform hybrid static+dynamic compiler optimizations and end-to-end tensor code generation
FP8
Intel
LLM
MIT
Best Paper Award
UW
UT-Texas & Oxford & Eindhoven University of Technology & Lawrence Livermore National Laboratory & CMU
ML training
AMD
Alibaba Cloud & UMich & UCLA & UC Merced
ML: Machine Learning
LLM: Large Language Model
LoRA: Low-Rank Adaptation
MoE: Mixture-of-Experts
Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication []
QMoE: Sub-1-Bit Compression of Trillion Parameter Models [] []
DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines []
Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation []
ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time []
Efficient Post-training Quantization with FP8 Formats []
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration [] []
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving [] [] [] []
Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache [] []
JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training [] []
FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms [Paper] [] []
CloudEval-YAML: A Practical Benchmark for Cloud Native YAML Configuration Generation [] [] [] []