MLSys 2024
Meta Info
Homepage: https://mlsys.org/Conferences/2024
Paper list: https://mlsys.org/Conferences/2024/AcceptedPapers
Papers
Large Language Models (LLMs)
LoRA serving
S-LoRA: Serving Thousands of Concurrent LoRA Adapters [Paper] [arXiv] [Code]
UC Berkeley
A system to serve many LoRA adapters
Store all adapters in the main memory and fetch the adapters used by the currently running queries to the GPU memory
Unified Paging — a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths
Employ a tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation
Built on top of LightLLM
Punica: Multi-Tenant LoRA Serving [arXiv] [Code]
UW & Duke
A system to serve multiple LoRA models in a shared GPU cluster
A CUDA kernel — Segmented Gather Matrix-Vector Multiplication (SGMV)
Batch GPU operations for concurrent execution of different LoRA models
A GPU only needs to store a single copy of the pre-trained model
A request scheduling mechanism to consolidate multi-tenant LoRA serving workloads
Route the new request to a small set of active GPUs
Allocate additional GPU resources when the existing GPUs are fully utilized
Periodically migrate existing requests for consolidation
LLM inference
Prompt Cache: Modular Attention Reuse for Low-Latency Inference [Paper]
Yale & Google
HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices [Paper]
NUS
FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics [Paper]
THU & Infinigence-AI
LLM fine-tuning
Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems [Paper]
UT-Austin
LLM for data manipulation
UniDM: A Unified Framework for Data Manipulation with Large Language Models [Paper]
Alibaba & USTC
Mixture-of-Experts (MoEs)
MoE training
Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication [Paper]
HKU & AWS & Boson AI
Diffusion Models
DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines [Paper]
HKU & AWS
Deep Learning Recommendation Models (DLRMs)
Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation [Paper]
Meta AI
ML Compilation
ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time [Paper]
CMU
Perform hybrid static+dynamic compiler optimizations and end-to-end tensor code generation
Quantization
FP8
Efficient Post-training Quantization with FP8 Formats [Paper]
Intel
LLM
Model Adaptation
Cloud Configuration Generation
Acronyms
ML: Machine Learning
LLM: Large Language Model
LoRA: Low-Rank Adaptation
MoE: Mixture-of-Experts
Last updated
Was this helpful?