Model Serving

Large language models (LLMs) are hot and diverse compared to conventional models. Therefore, I have classified the related works for LLMs in another paper list.

I am actively maintaining this list.

Model Serving Systems

Usher: Holistic Interference Avoidance for Resource Optimized ML Inference (OSDI 2024) [Paper] [Code]
- UVA & GaTech
Paella: Low-latency Model Serving with Software-defined GPU Scheduling (SOSP 2023) [Paper]
- UPenn & DBOS, Inc.
Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences (OSDI 2022) [Personal Notes] [Paper] [Code] [Benchmark] [Artifact]
- SJTU
- REEF: GPU kernel preemption; dynamic kernel padding.
INFaaS: Automated Model-less Inference Serving (ATC 2021) [Paper] [Code]
- Stanford
- Best Paper
- Consider model-variants
Clipper: A Low-Latency Online Prediction Serving System (NSDI 2017) [Personal Notes] [Paper] [Code]
- UC Berkeley
- Caching, batching, adaptive model selection.
TensorFlow-Serving: Flexible, High-Performance ML Serving (NIPS 2017 Workshop on ML Systems) [Paper]
- Google

Auto-Configuration for Model Serving

Serving Unseen Deep Learning Models with Near-Optimal Configurations: a Fast Adaptive Search Approach (SoCC 2022) [Personal Notes] [Paper] [Code]
- ISCAS
- Characterize a DL model by its key operators.
Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving (SoCC 2021) [Paper] [Code]
- HKUST & Alibaba
- Meta learning; bayesian optimization; Kubernetes.

Survey

A Survey of Multi-Tenant Deep Learning Inference on GPU (MLSys 2022 Workshop on Cloud Intelligence / AIOps) [Paper]
- George Mason & Microsoft & Maryland
A Survey of Large-Scale Deep Learning Serving System Optimization: Challenges and Opportunities (arXiv 2111.14247) [Paper]
- George Mason & Microsoft & Pittsburgh & Maryland

Last updated 1 year ago

Was this helpful?