Model Serving
Last updated
Was this helpful?
Last updated
Was this helpful?
Large language models (LLMs) are hot and diverse compared to conventional models. Therefore, I have classified the related works for LLMs in .
Usher: Holistic Interference Avoidance for Resource Optimized ML Inference () [] []
UVA & GaTech
Paella: Low-latency Model Serving with Software-defined GPU Scheduling () []
UPenn & DBOS, Inc.
Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences () [] [] [] [] []
SJTU
REEF: GPU kernel preemption; dynamic kernel padding.
INFaaS: Automated Model-less Inference Serving () [] []
Stanford
Best Paper
Consider model-variants
Clipper: A Low-Latency Online Prediction Serving System () [] [] []
UC Berkeley
Caching, batching, adaptive model selection.
TensorFlow-Serving: Flexible, High-Performance ML Serving (NIPS 2017 Workshop on ML Systems) []
ISCAS
Characterize a DL model by its key operators.
HKUST & Alibaba
Meta learning; bayesian optimization; Kubernetes.
George Mason & Microsoft & Maryland
George Mason & Microsoft & Pittsburgh & Maryland
Serving Unseen Deep Learning Models with Near-Optimal Configurations: a Fast Adaptive Search Approach () [] [] []
Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving () [] []
A Survey of Multi-Tenant Deep Learning Inference on GPU (MLSys 2022 Workshop on Cloud Intelligence / AIOps) []
A Survey of Large-Scale Deep Learning Serving System Optimization: Challenges and Opportunities (arXiv 2111.14247) []