Model Serving

Large language models (LLMs) are hot and diverse compared to conventional models. Therefore, I have classified the related works for LLMs in another paper list.

I am actively maintaining this list.

Model Serving Systems

  • Paella: Low-latency Model Serving with Software-defined GPU Scheduling (SOSP 2023) [Paper]

    • UPenn & DBOS, Inc.

  • Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences (OSDI 2022) [Personal Notes] [Paper] [Code] [Benchmark] [Artifact]

    • SJTU

    • REEF: GPU kernel preemption; dynamic kernel padding.

  • INFaaS: Automated Model-less Inference Serving (ATC 2021) [Paper] [Code]

    • Stanford

    • Best Paper

    • Consider model-variants

  • Clipper: A Low-Latency Online Prediction Serving System (NSDI 2017) [Personal Notes] [Paper] [Code]

    • UC Berkeley

    • Caching, batching, adaptive model selection.

  • TensorFlow-Serving: Flexible, High-Performance ML Serving (NIPS 2017 Workshop on ML Systems) [Paper]

    • Google

Auto-Configuration for Model Serving

  • Serving Unseen Deep Learning Models with Near-Optimal Configurations: a Fast Adaptive Search Approach (SoCC 2022) [Personal Notes] [Paper] [Code]

    • ISCAS

    • Characterize a DL model by its key operators.

  • Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving (SoCC 2021) [Paper] [Code]

    • HKUST & Alibaba

    • Meta learning; bayesian optimization; Kubernetes.

Survey

  • A Survey of Multi-Tenant Deep Learning Inference on GPU (MLSys 2022 Workshop on Cloud Intelligence / AIOps) [Paper]

    • George Mason & Microsoft & Maryland

  • A Survey of Large-Scale Deep Learning Serving System Optimization: Challenges and Opportunities (arXiv 2111.14247) [Paper]

    • George Mason & Microsoft & Pittsburgh & Maryland

Last updated