Clipper: A low-latency online prediction serving system

Metadata

Presented in NSDI 2017.

Authors: Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, Ion Stoica

Code: https://github.com/ucbrise/clipper

Understanding the paper

TL;DR

This paper presents a general-purpose model serving system named Clipper, which introduces a modular architecture to simplify model deployment across frameworks and applications, including caching, batching, and adaptive model selection techniques.

Three crucial properties of model serving system

Low latency
High throughput
Improved accuracy

Technical details

Clipper is divided into two layers.
- Model abstraction layer: provide a common interface across machine learning frameworks.
  - Caching: maintain a prediction cache, LRU eviction policy.
  - Adaptive query-batching: batching amortizes the cost and enables data-parallel optimizations in ML frameworks.
    Dynamic batch size: additive-increase-multiplicative-decrease (AIMD) scheme
    Delayed batching: batch wait timeout
  - Model containers: each model is managed in a separate Docker container, supports replica scaling.
- Model selection layer: dispatch the prediction request to one or more of the models through the model abstraction layer.
  - Single model selection: treat as a multi-armed bandit problem (exp3 algorithm)
  - Ensemble model selection (exp4 algorithm): combine predictions from multiple models, mitigate stragglers (causes some problems, e.g., ensemble missing, reduction in accuracy, rendering a late prediction is worse than rendering an inaccurate prediction)
  - Personalized model selection (Contextualization): instantiate a unique model selection state for each user, context or session.
Clipper uses a cross-language RPC to send the batch of queries to a model container hosting the model in its native machine learning framework.

Comparison

Compared to Tensorflow Serving, Clipper shows the modular architecture and substantially broader set of features with minimal performance penalty.
Different from general serving systems: the dominant cost in data-serving systems tends to be IO, in prediction serving it is computation.

Limitations

Doesn't optimize the execution of the models. Treat the models as black-box components.
Doesn't manage the training or retraining.

Last updated 2 years ago

Was this helpful?