Clipper: A low-latency online prediction serving system
Metadata
Presented in NSDI 2017.
Authors: Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, Ion Stoica
Homepage: http://www.clipper.ai/
Code: https://github.com/ucbrise/clipper
Understanding the paper
TL;DR
This paper presents a general-purpose model serving system named Clipper, which introduces a modular architecture to simplify model deployment across frameworks and applications, including caching, batching, and adaptive model selection techniques.
Three crucial properties of model serving system
Low latency
High throughput
Improved accuracy
Technical details
Clipper is divided into two layers.
Model abstraction layer: provide a common interface across machine learning frameworks.
Caching: maintain a prediction cache, LRU eviction policy.
Adaptive query-batching: batching amortizes the cost and enables data-parallel optimizations in ML frameworks.
Dynamic batch size: additive-increase-multiplicative-decrease (AIMD) scheme
Delayed batching: batch wait timeout
Model containers: each model is managed in a separate Docker container, supports replica scaling.
Model selection layer: dispatch the prediction request to one or more of the models through the model abstraction layer.
Single model selection: treat as a multi-armed bandit problem (exp3 algorithm)
Ensemble model selection (exp4 algorithm): combine predictions from multiple models, mitigate stragglers (causes some problems, e.g., ensemble missing, reduction in accuracy, rendering a late prediction is worse than rendering an inaccurate prediction)
Personalized model selection (Contextualization): instantiate a unique model selection state for each user, context or session.
Clipper uses a cross-language RPC to send the batch of queries to a model container hosting the model in its native machine learning framework.
Comparison
Compared to Tensorflow Serving, Clipper shows the modular architecture and substantially broader set of features with minimal performance penalty.
Different from general serving systems: the dominant cost in data-serving systems tends to be IO, in prediction serving it is computation.
Limitations
Doesn't optimize the execution of the models. Treat the models as black-box components.
Doesn't manage the training or retraining.
Last updated