# Clipper: A low-latency online prediction serving system

## Metadata

Presented in [NSDI 2017](https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/crankshaw).

Authors: Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, Ion Stoica

Homepage: <http://www.clipper.ai/>

Code: <https://github.com/ucbrise/clipper>

## Understanding the paper

### TL;DR

This paper presents a general-purpose model serving system named **Clipper**, which introduces a modular architecture to **simplify model deployment across frameworks and applications**, including **caching**, **batching**, and **adaptive model selection** techniques.

### Three crucial properties of model serving system

* Low latency
* High throughput
* Improved accuracy

### Technical details

![The architecture of Clipper](/files/VMr0gzfHyJPdXSsS4uSr)

* Clipper is divided into two layers.
  * **Model abstraction layer**: provide a common interface across machine learning frameworks.
    * **Caching**: maintain a prediction cache, LRU eviction policy.
    * **Adaptive query-batching**: batching amortizes the cost and enables data-parallel optimizations in ML frameworks.
      * Dynamic batch size: additive-increase-multiplicative-decrease (AIMD) scheme
      * Delayed batching: batch wait timeout
    * **Model containers**: each model is managed in a separate Docker container, supports replica scaling.
  * **Model selection layer**: dispatch the prediction request to one or more of the models through the model abstraction layer.
    * **Single model selection**: treat as a multi-armed bandit problem (exp3 algorithm)
    * **Ensemble model selection** (exp4 algorithm): combine predictions from multiple models, mitigate stragglers (causes some problems, e.g., ensemble missing, reduction in accuracy, *rendering a late prediction is worse than rendering an inaccurate prediction*)
    * **Personalized model selection (Contextualization)**: instantiate a unique model selection state for each user, context or session.
* Clipper uses **a cross-language RPC** to send the batch of queries to a model container hosting the model in its native machine learning framework.

### Comparison

* Compared to Tensorflow Serving, Clipper shows the modular architecture and substantially broader set of features with minimal performance penalty.
* Different from general serving systems: the dominant cost in **data-serving systems** tends to be **IO**, in **prediction serving** it is **computation**.

### Limitations

1. Doesn't optimize the execution of the models. Treat the models as black-box components.
2. Doesn't manage the training or retraining.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/nsdi-2017/clipper.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
