# SLAQ: Quality-driven scheduling for distributed machine learning

## Metadata

Presented in [SoCC 2017](https://doi.org/10.1145/3127479.3127490).

Authors: Haoyu Zhang, Logan Stafman, Andrew Or, Michael J. Freedman, *Princeton University*

## Understanding the paper

### TL;DRs

This paper presents **SLAQ**, which is a cluster scheduling framework that hosts multi-tenant approximate *ML training jobs* running on shared resources.\
It is a **fine-grained job-level scheduler**, which *focuses on the allocation of cluster resources* between competing ML jobs, but does so *short time intervals* (i.e., hundreds of milliseconds to a few seconds).

### Key Insights

* Leverage the *iterative nature* of ML training algorithms.
  * Collect quality and resource usage information from concurrent jobs.
  * Generate quality-improvement predictions for future iterations.

### Limitation of Existing Work

Existing job-level schedulers (YARN, Mesos, Apollo, Hadoop Capacity, Quincy, etc.) mostly allocate resources based on *resource fairness* or *priorities*.\
For ML training jobs, these schedulers often make *suboptimal* scheduling decisions because they are agnostic to the progress (quality improvement) within each job.

### Implementation

The system is implemented within the *Apache Spark framework* and utilizes *Spark MLlib*.
