SLAQ: Quality-driven scheduling for distributed machine learning

Metadata

Presented in SoCC 2017.

Authors: Haoyu Zhang, Logan Stafman, Andrew Or, Michael J. Freedman, Princeton University

Understanding the paper

TL;DRs

This paper presents SLAQ, which is a cluster scheduling framework that hosts multi-tenant approximate ML training jobs running on shared resources. It is a fine-grained job-level scheduler, which focuses on the allocation of cluster resources between competing ML jobs, but does so short time intervals (i.e., hundreds of milliseconds to a few seconds).

Key Insights

  • Leverage the iterative nature of ML training algorithms.

    • Collect quality and resource usage information from concurrent jobs.

    • Generate quality-improvement predictions for future iterations.

Limitation of Existing Work

Existing job-level schedulers (YARN, Mesos, Apollo, Hadoop Capacity, Quincy, etc.) mostly allocate resources based on resource fairness or priorities. For ML training jobs, these schedulers often make suboptimal scheduling decisions because they are agnostic to the progress (quality improvement) within each job.

Implementation

The system is implemented within the Apache Spark framework and utilizes Spark MLlib.

Last updated