Interference-aware multiplexing for deep learning in GPU clusters: A middleware approach

Meta Info

Presented in SC 2023.

Understanding the paper

Opportunities in co-locating DL training tasks

  • Tune training configurations (e.g., batch size) across all co-located tasks

  • Choose appropriate tasks to multiplex on a GPU device

Challenges

  • Trade-off between mitigating interference and accelerating training progress to achieve optimal training time

  • Vast search space of task configurations

  • Coupling between adjusting task configurations and designing task placement policies

Last updated