Interference-aware multiplexing for deep learning in GPU clusters: A middleware approach
Meta Info
Presented in SC 2023.
Understanding the paper
Opportunities in co-locating DL training tasks
Tune training configurations (e.g., batch size) across all co-located tasks
Choose appropriate tasks to multiplex on a GPU device
Challenges
Trade-off between mitigating interference and accelerating training progress to achieve optimal training time
Vast search space of task configurations
Coupling between adjusting task configurations and designing task placement policies
Last updated