VLDB 2024

Meta Info

DL training workloads
- Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads [Paper] [Code]
  - UCSD
  - Saturn -> SPASE: Select a Parallelism, Allocate resources, and Schedule; formulate the joint SPASE problem as an MILP.
Big data analytic workloads
- Intelligent Pooling: Proactive Resource Provisioning in Large-scale Cloud Service [Paper]
  - Microsoft
  - Predict usage patterns using a hybrid ML model; optimize the pool size dynamically.
Job scheduling
- ResLake: Towards Minimum Job Latency and Balanced Resource Utilization in Geo-distributed Job Scheduling
  - ByteDance
Autoscaling
- OptScaler: A Collaborative Framework for Robust Autoscaling in the Cloud [arXiv]
  - Ant Group
Serverless
- Resource Management in Aurora Serverless [Paper]
  - AWS
  - Industry Paper

ElasticNotebook: Enabling Live Migration for Computational Notebooks [Paper] [arXiv] [Code]
- UIUC & UMich
- Live migration via checkpointing/restoration.
- Reconstruct all variables from a subset of variables.

FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation [Paper] [Code]
- UNIST
- Cooperatively utilizes both CPUs and GPUs to accelerate the data preprocessing stage of DL training that runs the data augmentation algorithm.
- Orchestrate data preprocessing tasks across CPUs and GPUs while minimizing interference with GPU-based model training.

DLRover: Resource Optimization for Deep Recommendation Models Training at AntGroup [arXiv] [Code]
- AntGroup & Sichuan University

Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses [Paper] [Code]
- UIUC & NVIDIA
- GIDS: GPU Initiated Direct Storage Access -> A data loader to utilize all hardware resources (i.e., CPU memory, storage, and GPU memory)

Last updated 8 months ago

Was this helpful?