Data Processing

  • Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement (ATC 2024) [Paper] [Code]

    • ETH & Google

  • Disaggregating ML Input Data Processing at Scale (SoCC 2023)

    • Google & ETH

  • GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning (SIGMOD 2023) [Paper]

    • Alibaba & PKU

  • A case for disaggregation of ML data processing (arXiv 2210.14826) [Paper]

    • Google & ETH

    • tf.data service: Disaggregate data preprocessing from ML computation.

  • Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training (ISCA 2022) [Paper]

    • Meta

    • DSI: Data storage and ingestion

    • Industry track

    • Meta's data storage and ingestion pipeline

Last updated