# Data Processing

* Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement ([ATC 2024](https://paper.lingyunyang.com/reading-notes/conference/atc-2024)) \[[Paper](https://www.usenix.org/conference/atc24/presentation/graur)] \[[Code](https://github.com/eth-easl/pecan-experiments)]
  * ETH & Google
* Disaggregating ML Input Data Processing at Scale ([SoCC 2023](https://paper.lingyunyang.com/reading-notes/conference/socc-2023))
  * Google & ETH
* GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning (SIGMOD 2023) \[[Paper](https://dl.acm.org/doi/10.1145/3589773)]
  * Alibaba & PKU
* A case for disaggregation of ML data processing (arXiv 2210.14826) \[[Paper](https://arxiv.org/abs/2210.14826)]
  * Google & ETH
  * tf.data service: *Disaggregate* data preprocessing from ML computation.
* Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training (ISCA 2022) \[[Paper](https://dl.acm.org/doi/10.1145/3470496.3533044)]
  * Meta
  * DSI: Data storage and ingestion
  * Industry track
  * Meta's data storage and ingestion pipeline
