# SoCC 2023

## Meta Info

Homepage: <https://acmsocc.org/2023/>

Paper list: <https://acmsocc.org/2023/accepted-papers.html>

## Papers

### Resource Allocation

* Lifting the Fog of Uncertainties: Dynamic Resource Orchestration for the Containerized Cloud \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624646)]
  * UofT
  * Adaptively configure resource parameters
  * Built on contextual bandit techniques
  * Balance between performance and resource cost
* Not All Resources are Visible: Exploiting Fragmented Shadow Resources in Shared-State Scheduler Architecture \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624650)]
  * SJTU & Huawei
  * Shared-state schedulers: A central state view *periodically* updates the *global* cluster status to *distributed* schedulers
  * Shadow resources: Resources invisible to *shared-state schedulers* until the next view update
  * Resource Miner (RMiner) includes a *shadow resource manager* to manage shadow resources, an *RM filter* to select suitable tasks as RM tasks, an *RM scheduler* to allocate shadow resources to RM tasks
* Gödel: Unified large-scale resource management and scheduling at ByteDance \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624663)]
  * ByteDance & UVA
  * Industry Paper
  * A unified infrastructure for all business groups to run their diverse workloads
  * Built upon Kubernetes

### Machine Learning

* Anticipatory Resource Allocation for ML Training Clusters \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624669)]
  * Microsoft Research & UW
  * Schedule based on *predictions of future job arrivals and durations*
  * Deal with prediction errors
* tf.data service: A Case for Disaggregating ML Input Data Processing \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624666)]
  * Google & ETH
  * Industry Paper
  * A disaggregated input data processing service built on top of tf.data in TensorFlow
  * Horizontally scale out to right-size host resources (CPU/RAM) for data processing in each job
  * Share ephemeral preprocessed data results across jobs
  * Coordinated reads to avoid stragglers
* Is Machine Learning Necessary for Cloud Resource Usage Forecasting? \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624790)]
  * IMDEA Software Institute
  * Vision Paper
  * Question: Whether *complex machine learning models* are necessary to use?
  * Proposal: Practical memory management systems need to first identify the extent to which simple solutions can be effective.

### Serverless Computing

* Golgi: Performance-Aware, Resource-Efficient Function Scheduling for Serverless Computing \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624645)]
  * HKUST & WeBank
  * **Best Paper Award!**
  * A scheduling system for serverless functions to *minimize resource provisioning costs* while *meeting the function latency requirements*
  * Overcommit functions based on their past resource usage; Identify nine low-level metrics (e.g., request load, resource allocation, contention on shared resources); Use the Mondrian Forest to predict the function performance
  * Employ a conservative exploration-exploitation strategy for request routing; By default, route requests to non-overcommitted instances; Explore to use overcommitted instances
  * Vertical scaling to dynamically adjust the concurrency of overcommitted instances
* Parrotfish: Parametric Regression for Optimizing Serverless Functions \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624654)]
  * UBC & UTokyo & INSAT
  * Find optimal configurations through an online learning process
  * Use parametric regression to choose the right memory configurations for serverless functions
* AsyFunc: A High-Performance and Resource-Efficient Serverless Inference System via Asymmetric Functions \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624664)] \[[Code](https://github.com/peiqiangyu/AsyFunc)]
  * HUST & Huawei & Peng Cheng Laboratory
  * Problem: The time-consuming and resource-hungry model-loading process when scaling out function instances
  * Observation: The sensitivity of each layer to the computing resources is mostly anti-correlated with its memory resource usage
  * Asymmetric Functions
    * The original Body Function loads a complete model to meet stable demands
    * The proposed lightweight Shadow Function only loads a portion of resource-sensitive layers to deal with sudden demands effortlessly
  * AsyFunc — an inference serving system with an auto-scaling and scheduling engine; Built on top of Knative
* Chitu: Accelerating Serverless Workflows with Asynchronous State Replication Pipeline \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624794)] \[[Code](https://github.com/sigserverless/chitu)]
  * ISCAS & ICT, CAS
  * **Asynchronous State Replication Pipelines (ASRP)** to speed up serverless workflows for general applications
  * Three insights
    * Provide differentiable data types (DDT) at the programming model level to support incremental state sharing and computation
    * Continuously deliver changes of DDT objects in real-time
    * Direct communication and change propagation
  * Built atop OpenFaaS
* How Does It Function? Characterizing Long-term Trends in Production Serverless Workloads \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624783)] \[[Trace](https://github.com/sir-lab/data-release)]
  * Huawei
  * Industry Paper
  * Two new serverless traces in Huawei Cloud
    * The first trace: Huawei's *internal* workloads; Per-second statistics for 200 functions
    * The second trace: Huawei's public FaaS platform; Per-minute arrival rates for over 5000 functions
  * Characterize resource consumption, cold-start times, programming languages used, periodicity, per-second versus per-minute burstiness, correlations, and popularity.
  * Findings
    * Requests vary by up to 9 orders of magnitude across functions, with some functions executed over 1 billion times per day
    * Scheduling time, execution time and cold-start distributions vary across 2 to 4 orders of magnitude and have very long tails
    * Function invocation counts demonstrate strong periodicity for many individual functions and on an aggregate level
  * The need for further research in *estimating resource reservations and time-series prediction*
* Function as a Function \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624648)]
  * ETH
  * Vision Paper
  * Dandelion -- a clean state FaaS system; Treat serverless functions as pure functions; Explicitly separate computation and I/O; Hardware acceleration; Enable dataflow-aware function orchestration
* The Gap Between Serverless Research and Real-world Systems \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624785)]
  * SJTU & Huawei Cloud
  * Vision Paper
  * Five open challenges
    * Optimize cold start latency: Most existing works only consider synchronous starts; Asynchronous start in Industry
    * Declarative approach: Whether Kubernetes is the right system for serverless computing?
    * Scheduling cost
    * Balance different scheduling policies within a serverless system
    * Costs of sidecar

### Sustainable Computing

* Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale \[[Paper](https://dl.acm.org/doi/10.1145/3620678.3624793)]
  * MIT & NEU
  * Significant decreases in both temperature and power draw, reducing power consumption and potentially *improving hardware life-span*, with *minimal impact on job performance*
