SoCC 2023

Meta Info

Homepage: https://acmsocc.org/2023/

Paper list: https://acmsocc.org/2023/accepted-papers.html

Papers

Resource Allocation

  • Lifting the Fog of Uncertainties: Dynamic Resource Orchestration for the Containerized Cloud [Paper]

    • UofT

    • Adaptively configure resource parameters

    • Built on contextual bandit techniques

    • Balance between performance and resource cost

  • Not All Resources are Visible: Exploiting Fragmented Shadow Resources in Shared-State Scheduler Architecture [Paper]

    • SJTU & Huawei

    • Shared-state schedulers: A central state view periodically updates the global cluster status to distributed schedulers

    • Shadow resources: Resources invisible to shared-state schedulers until the next view update

    • Resource Miner (RMiner) includes a shadow resource manager to manage shadow resources, an RM filter to select suitable tasks as RM tasks, an RM scheduler to allocate shadow resources to RM tasks

  • Gödel: Unified large-scale resource management and scheduling at ByteDance [Paper]

    • ByteDance & UVA

    • Industry Paper

    • A unified infrastructure for all business groups to run their diverse workloads

    • Built upon Kubernetes

Machine Learning

  • Anticipatory Resource Allocation for ML Training Clusters [Paper]

    • Microsoft Research & UW

    • Schedule based on predictions of future job arrivals and durations

    • Deal with prediction errors

  • tf.data service: A Case for Disaggregating ML Input Data Processing [Paper]

    • Google & ETH

    • Industry Paper

    • A disaggregated input data processing service built on top of tf.data in TensorFlow

    • Horizontally scale out to right-size host resources (CPU/RAM) for data processing in each job

    • Share ephemeral preprocessed data results across jobs

    • Coordinated reads to avoid stragglers

  • Is Machine Learning Necessary for Cloud Resource Usage Forecasting? [Paper]

    • IMDEA Software Institute

    • Vision Paper

    • Question: Whether complex machine learning models are necessary to use?

    • Proposal: Practical memory management systems need to first identify the extent to which simple solutions can be effective.

Serverless Computing

  • Golgi: Performance-Aware, Resource-Efficient Function Scheduling for Serverless Computing [Paper]

    • HKUST & WeBank

    • Best Paper Award!

    • A scheduling system for serverless functions to minimize resource provisioning costs while meeting the function latency requirements

    • Overcommit functions based on their past resource usage; Identify nine low-level metrics (e.g., request load, resource allocation, contention on shared resources); Use the Mondrian Forest to predict the function performance

    • Employ a conservative exploration-exploitation strategy for request routing; By default, route requests to non-overcommitted instances; Explore to use overcommitted instances

    • Vertical scaling to dynamically adjust the concurrency of overcommitted instances

  • Parrotfish: Parametric Regression for Optimizing Serverless Functions [Paper]

    • UBC & UTokyo & INSAT

    • Find optimal configurations through an online learning process

    • Use parametric regression to choose the right memory configurations for serverless functions

  • AsyFunc: A High-Performance and Resource-Efficient Serverless Inference System via Asymmetric Functions [Paper] [Code]

    • HUST & Huawei & Peng Cheng Laboratory

    • Problem: The time-consuming and resource-hungry model-loading process when scaling out function instances

    • Observation: The sensitivity of each layer to the computing resources is mostly anti-correlated with its memory resource usage

    • Asymmetric Functions

      • The original Body Function loads a complete model to meet stable demands

      • The proposed lightweight Shadow Function only loads a portion of resource-sensitive layers to deal with sudden demands effortlessly

    • AsyFunc — an inference serving system with an auto-scaling and scheduling engine; Built on top of Knative

  • Chitu: Accelerating Serverless Workflows with Asynchronous State Replication Pipeline [Paper] [Code]

    • ISCAS & ICT, CAS

    • Asynchronous State Replication Pipelines (ASRP) to speed up serverless workflows for general applications

    • Three insights

      • Provide differentiable data types (DDT) at the programming model level to support incremental state sharing and computation

      • Continuously deliver changes of DDT objects in real-time

      • Direct communication and change propagation

    • Built atop OpenFaaS

  • How Does It Function? Characterizing Long-term Trends in Production Serverless Workloads [Paper] [Trace]

    • Huawei

    • Industry Paper

    • Two new serverless traces in Huawei Cloud

      • The first trace: Huawei's internal workloads; Per-second statistics for 200 functions

      • The second trace: Huawei's public FaaS platform; Per-minute arrival rates for over 5000 functions

    • Characterize resource consumption, cold-start times, programming languages used, periodicity, per-second versus per-minute burstiness, correlations, and popularity.

    • Findings

      • Requests vary by up to 9 orders of magnitude across functions, with some functions executed over 1 billion times per day

      • Scheduling time, execution time and cold-start distributions vary across 2 to 4 orders of magnitude and have very long tails

      • Function invocation counts demonstrate strong periodicity for many individual functions and on an aggregate level

    • The need for further research in estimating resource reservations and time-series prediction

  • Function as a Function [Paper]

    • ETH

    • Vision Paper

    • Dandelion -- a clean state FaaS system; Treat serverless functions as pure functions; Explicitly separate computation and I/O; Hardware acceleration; Enable dataflow-aware function orchestration

  • The Gap Between Serverless Research and Real-world Systems [Paper]

    • SJTU & Huawei Cloud

    • Vision Paper

    • Five open challenges

      • Optimize cold start latency: Most existing works only consider synchronous starts; Asynchronous start in Industry

      • Declarative approach: Whether Kubernetes is the right system for serverless computing?

      • Scheduling cost

      • Balance different scheduling policies within a serverless system

      • Costs of sidecar

Sustainable Computing

  • Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale [Paper]

    • MIT & NEU

    • Significant decreases in both temperature and power draw, reducing power consumption and potentially improving hardware life-span, with minimal impact on job performance

Last updated