SIGCOMM 2025

Meta Info

Homepage: https://conferences.sigcomm.org/sigcomm/2025/

Paper List

Acceptance Rate

16% (= 74 / 460 (approx.))

Papers

Large Language Models (LLMs)

Infrastructure
- InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers [Paper] [Video]
  - PKU & StepFun & Lightelligence
  - Key insight: unify connectivity and dynamic switching at the transceiver level using OCS.
  - Realize the transceiver-centric HBD architecture in production → Flexible construction of arbitrarily large ring topologies & improved system resilience
    Silicon Photonics (SiPh) based OCS transceiver (OCSTrx)
    Reconfigurable k-hop ring topology → Each node connects to all other nodes within ≤𝐾 hops via OCSTrx
    HBD-DCN orchestration algorithm → Minimize cross-ToR traffic
- Astral: A Datacenter Infrastructure for Large Language Model Training at Scale [Paper] [Video]
  - NJU & Tencent
LLM Training
- DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models [Paper] [Video] [arXiv]
  - PKU & StepFun
  - Disaggregated model orchestration: separate the training for modality encoder (ViT for images, Beats for audios), LLM backbone, and modality generator (Diffusion for images, AudioLDM for audio).
  - Disaggregated data preprocessing: decouple data preprocessing from training.
  - Integrated with Megatron-LM.
- ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs [Paper] [Video]
  - PKU & ByteDance
  - Limitations of existing works
    The mismatch between data heterogeneity & static mesh → Redundant communication and imbalanced computation.
  - HDP: Hybrid Data Parallelism
    Unify the inter- and intra-data partitioning with a dynamic mesh design.
    A communication optimizer
    Eliminate the redundant communication for short sequences by data-aware sharding and dynamic communication.
    Compress the communication cost for long sequences by selective offloading.
    A balance scheduler → Mitigate the imbalanced computation by parallelism-aware data assignment.
- From ATOP to ZCube: Automated Topology Optimization Pipeline and A Highly Cost-Effective Network Topology for Large Model Training [Paper] [Video]
  - THU & Zhongguancun Laboratory & Harnets.AI & ByteDance
  - ATOP: Automated Topology Optimization Pipeline
    Model network topology as a set of hyperparameters → Enable the discovery of potential network topologies.
  - A new topology ZCube, discoverd by ATOP.
    Reach the highest cost-effectiveness across various GPU scale.
- SkeletonHunter: Diagnosing and Localizing Network Failures in Containerized Large Model Training [Paper] [Video]
  - THU & Alibaba Cloud
  - Key idea: reason about the traffic skeleton, which comprises a crucial set of network paths consistently traversed by the training traffic.
Privacy-preseving LLM Inference
- SCX: Stateless KV-Cache Encoding for Cloud-Scale Confidential Transformer Serving [Paper] [Video] [Code]
  - CUHK
  - Encode the intermediate key-value cache using user-controlled keys → Ensure that the cloud can neither recover the input nor independently complete the next token prediction.
LLMOps
- Intent-Driven Network Management with Multi-Agent LLMs: The Confucius Framework [Paper] [Video]
  - Meta
  - Model network management workflows as DAGs to aid planning.
  - Integrate LLMs with existing management tools to achieve seamless operational integration, employ RAG to improve long-term memory, and establish a set of primitives to systematically support human/model interaction.
  - Integrate with existing network validation methods and incorporate its own validation framework to prevent regressions.
- Towards LLM-Based Failure Localization in Production-Scale Networks [Paper] [Video]
  - NJU & Alibaba Cloud
  - BiAn (狴犴), an LLM-based framework for efficient incident investigation.
  - Process monitoring data and generate error device rankings with detailed explanations.

Mixture-of-Experts (MoEs)

MoE Training
- MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training [Paper] [Video]
  - HKUST
  - Design and implement a regionally reconfigurable HBD that augments existing electrical interconnects using OCS.
MoE Inference
- MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism [Paper] [Video] [arXiv]
  - PKU & ByteDance
  - Attention/FFN Disaggregation (AFD)
  - Provide a M2N communication library → Eliminate unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization.

RDMA

Reliability
- Revisiting RDMA Reliability for Lossy Fabrics [Paper] [Video]
  - HKUST & Huawei
  - Best Student Paper Award (Honorable Mention)
  - DCP co-designs the switch and RNICs, including DCP-Switch and DCP-RNIC.
    Header-only-based retransmission.
    Bitmap-free packet tracking.
  - Prototype DCP-Switch using P4 switch and DCP-RNIC using FPGA.
Virtualization
- Software-based Live Migration for RDMA [Paper] [Video]
  - THU & MSRA
  - MigrRDMA: a software-based RDMA live migration.
  - Provide a software indirection layer to achieve transparent switching to new RDMA communications.
  - Implemented over Mellanox RNICs.
- ByteDance Jakiro: Enabling RDMA and TCP over Virtual Private Cloud [Paper] [Video]
  - ByteDance
  - Support fundamental VPC features (e.g., QoS, security groups) for both RDMA and TCP streams.
- Alibaba Stellar: A New Generation RDMA Network for Cloud AI [Paper] [Video]
  - Alibaba Cloud
  - Limitations of existing RDMA virtualization solutions (e.g., SR-IOV)
    Host-level
    The number of VFs is static → Cannot dynamically scale the number of VFs.
    The container must pin all of its memory in the host memory before initiating any RDMA operation → A minute-level start-up delay.
    PCIe-level
    LUT in PCIe fabrics is severely limited in size → Only a small number of VFs to enable GDR.
    RNIC-level
    No support for strict isolation between RDMA and non-RDMA traffic.
  - Three designs
    Para-Virtualized Direct Memory Access (PVDMA) for on-demand memory pinning → Reduce host memory consumption & mitigate the start-up delay of secure containers.
    Extended Memory Translation Table (eMTT) for optimized GDR performance → Allow the RNIC to bypass unnecessary consultations of memory address mappings in the PCIe fabric.
    RDMA Packet Spray for efficient multi-path utilization
Performance Diagnosis
- Hawkeye: Diagnosing RDMA Network Performance Anomalies with PFC Provenance [Paper] [Video]
  - THU & BUAA & Infrawaves
  - Three designs
    A PFC-aware telemetry mechanism → Record the PFC impact on flows
    An in-network PFC causality analysis and tracing mechanism → Collect causal telemetry for diagnosis
    A provenance-based diagnosis algorithm → Present the anomaly breakdown, identify the anomaly type and root causes
  - Evaluated on both NS-3 simulations and a Tofino testbed.
I/O Acceleration
- CEIO: A Cache-Efficient Network I/O Architecture for NIC-CPU Data Paths [Paper] [Video] [Code]
  - HKUST
  - Limitations of traditional I/O acceleration strategies (e.g., Data Direct I/O (DDIO), RDMA)
    Inefficient utilization of the LLC.
  - Cache-efficient I/O → Line-rate throughput and µs-scale tail latency
    Limit I/O Rate → Proactive rate control
    Limit I/O Capacity → Elastic buffer
  - Implemented on commodity SmartNICs and incorporated into DPDK and RDMA libraries.

Hardware Transport

Falcon: A Reliable, Low Latency Hardware Transport [Paper] [Video]
- Google
- Support multiple Upper Layer Protocols (ULPs) and heterogeneous application workloads in general-purpose Ethernet datacenter environments (with losses and without special switch support).
- Key designs: delay-based congestion control with multipath load balancing, a layered design with a simple request-response transaction interface for multi-ULP support, hardware-based retransmissions and error-handling for scalability, a programmable engine for flexibility.
- Falcon hardware transport layers

Collective Communication

ResCCL: Resource-Efficient Scheduling for Collective Communication [Paper] [Video]
- NEU & SIAT, CAS & Alibaba Cloud
- Limitations of existing works (e.g., NCCL, RCCL, MSCCL)
  - Static resource allocation and scheduling mechanisms → Inefficient utilization of bandwidth and SM resources for various collective algorithms
- Three designs
  - Optimize scheduling at the primitive level (e.g., send and recvReduceCopy).
  - Enable flexible thread block allocation.
  - Generate lightweight communication kernels to minimize runtime overhead.
SyCCL: Exploiting Symmetry for Efficient Collective Communication Scheduling [Paper] [Video]
- Alibaba Cloud & THU
- Limitations of existing works
  - Existing collective communication libraries (e.g., NCCL, RCCL) rely on fixed schedules and cannot adjust to varying topology and model requirements.
  - Existing collective schedule synthesizers (e.g., TECCL, TACCL) utilize Mixed Integer Linear Program for modeling but encounter scalability challenges.
- SyCCL, a scalable collective schedule synthesizer → Synthesize near-optimal schedules in tens of minute.
  - Leverage collective and topology symmetries to decompose the original collective communication demand into smaller sub-demands within smaller topology subsets.
  - Propose efficient search strategies to explore potential sub-demands, synthesizes corresponding sub-schedules, and integrates these sub-schedules into complete schedules.

Video Streaming

Towards User-level QoE: Large-scale Practice in Personalized Optimization of Adaptive Video Streaming [Paper] [Video]
- THU & Kuaishou & SFU
- LingXi, a system for personalized adaptive video streaming.
  - Dynamically optimize the objectives of adaptive video streaming algorithms by analyzing user engagement.
  - Iteratively determine optimal parameters through Monte Carlo sampling and online Bayesian optimization.
TLadder: QoE-Centric Video Ladder Optimization with Playback Feedback at Billion Scale [Paper] [Video]
- ByteDance
- Jointly consider the video content dimension (i.e., the bitrate-quality tradeoff of candidate representations) and the playback feedback dimension (e.g., network condition, rebuffering time, and playback bitrate).
ACE: Sending Burstiness Control for High-Quality Real-time Communication [Paper] [Video]
- HKUST & ByteDance
- A dual-control approach that manages both the encoding and transmission burstiness.
  - Sender: dynamically adjust the bucket size of a token-based pacer to control burstiness at the granularity of frame level.
  - Encoder: an adaptive complexity mechanism that smoothens frame sizes without sacrificing quality.
Harnessing WebRTC for Large-Scale Live Streaming [Paper] [Video]
- ByteDance
- Focus on optimizing first-frame delay, startup video rebuffering, audio-to-video drift, and per-session CPU usage.
Scalable Video Conferencing Using SDN Principles [Paper] [Video] [Code]
- Princeton & UVA
- Scallop, an SDN-inspired SFU (Selective Forwarding Unit)
  - Decouple video-conferencing applications into a hardware-based data plane for latency-sensitive and frequent media operations.
- A software control plane for the (infrequent) remaining tasks (e.g., analyze feedback signals, session management).

CXL

Understanding and Profiling CXL.mem Using PathFinder [Paper] [Video] [Code]
- UW-Madison & BUAA & Intel
- Leverage the capabilities of existing PMUs and dissect the CXL.mem protocol at adequate granularities.
- Key idea: view the server processor and its chipset as a multi-stage Clos network, equip each architectural module with a PMU-based telemetry engine, track different CXL.mem paths, and apply conventional traffic analysis techniques.
- Perform snapshot-based path-driven profiling and introduce four techniques: path construction, stall cycle breakdown, interference analyzer, and cross-snapshot analysis.
- Built atop Linux Perf.

Network Failures

SkyNet: Analyzing Alert Flooding from Severe Network Failures in Large Cloud Infrastructures [Paper] [Video]
- Alibaba Cloud
- Extract scope and severity information from alert floods.
- Integrate multiple monitoring data sources through a uniform input format.

Acronyms

RDMA: Remote Direct Memory Access
OCS: Optical Circuit Switching
HBD: High-Bandwidth Domain
DCN: Datacenter Network
ToR: Top-of-Rack
VPC: Virtual Private Cloud
SR-IOV: Single-Root Input/Output Virtualization
GDR: GPUDirect RDMA
VF: Virtual Function
LUT: Look-Up Table
LLC: Last-Level Cache
CXL: Compute Express Link
PMU: Performance Monitoring Unit
WebRTC: Web Real-Time Communications
DAG: Directed Acyclic Graph
RAG: Retrieval-Augmented Generation

Last updated 2 months ago

Was this helpful?