SIGCOMM 2025

Meta Info

Homepage: https://conferences.sigcomm.org/sigcomm/2025/

Paper List

Acceptance Rate

16% (= 74 / 460 (approx.))

Papers

Large Language Models (LLMs)

  • Infrastructure

    • InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers [Paper] [Video]

      • PKU & StepFun & Lightelligence

      • Key insight: unify connectivity and dynamic switching at the transceiver level using OCS.

      • Realize the transceiver-centric HBD architecture in production → Flexible construction of arbitrarily large ring topologies & improved system resilience

        • Silicon Photonics (SiPh) based OCS transceiver (OCSTrx)

        • Reconfigurable k-hop ring topology → Each node connects to all other nodes within ≤𝐾 hops via OCSTrx

        • HBD-DCN orchestration algorithm → Minimize cross-ToR traffic

    • Astral: A Datacenter Infrastructure for Large Language Model Training at Scale [Paper] [Video]

      • NJU & Tencent

  • LLM Training

    • DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models [Paper] [Video] [arXiv]

      • PKU & StepFun

      • Disaggregated model orchestration: separate the training for modality encoder (ViT for images, Beats for audios), LLM backbone, and modality generator (Diffusion for images, AudioLDM for audio).

      • Disaggregated data preprocessing: decouple data preprocessing from training.

      • Integrated with Megatron-LM.

    • ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs [Paper] [Video]

      • PKU & ByteDance

      • Limitations of existing works

        • The mismatch between data heterogeneity & static mesh → Redundant communication and imbalanced computation.

      • HDP: Hybrid Data Parallelism

        • Unify the inter- and intra-data partitioning with a dynamic mesh design.

        • A communication optimizer

          • Eliminate the redundant communication for short sequences by data-aware sharding and dynamic communication.

          • Compress the communication cost for long sequences by selective offloading.

        • A balance scheduler → Mitigate the imbalanced computation by parallelism-aware data assignment.

    • From ATOP to ZCube: Automated Topology Optimization Pipeline and A Highly Cost-Effective Network Topology for Large Model Training [Paper] [Video]

      • THU & Zhongguancun Laboratory & Harnets.AI & ByteDance

      • ATOP: Automated Topology Optimization Pipeline

        • Model network topology as a set of hyperparameters → Enable the discovery of potential network topologies.

      • A new topology ZCube, discoverd by ATOP.

        • Reach the highest cost-effectiveness across various GPU scale.

    • SkeletonHunter: Diagnosing and Localizing Network Failures in Containerized Large Model Training [Paper] [Video]

      • THU & Alibaba Cloud

      • Key idea: reason about the traffic skeleton, which comprises a crucial set of network paths consistently traversed by the training traffic.

  • Privacy-preseving LLM Inference

    • SCX: Stateless KV-Cache Encoding for Cloud-Scale Confidential Transformer Serving [Paper] [Video] [Code]

      • CUHK

      • Encode the intermediate key-value cache using user-controlled keys → Ensure that the cloud can neither recover the input nor independently complete the next token prediction.

  • LLMOps

    • Intent-Driven Network Management with Multi-Agent LLMs: The Confucius Framework [Paper] [Video]

      • Meta

      • Model network management workflows as DAGs to aid planning.

      • Integrate LLMs with existing management tools to achieve seamless operational integration, employ RAG to improve long-term memory, and establish a set of primitives to systematically support human/model interaction.

      • Integrate with existing network validation methods and incorporate its own validation framework to prevent regressions.

    • Towards LLM-Based Failure Localization in Production-Scale Networks [Paper] [Video]

      • NJU & Alibaba Cloud

      • BiAn (狴犴), an LLM-based framework for efficient incident investigation.

      • Process monitoring data and generate error device rankings with detailed explanations.

Mixture-of-Experts (MoEs)

  • MoE Training

    • MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training [Paper] [Video]

      • HKUST

      • Design and implement a regionally reconfigurable HBD that augments existing electrical interconnects using OCS.

  • MoE Inference

    • MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism [Paper] [Video] [arXiv]

      • PKU & ByteDance

      • Attention/FFN Disaggregation (AFD)

      • Provide a M2N communication library → Eliminate unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization.

RDMA

  • Reliability

    • Revisiting RDMA Reliability for Lossy Fabrics [Paper] [Video]

      • HKUST & Huawei

      • Best Student Paper Award (Honorable Mention)

      • DCP co-designs the switch and RNICs, including DCP-Switch and DCP-RNIC.

        • Header-only-based retransmission.

        • Bitmap-free packet tracking.

      • Prototype DCP-Switch using P4 switch and DCP-RNIC using FPGA.

  • Virtualization

    • Software-based Live Migration for RDMA [Paper] [Video]

      • THU & MSRA

      • MigrRDMA: a software-based RDMA live migration.

      • Provide a software indirection layer to achieve transparent switching to new RDMA communications.

      • Implemented over Mellanox RNICs.

    • ByteDance Jakiro: Enabling RDMA and TCP over Virtual Private Cloud [Paper] [Video]

      • ByteDance

      • Support fundamental VPC features (e.g., QoS, security groups) for both RDMA and TCP streams.

    • Alibaba Stellar: A New Generation RDMA Network for Cloud AI [Paper] [Video]

      • Alibaba Cloud

      • Limitations of existing RDMA virtualization solutions (e.g., SR-IOV)

        • Host-level

          • The number of VFs is static → Cannot dynamically scale the number of VFs.

          • The container must pin all of its memory in the host memory before initiating any RDMA operation → A minute-level start-up delay.

        • PCIe-level

          • LUT in PCIe fabrics is severely limited in size → Only a small number of VFs to enable GDR.

        • RNIC-level

          • No support for strict isolation between RDMA and non-RDMA traffic.

      • Three designs

        • Para-Virtualized Direct Memory Access (PVDMA) for on-demand memory pinning → Reduce host memory consumption & mitigate the start-up delay of secure containers.

        • Extended Memory Translation Table (eMTT) for optimized GDR performance → Allow the RNIC to bypass unnecessary consultations of memory address mappings in the PCIe fabric.

        • RDMA Packet Spray for efficient multi-path utilization

  • Performance Diagnosis

    • Hawkeye: Diagnosing RDMA Network Performance Anomalies with PFC Provenance [Paper] [Video]

      • THU & BUAA & Infrawaves

      • Three designs

        • A PFC-aware telemetry mechanism → Record the PFC impact on flows

        • An in-network PFC causality analysis and tracing mechanism → Collect causal telemetry for diagnosis

        • A provenance-based diagnosis algorithm → Present the anomaly breakdown, identify the anomaly type and root causes

      • Evaluated on both NS-3 simulations and a Tofino testbed.

  • I/O Acceleration

    • CEIO: A Cache-Efficient Network I/O Architecture for NIC-CPU Data Paths [Paper] [Video] [Code]

      • HKUST

      • Limitations of traditional I/O acceleration strategies (e.g., Data Direct I/O (DDIO), RDMA)

        • Inefficient utilization of the LLC.

      • Cache-efficient I/O → Line-rate throughput and µs-scale tail latency

        • Limit I/O Rate → Proactive rate control

        • Limit I/O Capacity → Elastic buffer

      • Implemented on commodity SmartNICs and incorporated into DPDK and RDMA libraries.

Hardware Transport

  • Falcon: A Reliable, Low Latency Hardware Transport [Paper] [Video]

    • Google

    • Support multiple Upper Layer Protocols (ULPs) and heterogeneous application workloads in general-purpose Ethernet datacenter environments (with losses and without special switch support).

    • Key designs: delay-based congestion control with multipath load balancing, a layered design with a simple request-response transaction interface for multi-ULP support, hardware-based retransmissions and error-handling for scalability, a programmable engine for flexibility.

    • Falcon hardware transport layers

Collective Communication

  • ResCCL: Resource-Efficient Scheduling for Collective Communication [Paper] [Video]

    • NEU & SIAT, CAS & Alibaba Cloud

    • Limitations of existing works (e.g., NCCL, RCCL, MSCCL)

      • Static resource allocation and scheduling mechanisms → Inefficient utilization of bandwidth and SM resources for various collective algorithms

    • Three designs

      • Optimize scheduling at the primitive level (e.g., send and recvReduceCopy).

      • Enable flexible thread block allocation.

      • Generate lightweight communication kernels to minimize runtime overhead.

  • SyCCL: Exploiting Symmetry for Efficient Collective Communication Scheduling [Paper] [Video]

    • Alibaba Cloud & THU

    • Limitations of existing works

      • Existing collective communication libraries (e.g., NCCL, RCCL) rely on fixed schedules and cannot adjust to varying topology and model requirements.

      • Existing collective schedule synthesizers (e.g., TECCL, TACCL) utilize Mixed Integer Linear Program for modeling but encounter scalability challenges.

    • SyCCL, a scalable collective schedule synthesizer → Synthesize near-optimal schedules in tens of minute.

      • Leverage collective and topology symmetries to decompose the original collective communication demand into smaller sub-demands within smaller topology subsets.

      • Propose efficient search strategies to explore potential sub-demands, synthesizes corresponding sub-schedules, and integrates these sub-schedules into complete schedules.

Video Streaming

  • Towards User-level QoE: Large-scale Practice in Personalized Optimization of Adaptive Video Streaming [Paper] [Video]

    • THU & Kuaishou & SFU

    • LingXi, a system for personalized adaptive video streaming.

      • Dynamically optimize the objectives of adaptive video streaming algorithms by analyzing user engagement.

      • Iteratively determine optimal parameters through Monte Carlo sampling and online Bayesian optimization.

  • TLadder: QoE-Centric Video Ladder Optimization with Playback Feedback at Billion Scale [Paper] [Video]

    • ByteDance

    • Jointly consider the video content dimension (i.e., the bitrate-quality tradeoff of candidate representations) and the playback feedback dimension (e.g., network condition, rebuffering time, and playback bitrate).

  • ACE: Sending Burstiness Control for High-Quality Real-time Communication [Paper] [Video]

    • HKUST & ByteDance

    • A dual-control approach that manages both the encoding and transmission burstiness.

      • Sender: dynamically adjust the bucket size of a token-based pacer to control burstiness at the granularity of frame level.

      • Encoder: an adaptive complexity mechanism that smoothens frame sizes without sacrificing quality.

  • Harnessing WebRTC for Large-Scale Live Streaming [Paper] [Video]

    • ByteDance

    • Focus on optimizing first-frame delay, startup video rebuffering, audio-to-video drift, and per-session CPU usage.

  • Scalable Video Conferencing Using SDN Principles [Paper] [Video] [Code]

    • Princeton & UVA

    • Scallop, an SDN-inspired SFU (Selective Forwarding Unit)

      • Decouple video-conferencing applications into a hardware-based data plane for latency-sensitive and frequent media operations.

    • A software control plane for the (infrequent) remaining tasks (e.g., analyze feedback signals, session management).

CXL

  • Understanding and Profiling CXL.mem Using PathFinder [Paper] [Video] [Code]

    • UW-Madison & BUAA & Intel

    • Leverage the capabilities of existing PMUs and dissect the CXL.mem protocol at adequate granularities.

    • Key idea: view the server processor and its chipset as a multi-stage Clos network, equip each architectural module with a PMU-based telemetry engine, track different CXL.mem paths, and apply conventional traffic analysis techniques.

    • Perform snapshot-based path-driven profiling and introduce four techniques: path construction, stall cycle breakdown, interference analyzer, and cross-snapshot analysis.

    • Built atop Linux Perf.

Network Failures

  • SkyNet: Analyzing Alert Flooding from Severe Network Failures in Large Cloud Infrastructures [Paper] [Video]

    • Alibaba Cloud

    • Extract scope and severity information from alert floods.

    • Integrate multiple monitoring data sources through a uniform input format.

Acronyms

  • RDMA: Remote Direct Memory Access

  • OCS: Optical Circuit Switching

  • HBD: High-Bandwidth Domain

  • DCN: Datacenter Network

  • ToR: Top-of-Rack

  • VPC: Virtual Private Cloud

  • SR-IOV: Single-Root Input/Output Virtualization

  • GDR: GPUDirect RDMA

  • VF: Virtual Function

  • LUT: Look-Up Table

  • LLC: Last-Level Cache

  • CXL: Compute Express Link

  • PMU: Performance Monitoring Unit

  • WebRTC: Web Real-Time Communications

  • DAG: Directed Acyclic Graph

  • RAG: Retrieval-Augmented Generation

Last updated

Was this helpful?