SIGCOMM 2025
Meta Info
Homepage: https://conferences.sigcomm.org/sigcomm/2025/
Paper List
Acceptance Rate
16% (= 74 / 460 (approx.))
Papers
Large Language Models (LLMs)
Infrastructure
InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers [Paper] [Video]
PKU & StepFun & Lightelligence
Key insight: unify connectivity and dynamic switching at the transceiver level using OCS.
Realize the transceiver-centric HBD architecture in production → Flexible construction of arbitrarily large ring topologies & improved system resilience
Silicon Photonics (SiPh) based OCS transceiver (OCSTrx)
Reconfigurable k-hop ring topology → Each node connects to all other nodes within ≤𝐾 hops via OCSTrx
HBD-DCN orchestration algorithm → Minimize cross-ToR traffic
LLM Training
DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models [Paper] [Video] [arXiv]
PKU & StepFun
Disaggregated model orchestration: separate the training for modality encoder (ViT for images, Beats for audios), LLM backbone, and modality generator (Diffusion for images, AudioLDM for audio).
Disaggregated data preprocessing: decouple data preprocessing from training.
Integrated with Megatron-LM.
ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs [Paper] [Video]
PKU & ByteDance
Limitations of existing works
The mismatch between data heterogeneity & static mesh → Redundant communication and imbalanced computation.
HDP: Hybrid Data Parallelism
Unify the inter- and intra-data partitioning with a dynamic mesh design.
A communication optimizer
Eliminate the redundant communication for short sequences by data-aware sharding and dynamic communication.
Compress the communication cost for long sequences by selective offloading.
A balance scheduler → Mitigate the imbalanced computation by parallelism-aware data assignment.
From ATOP to ZCube: Automated Topology Optimization Pipeline and A Highly Cost-Effective Network Topology for Large Model Training [Paper] [Video]
THU & Zhongguancun Laboratory & Harnets.AI & ByteDance
ATOP: Automated Topology Optimization Pipeline
Model network topology as a set of hyperparameters → Enable the discovery of potential network topologies.
A new topology ZCube, discoverd by ATOP.
Reach the highest cost-effectiveness across various GPU scale.
Privacy-preseving LLM Inference
LLMOps
Intent-Driven Network Management with Multi-Agent LLMs: The Confucius Framework [Paper] [Video]
Meta
Model network management workflows as DAGs to aid planning.
Integrate LLMs with existing management tools to achieve seamless operational integration, employ RAG to improve long-term memory, and establish a set of primitives to systematically support human/model interaction.
Integrate with existing network validation methods and incorporate its own validation framework to prevent regressions.
Mixture-of-Experts (MoEs)
MoE Inference
MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism [Paper] [Video] [arXiv]
PKU & ByteDance
Attention/FFN Disaggregation (AFD)
Provide a M2N communication library → Eliminate unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization.
RDMA
Reliability
Revisiting RDMA Reliability for Lossy Fabrics [Paper] [Video]
HKUST & Huawei
Best Student Paper Award (Honorable Mention)
DCP co-designs the switch and RNICs, including DCP-Switch and DCP-RNIC.
Header-only-based retransmission.
Bitmap-free packet tracking.
Prototype DCP-Switch using P4 switch and DCP-RNIC using FPGA.
Virtualization
Alibaba Stellar: A New Generation RDMA Network for Cloud AI [Paper] [Video]
Alibaba Cloud
Limitations of existing RDMA virtualization solutions (e.g., SR-IOV)
Host-level
The number of VFs is static → Cannot dynamically scale the number of VFs.
The container must pin all of its memory in the host memory before initiating any RDMA operation → A minute-level start-up delay.
PCIe-level
LUT in PCIe fabrics is severely limited in size → Only a small number of VFs to enable GDR.
RNIC-level
No support for strict isolation between RDMA and non-RDMA traffic.
Three designs
Para-Virtualized Direct Memory Access (PVDMA) for on-demand memory pinning → Reduce host memory consumption & mitigate the start-up delay of secure containers.
Extended Memory Translation Table (eMTT) for optimized GDR performance → Allow the RNIC to bypass unnecessary consultations of memory address mappings in the PCIe fabric.
RDMA Packet Spray for efficient multi-path utilization
Performance Diagnosis
Hawkeye: Diagnosing RDMA Network Performance Anomalies with PFC Provenance [Paper] [Video]
THU & BUAA & Infrawaves
Three designs
A PFC-aware telemetry mechanism → Record the PFC impact on flows
An in-network PFC causality analysis and tracing mechanism → Collect causal telemetry for diagnosis
A provenance-based diagnosis algorithm → Present the anomaly breakdown, identify the anomaly type and root causes
Evaluated on both NS-3 simulations and a Tofino testbed.
I/O Acceleration
CEIO: A Cache-Efficient Network I/O Architecture for NIC-CPU Data Paths [Paper] [Video] [Code]
HKUST
Limitations of traditional I/O acceleration strategies (e.g., Data Direct I/O (DDIO), RDMA)
Inefficient utilization of the LLC.
Cache-efficient I/O → Line-rate throughput and µs-scale tail latency
Limit I/O Rate → Proactive rate control
Limit I/O Capacity → Elastic buffer
Implemented on commodity SmartNICs and incorporated into DPDK and RDMA libraries.
Hardware Transport
Falcon: A Reliable, Low Latency Hardware Transport [Paper] [Video]
Google
Support multiple Upper Layer Protocols (ULPs) and heterogeneous application workloads in general-purpose Ethernet datacenter environments (with losses and without special switch support).
Key designs: delay-based congestion control with multipath load balancing, a layered design with a simple request-response transaction interface for multi-ULP support, hardware-based retransmissions and error-handling for scalability, a programmable engine for flexibility.
Falcon hardware transport layers
Collective Communication
ResCCL: Resource-Efficient Scheduling for Collective Communication [Paper] [Video]
NEU & SIAT, CAS & Alibaba Cloud
Limitations of existing works (e.g., NCCL, RCCL, MSCCL)
Static resource allocation and scheduling mechanisms → Inefficient utilization of bandwidth and SM resources for various collective algorithms
Three designs
Optimize scheduling at the primitive level (e.g., send and recvReduceCopy).
Enable flexible thread block allocation.
Generate lightweight communication kernels to minimize runtime overhead.
SyCCL: Exploiting Symmetry for Efficient Collective Communication Scheduling [Paper] [Video]
Alibaba Cloud & THU
Limitations of existing works
Existing collective communication libraries (e.g., NCCL, RCCL) rely on fixed schedules and cannot adjust to varying topology and model requirements.
Existing collective schedule synthesizers (e.g., TECCL, TACCL) utilize Mixed Integer Linear Program for modeling but encounter scalability challenges.
SyCCL, a scalable collective schedule synthesizer → Synthesize near-optimal schedules in tens of minute.
Leverage collective and topology symmetries to decompose the original collective communication demand into smaller sub-demands within smaller topology subsets.
Propose efficient search strategies to explore potential sub-demands, synthesizes corresponding sub-schedules, and integrates these sub-schedules into complete schedules.
Video Streaming
Towards User-level QoE: Large-scale Practice in Personalized Optimization of Adaptive Video Streaming [Paper] [Video]
THU & Kuaishou & SFU
LingXi, a system for personalized adaptive video streaming.
Dynamically optimize the objectives of adaptive video streaming algorithms by analyzing user engagement.
Iteratively determine optimal parameters through Monte Carlo sampling and online Bayesian optimization.
TLadder: QoE-Centric Video Ladder Optimization with Playback Feedback at Billion Scale [Paper] [Video]
ByteDance
Jointly consider the video content dimension (i.e., the bitrate-quality tradeoff of candidate representations) and the playback feedback dimension (e.g., network condition, rebuffering time, and playback bitrate).
ACE: Sending Burstiness Control for High-Quality Real-time Communication [Paper] [Video]
HKUST & ByteDance
A dual-control approach that manages both the encoding and transmission burstiness.
Sender: dynamically adjust the bucket size of a token-based pacer to control burstiness at the granularity of frame level.
Encoder: an adaptive complexity mechanism that smoothens frame sizes without sacrificing quality.
Scalable Video Conferencing Using SDN Principles [Paper] [Video] [Code]
Princeton & UVA
Scallop, an SDN-inspired SFU (Selective Forwarding Unit)
Decouple video-conferencing applications into a hardware-based data plane for latency-sensitive and frequent media operations.
A software control plane for the (infrequent) remaining tasks (e.g., analyze feedback signals, session management).
CXL
Understanding and Profiling CXL.mem Using PathFinder [Paper] [Video] [Code]
UW-Madison & BUAA & Intel
Leverage the capabilities of existing PMUs and dissect the
CXL.mem
protocol at adequate granularities.Key idea: view the server processor and its chipset as a multi-stage Clos network, equip each architectural module with a PMU-based telemetry engine, track different
CXL.mem
paths, and apply conventional traffic analysis techniques.Perform snapshot-based path-driven profiling and introduce four techniques: path construction, stall cycle breakdown, interference analyzer, and cross-snapshot analysis.
Built atop Linux Perf.
Network Failures
Acronyms
RDMA: Remote Direct Memory Access
OCS: Optical Circuit Switching
HBD: High-Bandwidth Domain
DCN: Datacenter Network
ToR: Top-of-Rack
VPC: Virtual Private Cloud
SR-IOV: Single-Root Input/Output Virtualization
GDR: GPUDirect RDMA
VF: Virtual Function
LUT: Look-Up Table
LLC: Last-Level Cache
CXL: Compute Express Link
PMU: Performance Monitoring Unit
WebRTC: Web Real-Time Communications
DAG: Directed Acyclic Graph
RAG: Retrieval-Augmented Generation
Last updated
Was this helpful?