> For the complete documentation index, see [llms.txt](https://paper.lingyunyang.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://paper.lingyunyang.com/reading-notes/conference/cais-2026.md).

# CAIS 2026

## Meta Info

Homepage: <https://www.caisconf.org>

Paper list: <https://www.caisconf.org/program/2026/papers/>

Proceedings: <https://dl.acm.org/doi/proceedings/10.1145/3786335>

## Papers

### LLM Inference

* XGrammar-2: Dynamic and Efficient Structured Generation Engine for Agentic LLMs \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813124)]
  * SJTU & CMU
  * Extend XGrammar with dynamic grammar support for efficient structured output generation in agentic LLM workloads (e.g., tool calling with runtime-defined schemas).
* Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813146)]
  * Stanford
  * Replace KV cache with spectral Koopman operator estimation for constant-memory associative recall.
* Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813134)]
  * LinkedIn & MIT
  * Optimize LLM query routing at the batch level under joint cost and capacity constraints for multi-model serving.
* Understanding and Improving Communication Performance in Multi-node LLM Inference \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813165)]
  * UMD & LLNL
  * Characterize and optimize inter-node communication bottlenecks in multi-node LLM inference deployments.

### Diffusion Model Inference

* SwiftFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813174)]
  * UofT & Amazon & NVIDIA & AWS
  * Introduce scalable sequence parallelism for distributed inference of Diffusion Transformers (DiTs) across multiple GPUs.

### LLM Optimization

* Scaling Textual Gradients via Sampling-Based Momentum \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813168)]
  * UChicago & UT Austin & Santa Clara & Princeton & MSR & SylphAI
  * Scale textual gradient optimization (TextGrad) via sampling-based momentum for improved convergence.
* optimize\_anything: Unified Text Optimization can Outperform Specialized Systems \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813167)] \[[Code](https://github.com/gepa-ai/gepa)]
  * UC Berkeley & MIT
  * A unified text optimization framework that subsumes prompt optimization, agent workflow design, and DSPy-style program synthesis into a single search procedure.

## Acronyms

* DiT: Diffusion Transformer
* KV: Key-Value
* LLM: Large Language Model
* LoRA: Low-Rank Adaptation