> For the complete documentation index, see [llms.txt](https://paper.lingyunyang.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://paper.lingyunyang.com/reading-notes/conference/cais-2026.md).

# CAIS 2026

## Meta Info

Homepage: <https://www.caisconf.org>

Paper list: <https://www.caisconf.org/program/2026/papers/>

Proceedings: <https://dl.acm.org/doi/proceedings/10.1145/3786335>

## Papers

### LLM Inference

* XGrammar-2: Dynamic and Efficient Structured Generation Engine for Agentic LLMs \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813124)]
  * SJTU & CMU
  * Extend XGrammar with dynamic grammar support for efficient structured output generation in agentic LLM workloads (e.g., tool calling with runtime-defined schemas).
* Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813146)]
  * Stanford
  * Replace KV cache with spectral Koopman operator estimation for constant-memory associative recall.
* Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813134)]
  * LinkedIn & MIT
  * Optimize LLM query routing at the batch level under joint cost and capacity constraints for multi-model serving.
* Understanding and Improving Communication Performance in Multi-node LLM Inference \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813165)]
  * UMD & LLNL
  * Characterize and optimize inter-node communication bottlenecks in multi-node LLM inference deployments.

### Diffusion Model Inference

* SwiftFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813174)]
  * UofT & Amazon & NVIDIA & AWS
  * Introduce scalable sequence parallelism for distributed inference of Diffusion Transformers (DiTs) across multiple GPUs.

### LLM Optimization

* Scaling Textual Gradients via Sampling-Based Momentum \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813168)]
  * UChicago & UT Austin & Santa Clara & Princeton & MSR & SylphAI
  * Scale textual gradient optimization (TextGrad) via sampling-based momentum for improved convergence.
* optimize\_anything: Unified Text Optimization can Outperform Specialized Systems \[[Paper](https://dl.acm.org/doi/10.1145/3786335.3813167)] \[[Code](https://github.com/gepa-ai/gepa)]
  * UC Berkeley & MIT
  * A unified text optimization framework that subsumes prompt optimization, agent workflow design, and DSPy-style program synthesis into a single search procedure.

## Acronyms

* DiT: Diffusion Transformer
* KV: Key-Value
* LLM: Large Language Model
* LoRA: Low-Rank Adaptation


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/cais-2026.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
