Efficiently programming large language models using SGLang
LLM Inference
Last updated
Was this helpful?
LLM Inference
Last updated
Was this helpful?
Presented in . [ ]
Authors: Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng (UCB, Stanford)
A single generation call = The process of taking a sequence of input tokens and generating a sequence of output tokens
Programming LLMs = The development of computer programs that control the generation processes of LLMs
The paradigm of programming LLMs changes
Single and straightforward generation calls β Multiple and interdependent generation calls interleaved with control flow
User cases
Prompting techniques
Basic: few-shot prompting, chain-of-thought prompting, retrieval-augmented generation
Meta-reasoning
LLM initially generates multiple solutions through chain-of-thought prompting;
It then performs meta-reasoning across these solutions to derive the final resolution.
Skeleton-of-thought
LLM first generates a skeleton of the answer;
It then issues multiple parallel calls to elaborate the skeleton into a detailed explanation.
Tree-of-thought
Integrate LLM calls within tree search algorithms (e.g., BFS).
LLM-based agents
The behavior of the agent is controlled by loops and conditional statements
How to manage the multiple API calls of LLMs?
A good system should include LLM state management, decoding process control, and smooth integration with external tools.
Co-design both the front-end language (programming interface) and the back-end runtime
SGLang
An embedded domain-specific language in Python
Interpreter
Manage the prompt state and execute SGLang programs
Treat a prompt as a stream and submit primitive operations to this stream for asynchronous execution
Compiler
Compiled as a computational graph and enable further optimizations
SGVM
A custom serving engine/runtime
RadixAttention
SGLang Primitives
Enable the manipulation of prompts and generations
gen
: call LLM generation
select
: let the LLM choose the option with the highest probability from a list
extend
or +=
: extend the current prompt
Control of parallelism
fork
: fork the current prompt state
join
: rejoin the forked prompt states
Compilation optimizations
Code movement for improving prefix sharing
Doesn't strictly preserve the original computation ββ aggressive
Prompt GPT-4 to re-order graph nodes
Runtime
RadixAttention
Utilize a radix tree (w/ efficient prefix search, reuse, insertion, eviction)
Least Recently Used (LRU) eviction policy
Cache-aware scheduling β Increase the cache hit rate
Key idea: Sort the requests by matched prefix length
Newly βextendβ CUDA kernel
Approximately 9k lines of Python code
SGVM reuses some components from LightLLM