NeurIPS 2024
Last updated
Was this helpful?
Last updated
Was this helpful?
Homepage:
Paper list:
Total: 15671
Accept: 25.8% (4037)
Poster: 23.3% (3650)
Spotlight: 2.1% (326)
Oral: 0.4% (61)
LLM Inference
SGLang: Efficient Execution of Structured Language Model Programs [] [] []
Stanford & UC Berkeley
Co-design both the front-end language (programming interface) and the back-end runtime
SGLang Primitives
Enable the manipulation of prompts and generations
gen
: call LLM generation
select
: let the LLM choose the option with the highest probability from a list
extend
or +=
: extend the current prompt
Control of parallelism
fork
: fork the current prompt state
join
: rejoin the forked prompt states
Compilation optimizations
Code movement for improving prefix sharing
Doesn't strictly preserve the original computation —— aggressive
Prompt GPT-4 to re-order graph nodes
Runtime
RadixAttention
Utilize a radix tree (w/ efficient prefix search, reuse, insertion, eviction)
LRU eviction policy
Cache-aware scheduling → Increase the cache hit rate
Key idea: Sort the requests by matched prefix length
Efficient LLM Scheduling by Learning to Rank [] []
UCSD & THU & Snowflake & UC-Berkeley
Insight: it is possible to predict the relative ranks of output lengths in a batch of requests.
Develop a scheduler for LLM inference that can approximate the shortest-job-first (SJF) schedule better than existing approaches
Compound AI Systems
Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems [] []
Stanford & UC Berkeley & Princeton
Systematically study how the number of LM calls affects the performance of two natural inference strategy designs.
Vote: Aggregate LM responses via majority voting
Filter-Vote: Majority voting after filtering results with an LM
Insight
More LM calls lead to higher performance on “easy” queries, but lower performance on “hard” queries, and nonmonotone behavior can emerge when a task contains both types of queries.
An analytical scaling model to predict the performance of Vote and Filter-Vote systems and find the optimal number of LM calls to make.
Adapter Selection
UC Berkeley & CMU & Google DeepMind
Problem: how to match the prompt to a set of relevant adapters
Stylus
Select and automatically compose task-specific adapters based on a prompt's keywords
Three-stage approach
Refiner: Leverage visual-language foundational models (VLM) to generate semantic descriptions of adapters then translate them into embeddings
Retriever: Fetch the most relevant adapters over the entirety of the user’s prompt using cosine similarity
Composer: Segment the prompt into tasks from a prompt’s keywords and assign retrieved adapters to tasks
StylusDocs
An adapter dataset consists of 75K LoRAs (sourced from Civitai) with pre-computed adapter embeddings
Inference
HKUST & HKU & Salesforce AI Research & UIUC
Develop a general RTK (reverse transition kernel) framework that enables a more balanced subproblem decomposition
Propose leveraging two fast sampling algorithms, the Metropolis-Adjusted Langevin Algorithm (MALA) and Underdamped Langevin Dynamics (ULD), for solving these strongly log-concave subproblems
Stanford
Talking Face Video Generation
MSRA
A framework to generate lifelike talking faces with appealing visual affective skills (VAS).
A diffusion-based holistic facial dynamics and head movement generation model that works in a face latent space.
Support the online generation of 512×512 videos at up to 40 FPS.
Facial Parts Swapping
Alibaba
Facial parts from different people are assembled into a complete face in latent space within the Mask-based Fusion Module
The consolidated feature is dispatched to the Addition-based Injection Module for fusion within the UNet of the diffusion model to create novel characters
PKU & ByteDance
Best Paper Award
VAR: Visual Autoregressive Modeling
Redefine the autoregressive learning on images as coarse-to-fine “next-scale prediction” or “next-resolution prediction”
Multi-scale token maps are autoregressively generated from coarse to fine scales (lower to higher resolutions), with parallel token generation within each scale
MIT & Google DeepMind & THU
Propose to model the per-token probability distribution using a diffusion procedure
Define a Diffusion Loss function to model the per-token probability
Evaluated across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants
Inference
NEU
Streamlined Inference: Leverage the temporal and spatial properties of video diffusion models
Three core components
Feature Slicer: Partition input features into sub-features
Operator Grouping: Process each sub-feature with a group of consecutive operators
Step Rehash: Accelerate inference through skipping unnecessary steps
Evaluation
UTS & ZJU
1.67M unique text-to-video Prompts from real users.
6.69M videos generated by four state-of-the-art diffusion models (Pika, VideoCraft2, Text2Video-Zero, ModelScope).
UCAS & HIT & Adelaide & Baidu
Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content.
DEVIL: An evaluation protocol that centers on the dynamics dimension to evaluate T2V generation models
MSRA
Utilize Multimodal Large Language Models (MLLMs) to perform fine-grained video preference annotations → VideoPrefer (13.5K preference annotations)
VideoRM: The reward model for text-to-video alignment
Stylus: Automatic Adapter Selection for Diffusion Models [] [] []
Reverse Transition Kernel: A Flexible Framework to Accelerate Diffusion Inference []
Accelerating Diffusion Models with Parallel Sampling: Inference at Sub-Linear Time Complexity []
Propose to divide the sampling process into blocks with parallelizable Picard iterations within each block
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time [] []
FuseAnyPart: Diffusion-Driven Facial Parts Swapping via Multiple Reference Images [] []
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction [] [] []
Autoregressive Image Generation without Vector Quantization [] [] []
Fast and Memory-Efficient Video Diffusion Using Streamlined Inference [] []
VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models [] [] [] []
Evaluation of Text-to-Video Generation Models: A Dynamics Perspective [] [] []
Boosting Text-to-Video Generative Model with MLLMs Feedback []