NeurIPS 2024
Meta Info
Homepage: https://neurips.cc/Conferences/2024
Paper list: https://neurips.cc/virtual/2024/papers.html?filter=titles
Acceptance Rate
Total: 15671
Accept: 25.8% (4037)
Poster: 23.3% (3650)
Spotlight: 2.1% (326)
Oral: 0.4% (61)
Papers
Large Language Models (LLMs)
LLM Inference
SGLang: Efficient Execution of Structured Language Model Programs [Paper] [Code] [arXiv]
Stanford & UC Berkeley
Co-design both the front-end language (programming interface) and the back-end runtime
SGLang Primitives
Enable the manipulation of prompts and generations
gen
: call LLM generationselect
: let the LLM choose the option with the highest probability from a listextend
or+=
: extend the current prompt
Control of parallelism
fork
: fork the current prompt statejoin
: rejoin the forked prompt states
Compilation optimizations
Code movement for improving prefix sharing
Doesn't strictly preserve the original computation —— aggressive
Prompt GPT-4 to re-order graph nodes
Runtime
RadixAttention
Utilize a radix tree (w/ efficient prefix search, reuse, insertion, eviction)
LRU eviction policy
Cache-aware scheduling → Increase the cache hit rate
Key idea: Sort the requests by matched prefix length
Efficient LLM Scheduling by Learning to Rank [Paper] [Code]
UCSD & THU & Snowflake & UC-Berkeley
Insight: it is possible to predict the relative ranks of output lengths in a batch of requests.
Develop a scheduler for LLM inference that can approximate the shortest-job-first (SJF) schedule better than existing approaches
Compound AI Systems
Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems [Paper] [Code]
Stanford & UC Berkeley & Princeton
Systematically study how the number of LM calls affects the performance of two natural inference strategy designs.
Vote: Aggregate LM responses via majority voting
Filter-Vote: Majority voting after filtering results with an LM
Insight
More LM calls lead to higher performance on “easy” queries, but lower performance on “hard” queries, and nonmonotone behavior can emerge when a task contains both types of queries.
An analytical scaling model to predict the performance of Vote and Filter-Vote systems and find the optimal number of LM calls to make.
Diffusion Models
Adapter Selection
Stylus: Automatic Adapter Selection for Diffusion Models [Paper] [Homepage] [Code]
UC Berkeley & CMU & Google DeepMind
Problem: how to match the prompt to a set of relevant adapters
Stylus
Select and automatically compose task-specific adapters based on a prompt's keywords
Three-stage approach
Refiner: Leverage visual-language foundational models (VLM) to generate semantic descriptions of adapters then translate them into embeddings
Retriever: Fetch the most relevant adapters over the entirety of the user’s prompt using cosine similarity
Composer: Segment the prompt into tasks from a prompt’s keywords and assign retrieved adapters to tasks
StylusDocs
An adapter dataset consists of 75K LoRAs (sourced from Civitai) with pre-computed adapter embeddings
Inference
Reverse Transition Kernel: A Flexible Framework to Accelerate Diffusion Inference [Paper]
HKUST & HKU & Salesforce AI Research & UIUC
Develop a general RTK (reverse transition kernel) framework that enables a more balanced subproblem decomposition
Propose leveraging two fast sampling algorithms, the Metropolis-Adjusted Langevin Algorithm (MALA) and Underdamped Langevin Dynamics (ULD), for solving these strongly log-concave subproblems
Accelerating Diffusion Models with Parallel Sampling: Inference at Sub-Linear Time Complexity [Paper]
Stanford
Talking Face Video Generation
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time [Paper] [Homepage]
MSRA
A framework to generate lifelike talking faces with appealing visual affective skills (VAS).
A diffusion-based holistic facial dynamics and head movement generation model that works in a face latent space.
Support the online generation of 512×512 videos at up to 40 FPS.
Facial Parts Swapping
FuseAnyPart: Diffusion-Driven Facial Parts Swapping via Multiple Reference Images [Paper] [Code (coming...)]
Alibaba
Facial parts from different people are assembled into a complete face in latent space within the Mask-based Fusion Module
The consolidated feature is dispatched to the Addition-based Injection Module for fusion within the UNet of the diffusion model to create novel characters
Autoregressive Image Generation
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction [Paper] [Code] [arXiv]
PKU & ByteDance
Best Paper Award
VAR: Visual Autoregressive Modeling
Redefine the autoregressive learning on images as coarse-to-fine “next-scale prediction” or “next-resolution prediction”
Multi-scale token maps are autoregressively generated from coarse to fine scales (lower to higher resolutions), with parallel token generation within each scale
Autoregressive Image Generation without Vector Quantization [Paper] [Code] [arXiv]
MIT & Google DeepMind & THU
Propose to model the per-token probability distribution using a diffusion procedure
Define a Diffusion Loss function to model the per-token probability
Evaluated across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants
Text-to-Video Generation
Inference
Fast and Memory-Efficient Video Diffusion Using Streamlined Inference [Paper] [Code]
NEU
Streamlined Inference: Leverage the temporal and spatial properties of video diffusion models
Three core components
Feature Slicer: Partition input features into sub-features
Operator Grouping: Process each sub-feature with a group of consecutive operators
Step Rehash: Accelerate inference through skipping unnecessary steps
Evaluation
Evaluation of Text-to-Video Generation Models: A Dynamics Perspective [Paper] [Homepage] [Code]
UCAS & HIT & Adelaide & Baidu
Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content.
DEVIL: An evaluation protocol that centers on the dynamics dimension to evaluate T2V generation models
Boosting Text-to-Video Generative Model with MLLMs Feedback [Paper]
MSRA
Utilize Multimodal Large Language Models (MLLMs) to perform fine-grained video preference annotations → VideoPrefer (13.5K preference annotations)
VideoRM: The reward model for text-to-video alignment
Last updated
Was this helpful?