NeurIPS 2024

Meta Info

Homepage: https://neurips.cc/Conferences/2024

Paper list: https://neurips.cc/virtual/2024/papers.html?filter=titles

Acceptance Rate

Total: 15671
Accept: 25.8% (4037)
- Poster: 23.3% (3650)
- Spotlight: 2.1% (326)
- Oral: 0.4% (61)

Papers

Large Language Models (LLMs)

LLM Inference
- SGLang: Efficient Execution of Structured Language Model Programs [Paper] [Code] [arXiv]
  - Stanford & UC Berkeley
  - Co-design both the front-end language (programming interface) and the back-end runtime
  - SGLang Primitives
    Enable the manipulation of prompts and generations
    gen: call LLM generation
    select: let the LLM choose the option with the highest probability from a list
    extend or +=: extend the current prompt
    Control of parallelism
    fork: fork the current prompt state
    join: rejoin the forked prompt states
  - Compilation optimizations
    Code movement for improving prefix sharing
    Doesn't strictly preserve the original computation —— aggressive
    Prompt GPT-4 to re-order graph nodes
  - Runtime
    RadixAttention
    Utilize a radix tree (w/ efficient prefix search, reuse, insertion, eviction)
    LRU eviction policy
    Cache-aware scheduling → Increase the cache hit rate
    Key idea: Sort the requests by matched prefix length
- Efficient LLM Scheduling by Learning to Rank [Paper] [Code]
  - UCSD & THU & Snowflake & UC-Berkeley
  - Insight: it is possible to predict the relative ranks of output lengths in a batch of requests.
  - Develop a scheduler for LLM inference that can approximate the shortest-job-first (SJF) schedule better than existing approaches
Compound AI Systems
- Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems [Paper] [Code]
  - Stanford & UC Berkeley & Princeton
  - Systematically study how the number of LM calls affects the performance of two natural inference strategy designs.
    Vote: Aggregate LM responses via majority voting
    Filter-Vote: Majority voting after filtering results with an LM
  - Insight
    More LM calls lead to higher performance on “easy” queries, but lower performance on “hard” queries, and nonmonotone behavior can emerge when a task contains both types of queries.
  - An analytical scaling model to predict the performance of Vote and Filter-Vote systems and find the optimal number of LM calls to make.

Diffusion Models

Adapter Selection
- Stylus: Automatic Adapter Selection for Diffusion Models [Paper] [Homepage] [Code]
  - UC Berkeley & CMU & Google DeepMind
  - Problem: how to match the prompt to a set of relevant adapters
  - Stylus
    Select and automatically compose task-specific adapters based on a prompt's keywords
    Three-stage approach
    Refiner: Leverage visual-language foundational models (VLM) to generate semantic descriptions of adapters then translate them into embeddings
    Retriever: Fetch the most relevant adapters over the entirety of the user’s prompt using cosine similarity
    Composer: Segment the prompt into tasks from a prompt’s keywords and assign retrieved adapters to tasks
  - StylusDocs
    An adapter dataset consists of 75K LoRAs (sourced from Civitai) with pre-computed adapter embeddings
Inference
- Reverse Transition Kernel: A Flexible Framework to Accelerate Diffusion Inference [Paper]
  - HKUST & HKU & Salesforce AI Research & UIUC
  - Develop a general RTK (reverse transition kernel) framework that enables a more balanced subproblem decomposition
  - Propose leveraging two fast sampling algorithms, the Metropolis-Adjusted Langevin Algorithm (MALA) and Underdamped Langevin Dynamics (ULD), for solving these strongly log-concave subproblems
- Accelerating Diffusion Models with Parallel Sampling: Inference at Sub-Linear Time Complexity [Paper]
  - Stanford
  - Propose to divide the sampling process into $O(1)$ blocks with parallelizable Picard iterations within each block
Talking Face Video Generation
- VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time [Paper] [Homepage]
  - MSRA
  - A framework to generate lifelike talking faces with appealing visual affective skills (VAS).
  - A diffusion-based holistic facial dynamics and head movement generation model that works in a face latent space.
  - Support the online generation of 512×512 videos at up to 40 FPS.
Facial Parts Swapping
- FuseAnyPart: Diffusion-Driven Facial Parts Swapping via Multiple Reference Images [Paper] [Code (coming...)]
  - Alibaba
  - Facial parts from different people are assembled into a complete face in latent space within the Mask-based Fusion Module
  - The consolidated feature is dispatched to the Addition-based Injection Module for fusion within the UNet of the diffusion model to create novel characters

Autoregressive Image Generation

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction [Paper] [Code] [arXiv]
- PKU & ByteDance
- Best Paper Award
- VAR: Visual Autoregressive Modeling
- Redefine the autoregressive learning on images as coarse-to-fine “next-scale prediction” or “next-resolution prediction”
- Multi-scale token maps are autoregressively generated from coarse to fine scales (lower to higher resolutions), with parallel token generation within each scale
Autoregressive Image Generation without Vector Quantization [Paper] [Code] [arXiv]
- MIT & Google DeepMind & THU
- Propose to model the per-token probability distribution using a diffusion procedure
- Define a Diffusion Loss function to model the per-token probability
- Evaluated across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants

Text-to-Video Generation

Inference
- Fast and Memory-Efficient Video Diffusion Using Streamlined Inference [Paper] [Code]
  - NEU
  - Streamlined Inference: Leverage the temporal and spatial properties of video diffusion models
  - Three core components
    Feature Slicer: Partition input features into sub-features
    Operator Grouping: Process each sub-feature with a group of consecutive operators
    Step Rehash: Accelerate inference through skipping unnecessary steps
Evaluation
- VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models [Paper] [Homepage] [Code] [Dataset]
  - UTS & ZJU
  - 1.67M unique text-to-video Prompts from real users.
  - 6.69M videos generated by four state-of-the-art diffusion models (Pika, VideoCraft2, Text2Video-Zero, ModelScope).
- Evaluation of Text-to-Video Generation Models: A Dynamics Perspective [Paper] [Homepage] [Code]
  - UCAS & HIT & Adelaide & Baidu
    Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content.
    DEVIL: An evaluation protocol that centers on the dynamics dimension to evaluate T2V generation models
- Boosting Text-to-Video Generative Model with MLLMs Feedback [Paper]
  - MSRA
    Utilize Multimodal Large Language Models (MLLMs) to perform fine-grained video preference annotations → VideoPrefer (13.5K preference annotations)
    VideoRM: The reward model for text-to-video alignment

Last updated 10 months ago

Was this helpful?