NeurIPS 2024

Meta Info

Homepage: https://neurips.cc/Conferences/2024

Paper list: https://neurips.cc/virtual/2024/papers.html?filter=titles

Acceptance Rate

  • Total: 15671

  • Accept: 25.8% (4037)

    • Poster: 23.3% (3650)

    • Spotlight: 2.1% (326)

    • Oral: 0.4% (61)

Papers

Large Language Models (LLMs)

  • LLM Inference

    • SGLang: Efficient Execution of Structured Language Model Programs [Paper] [Code] [arXiv]

      • Stanford & UC Berkeley

      • Co-design both the front-end language (programming interface) and the back-end runtime

      • SGLang Primitives

        • Enable the manipulation of prompts and generations

          • gen: call LLM generation

          • select: let the LLM choose the option with the highest probability from a list

          • extend or +=: extend the current prompt

        • Control of parallelism

          • fork: fork the current prompt state

          • join: rejoin the forked prompt states

      • Compilation optimizations

        • Code movement for improving prefix sharing

          • Doesn't strictly preserve the original computation —— aggressive

          • Prompt GPT-4 to re-order graph nodes

      • Runtime

        • RadixAttention

          • Utilize a radix tree (w/ efficient prefix search, reuse, insertion, eviction)

          • LRU eviction policy

        • Cache-aware scheduling → Increase the cache hit rate

          • Key idea: Sort the requests by matched prefix length

    • Efficient LLM Scheduling by Learning to Rank [Paper] [Code]

      • UCSD & THU & Snowflake & UC-Berkeley

      • Insight: it is possible to predict the relative ranks of output lengths in a batch of requests.

      • Develop a scheduler for LLM inference that can approximate the shortest-job-first (SJF) schedule better than existing approaches

  • Compound AI Systems

    • Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems [Paper] [Code]

      • Stanford & UC Berkeley & Princeton

      • Systematically study how the number of LM calls affects the performance of two natural inference strategy designs.

        • Vote: Aggregate LM responses via majority voting

        • Filter-Vote: Majority voting after filtering results with an LM

      • Insight

        • More LM calls lead to higher performance on “easy” queries, but lower performance on “hard” queries, and nonmonotone behavior can emerge when a task contains both types of queries.

      • An analytical scaling model to predict the performance of Vote and Filter-Vote systems and find the optimal number of LM calls to make.

Diffusion Models

  • Adapter Selection

    • Stylus: Automatic Adapter Selection for Diffusion Models [Paper] [Homepage] [Code]

      • UC Berkeley & CMU & Google DeepMind

      • Problem: how to match the prompt to a set of relevant adapters

      • Stylus

        • Select and automatically compose task-specific adapters based on a prompt's keywords

        • Three-stage approach

          1. Refiner: Leverage visual-language foundational models (VLM) to generate semantic descriptions of adapters then translate them into embeddings

          2. Retriever: Fetch the most relevant adapters over the entirety of the user’s prompt using cosine similarity

          3. Composer: Segment the prompt into tasks from a prompt’s keywords and assign retrieved adapters to tasks

      • StylusDocs

        • An adapter dataset consists of 75K LoRAs (sourced from Civitai) with pre-computed adapter embeddings

  • Inference

    • Reverse Transition Kernel: A Flexible Framework to Accelerate Diffusion Inference [Paper]

      • HKUST & HKU & Salesforce AI Research & UIUC

      • Develop a general RTK (reverse transition kernel) framework that enables a more balanced subproblem decomposition

      • Propose leveraging two fast sampling algorithms, the Metropolis-Adjusted Langevin Algorithm (MALA) and Underdamped Langevin Dynamics (ULD), for solving these strongly log-concave subproblems

    • Accelerating Diffusion Models with Parallel Sampling: Inference at Sub-Linear Time Complexity [Paper]

      • Stanford

      • Propose to divide the sampling process into O(1)O(1) blocks with parallelizable Picard iterations within each block

  • Talking Face Video Generation

    • VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time [Paper] [Homepage]

      • MSRA

      • A framework to generate lifelike talking faces with appealing visual affective skills (VAS).

      • A diffusion-based holistic facial dynamics and head movement generation model that works in a face latent space.

      • Support the online generation of 512×512 videos at up to 40 FPS.

  • Facial Parts Swapping

    • FuseAnyPart: Diffusion-Driven Facial Parts Swapping via Multiple Reference Images [Paper] [Code (coming...)]

      • Alibaba

      • Facial parts from different people are assembled into a complete face in latent space within the Mask-based Fusion Module

      • The consolidated feature is dispatched to the Addition-based Injection Module for fusion within the UNet of the diffusion model to create novel characters

Autoregressive Image Generation

  • Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction [Paper] [Code] [arXiv]

    • PKU & ByteDance

    • Best Paper Award

    • VAR: Visual Autoregressive Modeling

    • Redefine the autoregressive learning on images as coarse-to-fine “next-scale prediction” or “next-resolution prediction

    • Multi-scale token maps are autoregressively generated from coarse to fine scales (lower to higher resolutions), with parallel token generation within each scale

  • Autoregressive Image Generation without Vector Quantization [Paper] [Code] [arXiv]

    • MIT & Google DeepMind & THU

    • Propose to model the per-token probability distribution using a diffusion procedure

    • Define a Diffusion Loss function to model the per-token probability

    • Evaluated across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants

Text-to-Video Generation

  • Inference

    • Fast and Memory-Efficient Video Diffusion Using Streamlined Inference [Paper] [Code]

      • NEU

      • Streamlined Inference: Leverage the temporal and spatial properties of video diffusion models

      • Three core components

        • Feature Slicer: Partition input features into sub-features

        • Operator Grouping: Process each sub-feature with a group of consecutive operators

        • Step Rehash: Accelerate inference through skipping unnecessary steps

  • Evaluation

    • VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models [Paper] [Homepage] [Code] [Dataset]

      • UTS & ZJU

      • 1.67M unique text-to-video Prompts from real users.

      • 6.69M videos generated by four state-of-the-art diffusion models (Pika, VideoCraft2, Text2Video-Zero, ModelScope).

    • Evaluation of Text-to-Video Generation Models: A Dynamics Perspective [Paper] [Homepage] [Code]

      • UCAS & HIT & Adelaide & Baidu

        • Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content.

        • DEVIL: An evaluation protocol that centers on the dynamics dimension to evaluate T2V generation models

    • Boosting Text-to-Video Generative Model with MLLMs Feedback [Paper]

      • MSRA

        • Utilize Multimodal Large Language Models (MLLMs) to perform fine-grained video preference annotations → VideoPrefer (13.5K preference annotations)

        • VideoRM: The reward model for text-to-video alignment

Last updated

Was this helpful?