CVPR 2024

Meta Info


Paper list:


Diffusion Models


  • Cache Me if You Can: Accelerating Diffusion Models through Block Caching [Paper] [Homepage]

    • Meta & TUM & MCML & Oxford

    • Block caching

      • Reuse outputs from layer blocks of previous steps to speed up inference.

      • Automatically determines caching schedules based on each block's changes over timesteps.

  • CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model [Paper] [Code]

    • TJU & Tencent

    • CAT-DM: Controllable Accelerated virtual Try-on with Diffusion Model

    • Initiate a reverse denoising process with an implicit distribution generated by a pre-trained GAN-based model → Reduce the sampling steps

  • DeepCache: Accelerating Diffusion Models for Free [Paper] [Code]

    • NUS

    • Utilize the inherent temporal redundancy observed in the sequential denoising steps of diffusion models.

    • Cache and retrieve features across adjacent denoising stages, thereby reducing redundant computations.

  • DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models [Paper] [Homepage] [Code]

    • MIT & Princeton & Lepton AI & NVIDIA

    • Displaced patch parallelism

      • Split the model input into multiple patches and assign each patch to a GPU.

      • Reuse the pre-computed feature maps from the previous timestep to provide context for the current step.

    • DistriFusion → Enable running diffusion models across multiple GPUs in parallel

  • SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation [Paper]

    • VinAI Research, Vietnam

    • Knowledge distillation: Distill a pre-trained multi-step text-to-image model to a student network that can generate images with just a single inference step.

Support compatibility of add-on modules (ControlNets and LoRAs)

  • X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model [Paper] [Homepage] [Code]

    • NUS & Tencent & FDU

    • Enable the pre-trained add-on modules (ControlNet, LoRA) with the upgraded diffusion model (SDXL) without further retraining.

Support arbitrary image size

  • ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation [Paper] [Homepage] [Code]

    • Rice University

    • Enable pre-trained text-to-image diffusion models to generate images with various sizes.

    • Decouple the generation trajectory of a pre-trained model into local and global signals.

      • The local signal controls low-level pixel information and can be estimated on local patches.

      • The global signal is used to maintain overall structural consistency and is estimated with a reference image.

Improve image quality

  • FreeU: Free Lunch in Diffusion U-Net [Paper] [Homepage] [Code]

    • NTU

    • Key insight

      • Use two modulation factors to re-weight the feature contributions from the U-Net’s skip connections and backbone.

      • Increasing the backbone scaling factor b significantly enhances image quality.

      • Directly scaling s in the skip features has a limited influence on image synthesis quality.

    • FreeU

      • Improve the generation quality with only a few lines of code.

      • Only need to adjust two scaling factors during the inference.


  • On the Scalability of Diffusion-based Text-to-Image Generation [Paper]

    • AWS AI Labs & Amazon AGI

    • An empirical study of the scaling properties of diffusion-based text-to-image models.

    • Perform ablations on scaling both denoising backbones and training set, including training scaled U-Net and Transformer variants ranging from 0.4B to 4B parameters on datasets up to 600M images.

    • Specifically

      • Model scaling

        • The location and amount of cross-attention distinguish the performance.

        • To improve text-image alignment, increasing the transformer blocks is more parameter-efficient than increasing channel numbers.

        • Identify an efficient UNet variant.

      • Data scaling

        • The quality and diversity of the training set matter more than simply dataset size.

      • Provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute, and dataset size.

Last updated