# CVPR 2024

## Meta Info

Homepage: <https://cvpr.thecvf.com/Conferences/2024>

Paper list: <https://cvpr.thecvf.com/Conferences/2024/AcceptedPapers>

## Papers

### Diffusion Models

#### Acceleration

* Cache Me if You Can: Accelerating Diffusion Models through Block Caching \[[Paper](https://openaccess.thecvf.com/content/CVPR2024/html/Wimbauer_Cache_Me_if_You_Can_Accelerating_Diffusion_Models_through_Block_CVPR_2024_paper.html)] \[[Homepage](https://fwmb.github.io/blockcaching/)]
  * Meta & TUM & MCML & Oxford
  * **Block caching**
    * Reuse outputs from layer blocks of previous steps to speed up inference.
    * Automatically determines caching schedules based on each block's changes over timesteps.
* CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model \[[Paper](https://openaccess.thecvf.com/content/CVPR2024/html/Zeng_CAT-DM_Controllable_Accelerated_Virtual_Try-on_with_Diffusion_Model_CVPR_2024_paper.html)] \[[Code](https://github.com/zengjianhao/CAT-DM)]
  * TJU & Tencent
  * **CAT-DM**: **C**ontrollable **A**ccelerated virtual **T**ry-on with **D**iffusion **M**odel
  * Initiate a reverse denoising process with an implicit distribution generated by a pre-trained GAN-based model → Reduce the sampling steps
* DeepCache: Accelerating Diffusion Models for Free \[[Paper](https://openaccess.thecvf.com/content/CVPR2024/html/Ma_DeepCache_Accelerating_Diffusion_Models_for_Free_CVPR_2024_paper.html)] \[[Code](https://github.com/horseee/DeepCache)]
  * NUS
  * Utilize the inherent temporal redundancy observed in the sequential denoising steps of diffusion models.
  * Cache and retrieve features across adjacent denoising stages, thereby reducing redundant computations.
* DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models \[[Paper](https://openaccess.thecvf.com/content/CVPR2024/html/Li_DistriFusion_Distributed_Parallel_Inference_for_High-Resolution_Diffusion_Models_CVPR_2024_paper.html)] \[[Homepage](https://hanlab.mit.edu/projects/distrifusion)] \[[Code](https://github.com/mit-han-lab/distrifuser)]
  * MIT & Princeton & Lepton AI & NVIDIA
  * Displaced patch parallelism
    * Split the model input into multiple patches and assign each patch to a GPU.
    * Reuse the pre-computed feature maps from the previous timestep to provide context for the current step.
  * **DistriFusion** → Enable running diffusion models across multiple GPUs in parallel
* SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation \[[Paper](https://openaccess.thecvf.com/content/CVPR2024/html/Nguyen_SwiftBrush_One-Step_Text-to-Image_Diffusion_Model_with_Variational_Score_Distillation_CVPR_2024_paper.html)]
  * VinAI Research, Vietnam
  * **Knowledge distillation:** Distill a pre-trained multi-step text-to-image model to a student network that can generate images with *just a single inference step*.

#### Support compatibility of add-on modules (ControlNets and LoRAs)

* X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model \[[Paper](https://openaccess.thecvf.com/content/CVPR2024/html/Ran_X-Adapter_Adding_Universal_Compatibility_of_Plugins_for_Upgraded_Diffusion_Model_CVPR_2024_paper.html)] \[[Homepage](https://showlab.github.io/X-Adapter/)] \[[Code](https://github.com/showlab/X-Adapter)]
  * NUS & Tencent & FDU
  * Enable the pre-trained add-on modules (ControlNet, LoRA) with the upgraded diffusion model (SDXL) *without further retraining*.

#### Support arbitrary image size

* ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation \[[Paper](https://openaccess.thecvf.com/content/CVPR2024/html/Haji-Ali_ElasticDiffusion_Training-free_Arbitrary_Size_Image_Generation_through_Global-Local_Content_Separation_CVPR_2024_paper.html)] \[[Homepage](https://elasticdiffusion.github.io)] \[[Code](https://github.com/MoayedHajiAli/ElasticDiffusion-official)]
  * Rice University
  * Enable pre-trained text-to-image diffusion models to generate images with various sizes.
  * Decouple the generation trajectory of a pre-trained model into local and global signals.
    * The local signal controls low-level pixel information and can be estimated on local patches.
    * The global signal is used to maintain overall structural consistency and is estimated with a reference image.

#### Improve image quality

* FreeU: Free Lunch in Diffusion U-Net \[[Paper](https://openaccess.thecvf.com/content/CVPR2024/html/Si_FreeU_Free_Lunch_in_Diffusion_U-Net_CVPR_2024_paper.html)] \[[Homepage](https://chenyangsi.top/FreeU/)] \[[Code](https://github.com/ChenyangSi/FreeU)]
  * NTU
  * Key insight
    * Use two modulation factors to re-weight the feature contributions from the U-Net’s skip connections and backbone.
    * Increasing the backbone scaling factor *b* significantly enhances image quality.
    * Directly scaling *s* in the skip features has a limited influence on image synthesis quality.
  * **FreeU**
    * Improve the generation quality with only a few lines of code.
    * Only need to adjust two scaling factors during the inference.

#### Scalability

* On the Scalability of Diffusion-based Text-to-Image Generation \[[Paper](https://openaccess.thecvf.com/content/CVPR2024/html/Li_On_the_Scalability_of_Diffusion-based_Text-to-Image_Generation_CVPR_2024_paper.html)]
  * AWS AI Labs & Amazon AGI
  * An empirical study of the scaling properties of diffusion-based text-to-image models.
  * Perform ablations on scaling both denoising backbones and training set, including training scaled U-Net and Transformer variants ranging from 0.4B to 4B parameters on datasets up to 600M images.
  * Specifically
    * Model scaling
      * The location and amount of cross-attention distinguish the performance.
      * To improve text-image alignment, increasing the transformer blocks is more parameter-efficient than increasing channel numbers.
      * Identify an efficient UNet variant.
    * Data scaling
      * The quality and diversity of the training set matter more than simply dataset size.
    * Provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute, and dataset size.
