Diffusion Models
Image Generation
Diffusion Transformer (DiT)
FLUX.1 [Code]
Black Forest Labs
Text-to-image generation
Models
FLUX.1-schnell: https://huggingface.co/black-forest-labs/FLUX.1-schnell
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (arXiv:2403.03206) [arXiv] [Blog]
Stability AI
Stable Diffusion 3 (SD3)
Multimodal Diffusion Transformer (MMDiT)
Models
Stable Diffusion 3 Medium: https://huggingface.co/stabilityai/stable-diffusion-3-medium
UNet
Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis [Technical Report]
Kuaishou Kolors
Text-to-image generation
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis (arXiv:2307.01952) [arXiv]
Stability AI
High-Resolution Image Synthesis with Latent Diffusion Models (CVPR 2022) [Paper] [arXiv] [Code]
LMU Munich & Runway ML
Latent Diffusion Models (LDMs)
Models
Stable-Diffusion-v1-5: https://huggingface.co/runwayml/stable-diffusion-v1-5
Initialized with the weights of the Stable-Diffusion-v1-2 checkpoint and subsequently fine-tuned on 595k steps at resolution 512x512.
Video Generation
Stable Video 4D (SV4D)
Stability AI
Model: https://huggingface.co/stabilityai/sv4d
Generate 40 frames (5 video frames x 8 camera views) at 576x576 resolution, given 5 reference frames of the same size.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets (arXiv:2311.15127) [arXiv] [Blog]
Stability AI
Stable Video Diffusion (SVD)
Text-to-video and image-to-video generation
Models
https://huggingface.co/stabilityai/stable-video-diffusion-img2vid
Generate 14 frames at resolution 576x1024 given a context frame of the same size.
https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt
Fine-tuned from the SVD-img2vid.
Generate 25 frames at resolution 576x1024 given a context frame of the same size.
Acronyms
LLM: Large Language Model
Last updated