mirror of
https://github.com/Stability-AI/stablediffusion.git
synced 2025-01-07 07:01:08 +00:00
45 lines
2.1 KiB
Text
45 lines
2.1 KiB
Text
|
cff-version: 1.2.0
|
||
|
message: If you use this software, please cite it using these metadata.
|
||
|
title: High-Resolution Image Synthesis with Latent Diffusion Models
|
||
|
authors:
|
||
|
- family-names: Rombach
|
||
|
given-names: Robin
|
||
|
- family-names: Blattmann
|
||
|
given-names: Andreas
|
||
|
- family-names: Lorenz
|
||
|
given-names: Dominik
|
||
|
- family-names: Esser
|
||
|
given-names: Patrick
|
||
|
- family-names: Ommer
|
||
|
given-names: Björn
|
||
|
year: 2021
|
||
|
doi: 10.48550/arXiv.2112.10752
|
||
|
abstract: |
|
||
|
By decomposing the image formation process into a sequential application of denoising
|
||
|
autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on
|
||
|
image data and beyond. Additionally, their formulation allows for a guiding mechanism
|
||
|
to control the image generation process without retraining. However, since these
|
||
|
models typically operate directly in pixel space, optimization of powerful DMs often
|
||
|
consumes hundreds of GPU days and inference is expensive due to sequential
|
||
|
evaluations. To enable DM training on limited computational resources while retaining
|
||
|
their quality and flexibility, we apply them in the latent space of powerful
|
||
|
pretrained autoencoders. In contrast to previous work, training diffusion models on
|
||
|
such a representation allows for the first time to reach a near-optimal point between
|
||
|
complexity reduction and detail preservation, greatly boosting visual fidelity.
|
||
|
By introducing cross-attention layers into the model architecture, we turn diffusion
|
||
|
models into powerful and flexible generators for general conditioning inputs such as
|
||
|
text or bounding boxes and high-resolution synthesis becomes possible in a
|
||
|
convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the
|
||
|
art for image inpainting and highly competitive performance on various tasks,
|
||
|
including unconditional image generation, semantic scene synthesis, and
|
||
|
super-resolution, while significantly reducing computational requirements compared to
|
||
|
pixel-based DMs.
|
||
|
input:
|
||
|
- format: arXiv
|
||
|
id: 2112.10752
|
||
|
type: article
|
||
|
url: https://arxiv.org/abs/2112.10752
|
||
|
output:
|
||
|
- format: PDF
|
||
|
url: https://arxiv.org/pdf/2112.10752.pdf
|