mirror of
https://github.com/Stability-AI/stablediffusion.git
synced 2024-12-22 15:44:58 +00:00
commit
dab18ab497
2 changed files with 26 additions and 11 deletions
27
README.md
27
README.md
|
@ -1,4 +1,4 @@
|
||||||
# Stable Diffusion 2.0
|
# Stable Diffusion Version 2
|
||||||
![t2i](assets/stable-samples/txt2img/768/merged-0006.png)
|
![t2i](assets/stable-samples/txt2img/768/merged-0006.png)
|
||||||
![t2i](assets/stable-samples/txt2img/768/merged-0002.png)
|
![t2i](assets/stable-samples/txt2img/768/merged-0002.png)
|
||||||
![t2i](assets/stable-samples/txt2img/768/merged-0005.png)
|
![t2i](assets/stable-samples/txt2img/768/merged-0005.png)
|
||||||
|
@ -8,7 +8,16 @@ new checkpoints. The following list provides an overview of all currently availa
|
||||||
|
|
||||||
## News
|
## News
|
||||||
|
|
||||||
**November 2022**
|
**December 7, 2022**
|
||||||
|
|
||||||
|
*Version 2.1*
|
||||||
|
|
||||||
|
- New stable diffusion model (_Stable Diffusion 2.1-v_, [HuggingFace](https://huggingface.co/stabilityai/stable-diffusion-2-1)) at 768x768 resolution and (_Stable Diffusion 2.1-base_, [HuggingFace](https://huggingface.co/stabilityai/stable-diffusion-2-1-base)) at 512x512 resolution, both based on the same number of parameters and architecture as 2.0 and fine-tuned on 2.0, on a less restrictive NSFW filtering of the [LAION-5B](https://laion.ai/blog/laion-5b/) dataset.
|
||||||
|
|
||||||
|
**November 24, 2022**
|
||||||
|
|
||||||
|
*Version 2.0*
|
||||||
|
|
||||||
- New stable diffusion model (_Stable Diffusion 2.0-v_) at 768x768 resolution. Same number of parameters in the U-Net as 1.5, but uses [OpenCLIP-ViT/H](https://github.com/mlfoundations/open_clip) as the text encoder and is trained from scratch. _SD 2.0-v_ is a so-called [v-prediction](https://arxiv.org/abs/2202.00512) model.
|
- New stable diffusion model (_Stable Diffusion 2.0-v_) at 768x768 resolution. Same number of parameters in the U-Net as 1.5, but uses [OpenCLIP-ViT/H](https://github.com/mlfoundations/open_clip) as the text encoder and is trained from scratch. _SD 2.0-v_ is a so-called [v-prediction](https://arxiv.org/abs/2202.00512) model.
|
||||||
- The above model is finetuned from _SD 2.0-base_, which was trained as a standard noise-prediction model on 512x512 images and is also made available.
|
- The above model is finetuned from _SD 2.0-base_, which was trained as a standard noise-prediction model on 512x512 images and is also made available.
|
||||||
- Added a [x4 upscaling latent text-guided diffusion model](#image-upscaling-with-stable-diffusion).
|
- Added a [x4 upscaling latent text-guided diffusion model](#image-upscaling-with-stable-diffusion).
|
||||||
|
@ -82,11 +91,11 @@ The weights are available via [the StabilityAI organization at Hugging Face](htt
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Stable Diffusion v2.0
|
## Stable Diffusion v2
|
||||||
|
|
||||||
Stable Diffusion v2.0 refers to a specific configuration of the model
|
Stable Diffusion v2 refers to a specific configuration of the model
|
||||||
architecture that uses a downsampling-factor 8 autoencoder with an 865M UNet
|
architecture that uses a downsampling-factor 8 autoencoder with an 865M UNet
|
||||||
and OpenCLIP ViT-H/14 text encoder for the diffusion model. The _SD 2.0-v_ model produces 768x768 px outputs.
|
and OpenCLIP ViT-H/14 text encoder for the diffusion model. The _SD 2-v_ model produces 768x768 px outputs.
|
||||||
|
|
||||||
Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
|
Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
|
||||||
5.0, 6.0, 7.0, 8.0) and 50 DDIM sampling steps show the relative improvements of the checkpoints:
|
5.0, 6.0, 7.0, 8.0) and 50 DDIM sampling steps show the relative improvements of the checkpoints:
|
||||||
|
@ -99,16 +108,16 @@ Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
|
||||||
![txt2img-stable2](assets/stable-samples/txt2img/merged-0003.png)
|
![txt2img-stable2](assets/stable-samples/txt2img/merged-0003.png)
|
||||||
![txt2img-stable2](assets/stable-samples/txt2img/merged-0001.png)
|
![txt2img-stable2](assets/stable-samples/txt2img/merged-0001.png)
|
||||||
|
|
||||||
Stable Diffusion 2.0 is a latent diffusion model conditioned on the penultimate text embeddings of a CLIP ViT-H/14 text encoder.
|
Stable Diffusion 2 is a latent diffusion model conditioned on the penultimate text embeddings of a CLIP ViT-H/14 text encoder.
|
||||||
We provide a [reference script for sampling](#reference-sampling-script).
|
We provide a [reference script for sampling](#reference-sampling-script).
|
||||||
#### Reference Sampling Script
|
#### Reference Sampling Script
|
||||||
|
|
||||||
This script incorporates an [invisible watermarking](https://github.com/ShieldMnt/invisible-watermark) of the outputs, to help viewers [identify the images as machine-generated](scripts/tests/test_watermark.py).
|
This script incorporates an [invisible watermarking](https://github.com/ShieldMnt/invisible-watermark) of the outputs, to help viewers [identify the images as machine-generated](scripts/tests/test_watermark.py).
|
||||||
We provide the configs for the _SD2.0-v_ (768px) and _SD2.0-base_ (512px) model.
|
We provide the configs for the _SD2-v_ (768px) and _SD2-base_ (512px) model.
|
||||||
|
|
||||||
First, download the weights for [_SD2.0-v_](https://huggingface.co/stabilityai/stable-diffusion-2) and [_SD2.0-base_](https://huggingface.co/stabilityai/stable-diffusion-2-base).
|
First, download the weights for [_SD2.1-v_](https://huggingface.co/stabilityai/stable-diffusion-2-1) and [_SD2.1-base_](https://huggingface.co/stabilityai/stable-diffusion-2-1-base).
|
||||||
|
|
||||||
To sample from the _SD2.0-v_ model, run the following:
|
To sample from the _SD2.1-v_ model, run the following:
|
||||||
|
|
||||||
```
|
```
|
||||||
python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768
|
python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768
|
||||||
|
|
10
modelcard.md
10
modelcard.md
|
@ -80,7 +80,7 @@ Stable Diffusion v2 mirrors and exacerbates biases to such a degree that viewer
|
||||||
**Training Data**
|
**Training Data**
|
||||||
The model developers used the following dataset for training the model:
|
The model developers used the following dataset for training the model:
|
||||||
|
|
||||||
- LAION-5B and subsets (details below). The training data is further filtered using LAION's NSFW detector, with a "p_unsafe" score of 0.1 (conservative). For more details, please refer to LAION-5B's [NeurIPS 2022](https://openreview.net/forum?id=M3Y74vmsMcY) paper and reviewer discussions on the topic.
|
- LAION-5B and subsets (details below). The training data is further filtered using LAION's NSFW detector. For more details, please refer to LAION-5B's [NeurIPS 2022](https://openreview.net/forum?id=M3Y74vmsMcY) paper and reviewer discussions on the topic.
|
||||||
|
|
||||||
**Training Procedure**
|
**Training Procedure**
|
||||||
Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
|
Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
|
||||||
|
@ -90,7 +90,13 @@ Stable Diffusion v2 is a latent diffusion model which combines an autoencoder wi
|
||||||
- The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
|
- The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
|
||||||
- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called _v-objective_, see https://arxiv.org/abs/2202.00512.
|
- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called _v-objective_, see https://arxiv.org/abs/2202.00512.
|
||||||
|
|
||||||
We currently provide the following checkpoints:
|
We currently provide the following checkpoints, for various versions:
|
||||||
|
|
||||||
|
### Version 2.1
|
||||||
|
|
||||||
|
- `512-base-ema.ckpt`: Fine-tuned on `512-base-ema.ckpt` 2.0 with 220k extra steps taken, with `punsafe=0.98` on the same dataset.
|
||||||
|
- `768-v-ema.ckpt`: Resumed from `768-v-ema.ckpt` 2.0 with an additional 55k steps on the same dataset (`punsafe=0.1`), and then fine-tuned for another 155k extra steps with `punsafe=0.98`.
|
||||||
|
### Version 2.0
|
||||||
|
|
||||||
- `512-base-ema.ckpt`: 550k steps at resolution `256x256` on a subset of [LAION-5B](https://laion.ai/blog/laion-5b/) filtered for explicit pornographic material, using the [LAION-NSFW classifier](https://github.com/LAION-AI/CLIP-based-NSFW-Detector) with `punsafe=0.1` and an [aesthetic score](https://github.com/christophschuhmann/improved-aesthetic-predictor) >= `4.5`.
|
- `512-base-ema.ckpt`: 550k steps at resolution `256x256` on a subset of [LAION-5B](https://laion.ai/blog/laion-5b/) filtered for explicit pornographic material, using the [LAION-NSFW classifier](https://github.com/LAION-AI/CLIP-based-NSFW-Detector) with `punsafe=0.1` and an [aesthetic score](https://github.com/christophschuhmann/improved-aesthetic-predictor) >= `4.5`.
|
||||||
850k steps at resolution `512x512` on the same dataset with resolution `>= 512x512`.
|
850k steps at resolution `512x512` on the same dataset with resolution `>= 512x512`.
|
||||||
|
|
Loading…
Reference in a new issue