mirror of
https://github.com/Stability-AI/stablediffusion.git
synced 2024-12-23 08:04:59 +00:00
commit
cc77f2300d
6 changed files with 212 additions and 16 deletions
165
.gitignore
vendored
Normal file
165
.gitignore
vendored
Normal file
|
@ -0,0 +1,165 @@
|
||||||
|
# Generated by project
|
||||||
|
outputs/
|
||||||
|
|
||||||
|
# Byte-compiled / optimized / DLL files
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*$py.class
|
||||||
|
|
||||||
|
# C extensions
|
||||||
|
*.so
|
||||||
|
|
||||||
|
# General MacOS
|
||||||
|
.DS_Store
|
||||||
|
.AppleDouble
|
||||||
|
.LSOverride
|
||||||
|
|
||||||
|
# Distribution / packaging
|
||||||
|
.Python
|
||||||
|
build/
|
||||||
|
develop-eggs/
|
||||||
|
dist/
|
||||||
|
downloads/
|
||||||
|
eggs/
|
||||||
|
.eggs/
|
||||||
|
lib/
|
||||||
|
lib64/
|
||||||
|
parts/
|
||||||
|
sdist/
|
||||||
|
var/
|
||||||
|
wheels/
|
||||||
|
share/python-wheels/
|
||||||
|
*.egg-info/
|
||||||
|
.installed.cfg
|
||||||
|
*.egg
|
||||||
|
MANIFEST
|
||||||
|
|
||||||
|
# PyInstaller
|
||||||
|
# Usually these files are written by a python script from a template
|
||||||
|
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
||||||
|
*.manifest
|
||||||
|
*.spec
|
||||||
|
|
||||||
|
# Installer logs
|
||||||
|
pip-log.txt
|
||||||
|
pip-delete-this-directory.txt
|
||||||
|
|
||||||
|
# Unit test / coverage reports
|
||||||
|
htmlcov/
|
||||||
|
.tox/
|
||||||
|
.nox/
|
||||||
|
.coverage
|
||||||
|
.coverage.*
|
||||||
|
.cache
|
||||||
|
nosetests.xml
|
||||||
|
coverage.xml
|
||||||
|
*.cover
|
||||||
|
*.py,cover
|
||||||
|
.hypothesis/
|
||||||
|
.pytest_cache/
|
||||||
|
cover/
|
||||||
|
|
||||||
|
# Translations
|
||||||
|
*.mo
|
||||||
|
*.pot
|
||||||
|
|
||||||
|
# Django stuff:
|
||||||
|
*.log
|
||||||
|
local_settings.py
|
||||||
|
db.sqlite3
|
||||||
|
db.sqlite3-journal
|
||||||
|
|
||||||
|
# Flask stuff:
|
||||||
|
instance/
|
||||||
|
.webassets-cache
|
||||||
|
|
||||||
|
# Scrapy stuff:
|
||||||
|
.scrapy
|
||||||
|
|
||||||
|
# Sphinx documentation
|
||||||
|
docs/_build/
|
||||||
|
|
||||||
|
# PyBuilder
|
||||||
|
.pybuilder/
|
||||||
|
target/
|
||||||
|
|
||||||
|
# Jupyter Notebook
|
||||||
|
.ipynb_checkpoints
|
||||||
|
|
||||||
|
# IPython
|
||||||
|
profile_default/
|
||||||
|
ipython_config.py
|
||||||
|
|
||||||
|
# pyenv
|
||||||
|
# For a library or package, you might want to ignore these files since the code is
|
||||||
|
# intended to run in multiple environments; otherwise, check them in:
|
||||||
|
# .python-version
|
||||||
|
|
||||||
|
# pipenv
|
||||||
|
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
||||||
|
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
||||||
|
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
||||||
|
# install all needed dependencies.
|
||||||
|
#Pipfile.lock
|
||||||
|
|
||||||
|
# poetry
|
||||||
|
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
|
||||||
|
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
||||||
|
# commonly ignored for libraries.
|
||||||
|
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
|
||||||
|
#poetry.lock
|
||||||
|
|
||||||
|
# pdm
|
||||||
|
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
|
||||||
|
#pdm.lock
|
||||||
|
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
|
||||||
|
# in version control.
|
||||||
|
# https://pdm.fming.dev/#use-with-ide
|
||||||
|
.pdm.toml
|
||||||
|
|
||||||
|
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
|
||||||
|
__pypackages__/
|
||||||
|
|
||||||
|
# Celery stuff
|
||||||
|
celerybeat-schedule
|
||||||
|
celerybeat.pid
|
||||||
|
|
||||||
|
# SageMath parsed files
|
||||||
|
*.sage.py
|
||||||
|
|
||||||
|
# Environments
|
||||||
|
.env
|
||||||
|
.venv
|
||||||
|
env/
|
||||||
|
venv/
|
||||||
|
ENV/
|
||||||
|
env.bak/
|
||||||
|
venv.bak/
|
||||||
|
|
||||||
|
# Spyder project settings
|
||||||
|
.spyderproject
|
||||||
|
.spyproject
|
||||||
|
|
||||||
|
# Rope project settings
|
||||||
|
.ropeproject
|
||||||
|
|
||||||
|
# mkdocs documentation
|
||||||
|
/site
|
||||||
|
|
||||||
|
# mypy
|
||||||
|
.mypy_cache/
|
||||||
|
.dmypy.json
|
||||||
|
dmypy.json
|
||||||
|
|
||||||
|
# Pyre type checker
|
||||||
|
.pyre/
|
||||||
|
|
||||||
|
# pytype static type analyzer
|
||||||
|
.pytype/
|
||||||
|
|
||||||
|
# Cython debug symbols
|
||||||
|
cython_debug/
|
||||||
|
|
||||||
|
# IDEs
|
||||||
|
.idea/
|
||||||
|
.vscode/
|
34
README.md
34
README.md
|
@ -1,12 +1,24 @@
|
||||||
# Stable Diffusion 2.0
|
# Stable Diffusion Version 2
|
||||||
![t2i](assets/stable-samples/txt2img/768/merged-0006.png)
|
![t2i](assets/stable-samples/txt2img/768/merged-0006.png)
|
||||||
![t2i](assets/stable-samples/txt2img/768/merged-0002.png)
|
![t2i](assets/stable-samples/txt2img/768/merged-0002.png)
|
||||||
![t2i](assets/stable-samples/txt2img/768/merged-0005.png)
|
![t2i](assets/stable-samples/txt2img/768/merged-0005.png)
|
||||||
|
|
||||||
This repository contains [Stable Diffusion](https://github.com/CompVis/stable-diffusion) models trained from scratch and will be continuously updated with
|
This repository contains [Stable Diffusion](https://github.com/CompVis/stable-diffusion) models trained from scratch and will be continuously updated with
|
||||||
new checkpoints. The following list provides an overview of all currently available models. More coming soon.
|
new checkpoints. The following list provides an overview of all currently available models. More coming soon.
|
||||||
|
|
||||||
## News
|
## News
|
||||||
**November 2022**
|
|
||||||
|
**December 7, 2022**
|
||||||
|
|
||||||
|
*Version 2.1*
|
||||||
|
|
||||||
|
- New stable diffusion model (_Stable Diffusion 2.1-v_, [HuggingFace](https://huggingface.co/stabilityai/stable-diffusion-2-1)) at 768x768 resolution and (_Stable Diffusion 2.1-base_, [HuggingFace](https://huggingface.co/stabilityai/stable-diffusion-2-1-base)) at 512x512 resolution, both based on the same number of parameters and architecture as 2.0 and fine-tuned on 2.0, on a less restrictive NSFW filtering of the [LAION-5B](https://laion.ai/blog/laion-5b/) dataset.
|
||||||
|
Per default, the attention operation of the model is evaluated at full precision when `xformers` is not installed. To enable fp16 (which can cause numerical instabilities with the vanilla attention module on the v2.1 model) , run your script with `ATTN_PRECISION=fp16 python <thescript.py>`
|
||||||
|
|
||||||
|
**November 24, 2022**
|
||||||
|
|
||||||
|
*Version 2.0*
|
||||||
|
|
||||||
- New stable diffusion model (_Stable Diffusion 2.0-v_) at 768x768 resolution. Same number of parameters in the U-Net as 1.5, but uses [OpenCLIP-ViT/H](https://github.com/mlfoundations/open_clip) as the text encoder and is trained from scratch. _SD 2.0-v_ is a so-called [v-prediction](https://arxiv.org/abs/2202.00512) model.
|
- New stable diffusion model (_Stable Diffusion 2.0-v_) at 768x768 resolution. Same number of parameters in the U-Net as 1.5, but uses [OpenCLIP-ViT/H](https://github.com/mlfoundations/open_clip) as the text encoder and is trained from scratch. _SD 2.0-v_ is a so-called [v-prediction](https://arxiv.org/abs/2202.00512) model.
|
||||||
- The above model is finetuned from _SD 2.0-base_, which was trained as a standard noise-prediction model on 512x512 images and is also made available.
|
- The above model is finetuned from _SD 2.0-base_, which was trained as a standard noise-prediction model on 512x512 images and is also made available.
|
||||||
- Added a [x4 upscaling latent text-guided diffusion model](#image-upscaling-with-stable-diffusion).
|
- Added a [x4 upscaling latent text-guided diffusion model](#image-upscaling-with-stable-diffusion).
|
||||||
|
@ -54,7 +66,7 @@ Installation needs a somewhat recent version of nvcc and gcc/g++, obtain those,
|
||||||
export CUDA_HOME=/usr/local/cuda-11.4
|
export CUDA_HOME=/usr/local/cuda-11.4
|
||||||
conda install -c nvidia/label/cuda-11.4.0 cuda-nvcc
|
conda install -c nvidia/label/cuda-11.4.0 cuda-nvcc
|
||||||
conda install -c conda-forge gcc
|
conda install -c conda-forge gcc
|
||||||
conda install -c conda-forge gxx_linux-64=9.5.0
|
conda install -c conda-forge gxx_linux-64==9.5.0
|
||||||
```
|
```
|
||||||
|
|
||||||
Then, run the following (compiling takes up to 30 min).
|
Then, run the following (compiling takes up to 30 min).
|
||||||
|
@ -80,11 +92,11 @@ The weights are available via [the StabilityAI organization at Hugging Face](htt
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Stable Diffusion v2.0
|
## Stable Diffusion v2
|
||||||
|
|
||||||
Stable Diffusion v2.0 refers to a specific configuration of the model
|
Stable Diffusion v2 refers to a specific configuration of the model
|
||||||
architecture that uses a downsampling-factor 8 autoencoder with an 865M UNet
|
architecture that uses a downsampling-factor 8 autoencoder with an 865M UNet
|
||||||
and OpenCLIP ViT-H/14 text encoder for the diffusion model. The _SD 2.0-v_ model produces 768x768 px outputs.
|
and OpenCLIP ViT-H/14 text encoder for the diffusion model. The _SD 2-v_ model produces 768x768 px outputs.
|
||||||
|
|
||||||
Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
|
Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
|
||||||
5.0, 6.0, 7.0, 8.0) and 50 DDIM sampling steps show the relative improvements of the checkpoints:
|
5.0, 6.0, 7.0, 8.0) and 50 DDIM sampling steps show the relative improvements of the checkpoints:
|
||||||
|
@ -97,16 +109,16 @@ Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
|
||||||
![txt2img-stable2](assets/stable-samples/txt2img/merged-0003.png)
|
![txt2img-stable2](assets/stable-samples/txt2img/merged-0003.png)
|
||||||
![txt2img-stable2](assets/stable-samples/txt2img/merged-0001.png)
|
![txt2img-stable2](assets/stable-samples/txt2img/merged-0001.png)
|
||||||
|
|
||||||
Stable Diffusion 2.0 is a latent diffusion model conditioned on the penultimate text embeddings of a CLIP ViT-H/14 text encoder.
|
Stable Diffusion 2 is a latent diffusion model conditioned on the penultimate text embeddings of a CLIP ViT-H/14 text encoder.
|
||||||
We provide a [reference script for sampling](#reference-sampling-script).
|
We provide a [reference script for sampling](#reference-sampling-script).
|
||||||
#### Reference Sampling Script
|
#### Reference Sampling Script
|
||||||
|
|
||||||
This script incorporates an [invisible watermarking](https://github.com/ShieldMnt/invisible-watermark) of the outputs, to help viewers [identify the images as machine-generated](scripts/tests/test_watermark.py).
|
This script incorporates an [invisible watermarking](https://github.com/ShieldMnt/invisible-watermark) of the outputs, to help viewers [identify the images as machine-generated](scripts/tests/test_watermark.py).
|
||||||
We provide the configs for the _SD2.0-v_ (768px) and _SD2.0-base_ (512px) model.
|
We provide the configs for the _SD2-v_ (768px) and _SD2-base_ (512px) model.
|
||||||
|
|
||||||
First, download the weights for [_SD2.0-v_](https://huggingface.co/stabilityai/stable-diffusion-2) and [_SD2.0-base_](https://huggingface.co/stabilityai/stable-diffusion-2-base).
|
First, download the weights for [_SD2.1-v_](https://huggingface.co/stabilityai/stable-diffusion-2-1) and [_SD2.1-base_](https://huggingface.co/stabilityai/stable-diffusion-2-1-base).
|
||||||
|
|
||||||
To sample from the _SD2.0-v_ model, run the following:
|
To sample from the _SD2.1-v_ model, run the following:
|
||||||
|
|
||||||
```
|
```
|
||||||
python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768
|
python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768
|
||||||
|
@ -152,7 +164,7 @@ and the diffusion model is then conditioned on the (relative) depth output.
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<b> depth2image </b><br/>
|
<b> depth2image </b><br/>
|
||||||
<img src=assets/stable-samples/depth2img/d2i.gif/>
|
<img src=assets/stable-samples/depth2img/d2i.gif>
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
This model is particularly useful for a photorealistic style; see the [examples](assets/stable-samples/depth2img).
|
This model is particularly useful for a photorealistic style; see the [examples](assets/stable-samples/depth2img).
|
||||||
|
|
|
@ -390,7 +390,7 @@ class DDPM(pl.LightningModule):
|
||||||
elif self.parameterization == "v":
|
elif self.parameterization == "v":
|
||||||
target = self.get_v(x_start, noise, t)
|
target = self.get_v(x_start, noise, t)
|
||||||
else:
|
else:
|
||||||
raise NotImplementedError(f"Paramterization {self.parameterization} not yet supported")
|
raise NotImplementedError(f"Parameterization {self.parameterization} not yet supported")
|
||||||
|
|
||||||
loss = self.get_loss(model_out, target, mean=False).mean(dim=[1, 2, 3])
|
loss = self.get_loss(model_out, target, mean=False).mean(dim=[1, 2, 3])
|
||||||
|
|
||||||
|
|
|
@ -16,6 +16,9 @@ try:
|
||||||
except:
|
except:
|
||||||
XFORMERS_IS_AVAILBLE = False
|
XFORMERS_IS_AVAILBLE = False
|
||||||
|
|
||||||
|
# CrossAttn precision handling
|
||||||
|
import os
|
||||||
|
_ATTN_PRECISION = os.environ.get("ATTN_PRECISION", "fp32")
|
||||||
|
|
||||||
def exists(val):
|
def exists(val):
|
||||||
return val is not None
|
return val is not None
|
||||||
|
@ -167,7 +170,14 @@ class CrossAttention(nn.Module):
|
||||||
|
|
||||||
q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q, k, v))
|
q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q, k, v))
|
||||||
|
|
||||||
sim = einsum('b i d, b j d -> b i j', q, k) * self.scale
|
# force cast to fp32 to avoid overflowing
|
||||||
|
if _ATTN_PRECISION =="fp32":
|
||||||
|
with torch.autocast(enabled=False, device_type = 'cuda'):
|
||||||
|
q, k = q.float(), k.float()
|
||||||
|
sim = einsum('b i d, b j d -> b i j', q, k) * self.scale
|
||||||
|
else:
|
||||||
|
sim = einsum('b i d, b j d -> b i j', q, k) * self.scale
|
||||||
|
|
||||||
del q, k
|
del q, k
|
||||||
|
|
||||||
if exists(mask):
|
if exists(mask):
|
||||||
|
|
10
modelcard.md
10
modelcard.md
|
@ -80,7 +80,7 @@ Stable Diffusion v2 mirrors and exacerbates biases to such a degree that viewer
|
||||||
**Training Data**
|
**Training Data**
|
||||||
The model developers used the following dataset for training the model:
|
The model developers used the following dataset for training the model:
|
||||||
|
|
||||||
- LAION-5B and subsets (details below). The training data is further filtered using LAION's NSFW detector, with a "p_unsafe" score of 0.1 (conservative). For more details, please refer to LAION-5B's [NeurIPS 2022](https://openreview.net/forum?id=M3Y74vmsMcY) paper and reviewer discussions on the topic.
|
- LAION-5B and subsets (details below). The training data is further filtered using LAION's NSFW detector. For more details, please refer to LAION-5B's [NeurIPS 2022](https://openreview.net/forum?id=M3Y74vmsMcY) paper and reviewer discussions on the topic.
|
||||||
|
|
||||||
**Training Procedure**
|
**Training Procedure**
|
||||||
Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
|
Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
|
||||||
|
@ -90,7 +90,13 @@ Stable Diffusion v2 is a latent diffusion model which combines an autoencoder wi
|
||||||
- The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
|
- The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
|
||||||
- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called _v-objective_, see https://arxiv.org/abs/2202.00512.
|
- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called _v-objective_, see https://arxiv.org/abs/2202.00512.
|
||||||
|
|
||||||
We currently provide the following checkpoints:
|
We currently provide the following checkpoints, for various versions:
|
||||||
|
|
||||||
|
### Version 2.1
|
||||||
|
|
||||||
|
- `512-base-ema.ckpt`: Fine-tuned on `512-base-ema.ckpt` 2.0 with 220k extra steps taken, with `punsafe=0.98` on the same dataset.
|
||||||
|
- `768-v-ema.ckpt`: Resumed from `768-v-ema.ckpt` 2.0 with an additional 55k steps on the same dataset (`punsafe=0.1`), and then fine-tuned for another 155k extra steps with `punsafe=0.98`.
|
||||||
|
### Version 2.0
|
||||||
|
|
||||||
- `512-base-ema.ckpt`: 550k steps at resolution `256x256` on a subset of [LAION-5B](https://laion.ai/blog/laion-5b/) filtered for explicit pornographic material, using the [LAION-NSFW classifier](https://github.com/LAION-AI/CLIP-based-NSFW-Detector) with `punsafe=0.1` and an [aesthetic score](https://github.com/christophschuhmann/improved-aesthetic-predictor) >= `4.5`.
|
- `512-base-ema.ckpt`: 550k steps at resolution `256x256` on a subset of [LAION-5B](https://laion.ai/blog/laion-5b/) filtered for explicit pornographic material, using the [LAION-NSFW classifier](https://github.com/LAION-AI/CLIP-based-NSFW-Detector) with `punsafe=0.1` and an [aesthetic score](https://github.com/christophschuhmann/improved-aesthetic-predictor) >= `4.5`.
|
||||||
850k steps at resolution `512x512` on the same dataset with resolution `>= 512x512`.
|
850k steps at resolution `512x512` on the same dataset with resolution `>= 512x512`.
|
||||||
|
|
|
@ -13,4 +13,7 @@ transformers==4.19.2
|
||||||
webdataset==0.2.5
|
webdataset==0.2.5
|
||||||
open-clip-torch==2.7.0
|
open-clip-torch==2.7.0
|
||||||
gradio==3.11
|
gradio==3.11
|
||||||
|
kornia==0.6
|
||||||
|
invisible-watermark>=0.1.5
|
||||||
|
streamlit-drawable-canvas==0.8.0
|
||||||
-e .
|
-e .
|
||||||
|
|
Loading…
Reference in a new issue