mirror of
https://github.com/Stability-AI/stablediffusion.git
synced 2024-12-22 23:55:00 +00:00
commit
cc77f2300d
6 changed files with 212 additions and 16 deletions
165
.gitignore
vendored
Normal file
165
.gitignore
vendored
Normal file
|
@ -0,0 +1,165 @@
|
|||
# Generated by project
|
||||
outputs/
|
||||
|
||||
# Byte-compiled / optimized / DLL files
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
|
||||
# C extensions
|
||||
*.so
|
||||
|
||||
# General MacOS
|
||||
.DS_Store
|
||||
.AppleDouble
|
||||
.LSOverride
|
||||
|
||||
# Distribution / packaging
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
share/python-wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
MANIFEST
|
||||
|
||||
# PyInstaller
|
||||
# Usually these files are written by a python script from a template
|
||||
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
||||
*.manifest
|
||||
*.spec
|
||||
|
||||
# Installer logs
|
||||
pip-log.txt
|
||||
pip-delete-this-directory.txt
|
||||
|
||||
# Unit test / coverage reports
|
||||
htmlcov/
|
||||
.tox/
|
||||
.nox/
|
||||
.coverage
|
||||
.coverage.*
|
||||
.cache
|
||||
nosetests.xml
|
||||
coverage.xml
|
||||
*.cover
|
||||
*.py,cover
|
||||
.hypothesis/
|
||||
.pytest_cache/
|
||||
cover/
|
||||
|
||||
# Translations
|
||||
*.mo
|
||||
*.pot
|
||||
|
||||
# Django stuff:
|
||||
*.log
|
||||
local_settings.py
|
||||
db.sqlite3
|
||||
db.sqlite3-journal
|
||||
|
||||
# Flask stuff:
|
||||
instance/
|
||||
.webassets-cache
|
||||
|
||||
# Scrapy stuff:
|
||||
.scrapy
|
||||
|
||||
# Sphinx documentation
|
||||
docs/_build/
|
||||
|
||||
# PyBuilder
|
||||
.pybuilder/
|
||||
target/
|
||||
|
||||
# Jupyter Notebook
|
||||
.ipynb_checkpoints
|
||||
|
||||
# IPython
|
||||
profile_default/
|
||||
ipython_config.py
|
||||
|
||||
# pyenv
|
||||
# For a library or package, you might want to ignore these files since the code is
|
||||
# intended to run in multiple environments; otherwise, check them in:
|
||||
# .python-version
|
||||
|
||||
# pipenv
|
||||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
||||
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
||||
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
||||
# install all needed dependencies.
|
||||
#Pipfile.lock
|
||||
|
||||
# poetry
|
||||
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
|
||||
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
||||
# commonly ignored for libraries.
|
||||
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
|
||||
#poetry.lock
|
||||
|
||||
# pdm
|
||||
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
|
||||
#pdm.lock
|
||||
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
|
||||
# in version control.
|
||||
# https://pdm.fming.dev/#use-with-ide
|
||||
.pdm.toml
|
||||
|
||||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
|
||||
__pypackages__/
|
||||
|
||||
# Celery stuff
|
||||
celerybeat-schedule
|
||||
celerybeat.pid
|
||||
|
||||
# SageMath parsed files
|
||||
*.sage.py
|
||||
|
||||
# Environments
|
||||
.env
|
||||
.venv
|
||||
env/
|
||||
venv/
|
||||
ENV/
|
||||
env.bak/
|
||||
venv.bak/
|
||||
|
||||
# Spyder project settings
|
||||
.spyderproject
|
||||
.spyproject
|
||||
|
||||
# Rope project settings
|
||||
.ropeproject
|
||||
|
||||
# mkdocs documentation
|
||||
/site
|
||||
|
||||
# mypy
|
||||
.mypy_cache/
|
||||
.dmypy.json
|
||||
dmypy.json
|
||||
|
||||
# Pyre type checker
|
||||
.pyre/
|
||||
|
||||
# pytype static type analyzer
|
||||
.pytype/
|
||||
|
||||
# Cython debug symbols
|
||||
cython_debug/
|
||||
|
||||
# IDEs
|
||||
.idea/
|
||||
.vscode/
|
34
README.md
34
README.md
|
@ -1,12 +1,24 @@
|
|||
# Stable Diffusion 2.0
|
||||
# Stable Diffusion Version 2
|
||||
![t2i](assets/stable-samples/txt2img/768/merged-0006.png)
|
||||
![t2i](assets/stable-samples/txt2img/768/merged-0002.png)
|
||||
![t2i](assets/stable-samples/txt2img/768/merged-0005.png)
|
||||
|
||||
This repository contains [Stable Diffusion](https://github.com/CompVis/stable-diffusion) models trained from scratch and will be continuously updated with
|
||||
new checkpoints. The following list provides an overview of all currently available models. More coming soon.
|
||||
|
||||
## News
|
||||
**November 2022**
|
||||
|
||||
**December 7, 2022**
|
||||
|
||||
*Version 2.1*
|
||||
|
||||
- New stable diffusion model (_Stable Diffusion 2.1-v_, [HuggingFace](https://huggingface.co/stabilityai/stable-diffusion-2-1)) at 768x768 resolution and (_Stable Diffusion 2.1-base_, [HuggingFace](https://huggingface.co/stabilityai/stable-diffusion-2-1-base)) at 512x512 resolution, both based on the same number of parameters and architecture as 2.0 and fine-tuned on 2.0, on a less restrictive NSFW filtering of the [LAION-5B](https://laion.ai/blog/laion-5b/) dataset.
|
||||
Per default, the attention operation of the model is evaluated at full precision when `xformers` is not installed. To enable fp16 (which can cause numerical instabilities with the vanilla attention module on the v2.1 model) , run your script with `ATTN_PRECISION=fp16 python <thescript.py>`
|
||||
|
||||
**November 24, 2022**
|
||||
|
||||
*Version 2.0*
|
||||
|
||||
- New stable diffusion model (_Stable Diffusion 2.0-v_) at 768x768 resolution. Same number of parameters in the U-Net as 1.5, but uses [OpenCLIP-ViT/H](https://github.com/mlfoundations/open_clip) as the text encoder and is trained from scratch. _SD 2.0-v_ is a so-called [v-prediction](https://arxiv.org/abs/2202.00512) model.
|
||||
- The above model is finetuned from _SD 2.0-base_, which was trained as a standard noise-prediction model on 512x512 images and is also made available.
|
||||
- Added a [x4 upscaling latent text-guided diffusion model](#image-upscaling-with-stable-diffusion).
|
||||
|
@ -54,7 +66,7 @@ Installation needs a somewhat recent version of nvcc and gcc/g++, obtain those,
|
|||
export CUDA_HOME=/usr/local/cuda-11.4
|
||||
conda install -c nvidia/label/cuda-11.4.0 cuda-nvcc
|
||||
conda install -c conda-forge gcc
|
||||
conda install -c conda-forge gxx_linux-64=9.5.0
|
||||
conda install -c conda-forge gxx_linux-64==9.5.0
|
||||
```
|
||||
|
||||
Then, run the following (compiling takes up to 30 min).
|
||||
|
@ -80,11 +92,11 @@ The weights are available via [the StabilityAI organization at Hugging Face](htt
|
|||
|
||||
|
||||
|
||||
## Stable Diffusion v2.0
|
||||
## Stable Diffusion v2
|
||||
|
||||
Stable Diffusion v2.0 refers to a specific configuration of the model
|
||||
Stable Diffusion v2 refers to a specific configuration of the model
|
||||
architecture that uses a downsampling-factor 8 autoencoder with an 865M UNet
|
||||
and OpenCLIP ViT-H/14 text encoder for the diffusion model. The _SD 2.0-v_ model produces 768x768 px outputs.
|
||||
and OpenCLIP ViT-H/14 text encoder for the diffusion model. The _SD 2-v_ model produces 768x768 px outputs.
|
||||
|
||||
Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
|
||||
5.0, 6.0, 7.0, 8.0) and 50 DDIM sampling steps show the relative improvements of the checkpoints:
|
||||
|
@ -97,16 +109,16 @@ Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
|
|||
![txt2img-stable2](assets/stable-samples/txt2img/merged-0003.png)
|
||||
![txt2img-stable2](assets/stable-samples/txt2img/merged-0001.png)
|
||||
|
||||
Stable Diffusion 2.0 is a latent diffusion model conditioned on the penultimate text embeddings of a CLIP ViT-H/14 text encoder.
|
||||
Stable Diffusion 2 is a latent diffusion model conditioned on the penultimate text embeddings of a CLIP ViT-H/14 text encoder.
|
||||
We provide a [reference script for sampling](#reference-sampling-script).
|
||||
#### Reference Sampling Script
|
||||
|
||||
This script incorporates an [invisible watermarking](https://github.com/ShieldMnt/invisible-watermark) of the outputs, to help viewers [identify the images as machine-generated](scripts/tests/test_watermark.py).
|
||||
We provide the configs for the _SD2.0-v_ (768px) and _SD2.0-base_ (512px) model.
|
||||
We provide the configs for the _SD2-v_ (768px) and _SD2-base_ (512px) model.
|
||||
|
||||
First, download the weights for [_SD2.0-v_](https://huggingface.co/stabilityai/stable-diffusion-2) and [_SD2.0-base_](https://huggingface.co/stabilityai/stable-diffusion-2-base).
|
||||
First, download the weights for [_SD2.1-v_](https://huggingface.co/stabilityai/stable-diffusion-2-1) and [_SD2.1-base_](https://huggingface.co/stabilityai/stable-diffusion-2-1-base).
|
||||
|
||||
To sample from the _SD2.0-v_ model, run the following:
|
||||
To sample from the _SD2.1-v_ model, run the following:
|
||||
|
||||
```
|
||||
python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768
|
||||
|
@ -152,7 +164,7 @@ and the diffusion model is then conditioned on the (relative) depth output.
|
|||
|
||||
<p align="center">
|
||||
<b> depth2image </b><br/>
|
||||
<img src=assets/stable-samples/depth2img/d2i.gif/>
|
||||
<img src=assets/stable-samples/depth2img/d2i.gif>
|
||||
</p>
|
||||
|
||||
This model is particularly useful for a photorealistic style; see the [examples](assets/stable-samples/depth2img).
|
||||
|
|
|
@ -390,7 +390,7 @@ class DDPM(pl.LightningModule):
|
|||
elif self.parameterization == "v":
|
||||
target = self.get_v(x_start, noise, t)
|
||||
else:
|
||||
raise NotImplementedError(f"Paramterization {self.parameterization} not yet supported")
|
||||
raise NotImplementedError(f"Parameterization {self.parameterization} not yet supported")
|
||||
|
||||
loss = self.get_loss(model_out, target, mean=False).mean(dim=[1, 2, 3])
|
||||
|
||||
|
|
|
@ -16,6 +16,9 @@ try:
|
|||
except:
|
||||
XFORMERS_IS_AVAILBLE = False
|
||||
|
||||
# CrossAttn precision handling
|
||||
import os
|
||||
_ATTN_PRECISION = os.environ.get("ATTN_PRECISION", "fp32")
|
||||
|
||||
def exists(val):
|
||||
return val is not None
|
||||
|
@ -167,7 +170,14 @@ class CrossAttention(nn.Module):
|
|||
|
||||
q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q, k, v))
|
||||
|
||||
# force cast to fp32 to avoid overflowing
|
||||
if _ATTN_PRECISION =="fp32":
|
||||
with torch.autocast(enabled=False, device_type = 'cuda'):
|
||||
q, k = q.float(), k.float()
|
||||
sim = einsum('b i d, b j d -> b i j', q, k) * self.scale
|
||||
else:
|
||||
sim = einsum('b i d, b j d -> b i j', q, k) * self.scale
|
||||
|
||||
del q, k
|
||||
|
||||
if exists(mask):
|
||||
|
|
10
modelcard.md
10
modelcard.md
|
@ -80,7 +80,7 @@ Stable Diffusion v2 mirrors and exacerbates biases to such a degree that viewer
|
|||
**Training Data**
|
||||
The model developers used the following dataset for training the model:
|
||||
|
||||
- LAION-5B and subsets (details below). The training data is further filtered using LAION's NSFW detector, with a "p_unsafe" score of 0.1 (conservative). For more details, please refer to LAION-5B's [NeurIPS 2022](https://openreview.net/forum?id=M3Y74vmsMcY) paper and reviewer discussions on the topic.
|
||||
- LAION-5B and subsets (details below). The training data is further filtered using LAION's NSFW detector. For more details, please refer to LAION-5B's [NeurIPS 2022](https://openreview.net/forum?id=M3Y74vmsMcY) paper and reviewer discussions on the topic.
|
||||
|
||||
**Training Procedure**
|
||||
Stable Diffusion v2 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
|
||||
|
@ -90,7 +90,13 @@ Stable Diffusion v2 is a latent diffusion model which combines an autoencoder wi
|
|||
- The output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
|
||||
- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet. We also use the so-called _v-objective_, see https://arxiv.org/abs/2202.00512.
|
||||
|
||||
We currently provide the following checkpoints:
|
||||
We currently provide the following checkpoints, for various versions:
|
||||
|
||||
### Version 2.1
|
||||
|
||||
- `512-base-ema.ckpt`: Fine-tuned on `512-base-ema.ckpt` 2.0 with 220k extra steps taken, with `punsafe=0.98` on the same dataset.
|
||||
- `768-v-ema.ckpt`: Resumed from `768-v-ema.ckpt` 2.0 with an additional 55k steps on the same dataset (`punsafe=0.1`), and then fine-tuned for another 155k extra steps with `punsafe=0.98`.
|
||||
### Version 2.0
|
||||
|
||||
- `512-base-ema.ckpt`: 550k steps at resolution `256x256` on a subset of [LAION-5B](https://laion.ai/blog/laion-5b/) filtered for explicit pornographic material, using the [LAION-NSFW classifier](https://github.com/LAION-AI/CLIP-based-NSFW-Detector) with `punsafe=0.1` and an [aesthetic score](https://github.com/christophschuhmann/improved-aesthetic-predictor) >= `4.5`.
|
||||
850k steps at resolution `512x512` on the same dataset with resolution `>= 512x512`.
|
||||
|
|
|
@ -13,4 +13,7 @@ transformers==4.19.2
|
|||
webdataset==0.2.5
|
||||
open-clip-torch==2.7.0
|
||||
gradio==3.11
|
||||
kornia==0.6
|
||||
invisible-watermark>=0.1.5
|
||||
streamlit-drawable-canvas==0.8.0
|
||||
-e .
|
||||
|
|
Loading…
Reference in a new issue