Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
b590b3f
log inference time
eolecvk Sep 26, 2022
0673aff
switch from timeit to torch.utils.benchmark.Timer.timeit
eolecvk Sep 27, 2022
136903a
rename benchmark script, add memory log
eolecvk Sep 28, 2022
3eee230
load weights from repo, add n_samples option, run benchmark grid, sav…
eolecvk Sep 28, 2022
ecd7282
draft benchmarking readme
eolecvk Sep 29, 2022
90bf85f
load model from huggingface
eolecvk Sep 29, 2022
53ef997
add benchmark instruction to README.md
eolecvk Sep 29, 2022
a94ffb0
updated benchmark.csv with device name
eolecvk Sep 29, 2022
8cf9a0b
append to benchmark, add device col
eolecvk Sep 29, 2022
f54cfbb
csvpath relative to benchmark script
eolecvk Sep 29, 2022
72e2377
sync README benchmark w .csv outptut
eolecvk Sep 29, 2022
656c4d1
benchmark: 1) Get precisons working 2) Add cpu support
chuanli11 Oct 1, 2022
e86d69e
benchmark: 1) Add onnx 2) Add arguments to benchmark script
chuanli11 Oct 1, 2022
03df267
benchmark: update requirements.txt so it uses torch cu116 build and o…
chuanli11 Oct 1, 2022
91542b2
benchmark: 1) Catch CUDA OOM error more gracefully 2) Add samples (ba…
chuanli11 Oct 1, 2022
e873162
benchmark: use float instead of str for logging
chuanli11 Oct 1, 2022
254536f
benchmark: handles onnx runtime error
chuanli11 Oct 2, 2022
c6598a5
benchmark: more results
chuanli11 Oct 2, 2022
37b5afd
benchmark: use autocast in a more elegant way. remove onnx from gpu b…
chuanli11 Oct 2, 2022
2469d61
benchmark: add revsiion when loading pretrianed models for half preci…
chuanli11 Oct 3, 2022
cc81843
benchmark: Add readme to benchmark
chuanli11 Oct 3, 2022
7f6ced3
update benchmark.md prose and prettify bar graphs
eolecvk Oct 4, 2022
437689b
typos
eolecvk Oct 4, 2022
0c9fbc7
update latency graph rtx8000, update readme.md results section
eolecvk Oct 4, 2022
d56c0c5
rm deprecated graphs .svg
eolecvk Oct 4, 2022
5245735
benchmark: change latency to speed/secs to finish in figure title
chuanli11 Oct 5, 2022
645fc33
saving logs to benchmark_tmp.csv instead of benchmark.csv and gitigno…
eolecvk Oct 5, 2022
2b5da79
missing csv header 'runtime'
eolecvk Oct 5, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
model_zoo/
outputs/
*benchmark_tmp.csv

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down Expand Up @@ -130,6 +131,7 @@ venv/
ENV/
env.bak/
venv.bak/
.venv*/

# Spyder project settings
.spyderproject
Expand Down Expand Up @@ -160,4 +162,4 @@ cython_debug/
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
#.idea/
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"python.formatting.provider": "black"
}
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,32 @@ for idx, im in enumerate(images):
im.save(f"{idx:06}.png")
```

## Benchmarking inference

Detailed benchmark documentation can be found [here](./docs/benchmark.md).

### Setup

Before running the benchmark, make sure you have completed the repository [installation steps](#installation).

You will then need to set the huggingface access token:
1. Create a user account on HuggingFace and generate an access token.
2. Set your huggingface access token as the `ACCESS_TOKEN` environment variable:
```
export ACCESS_TOKEN=<hf_...>
```

### Usage

Launch the benchmark script to append benchmark results to the existing [benchmark.csv](./benchmark.csv) results file:
```
python ./scripts/benchmark.py
```

### Results

<img src="./docs/pictures/pretty_benchmark_sd_txt2img_latency.png" alt="Stable Diffusion Text2Image Latency (seconds)" width="850"/>

## Links

- [Captioned Pokémon dataset](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions)
Expand Down
58 changes: 58 additions & 0 deletions benchmark.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz,single,pytorch,1,458.97,0.0
Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz,single,onnx,1,286.13,0.0
NVIDIA GeForce RTX 3090,single,pytorch,1,7.96,7.72
NVIDIA GeForce RTX 3090,half,pytorch,1,4.83,4.54
NVIDIA GeForce RTX 3090,single,pytorch,2,14.49,11
NVIDIA GeForce RTX 3090,half,pytorch,2,8.42,8.75
NVIDIA GeForce RTX 3090,single,pytorch,4,27.94,17.69
NVIDIA GeForce RTX 3090,half,pytorch,4,15.87,15.36
NVIDIA GeForce RTX 3090,single,pytorch,8,-1.0,-1.0
NVIDIA GeForce RTX 3090,half,pytorch,8,-1.0,-1.0
NVIDIA RTX A5500,single,pytorch,1,8.55,7.69
NVIDIA RTX A5500,half,pytorch,1,5.05,4.58
NVIDIA RTX A5500,single,pytorch,2,15.71,11
NVIDIA RTX A5500,half,pytorch,2,9.37,8.8
NVIDIA RTX A5500,single,pytorch,4,30.51,17.69
NVIDIA RTX A5500,half,pytorch,4,16.97,15.33
NVIDIA RTX A5500,single,pytorch,8,-1.0,-1.0
NVIDIA RTX A5500,half,pytorch,8,-1.0,-1.0
AMD EPYC 7352 24-Core Processor,single,pytorch,1,529.93,0.0
AMD EPYC 7352 24-Core Processor,single,onnx,1,223.19,0.0
NVIDIA GeForce RTX 3080,single,pytorch,4,-1.0,-1.0
NVIDIA GeForce RTX 3080,half,pytorch,4,-1.0,-1.0
NVIDIA GeForce RTX 3080,single,pytorch,1,-1.0,-1.0
NVIDIA GeForce RTX 3080,half,pytorch,1,5.59,4.52
NVIDIA GeForce RTX 3080,single,pytorch,2,-1.0,-1.0
NVIDIA GeForce RTX 3080,half,pytorch,2,-1.0,-1.0
NVIDIA A100 80GB PCIe,single,pytorch,1,6.39,7.75
NVIDIA A100 80GB PCIe,half,pytorch,1,3.74,4.55
NVIDIA A100 80GB PCIe,single,pytorch,2,11.12,11.05
NVIDIA A100 80GB PCIe,half,pytorch,2,5.72,8.77
NVIDIA A100 80GB PCIe,single,pytorch,4,20.18,17.63
NVIDIA A100 80GB PCIe,half,pytorch,4,10.04,15.34
NVIDIA A100 80GB PCIe,single,pytorch,8,38.88,30.88
NVIDIA A100 80GB PCIe,half,pytorch,8,18.68,28.47
NVIDIA A100 80GB PCIe,single,pytorch,16,76.92,57.46
NVIDIA A100 80GB PCIe,half,pytorch,16,36.67,54.73
NVIDIA A100 80GB PCIe,half,pytorch,28,63.88,78.78
NVIDIA RTX A6000,single,pytorch,1,8.09,7.75
NVIDIA RTX A6000,half,pytorch,1,5.03,4.53
NVIDIA RTX A6000,single,pytorch,2,14.86,10.98
NVIDIA RTX A6000,half,pytorch,2,9.03,8.79
NVIDIA RTX A6000,single,pytorch,4,27.92,17.62
NVIDIA RTX A6000,half,pytorch,4,17.0,15.34
NVIDIA RTX A6000,single,pytorch,8,53.95,30.88
NVIDIA RTX A6000,half,pytorch,8,32.57,28.51
NVIDIA RTX A6000,half,pytorch,16,63.16,46.11
Quadro RTX 8000,single,pytorch,1,12.3,7.71
Quadro RTX 8000,half,pytorch,1,5.93,4.52
Quadro RTX 8000,single,pytorch,2,24.42,9.16
Quadro RTX 8000,half,pytorch,2,10.92,7.02
Quadro RTX 8000,single,pytorch,4,42.56,15.58
Quadro RTX 8000,half,pytorch,4,21.24,12.39
Quadro RTX 8000,single,pytorch,8,76.96,23.11
Quadro RTX 8000,half,pytorch,8,40.52,20.98
Quadro RTX 8000,single,pytorch,16,152.55,42.47
Quadro RTX 8000,half,pytorch,16,80.31,38.18
Quadro RTX 8000,single,pytorch,32,-1.0,-1.0
Quadro RTX 8000,half,pytorch,32,-1.0,-1.0
112 changes: 112 additions & 0 deletions docs/benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Benchmarking Diffuser Models

We present a benchmark of [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion) model inference. This text2image model uses a text prompt as input and outputs an image of resolution `512x512`.

Our experiments analyze inference performance in terms of speed, memory consumption, throughput, and quality of the output images. We look at how different choices in hardware (GPU model, GPU vs CPU) and software (single vs half precision, pytorch vs onnxruntime) affect inference performance.

For reference, we will be providing benchmark results for the following GPU devices: A100 80GB PCIe, RTX3090, RTXA5500, RTXA6000, RTX3080, RTX8000. Please refer to the ["Reproducing the experiments"](#reproducing-the-experiments) section for details on running these experiments in your own environment.


## Inference speed

The figure below shows the latency at inference when using different hardware and precision for generating a single image using the (arbitrary) text prompt: *"a photo of an astronaut riding a horse on mars"*.

<img src="./pictures/pretty_benchmark_sd_txt2img_latency.png" alt="Stable Diffusion Text2Image Latency (seconds)" width="800"/>


We find that:
* The inference latencies range between `3.74` to `5.56` seconds across our tested Ampere GPUs, including the consumer 3080 card to the flagship A100 80GB card.
* Half-precision reduces the latency by about `40%` for Ampere GPUs, and by `52%` for the previous generation `RTX8000` GPU.

We believe Ampere GPUs enjoy a relatively "smaller" speedup from half-precision due to their use of `TF32`. For readers who are not familiar with `TF32`, it is a [`19-bit` format](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) that has been used as the default single-precision data type on Ampere GPUs for major deep learning frameworks such as PyTorch and TensorFlow. One can expect half-precision's speedup over `FP32` to be bigger since it is a true `32-bit` format.


We run these same inference jobs CPU devices to put in perspective the inference speed performance observed on GPU.

<img src="./pictures/pretty_benchmark_sd_txt2img_gpu_vs_cpu.png" alt="Stable Diffusion Text2Image GPU v CPU" width="700"/>


We note that:
* GPUs are significantly faster -- by one or two orders of magnitudes depending on the precisions.
* `onnxruntime` can reduce the latency for CPU by about `40%` to `50%`, depending on the type of CPUs.

ONNX currently does not have [stable support](https://github.com/huggingface/diffusers/issues/489) for Huggingface diffusers.
We will investigate `onnxruntime-gpu` in future benchmarks.




## Memory

We also measure the memory consumption of running stable diffusion inference.

<img src="./pictures/pretty_benchmark_sd_txt2img_mem.png" alt="Stable Diffusion Text2Image Memory (GB)" width="640"/>

Memory usage is observed to be consistent across all tested GPUs:
* It takes about `7.7 GB` GPU memory to run single-precision inference with batch size one.
* It takes about `4.5 GB` GPU memory to run half-precision inference with batch size one.




## Throughput

Latency measures how quickly a _single_ input can be processed, which is critical to online applications that don't tolerate even the slightest delay. However, some (offline) applications may focus on "throughput", which measures the total volume of data processed in a fixed amount of time.


Our throughput benchmark pushes the batch size to the maximum for each GPU, and measures the number of images they can process per minute. The reason for maximizing the batch size is to keep tensor cores busy so that computation can dominate the workload, avoiding any non-computational bottlenecks.

We run a series of throughput experiment in pytorch with half-precision and using the maximum batch size that can be used for each GPU:

<img src="./pictures/pretty_benchmark_sd_txt2img_throughput.png" alt="Stable Diffusion Text2Image Throughput (images/minute)" width="390"/>

We note:
* Once again, A100 80GB is the top performer and has the highest throughput.
* The gap between A100 80GB and other cards in terms of throughput can be explained by the larger maximum batch size that can be used on this card.


As a concrete example, the chart below shows how A100 80GB's throughput increases by `64%` when we changed the batch size from 1 to 28 (the largest without causing an out of memory error). It is also interesting to see that the increase is not linear and flattens out when batch size reaches a certain value, at which point the tensor cores on the GPU are saturated and any new data in the GPU memory will have to be queued up before getting their own computing resources.

<img src="./pictures/pretty_benchmark_sd_txt2img_batchsize_vs_throughput.png" alt="Stable Diffusion Text2Image Batch size vs Throughput (images/minute)" width="380"/>


## Precision

We are curious about whether half-precision introduces degradations to the quality of the output images. To test this out, we fixed the text prompt as well as the "latent" input vector and fed them to the single-precision model and the half-precision model. We ran the inference for 100 steps and saved both models' outputs at each step, as well as the difference map:

![Evolution of precision v degradation across 100 steps](./pictures/benchmark_sd_precision_history.gif)

Our observation is that there are indeed visible differences between the single-precision output and the half-precision output, especially in the early steps. The differences often decrease with the number of steps, but might not always vanish.

Interestingly, such a difference may not imply artifacts in half-precision's outputs. For example, in step 70, the picture below shows half-precision didn't produce the artifact in the single-precision output (an extra front leg):

![Precision v Degradation at step 70](./pictures/benchmark_sd_precision_step_70.png)

---

## Reproducing the experiments

You can use this [Lambda Diffusers](https://github.com/LambdaLabsML/lambda-diffusers) repository to reproduce the results presented in this article.

## Setup

Before running the benchmark, make sure you have completed the repository [installation steps](../README.md#installation).

You will then need to set the huggingface access token:
1. Create a user account on HuggingFace and generate an access token.
2. Set your huggingface access token as the `ACCESS_TOKEN` environment variable:
```
export ACCESS_TOKEN=<hf_...>
```

## Usage

Launch the `benchmark.py` script to append benchmark results to the existing [benchmark.csv](../benchmark.csv) results file:
```
python ./scripts/benchmark.py
```

Lauch the `benchmark_quality.py` script to compare the output of single-precision and half-precision models:
```
python ./scripts/benchmark_quality.py
```
Binary file added docs/pictures/FreeMono.ttf
Binary file not shown.
Binary file added docs/pictures/benchmark_sd_precision_history.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pictures/benchmark_sd_precision_step_70.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
16 changes: 9 additions & 7 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
torch
torchvision
transformers
ftfy
Pillow
diffusers
-e .
--extra-index-url https://download.pytorch.org/whl/cu116 torch

transformers==4.22.1
ftfy==6.1.1
Pillow==9.2.0
diffusers==0.3.0
onnxruntime==1.12.1
scikit-image==0.19.3
-e .
Loading