[bitsandbytes] allow directly CUDA placements of pipelines loaded with bnb components#9840
[bitsandbytes] allow directly CUDA placements of pipelines loaded with bnb components#9840
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
SunMarc
left a comment
There was a problem hiding this comment.
Thanks for the PR ! Left a suggestion
|
@SunMarc WDYT now? |
SunMarc
left a comment
There was a problem hiding this comment.
Thanks for adding this ! LGTM ! I'll marge the PR on accelerate also
|
Have run the integration tests and they are passing. |
|
|
@SunMarc yes, on |
|
No, I read that as a question, my bad ;) |
| pipeline_has_bnb = any( | ||
| (_check_bnb_status(module)[1] or _check_bnb_status(module)[-1]) for _, module in self.components.items() | ||
| ) |
There was a problem hiding this comment.
IMO cleaner.
| pipeline_has_bnb = any( | |
| (_check_bnb_status(module)[1] or _check_bnb_status(module)[-1]) for _, module in self.components.items() | |
| ) | |
| pipeline_has_bnb = any( | |
| any((_check_bnb_status(module))) for _, module in self.components.items() | |
| ) |
There was a problem hiding this comment.
If this check is placed after the sequential offloading check, placement would still fail right?
There was a problem hiding this comment.
Running the test gives:
E ValueError: It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading.
src/diffusers/pipelines/pipeline_utils.py:417: ValueError
There was a problem hiding this comment.
If this check is placed after the sequential offloading check, placement would still fail right?
I have modified the placement of the logic. Could you check again?
Re. tests, I just ran pytest tests/quantization/bnb/test_4bit.py::SlowBnb4BitTests and pytest tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitTests and everything passed.
You need this PR huggingface/accelerate#3223 for this to work.
|
|
||
| device = device or device_arg | ||
|
|
||
| pipeline_has_bnb = any(any((_check_bnb_status(module))) for _, module in self.components.items()) |
There was a problem hiding this comment.
it seems to have some overlapping logics with the code just a little bit below this, no?
There was a problem hiding this comment.
Good point.
However, the LoC you pointed out is relevant when we're transferring an 8bit quantized model from one device to the other. It's a log to let the users know that this model has already been placed on a GPU and will remain so. Requesting to put it on a CPU will be ineffective.
We call self.to("cpu") when doing enable_model_cpu_offload():
So, this kind of log becomes informative in the context of using enable_model_cpu_offload(), for example.
This PR, however, allows users to move an entire pipeline to a GPU when the memory permits. Previously it wasn't possible.
So, maybe this apparent overlap is justified. LMK.
There was a problem hiding this comment.
This PR, however, allows users to move an entire pipeline to a GPU when the memory permits. Previously it wasn't possible.
did I miss something?
this PR add a check which throw a value error under certain condition - not enable a new use case like you described here, no?
There was a problem hiding this comment.
Well, the enablement comes from the accelerate fix huggingface/accelerate#3223 and this PR adds a check for that as you described. Sorry for the wrong order of words 😅
If you have other comments on the PR happy to address them.
There was a problem hiding this comment.
my previous comments stands, it has overlapping logic with other checks you have below and is very very confusing.
you're not enable a new use case here, this PR correct a previous wrong error message and allow user to take correct action, I would simply update the warning message here, to add the other possible scenario that they are trying to call to("cuda") on a quantized model without offloading, and they need to upgrade accelerate in order to do that
There was a problem hiding this comment.
this PR correct a previous wrong error message
What was the wrong error message?
IIUC the line you're point to has nothing to do with the changes introduced in this PR and has been in the codebase for quite a while.
The problem line (fixed by the accelerate PR) was this:
So, what I have done in 1779093 is as follows:
Updated the condition of the error message:
"You are trying to call `.to('cuda')` on a pipeline that has models quantized with `bitsandbytes`. Your current `accelerate` installation does not support it. Please upgrade the installation."to:
if (
not pipeline_is_offloaded
and not pipeline_is_sequentially_offloaded
and pipeline_has_bnb
and torch.device(device).type == "cuda"
and is_accelerate_version("<", "1.1.0.dev0")
):This now also considers when the pipeline is not offloaded. Additionally,
now also considers if the pipeline is not offloaded:
* fix: missing AutoencoderKL lora adapter * fix --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
| and torch.device(device).type == "cuda" | ||
| and is_accelerate_version("<", "1.1.0.dev0") | ||
| ): | ||
| raise ValueError( |
There was a problem hiding this comment.
the error message you want to throw against this scenario, no?
- accelerator < 1.1.0.dev0
you call pipeline.to("cuda")on a pipeline that has bnb
but if these 2 condition are met (older accelerator version + bnb):
- not pipeline_is_sequentially_offloaded
will beFalse` here and you will not reach the value error - you will reach this check first and get an error message -this is the wrong error message I was talking about
if (
not pipeline_is_offloaded
and not pipeline_is_sequentially_offloaded
and pipeline_has_bnb
and torch.device(device).type == "cuda"
and is_accelerate_version("<", "1.1.0.dev0")
):There was a problem hiding this comment.
Yeah this makes a ton of sense. Thanks for the elaborate clarification. I have reflected this in my latest commits.
I have also tested most of the SLOW tests and they are passing. This is to ensure existing functionalities don't break with the current changes.
LMK.
| "It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading." | ||
| ) | ||
| if device and torch.device(device).type == "cuda": | ||
| if pipeline_is_sequentially_offloaded and not pipeline_has_bnb: |
There was a problem hiding this comment.
my previous comments here apply almost exactly here so I will just repeat it
#9840
the error message you want to throw against this scenario:
- accelerator < 1.1.0.dev0
- you call
pipeline.to("cuda")on a pipeline that has bnb
if these 2 condition are met (older accelerator version + bnb), it will not reach the error message you intended, it will be caught here at this firs check, and the error message is same as before this PR (about offloading)
can you do this? #9840 (comment)
IF not, please remove the changes to pipline_utils.py and we can merge (I will work on it in a separate PR) I think the added tests are fine without the changes: if accecelrate version is new, it is not affected by the changes in this PR; if it is not, it throw a different error, that's all
There was a problem hiding this comment.
ok I was wrong! will merge
There was a problem hiding this comment.
Sure that works but here's my last try.
if these 2 condition are met (older accelerator version + bnb), it will not reach the error message you intended, it will be caught here at this firs check, and the error message is same as before this PR (about offloading)
When you have:
model_id = "hf-internal-testing/flux.1-dev-nf4-pkg"
t5_4bit = T5EncoderModel.from_pretrained(model_id, subfolder="text_encoder_2")
transformer_4bit = FluxTransformer2DModel.from_pretrained(model_id, subfolder="transformer")
pipeline_4bit = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
text_encoder_2=t5_4bit,
transformer=transformer_4bit,
torch_dtype=torch.float16,
)in if pipeline_is_sequentially_offloaded and not pipeline_has_bnb, pipeline_is_sequentially_offloaded will be True (older accelerate version), however, not pipeline_has_bnb will be False (as expected). So, the following error won't be raised:
"It seems like you have activated sequential model offloading by calling `enable_sequential_cpu_offload`, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline `.to('cpu')` or consider removing the move altogether if you use sequential offloading."And it will hit the else.
To test, you can run the following with accelerate 1.0.1:
from diffusers import DiffusionPipeline, FluxTransformer2DModel
from transformers import T5EncoderModel
import torch
model_id = "hf-internal-testing/flux.1-dev-nf4-pkg"
t5_4bit = T5EncoderModel.from_pretrained(model_id, subfolder="text_encoder_2")
transformer_4bit = FluxTransformer2DModel.from_pretrained(model_id, subfolder="transformer")
pipeline_4bit = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
text_encoder_2=t5_4bit,
transformer=transformer_4bit,
torch_dtype=torch.float16,
).to("cuda")It throws:
ValueError: You are trying to call `.to('cuda')` on a pipeline that has models quantized with `bitsandbytes`. Your current `accelerate` installation does not support it. Please upgrade the installation.Isn't this what we expect or am I missing something?
There was a problem hiding this comment.
yeah I missed that not pipeline_has_bnb in the statement, it works
There was a problem hiding this comment.
Saw your comment. Thanks for beating it with me :)
…h bnb components (#9840) * allow device placement when using bnb quantization. * warning. * tests * fixes * docs. * require accelerate version. * remove print. * revert to() * tests * fixes * fix: missing AutoencoderKL lora adapter (#9807) * fix: missing AutoencoderKL lora adapter * fix --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * fixes * fix condition test * updates * updates * remove is_offloaded. * fixes * better * empty --------- Co-authored-by: Emmanuel Benazera <emmanuel.benazera@jolibrain.com>
What does this PR do?
When a pipeline is loaded with models that have quantization config, we should still be able to call
to("cuda")on the pipeline object. For GPUs that would allow the memory (such as a 4090), this has performance benefits (as demonstrated below).Flux.1 Dev, steps: 30
Currently, calling
to("cuda")is not possible because:has:
This is why this line complains:
diffusers/src/diffusers/pipelines/pipeline_utils.py
Line 413 in c10f875
This PR fixes that behavior.
Benchmarking code:
Unroll