Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
sayakpaul
left a comment
There was a problem hiding this comment.
Thanks a lot for quickly getting this up 🔥
My comments are mostly minor, the major one being adding hf_quantizer to the allocator function.
Additionally, for a potentially better user-experience, if we could try to rethink the to() method of DiffusionPipeline, it would be helpful. I mean the following.
Currently, from what I understand, we have to first initialize the denoiser using device_map and then the rest of the components. If a user is calling .to() on a DiffusionPipeline, we could consider using device_map="cuda" for dispatching the model-level components to CUDA. I don't immediately see a downside to it.
| return parsed_parameters | ||
|
|
||
|
|
||
| def _find_mismatched_keys( |
There was a problem hiding this comment.
Taken out of here:
diffusers/src/diffusers/models/modeling_utils.py
Line 1509 in 9f4d997
| if device_type is None: | ||
| device_type = get_device() | ||
| device_mod = getattr(torch, device_type, torch.cuda) | ||
| device_mod.synchronize() |
There was a problem hiding this comment.
I guess all different backends ought to have this method. Just flagging.
There was a problem hiding this comment.
afaik, synchronize should be available on all devices. Just the empty_cache function required a special check because it would fail if device was cpu
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* update * update * update * pin accelerate version * add comment explanations * update docstring * make style * non_blocking does not matter for dtype cast * _empty_cache -> clear_cache * update * Update src/diffusers/models/model_loading_utils.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/diffusers/models/model_loading_utils.py --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* update * update * update * pin accelerate version * add comment explanations * update docstring * make style * non_blocking does not matter for dtype cast * _empty_cache -> clear_cache * update * Update src/diffusers/models/model_loading_utils.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/diffusers/models/model_loading_utils.py --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
All thanks to @Cyrilvallez's PR: huggingface/transformers#36380
The accelerate PR is required because we end up calling
clear_device_cachein a loop (over the sharded files). This is bad. Without this, you'll see no speedup.Another small optimization is using non_blocking everywhere and syncing just before returning control to the user. This is slightly faster.
Sister PR in accelerate required to obtain speedup: huggingface/accelerate#3674
16.765s4.521s