Conversation
fp8 quantization currently limited to tensors with shapes where both dimensions are divisible by 16.
|
Hello @Narsil , check they check layers size, this thing is when user requests is not batched by 16 this make pad for dummy requests |
|
Hi @dongs0104 This depends on your torch version, torch nightly (I think 2.2.2 also) does not require the padding. Adding extra padding KILLS performance by a huge factor (current implementation is still slower than fp16 for some reason but at least comparable). |
@Narsil I agree with you padding make performance less, also I used on v2.0.1 TGI version, which is using torch 2.1.1 so it will be solve when #1730 is merged, so this PR you can close. :) |
fp8 quantization currently limited to tensors with shapes where both dimensions are divisible by 16.
Hello @Narsil ,
when I was using fp8 quantize on H100, I get some error which is size is not divisible by 16,
check
filter_out_small_unaligned_layershttps://github.com/pytorch-labs/float8_experimental/blob/ac065d09a6259574a85027edc84eb647dc6c90c2/float8_experimental/float8_linear_utils.py#L82-L93they check layers size, this thing is when user requests is not batched by 16 this make pad for dummy requests
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil