Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| self.assertTrue(isinstance(weight, AffineQuantizedTensor)) | ||
| self.assertEqual(weight.quant_min, 0) | ||
| self.assertEqual(weight.quant_max, 15) | ||
| self.assertTrue(isinstance(weight.layout_type, TensorCoreTiledLayoutType)) |
There was a problem hiding this comment.
layout_type has become an internal private attribute called _layout now. It does not have to be tested as such so can remove. The layout is also now called TensorCoreTiledLayout instead
| size_quantized_with_not_convert = get_model_size_in_bytes(quantized_model_with_not_convert) | ||
| size_quantized = get_model_size_in_bytes(quantized_model) | ||
|
|
||
| self.assertTrue(size_quantized < size_quantized_with_not_convert) |
There was a problem hiding this comment.
Not related to bumping the version, but it makes for a more meaningful test
|
|
||
| for param in module.parameters(): | ||
| if param.__class__.__name__ == "AffineQuantizedTensor": | ||
| data, scale, zero_point = param.layout_tensor.get_plain() |
There was a problem hiding this comment.
Same reason as above for removing this. layout_tensor is internal private attribute meaning we shouldn't access it because they could change it without warning in future
| self.assertTrue(total_int8wo < total_bf16 < total_int4wo_gs32) | ||
| # int4 with default group size quantized very few linear layers compared to a smaller group size of 32 | ||
| self.assertTrue(quantized_int4wo < quantized_int4wo_gs32 and unquantized_int4wo > unquantized_int4wo_gs32) | ||
| total_int4wo = get_model_size_in_bytes(transformer_int4wo) |
There was a problem hiding this comment.
We use the torchao provided utility instead now
|
|
||
| def _test_quant_type(self, quantization_config, expected_slice): | ||
| components = self.get_dummy_components(quantization_config) | ||
| pipe = FluxPipeline(**components).to(dtype=torch.bfloat16) |
There was a problem hiding this comment.
I think this was incorrect thing to do here and it slipped past us in previous PR. We should not be calling .to(dtype) on the pipeline directly if there has been a model that has been quantized.
The GGUF PR introduced a check in modeling_utils.py here that catches this behaviour.
|
Gentle ping @DN6 |
* bump min torchao version to 0.7.0 * update
Context: https://huggingface.slack.com/archives/C065E480NN9/p1734425021147699
cc @yiyixuxu