[torchao] fix safetensors and enable loading from sharded files #41998

liangel-02 · 2025-11-03T17:42:33Z

Context

This PR is a followup to #40735 and #41138. Previously, we enabled safetensors in torchao for one shard file. This PR fixes some errors introduced in #41138 and handles the case when checkpoints are sharded onto more than one file, including the edge case where a single quantized tensor (ie Float8Tensor) is sharded onto two different files (ie qdata on one and scale on another).

Summary

If we are loading in a component of a tensor subclass in create_quantized_param() called by _load_state_dict_into_meta_model(), we add this as a new parameter into the model. Then after all parameters are loaded, we unflatten the state_dict and reassign the model parameters.

Testing

Modified unit tests to test all tensor subclasses
python tests/quantization/torchao_integration/test_torchao.py -k TorchAoSafeSerializationTest

src/transformers/quantizers/quantizer_torchao.py

jerryzh168

thanks, looks good mostly, had one more inline comment

SunMarc

Thanks for your work ! Left a couple of comments. Btw, we will soon refactor how quantization is applied as we move to dynamic weights loading like vllm. This should help getting support for features like TP

src/transformers/modeling_utils.py

src/transformers/quantizers/base.py

SunMarc · 2025-11-04T10:07:54Z

src/transformers/quantizers/quantizer_torchao.py

+        if TORCHAO_VERSION >= version.parse("0.14.0") and is_metadata_torchao(self.metadata):
+            updated_state_dict = unflatten_tensor_state_dict(model.state_dict(), metadata)
+
+            weights_to_register = set(updated_state_dict.keys())
+
+            for name, param in list(model.named_parameters()):
+                module_fqn, weight_name = name.rsplit(".", 1)
+                module = model.get_submodule(module_fqn)
+                weight = getattr(module, weight_name)
+
+                device = weight.device
+                requires_grad = weight.requires_grad
+
+                if "_weight_" in weight_name:
+                    delattr(module, weight_name)
+
+                if name in weights_to_register:
+                    new_param_value = updated_state_dict[name]
+                    new_param = torch.nn.Parameter(new_param_value.to(device), requires_grad=requires_grad)
+                    module.register_parameter(weight_name, new_param)
+
+                    weights_to_register.remove(name)
+
+            model.load_state_dict(updated_state_dict, strict=False)


so instead of performing unflatten_tensor_state_dict in create_quantized_param, we do it here at the very end and we just store the flattened weights in the module?

yeah, we don't want to do it in create_quantized_param since at most, we'd only have access to one shard file, and we want to consider the case where tensor subclass attributes are split up over multiple files

we call unflatten_tensor_state_dict at the very end to get the recovered state dict, and then iterate through the model and replace the weights that represent the tensor attributes with the entire tensor subclass.

github-actions · 2025-11-04T18:16:19Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: torchao_integration

liangel-02 marked this pull request as draft November 3, 2025 17:43

liangel-02 force-pushed the torchao-safetensors-sharding branch from 8b6b802 to eeb8451 Compare November 3, 2025 17:54

liangel-02 marked this pull request as ready for review November 3, 2025 18:32

github-actions bot requested review from MekkCyber and SunMarc November 3, 2025 18:33

jerryzh168 reviewed Nov 3, 2025

View reviewed changes