Optimize `to_py_obj` for python-native numeric lists and scalars #36885

n0gu-furiosa · 2025-03-21T12:58:24Z

What does this PR do?

This PR optimizes to_py_obj by adding early returns for python-native numeric scalars and 1D lists/tuples of numbers. This avoids unnecessary recursive conversions, which can significantly impact performance of decode().

Fixes #36872

In the provided example from #36872, the runtime decreased from approximately 11 seconds to 0.8 seconds.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1 @ArthurZucker

github-actions · 2025-03-21T12:58:35Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

Rocketknight1 · 2025-03-21T14:07:20Z

This seems like a good idea, but I worry that there might be some edge cases where it fails because it only tests the first element!

One idea would be to use a conversion like np.array() as a test, without actually returning its output. For example, if:

np.array() succeeds
The returned array has 1 dimension
The returned array has int/float dtype

Then we can guarantee that the input list/tuple was a flat list of numbers and just return the original list. It would guarantee correctness without needing to recursively call to_py_obj(). WDYT?

n0gu-furiosa · 2025-03-21T15:09:26Z

@Rocketknight1 Great idea. Thanks for the suggestion!

I ran some benchmarks comparing different approaches (using the same example code, but with an increased number of iterations). Here are the results:

Baseline before this PR: ~29.5s per 3000 iterations
Initial optimization (63bfca8): ~2.3s per 3000 iterations
Initial optimization without the obj[0] bypass hack (code as follows): ~3.0s per 3000 iterations
```
elif isinstance(obj, (list, tuple)):
    return [to_py_obj(o) for o in obj]
```
This version was tested to serve as a baseline for approaches that traverse all elements.
And finally, the suggested np.array approach (4d4baa7): ~2.7s

As you see, the np.array-based check seems to offer a nice balance between type safety and performance.

Also, instead of checking whether the array has 1 dimension and returning obj (a), I opted to return arr.tolist() regardless of the array's dimension (b). This allows the same optimization to apply to native multi-dimensional python lists as well. I benchmarked both options using the same example code, and the results were not significantly different (~2.47s for (a) and ~2.49s for (b)). Since this test used a 1D array - which favors (a) - I believe (b) is generally a more flexible and equally performant option.

Let me know if you have any feedback or further suggestions.

Rocketknight1

Yes, this seems good now! I made one small suggestion, so let me know what you think and then we can merge this.

Rocketknight1 · 2025-03-24T15:20:49Z

src/transformers/utils/generic.py

    """
    Convert a TensorFlow tensor, PyTorch tensor, Numpy array or python list to a python list.
    """
+    if is_py_number(obj):


Suggested change

if is_py_number(obj):

if isinstance(obj, (int, float)):

Since we're only using the function once, we can just inline it directly here. I think it's understandable!

If you merge this you should also delete is_py_number()

Applied in a04e338. Also added some test code for to_py_obj here.

Rocketknight1 · 2025-03-24T15:27:46Z

Also, one more thought: I think this should still work even if obj is a list/tuple containing e.g. Torch/TF arrays. However, we should be careful around that case, since I think np.array() will convert lists of those too.

Rocketknight1 · 2025-03-26T16:48:59Z

This looks good to me now! Ping me whenever you're ready for me to merge it @n0gu-furiosa

n0gu-furiosa · 2025-03-27T02:01:59Z

@Rocketknight1 Everything’s ready on my end. Please feel free to merge whenever you get a chance. Thanks in advance!

ArthurZucker · 2025-03-27T13:15:58Z

thanks 🤗

…gingface#36885) * Optimize to_py_obj for python-native numeric lists and scalars * Fix bug that tuple is not converted to list * Try np.array for more robust type checking * Apply review and add tests for to_py_obj

Optimize to_py_obj for python-native numeric lists and scalars

1503a4f

github-actions bot marked this pull request as draft March 21, 2025 12:58

n0gu-furiosa marked this pull request as ready for review March 21, 2025 13:03

github-actions bot requested review from ArthurZucker and Rocketknight1 March 21, 2025 13:04

Fix bug that tuple is not converted to list

63bfca8

Try np.array for more robust type checking

4d4baa7

Rocketknight1 approved these changes Mar 24, 2025

View reviewed changes

n0gu-furiosa and others added 3 commits March 25, 2025 16:56

Apply review and add tests for to_py_obj

a04e338

Merge branch 'main' into n0gu-fix

2fc4e4b

Merge branch 'main' into n0gu-fix

620d1ad

ArthurZucker approved these changes Mar 27, 2025

View reviewed changes

ArthurZucker merged commit d1eafe8 into huggingface:main Mar 27, 2025
18 checks passed

n0gu-furiosa deleted the n0gu-fix branch March 27, 2025 13:17

Rocketknight1 mentioned this pull request Apr 1, 2025

Add performance-optimized version of to_py_obj that avoids redundant … #37167

Closed

5 tasks

22quinn mentioned this pull request Jun 24, 2025

[PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion vllm-project/vllm#20000

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize `to_py_obj` for python-native numeric lists and scalars #36885

Optimize `to_py_obj` for python-native numeric lists and scalars #36885

Uh oh!

n0gu-furiosa commented Mar 21, 2025

Uh oh!

github-actions bot commented Mar 21, 2025

Uh oh!

Rocketknight1 commented Mar 21, 2025

Uh oh!

n0gu-furiosa commented Mar 21, 2025

Uh oh!

Rocketknight1 left a comment •

edited

Loading

Uh oh!

Rocketknight1 Mar 24, 2025

Uh oh!

Rocketknight1 Mar 24, 2025

Uh oh!

n0gu-furiosa Mar 25, 2025

Uh oh!

Rocketknight1 commented Mar 24, 2025

Uh oh!

Rocketknight1 commented Mar 26, 2025

Uh oh!

n0gu-furiosa commented Mar 27, 2025

Uh oh!

ArthurZucker commented Mar 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize to_py_obj for python-native numeric lists and scalars #36885

Optimize to_py_obj for python-native numeric lists and scalars #36885

Uh oh!

Conversation

n0gu-furiosa commented Mar 21, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions bot commented Mar 21, 2025

Uh oh!

Rocketknight1 commented Mar 21, 2025

Uh oh!

n0gu-furiosa commented Mar 21, 2025

Uh oh!

Rocketknight1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

n0gu-furiosa Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Mar 24, 2025

Uh oh!

Rocketknight1 commented Mar 26, 2025

Uh oh!

n0gu-furiosa commented Mar 27, 2025

Uh oh!

ArthurZucker commented Mar 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize `to_py_obj` for python-native numeric lists and scalars #36885

Optimize `to_py_obj` for python-native numeric lists and scalars #36885

Rocketknight1 left a comment •

edited

Loading