Skip to content

Conversation

@cyyever
Copy link
Collaborator

@cyyever cyyever commented Feb 8, 2024

This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119449

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures

As of commit 6531b55 with merge base 6e10f7d (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@cyyever cyyever changed the title [Environment Variable][1] Use thread-safe getenv in c10 [Environment Variable][1/N] Use thread-safe getenv in c10 Feb 8, 2024
@cyyever cyyever requested a review from albanD February 8, 2024 10:57
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please write a better description at what you are trying to achieve here?
I haven't read the standard, but being familiar with the implementation details, I don't see why getenv on its own is not thread safe.

It is probably unsafe when there are a concurrent putenv calls, but you are not adding that one to your env utils.

[Edit] To quote https://en.cppreference.com/w/cpp/utility/program/getenv looks like it is part of the standard:

This function is thread-safe (calling it from multiple threads does not introduce a data race) as long as no other function modifies the host environment. In particular, the POSIX functions setenv(), unsetenv(), and putenv() would introduce a data race if called without synchronization.

[Edit2] Is this about suppressing some clang-tidy warnings or solving a real problem? If later, then solution is to provide thread-safe extern "C" implementation of those methods in the code, that would take over libc ones (all libc functions are weak symbols by design so that one can app can choose to override them) and provide thread safety from getenv calls fro say oneDNN, libcuda etc

@albanD
Copy link
Collaborator

albanD commented Feb 8, 2024

I would expect that a user setting something on os.env in python would trigger such a race? That was my assumption here.

@malfet
Copy link
Contributor

malfet commented Feb 8, 2024

I would expect that a user setting something on os.env in python would trigger such a race? That was my assumption here.

Yes, but the current implementation would do nothing to avoid it, as putenv call from CPython would not be serialized against getenv calls made from libtorch

@drisspg drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 8, 2024
@cyyever
Copy link
Collaborator Author

cyyever commented Feb 9, 2024

I would expect that a user setting something on os.env in python would trigger such a race? That was my assumption here.

A specific reproducing example may be hard to create. Also note that Torch may be used in a pure multi-threading C++ environment.

@cyyever
Copy link
Collaborator Author

cyyever commented Feb 9, 2024

@malfet @albanD set_env was added.

@cyyever cyyever force-pushed the th_safe_env branch 3 times, most recently from fa9f374 to b760fdf Compare February 9, 2024 03:59
@cyyever cyyever changed the title [Environment Variable][1/N] Use thread-safe getenv in c10 [Environment Variable][1/N] Use thread-safe env wrappers in c10 Feb 9, 2024
@cyyever cyyever changed the title [Environment Variable][1/N] Use thread-safe env wrappers in c10 [Environment Variable][1/N] Use thread-safe env variable wrappers in c10 Feb 9, 2024
@cyyever cyyever changed the title [Environment Variable][1/N] Use thread-safe env variable wrappers in c10 [Environment Variable][1/N] Use thread-safe env variable API in c10 Feb 9, 2024
@cyyever
Copy link
Collaborator Author

cyyever commented Feb 9, 2024

Can you please write a better description at what you are trying to achieve here? I haven't read the standard, but being familiar with the implementation details, I don't see why getenv on its own is not thread safe.

It is probably unsafe when there are a concurrent putenv calls, but you are not adding that one to your env utils.

[Edit] To quote https://en.cppreference.com/w/cpp/utility/program/getenv looks like it is part of the standard:

This function is thread-safe (calling it from multiple threads does not introduce a data race) as long as no other function modifies the host environment. In particular, the POSIX functions setenv(), unsetenv(), and putenv() would introduce a data race if called without synchronization.

[Edit2] Is this about suppressing some clang-tidy warnings or solving a real problem? If later, then solution is to provide thread-safe extern "C" implementation of those methods in the code, that would take over libc ones (all libc functions are weak symbols by design so that one can app can choose to override them) and provide thread safety from getenv calls fro say oneDNN, libcuda etc

The purpose is to suppress clang-tidy warnings and solve possible future problems by providing thread-safe getenv and setenv. Overwriting libc implementations may not be a good idea because we have to support MSVC and other no GNU platforms so that I prefer to use explicit c10 API.

@cyyever
Copy link
Collaborator Author

cyyever commented Feb 9, 2024

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Feb 9, 2024
@cyyever
Copy link
Collaborator Author

cyyever commented Feb 9, 2024

@pytorchbot label ciflow/trunk

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 9, 2024
@github-actions
Copy link
Contributor

github-actions bot commented Apr 9, 2024

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Apr 9, 2024
@cyyever cyyever removed the Stale label Apr 10, 2024
@cyyever cyyever force-pushed the th_safe_env branch 2 times, most recently from 5e05925 to e66b7e7 Compare April 10, 2024 06:36
@cyyever cyyever requested a review from albanD April 10, 2024 07:30
@eqy
Copy link
Collaborator

eqy commented Sep 15, 2024

commenting that this is a real issue we've seen in practice pytorch/kineto#984

but problematic interactions might be more likely between PyTorch and third party libraries as above

@cyyever cyyever removed the Stale label Sep 15, 2024
@cyyever
Copy link
Collaborator Author

cyyever commented Sep 15, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased th_safe_env onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout th_safe_env && git pull --rebase)

@cyyever
Copy link
Collaborator Author

cyyever commented Oct 1, 2024

Rewrite code to avoid changing existing API.

@cyyever
Copy link
Collaborator Author

cyyever commented Oct 1, 2024

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@cyyever
Copy link
Collaborator Author

cyyever commented Oct 1, 2024

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased th_safe_env onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout th_safe_env && git pull --rebase)

@cyyever
Copy link
Collaborator Author

cyyever commented Oct 1, 2024

@pytorchbot merge -f "Unrelated failures"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

AnantGulati pushed a commit to AnantGulati/pytorch that referenced this pull request Oct 2, 2024
…ytorch#119449)

This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.
Pull Request resolved: pytorch#119449
Approved by: https://github.com/malfet, https://github.com/albanD, https://github.com/eqy
pytorchmergebot pushed a commit that referenced this pull request Oct 4, 2024
This follows #119449 to make setenv thread-safe.

Pull Request resolved: #124485
Approved by: https://github.com/eqy
@cyyever cyyever deleted the th_safe_env branch May 3, 2025 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged open source Reverted topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants