Skip to content

Conversation

@bdhirsh
Copy link
Contributor

@bdhirsh bdhirsh commented Jul 23, 2024

cpp->python exceptions can be slow - *especially if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.

Stack from ghstack (oldest at bottom):

cc @albanD @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131481

Note: Links to docs will display an error until the docs builds have been completed.

❌ 42 New Failures, 1 Cancelled Job, 14 Unrelated Failures

As of commit 09b09fb with merge base 932ae13 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




[ghstack-poisoned]
TORCH_API void runJITCPPTests();
#endif

static thread_local uint64_t cpp_to_python_translated_exception_count{0};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought process here was:

(1) I figured this can't be a plain uint64_t, since multiple threads may be raising exceptions and can increment the counter.

(2) I originally made it a std::atomic<uint64_t>, but then I didn't want to risk the increment code being very slow, and exacerbating slowness problems in cases where there are many cpp-to-python exceptions. Alternatively, we could make it slow but gate it behind a config / env var, but this seems like a nice metric to be able to track in an always-on way.

(3) thread_local will under-count if multiple threads are all raising exceptions. My hope is that we are very unlikely to hit a situation where the main thread raises no cpp-to-python exception, but a side thread raises a large number (although if this happens, then... we will not be able to track it in metrics).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry about (2), you're already in slow pain with exceptions, the atomic is a rounding error.

…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




[ghstack-poisoned]
@bdhirsh
Copy link
Contributor Author

bdhirsh commented Jul 23, 2024

I gave up looking for a way to grab the output from CI, so I just ran a script that adds the extra column to every existing csv we check in (set to zero - if there are any exceptions, I'll expect CI to fail)

…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




[ghstack-poisoned]
@bdhirsh bdhirsh requested a review from a team as a code owner July 24, 2024 14:32
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




[ghstack-poisoned]
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




[ghstack-poisoned]
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




[ghstack-poisoned]
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




[ghstack-poisoned]
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




[ghstack-poisoned]
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




cc albanD voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




cc albanD voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




cc albanD voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Sep 14, 2024
…ark suite

ghstack-source-id: c82209e
Pull Request resolved: #131481
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




cc albanD voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Sep 18, 2024
…ark suite

ghstack-source-id: c7432b9
Pull Request resolved: #131481
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




cc albanD voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Sep 18, 2024
…ark suite

ghstack-source-id: 0766364
Pull Request resolved: #131481
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




cc albanD voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Oct 14, 2024
…ark suite

ghstack-source-id: 7f70663
Pull Request resolved: #131481
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




cc albanD voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




cc albanD voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Oct 14, 2024
…ark suite

ghstack-source-id: 5f62345
Pull Request resolved: #131481
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




cc albanD voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




cc albanD voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
…pile benchmark suite"

cpp->python exceptions can be slow - **especially* if you are running pytorch with infra that tries to symbolize C++ stacktraces.

This PR attempts to:
(1) add a counter every time pybind catches a C++ exception that enters python
(2) track the value of this counter in our benchmark suite, so we can hopefully drive it to zero.

It might be the case that we have very few (or even zero) of these cases showing up in torchbench. Emprically from internal, it also appears that this can show up when torch.compiling custom ops. So if few-or-zero exceptions are encountered in torchbench, we should also add some tests for avoiding C++ exceptions in the common path of torch.compile usage with custom ops.




cc albanD voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Oct 16, 2024
…ark suite

ghstack-source-id: f7c0c11
Pull Request resolved: #131481
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Dec 28, 2024
@github-actions github-actions bot closed this Jan 27, 2025
@github-actions github-actions bot deleted the gh/bdhirsh/588/head branch February 28, 2025 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants