Skip to content

Conversation

@mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Nov 20, 2022

Stack from ghstack (oldest at bottom):

closes #35643

This PR is mostly borrowed from #82042. Thanks @Padarn for implementing
the first version and debugging into the errors.

Based on the discussion in #82042 this PR adds a with_kwargs
argument to register_forward_pre_hook and register_forward_hook
methods. When the arg is set to true, the provided hook must accept
kwargs args. Under the hook, this PR adds a
_forward_pre_hooks_with_kwargs and a _forward_hook_with_kwargs
set to keep track of which hooks accept kwargs.

Differential Revision: D41431111

cc @albanD @mruberry @jbschlosser @walterddr @kshitij12345 @saketh-are

closes #35643

This PR is mostly copied from #82042. Thanks @Padarn for implementing
the first version and debugging into the errors.

Based on the discussion in #82042, this PR adds a `with_kwargs`
argument to `register_forward_pre_hook` and `register_forward_hook`
methods. When the arg is set to true, the provided hook must accept
kwargs args. Under the hood, the hook is wrapped into `_ForwardHook`
and `_ForwardPreHook` types to avoid backward compatibility issues.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 20, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89389

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 Failures

As of commit d940db0:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the oncall: quantization Quantization support in PyTorch label Nov 20, 2022
@mrshenli mrshenli added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 20, 2022
closes #35643

This PR is mostly copied from #82042. Thanks Padarn for implementing
the first version and debugging into the errors.

Based on the discussion in #82042, this PR adds a `with_kwargs`
argument to `register_forward_pre_hook` and `register_forward_hook`
methods. When the arg is set to true, the provided hook must accept
kwargs args. Under the hood, the hook is wrapped into `_ForwardHook`
and `_ForwardPreHook` types to avoid backward compatibility issues.

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel

[ghstack-poisoned]
@mrshenli mrshenli added the module: nn Related to torch.nn label Nov 20, 2022
@mrshenli mrshenli added the topic: improvements topic category label Nov 20, 2022
closes #35643

This PR is mostly copied from #82042. Thanks Padarn for implementing
the first version and debugging into the errors.

Based on the discussion in #82042, this PR adds a `with_kwargs`
argument to `register_forward_pre_hook` and `register_forward_hook`
methods. When the arg is set to true, the provided hook must accept
kwargs args. Under the hood, the hook is wrapped into `_ForwardHook`
and `_ForwardPreHook` types to avoid backward compatibility issues.

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel albanD mruberry jbschlosser walterddr kshitij12345 saketh-are

[ghstack-poisoned]
@Padarn
Copy link
Contributor

Padarn commented Nov 21, 2022

Thaks @mrshenli! I've closed my PR. I like the changes you've made 👍

self.module = weakref.ref(state["module"])


# N.B.: This calss is NOT deriving from `_WrappedHook`, because pre- and post-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I had this same realisation. It does feel like they should be mergable, but I couldn't see a simple way to do so.

closes #35643

This PR is mostly copied from #82042. Thanks Padarn for implementing
the first version and debugging into the errors.

Based on the discussion in #82042, this PR adds a `with_kwargs`
argument to `register_forward_pre_hook` and `register_forward_hook`
methods. When the arg is set to true, the provided hook must accept
kwargs args. Under the hood, the hook is wrapped into `_ForwardHook`
and `_ForwardPreHook` types to avoid backward compatibility issues.

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel albanD mruberry jbschlosser walterddr kshitij12345 saketh-are

[ghstack-poisoned]
closes #35643

This PR is mostly copied from #82042. Thanks Padarn for implementing
the first version and debugging into the errors.

Based on the discussion in #82042, this PR adds a `with_kwargs`
argument to `register_forward_pre_hook` and `register_forward_hook`
methods. When the arg is set to true, the provided hook must accept
kwargs args. Under the hood, the hook is wrapped into `_ForwardHook`
and `_ForwardPreHook` types to avoid backward compatibility issues.

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel albanD mruberry jbschlosser walterddr kshitij12345 saketh-are

[ghstack-poisoned]
@mrshenli
Copy link
Contributor Author

@mrshenli has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@soulitzer soulitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, mostly looks good! Mainly just a comment about cleaning up the new dict on removal.

@mrshenli
Copy link
Contributor Author

Thanks @awgu and @soulitzer for the review! Before updating this PR, will let CI run for a bit longer while waiting for comments from other reviewers

# Marks whether the corresponding _forward_hooks accept kwargs or not. All
# As JIT does not support Set[int], this dict is used as a set, where all
# hooks represented in this dict accept kwargs.
_forward_hooks_with_kwargs: Dict[int, bool]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the handle.remove() method should remove the entry in the _forward_hooks_with_kwargs dict.

closes #35643

This PR is mostly borrowed from #82042. Thanks Padarn for implementing
the first version and debugging into the errors.

Based on the discussion in #82042 this PR adds a with_kwargs
argument to register_forward_pre_hook and register_forward_hook
methods. When the arg is set to true, the provided hook must accept
kwargs args. Under the hook, this PR adds a
`_forward_pre_hooks_with_kwargs` and a `_forward_hook_with_kwargs`
set to keep track of which hooks accept kwargs.

Differential Revision: [D41431111](https://our.internmc.facebook.com/intern/diff/D41431111)

cc albanD mruberry jbschlosser walterddr kshitij12345 saketh-are

[ghstack-poisoned]

def __init__(self, hooks_dict: Any) -> None:
def __init__(
self, hooks_dict: Any, *, kwargs_dict: Dict[int, bool] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what kwargs_dict means in this context. extra_dict sounds better.
Also the typing here should be the same as hooks_dict no?

Even better might be to just do *args and take as many dictionary as the user wants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do extra_dict for now. Can add arbitrary numbers of dict if necessary in followups.

self.kwargs_dict_ref = (
None
if len(state) < 3
else weakref.ref(dict() if state[2] is None else state[2])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why dict() nad not OrderedDict() like the one above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we don't need order for this one. Let me change that to OrderedDict to keep things consistent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as with the other one tbh.

closes #35643

This PR is mostly borrowed from #82042. Thanks Padarn for implementing
the first version and debugging into the errors.

Based on the discussion in #82042 this PR adds a with_kwargs
argument to register_forward_pre_hook and register_forward_hook
methods. When the arg is set to true, the provided hook must accept
kwargs args. Under the hook, this PR adds a
`_forward_pre_hooks_with_kwargs` and a `_forward_hook_with_kwargs`
set to keep track of which hooks accept kwargs.

Differential Revision: [D41431111](https://our.internmc.facebook.com/intern/diff/D41431111)

cc albanD mruberry jbschlosser walterddr kshitij12345 saketh-are

[ghstack-poisoned]
@mrshenli
Copy link
Contributor Author

@mrshenli has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

closes #35643

This PR is mostly borrowed from #82042. Thanks Padarn for implementing
the first version and debugging into the errors.

Based on the discussion in #82042 this PR adds a with_kwargs
argument to register_forward_pre_hook and register_forward_hook
methods. When the arg is set to true, the provided hook must accept
kwargs args. Under the hook, this PR adds a
`_forward_pre_hooks_with_kwargs` and a `_forward_hook_with_kwargs`
set to keep track of which hooks accept kwargs.

Differential Revision: [D41431111](https://our.internmc.facebook.com/intern/diff/D41431111)

cc albanD mruberry jbschlosser walterddr kshitij12345 saketh-are

[ghstack-poisoned]
@mrshenli
Copy link
Contributor Author

@mrshenli has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit about error but SGTM otherwise.
Will let @soulitzer take a final look as well!

closes #35643

This PR is mostly borrowed from #82042. Thanks Padarn for implementing
the first version and debugging into the errors.

Based on the discussion in #82042 this PR adds a with_kwargs
argument to register_forward_pre_hook and register_forward_hook
methods. When the arg is set to true, the provided hook must accept
kwargs args. Under the hook, this PR adds a
`_forward_pre_hooks_with_kwargs` and a `_forward_hook_with_kwargs`
set to keep track of which hooks accept kwargs.

Differential Revision: [D41431111](https://our.internmc.facebook.com/intern/diff/D41431111)

cc albanD mruberry jbschlosser walterddr kshitij12345 saketh-are

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Nov 22, 2022
closes #35643

This PR is mostly borrowed from #82042. Thanks @Padarn for implementing
the first version and debugging into the errors.

Based on the discussion in #82042 this PR adds a with_kwargs
argument to register_forward_pre_hook and register_forward_hook
methods. When the arg is set to true, the provided hook must accept
kwargs args. Under the hook, this PR adds a
`_forward_pre_hooks_with_kwargs` and a `_forward_hook_with_kwargs`
set to keep track of which hooks accept kwargs.

ghstack-source-id: 1f40df4
Pull Request resolved: #89389
@mrshenli
Copy link
Contributor Author

@mrshenli has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@soulitzer soulitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@mrshenli
Copy link
Contributor Author

Regarding the errors:

  1. periodic / linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck https://github.com/pytorch/pytorch/actions/runs/3527608027/jobs/5917112620
2022-11-22T23:52:48.9571582Z ======================================================================
2022-11-22T23:52:48.9571873Z ERROR [0.032s]: test_index_select_exhaustive_index_large_cuda_float64 (__main__.TestSparseCUDA)
2022-11-22T23:52:48.9572149Z ----------------------------------------------------------------------
2022-11-22T23:52:48.9572285Z Traceback (most recent call last):
2022-11-22T23:52:48.9572640Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2054, in wrapper
2022-11-22T23:52:48.9572757Z     method(*args, **kwargs)
2022-11-22T23:52:48.9573122Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 391, in instantiated_test
2022-11-22T23:52:48.9573224Z     raise rte
2022-11-22T23:52:48.9573605Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_device_type.py", line 378, in instantiated_test
2022-11-22T23:52:48.9573745Z     result = test(self, **param_kwargs)
2022-11-22T23:52:48.9574094Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3472, in wrapped
2022-11-22T23:52:48.9574241Z     f(self, *args, **kwargs, coalesced=True)
2022-11-22T23:52:48.9574488Z   File "/var/lib/jenkins/workspace/test/test_sparse.py", line 1135, in test_index_select_exhaustive_index_large
2022-11-22T23:52:48.9574702Z     self._test_index_select_exhaustive_index((100, 50, 3, 3), (2, 3), device, dtype, coalesced)
2022-11-22T23:52:48.9574925Z   File "/var/lib/jenkins/workspace/test/test_sparse.py", line 1101, in _test_index_select_exhaustive_index
2022-11-22T23:52:48.9575116Z     t_sparse = t.to_sparse().coalesce() if coalesced else t.to_sparse()
2022-11-22T23:52:48.9575684Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.44 GiB total capacity; 355.50 KiB already allocated; 320.00 KiB free; 2.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I couldn't reproduce the OOM. Based on the error message, it could be CUDA context from concurrent CI tests ate up CUDA memory? cc @huydhn

  1. pull / linux-focal-py3.7-clang7-asan / test : https://github.com/pytorch/pytorch/actions/runs/3527607142/jobs/5917038510

This is irrelevant and already on master. See https://github.com/pytorch/pytorch/actions/runs/3518138536/jobs/5896929592

/usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr_base.h:911:14: runtime error: member call on address 0x60b000b767b0 which does not point to an object of type 'std::_Sp_counted_base<__gnu_cxx::_S_atomic>'
0x60b000b767b0: note: object has a possibly invalid vptr: abs(offset to top) too big
 dd 0f 00 2a  28 1a 80 3f f1 7f 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^~~~~~~~~~~~~~~~~~~~~~~
              possibly invalid vptr
    #0 0x7ff139848219 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count(std::__weak_count<(__gnu_cxx::_Lock_policy)2> const&, std::nothrow_t) (/opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so+0xdf2e219)
    #1 0x7ff14e7972f8 in libkineto::(anonymous namespace)::configFactories() (/opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so+0x22e7d2f8)
    #2 0x7ff14e797f79 in libkineto::Config::Config() (/opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so+0x22e7df79)
    #3 0x7ff14e7b129d in libkineto::ConfigLoader::updateConfigThread() (/opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so+0x22e9729d)
    #4 0x7ff16eb2ade3  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xd6de3)
    #5 0x7ff16ee32608 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x8608)
    #6 0x7ff16ed57132 in clone (/lib/x86_64-linux-gnu/libc.so.6+0x11f132)

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr_base.h:911:14 in
FINISHED PRINTING LOG FILE of test_jit (/var/lib/jenkins/workspace/test/test-reports/test_jit_ansb04hr)

test_jit failed!
Traceback (most recent call last):
  File "test/run_test.py", line 1339, in <module>
    main()
  File "test/run_test.py", line 1303, in main
    raise RuntimeError("\n".join(failure_messages))
RuntimeError: test_jit failed!

@mrshenli
Copy link
Contributor Author

One more failure, on win-vs2019-cuda11.7-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu): https://github.com/pytorch/pytorch/actions/runs/3527608027/jobs/5917374444

======================================================================
FAIL [0.003s]: test_ref_small_input_prod_cuda_int8 (__main__.TestReductionsCUDA)
Compares op against reference for small input tensors
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 378, in instantiated_test
    result = test(self, **param_kwargs)
  File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 828, in test_wrapper
    return test(*args, **kwargs)
  File "C:\actions-runner\_work\pytorch\pytorch\test\test_reductions.py", line 378, in test_ref_small_input
    self._test_ref(op, t)
  File "C:\actions-runner\_work\pytorch\pytorch\test\test_reductions.py", line 365, in _test_ref
    self.assertEqual(result, expected, exact_dtype=False)
  File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 2561, in assertEqual
    assert_equal(
  File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1118, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Scalars are not equal!

Absolute difference: 68719476736
Relative difference: inf

Checked the test source code, it is op-only test, and doesn't touch module hooks at all. Also I couldn't reproduce this locally.

@ops(filter(lambda op: op.ref is not None, reduction_ops),
allowed_dtypes=all_types_and_complex_and(torch.half, torch.bool))
def test_ref_small_input(self, device, dtype, op: ReductionOpInfo):
"""Compares op against reference for small input tensors"""
t = make_tensor((5, 3, 4, 2), dtype=dtype, device=device, low=-2, high=2, exclude_zero=True)
self._test_ref(op, t)
for dim in [0, 1, 3] + ([[0, 2], [1, 3]] if op.supports_multiple_dims else []):
self._test_ref(op, t, dim=dim)

@mrshenli
Copy link
Contributor Author

4th failure: cuda11.6-py3.10-gcc7-sm86 / test https://github.com/pytorch/pytorch/actions/runs/3527607833/jobs/5918576014

This is also irrelevant and on master. See https://github.com/pytorch/pytorch/actions/runs/3528028962/jobs/5917920193

2022-11-23T02:10:11.0209152Z Generating XML reports...
2022-11-23T02:10:11.0209491Z Generated XML report: test-reports/python-unittest/test_jit_cuda_fuser/TEST-TestCudaFuser-20221123015624.xml
2022-11-23T02:10:11.0209824Z Generated XML report: test-reports/python-unittest/test_jit_cuda_fuser/TEST-TestEnableDisableCudaFuser-20221123015624.xml
2022-11-23T02:10:11.0210162Z Generated XML report: test-reports/python-unittest/test_jit_cuda_fuser/TEST-jit.test_fuser_common.TestFuserCommon-20221123015624.xml
2022-11-23T02:10:11.0210472Z Generated XML report: test-reports/python-unittest/test_jit_cuda_fuser/TEST-TestCudaFuserOpInfoCUDA-20221123015624.xml
2022-11-23T02:10:11.0210766Z FINISHED PRINTING LOG FILE of test_jit_cuda_fuser (/var/lib/jenkins/workspace/test/test-reports/test_jit_cuda_fuser_gemnwwrr)
2022-11-23T02:10:11.0210772Z 
2022-11-23T02:10:11.0210877Z Traceback (most recent call last):
2022-11-23T02:10:11.0211031Z   File "/var/lib/jenkins/workspace/test/run_test.py", line 1339, in <module>
2022-11-23T02:10:11.0211105Z     main()
2022-11-23T02:10:11.0211255Z   File "/var/lib/jenkins/workspace/test/run_test.py", line 1314, in main
2022-11-23T02:10:11.0211359Z     raise RuntimeError(err_message)
2022-11-23T02:10:11.0211504Z RuntimeError: test_jit_cuda_fuser failed! Received signal: SIGIOT

@mrshenli
Copy link
Contributor Author

One more, same as the 3rd one: win-vs2019-cuda11.6-py3 / test, https://github.com/pytorch/pytorch/actions/runs/3527607833/jobs/5918576620#logs

======================================================================
FAIL [0.003s]: test_ref_small_input_prod_cuda_int8 (__main__.TestReductionsCUDA)
Compares op against reference for small input tensors
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 378, in instantiated_test
    result = test(self, **param_kwargs)
  File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 828, in test_wrapper
    return test(*args, **kwargs)
  File "C:\actions-runner\_work\pytorch\pytorch\test\test_reductions.py", line 378, in test_ref_small_input
    self._test_ref(op, t)
  File "C:\actions-runner\_work\pytorch\pytorch\test\test_reductions.py", line 365, in _test_ref
    self.assertEqual(result, expected, exact_dtype=False)
  File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 2561, in assertEqual
    assert_equal(
  File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_comparison.py", line 1118, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Scalars are not equal!

Absolute difference: 68719476736
Relative difference: inf

@mrshenli
Copy link
Contributor Author

@pytorchbot merge -f "all test failures are irrelevant, see comments above"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
)

closes pytorch#35643

This PR is mostly borrowed from pytorch#82042. Thanks @Padarn for implementing
the first version and debugging into the errors.

Based on the discussion in pytorch#82042 this PR adds a with_kwargs
argument to register_forward_pre_hook and register_forward_hook
methods. When the arg is set to true, the provided hook must accept
kwargs args. Under the hook, this PR adds a
`_forward_pre_hooks_with_kwargs` and a `_forward_hook_with_kwargs`
set to keep track of which hooks accept kwargs.

Differential Revision: [D41431111](https://our.internmc.facebook.com/intern/diff/D41431111)

Pull Request resolved: pytorch#89389
Approved by: https://github.com/soulitzer
@facebook-github-bot facebook-github-bot deleted the gh/mrshenli/340/head branch June 8, 2023 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged module: nn Related to torch.nn release notes: nn release notes category topic: improvements topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants