Skip to content

Tags: pytorch/pytorch

Tags

trunk/89165c0a2b5d3c147c19a492437291c8ff18aa7f

Toggle trunk/89165c0a2b5d3c147c19a492437291c8ff18aa7f's commit message
Update triton to 3.5.1 release (#166968)

This includes sm103 triton-lang/triton#8485 fix

Pull Request resolved: #166968
Approved by: https://github.com/Lucaskabela, https://github.com/njriasan

trunk/59563dfe56a086a4a95025f0ccfe373bc1fd3759

Toggle trunk/59563dfe56a086a4a95025f0ccfe373bc1fd3759's commit message
Refactor out headeronly ArrayRef (#164991)

Differential Revision: [D85091961](https://our.internmc.facebook.com/intern/diff/D85091961)
Pull Request resolved: #164991
Approved by: https://github.com/swolchok

trunk/39160dba0c5120c65705a44e556c8c4af243e573

Toggle trunk/39160dba0c5120c65705a44e556c8c4af243e573's commit message
shrink_group implementation to expose ncclCommShrink API (#164518)

Closes #164529

To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch.

This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization.

For more info:  [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator)

Pull Request resolved: #164518
Approved by: https://github.com/kwen2501

trunk/14956eaef4a14901a95a6d0779d99db11fd7406b

Toggle trunk/14956eaef4a14901a95a6d0779d99db11fd7406b's commit message
[ROCm][CI] revert ROCm magma commit hash to last known good (#167044)

PR #166693 updated the magma commit hash but this has been linked to ROCm 7.1 CI failures.  Go back to last known working magma version.

Pull Request resolved: #167044
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>

trunk/6052a01b71277eb767d87daf47d109f8e0edd5c0

Toggle trunk/6052a01b71277eb767d87daf47d109f8e0edd5c0's commit message
[BE][Typing][Dynamo] Type torch/_dynamo/variables/dicts.py (#167022)

Provides type coverage to torch/_dynamo/variables/dicts.py

Coverage report:
`mypy torch/_dynamo/variables/dicts.py --linecount-report /tmp/coverage_log`

Compare before to after - we go from 0 lines and 0 funcs covered to 1547 lines and 89 funcs covered

Pull Request resolved: #167022
Approved by: https://github.com/Skylion007

trunk/5863ba1b2e4de9ea0ae16a663465ec5d3d6f9f52

Toggle trunk/5863ba1b2e4de9ea0ae16a663465ec5d3d6f9f52's commit message
[12/N] Apply ruff UP035 rule (#166929)

This PR continues to apply ruff UP035 rule to test code and some remaining torch files.

Pull Request resolved: #166929
Approved by: https://github.com/Lucaskabela

trunk/4271ffe91849335ffbcc2014c948694f8ec107fd

Toggle trunk/4271ffe91849335ffbcc2014c948694f8ec107fd's commit message
don't produce invalid grid configs (#166974)

Proper fix for #164048, fixes gather too, reverts #164049
Pull Request resolved: #166974
Approved by: https://github.com/eqy

trunk/658c5f879c37142b1df51c7eb6c5a5bb06318597

Toggle trunk/658c5f879c37142b1df51c7eb6c5a5bb06318597's commit message
[Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel (#167003)

Summary: This is a reland of #165036, which previously contained a minor bug in the logic that determined whether the kernel should be enabled. As a result, it was incorrectly activated on non-Blackwell GPUs.

Test Plan:
Inductor test (fbcode):
`INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 TORCHINDUCTOR_CACHE_DIR=~/cutetest buck2 run mode/opt //caffe2/test/inductor:cutedsl_grouped_mm -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1"`

Tritonbench (fbcode):
`clear; CUDA_VISIBLE_DEVICES=7 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/opt //pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1" -- --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_cute_grouped_mm --precision bf16  --num-inputs 1 --metrics tflops,accuracy`

Tritonbench(oss):
`clear; CUDA_VISIBLE_DEVICES=2 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_triton_grouped_mm --precision bf16  --num-inputs 1 --metrics tflops,accuracy`

Unit Tests(oss):
`clear; python test/inductor/test_cutedsl_grouped_mm.py`

Differential Revision: D86231180

Pull Request resolved: #167003
Approved by: https://github.com/jananisriram

trunk/641de23c96e2c0d2848a7aa2aacb2f77843177a5

Toggle trunk/641de23c96e2c0d2848a7aa2aacb2f77843177a5's commit message
ci: Add aarch64 docker builds for modern clang (#166416)

Should enable us to build using some arm optimizations that are only
available on the newest versions of clang.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: #166416
Approved by: https://github.com/malfet

trunk/431dfe8692f3f927c19c739884054d7f1d42a33d

Toggle trunk/431dfe8692f3f927c19c739884054d7f1d42a33d's commit message
[dynamo] extend `collections.defaultdict` support with `*args`, `**kw…

…args` and custom `default_factory` (#166793)

Fixes #166238

Extend `collections.defaultdict` to accept `*args` and `**kwargs` in the constructor. And also support custom `default_factory`, such as `dd.default_factory` (a `GetAttrVariable`).

Pull Request resolved: #166793
Approved by: https://github.com/guilhermeleobas