[Inductor] optimize scalar welford_reduce by jiayisunx · Pull Request #162709 · pytorch/pytorch

jiayisunx · 2025-09-11T08:15:32Z

Stack from ghstack (oldest at bottom):

-> [Inductor] optimize scalar welford_reduce #162709

Summary:
Optimize scalar welford_reduce implementation, combining Welford algorithm with cascade sum to improve numerical stability. Specifically:

Use Welford algorithm to compute mean and variance.
Use cascade summation when computing sum over input for both mean and variance.

Example:
Take #141541 as an example:

import torch
import torch.nn as nn
torch.manual_seed(0)

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.gn = nn.GroupNorm(num_groups=32, num_channels=32)

    def forward(self, x):
        return self.gn(x)

model = Model().eval()
x = torch.randn(1, 32, 128, 128, 128)

with torch.no_grad():
    output = model(x)
    with torch._inductor.config.patch({"cpp.simdlen": 0}):
        c_model = torch.compile(model)
        c_output = c_model(x)

print(torch.max(torch.abs(output - c_output)))
print(torch.allclose(output, c_output, 1.3e-6, 1e-5))

logs

before

tensor(0.0005)
False

After

tensor(1.4305e-06)
True

Generated code:

before

cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['float*', 'float*', 'const float*', 'const float*', 'const float*', 'float*'], '''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(float* in_out_ptr0,
                       float* in_out_ptr1,
                       const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr2)
{
    auto out_ptr1 = in_out_ptr0;
    auto out_ptr0 = in_out_ptr1;
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                Welford<float> tmp_acc0 = Welford<float>();
                Welford<float> tmp_acc0_arr[4];
                for (int i = 0; i < 4; i++)
                {
                    tmp_acc0_arr[i] = Welford<float>();
                }
                #pragma omp parallel num_threads(4)
                {
                    int tid = omp_get_thread_num();
                    Welford<float> tmp_acc0_local = Welford<float>();
                    #pragma omp for
                    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L))
                    {
                        {
                            {
                                auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)];
                                tmp_acc0_local = welford_combine(tmp_acc0_local, tmp0);
                            }
                        }
                    }
                    tmp_acc0_arr[tid] = tmp_acc0_local;
                }
                for (int tid = 0; tid < 4; tid++)
                {
                    tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]);
                }
                in_out_ptr1[static_cast<int64_t>(x0)] = tmp_acc0.mean;
                in_out_ptr0[static_cast<int64_t>(x0)] = tmp_acc0.m2;
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                {
                    auto tmp0 = out_ptr1[static_cast<int64_t>(x0)];
                    auto tmp6 = in_ptr1[static_cast<int64_t>(x0)];
                    auto tmp8 = out_ptr0[static_cast<int64_t>(x0)];
                    auto tmp11 = in_ptr2[static_cast<int64_t>(x0)];
                    auto tmp1 = static_cast<float>(2097152.0);
                    auto tmp2 = tmp0 / tmp1;
                    auto tmp3 = static_cast<float>(1e-05);
                    auto tmp4 = float(tmp2 + tmp3);
                    auto tmp5 = 1 / std::sqrt(tmp4);
                    auto tmp7 = float(tmp5 * tmp6);
                    auto tmp9 = decltype(tmp8)(-tmp8);
                    auto tmp10 = float(tmp9 * tmp7);
                    auto tmp12 = float(tmp10 + tmp11);
                    in_out_ptr0[static_cast<int64_t>(x0)] = tmp7;
                    in_out_ptr1[static_cast<int64_t>(x0)] = tmp12;
                }
            }
        }
    }
    #pragma omp parallel num_threads(4)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
            {
                #pragma GCC ivdep
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        {
                            auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)];
                            auto tmp1 = in_out_ptr0[static_cast<int64_t>(x0)];
                            auto tmp3 = in_out_ptr1[static_cast<int64_t>(x0)];
                            auto tmp2 = float(tmp0 * tmp1);
                            auto tmp4 = float(tmp2 + tmp3);
                            out_ptr2[static_cast<int64_t>(x1 + 2097152L*x0)] = tmp4;
                        }
                    }
                }
            }
        }
    }
}
''')


async_compile.wait(globals())
del async_compile

class Runner:
    def __init__(self, partitions):
        self.partitions = partitions

    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables

    def call(self, args):
        arg0_1, arg1_1, arg2_1 = args
        args.clear()
        assert_size_stride(arg0_1, (32, ), (1, ))
        assert_size_stride(arg1_1, (32, ), (1, ))
        assert_size_stride(arg2_1, (1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1))
        buf0 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32)
        buf1 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32)
        buf3 = reinterpret_tensor(buf1, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf1  # reuse
        buf4 = reinterpret_tensor(buf0, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf0  # reuse
        buf5 = empty_strided_cpu((1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1), torch.float32)
        # [Provenance debug handles] cpp_fused_native_group_norm_0:1
        cpp_fused_native_group_norm_0(buf3, buf4, arg2_1, arg0_1, arg1_1, buf5)
        del arg0_1
        del arg1_1
        del arg2_1
        return (buf5, )

After

cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['float*', 'float*', 'const float*', 'const float*', 'const float*', 'float*'], '''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(float* in_out_ptr0,
                       float* in_out_ptr1,
                       const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr2)
{
    auto out_ptr1 = in_out_ptr0;
    auto out_ptr0 = in_out_ptr1;
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                Welford<float> tmp_acc0 = Welford<float>();
                Welford<float> tmp_acc0_arr[4];
                for (int i = 0; i < 4; i++)
                {
                    tmp_acc0_arr[i] = Welford<float>();
                }
                #pragma omp parallel num_threads(4)
                {
                    int tid = omp_get_thread_num();
                    WelfordHelper<float, float, 4096> scalar_welford_helper0(static_cast<int64_t>(524288L));
                    Welford<float> tmp_acc0_local = Welford<float>();
                    #pragma omp for
                    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L))
                    {
                        {
                            {
                                auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)];
                                tmp_acc0_local = welford_combine(tmp_acc0_local, tmp0, &scalar_welford_helper0);
                            }
                        }
                    }
                    tmp_acc0_local = welford_combine(tmp_acc0_local, &scalar_welford_helper0);
                    tmp_acc0_arr[tid] = tmp_acc0_local;
                }
                for (int tid = 0; tid < 4; tid++)
                {
                    tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]);
                }
                in_out_ptr1[static_cast<int64_t>(x0)] = tmp_acc0.mean;
                in_out_ptr0[static_cast<int64_t>(x0)] = tmp_acc0.m2;
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                {
                    auto tmp0 = out_ptr1[static_cast<int64_t>(x0)];
                    auto tmp6 = in_ptr1[static_cast<int64_t>(x0)];
                    auto tmp8 = out_ptr0[static_cast<int64_t>(x0)];
                    auto tmp11 = in_ptr2[static_cast<int64_t>(x0)];
                    auto tmp1 = static_cast<float>(2097152.0);
                    auto tmp2 = tmp0 / tmp1;
                    auto tmp3 = static_cast<float>(1e-05);
                    auto tmp4 = float(tmp2 + tmp3);
                    auto tmp5 = 1 / std::sqrt(tmp4);
                    auto tmp7 = float(tmp5 * tmp6);
                    auto tmp9 = decltype(tmp8)(-tmp8);
                    auto tmp10 = float(tmp9 * tmp7);
                    auto tmp12 = float(tmp10 + tmp11);
                    in_out_ptr0[static_cast<int64_t>(x0)] = tmp7;
                    in_out_ptr1[static_cast<int64_t>(x0)] = tmp12;
                }
            }
        }
    }
    #pragma omp parallel num_threads(4)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
            {
                #pragma GCC ivdep
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        {
                            auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)];
                            auto tmp1 = in_out_ptr0[static_cast<int64_t>(x0)];
                            auto tmp3 = in_out_ptr1[static_cast<int64_t>(x0)];
                            auto tmp2 = float(tmp0 * tmp1);
                            auto tmp4 = float(tmp2 + tmp3);
                            out_ptr2[static_cast<int64_t>(x1 + 2097152L*x0)] = tmp4;
                        }
                    }
                }
            }
        }
    }
}
''')


async_compile.wait(globals())
del async_compile

class Runner:
    def __init__(self, partitions):
        self.partitions = partitions

    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables

    def call(self, args):
        arg0_1, arg1_1, arg2_1 = args
        args.clear()
        assert_size_stride(arg0_1, (32, ), (1, ))
        assert_size_stride(arg1_1, (32, ), (1, ))
        assert_size_stride(arg2_1, (1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1))
        buf0 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32)
        buf1 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32)
        buf3 = reinterpret_tensor(buf1, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf1  # reuse
        buf4 = reinterpret_tensor(buf0, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf0  # reuse
        buf5 = empty_strided_cpu((1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1), torch.float32)
        # [Provenance debug handles] cpp_fused_native_group_norm_0:1
        cpp_fused_native_group_norm_0(buf3, buf4, arg2_1, arg0_1, arg1_1, buf5)
        del arg0_1
        del arg1_1
        del arg2_1
        return (buf5, )

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @mlazos

pytorch-bot · 2025-09-11T08:15:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162709

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit a0b880b with merge base 8cf0bdd ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh) (similar failure)
RuntimeError: doctests 1/1 failed!

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 2, 2, linux.2xlarge.amx) (gh) (trunk failure)
stable_diffusion_unet
inductor / inductor-cpu-test / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.2xlarge.amx) (gh) (trunk failure)
stable_diffusion_unet
inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
stable_diffusion_unet

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge, unstable) (gh) (#166072)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 8e94136 Pull Request resolved: #162709

[ghstack-poisoned]

ghstack-source-id: e3845ca Pull Request resolved: #162709

[ghstack-poisoned]

ghstack-source-id: c3911dd Pull Request resolved: #162709

[ghstack-poisoned]

ghstack-source-id: 9d98f9a Pull Request resolved: #162709

[ghstack-poisoned]

CaoE · 2025-10-11T00:54:34Z

torch/csrc/inductor/cpp_prefix.h

 }

-template <typename T, uint64_t kChunkSize>
+template <typename T, typename S, uint64_t kChunkSize>


Can we use IsVecType to help get the scalar type and avoid using a new typename S ?

Done, thanks for your comment!

[ghstack-poisoned]

CaoE · 2025-11-03T01:30:18Z

torch/_inductor/codegen/cpp.py

-        # acc helper is not used for scalar welford_reduce
        if reduction_type == "welford_reduce":
-            return not use_scalar
+            return True


Should we also determine this based on the size of welford_reduce?

Thanks for your comment! WelfordHelper helps two things: 1. Save the reciprocal of weights to avoid redundant divisions. 2. Save the welford stack, which is used to combine welford reduction with cascade summation. Because the first one is beneficial for performance, so WelfordHelper is used by default. Furthermore, this PR only optimizes the scalar welford_reduce implementation; I don't intend to change the behavior of vec welford_reduce.

CaoE · 2025-11-03T05:54:56Z

test/inductor/test_cpu_repro.py

                actual = compiled_m(x)
                self.assertEqual(expected, actual)

+        # test scalar welford_reduce


It's better to reuse the code of this test.

Done, thanks!

[ghstack-poisoned]

ghstack-source-id: b27dd10 Pull Request resolved: #162709

[ghstack-poisoned]

ghstack-source-id: 936e5de Pull Request resolved: #162709

[ghstack-poisoned]

jiayisunx · 2025-11-17T02:44:59Z

@pytorchbot merge

pytorchmergebot · 2025-11-17T02:46:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ghstack-source-id: 985346a Pull Request resolved: pytorch/pytorch#162709

**Summary:** Optimize scalar welford_reduce implementation, combining Welford algorithm with cascade sum to improve numerical stability. Specifically: 1. Use Welford algorithm to compute mean and variance. 2. Use cascade summation when computing sum over input for both mean and variance. **Example:** Take pytorch#141541 as an example: ``` import torch import torch.nn as nn torch.manual_seed(0) class Model(nn.Module): def __init__(self): super().__init__() self.gn = nn.GroupNorm(num_groups=32, num_channels=32) def forward(self, x): return self.gn(x) model = Model().eval() x = torch.randn(1, 32, 128, 128, 128) with torch.no_grad(): output = model(x) with torch._inductor.config.patch({"cpp.simdlen": 0}): c_model = torch.compile(model) c_output = c_model(x) print(torch.max(torch.abs(output - c_output))) print(torch.allclose(output, c_output, 1.3e-6, 1e-5)) ``` **logs** - before ``` tensor(0.0005) False ``` - After ``` tensor(1.4305e-06) True ``` **Generated code:** - before ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['float*', 'float*', 'const float*', 'const float*', 'const float*', 'float*'], ''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(float* in_out_ptr0, float* in_out_ptr1, const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr2) { auto out_ptr1 = in_out_ptr0; auto out_ptr0 = in_out_ptr1; { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<float> tmp_acc0_arr[4]; for (int i = 0; i < 4; i++) { tmp_acc0_arr[i] = Welford<float>(); } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); Welford<float> tmp_acc0_local = Welford<float>(); #pragma omp for for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; tmp_acc0_local = welford_combine(tmp_acc0_local, tmp0); } } } tmp_acc0_arr[tid] = tmp_acc0_local; } for (int tid = 0; tid < 4; tid++) { tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]); } in_out_ptr1[static_cast<int64_t>(x0)] = tmp_acc0.mean; in_out_ptr0[static_cast<int64_t>(x0)] = tmp_acc0.m2; } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { { auto tmp0 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp6 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp8 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp11 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp1 = static_cast<float>(2097152.0); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(1e-05); auto tmp4 = float(tmp2 + tmp3); auto tmp5 = 1 / std::sqrt(tmp4); auto tmp7 = float(tmp5 * tmp6); auto tmp9 = decltype(tmp8)(-tmp8); auto tmp10 = float(tmp9 * tmp7); auto tmp12 = float(tmp10 + tmp11); in_out_ptr0[static_cast<int64_t>(x0)] = tmp7; in_out_ptr1[static_cast<int64_t>(x0)] = tmp12; } } } } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); { #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; auto tmp1 = in_out_ptr0[static_cast<int64_t>(x0)]; auto tmp3 = in_out_ptr1[static_cast<int64_t>(x0)]; auto tmp2 = float(tmp0 * tmp1); auto tmp4 = float(tmp2 + tmp3); out_ptr2[static_cast<int64_t>(x1 + 2097152L*x0)] = tmp4; } } } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (32, ), (1, )) assert_size_stride(arg1_1, (32, ), (1, )) assert_size_stride(arg2_1, (1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1)) buf0 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf1 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf3 = reinterpret_tensor(buf1, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf1 # reuse buf4 = reinterpret_tensor(buf0, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf0 # reuse buf5 = empty_strided_cpu((1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1), torch.float32) # [Provenance debug handles] cpp_fused_native_group_norm_0:1 cpp_fused_native_group_norm_0(buf3, buf4, arg2_1, arg0_1, arg1_1, buf5) del arg0_1 del arg1_1 del arg2_1 return (buf5, ) ``` - After ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['float*', 'float*', 'const float*', 'const float*', 'const float*', 'float*'], ''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(float* in_out_ptr0, float* in_out_ptr1, const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr2) { auto out_ptr1 = in_out_ptr0; auto out_ptr0 = in_out_ptr1; { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<float> tmp_acc0_arr[4]; for (int i = 0; i < 4; i++) { tmp_acc0_arr[i] = Welford<float>(); } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); WelfordHelper<float, float, 4096> scalar_welford_helper0(static_cast<int64_t>(524288L)); Welford<float> tmp_acc0_local = Welford<float>(); #pragma omp for for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; tmp_acc0_local = welford_combine(tmp_acc0_local, tmp0, &scalar_welford_helper0); } } } tmp_acc0_local = welford_combine(tmp_acc0_local, &scalar_welford_helper0); tmp_acc0_arr[tid] = tmp_acc0_local; } for (int tid = 0; tid < 4; tid++) { tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]); } in_out_ptr1[static_cast<int64_t>(x0)] = tmp_acc0.mean; in_out_ptr0[static_cast<int64_t>(x0)] = tmp_acc0.m2; } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { { auto tmp0 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp6 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp8 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp11 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp1 = static_cast<float>(2097152.0); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(1e-05); auto tmp4 = float(tmp2 + tmp3); auto tmp5 = 1 / std::sqrt(tmp4); auto tmp7 = float(tmp5 * tmp6); auto tmp9 = decltype(tmp8)(-tmp8); auto tmp10 = float(tmp9 * tmp7); auto tmp12 = float(tmp10 + tmp11); in_out_ptr0[static_cast<int64_t>(x0)] = tmp7; in_out_ptr1[static_cast<int64_t>(x0)] = tmp12; } } } } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); { #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; auto tmp1 = in_out_ptr0[static_cast<int64_t>(x0)]; auto tmp3 = in_out_ptr1[static_cast<int64_t>(x0)]; auto tmp2 = float(tmp0 * tmp1); auto tmp4 = float(tmp2 + tmp3); out_ptr2[static_cast<int64_t>(x1 + 2097152L*x0)] = tmp4; } } } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (32, ), (1, )) assert_size_stride(arg1_1, (32, ), (1, )) assert_size_stride(arg2_1, (1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1)) buf0 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf1 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf3 = reinterpret_tensor(buf1, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf1 # reuse buf4 = reinterpret_tensor(buf0, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf0 # reuse buf5 = empty_strided_cpu((1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1), torch.float32) # [Provenance debug handles] cpp_fused_native_group_norm_0:1 cpp_fused_native_group_norm_0(buf3, buf4, arg2_1, arg0_1, arg1_1, buf5) del arg0_1 del arg1_1 del arg2_1 return (buf5, ) ``` Pull Request resolved: pytorch#162709 Approved by: https://github.com/CaoE, https://github.com/jansel

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 11, 2025

jiayisunx added a commit that referenced this pull request Sep 11, 2025

[Inductor] optimize reduction scalar kernel

3464dc9

ghstack-source-id: 8e94136 Pull Request resolved: #162709

pytorchbot added the open source label Sep 11, 2025

jiayisunx marked this pull request as draft September 11, 2025 08:19

Update

2919a7d

[ghstack-poisoned]

jiayisunx added release notes: inductor ciflow/trunk Trigger trunk jobs on your pull request labels Sep 11, 2025

jiayisunx added a commit that referenced this pull request Sep 12, 2025

[Inductor] optimize reduction scalar kernel

cb564df

ghstack-source-id: e3845ca Pull Request resolved: #162709

Update

9013df9

[ghstack-poisoned]

jiayisunx changed the title ~~[Inductor] optimize reduction scalar kernel~~ [Inductor] optimize scalar welford_reduce Sep 12, 2025

jiayisunx added a commit that referenced this pull request Sep 12, 2025

[Inductor] optimize reduction scalar kernel

1fe14c5

ghstack-source-id: c3911dd Pull Request resolved: #162709

Update

179e055

[ghstack-poisoned]

jiayisunx added a commit that referenced this pull request Sep 12, 2025

[Inductor] optimize scalar welford_reduce

40e2694

ghstack-source-id: 9d98f9a Pull Request resolved: #162709

Update

a3ba563

[ghstack-poisoned]

jiayisunx mentioned this pull request Sep 12, 2025

[inductor] optimize welford reduction for smaller reduction sizes #162795

Draft

jiayisunx requested review from CaoE and mingfeima September 16, 2025 01:54

jiayisunx added 2 commits September 17, 2025 08:18

Update

23ecc58

[ghstack-poisoned]

Update

f077bc9

[ghstack-poisoned]

CaoE reviewed Oct 11, 2025

View reviewed changes

jiayisunx added 3 commits October 11, 2025 07:46

Update

3237845

[ghstack-poisoned]

Update

dfc774c

[ghstack-poisoned]

Update

2924cbd

[ghstack-poisoned]

jiayisunx requested a review from CaoE October 27, 2025 06:58

CaoE reviewed Nov 3, 2025

View reviewed changes

jiayisunx requested a review from CaoE November 3, 2025 05:53

CaoE reviewed Nov 3, 2025

View reviewed changes

CaoE approved these changes Nov 3, 2025

View reviewed changes

Update

2ce34c4

[ghstack-poisoned]

jiayisunx marked this pull request as ready for review November 11, 2025 09:18

jiayisunx requested a review from jansel November 11, 2025 09:18

jansel approved these changes Nov 11, 2025

View reviewed changes

jiayisunx added a commit that referenced this pull request Nov 12, 2025

[Inductor] optimize scalar welford_reduce

e44093e

ghstack-source-id: b27dd10 Pull Request resolved: #162709

Update

41db7bd

[ghstack-poisoned]

jiayisunx added a commit that referenced this pull request Nov 14, 2025

[Inductor] optimize scalar welford_reduce

b5be716

ghstack-source-id: 936e5de Pull Request resolved: #162709

Update

a0b880b

[ghstack-poisoned]

pytorchmergebot added the merging label Nov 17, 2025

pytorchmergebot added the Merged label Nov 17, 2025

pytorchmergebot closed this in 4322354 Nov 17, 2025

pytorchmergebot removed the merging label Nov 17, 2025

Khanaksahu pushed a commit to Khanaksahu/pytorch that referenced this pull request Nov 17, 2025

[Inductor] optimize scalar welford_reduce

a1955c0

ghstack-source-id: 985346a Pull Request resolved: pytorch/pytorch#162709

github-actions bot deleted the gh/jiayisunx/77/head branch December 18, 2025 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor] optimize scalar welford_reduce#162709

[Inductor] optimize scalar welford_reduce#162709
jiayisunx wants to merge 12 commits intogh/jiayisunx/77/basefrom
gh/jiayisunx/77/head

jiayisunx commented Sep 11, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 11, 2025 •

edited

Loading

Uh oh!

CaoE Oct 11, 2025

Uh oh!

jiayisunx Oct 27, 2025

Uh oh!

CaoE Nov 3, 2025

Uh oh!

jiayisunx Nov 3, 2025

Uh oh!

CaoE Nov 3, 2025

Uh oh!

jiayisunx Nov 4, 2025

Uh oh!

jiayisunx commented Nov 17, 2025

Uh oh!

pytorchmergebot commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jiayisunx commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162709

✅ You can merge normally! (5 Unrelated Failures)

Uh oh!

CaoE Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

jiayisunx Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

CaoE Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

jiayisunx Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

CaoE Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

jiayisunx Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

jiayisunx commented Nov 17, 2025

Uh oh!

pytorchmergebot commented Nov 17, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jiayisunx commented Sep 11, 2025 •

edited

Loading

pytorch-bot bot commented Sep 11, 2025 •

edited

Loading