[Inductor][CPU] GEMM template: add an AVX512-VNNI-based micro kernel by Xia-Weiwen · Pull Request #166846 · pytorch/pytorch

Xia-Weiwen · 2025-11-03T08:17:45Z

Summary
This PR adds an AVX512-VNNI-based micro kernel for u8s8s32 in CPP GEMM template. It can be chosen to construct a GEMM kernel for hardware platforms that support AVX512_VNNI but not AMX. Without this feature, only the aten qlinear op is available to compute u8s8s32 GEMM (or fall back to reference implementation, which is super slow). The new microkernel brings performance gain over the aten qlinear op when M is small (see performance data below).
On platforms that support both AVX512_VNNI and AMX, AMX is preferred regardless of input shapes. This ensures there won't be performance regression on such platforms. We can add heuristics to select from AVX512_VNNI and AMX in the future if we need.
Note that this PR only adds a new microkernel. It does not change the outer loops and blockings of CPP GEMM template.

We found block_m=6 and block_n=64 is the best by experiments. OneDNN also uses such blocking strategy.

Performance benchmark
We collected performance data on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with the following script.

import torch
import torchao
import copy
import os
import itertools

import torch._inductor.config as config
config.freezing = True
config.max_autotune_gemm_backends = "CPP"
config.cpp_wrapper = True
config.cpp.enable_kernel_profile = True

from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
import torchao.quantization.pt2e.quantizer.x86_inductor_quantizer as xiq
from torchao.quantization.pt2e.quantizer.x86_inductor_quantizer import X86InductorQuantizer
from torchao.quantization.pt2e import move_exported_model_to_eval

def pt2e_ptq(m, example_inputs):
    m = m.eval()
    exported_model = torch.export.export(m, example_inputs, strict=True).module()
    quantizer = X86InductorQuantizer()
    quantizer.set_global(xiq.get_default_x86_inductor_quantization_config())

    with torch.no_grad():
        prepared_model = prepare_pt2e(exported_model, quantizer)
        _ = prepared_model(*example_inputs)
        converted_model = convert_pt2e(prepared_model)
        move_exported_model_to_eval(converted_model)
        optimized_model = torch.compile(converted_model)
        optimized_model(*example_inputs)
        return optimized_model

def benchmark(model, inputs):
    import time
    warmup, active = 100, 1000
    with torch.no_grad():
        for i in range(warmup):
            model(*inputs)
        t0 = time.time()
        for i in range(active):
            model(*inputs)
        te = time.time() - t0
        print("Time per iteration:", round(te * 1000 / active, 3), "ms")

in1, out1 = 1024, 1024
in2, out2 = 1024, 1024

class Mod(torch.nn.Module):
    def __init__(self, bias=True):
        super().__init__()
        self.linear1 = torch.nn.Linear(in1, out1, bias=bias)
        self.relu = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(in2, out2, bias=(not bias))

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

    def get_example_input(self, M=1):
        return torch.randn(M, in1)

if __name__ == "__main__":
    use_max_autotune_list = [True, False]
    M_list = [1, 4, 32, 128, 256]
    cases = itertools.product(use_max_autotune_list, M_list)
    for use_max_autotune, M in cases:
        config.max_autotune = use_max_autotune
        model_fp = Mod().eval()
        data = model_fp.get_example_input(M)
        inputs = (data,)
        m = pt2e_ptq(copy.deepcopy(model_fp), inputs)
        print("[TEST INFO] Using GEMM template:", use_max_autotune, ", M:", M, ", num of cores:", len(os.sched_getaffinity(0)))
        benchmark(m, inputs)

Command to run:

# 1 core
numactl -C0 python benchmark_vnni_microkernel.py
# 4 cores
numactl -C0-3 python benchmark_vnni_microkernel.py

Results:

Num of Cores	M	ATEN (ms)	CPP (ms)	Improve
1	1	0.2	0.128	36.00%
1	4	0.202	0.132	34.65%
1	32	0.358	0.298	16.76%
1	128	0.898	0.882	1.78%
1	256	1.613	1.656	-2.67%
4	1	0.108	0.058	46.30%
4	4	0.112	0.054	51.79%
4	32	0.165	0.109	33.94%
4	128	0.352	0.306	13.07%
4	256	0.59	0.573	2.88%

Test plan

python -m pytest  -sv test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @chenyang78

…for u8s8s32

pytorch-bot · 2025-11-03T08:17:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166846

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit e1fa8c9 with merge base d9cb8a7 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.gfx942.1) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Xia-Weiwen · 2025-11-10T01:24:32Z

Hi @CaoE @mingfeima Could you please review this PR? Thanks.

mingfeima · 2025-11-20T02:04:30Z

@CaoE

torch/_inductor/codegen/cpp_micro_gemm.py

torch/_inductor/cpu_vec_isa.py

Xia-Weiwen · 2025-12-04T02:10:19Z

Hi @jansel Could you please review this PR? Thanks.

Xia-Weiwen · 2025-12-05T01:40:23Z

Hi @jansel Could you please review this PR? Thanks.

Xia-Weiwen · 2025-12-07T15:56:01Z

@pytorchbot merge

pytorchmergebot · 2025-12-07T15:58:41Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ytorch#166846) **Summary** This PR adds an AVX512-VNNI-based micro kernel for u8s8s32 in CPP GEMM template. It can be chosen to construct a GEMM kernel for hardware platforms that support AVX512_VNNI but not AMX. Without this feature, only the aten `qlinear` op is available to compute u8s8s32 GEMM (or fall back to reference implementation, which is super slow). The new microkernel brings performance gain over the aten `qlinear` op when M is small (see performance data below). On platforms that support both AVX512_VNNI and AMX, AMX is preferred regardless of input shapes. This ensures there won't be performance regression on such platforms. We can add heuristics to select from AVX512_VNNI and AMX in the future if we need. Note that this PR only adds a new microkernel. It does not change the outer loops and blockings of CPP GEMM template. We found block_m=6 and block_n=64 is the best by experiments. OneDNN also uses such blocking strategy. **Performance benchmark** We collected performance data on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with the following script. ```python import torch import torchao import copy import os import itertools import torch._inductor.config as config config.freezing = True config.max_autotune_gemm_backends = "CPP" config.cpp_wrapper = True config.cpp.enable_kernel_profile = True from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e import torchao.quantization.pt2e.quantizer.x86_inductor_quantizer as xiq from torchao.quantization.pt2e.quantizer.x86_inductor_quantizer import X86InductorQuantizer from torchao.quantization.pt2e import move_exported_model_to_eval def pt2e_ptq(m, example_inputs): m = m.eval() exported_model = torch.export.export(m, example_inputs, strict=True).module() quantizer = X86InductorQuantizer() quantizer.set_global(xiq.get_default_x86_inductor_quantization_config()) with torch.no_grad(): prepared_model = prepare_pt2e(exported_model, quantizer) _ = prepared_model(*example_inputs) converted_model = convert_pt2e(prepared_model) move_exported_model_to_eval(converted_model) optimized_model = torch.compile(converted_model) optimized_model(*example_inputs) return optimized_model def benchmark(model, inputs): import time warmup, active = 100, 1000 with torch.no_grad(): for i in range(warmup): model(*inputs) t0 = time.time() for i in range(active): model(*inputs) te = time.time() - t0 print("Time per iteration:", round(te * 1000 / active, 3), "ms") in1, out1 = 1024, 1024 in2, out2 = 1024, 1024 class Mod(torch.nn.Module): def __init__(self, bias=True): super().__init__() self.linear1 = torch.nn.Linear(in1, out1, bias=bias) self.relu = torch.nn.ReLU() self.linear2 = torch.nn.Linear(in2, out2, bias=(not bias)) def forward(self, x): return self.linear2(self.relu(self.linear1(x))) def get_example_input(self, M=1): return torch.randn(M, in1) if __name__ == "__main__": use_max_autotune_list = [True, False] M_list = [1, 4, 32, 128, 256] cases = itertools.product(use_max_autotune_list, M_list) for use_max_autotune, M in cases: config.max_autotune = use_max_autotune model_fp = Mod().eval() data = model_fp.get_example_input(M) inputs = (data,) m = pt2e_ptq(copy.deepcopy(model_fp), inputs) print("[TEST INFO] Using GEMM template:", use_max_autotune, ", M:", M, ", num of cores:", len(os.sched_getaffinity(0))) benchmark(m, inputs) ``` Command to run: ``` # 1 core numactl -C0 python benchmark_vnni_microkernel.py # 4 cores numactl -C0-3 python benchmark_vnni_microkernel.py ``` Results: Num of Cores | M | ATEN (ms) | CPP (ms) | Improve -- | -- | -- | -- | -- 1 | 1 | 0.2 | 0.128 | 36.00% 1 | 4 | 0.202 | 0.132 | 34.65% 1 | 32 | 0.358 | 0.298 | 16.76% 1 | 128 | 0.898 | 0.882 | 1.78% 1 | 256 | 1.613 | 1.656 | -2.67% 4 | 1 | 0.108 | 0.058 | 46.30% 4 | 4 | 0.112 | 0.054 | 51.79% 4 | 32 | 0.165 | 0.109 | 33.94% 4 | 128 | 0.352 | 0.306 | 13.07% 4 | 256 | 0.59 | 0.573 | 2.88% **Test plan** ``` python -m pytest -sv test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` Pull Request resolved: pytorch#166846 Approved by: https://github.com/CaoE, https://github.com/jansel

…166846) **Summary** This PR adds an AVX512-VNNI-based micro kernel for u8s8s32 in CPP GEMM template. It can be chosen to construct a GEMM kernel for hardware platforms that support AVX512_VNNI but not AMX. Without this feature, only the aten `qlinear` op is available to compute u8s8s32 GEMM (or fall back to reference implementation, which is super slow). The new microkernel brings performance gain over the aten `qlinear` op when M is small (see performance data below). On platforms that support both AVX512_VNNI and AMX, AMX is preferred regardless of input shapes. This ensures there won't be performance regression on such platforms. We can add heuristics to select from AVX512_VNNI and AMX in the future if we need. Note that this PR only adds a new microkernel. It does not change the outer loops and blockings of CPP GEMM template. We found block_m=6 and block_n=64 is the best by experiments. OneDNN also uses such blocking strategy. **Performance benchmark** We collected performance data on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with the following script. ```python import torch import torchao import copy import os import itertools import torch._inductor.config as config config.freezing = True config.max_autotune_gemm_backends = "CPP" config.cpp_wrapper = True config.cpp.enable_kernel_profile = True from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e import torchao.quantization.pt2e.quantizer.x86_inductor_quantizer as xiq from torchao.quantization.pt2e.quantizer.x86_inductor_quantizer import X86InductorQuantizer from torchao.quantization.pt2e import move_exported_model_to_eval def pt2e_ptq(m, example_inputs): m = m.eval() exported_model = torch.export.export(m, example_inputs, strict=True).module() quantizer = X86InductorQuantizer() quantizer.set_global(xiq.get_default_x86_inductor_quantization_config()) with torch.no_grad(): prepared_model = prepare_pt2e(exported_model, quantizer) _ = prepared_model(*example_inputs) converted_model = convert_pt2e(prepared_model) move_exported_model_to_eval(converted_model) optimized_model = torch.compile(converted_model) optimized_model(*example_inputs) return optimized_model def benchmark(model, inputs): import time warmup, active = 100, 1000 with torch.no_grad(): for i in range(warmup): model(*inputs) t0 = time.time() for i in range(active): model(*inputs) te = time.time() - t0 print("Time per iteration:", round(te * 1000 / active, 3), "ms") in1, out1 = 1024, 1024 in2, out2 = 1024, 1024 class Mod(torch.nn.Module): def __init__(self, bias=True): super().__init__() self.linear1 = torch.nn.Linear(in1, out1, bias=bias) self.relu = torch.nn.ReLU() self.linear2 = torch.nn.Linear(in2, out2, bias=(not bias)) def forward(self, x): return self.linear2(self.relu(self.linear1(x))) def get_example_input(self, M=1): return torch.randn(M, in1) if __name__ == "__main__": use_max_autotune_list = [True, False] M_list = [1, 4, 32, 128, 256] cases = itertools.product(use_max_autotune_list, M_list) for use_max_autotune, M in cases: config.max_autotune = use_max_autotune model_fp = Mod().eval() data = model_fp.get_example_input(M) inputs = (data,) m = pt2e_ptq(copy.deepcopy(model_fp), inputs) print("[TEST INFO] Using GEMM template:", use_max_autotune, ", M:", M, ", num of cores:", len(os.sched_getaffinity(0))) benchmark(m, inputs) ``` Command to run: ``` # 1 core numactl -C0 python benchmark_vnni_microkernel.py # 4 cores numactl -C0-3 python benchmark_vnni_microkernel.py ``` Results: Num of Cores | M | ATEN (ms) | CPP (ms) | Improve -- | -- | -- | -- | -- 1 | 1 | 0.2 | 0.128 | 36.00% 1 | 4 | 0.202 | 0.132 | 34.65% 1 | 32 | 0.358 | 0.298 | 16.76% 1 | 128 | 0.898 | 0.882 | 1.78% 1 | 256 | 1.613 | 1.656 | -2.67% 4 | 1 | 0.108 | 0.058 | 46.30% 4 | 4 | 0.112 | 0.054 | 51.79% 4 | 32 | 0.165 | 0.109 | 33.94% 4 | 128 | 0.352 | 0.306 | 13.07% 4 | 256 | 0.59 | 0.573 | 2.88% **Test plan** ``` python -m pytest -sv test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` Pull Request resolved: #166846 Approved by: https://github.com/CaoE, https://github.com/jansel

…ytorch#166846) **Summary** This PR adds an AVX512-VNNI-based micro kernel for u8s8s32 in CPP GEMM template. It can be chosen to construct a GEMM kernel for hardware platforms that support AVX512_VNNI but not AMX. Without this feature, only the aten `qlinear` op is available to compute u8s8s32 GEMM (or fall back to reference implementation, which is super slow). The new microkernel brings performance gain over the aten `qlinear` op when M is small (see performance data below). On platforms that support both AVX512_VNNI and AMX, AMX is preferred regardless of input shapes. This ensures there won't be performance regression on such platforms. We can add heuristics to select from AVX512_VNNI and AMX in the future if we need. Note that this PR only adds a new microkernel. It does not change the outer loops and blockings of CPP GEMM template. We found block_m=6 and block_n=64 is the best by experiments. OneDNN also uses such blocking strategy. **Performance benchmark** We collected performance data on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with the following script. ```python import torch import torchao import copy import os import itertools import torch._inductor.config as config config.freezing = True config.max_autotune_gemm_backends = "CPP" config.cpp_wrapper = True config.cpp.enable_kernel_profile = True from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e import torchao.quantization.pt2e.quantizer.x86_inductor_quantizer as xiq from torchao.quantization.pt2e.quantizer.x86_inductor_quantizer import X86InductorQuantizer from torchao.quantization.pt2e import move_exported_model_to_eval def pt2e_ptq(m, example_inputs): m = m.eval() exported_model = torch.export.export(m, example_inputs, strict=True).module() quantizer = X86InductorQuantizer() quantizer.set_global(xiq.get_default_x86_inductor_quantization_config()) with torch.no_grad(): prepared_model = prepare_pt2e(exported_model, quantizer) _ = prepared_model(*example_inputs) converted_model = convert_pt2e(prepared_model) move_exported_model_to_eval(converted_model) optimized_model = torch.compile(converted_model) optimized_model(*example_inputs) return optimized_model def benchmark(model, inputs): import time warmup, active = 100, 1000 with torch.no_grad(): for i in range(warmup): model(*inputs) t0 = time.time() for i in range(active): model(*inputs) te = time.time() - t0 print("Time per iteration:", round(te * 1000 / active, 3), "ms") in1, out1 = 1024, 1024 in2, out2 = 1024, 1024 class Mod(torch.nn.Module): def __init__(self, bias=True): super().__init__() self.linear1 = torch.nn.Linear(in1, out1, bias=bias) self.relu = torch.nn.ReLU() self.linear2 = torch.nn.Linear(in2, out2, bias=(not bias)) def forward(self, x): return self.linear2(self.relu(self.linear1(x))) def get_example_input(self, M=1): return torch.randn(M, in1) if __name__ == "__main__": use_max_autotune_list = [True, False] M_list = [1, 4, 32, 128, 256] cases = itertools.product(use_max_autotune_list, M_list) for use_max_autotune, M in cases: config.max_autotune = use_max_autotune model_fp = Mod().eval() data = model_fp.get_example_input(M) inputs = (data,) m = pt2e_ptq(copy.deepcopy(model_fp), inputs) print("[TEST INFO] Using GEMM template:", use_max_autotune, ", M:", M, ", num of cores:", len(os.sched_getaffinity(0))) benchmark(m, inputs) ``` Command to run: ``` # 1 core numactl -C0 python benchmark_vnni_microkernel.py # 4 cores numactl -C0-3 python benchmark_vnni_microkernel.py ``` Results: Num of Cores | M | ATEN (ms) | CPP (ms) | Improve -- | -- | -- | -- | -- 1 | 1 | 0.2 | 0.128 | 36.00% 1 | 4 | 0.202 | 0.132 | 34.65% 1 | 32 | 0.358 | 0.298 | 16.76% 1 | 128 | 0.898 | 0.882 | 1.78% 1 | 256 | 1.613 | 1.656 | -2.67% 4 | 1 | 0.108 | 0.058 | 46.30% 4 | 4 | 0.112 | 0.054 | 51.79% 4 | 32 | 0.165 | 0.109 | 33.94% 4 | 128 | 0.352 | 0.306 | 13.07% 4 | 256 | 0.59 | 0.573 | 2.88% **Test plan** ``` python -m pytest -sv test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` Pull Request resolved: pytorch#166846 Approved by: https://github.com/CaoE, https://github.com/jansel

[Inductor][CPU] GEMM template: add an AVX512-VNNI-based micro kernel …

fa8345d

…for u8s8s32

pytorch-bot bot added ciflow/inductor module: inductor labels Nov 3, 2025

Xia-Weiwen added intel This tag is for PR from Intel topic: not user facing topic category labels Nov 3, 2025

pytorchbot added the open source label Nov 3, 2025

Xia-Weiwen added 2 commits November 6, 2025 10:49

Update register blocking

558e3c8

Merge branch 'main' into gemm_int8_vnni

81a2203

Xia-Weiwen requested review from CaoE and mingfeima November 6, 2025 03:34

Fix UT failures and update heuristic

ea2f66a

Merge branch 'main' into gemm_int8_vnni

f673662

mingfeima requested review from CaoE and removed request for CaoE November 20, 2025 02:04

CaoE reviewed Nov 21, 2025

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

CaoE reviewed Nov 21, 2025

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

CaoE reviewed Nov 21, 2025

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

CaoE reviewed Nov 21, 2025

View reviewed changes

torch/_inductor/cpu_vec_isa.py Show resolved Hide resolved

Refine code

f8a4d0b

Xia-Weiwen requested a review from CaoE November 21, 2025 06:57

Fix UT

7066bf5

CaoE approved these changes Dec 2, 2025

View reviewed changes

Merge branch 'main' into gemm_int8_vnni

e1fa8c9

Xia-Weiwen marked this pull request as ready for review December 2, 2025 03:13

Xia-Weiwen requested a review from jansel December 2, 2025 08:11

jansel approved these changes Dec 6, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 7, 2025

pytorchmergebot added the merging label Dec 7, 2025

pytorchmergebot added the Merged label Dec 7, 2025

pytorchmergebot closed this in af4458c Dec 7, 2025

pytorchmergebot removed the merging label Dec 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor][CPU] GEMM template: add an AVX512-VNNI-based micro kernel#166846

[Inductor][CPU] GEMM template: add an AVX512-VNNI-based micro kernel#166846
Xia-Weiwen wants to merge 8 commits intopytorch:mainfrom
Xia-Weiwen:gemm_int8_vnni

Xia-Weiwen commented Nov 3, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 3, 2025 •

edited

Loading

Uh oh!

Xia-Weiwen commented Nov 10, 2025

Uh oh!

mingfeima commented Nov 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Xia-Weiwen commented Dec 4, 2025

Uh oh!

Xia-Weiwen commented Dec 5, 2025

Uh oh!

Xia-Weiwen commented Dec 7, 2025

Uh oh!

pytorchmergebot commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

Xia-Weiwen commented Nov 3, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166846

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Xia-Weiwen commented Nov 10, 2025

Uh oh!

mingfeima commented Nov 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Xia-Weiwen commented Dec 4, 2025

Uh oh!

Xia-Weiwen commented Dec 5, 2025

Uh oh!

Xia-Weiwen commented Dec 7, 2025

Uh oh!

pytorchmergebot commented Dec 7, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Xia-Weiwen commented Nov 3, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 3, 2025 •

edited

Loading