[ROCm] Add aiter tkw1 kernel for Llama4 fp8 by kliuae · Pull Request #16727 · vllm-project/vllm

kliuae · 2025-04-16T14:02:16Z

This PR enables aiter's tkw1 quantized MoE kernel to improve inferencing performance of compressed tensor Llama4 quantized with FP8. We have also revamped the aiter's MoE kernel dispatching to automatically choose the suitable AITER Fused MoE kernel without needing to set flags for kernel selection. Users only need to specify
VLLM_ROCM_USE_AITER=1 and VLLM_ROCM_USE_AITER_MOE=1 to activate aiter's MoE kernels, and the VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE flag is removed.

Note: torch.compile isn't supported in this PR yet, and the performance numbers are attained with V1 eager mode. The enablement of V1 torch compile for aiter MoE kernels will be addressed in a separate PR.

Llama4 Maverick FP8 throughput benchmarks

Without aiter tkw1
VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER_MOE=0 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_LINEAR=0 SAFETENSORS_FAST_GPU=1 python benchmarks/benchmark_throughput.py --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --dataset-name random --input-len 1000 --output-len 1000 -tp 8 --max-model-len 8192 --enforce-eager
Throughput: 6.47 requests/s, 13159.90 total tokens/s, 6468.73 output tokens/s

With aiter tkw1
VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MOE=1  VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_LINEAR=0 SAFETENSORS_FAST_GPU=1 python benchmarks/benchmark_throughput.py --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --dataset-name random --input-len 1000 --output-len 1000 -tp 8 --max-model-len 8192 --enforce-eager
Throughput: 7.94 requests/s, 16143.27 total tokens/s, 7937.78 output tokens/s

Llama4 Maverick FP8 latency benchmarks

Without aiter tkw1
VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER_MOE=0 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_LINEAR=0 python -m vllm.entrypoints.openai.api_server --max-model-len 30000 --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -tp 8 --enforce-eager
============ Serving Benchmark Result ============
Successful requests:                     160
Benchmark duration (s):                  150.38
Total input tokens:                      160000
Total generated tokens:                  160000
Request throughput (req/s):              1.06
Output token throughput (tok/s):         1063.96
Total Token throughput (tok/s):          2127.93
---------------Time to First Token----------------
Mean TTFT (ms):                          268.25
Median TTFT (ms):                        153.78
P99 TTFT (ms):                           1199.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.79
Median TPOT (ms):                        29.78
P99 TPOT (ms):                           30.24
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.79
Median ITL (ms):                         29.34
P99 ITL (ms):                            51.17
==================================================

With aiter tkw1
VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MOE=1  VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_LINEAR=0 python -m vllm.entrypoints.openai.api_server --max-model-len 30000 --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -tp 8 --enforce-eager
============ Serving Benchmark Result ============
Successful requests:                     160
Benchmark duration (s):                  117.88
Total input tokens:                      160000
Total generated tokens:                  160000
Request throughput (req/s):              1.36
Output token throughput (tok/s):         1357.26
Total Token throughput (tok/s):          2714.52
---------------Time to First Token----------------
Mean TTFT (ms):                          191.74
Median TTFT (ms):                        138.49
P99 TTFT (ms):                           783.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.37
Median TPOT (ms):                        23.41
P99 TPOT (ms):                           23.60
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.37
Median ITL (ms):                         23.07
P99 ITL (ms):                            44.99
==================================================

Text Generation Response

meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8

Without aiter tkw1
Prompt: 'The color of the sky is blue but sometimes it can also be', Generated text: ' red, orange or grey. What is the reason behind the different colors of the sky? _ Tiwari Academy Discussion\nThe color of the sky is blue but sometimes it can also be red, orange or grey. What is the reason behind the different colors of the sky?\nThe color of the sky is primarily determined by the scattering of sunlight by the Earth_s atmosphere. The most common color we see is blue, and this is due to a phenomenon called Rayleigh scattering. Here_s why the sky appears blue and how other colors can manifest under different conditions:\n1. Blue Sky (Rayleigh Scattering):\n_ During the daytime when the sun'
Prompt: 'The capital of France is', Generated text: ' Paris. It is a major European city and a global center for art, fashion, and culture. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its cuisine, fashion, and romantic atmosphere. Paris is a popular tourist destination and is often referred to as the "City of Light" due to its role in the Enlightenment and its many famous intellectuals and artists. \nParis, the capital of France, is a city steeped in history, art, and culture. It is one of the most visited cities in the world,'
Prompt: 'What is batch inference?', Generated text: ' - Azure Machine Learning | Microsoft Learn Skip to main content \nWhat is batch inference?\nBatch inference, or batch scoring, is the process of generating predictions on a batch of observations. Batch inference or batch scoring is a common pattern for machine learning (ML) models in production environments. The batch inference process can be run on a recurring schedule or on-demand.\nBatch inference is a key component of an end-to-end ML solution. An end-to-end ML solution typically requires:\n0. Data preparation and preprocessing\n1. Model training\n2. Model evaluation\n3. Model deployment\n4. Batch inference\n5. Monitoring and retraining\n'

With aiter tkw1
Prompt: 'The color of the sky is blue but sometimes it can also be', Generated text: ' red, orange, or violet. What is the reason behind the different colors of the sky? _ Tiwari Academy Discussion\nThe color of the sky is blue but sometimes it can also be red, orange, or violet. What is the reason behind the different colors of the sky?\nThe color of the sky is primarily determined by the scattering of sunlight by the Earth_s atmosphere. The most common color we see is blue because blue light is scattered more than other colors by the molecules and small particles in the atmosphere. However, during sunrise and sunset, the sky can appear red, orange, or violet due to the following reasons:\n1'
Prompt: 'The capital of France is', Generated text: ' Paris. It is a major European city and a global center for art, fashion, and culture. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its cuisine, fashion, and romantic atmosphere. Paris is a popular tourist destination and is often referred to as the _City of Light_ due to its role in the Enlightenment and its many famous intellectuals and artists. The city is divided into 20 arrondissements, or districts, each with its own unique character and charm. Paris is a must-visit destination for anyone interested'
Prompt: 'What is batch inference?', Generated text: " - Azure Machine Learning | Microsoft Learn Skip to main content \nWhat is batch inference?\nBatch inference, or batch scoring, is the process of generating predictions on a batch of observations. Batch inference or batch scoring is a common pattern for models that are trained offline. Batch inference can be used for both tabular data and unstructured data like images or text.\nBatch inference is typically used for offline scoring where the response time isn't critical. For example, a model that predicts energy demand for a utility company can be used to make predictions every hour, as the demand forecast is required only once per hour. In contrast, online inference is used for real-time scoring"

lm_eval Results

V1 without aiter, eager mode
vllm (pretrained=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,tensor_parallel_size=4,max_model_len=30000,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	_	0.9272	_	0.0072
		strict-match	5	exact_match	_	0.9295	_	0.0071

V1 with aiter, eager mode
vllm (pretrained=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,tensor_parallel_size=4,max_model_len=30000,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	_	0.9227	_	0.0074
		strict-match	5	exact_match	_	0.9272	_	0.0072

Reduce complexity of selecting AITER Fused MoE kernel

As the number of AITER Flags have increased, we have revamped the condition to pick the AITER Fused MoE kernel without the need of any flags. So VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE. User only need to specify VLLM_ROCM_USE_AITER=1andVLLM_ROCM_USE_AITER_MOE=1`
We have validated the code path of other models with the latest AITER fused moe selection logic:

mistralai_Mixtral-8x7B-Instruct-v0.1_V0

vllm (pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,tensor_parallel_size=1,max_model_len=30000,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.6399	±	0.0132
		strict-match	5	exact_match	↑	0.5216	±	0.0138

mistralai_Mixtral-8x7B-Instruct-v0.1_FP8_V0

vllm (pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,tensor_parallel_size=1,max_model_len=30000,quantization=fp8,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.6111	±	0.0134
		strict-match	5	exact_match	↑	0.4769	±	0.0138

despseek-ai_DeepSeek-V3

vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=30000,gpu_memory_utilization=0.8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9492	±	0.006
		strict-match	5	exact_match	↑	0.9500	±	0.006

github-actions · 2025-04-16T14:02:28Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

hongxiayang · 2025-04-16T20:03:15Z

vllm/envs.py

    VLLM_ROCM_USE_AITER_LINEAR: bool = True
    VLLM_ROCM_USE_AITER_MOE: bool = True
    VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE: bool = False
+    VLLM_ROCM_USE_AITER_FP8_CHANNEL_SCALED_MOE: bool = False


Can we make the env name more align with the kernel name , in this case, to include tkw1 in the name?

hongxiayang · 2025-04-16T20:06:32Z

vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py



+def is_rocm_aiter_channel_scaled_moe_enabled() -> bool:
+    return is_rocm_aiter_moe_enabled() and \


Does this tkw1 enablement need to depend on is_rocm_aiter_moe_enabled() ?

In this enablement we are following the block_scaled_moe case in using VLLM_ROCM_USE_AITER_MOE as a master switch for enabling MoE ops, to stay consistent with the other aiter kernels.

hongxiayang · 2025-04-16T20:11:40Z

vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py

+    if activation_str == "silu":
+        activation = ActivationType.Silu
+    elif activation_str == "gelu":
+        activation = ActivationType.Gelu
+    else:
+        activation = ActivationType.Silu


Can be simplified to one-liner ?

Suggested change

if activation_str == "silu":

activation = ActivationType.Silu

elif activation_str == "gelu":

activation = ActivationType.Gelu

else:

activation = ActivationType.Silu

activation = ActivationType.Gelu if activation_str == "gelu" else ActivationType.Silu

Do we need an additional wrapper for the _tkw1 kernel, given that it’s just a kernel call plus an activation type conversion? the activation type can also used by other branches / kernel calls?

We are wrapping the kernel call because in our future PR addressing the enablement of torch compile for aiter MoE kernels, we will be using wrappers to register the aiter ops, and so we thought to leave it here for now.

hongxiayang · 2025-04-16T20:13:48Z

vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py

+        # # All AITER Fused MoE kernels are expecting the following datatypes
+        # topk_weights = topk_weights.to(torch.float32)
+        # topk_ids = topk_ids.to(torch.int32)


Suggested change

# # All AITER Fused MoE kernels are expecting the following datatypes

# topk_weights = topk_weights.to(torch.float32)

# topk_ids = topk_ids.to(torch.int32)

sijiac · 2025-04-16T21:16:40Z

vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py

+        # topk_weights = topk_weights.to(torch.float32)
+        # topk_ids = topk_ids.to(torch.int32)
+
+        return rocm_aiter_asm_moe_tkw1(hidden_states,


Let's assert apply_router_weight_on_input=True or do the if branch check when calling the _tkw1 kernel? btw, we should have some comments to illustrate the difference between _tkw1 kernel and other aiter kernels. The difference is on applying topk_weights on the output of the first GEMM or the second GEMM

sijiac · 2025-04-16T21:21:36Z

vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py

+    if activation_str == "silu":
+        activation = ActivationType.Silu
+    elif activation_str == "gelu":
+        activation = ActivationType.Gelu
+    else:
+        activation = ActivationType.Silu


Do we need an additional wrapper for the _tkw1 kernel, given that it’s just a kernel call plus an activation type conversion? the activation type can also used by other branches / kernel calls?

sijiac · 2025-04-16T21:27:43Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

              and layer.activation == "silu" and layer.expert_map is None):
            return CompressedTensorsW8A8Fp8MoECutlassMethod(quant_config)
        elif quant_config._is_fp8_w8a8(weight_quant, input_quant):
+            if is_rocm_aiter_channel_scaled_moe_enabled():


tkw1 is not a general support of FP8 FMOE channel / rowwise scaling, it only supports the case when apply_router_weight_on_input =True

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

…E_AITER_FP8_BLOCK_SCALED_MOE and VLLM_ROCM_USE_AITER_FP8_TKW1_MOE Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

mergify · 2025-04-17T18:52:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kliuae.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

houseroad · 2025-04-18T07:05:08Z

vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py

+        )

-    if envs.VLLM_ROCM_USE_AITER_FP8_BLOCK_SCALED_MOE and use_fp8_w8a8:
+        # TODO: verify this code path for DeepSeekV3


can we verify before landing?

Verified: Will remove the comment.

2025-04-18:10:35:16 INFO [loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=30000,gpu_memory_utilization=0.8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr

gsm8k 3 flexible-extract 5 exact_match ↑ 0.9492 ± 0.006

strict-match 5 exact_match ↑ 0.9500 ± 0.006

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

SageMoore

Looks reasonable. Just a few nits.

SageMoore · 2025-04-21T14:30:50Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

+            layer.w2_weight = torch.nn.Parameter(shuffled_w2,
+                                                 requires_grad=False)
+
+        if self.use_rocm_aiter_moe:


Nit: Can you merge these into one if statement?

Will do. Thanks for pointing this out.

SageMoore · 2025-04-21T14:34:23Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

+            is_rocm_aiter_moe_enabled)
+
+        # Property to determine if AITER is used
+        self.use_rocm_aiter_moe = is_rocm_aiter_moe_enabled()


Nit: Do you need to store this in the class? It doesn't look like you are using it outside of this function.

You're right. Updated this along with the merged if statement.

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: Frieda (Jingying) Huang <jingyingfhuang@gmail.com>

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Add aiter tkw1 kernel for fp8

88e60fb

kliuae requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners April 16, 2025 14:02

mergify bot added the ci/build label Apr 16, 2025

Add aiter tkw1 kernel for fp8

6659b99

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

kliuae force-pushed the llama4-fp8-aiter branch from 88e60fb to 6659b99 Compare April 16, 2025 17:57

lint

6ce8186

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

tjtanaa mentioned this pull request Apr 10, 2025

[Feature] [ROCm]: AITER Kernel Integration #14964

Closed

61 tasks

tjtanaa and others added 2 commits April 16, 2025 19:23

abstracted out the topk_weights and topk_ids

9a8b1c4

Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

sync with main

21919da

Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

hongxiayang reviewed Apr 16, 2025

View reviewed changes

hongxiayang added the rocm Related to AMD ROCm label Apr 16, 2025

hongxiayang mentioned this pull request Apr 16, 2025

[ROCm] (Deprecated) Enable AITER Tkw1 kernel #16418

Closed

4 tasks

sijiac reviewed Apr 16, 2025

View reviewed changes

kliuae and others added 4 commits April 17, 2025 09:55

make explicit naming

cd2b1ab

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

simplify activation logic

52f2f2a

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

mergify bot added the needs-rebase label Apr 17, 2025

tjtanaa and others added 3 commits April 17, 2025 19:12

fix missing VLLM_ROCM_USE_AITER_MOE bug

ea70f55

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

merge upstream

f00b8b7

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

clean up

2502200

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

mergify bot removed the needs-rebase label Apr 18, 2025

houseroad reviewed Apr 18, 2025

View reviewed changes

fix linting

d45fad5

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

SageMoore reviewed Apr 21, 2025

View reviewed changes

simplify weight loading logic

b9e26a2

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

SageMoore approved these changes Apr 21, 2025

View reviewed changes

hongxiayang added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 21, 2025

hongxiayang approved these changes Apr 21, 2025

View reviewed changes

DarkLight1337 approved these changes Apr 22, 2025

View reviewed changes

vllm-bot merged commit 5b794ca into vllm-project:main Apr 22, 2025
60 of 63 checks passed

tjtanaa mentioned this pull request Apr 22, 2025

[Misc][ROCm] Restrict Aiter moe to specific models. #16435

Closed

luizanao mentioned this pull request Apr 23, 2025

[Bug]: Error when running Llama-4-Maverick-17B-128E-Instruct-FP8 on mi300x #16474

Closed

1 task

tjtanaa mentioned this pull request Apr 28, 2025

[FEAT] [ROCm]: Add AITER Block-Scaled GEMM Feature #14968

Merged

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

tjtanaa deleted the llama4-fp8-aiter branch May 16, 2025 16:29

vllmellm mentioned this pull request Aug 26, 2025

[Feature] [ROCm]: AITER Kernel Integration vllmellm/vllm#51

Open

61 tasks



		def is_rocm_aiter_channel_scaled_moe_enabled() -> bool:
		return is_rocm_aiter_moe_enabled() and \

	# # All AITER Fused MoE kernels are expecting the following datatypes
	# topk_weights = topk_weights.to(torch.float32)
	# topk_ids = topk_ids.to(torch.int32)

Uh oh!

Conversation

kliuae commented Apr 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Llama4 Maverick FP8 throughput benchmarks

Llama4 Maverick FP8 latency benchmarks

Text Generation Response

lm_eval Results

Reduce complexity of selecting AITER Fused MoE kernel

mistralai_Mixtral-8x7B-Instruct-v0.1_V0

mistralai_Mixtral-8x7B-Instruct-v0.1_FP8_V0

despseek-ai_DeepSeek-V3

Uh oh!

github-actions bot commented Apr 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

kliuae commented Apr 16, 2025 •

edited by github-actions bot

Loading