Update/add to qr_ks_vs_whole_k_prefetch pipeline by qianfengz · Pull Request #3485 · ROCm/composable_kernel

qianfengz · 2025-12-24T10:00:35Z

About qr_ks_vs_whole_k_prefetch pipeline

The pipeline qr_ks_vs_whole_k_prefetch is mainly used for the situations where total number of work-groups is not enough to occupy the CUs. When the total number of work-groups is low, use MTile size (kM0) 64 rather than 128 can improve the CU occupancy. And with kM0=64, less registers are consumed to save P and O, thus enough vgprs are left for prefetch the whole k_tile from next iteration in the main-loop, and thus performance can be improved compared to the usual method of using kM0=128,
Except for prefetching whole k tile when kM0=64, the pipeline also has the path to use kM0=128, in which case, 1/2 of n0_loops slices of the k tile are prefetched for next iteration. Path of kM0=128 can be used as a replacement of using pipeline qr_ks_vs_async

What this PR does

Update in the pipeline policy to ensure best mfma instructions are used on MI350
Add the qr_ks_vs_whole_k_prefetch_trload pipeline instance so that V can be loaded using transposed loading on MI350 (avoid the need of lots of shuffling instructions)
Using n0_loop to implement Gemm0 instead of the commonly used k0_loop. n0_loop brings the benefits of less move_tile_window() call, and removing the need of clear_tile(s_acc) in the main loop.
Complete support of naive tile loading for hdim96 and hdim160, which means loading tile of hdim96/hdim160 without having to pad them to hdim128/hdim256
Other fine-grained improvement (eg. use explict partition_index to guarantee warp_id is allocated on vgpr for store_tile/load_tile to/from LDS tile_window)

Performance results

For attention shapes which leads to kM0=64, qr_ks_vs_async_whole_k_prefetch_trload shows much better performance than qr_ks_vs_async_trload on the same case (execution time 41.02ms by whole_k_prefetch_trload & 58.50ms by async_load)
For attention shapes which leads to kM0=128, qr_ks_vs_async_whole_k_prefetch_trload show a little bit better performance than qr_ks_vs_async on mi350 (execution time 104.50ms by whole_k_prefetch_trload & 106.50ms by qr_ks_vs_async). And they shows completely on-par performance on MI300

Test/Verify

Use the ROCM xformers branch test_whole_k_prefetch_n0loop to test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can not be used by ck_tile fmha example so far
Use the following command-line for building/testing xformers

#> git clone -b test_whole_k_prefetch_n0loop https://github.com/ROCm/xformers
#> git submodule update --init --recursive   
#> pip  install --no-build-isolation -e ./
#> pytest tests/test_mem_eff_attention.py::test_forward

Any scripts which can run on xformers can be used to evaluate qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to switch from using different pipelines

#> export FMHA_DISABLE_SPECIAL_TREATMENT=1              #> to disable using FAV3 and qr_ks_vs_async_trload pipeline
#> export FMHA_DISABLE_ASYNC_PIPELINE=1                     #>  to disable using qr_ks_vs_async pipeline

Discussion

…oping Gemm0 along n0 dimension

…8 on mi350

… ...)

…e_k_prefetch pipeline

…n whole_k_prefetch path)

…n whole_k_prefetch path in trload pipeline)

… next iteration in the non-whole-k-perfetch path

Copilot

Pull request overview

This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to improve performance on MI350 GPUs through better MFMA instruction usage, transposed V-loading support, and N0-loop implementation. The pipeline targets scenarios where work-group counts are low, enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs 128) while prefetching entire K tiles.

Changes:

Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload) to reduce shuffle instructions on MI350
Implements N0-loop based Gemm0 to reduce tile window movement overhead and eliminate clear_tile calls
Adds full support for hdim96/hdim160 without padding requirements
Updates MFMA instruction selection to ensure optimal choices for MI350

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
block_gemm_areg_bsmem_trload_creg_v2_prefetch_n.hpp	New GEMM block implementation supporting transposed V-loading with N-dimension prefetching
block_gemm_areg_bsmem_creg_v2_prefetch_n.hpp	N-dimension prefetching GEMM implementation for standard (non-transposed) loading
block_gemm_areg_bsmem_creg_v2_prefetch_k.hpp	K-dimension prefetching GEMM implementation
tile_fmha_shape.hpp	Adds kN0Sub field and relaxes static assertion for N0-loop support
block_fmha_pipeline_qr_ks_vs_whole_k_prefetch_trload.hpp	New pipeline variant with transposed V-loading
block_fmha_pipeline_qr_ks_vs_whole_k_prefetch_default_policy.hpp	Comprehensive policy updates for LDS management, alignment, and MFMA selection
block_fmha_pipeline_qr_ks_vs_whole_k_prefetch.hpp	Core pipeline updated with N0-loop implementation and simplified memory management
block_fmha_pipeline_problem.hpp	Adds utility functions for calculating optimal vector sizes
fmha_fwd_kernel.hpp	Kernel updates to support N0-loop pipelines and naive hdim loading
fmha.hpp	Includes new trload pipeline header

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-15T06:44:04Z

include/ck_tile/ops/gemm/block/block_gemm_areg_bsmem_creg_v2_prefetch_n.hpp

+                        load_tile(b_warp_windows(number<nIter + 1>{})(kIter));
+                };
+
+                __builtin_amdgcn_sched_barrier(0x0000001);


The scheduling barrier mask has an incorrect value. Line 140 uses 0x0000001 (7 digits) while line 127 correctly uses 0x00000001 (8 digits). This should be 0x00000001 to match the proper 32-bit mask format.

Suggested change

__builtin_amdgcn_sched_barrier(0x0000001);

__builtin_amdgcn_sched_barrier(0x00000001);

Copilot · 2026-01-15T06:44:04Z

include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_whole_k_prefetch.hpp

+            }

-            const auto bias_tile = load_tile(bias_dram_window); // load bias tile
+            __builtin_amdgcn_sched_barrier(0x000000001);


The scheduling barrier mask has an incorrect value. This uses 0x000000001 (9 digits) when it should be 0x00000001 (8 digits) to match the proper 32-bit hexadecimal format.

Suggested change

__builtin_amdgcn_sched_barrier(0x000000001);

__builtin_amdgcn_sched_barrier(0x00000001);

Copilot · 2026-01-15T06:44:05Z

include/ck_tile/ops/fmha/pipeline/tile_fmha_shape.hpp

+    static constexpr index_t kM0    = BlockTile::at(number<0>{}); // tile size along q seqlen
+    static constexpr index_t kN0    = BlockTile::at(number<1>{}); // tile size along k seqlen
+    static constexpr index_t kK0    = BlockTile::at(number<2>{}); // tile size along qk gemm unroll
+    static constexpr index_t kN0Sub = BlockTile::at(number<2>{}); // tile size for dividing kN0


The variable kN0Sub is assigned from BlockTile index 2, which is the same index as kK0 (line 52). This appears to be intentional based on the assertion at line 59, but the naming is confusing since kN0Sub suggests it's related to kN0, not kK0. Consider renaming to better reflect its relationship to both dimensions, or add a clarifying comment.

Suggested change

static constexpr index_t kN0Sub = BlockTile::at(number<2>{}); // tile size for dividing kN0

static constexpr index_t kN0Sub = BlockTile::at(number<2>{}); // same index as kK0; used as subdivision factor when dividing kN0

Copilot · 2026-01-15T06:44:05Z

include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_whole_k_prefetch.hpp

-        {
-            if(num_total_loop <= 0)
+        // assuming no random values need be saved, this is true when the pipeline is called from
+        // xformers, since we have a separate kernel to generated randomm values


Corrected spelling of 'generated' from 'generated randomm values' to 'generate random values'.

Suggested change

// xformers, since we have a separate kernel to generated randomm values

// xformers, since we have a separate kernel to generate random values

Copilot · 2026-01-15T06:44:05Z

include/ck_tile/ops/fmha/kernel/fmha_fwd_kernel.hpp

+template <typename T>
+static inline constexpr bool is_naive_hdim_load_v = has_naive_hdim_load_flag<T>::value;
+
+// A helper struct for detechting kUseTrLoad


Corrected spelling of 'detechting' to 'detecting'.

Suggested change

// A helper struct for detechting kUseTrLoad

// A helper struct for detecting kUseTrLoad

Copilot · 2026-01-15T06:44:06Z

include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_problem.hpp

+    else
+        static_assert(false, "The data type is not supported!");


Using static_assert(false, ...) directly can cause compilation issues with some compilers even when the branch is not taken. Consider using a type-dependent false condition like static_assert(sizeof(DataType) == 0, ...) or static_assert(!std::is_same_v<DataType, DataType>, ...).

asleepzzz · 2026-01-28T08:26:58Z

we found async can beat wholek with a new config, will discuss with qianfeng

ammallya · 2026-02-03T22:06:30Z

Error importing due to merge conflicts – please reopen the PR on ROCm/rocm-libraries

qianfengz added 25 commits December 4, 2025 09:09

Initial re-implementation of pipeline qr_ks_vs_whole_k_prefetch in lo…

5fada1c

…oping Gemm0 along n0 dimension

Add prefetching whole next iteration K path in the pipeline

98f9b4a

Change in GetKVBlockGemm to let gemm1 to use WarpTile-16x16x16/32x32x…

c32949b

…8 on mi350

Switch the codes based on the iteration index (first/intermediate/last)

25521a7

Simplify the block_gemm codes

8b85919

[Performance] Change __builtin_amdgcn_sched_barrier() in block_gemm

5722f8a

Refine the interleaving in the loop of Gemm0

044f554

Using explicit vgpr-saved partition_index with store_tile(lds_window,…

2ea8d83

… ...)

Separate kN0Sub from kK0 to be used for flexible tile tuning for whol…

12c8873

…e_k_prefetch pipeline

Load Q through Lds

c3d3487

Fix move_tile_window(k_dram_window, ..) step in the pipeline

409ec3b

Remove replicated codes in the pipeline

370d386

Adjust in GetNumPrefetchV()

d281c51

Add support of loading QK tiles of hdim96 without padding to hdim128

384f470

Add qr_ks_vs_whole_k_prefetch_trload pipeline

eb598a9

Using is_using_trload_v to check the kUseTrLoad from pipeline

57abd10

Load Q directly from global memory to registers for BlockGemm

3f6d26e

Fix the static_assert expression in the pipeline

1ef76a6

Update to the non-whole-k-prefetch path in the whoke_k_prefetch pipeline

db5c12d

Update to only pre-load one v_tile during Gemm0 loop

57cf989

Move the loading of k_file for next iteration into the Gemm1 loop (no…

b77fdbf

…n whole_k_prefetch path)

Update to GetNumPrefetchV()

e7e6ebc

Move the loading of k_tile for next iteration into the Gemm1 loop (no…

6c91b0c

…n whole_k_prefetch path in trload pipeline)

Update to GetNumPrefetchV() for kM0=64 path

489e255

Update in whole_k_prefetch_trload pipeline to prefetch two k_tile for…

f5b4d5d

… next iteration in the non-whole-k-perfetch path

qianfengz requested review from aosewski, carlushuang, geyyer, illsilin and poyenc as code owners December 24, 2025 10:00

qianfengz requested review from ThomasNing, afagaj, andriy-ca, asleepzzz, bartekxk, cgmillette, coderfeli, shumway, tenpercent and vidyasagar-amd as code owners December 24, 2025 10:00

poyenc requested a review from Copilot January 15, 2026 06:40

poyenc assigned qianfengz Jan 15, 2026

Copilot started reviewing on behalf of poyenc January 15, 2026 06:41 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

illsilin assigned asleepzzz Jan 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update/add to qr_ks_vs_whole_k_prefetch pipeline#3485

Update/add to qr_ks_vs_whole_k_prefetch pipeline#3485
qianfengz wants to merge 25 commits intodevelopfrom
whole_k_prefetch_n0loop

qianfengz commented Dec 24, 2025 •

edited by afagaj

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

asleepzzz commented Jan 28, 2026

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	__builtin_amdgcn_sched_barrier(0x0000001);
	__builtin_amdgcn_sched_barrier(0x00000001);

	static constexpr index_t kN0Sub = BlockTile::at(number<2>{}); // tile size for dividing kN0
	static constexpr index_t kN0Sub = BlockTile::at(number<2>{}); // same index as kK0; used as subdivision factor when dividing kN0

	// xformers, since we have a separate kernel to generated randomm values
	// xformers, since we have a separate kernel to generate random values

	// A helper struct for detechting kUseTrLoad
	// A helper struct for detecting kUseTrLoad

		else
		static_assert(false, "The data type is not supported!");

Conversation

qianfengz commented Dec 24, 2025 • edited by afagaj Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About qr_ks_vs_whole_k_prefetch pipeline

What this PR does

Performance results

Test/Verify

Discussion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

asleepzzz commented Jan 28, 2026

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qianfengz commented Dec 24, 2025 •

edited by afagaj

Loading