Skip to content

Conversation

@xu-shawn
Copy link
Contributor

Rewrite the CUDA feature transformer forward/backward kernels in triton, which eliminates graph breaks and allows gpu-specific optimization through triton.autotune

CUDA
Epoch 0:  15% 900/6104 [03:30<20:17,  4.28it/s, v_num=46]

Triton
Epoch 0:  15% 900/6104 [03:13<18:36,  4.66it/s, v_num=45]

@vondele
Copy link
Member

vondele commented Jun 28, 2025

Quite a bit slower locally:

reported after epoch 3 / end of the run (2 runs9.

                   big                             small
master         84.99it/s, 84.57it/s          210.57it/s, 214.03it/s
triton         36.86it/s, 36.90it/s           90.97it/s,  91.75it/s

@Disservin
Copy link
Member

yea slowdown for me as well not as big as in vondeles example but pretty noticable

@Disservin
Copy link
Member

you might be able to rewrite the kernel using tilelang without sacrificing performance
https://github.com/tile-ai/tilelang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants