Skip to content

Conversation

@ikawrakow
Copy link
Owner

This PR adds FA support for models where K and V head sizes are different, such as DeepSeek-R1 and DeepSeek-Lite. It only works with the standard attention mechanism, I have yet to look into FA with MLA.

We get a nice speedup for PP, increasing with context length, but TG is not faster. I want to play some more with it, but throwing it out there if someone wants to try. For sure this allows longer contexts to be processed as -ctk q8_0 -ctv q8_0 seems perfectly adequate.

Iwan Kawrakow added 3 commits February 10, 2025 17:41
This is relevant for DeepSeek models.
At this point ggml CPU FA works.
Now I need to go and change iqk FA to make it work
with Dk != Dv.
To not have compilation time explode, just
Dk = 192, Dv = 128 for now (DeepSeek)
@saood06 saood06 mentioned this pull request Feb 10, 2025
3 tasks
@ikawrakow
Copy link
Owner Author

So, I did get some minor FA speed improvements for TG, but I don't see what else one could do, so I'll merge it.

Here is a performance comparison between baseline (Q8_0 K-cache, no FA, no MLA), MLA (Q8_0 K-cache) and FA (Q8_0 K and V cache) for DeepSeek-Lite running on a Ryzen-7950X CPU. Both graphs show the MLA and FA performance ratio to baseline.

First graph is prompt processing speed. We see FA giving a ~40% performance boost at 16k tokens compared to baseline. MLA is 2X slower than baseline and 2.8X slower than FA at 16k tokens.

ds2_pp

The second graph is token generation speed (TG-64) after a prompt of a given length (i.e., TG speed as a function of the number of tokens in the KV cache). We do get some performance gains for very long prompts from FA (~10% at 16k tokens), but by far not as much as from MLA. MLA is 1.57X faster than baseline and 1.43X faster than FA at 16k tokens.

ds2_tg

@ikawrakow
Copy link
Owner Author

ikawrakow commented Feb 11, 2025

Recently I read somewhere that for the "common enterprise workflow" (whatever that means) the number of generated tokens is typically only about 10% of the prompt tokens. I don't know if that is true, but for the sake of argument, let's assume for a moment that it is. In that case the best way to measure overall model performance is to use llama-bench -pg Npp,Ntg, where Ntg=0.1*Npp is the number of generated tokens and Npp is the number of prompt tokens. The following graph shows PG performance as a function of prompt length. The black symbols are mainline llama.cpp build b9ab0a4d (4687) (most current version as of today), the red symbols are for baseline ik_llama.cpp (no FA, no MLA), the green symbols are for MLA, and the blue symbols are for FA from this PR. The model is DeepSeek-Lite quantized with IQ4_XS. All use Q8_0 for K cache, FA uses Q8_0 also for V cache. All runs are on a Ryzen-7950X CPU. If we buy the claim that Ntg ~ 0.1*Npp is the "typical enterprise workflow", then there is no benefit from MLA over baseline, while FA is ~26% better for long prompts. Mainline llama.cpp is, as usual, slower. 1.45X for short prompts, increasing to 1.7X slower for prompts with 16k tokens.

ds2_pg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants