Skip to content

Conversation

@ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Oct 9, 2024

It is similar to IQ4_K with the following difference

  • Blocks of 32 instead of blocks of 16
  • Row-wise float scale instead of per block instead of per super-block ggml_half
  • 7-bit block scales instead of 6-bit - needed to ensure enough precision when using per row float scale

It ends up being 4.25 bpw, so the same as IQ4_XS. Why add it then? Because it has a lower quantization error than IQ4_XS. For some models the difference is quite significant. The following table gives some examples. Quantization error Qerr is defined as PPL(Q)/PPL(f16)-1

Model Qerr(IQ4_XS) Qerr(IQ4_KS)
LLaMA-3.1-8B 2.82% 2.68%
LLaMA-3.1-8B-Instruct 2.54% 1.85%
LLaMA-3.2-3B-Instruct 2.45% 2.13%
Qwen-2.5-7B-Instruct 2.31% 1.62%
Qwen-2.5-32B-Instruct 2.17% 1.82%
Nemo-Instruct-2407 1.592% 1.579%
Gemma-2-9B 1.33% 0.92%
Gemma-2-27B-Instruct 1.23% 0.72%

Performance is similar to IQ4_XS or even slightly better, except for TG on the M2-Max GPU, where it is ~2% slower (Apple Silicon does not like non-sequential memory access, but having the row scale stored at the beginning of the row causes an additional memory jump in the dot product kernel).

The PR also adds a new quantization mix - IQ3_KL (L for "large"). It fills the gap between IQ4_K and IQ4_K (and now IQ4_KS). The following graph illustrates where this new mix sits for LLaMA-3.1-8B-Instruct.

il31_8B

Iwan Kawrakow added 13 commits October 8, 2024 10:52
At the same 4.25 bpw PPL is always better, for some models
significantly better. I'll rename to iq4_ks and keep it.
We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0.
Also fix the breakage I caused with the dedicated work buffer
quantization portion when the multiplication is not done
via iqk_mul_mat.
I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2).
Again the same mistake of packing int32_t back to int16_t,
which overflows occasionally (just occasionally, that's why the
result doesn't look completely wrong, so I didn't notice).
After all, tt is a smaller variant of iq4_k.
@ikawrakow ikawrakow merged commit b30c9e1 into main Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants