New SOTA quantization: 4.25 bpw IQ4_KS #83

ikawrakow · 2024-10-09T09:31:40Z

It is similar to IQ4_K with the following difference

Blocks of 32 instead of blocks of 16
Row-wise float scale instead of per block instead of per super-block ggml_half
7-bit block scales instead of 6-bit - needed to ensure enough precision when using per row float scale

It ends up being 4.25 bpw, so the same as IQ4_XS. Why add it then? Because it has a lower quantization error than IQ4_XS. For some models the difference is quite significant. The following table gives some examples. Quantization error Qerr is defined as PPL(Q)/PPL(f16)-1

Model	Qerr(IQ4_XS)	Qerr(IQ4_KS)
LLaMA-3.1-8B	2.82%	2.68%
LLaMA-3.1-8B-Instruct	2.54%	1.85%
LLaMA-3.2-3B-Instruct	2.45%	2.13%
Qwen-2.5-7B-Instruct	2.31%	1.62%
Qwen-2.5-32B-Instruct	2.17%	1.82%
Nemo-Instruct-2407	1.592%	1.579%
Gemma-2-9B	1.33%	0.92%
Gemma-2-27B-Instruct	1.23%	0.72%

Performance is similar to IQ4_XS or even slightly better, except for TG on the M2-Max GPU, where it is ~2% slower (Apple Silicon does not like non-sequential memory access, but having the row scale stored at the beginning of the row causes an additional memory jump in the dot product kernel).

The PR also adds a new quantization mix - IQ3_KL (L for "large"). It fills the gap between IQ4_K and IQ4_K (and now IQ4_KS). The following graph illustrates where this new mix sits for LLaMA-3.1-8B-Instruct.

At the same 4.25 bpw PPL is always better, for some models significantly better. I'll rename to iq4_ks and keep it.

We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0.

Also fix the breakage I caused with the dedicated work buffer quantization portion when the multiplication is not done via iqk_mul_mat.

I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2). Again the same mistake of packing int32_t back to int16_t, which overflows occasionally (just occasionally, that's why the result doesn't look completely wrong, so I didn't notice).

After all, tt is a smaller variant of iq4_k.

Iwan Kawrakow added 13 commits October 8, 2024 10:52

iq4_k_xxs: basics

4c76471

WIP + adding iq3_kl quantization mix

1dd6c40

iq4_xxs: this looks very viable compared to iq4_xs

975292b

At the same 4.25 bpw PPL is always better, for some models significantly better. I'll rename to iq4_ks and keep it.

iq4_xxs: CUDA dot product

81bd332

We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0.

iq4_xxs: scalar CPU dot product

834af69

Also fix the breakage I caused with the dedicated work buffer quantization portion when the multiplication is not done via iqk_mul_mat.

iq4_xxs: Zen4

c24ad0d

I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2). Again the same mistake of packing int32_t back to int16_t, which overflows occasionally (just occasionally, that's why the result doesn't look completely wrong, so I didn't notice).

Fix iq4_xs (Zen4)

ee59051

iq4_xxs: AVX2

9a3f445

iq4_xxs: ARM_NEON

5865c98

iq4_xxs: Metal

bb522fb

iq4_xxs: slightly faster TG on Metal

0e12b29

iq4_xxs: rename to iq4_ks

bb6eab8

After all, tt is a smaller variant of iq4_k.

iq3_kl: use iq4_ks instead of iq4_k/iq4_xs

f61c379

ikawrakow merged commit b30c9e1 into main Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New SOTA quantization: 4.25 bpw IQ4_KS #83

New SOTA quantization: 4.25 bpw IQ4_KS #83

Uh oh!

ikawrakow commented Oct 9, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New SOTA quantization: 4.25 bpw IQ4_KS #83

New SOTA quantization: 4.25 bpw IQ4_KS #83

Uh oh!

Conversation

ikawrakow commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ikawrakow commented Oct 9, 2024 •

edited

Loading