Add KVzipPress by Janghyun1230 · Pull Request #93 · NVIDIA/kvpress

Janghyun1230 · 2025-07-08T02:10:52Z

PR description

Hi! I've tried to add KVzip, a recent work on query-agnostic KV cache eviction.

arXiv: https://arxiv.org/abs/2505.23416
Code: https://github.com/snu-mllab/KVzip

KVzip achieves near-lossless compression at eviction ratios of up to 80% on RULER-4k with LLaMA3.1-8B (evaluated using the evaluation.py script from this repository). I've uploaded the result json files on Drive.

Compression ratio	0	0.1	0.25	0.5	0.6	0.7	0.8	0.9
Average Performance	95.7	95.5	95.5	95.5	95.5	95.3	94.9	90.5

KVzip introduces compression overhead (2× prefilling time, with negligible memory overhead). The original KVzip repository also provides a version without compression overhead at the cost of performance, using DuoAttention-style head-level eviction.

I tried to make minimal changes to this repository, but I had to make some additions in pipeline.py. I follow the fake compression strategy from AdaKV in this repository, whereas the original KVzip repository provides optimized code that improves decoding speed by 2×.

Please review and let me know if there are any issues or if everything looks fine. Truly appreciate your great repository!

Checklist

Tests are working (make test)
Code is formatted correctly (make style, on errors try fix with make format)
Copyright header is included
All commits are signed-off using git commit -s
(new press) mypress_press.py is in the presses directory
(new press) MyPress is in __init__.py
(new press) README.md is updated with a 1 liner about the new press in the Available presses section
(new press) New press is in the default_presses list in tests/default_presses.py
(new press) A docstring is provided that follows the same structure as the existing ones

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

maxjeblick · 2025-07-13T16:15:02Z

Thanks a lot for your PR, the results look very promising!
I went through the PR and have the following suggestion:
Instead of modifying pipeline.py, try to move the logic into the press' __call__ context manager method.
I'll leave some code stub below that exemplifies this. (The code stub hasn't been run by myself, and serves more as a template).
In particular:

context input ids and question suffix is fetched/computed in the press itself (answer_prefix isn't part of question suffix, IDK if this affects the performance)
Instead of .do_compress attribute, the forward hook is only registered after the initial forward pass of the model (upon exiting the context manager).
The model forward loop is part of the __call__ method.

Feel free to discuss these proposed changes here

    @contextmanager
    def __call__(self, model: PreTrainedModel) -> Generator:
        """
        Context manager that handles both initial prefilling and KVzip scoring/compression.
        
        This overrides the base class __call__ method to implement the full KVzip algorithm:
        1. First yield: allows initial prefilling with context
        2. After yield: performs KVzip scoring and compression using context reconstruction
        """
        if not isinstance(model, SUPPORTED_MODELS):
            logger.warning(f"Model {type(model)} not tested, supported models: {SUPPORTED_MODELS}")

        if isinstance(model, Gemma3ForCausalLM):
            logger.warning("Compression in Gemma3 is only applied to layer without sliding window attention")

        # Store model reference for later use
        tokenizer = AutoTokenizer.from_pretrained(model.config.name_or_path)

        # Get suffix_ids directly using tokenizer's chat template (do this once, not in hook)
        if tokenizer.chat_template is None:
            suffix_text = "\n"  # Default suffix for models without chat template
        else:
            # Use a dummy context to extract the question suffix from chat template
            dummy_context = "dummy context"
            separator = "\n" + "#" * len(dummy_context)
            temp_context = tokenizer.apply_chat_template(
                [{"role": "user", "content": dummy_context + separator}],
                add_generation_prompt=True,
                tokenize=False
            )
            _, suffix_text = temp_context.split(separator)
        
        # Tokenize suffix directly to ids
        self._suffix_ids = tokenizer.encode(suffix_text, return_tensors="pt", add_special_tokens=False)

        # Register embedding hook to capture context information
        hooks = []
        try:
            # First yield: Initial prefilling phase (no compression hooks yet)
            embedding_hook = model.model.embed_tokens.register_forward_hook(self._forward_hook_embedding,
                                                                            with_kwargs=True)
            yield
            # Remove embedding hook since we no longer need it
            embedding_hook.remove()

            # After yield: KVzip scoring and compression phase
            if self.compression_ratio > 0 and self._context_ids is not None:
                # Now register attention hooks for compression
                for layer in model.model.layers:
                    if isinstance(model, Gemma3ForCausalLM) and layer.is_sliding:
                        continue
                    layer.self_attn.rotary_emb = model.model.rotary_emb
                    hooks.append(layer.self_attn.register_forward_hook(self.forward_hook, with_kwargs=True))

                self._perform_kvzip_compression(model, tokenizer)
        finally:
            for hook in hooks:
                hook.remove()

    def _forward_hook_embedding(self, module: nn.Module, input: list[torch.Tensor], kwargs: dict, output: list):
        """
        Hook for embedding layer to capture context information from the first forward pass.

        """
        self._context_ids = input[0]
        self._cache = ... # fetch from kwargs

        return output


    def _perform_kvzip_compression(self, model: PreTrainedModel, tokenizer: PreTrainedTokenizer):
        """
        Perform the KVzip scoring and compression algorithm.
        """
        context_length = self._context_ids.shape[1]
        self.context_length = context_length

        # Prepare chunked inputs for context reconstruction
        input_ids = self.prepare(model, tokenizer, context_length)

        # Reset start_idx for scoring
        self.start_idx = 0

        # Perform scoring through context reconstruction
        # Use the stored cache from the initial forward pass
        for prefill_ids, repeat_ids in input_ids:
            self.end_idx = self.start_idx + prefill_ids.shape[1]
            # Pass the cache that was used in the initial forward pass
            model(
                input_ids=repeat_ids.to(model.device),
                past_key_values=self._cache,
                num_logits_to_keep=1,
            )
            self.start_idx = self.end_idx

        # Verify tokenization consistency
        assert self.end_idx == context_length, "Tokenization is not consistent"

        # Perform final compression
        self.compress_post(model)

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

Janghyun1230 · 2025-07-14T09:01:41Z

Following your guidelines, I've moved the modifications in pipeline.py into the __call__ context manager method of press. I also merged the latest upstream commits into my branch and updated codes accordingly.

I ran some tests and confirmed that the current version maintains performance. (The prefilling and compression time has slightly increased.) Please review and let me know if you have any further suggestions!

maxjeblick · 2025-07-15T09:20:22Z

Thanks a lot for the quick updates!
We will review the PR, please expect this to take a few days.

maxjeblick

Thanks a lot for the extensive refactoring of the press!

I've tested the press on ruler4k benchmark, and the results look very nice!
We will add your press to our benchmark, once it is merged.

For the press to be merged, I kindly ask to

Add a warning in the press post init method, informing the user that the press uses multiple forward passes.
The press implementation will benefit from being refactored in several places. I left some comments in the code; there are also other places where refactoring can help. Please also add some more comments/docstrings that help users.

kvpress/presses/kvzip_press.py

pyproject.toml

README.md

kvpress/presses/kvzip_press.py

evaluation/evaluate.sh

evaluation/evaluate_config.yaml

kvpress/presses/kvzip_press.py

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

Janghyun1230 · 2025-07-23T08:57:37Z

I appreciate your detailed feedback! Your comments improve the clarity and interpretability of the code. I've incorporated all of your suggestions, leaving comments on some specific points.

One thing I'd like to mention is that the current KVzipPress implementation is not compatible with ComposedPress. This is due to that KVzipPress follows a slightly different logic and adopts fake compression as AdaKV, which is also incompatible with ComposedPress.

This incompatibility raises an error in make test (tests/presses/test_presses.py, line 86), where the test invokes ComposedPress with KVzipPress. Aside from this issue, I found no other issues during testing.

maxjeblick

Thanks a lot for the refactoring, code looks good!

One thing I'd like to mention is that the current KVzipPress implementation is not compatible with ComposedPress. This is due to that KVzipPress follows a slightly different logic and adopts fake compression as AdaKV, which is also incompatible with ComposedPress.

Ok, thanks for the notice. For now, you can add an if statement in the test, skipping that combination. I'll merge the PR once test pass.

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

Janghyun1230 · 2025-07-25T04:00:16Z

Thank you for the review! I've added a statement in the test_presses.py.

Janghyun1230 added 2 commits July 8, 2025 09:56

add KVzip

bfaa7fd

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

merge KVzip

5bbe1ab

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

Janghyun1230 mentioned this pull request Jul 8, 2025

Integrate KVzip method into KVPress snu-mllab/KVzip#3

Closed

Janghyun1230 added 3 commits July 8, 2025 18:04

Fix KVzip style errors

b13fbb1

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

disable kvzip_press mypy attr-defined error

a052ec6

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

fix a comment

8912fb8

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

maxjeblick self-requested a review July 10, 2025 12:20

Move KVzip logic into the press

2799cab

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

Janghyun1230 force-pushed the main branch 2 times, most recently from cd1e722 to 2799cab Compare July 14, 2025 08:37

Janghyun1230 added 2 commits July 14, 2025 17:46

merge upstream

2df6d66

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

update kvzip test

edfd3cd

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

maxjeblick requested a review from alessiodevoto July 15, 2025 09:19

maxjeblick requested changes Jul 21, 2025

View reviewed changes

maxjeblick reviewed Jul 21, 2025

View reviewed changes

kvpress/presses/kvzip_press.py Show resolved Hide resolved

alessiodevoto reviewed Jul 21, 2025

View reviewed changes

evaluation/evaluate.sh Outdated Show resolved Hide resolved

evaluation/evaluate_config.yaml Outdated Show resolved Hide resolved

alessiodevoto reviewed Jul 22, 2025

View reviewed changes

kvpress/presses/kvzip_press.py Outdated Show resolved Hide resolved

Janghyun1230 added 4 commits July 23, 2025 17:14

refactoring and updating kvzip

a421931

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

minor: add a newline

6f2d51d

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

minor: revert model name

ce2e3e6

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

minor: revert model name

7d8aaac

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

maxjeblick approved these changes Jul 24, 2025

View reviewed changes

add test exception

596e74a

Signed-off-by: Janghyun1230 <kimjanghyun1230@gmail.com>

maxjeblick approved these changes Jul 25, 2025

View reviewed changes

maxjeblick merged commit fb93b31 into NVIDIA:main Jul 25, 2025
3 checks passed

alessiodevoto mentioned this pull request Jul 25, 2025

run backbone model only for prefill #100

Merged

6 tasks

Janghyun1230 mentioned this pull request Aug 1, 2025

Adding KVzip benchmark results #107

Closed

Conversation

Janghyun1230 commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR description

Checklist

Uh oh!

maxjeblick commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Janghyun1230 commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxjeblick commented Jul 15, 2025

Uh oh!

maxjeblick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Janghyun1230 commented Jul 23, 2025

Uh oh!

maxjeblick left a comment

Choose a reason for hiding this comment

Uh oh!

Janghyun1230 commented Jul 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Janghyun1230 commented Jul 8, 2025 •

edited

Loading

maxjeblick commented Jul 13, 2025 •

edited

Loading

Janghyun1230 commented Jul 14, 2025 •

edited

Loading