[TRTLLM-7078][chore] optimal kvcache transfer for VWSA #7952

chuangz0 · 2025-09-24T07:44:36Z

enhance cache transceiver to support VSWA.

the basic idea is transfer the blocks in the range

auto && blocks=blockscacheManager.getSequence(mRequestId).getCacheBlockIds(windowSize).at(kFIRST_AND_ONLY_BEAM)
blocks[ blocks.size()- (windowSize/tokensPerBlock +1) , blocks.size]

transfer (windowSize/tokensPerBlock+1) block for each window.
limitation:

even if context and gen enable cache reuse ,still transfer these blocks.
partial_reuse is not supported.

Summary by CodeRabbit

New Features
- Added per-window KV cache handling, enabling window-based block transfers.
- Introduced env flag TRTLLM_KVCACHE_TRANSFER_ALL_BLOCKS_FOR_WINDOW to force full-window transfers.
API Changes
- Updated buffer pre-allocation to require a tokens_per_block parameter (C++ and Python bindings).
- Request/block-hash interfaces now use per-window mappings instead of a flat list.
Tests
- Added disaggregated serving tests for GPT-OSS with block reuse on/off.
- Updated existing tests to reflect per-window behavior and new API signatures.

Description

cache transceiver will only send block in window for VWSA

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

chuangz0 · 2025-09-24T07:45:23Z

/bot run

tensorrt-cicd · 2025-09-24T07:51:00Z

PR_Github #19774 [ run ] triggered by Bot

coderabbitai · 2025-09-24T07:53:23Z

📝 Walkthrough

Walkthrough

Introduces per-window KV-cache handling across APIs and implementations: new BlockRangeForWindow, BlockRange now holds per-window data, request and transceiver paths carry per-window block hashes, cache formatting and MLA formatting iterate by window, buffer sizing accounts for tokensPerBlock and an env flag, bindings and tests updated accordingly, and new integration tests added.

Changes

Cohort / File(s)	Summary
KV-cache range APIs `cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h`	Added BlockRangeForWindow; refactored BlockRange to per-window data (block IDs, pools); new factories and accessors; iterator updated to window-range; includes expanded.
Request API (per-window hashes) `cpp/include/tensorrt_llm/batch_manager/llmRequest.h`	GenericLlmRequest now stores/returns requested block hashes per window via unordered_map; old flat vector removed.
Cache formatter (window-aware) `cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp`	Sending/receiving logic changed to per-window ranges; env flag to transfer all blocks per window; detailed logging; layer-wise and non-layer-wise paths updated.
MLA cache formatter `cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp`	Iteration switched from pools to per-window; assertions on window count; per-window block collection for format/unformat.
Data transceiver (per-window hashes, serialization) `cpp/tensorrt_llm/batch_manager/dataTransceiver.h`, `cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`	RequestInfo now holds per-window hash map; constructor, accessors, equality, (de)serialization, and usage updated; sender/receiver paths accept and propagate per-window hashes; include added.
Transfer buffer sizing (tokensPerBlock) `cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h`, `cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp`, `cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp`	preAllocBufferSize signature adds tokensPerBlock; logic uses windowSize + tokensPerBlock with env override; call sites updated.
Bindings for new signature `cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp`, `cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp`	Python bindings updated to include tokens_per_block parameter for preAllocBufferSize.
Env flag `cpp/tensorrt_llm/common/envUtils.h`, `cpp/tensorrt_llm/common/envUtils.cpp`	Added getEnvKVCacheTransferAllBlocksForWindow for env-driven behavior.
Unit tests (range, buffer, multi-GPU) `cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp`, `cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp`, `cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp`	Tests migrated to BlockRangeForWindow; buffer preAlloc tests updated for new arg; multi-GPU tests refactored to per-window loops and helper signatures; max sequences adjusted.
Integration tests and list `tests/integration/defs/accuracy/test_disaggregated_serving.py`, `tests/integration/test_lists/test-db/l0_dgx_h200.yml`	Switched overlap_scheduler param to block_reuse; set max_attention_window and enable_block_reuse; added new TestGPTOSS class and two new test entries in DGX H200 list.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Model as Model/Executor
  participant Formatter as CacheFormatter
  participant Range as BlockRange (multi-window)
  participant Tx as DataTransceiver::Sender
  participant Rx as DataTransceiver::Receiver
  participant Req as LlmRequest

  Model->>Formatter: getBlockRangeForSending(requestId)
  Formatter->>Range: compute per-window blockIds/pools
  Range-->>Formatter: windowSizes + BlockRangeForWindow
  Formatter->>Tx: sendRequestInfo(RequestId, blockHashesPerWindow)
  Tx->>Tx: Serialize map<windowSize, hashes>
  Tx-->>Rx: Transmit RequestInfo
  Rx->>Req: setRequestedBlockHashesPerWindow(map)
  Note over Rx,Req: Per-window hashes stored on request

  Model->>Formatter: getBlockRangeForReceiving(requestId)
  Formatter->>Range: build per-window ranges (env-aware)
  Range-->>Formatter: BlockRangeForWindow per window
  loop for each windowSize
    Formatter->>Rx: receive blocks for window
    Rx-->>Formatter: blocks for window
    Formatter->>Formatter: unformat into per-window buffers
  end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.70% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	⚠️ Warning	The pull request description is incomplete and lacks critical information required by the template. The Description section provides only a brief, technical explanation without clearly articulating the issue being solved or the reasoning behind the solution. The Test Coverage section is entirely empty with no tests listed to validate the changes. The PR Checklist is largely unchecked, with only a single generic box marked without verification of the substantive items like coding guidelines compliance, test case provision, or documentation updates. While the description is not off-topic, the overall submission is significantly incomplete and does not meet the template's structural requirements.	The author should provide a more detailed Description section that clearly explains what problem is being solved and why this solution was chosen. The Test Coverage section must be filled with specific test cases that validate the per-window KV cache transfer changes, particularly for VWSA support. The PR Checklist items should be explicitly reviewed and checked as appropriate, including verification that the code follows TRT-LLM coding guidelines and that relevant documentation has been updated. Consider adding more context about the performance implications and any backward compatibility concerns.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title is concise, follows the repository tag format with a JIRA ticket and type, and accurately summarizes the main intent of the change—optimizing KV cache transfer for VWSA/windowed transfers—so it is clear and relevant to the changes in the diff.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (2)
76-83: Fix typo: py::classh -> py::class_. This won’t compile as-is.

Apply this diff:
-    py::classh<tb::BaseCacheTransceiver, PyCacheTransceiver>(m, "BaseCacheTransceiver")
+    py::class_<tb::BaseCacheTransceiver, PyCacheTransceiver>(m, "BaseCacheTransceiver")
88-96: Fix all occurrences of py::classh → py::class_ in pybind registrations

Replace every py::classh< with py::class_< (typo). Matches found:

cpp/tensorrt_llm/pybind/runtime/bindings.cpp:202

cpp/tensorrt_llm/pybind/runtime/bindings.cpp:224

cpp/tensorrt_llm/pybind/runtime/bindings.cpp:229

cpp/tensorrt_llm/pybind/bindings.cpp:78

cpp/tensorrt_llm/pybind/bindings.cpp:107

cpp/tensorrt_llm/pybind/bindings.cpp:398

cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp:76

cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp:88

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp:336

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp:477

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp:497

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp:517

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp:524

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp:99

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp:262

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp:393

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp:401

Example (apply similarly everywhere):
-    py::classh<tb::CacheTransceiver, tb::BaseCacheTransceiver>(m, "CacheTransceiver")
+    py::class_<tb::CacheTransceiver, tb::BaseCacheTransceiver>(m, "CacheTransceiver")
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)
247-271: Same clamping bug in preAllocBufferSize()

Mirror the fix for the pre-allocation path.
-            auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value()
-                    ? static_cast<size_t>(windowSize) + tokensPerBlock
-                    : maxNumTokens.value());
+            auto validTokenNum = std::min(
+                maxNumTokens.value(),
+                static_cast<size_t>(windowSize) + static_cast<size_t>(tokensPerBlock));
             if (common::getEnvKVCacheTransferAllBlocksForWindow())
             {
                 validTokenNum = maxNumTokens.value();
             }
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)
436-438: Buffer index mismatch in chunked send path

Using processIdx instead of bufferIdx corrupts the slice logic when TP×PP > 1. Use bufferIdx consistently.
-                size_t remainSendSize = outputSplitCaches[processIdx]->getSize();
-                size_t needSendSize = outputSplitCaches[processIdx]->getSize();
+                size_t remainSendSize = outputSplitCaches[bufferIdx]->getSize();
+                size_t needSendSize = outputSplitCaches[bufferIdx]->getSize();
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
878-902: Bug: CP over-allocation check uses global blockIdx instead of per-window index.

isBlockOverallocated(blockIdx, numTotalBlocks) compares a global index against a per-window count. Use blockIdInWindow for correctness within each window.

Apply this diff:
-                if (isBlockOverallocated(blockIdx, numTotalBlocks))
+                if (isBlockOverallocated(blockIdInWindow, numTotalBlocks))
                 {
                     TLLM_LOG_INFO(
                         "[generationVerifyKVCache] Skipping over-allocated block for request id %d (rank %d, blockIdx "
                         "%d, numTotalBlocks %d)",
-                        llmRequest->mRequestId, mRank, blockIdx, numTotalBlocks);
+                        llmRequest->mRequestId, mRank, blockIdInWindow, numTotalBlocks);
                     break;
                 }

🧹 Nitpick comments (11)

cpp/tensorrt_llm/common/envUtils.h (1)

119-120: Add Doxygen comment for the new env getter.

Document the env var name and semantics for consistency with our header guidelines.

Apply this diff:
- bool getEnvKVCacheTransferAllBlocksForWindow();
+ //! Returns true when environment TRTLLM_KVCACHE_TRANSFER_ALL_BLOCKS_FOR_WINDOW is set to "1".
+ bool getEnvKVCacheTransferAllBlocksForWindow();

cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (1)

442-443: Add a unit test for TRTLLM_KVCACHE_TRANSFER_ALL_BLOCKS_FOR_WINDOW.

Exercise the env‑flag path in preAllocBufferSize to prevent regressions.

Apply this diff to append a new test:

+TEST_F(CacheTransBufferTest, TestPreAlloc_AllBlocksForWindowEnv)
+{
+    pid_t pid = fork();
+    ASSERT_NE(pid, -1) << "Fork failed";
+    if (pid == 0)
+    {
+        // Child
+        SizeType32 maxBlocksPerSeq = 10;
+        SizeType32 tokensPerBlock = 8;
+        const size_t maxNumTokens = static_cast<size_t>(maxBlocksPerSeq) * tokensPerBlock;
+        SetUpCacheTransBuffer(4, 2, 64, tokensPerBlock, CacheType::kSELFKONLY, maxNumTokens, maxBlocksPerSeq);
+        const size_t cacheSizeBytesPerToken = kvCacheSizePerToken(4, 2, 64, CacheType::kSELFKONLY);
+        std::map<SizeType32, SizeType32> perWindow{
+            {maxBlocksPerSeq * tokensPerBlock, cacheSizeBytesPerToken}};
+        tensorrt_llm::executor::CacheTransceiverConfig cfg{
+            tensorrt_llm::executor::CacheTransceiverConfig::BackendType::UCX, maxNumTokens};
+
+        // Baseline (flag not set): window + tokensPerBlock
+        unsetenv("TRTLLM_KVCACHE_TRANSFER_ALL_BLOCKS_FOR_WINDOW");
+        size_t sz1 = CacheTransBufferManager::preAllocBufferSize(perWindow, tokensPerBlock, cfg);
+
+        // With flag set: uses maxNumTokens for the window
+        setenv("TRTLLM_KVCACHE_TRANSFER_ALL_BLOCKS_FOR_WINDOW", "1", 1);
+        size_t sz2 = CacheTransBufferManager::preAllocBufferSize(perWindow, tokensPerBlock, cfg);
+        EXPECT_GT(sz2, sz1);
+        exit(testing::Test::HasFailure() ? 1 : 0);
+    }
+    int status;
+    ASSERT_NE(-1, waitpid(pid, &status, 0)) << "waitpid failed";
+    ASSERT_TRUE(WIFEXITED(status));
+    ASSERT_EQ(0, WEXITSTATUS(status));
+}

cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1)

63-65: Add brief Doxygen to the updated API; consider clamping logic in the implementation.

Document tokensPerBlock purpose.
In the .cpp, consider clamping (windowSize + tokensPerBlock) to maxTokens to avoid oversizing when windowSize is close to maxTokens.

Apply this diff to document the API:

-    static size_t preAllocBufferSize(std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
-        SizeType32 tokensPerBlock,
-        std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt);
+    //! Pre-compute total bytes to pre-allocate for KV cache transfer buffers.
+    //! cacheSizeBytesPerTokenPerWindow: map of window_size -> bytes-per-token across local attention layers.
+    //! tokensPerBlock: tokens per cache block (used for block-aligned sizing).
+    //! cacheTransceiverConfig: when absent or without backend, returns 0.
+    static size_t preAllocBufferSize(std::map<SizeType32, SizeType32> const& cacheSizeBytesPerTokenPerWindow,
+        SizeType32 tokensPerBlock,
+        std::optional<executor::CacheTransceiverConfig> const& cacheTransceiverConfig = std::nullopt);

For the .cpp implementation, consider:

-            auto validTokenNum = (static_cast<size_t>(windowSize) < maxNumTokens.value()
-                    ? static_cast<size_t>(windowSize) + tokensPerBlock
-                    : maxNumTokens.value());
+            auto validTokenNum = static_cast<size_t>(windowSize) + static_cast<size_t>(tokensPerBlock);
+            validTokenNum = std::min(validTokenNum, maxNumTokens.value());

cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2)

132-132: Remove stray merge marker

Leftover "diff end" marker should be removed.
-    // diff end
+
141-145: Deterministic window ordering

getWindowSizes() draws keys from an unordered_map; ordering is not guaranteed and can desync zcopy send/recv iteration order. Sort windowSizes before iterating.
-    auto const& windowSizes = blockRange.getWindowSizes();
+    auto windowSizes = blockRange.getWindowSizes();
+    std::sort(windowSizes.begin(), windowSizes.end());
Apply similarly in unformat().

Also applies to: 350-352

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)

2048-2049: Add Doxygen for new per-window APIs

Headers require Doxygen docs for new interfaces. Please document setRequestedBlockHashes and getRequestedBlockHashesPerWindow.

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)

271-277: Deterministic window ordering

Sort windowSizes; current getWindowSizes() draws from an unordered_map.

-        auto const& windowSizes = blockRange.getWindowSizes();
+        auto windowSizes = blockRange.getWindowSizes();
+        std::sort(windowSizes.begin(), windowSizes.end());
@@
-    auto windowSizes = blockRange.getWindowSizes();
+    auto windowSizes = blockRange.getWindowSizes();
+    std::sort(windowSizes.begin(), windowSizes.end());

Also applies to: 547-549

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (2)

122-130: Make getWindowSizes deterministic

Return a sorted list of window sizes to stabilize iteration order across sender/receiver.

-    std::vector<SizeType32> getWindowSizes() const
-    {
-        std::vector<SizeType32> windowSizes;
-        for (auto const& [windowSize, _] : mPoolsPerWindow)
-        {
-            windowSizes.push_back(windowSize);
-        }
-        return windowSizes;
-    }
+    std::vector<SizeType32> getWindowSizes() const
+    {
+        std::vector<SizeType32> windowSizes;
+        windowSizes.reserve(mPoolsPerWindow.size());
+        for (auto const& [windowSize, _] : mPoolsPerWindow)
+        {
+            windowSizes.push_back(windowSize);
+        }
+        std::sort(windowSizes.begin(), windowSizes.end());
+        return windowSizes;
+    }

Add include outside this hunk:

#include <algorithm>

27-56: Header API additions: please add Doxygen

New BlockRangeForWindow class and modified BlockRange API need Doxygen per project guidelines.

Also applies to: 58-66, 80-96, 97-111, 113-121, 132-136, 137-166, 187-193, 230-233, 250-253, 255-258

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)

218-224: Serialize unordered_map deterministically (optional).

Current serialization iterates an unordered_map; correctness is fine since deserialization restores a map. For reproducible byte streams/logs, consider ordering by windowSize when serializing.

Apply this diff:
-    su::serialize(requestInfo.mBlockHashesPerWindow.size(), os);
-    for (auto const& [windowSize, blockHashes] : requestInfo.mBlockHashesPerWindow)
+    su::serialize(requestInfo.mBlockHashesPerWindow.size(), os);
+    // Optional: stable order for reproducible serialization
+    std::vector<SizeType32> keys;
+    keys.reserve(requestInfo.mBlockHashesPerWindow.size());
+    for (auto const& kv : requestInfo.mBlockHashesPerWindow) { keys.push_back(kv.first); }
+    std::sort(keys.begin(), keys.end());
+    for (auto const& windowSize : keys)
     {
-        su::serialize(windowSize, os);
-        su::serialize(blockHashes, os);
+        su::serialize(windowSize, os);
+        su::serialize(requestInfo.mBlockHashesPerWindow.at(windowSize), os);
     }
433-446: Passing per-window hashes via LlmRequest mutation — acceptable, but consider removing the TODO.

The TODO suggests passing hashes directly; that would reduce hidden coupling. For future refactor, Thread the hashes through TransferSession/formatter API instead of mutating LlmRequest.

I can draft an API sketch to pass block hashes into BaseCacheFormatter::format without touching LlmRequest if helpful.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 946ffcd and af1be23.

📒 Files selected for processing (18)

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (4 hunks)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (2 hunks)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (6 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (3 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (9 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (4 hunks)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2 hunks)
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1 hunks)
cpp/tensorrt_llm/common/envUtils.cpp (1 hunks)
cpp/tensorrt_llm/common/envUtils.h (1 hunks)
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp (1 hunks)
cpp/tests/unit_tests/batch_manager/cacheTransBufferTest.cpp (2 hunks)
cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp (2 hunks)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (14 hunks)
tests/integration/defs/accuracy/test_disaggregated_serving.py (3 hunks)
tests/integration/test_lists/test-db/l0_dgx_h200.yml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}