Skip to content

tentacle: erasure-code/isa: fix cache collision causing buffer overflow#68956

Open
tchaikov wants to merge 1 commit into
ceph:tentaclefrom
tchaikov:wip-74943-tentacle
Open

tentacle: erasure-code/isa: fix cache collision causing buffer overflow#68956
tchaikov wants to merge 1 commit into
ceph:tentaclefrom
tchaikov:wip-74943-tentacle

Conversation

@tchaikov
Copy link
Copy Markdown
Contributor

backport tracker: https://tracker.ceph.com/issues/74943


backport of #66894
parent tracker: https://tracker.ceph.com/issues/74382

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

@tchaikov tchaikov requested a review from a team as a code owner May 17, 2026 02:48
@tchaikov tchaikov added this to the tentacle milestone May 17, 2026
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Backport Parity Review - Backport Deviation Alert

A conflict or unapproved deviation was detected during the simulation of this backport. The code in this PR does not match a clean cherry-pick of the upstream commits.

Deviation in Backport 8800f99b (cherry-pick of 7e668715)

Affected File(s)

Range Diff

Click to expand
--- Original (7e668715)
+++ Backport (8800f99b)
@@ -116,3 +116,4 @@
 Fixes: https://tracker.ceph.com/issues/74382
 
 Signed-off-by: Kefu Chai <k.chai@proxmox.com>
+(cherry picked from commit 7e668715b0eb98b3f9ae9f520f6b54fcf908d9ee)

================================================================================
RANGE DIFF
================================================================================

1:  7e668715b0e ! 1:  8800f99b7a2 erasure-code/isa: fix cache collision causing buffer overflow
    @@ Commit message
         Fixes: https://tracker.ceph.com/issues/74382
     
         Signed-off-by: Kefu Chai <k.chai@proxmox.com>
    +    (cherry picked from commit 7e668715b0eb98b3f9ae9f520f6b54fcf908d9ee)
     
      ## src/erasure-code/isa/ErasureCodeIsa.cc ##
     @@ src/erasure-code/isa/ErasureCodeIsa.cc: ErasureCodeIsaDefault::isa_decode(int *erasures,
    @@ src/erasure-code/isa/ErasureCodeIsaTableCache.cc: ErasureCodeIsaTableCache::putD
        dout(12) << "[ put table    ] = " << signature << dendl;
     
      ## src/test/erasure-code/TestErasureCodePlugins.cc ##
    -@@ src/test/erasure-code/TestErasureCodePlugins.cc: TEST_P(PluginTest, CRCEncodeDecodeSupport) {
    -       uint32_t decoded_crc =
    -           read_crc_from_bufferlist(out_bls[missing_shard_id]);
    -       uint32_t original_crc =
    --          read_crc_from_bufferlist(hashes_bl, missing_shard_id.id * chunk_size);
    -+          read_crc_from_bufferlist(hashes_bl, missing_raw_shard_id.id * chunk_size);
    - 
    -       different = different | (decoded_crc != original_crc);
    -     }
    +@@ src/test/erasure-code/TestErasureCodePlugins.cc: INSTANTIATE_TEST_SUITE_P(
    +  *   valgrind --tool=memcheck ./unittest_erasure_code_plugins \
    +  *      --gtest_filter=*.* --log-to-stderr=true --debug-osd=20"
    +  * End:
    +- */
    + \ No newline at end of file
    ++ */

How to proceed:

  • Authors (Genuine Conflicts): If this is a genuine conflict requiring manual resolution, ensure your resolution is correct. You must explain the conflict resolution in the commit message (e.g., leave the standard Git Conflicts: block intact) and include an explanation for changes.
  • Authors (Need Help?): Reach out to the Component Lead for technical guidance on complex code conflicts.
  • Component Leads (Review): Please review the Range Diff(s) above to verify the author's manual conflict resolution is correct for this release branch. If the deviation is intentional, documented, and approved then the component lead or @ceph/ceph-release-manager can bypass this check by commenting /audit override.

Be familiar with the rules and guidelines for writing backports.


Commit Parity Visualizer

BACKPORT PR #68956 SOURCE PR SOURCE STATUS
8800f99 erasure-code/isa: fix cache collision causing buffer overflow PR #66894 7e66871 erasure-code/isa: fix cache collision causing buffer overflow

🛟 Need Help?

If you need technical help resolving these issues, please consult with the Component Lead. If you need administrative overrides, please see the #ceph-upstream-releases channel on Slack and request a review from the @ceph/ceph-release-manager.

📋 Component Lead / Release Manager

To override the audit failure, apply releng-audit-override label or comment /audit override.


⚠️ Note: Automated audit checks will be suspended on future pushes to prevent comment spam while you work.

When you are ready for a new audit, please remove the releng-audit-fail label or comment /audit retest.

CI Run Log: View Workflow Details

@github-actions github-actions Bot added the releng-audit-fail Release engineering: failed backport verification audit. label May 17, 2026
This commit fixes a critical cache key collision bug in the ISA erasure
code plugin that could lead to heap-buffer-overflow and silent data
corruption.

Problem:
--------
The decoding table cache was indexed only by matrix type and erasure
signature (available/missing chunk pattern), but did NOT include the
(k,m) erasure code configuration parameters. This caused different EC
configurations with similar erasure patterns to collide in the cache,
leading to incorrectly-sized cached buffers being reused.

AddressSanitizer Report:
------------------------
==4904==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x5160001397b8 at pc 0x5de8e415296b bp 0x7ffc82260310 sp 0x7ffc8225fad0
READ of size 576 at 0x5160001397b8 thread T0
    #0 __asan_memcpy
    #1 ErasureCodeIsaTableCache::getDecodingTableFromCache()
       .../ErasureCodeIsaTableCache.cc:260:5
    #2 ErasureCodeIsaDefault::isa_decode()
       .../ErasureCodeIsa.cc:490:15

0x5160001397b8 is located 0 bytes after 568-byte region
[0x516000139580,0x5160001397b8) allocated by:
    #0 posix_memalign
    #1 ceph::buffer::raw_combined::alloc_data_n_controlblock()
    #2 ErasureCodeIsaTableCache::putDecodingTableToCache()
       .../ErasureCodeIsaTableCache.cc:319:18

Root Cause:
-----------
Scenario illustrating the bug:
1. First decode operation: k=2, m=1, erasure pattern "+0+2-1"
   - Creates cache entry with key "+0+2-1"
   - Allocates buffer: 2*(1+2)*32 = 192 bytes
2. Second decode operation: k=3, m=3, same erasure pattern "+0+2-1"
   - Looks up cache with key "+0+2-1" → COLLISION
   - Retrieves 192-byte buffer but needs 3*(3+3)*32 = 576 bytes
   - Result: Heap-buffer-overflow (reads 384 bytes beyond allocation)

Worse scenario (silent corruption):
1. First decode: k=3, m=3 → caches 576-byte table
2. Second decode: k=2, m=1 → retrieves wrong table
   - Uses incorrect decoding matrix
   - Result: Silent data corruption during recovery

Solution:
---------
Include k and m parameters in cache signature
 - Old format: "+0+2+3-1-4"
 - New format: "k3m2a+0+2+3e-1-4"

Test Fix:
---------
Also fixes a buffer overflow in TestErasureCodePlugins.cc where
hashes_bl offset was calculated using chunk_size instead of sizeof(uint32_t),
causing reads beyond the CRC buffer.

Production Impact:
------------------

Backward Compatibility: FULLY COMPATIBLE
- Cache is ephemeral (in-memory only, not persisted)
- Cache cleared on process restart
- Rolling upgrades safe - each OSD restart gets fixed code
- Old cache entries automatically invalidated on upgrade
- No wire protocol or on-disk format changes
- No configuration changes required
- No breaking changes

Data Integrity:
- Eliminates silent data corruption risk
- Eliminates heap-buffer-overflow crashes
- Cache now correctly isolated by (k,m) configuration
- Correct decoding tables always used for recovery
- No risk of corrupting user data from the fix itself

Why Users Haven't Complained:
------------------------------

Several factors likely prevented widespread reports:

1. Low probability conditions required:
   - Need multiple EC pools with DIFFERENT (k,m) configurations
   - Need similar erasure patterns across pools
   - Need cache collision to occur during actual recovery operations
   - Recovery operations are relatively rare in healthy clusters

2. Crash vs silent corruption detection:
   - Buffer overflows (easier to detect) occur when k2,m2 > k1,m1
   - Silent corruption (harder to detect) occurs when k2,m2 < k1,m1
   - Crashes might be attributed to other causes
   - Data corruption only detected during scrub or data verification

3. Common deployment patterns:
   - Many deployments use single EC configuration cluster-wide
   - Default EC configurations (k=2,m=1 or k=4,m=2) reduce collision space
   - Erasure pattern variety may be insufficient for collisions

4. ISA plugin usage:
   - Not universally deployed (requires Intel ISA-L library)
   - Some sites use jerasure plugin instead
   - Plugin selection depends on hardware and configuration

5. Detection difficulty:
   - ASan not enabled in production builds
   - Silent corruption only appears during:
     * Degraded reads with recovery
     * Scrub operations
     * Deep-scrub verification
   - Corrupted data might not be immediately accessed

Fixes: https://tracker.ceph.com/issues/74382

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
(cherry picked from commit 7e66871)

Conflicts:
        src/test/erasure-code/TestErasureCodePlugins.cc
        - dropped the change in test, as the updated test is not inclued
        in tentacle.
@tchaikov tchaikov force-pushed the wip-74943-tentacle branch from 8800f99 to bbd5af6 Compare May 17, 2026 04:29
@tchaikov tchaikov removed the releng-audit-fail Release engineering: failed backport verification audit. label May 17, 2026
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Backport Parity Review - Backport Deviation Alert

A conflict or unapproved deviation was detected during the simulation of this backport. The code in this PR does not match a clean cherry-pick of the upstream commits.

Deviation in Backport bbd5af64 (cherry-pick of 7e668715)

Affected File(s)

Range Diff

Click to expand
--- Original (7e668715)
+++ Backport (bbd5af64)
@@ -116,3 +116,9 @@
 Fixes: https://tracker.ceph.com/issues/74382
 
 Signed-off-by: Kefu Chai <k.chai@proxmox.com>
+(cherry picked from commit 7e668715b0eb98b3f9ae9f520f6b54fcf908d9ee)
+
+Conflicts:
+        src/test/erasure-code/TestErasureCodePlugins.cc
+        - dropped the change in test, as the updated test is not inclued
+        in tentacle.

================================================================================
RANGE DIFF
================================================================================

1:  7e668715b0e ! 1:  bbd5af64724 erasure-code/isa: fix cache collision causing buffer overflow
    @@ Commit message
         Fixes: https://tracker.ceph.com/issues/74382
     
         Signed-off-by: Kefu Chai <k.chai@proxmox.com>
    +    (cherry picked from commit 7e668715b0eb98b3f9ae9f520f6b54fcf908d9ee)
    +
    +    Conflicts:
    +            src/test/erasure-code/TestErasureCodePlugins.cc
    +            - dropped the change in test, as the updated test is not inclued
    +            in tentacle.
     
      ## src/erasure-code/isa/ErasureCodeIsa.cc ##
     @@ src/erasure-code/isa/ErasureCodeIsa.cc: ErasureCodeIsaDefault::isa_decode(int *erasures,
    @@ src/erasure-code/isa/ErasureCodeIsaTableCache.cc: ErasureCodeIsaTableCache::putD
        dout(12) << "[ put table    ] = " << signature << dendl;
     
      ## src/test/erasure-code/TestErasureCodePlugins.cc ##
    -@@ src/test/erasure-code/TestErasureCodePlugins.cc: TEST_P(PluginTest, CRCEncodeDecodeSupport) {
    -       uint32_t decoded_crc =
    -           read_crc_from_bufferlist(out_bls[missing_shard_id]);
    -       uint32_t original_crc =
    --          read_crc_from_bufferlist(hashes_bl, missing_shard_id.id * chunk_size);
    -+          read_crc_from_bufferlist(hashes_bl, missing_raw_shard_id.id * chunk_size);
    - 
    -       different = different | (decoded_crc != original_crc);
    -     }
    +@@ src/test/erasure-code/TestErasureCodePlugins.cc: INSTANTIATE_TEST_SUITE_P(
    +  *   valgrind --tool=memcheck ./unittest_erasure_code_plugins \
    +  *      --gtest_filter=*.* --log-to-stderr=true --debug-osd=20"
    +  * End:
    +- */
    + \ No newline at end of file
    ++ */

How to proceed:

  • Authors (Genuine Conflicts): If this is a genuine conflict requiring manual resolution, ensure your resolution is correct. You must explain the conflict resolution in the commit message (e.g., leave the standard Git Conflicts: block intact) and include an explanation for changes.
  • Authors (Need Help?): Reach out to the Component Lead for technical guidance on complex code conflicts.
  • Component Leads (Review): Please review the Range Diff(s) above to verify the author's manual conflict resolution is correct for this release branch. If the deviation is intentional, documented, and approved then the component lead or @ceph/ceph-release-manager can bypass this check by commenting /audit override.

Be familiar with the rules and guidelines for writing backports.


Commit Parity Visualizer

BACKPORT PR #68956 SOURCE PR SOURCE STATUS
bbd5af6 erasure-code/isa: fix cache collision causing buffer overflow PR #66894 7e66871 erasure-code/isa: fix cache collision causing buffer overflow

🛟 Need Help?

If you need technical help resolving these issues, please consult with the Component Lead. If you need administrative overrides, please see the #ceph-upstream-releases channel on Slack and request a review from the @ceph/ceph-release-manager.

📋 Component Lead / Release Manager

To override the audit failure, apply releng-audit-override label or comment /audit override.


⚠️ Note: Automated audit checks will be suspended on future pushes to prevent comment spam while you work.

When you are ready for a new audit, please remove the releng-audit-fail label or comment /audit retest.

CI Run Log: View Workflow Details

@github-actions github-actions Bot added the releng-audit-fail Release engineering: failed backport verification audit. label May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core releng-audit-fail Release engineering: failed backport verification audit. tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant