tentacle: erasure-code/isa: fix cache collision causing buffer overflow#68956
tentacle: erasure-code/isa: fix cache collision causing buffer overflow#68956tchaikov wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Automated Backport Parity Review - Backport Deviation Alert
A conflict or unapproved deviation was detected during the simulation of this backport. The code in this PR does not match a clean cherry-pick of the upstream commits.
Deviation in Backport 8800f99b (cherry-pick of 7e668715)
Affected File(s)
src/test/erasure-code/TestErasureCodePlugins.cc
Range Diff
Click to expand
--- Original (7e668715)
+++ Backport (8800f99b)
@@ -116,3 +116,4 @@
Fixes: https://tracker.ceph.com/issues/74382
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
+(cherry picked from commit 7e668715b0eb98b3f9ae9f520f6b54fcf908d9ee)
================================================================================
RANGE DIFF
================================================================================
1: 7e668715b0e ! 1: 8800f99b7a2 erasure-code/isa: fix cache collision causing buffer overflow
@@ Commit message
Fixes: https://tracker.ceph.com/issues/74382
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
+ (cherry picked from commit 7e668715b0eb98b3f9ae9f520f6b54fcf908d9ee)
## src/erasure-code/isa/ErasureCodeIsa.cc ##
@@ src/erasure-code/isa/ErasureCodeIsa.cc: ErasureCodeIsaDefault::isa_decode(int *erasures,
@@ src/erasure-code/isa/ErasureCodeIsaTableCache.cc: ErasureCodeIsaTableCache::putD
dout(12) << "[ put table ] = " << signature << dendl;
## src/test/erasure-code/TestErasureCodePlugins.cc ##
-@@ src/test/erasure-code/TestErasureCodePlugins.cc: TEST_P(PluginTest, CRCEncodeDecodeSupport) {
- uint32_t decoded_crc =
- read_crc_from_bufferlist(out_bls[missing_shard_id]);
- uint32_t original_crc =
-- read_crc_from_bufferlist(hashes_bl, missing_shard_id.id * chunk_size);
-+ read_crc_from_bufferlist(hashes_bl, missing_raw_shard_id.id * chunk_size);
-
- different = different | (decoded_crc != original_crc);
- }
+@@ src/test/erasure-code/TestErasureCodePlugins.cc: INSTANTIATE_TEST_SUITE_P(
+ * valgrind --tool=memcheck ./unittest_erasure_code_plugins \
+ * --gtest_filter=*.* --log-to-stderr=true --debug-osd=20"
+ * End:
+- */
+ \ No newline at end of file
++ */How to proceed:
- Authors (Genuine Conflicts): If this is a genuine conflict requiring manual resolution, ensure your resolution is correct. You must explain the conflict resolution in the commit message (e.g., leave the standard Git
Conflicts:block intact) and include an explanation for changes. - Authors (Need Help?): Reach out to the Component Lead for technical guidance on complex code conflicts.
- Component Leads (Review): Please review the Range Diff(s) above to verify the author's manual conflict resolution is correct for this release branch. If the deviation is intentional, documented, and approved then the component lead or @ceph/ceph-release-manager can bypass this check by commenting
/audit override.
Be familiar with the rules and guidelines for writing backports.
Commit Parity Visualizer
| BACKPORT PR #68956 | SOURCE PR | SOURCE STATUS |
|---|---|---|
| 8800f99 erasure-code/isa: fix cache collision causing buffer overflow | PR #66894 | 7e66871 erasure-code/isa: fix cache collision causing buffer overflow |
🛟 Need Help?
If you need technical help resolving these issues, please consult with the Component Lead. If you need administrative overrides, please see the #ceph-upstream-releases channel on Slack and request a review from the @ceph/ceph-release-manager.
📋 Component Lead / Release Manager
To override the audit failure, apply releng-audit-override label or comment /audit override.
When you are ready for a new audit, please remove the releng-audit-fail label or comment /audit retest.
CI Run Log: View Workflow Details
This commit fixes a critical cache key collision bug in the ISA erasure
code plugin that could lead to heap-buffer-overflow and silent data
corruption.
Problem:
--------
The decoding table cache was indexed only by matrix type and erasure
signature (available/missing chunk pattern), but did NOT include the
(k,m) erasure code configuration parameters. This caused different EC
configurations with similar erasure patterns to collide in the cache,
leading to incorrectly-sized cached buffers being reused.
AddressSanitizer Report:
------------------------
==4904==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x5160001397b8 at pc 0x5de8e415296b bp 0x7ffc82260310 sp 0x7ffc8225fad0
READ of size 576 at 0x5160001397b8 thread T0
#0 __asan_memcpy
#1 ErasureCodeIsaTableCache::getDecodingTableFromCache()
.../ErasureCodeIsaTableCache.cc:260:5
#2 ErasureCodeIsaDefault::isa_decode()
.../ErasureCodeIsa.cc:490:15
0x5160001397b8 is located 0 bytes after 568-byte region
[0x516000139580,0x5160001397b8) allocated by:
#0 posix_memalign
#1 ceph::buffer::raw_combined::alloc_data_n_controlblock()
#2 ErasureCodeIsaTableCache::putDecodingTableToCache()
.../ErasureCodeIsaTableCache.cc:319:18
Root Cause:
-----------
Scenario illustrating the bug:
1. First decode operation: k=2, m=1, erasure pattern "+0+2-1"
- Creates cache entry with key "+0+2-1"
- Allocates buffer: 2*(1+2)*32 = 192 bytes
2. Second decode operation: k=3, m=3, same erasure pattern "+0+2-1"
- Looks up cache with key "+0+2-1" → COLLISION
- Retrieves 192-byte buffer but needs 3*(3+3)*32 = 576 bytes
- Result: Heap-buffer-overflow (reads 384 bytes beyond allocation)
Worse scenario (silent corruption):
1. First decode: k=3, m=3 → caches 576-byte table
2. Second decode: k=2, m=1 → retrieves wrong table
- Uses incorrect decoding matrix
- Result: Silent data corruption during recovery
Solution:
---------
Include k and m parameters in cache signature
- Old format: "+0+2+3-1-4"
- New format: "k3m2a+0+2+3e-1-4"
Test Fix:
---------
Also fixes a buffer overflow in TestErasureCodePlugins.cc where
hashes_bl offset was calculated using chunk_size instead of sizeof(uint32_t),
causing reads beyond the CRC buffer.
Production Impact:
------------------
Backward Compatibility: FULLY COMPATIBLE
- Cache is ephemeral (in-memory only, not persisted)
- Cache cleared on process restart
- Rolling upgrades safe - each OSD restart gets fixed code
- Old cache entries automatically invalidated on upgrade
- No wire protocol or on-disk format changes
- No configuration changes required
- No breaking changes
Data Integrity:
- Eliminates silent data corruption risk
- Eliminates heap-buffer-overflow crashes
- Cache now correctly isolated by (k,m) configuration
- Correct decoding tables always used for recovery
- No risk of corrupting user data from the fix itself
Why Users Haven't Complained:
------------------------------
Several factors likely prevented widespread reports:
1. Low probability conditions required:
- Need multiple EC pools with DIFFERENT (k,m) configurations
- Need similar erasure patterns across pools
- Need cache collision to occur during actual recovery operations
- Recovery operations are relatively rare in healthy clusters
2. Crash vs silent corruption detection:
- Buffer overflows (easier to detect) occur when k2,m2 > k1,m1
- Silent corruption (harder to detect) occurs when k2,m2 < k1,m1
- Crashes might be attributed to other causes
- Data corruption only detected during scrub or data verification
3. Common deployment patterns:
- Many deployments use single EC configuration cluster-wide
- Default EC configurations (k=2,m=1 or k=4,m=2) reduce collision space
- Erasure pattern variety may be insufficient for collisions
4. ISA plugin usage:
- Not universally deployed (requires Intel ISA-L library)
- Some sites use jerasure plugin instead
- Plugin selection depends on hardware and configuration
5. Detection difficulty:
- ASan not enabled in production builds
- Silent corruption only appears during:
* Degraded reads with recovery
* Scrub operations
* Deep-scrub verification
- Corrupted data might not be immediately accessed
Fixes: https://tracker.ceph.com/issues/74382
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
(cherry picked from commit 7e66871)
Conflicts:
src/test/erasure-code/TestErasureCodePlugins.cc
- dropped the change in test, as the updated test is not inclued
in tentacle.
8800f99 to
bbd5af6
Compare
There was a problem hiding this comment.
Automated Backport Parity Review - Backport Deviation Alert
A conflict or unapproved deviation was detected during the simulation of this backport. The code in this PR does not match a clean cherry-pick of the upstream commits.
Deviation in Backport bbd5af64 (cherry-pick of 7e668715)
Affected File(s)
src/test/erasure-code/TestErasureCodePlugins.cc
Range Diff
Click to expand
--- Original (7e668715)
+++ Backport (bbd5af64)
@@ -116,3 +116,9 @@
Fixes: https://tracker.ceph.com/issues/74382
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
+(cherry picked from commit 7e668715b0eb98b3f9ae9f520f6b54fcf908d9ee)
+
+Conflicts:
+ src/test/erasure-code/TestErasureCodePlugins.cc
+ - dropped the change in test, as the updated test is not inclued
+ in tentacle.
================================================================================
RANGE DIFF
================================================================================
1: 7e668715b0e ! 1: bbd5af64724 erasure-code/isa: fix cache collision causing buffer overflow
@@ Commit message
Fixes: https://tracker.ceph.com/issues/74382
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
+ (cherry picked from commit 7e668715b0eb98b3f9ae9f520f6b54fcf908d9ee)
+
+ Conflicts:
+ src/test/erasure-code/TestErasureCodePlugins.cc
+ - dropped the change in test, as the updated test is not inclued
+ in tentacle.
## src/erasure-code/isa/ErasureCodeIsa.cc ##
@@ src/erasure-code/isa/ErasureCodeIsa.cc: ErasureCodeIsaDefault::isa_decode(int *erasures,
@@ src/erasure-code/isa/ErasureCodeIsaTableCache.cc: ErasureCodeIsaTableCache::putD
dout(12) << "[ put table ] = " << signature << dendl;
## src/test/erasure-code/TestErasureCodePlugins.cc ##
-@@ src/test/erasure-code/TestErasureCodePlugins.cc: TEST_P(PluginTest, CRCEncodeDecodeSupport) {
- uint32_t decoded_crc =
- read_crc_from_bufferlist(out_bls[missing_shard_id]);
- uint32_t original_crc =
-- read_crc_from_bufferlist(hashes_bl, missing_shard_id.id * chunk_size);
-+ read_crc_from_bufferlist(hashes_bl, missing_raw_shard_id.id * chunk_size);
-
- different = different | (decoded_crc != original_crc);
- }
+@@ src/test/erasure-code/TestErasureCodePlugins.cc: INSTANTIATE_TEST_SUITE_P(
+ * valgrind --tool=memcheck ./unittest_erasure_code_plugins \
+ * --gtest_filter=*.* --log-to-stderr=true --debug-osd=20"
+ * End:
+- */
+ \ No newline at end of file
++ */How to proceed:
- Authors (Genuine Conflicts): If this is a genuine conflict requiring manual resolution, ensure your resolution is correct. You must explain the conflict resolution in the commit message (e.g., leave the standard Git
Conflicts:block intact) and include an explanation for changes. - Authors (Need Help?): Reach out to the Component Lead for technical guidance on complex code conflicts.
- Component Leads (Review): Please review the Range Diff(s) above to verify the author's manual conflict resolution is correct for this release branch. If the deviation is intentional, documented, and approved then the component lead or @ceph/ceph-release-manager can bypass this check by commenting
/audit override.
Be familiar with the rules and guidelines for writing backports.
Commit Parity Visualizer
| BACKPORT PR #68956 | SOURCE PR | SOURCE STATUS |
|---|---|---|
| bbd5af6 erasure-code/isa: fix cache collision causing buffer overflow | PR #66894 | 7e66871 erasure-code/isa: fix cache collision causing buffer overflow |
🛟 Need Help?
If you need technical help resolving these issues, please consult with the Component Lead. If you need administrative overrides, please see the #ceph-upstream-releases channel on Slack and request a review from the @ceph/ceph-release-manager.
📋 Component Lead / Release Manager
To override the audit failure, apply releng-audit-override label or comment /audit override.
When you are ready for a new audit, please remove the releng-audit-fail label or comment /audit retest.
CI Run Log: View Workflow Details
backport tracker: https://tracker.ceph.com/issues/74943
backport of #66894
parent tracker: https://tracker.ceph.com/issues/74382
this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh