Replace platform-specific SIMD with portable wide crate#127
Open
lilith wants to merge 3 commits intoImageOptim:mainfrom
Open
Replace platform-specific SIMD with portable wide crate#127lilith wants to merge 3 commits intoImageOptim:mainfrom
lilith wants to merge 3 commits intoImageOptim:mainfrom
Conversation
Replace hand-written SSE2 (x86_64) and NEON (aarch64) intrinsics in f_pixel::diff() with safe, portable SIMD using the wide crate's f32x4. - Direct bytemuck cast from ARGBF to f32x4 (both are Pod) - Single implementation works on all platforms (x86, ARM, WASM, etc.) - Includes scalar reference and brute-force comparison test - ~70 lines removed, eliminates all unsafe in this function
The union was used to reinterpret the same memory as either mc_sort_value (u32) during median cut sorting, or likely_palette_index (PalIndex) after. Since PalIndex fits in u32, we can simply store u32 and cast on read. This eliminates unsafe union field access with no runtime cost.
This sorts 3 pivot indices and accesses them 3 times total. The bounds check overhead is negligible compared to the partitioning work that follows. Safe indexing is simpler and equally fast.
Author
Follow-up: MaybeUninit removal feasibilityI've prototyped making Benchmark results (512×512 image, 3 runs each)
The difference is within noise (~1-3%). The zero-initialization cost is negligible compared to the actual quantization work. Changes required
Would you be interested in this approach for the main repo? It would allow |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
f_pixel::diff()with portablewide::f32x4HistSortTmpunion in favor of plainu32(union was unnecessary sincePalIndexfits inu32)get_uncheckedinqsort_pivot(3 bounds checks is negligible overhead)Details
The
diff()function had three separate implementations using unsafe intrinsics:#[cfg(target_arch = "x86_64")]with SSE2#[cfg(target_arch = "aarch64")]with NEONNow there's a single portable implementation using
wide::f32x4that compiles to equivalent SIMD on both architectures. A scalar reference implementation is retained under#[cfg(test)]with a brute-force comparison test that verifies both implementations match across edge cases.Test plan
diff_simd_matches_scalartest verifies SIMD matches scalar referencecargo clippypasses