Add AVX-512 FP16 implementation of halfvec distance functions#531
Add AVX-512 FP16 implementation of halfvec distance functions#531lucagiac81 wants to merge 5 commits intopgvector:masterfrom
Conversation
|
@nathan-bossart Would love your feedback on this one. |
nathan-bossart
left a comment
There was a problem hiding this comment.
Performance results will be shared soon.
Looking forward to these!
src/halfutils.c
Outdated
| #ifdef HAVE_AVX512FP16 | ||
| TARGET_XSAVE static bool | ||
| SupportsAvx512Fp16() | ||
| { | ||
| unsigned int exx[4] = {0, 0, 0, 0}; | ||
| unsigned int feature = (1 << 23); | ||
|
|
||
| #if defined(HAVE__GET_CPUID) | ||
| __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); | ||
| #elif defined(HAVE__CPUID) | ||
| __cpuid(exx, 7, 0); | ||
| #endif | ||
|
|
||
| return (exx[3] & feature) == feature; | ||
| } | ||
| #endif |
There was a problem hiding this comment.
I think this is missing a couple steps, such as checking for osxsave and verifying the ZMM registers are enabled. See SupportsAvx512Popcount() for an example.
There was a problem hiding this comment.
Thanks for the reference. I'll add those checks (OSXSAVE and XCR0 control register).
src/halfutils.c
Outdated
| for (; i < dim; i++) | ||
| distance += HalfToFloat4(ax[i]) * HalfToFloat4(bx[i]); |
There was a problem hiding this comment.
Is this auto-vectorized? (Same question for HalfvecL2SquaredDistanceAvx512Fp16().)
There was a problem hiding this comment.
I checked L2SquaredDistance and InnerProduct, and it is using AVX scalar instructions, at least with gcc-12. We'll try masked vector instructions to handle the loop remainder.
There was a problem hiding this comment.
The latest update includes masked vector instructions for the loop remainder.
src/halfutils.c
Outdated
| #ifdef HAVE_AVX512FP16 | ||
| if (SupportsAvx512Fp16()) | ||
| { | ||
| HalfvecL2SquaredDistance = HalfvecL2SquaredDistanceAvx512Fp16; | ||
| HalfvecInnerProduct = HalfvecInnerProductAvx512Fp16; | ||
| HalfvecCosineSimilarity = HalfvecCosineSimilarityAvx512Fp16; | ||
| } | ||
| #endif |
There was a problem hiding this comment.
nitpick: This might not need to be nested in the HALFVEC_DISPATCH block.
There was a problem hiding this comment.
You're right. Currently, it is taking advantage of the OSXSAVE check included with the other features, but I'll separate that.
|
I'll kick off some local benchmark runs to see the diffs. I have a r7i at the ready. |
|
@lucagiac81 I'm having issues compiling on an EC2 r7i. This is using gcc12 and clang-15. Here is some truncated output: |
|
@jkatz I think clang is not applying I tested on an m7i instance (where
With gcc-12.3, and I got no errors in all cases. Can you try adding |
e390649 to
85ba2dc
Compare
|
Rebased on latest master |
src/halfutils.c
Outdated
| SupportsAvx512Fp16() | ||
| { | ||
| unsigned int exx[4] = {0, 0, 0, 0}; | ||
| unsigned int feature = (1 << 23); |
There was a problem hiding this comment.
nit. feature can be defined using #DEFINE CPU_FEATURE_AVX512FP16
src/halfutils.c
Outdated
| __cpuid(exx, 7, 0); | ||
| #endif | ||
|
|
||
| /* Check OS supports XSAVE */ |
There was a problem hiding this comment.
nit. update comment to reflect OSXSAVE
src/halfutils.c
Outdated
| return false; | ||
|
|
||
| /* Check XMM, YMM, and ZMM registers are enabled */ | ||
| if ((_xgetbv(0) & 0xe6) != 0xe6) |
There was a problem hiding this comment.
@nathan-bossart shouldn't this be _xgetbv(0) & 0xe6) == 0xe6 ? Similar comment on L187 in bitutils.c per the discussion [0]
[0] : https://www.postgresql.org/message-id/20240418210158.GA3776258%40nathanxps13
There was a problem hiding this comment.
This looks alright as-is to me. If this check fails, we return false, so != looks correct.
85ba2dc to
915d6eb
Compare
|
While collecting data with ANN benchmarks, we noticed a degradation in recall for some datasets (such as sift-128) when computing distances in half precision. Other datasets (such as gist-960) are not affected, and recall is matched to the existing distance functions. The existing functions (*F16c) first convert halfvec elements to single precision and execute the distance computation in single precision. So, enabling the FP16 distance functions may not be desirable in all cases. The latest update to the PR provides two implementations of the distance functions with AVX-512: one using single precision and one using half precision.
|
|
@lucagiac81 Thanks for the continued work. Per @nathan-bossart comment earlier, it'd be helpful to see the actual performance results. I'll try to get this to build again - last I checked I didn't have avx512fp16 available on my instance class. |
|
Here are some initial results
With the gist-960-euclidean dataset, so far we observe
It'd be great if you could reproduce these numbers with your setup. Please let me know if you still run into compilation issues. We'd also like to collect data with dbpedia-openai-1000k-angular as well (higher dimensions, different distance metric) , but we're running into a 403 error when downloading the dataset (similar to this report). Do you have any advice on how to run with that dataset? |
|
Hi @lucagiac81, thanks for the PR, and sorry for the delay. Based on the numbers above, I'm not sure the benefit justifies the complexity. For the |
|
Thanks @ankane. The issue with the dbpedia dataset was the inability to create the dataset locally. After updating the datasets package to v2.19.1, the issue is resolved. We'll share the results with that dataset as well. Regarding the complexity, is the additional parameter to select the precision of the distance computation the main concern? |
|
Sounds good. The parameter doubles my concern, but there's still a lot of complexity without it. I'm not sure either choice is great based on the numbers above, since the single-precision version provides little performance benefit and while the half-precision provides some benefit but reduces precision and range. For comparison, here are the f16c + fma numbers: #311 (comment). |
915d6eb to
cd84cd6
Compare
|
The latest update eliminates the need to manually enable FP16 computation. Computation starts with FP16 and switches to FP32 in case of overflow. |
cd84cd6 to
30d408c
Compare
30d408c to
4e3859a
Compare
|
The latest update introduces an AVX512_FP16 implementation of vector_to_halfvec conversion (in separate commit), as that function has a noticeable contribution in VectorDBBench benchmarks. It is also rebased on pgvector v0.8.0. Sharing additional performance measurements with ANN-Benchmarks and VectorDBBench below. For ANN-Benchmarks, we used a similar setup as for the previously shared data. With the latest changes, the manual selection of FP16/FP32 computation is removed. We also include measurements for index build time (with 8 parallel workers). Performance gains are relative to the existing F16c implementation.
For sift-128-euclidean, we observe 11-12% index build time reduction with recall matched within +/-0.1%, but no significant qps/p99 gain. This confirms that the recall degradation previously observed with FP16 computation is resolved and there are gains for certain metrics even at lower dimensions. For VectorDBBench, we focused on larger datasets and varying search concurrency (1-40 range on an r7i.12xlarge instance). Below are initial results with two datasets:
|
4e3859a to
8c39ae2
Compare
|
The last update fixes some issues reported by CI:
|
|
I tested the latest revision a1e3ead using vectordbbench on a r7i.8xlarge instance and achieved qps improvements ranging from 7% to 23%, depending on the dataset.
Tests were performed using:
Note: gcc >= 12 & binutils >= 2.38 are required for this change to have any effect. Upgrading gcc, without binutils can cause pgvector's make to fail. |
@greenhal thanks for sharing these. This is great improvements. |
a1e3ead to
bfaa33a
Compare
|
We'd like to propose a refactoring to make platform-specific optimizations easier to maintain. In the latest update, AVX-512 functions are moved to separate files. The functions are also included by conditional compilation (enabled by default). In this way, AVX-512 implementations are all in one place, and they can be easily disabled in the build if desired. This approach is easily extended to future contributions and different architectures. This will allow pgvector to benefit from targeted optimizations while keeping the "core" code uncluttered. It will also clarify where particular expertise is needed for maintenance and improvements. @ankane do you see this as a viable approach to integrate these optimizations? Please let us know your thoughts. |
|
Hi @lucagiac81, thanks again for the PR, but I don't think it's a good fit for pgvector. Changing the accumulation precision can reduce recall (as @greenhal's tests show), and there is already a separate lever for trading recall for speed ( |
This PR adds implementations of halfvec distance functions based on the AVX-512 FP16 instruction set. The instruction set was introduced with Intel 4th Gen Intel® Xeon® Scalable processors. It supports 32x FP16 operations per instruction with 512-bit registers.
Compiler support for the new instructions was added in gcc-12 and clang-14. Those versions are minimum requirements for the AVX-512 FP16 functions to be compiled (controlled by conditional compilation). Support for the instruction set is also detected at runtime using CPUID. If not supported, the existing default or F16c functions are used.
Building was tested with
Execution of a binary compiled with gcc-12 (which includes the AVX-512 FP16 functions) was tested on