Add AVX-512 FP16 implementation of halfvec distance functions by lucagiac81 · Pull Request #531 · pgvector/pgvector

lucagiac81 · 2024-04-24T14:28:09Z

This PR adds implementations of halfvec distance functions based on the AVX-512 FP16 instruction set. The instruction set was introduced with Intel 4th Gen Intel® Xeon® Scalable processors. It supports 32x FP16 operations per instruction with 512-bit registers.

Compiler support for the new instructions was added in gcc-12 and clang-14. Those versions are minimum requirements for the AVX-512 FP16 functions to be compiled (controlled by conditional compilation). Support for the instruction set is also detected at runtime using CPUID. If not supported, the existing default or F16c functions are used.

Building was tested with

gcc-11/clang-13 (no AVX-512 FP16 support)
gcc-12/gcc-13/clang-14 (with AVX-512 FP16 support)

Execution of a binary compiled with gcc-12 (which includes the AVX-512 FP16 functions) was tested on

4th Gen Intel® Xeon® Scalable processor (with AVX-512 FP16 support): AVX-512 FP16 functions are used
3rd Gen Intel® Xeon® Scalable processor (no AVX-512 FP16 support): existing F16c functions are used

jkatz · 2024-04-24T14:45:19Z

@nathan-bossart Would love your feedback on this one.

nathan-bossart

Performance results will be shared soon.

Looking forward to these!

nathan-bossart · 2024-04-24T16:28:26Z

src/halfutils.c

+#ifdef HAVE_AVX512FP16
+TARGET_XSAVE static bool
+SupportsAvx512Fp16()
+{
+	unsigned int exx[4] = {0, 0, 0, 0};
+	unsigned int feature = (1 << 23);
+
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(exx, 7, 0);
+#endif
+
+	return (exx[3] & feature) == feature;
+}
+#endif


I think this is missing a couple steps, such as checking for osxsave and verifying the ZMM registers are enabled. See SupportsAvx512Popcount() for an example.

Thanks for the reference. I'll add those checks (OSXSAVE and XCR0 control register).

nathan-bossart · 2024-04-24T16:29:28Z

src/halfutils.c

+	for (; i < dim; i++)
+		distance += HalfToFloat4(ax[i]) * HalfToFloat4(bx[i]);


Is this auto-vectorized? (Same question for HalfvecL2SquaredDistanceAvx512Fp16().)

I checked L2SquaredDistance and InnerProduct, and it is using AVX scalar instructions, at least with gcc-12. We'll try masked vector instructions to handle the loop remainder.

The latest update includes masked vector instructions for the loop remainder.

nathan-bossart · 2024-04-24T16:32:39Z

src/halfutils.c

+#ifdef HAVE_AVX512FP16
+		if (SupportsAvx512Fp16())
+		{
+			HalfvecL2SquaredDistance = HalfvecL2SquaredDistanceAvx512Fp16;
+			HalfvecInnerProduct = HalfvecInnerProductAvx512Fp16;
+			HalfvecCosineSimilarity = HalfvecCosineSimilarityAvx512Fp16;
+		}
+#endif


nitpick: This might not need to be nested in the HALFVEC_DISPATCH block.

You're right. Currently, it is taking advantage of the OSXSAVE check included with the other features, but I'll separate that.

jkatz · 2024-04-24T16:49:06Z

I'll kick off some local benchmark runs to see the diffs. I have a r7i at the ready.

jkatz · 2024-04-24T17:34:11Z

@lucagiac81 I'm having issues compiling on an EC2 r7i. This is using gcc12 and clang-15. Here is some truncated output:

/usr/bin/clang-15 -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -Xclang -no-opaque-pointers -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -O2  -I. -I./ -I/usr/include/postgresql/16/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2  -flto=thin -emit-llvm -c -o src/halfutils.bc src/halfutils.c
src/halfutils.c:92:9: error: expected ';' after expression
        __m512h         dist = _mm512_setzero_ph();
               ^
               ;
src/halfutils.c:92:2: error: use of undeclared identifier '__m512h'
        __m512h         dist = _mm512_setzero_ph();
        ^
src/halfutils.c:92:11: error: use of undeclared identifier 'dist'
        __m512h         dist = _mm512_setzero_ph();
                        ^
src/halfutils.c:92:18: warning: call to undeclared function '_mm512_setzero_ph'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
        __m512h         dist = _mm512_setzero_ph();
                               ^
src/halfutils.c:95:10: error: expected ';' after expression
                __m512h axi = _mm512_loadu_ph(ax+i);
                       ^
                       ;
src/halfutils.c:95:3: error: use of undeclared identifier '__m512h'
                __m512h axi = _mm512_loadu_ph(ax+i);
                ^
src/halfutils.c:95:11: error: use of undeclared identifier 'axi'; did you mean 'ax'?
                __m512h axi = _mm512_loadu_ph(ax+i);
                        ^~~
                        ax
src/halfutils.c:87:52: note: 'ax' declared here
HalfvecL2SquaredDistanceAvx512Fp16(int dim, half * ax, half * bx)

lucagiac81 · 2024-04-25T02:49:52Z

@jkatz I think clang is not applying __attribute__((target("avx512fp16")))

I tested on an m7i instance (where -march=native includes -mavx512fp16) with clang-15

Build pgvector with the default Makefile (which has -march=native in OPTFLAGS): no error
Remove -march=native from Makefile: compilation errors (similar to your report)
Replace -march=native with -mavx512fp16 in Makefile: no error

With gcc-12.3, and I got no errors in all cases.

Can you try adding -march=native or -mavx512fp16 to your flags as a temporary solution?

lucagiac81 · 2024-04-25T19:54:40Z

Rebased on latest master
Added checks for OSXSAVE and zmm registers enabled
Added L1 distance AVX512-FP16 implementation

akashsha1 · 2024-04-26T23:52:17Z

src/halfutils.c

+SupportsAvx512Fp16()
+{
+	unsigned int exx[4] = {0, 0, 0, 0};
+	unsigned int feature = (1 << 23);


nit. feature can be defined using #DEFINE CPU_FEATURE_AVX512FP16

akashsha1 · 2024-04-26T23:54:23Z

src/halfutils.c

+	__cpuid(exx, 7, 0);
+#endif
+
+	/* Check OS supports XSAVE */


nit. update comment to reflect OSXSAVE

akashsha1 · 2024-04-27T00:07:48Z

src/halfutils.c

+		return false;
+
+	/* Check XMM, YMM, and ZMM registers are enabled */
+	if ((_xgetbv(0) & 0xe6) != 0xe6)


@nathan-bossart shouldn't this be _xgetbv(0) & 0xe6) == 0xe6 ? Similar comment on L187 in bitutils.c per the discussion [0]

[0] : https://www.postgresql.org/message-id/20240418210158.GA3776258%40nathanxps13

This looks alright as-is to me. If this check fails, we return false, so != looks correct.

lucagiac81 · 2024-05-14T03:18:13Z

While collecting data with ANN benchmarks, we noticed a degradation in recall for some datasets (such as sift-128) when computing distances in half precision. Other datasets (such as gist-960) are not affected, and recall is matched to the existing distance functions. The existing functions (*F16c) first convert halfvec elements to single precision and execute the distance computation in single precision.

So, enabling the FP16 distance functions may not be desirable in all cases. The latest update to the PR provides two implementations of the distance functions with AVX-512: one using single precision and one using half precision.

The single-precision functions are used by default, as they're the most generically applicable.
The user can decide to use the half-precision functions by setting a variable (halfvec.use_fp16_compute). The decision should be based on recall vs performance tradeoff for a specific dataset (this implementation handles 2x the vector elements per iteration compared to the single-precision one).

jkatz · 2024-05-14T15:30:18Z

@lucagiac81 Thanks for the continued work. Per @nathan-bossart comment earlier, it'd be helpful to see the actual performance results.

I'll try to get this to build again - last I checked I didn't have avx512fp16 available on my instance class.

lucagiac81 · 2024-05-16T19:36:13Z

Here are some initial results

We made a few changes to ANN benchmarks to support halfvec (following your post) and used the default conditions.
The tests were run on an m7i.metal-24xl instance. pgvector is compiled with gcc-12.
We measure query performance for the AVX-512 half-precision and single-precision distance implementations, and compare with the existing F16c implementation. To enable the half-precision functions, we use SET halfvec.use_fp16_compute = true in set_query_arguments, as described in the previous comment.

With the gist-960-euclidean dataset, so far we observe

qps increase of 9.6%-12.9% for half-precision and 1.4%-3.6% for single-precision
p99 reduction of 8.3%-12.8% for half-precision and 0-4% for single-precision
recall is matched to within +/-1% for both implementations

It'd be great if you could reproduce these numbers with your setup. Please let me know if you still run into compilation issues. We'd also like to collect data with dbpedia-openai-1000k-angular as well (higher dimensions, different distance metric) , but we're running into a 403 error when downloading the dataset (similar to this report). Do you have any advice on how to run with that dataset?

ankane · 2024-09-24T20:44:31Z

Hi @lucagiac81, thanks for the PR, and sorry for the delay. Based on the numbers above, I'm not sure the benefit justifies the complexity.

For the dbpedia-openai-1000k-angular dataset, there's a create_dataset.py script in ann-benchmarks, fwiw.

lucagiac81 · 2024-09-24T22:56:39Z

Thanks @ankane. The issue with the dbpedia dataset was the inability to create the dataset locally. After updating the datasets package to v2.19.1, the issue is resolved. We'll share the results with that dataset as well.

Regarding the complexity, is the additional parameter to select the precision of the distance computation the main concern?

ankane · 2024-09-25T00:06:17Z

Sounds good.

The parameter doubles my concern, but there's still a lot of complexity without it. I'm not sure either choice is great based on the numbers above, since the single-precision version provides little performance benefit and while the half-precision provides some benefit but reduces precision and range.

For comparison, here are the f16c + fma numbers: #311 (comment).

lucagiac81 · 2024-10-21T22:15:52Z

The latest update eliminates the need to manually enable FP16 computation. Computation starts with FP16 and switches to FP32 in case of overflow.
With this approach, the recall degradation we observed with FP16 for certain datasets (sift-128-euclidean) is resolved. Datasets that performed well with FP16 computation (gist-960-euclidean) show similar performance as with FP16 enabled in the previous version of the code.
We will share a more detailed report, but this is a step towards reducing complexity (especially not exposing it to the end user).

lucagiac81 · 2024-11-12T20:48:30Z

The latest update introduces an AVX512_FP16 implementation of vector_to_halfvec conversion (in separate commit), as that function has a noticeable contribution in VectorDBBench benchmarks. It is also rebased on pgvector v0.8.0. Sharing additional performance measurements with ANN-Benchmarks and VectorDBBench below.

For ANN-Benchmarks, we used a similar setup as for the previously shared data. With the latest changes, the manual selection of FP16/FP32 computation is removed. We also include measurements for index build time (with 8 parallel workers). Performance gains are relative to the existing F16c implementation.
For dbpedia-openai-1000k-angular dataset, we observe

qps gain of 9.4%-14.2%
p99 reduction of 9.0%-12.9%
index build time reduction of 3.8% (m=16) and 11.6% (m=24)
recall matched within +/-0.3%

For sift-128-euclidean, we observe 11-12% index build time reduction with recall matched within +/-0.1%, but no significant qps/p99 gain. This confirms that the recall degradation previously observed with FP16 computation is resolved and there are gains for certain metrics even at lower dimensions.

For VectorDBBench, we focused on larger datasets and varying search concurrency (1-40 range on an r7i.12xlarge instance). Below are initial results with two datasets:

Performance1536D5M (OpenAI, 5M vectors, 1536 dimensions): 4-8% qps gain
Performance768D10M (Cohere, 10M vectors, 768 dimensions): 5-6% qps gain

lucagiac81 · 2024-11-21T22:33:08Z

The last update fixes some issues reported by CI:

Remove unnecessary variable initialization
Add AVX512DQ check (required by one instruction)
Increase clang minimum version requirement from 14 to 16. Version 16 solves the issue reported in a previous comment. I'll keep checking for a solution with earlier versions.

greenhal · 2025-01-06T21:57:24Z

@lucagiac81

I tested the latest revision a1e3ead using vectordbbench on a r7i.8xlarge instance and achieved qps improvements ranging from 7% to 23%, depending on the dataset.

Dataset	qps	recall	with avx512 qps	qps improvement	with avx512 recall	recallimpact
Performance1536D50K	10644.763	0.9691	12594.4903	18.32%	0.9626	-0.67%
Performance1536D5M	6810.1588	0.9338	7332.7366	7.67%	0.927	-0.73%
Performance768D10M	7916.9702	0.8975	9799.3695	23.78%	0.8949	-0.29%
Performance768D1M	9847.2079	0.926	11043.1437	12.14%	0.9262	0.02%

Tests were performed using:

ef_construction 128
ef_search 128
k 100
full vector stored in the table
Amazon linux2
gcc 12.4.0
binutils 2.39

Note: gcc >= 12 & binutils >= 2.38 are required for this change to have any effect. Upgrading gcc, without binutils can cause pgvector's make to fail.

akashsha1 · 2025-01-06T23:09:40Z

@lucagiac81

I tested the latest revision a1e3ead using vectordbbench on a r7i.8xlarge instance and achieved qps improvements ranging from 7% to 23%, depending on the dataset.

Dataset qps recall with avx512 qps qps improvement with avx512 recall recallimpact
Performance1536D50K 10644.763 0.9691 12594.4903 18.32% 0.9626 -0.67%
Performance1536D5M 6810.1588 0.9338 7332.7366 7.67% 0.927 -0.73%
Performance768D10M 7916.9702 0.8975 9799.3695 23.78% 0.8949 -0.29%
Performance768D1M 9847.2079 0.926 11043.1437 12.14% 0.9262 0.02%
Tests were performed using:

ef_construction 128

ef_search 128

k 100

full vector stored in the table

Amazon linux2

gcc 12.4.0

binutils 2.39

Note: gcc >= 12 & binutils >= 2.38 are required for this change to have any effect. Upgrading gcc, without binutils can cause make to fail.

@greenhal thanks for sharing these. This is great improvements.
@jkatz , @nathan-bossart , @ankane - 23% improvements seem like a solid improvement on AWS instances to consider. The PR has also been simplified to automatically take advantage of the new instructions in avx512. Would you be able to help provide any feedback here, and help merge this change?

lucagiac81 · 2025-04-10T20:11:55Z

We'd like to propose a refactoring to make platform-specific optimizations easier to maintain.

In the latest update, AVX-512 functions are moved to separate files. The functions are also included by conditional compilation (enabled by default). In this way, AVX-512 implementations are all in one place, and they can be easily disabled in the build if desired.
AVX-512 feature detection is currently in the same files. In the future, we could also move detection to a common location if duplication becomes a concern.

This approach is easily extended to future contributions and different architectures. This will allow pgvector to benefit from targeted optimizations while keeping the "core" code uncluttered. It will also clarify where particular expertise is needed for maintenance and improvements.

@ankane do you see this as a viable approach to integrate these optimizations? Please let us know your thoughts.

ankane · 2025-04-10T21:03:36Z

Hi @lucagiac81, thanks again for the PR, but I don't think it's a good fit for pgvector. Changing the accumulation precision can reduce recall (as @greenhal's tests show), and there is already a separate lever for trading recall for speed (ef_search).

nathan-bossart reviewed Apr 24, 2024

View reviewed changes

lucagiac81 force-pushed the halfvec_avx512_fp16 branch from e390649 to 85ba2dc Compare April 25, 2024 19:53

lucagiac81 marked this pull request as ready for review April 25, 2024 19:54

akashsha1 reviewed Apr 26, 2024

View reviewed changes

src/halfutils.c Outdated

__cpuid(exx, 7, 0);

#endif

/* Check OS supports XSAVE */

Copy link

akashsha1 Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. update comment to reflect OSXSAVE

akashsha1 reviewed Apr 27, 2024

View reviewed changes

lucagiac81 force-pushed the halfvec_avx512_fp16 branch from 85ba2dc to 915d6eb Compare May 14, 2024 03:08

lucagiac81 force-pushed the halfvec_avx512_fp16 branch from 915d6eb to cd84cd6 Compare October 21, 2024 22:02

lucagiac81 force-pushed the halfvec_avx512_fp16 branch from cd84cd6 to 30d408c Compare October 21, 2024 22:28

lucagiac81 force-pushed the halfvec_avx512_fp16 branch from 30d408c to 4e3859a Compare November 12, 2024 20:33

lucagiac81 force-pushed the halfvec_avx512_fp16 branch from 4e3859a to 8c39ae2 Compare November 21, 2024 22:29

lucagiac81 added 3 commits March 25, 2025 13:09

Implement AVX512 distance calculations for halfvec

63792c8

Automate fp16/fp32 halfvec distance computation

5969951

Implement AVX512 vector_to_halfvec conversion

49fa02b

lucagiac81 added 2 commits March 25, 2025 13:09

Add CI job for AVX512 FP16 optimized path

80049e4

Move AVX-512 functions to separate files

bfaa33a

lucagiac81 force-pushed the halfvec_avx512_fp16 branch from a1e3ead to bfaa33a Compare April 10, 2025 20:00

ankane closed this Apr 10, 2025

		for (; i < dim; i++)
		distance += HalfToFloat4(ax[i]) * HalfToFloat4(bx[i]);

Conversation

lucagiac81 commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkatz commented Apr 24, 2024

Uh oh!

nathan-bossart left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkatz commented Apr 24, 2024

Uh oh!

jkatz commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucagiac81 commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucagiac81 commented Apr 25, 2024

Uh oh!

akashsha1 Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akashsha1 Apr 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucagiac81 commented May 14, 2024

Uh oh!

jkatz commented May 14, 2024

Uh oh!

lucagiac81 commented May 16, 2024

Uh oh!

ankane commented Sep 24, 2024

Uh oh!

lucagiac81 commented Sep 24, 2024

Uh oh!

ankane commented Sep 25, 2024

Uh oh!

lucagiac81 commented Oct 21, 2024

Uh oh!

lucagiac81 commented Nov 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucagiac81 commented Nov 21, 2024

Uh oh!

greenhal commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akashsha1 commented Jan 6, 2025

Uh oh!

lucagiac81 commented Apr 10, 2025

Uh oh!

ankane commented Apr 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

lucagiac81 commented Apr 24, 2024 •

edited

Loading

jkatz commented Apr 24, 2024 •

edited

Loading

lucagiac81 commented Apr 25, 2024 •

edited

Loading

akashsha1 Apr 26, 2024 •

edited

Loading

akashsha1 Apr 27, 2024 •

edited

Loading

lucagiac81 commented Nov 12, 2024 •

edited

Loading

greenhal commented Jan 6, 2025 •

edited

Loading