S32 support for AEC #8237

cujomalainey · 2023-09-20T22:06:00Z

No description provided.

replaced by out_fmt Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>

Move to float API which consumes deinterleaved samples, this is done in preparation for more formats. As a result the copy function is also abstracted. Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>

since we are using the float api we just convert everything to float and we are good Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>

lgirdwood

@singalsu @andrula-song pls review.

singalsu · 2023-09-26T07:42:11Z

src/audio/google/google_rtc_audio_processing.c

+#if CONFIG_FORMAT_S16LE
+static void render_s16(struct google_rtc_audio_processing_comp_data *cd)
+{
+	struct comp_buffer *buffer_c = buffer_acquire(cd->aec_reference);


Buffer_acquire() removal applies also for this usage, right?

yep, we can remove after if this merges first.

this would break compilation, wouldn't it?

lgirdwood

LGTM. @andrula-song pls also review.

lgirdwood · 2023-09-26T14:17:17Z

src/audio/google/google_rtc_audio_processing.c

+#if CONFIG_FORMAT_S16LE
+static void render_s16(struct google_rtc_audio_processing_comp_data *cd)
+{
+	struct comp_buffer *buffer_c = buffer_acquire(cd->aec_reference);


yep, we can remove after if this merges first.

ranj063

looks good. @cujomalainey could you please undraft the PR

lgirdwood

@cujomalainey good to merge. Some conflicts though.

cujomalainey · 2023-09-29T18:36:57Z

@cujomalainey good to merge. Some conflicts though.

Negative, tests reported broken audio, working on debugging, just had to fight my tooling super hard this week

btian1

any performance difference with float api?

cujomalainey · 2023-10-09T14:25:52Z

any performance difference with float api?

From what I understand, none, since the s16 is essentially a frontend for the f32 api

cujomalainey · 2023-10-09T14:27:57Z

This is blocked by #8258. The AEC xruns the pipe on ll

marcinszkudlinski

Please do not merge, i'll include those changes in AEC DP processing PR

btian1 · 2023-10-30T13:40:09Z

src/audio/google/google_rtc_audio_processing.c

+		for (i = 0; i < n; i++) {
+			j = cd->raw_mic_buffer_frame_index * cd->num_capture_channels;
+			for (channel = 0; channel < cd->num_capture_channels; channel++) {
+				cd->raw_mic_buffer[j] = Q_CONVERT_QTOF(src[channel], 31);


those Q_CONVERT_FLOAT(QTOF) macros normally used with fixed numbers, if you use here, you will have very negative performance impact, this is not proposed to use this macro, please find other way to convert to fix point data.

If there is a problem with the standard macros then they should be fixed or an alternative provided. The problem should not be addressed here if there is no other alternative currently.

Yeah, unless I'm looking at the wrong code, it looks to me like this macro is a little pessimal, dividing (!) a float by a constructed 64 bit (!!) quantity (so at runtime this is going to emit a double precision division instead of the single precision multiply we want). From format.h:

#define Q_CONVERT_QTOF(x, ny) ((float)(x) / ((int64_t)1 << (ny)))

Swapping that to something like this should generate better code for sure, though edge cases at the boundaries are always hard to reason about: #define S32_TO_F(x) ((x) * (1.0f / BIT(31)))

andyross

Notes and nitpicks. I guess one thing to note somewhere is that this is going to be highly dependent on platforms with a working FPU/HiFi engine and attendant toolchain and OS.

I made the point a few times back in the day, but note that Zephyr has Xtensa FPU support in thread contexts (when enabled, IIRC none of the intel_adsp devices I was working on had the hardware?) but NOT HiFi. If the thread doing this work gets context-switched out to something else that also needs to use the SIMD registers, things will break.

Design consensus at the time was that this "wasn't going to be a problem" because SOF pins everything and segregates SIMD workloads across preemption boundaries. But it's a maintenance risk for sure, and frankly we should probably add HiFi integration to the arch layer sooner rather than later if we want to move to "all float" AEC.

andyross · 2023-11-06T18:02:25Z

src/audio/google/google_rtc_audio_processing.c

 	bzero(cd->raw_mic_buffer, cd->num_frames * cd->num_capture_channels * sizeof(cd->raw_mic_buffer[0]));
 	cd->raw_mic_buffer_frame_index = 0;
+	for (channel = 0; channel < cd->num_capture_channels; channel++)
+		cd->raw_mic_buffer_ptrs[channel] = &cd->raw_mic_buffer[channel * cd->num_frames];


Looks like there's a !aec_reference_buffer early exit case just below this, probably best to put this after that out of general "do the cheap checks first" hygiene. It also allows the compiler (maybe) or you to roll this into the other num_capture_channels loop you're adding below. I don't think I see any other branching here that would prevent that, though it's hard to tell in a review UI.

andyross · 2023-11-06T18:05:43Z

src/audio/google/google_rtc_audio_processing.c

+	bzero(cd->aec_reference_buffer,
+	      cd->num_frames *
+	      cd->num_aec_reference_channels *
+	      sizeof(cd->aec_reference_buffer[0]));


Style nitpick: this is a whitespace-only change, and it faked me out because I thought for a second that this was changing the bzero() region without updating the size of the allocation.

IMHO the whole "write the size expression in the alloc and bzero steps" pattern is fragile. I guess my aesthetics would say that if we're going to change the source code bytes that compute this number, we should fix it so the size is computed exactly once and used as an immutable value in both the alloc/clear.

andyross · 2023-11-06T18:12:08Z

src/audio/google/google_rtc_audio_processing.c

 	cd->output_buffer_frame_index = 0;
+	for (channel = 0; channel < cd->num_capture_channels; channel++)
+		cd->aec_reference_buffer_ptrs[channel] =
+			&cd->aec_reference_buffer[channel * cd->num_frames];


Confused: this is looping over the same array as the num_aec_reference_channels loop above and initializing the same indexes to the same values. Looks like semantically it's identical to looping once over the maximum of the two counts? Or maybe it's supposed to be initializing a different array?

very likely a copy paste error, this was never tested fully as we discovered the DP scheduling bug

maybe it should be :
for (channel = 0; channel < cd->num_capture_channels; channel++)
cd->output_buffer_ptrs[channel] = &cd->output_buffer[channel * cd->num_frames];

(marcin: sorry for accidental edit)

andyross · 2023-11-06T18:21:26Z

src/audio/google/google_rtc_audio_processing.c

+		for (i = 0; i < n; i++) {
+			j = cd->raw_mic_buffer_frame_index * cd->num_capture_channels;
+			for (channel = 0; channel < cd->num_capture_channels; channel++) {
+				cd->raw_mic_buffer[j] = Q_CONVERT_QTOF(src[channel], 31);


Yeah, unless I'm looking at the wrong code, it looks to me like this macro is a little pessimal, dividing (!) a float by a constructed 64 bit (!!) quantity (so at runtime this is going to emit a double precision division instead of the single precision multiply we want). From format.h:

#define Q_CONVERT_QTOF(x, ny) ((float)(x) / ((int64_t)1 << (ny)))

Swapping that to something like this should generate better code for sure, though edge cases at the boundaries are always hard to reason about: #define S32_TO_F(x) ((x) * (1.0f / BIT(31)))

cujomalainey · 2023-11-07T07:47:08Z

Notes and nitpicks. I guess one thing to note somewhere is that this is going to be highly dependent on platforms with a working FPU/HiFi engine and attendant toolchain and OS.

I don't think this makes a difference since this is essentially a no-op as the conversion to FP was already done inside the lib, this just makes its easily handed for multiple formats outside.

cujomalainey · 2023-11-07T07:48:05Z

No idea why github won't let me reply to #8237 (comment) but for some reason there is no reply box.

this is a question for @ShriramShastry and @singalsu

marcinszkudlinski · 2023-11-07T09:55:09Z

@andyross @cujomalainey @singalsu @nashif
If its true that Zephyr does not save HiFi registers in preemption, it is a huge gap and must be fixed, DP processing is designed as a set of threads with deadlines, preempting each other (and being preempted by LL tasks). Each of the threads may, and most likely will, use HiFi

marcinszkudlinski · 2023-11-07T10:32:53Z

As to slow float->int conversion, the answer is:

int16_t CONVERT(float data)
{
	const xtfloat ratio = 2 << 14;
	xtfloat x0 = data;
	xtfloat x1;
	int16_t x;

	x1 = XT_MUL_S(x0, ratio);
	x = XT_TRUNC_S(x1, 0);

	return x;
}

converting of 10ms of 4channel data from float to int16 took 1.1ms (!!!!) using Q_CONVERT_FLOAT, 120us using CONVERT as above
Q_CONVERT_QTOF, however, is quite fast, conversion from 10ms/4chanels fom int16 to float takes 140us

andyross · 2023-11-07T17:46:17Z

Regarding HiFi registers: I'm guessing @dcpleung might have some WIP on saving the SIMD registers, if not I can take a crack at it. It's not particularly difficult, though it does have some stack/performance tradeoffs to worry about (it's a ton of data to push on the stacks, but then stacks are cached where e.g. the thread struct can't be). And IIRC (i.e. I might be wrong, would have to look up) Xtensa doesn't have a way to lazy-evaluate (e.g. trap on use) the context switch, we'd likely pay the cost for every interrupt/switch.

And as far as how to implement it: my recommendation would be to stay away from the explicit TIE macros as they defeat the optimizer (and obviously aren't portable anyway). Checking quickly, the loop below (very roughly what the code in question is doing) already optimizes to a pleasingly unrolled loop over the scalar FPU with xcc 2021.6 on TGL. With xt-clang and the ACE config, it actually emits some pretty scarily optimized (but quite fast looking) well-bundled HiFi SIMD loops. The TIE macros are definitely in third place by my accounting (though still ahead of gcc: the Zephyr SDK for TGL won't emit FPU code at all and calls into libgcc -- pretty sure FPU support is supposed to work, might be a compiler build goof).

andyross · 2023-11-07T17:48:13Z

Checking quickly, the loop below [...]

Oops, forgot loop:

void f_to_s32(float *in, int *out, int n)
{
    for (int i = 0; i < n; i++)
        out[i] = in[i] * (1.0f / INT_MAX);
}

andyross · 2023-11-08T01:46:28Z

And IIRC (i.e. I might be wrong, would have to look up) Xtensa doesn't have a way to lazy-evaluate (e.g. trap on use) the context switch

I looked it up. It does. It gets complicated with SMP in the presence of SCHED_PIN_ONLY=n because you need to shoot down the state on another core when threads migrate, but that's not a SOF configuration so we can defer optimizing that case.

andrula-song · 2023-11-08T03:06:19Z

src/audio/google/google_rtc_audio_processing.c

 	cd->output_buffer_frame_index = 0;
+	for (channel = 0; channel < cd->num_capture_channels; channel++)
+		cd->aec_reference_buffer_ptrs[channel] =
+			&cd->aec_reference_buffer[channel * cd->num_frames];


maybe it should be :
for (channel = 0; channel < cd->num_capture_channels; channel++)
cd->output_buffer_ptrs[channel] = &cd->output_buffer[channel * cd->num_frames];

(marcin: sorry for accidental edit)

andrula-song · 2023-11-08T03:20:16Z

src/audio/google/google_rtc_audio_processing.c

+				j += cd->num_frames;
+			}
+
+			ref += audio_stream_get_channels(&buffer_c->stream);


can we use a variable such like ref_chn = audio_stream_get_channels(&buffer_c->stream); to reduce calculate the channels in the for loop?

andrula-song · 2023-11-08T05:05:31Z

src/audio/google/google_rtc_audio_processing.c

+			}
+
+			src += audio_stream_get_channels(&mic_buf->stream);
+			dst += audio_stream_get_channels(&output_buf->stream);


ShriramShastry · 2023-11-08T05:23:13Z

No idea why github won't let me reply to #8237 (comment) but for some reason there is no reply box.

this is a question for @ShriramShastry and @singalsu

Q_CONVERT_QTOF

Hi @cujomalainey ,

I think you can use 'Q_CONVERT_QTOF' without any problems for the time being. In cmocka wrappers, Q_CONVERT_QTOF is frequently utilized, and there is no appreciable loss of accuracy in terms of Total Harmonic Distortion or Unit in last place.

If performance is an issue, I'm not sure what the expectations are—what the cycle and benefit are expected to be, how long they will last, and so on.

Yes, it is true that the #define Q_CONVERT_QTOF(x, ny) ((float)(x) / ((int64_t)1 << (ny))) division operator may be an argument. In that case, the implementation might be made quickly for any compiler by introducing the inverse #define Q_CONVERT_QTOF(x, ny) ((float)(x) * inv((int64_t)1 << (ny))) and employing a leading-to-zeros technique with a LUT to prevent loops.

Thank you!

marcinszkudlinski · 2023-11-08T08:29:06Z

I'm finishing with DP venison of AEC and the loop looks like:

	for (int i = 0; i < cd->num_frames; i++) {
		for (channel = 0; channel < cd->num_capture_channels; channel++)
			cd->process_buffer_ptrs[channel][i] =
					Q_CONVERT_QTOF(src[channel], 15);

		src += cd->num_capture_channels;
		if (src >= src_buf_end)
			src = src_buf_start;
	}

execution time for the loop above - for 10ms 16bit 4 channels - 140us

please do not merge this commit - once DP version is stable - hopefully very soon - i'll put a new PR for this

marcinszkudlinski · 2023-11-09T17:35:42Z

I';m closing this PR,
all changes for 32bit are here: #8469

cujomalainey added 3 commits September 20, 2023 12:46

audio: google: rtc: remove unused var from ipc4 conversion

12fdcf6

replaced by out_fmt Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>

audio: google: rtc: convert to float api

00f25c8

Move to float API which consumes deinterleaved samples, this is done in preparation for more formats. As a result the copy function is also abstracted. Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>

audio: google: rtc: add support for s32

b634061

since we are using the float api we just convert everything to float and we are good Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>

cujomalainey requested review from andyross, bzhg, eddy1021, johnylin76, lkoenig and perahgren September 20, 2023 22:06

cujomalainey self-assigned this Sep 20, 2023

lgirdwood reviewed Sep 25, 2023

View reviewed changes

singalsu approved these changes Sep 26, 2023

View reviewed changes

lgirdwood reviewed Sep 26, 2023

View reviewed changes

singalsu mentioned this pull request Sep 26, 2023

Audio: Convert Google RTC audio processing to module API #8258

Merged

1 task

ranj063 approved these changes Sep 26, 2023

View reviewed changes

lgirdwood approved these changes Sep 29, 2023

View reviewed changes

btian1 reviewed Oct 9, 2023

View reviewed changes

marcinszkudlinski requested changes Oct 18, 2023

View reviewed changes

btian1 reviewed Oct 30, 2023

View reviewed changes

andyross reviewed Nov 6, 2023

View reviewed changes

andrula-song reviewed Nov 8, 2023

View reviewed changes

marcinszkudlinski closed this Nov 9, 2023

S32 support for AEC #8237

S32 support for AEC #8237

Uh oh!

Conversation

cujomalainey commented Sep 20, 2023

Uh oh!

lgirdwood left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lgirdwood left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ranj063 left a comment

Choose a reason for hiding this comment

Uh oh!

lgirdwood left a comment

Choose a reason for hiding this comment

Uh oh!

cujomalainey commented Sep 29, 2023

Uh oh!

btian1 left a comment

Choose a reason for hiding this comment

Uh oh!

cujomalainey commented Oct 9, 2023

Uh oh!

cujomalainey commented Oct 9, 2023

Uh oh!

marcinszkudlinski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyross left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyross Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cujomalainey Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrula-song Nov 8, 2023 • edited by marcinszkudlinski Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cujomalainey commented Nov 7, 2023

Uh oh!

cujomalainey commented Nov 7, 2023

Uh oh!

marcinszkudlinski commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcinszkudlinski commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyross commented Nov 7, 2023

Uh oh!

andyross commented Nov 7, 2023

Uh oh!

andyross commented Nov 8, 2023

andyross Nov 6, 2023 •

edited

Loading

cujomalainey Nov 7, 2023 •

edited

Loading

andrula-song Nov 8, 2023 •

edited by marcinszkudlinski

Loading

marcinszkudlinski commented Nov 7, 2023 •

edited

Loading

marcinszkudlinski commented Nov 7, 2023 •

edited

Loading

andrula-song Nov 8, 2023 •

edited by marcinszkudlinski

Loading

marcinszkudlinski commented Nov 8, 2023 •

edited

Loading