-
Notifications
You must be signed in to change notification settings - Fork 349
S32 support for AEC #8237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S32 support for AEC #8237
Conversation
replaced by out_fmt Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>
Move to float API which consumes deinterleaved samples, this is done in preparation for more formats. As a result the copy function is also abstracted. Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>
since we are using the float api we just convert everything to float and we are good Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>
lgirdwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singalsu @andrula-song pls review.
| #if CONFIG_FORMAT_S16LE | ||
| static void render_s16(struct google_rtc_audio_processing_comp_data *cd) | ||
| { | ||
| struct comp_buffer *buffer_c = buffer_acquire(cd->aec_reference); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffer_acquire() removal applies also for this usage, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, we can remove after if this merges first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would break compilation, wouldn't it?
lgirdwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @andrula-song pls also review.
| #if CONFIG_FORMAT_S16LE | ||
| static void render_s16(struct google_rtc_audio_processing_comp_data *cd) | ||
| { | ||
| struct comp_buffer *buffer_c = buffer_acquire(cd->aec_reference); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, we can remove after if this merges first.
ranj063
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. @cujomalainey could you please undraft the PR
lgirdwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cujomalainey good to merge. Some conflicts though.
Negative, tests reported broken audio, working on debugging, just had to fight my tooling super hard this week |
btian1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any performance difference with float api?
From what I understand, none, since the s16 is essentially a frontend for the f32 api |
|
This is blocked by #8258. The AEC xruns the pipe on ll |
marcinszkudlinski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not merge, i'll include those changes in AEC DP processing PR
| for (i = 0; i < n; i++) { | ||
| j = cd->raw_mic_buffer_frame_index * cd->num_capture_channels; | ||
| for (channel = 0; channel < cd->num_capture_channels; channel++) { | ||
| cd->raw_mic_buffer[j] = Q_CONVERT_QTOF(src[channel], 31); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those Q_CONVERT_FLOAT(QTOF) macros normally used with fixed numbers, if you use here, you will have very negative performance impact, this is not proposed to use this macro, please find other way to convert to fix point data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a problem with the standard macros then they should be fixed or an alternative provided. The problem should not be addressed here if there is no other alternative currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, unless I'm looking at the wrong code, it looks to me like this macro is a little pessimal, dividing (!) a float by a constructed 64 bit (!!) quantity (so at runtime this is going to emit a double precision division instead of the single precision multiply we want). From format.h:
#define Q_CONVERT_QTOF(x, ny) ((float)(x) / ((int64_t)1 << (ny)))
Swapping that to something like this should generate better code for sure, though edge cases at the boundaries are always hard to reason about: #define S32_TO_F(x) ((x) * (1.0f / BIT(31)))
andyross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes and nitpicks. I guess one thing to note somewhere is that this is going to be highly dependent on platforms with a working FPU/HiFi engine and attendant toolchain and OS.
I made the point a few times back in the day, but note that Zephyr has Xtensa FPU support in thread contexts (when enabled, IIRC none of the intel_adsp devices I was working on had the hardware?) but NOT HiFi. If the thread doing this work gets context-switched out to something else that also needs to use the SIMD registers, things will break.
Design consensus at the time was that this "wasn't going to be a problem" because SOF pins everything and segregates SIMD workloads across preemption boundaries. But it's a maintenance risk for sure, and frankly we should probably add HiFi integration to the arch layer sooner rather than later if we want to move to "all float" AEC.
| bzero(cd->raw_mic_buffer, cd->num_frames * cd->num_capture_channels * sizeof(cd->raw_mic_buffer[0])); | ||
| cd->raw_mic_buffer_frame_index = 0; | ||
| for (channel = 0; channel < cd->num_capture_channels; channel++) | ||
| cd->raw_mic_buffer_ptrs[channel] = &cd->raw_mic_buffer[channel * cd->num_frames]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's a !aec_reference_buffer early exit case just below this, probably best to put this after that out of general "do the cheap checks first" hygiene. It also allows the compiler (maybe) or you to roll this into the other num_capture_channels loop you're adding below. I don't think I see any other branching here that would prevent that, though it's hard to tell in a review UI.
| bzero(cd->aec_reference_buffer, | ||
| cd->num_frames * | ||
| cd->num_aec_reference_channels * | ||
| sizeof(cd->aec_reference_buffer[0])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nitpick: this is a whitespace-only change, and it faked me out because I thought for a second that this was changing the bzero() region without updating the size of the allocation.
IMHO the whole "write the size expression in the alloc and bzero steps" pattern is fragile. I guess my aesthetics would say that if we're going to change the source code bytes that compute this number, we should fix it so the size is computed exactly once and used as an immutable value in both the alloc/clear.
| cd->output_buffer_frame_index = 0; | ||
| for (channel = 0; channel < cd->num_capture_channels; channel++) | ||
| cd->aec_reference_buffer_ptrs[channel] = | ||
| &cd->aec_reference_buffer[channel * cd->num_frames]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused: this is looping over the same array as the num_aec_reference_channels loop above and initializing the same indexes to the same values. Looks like semantically it's identical to looping once over the maximum of the two counts? Or maybe it's supposed to be initializing a different array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very likely a copy paste error, this was never tested fully as we discovered the DP scheduling bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it should be :
for (channel = 0; channel < cd->num_capture_channels; channel++)
cd->output_buffer_ptrs[channel] = &cd->output_buffer[channel * cd->num_frames];
(marcin: sorry for accidental edit)
| for (i = 0; i < n; i++) { | ||
| j = cd->raw_mic_buffer_frame_index * cd->num_capture_channels; | ||
| for (channel = 0; channel < cd->num_capture_channels; channel++) { | ||
| cd->raw_mic_buffer[j] = Q_CONVERT_QTOF(src[channel], 31); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, unless I'm looking at the wrong code, it looks to me like this macro is a little pessimal, dividing (!) a float by a constructed 64 bit (!!) quantity (so at runtime this is going to emit a double precision division instead of the single precision multiply we want). From format.h:
#define Q_CONVERT_QTOF(x, ny) ((float)(x) / ((int64_t)1 << (ny)))
Swapping that to something like this should generate better code for sure, though edge cases at the boundaries are always hard to reason about: #define S32_TO_F(x) ((x) * (1.0f / BIT(31)))
I don't think this makes a difference since this is essentially a no-op as the conversion to FP was already done inside the lib, this just makes its easily handed for multiple formats outside. |
|
No idea why github won't let me reply to #8237 (comment) but for some reason there is no reply box. this is a question for @ShriramShastry and @singalsu |
|
@andyross @cujomalainey @singalsu @nashif |
|
As to slow float->int conversion, the answer is: converting of 10ms of 4channel data from float to int16 took 1.1ms (!!!!) using Q_CONVERT_FLOAT, 120us using CONVERT as above |
|
Regarding HiFi registers: I'm guessing @dcpleung might have some WIP on saving the SIMD registers, if not I can take a crack at it. It's not particularly difficult, though it does have some stack/performance tradeoffs to worry about (it's a ton of data to push on the stacks, but then stacks are cached where e.g. the thread struct can't be). And IIRC (i.e. I might be wrong, would have to look up) Xtensa doesn't have a way to lazy-evaluate (e.g. trap on use) the context switch, we'd likely pay the cost for every interrupt/switch. And as far as how to implement it: my recommendation would be to stay away from the explicit TIE macros as they defeat the optimizer (and obviously aren't portable anyway). Checking quickly, the loop below (very roughly what the code in question is doing) already optimizes to a pleasingly unrolled loop over the scalar FPU with xcc 2021.6 on TGL. With xt-clang and the ACE config, it actually emits some pretty scarily optimized (but quite fast looking) well-bundled HiFi SIMD loops. The TIE macros are definitely in third place by my accounting (though still ahead of gcc: the Zephyr SDK for TGL won't emit FPU code at all and calls into libgcc -- pretty sure FPU support is supposed to work, might be a compiler build goof). |
Oops, forgot loop: void f_to_s32(float *in, int *out, int n)
{
for (int i = 0; i < n; i++)
out[i] = in[i] * (1.0f / INT_MAX);
} |
I looked it up. It does. It gets complicated with SMP in the presence of SCHED_PIN_ONLY=n because you need to shoot down the state on another core when threads migrate, but that's not a SOF configuration so we can defer optimizing that case. |
| cd->output_buffer_frame_index = 0; | ||
| for (channel = 0; channel < cd->num_capture_channels; channel++) | ||
| cd->aec_reference_buffer_ptrs[channel] = | ||
| &cd->aec_reference_buffer[channel * cd->num_frames]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it should be :
for (channel = 0; channel < cd->num_capture_channels; channel++)
cd->output_buffer_ptrs[channel] = &cd->output_buffer[channel * cd->num_frames];
(marcin: sorry for accidental edit)
| j += cd->num_frames; | ||
| } | ||
|
|
||
| ref += audio_stream_get_channels(&buffer_c->stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use a variable such like ref_chn = audio_stream_get_channels(&buffer_c->stream); to reduce calculate the channels in the for loop?
| } | ||
|
|
||
| src += audio_stream_get_channels(&mic_buf->stream); | ||
| dst += audio_stream_get_channels(&output_buf->stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
Hi @cujomalainey , I think you can use If performance is an issue, I'm not sure what the expectations are—what the cycle and benefit are expected to be, how long they will last, and so on. Yes, it is true that the Thank you! |
|
I'm finishing with DP venison of AEC and the loop looks like: execution time for the loop above - for 10ms 16bit 4 channels - 140us please do not merge this commit - once DP version is stable - hopefully very soon - i'll put a new PR for this |
|
I';m closing this PR, |
No description provided.