Skip to content

Conversation

@cujomalainey
Copy link
Contributor

No description provided.

replaced by out_fmt

Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>
Move to float API which consumes deinterleaved samples, this is done in
preparation for more formats. As a result the copy function is also
abstracted.

Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>
since we are using the float api we just convert everything to float and
we are good

Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>
Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singalsu @andrula-song pls review.

#if CONFIG_FORMAT_S16LE
static void render_s16(struct google_rtc_audio_processing_comp_data *cd)
{
struct comp_buffer *buffer_c = buffer_acquire(cd->aec_reference);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer_acquire() removal applies also for this usage, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, we can remove after if this merges first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would break compilation, wouldn't it?

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @andrula-song pls also review.

#if CONFIG_FORMAT_S16LE
static void render_s16(struct google_rtc_audio_processing_comp_data *cd)
{
struct comp_buffer *buffer_c = buffer_acquire(cd->aec_reference);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, we can remove after if this merges first.

Copy link
Collaborator

@ranj063 ranj063 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. @cujomalainey could you please undraft the PR

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cujomalainey good to merge. Some conflicts though.

@cujomalainey
Copy link
Contributor Author

@cujomalainey good to merge. Some conflicts though.

Negative, tests reported broken audio, working on debugging, just had to fight my tooling super hard this week

Copy link
Contributor

@btian1 btian1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any performance difference with float api?

@cujomalainey
Copy link
Contributor Author

any performance difference with float api?

From what I understand, none, since the s16 is essentially a frontend for the f32 api

@cujomalainey
Copy link
Contributor Author

This is blocked by #8258. The AEC xruns the pipe on ll

Copy link
Contributor

@marcinszkudlinski marcinszkudlinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not merge, i'll include those changes in AEC DP processing PR

for (i = 0; i < n; i++) {
j = cd->raw_mic_buffer_frame_index * cd->num_capture_channels;
for (channel = 0; channel < cd->num_capture_channels; channel++) {
cd->raw_mic_buffer[j] = Q_CONVERT_QTOF(src[channel], 31);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those Q_CONVERT_FLOAT(QTOF) macros normally used with fixed numbers, if you use here, you will have very negative performance impact, this is not proposed to use this macro, please find other way to convert to fix point data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a problem with the standard macros then they should be fixed or an alternative provided. The problem should not be addressed here if there is no other alternative currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, unless I'm looking at the wrong code, it looks to me like this macro is a little pessimal, dividing (!) a float by a constructed 64 bit (!!) quantity (so at runtime this is going to emit a double precision division instead of the single precision multiply we want). From format.h:

#define Q_CONVERT_QTOF(x, ny) ((float)(x) / ((int64_t)1 << (ny)))

Swapping that to something like this should generate better code for sure, though edge cases at the boundaries are always hard to reason about: #define S32_TO_F(x) ((x) * (1.0f / BIT(31)))

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes and nitpicks. I guess one thing to note somewhere is that this is going to be highly dependent on platforms with a working FPU/HiFi engine and attendant toolchain and OS.

I made the point a few times back in the day, but note that Zephyr has Xtensa FPU support in thread contexts (when enabled, IIRC none of the intel_adsp devices I was working on had the hardware?) but NOT HiFi. If the thread doing this work gets context-switched out to something else that also needs to use the SIMD registers, things will break.

Design consensus at the time was that this "wasn't going to be a problem" because SOF pins everything and segregates SIMD workloads across preemption boundaries. But it's a maintenance risk for sure, and frankly we should probably add HiFi integration to the arch layer sooner rather than later if we want to move to "all float" AEC.

bzero(cd->raw_mic_buffer, cd->num_frames * cd->num_capture_channels * sizeof(cd->raw_mic_buffer[0]));
cd->raw_mic_buffer_frame_index = 0;
for (channel = 0; channel < cd->num_capture_channels; channel++)
cd->raw_mic_buffer_ptrs[channel] = &cd->raw_mic_buffer[channel * cd->num_frames];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's a !aec_reference_buffer early exit case just below this, probably best to put this after that out of general "do the cheap checks first" hygiene. It also allows the compiler (maybe) or you to roll this into the other num_capture_channels loop you're adding below. I don't think I see any other branching here that would prevent that, though it's hard to tell in a review UI.

bzero(cd->aec_reference_buffer,
cd->num_frames *
cd->num_aec_reference_channels *
sizeof(cd->aec_reference_buffer[0]));
Copy link
Contributor

@andyross andyross Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nitpick: this is a whitespace-only change, and it faked me out because I thought for a second that this was changing the bzero() region without updating the size of the allocation.

IMHO the whole "write the size expression in the alloc and bzero steps" pattern is fragile. I guess my aesthetics would say that if we're going to change the source code bytes that compute this number, we should fix it so the size is computed exactly once and used as an immutable value in both the alloc/clear.

cd->output_buffer_frame_index = 0;
for (channel = 0; channel < cd->num_capture_channels; channel++)
cd->aec_reference_buffer_ptrs[channel] =
&cd->aec_reference_buffer[channel * cd->num_frames];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused: this is looping over the same array as the num_aec_reference_channels loop above and initializing the same indexes to the same values. Looks like semantically it's identical to looping once over the maximum of the two counts? Or maybe it's supposed to be initializing a different array?

Copy link
Contributor Author

@cujomalainey cujomalainey Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very likely a copy paste error, this was never tested fully as we discovered the DP scheduling bug

Copy link
Contributor

@andrula-song andrula-song Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it should be :
for (channel = 0; channel < cd->num_capture_channels; channel++)
cd->output_buffer_ptrs[channel] = &cd->output_buffer[channel * cd->num_frames];

(marcin: sorry for accidental edit)

for (i = 0; i < n; i++) {
j = cd->raw_mic_buffer_frame_index * cd->num_capture_channels;
for (channel = 0; channel < cd->num_capture_channels; channel++) {
cd->raw_mic_buffer[j] = Q_CONVERT_QTOF(src[channel], 31);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, unless I'm looking at the wrong code, it looks to me like this macro is a little pessimal, dividing (!) a float by a constructed 64 bit (!!) quantity (so at runtime this is going to emit a double precision division instead of the single precision multiply we want). From format.h:

#define Q_CONVERT_QTOF(x, ny) ((float)(x) / ((int64_t)1 << (ny)))

Swapping that to something like this should generate better code for sure, though edge cases at the boundaries are always hard to reason about: #define S32_TO_F(x) ((x) * (1.0f / BIT(31)))

@cujomalainey
Copy link
Contributor Author

Notes and nitpicks. I guess one thing to note somewhere is that this is going to be highly dependent on platforms with a working FPU/HiFi engine and attendant toolchain and OS.

I don't think this makes a difference since this is essentially a no-op as the conversion to FP was already done inside the lib, this just makes its easily handed for multiple formats outside.

@cujomalainey
Copy link
Contributor Author

No idea why github won't let me reply to #8237 (comment) but for some reason there is no reply box.

this is a question for @ShriramShastry and @singalsu

@marcinszkudlinski
Copy link
Contributor

marcinszkudlinski commented Nov 7, 2023

@andyross @cujomalainey @singalsu @nashif
If its true that Zephyr does not save HiFi registers in preemption, it is a huge gap and must be fixed, DP processing is designed as a set of threads with deadlines, preempting each other (and being preempted by LL tasks). Each of the threads may, and most likely will, use HiFi

@marcinszkudlinski
Copy link
Contributor

marcinszkudlinski commented Nov 7, 2023

As to slow float->int conversion, the answer is:

int16_t CONVERT(float data)
{
	const xtfloat ratio = 2 << 14;
	xtfloat x0 = data;
	xtfloat x1;
	int16_t x;

	x1 = XT_MUL_S(x0, ratio);
	x = XT_TRUNC_S(x1, 0);

	return x;
}

converting of 10ms of 4channel data from float to int16 took 1.1ms (!!!!) using Q_CONVERT_FLOAT, 120us using CONVERT as above
Q_CONVERT_QTOF, however, is quite fast, conversion from 10ms/4chanels fom int16 to float takes 140us

@andyross
Copy link
Contributor

andyross commented Nov 7, 2023

Regarding HiFi registers: I'm guessing @dcpleung might have some WIP on saving the SIMD registers, if not I can take a crack at it. It's not particularly difficult, though it does have some stack/performance tradeoffs to worry about (it's a ton of data to push on the stacks, but then stacks are cached where e.g. the thread struct can't be). And IIRC (i.e. I might be wrong, would have to look up) Xtensa doesn't have a way to lazy-evaluate (e.g. trap on use) the context switch, we'd likely pay the cost for every interrupt/switch.

And as far as how to implement it: my recommendation would be to stay away from the explicit TIE macros as they defeat the optimizer (and obviously aren't portable anyway). Checking quickly, the loop below (very roughly what the code in question is doing) already optimizes to a pleasingly unrolled loop over the scalar FPU with xcc 2021.6 on TGL. With xt-clang and the ACE config, it actually emits some pretty scarily optimized (but quite fast looking) well-bundled HiFi SIMD loops. The TIE macros are definitely in third place by my accounting (though still ahead of gcc: the Zephyr SDK for TGL won't emit FPU code at all and calls into libgcc -- pretty sure FPU support is supposed to work, might be a compiler build goof).

@andyross
Copy link
Contributor

andyross commented Nov 7, 2023

Checking quickly, the loop below [...]

Oops, forgot loop:

void f_to_s32(float *in, int *out, int n)
{
    for (int i = 0; i < n; i++)
        out[i] = in[i] * (1.0f / INT_MAX);
}

@andyross
Copy link
Contributor

andyross commented Nov 8, 2023

And IIRC (i.e. I might be wrong, would have to look up) Xtensa doesn't have a way to lazy-evaluate (e.g. trap on use) the context switch

I looked it up. It does. It gets complicated with SMP in the presence of SCHED_PIN_ONLY=n because you need to shoot down the state on another core when threads migrate, but that's not a SOF configuration so we can defer optimizing that case.

cd->output_buffer_frame_index = 0;
for (channel = 0; channel < cd->num_capture_channels; channel++)
cd->aec_reference_buffer_ptrs[channel] =
&cd->aec_reference_buffer[channel * cd->num_frames];
Copy link
Contributor

@andrula-song andrula-song Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it should be :
for (channel = 0; channel < cd->num_capture_channels; channel++)
cd->output_buffer_ptrs[channel] = &cd->output_buffer[channel * cd->num_frames];

(marcin: sorry for accidental edit)

j += cd->num_frames;
}

ref += audio_stream_get_channels(&buffer_c->stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a variable such like ref_chn = audio_stream_get_channels(&buffer_c->stream); to reduce calculate the channels in the for loop?

}

src += audio_stream_get_channels(&mic_buf->stream);
dst += audio_stream_get_channels(&output_buf->stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

@ShriramShastry
Copy link
Contributor

No idea why github won't let me reply to #8237 (comment) but for some reason there is no reply box.

this is a question for @ShriramShastry and @singalsu

Q_CONVERT_QTOF

Hi @cujomalainey ,

I think you can use 'Q_CONVERT_QTOF' without any problems for the time being. In cmocka wrappers, Q_CONVERT_QTOF is frequently utilized, and there is no appreciable loss of accuracy in terms of Total Harmonic Distortion or Unit in last place.

If performance is an issue, I'm not sure what the expectations are—what the cycle and benefit are expected to be, how long they will last, and so on.

Yes, it is true that the #define Q_CONVERT_QTOF(x, ny) ((float)(x) / ((int64_t)1 << (ny))) division operator may be an argument. In that case, the implementation might be made quickly for any compiler by introducing the inverse #define Q_CONVERT_QTOF(x, ny) ((float)(x) * inv((int64_t)1 << (ny))) and employing a leading-to-zeros technique with a LUT to prevent loops.

Thank you!

@marcinszkudlinski
Copy link
Contributor

marcinszkudlinski commented Nov 8, 2023

I'm finishing with DP venison of AEC and the loop looks like:

	for (int i = 0; i < cd->num_frames; i++) {
		for (channel = 0; channel < cd->num_capture_channels; channel++)
			cd->process_buffer_ptrs[channel][i] =
					Q_CONVERT_QTOF(src[channel], 15);

		src += cd->num_capture_channels;
		if (src >= src_buf_end)
			src = src_buf_start;
	}

execution time for the loop above - for 10ms 16bit 4 channels - 140us

please do not merge this commit - once DP version is stable - hopefully very soon - i'll put a new PR for this

@marcinszkudlinski
Copy link
Contributor

I';m closing this PR,
all changes for 32bit are here: #8469

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants