Skip to content

Conversation

@andrula-song
Copy link
Contributor

Optimize the eq_fir_2x_sx functions by remove the duplicated usage of circular buffer.

@andrula-song
Copy link
Contributor Author

SOFCI TEST

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the before/after MCPS ?

@andrula-song
Copy link
Contributor Author

the result from my xtensa testbench is below:
Screenshot from 2023-04-20 14-25-38
the main calculation is in fir_32x16_2x_hifi3( did not find any where to optimize for this function), and this is depend on the coefficient. in this test, I used the mid type, which use 40 taps. Compared with the original version:
for 16 bit format save about 14% cycles for the whole calculation of eq_fir_2x_s16 and save about 37% cycles for eq_fir_2x_s16 without fir_32x16_2x_hifi3

for 24 bit format save about 8% cycles for the whole calculation of eq_fir_2x_s24 and save about 21% cycles for eq_fir_2x_s16 without fir_32x16_2x_hifi3

for 32 bit format save about 9% cycles for the whole calculation of eq_fir_2x_s32 and save about 26% cycles for eq_fir_2x_s16 without fir_32x16_2x_hifi3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this processing is still based on channel data, i.e, try to handle one channel data in this line 123 for loop, then handle another channel, so:

  1. Possible handle data based on sample? take 2 channel as example.
    handle L1R1L2R2
    current L1 L2 --> R1 R2
    proposed: L1R1L2R2 --> L3R3L4R4.
    you can still handle 2 samples for each channel, this will make load/store fully aligned with 64 bit operation.

another disadvantage with current handle is: once the buffer is big, take deep buffer for example.
the cache miss will cost much more cycles than real, image big buffer have to write back and load new data in cache.

Copy link
Contributor Author

@andrula-song andrula-song May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each channel can have its own coefficient, and it's delay depends on the statistics in this channel, processing by sample will cost much more cycles.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worrying about change of processing order. A run in real device with long FIR would give more confidence that this is better. Maybe also xt testbench could be enough because it has pipelines and scheduler framework from SOF. You could add a test response where left and right channel have different coefficients and make them long. See the FIR EQ examples generator how it's done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I didn't realize that HIFI4 build does not have HIFI3 set while they are mostly compatible. I wonder if there's other similar issues in code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually in our config, we enabled HIFI3 while HIFI4 build. Add || XCHAL_HAVE_HIFI4 == 1 is because I'm not sure if the HIFI3 will still enabled while we have HIFI5 or other newer version(but might HIFI4 is enabled). I can check the other code.

@singalsu
Copy link
Collaborator

I got with TGL HiFi3 build of xt testbench these MCPS numbers. It includes overhead of scheduler and pipeline, but excludes file components. Command is "process_test('eq-fir', 32, 32, 48000, 0, 'xt-run');".

	git main	pr7499
16	14.10		13.59
24	14.43		13.86
32	14.15		13.64

Copy link
Collaborator

@singalsu singalsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. The performance seems to be similar as with previous patch version while this is quite different now.

@andrula-song andrula-song marked this pull request as ready for review May 24, 2023 06:16
@kv2019i
Copy link
Collaborator

kv2019i commented May 24, 2023

@andrula-song Can you check the CI fails? Otherwise seems ready to go.

Optimize the eq_fir_2x_sx functions by remove the duplicated usage
of circular buffer.

Signed-off-by: Andrula Song <andrula.song@intel.com>
@kv2019i
Copy link
Collaborator

kv2019i commented May 25, 2023

Mandatory CIs pass, known fails in https://sof-ci.01.org/sofpr/PR7499/build8482/devicetest/index.html and https://sof-ci.01.org/sofpr/PR7499/build8481/devicetest/index.html. Fuzzer has issues now, please do @marc-hb @cujomalainey @andyross ping me when I should again start gating items for the fuzzer.

@kv2019i kv2019i merged commit f796e0a into thesofproject:main May 25, 2023
@marc-hb
Copy link
Collaborator

marc-hb commented May 30, 2023

Fuzzer has issues now, please do @marc-hb @cujomalainey @andyross ping me when I should again start gating items for the fuzzer.

Main branch is clean, still not clear (to me) what to do about stable-v2.2, see discussion in #7675

sof-docs also needs an update.

@cujomalainey
Copy link
Contributor

@marc-hb I made #7709 so we can stop hijacking this PR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants