Bunch of fixes to DP processing #8467

marcinszkudlinski · 2023-11-09T14:02:33Z

The PR contains 3 commits, to avoid merge conflicts each commit is a separate bugfix for DP processing

EDIT:
FIX1: #8508
FIX2: #8511

lyakh · 2023-11-10T08:47:46Z

src/audio/module_adapter/module_adapter.c

+				audio_stream_get_source(&buffer->stream);
 		struct sof_source *data_src = dp_queue_get_source(dp_queue);
-		uint32_t to_copy = MIN(sink_get_free_size(data_sink),
+		uint32_t to_copy = MIN(source_get_min_available(following_mod_data_source),


shouldn't we fix offending LL modules one by one - as soon as you generate respective topologies?

absolutely we do
But at the moment we need a working WA for google AEC topology - and we need it ASAP
EDIT: my idea is to verify/fix this when converting module to sink/src interface. All other will be protected by this WA

ok, since this is a temp WA, we probably want to create an issue to track and link it to the comments here. We can then check all the other modules as we build topologies and remove the WA.
I think it also make sense to put the WA in an ifdef MODULE_WA_FOR_BLAH so that its easy for anyone to fix all modules and cleanly remove (and test module updates)..

This is at least in dp_copy_queues() function, so specific to DP-LL interface and the comment explains what is done here, so could be worse.

@lgirdwood I don't think it is neccessary
dp_copy_queues in fact is a kind of WA itself. It is a double-buffer copying from LL buffers to pipeline 2.0 sink/src interface. It makes a DP module to look from the other components point of view (almost) exactly as a "normal" LL

Once all modules use sink/src and communication between them is in pipeline 2.0 manner (I'm preparng an RFC for this), dp_copy_queues will be removed in whole (well, hopefully together with module adapter)

@marcinszkudlinski ok, and we should be able to track this in RFC GH issue. i.e. we can mention all the PRs - sound like we target v2.9 for this too.

src/include/sof/audio/module_adapter/module/generic.h

src/schedule/zephyr_dp_schedule.c

src/audio/module_adapter/module_adapter.c

lgirdwood · 2023-11-13T13:39:45Z

src/audio/module_adapter/module_adapter.c

+		 * a trick is needed there as a workaround
+		 * DP may produce a huge chunk of output data (i.e. 10 LL
+		 * cycles), and the following module should be able to consume it in 1 cycle chunks
+		 *
+		 * unfortunately some modules are not prepared to work when there's more than
+		 * 1 data portion available in the buffer and are draining buffers with data loss


Isn't this configurable in topology, i.e. to make the downstream LL module consume 10 cycles ? @ranj063 did you mention to me there was a way to do this now ?

it is not up to topology, it is up to each module separately
i.e. dc block - process all data from input, no matter how much data is there

static int dcblock_process(struct processing_module *mod, struct input_stream_buffer *input_buffers, int num_input_buffers, struct output_stream_buffer *output_buffers, int num_output_buffers) { struct comp_data *cd = module_get_private_data(mod); struct audio_stream *source = input_buffers[0].data; struct audio_stream *sink = output_buffers[0].data; >>>>>>> uint32_t frames = input_buffers[0].size; <<<<<< comp_dbg(mod->dev, "dcblock_process()"); cd->dcblock_func(cd, source, sink, frames); module_update_buffer_position(&input_buffers[0], &output_buffers[0], frames); return 0; }

process all data in buffer

@singalsu @andrula-song fyi - lets make sure we check all modules to make sure they align with this use case.

then maybe we have to use audio_stream_avail_frames instead of audio_stream_avail_frames_aligned for all components to get the exact data size to process so every component can process the same data size, or we force all components use the same alignment setting.

@andrula-song do we need alignment at all? Especially for LL modules?

all buffers are aligned to cacheline size (64 bytes)

in LL processing a module is always getting a 1ms data portion from input and put 1ms to output

that means - each buffer between 2 LLs should

be no larger than 1ms of data (max data rate in case of variable data rate)

at each LL cycle - be filled with 1ms AND drained 1ms (that means - IBS and OBS of connected modules should match)

so

always align to 64 bytes,

buffer always drained to empty

always start from buffer begin (64bytes align)

always linear data

so what is the align useful for?

real align is in IBSes and OBSes, and it is up to module to set in correctly - meaning that IBS/OBS should be aligned to frame size (in most cases bytes_per_samplechannelsrate_per_1ms*processing_period - i.e. 32bit/48000/2 = 384 for 1ms). And in LL a module is expected to consume/produce declared amount of bytes in each cycle. Any mismatch here means underrun/overrun (with small exception during pipeline startup)

ok, so we can use aligned for now, but this corner case should be handled somehow

off topic begin
why: we're wasting a lot of memory for buffers between LL modules - a separate buffer for each bind, most of the time empty, At the moment there's enough memory to cover this, but nor for long
Required optimalization - use a single memory space as a reusable buffer, connected dynamically at in/out of module that is currently processing.

As side-effect there won't be a place to keep a frame to be processed in next cycle
off topic end

i.e. dc block - process all data from input, no matter how much data is there

That's not correct to say. Unless module adapter has changed, the input_buffers[0].size in legacy audio stream mode used to contain a safe amount to process from source available and sink free, and also follow the defined data align constraints. The struct field name "size" seems wrong though.

I've noticed this has changed in sink/source API we need a new way in modules to know the processing frames count.

@marcinszkudlinski re offtopic - @btian1 is looking at this now, target the low hanging fruit 1st.

@marcinszkudlinski @andrula-song @singalsu re -alignment - yes we need to always process on aligned SIMD sizes other wise

We need extra logic/MCPS to deal with unaligned leading data

We need extra logic/MCPS to deal with unaligned trailing data

additional impact of 1 & 2 is more complexity, larger TEXT, and decreased I$ perf.

@singalsu , @marcinszkudlinski , I searched with current code, seems only SRC type is process, for SRC it may generate more than LL period data per output, so indeed SRC OBS need be set separately.

Do we plan to change most of module's processing type from audio_stream to process?

lgirdwood · 2023-11-13T13:40:53Z

src/audio/module_adapter/module_adapter.c

 	 * first, set all parameters by calling "module prepare" with pointers to
 	 * "main" audio_stream buffers
 	 */
+	list_init(&mod->dp_queue_ll_to_dp_list);


We can merge this one straight away if WA patch alignment takes long time.

src/schedule/zephyr_dp_schedule.c

DP may produce a huge chunk of output data (i.e. 10 LL cycles), and the following module should be able to consume it in 1 cycle chunks, one by one unfortunately some modules are not prepared to work when there's more than 1 data portion available in the buffer and are draining buffers with data loss a workaround: copy only the following module's IBS in each LL cycle required fix: all modules using sink/src interface must be aware to process only data they need, not forcefully draining a buffer Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>

in case module prepare fails, dp_queue lists were left uninitialized. It may lead to crash if module reset was performed later Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>

lets assume DP with 10ms period (a.k.a a deadline). It starts and finishes earlier, i.e. in 2ms providing 10ms of data LL starts consuming data in 1ms chunks and will drain 10ms buffer in 10ms, expecting a new portion of data on 11th ms BUT - the DP module deadline is still 10ms, regardless if it had finished earlier and it is completely fine that processing in next cycle takes full 10ms - as long as it fits into the deadline. It may lead to underruns: LL1 (1ms) ---> DP (10ms) -->LL2 (1ms) ticks 0..9 -> LL1 is producing 1ms data portions, DP is waiting, LL2 is waiting tick 10 - DP has enough data to run, it starts processing tick 12 - DP finishes earlier, LL2 starts consuming, LL1 is producing data ticks 13-19 LL1 is producing data, LL2 is consuming data (both in 1ms chunks) tick 20 - DP starts processing a new portion of 10ms data, having 10ms to finish !!!! but LL2 has already consumed 8ms !!!! tick 22 - LL2 is consuming the last 1ms data chunk tick 23 - DP is still processing, LL2 has no data to process !!! UNDERRUN !!!! tick 19 - DP finishes properly in a deadline time Solution: even if DP finishes before its deadline, the data must be held till deadline time, so LL2 may start processing no earlier than tick 20 Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>

kv2019i

I think the code commentary explains the solutions well enough. I do think you should not remove the comment added in patch1. The next-module-IBS logic is special and needs to show up in code (I like the comment, Liam was asking for a a more verbose WA definition in code).

kv2019i · 2023-11-14T16:42:39Z

src/audio/module_adapter/module_adapter.c

+				audio_stream_get_source(&buffer->stream);
 		struct sof_source *data_src = dp_queue_get_source(dp_queue);
-		uint32_t to_copy = MIN(sink_get_free_size(data_sink),
+		uint32_t to_copy = MIN(source_get_min_available(following_mod_data_source),


This is at least in dp_copy_queues() function, so specific to DP-LL interface and the comment explains what is done here, so could be worse.

kv2019i · 2023-11-14T16:46:25Z

src/audio/module_adapter/module_adapter.c

-		 */
+	/* output - we need to copy data from dp_queue (as source)
+	 * to audio_stream (as sink)
+	 */


I don't think we should drop this comment (from first patch in the series). The logic to not feed more than on LL is still in place, right, and should be explained.

comments addressed, thanks

lgirdwood

I think patch 2/3 is more urgent and could be merged now.
@marcinszkudlinski I'm not sure I'm following 100%, but will DP queues

align the queue to SIMD alignment sizes OR
allow modules to process in SIMD alignment and hold any trailing samples until next iteration (where they will be aligned on SIMD size) ?

lgirdwood · 2023-11-17T14:16:19Z

src/audio/module_adapter/module_adapter.c

+				audio_stream_get_source(&buffer->stream);
 		struct sof_source *data_src = dp_queue_get_source(dp_queue);
-		uint32_t to_copy = MIN(sink_get_free_size(data_sink),
+		uint32_t to_copy = MIN(source_get_min_available(following_mod_data_source),


@marcinszkudlinski ok, and we should be able to track this in RFC GH issue. i.e. we can mention all the PRs - sound like we target v2.9 for this too.

lgirdwood · 2023-11-17T14:19:15Z

src/include/sof/audio/module_adapter/module/generic.h

+	 * Solution: even if DP finishes before its deadline, the data must be held till
+	 * deadline time, so LL2 may start processing no earlier than tick 20
+	 */
+	bool dp_startup_delay;


thinking aloud - would this be better as a int i.e. a counter and could be set to any delay value needed and decremented on each LL tick ?

btian1

@marcinszkudlinski , did you check with DTS or Cadence whether 20 ms delay is acceptable?
any concern from them?

or if DP is 20ms, then delay will be 40ms, this is not small number, at least, HIFI audio may
have concern on this big delay.

btian1 · 2023-11-20T05:52:16Z

src/audio/module_adapter/module_adapter.c

+		 * a trick is needed there as a workaround
+		 * DP may produce a huge chunk of output data (i.e. 10 LL
+		 * cycles), and the following module should be able to consume it in 1 cycle chunks
+		 *
+		 * unfortunately some modules are not prepared to work when there's more than
+		 * 1 data portion available in the buffer and are draining buffers with data loss


@singalsu , @marcinszkudlinski , I searched with current code, seems only SRC type is process, for SRC it may generate more than LL period data per output, so indeed SRC OBS need be set separately.

Do we plan to change most of module's processing type from audio_stream to process?

marcinszkudlinski · 2023-11-20T13:56:20Z

@btian1 it is up to the driver/topology to run a module as DP or as LL, some modules - like SRC - may be run either as DP or as LL - depending what has been requested in init instance IPC

it is how DP works, delay is a part of this kind of processing. If it processes data in 10ms chunks, it's expected to deliver data in 10ms deadline. It works asyn to LL tick, and may be preempted by another DP that have a shorter deadline (EDF scheduling).

In opposite we do have Low Latency that process 1ms chunks and introduces 1ms end to end delay (+DMA buffers in DAIs)

btian1 · 2023-11-21T02:12:03Z

@btian1 it is up to the driver/topology to run a module as DP or as LL, some modules - like SRC - may be run either as DP or as LL - depending what has been requested in init instance IPC

it is how DP works, delay is a part of this kind of processing. If it processes data in 10ms chunks, it's expected to deliver data in 10ms deadline. It works asyn to LL tick, and may be preempted by another DP that have a shorter deadline (EDF scheduling).

In opposite we do have Low Latency that process 1ms chunks and introduces 1ms end to end delay (+DMA buffers in DAIs)

"Solution: even if DP finishes before its deadline,
the data must be held till deadline time, so LL2 may
start processing no earlier than tick 20"

here LL2 already hold data for 20ms, according my past experience:

it is too long, since codec itself also have delays.
A/V sync also have requirement on delay times.

marcinszkudlinski · 2023-11-21T10:29:33Z

@btian1 as I said - it is up to the creator of the topology. Nobody says "you MUST use long chunks". If i.e. you want to offload a module SRC to secondary core but still have low latency - you can use 1ms DP processing

As I said - i cannot guarantee that even if 10ms DP module had once finished processing in 2ms it will do the same in next cycle. Such module may - and will - be preempted and must wait till other DPs with shorter deadlines are ready, and will finish later. The timings may also change when a next pipeline starts, etc.

DP processing with long chunks is generally NOT designed for main audio streams, but for special cases,

echo cancellation (current AEC case requires processing in 10ms chunks, 20ms of speech delay does not matter as normal speech latency is usually much longer - close to 300ms)
speech recognition - no audio stream output at all - just events like "somebody said a keypharse" - very long chunks, sometimes 300-500ms, deadline set almost to eternity ;) from LL point of view. Even a few seconds does not matter.

marcinszkudlinski · 2023-11-21T11:59:36Z

I'm changing this PR to DRAFT, will split it to several and close this one

Also, during reading of review comments, some potentially better solutions for the problems came to my mind

marcinszkudlinski force-pushed the dp-fixes branch from cef67aa to 867cccc Compare November 9, 2023 14:08

marcinszkudlinski mentioned this pull request Nov 9, 2023

[DNM] Google aec sink src dp #8469

Closed

7 tasks

lyakh previously requested changes Nov 10, 2023

View reviewed changes

lgirdwood reviewed Nov 13, 2023

View reviewed changes

marcinszkudlinski added 3 commits November 14, 2023 12:00

DP: bugix - lists initialization was done too late

0405692

in case module prepare fails, dp_queue lists were left uninitialized. It may lead to crash if module reset was performed later Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>

marcinszkudlinski force-pushed the dp-fixes branch from 867cccc to b84b9e8 Compare November 14, 2023 11:07

marcinszkudlinski marked this pull request as ready for review November 14, 2023 11:07

marcinszkudlinski requested review from LaurentiuM1234, dbaluta, jxstelter, kv2019i, lbetlej, mmaka1, pblaszko, plbossart and ranj063 as code owners November 14, 2023 11:07

marcinszkudlinski requested a review from lyakh November 14, 2023 11:07

kv2019i reviewed Nov 14, 2023

View reviewed changes

lgirdwood reviewed Nov 17, 2023

View reviewed changes

btian1 reviewed Nov 20, 2023

View reviewed changes

marcinszkudlinski marked this pull request as draft November 21, 2023 11:59

marcinszkudlinski closed this Nov 28, 2023

marcinszkudlinski deleted the dp-fixes branch June 21, 2024 07:29

Bunch of fixes to DP processing #8467

Bunch of fixes to DP processing #8467

Uh oh!

Conversation

marcinszkudlinski commented Nov 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcinszkudlinski Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcinszkudlinski Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcinszkudlinski Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kv2019i left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lgirdwood left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

btian1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcinszkudlinski commented Nov 20, 2023

Uh oh!

btian1 commented Nov 21, 2023

Uh oh!

marcinszkudlinski commented Nov 21, 2023

Uh oh!

marcinszkudlinski commented Nov 21, 2023

Uh oh!

marcinszkudlinski commented Nov 9, 2023 •

edited

Loading

marcinszkudlinski Nov 14, 2023 •

edited

Loading

marcinszkudlinski Nov 14, 2023 •

edited

Loading

marcinszkudlinski Nov 16, 2023 •

edited

Loading

lgirdwood left a comment •

edited

Loading