Skip to content

Conversation

@marcinszkudlinski
Copy link
Contributor

@marcinszkudlinski marcinszkudlinski commented Nov 9, 2023

The PR contains 3 commits, to avoid merge conflicts each commit is a separate bugfix for DP processing

EDIT:
FIX1: #8508
FIX2: #8511

lyakh
lyakh previously requested changes Nov 10, 2023
audio_stream_get_source(&buffer->stream);
struct sof_source *data_src = dp_queue_get_source(dp_queue);
uint32_t to_copy = MIN(sink_get_free_size(data_sink),
uint32_t to_copy = MIN(source_get_min_available(following_mod_data_source),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we fix offending LL modules one by one - as soon as you generate respective topologies?

Copy link
Contributor Author

@marcinszkudlinski marcinszkudlinski Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely we do
But at the moment we need a working WA for google AEC topology - and we need it ASAP
EDIT: my idea is to verify/fix this when converting module to sink/src interface. All other will be protected by this WA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, since this is a temp WA, we probably want to create an issue to track and link it to the comments here. We can then check all the other modules as we build topologies and remove the WA.
I think it also make sense to put the WA in an ifdef MODULE_WA_FOR_BLAH so that its easy for anyone to fix all modules and cleanly remove (and test module updates)..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at least in dp_copy_queues() function, so specific to DP-LL interface and the comment explains what is done here, so could be worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgirdwood I don't think it is neccessary
dp_copy_queues in fact is a kind of WA itself. It is a double-buffer copying from LL buffers to pipeline 2.0 sink/src interface. It makes a DP module to look from the other components point of view (almost) exactly as a "normal" LL

Once all modules use sink/src and communication between them is in pipeline 2.0 manner (I'm preparng an RFC for this), dp_copy_queues will be removed in whole (well, hopefully together with module adapter)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcinszkudlinski ok, and we should be able to track this in RFC GH issue. i.e. we can mention all the PRs - sound like we target v2.9 for this too.

Comment on lines 1076 to 1081
* a trick is needed there as a workaround
* DP may produce a huge chunk of output data (i.e. 10 LL
* cycles), and the following module should be able to consume it in 1 cycle chunks
*
* unfortunately some modules are not prepared to work when there's more than
* 1 data portion available in the buffer and are draining buffers with data loss
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this configurable in topology, i.e. to make the downstream LL module consume 10 cycles ? @ranj063 did you mention to me there was a way to do this now ?

Copy link
Contributor Author

@marcinszkudlinski marcinszkudlinski Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not up to topology, it is up to each module separately
i.e. dc block - process all data from input, no matter how much data is there

static int dcblock_process(struct processing_module *mod,
			   struct input_stream_buffer *input_buffers,
			   int num_input_buffers,
			   struct output_stream_buffer *output_buffers,
			   int num_output_buffers)
{
	struct comp_data *cd = module_get_private_data(mod);
	struct audio_stream *source = input_buffers[0].data;
	struct audio_stream *sink = output_buffers[0].data;

	>>>>>>>   uint32_t frames = input_buffers[0].size;  <<<<<<

	comp_dbg(mod->dev, "dcblock_process()");

	cd->dcblock_func(cd, source, sink, frames);

	module_update_buffer_position(&input_buffers[0], &output_buffers[0], frames);
	return 0;
}

process all data in buffer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singalsu @andrula-song fyi - lets make sure we check all modules to make sure they align with this use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then maybe we have to use audio_stream_avail_frames instead of audio_stream_avail_frames_aligned for all components to get the exact data size to process so every component can process the same data size, or we force all components use the same alignment setting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrula-song do we need alignment at all? Especially for LL modules?

  1. all buffers are aligned to cacheline size (64 bytes)
  2. in LL processing a module is always getting a 1ms data portion from input and put 1ms to output

that means - each buffer between 2 LLs should

  • be no larger than 1ms of data (max data rate in case of variable data rate)
  • at each LL cycle - be filled with 1ms AND drained 1ms (that means - IBS and OBS of connected modules should match)

so

  • always align to 64 bytes,
  • buffer always drained to empty
  • always start from buffer begin (64bytes align)
  • always linear data

so what is the align useful for?

real align is in IBSes and OBSes, and it is up to module to set in correctly - meaning that IBS/OBS should be aligned to frame size (in most cases bytes_per_samplechannelsrate_per_1ms*processing_period - i.e. 32bit/48000/2 = 384 for 1ms). And in LL a module is expected to consume/produce declared amount of bytes in each cycle. Any mismatch here means underrun/overrun (with small exception during pipeline startup)

Copy link
Contributor Author

@marcinszkudlinski marcinszkudlinski Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so we can use aligned for now, but this corner case should be handled somehow

off topic begin
why: we're wasting a lot of memory for buffers between LL modules - a separate buffer for each bind, most of the time empty, At the moment there's enough memory to cover this, but nor for long
Required optimalization - use a single memory space as a reusable buffer, connected dynamically at in/out of module that is currently processing.

As side-effect there won't be a place to keep a frame to be processed in next cycle
off topic end

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. dc block - process all data from input, no matter how much data is there

That's not correct to say. Unless module adapter has changed, the input_buffers[0].size in legacy audio stream mode used to contain a safe amount to process from source available and sink free, and also follow the defined data align constraints. The struct field name "size" seems wrong though.

I've noticed this has changed in sink/source API we need a new way in modules to know the processing frames count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcinszkudlinski re offtopic - @btian1 is looking at this now, target the low hanging fruit 1st.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcinszkudlinski @andrula-song @singalsu re -alignment - yes we need to always process on aligned SIMD sizes other wise

  1. We need extra logic/MCPS to deal with unaligned leading data
  2. We need extra logic/MCPS to deal with unaligned trailing data
  3. additional impact of 1 & 2 is more complexity, larger TEXT, and decreased I$ perf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singalsu , @marcinszkudlinski , I searched with current code, seems only SRC type is process, for SRC it may generate more than LL period data per output, so indeed SRC OBS need be set separately.

Do we plan to change most of module's processing type from audio_stream to process?

* first, set all parameters by calling "module prepare" with pointers to
* "main" audio_stream buffers
*/
list_init(&mod->dp_queue_ll_to_dp_list);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge this one straight away if WA patch alignment takes long time.

DP may produce a huge chunk of output data (i.e. 10 LL
cycles), and the following module should be able to
consume it in 1 cycle chunks, one by one

unfortunately some modules are not prepared to
work when there's more than 1 data portion available
in the buffer and are draining buffers with data loss

a workaround: copy only the following module's IBS in each LL cycle

required fix: all modules using sink/src interface must be aware to
process only data they need, not forcefully draining a buffer

Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>
in case module prepare fails, dp_queue lists were left
uninitialized.
It may lead to crash if module reset was performed later

Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>
lets assume DP with 10ms period (a.k.a a deadline).
It starts and finishes earlier, i.e. in 2ms providing 10ms of data
LL starts consuming data in 1ms chunks and will drain
10ms buffer in 10ms, expecting a new portion of data on 11th ms

BUT - the DP module deadline is still 10ms,
regardless if it had finished earlier and it is completely fine
that processing in next cycle takes full 10ms - as long as it
fits into the deadline.

It may lead to underruns:

LL1 (1ms) ---> DP (10ms) -->LL2 (1ms)

ticks 0..9 -> LL1 is producing 1ms data portions,
             DP is waiting, LL2 is waiting
tick 10 - DP has enough data to run, it starts processing
tick 12 - DP finishes earlier, LL2 starts consuming,
          LL1 is producing data
ticks 13-19 LL1 is producing data,
            LL2 is consuming data (both in 1ms chunks)
tick 20  - DP starts processing a new portion of 10ms data,
           having 10ms to finish
	      !!!! but LL2 has already consumed 8ms !!!!
tick 22 - LL2 is consuming the last 1ms data chunk
tick 23 - DP is still processing, LL2 has no data to process
	 			!!! UNDERRUN !!!!
tick 19 - DP finishes properly in a deadline time

Solution: even if DP finishes before its deadline,
the data must be held till deadline time, so LL2 may
start processing no earlier than tick 20

Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>
Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code commentary explains the solutions well enough. I do think you should not remove the comment added in patch1. The next-module-IBS logic is special and needs to show up in code (I like the comment, Liam was asking for a a more verbose WA definition in code).

audio_stream_get_source(&buffer->stream);
struct sof_source *data_src = dp_queue_get_source(dp_queue);
uint32_t to_copy = MIN(sink_get_free_size(data_sink),
uint32_t to_copy = MIN(source_get_min_available(following_mod_data_source),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at least in dp_copy_queues() function, so specific to DP-LL interface and the comment explains what is done here, so could be worse.

*/
/* output - we need to copy data from dp_queue (as source)
* to audio_stream (as sink)
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should drop this comment (from first patch in the series). The logic to not feed more than on LL is still in place, right, and should be explained.

@lyakh lyakh dismissed their stale review November 16, 2023 07:19

comments addressed, thanks

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think patch 2/3 is more urgent and could be merged now.
@marcinszkudlinski I'm not sure I'm following 100%, but will DP queues

  1. align the queue to SIMD alignment sizes OR
  2. allow modules to process in SIMD alignment and hold any trailing samples until next iteration (where they will be aligned on SIMD size) ?

audio_stream_get_source(&buffer->stream);
struct sof_source *data_src = dp_queue_get_source(dp_queue);
uint32_t to_copy = MIN(sink_get_free_size(data_sink),
uint32_t to_copy = MIN(source_get_min_available(following_mod_data_source),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcinszkudlinski ok, and we should be able to track this in RFC GH issue. i.e. we can mention all the PRs - sound like we target v2.9 for this too.

* Solution: even if DP finishes before its deadline, the data must be held till
* deadline time, so LL2 may start processing no earlier than tick 20
*/
bool dp_startup_delay;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking aloud - would this be better as a int i.e. a counter and could be set to any delay value needed and decremented on each LL tick ?

Copy link
Contributor

@btian1 btian1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcinszkudlinski , did you check with DTS or Cadence whether 20 ms delay is acceptable?
any concern from them?

or if DP is 20ms, then delay will be 40ms, this is not small number, at least, HIFI audio may
have concern on this big delay.

Comment on lines 1076 to 1081
* a trick is needed there as a workaround
* DP may produce a huge chunk of output data (i.e. 10 LL
* cycles), and the following module should be able to consume it in 1 cycle chunks
*
* unfortunately some modules are not prepared to work when there's more than
* 1 data portion available in the buffer and are draining buffers with data loss
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singalsu , @marcinszkudlinski , I searched with current code, seems only SRC type is process, for SRC it may generate more than LL period data per output, so indeed SRC OBS need be set separately.

Do we plan to change most of module's processing type from audio_stream to process?

@marcinszkudlinski
Copy link
Contributor Author

@btian1 it is up to the driver/topology to run a module as DP or as LL, some modules - like SRC - may be run either as DP or as LL - depending what has been requested in init instance IPC

it is how DP works, delay is a part of this kind of processing. If it processes data in 10ms chunks, it's expected to deliver data in 10ms deadline. It works asyn to LL tick, and may be preempted by another DP that have a shorter deadline (EDF scheduling).

In opposite we do have Low Latency that process 1ms chunks and introduces 1ms end to end delay (+DMA buffers in DAIs)

@btian1
Copy link
Contributor

btian1 commented Nov 21, 2023

@btian1 it is up to the driver/topology to run a module as DP or as LL, some modules - like SRC - may be run either as DP or as LL - depending what has been requested in init instance IPC

it is how DP works, delay is a part of this kind of processing. If it processes data in 10ms chunks, it's expected to deliver data in 10ms deadline. It works asyn to LL tick, and may be preempted by another DP that have a shorter deadline (EDF scheduling).

In opposite we do have Low Latency that process 1ms chunks and introduces 1ms end to end delay (+DMA buffers in DAIs)

"Solution: even if DP finishes before its deadline,
the data must be held till deadline time, so LL2 may
start processing no earlier than tick 20"

here LL2 already hold data for 20ms, according my past experience:

  1. it is too long, since codec itself also have delays.
  2. A/V sync also have requirement on delay times.

@marcinszkudlinski
Copy link
Contributor Author

@btian1 as I said - it is up to the creator of the topology. Nobody says "you MUST use long chunks". If i.e. you want to offload a module SRC to secondary core but still have low latency - you can use 1ms DP processing

As I said - i cannot guarantee that even if 10ms DP module had once finished processing in 2ms it will do the same in next cycle. Such module may - and will - be preempted and must wait till other DPs with shorter deadlines are ready, and will finish later. The timings may also change when a next pipeline starts, etc.

DP processing with long chunks is generally NOT designed for main audio streams, but for special cases,

  • echo cancellation (current AEC case requires processing in 10ms chunks, 20ms of speech delay does not matter as normal speech latency is usually much longer - close to 300ms)
  • speech recognition - no audio stream output at all - just events like "somebody said a keypharse" - very long chunks, sometimes 300-500ms, deadline set almost to eternity ;) from LL point of view. Even a few seconds does not matter.

@marcinszkudlinski
Copy link
Contributor Author

I'm changing this PR to DRAFT, will split it to several and close this one

Also, during reading of review comments, some potentially better solutions for the problems came to my mind

@marcinszkudlinski marcinszkudlinski marked this pull request as draft November 21, 2023 11:59
@marcinszkudlinski marcinszkudlinski deleted the dp-fixes branch June 21, 2024 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants