-
Notifications
You must be signed in to change notification settings - Fork 349
Bunch of fixes to DP processing #8467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cef67aa to
867cccc
Compare
| audio_stream_get_source(&buffer->stream); | ||
| struct sof_source *data_src = dp_queue_get_source(dp_queue); | ||
| uint32_t to_copy = MIN(sink_get_free_size(data_sink), | ||
| uint32_t to_copy = MIN(source_get_min_available(following_mod_data_source), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we fix offending LL modules one by one - as soon as you generate respective topologies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely we do
But at the moment we need a working WA for google AEC topology - and we need it ASAP
EDIT: my idea is to verify/fix this when converting module to sink/src interface. All other will be protected by this WA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, since this is a temp WA, we probably want to create an issue to track and link it to the comments here. We can then check all the other modules as we build topologies and remove the WA.
I think it also make sense to put the WA in an ifdef MODULE_WA_FOR_BLAH so that its easy for anyone to fix all modules and cleanly remove (and test module updates)..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is at least in dp_copy_queues() function, so specific to DP-LL interface and the comment explains what is done here, so could be worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgirdwood I don't think it is neccessary
dp_copy_queues in fact is a kind of WA itself. It is a double-buffer copying from LL buffers to pipeline 2.0 sink/src interface. It makes a DP module to look from the other components point of view (almost) exactly as a "normal" LL
Once all modules use sink/src and communication between them is in pipeline 2.0 manner (I'm preparng an RFC for this), dp_copy_queues will be removed in whole (well, hopefully together with module adapter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcinszkudlinski ok, and we should be able to track this in RFC GH issue. i.e. we can mention all the PRs - sound like we target v2.9 for this too.
| * a trick is needed there as a workaround | ||
| * DP may produce a huge chunk of output data (i.e. 10 LL | ||
| * cycles), and the following module should be able to consume it in 1 cycle chunks | ||
| * | ||
| * unfortunately some modules are not prepared to work when there's more than | ||
| * 1 data portion available in the buffer and are draining buffers with data loss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this configurable in topology, i.e. to make the downstream LL module consume 10 cycles ? @ranj063 did you mention to me there was a way to do this now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not up to topology, it is up to each module separately
i.e. dc block - process all data from input, no matter how much data is there
static int dcblock_process(struct processing_module *mod,
struct input_stream_buffer *input_buffers,
int num_input_buffers,
struct output_stream_buffer *output_buffers,
int num_output_buffers)
{
struct comp_data *cd = module_get_private_data(mod);
struct audio_stream *source = input_buffers[0].data;
struct audio_stream *sink = output_buffers[0].data;
>>>>>>> uint32_t frames = input_buffers[0].size; <<<<<<
comp_dbg(mod->dev, "dcblock_process()");
cd->dcblock_func(cd, source, sink, frames);
module_update_buffer_position(&input_buffers[0], &output_buffers[0], frames);
return 0;
}
process all data in buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singalsu @andrula-song fyi - lets make sure we check all modules to make sure they align with this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then maybe we have to use audio_stream_avail_frames instead of audio_stream_avail_frames_aligned for all components to get the exact data size to process so every component can process the same data size, or we force all components use the same alignment setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrula-song do we need alignment at all? Especially for LL modules?
- all buffers are aligned to cacheline size (64 bytes)
- in LL processing a module is always getting a 1ms data portion from input and put 1ms to output
that means - each buffer between 2 LLs should
- be no larger than 1ms of data (max data rate in case of variable data rate)
- at each LL cycle - be filled with 1ms AND drained 1ms (that means - IBS and OBS of connected modules should match)
so
- always align to 64 bytes,
- buffer always drained to empty
- always start from buffer begin (64bytes align)
- always linear data
so what is the align useful for?
real align is in IBSes and OBSes, and it is up to module to set in correctly - meaning that IBS/OBS should be aligned to frame size (in most cases bytes_per_samplechannelsrate_per_1ms*processing_period - i.e. 32bit/48000/2 = 384 for 1ms). And in LL a module is expected to consume/produce declared amount of bytes in each cycle. Any mismatch here means underrun/overrun (with small exception during pipeline startup)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so we can use aligned for now, but this corner case should be handled somehow
off topic begin
why: we're wasting a lot of memory for buffers between LL modules - a separate buffer for each bind, most of the time empty, At the moment there's enough memory to cover this, but nor for long
Required optimalization - use a single memory space as a reusable buffer, connected dynamically at in/out of module that is currently processing.
As side-effect there won't be a place to keep a frame to be processed in next cycle
off topic end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. dc block - process all data from input, no matter how much data is there
That's not correct to say. Unless module adapter has changed, the input_buffers[0].size in legacy audio stream mode used to contain a safe amount to process from source available and sink free, and also follow the defined data align constraints. The struct field name "size" seems wrong though.
I've noticed this has changed in sink/source API we need a new way in modules to know the processing frames count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcinszkudlinski re offtopic - @btian1 is looking at this now, target the low hanging fruit 1st.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcinszkudlinski @andrula-song @singalsu re -alignment - yes we need to always process on aligned SIMD sizes other wise
- We need extra logic/MCPS to deal with unaligned leading data
- We need extra logic/MCPS to deal with unaligned trailing data
- additional impact of 1 & 2 is more complexity, larger TEXT, and decreased I$ perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singalsu , @marcinszkudlinski , I searched with current code, seems only SRC type is process, for SRC it may generate more than LL period data per output, so indeed SRC OBS need be set separately.
Do we plan to change most of module's processing type from audio_stream to process?
| * first, set all parameters by calling "module prepare" with pointers to | ||
| * "main" audio_stream buffers | ||
| */ | ||
| list_init(&mod->dp_queue_ll_to_dp_list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge this one straight away if WA patch alignment takes long time.
DP may produce a huge chunk of output data (i.e. 10 LL cycles), and the following module should be able to consume it in 1 cycle chunks, one by one unfortunately some modules are not prepared to work when there's more than 1 data portion available in the buffer and are draining buffers with data loss a workaround: copy only the following module's IBS in each LL cycle required fix: all modules using sink/src interface must be aware to process only data they need, not forcefully draining a buffer Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>
in case module prepare fails, dp_queue lists were left uninitialized. It may lead to crash if module reset was performed later Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>
lets assume DP with 10ms period (a.k.a a deadline).
It starts and finishes earlier, i.e. in 2ms providing 10ms of data
LL starts consuming data in 1ms chunks and will drain
10ms buffer in 10ms, expecting a new portion of data on 11th ms
BUT - the DP module deadline is still 10ms,
regardless if it had finished earlier and it is completely fine
that processing in next cycle takes full 10ms - as long as it
fits into the deadline.
It may lead to underruns:
LL1 (1ms) ---> DP (10ms) -->LL2 (1ms)
ticks 0..9 -> LL1 is producing 1ms data portions,
DP is waiting, LL2 is waiting
tick 10 - DP has enough data to run, it starts processing
tick 12 - DP finishes earlier, LL2 starts consuming,
LL1 is producing data
ticks 13-19 LL1 is producing data,
LL2 is consuming data (both in 1ms chunks)
tick 20 - DP starts processing a new portion of 10ms data,
having 10ms to finish
!!!! but LL2 has already consumed 8ms !!!!
tick 22 - LL2 is consuming the last 1ms data chunk
tick 23 - DP is still processing, LL2 has no data to process
!!! UNDERRUN !!!!
tick 19 - DP finishes properly in a deadline time
Solution: even if DP finishes before its deadline,
the data must be held till deadline time, so LL2 may
start processing no earlier than tick 20
Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>
867cccc to
b84b9e8
Compare
kv2019i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code commentary explains the solutions well enough. I do think you should not remove the comment added in patch1. The next-module-IBS logic is special and needs to show up in code (I like the comment, Liam was asking for a a more verbose WA definition in code).
| audio_stream_get_source(&buffer->stream); | ||
| struct sof_source *data_src = dp_queue_get_source(dp_queue); | ||
| uint32_t to_copy = MIN(sink_get_free_size(data_sink), | ||
| uint32_t to_copy = MIN(source_get_min_available(following_mod_data_source), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is at least in dp_copy_queues() function, so specific to DP-LL interface and the comment explains what is done here, so could be worse.
| */ | ||
| /* output - we need to copy data from dp_queue (as source) | ||
| * to audio_stream (as sink) | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should drop this comment (from first patch in the series). The logic to not feed more than on LL is still in place, right, and should be explained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think patch 2/3 is more urgent and could be merged now.
@marcinszkudlinski I'm not sure I'm following 100%, but will DP queues
- align the queue to SIMD alignment sizes OR
- allow modules to process in SIMD alignment and hold any trailing samples until next iteration (where they will be aligned on SIMD size) ?
| audio_stream_get_source(&buffer->stream); | ||
| struct sof_source *data_src = dp_queue_get_source(dp_queue); | ||
| uint32_t to_copy = MIN(sink_get_free_size(data_sink), | ||
| uint32_t to_copy = MIN(source_get_min_available(following_mod_data_source), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcinszkudlinski ok, and we should be able to track this in RFC GH issue. i.e. we can mention all the PRs - sound like we target v2.9 for this too.
| * Solution: even if DP finishes before its deadline, the data must be held till | ||
| * deadline time, so LL2 may start processing no earlier than tick 20 | ||
| */ | ||
| bool dp_startup_delay; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking aloud - would this be better as a int i.e. a counter and could be set to any delay value needed and decremented on each LL tick ?
btian1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcinszkudlinski , did you check with DTS or Cadence whether 20 ms delay is acceptable?
any concern from them?
or if DP is 20ms, then delay will be 40ms, this is not small number, at least, HIFI audio may
have concern on this big delay.
| * a trick is needed there as a workaround | ||
| * DP may produce a huge chunk of output data (i.e. 10 LL | ||
| * cycles), and the following module should be able to consume it in 1 cycle chunks | ||
| * | ||
| * unfortunately some modules are not prepared to work when there's more than | ||
| * 1 data portion available in the buffer and are draining buffers with data loss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singalsu , @marcinszkudlinski , I searched with current code, seems only SRC type is process, for SRC it may generate more than LL period data per output, so indeed SRC OBS need be set separately.
Do we plan to change most of module's processing type from audio_stream to process?
|
@btian1 it is up to the driver/topology to run a module as DP or as LL, some modules - like SRC - may be run either as DP or as LL - depending what has been requested in init instance IPC it is how DP works, delay is a part of this kind of processing. If it processes data in 10ms chunks, it's expected to deliver data in 10ms deadline. It works asyn to LL tick, and may be preempted by another DP that have a shorter deadline (EDF scheduling). In opposite we do have Low Latency that process 1ms chunks and introduces 1ms end to end delay (+DMA buffers in DAIs) |
"Solution: even if DP finishes before its deadline, here LL2 already hold data for 20ms, according my past experience:
|
|
@btian1 as I said - it is up to the creator of the topology. Nobody says "you MUST use long chunks". If i.e. you want to offload a module SRC to secondary core but still have low latency - you can use 1ms DP processing As I said - i cannot guarantee that even if 10ms DP module had once finished processing in 2ms it will do the same in next cycle. Such module may - and will - be preempted and must wait till other DPs with shorter deadlines are ready, and will finish later. The timings may also change when a next pipeline starts, etc. DP processing with long chunks is generally NOT designed for main audio streams, but for special cases,
|
|
I'm changing this PR to DRAFT, will split it to several and close this one Also, during reading of review comments, some potentially better solutions for the problems came to my mind |
The PR contains 3 commits, to avoid merge conflicts each commit is a separate bugfix for DP processing
EDIT:
FIX1: #8508
FIX2: #8511