Skip to content

Conversation

@andyross
Copy link
Contributor

@andyross andyross commented Jun 7, 2024

Some API refactoring, no behavioral changes.

  • The comp_buffer change is required to expose the pipeline ID to source/sinker users (see Google AEC rework: dynamic formats, legacy platform support #8919).
  • The processing_module backpointer fix is just a dangerous pattern I stumbled on during that work. Typesafety is good. Overloading untyped "private" data when you aren't the defined owner is bad. This would have certainly blown up on someone at some point even if it hasn't yet.

@marcinszkudlinski marcinszkudlinski dismissed their stale review June 11, 2024 13:01

missed one detail. Not a blocker - just nice to have

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcinszkudlinski good to merge ? it looks the the get by ID call can be added incrementally ?

@lgirdwood
Copy link
Member

@lrudyX @wszypelt good to merge or real bug reported by internal CI ?

@marcinszkudlinski
Copy link
Contributor

@lgirdwood it can be added later, also in #8919, so we can merge

@lgirdwood
Copy link
Member

@andyross also a build failure reported by CI (testbench build) but maybe in others too.

[ 34%] Building C object CMakeFiles/sof.dir/src/audio/crossover/crossover_ipc3.c.o
/home/runner/work/sof/sof/src/audio/crossover/crossover_ipc3.c: In function ‘crossover_check_sink_assign’:
/home/runner/work/sof/sof/src/audio/crossover/crossover_ipc3.c:49:35: error: ‘struct comp_buffer’ has no member named ‘pipeline_id’
   49 |                 pipeline_id = sink->pipeline_id;
      |                                   ^~
gmake[5]: *** [CMakeFiles/sof.dir/build.make:762: CMakeFiles/sof.dir/src/audio/crossover/crossover_ipc3.c.o] Error 1

@wszypelt
Copy link

wszypelt commented Jun 11, 2024

@lgirdwood
step: _Internal Intel CI System/merge/build — Tests failed
there are issues in kd and kpb

@lgirdwood
Copy link
Member

@lgirdwood step: _Internal Intel CI System/merge/build — Tests failed there are issues in kd and kpb

Thanks @wszypelt can you share here with @andyross Thanks !

@andyross
Copy link
Contributor Author

The build failure is real, just something that doesn't build in any of the targets I poked (FWIW: is there a simple way to "run CI" on a local build host?). Need to swap hardware this morning (juggling AEC and a big array of MTK boards for Zephyr work and the desk is running out of space) and will have a fix up, along with the API additions requested.

@cujomalainey
Copy link
Contributor

cujomalainey commented Jun 11, 2024

(FWIW: is there a simple way to "run CI" on a local build host?)

https://github.com/nektos/act

Although i wouldnt use it on a corp machine

@andyross
Copy link
Contributor Author

https://github.com/nektos/act

Oh sure, but that requires knowing the right github workflow to run. And I can read .github/workflows anyway. Both of those are ~= in complexity to just reading the build failure in CI anyway. What I meant was something in the SOF source tree like "scripts/run-ci.sh" or whatever that developers can use to easily check a PR before submission.

Like, in Zephyr, almost[1] everything[2] you want to test is exercised by just running "scripts/twister" with no arguments. Would be nice to have a similar smoke test in SOF, though the needed tooling setup would be a lot of complexity.

[1] Well, except for stuff that requires tooling not distributed in the SDK, so it doesn't do the doc build or renode runs or clang builds. There's an increasingly long list of "Oh, yeah, that" cases in CI failures, but really that's a bug in the SDK strategy and not the tooling.

[2] With the caveat that a bare twister run now takes upwards of an hour and a half on a 64 core build box. That too is a real bug, but not with the tooling. Enumerating the cartesian product of "Tests X Platforms" is running into severe scaling problems right now.

@cujomalainey
Copy link
Contributor

https://github.com/nektos/act

Oh sure, but that requires knowing the right github workflow to run. And I can read .github/workflows anyway. Both of those are ~= in complexity to just reading the build failure in CI anyway. What I meant was something in the SOF source tree like "scripts/run-ci.sh" or whatever that developers can use to easily check a PR before submission.

Like, in Zephyr, almost[1] everything[2] you want to test is exercised by just running "scripts/twister" with no arguments. Would be nice to have a similar smoke test in SOF, though the needed tooling setup would be a lot of complexity.

[1] Well, except for stuff that requires tooling not distributed in the SDK, so it doesn't do the doc build or renode runs or clang builds. There's an increasingly long list of "Oh, yeah, that" cases in CI failures, but really that's a bug in the SDK strategy and not the tooling.

[2] With the caveat that a bare twister run now takes upwards of an hour and a half on a 64 core build box. That too is a real bug, but not with the tooling. Enumerating the cartesian product of "Tests X Platforms" is running into severe scaling problems right now.

@marc-hb is there a script for something like this? I know a lot of the CI just invokes scripts, might not be the worst to have a user ready one

@andyross andyross force-pushed the comp-mod-refactor branch 3 times, most recently from 5350c7a to 35d25b1 Compare June 11, 2024 18:58
@andyross
Copy link
Contributor Author

Think this should be good now? The checkpatch failure is a false positive, it's complaining (legitimately, though the failure was already present) about multiply expanded arguments in a macro that merely got a term refactored. The sparse warnings seem to be entirely about upstream Zephyr code, I don't think I see anything specific to this PR and am pretty sure that's just how they look. The "zmain" failures are indeed screaming about the change to the fatal error handling that just merged; that does need to be fixed but obviously not in this PR.

@wszypelt
Copy link

wszypelt commented Jun 12, 2024

@andyross in Internal Intel CI System/merge/builds, When we try to bind peakvol with kpb we get FW exception

Module: 5
Instance: 0
Core: 0
Pipeline: 0
Proc_domain: LL
Extended_init: False
2024-06-10 14:34:46,569 | Response InitInstance (E0000000, 00000022) received!
2024-06-10 14:34:46,569 | Sending InitInstance (C000000E, 00000011)
Payload:
00002328 00000100 00000040 00000000
00003E80 00000020 FFFF3210 00000000
00000000 00001804 00003E80 00000020
FFFF3210 00000000 00000000 00001804
00000010
Module: 14
Instance: 0
Core: 0
Pipeline: 0
Proc_domain: LL
Extended_init: False
2024-06-10 14:34:46,569 | Response InitInstance (E0000000, 00000011) received!
2024-06-10 14:34:46,569 | Sending BindModules (C5000004, 00000005)
2024-06-10 14:34:46,569 | Response BindModules (E5000000, 00000005) received!
2024-06-10 14:34:46,569 | Sending BindModules (C5000005, 0000000E)
2024-06-10 14:34:46,601 | Notification FwExceptionNotification (9B0A0000, 00000000)
Payload:
00000000 00000004 00000000 00000001
00000004 00000003 00000002 00000004
00001000 00000009 00000004 00020000
00000003 00000004 00000400 00000007
00000004 00000016 00000006 00000004
00000028 00000008 00000004 00000000
00000020 0000000A 00000004 00000003
0000000B 00000004 00000008 0000000D
00000004 00000100 0000000E 00000004
00000010 0000000F 00000004 00000008
00000010 00000004 00000010 0000000C
00000004 00000005 00000011 00000004
00000010 00000012 00000014 00000001
00000001 00000001 00000000 00000000
0000001D 00000004 00000001 00000002
00000004 00010000 00000015 00000004 received!

@andyross
Copy link
Contributor Author

I don't know how to interpret that? Is there a stack trace or fault address or something you can match to the output binary? Is there a way I can reproduce the test, or at least see how it's built? Is it built entirely from code in the tree or is there more source code in your test I should be looking for?

Really this shouldn't be changing anything behaviorally. App-visible changes are limited to:

  • The size of the comp_dev structure increased by a word, so if you have test code making assumptions about it it will need porting.
  • The struct processing_module pointer in a dev is now in its own field (the new word) and no longer in the drvdata. Likewise if you have code using this it will need porting.

@andyross
Copy link
Contributor Author

Yeah, I'm a little worried here. Looking at that dump, none of the notable strings (e.g. "FwExceptionNotification", "Proc_domain", "Extended_init") appear in the SOF tree at all. I'm worried this is not just a test, but a big blob of out-of-tree source code?

@marcinszkudlinski
Copy link
Contributor

this is an exception CPU 0 EXCCAUSE 13 (load/store PIF data error)

FW log from this test:

[    0.000000] <inf> init: print_version_banner: FW ABI 0x301d000 DBG ABI 0x5003000 tags SOF:v2.5-stable-branch-2433-gc2194e7afd54 zephyr:v3.6.0-5059-g53ddff639562 src hash 0x3fba48f8 (ref hash 0x3fba48f8)
[    0.000000] <inf> clock: clock_set_freq: clock 0 set freq 20000000Hz freq_idx 0
[    0.000000] <inf> clock: clock_set_freq: clock 1 set freq 20000000Hz freq_idx 0
[    0.000000] <inf> clock: clock_set_freq: clock 2 set freq 20000000Hz freq_idx 0
*** Booting Zephyr OS build v3.6.0-5059-g53ddff639562 ***
[    0.000000] <inf> ipc: telemetry_init: Telemetry enabled. May affect performance
[    0.000000] <inf> main: sof_app_main: SOF on intel_adsp
[    0.000000] <inf> main: sof_app_main: SOF initialized
[    0.000008] <inf> ipc: ipc_cmd: rx	: 0x43000000|0x30701000
[    0.000010] <inf> ipc: ipc_cmd: rx	: 0x43000000|0x30801000
[    0.000010] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x3070000c
[    0.000010] <wrn> basefw_intel: basefw_set_fw_config: returning success for Set FW_CONFIG without handling it
[    0.000011] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x31400008
[    0.001531] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x30600064
[    0.006800] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x3070000c
[    0.006896] <wrn> basefw_intel: basefw_set_fw_config: returning success for Set FW_CONFIG without handling it
[    0.033406] <inf> ipc: ipc_cmd: rx	: 0x11000005|0x0
[    0.033445] <inf> pipe: pipeline_new: pipeline new pipe_id 0 priority 0
[    0.035891] <inf> ipc: ipc_cmd: rx	: 0x40000004|0x15
[    0.036230] <inf> dma: dma_get: dma_get() ID 0 sref = 1 busy channels 0
[    0.039350] <inf> ipc: ipc_cmd: rx	: 0x40000005|0x22
[    0.040905] <inf> ipc: ipc_cmd: rx	: 0x4000000e|0x11
[    0.041018] <inf> kpb: kpb_new: kpb_new()
[    0.042283] <inf> ipc: ipc_cmd: rx	: 0x45000004|0x5
[    0.042345] <inf> ipc: buffer_new: buffer new size 0x200 id 0.0 flags 0x0
[    0.042493] <inf> pipe: pipeline_connect: comp:0 0x4 connect buffer 0 as sink
[    0.042523] <inf> pipe: pipeline_connect: comp:0 0x5 connect buffer 0 as source
[    0.043981] <inf> ipc: ipc_cmd: rx	: 0x45000005|0xe
[    0.044055] <inf> ipc: buffer_new: buffer new size 0x200 id 0.0 flags 0x0
[    0.044175] <err> os: print_fatal_exception:  ** FATAL EXCEPTION
[    0.044225] <err> os: print_fatal_exception:  ** CPU 0 EXCCAUSE 13 (load/store PIF data error)
[    0.044250] <err> os: print_fatal_exception:  **  PC 0xa0063344 VADDR 0x94
[    0.044276] <err> os: print_fatal_exception:  **  PS 0x60420
[    0.044303] <err> os: print_fatal_exception:  **    (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:4 CALLINC:2)
[    0.044331] <err> os: xtensa_dump_stack:  **  A0 0xa006210c  SP 0xa0103fd0  A2 (nil)  A3 0xa01040c4
[    0.044360] <err> os: xtensa_dump_stack:  **  A4 0xa0103fe4  A5 0x40110540  A6 0x5  A7 0xa0103fd0
[    0.044385] <err> os: xtensa_dump_stack:  **  A8 0xa0063341  A9 0xa0103fb0 A10 0xa0111280 A11 0x100
[    0.044411] <err> os: xtensa_dump_stack:  ** A12 0x8 A13 0x400f7bcc A14 0x200 A15 0xa0103f60
[    0.044440] <err> os: xtensa_dump_stack:  ** LBEG 0xa007cbf9 LEND 0xa007cc05 LCOUNT (nil)
[    0.044465] <err> os: xtensa_dump_stack:  ** SAR 0x20
[    0.044491] <err> os: xtensa_dump_stack:  **  THREADPTR 0x200




Backtrace:0xa0063341:0xa0103fd0 0xa0062109:0xa01040c0 0xa003c0be:0xa01040f0 0xa006405a:0xa0104120 0xa0063afc:0xa0104140 0xa00676d2:0xa0104160 0xa0057e21:0xa0104180 0xa005b67b:0xa01041b0 



[    0.045170] <err> os: z_fatal_error: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[    0.045216] <err> os: z_fatal_error: Current thread: 0x4010a7c0 (unknown)
[    0.055186] <err> coredump_error: coredump_mem_window_backend_start: #CD:BEGIN#
[    0.056478] <err> coredump_error: coredump_mem_window_backend_end: #CD:END#
[    0.056915] <err> zephyr: k_sys_fatal_error_handler: Halting system

zephyr.map.gz

map file attached for backtrace decoding

@andyross
Copy link
Contributor Author

That PC is in ipc_comp_connect, but the map file has no disassembly. Seems reasonable that it's instantiating a kpb component given the logs. Is that not exercised by any tests I can run locally?

And can you repeat that build with CONFIG_OUTPUT_DISASSEMBLY=y and provide the resulting zephyr.lst file? (I hate that Zephyr turned this off by default, it takes time in CI but is irreplaceable for stuff like this). You can also get it with "objdump -D" on zephyr.elf (the objdump that matches the toolchain in use, obviously).

@andyross
Copy link
Contributor Author

Also when you guys have bandwidth, it would be helpful to try to try the test with each of the final two patches in this PR, which both stand alone, to see which is causing the failure (the first is just dead code removal and surely uninvolved -- it would cause a build failure, not a runtime failure, if something was using it).

@marcinszkudlinski
Copy link
Contributor

marcinszkudlinski commented Jun 13, 2024

@andyross
maybe not the rootcause but "a hint" I found today:
image

after your change
image

dstmod is NULL, and is being dereferenced in the very next line

EDIT:
as I see you set mod field in module_adapter_new. Problem is that not all modules use module adapter, some are not converted yet - like KPB module.

This would have certainly blown up on someone at some point even if it hasn't yet.

well... it just has. It was working only because KPB use dev->priv_data as it was originally intended to be used - as a container for a private data set and the pointer wasn't null. And was mapped to struct processing_module :(

@marcinszkudlinski
Copy link
Contributor

@andyross I'll provide a fix for this.

@andyross
Copy link
Contributor Author

andyross commented Jun 13, 2024

Ouch. So if I'm understanding you:

  • The test is instantiating a kpb component via some path other than a processing_module creation.
  • It's connecting it to another module via an IPC4 MOD_BIND command.
  • The module backpointer is (now) NULL, so this crashes.
  • But it used to "work", because the drvdata field being used as the module backpointer (which was actually a kpb-specific data pointer and NOT a processing_module) was non-NULL.

Ouch, again. So it's actually a nice catch for that patch, I guess, the only issue being that the test that detected the failure was out-of-tree. This doesn't seem like something that should be in a vendor test? Any chance you can commit that somewhere so it can be exercised by other folks? I'd totally have fixed this myself if I'd seen it, I promise. :)

Regardless: any possibility of merging this ahead of the kpb (or test? seems like the real bug is wherever that kpb is getting created, not in the device code itself) fix and filtering/whitelisting the failure on your end, given that it's been root caused and isn't an in-tree test?

@marcinszkudlinski
Copy link
Contributor

marcinszkudlinski commented Jun 14, 2024

@andyross
084c718
I can merge it in a separate PR, but since it makes conflict with your code it would be faster and more convenient to add my commit to your PR.

andyross added 4 commits June 14, 2024 07:28
This is dead code: just a wrapper around module_adapter_new(), which
is a public API already in use.  Presumably a forgotten relic.
Remove.

Signed-off-by: Andy Ross <andyross@google.com>
The current SOF architecture splits a "module" across two structs, the
legacy comp_dev and a newer struct processing_module.  The latter
contains a pointer to the former, but in many places existing code has
needed a backpointer to recover the module from a component.

So far this has abused the drvdata mechanism to store this as a void*,
but that's fragile as many components are already (!) using drvdata
for other purposes and will clobber the setting.  The fact that it
worked is seeming just by luck.  That pointer is for the exclusive use
of the comp_driver code associated with a component to store its own
data.  We can't be touching it from the global module code.

Just give the pointer a properly-typed field of its own and make sure
the two are initialized in tandem.  Longer term, SOF really needs to
fix this bifurcation and unify the two structs.

Signed-off-by: Andy Ross <andyross@google.com>
The pipeline_id has historically been part of the comp_buffer struct,
but that is being deprecated as a public API.  So move it down into
the contained sof_audio_stream_params struct where it can be found by
new style sink/source code.

Note that the actual value of the pipeline ID is a little ambiguous:
on IPC3, the buffer is defined by the user in the .tplg file as part
of a specific pipeline with a known ID.  With IPC4 topology, the
buffers are implicitly created and will be assigned the ID of their
source (!)  component.  It is legal to define a connection across two
pipelines, and there's no ability here to recover both pipeline IDs.

Signed-off-by: Andy Ross <andyross@google.com>
Expose the new pipeline_id field in sof_audio_stream_params to a
cleaner sink/source API for use by module code.  Longer term this may
want to be indirected by newer backends.

Signed-off-by: Andy Ross <andyross@google.com>
@andyross andyross force-pushed the comp-mod-refactor branch from 35d25b1 to 029758c Compare June 14, 2024 14:28
@andyross
Copy link
Contributor Author

Easy enough, seems clean to me in cyclic review. Everyone else should check the new patch too.

Note that I left it in place at the end of the series rather than re-refactor it twice, which is technically a bisection issue because the crash is introduced by an earlier patch. I argue that's OK since the crash isn't visible using in-tree tests, and because the root cause bug was pre-existing, it just wasn't a crash.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Some typos, but nothing to warrant an update of the patch series, so if nobody objects, let's merge this today.

Legacy API that is not using module_adapter is now depreciated,
but there are still some modules that use it. So all common code
must work properly with both types of modules.
This commit is fixing crash in bind operation when binding
a legacy module. There also some comments added in potentially
similar places, but where legacy modules cannot be used.

Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>
@andyross andyross force-pushed the comp-mod-refactor branch from 029758c to d38f9cd Compare June 17, 2024 19:55
@andyross
Copy link
Contributor Author

Took the liberty of fixing up @marcinszkudlinski 's comment per review above; hope that's OK. Also rebased.

@kv2019i kv2019i merged commit 5f5fdb6 into thesofproject:main Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants