-
Notifications
You must be signed in to change notification settings - Fork 349
Fix chain dma task handling issue and re-enable it. #7073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ujfalusi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kv2019i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for CI results.
|
Something fishy: |
|
@jsarha TGL HDA has quite a few failures. Can you please check? |
Yes, indeed the multiple suspend resume now fails in FW crash. I can reproduce the problem on my upx-i11, but there is not much information to go on to close up on the problem. However, its quite clear that we should not merge this PR yet. |
|
Let's test this again in CI (I'll trigger this) @jsarha @makarukp The latest run did not have the Linux kernel fix to get last DSP logs to mtrace. This looks like a FW crash and if that's the case, we should get a clean DSP crash dump with a backtrace to mtrace. (https://sof-ci.01.org/sofpr/PR7073/build3851/devicetest/index.html does not have it) |
|
SOFCI TEST |
|
@jsarha @makarukp This will get published later, but the CI now capture a full DSP panic. I'll copy an early peek here: |
|
@kv2019i nice to see the mtrace, but it looks like its missing |
74f6cc7 to
cd64e68
Compare
|
I think I found the fix for: #7084 |
When schedule_task_free() is called to a scheduled zephyr_ll task it sometimes causes a crash. Canceling the task before freeing appears to fix the problem. Signed-off-by: Jyri Sarha <jyri.sarha@intel.com>
The known issues with the pipeline-free chain DMA implementation has now been solved, so we can re-enable chain DMA usage. This reverts commit c3807ae. Signed-off-by: Jyri Sarha <jyri.sarha@intel.com>
cd64e68 to
cf2115d
Compare
It got published later on the public site but then @jsarha force-pushed a new version so we'll never find the complete results. Always save links to test results before force-pushing. |
Sorry about that. Just fixed a stupid typo in the commit message. |
| if (!ret) | ||
| ret = ret2; | ||
|
|
||
| schedule_task_cancel(&cd->chain_task); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had several locks and prints in places, when debugging this on Friday. Based on those prints it looked like the crash is caused by the task being executed after its been freed. I am pretty sure this a correct fix. Before the fix the multiple-paise-resume crashed in a matter of seconds, after it was started. To pass the test it needs run something like 30mins without a glitch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ujfalusi yes, I think it's needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would something like this can be added instead for a generic fix?
diff --git a/src/include/sof/schedule/schedule.h b/src/include/sof/schedule/schedule.h
index 5d069443ab07..136f3ffbc8ef 100644
--- a/src/include/sof/schedule/schedule.h
+++ b/src/include/sof/schedule/schedule.h
@@ -363,8 +363,15 @@ static inline int schedule_task_free(struct task *task)
list_for_item(slist, &schedulers->list) {
sch = container_of(slist, struct schedule_data, list);
- if (task->type == sch->type)
+ if (task->type == sch->type) {
+ /* Make sure that the task is not scheduled, cancel it first */
+ int ret = sch->ops->schedule_task_cancel(sch->data, task);
+
+ if (ret)
+ return ret;
+
return sch->ops->schedule_task_free(sch->data, task);
+ }
}
return -ENODEV;probably the ret check is not needed? I'm not sure what is the return if the task is not scheduled, my guess would be 0, but if not, then the ret check is not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That, or add clear documentation to schedule_task_free() that one should not call free to a task that may still be scheduled. E.g. keep the API symmetric schedule_task_init() and schedule_task_free(), vs. schedule_task() and schedule_task_cancel().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I try to think how the task can be scheduled after being freed, I can not put my finger on it. But since I have seen it, I am pretty sure there can be some unfortunate serializations zephyr_ll_run() and zephyr_ll_task_free() where this could happen. But indeed it starts to look like there is a bug in zephyr_ll.c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the zephyr_ll_lock() = zephyr arch_irq_lock() and zephyr_ll_unlock() = zephyr arch_irq_unlock(), do anything about the assignments actually being flushed to the memory? Should they? It looks to me that their implementation on xtensa reduce to a single instruction:
static ALWAYS_INLINE unsigned int arch_irq_lock(void)
{
unsigned int key;
__asm__ volatile("rsil %0, %1"
: "=r"(key) : "i"(XCHAL_EXCM_LEVEL) : "memory");
return key;
}
static ALWAYS_INLINE void arch_irq_unlock(unsigned int key)
{
__asm__ volatile("wsr.ps %0; rsync"
:: "r"(key) : "memory");
}
Even if the threads run on the same core, the functions trust the values are immediately readable from another thread, before the function exits. How do we know the values are actually written to memory instead of them being still held in the registers, if there is no explicit memory barriers or even volatile qualifiers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why I came to the conclusion, that schedule_task_cancel() has to be called before schedule_task_free() with Zephyr LL scheduler. Looking at zephyr_ll_task_free() now it should handle freeing tasks in any state, including running. Note, that as described there, it must only be called on the same core as where the task is running and only from a thread context. Is this all the case? Have we decoded the crash? @jsarha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The crash always happens at rsil instruction when taking the lock (= forbidding interrupts?) in zephyr_ll_run() after do_task_run() call. At which point - according to my debug prints - the task has already been freed. After thinking different serializations of the functions for some time, I can not think any other reason for that to happen than the "locking" not working as it should (e.g. not acting as explicit memory barrier). Maybe a more correct way to fix this particular problem would be not to init() and free() the task all the time, but just to schedule() and cancel() it. But it still appears that there is something fishy how the zephyr_ll works.
So the crash always happens here:
Line 208 in 7fdc623
| zephyr_ll_lock(sch, &flags); |
|
SOFCI TEST |
|
Double test to be sure. |
lyakh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a bug in Zephyr LL scheduler and I tried to "quickly" fix it in #7138 and it didn't work. We can merge this for now and I'll debug this more and will fix the scheduler properly when I get some time. In fact this fix shouldn't be needed, but it's needed to compensate for that scheduler bug
The previous URL was not saved so we lost the previous test results again. The new 01.org lacks directory listing. |
|
I do not know why this version works, but it appears to be the only one that does. #7138 does much better job in theorizing what goes wrong with the original version, and fixes that, but unfortunately that version does not pass the test. |
The known issues with the pipeline-free chain DMA implementation has now been solved, so we can re-enable chain DMA usage.
This reverts commit c3807ae.
Signed-off-by: Jyri Sarha jyri.sarha@intel.com