Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/audio/chain_dma.c
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,7 @@ static int chain_task_pause(struct comp_dev *dev)
if (!ret)
ret = ret2;

schedule_task_cancel(&cd->chain_task);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsarha, 'appears to fix' implies that we don't know the real issue?
Or we must stop before freeing unconditionally as this is the correct way?

@lyakh, @kv2019i, can you clarify on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had several locks and prints in places, when debugging this on Friday. Based on those prints it looked like the crash is caused by the task being executed after its been freed. I am pretty sure this a correct fix. Before the fix the multiple-paise-resume crashed in a matter of seconds, after it was started. To pass the test it needs run something like 30mins without a glitch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi yes, I think it's needed

Copy link
Contributor

@ujfalusi ujfalusi Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would something like this can be added instead for a generic fix?

diff --git a/src/include/sof/schedule/schedule.h b/src/include/sof/schedule/schedule.h
index 5d069443ab07..136f3ffbc8ef 100644
--- a/src/include/sof/schedule/schedule.h
+++ b/src/include/sof/schedule/schedule.h
@@ -363,8 +363,15 @@ static inline int schedule_task_free(struct task *task)
 
 	list_for_item(slist, &schedulers->list) {
 		sch = container_of(slist, struct schedule_data, list);
-		if (task->type == sch->type)
+		if (task->type == sch->type) {
+			/* Make sure that the task is not scheduled, cancel it first */
+			int ret = sch->ops->schedule_task_cancel(sch->data, task);
+
+			if (ret)
+				return ret;
+
 			return sch->ops->schedule_task_free(sch->data, task);
+		}
 	}
 
 	return -ENODEV;

probably the ret check is not needed? I'm not sure what is the return if the task is not scheduled, my guess would be 0, but if not, then the ret check is not good.

Copy link
Contributor Author

@jsarha jsarha Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That, or add clear documentation to schedule_task_free() that one should not call free to a task that may still be scheduled. E.g. keep the API symmetric schedule_task_init() and schedule_task_free(), vs. schedule_task() and schedule_task_cancel().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i do you mean waiting on a semaphore in zephyr_ll_task_free()? Well, that's just one particular case. I think that so far we have what @jsarha described as a symmetric API. It could in principle be changed , I guess, but we'd need to verify all the cases - XTOS, Zephyr, LL, EDF,...

Copy link
Contributor Author

@jsarha jsarha Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I try to think how the task can be scheduled after being freed, I can not put my finger on it. But since I have seen it, I am pretty sure there can be some unfortunate serializations zephyr_ll_run() and zephyr_ll_task_free() where this could happen. But indeed it starts to look like there is a bug in zephyr_ll.c.

Copy link
Contributor Author

@jsarha jsarha Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the zephyr_ll_lock() = zephyr arch_irq_lock() and zephyr_ll_unlock() = zephyr arch_irq_unlock(), do anything about the assignments actually being flushed to the memory? Should they? It looks to me that their implementation on xtensa reduce to a single instruction:

static ALWAYS_INLINE unsigned int arch_irq_lock(void)
{
	unsigned int key;

	__asm__ volatile("rsil %0, %1"
			 : "=r"(key) : "i"(XCHAL_EXCM_LEVEL) : "memory");
	return key;
}

static ALWAYS_INLINE void arch_irq_unlock(unsigned int key)
{
	__asm__ volatile("wsr.ps %0; rsync"
			 :: "r"(key) : "memory");
}

Even if the threads run on the same core, the functions trust the values are immediately readable from another thread, before the function exits. How do we know the values are actually written to memory instead of them being still held in the registers, if there is no explicit memory barriers or even volatile qualifiers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why I came to the conclusion, that schedule_task_cancel() has to be called before schedule_task_free() with Zephyr LL scheduler. Looking at zephyr_ll_task_free() now it should handle freeing tasks in any state, including running. Note, that as described there, it must only be called on the same core as where the task is running and only from a thread context. Is this all the case? Have we decoded the crash? @jsarha

Copy link
Contributor Author

@jsarha jsarha Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crash always happens at rsil instruction when taking the lock (= forbidding interrupts?) in zephyr_ll_run() after do_task_run() call. At which point - according to my debug prints - the task has already been freed. After thinking different serializations of the functions for some time, I can not think any other reason for that to happen than the "locking" not working as it should (e.g. not acting as explicit memory barrier). Maybe a more correct way to fix this particular problem would be not to init() and free() the task all the time, but just to schedule() and cancel() it. But it still appears that there is something fishy how the zephyr_ll works.

So the crash always happens here:

zephyr_ll_lock(sch, &flags);

schedule_task_free(&cd->chain_task);
pm_policy_state_lock_put(PM_STATE_RUNTIME_IDLE, PM_ALL_SUBSTATES);
k_spin_unlock(&drivers->lock, key);
Expand Down
8 changes: 4 additions & 4 deletions tools/topology/topology2/avs-tplg/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,17 @@
# Array of "input-file-name;output-file-name;comma separated pre-processor variables"
set(TPLGS
# CAVS HDMI only topology with passthrough pipelines
"sof-hda-generic\;sof-hda-generic-idisp\;USE_CHAIN_DMA=false,DEEPBUFFER_FW_DMA_MS=100"
"sof-hda-generic\;sof-hda-generic-idisp\;USE_CHAIN_DMA=true,DEEPBUFFER_FW_DMA_MS=100"
# CAVS HDA topology with mixer-based pipelines for HDA and passthrough pipelines for HDMI
"sof-hda-generic\;sof-hda-generic\;HDA_CONFIG=mix,USE_CHAIN_DMA=false,DEEPBUFFER_FW_DMA_MS=100"
"sof-hda-generic\;sof-hda-generic\;HDA_CONFIG=mix,USE_CHAIN_DMA=true,DEEPBUFFER_FW_DMA_MS=100"
# If the alsatplg plugins for NHLT are not available, the NHLT blobs will not be added to the
# topologies below.
"sof-hda-generic\;sof-hda-generic-2ch\;\
HDA_CONFIG=mix,NUM_DMICS=2,PREPROCESS_PLUGINS=nhlt,NHLT_BIN=nhlt-sof-hda-generic-2ch.bin,USE_CHAIN_DMA=false,\
HDA_CONFIG=mix,NUM_DMICS=2,PREPROCESS_PLUGINS=nhlt,NHLT_BIN=nhlt-sof-hda-generic-2ch.bin,USE_CHAIN_DMA=true,\
DEEPBUFFER_FW_DMA_MS=100"
"sof-hda-generic\;sof-hda-generic-4ch\;\
HDA_CONFIG=mix,NUM_DMICS=4,PDM1_MIC_A_ENABLE=1,PDM1_MIC_B_ENABLE=1,\
PREPROCESS_PLUGINS=nhlt,NHLT_BIN=nhlt-sof-hda-generic-4ch.bin,USE_CHAIN_DMA=false,DEEPBUFFER_FW_DMA_MS=100"
PREPROCESS_PLUGINS=nhlt,NHLT_BIN=nhlt-sof-hda-generic-4ch.bin,USE_CHAIN_DMA=true,DEEPBUFFER_FW_DMA_MS=100"
# CAVS SDW topology with passthrough pipelines
"cavs-sdw\;cavs-sdw\;DEEPBUFFER_FW_DMA_MS=100,\
PREPROCESS_PLUGINS=nhlt,NHLT_BIN=nhlt-cavs-sdw.bin"
Expand Down