Increase timeout in ServiceInfo.request to handle loaded systems#895
Conversation
Codecov Report
@@ Coverage Diff @@
## master #895 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 22 22
Lines 2338 2339 +1
Branches 390 390
=========================================
+ Hits 2338 2339 +1
Continue to review full report at Codecov.
|
|
The CI run failed because it took https://github.com/jstasiak/python-zeroconf/pull/895/checks?check_run_id=2963841776 |
|
2nd run failed on mac 3.9 These runners are definitely known to be loaded https://github.com/jstasiak/python-zeroconf/pull/895/checks?check_run_id=2963895631 |
|
|
|
Still taking > 300ms |
|
I’m seeing GitHub runners being up to 150ms behind, but never a full second, but thats just running the tests. I think its a safe assumption that home-assistant/core#52384 is going to be much more behind because of Home Assistant startup. We've seen loops be behind by 3-4s on RPi3, but usually not on a system with that kind of specs. It only has to get behind by 1001ms for the exception to happen though. |
|
Can I measure easily for you? |
|
So its definitely working as designed, there can be up to 150ms delay when the system is loaded though on github runners. Its likely the target system is much higher. |
|
@hmmbob Can you checkout this PR and install with |
I'll give it a try in the morning (Europe). Sorry for noob question, but how do I check out the PR from inside the container?
|
|
Restart |
|
I'll give it a run in the morning! I'm still a pit puzzled by the fact that this might be due to a "loaded system". I don't see CPU spiking above 20%-ish if I check proxmox system status, and the box is quite powerfull.... Other vitals are also fine 🤷🏻♂️ |
The event loop can only use one cpu core so the single core performance is the metric that matters here. In practice this is almost never a problem except for when Home Assistant is starting up because there is so much competition for cpu time as integrations are setting up. For a bit more in-depth on the GIL and why only one thread runs at a time: https://opensource.com/article/17/4/grok-gil |
Yes, issue is gone with this change. Log clear, all cast devices available upon restart |
It can take a few seconds for a loaded system to run the
async_requestcoroutine when the event loop is busy or the system is CPU bound (example being Home Assistant startup). We now addan additional
_LOADED_SYSTEM_TIMEOUT(10s) to therun_coroutine_threadsafecalls to ensure the coroutine has the total amount of time to run up to its internal timeout (default of 3000ms).Ten seconds is a bit large of a timeout; however, its only used in cases where we wrap other timeouts. We now expect the only instance the
run_coroutine_threadsaferesult timeout will happen in a production circumstance is when someone is running aServiceInfo.request()in a thread and another thread callsZeroconf.close()at just the right moment that the future is never completed unless the system is so loaded that it is nearly unresponsive.The timeout for
run_coroutine_threadsafeis the maximum time a thread can cleanly shut down when zeroconf is closed out in another thread, which should always be longer than the underlying thread operation.