Skip to content

Increase timeout in ServiceInfo.request to handle loaded systems#895

Merged
bdraco merged 10 commits into
python-zeroconf:masterfrom
bdraco:verify_get_service_info_timeout
Jul 4, 2021
Merged

Increase timeout in ServiceInfo.request to handle loaded systems#895
bdraco merged 10 commits into
python-zeroconf:masterfrom
bdraco:verify_get_service_info_timeout

Conversation

@bdraco

@bdraco bdraco commented Jul 1, 2021

Copy link
Copy Markdown
Member

It can take a few seconds for a loaded system to run the async_request coroutine when the event loop is busy or the system is CPU bound (example being Home Assistant startup). We now add
an additional _LOADED_SYSTEM_TIMEOUT (10s) to the run_coroutine_threadsafe calls to ensure the coroutine has the total amount of time to run up to its internal timeout (default of 3000ms).

Ten seconds is a bit large of a timeout; however, its only used in cases where we wrap other timeouts. We now expect the only instance the run_coroutine_threadsafe result timeout will happen in a production circumstance is when someone is running a ServiceInfo.request() in a thread and another thread calls Zeroconf.close() at just the right moment that the future is never completed unless the system is so loaded that it is nearly unresponsive.

The timeout for run_coroutine_threadsafe is the maximum time a thread can cleanly shut down when zeroconf is closed out in another thread, which should always be longer than the underlying thread operation.

@codecov-commenter

codecov-commenter commented Jul 1, 2021

Copy link
Copy Markdown

Codecov Report

Merging #895 (04f1d66) into master (82ff150) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master      #895   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           22        22           
  Lines         2338      2339    +1     
  Branches       390       390           
=========================================
+ Hits          2338      2339    +1     
Impacted Files Coverage Δ
zeroconf/_services/info.py 100.00% <ø> (ø)
zeroconf/_core.py 100.00% <100.00%> (ø)
zeroconf/_utils/asyncio.py 100.00% <100.00%> (ø)
zeroconf/const.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82ff150...04f1d66. Read the comment docs.

@bdraco

bdraco commented Jul 1, 2021

Copy link
Copy Markdown
Member Author

The CI run failed because it took 127ms to complete instead of 100ms

https://github.com/jstasiak/python-zeroconf/pull/895/checks?check_run_id=2963841776

        end_time = current_time_millis()
        await aiozc.async_close()
>       assert (end_time - start_time) < 120
E       assert (1625150945234.4329 - 1625150945107.269) < 120

@bdraco

bdraco commented Jul 1, 2021

Copy link
Copy Markdown
Member Author

2nd run failed on mac 3.9

These runners are definitely known to be loaded

https://github.com/jstasiak/python-zeroconf/pull/895/checks?check_run_id=2963895631

>>> 1625151297625.597 - 1625151297475.3542
150.24267578125

@bdraco bdraco changed the title Add tests to verify get_service_info timeout get_service_info can throw TimeoutError when the event loop is behind or the system is cpu bound Jul 1, 2021
@bdraco

bdraco commented Jul 1, 2021

Copy link
Copy Markdown
Member Author

1625152680448.731 - 1625152680139.38 it took 300ms on one run

@bdraco bdraco changed the title get_service_info can throw TimeoutError when the event loop is behind or the system is cpu bound Add timeout buffer to ServiceInfo.request to handle the event loop being behind or the system is cpu bound Jul 1, 2021
@bdraco

bdraco commented Jul 1, 2021

Copy link
Copy Markdown
Member Author

Still taking > 300ms
E assert (1625154137795.9521 - 1625154137454.514) < (200 + 100)

@bdraco

bdraco commented Jul 1, 2021

Copy link
Copy Markdown
Member Author

I’m seeing GitHub runners being up to 150ms behind, but never a full second, but thats just running the tests. I think its a safe assumption that home-assistant/core#52384 is going to be much more behind because of Home Assistant startup. We've seen loops be behind by 3-4s on RPi3, but usually not on a system with that kind of specs. It only has to get behind by 1001ms for the exception to happen though.

@hmmbob

hmmbob commented Jul 1, 2021

Copy link
Copy Markdown

Can I measure easily for you?

@bdraco

bdraco commented Jul 1, 2021

Copy link
Copy Markdown
Member Author

So its definitely working as designed, there can be up to 150ms delay when the system is loaded though on github runners. Its likely the target system is much higher.

@bdraco

bdraco commented Jul 1, 2021

Copy link
Copy Markdown
Member Author

@hmmbob Can you checkout this PR and install with pip3 install --upgrade . inside the home assistant container, then restart to see if it resolves the issue for you?

@bdraco bdraco requested a review from emontnemery July 1, 2021 16:54
Comment thread tests/utils/test_asyncio.py Outdated
Comment thread zeroconf/const.py
@bdraco bdraco changed the title Add timeout buffer to ServiceInfo.request to handle the event loop being behind or the system is cpu bound Increase timeout in ServiceInfo.request to handle loaded systems Jul 1, 2021
@hmmbob

hmmbob commented Jul 1, 2021

Copy link
Copy Markdown

@hmmbob Can you checkout this PR and install with pip3 install --upgrade . inside the home assistant container, then restart to see if it resolves the issue for you?

I'll give it a try in the morning (Europe). Sorry for noob question, but how do I check out the PR from inside the container?

docker exec -it homeassistant /bin/bash, and when I the container I need to run...... :)

@bdraco

bdraco commented Jul 1, 2021

Copy link
Copy Markdown
Member Author

docker exec -it homeassistant /bin/bash
git clone https://github.com/bdraco/python-zeroconf
cd python-zeroconf
git checkout -t origin/verify_get_service_info_timeout
pip3 install --upgrade .

Restart

@hmmbob

hmmbob commented Jul 1, 2021

Copy link
Copy Markdown

I'll give it a run in the morning!

I'm still a pit puzzled by the fact that this might be due to a "loaded system". I don't see CPU spiking above 20%-ish if I check proxmox system status, and the box is quite powerfull.... Other vitals are also fine 🤷🏻‍♂️

@bdraco

bdraco commented Jul 1, 2021

Copy link
Copy Markdown
Member Author

I'll give it a run in the morning!

I'm still a pit puzzled by the fact that this might be due to a "loaded system". I don't see CPU spiking above 20%-ish if I check proxmox system status, and the box is quite powerfull....

The event loop can only use one cpu core so the single core performance is the metric that matters here. In practice this is almost never a problem except for when Home Assistant is starting up because there is so much competition for cpu time as integrations are setting up.

For a bit more in-depth on the GIL and why only one thread runs at a time: https://opensource.com/article/17/4/grok-gil

@bdraco bdraco marked this pull request as ready for review July 1, 2021 22:39
@hmmbob

hmmbob commented Jul 2, 2021

Copy link
Copy Markdown

@hmmbob Can you checkout this PR and install with pip3 install --upgrade . inside the home assistant container, then restart to see if it resolves the issue for you?

Yes, issue is gone with this change. Log clear, all cast devices available upon restart

@emontnemery emontnemery left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bdraco bdraco mentioned this pull request Jul 4, 2021
7 tasks
@bdraco bdraco merged commit 56c7d69 into python-zeroconf:master Jul 4, 2021
@bdraco bdraco deleted the verify_get_service_info_timeout branch July 4, 2021 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants