Split queries to avoid overflowing device buffers #502

cobryan05 · 2023-08-25T21:22:30Z

Several KASA devices seem to have pretty strict buffer size limitations on incoming/outgoing data transfers.

Testing on KL125-US and HL103 has shown that sending a request size larger than about ~768 bytes will immediately crash the device. Additionally, a query that generates a response larger than ~4096 bytes will crash the KL125-US. I was unable to generate such a large response to test the HL103.

The KL125-US will only return such large queries when its monthly usage stats have been populated. This means that a new bulb would work fine, but after a month of data collection the bulb would break the 4K limit and start to crash.

To work around this issue, an estimated worst-case response size is calculated before sending a request by summing up all modules estimated response size. If the estimated size is greater than the device's max_response_payload_size then the query will be split into multiple queries.

This PR implements splitting queries expected to have large responses and also removes the module 'skip list' which was a previous workaround to the crash (which worked by simply reducing the number of modules queried, which prevented the overflow) since it is no longer necessary.

This PR does not attempt to address the "input buffer size limit." Thus far this limit has not been an issue.

codecov-commenter · 2023-08-25T21:24:55Z

Codecov Report

Patch coverage is 100.00% of modified lines.

Files Changed	Coverage
kasa/modules/module.py	`100.00%`
kasa/modules/usage.py	`100.00%`
kasa/smartbulb.py	`100.00%`
kasa/smartdevice.py	`100.00%`

📢 Thoughts on this report? Let us know!.

rytilahti · 2023-08-26T16:34:39Z

I don't think it's a typo as the usage data is actually provided by the schedule module. Could you try enabling the debug mode (kasa -d) and check if you get response data for the usage (get_day_stat and get_month_stat in the response payloads for schedule module) with and without your PR, even when it crashes?

Alternatively, you could use the script in #404 . Here's how the schedule fixture looked for my KL130(EU) when using that script:

{
    "get_rules": {
        "rule_list": [
            {
                "id": "7AF2491F6F5284EB35928F7612FAB4E7",
                "name": "Schedule Rule",
                "enable": 0,
                "wday": [
                    1,
                    1,
                    1,
                    1,
                    1,
                    1,
                    1
                ],
                "stime_opt": 0,
                "smin": 985,
                "sact": 2,
                "s_light": {
                    "on_off": 1,
                    "mode": "customize_preset",
                    "hue": 0,
                    "saturation": 0,
                    "color_temp": 3638,
                    "brightness": 73
                },
                "etime_opt": -1,
                "emin": 0,
                "eact": -1,
                "repeat": 1
            }
        ],
        "enable": 0,
        "err_code": 0
    },
    "get_next_action": {
        "type": -1,
        "err_code": 0
    },
    "get_realtime": {
        "err_code": -2000,
        "err_msg": "Method not support"
    },
    "get_daystat": {
        "day_list": [
            {
                "year": 2022,
                "month": 3,
                "day": 1,
                "time": 21
            }
        ],
        "err_code": 0
    },
    "get_monthstat": {
        "month_list": [
            {
                "year": 2022,
                "month": 1,
                "time": 15835
            },
            {
                "year": 2022,
                "month": 2,
                "time": 13859
            },
            {
                "year": 2022,
                "month": 3,
                "time": 21
            }
        ],
        "err_code": 0
    }
}

Also, what is the exact model for your device? The US variant has been known to have some issues (see #451), maybe we need to add the schedule to the skiplist, too?

cobryan05 · 2023-08-26T18:53:15Z

My exact model is KL125(US). I have two bulbs that I am testing with. One of the bulbs (x.x.x.64) becomes unresponsivemuch easier than the other (x.x.x.73).

I've attached create_module_fixture.py outputs for the following:

64, 73 with PR change
64, 73 without PR change

and the 'kasa -d' output for each bulb as well.
module_fixtures.zip
64_debug_output.txt
73_debug_output.txt

I notice that:

64 becomes unresponsive much easier than 73.
64's 'emeter' has many more entries in emeter 'get_daystat', since '73' has been unplugged for a while and '64' has still been in use. Pretty sure I swapped out 73 a month or so ago because it crashed so often, and now it seems fine... So more 'heavily used' bulbs become unresponsive quicker?
My change breaks the 'usage' module
My change breaks the 'schedule' module
Blacklisting 'usage' and 'schedule' (but not just one) appears to prevent the crash similar to the module name change.

Here's some debug output without my change. The bulb becomes unresponsive very quickly.

DEBUG:kasa.smartdevice:The device has emeter, querying its information along sysinfo
DEBUG:kasa.smartdevice:Module emeter already registered, ignoring...
DEBUG:kasa.modules.module:Initial update, so consider supported: smartlife.iot.common.schedule
DEBUG:kasa.smartdevice:Adding query for <Module Schedule (smartlife.iot.common.schedule) for 192.168.3.64>: {'smartlife.iot.common.schedule': {'get_rules': None, 'get_next_action': None}}
DEBUG:kasa.modules.module:Initial update, so consider supported: smartlife.iot.common.schedule
DEBUG:kasa.smartdevice:Adding query for <Module Usage (smartlife.iot.common.schedule) for 192.168.3.64>: {'smartlife.iot.common.schedule': {'get_realtime': None, 'get_daystat': {'year': 2023, 'month': 8}, 'get_monthstat': {'year': 2023}}}
DEBUG:kasa.modules.module:Initial update, so consider supported: smartlife.iot.common.anti_theft
DEBUG:kasa.smartdevice:Adding query for <Module Antitheft (smartlife.iot.common.anti_theft) for 192.168.3.64>: {'smartlife.iot.common.anti_theft': {'get_rules': None, 'get_next_action': None}}
DEBUG:kasa.modules.module:Initial update, so consider supported: smartlife.iot.common.timesetting
DEBUG:kasa.smartdevice:Adding query for <Module Time (smartlife.iot.common.timesetting) for 192.168.3.64>: {'smartlife.iot.common.timesetting': {'get_time': None, 'get_timezone': None}}
DEBUG:kasa.modules.module:Initial update, so consider supported: smartlife.iot.common.emeter
DEBUG:kasa.smartdevice:Adding query for <Module Emeter (smartlife.iot.common.emeter) for 192.168.3.64>: {'smartlife.iot.common.emeter': {'get_realtime': None, 'get_daystat': {'year': 2023, 'month': 8}, 'get_monthstat': {'year': 2023}}}
DEBUG:kasa.modules.module:Initial update, so consider supported: countdown
DEBUG:kasa.smartdevice:Adding query for <Module Countdown (countdown) for 192.168.3.64>: {'countdown': {'get_rules': None, 'get_next_action': None}}
DEBUG:kasa.modules.module:Initial update, so consider supported: smartlife.iot.common.cloud
DEBUG:kasa.smartdevice:Module <Module Cloud (smartlife.iot.common.cloud) for 192.168.3.64> is excluded for KL125(US), skipping
DEBUG:kasa.protocol:192.168.3.64 >> {"system": {"get_sysinfo": null}, "smartlife.iot.common.schedule": {"get_rules": null, "get_next_action": null, "get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "smartlife.iot.common.anti_theft": {"get_rules": null, "get_next_action": null}, "smartlife.iot.common.timesetting": {"get_time": null, "get_timezone": null}, "smartlife.iot.common.emeter": {"get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "countdown": {"get_rules": null, "get_next_action": null}}
DEBUG:kasa.protocol:Unable to query the device 192.168.3.64, retrying: 
DEBUG:kasa.protocol:192.168.3.64 >> {"system": {"get_sysinfo": null}, "smartlife.iot.common.schedule": {"get_rules": null, "get_next_action": null, "get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "smartlife.iot.common.anti_theft": {"get_rules": null, "get_next_action": null}, "smartlife.iot.common.timesetting": {"get_time": null, "get_timezone": null}, "smartlife.iot.common.emeter": {"get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "countdown": {"get_rules": null, "get_next_action": null}}
DEBUG:kasa.protocol:Unable to query the device 192.168.3.64, retrying: 
DEBUG:kasa.protocol:192.168.3.64 >> {"system": {"get_sysinfo": null}, "smartlife.iot.common.schedule": {"get_rules": null, "get_next_action": null, "get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "smartlife.iot.common.anti_theft": {"get_rules": null, "get_next_action": null}, "smartlife.iot.common.timesetting": {"get_time": null, "get_timezone": null}, "smartlife.iot.common.emeter": {"get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "countdown": {"get_rules": null, "get_next_action": null}}
DEBUG:kasa.protocol:Unable to query the device 192.168.3.64, retrying: [Errno 104] Connection reset by peer
DEBUG:kasa.protocol:192.168.3.64 >> {"system": {"get_sysinfo": null}, "smartlife.iot.common.schedule": {"get_rules": null, "get_next_action": null, "get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "smartlife.iot.common.anti_theft": {"get_rules": null, "get_next_action": null}, "smartlife.iot.common.timesetting": {"get_time": null, "get_timezone": null}, "smartlife.iot.common.emeter": {"get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "countdown": {"get_rules": null, "get_next_action": null}}
DEBUG:kasa.protocol:Giving up on 192.168.3.64 after 3 retries
Got error: SmartDeviceException('Unable to query the device 192.168.3.64:9999:  Connection reset by peer')

rytilahti · 2023-08-27T11:06:58Z

Looking at the response payload for your 64, it is very close to being 4K, so your guess that it has something to do with usage times may very well be the reason for this behavior...

Could you try the linked PR just to confirm that, it doesn't change any behavior but will print out the expected and read payload lengths.

cobryan05 · 2023-08-27T17:29:34Z

The returned buffer I am getting is 4961 bytes. It is definitely the 'fetch all' query that crashes it. So we're overflowing their output buffer asking for so much info at once.

I pushed up a PR that splits the module update list to avoid overflowing the buffer. This seems to be working well for me so far (though I haven't updated test cases).

The KL-125(US) bulb appears to crash if a request is made to it which produces an output large than 4096 bytes. This commit splits up the module update based on the expected response size to avoid making a request which will overflow the device's output buffer.

cobryan05 · 2023-08-28T17:15:56Z

I've been trying to probe buffer sizes before crash and have noticed a couple things while trying to construct queries to generate different sized responses. These are all about request sizes, not the responses.

Sending requests over 1K to KL125-US is immediate crash. Queries over 256 bytes have high chance of it becoming unresponsive. With enough repeated queries, the bulbs always stop responding eventually no matter how small the request.
Query sizes above 350 bytes quickly (though not immediately) crash my two KL125s and my KL130.
Query sizes above ~850 bytes immediately crash my HL103

rytilahti · 2023-08-28T17:35:59Z

Thanks for the update, interesting findings which will make the fix easier I suppose. Apparently it does not affect all devices, but I suppose that capping the request sizes for bulbs is much nicer solution than trying to estimate the potential response sizes.

On the repeated requests, do they also cause the device drop from the network? If not, that might be just to protect the device from DoS.

cobryan05 · 2023-08-28T18:13:25Z

I suppose that capping the request sizes for bulbs is much nicer solution than trying to estimate the potential response sizes.

Unfortunately I think the 4k output buffer is still a thing. The 'input buffer' is just another way it becomes unresponsive.

On the repeated requests, do they also cause the device drop from the network? If not, that might be just to protect the device from DoS.

This seems to differ as well. If I overflow the output buffer on the KL-125 it still responds to pings but I get 'connection reset by peer' or simply time out when trying to connect to it. Spamming it with lots of connections quickly takes it off the network.

When querying 1/s, it's pretty random when they stop responding. Sometimes I feel like I see pattern, but then more testing breaks the pattern. 1/s usually doesn't take it off the network.

cobryan05 · 2023-08-29T21:35:27Z

More testing. It ends up that my test query itself was causing issues. I was sending queries like

{ "0" : None,  "1": None }

padding the request with these until the target size. It ends up that the bulb doesn't like this that much. After processing approximately 720 "None" entries it crashes. I could send { "0": None } approx 720 times before crashing, { "0": None, "1": None" } approx 360 times before crashing. I wonder if maybe it is keeping some sort of internal error log that gets overflowed?

Padding out my requests to { "0": { "0": None } } helps, a lot. I can spam it indefinitely without a crash.

Takeaways:

Requests above 870 bytes crash my KL125 and my HL103.
The KL125 will crash if it generates an output of > 4096 bytes
Given the Request size limit, I can't easily generate arbitrarily large outputs. I have been unable to generate an output above 4096 bytes on anything except the KL-125 to test if they would crash.

I think that the 'output estimate/split requests' is the only thing that is going to be needed for this. I'll try to get it cleaned up and tests updated in the next few days

cobryan05 · 2023-08-30T18:11:24Z

I'm hoping with this change that several of the other 'fixes' may end up unnecessary. I might try another PR later to try removing previous possibly-related fixes for testing.

brianthedavis · 2023-09-05T03:00:40Z

I tested this on my KL125(US) bulbs (1.0.13 Build 220714 Rel.162126 and 1.0.7 Build 211009 Rel.172044) by removing the MODEL_MODULE_SKIPLIST and none of my bulbs crashed; this seems like a more sustainable way to prevent crashes than a hard-coded skip list.

cobryan05 · 2023-09-06T19:36:38Z

I think this patchset is good to go from my perspective.

rytilahti

LGTM with minor changes, glad to hear it's working for @brianthedavis too!

@cobryan05 would you mind adding some explicit tests for the new SmartDevice and Module API properties, and removing the skiplist functionality? Also, please update the PR description also to include the reasoning and findings leading to this change (if not even adding a word about it to the docs) for future developers.

kasa/modules/usage.py

kasa/smartdevice.py

kasa/smartbulb.py

The KL-125(US) bulb appears to crash if a request is made to it which produces an output large than 4096 bytes. This commit splits up the module update based on the expected response size to avoid making a request which will overflow the device's output buffer.

Move maximum response size from a 'Protocol' property to a 'Device' property, defaulting to 16K. Some bulbs, eg KL-125US, are known to have smaller buffers, so bulbs are set to 4k.

Run linter, fix response deep-copying making tests unhappy, updated call count in emeter test.

This commit removes the module skip list as it seems this workaround is no longer necessary when splitting up large queries.

rytilahti

Great, thanks @cobryan05! 🥇

rytilahti mentioned this pull request Aug 27, 2023

Log expected & read payload lengths #503

Closed

cobryan05 force-pushed the moduleTypoFix branch from fc76bdc to 1a38b97 Compare August 27, 2023 17:27

cobryan05 force-pushed the moduleTypoFix branch from a4cd72a to 2c108b3 Compare August 30, 2023 18:08

rytilahti requested changes Sep 13, 2023

View reviewed changes

kasa/modules/usage.py Outdated Show resolved Hide resolved

kasa/smartdevice.py Outdated Show resolved Hide resolved

kasa/smartbulb.py Outdated Show resolved Hide resolved

rytilahti added the bug Something isn't working label Sep 13, 2023

rytilahti changed the title ~~Fix KL-125 (and others?) crashes - Fix typo in module lists~~ Split queries to avoid overflowing device buffers Sep 13, 2023

cobryan05 added 4 commits September 14, 2023 10:58

Make maximum response size a 'device' property

4ff74ac

Move maximum response size from a 'Protocol' property to a 'Device' property, defaulting to 16K. Some bulbs, eg KL-125US, are known to have smaller buffers, so bulbs are set to 4k.

Lint and fix tests

8cf4a36

Run linter, fix response deep-copying making tests unhappy, updated call count in emeter test.

Account for initial sysinfo size

c851c16

cobryan05 force-pushed the moduleTypoFix branch from 2c108b3 to 775bd5a Compare September 14, 2023 16:26

cobryan05 added 2 commits September 14, 2023 11:43

Address review comments

188f64a

Remove module skip list

2712af8

This commit removes the module skip list as it seems this workaround is no longer necessary when splitting up large queries.

cobryan05 force-pushed the moduleTypoFix branch from 775bd5a to 2712af8 Compare September 14, 2023 16:43

rytilahti added enhancement New feature or request and removed bug Something isn't working labels Sep 14, 2023

rytilahti approved these changes Sep 14, 2023

View reviewed changes

rytilahti merged commit a2444da into python-kasa:master Sep 14, 2023

rytilahti mentioned this pull request Oct 4, 2023

TP Link - Temperature should be between 2500 and 9000, was 9009 home-assistant/core#100618

Closed

rytilahti mentioned this pull request Oct 30, 2023

contributing KasaCam support #530

Closed

rytilahti mentioned this pull request Dec 31, 2023

Add known smart requests to dump_devinfo #597

Merged

Uh oh!

Split queries to avoid overflowing device buffers #502

Split queries to avoid overflowing device buffers #502

Uh oh!

Conversation

cobryan05 commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Aug 25, 2023 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rytilahti commented Aug 26, 2023

Uh oh!

cobryan05 commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rytilahti commented Aug 27, 2023

Uh oh!

cobryan05 commented Aug 27, 2023

Uh oh!

cobryan05 commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rytilahti commented Aug 28, 2023

Uh oh!

cobryan05 commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cobryan05 commented Aug 29, 2023

Uh oh!

cobryan05 commented Aug 30, 2023

Uh oh!

brianthedavis commented Sep 5, 2023

Uh oh!

cobryan05 commented Sep 6, 2023

Uh oh!

rytilahti left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rytilahti left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cobryan05 commented Aug 25, 2023 •

edited

Loading

codecov-commenter commented Aug 25, 2023 •

edited by codecov bot

Loading

cobryan05 commented Aug 26, 2023 •

edited

Loading

cobryan05 commented Aug 28, 2023 •

edited

Loading

cobryan05 commented Aug 28, 2023 •

edited

Loading

rytilahti left a comment •

edited

Loading