Skip to content

Conversation

@cobryan05
Copy link
Contributor

@cobryan05 cobryan05 commented Aug 25, 2023

Several KASA devices seem to have pretty strict buffer size limitations on incoming/outgoing data transfers.

Testing on KL125-US and HL103 has shown that sending a request size larger than about ~768 bytes will immediately crash the device. Additionally, a query that generates a response larger than ~4096 bytes will crash the KL125-US. I was unable to generate such a large response to test the HL103.

The KL125-US will only return such large queries when its monthly usage stats have been populated. This means that a new bulb would work fine, but after a month of data collection the bulb would break the 4K limit and start to crash.

To work around this issue, an estimated worst-case response size is calculated before sending a request by summing up all modules estimated response size. If the estimated size is greater than the device's max_response_payload_size then the query will be split into multiple queries.

This PR implements splitting queries expected to have large responses and also removes the module 'skip list' which was a previous workaround to the crash (which worked by simply reducing the number of modules queried, which prevented the overflow) since it is no longer necessary.

This PR does not attempt to address the "input buffer size limit." Thus far this limit has not been an issue.

@codecov-commenter
Copy link

codecov-commenter commented Aug 25, 2023

Codecov Report

Patch coverage is 100.00% of modified lines.

Files Changed Coverage
kasa/modules/module.py 100.00%
kasa/modules/usage.py 100.00%
kasa/smartbulb.py 100.00%
kasa/smartdevice.py 100.00%

📢 Thoughts on this report? Let us know!.

@rytilahti
Copy link
Member

I don't think it's a typo as the usage data is actually provided by the schedule module. Could you try enabling the debug mode (kasa -d) and check if you get response data for the usage (get_day_stat and get_month_stat in the response payloads for schedule module) with and without your PR, even when it crashes?

Alternatively, you could use the script in #404 . Here's how the schedule fixture looked for my KL130(EU) when using that script:

{
    "get_rules": {
        "rule_list": [
            {
                "id": "7AF2491F6F5284EB35928F7612FAB4E7",
                "name": "Schedule Rule",
                "enable": 0,
                "wday": [
                    1,
                    1,
                    1,
                    1,
                    1,
                    1,
                    1
                ],
                "stime_opt": 0,
                "smin": 985,
                "sact": 2,
                "s_light": {
                    "on_off": 1,
                    "mode": "customize_preset",
                    "hue": 0,
                    "saturation": 0,
                    "color_temp": 3638,
                    "brightness": 73
                },
                "etime_opt": -1,
                "emin": 0,
                "eact": -1,
                "repeat": 1
            }
        ],
        "enable": 0,
        "err_code": 0
    },
    "get_next_action": {
        "type": -1,
        "err_code": 0
    },
    "get_realtime": {
        "err_code": -2000,
        "err_msg": "Method not support"
    },
    "get_daystat": {
        "day_list": [
            {
                "year": 2022,
                "month": 3,
                "day": 1,
                "time": 21
            }
        ],
        "err_code": 0
    },
    "get_monthstat": {
        "month_list": [
            {
                "year": 2022,
                "month": 1,
                "time": 15835
            },
            {
                "year": 2022,
                "month": 2,
                "time": 13859
            },
            {
                "year": 2022,
                "month": 3,
                "time": 21
            }
        ],
        "err_code": 0
    }
}

Also, what is the exact model for your device? The US variant has been known to have some issues (see #451), maybe we need to add the schedule to the skiplist, too?

@cobryan05
Copy link
Contributor Author

cobryan05 commented Aug 26, 2023

My exact model is KL125(US). I have two bulbs that I am testing with. One of the bulbs (x.x.x.64) becomes unresponsivemuch easier than the other (x.x.x.73).

I've attached create_module_fixture.py outputs for the following:

64, 73 with PR change
64, 73 without PR change

and the 'kasa -d' output for each bulb as well.
module_fixtures.zip
64_debug_output.txt
73_debug_output.txt

I notice that:

  • 64 becomes unresponsive much easier than 73.
  • 64's 'emeter' has many more entries in emeter 'get_daystat', since '73' has been unplugged for a while and '64' has still been in use. Pretty sure I swapped out 73 a month or so ago because it crashed so often, and now it seems fine... So more 'heavily used' bulbs become unresponsive quicker?
  • My change breaks the 'usage' module
  • My change breaks the 'schedule' module
  • Blacklisting 'usage' and 'schedule' (but not just one) appears to prevent the crash similar to the module name change.

Here's some debug output without my change. The bulb becomes unresponsive very quickly.

DEBUG:kasa.smartdevice:The device has emeter, querying its information along sysinfo
DEBUG:kasa.smartdevice:Module emeter already registered, ignoring...
DEBUG:kasa.modules.module:Initial update, so consider supported: smartlife.iot.common.schedule
DEBUG:kasa.smartdevice:Adding query for <Module Schedule (smartlife.iot.common.schedule) for 192.168.3.64>: {'smartlife.iot.common.schedule': {'get_rules': None, 'get_next_action': None}}
DEBUG:kasa.modules.module:Initial update, so consider supported: smartlife.iot.common.schedule
DEBUG:kasa.smartdevice:Adding query for <Module Usage (smartlife.iot.common.schedule) for 192.168.3.64>: {'smartlife.iot.common.schedule': {'get_realtime': None, 'get_daystat': {'year': 2023, 'month': 8}, 'get_monthstat': {'year': 2023}}}
DEBUG:kasa.modules.module:Initial update, so consider supported: smartlife.iot.common.anti_theft
DEBUG:kasa.smartdevice:Adding query for <Module Antitheft (smartlife.iot.common.anti_theft) for 192.168.3.64>: {'smartlife.iot.common.anti_theft': {'get_rules': None, 'get_next_action': None}}
DEBUG:kasa.modules.module:Initial update, so consider supported: smartlife.iot.common.timesetting
DEBUG:kasa.smartdevice:Adding query for <Module Time (smartlife.iot.common.timesetting) for 192.168.3.64>: {'smartlife.iot.common.timesetting': {'get_time': None, 'get_timezone': None}}
DEBUG:kasa.modules.module:Initial update, so consider supported: smartlife.iot.common.emeter
DEBUG:kasa.smartdevice:Adding query for <Module Emeter (smartlife.iot.common.emeter) for 192.168.3.64>: {'smartlife.iot.common.emeter': {'get_realtime': None, 'get_daystat': {'year': 2023, 'month': 8}, 'get_monthstat': {'year': 2023}}}
DEBUG:kasa.modules.module:Initial update, so consider supported: countdown
DEBUG:kasa.smartdevice:Adding query for <Module Countdown (countdown) for 192.168.3.64>: {'countdown': {'get_rules': None, 'get_next_action': None}}
DEBUG:kasa.modules.module:Initial update, so consider supported: smartlife.iot.common.cloud
DEBUG:kasa.smartdevice:Module <Module Cloud (smartlife.iot.common.cloud) for 192.168.3.64> is excluded for KL125(US), skipping
DEBUG:kasa.protocol:192.168.3.64 >> {"system": {"get_sysinfo": null}, "smartlife.iot.common.schedule": {"get_rules": null, "get_next_action": null, "get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "smartlife.iot.common.anti_theft": {"get_rules": null, "get_next_action": null}, "smartlife.iot.common.timesetting": {"get_time": null, "get_timezone": null}, "smartlife.iot.common.emeter": {"get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "countdown": {"get_rules": null, "get_next_action": null}}
DEBUG:kasa.protocol:Unable to query the device 192.168.3.64, retrying: 
DEBUG:kasa.protocol:192.168.3.64 >> {"system": {"get_sysinfo": null}, "smartlife.iot.common.schedule": {"get_rules": null, "get_next_action": null, "get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "smartlife.iot.common.anti_theft": {"get_rules": null, "get_next_action": null}, "smartlife.iot.common.timesetting": {"get_time": null, "get_timezone": null}, "smartlife.iot.common.emeter": {"get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "countdown": {"get_rules": null, "get_next_action": null}}
DEBUG:kasa.protocol:Unable to query the device 192.168.3.64, retrying: 
DEBUG:kasa.protocol:192.168.3.64 >> {"system": {"get_sysinfo": null}, "smartlife.iot.common.schedule": {"get_rules": null, "get_next_action": null, "get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "smartlife.iot.common.anti_theft": {"get_rules": null, "get_next_action": null}, "smartlife.iot.common.timesetting": {"get_time": null, "get_timezone": null}, "smartlife.iot.common.emeter": {"get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "countdown": {"get_rules": null, "get_next_action": null}}
DEBUG:kasa.protocol:Unable to query the device 192.168.3.64, retrying: [Errno 104] Connection reset by peer
DEBUG:kasa.protocol:192.168.3.64 >> {"system": {"get_sysinfo": null}, "smartlife.iot.common.schedule": {"get_rules": null, "get_next_action": null, "get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "smartlife.iot.common.anti_theft": {"get_rules": null, "get_next_action": null}, "smartlife.iot.common.timesetting": {"get_time": null, "get_timezone": null}, "smartlife.iot.common.emeter": {"get_realtime": null, "get_daystat": {"year": 2023, "month": 8}, "get_monthstat": {"year": 2023}}, "countdown": {"get_rules": null, "get_next_action": null}}
DEBUG:kasa.protocol:Giving up on 192.168.3.64 after 3 retries
Got error: SmartDeviceException('Unable to query the device 192.168.3.64:9999:  Connection reset by peer')

@rytilahti
Copy link
Member

Looking at the response payload for your 64, it is very close to being 4K, so your guess that it has something to do with usage times may very well be the reason for this behavior...

Could you try the linked PR just to confirm that, it doesn't change any behavior but will print out the expected and read payload lengths.

@cobryan05
Copy link
Contributor Author

The returned buffer I am getting is 4961 bytes. It is definitely the 'fetch all' query that crashes it. So we're overflowing their output buffer asking for so much info at once.

I pushed up a PR that splits the module update list to avoid overflowing the buffer. This seems to be working well for me so far (though I haven't updated test cases).

rytilahti referenced this pull request Aug 27, 2023
The KL-125(US) bulb appears to crash if a request is made to it which
produces an output large than 4096 bytes. This commit splits up the
module update based on the expected response size to avoid making a
request which will overflow the device's output buffer.
@cobryan05
Copy link
Contributor Author

cobryan05 commented Aug 28, 2023

I've been trying to probe buffer sizes before crash and have noticed a couple things while trying to construct queries to generate different sized responses. These are all about request sizes, not the responses.

  1. Sending requests over 1K to KL125-US is immediate crash. Queries over 256 bytes have high chance of it becoming unresponsive. With enough repeated queries, the bulbs always stop responding eventually no matter how small the request.
  2. Query sizes above 350 bytes quickly (though not immediately) crash my two KL125s and my KL130.
  3. Query sizes above ~850 bytes immediately crash my HL103

@rytilahti
Copy link
Member

Thanks for the update, interesting findings which will make the fix easier I suppose. Apparently it does not affect all devices, but I suppose that capping the request sizes for bulbs is much nicer solution than trying to estimate the potential response sizes.

On the repeated requests, do they also cause the device drop from the network? If not, that might be just to protect the device from DoS.

@cobryan05
Copy link
Contributor Author

cobryan05 commented Aug 28, 2023

I suppose that capping the request sizes for bulbs is much nicer solution than trying to estimate the potential response sizes.

Unfortunately I think the 4k output buffer is still a thing. The 'input buffer' is just another way it becomes unresponsive.

On the repeated requests, do they also cause the device drop from the network? If not, that might be just to protect the device from DoS.

This seems to differ as well. If I overflow the output buffer on the KL-125 it still responds to pings but I get 'connection reset by peer' or simply time out when trying to connect to it. Spamming it with lots of connections quickly takes it off the network.

When querying 1/s, it's pretty random when they stop responding. Sometimes I feel like I see pattern, but then more testing breaks the pattern. 1/s usually doesn't take it off the network.

@cobryan05
Copy link
Contributor Author

More testing. It ends up that my test query itself was causing issues. I was sending queries like

{ "0" : None,  "1": None }

padding the request with these until the target size. It ends up that the bulb doesn't like this that much. After processing approximately 720 "None" entries it crashes. I could send { "0": None } approx 720 times before crashing, { "0": None, "1": None" } approx 360 times before crashing. I wonder if maybe it is keeping some sort of internal error log that gets overflowed?

Padding out my requests to { "0": { "0": None } } helps, a lot. I can spam it indefinitely without a crash.

Takeaways:

  • Requests above 870 bytes crash my KL125 and my HL103.
  • The KL125 will crash if it generates an output of > 4096 bytes
  • Given the Request size limit, I can't easily generate arbitrarily large outputs. I have been unable to generate an output above 4096 bytes on anything except the KL-125 to test if they would crash.

I think that the 'output estimate/split requests' is the only thing that is going to be needed for this. I'll try to get it cleaned up and tests updated in the next few days

@cobryan05
Copy link
Contributor Author

I'm hoping with this change that several of the other 'fixes' may end up unnecessary. I might try another PR later to try removing previous possibly-related fixes for testing.

@brianthedavis
Copy link
Contributor

I tested this on my KL125(US) bulbs (1.0.13 Build 220714 Rel.162126 and 1.0.7 Build 211009 Rel.172044) by removing the MODEL_MODULE_SKIPLIST and none of my bulbs crashed; this seems like a more sustainable way to prevent crashes than a hard-coded skip list.

@cobryan05
Copy link
Contributor Author

I think this patchset is good to go from my perspective.

Copy link
Member

@rytilahti rytilahti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor changes, glad to hear it's working for @brianthedavis too!

@cobryan05 would you mind adding some explicit tests for the new SmartDevice and Module API properties, and removing the skiplist functionality? Also, please update the PR description also to include the reasoning and findings leading to this change (if not even adding a word about it to the docs) for future developers.

@rytilahti rytilahti added the bug Something isn't working label Sep 13, 2023
@rytilahti rytilahti changed the title Fix KL-125 (and others?) crashes - Fix typo in module lists Split queries to avoid overflowing device buffers Sep 13, 2023
The KL-125(US) bulb appears to crash if a request is made to it which
produces an output large than 4096 bytes. This commit splits up the
module update based on the expected response size to avoid making a
request which will overflow the device's output buffer.
Move maximum response size from a 'Protocol' property to a
'Device' property, defaulting to 16K.

Some bulbs, eg KL-125US, are known to have smaller buffers, so
bulbs are set to 4k.
Run linter, fix response deep-copying making tests
unhappy, updated call count in emeter test.
This commit removes the module skip list as it seems this workaround
is no longer necessary when splitting up large queries.
@rytilahti rytilahti added enhancement New feature or request and removed bug Something isn't working labels Sep 14, 2023
Copy link
Member

@rytilahti rytilahti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks @cobryan05! 🥇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants