fix: Fix BasicCrawler statistics persistence#1490
Conversation
55e7316 to
ebc350d
Compare
vdusek
left a comment
There was a problem hiding this comment.
I'm surprised we use the SDK_CRAWLER_STATISTICS_... key for state persistence. Why is the SDK prefix in Crawlee? Also, since this is internal, we use a double-underscore prefix (__STORAGE_ALIASES_MAPPING, __RQ_STATE_...) for other cases. Could we update the key name, please?
tests/unit/test_configuration.py
Outdated
| crawler = HttpCrawler( | ||
| configuration=configuration, | ||
| storage_client=storage_client, | ||
| ) | ||
| service_locator.set_configuration(configuration) | ||
| service_locator.set_storage_client(storage_client) | ||
|
|
||
| crawler = HttpCrawler() |
There was a problem hiding this comment.
This is because RecoverableState of statistics persists to/recovers from global storage_client. And since statistics is persisted by default now, it will try to persist to default global service_client, which is FileSystem... regardless of the crawler-specific storage_client
Mentioned here:
#1438 (comment)
I am open to discussion about this.
There was a problem hiding this comment.
Couldn't we use the storage client passed to the crawler?
There was a problem hiding this comment.
We could, but do we want to? I had an inconclusive discussion about this with @janbuchar
I am still not sure about this.
There was a problem hiding this comment.
Honestly, I'm kinda disappointed with the amount of edge cases that arose from having a separate service locator for crawlers.
From a "common sense" perspective, the RecoverableState is owned by the crawler and it doesn't make much sense to put the serialized state in a different storage (the global one). Then again, there's a good chance that the crawler-wide storage client will be a memory storage, which is not a great fit for RecoverableState.
But, unless I'm missing something, it should be super rare that somebody will do this intentionally. In my opinion, we should pick one of these options and just show a warning if both the global and crawler-specific storage client are configured.
TODO: Figure out reason for stats difference in request_total_finished_duration
BasicCrawler statistics persistanceBasicCrawler statistics persistence
janbuchar
left a comment
There was a problem hiding this comment.
LGTM. Please look at my comment about service locators and do whatever you deem appropriate.
7ba7190 to
00e65ec
Compare
Description
BasicCrawleris persisting statistics by default.BasicCrawleris recovering existing statistics by default ifConfiguration.purge_on_startis False.BasicCrawleremitEvent.PERSIST_STATEwhen finishing.Issues
Testing