fix: Fix `KeyValueStore.auto_saved_value` failing in some scenarios by Pijukatel · Pull Request #1438 · apify/crawlee-python

Pijukatel · 2025-09-30T12:30:56Z

Description

Reduce the amount of global side effects in service_locator by using an explicit KVS factory in RecoverableState.
Fix KeyValueStore.auto_saved_value not working properly if the global storage_client was different from the current kvs storage client.
Improve test isolation.

Issues

Closes: All crawlee persistence names should be valid names for Apify platform #1354

Testing

Added tests for some edge cases.

Explicit kvs to RecoverableState

Pijukatel · 2025-09-30T12:31:52Z

This PR replaces #1368. That PR included many changes that were already released in Crawlee v1.0, and continuing in the previous PR would not be good for review.

vdusek

I wouldn't be afraid to call it a fix if it's (according to the description) fixing something, and then it should be included in the changelog.

But also, I'm not 100% I understand why we're doing this:

Reduce the amount of global side effects in service_locator by using an explicit KVS factory in RecoverableState.

What side effects?

Fix KeyValueStore.auto_saved_value not working properly if the global storage_client was different from the current kvs storage client.

👍

vdusek · 2025-10-01T12:14:07Z

src/crawlee/storage_clients/_file_system/_request_queue_client.py

                    metadata=metadata,
                    path_to_rq=path_to_rq,
                    lock=asyncio.Lock(),
+                    recoverable_state=await cls._create_recoverable_state(id=metadata.id, configuration=configuration),


Instead of creating it three times, we can create it once, store it in a variable, and just pass it where needed.

That creation requires metadata to get the RQ.id, so we have to repeat this call, as in all three branches, we get metadata in a different way.

janbuchar · 2025-10-01T14:49:13Z

src/crawlee/storage_clients/_file_system/_request_queue_client.py

+            from crawlee.storage_clients import FileSystemStorageClient  # noqa: PLC0415 avoid circular import
+            from crawlee.storages import KeyValueStore  # noqa: PLC0415 avoid circular import
+
+            return await KeyValueStore.open(storage_client=FileSystemStorageClient(), configuration=configuration)


Creating a fresh filesystem storage client to open a request queue feels wrong - at this point, we can be pretty sure that another one already exists. Is there a specific reason to do this or is it just because you don't have access to the existing one?

At that point, we are not sure it exists. It could have been created through a class method without the client and even when created with the help of client, it is out of scope:
await FileSystemRequestQueueClient.open(...)

And why not open the KVS through such a class method as well? Because that way, you bypass the storage instance manager - and that is generally something we do not want.

FileSystemStorageClient is just a helper factory class, which is mainly for convenience and for registering the storage instance manager.

janbuchar · 2025-10-01T14:51:33Z

tests/unit/storage_clients/_file_system/test_fs_rq_client.py

            assert request_data['url'].startswith('https://example.com/')


+async def test_opening_rq_does_not_have_side_effect_on_service_locator(


Can you add some explanation here? Also, inlining the rq_client fixture could lead to better readable code

janbuchar · 2025-10-01T14:55:11Z

tests/unit/conftest.py


+        # Reset global class variables to ensure test isolation.
+        KeyValueStore._autosaved_values = {}
+        Statistics._Statistics__next_id = 0  # type:ignore[attr-defined] # Mangled attribute


How does this contribute towards test isolation? Is there anything that depends on the persist state key that is derived from the ID?

The answer is I am not sure. But we have so many tests, so I think it is best if we restore all we can to the same state at the beginning of the test. This reduces the chance of some weird behavior based on the order of the test execution.

Pijukatel · 2025-10-02T07:50:16Z

I wouldn't be afraid to call it a fix if it's (according to the description) fixing something, and then it should be included in the changelog.

But also, I'm not 100% I understand why we're doing this:

Reduce the amount of global side effects in service_locator by using an explicit KVS factory in RecoverableState.

What side effects?

Fix KeyValueStore.auto_saved_value not working properly if the global storage_client was different from the current kvs storage client.

👍

Each time you use RecoverableState and you do not pass an explicit persist_state_kvs_factory the service_locator will be used to get the current global storage_client. If there is no global storage_client set yet, then it is created and set - and this is the side effect I want to minimize. In many cases, a specific RecoverableState does not really want to use the global storage_client as in many cases that does not make sense at all.

There are different cases, where I am not sure, for example, Statistics, so I did not touch those; I changed only those that I am 100% sure about.

vdusek

LGTM

Draft of minimizing side effects

ac9c95f

Explicit kvs to RecoverableState

github-actions bot assigned Pijukatel Sep 30, 2025

github-actions bot added this to the 124th sprint - Tooling team milestone Sep 30, 2025

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Sep 30, 2025

Pijukatel mentioned this pull request Sep 30, 2025

feat: Update RecoverableState and StorageInstanceManager to ensure proper persistence #1368

Closed

Use factory method istead of the explicit kvs

537ed1a

Pijukatel force-pushed the minimize-global-service-locator-side-effects branch from 97b11d4 to 537ed1a Compare September 30, 2025 12:33

Pijukatel requested review from janbuchar and vdusek September 30, 2025 12:36

Pijukatel marked this pull request as ready for review September 30, 2025 13:14

vdusek reviewed Oct 1, 2025

View reviewed changes

vdusek mentioned this pull request Oct 1, 2025

fix: Fix memory leak in PlaywrightCrawler on browser context creation #1446

Merged

janbuchar reviewed Oct 1, 2025

View reviewed changes

Make test more readable

ab62d6c

Pijukatel changed the title ~~refactor: Minimize global service locator side effects~~ fix: Fix KeyValueStore.auto_saved_value failing in some scenarios Oct 2, 2025

Pijukatel requested review from janbuchar and vdusek October 2, 2025 07:52

vdusek approved these changes Oct 2, 2025

View reviewed changes

Merge branch 'master' into minimize-global-service-locator-side-effects

faec6e0

Pijukatel merged commit b35dee7 into master Oct 16, 2025
19 checks passed

Pijukatel deleted the minimize-global-service-locator-side-effects branch October 16, 2025 07:29

Pijukatel mentioned this pull request Oct 17, 2025

fix: Fix BasicCrawler statistics persistence #1490

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix `KeyValueStore.auto_saved_value` failing in some scenarios#1438

fix: Fix `KeyValueStore.auto_saved_value` failing in some scenarios#1438
Pijukatel merged 4 commits intomasterfrom
minimize-global-service-locator-side-effects

Pijukatel commented Sep 30, 2025

Uh oh!

Pijukatel commented Sep 30, 2025

Uh oh!

vdusek left a comment

Uh oh!

vdusek Oct 1, 2025

Uh oh!

Pijukatel Oct 2, 2025

Uh oh!

janbuchar Oct 1, 2025

Uh oh!

Pijukatel Oct 2, 2025 •

edited

Loading

Uh oh!

janbuchar Oct 1, 2025

Uh oh!

Pijukatel Oct 2, 2025

Uh oh!

janbuchar Oct 1, 2025

Uh oh!

Pijukatel Oct 2, 2025

Uh oh!

Pijukatel commented Oct 2, 2025

Uh oh!

vdusek left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		assert request_data['url'].startswith('https://example.com/')


		async def test_opening_rq_does_not_have_side_effect_on_service_locator(

Conversation

Pijukatel commented Sep 30, 2025

Description

Issues

Testing

Uh oh!

Pijukatel commented Sep 30, 2025

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janbuchar Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel commented Oct 2, 2025

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pijukatel Oct 2, 2025 •

edited

Loading