feat: Add `RedisStorageClient` based on Redis v8.0+ by Mantisus · Pull Request #1406 · apify/crawlee-python

Mantisus · 2025-09-13T03:54:25Z

Description

This PR implements a storage client RedisStorageClient based on Redis v8+. The minimum version 8 requirement is due to the fact that all data structures used are only available starting from Redis Open-Source version 8, without any additional extensions.

Testing

Added new unit tests
For testing without actual Redis usage, fakeredis is used

Mantisus · 2025-09-15T18:20:27Z

Performance test.

1 client

Code to run

import asyncio

from crawlee import ConcurrencySettings
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storage_clients import RedisStorageClient

CONNECTION = 'redis://localhost:6379'


async def main() -> None:
    storage_client = RedisStorageClient(connection_string=CONNECTION)
    http_client = HttpxHttpClient()

    crawler = ParselCrawler(
        storage_client=storage_client,
        http_client=http_client,
        concurrency_settings=ConcurrencySettings(desired_concurrency=20),
    )

    @crawler.router.default_handler
    async def request_handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing URL: {context.request.url}...')
        data = {
            'url': context.request.url,
            'title': context.selector.css('title::text').get(),
        }
        await context.push_data(data)
        await context.enqueue_links()

    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 2363       │
│ requests_failed               │ 0          │
│ retry_histogram               │ [2363]     │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 358.4ms    │
│ requests_finished_per_minute  │ 3545       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 14min 6.8s │
│ requests_total                │ 2363       │
│ crawler_runtime               │ 39.99s     │
└───────────────────────────────┴────────────┘

3 clients

Code to run

import asyncio
from concurrent.futures import ProcessPoolExecutor

from crawlee import ConcurrencySettings, service_locator
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storage_clients import RedisStorageClient
from crawlee.storages import RequestQueue

CONNECTION = 'redis://localhost:6379'

async def run(queue_name: str) -> None:
    storage_client = RedisStorageClient(connection_string=CONNECTION)

    service_locator.set_storage_client(storage_client)
    queue = await RequestQueue.open(name=queue_name)

    http_client = HttpxHttpClient()

    crawler = ParselCrawler(
        http_client=http_client,
        request_manager=queue,
        concurrency_settings=ConcurrencySettings(desired_concurrency=20),
    )

    @crawler.router.default_handler
    async def request_handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing URL: {context.request.url}...')
        data = {
            'url': context.request.url,
            'title': context.selector.css('title::text').get(),
        }
        await context.push_data(data)
        await context.enqueue_links()

    await crawler.run(['https://crawlee.dev'])

def process_run(queue_name: str) -> None:
    asyncio.run(run(queue_name))

def multi_run(queue_name: str = 'multi') -> None:
    workers = 3
    with ProcessPoolExecutor(max_workers=workers) as executor:
        executor.map(process_run, [queue_name for i in range(workers)])

if __name__ == '__main__':
    multi_run()

[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 779        │
│ requests_failed               │ 0          │
│ retry_histogram               │ [779]      │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 356.9ms    │
│ requests_finished_per_minute  │ 2996       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 4min 38.0s │
│ requests_total                │ 779        │
│ crawler_runtime               │ 15.60s     │
└───────────────────────────────┴────────────┘
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 762        │
│ requests_failed               │ 0          │
│ retry_histogram               │ [762]      │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 360.0ms    │
│ requests_finished_per_minute  │ 2931       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 4min 34.3s │
│ requests_total                │ 762        │
│ crawler_runtime               │ 15.60s     │
└───────────────────────────────┴────────────┘
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 822        │
│ requests_failed               │ 0          │
│ retry_histogram               │ [822]      │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 342.2ms    │
│ requests_finished_per_minute  │ 3161       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 4min 41.3s │
│ requests_total                │ 822        │
│ crawler_runtime               │ 15.60s     │
└───────────────────────────────┴────────────┘

Mantisus · 2025-09-16T19:16:29Z

In RedisRequestQueueClient, I used a Bloom filter in Redis for deduplication and tracking handled requests. This approach differs from what we used in other clients (using set). My main motivation was that Redis is an in-memory database, and memory consumption can be critical in some cases.

Since a Bloom filter is a probabilistic data structure, the final data structure size is affected by the error probability; I used 1e-7. This means that with a probability of 1e-7, we may get a false positive when checking the filter. In our case, this translates to a probability of skipping a request: with probability 1e-7, a request that wasn't added to the queue will be considered as already added. Similarly, with probability 1e-7, a request that hasn't been handled yet will be considered as already processed.

Memory consumption for records in the format 'https://crawlee.dev/{i}' (record size doesn't matter for Bloom filters):

Redis Bloom filter:

100,000 - 427 KB
1,000,000 - 4 MB
10,000,000 - 42 MB

Redis set:

100,000 - 6 MB
1,000,000 - 61 MB
10,000,000 - 662 MB

Discussion about whether it's worth pursuing this approach is welcome!

janbuchar · 2025-09-17T11:39:18Z

I haven't read the PR yet, but I did look into bloom filters for request deduplication in the past and what you wrote piqued my interest 🙂 I am a little worried about the chance of dropping a URL completely, even with a super small probability.

Perhaps we should default to a solution that tolerates some percentage of the "opposite" errors and allows a URL to get processed multiple times in rare cases. A fixed size hash table is an example of such data structure. I don't know if anything more sophisticated exists.

But maybe I have an irrational fear of probabilistic stuff 🙂

Mantisus · 2025-09-18T16:04:21Z

I am a little worried about the chance of dropping a URL completely, even with a super small probability.

Yes, I agree that this may be a little disturbing. And if we go down this route, it will need to be highlighted separately for the user.

But perhaps I am not sufficiently afraid of probabilistic structures, as I have used them before. 🙂

Mantisus · 2025-09-28T19:35:02Z

Since we have already added the ability to parameterize queue behavior in the SDK (single, shared), I updated RedisStorageClient with a parameter defining the deduplication strategy queue_dedup_strategy. If set to default, it will use Set, bloom Bloom filters.

Pijukatel · 2025-10-20T14:20:37Z

It also works with:

Valkey from following docker image:valkey/valkey-bundle
Dragonfly from following docker image:docker.dragonflydb.io/dragonflydb/dragonfly

Mantisus · 2025-10-20T20:55:16Z

It also works with:

Thanks for checking that out!

It's a real surprise to me that the Redis client is fully compatible with these.

Pijukatel

Just some small comments and questions. I am really looking forward to using this. Good work.

src/crawlee/storage_clients/_redis/_client_mixin.py

src/crawlee/storage_clients/_redis/_dataset_client.py

src/crawlee/storage_clients/_redis/_request_queue_client.py

src/crawlee/storage_clients/_redis/_storage_client.py

tests/unit/storages/conftest.py

tests/unit/storage_clients/_redis/test_redis_dataset_client.py

janbuchar

I think we should reach out to somebody who knows Redis better to review this for us. Or do you feel confident @Pijukatel? 😁

janbuchar · 2025-10-22T15:21:24Z

docs/guides/storage_clients.mdx

I think we should mention that Redis persistence is unlike that of filesystem or SQL storage and link to https://redis.io/docs/latest/operate/oss_and_stack/management/persistence/

janbuchar · 2025-10-22T15:35:53Z

src/crawlee/storage_clients/_redis/_utils.py

+    return await response if isinstance(response, Awaitable) else response
+
+
+def read_lua_script(file_path: Path) -> str:


Couldn't this accept just the file name and prepend the path to the lua_scripts directory automatically?

src/crawlee/storage_clients/_redis/_storage_client.py

janbuchar · 2025-10-22T15:58:01Z

src/crawlee/storage_clients/_redis/_storage_client.py

+
+        # Call the notification only once
+        warnings.warn(
+            'The RedisStorageClient is experimental and may change or be removed in future releases.',


I don't think we'll want to remove it 🙂 The storage "schema" could change, though - perhaps we should mention that.

janbuchar · 2025-10-22T16:05:11Z

src/crawlee/storage_clients/_redis/_client_mixin.py

+        Returns:
+            An instance for the opened or created storage client.
+        """
+        internal_name = name or alias or cls._DEFAULT_NAME


I don't understand this - does it mean that there is no difference in behavior of named and aliased storages?

for alias in the metadata object name=None.

internal_name is used to form the key prefix. As in FileSystemStorageClient, the folder name is formed.

Co-authored-by: Jan Buchar <Teyras@gmail.com>

Pijukatel · 2025-10-24T10:05:37Z

I think we should reach out to somebody who knows Redis better to review this for us. Or do you feel confident @Pijukatel? 😁

Feel free to invite someone more experienced with Redis to review; I was mainly focusing on the Python part in the review.

Anyway, I tried running it a little and did not see anything wrong. It is also optional and experimental, so I would not be afraid to release it and improve it as we go. No existing user should be affected by this, even if there is some hidden bug.

vdusek

Looks great! I tested it locally with a few simple crawlers using a local Redis instance in Docker, and everything seems to be working... I just have a few comments.

docs/guides/storage_clients.mdx

pyproject.toml

docs/guides/storage_clients.mdx

tests/unit/storages/conftest.py

tests/unit/storage_clients/_redis/test_redis_dataset_client.py

src/crawlee/storage_clients/_redis/_storage_client.py

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

janbuchar · 2025-10-30T11:50:53Z

@JuanGalilea did you have a chance to look into this?

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

vdusek

LGTM

vdusek · 2025-11-10T15:56:27Z

@Mantisus Could you please resolve the conflicts? Once that's done, we'll merge it.

docs/examples/playwright_crawler_with_fingerprint_generator.mdx

vdusek

Just dev dep version specifiers, see #1545.

pyproject.toml

uv.lock

Mantisus added 4 commits September 13, 2025 03:34

core implementation

49d8db1

add fakeredis

fe3eee1

Merge branch 'master' into redis

97ca5ea

add support for NDU storages

997fca3

Mantisus self-assigned this Sep 13, 2025

clean code

3c1aeed

Mantisus changed the title ~~feat: Add RedisStorageClient~~ feat: Add RedisStorageClient based on Redis v8.0+ Sep 14, 2025

Mantisus added 5 commits September 14, 2025 01:33

up docs

75d81d8

update guide

31a1fa9

add in built-id

d46ffbe

add tests for Redis clients

5b77ab6

resolve

ccf713c

Mantisus requested review from Pijukatel, janbuchar and vdusek September 15, 2025 18:20

Mantisus marked this pull request as ready for review September 15, 2025 18:21

Mantisus added 2 commits September 15, 2025 18:41

suppress warnings

7c84ed1

resolve

a1f4403

Mantisus added 4 commits September 28, 2025 18:36

add default dedup strategy

122c923

resolve

32cfe63

up tests

ec34386

up docs

c1dda54

Mantisus added 2 commits October 20, 2025 10:59

resolve

ad1d055

save request state with reclaim

c5d0941

Pijukatel reviewed Oct 21, 2025

View reviewed changes

janbuchar reviewed Oct 22, 2025

View reviewed changes

Mantisus and others added 3 commits October 22, 2025 20:17

Update src/crawlee/storage_clients/_redis/_storage_client.py

a4a8a5b

Co-authored-by: Jan Buchar <Teyras@gmail.com>

up first part

f75d110

up second part

3b4f18d

Pijukatel approved these changes Oct 24, 2025

View reviewed changes

Mantisus added 2 commits October 24, 2025 10:43

resolve

f93215a

up redis to 7.0.0

5067613

vdusek requested changes Oct 30, 2025

View reviewed changes

Mantisus and others added 3 commits October 30, 2025 13:47

Update docs/guides/storage_clients.mdx

9373e55

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/storage_clients.mdx

0b4e2eb

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update pyproject.toml

67b6ab1

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Mantisus and others added 2 commits October 30, 2025 14:02

Update src/crawlee/storage_clients/_redis/_storage_client.py

c3f07a5

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

up docs and tests

8975c0f

Mantisus requested a review from vdusek October 30, 2025 12:45

vdusek approved these changes Oct 31, 2025

View reviewed changes

janbuchar reviewed Nov 10, 2025

View reviewed changes

docs/examples/playwright_crawler_with_fingerprint_generator.mdx Outdated Show resolved Hide resolved

resolve

7a2bebe

Mantisus force-pushed the redis branch from 4c0a9c5 to 7a2bebe Compare November 10, 2025 19:42

Mantisus requested a review from janbuchar November 10, 2025 19:43

janbuchar removed their request for review November 10, 2025 20:48

vdusek approved these changes Nov 11, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

uv.lock Outdated Show resolved Hide resolved

vdusek added 2 commits November 11, 2025 08:52

Update pyproject.toml

34ec87d

Update uv.lock

7a3304a

vdusek merged commit d08d13d into apify:master Nov 11, 2025
19 checks passed

		return await response if isinstance(response, Awaitable) else response


		def read_lua_script(file_path: Path) -> str:

Conversation

Mantisus commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Uh oh!

Mantisus commented Sep 15, 2025

Uh oh!

Mantisus commented Sep 16, 2025

Uh oh!

janbuchar commented Sep 17, 2025

Uh oh!

Mantisus commented Sep 18, 2025

Uh oh!

Mantisus commented Sep 28, 2025

Uh oh!

Pijukatel commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mantisus commented Oct 20, 2025

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janbuchar Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pijukatel commented Oct 24, 2025

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janbuchar commented Oct 30, 2025

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek commented Nov 10, 2025

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mantisus commented Sep 13, 2025 •

edited

Loading

Pijukatel commented Oct 20, 2025 •

edited

Loading

Mantisus Oct 22, 2025 •

edited

Loading