GitHub - afadesigns/pyru: High-throughput async web scraper powered by Rust.

PyRu - High-performance Python web scraper

Quick Start

pip install pyru-scraper
pyru scrape "https://example.com" -s "h1" -o json

Design

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4a90d9', 'primaryTextColor': '#fff', 'primaryBorderColor': '#4a90d9', 'lineColor': '#666', 'secondaryColor': '#f5f5f5', 'tertiaryColor': '#fff'}}}%%
flowchart LR
  CLI["Python CLI<br/>(argparse, stdlib only)"] -->|await| Bridge["PyO3 bridge<br/>pyru._native"]
  Bridge -->|tokio::spawn| Fetch["reqwest<br/>rustls · HTTP/1.1 + HTTP/2<br/>gzip · brotli · zstd"]
  Bridge -->|semaphore| Pool[["Concurrency gate<br/>Tokio JoinSet + Semaphore"]]
  Fetch --> Body["body bytes"]
  Body -->|spawn_blocking| Parse["scraper<br/>CSS selectors over html5ever"]
  Parse --> Out["(elements, errors, latency_ms)"]
  Out -->|Bound&lt;'py, PyAny&gt;| CLI
  style CLI fill:#e8f4f8,stroke:#4a90d9,stroke-width:2px
  style Bridge fill:#4a90d9,stroke:#333,stroke-width:2px,color:#fff
  style Fetch fill:#d4edda,stroke:#28a745,stroke-width:2px
  style Pool fill:#fff3cd,stroke:#ffc107,stroke-width:2px
  style Parse fill:#f8d7da,stroke:#dc3545,stroke-width:2px
  style Out fill:#d1ecf1,stroke:#17a2b8,stroke-width:2px

One async Python entry point. The CLI hands a batch of URLs to a single await scrape_urls_concurrent(...) call. Everything below the await lives in Rust.
Per-URL error reporting. A failing URL emits an error string in its slot; the rest of the batch keeps flowing. No early-exit semantics.
Tuned reqwest client. TCP_NODELAY, 30s keep-alive, bounded idle pool, configurable timeouts, 10-redirect cap, HTTP/2 negotiated when the peer supports it.
Parse off the reactor. HTML goes through scraper inside tokio::task::spawn_blocking so the async runtime stays responsive even when a page is a 5MB <div> soup.
MiMalloc global allocator. ~10–20% fewer allocations on the hot path vs glibc's default.

Comparison

Aspect	PyRu	httpx + selectolax	Scrapy
Language	Rust + Python	Python	Python
Dependencies	0 (stdlib only)	5+	20+
Async	Native (tokio)	asyncio	Twisted
Binary size	~2 MB	~5 MB	~30 MB
Latency	<1ms	5-10ms	10-50ms
Memory overhead	MiMalloc	Standard	Highest

PyRu is ideal when you need speed, minimal dependencies, or a Python wrapper around a Rust async core.

Install

pip install pyru-scraper

Note

Pre-built abi3 wheels for x86_64 Linux (manylinux 2.28), Intel + Apple Silicon macOS, and Windows. Source distribution requires Rust toolchain and CPython 3.15+.

Features

High throughput — async Rust core with tokio, sub-millisecond latency
Zero Python deps — stdlib-only CLI, no runtime bloat
Concurrency control — adjustable semaphore-capped parallelism
Retry logic — exponential backoff on failure
robots.txt support — optional compliance checking
Caching — ETag/Last-Modified conditional requests
Proxy support — HTTP/HTTPS proxy configuration
Custom headers — per-request header injection
SSL toggle — optionally skip certificate verification
Multiple output modes — JSON, text, file, stats

Usage

pyru scrape [OPTIONS] URL [URL...]

Flag	Default	Description
`-s, --selector TEXT`	required	CSS selector applied to each fetched page.
`-o, --output {json,text}`	text	Output format.
`-c, --concurrency INT`	50	Maximum in-flight requests (1–10,000).
`-u, --user-agent TEXT`	built-in	Override the default User-Agent.
`--timeout-ms INT`	10000	Total per-request timeout (milliseconds).
`--connect-timeout-ms INT`	5000	TCP/TLS connect timeout (milliseconds).
`-r, --retries INT`	0	Retry attempts on failure (0–10).
`--respect-robots-txt`	false	Obey robots.txt rules before fetching.
`--cache`	false	Use ETag/Last-Modified for conditional requests.
`--proxy URL`	none	HTTP/HTTPS proxy URL.
`-H, --header TEXT`	none	Custom header (repeatable).
`--insecure`	false	Skip SSL verification.
`--output-file FILE`	stdout	Write output to file.
`--stats`	false	Print summary statistics.
`-q, --quiet`	false	Suppress informational output.

pyru scrape "https://books.toscrape.com/" -s "h3 > a" -c 200 -o json

Worked example: 50 pages, JSON to jq

pyru scrape $(seq 1 50 | xargs -I{} echo "https://books.toscrape.com/catalogue/page-{}.html") \
  -s "h3 > a" -c 25 -o json \
  | jq -s '[.[] | {url, latency_ms, n: (.elements | length)}]'

Python API

import asyncio
from collections.abc import Sequence
from pyru import scrape_urls_concurrent

async def main() -> None:
    elements, errors, latency_ms = await scrape_urls_concurrent(
        urls=["https://example.com/", "https://example.org/"],
        selector="h1",
        concurrency=8,
        timeout_ms=5_000,
    )
    for url, els, err, lat in zip(
        ["https://example.com/", "https://example.org/"],
        elements, errors, latency_ms, strict=True,
    ):
        marker = "ok" if not err else f"err={err!r}"
        print(f"{url}  {lat:>4} ms  {marker}  -> {els}")

asyncio.run(main())

The _native module exports a single async function:

async def scrape_urls_concurrent(
    urls: Sequence[str],
    selector: str,
    concurrency: int = 50,
    user_agent: str | None = None,
    timeout_ms: int = 10_000,
    connect_timeout_ms: int = 5_000,
    retries: int = 0,
    respect_robots_txt: bool = False,
    use_cache: bool = False,
    proxy: str | None = None,
    headers: Sequence[tuple[str, str]] | None = None,
    insecure: bool = False,
) -> tuple[list[list[str]], list[str], list[int]]: ...

Benchmarks

A self-contained harness under benchmarks/ spins up a local aiohttp server and compares PyRu against the most-commonly-cited pure-Python alternative (httpx + selectolax). Numbers vary with hardware, kernel tunables, and test concurrency — run it yourself and publish the raw output before quoting comparisons.

uv sync --group benchmarks
uv run python benchmarks/real_world_benchmark.py

Warning

Benchmark results published before running the harness on your hardware should be treated as anecdotes, not data.

Security

See SECURITY.md for responsible disclosure and supply-chain posture:

Zero runtime Python dependencies. The CLI is stdlib-only.
In-tree PEP 517 build backend at _build/ — no setuptools, no hatchling, no maturin.
Rust audit (cargo deny check advisories bans sources) runs on every CI build.
Trusted publishing (OIDC) to PyPI — no API tokens on file.
Signed commits are enforced on main.

Develop

uv python install 3.15
uv sync --group dev
uv pip install -e .

Then the full check chain:

uv run ruff check
uv run ruff format --check
uv run python -m unittest discover -s tests -v

cargo fmt  --manifest-path native/Cargo.toml --all -- --check
cargo clippy --manifest-path native/Cargo.toml --all-targets -- -D warnings
cargo test --manifest-path native/Cargo.toml --all-targets

Layout at a glance:

.
├─ pyru/         Python package (argparse CLI + type stubs)
├─ native/       Rust crate (pyo3, tokio, reqwest, scraper)
├─ _build/       PEP 517 build backend (stdlib only)
├─ tests/        stdlib unittest suite
└─ benchmarks/   opt-in comparison harness

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github		.github
_build		_build
assets		assets
benchmarks		benchmarks
native		native
pyru		pyru
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quick Start

Table of contents

Design

Comparison

Install

Features

Usage

Python API

Benchmarks

Security

Develop

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Table of contents

Design

Comparison

Install

Features

Usage

Python API

Benchmarks

Security

Develop

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages