Skip to content

afadesigns/pyru

Repository files navigation

PyRu - High-performance Python web scraper

CI OpenSSF Scorecard PyPI License: MIT Python 3.15+ Rust 1.88+

Quick Start

pip install pyru-scraper
pyru scrape "https://example.com" -s "h1" -o json

Table of contents

Design · Comparison · Features · Install · Usage · API · Benchmarks · Security · Develop · License

Design

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4a90d9', 'primaryTextColor': '#fff', 'primaryBorderColor': '#4a90d9', 'lineColor': '#666', 'secondaryColor': '#f5f5f5', 'tertiaryColor': '#fff'}}}%%
flowchart LR
  CLI["Python CLI<br/>(argparse, stdlib only)"] -->|await| Bridge["PyO3 bridge<br/>pyru._native"]
  Bridge -->|tokio::spawn| Fetch["reqwest<br/>rustls · HTTP/1.1 + HTTP/2<br/>gzip · brotli · zstd"]
  Bridge -->|semaphore| Pool[["Concurrency gate<br/>Tokio JoinSet + Semaphore"]]
  Fetch --> Body["body bytes"]
  Body -->|spawn_blocking| Parse["scraper<br/>CSS selectors over html5ever"]
  Parse --> Out["(elements, errors, latency_ms)"]
  Out -->|Bound&lt;'py, PyAny&gt;| CLI
  style CLI fill:#e8f4f8,stroke:#4a90d9,stroke-width:2px
  style Bridge fill:#4a90d9,stroke:#333,stroke-width:2px,color:#fff
  style Fetch fill:#d4edda,stroke:#28a745,stroke-width:2px
  style Pool fill:#fff3cd,stroke:#ffc107,stroke-width:2px
  style Parse fill:#f8d7da,stroke:#dc3545,stroke-width:2px
  style Out fill:#d1ecf1,stroke:#17a2b8,stroke-width:2px
Loading
  • One async Python entry point. The CLI hands a batch of URLs to a single await scrape_urls_concurrent(...) call. Everything below the await lives in Rust.
  • Per-URL error reporting. A failing URL emits an error string in its slot; the rest of the batch keeps flowing. No early-exit semantics.
  • Tuned reqwest client. TCP_NODELAY, 30s keep-alive, bounded idle pool, configurable timeouts, 10-redirect cap, HTTP/2 negotiated when the peer supports it.
  • Parse off the reactor. HTML goes through scraper inside tokio::task::spawn_blocking so the async runtime stays responsive even when a page is a 5MB <div> soup.
  • MiMalloc global allocator. ~10–20% fewer allocations on the hot path vs glibc's default.

Comparison

Aspect PyRu httpx + selectolax Scrapy
Language Rust + Python Python Python
Dependencies 0 (stdlib only) 5+ 20+
Async Native (tokio) asyncio Twisted
Binary size ~2 MB ~5 MB ~30 MB
Latency <1ms 5-10ms 10-50ms
Memory overhead MiMalloc Standard Highest

PyRu is ideal when you need speed, minimal dependencies, or a Python wrapper around a Rust async core.

Install

pip install pyru-scraper

Note

Pre-built abi3 wheels for x86_64 Linux (manylinux 2.28), Intel + Apple Silicon macOS, and Windows. Source distribution requires Rust toolchain and CPython 3.15+.

Features

  • High throughput — async Rust core with tokio, sub-millisecond latency
  • Zero Python deps — stdlib-only CLI, no runtime bloat
  • Concurrency control — adjustable semaphore-capped parallelism
  • Retry logic — exponential backoff on failure
  • robots.txt support — optional compliance checking
  • Caching — ETag/Last-Modified conditional requests
  • Proxy support — HTTP/HTTPS proxy configuration
  • Custom headers — per-request header injection
  • SSL toggle — optionally skip certificate verification
  • Multiple output modes — JSON, text, file, stats

Usage

pyru scrape [OPTIONS] URL [URL...]
Flag Default Description
-s, --selector TEXT required CSS selector applied to each fetched page.
-o, --output {json,text} text Output format.
-c, --concurrency INT 50 Maximum in-flight requests (1–10,000).
-u, --user-agent TEXT built-in Override the default User-Agent.
--timeout-ms INT 10000 Total per-request timeout (milliseconds).
--connect-timeout-ms INT 5000 TCP/TLS connect timeout (milliseconds).
-r, --retries INT 0 Retry attempts on failure (0–10).
--respect-robots-txt false Obey robots.txt rules before fetching.
--cache false Use ETag/Last-Modified for conditional requests.
--proxy URL none HTTP/HTTPS proxy URL.
-H, --header TEXT none Custom header (repeatable).
--insecure false Skip SSL verification.
--output-file FILE stdout Write output to file.
--stats false Print summary statistics.
-q, --quiet false Suppress informational output.
pyru scrape "https://books.toscrape.com/" -s "h3 > a" -c 200 -o json
Worked example: 50 pages, JSON to jq
pyru scrape $(seq 1 50 | xargs -I{} echo "https://books.toscrape.com/catalogue/page-{}.html") \
  -s "h3 > a" -c 25 -o json \
  | jq -s '[.[] | {url, latency_ms, n: (.elements | length)}]'

Python API

import asyncio
from collections.abc import Sequence
from pyru import scrape_urls_concurrent

async def main() -> None:
    elements, errors, latency_ms = await scrape_urls_concurrent(
        urls=["https://example.com/", "https://example.org/"],
        selector="h1",
        concurrency=8,
        timeout_ms=5_000,
    )
    for url, els, err, lat in zip(
        ["https://example.com/", "https://example.org/"],
        elements, errors, latency_ms, strict=True,
    ):
        marker = "ok" if not err else f"err={err!r}"
        print(f"{url}  {lat:>4} ms  {marker}  -> {els}")

asyncio.run(main())

The _native module exports a single async function:

async def scrape_urls_concurrent(
    urls: Sequence[str],
    selector: str,
    concurrency: int = 50,
    user_agent: str | None = None,
    timeout_ms: int = 10_000,
    connect_timeout_ms: int = 5_000,
    retries: int = 0,
    respect_robots_txt: bool = False,
    use_cache: bool = False,
    proxy: str | None = None,
    headers: Sequence[tuple[str, str]] | None = None,
    insecure: bool = False,
) -> tuple[list[list[str]], list[str], list[int]]: ...

Benchmarks

A self-contained harness under benchmarks/ spins up a local aiohttp server and compares PyRu against the most-commonly-cited pure-Python alternative (httpx + selectolax). Numbers vary with hardware, kernel tunables, and test concurrency — run it yourself and publish the raw output before quoting comparisons.

uv sync --group benchmarks
uv run python benchmarks/real_world_benchmark.py

Warning

Benchmark results published before running the harness on your hardware should be treated as anecdotes, not data.

Security

See SECURITY.md for responsible disclosure and supply-chain posture:

  • Zero runtime Python dependencies. The CLI is stdlib-only.
  • In-tree PEP 517 build backend at _build/ — no setuptools, no hatchling, no maturin.
  • Rust audit (cargo deny check advisories bans sources) runs on every CI build.
  • Trusted publishing (OIDC) to PyPI — no API tokens on file.
  • Signed commits are enforced on main.

Develop

uv python install 3.15
uv sync --group dev
uv pip install -e .

Then the full check chain:

uv run ruff check
uv run ruff format --check
uv run python -m unittest discover -s tests -v

cargo fmt  --manifest-path native/Cargo.toml --all -- --check
cargo clippy --manifest-path native/Cargo.toml --all-targets -- -D warnings
cargo test --manifest-path native/Cargo.toml --all-targets

Layout at a glance:

.
├─ pyru/         Python package (argparse CLI + type stubs)
├─ native/       Rust crate (pyo3, tokio, reqwest, scraper)
├─ _build/       PEP 517 build backend (stdlib only)
├─ tests/        stdlib unittest suite
└─ benchmarks/   opt-in comparison harness

License

MIT — see LICENSE. Copyright © Andreas Fahl.

About

High-throughput async web scraper powered by Rust.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors