pip install pyru-scraper
pyru scrape "https://example.com" -s "h1" -o json| Design · Comparison · Features · Install · Usage · API · Benchmarks · Security · Develop · License |
|---|
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4a90d9', 'primaryTextColor': '#fff', 'primaryBorderColor': '#4a90d9', 'lineColor': '#666', 'secondaryColor': '#f5f5f5', 'tertiaryColor': '#fff'}}}%%
flowchart LR
CLI["Python CLI<br/>(argparse, stdlib only)"] -->|await| Bridge["PyO3 bridge<br/>pyru._native"]
Bridge -->|tokio::spawn| Fetch["reqwest<br/>rustls · HTTP/1.1 + HTTP/2<br/>gzip · brotli · zstd"]
Bridge -->|semaphore| Pool[["Concurrency gate<br/>Tokio JoinSet + Semaphore"]]
Fetch --> Body["body bytes"]
Body -->|spawn_blocking| Parse["scraper<br/>CSS selectors over html5ever"]
Parse --> Out["(elements, errors, latency_ms)"]
Out -->|Bound<'py, PyAny>| CLI
style CLI fill:#e8f4f8,stroke:#4a90d9,stroke-width:2px
style Bridge fill:#4a90d9,stroke:#333,stroke-width:2px,color:#fff
style Fetch fill:#d4edda,stroke:#28a745,stroke-width:2px
style Pool fill:#fff3cd,stroke:#ffc107,stroke-width:2px
style Parse fill:#f8d7da,stroke:#dc3545,stroke-width:2px
style Out fill:#d1ecf1,stroke:#17a2b8,stroke-width:2px
- One async Python entry point. The CLI hands a batch of URLs to a single
await scrape_urls_concurrent(...)call. Everything below the await lives in Rust. - Per-URL error reporting. A failing URL emits an error string in its slot; the rest of the batch keeps flowing. No early-exit semantics.
- Tuned
reqwestclient. TCP_NODELAY, 30s keep-alive, bounded idle pool, configurable timeouts, 10-redirect cap, HTTP/2 negotiated when the peer supports it. - Parse off the reactor. HTML goes through
scraperinsidetokio::task::spawn_blockingso the async runtime stays responsive even when a page is a 5MB<div>soup. - MiMalloc global allocator. ~10–20% fewer allocations on the hot path vs glibc's default.
| Aspect | PyRu | httpx + selectolax | Scrapy |
|---|---|---|---|
| Language | Rust + Python | Python | Python |
| Dependencies | 0 (stdlib only) | 5+ | 20+ |
| Async | Native (tokio) | asyncio | Twisted |
| Binary size | ~2 MB | ~5 MB | ~30 MB |
| Latency | <1ms | 5-10ms | 10-50ms |
| Memory overhead | MiMalloc | Standard | Highest |
PyRu is ideal when you need speed, minimal dependencies, or a Python wrapper around a Rust async core.
pip install pyru-scraperNote
Pre-built abi3 wheels for x86_64 Linux (manylinux 2.28), Intel + Apple Silicon macOS, and Windows. Source distribution requires Rust toolchain and CPython 3.15+.
- High throughput — async Rust core with tokio, sub-millisecond latency
- Zero Python deps — stdlib-only CLI, no runtime bloat
- Concurrency control — adjustable semaphore-capped parallelism
- Retry logic — exponential backoff on failure
- robots.txt support — optional compliance checking
- Caching — ETag/Last-Modified conditional requests
- Proxy support — HTTP/HTTPS proxy configuration
- Custom headers — per-request header injection
- SSL toggle — optionally skip certificate verification
- Multiple output modes — JSON, text, file, stats
pyru scrape [OPTIONS] URL [URL...]| Flag | Default | Description |
|---|---|---|
-s, --selector TEXT |
required | CSS selector applied to each fetched page. |
-o, --output {json,text} |
text | Output format. |
-c, --concurrency INT |
50 | Maximum in-flight requests (1–10,000). |
-u, --user-agent TEXT |
built-in | Override the default User-Agent. |
--timeout-ms INT |
10000 | Total per-request timeout (milliseconds). |
--connect-timeout-ms INT |
5000 | TCP/TLS connect timeout (milliseconds). |
-r, --retries INT |
0 | Retry attempts on failure (0–10). |
--respect-robots-txt |
false | Obey robots.txt rules before fetching. |
--cache |
false | Use ETag/Last-Modified for conditional requests. |
--proxy URL |
none | HTTP/HTTPS proxy URL. |
-H, --header TEXT |
none | Custom header (repeatable). |
--insecure |
false | Skip SSL verification. |
--output-file FILE |
stdout | Write output to file. |
--stats |
false | Print summary statistics. |
-q, --quiet |
false | Suppress informational output. |
pyru scrape "https://books.toscrape.com/" -s "h3 > a" -c 200 -o jsonWorked example: 50 pages, JSON to jq
pyru scrape $(seq 1 50 | xargs -I{} echo "https://books.toscrape.com/catalogue/page-{}.html") \
-s "h3 > a" -c 25 -o json \
| jq -s '[.[] | {url, latency_ms, n: (.elements | length)}]'import asyncio
from collections.abc import Sequence
from pyru import scrape_urls_concurrent
async def main() -> None:
elements, errors, latency_ms = await scrape_urls_concurrent(
urls=["https://example.com/", "https://example.org/"],
selector="h1",
concurrency=8,
timeout_ms=5_000,
)
for url, els, err, lat in zip(
["https://example.com/", "https://example.org/"],
elements, errors, latency_ms, strict=True,
):
marker = "ok" if not err else f"err={err!r}"
print(f"{url} {lat:>4} ms {marker} -> {els}")
asyncio.run(main())The _native module exports a single async function:
async def scrape_urls_concurrent(
urls: Sequence[str],
selector: str,
concurrency: int = 50,
user_agent: str | None = None,
timeout_ms: int = 10_000,
connect_timeout_ms: int = 5_000,
retries: int = 0,
respect_robots_txt: bool = False,
use_cache: bool = False,
proxy: str | None = None,
headers: Sequence[tuple[str, str]] | None = None,
insecure: bool = False,
) -> tuple[list[list[str]], list[str], list[int]]: ...A self-contained harness under benchmarks/ spins up a local aiohttp server and compares PyRu against the most-commonly-cited pure-Python alternative (httpx + selectolax). Numbers vary with hardware, kernel tunables, and test concurrency — run it yourself and publish the raw output before quoting comparisons.
uv sync --group benchmarks
uv run python benchmarks/real_world_benchmark.pyWarning
Benchmark results published before running the harness on your hardware should be treated as anecdotes, not data.
See SECURITY.md for responsible disclosure and supply-chain posture:
- Zero runtime Python dependencies. The CLI is stdlib-only.
- In-tree PEP 517 build backend at
_build/— no setuptools, no hatchling, no maturin. - Rust audit (
cargo deny check advisories bans sources) runs on every CI build. - Trusted publishing (OIDC) to PyPI — no API tokens on file.
- Signed commits are enforced on
main.
uv python install 3.15
uv sync --group dev
uv pip install -e .Then the full check chain:
uv run ruff check
uv run ruff format --check
uv run python -m unittest discover -s tests -v
cargo fmt --manifest-path native/Cargo.toml --all -- --check
cargo clippy --manifest-path native/Cargo.toml --all-targets -- -D warnings
cargo test --manifest-path native/Cargo.toml --all-targetsLayout at a glance:
.
├─ pyru/ Python package (argparse CLI + type stubs)
├─ native/ Rust crate (pyo3, tokio, reqwest, scraper)
├─ _build/ PEP 517 build backend (stdlib only)
├─ tests/ stdlib unittest suite
└─ benchmarks/ opt-in comparison harness
MIT — see LICENSE. Copyright © Andreas Fahl.
