Python SDK

Python SDK gives you a handy abstraction to interact with Scrapfly API. It includes all of scrapfly features and many convenient shortcuts:

  • Automatic base64 encode of JS snippet
  • Error Handling
  • Body json encode if Content-Type: application/json
  • Body URL encode and set Content Type: application/x-www-form-urlencoded if no content type specified
  • Convert Binary response into a python ByteIO object

Step by Step Introduction

For a hands-on introduction see our Scrapfly SDK introduction page!

Discover Now

The Full python API specification is available here: https://scrapfly.github.io/python-scrapfly/docs/scrapfly

For more on Python SDK use with Scrapfly, select "Python SDK" option in Scrapfly docs top bar.

Installation

Source code of Python SDK is available on Github scrapfly-sdk package is available through PyPi.

You can also install extra package scrapfly[speedups] to get brotli compression and msgpack serialization benefits.

You can also install scrapfly[all] to get all optional Scrapfly features without any extra impact on your scrapfly performance.

Scrape

If you plan to scrape protected website - make sure to enable Anti Scraping Protection

Discover python full specification:

Using Context

How to configure Scrape Query

You can check the ScrapeConfig implementation to check all available options available here.

All parameters listed in this documentation can be used when you construct the scrape config object.

Download Binary Response

Error Handling

Error handling is a big part of scraper, so we design a system to reflect what happened when it's going bad to handle it properly from Scraper. Here a simple snippet to handle errors on your owns

Errors with related code and explanation are documented and available here, if you want to know more.

By default, if the upstream website that you scrape responds with bad HTTP code, the SDK will raise UpstreamHttpClientError or UpstreamHttpServerError regarding the HTTP status code. You can disable this behavior by setting the raise_on_upstream_error attribute to false. ScrapeConfig(raise_on_upstream_error=False)

If you want to report to your app for monitoring / tracking purpose on your side, checkout reporter feature.

Account

You can retrieve account information

Keep Alive HTTP Session

Take benefits of Keep-Alive Connection

Concurrency out of the box

You can run scrape concurrently out of the box. We use asyncio for that.

In python, there are many ways to achieve concurrency. You can also check:

First of all, ensure you have installed concurrency module


Webhook Server

The Scrapfly Python SDK offers a built-in webhook server feature, allowing developers to easily set up and handle webhooks for receiving notifications and data from Scrapfly services. This documentation provides an overview of the create_server function within the SDK, along with an example of its usage.

Example Usage

In order to expose the local server to internet we use ngrok and you need a free account to run the example.

Below is an example demonstrating how to use the create_server function to set up a webhook server:

  1. Install dependencies: pip install ngrok flask scrapfly
  2. Export your ngrok auth token in your terminal: export NGROK_AUTHTOKEN=MY_NGROK_TOKEN
  3. Create a webhook on your Scrapfly dashboard with any endpoint (For example from https://webhook.site). Since Ngrok endpoint is only known at runtime only and random on each run, we will edit the endpoint once ngrok advertised it in the next step.
  4. Retrieve your webhook signing secret
  5. Run the command python webhook_server.py --signing-secret=MY_SIGNING_SECRET
  6. Once the server is running, copy the exposed url advertised below the log line "====== LISTENING ON ======"
  7. Edit your webhook url and replace it by the advertised url
With ngrok free plan, on each start of the server, a new random tunnel url is assigned, you need edit the webhook

In this example, the webhook server is set up using create_server, with a callback function webhook_callback defined to handle incoming webhook payloads. The signing secret is provided as a command-line argument, and ngrok is used for exposing the local server to the internet for testing.

Screenshot API

The Screenshot API captures full-page or viewport screenshots with headless browsers. It supports custom resolution, format, capture region, rendering options, caching and webhooks.

See the Screenshot API documentation for the full parameter reference.

Basic Screenshot

Screenshot with Options

Control quality, resolution, rendering wait, dark mode and more:

Extraction API

The Extraction API parses HTML/text and extracts structured data using templates, predefined AI models, or free-form LLM prompts.

See the Extraction API documentation for all extraction models and template syntax.

Predefined AI Model

Extract product data using a built-in model — no template required:

LLM Free-Form Prompt

Named Template

Crawler API

The Crawler API recursively crawls a website starting from a seed URL. It handles URL discovery, deduplication, rate limiting, robots.txt compliance, sitemap parsing, content extraction and webhook callbacks.

The public test target web-scraping.dev is used in the examples below - it accepts automated crawls.

See the Crawler API documentation for the full parameter reference and webhook payload examples.

Basic Crawl

Crawl with Compliance Options

Control robots.txt respect, nofollow handling, and subdomain following:

Crawl with Sitemap Discovery

Crawl with Webhooks

Receive real-time events as the crawler visits, discovers and finishes URLs:

List Crawled URLs

Stream the list of URLs the crawler visited, skipped or failed on. The endpoint is paginated; iterate by incrementing page until the response is empty.

Read a Single Page's Content

Use Crawl.read() to fetch one page in plain mode (no JSON envelope) — the returned CrawlContent wraps the raw bytes plus the originating URL. Returns None if the URL was not part of this crawl.

Download WARC and HAR Artifacts

WARC archives every HTTP exchange (request + response + body) as it happened on the wire. HAR captures network timings, headers and the response body in a JSON-friendly format. The Python SDK ships WarcParser and HarArchive so you don't need a third-party library.

Cancel a Running Crawl

Stop a crawler before it reaches its natural end (e.g. on a runaway crawl, a budget cap, or user navigation away). The status will transition to CANCELLED with state.stop_reason="user_cancelled".

Handle Webhook Events

Use webhook_from_payload() to parse incoming webhook bodies into typed dataclasses. The four lifecycle events (started/stopped/cancelled/finished) share CrawlerLifecycleWebhook; the four URL events have their own classes. Field names match the wire format and the scrape-engine source of truth.

Cloud Browser API

The Cloud Browser API provides a fully managed remote browser that bypasses anti-bot protection (Cloudflare, DataDome, Imperva, etc.) and hands you a live Playwright/Puppeteer-compatible WebSocket connection.

Install the extra dependency first:

Basic Session

Session with Country

External Integration

LlamaIndex

LlamaIndex, formerly known as GPT Index, is a data framework designed to facilitate the connection between large language models (LLMs) and a wide variety of data sources. It provides tools to effectively ingest, index, and query data within these models.

Integrate Scrapfly with LlamaIndex

Langchain

LangChain is a robust framework designed for developing applications powered by language models. It focuses on enabling the creation of applications that can leverage the capabilities of large language models (LLMs) for a variety of use cases.

Integrate Scrapfly with Langchain

Summary