Python SDK
Python SDK gives you a handy abstraction to interact with Scrapfly API.
It includes all of scrapfly features and many convenient shortcuts:
- Automatic base64 encode of JS snippet
- Error Handling
- Body json encode if
Content-Type: application/json
- Body URL encode and set
Content Type: application/x-www-form-urlencoded if no content type specified
- Convert Binary response into a python
ByteIO object
Step by Step Introduction
For a hands-on introduction see our Scrapfly SDK introduction page!
Discover Now
The Full python API specification is available here: https://scrapfly.github.io/python-scrapfly/docs/scrapfly
For more on Python SDK use with Scrapfly, select "Python SDK" option
in Scrapfly docs top bar.
Installation
Source code of Python SDK is available on
Github
scrapfly-sdk package is available through PyPi.
pip install 'scrapfly-sdk'
You can also install extra package scrapfly[speedups] to get
brotli compression and msgpack serialization benefits.
pip install 'scrapfly-sdk[speedups]'
You can also install scrapfly[all] to get all optional Scrapfly features without any extra impact on your scrapfly performance.
pip install 'scrapfly-sdk[all]'
Scrape
If you plan to scrape protected website - make sure to enable
Anti Scraping Protection
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='{{ YOUR_API_KEY }}')
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/anything'))
# Automatic retry errors marked "retryable" and wait delay recommended before retrying
api_response:ScrapeApiResponse = scrapfly.resilient_scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/anything'))
# Automatic retry error based on status code
api_response:ScrapeApiResponse = scrapfly.resilient_scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/status/500'), retry_on_status_code=[500])
# scrape result, content, iframes, response headers, response cookies states, screenshots, ssl, dns etc
print(api_response.scrape_result)
# html content
print(api_response.scrape_result['content'])
# Context of scrape, session, webhook, asp, cache, debug
print(api_response.context)
# raw api result
print(api_response.content)
# True if the scrape respond with >= 200 < 300 http status
print(api_response.success)
# Api status code /!\ Not the api status code of the scrape!
print(api_response.status_code)
# Upstream website status code
print(api_response.upstream_status_code)
# Convert API Scrape Result into well known requests.Response object
print(api_response.upstream_result_into_response())
Discover python full specification:
Using Context
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='{{ YOUR_API_KEY }}')
with scrapfly as scraper:
response: ScrapeApiResponse = scraper.scrape(ScrapeConfig(url='https://httpbin.dev/anything', country='fr'))
How to configure Scrape Query
You can check the ScrapeConfig implementation to check all available options
available here.
All parameters listed in this documentation can be used when you construct the scrape config object.
Download Binary Response
from scrapfly import ScrapflyClient, ScrapeApiResponse
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://www.intel.com/content/www/us/en/ethernet-controllers/82599-10-gbe-controller-datasheet.html'))
scrapfly.sink(api_response) # you can specify path and name via named arguments
Error Handling
Error handling is a big part of scraper, so we design a system to reflect what happened when
it's going bad to handle it properly from Scraper. Here a simple snippet to handle errors on your owns
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse, UpstreamHttpClientError, \
ScrapflyScrapeError, UpstreamHttpServerError
scrapfly = ScrapflyClient(key='{{ YOUR_API_KEY }}')
try:
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
url='https://httpbin.dev/status/404',
))
except UpstreamHttpClientError as e: # HTTP 400 - 500
print(e.api_response.scrape_result['error'])
raise e
except UpstreamHttpServerError as e: # HTTP >= 500
print(e.api_response.scrape_result['error'])
raise e
# UpstreamHttpError can be used to catch all related error regarding the upstream website
except ScrapflyScrapeError as e:
print(e.message)
print(e.code)
raise e
Errors with related code and explanation are documented and available here,
if you want to know more.
error.message # Message
error.code # Error code of error
error.retry_delay # Recommended time wait before retrying if retryable
error.retry_times # Recommended retry times if retryable
error.resource # Related resource, Proxy, ASP, Webhook, Spider
error.is_retryable # True or False
error.documentation_url # Documentation explaining the error in details
error.api_response # Api Response object
error.http_status_code # Http code
By default, if the upstream website that you scrape responds with bad HTTP code, the SDK will raise
UpstreamHttpClientError or UpstreamHttpServerError regarding the HTTP status code.
You can disable this behavior by setting the raise_on_upstream_error attribute to false. ScrapeConfig(raise_on_upstream_error=False)
If you want to report to your app for monitoring / tracking purpose on your side, checkout reporter
feature.
Account
You can retrieve account information
from scrapfly import ScrapflyClient
scrapfly = ScrapflyClient(key='{{ YOUR_API_KEY }}')
print(scrapfly.client.account())
Keep Alive HTTP Session
Take benefits of Keep-Alive Connection
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='{{ YOUR_API_KEY }}')
with scrapfly as client:
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(
url='https://news.ycombinator.com/',
render_js=True,
screenshots={
'main': 'fullpage'
}
))
# more scrape calls
Concurrency out of the box
You can run scrape concurrently out of the box. We use asyncio for that.
In python, there are many ways to achieve concurrency. You can also check:
First of all, ensure you have installed concurrency module
pip install 'scrapfly-sdk[concurrency]'
import asyncio
import logging as logger
from sys import stdout
scrapfly_logger = logger.getLogger('scrapfly')
scrapfly_logger.setLevel(logger.DEBUG)
logger.StreamHandler(stdout)
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='{{ YOUR_API_KEY }}', max_concurrency=2)
async def main():
targets = [
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True),
ScrapeConfig(url='https://httpbin.dev/anything', render_js=True)
]
async for result in scrapfly.concurrent_scrape(scrape_configs=targets):
print(result)
asyncio.run(main())
Webhook Server
The Scrapfly Python SDK offers a built-in webhook server feature, allowing developers to easily set up and handle webhooks for receiving notifications and data from Scrapfly services.
This documentation provides an overview of the create_server function within the SDK, along with an example of its usage.
Example Usage
In order to expose the local server to internet we use ngrok and you need a free account to run the example.
Below is an example demonstrating how to use the create_server function to set up a webhook server:
- Install dependencies:
pip install ngrok flask scrapfly
- Export your ngrok auth token in your terminal:
export NGROK_AUTHTOKEN=MY_NGROK_TOKEN
-
Create a webhook on your Scrapfly dashboard with any endpoint
(For example from https://webhook.site). Since Ngrok endpoint is only known at runtime only and random on each run, we will
edit the endpoint once ngrok advertised it in the next step.
- Retrieve your webhook signing secret
- Run the command
python webhook_server.py --signing-secret=MY_SIGNING_SECRET
- Once the server is running, copy the exposed url advertised below the log line
"====== LISTENING ON ======"
- Edit your webhook url and replace it by the advertised url
With ngrok free plan, on each start of the server, a new random tunnel url is assigned, you need edit the webhook
import argparse
from typing import Dict
import flask
import ngrok
from scrapfly import webhook
from scrapfly.webhook import ResourceType
# Define the webhook callback function
def webhook_callback(data: Dict, resource_type: ResourceType, request: flask.Request):
if resource_type == ResourceType.SCRAPE.value:
# Process scrape result
upstream_response = data['result']
print(upstream_response)
else:
# Process other resource types
print(data)
# Set up ngrok listener for tunneling
listener = ngrok.werkzeug_develop()
# Parse command-line arguments
parser = argparse.ArgumentParser(description="Webhook server with signing secret")
parser.add_argument("--signing-secret", required=True, help="Signing secret to verify webhook payload integrity")
args = parser.parse_args()
# Create Flask application and set up webhook server
app = flask.Flask("Scrapfly Webhook Server")
webhook.create_server(signing_secrets=(args.signing_secret,), callback=webhook_callback, app=app)
# Start the server and print the webhook endpoint URL
print("====== LISTENING ON ======")
print(listener.url() + "/webhook")
print("==========================")
app.run()
In this example, the webhook server is set up using create_server, with a callback function webhook_callback defined to handle incoming webhook payloads.
The signing secret is provided as a command-line argument, and ngrok is used for exposing the local server to the internet for testing.
Screenshot API
The Screenshot API captures full-page or viewport screenshots with headless browsers.
It supports custom resolution, format, capture region, rendering options, caching and webhooks.
See the Screenshot API documentation
for the full parameter reference.
Basic Screenshot
from scrapfly import ScrapflyClient, ScreenshotConfig
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
result = client.screenshot(ScreenshotConfig(
url="https://web-scraping.dev/",
format="jpg",
capture="fullpage",
))
with open("screenshot.jpg", "wb") as f:
f.write(result.image)
Screenshot with Options
Control quality, resolution, rendering wait, dark mode and more:
from scrapfly import ScrapflyClient, ScreenshotConfig
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
result = client.screenshot(ScreenshotConfig(
url="https://web-scraping.dev/",
format="png",
capture="fullpage",
resolution="1440x900", # tablet: "768x1024", mobile: "375x812"
rendering_wait=2000, # wait 2s after page load
options=["dark_mode", "block_banners"],
country="us",
))
with open("screenshot.png", "wb") as f:
f.write(result.image)
The Extraction API parses HTML/text and extracts structured data using templates,
predefined AI models, or free-form LLM prompts.
See the Extraction API documentation
for all extraction models and template syntax.
Extract product data using a built-in model — no template required:
from scrapfly import ScrapflyClient, ExtractionConfig
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
result = client.extract(ExtractionConfig(
body="...Orange Chocolate Box
$9.99...",
content_type="text/html",
extraction_model="product", # or: product_listing, article, review_list, ...
))
print(result.data)
from scrapfly import ScrapflyClient, ExtractionConfig
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
result = client.extract(ExtractionConfig(
body="...The GPU operates at 2.5 GHz with 24 GB VRAM...
...",
content_type="text/html",
extraction_prompt="Extract GPU name, clock speed and VRAM in GB as JSON",
))
print(result.data)
from scrapfly import ScrapflyClient, ExtractionConfig
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
# Use a saved extraction template by name
result = client.extract(ExtractionConfig(
body="...$9.99...",
content_type="text/html",
extraction_template="my-product-template",
))
print(result.data)
Crawler API
The Crawler API recursively crawls a website starting from a seed URL.
It handles URL discovery, deduplication, rate limiting, robots.txt compliance,
sitemap parsing, content extraction and webhook callbacks.
The public test target web-scraping.dev
is used in the examples below - it accepts automated crawls.
See the Crawler API documentation
for the full parameter reference and webhook payload examples.
Basic Crawl
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
crawl = Crawl(
client,
CrawlerConfig(
url="https://web-scraping.dev/products",
page_limit=10,
content_formats=["markdown"],
),
)
crawl.crawl()
crawl.wait()
status = crawl.status()
print(f"Visited {status.state.urls_visited} pages")
Crawl with Compliance Options
Control robots.txt respect, nofollow handling, and subdomain following:
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
crawl = Crawl(
client,
CrawlerConfig(
url="https://web-scraping.dev/",
page_limit=50,
respect_robots_txt=True,
ignore_no_follow=False, # honour rel=nofollow links
follow_internal_subdomains=False,
content_formats=["markdown", "page_metadata"],
),
)
crawl.crawl()
crawl.wait()
Crawl with Sitemap Discovery
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
crawl = Crawl(
client,
CrawlerConfig(
url="https://web-scraping.dev/",
use_sitemaps=True, # discover URLs from sitemap.xml
page_limit=100,
max_depth=3,
content_formats=["html", "markdown"],
),
)
crawl.crawl()
crawl.wait()
print(f"Visited {crawl.status().state.urls_visited} pages")
Crawl with Webhooks
Receive real-time events as the crawler visits, discovers and finishes URLs:
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
crawl = Crawl(
client,
CrawlerConfig(
url="https://web-scraping.dev/products",
page_limit=50,
# Replace with the name of a webhook you registered in your dashboard
# at https://scrapfly.io/dashboard/webhook
webhook_name="your-webhook-name",
webhook_events=[
"crawler_url_visited",
"crawler_finished",
],
content_formats=["markdown"],
),
)
crawl.crawl()
# webhook receives events — no need to poll if you just need events
List Crawled URLs
Stream the list of URLs the crawler visited, skipped or failed on. The endpoint
is paginated; iterate by incrementing page until the response is empty.
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
crawl = Crawl(client, CrawlerConfig(url="https://web-scraping.dev/products", page_limit=10))
crawl.crawl()
crawl.wait()
# Stream visited URLs (default status filter is 'visited')
visited = crawl.urls(status="visited", page=1, per_page=100)
for entry in visited:
print(entry.url)
# Failed URLs include the reason as a CSV-style suffix
for entry in crawl.urls(status="failed"):
print(entry.url, "->", entry.reason)
Read a Single Page's Content
Use Crawl.read() to fetch one page in plain mode (no JSON envelope) — the
returned CrawlContent wraps the raw bytes plus the originating URL.
Returns None if the URL was not part of this crawl.
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
crawl = Crawl(client, CrawlerConfig(url="https://web-scraping.dev/products", page_limit=5,
content_formats=["markdown"]))
crawl.crawl()
crawl.wait()
content = crawl.read("https://web-scraping.dev/products", format="markdown")
if content is not None:
print(content.content[:200])
# For multiple URLs in one round-trip:
batch = crawl.read_batch(
urls=["https://web-scraping.dev/products", "https://web-scraping.dev/product/1"],
formats=["markdown"],
)
for url, formats in batch.items():
print(url, "->", len(formats["markdown"]), "chars")
Download WARC and HAR Artifacts
WARC archives every HTTP exchange (request + response + body) as it happened on the
wire. HAR captures network timings, headers and the response body in a JSON-friendly
format. The Python SDK ships WarcParser and HarArchive so you
don't need a third-party library.
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
crawl = Crawl(client, CrawlerConfig(url="https://web-scraping.dev/products", page_limit=10))
crawl.crawl()
crawl.wait()
# WARC: iterate response records — content, headers, status code, URL
warc = crawl.warc()
for record in warc.iter_responses():
print(record.status_code, record.url, len(record.content), "bytes")
warc.save("crawl.warc.gz")
# HAR: high-level filters for status / content-type / URL.
# `crawl.har()` returns a CrawlerArtifactResponse — its `.parser` is a HarArchive.
har = crawl.har()
for entry in har.parser.filter_by_status(200):
print(entry.method, entry.url, entry.content_type)
Cancel a Running Crawl
Stop a crawler before it reaches its natural end (e.g. on a runaway crawl, a budget
cap, or user navigation away). The status will transition to CANCELLED
with state.stop_reason="user_cancelled".
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
crawl = Crawl(client, CrawlerConfig(url="https://web-scraping.dev/products", page_limit=1000))
crawl.crawl()
# ... later, from another worker / signal handler / UI ...
crawl.cancel()
# Pass allow_cancelled=True so wait() returns normally on the cancellation
# we just triggered ourselves, instead of raising ScrapflyCrawlerError.
crawl.wait(allow_cancelled=True)
status = crawl.status()
assert status.is_cancelled
print(f"stop_reason={status.state.stop_reason}")
Handle Webhook Events
Use webhook_from_payload() to parse incoming webhook bodies into typed
dataclasses. The four lifecycle events (started/stopped/cancelled/finished) share
CrawlerLifecycleWebhook; the four URL events have their own classes.
Field names match the wire format and the scrape-engine source of truth.
from flask import Flask, request
from scrapfly import (
webhook_from_payload,
CrawlerLifecycleWebhook,
CrawlerUrlVisitedWebhook,
CrawlerUrlFailedWebhook,
CrawlerWebhookEvent,
)
app = Flask(__name__)
SIGNING_SECRETS = ("your-hex-secret",)
@app.route("/webhook", methods=["POST"])
def crawler_webhook():
wh = webhook_from_payload(
request.json,
signing_secrets=SIGNING_SECRETS,
signature=request.headers.get("X-Scrapfly-Webhook-Signature"),
)
# Common fields on every event
print(f"[{wh.event}] {wh.crawler_uuid} "
f"visited={wh.state.urls_visited}/{wh.state.urls_extracted}")
if isinstance(wh, CrawlerLifecycleWebhook):
if wh.event == CrawlerWebhookEvent.CRAWLER_FINISHED.value:
print(f" finished — credits={wh.state.api_credit_used}")
elif isinstance(wh, CrawlerUrlVisitedWebhook):
print(f" visited {wh.url} [{wh.scrape.status_code}]")
elif isinstance(wh, CrawlerUrlFailedWebhook):
print(f" failed {wh.url}: {wh.error}")
return "", 200
Cloud Browser API
The Cloud Browser API provides a fully managed remote browser that bypasses
anti-bot protection (Cloudflare, DataDome, Imperva, etc.) and hands you
a live Playwright/Puppeteer-compatible WebSocket connection.
Install the extra dependency first:
pip install 'scrapfly-sdk[all]' playwright && playwright install chromium
Basic Session
from scrapfly import ScrapflyClient
from playwright.sync_api import sync_playwright
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
result = client.cloud_browser_unblock(
url="https://web-scraping.dev/product/1",
)
session_id = result["session_id"]
with sync_playwright() as p:
browser = p.chromium.connect_over_cdp(result["ws_url"])
page = browser.contexts[0].pages[0]
print("Title:", page.title())
print("URL:", page.url)
browser.close()
Session with Country
from scrapfly import ScrapflyClient
from playwright.sync_api import sync_playwright
client = ScrapflyClient(key="{{ YOUR_API_KEY }}")
result = client.cloud_browser_unblock(
url="https://web-scraping.dev/product/1",
country="us",
)
with sync_playwright() as p:
browser = p.chromium.connect_over_cdp(result["ws_url"])
page = browser.contexts[0].pages[0]
# Navigate within the same session
page.goto("https://web-scraping.dev/products")
page.wait_for_selector(".product")
products = page.query_selector_all(".product-name")
for product in products:
print(product.inner_text())
browser.close()
External Integration
LlamaIndex
LlamaIndex, formerly known as GPT Index, is a data framework designed to facilitate the connection between large language models (LLMs) and a wide variety of data sources. It provides tools to effectively ingest, index, and query data within these models.
Integrate Scrapfly with LlamaIndex
Langchain
LangChain is a robust framework designed for developing applications powered by language models. It focuses on enabling the creation of applications that can leverage the capabilities of large language models (LLMs) for a variety of use cases.
Integrate Scrapfly with Langchain