Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
207 changes: 207 additions & 0 deletions src/content/docs/browser-rendering/rest-api/crawl-endpoint.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
---
pcx_content_type: how-to
title: /crawl - Crawl web content
sidebar:
order: 11
---

import { Render } from "~/components";

The `/crawl` endpoint automates the process of scraping content from webpages starting with a single URL and crawling to a specified number or depth of links. The response can be returned in either HTML, Markdown, or JSON.

The `/crawl` endpoint respects the directives of `robots.txt` files, including `crawl-delay` and [`content-signal`](https://contentsignals.org/). All URLs that `/crawl` is directed not to crawl are listed in the response with `"status": "disallowed"`.

## Endpoint

```txt
https://api.cloudflare.com/client/v4/accounts/<account_id>/browser-rendering/crawl
```

## Required fields
You must provide `url`:
- `url` (string)

## Common use cases

- Scraping online content to build a knowledge base of up-to-date information
- Converting online content into LLM-friendly formats to train [Retrieval-Augmented Generation (RAG) applications](/reference-architecture/diagrams/ai/ai-rag/) and other AI systems

## Basic usage

There are two separate steps to the `/crawl` endpoint:
1. [Initiate the crawl job](/browser-rendering/rest-api/crawl-endpoint/#initiate-the-crawl-job) — A `POST` request where you initiate the crawl and receive a response with a job `id`.
2. [Request results of the crawl job](browser-rendering/rest-api/crawl-endpoint/#request-results-of-the-crawl-job) — A `GET` request where you request the status or results of the crawl.

:::note[Free plan limitation]
If you are on a Workers Free plan, your crawl may fail if it hits the [limit of 10 minutes per day](/browser-rendering/platform/pricing/). To avoid this, you can either [upgrade to a Workers Paid plan](/workers/platform/pricing/) or you can [put limitations on timeouts](/browser-rendering/reference/timeouts/) to get the most out of the 10 minutes of your crawl request.
:::

### Initiate the crawl job

Here are the basic parameters you can use to initiate your crawl job:
- `url` — (Required) Starts crawling from this URL
- `limit` — (Optional) Maximum number of pages to crawl (default is 10, maximum is 100,000)
- `depth` — (Optional) Maximum link depth to crawl from the starting URL
- `formats` — (Optional) Response format (default is HTML, other options are Markdown and JSON)

The API will respond immediately with a job `id` you will use to retrieve the status and results of the crawl job.

See the [advanced usage section below](/browser-rendering/rest-api/crawl-endpoint/#initiate-the-crawl-job) for additional parameters.

Here is an example that uses the basic parameters:
```bash
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{

"url": "https://developers.cloudflare.com/workers/",

"limit": 50,

"depth": 2,

"formats": ["markdown"]
}'
```

Here is an example of the response, which includes a job `id`:

```json output
{
"result": {
"id": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e"
},
"success": true
}
```

### Request results of the crawl job

Here is an example of how you would check the status or request the results of your crawl job with the job `id` you were provided:

```bash
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/result/c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e' \
-H 'Authorization: Bearer YOUR_API_TOKEN'
```

Here is an example response:

```json output

{
"result": {
"id": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e",
"status": "complete",
"browserTimeSpent": 134.7,
"total": 50,
"completed": 50,
"entries": [
{
"url": "[https://developers.cloudflare.com/workers/](https://developers.cloudflare.com/workers/)",
"status": "completed",
"markdown": "# Cloudflare Workers\nBuild and deploy serverless applications...",
"html": null,
"metadata": {
"title": "Cloudflare Workers · Cloudflare Workers docs",
"language": "en-US"
}
},
{
"url": "[https://developers.cloudflare.com/workers/get-started/quickstarts/](https://developers.cloudflare.com/workers/get-started/quickstarts/)",
"status": "completed",
"markdown": "## Quickstarts\nGet up and running with a simple 'Hello World'...",
"html": null,
"metadata": {
"title": "Quickstarts · Cloudflare Workers docs",
"language": "en-US"
}
}
// ... 48 more entries omitted for brevity
]
},
"success": true
}
```

### Cancel a crawl job

If you need to cancel a job that is currently in progress, here is an example of how to cancel a crawl job with the job `id` you were provided:

```bash
curl -X DELETE 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/result/c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e' \
-H 'Authorization: Bearer YOUR_API_TOKEN'
```

A successful cancellation will return a `200 OK` status code. The job status will be updated to cancelled.

## Advanced usage

The `/crawl` endpoint has many parameters you can use to customize your crawl. For the full list, check the [API docs[(https://developers.cloudflare.com/api/resources/browser_rendering/).

Here is an example that uses the additional parameters that are currently available, in addition to the [basic parameters shown in the example above](/browser-rendering/rest-api/crawl-endpoint/#initiate-the-crawl-job) and the [render parameter below](/browser-rendering/rest-api/crawl-endpoint/#render-a-simple-html-fetch):

```bash
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{

// Required: The URL to start crawling from
"url": "https://www.exampledocs.com/docs/",

// Optional: The maximum age of a cached resource that can be returned (in seconds)
"maxAge": 7200,

"options": {

// Optional: If true, follows links to external domains (default is false)
"includeExternalLinks": true,

// Optional: If true, follows links to subdomains of the starting URL (default is false)
"includeSubdomains": true,

// Optional: Only visits URLs that match one of these patterns
"includePatterns": [
".*/api/v1/.*"
],

// Optional: Does not visit URLs that match any of these patterns
"excludePatterns": [
".*/learning-paths/.*"
]
}
```

### Choose when to render JavaScript

Use the `render` parameter to control whether the `crawl` endpoint spins up a headless browser and executes page JavaScript. The default is `render: true`. Set `render: false` to do a fast HTML fetch without executing JavaScript.

Use `render: true` when the page builds content in the browser. Use `render: false` when the content you need is already in the initial HTML response.

Crawls with the `render: false` are billed under [Workers pricing](/workers/platform/pricing/), Crawls with `render: true` use a headless browser and are billed under typical Browser Rendering pricing.

Here is an example of a request that uses the `render` parameter:

```bash
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{
// Required: The URL to start crawling from
"url": "https://developers.cloudflare.com/workers/",

//Optional: If false, only does a simple HTML fetch crawl (default is true)
"render": false
}'
```

<Render
file="setting-custom-user-agent"
product="browser-rendering"
/>

<Render
file="faq"
product="browser-rendering"
/>
Loading