Skip to content

Commit abd8379

Browse files
committed
[Browser Rendering] Crawl endpoint
1 parent f2d0e86 commit abd8379

File tree

1 file changed

+196
-0
lines changed

1 file changed

+196
-0
lines changed
Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
---
2+
pcx_content_type: how-to
3+
title: /crawl - Crawl web content
4+
sidebar:
5+
order: 11
6+
---
7+
8+
import { Render } from "~/components";
9+
10+
The `/crawl` endpoint automates the process of scraping content from webpages starting with a single URL and crawling to your specified depth of links. The response can be returned in either HTML, Markdown, or JSON.
11+
12+
The `/crawl` endpoint respects the directives of `robots.txt` files, such as `crawl-delay` and [`content-signal`](https://contentsignals.org/). All URLs that `/crawl` is directed not to crawl are listed in the response with `"status": "disallowed"`.
13+
14+
## Endpoint
15+
16+
```txt
17+
https://api.cloudflare.com/client/v4/accounts/<account_id>/browser-rendering/crawl
18+
```
19+
20+
## Required fields
21+
You must provide `url`:
22+
- `url` (string)
23+
24+
## Common use cases
25+
26+
- Scraping online content to build a knowledge base of up-to-date information
27+
- Converting online content into LLM-friendly formats to train Retrieval-Augmented Generation (RAG) applications and other AI systems
28+
29+
## Basic usage
30+
31+
Since the `/crawl` endpoint takes some time to process, it is split into two requests:
32+
1. A `POST` request where you initiate the crawl and receive a response with a `job_id`.
33+
2. A `GET` request where you request the status or results of the crawl.
34+
35+
:::note[Free plan limitation]
36+
If you are on a Workers Free plan, your crawl may fail if it hits the [limit of 10 minutes per day](/browser-rendering/platform/pricing/). To avoid this, you can either [upgrade to a Workers Paid plan](/workers/platform/pricing/) or you can [put limitations on timeouts](/browser-rendering/reference/timeouts/) to get the most out of the 10 minutes of your crawl request.
37+
:::
38+
39+
### Initiate the crawl job
40+
41+
text
42+
43+
```bash
44+
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
45+
-H 'Authorization: Bearer <apiToken>' \
46+
-H 'Content-Type: application/json' \
47+
-d '{
48+
49+
// Required: Starts crawling from this URL
50+
"url": "https://developers.cloudflare.com/workers/",
51+
52+
// Optional: Maximum number of pages to crawl (default is 10, maximum is 100,000)
53+
"limit": 50,
54+
55+
// Optional: Maximum link depth to crawl from the starting URL
56+
"depth": 2,
57+
58+
// Optional: Response format (default is HTML, other options are Markdown and JSON)
59+
"formats": ["markdown"]
60+
}'
61+
```
62+
63+
text
64+
65+
```json output
66+
{
67+
"result": {
68+
"id": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e"
69+
},
70+
"success": true
71+
}
72+
```
73+
74+
### Request results of the crawl job
75+
76+
text
77+
78+
```bash
79+
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/result/c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e' \
80+
-H 'Authorization: Bearer YOUR_API_TOKEN'
81+
```
82+
83+
text
84+
85+
```json output
86+
87+
{
88+
"result": {
89+
"id": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e",
90+
"status": "complete",
91+
"browserTimeSpent": 134.7,
92+
"total": 50,
93+
"completed": 50,
94+
"entries": [
95+
{
96+
"url": "[https://developers.cloudflare.com/workers/](https://developers.cloudflare.com/workers/)",
97+
"status": "completed",
98+
"markdown": "# Cloudflare Workers\nBuild and deploy serverless applications...",
99+
"html": null,
100+
"metadata": {
101+
"title": "Cloudflare Workers · Cloudflare Workers docs",
102+
"language": "en-US"
103+
}
104+
},
105+
{
106+
"url": "[https://developers.cloudflare.com/workers/get-started/quickstarts/](https://developers.cloudflare.com/workers/get-started/quickstarts/)",
107+
"status": "completed",
108+
"markdown": "## Quickstarts\nGet up and running with a simple 'Hello World'...",
109+
"html": null,
110+
"metadata": {
111+
"title": "Quickstarts · Cloudflare Workers docs",
112+
"language": "en-US"
113+
}
114+
}
115+
]
116+
},
117+
"success": true
118+
}
119+
```
120+
121+
### Cancel a crawl job
122+
123+
text
124+
125+
```bash
126+
curl -X DELETE 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/result/c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e' \
127+
-H 'Authorization: Bearer YOUR_API_TOKEN'
128+
```
129+
130+
A successful cancellation will return a 200 OK status code. The job status will be updated to cancelled.
131+
132+
## Advanced usage
133+
134+
:::note[Looking for more parameters?]
135+
Visit the [Browser Rendering PDF API reference](/api/resources/browser_rendering/subresources/pdf/methods/create/) for all available parameters.
136+
:::
137+
138+
Here are...
139+
140+
text
141+
142+
```bash
143+
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
144+
-H 'Authorization: Bearer <apiToken>' \
145+
-H 'Content-Type: application/json' \
146+
-d '{
147+
148+
// Required: The URL to start crawling from
149+
"url": "https://www.exampledocs.com/docs/",
150+
151+
// Optional: The maximum age of a cached resource that can be returned (in seconds)
152+
"maxAge": 7200,
153+
154+
"options": {
155+
156+
// Optional: If true, follows links to external domains (default is false)
157+
"includeExternalLinks": true,
158+
159+
// Optional: If true, follows links to subdomains of the starting URL (default is false)
160+
"includeSubdomains": true,
161+
162+
// Optional: Only visits URLs that match one of these patterns
163+
"includePatterns": [
164+
".*/api/v1/.*"
165+
],
166+
167+
// Optional: Does not visit URLs that match any of these patterns
168+
"excludePatterns": [
169+
".*/learning-paths/.*"
170+
]
171+
}
172+
```
173+
174+
### Render
175+
176+
text
177+
178+
```bash
179+
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
180+
-H 'Authorization: Bearer <apiToken>' \
181+
-H 'Content-Type: application/json' \
182+
-d '{
183+
"url": "https://developers.cloudflare.com/workers/",
184+
"render": false
185+
}'
186+
```
187+
188+
<Render
189+
file="setting-custom-user-agent"
190+
product="browser-rendering"
191+
/>
192+
193+
<Render
194+
file="faq"
195+
product="browser-rendering"
196+
/>

0 commit comments

Comments
 (0)