docs: Add guide "HttpCrawler with custom parser" #1622

Mantisus · 2025-12-16T20:08:13Z

Description

Add guide "HttpCrawler with custom parser".

Issues

Closes: Add scrapling as a parser #1392
Closes: Integrate an HTML parser with XPath 2 support #702

codecov · 2025-12-16T20:17:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.49%. Comparing base (7f17a43) to head (bbc0157).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1622      +/-   ##
==========================================
- Coverage   92.50%   92.49%   -0.01%     
==========================================
  Files         157      157              
  Lines       10437    10439       +2     
==========================================
+ Hits         9655     9656       +1     
- Misses        782      783       +1

Flag	Coverage Δ
unit	`92.49% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

janbuchar

Thanks for the guide Max. While the code contained in there works, the idea was to use AbstractHttpCrawler along with a custom implementation of AbstractHttpParser - similarly to how BeautifulSoupCrawler and company are implemented.

The cool thing about this is that you can then use the parser class in AdaptivePlaywrightCrawler as well.

Mantisus · 2025-12-17T13:15:43Z

While the code contained in there works, the idea was to use AbstractHttpCrawler along with a custom implementation of AbstractHttpParser - similarly to how BeautifulSoupCrawler and company are implemented.

Got it. What do you think about expanding this guide with another section on implementation based on AbstractHttpCrawler?

janbuchar · 2025-12-17T15:00:04Z

While the code contained in there works, the idea was to use AbstractHttpCrawler along with a custom implementation of AbstractHttpParser - similarly to how BeautifulSoupCrawler and company are implemented.

Got it. What do you think about expanding this guide with another section on implementation based on AbstractHttpCrawler?

Well, just adding it to the bottom won't cut it, but you can make the guide into two parts - quick and dirty solution and the "native" way. You can also explain the benefits of each approach.

janbuchar

This is shaping up real nice, thank you!

docs/guides/code_examples/http_crawlers/selectolax_crawler.py

docs/guides/crawler_custom_parser.mdx

vdusek

Looks great!

I'm also updating the description to note that this resolves #702 as well, due to saxonche.

However, I would say this content would fit better in the HTTP crawlers guide. I'd suggest merging it there.

"Using HttpCrawler with a custom parser" could be the next section after "HttpCrawler", and "Creating a custom crawler" is essentially the same as "Creating a custom HTTP crawler".

docs/guides/code_examples/crawler_custom_parser/selectolax_adaptive_run.py

docs/guides/code_examples/http_crawlers/selectolax_adaptive_run.py

docs/guides/crawler_custom_parser.mdx

…ptive_run.py Co-authored-by: Jan Buchar <Teyras@gmail.com>

vdusek

This is really good, thanks!

Just a few minor things... Mostly regarding more concise titles and consistent grammatical forms.

docs/guides/http_crawlers.mdx

pyproject.toml

vdusek · 2025-12-20T09:55:38Z

src/crawlee/crawlers/__init__.py

    'AdaptivePlaywrightCrawler',
    'AdaptivePlaywrightCrawlingContext',
    'AdaptivePlaywrightPreNavCrawlingContext',
+    'AdaptivePlaywrightCrawlerStatisticState',


If this was private until now, we should expose it in a separate "feat:" PR, as it extends the public interface.

Sure. #1635

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

add docs "HttpCrawler with custom parser"

e5aff86

Mantisus requested review from janbuchar and vdusek December 16, 2025 20:08

Mantisus self-assigned this Dec 16, 2025

fix

247ef79

janbuchar requested changes Dec 17, 2025

View reviewed changes

Mantisus added 2 commits December 18, 2025 03:24

add AbstractHttpCrawler section

758424b

del extra file

7a9e092

Mantisus requested a review from janbuchar December 18, 2025 03:27

janbuchar reviewed Dec 18, 2025

View reviewed changes

docs/guides/code_examples/http_crawlers/selectolax_crawler.py Show resolved Hide resolved

docs/guides/crawler_custom_parser.mdx Outdated Show resolved Hide resolved

docs/guides/crawler_custom_parser.mdx Outdated Show resolved Hide resolved

add AdaptivePlaywrightCrawler example

a895901

Mantisus requested a review from janbuchar December 19, 2025 00:12

vdusek requested changes Dec 19, 2025

View reviewed changes

janbuchar reviewed Dec 19, 2025

View reviewed changes

Mantisus and others added 2 commits December 19, 2025 15:27

Update docs/guides/code_examples/crawler_custom_parser/selectolax_ada…

5844f4b

…ptive_run.py Co-authored-by: Jan Buchar <Teyras@gmail.com>

integrate to HTTP crawlers guide

7fe669e

Mantisus requested review from janbuchar and vdusek December 19, 2025 15:25

vdusek requested changes Dec 20, 2025

View reviewed changes

Mantisus and others added 8 commits December 20, 2025 16:20

Update pyproject.toml

2bc5967

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

2b1f41f

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

08ee00c

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

8195397

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

a5be06d

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

e22346b

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

3cfacf0

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

ac427cc

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Merge branch 'apify:master' into custom-http-parser

bbc0157

docs: Add guide "HttpCrawler with custom parser" #1622

Are you sure you want to change the base?

docs: Add guide "HttpCrawler with custom parser" #1622

Uh oh!

Conversation

Mantisus commented Dec 16, 2025 • edited by vdusek Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Uh oh!

codecov bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

janbuchar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mantisus commented Dec 17, 2025

Uh oh!

janbuchar commented Dec 17, 2025

Uh oh!

janbuchar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mantisus commented Dec 16, 2025 •

edited by vdusek

Loading

codecov bot commented Dec 16, 2025 •

edited

Loading

janbuchar left a comment •

edited

Loading

janbuchar left a comment •

edited

Loading