feat: Add transform_request_function parameter for SitemapRequestLoader#1525
feat: Add transform_request_function parameter for SitemapRequestLoader#1525vdusek merged 5 commits intoapify:masterfrom
transform_request_function parameter for SitemapRequestLoader#1525Conversation
| exclude: list[re.Pattern[Any] | Glob] | None = None, | ||
| max_buffer_size: int = 200, | ||
| persist_state_key: str | None = None, | ||
| transform_request_function: Callable[[RequestOptions], RequestOptions | RequestTransformAction] | None = None, |
There was a problem hiding this comment.
Buuut... shouldn't this also receive an URL of the origin sitemap?
There was a problem hiding this comment.
I don't think that makes sense. A sitemap cannot contain links to another domain.
This way, users can easily create a mapping between the original link to the sitemap and the link inside transform_request_function. From my point of view, the most valuable thing that adding transform_request_function gives is the ability to add a label so that the request is processed by the appropriate handler.
There was a problem hiding this comment.
That makes a lot of sense, thanks. But I'm afraid that this won't "click" for a lot of people. Perhaps we could add an example that showcases this?
| def transform_request( | ||
| request_options: RequestOptions, | ||
| ) -> RequestOptions | RequestTransformAction: | ||
| request_host = URL(request_options['url']).host |
There was a problem hiding this comment.
Maybe we should mention that a sitemap should only contain links to the same host here
There was a problem hiding this comment.
Good point. Added.
transform_request_function parameter for SitemapRequestLoadertransform_request_function parameter for SitemapRequestLoader
Description
This PR is inspired by this discussion.
Add support
transform_request_functionforSitemapRequestLoader, which works the same way as inEnqueueLinksFunction. This can be useful for settinglabelfor correct routing oruser_datawith custom data.