Skip to content

Add browser User-Agent header to HTTP requests to support bot-protected APIs #1467

@mstykow

Description

@mstykow

Problem

When attempting to convert documents from URLs using convert_to_markdown, some APIs return errors (404, 403) when accessed without a browser User-Agent header, even though the same URLs work perfectly in a web browser.

Example

API endpoint: https://api.abfall.io/?key=ba5c0a03ba41d81479797313161ced08&mode=export&idhousenumber=607&wastetypes=51,1146,1554,17,1553,627&timeperiod=20250101-20251231&showinactive=true&type=pdf

With markitdown-mcp:

{
  "error": "404 Client Error: Not Found for url: https://api.abfall.io/?key=..."
}

Same URL in browser: Works perfectly, downloads a 465KB PDF file

Root Cause

The API checks the User-Agent header and blocks requests that don't appear to be from a browser. This is a common pattern for APIs that want to prevent automated scraping but still allow legitimate browser access.

Reproduction

# Without User-Agent - returns 404
curl -I "https://api.abfall.io/?key=ba5c0a03ba41d81479797313161ced08&mode=export&idhousenumber=607&wastetypes=51,1146,1554,17,1553,627&timeperiod=20250101-20251231&showinactive=true&type=pdf"
# HTTP/2 404

# With browser User-Agent - returns 200 OK with PDF
curl -I -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" "https://api.abfall.io/?key=ba5c0a03ba41d81479797313161ced08&mode=export&idhousenumber=607&wastetypes=51,1146,1554,17,1553,627&timeperiod=20250101-20251231&showinactive=true&type=pdf"
# HTTP/2 200
# content-type: application/pdf

Proposed Solution

Add a browser-like User-Agent header when making HTTP requests in markitdown. For example:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

This would:

  • ✅ Fix compatibility with APIs that check User-Agent
  • ✅ Better mimic legitimate browser behavior
  • ✅ Still be transparent about the tool's purpose
  • ✅ Not require any breaking changes to the API

Impact

This issue affects any user trying to convert documents from:

  • Public APIs with bot protection
  • Websites that check User-Agent headers
  • Content management systems with basic access control
  • Government/municipal services (like the waste management API above)

Related Issues

Similar to #1196 but with a different error code (404 vs 403) and confirmed root cause through testing.

Environment

  • markitdown-mcp version: 1.8.1
  • Tested via: MCP HTTP server
  • Confirmed the issue is specifically User-Agent related through curl testing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions