-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Problem
When attempting to convert documents from URLs using convert_to_markdown, some APIs return errors (404, 403) when accessed without a browser User-Agent header, even though the same URLs work perfectly in a web browser.
Example
API endpoint: https://api.abfall.io/?key=ba5c0a03ba41d81479797313161ced08&mode=export&idhousenumber=607&wastetypes=51,1146,1554,17,1553,627&timeperiod=20250101-20251231&showinactive=true&type=pdf
With markitdown-mcp:
{
"error": "404 Client Error: Not Found for url: https://api.abfall.io/?key=..."
}Same URL in browser: Works perfectly, downloads a 465KB PDF file
Root Cause
The API checks the User-Agent header and blocks requests that don't appear to be from a browser. This is a common pattern for APIs that want to prevent automated scraping but still allow legitimate browser access.
Reproduction
# Without User-Agent - returns 404
curl -I "https://api.abfall.io/?key=ba5c0a03ba41d81479797313161ced08&mode=export&idhousenumber=607&wastetypes=51,1146,1554,17,1553,627&timeperiod=20250101-20251231&showinactive=true&type=pdf"
# HTTP/2 404
# With browser User-Agent - returns 200 OK with PDF
curl -I -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" "https://api.abfall.io/?key=ba5c0a03ba41d81479797313161ced08&mode=export&idhousenumber=607&wastetypes=51,1146,1554,17,1553,627&timeperiod=20250101-20251231&showinactive=true&type=pdf"
# HTTP/2 200
# content-type: application/pdfProposed Solution
Add a browser-like User-Agent header when making HTTP requests in markitdown. For example:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}This would:
- ✅ Fix compatibility with APIs that check User-Agent
- ✅ Better mimic legitimate browser behavior
- ✅ Still be transparent about the tool's purpose
- ✅ Not require any breaking changes to the API
Impact
This issue affects any user trying to convert documents from:
- Public APIs with bot protection
- Websites that check User-Agent headers
- Content management systems with basic access control
- Government/municipal services (like the waste management API above)
Related Issues
Similar to #1196 but with a different error code (404 vs 403) and confirmed root cause through testing.
Environment
- markitdown-mcp version: 1.8.1
- Tested via: MCP HTTP server
- Confirmed the issue is specifically User-Agent related through curl testing