A Python script that generates llms.txt and llms-full.txt files for any website using Firecrawl and OpenAI APIs.
llms.txt is a standardized format for making website content more accessible to Large Language Models (LLMs). It provides:
- llms.txt: A concise index of all pages with titles and descriptions
- llms-full.txt: Complete content of all pages for comprehensive access
- 🗺️ Website Mapping: Automatically discovers all URLs on a website using Firecrawl's map endpoint
- 📄 Content Scraping: Extracts markdown content from each page
- 🤖 AI Summaries: Uses OpenAI's GPT-4o-mini to generate concise titles and descriptions
- ⚡ Parallel Processing: Processes multiple URLs concurrently for faster generation
- 🎯 Configurable Limits: Set maximum number of URLs to process
- 📁 Flexible Output: Choose to generate both files or just llms.txt
- Python 3.7+
- Firecrawl API key (Get one here)
- OpenAI API key (Get one here)
- Clone the repository:
git clone <repository-url>
cd <repository-directory>- Install dependencies:
pip install -r requirements.txt-
Set up API keys (choose one method):
Option A: Using .env file (recommended)
cp env.example .env # Edit .env and add your API keysOption B: Using environment variables
export FIRECRAWL_API_KEY="your-firecrawl-api-key" export OPENAI_API_KEY="your-openai-api-key"
Option C: Using command line arguments (See usage examples below)
Generate llms.txt and llms-full.txt for a website:
python generate-llmstxt.py https://example.com# Limit to 50 URLs
python generate-llmstxt.py https://example.com --max-urls 50
# Save to specific directory
python generate-llmstxt.py https://example.com --output-dir ./output
# Only generate llms.txt (skip full text)
python generate-llmstxt.py https://example.com --no-full-text
# Enable verbose logging
python generate-llmstxt.py https://example.com --verbose
# Specify API keys via command line
python generate-llmstxt.py https://example.com \
--firecrawl-api-key "fc-..." \
--openai-api-key "sk-..."url(required): The website URL to process--max-urls: Maximum number of URLs to process (default: 20)--output-dir: Directory to save output files (default: current directory)--firecrawl-api-key: Firecrawl API key (defaults to .env file or FIRECRAWL_API_KEY env var)--openai-api-key: OpenAI API key (defaults to .env file or OPENAI_API_KEY env var)--no-full-text: Only generate llms.txt, skip llms-full.txt--verbose: Enable verbose logging for debugging
# https://example.com llms.txt
- [Page Title](https://example.com/page1): Brief description of the page content here
- [Another Page](https://example.com/page2): Another concise description of page content
# https://example.com llms-full.txt
<|firecrawl-page-1-lllmstxt|>
## Page Title
Full markdown content of the page...
<|firecrawl-page-2-lllmstxt|>
## Another Page
Full markdown content of another page...
- Website Mapping: Uses Firecrawl's
/mapendpoint to discover all URLs on the website - Batch Processing: Processes URLs in batches of 10 for efficiency
- Content Extraction: Scrapes each URL to extract markdown content
- AI Summarization: For each page, GPT-4o-mini generates:
- A 3-4 word title
- A 9-10 word description
- File Generation: Creates formatted llms.txt and llms-full.txt files
- Failed URL scrapes are logged and skipped
- If no URLs are found, the script exits with an error
- API errors are logged with details for debugging
- Rate limiting is handled with delays between batches
- Processing time depends on the number of URLs and response times
- Default batch size is 10 URLs processed concurrently
- Small delays between batches prevent rate limiting
- For large websites, consider using
--max-urlsto limit processing
python generate-llmstxt.py https://small-blog.com --max-urls 20python generate-llmstxt.py https://docs.example.com --max-urls 100 --verbosepython generate-llmstxt.py https://example.com --no-full-text --max-urls 50The script checks for API keys in this order:
- Command line arguments (
--firecrawl-api-key,--openai-api-key) .envfile in the current directory- Environment variables (
FIRECRAWL_API_KEY,OPENAI_API_KEY)
Ensure you've either:
- Created a
.envfile with your API keys (copy fromenv.example) - Set environment variables:
FIRECRAWL_API_KEYandOPENAI_API_KEY - Or pass them via command line arguments
If you encounter rate limits:
- Reduce concurrent workers in the code
- Add longer delays between batches
- Process fewer URLs at once
For very large websites:
- Use
--max-urlsto limit the number of pages - Process in smaller batches
- Use
--no-full-textto skip full content generation
MIT License - see LICENSE file for details