PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
-
Updated
May 15, 2026 - Java
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
Fast Rust library for PDF inspection, classification, and text extraction. Intelligently detects scanned vs text-based PDFs to enable smart routing decisions.
Free open-source web software for signing PDF (alone or with others) and also organize pages, edit medata and compress pdf
Use TradeRepublic in terminal and mass download all documents
JavaScript bindings for MuPDF
Fast GPU OCR server. 270 img/s on FUNSD. TensorRT FP16, PP-OCRv5, HTTP + gRPC.
Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
Java PDF table extraction & OCR library. Extract structured tables from text-based and scanned PDFs using stream, lattice (OpenCV-style grid detection), and hybrid parsing.
Claude Code and Codex SKILLs for PDF, Excel, Word, and PowerPoint manipulation — extraction, forms, formulas, tracked changes, adapted from Anthropic skills.
PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.
A professinal CLI workflow for PhD students to extract, analyze, and visualize academic papers into structured Markdown and Obsidian Canvas.
Translate many large PDF Reports for free using Python.
Web content extraction engine backed by Qt WebEngine.
Find what gets buried in the data room. 13 AI agents analyze every contract across 9 domains (Legal, Finance, Commercial, Tech, Cyber, HR, Tax, Regulatory, ESG), cross-reference findings, and trace each to exact page & quote. Interactive chat, Excel/Word export, knowledge that compounds across runs.
🚀 Simplify your research workflow with Claude Scholar, the complete configuration for Claude Code in data science, AI, and academic writing.
Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.
Scrape and ingest HKEx (Hong Kong Stock Exchange) regulatory filings into SurrealDB with full-text extraction and graph linking.
Fast pure-Rust PDF extraction library and CLI by Clark Labs Inc. — 10–50x faster than pdfplumber for text, word, table, layout, image, and metadata extraction.
Add a description, image, and links to the pdf-extraction topic page so that developers can more easily learn about it.
To associate your repository with the pdf-extraction topic, visit your repo's landing page and select "manage topics."