pdf-extraction

Star

Here are 241 public repositories matching this topic...

opendataloader-project / opendataloader-pdf

Star

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

Updated May 15, 2026
Java

kreuzberg-dev / kreuzberg

Star

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

Updated May 17, 2026
Rust

firecrawl / pdf-inspector

Star

Fast Rust library for PDF inspection, classification, and text extraction. Intelligently detects scanned vs text-based PDFs to enable smart routing decisions.

nodejs python markdown rust pdf text-extraction pdf-parser pdf-extraction ocr-routing pdf-classification

Updated May 15, 2026
Rust

24eme / signaturepdf

Star

Free open-source web software for signing PDF (alone or with others) and also organize pages, edit medata and compress pdf

php pdf js signature pdf-manipulation pdf-merge pdf-format pdf-rotate pdf-merger pdf-meta-editor pdf-tools pdf-signature pdf-compression pdf-editor pdf-sign pdf-extraction pdf-signer pdf-metadata pdf-compressor

Updated Apr 30, 2026
JavaScript

pytr-org / pytr

Star

Use TradeRepublic in terminal and mass download all documents

portfolio finance terminal-app portfolio-performance pdf-extraction traderepublic-statements traderepublic

Updated May 10, 2026
Python

ArtifexSoftware / mupdf.js

Star

JavaScript bindings for MuPDF

javascript pdf typescript wasm mupdf pdf-viewer pdf-extraction

Updated May 5, 2026

aiptimizer / TurboOCR

Star

Fast GPU OCR server. 270 img/s on FUNSD. TensorRT FP16, PP-OCRv5, HTTP + gRPC.

ocr grpc nvidia text-recognition text-detection inference-server fp16 tensorrt rag fastapi pdf-extraction paddleocr easyocr document-ai document-parsing qwen-vl gpu-ocr

Updated May 14, 2026
C++

mateogon / pdf-narrator

Star

Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.

pdf text-to-speech audiobook tts epub low-resource pdf-extraction pdf-to-audiobook immersive-reading kokoro-tts audiobook-generator pdf-audiobook

Updated Feb 26, 2026
Python

iamarunbrahma / pdf-to-markdown

Star

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

python information-retrieval document-conversion pdf-converter text-extraction pdf-parsing document-processing rag pdf-extraction retrieval-augmented-generation pdf-to-markdown

Updated Nov 22, 2024
Python

ExtractPDF4J / ExtractPDF4J

Sponsor

Star

Java PDF table extraction & OCR library. Extract structured tables from text-based and scanned PDFs using stream, lattice (OpenCV-style grid detection), and hybrid parsing.

java cli ocr maven pdf-document pdf-extractor ocr-recognition document-processing pdf-processor pdf-document-processor pdf-extraction java17

Updated Mar 15, 2026
Java

appautomaton / document-SKILLs

Star

Claude Code and Codex SKILLs for PDF, Excel, Word, and PowerPoint manipulation — extraction, forms, formulas, tracked changes, adapted from Anthropic skills.

excel docx pptx codex ai-agents document-processing pdf-extraction agent-skills claude-code claude-skills

Updated Mar 26, 2026
Python

NameetP / pdfmux

Star

PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.

python pdf ocr mcp self-healing structured-extraction rag pdf-to-json pdf-extraction ai-agent llm document-parsing pdf-to-markdown docling opendataloader

Updated May 5, 2026
Python

heleninsights-dot / phd-deepread-workflow

Star

A professinal CLI workflow for PhD students to extract, analyze, and visualize academic papers into structured Markdown and Obsidian Canvas.

python pdf workflow research academic obsidian literature-review pdf-extraction

Updated May 15, 2026
Python

pcschreiber1 / PDF_Extraction-Translation

Star

Translate many large PDF Reports for free using Python.

python pdf-extraction pdf-translation

Updated Dec 31, 2022
Jupyter Notebook

wszqkzqk / qt-web-extractor

Star

Web content extraction engine backed by Qt WebEngine.

mcp chromium web-scraping qtwebengine content-extraction headless-browser pdf-extraction pyside6 open-webui mcp-server

Updated Apr 21, 2026
Python

zoharbabin / due-diligence-agents

Sponsor

Star

Find what gets buried in the data room. 13 AI agents analyze every contract across 9 domains (Legal, Finance, Commercial, Tech, Cyber, HR, Tax, Regulatory, ESG), cross-reference findings, and trace each to exact page & quote. Interactive chat, Excel/Word export, knowledge that compounds across runs.

Updated May 17, 2026
Python

jessevanwyk1 / claude-scholar

Star

🚀 Simplify your research workflow with Claude Scholar, the complete configuration for Claude Code in data science, AI, and academic writing.

search mcp academic pubmed summarization research-tool reading-list arxiv ai-safety literature-review scientific-literature semantic-scholar pdf-extraction streamlit academic-papers academic-research research-tools mcp-server claude-code

Updated May 17, 2026
TeX

aidalinfo / extract-kit

Star

Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.

pdf document-processing ai-sdk pdf-extraction vision-llm

Updated Sep 14, 2025
TypeScript

simonplmak-cloud / hkex-filing-scraper

Star

Scrape and ingest HKEx (Hong Kong Stock Exchange) regulatory filings into SurrealDB with full-text extraction and graph linking.

python open-source scraper etl web-scraping stock-market graph-database financial-data data-pipeline hong-kong hkex pdf-extraction surrealdb regulatory-filings hong-kong-stock-exchange

Updated Feb 15, 2026
Python

clark-labs-inc / pdfsink-rs

Star

Fast pure-Rust PDF extraction library and CLI by Clark Labs Inc. — 10–50x faster than pdfplumber for text, word, table, layout, image, and metadata extraction.

rust pdf text-extraction rust-library pdf-to-text rust-crate table-extraction pdf-parser document-processing layout-analysis pdf-to-json pdf-extraction pdfplumber document-ai clark-labs

Updated Apr 27, 2026
Rust

Improve this page

Add a description, image, and links to the pdf-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pdf-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf-extraction

Here are 241 public repositories matching this topic...

opendataloader-project / opendataloader-pdf

kreuzberg-dev / kreuzberg

firecrawl / pdf-inspector

24eme / signaturepdf

pytr-org / pytr

ArtifexSoftware / mupdf.js

aiptimizer / TurboOCR

mateogon / pdf-narrator

iamarunbrahma / pdf-to-markdown

ExtractPDF4J / ExtractPDF4J

appautomaton / document-SKILLs

NameetP / pdfmux

heleninsights-dot / phd-deepread-workflow

pcschreiber1 / PDF_Extraction-Translation

wszqkzqk / qt-web-extractor

zoharbabin / due-diligence-agents

jessevanwyk1 / claude-scholar

aidalinfo / extract-kit

simonplmak-cloud / hkex-filing-scraper

clark-labs-inc / pdfsink-rs

Improve this page

Add this topic to your repo