Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 47 additions & 7 deletions handling-pdf-files/pdf-compressor/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,48 @@
# [How to Compress PDF Files in Python](https://www.thepythoncode.com/article/compress-pdf-files-in-python)
To run this:
- `pip3 install -r requirements.txt`
- To compress `bert-paper.pdf` file:
```
$ python pdf_compressor.py bert-paper.pdf bert-paper-min.pdf
```
This will spawn a new compressed PDF file under the name `bert-paper-min.pdf`.

This directory contains two approaches:

- Legacy (commercial): `pdf_compressor.py` uses PDFTron/PDFNet. PDFNet now requires a license key and the old pip package is not freely available, so this may not work without a license.
- Recommended (open source): `pdf_compressor_ghostscript.py` uses Ghostscript to compress PDFs.

## Ghostscript method (recommended)

Prerequisite: Install Ghostscript

- macOS (Homebrew):
- `brew install ghostscript`
- Ubuntu/Debian:
- `sudo apt-get update && sudo apt-get install -y ghostscript`
- Windows:
- Download and install from https://ghostscript.com/releases/
- Ensure `gswin64c.exe` (or `gswin32c.exe`) is in your PATH.

No Python packages are required for this method, only Ghostscript.

### Usage

To compress `bert-paper.pdf` into `bert-paper-min.pdf` with default quality (`power=2`):

```
python pdf_compressor_ghostscript.py bert-paper.pdf bert-paper-min.pdf
```

Optional quality level `[power]` controls compression/quality tradeoff (maps to Ghostscript `-dPDFSETTINGS`):

- 0 = `/screen` (smallest, lowest quality)
- 1 = `/ebook` (good quality)
- 2 = `/printer` (high quality) [default]
- 3 = `/prepress` (very high quality)
- 4 = `/default` (Ghostscript default)

Example:

```
python pdf_compressor_ghostscript.py bert-paper.pdf bert-paper-min.pdf 1
```

In testing, `bert-paper.pdf` (~757 KB) compressed to ~407 KB with `power=1`.

## Legacy PDFNet method (requires license)

If you have a valid license and the PDFNet SDK installed, you can use the original `pdf_compressor.py` script. Note that the previously referenced `PDFNetPython3` pip package is not freely available and may not install via pip. Refer to the vendor's documentation for installation and licensing.
103 changes: 103 additions & 0 deletions handling-pdf-files/pdf-compressor/pdf_compressor_ghostscript.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
import os
import sys
import subprocess
import shutil


def get_size_format(b, factor=1024, suffix="B"):
for unit in ["", "K", "M", "G", "T", "P", "E", "Z"]:
if b < factor:
return f"{b:.2f}{unit}{suffix}"
b /= factor
return f"{b:.2f}Y{suffix}"


def find_ghostscript_executable():
candidates = [
shutil.which('gs'),
shutil.which('gswin64c'),
shutil.which('gswin32c'),
]
for c in candidates:
if c:
return c
return None


def compress_file(input_file: str, output_file: str, power: int = 2):
"""Compress PDF using Ghostscript.

power:
0 -> /screen (lowest quality, highest compression)
1 -> /ebook (good quality)
2 -> /printer (high quality) [default]
3 -> /prepress (very high quality)
4 -> /default (Ghostscript default)
"""
if not os.path.exists(input_file):
raise FileNotFoundError(f"Input file not found: {input_file}")
if not output_file:
output_file = input_file

initial_size = os.path.getsize(input_file)

gs = find_ghostscript_executable()
if not gs:
raise RuntimeError(
"Ghostscript not found. Install it and ensure 'gs' (Linux/macOS) "
"or 'gswin64c'/'gswin32c' (Windows) is in PATH."
)

settings_map = {
0: '/screen',
1: '/ebook',
2: '/printer',
3: '/prepress',
4: '/default',
}
pdfsettings = settings_map.get(power, '/printer')

cmd = [
gs,
'-sDEVICE=pdfwrite',
'-dCompatibilityLevel=1.4',
f'-dPDFSETTINGS={pdfsettings}',
'-dNOPAUSE',
'-dBATCH',
'-dQUIET',
f'-sOutputFile={output_file}',
input_file,
]

try:
subprocess.run(cmd, check=True)
except subprocess.CalledProcessError as e:
print(f"Ghostscript failed: {e}")
return False

compressed_size = os.path.getsize(output_file)
ratio = 1 - (compressed_size / initial_size)
summary = {
"Input File": input_file,
"Initial Size": get_size_format(initial_size),
"Output File": output_file,
"Compressed Size": get_size_format(compressed_size),
"Compression Ratio": f"{ratio:.3%}",
}

print("## Summary ########################################################")
for k, v in summary.items():
print(f"{k}: {v}")
print("###################################################################")
return True


if __name__ == '__main__':
if len(sys.argv) < 3:
print("Usage: python pdf_compressor_ghostscript.py <input.pdf> <output.pdf> [power 0-4]")
sys.exit(1)
input_file = sys.argv[1]
output_file = sys.argv[2]
power = int(sys.argv[3]) if len(sys.argv) > 3 else 2
ok = compress_file(input_file, output_file, power)
sys.exit(0 if ok else 2)
8 changes: 7 additions & 1 deletion handling-pdf-files/pdf-compressor/requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
PDFNetPython3==8.1.0
# No Python dependencies required for Ghostscript-based compressor.
# System dependency: Ghostscript
# - macOS: brew install ghostscript
# - Debian: sudo apt-get install -y ghostscript
# - Windows: https://ghostscript.com/releases/
#
# The legacy script (pdf_compressor.py) depends on PDFNet (commercial) and a license key.