extract vector charts from PDF

Question

I can't find the solution to how to isolate or extract vector chart and graphs(that are not images) from pdf.

I have tried extract directly, but I realize that it is not that straight forward. I was using mymupdf. This script extracts and saves only images. But I needed to save charts that are not the images. In PDF apparently it is stored differently.

import fitz
import os 

pdf_path = 'path to pdf'
output_folder = 'your output folder'
os.makedirs(output_folder, exist_ok=True)

doc = fitz.open(pdf_path)
chart_count = 0

page = doc.load_page(0)
img_list = page.get_images(full=True )
for img_index, img in enumerate(img_list):
    base_image = doc.extract_image(img[0])
    image_bytes = base_image["image"]
    image = Image.open(io.BytesIO(image_bytes))
    image_path = os.path.join(output_folder, 
        f"chart_{chart_count+1}.png")
    image.save(image_path)
    chart_count += 1

This one only performs good on image type in PDF but not for vector charts. Do you have any suggestions or solutions?

Sample PDF file ( where you can see not all charts are being extracted)

K J · Accepted Answer · 2025-02-03 23:31:36Z

You have correctly described PDF is different components on a page. Some are areas of colour and others are text and perhaps JPEG images so when we strip the background paper colours the first 6 pages match that description well.

Floating images and floating text characters in chart like pages. Any page colours or linework are totally separate sub page objects.

Moving on to the ones you hope to see different. We can see these are either images or simply just parts of a page thus not independent graphics for extraction.

Thus to extract objects from an area they must be gathered by co-ordinates in your Region of Interest (ROI) or redact the others from the page.

PyMuPdf is good at redaction so trim all the page outside the Region of interest using X and Y REDACTION boxes.

Then once all the surrounding data is deleted ensure the remaining text is one colour for ease of viewing.

The culmination of editing With MuPDF can thus be a single page PDF of the retained and edited area.

Finally you can reduce the page size to what you design it to be.

The code would be too large for me to write each custom page editor so I simply cut and paste using Mutools and Notepad as far easier.

Collectives™ on Stack Overflow

extract vector charts from PDF

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related