I can't find the solution to how to isolate or extract vector chart and graphs(that are not images) from pdf.
I have tried extract directly, but I realize that it is not that straight forward. I was using mymupdf. This script extracts and saves only images. But I needed to save charts that are not the images. In PDF apparently it is stored differently.
import fitz
import os
pdf_path = 'path to pdf'
output_folder = 'your output folder'
os.makedirs(output_folder, exist_ok=True)
doc = fitz.open(pdf_path)
chart_count = 0
page = doc.load_page(0)
img_list = page.get_images(full=True )
for img_index, img in enumerate(img_list):
base_image = doc.extract_image(img[0])
image_bytes = base_image["image"]
image = Image.open(io.BytesIO(image_bytes))
image_path = os.path.join(output_folder,
f"chart_{chart_count+1}.png")
image.save(image_path)
chart_count += 1
This one only performs good on image type in PDF but not for vector charts. Do you have any suggestions or solutions?
Sample PDF file ( where you can see not all charts are being extracted)






