3

I am trying to use request.get() to download files from the assist.org website for a research project. Specifically, when you go to the website they have a box for articulation agreements. While it would be awesome to come up with a way to go through all the drop down menus (Academic Year, Institution, Agreements with Other Institutions) and then view the agreement for each combination of these and download them, I need help for an even simpler step.

Clicking through and finding a link, the reports in the articulation agreements are stored in the URL format https://assist.org/transfer/report/XXXXXXX, where the X's are digits. Here is an example.

Clicking the link in my browser (Safari) opens the PDF and I can click the download button. But using the following sample Python code, it gives me only a corrupt .pdf file. I am not that well acquainted with HTML and websites etc., so I am not quite sure how to adjust the code to get the PDF file from the above link.

import requests

def download_file(file_number):
    url = f"https://assist.org/transfer/report/{file_number}"
    response = requests.get(url)
    
    if response.status_code == 200:
        with open(f"report_{file_number}.pdf", "wb") as file:
            file.write(response.content)
        print(f"File 'report_{file_number}.pdf' downloaded successfully!")
    else:
        print(f"Failed to download the file. HTTP status code: {response.status_code}")

file_number = "26917146"
download_file(file_number)

I tried the above piece of code, and all I got is a file that nominally has the extension .pdf, but it fails to open in Preview on MacOS.

I have also looked in the source code for the website but cannot find any references to a .pdf file...

Furthermore, contacting the people behind the webpage doesn't help much, as they cannot readily send all the PDF files yet (they are doing some restructuring).

5
  • 1
    I'm getting a HTML document attempting to curl your example URL. Since you're on a mac, try the file command line command, e.g. file report_26917146.pdf to see what you're getting - likely not a PDF. Commented Sep 27, 2023 at 18:38
  • 1
    If you looked at the contents of the file that was downloaded, you'd see that it is HTML, not PDF. JavaScript code loaded by the page is what actually retrieves and displays the PDF. I don't see any way to get the PDF file directly. Commented Sep 27, 2023 at 18:39
  • 1
    Right. That link brings up a document viewer app, not a downloaded PDF. Commented Sep 27, 2023 at 18:40
  • 1
    Why would you need a viewer app? Can't all browsers show pdf directly? Maybe it's not a PDF! Commented Sep 27, 2023 at 18:45
  • Thank you all! Indeed when running file report_26917146.pdf I get that it is a HTML document. When clicking the link in my browser (Safari) I can download as a PDF (if I hover my mouse towards the bottom of the screen I get an option that includes download the file, which then downloads a PDF). Commented Sep 27, 2023 at 18:50

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.