3

I have this code that iterates through several links. For each one, it retrieves the HTML response and runs response_html.find('relative-time')

import pandas as pd
import requests_html

def main():
    df_links = pd.read_csv('links2.csv', index_col=0)

    session = requests_html.HTMLSession()

    try:
        for i in range(0, len(df_links.index)):
            url = df_links.iloc[i]['hyperlink']
            print(f"[{i}/{len(df_links.index)}]: {url}", flush=True)
            response = session.get(url)
            status_code = response.status_code
            if status_code == 200:
                response_html = response.html
                dateList = response_html.find('relative-time')
    except Exception as e:
        print("Something went wrong...", flush=True)

if __name__ == "__main__":
    main()

However, with 15,000 links, the code mysteriously stops halfway through execution and gets stuck in the middle of the for loop. What could be causing this?

I've asked friends to simulate it, and the same thing happened to them.

For testing purposes, the repository with the CSV is available here: https://github.com/carloseduardobanjar/nvd-linked-content-crawler-bug

When I comment out the line dateList = response_html.find('relative-time'), the code runs smoothly until completion. It seems the issue lies within that line.

ps: I know the code may seem nonsensical, but it's just an example to illustrate the problem.

11
  • Does it always stop at the same URL? Can you do that one by itself? Commented Apr 27, 2024 at 17:22
  • 1
    I know the code may seem nonsensical, but it's just an example I'm guessing the true problem is somewhere in the code you removed in order to make the example. Commented Apr 27, 2024 at 17:31
  • @TimRoberts yes, it always stop at the same URL, but if I start running from that URL it works. Commented Apr 29, 2024 at 22:31
  • @JohnGordon I tested the example code and I reproduced the error. Commented Apr 29, 2024 at 22:34
  • One thing that I noticed is that the response object is being left without closing. The get function from a Session in requests-html returns a requests.Response, which has the close function. Also, the Session supports a close function, maybe you can also try to close and recreate it periodically. Commented May 1, 2024 at 0:40

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.