I have this code that iterates through several links. For each one, it retrieves the HTML response and runs response_html.find('relative-time')
import pandas as pd
import requests_html
def main():
df_links = pd.read_csv('links2.csv', index_col=0)
session = requests_html.HTMLSession()
try:
for i in range(0, len(df_links.index)):
url = df_links.iloc[i]['hyperlink']
print(f"[{i}/{len(df_links.index)}]: {url}", flush=True)
response = session.get(url)
status_code = response.status_code
if status_code == 200:
response_html = response.html
dateList = response_html.find('relative-time')
except Exception as e:
print("Something went wrong...", flush=True)
if __name__ == "__main__":
main()
However, with 15,000 links, the code mysteriously stops halfway through execution and gets stuck in the middle of the for loop. What could be causing this?
I've asked friends to simulate it, and the same thing happened to them.
For testing purposes, the repository with the CSV is available here: https://github.com/carloseduardobanjar/nvd-linked-content-crawler-bug
When I comment out the line dateList = response_html.find('relative-time'), the code runs smoothly until completion. It seems the issue lies within that line.
ps: I know the code may seem nonsensical, but it's just an example to illustrate the problem.
responseobject is being left without closing. Thegetfunction from a Session in requests-html returns a requests.Response, which has theclosefunction. Also, the Session supports aclosefunction, maybe you can also try to close and recreate it periodically.