add option to --no-check-certs use at own risk#89
Conversation
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
|
@SuperKogito it looks like one of your previously working escape sequences is now considered invalid syntax: And then it's not detecting the urls and some tests fail I think? |
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
| "|[a-z0-9.\\-]+[.](?:%s)/)" % domain_extensions, | ||
| "(?:", | ||
| "[^\\s()<>\[\\]]+|\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)", | ||
| r"[^\\s()<>\[\\]]+|\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)", |
There was a problem hiding this comment.
Everything other than this looks good to me. I will check this in a couple of hours from now and see if I can come up with a quick fix.
There was a problem hiding this comment.
Thank you! This was my effort to fix the warning - turning into a raw string (what I did above) helps sometimes. But the tests are still failing - it's not even detecting URLs on many cases, so we have a larger issue on our hands.
I appreciate your help @SuperKogito !
There was a problem hiding this comment.
So the good news, you had the right fix to the warning and we have some good tests.
The bad news, we got a lot of deprecation warnings and some tough URLs to check e.g. https://codepen.io/rootwork/ cannot be checked from my machine but the link is there.
Also check this, this is the same test, https://groups.drupal.org/node/278968 causes the fail at the first one and then works in the second. Zero code changed in between just timing I guess.


My suggestions are the following:
- Using your fix but all over the
URL_REGEXand use no escapes in
This is what worked for me:
URL_REGEX = r"".join(
(
r"(?i)\b(",
r"(?:",
r"https?:(?:/{1,3}|[a-z0-9%]",
r")",
r"|[a-z0-9.\-]+[.](?:%s)/)" % domain_extensions,
r"(?:",
r"[^\s()<>\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)",
r"|\([^\\s]+?\)",
r")",
r"+",
r"(?:",
r"\([^\s()]*?\([^\s()]+\)[^\s()]*?\)",
r"|\([^\\s]+?\)",
r"|[^\s`!()\[\];:'\".,<>?«»“”‘’]",
r")",
r"|",
r"(?:",
r"(?<!@)[a-z0-9]",
r"+(?:[.\-][a-z0-9]+)*[.]",
r"(?:%s)\b/?(?!@)" % domain_extensions,
r"))",
)
)-
Comment the difficult links causing an issue -for now- until we figure a better way to check them (This is not a REGEX issue imo ... I can only point the finger to the Driver atm :/ ). In my case commenting out ("https://groups.drupal.org/node/298298" and "https://codepen.io/rootwork/") under
made the test pass for Python 3.9 and 3.12. -
A key point is to extend the test.yml to test for different versions of Python (The warning for the escape char -among others- only shows up on Python 3.12 and not on 3.9). In this regard, we need to decide which Python versions do we want to support and which ones we want to gradually drop. I think our test.yml 'test section' should look a bit more like one https://github.com/librosa/librosa/blob/main/.github/workflows/ci.yml
This way we cover more versions.
*I did not make any direct changes to the branch since I don't have a better solution and your input on these matters is very important.
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
|
@SuperKogito could you please do a PR to the PR branch here (and then it can be also tested)? |
|
#90 but I am having the same fail error :/ |
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
|
okay I found the issues - nothing to do with our regex or the requests, it was an update to selenium webdriver that deprecated some of the logic we were using. As a result, the driver was failing, returning to be None, and since that is the primary means to get a lot of these URLs (e.g., the initial requests response is not allowed), a lot (actually many) were failing. This is becoming more common with websites, as is logical, they don't want people scraping. But they can't prevent a selenium webdriver from doing so. I'm finishing up local tests now and will push the fixes shortly. |
fb4dcd8 to
2e53fd5
Compare
|
Note for myself: we will need to update the driver in the Dockerfile as well, once we find the one that matches GH actions. |
The current failures are a result of an update to selenium, so the instantiation of our driver fails, returns as None, and then all the requests are done with only requests. As the web matures (and sites do not want scraping) it is less likely this approach will work - we need the driver. This change will update the selenium UI to ensure the driver works and restore functionality. I will follow up with any tweaks needed for the CI (working locally for me). Signed-off-by: vsoch <vsoch@users.noreply.github.com>
2e53fd5 to
0b41f0e
Compare
|
That green is sure beautiful :) 🍏 https://github.com/urlstechie/urlchecker-python/actions/runs/7769722953/job/21189125531?pr=89 Just pushed the update for the container, and we should be able to merge and release soon and test with the action. |
38d057e to
35a382e
Compare
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
35a382e to
a19bddd
Compare


This will address urlstechie/urlchecker-action#105. After it is tested by the person that opened the issue we will merge, release and update the action.