Skip to content

Conversation

@Markus-Rost
Copy link
Collaborator

Fix #2310

  • Add --javaScript option with values "none", "trusted" or "all" (default being "trusted").
  • Add --addModules option for additional ResourceLoader modules

@Markus-Rost
Copy link
Collaborator Author

This draft is currently made as a proof of concept with a disregard for any other renderer it might be breaking.

@Markus-Rost
Copy link
Collaborator Author

Markus-Rost commented Aug 18, 2025

I created some example ZIM files including no JavaScript, all JavaScript or only JavaScript trusted not to request external content. Please take a look at them and let me know how you feel about them.

  • All JavaScript disabled
  • Only trusted JavaScript enabled
  • Trusted JavaScript enabled, as well as some additional user created scripts not requesting external content
    • wikipedia_en_10_JS_trusted_addModules_2025-08.zip
    • mwoffliner --mwUrl=https://en.wikipedia.org/ --articleList=http://download.openzim.org/wp1/enwiki/tops/10.tsv --javaScript=trusted --addModules=site,ext.gadget.ReferenceTooltips,oojs,oojs-ui,ext.gadget.switcher,ext.gadget.Calculator,@wikimedia/codex
  • All JavaScript enabled, including scripts requesting external content

@benoit74
Copy link
Contributor

I tried PoC ZIMs.

For some reasons, the wikipedia_en_10_JS_trusted_addModules_2025-08.zim is causing a significant problem to my kiwix-serve / kiwix-apple readers. Both simply fails to load the ZIM main article (or any other article in kiwix-serve). Surprisingly I rebuild a ZIM with same command and few other articles and it works fine. Don't know what is broken, but something is.

All javascripts version is obviously causing problems in the web console ... but nothing really visible in the articles.

The "trusted javascript" works very well.

I'm not sufficiently export to assess the list of trusted modules, or the one of "addModules" version, but this is more a detail we can refine over the years anyway.

The only things I can say is that I confirm that the "trusted javascript with additional modules" is required/enough to solve #1491 and #1911 articles. Only "trusted javascript" is not enough.

I've tested with JS disabled in Chrome and I confirm the ZIM degrades nicely (i.e. it looks like we get the same content than currently without this PR merged).

All points above makes me consider that we might want to change default scraper behavior (compared to current PR behavior), so that scraper creates "near-perfect" ZIM for "newbies", without needing additional flags. Especially important for people willing to ZIM wikipedia which is somehow our primary audience:

  • we should probably have a --nojavascript option instead of a --javascript option, for those willing to not have any JS (even if ZIM with JS seems to degrade nicely in browsers with JS disabled, pretty sure some folk would prefer to not have any JS code in their ZIMs)
  • we should maybe add more scripts to the trusted list

@Markus-Rost Thanks a lot anyway, very impressive change. Glad you made it work! I did not had any look at the code since you specified it is only a PoC. Let me know when it is worth it.

@Markus-Rost
Copy link
Collaborator Author

For some reasons, the wikipedia_en_10_JS_trusted_addModules_2025-08.zim is causing a significant problem to my kiwix-serve / kiwix-apple readers. Both simply fails to load the ZIM main article (or any other article in kiwix-serve). Surprisingly I rebuild a ZIM with same command and few other articles and it works fine. Don't know what is broken, but something is.

Oh, I'm getting that as well. Not sure what happened there. I tested it locally using 100.tsv which worked fine and created the 10.tsv version to fit the GitHub upload limit, but didn't actually test that one again.

The only things I can say is that I confirm that the "trusted javascript with additional modules" is required/enough to solve #1491 and #1911 articles. Only "trusted javascript" is not enough.

Yes, those tables need the site module.

All points above makes me consider that we might want to change default scraper behavior (compared to current PR behavior), so that scraper creates "near-perfect" ZIM for "newbies", without needing additional flags. Especially important for people willing to ZIM wikipedia which is somehow our primary audience:

Do you mean having all JS included by default? That doesn't seem like a good idea as some JS not expecting a request to fail can easily break stuff on the page.

  • we should probably have a --nojavascript option instead of a --javascript option, for those willing to not have any JS (even if ZIM with JS seems to degrade nicely in browsers with JS disabled, pretty sure some folk would prefer to not have any JS code in their ZIMs)

I've already added --javaScript=none, does no js really need it's own option?

  • we should maybe add more scripts to the trusted list

I've looked through basically all the modules I could find loaded on Wikipedia articles as well as a few other wikis and those are all the modules which won't try to load any non-scraped content. Sadly all three MathJax extensions either load external JS or from an extension path we can't easily rewrite or guess ahead of time. @benoit74 let me know about any specific scripts to add you have in mind and I'll take a look at them.

@benoit74
Copy link
Contributor

OK, then what I meant is having --javascript=trusted as default. If we have --javascript=none, this is OK.

And what I meant by having more modules was having everything needed for wikipedia (having the ancestry table in mind). But if this comes from the site modules, and since (correct me if I'm wrong) we could have anything in the site module, then we can't do that. Or maybe we could add the site modules automatically only for wikimedia projects... not sure it is such a good idea.

@Markus-Rost
Copy link
Collaborator Author

OK, then what I meant is having --javascript=trusted as default. If we have --javascript=none, this is OK.

--javaScript=trusted is already the default in this PR

And what I meant by having more modules was having everything needed for wikipedia (having the ancestry table in mind). But if this comes from the site modules, and since (correct me if I'm wrong) we could have anything in the site module, then we can't do that. Or maybe we could add the site modules automatically only for wikimedia projects... not sure it is such a good idea.

Yeah, the site module can contain anything and change at any time. We also can't just include it for all Wikimedia projects, as on some wikis it might load external content (like on es.wikipedia.org).

@Markus-Rost
Copy link
Collaborator Author

@Markus-Rost Thanks a lot anyway, very impressive change. Glad you made it work! I did not had any look at the code since you specified it is only a PoC. Let me know when it is worth it.

@benoit74 Oh please take a look at the code. I assume that I've broken all renderers besides ActionParse, though I'm also not familiar at all with those renderers and have no idea how badly broken they are. I'm certainly gonna need some help with getting those issues fixed.

@Jaifroid
Copy link
Collaborator

Jaifroid commented Aug 22, 2025

Apologies for the delay. I've now tested these archives in Kiwix JS Browser Extension, with the same results as @benoit74 reported.

  • The JS_all shows errors in console relating to failures to fetch some scripts (but I'm not actually seeing blocked calls to external resources).
  • JS_trusted shows a bunch of failures on the landing page, but not on other pages I tested.
  • JS_none also shows failures on the landing page (clearly references to modules not included in the archive have not been removed), and also on article pages a failure to find C/_mw_/startup.js.
  • JS_trusted_addModules fails to find the landing page, but article pages are fine.

I don't see any differences in formatting, suggesting to me that at least for the tested pages JS_none is as good as other formats, and would be faster to load on older devices due to fewer resources needing to be extracted. However, it would be necessary to remove references to scripts not in the archive to avoid the multiple console errors from attempts to load these. However, JS_trusted seems like a good compromise to me, so long as calls to untrusted resources are removed from the landing page.

Regarding the core issue of calling external scripts or resources #2310, it seems none of the test pages I noticed had that problem in any of these test archives, but possibly I didn't test deeply enough.

@benoit74
Copy link
Contributor

Regarding the core issue of calling external scripts or resources #2310, it seems none of the test pages I noticed had that problem in any of these test archives, but possibly I didn't test deeply enough.

Currently the scraper has indeed more or less solved #2310 by removing all JS, so that we de-facto do not load anymore external scripts. But this is not really a long term solution. This is why we kept #2310 open saying this was more an interim fix and the medium term solution is this PR, adding back "just enough" scripts.

@Markus-Rost
Copy link
Collaborator Author

  • The JS_all shows errors in console relating to failures to fetch some scripts (but I'm not actually seeing blocked calls to external resources).

Regarding the core issue of calling external scripts or resources #2310, it seems none of the test pages I noticed had that problem in any of these test archives, but possibly I didn't test deeply enough.

The calling of external scripts seen in #2310 is mostly just the attempted loading of JS modules. By rewriting startup.js to load those modules from C/_mw_/MODULE.js, they are now the failures to fetch some scripts you noticed in JS_all (though a lot of the called scripts are included in the ZIM).

Of course JS_all can still have scripts attempting to load external resources, just that en.wikipedia is a bad example for that because they have very little JS doing that in the first place. Getting JS_all on es.wikipedia for example will load external scripts through the site module as mentioned in an earlier comment.

  • JS_trusted shows a bunch of failures on the landing page, but not on other pages I tested.
  • JS_none also shows failures on the landing page (clearly references to modules not included in the archive have not been removed), and also on article pages a failure to find C/_mw_/startup.js.
  • JS_trusted_addModules fails to find the landing page, but article pages are fine.

I have the suspicion that my filtering of JS modules stopped the already existing landing page JS from being included 😅 Ah, I'll fix JS_none including startup.

I don't see any differences in formatting, suggesting to me that at least for the tested pages JS_none is as good as other formats, and would be faster to load on older devices due to fewer resources needing to be extracted. However, it would be necessary to remove references to scripts not in the archive to avoid the multiple console errors from attempts to load these. However, JS_trusted seems like a good compromise to me, so long as calls to untrusted resources are removed from the landing page.

You should notice small things like the navboxes at the bottom of articles being collapsed by default now or the video/audio player popup on "Michael Jackson", but yes there aren't a lot of differences in formatting as en.wikipedia does not rely as heavily on JS as other wikis might.

@@ -0,0 +1,7 @@
<script>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline scripts is taboo in ZIMs. Experienced showed it could cause problem on some readers around security constraints. It seems to be more or less resolved, but given the price to pay is not so high, we should transform this into JS scripts of their own.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content of the script is different for every article, so that will be one script file for each article in the zim.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realized that ... even if you already explained it in the past ... sorry.

This is indeed going to a be a bit "painful" (more a technical constraint anyway than a real usability problem).

@kelson42 @Jaifroid I would appreciate your experience on that point.

Which alternative is best:

  • have few lines of inline JS in all articles (and only 1 HTML file per article)
  • or have 1 HTML and 1 JS file per article but no inline JS at all (more or less doubling the number of entries in the ZIM)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kelson42 ping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed about this topic with @kelson42 and he confirmed it feels way better to avoid inline JS at all prices, especially for very popular ZIMs like the ones made by mwoffliner.

He however came up with a new idea which could avoid to have 1 HTML + 1 JS file per article. It lookt like it could be possible to include variables we currently add in the JS (RLCONF, RLSTATE, RLPAGEMODULES) inside the HTML instead (with some "hidden" DOM properties, e.g <meta name="RLCONF" content="..." />, but I'm open to other ideas) and load these variables inside the appropriate JS object at page load. With this "workaround" we will need the same JS code on all articles, hence only one JS file per ZIM and no more inline JS.

@Markus-Rost WDYT, do you think this would be feasible?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! I was more referring to any CSP that you might want to add to Wikipedia articles in the ZIM (for readers that don't enforce one). Indeed, a CSP can't in fact be loosened by a second CSP, it can only be made stricter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @Markus-Rost point is correct: we probably shouldn't mind about using inline JS in mwoffliner ZIMs because we have a nice no-JS fallback present in the ZIM anyway. When the browser context does not allows to load inline JS, then JS will anyway not load at all and we are back to the no-JS fallback (which is not a dirty fallback at all since this is what is currently used on all production ZIMs).

I just tested test ZIMs from #2310 (comment) on Kiwix JS browser extension in all modes (restricted, with/without service worker, ...) and it looked fine. When in restricted mode, JS did not loaded but display was still fine. Of course there was some errors in the console, but I don't think we should really care. @Jaifroid do you confirm?

Copy link
Contributor

@kelson42 kelson42 Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If inline JS is now allowed, please open a dedicated issue in openzim/overview to manage this, because this is everything but clear why now it should be OK, whereas it has been forbidden during years. We can not change the rules, just because it is convenient to get a PR merged - as it.

Copy link
Collaborator

@Jaifroid Jaifroid Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benoit74 On my side, I had to develop a few years ago workarounds for lack of inline JS support in some contexts, notably Chromium extension (even in ServiceWorker mode). In such contexts the user is effectively redirected to a PWA from the extension (where there is no such restriction). Restricted mode is now effectively a no-JS mode that is kept for compatibility with old systems and as a security/screening measure when opening a new untrusted ZIM for the first time.

This should be an Org decision (@kelson42 ). There is absolutely no need to hold things up due to Kiwix JS or the PWA any more. There are independent arguments for avoiding inline JS in ZIMs where we know that JS is not needed, and for trying to keep our baseline, bread-and-butter Wikipedia ZIMs as simple as possible and compatible with even the oldest frameworks. With a bit of work, any inline script snippet can be converted into an attached script loaded at runtime (allowed by all frameworks of course), especially when we know the contents of inline JS script tags in a given context. Nevertheless, it seems reasonable to allow ZIM creators to take that decision themselves, if they need inline JS for their Mediawiki ZIMs, but for us to act conservatively vis-à-vis official Kiwix Wikipedia ZIMs.

As you say, Wikipedia ZIMs provide a good NOSCRIPT default in any case for now.. It should be possible to provide the long-awaited ZIM JS API #87 independently of whether or not inline JS is allowed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If inline JS is now allowed, please open a dedicated issue in openzim/overview to manage this, because this is everything but clear why now it should be OK, whereas it has been forbidden during years. We can not change the rules, just because it is convenient to get a PR merged - as it.

@kelson42 As far as I can tell, the reason for inline scripts to be disallowed is that some readers are/were blocking inline scripts from executing, resulting in (partially) unusable ZIMs. At least that was the case in all issues regarding inline scripts I could find (#324, #1578, #2096, kiwix/kiwix-js#355, kiwix/kiwix-js#789, kiwix/kiwix-js#865, openzim/overview#52), please let me know if there are additional reasons against inline scripts that I missed.

So avoiding inline scripts was to ensure anything needing those inline scripts would work properly in all ZIM readers. Now in this specific case here, avoiding inline scripts won't actually make the feature work in those readers as they are also blocking the eval based feature check in the startup JS and thus will always fall back to a noscript behavior. This means we have nothing to gain here from avoiding inline scripts.

See also openzim/overview#52 for a similar situation, through with a lot less graceful noscript behavior.

@benoit74
Copy link
Contributor

benoit74 commented Oct 2, 2025

@kelson42 we need your decision on this one as well, it sat idle for way too long.

I again see no reason to be worried about the inline scripts, because as stated by @Markus-Rost the ZIM will anyway fallback to no-JS even if we remove these inline scripts (because startup.js also requires eval which is blocked in same context than inline scripts).

I tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-desktop (it was your main concern) and it works like a charm, including all JS-based functions (like show/hide at the bottom of "Michael Jackson" page).

Taking measures to remove these inline scripts without providing any added value to the end users looks like a waste of our precious resources and a risk to introduce bugs.

The no-JS fallback is working exactly as-of today (without this PR merged), i.e. the ZIM will still be high-quality enough.

I understand you would prefer to provide same ZIM behavior on all readers, but at some point I feel like it is important to recognize that the effort is not worth it.

I've also opened openzim/overview#62 to discuss the general policy for openZIM

@benoit74
Copy link
Contributor

benoit74 commented Oct 2, 2025

I also tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-js and kiwix-pwa in Restricted mode and I confirm it fallbacks very nice to no-JS mode (i.e. the ZIM looks just like a currently prod ZIM not supporting JS at all).

@Jaifroid
Copy link
Collaborator

Jaifroid commented Oct 2, 2025

I also tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-js and kiwix-pwa in Restricted mode and I confirm it fallbacks very nice to no-JS mode (i.e. the ZIM looks just like a currently prod ZIM not supporting JS at all).

That one has a bunch of violations in ServiceWorkerLocal mode in Chrome, with the CSP violations looking pretty rough:

image

But more worryingly, in ServiceWorker mode, we get a bunch of attempts to access an external PHP server (wikipedia.org/w/load.php). At the very least those calls should be removed ISTM! -

image

@Markus-Rost
Copy link
Collaborator Author

Markus-Rost commented Oct 2, 2025

I also tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-js and kiwix-pwa in Restricted mode and I confirm it fallbacks very nice to no-JS mode (i.e. the ZIM looks just like a currently prod ZIM not supporting JS at all).

That one has a bunch of violations in ServiceWorkerLocal mode in Chrome, with the CSP violations looking pretty rough:

image

I'm only seeing the two CSP violations from the two inline scripts existing. These are expected (and can be reduced to just one).

But more worryingly, in ServiceWorker mode, we get a bunch of attempts to access an external PHP server (wikipedia.org/w/load.php). At the very least those calls should be removed ISTM! -

image

This is worrying, as only the startup module does those requests and has been modified in this PR to make JS work properly in the first place. The only way I can think of for you to get these errors in the PR ZIM again is if your service worker is somehow using a startup.js file cached from a different Wikipedia ZIM, which seems like a major security issue. @Jaifroid

@Markus-Rost
Copy link
Collaborator Author

But more worryingly, in ServiceWorker mode, we get a bunch of attempts to access an external PHP server (wikipedia.org/w/load.php). At the very least those calls should be removed ISTM! -

image

Oh, I found where this is coming from. It's the sourceMappingURL for modules loaded from local storage. Should be fairly easy to just remove that annotation, though the source map is only loaded by the browser dev tools anyway.

@Jaifroid
Copy link
Collaborator

Jaifroid commented Oct 2, 2025

This is worrying, as only the startup module does those requests and has been modified in this PR to make JS work properly in the first place. The only way I can think of for you to get these errors in the PR ZIM again is if your service worker is somehow using a startup.js file cached from a different Wikipedia ZIM, which seems like a major security issue. @Jaifroid

@Markus-Rost Just to confirm that I only do these tests in the Browser Extension, which takes a purist approach and only uses the resources supplied in the ZIM except (currently) for an added dark CSS if the user selects that (to be changed soon to native dark). (Even in the PWA, I never supply JS, only CSS for some other transforms.)

@Markus-Rost
Copy link
Collaborator Author

I'm having less time to work on this now, but I would like to avoid this PR becoming stale. @benoit74 can you review the PR further (besides the inline script existing), so that I have some pointers to work on whenever I'm available?

@Markus-Rost Markus-Rost requested a review from benoit74 October 10, 2025 17:02
Copy link
Contributor

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about DO_PROPAGATION, ALL_READY_FUNCTION and LOAD_PHP? Looks like these were "useful" things at some point, but indeed never called since years since the module name was not correct so startup.js was never fixed.

The failing unit test is about inline scripts, it will be easily resolved once discussion about this has settled.

I've added a comment about what looks like the root cause of all failing e2e tests.

And this PR obviously misses a significant share of unit and e2e tests to ensure we will not break too easily this very important feature in the future.

@Markus-Rost
Copy link
Collaborator Author

What about DO_PROPAGATION, ALL_READY_FUNCTION and LOAD_PHP? Looks like these were "useful" things at some point, but indeed never called since years since the module name was not correct so startup.js was never fixed.

I'm unsure what DO_PROPAGATION was supposed to solve. It removed a short timeout before each modules got loaded. I suspect the reason was to avoid using mw.requestIdleCallback because the mw class wasn't properly initialized before. Regarding the reason of the timeout, the MW script has a comment about it:

 		// Yield for two reasons:
		// * Allow successive calls to mw.loader.impl() from the same
		//   load.php response, or from the same asyncEval() to be in the
		//   propagation batch.
		// * Allow the browser to breathe between the reception of
		//   module source code and the execution of it.
		//
		// Use a high priority because the user may be waiting for interactions
		// to start being possible. But, first provide a moment (up to 'timeout')
		// for native input event handling (e.g. scrolling/typing/clicking).
		mw.requestIdleCallback( doPropagation, { timeout: 1 } );

ALL_READY_FUNCTION just disabled the check if all dependency modules are loaded. Again likely existed because the resource loader didn't work fully, but is no longer needed as the resource loader now works as intended inside the ZIM.

No idea what LOAD_PHP was intended to do, it does not match anything on any recent startup module (besides possibly nuking half the minified module code due to an overly agressive regex)

@benoit74
Copy link
Contributor

OK then this PR should completely remove DO_PROPAGATION, ALL_READY_FUNCTION and LOAD_PHP for the time being, no need to keep in codebase stuff we do not really understand and does not seem to provide much added value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wikipedia articles scraped with ActionParse API contain scripts that attempt to load external content

5 participants