User Details
- User Since
- Mar 31 2015, 8:12 PM (559 w, 2 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- Jberkel [ Global Accounts ]
Sep 24 2025
The situation has improved somewhat, but there are still caching issues: if a page A transcludes template B, and B is updated (but not A), then the HTML dumps will still show the old content of B in A until A itself is updated.
Sep 14 2025
@Fenakhay: I think we'll have to solve this ourselves for now. The reconstruction namespace is a lot smaller, it should be doable to produce dumps for them.
May 29 2025
confirmed! thanks
May 25 2025
Also, it looks like there's another bug where the SDK doesn't handle the request limit situation properly and keeps on retrying the requests in quick succession before seemingly getting rate-limited for the retries:
May 24 2025
@creynolds Thanks for investigating. With those 1500 requests I was only able to download 35 of the 72 chunks of the Wiktionary dump. I suspect this can be explained by the SDK making several requests for each chunk. In my case I had it set to 5 MB (the default is 25MB), and each chunk is ~ 200 MB, so that makes (72 * 200) / 5 = 2880 requests. With the default transfer size of 25 MB this would just be ~ 576 requests and still in the free tier. Given that the chunk size is configurable it's strange to have these accounted for by number of requests sent. Maybe it would make more sense to cap by amount of data transferred instead.
How many free chunk requests are there? I'm now getting 429 responses on chunk downloads, and the API dashboard confusingly says: "0 / 0 Chunk requests left" (from a free account outside WMCS).
Confirmed, works now, thanks.
May 2 2025
Will these requests work when sent from WMCS?
Apr 22 2025
I was pointed to this ticket on T390839. I'd like to raise some concerns regarding the reduced accessibility and visibility of HTML dumps following the removal of the mirroring.
Apr 21 2025
This also affects the HTML dumps, right? There are none in https://dumps.wikimedia.org/other/enterprise_html/runs/20250401/ or https://dumps.wikimedia.org/other/enterprise_html/runs/20250420/.
Jul 9 2024
Done some testing with the latest (20240701) dumps (allowing for some tolerance around the moment of dump generation):
Jun 4 2024
That's good news. I've done some tests, and it's looking much better now. The XML dumps haven't been released yet (due to T365501), so there's no baseline to do more detailed testing.
May 26 2024
Latest HTML enwikt dump (20240520) vs XML dump:
May 23 2024
It's probably just the new content, with the baseline still being incomplete. I'll check with the XML dumps.
Apr 18 2024
The HTML dumps are pretty much useless until T351712 is fixed.
Mar 25 2024
Can anyone clarify though? It seems that the new sub-tasks are now stuck again.
Mar 18 2024
It probably means the investigation has been "resolved". The main task is now T351712 + subtasks.
Mar 1 2024
Feb 21 2024
Feb 19 2024
I'll add a command to automatically clear the tmp storage, that should help
I've deleted tmp and other unused stuff it's now down to 16GB, is that acceptable?
Feb 5 2024
Could you explain a bit more what this means, please?
Feb 2 2024
Jan 26 2024
Latest enwikt dump is now at 9.6 GB, still some way to go to the 13GB of the 20230701 dump (also incomplete, but still useful as a baseline).
Jan 9 2024
OK. I think it might be worth putting a disclaimer somewhere, perhaps on https://dumps.wikimedia.org/other/enterprise_html/, to warn users that the dumps are incomplete.
@REsquito-WMF thanks! So this means the next dumps will have more data, but will still be incomplete until this other bug is fixed?
Dec 31 2023
Dec 11 2023
@REsquito-WMF not sure if the changes were already in place, but the current enwiktionary NS0 dump is still at 3.5 GB (compared to 13 GB on 20230701).
Nov 6 2023
Is there anything going to be done about this? The enterprise dumps have been in full failure mode for a few months now and are absolutely unusable. I really don't know how an obvious total failure of service can stay in triage hell for such a long time. I understand WMF resources are limited, but then at least let volunteers help out with this. My question about the code generating the dumps above is still unanswered. The transparency/communication on this whole issue has been miserable.
Oct 27 2023
We don't really need to keep all the old dumps around, I've started the deletion of all dump files before 2023. Different files are needed different purposes: for the stats, and for the "wanted entries" on Wiktionary. After generating the dumps, all the data "lives" on Wiktionary, except for the raw data, which is hosted on ~tools.digero/www and shouldn't be deleted. Right now it uses about 1.3G.
Oct 24 2023
@tstarling Thanks for unblocking this! 🙌
Oct 20 2023
Oct 5 2023
Some random guessing: perhaps the error handling code is borked, and it just finishes the dump and closes the file (without erroring the process)? But why then would so many repositories hit errors at the same time? All the 7-20 dumps seem to be affected, maybe some site-wide network/server problems which weren't handled properly?
Oct 4 2023
I've suggested this previously: the XML dumps have been around for a very long time, and compared to the HTML version, they are very reliable, so why not simply base the HTML dumps off them? Then you could easily compare the counts of the two files, it would also help users who consume both types of data.
Sep 20 2023
Any idea why this would affect primarily non-wikipedia instances? Is the code which generates these dumps available somewhere?
Sep 15 2023
Weirdly, there seems to be less variation in filesizes for Wikipedia dumps:
Aug 24 2023
file sizes from the most recent enwikt HTML dumps (NS0):
Jul 24 2023
Hasn't been fixed yet, data is still missing.
Jul 21 2023
Ok, I hope this can be rolled out quickly, it can't get much worse than the current state
I just checked the latest dumps (2023-07-20), and it's now worse: there are around 2.5 million pages missing from the HTML dump (using the XML dump as a baseline).
Jul 12 2023
Why was this already marked as resolved? New dumps haven't even been published yet, so it's impossible to verify.
Jun 12 2023
Closing this, maybe it'll be useful for future reference. I haven't added documentation to wikitech, not sure where it should go.
I'll see if I can prebuilt the binaries and then just launch the commands without gradle to avoid this issue (so the locks are only held during building, not execution)
Jun 10 2023
There are ~150 entries missing from the HTML dump (compared to 2200 earlier):
It looks like the situation has improved with the latest dump (20230601, enwikt):
Jun 9 2023
looks like the files have finally been synced to toolforge!
Jun 7 2023
Jun 5 2023
The rsync, which copies the files over to the nfs share accessible to toolforge, is still in progress.
Jun 2 2023
Looks like the data was copied successfully this time! I've downloaded the enwiktionary-NS0 dump and the checksum matches.
May 29 2023
It might be the case that we are just serving the checksum of the previous dump.
Meaning: we are grabbing the checksum before the upload has finished.
@ArielGlenn if the API side isn't fixed until the June run would it be possible to ignore the checksums and copy the files regardless? We've been dump-less for 2 months now…
May 25 2023
@Protsack.stephan Where are the checksums calculated? Can you re-index the metadata of the dump files on the API side so that they match the actual file content? It looks like they might get calculated before the file is fully processed, or they are calculated from a different version of the file (as you indicated in your comment)?
@ArielGlenn Is the downloaded data usable, that is, can you decompress the files without error? If the files are OK, maybe it's a problem with the checksum generation: if the checksums are off only for some files, it could be related to the file size. Perhaps some sort of overflow where the hashes are calculated?
May 22 2023
May 17 2023
May 16 2023
Another question, where are the enterprise dumps stored on toolforge now? They seem to have stopped updating October last year.
Thanks for moving this one forward!
May 11 2023
Perhaps the same underlying issue as T305407.
May 8 2023
The files haven't materialized, guess something is still amiss…
May 4 2023
Yes that's what I meant, thanks 🤞
Ok, so the files have been generated, but not copied? Can they be recovered?
May 2 2023
Thanks! Is there any way to check the HTML dump progress/state "from the outside"? The XML dumps have a status page + the machine readable dumpstatus.json
Apr 11 2023
Related to T318371
Mar 24 2023
Ok, let me know once you have dumps available with the new infra and I'll re-generate them.
On the English Wiktionary we now use HTML dumps to generate our stats. Some of our content is not in the mainspace and therefore not reflected in the statistics. There are also problems generating information related to proto-languages, these live in the Reconstruction: namespace.
Thanks, are you referring to the deprecation of restbase/MCS? On the English Wiktionary, we're relying more and more on these dumps for statistics and maintenance tasks, and many editors have noticed problems with data derived from these dumps.
Mar 22 2023
Another cache fail related ticket, probably not related though: T226931
Mar 16 2023
Looks like T122934 is relevant and would help with this. Unfortunately, there's been no movement on that task recently.
Dec 5 2022
It works when adding -t latest.
Dec 4 2022
I've been looking at submitting a patch for this myself, but while building the docker images from https://gerrit.wikimedia.org/g/operations/docker-images/toollabs-images
I get the following error:
Nov 9 2022
I have disabled all gadgets and beta features (except "Visual Editing" and "New wikitext mode"), still the same result.
I've also tried it with Safari (see screenshot).
Oct 10 2022
The stats now have a correct timestamp, but there's still missing data. Can you please fix this? With this unpredictable mix of old and new data they're useless for most purposes right now, might as well not generate them at all.
Oct 9 2022
Oct 7 2022
Hmm, dumps are still not available…