Introduction
The Wikibase API is a recommended way to import entities in bulk into a Wikibase instance. However, the current performance of entity creation via the Wikibase API and its wrappers is roughly 0.5-20 items per second. There is no a reported comparison, but a few values were mentioned in the Wikibase Community Telegram group in March, 2021: 0.55 (Andra), 5 (Myst), 18 (Adam) and 20 (Jeroen). Usually I managed to create 5 items per second using the Wikibase API or its wrapper. That performance is fine for years-long collaborative knowledge graph construction. But in short small projects with 5-100 millions of entities it would be great to have performance 100 items per second or faster at least for initial upload of data.
As a consequence of that performance issue the third-party Wikibase users are searching workarounds for bulk import. For example, RaiseWikibase inserts entities and wikitexts directly to roughly ten tables of SQL database. However, filling four secondary tables (to get labels of entities in the Wikibase frontend) and building CirrusSearch index are outsourced to the maintenance scripts of Wikibase (see building_indexing function). The direct insert of data into SQL database boosts performance to 280 items per second, but filling the secondary tables and CirrusSearch indexing have poor performance as well which was pointed out by Aidan Hogan in the Wikibase Community mailing list. Update on 04.08.2021: the secondary tables are filled now on the fly as well (see the commit).
Adam Shorland participated in all those discussions. This resulted in a performance benchmark tool wikibase-profile, the post What happens in Wikibase when you make a new Item? and the related ticket T285987.
Problem
In short, the Wikibase API does many things. Some of those are not needed for bulk import by admins.
Adam mentioned six levels that could be improved and optimized for a bulk import use case. I'll mention some things which are probably could be ignored during the initial bulk import by admins.
- The API
- parameter validation
- permission checks for the user
- The Business
- The edit token is validated
- edit permissions are checked
- rate limits are also checked
- edit filter hooks run
- Wikibase persistence time
- ?
- MediaWiki PageUpdater
- some more permission checks
- Derived data updates
- ?
- Secondary data updates
- ?
Apart of that we need a kind of benchmark. The wikibase-profile is a good starting point for that.
Possible tasks & solutions
- Benchmark
- Discuss what is not needed in the Wikibase API for bulk import.
- Check the codes in the post and find performance bottlenecks. See T285987 as an example.
Predicted impact
- Faster data import.
- More users of Wikibase.
- Faster grow of the Wikibase Ecosystem.