User Details
- User Since
- Nov 12 2020, 6:16 PM (266 w, 16 h)
- Availability
- Available
- LDAP User
- Fabian Kaelin
- MediaWiki User
- FKaelin (WMF) [ Global Accounts ]
Thu, Dec 4
This html stream (or rather the events table that will be gobblined to the datalake) will be the first step towards a production html dataset in the datalake, followed by other challenges such as reconciliation and backfilling. Since the complexity of this stream is limited (almost identical to the page content change), having this initial building block in place is in my opinion a high priority from an essential work perspective. We have to get started somewhere, and once this stream is in place it will also be easier to scope and plan the other pieces that need to follow.
Tue, Nov 25
Great to hear. @GGoncalves-WMF your educated guess is indeed such. Here some additional context
Mon, Nov 24
It would be great if you pick this up @JMonton-WMF .
Wed, Nov 19
Resolving this - feedback and next steps will be discussed in T398071#11362486
The dataset has been updated with data from September and October 25, available in table fab.content_diff_edit_types_index. Resolving this.
Nov 10 2025
Weekly updates
- started implementation of commons-utils as a project in ml-pipeline. Initial focus is on ci integration to publish a wheel, and adding it is a dependency to another project.
Nov 3 2025
Oct 31 2025
List of tables with a non-fully qualified uri location (leaving out the tables in users personal databases)
wmde.campaign_banner_impressions_quarter_hourly wmde.tmp_mwh_wiki_editor_activity_flags_monthly wmde.tmp_wdqs_normalized_queries_and_metadata wmde.tmp_wdqs_query_segments wmde.wd_action_api_metrics_monthly wmde.wd_action_api_request_metadata_monthly wmde.wd_article_placeholder_metrics_daily wmde.wd_changes_preference_usage_by_wiki_monthly wmde.wd_changes_preference_usage_distinct_monthly wmde.wd_coeditors_by_wiki_monthly wmde.wd_coeditors_distinct_monthly wmde.wd_device_type_edits_monthly wmde.wd_dump_metrics_monthly wmde.wd_dump_request_metadata_monthly wmde.wd_entity_schema_namespace_metrics_daily wmde.wd_entity_usage_by_wiki_monthly wmde.wd_entity_usage_distinct_monthly wmde.wd_item_sitelink_segments_weekly wmde.wd_query_segments_daily wmde.wd_reliability_metrics_daily wmde.wd_rest_api_metrics_monthly wmde.wd_rest_api_request_metadata_monthly wmde.wd_rollback_editors_monthly wmde.wd_special_entity_data_metrics_daily wmde.wd_special_entity_schema_text_metrics_daily wmde.wdqs_metrics_daily wmde.wdqs_metrics_monthly wmde.wiki_editor_activity_levels_monthly wmde.wiki_page_wd_entity_usage_monthly wmde.wit_docs_pageview_metrics_monthly wmde.wlb_commons_video_metrics_daily wmde.wlb_commons_video_metrics_monthly wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 wmf_content.mediawiki_content_history_v1_old wmf_contributors.commons_category_metrics_snapshot wmf_contributors.commons_edits wmf_contributors.commons_media_file_metrics_snapshot wmf_contributors.commons_pageviews_per_category_monthly wmf_contributors.commons_pageviews_per_media_file_monthly wmf_contributors.editor_month wmf_contributors.new_editor wmf_data_ops.data_quality_alerts wmf_data_ops.data_quality_metrics wmf_experiments.experiment_results_v1 wmf_experiments.experiments_registry_v1 wmf_experiments.metrics_catalog_v1 wmf_product.automoderator_activity_snapshot_monthly wmf_product.automoderator_config wmf_product.automoderator_monitoring_snapshot_daily wmf_product.automoderator_potential_vandalism_reverted wmf_product.citation_needed_clickthroughs_daily wmf_product.citation_needed_searches_daily wmf_product.commons_deletions_monthly wmf_product.commons_uploads_monthly wmf_product.commons_uploadwizard_deletions_monthly wmf_product.cx_abuse_filter_daily wmf_product.cx_corpora wmf_product.cx_deletion_stats_monthly wmf_product.cx_draft_translations_daily wmf_product.cx_exclude_users wmf_product.cx_key_metrics_monthly wmf_product.cx_mt_default_service_comparison_monthly wmf_product.cx_mt_service_availability wmf_product.cx_mt_service_usage_monthly wmf_product.cx_published_translations_daily wmf_product.cx_suggestions_menu_interactions_daily wmf_product.cx_translations wmf_product.cx_translators wmf_product.moderation_flagged_revisions_pending_hourly wmf_product.moderation_patrolled_recentchanges_daily wmf_product.moderation_unpatrolled_recentchanges_daily wmf_product.moderation_vandal_pageviews_monthly wmf_product.trust_safety_admin_action_daily wmf_product.trust_safety_admin_action_monthly wmf_product.trust_safety_admin_monthly wmf_product.trust_safety_admin_request_monthly wmf_product.trust_safety_block_daily wmf_product.trust_safety_block_monthly wmf_product.trust_safety_new_admin_monthly wmf_readership.unique_devices_per_domain_daily wmf_readership.unique_devices_per_domain_monthly wmf_readership.unique_devices_per_project_family_daily wmf_readership.unique_devices_per_project_family_monthly wmf_traffic.aqs_hourly wmf_traffic.browser_general wmf_traffic.interlanguage_navigation wmf_traffic.referrer_daily wmf_traffic.session_length
Oct 30 2025
@EBernhardson Thank you for the background!
Oct 29 2025
This is fixed.
The fix has been deployed, the job have been backfilled, and the schema updated with the following commands.
ALTER TABLE research.mediawiki_content_diff DROP COLUMN revision_text_sha1; ALTER TABLE research.mediawiki_content_diff ADD COLUMN user_central_id BIGINT COMMENT 'Global cross-wiki user ID. See: https:// www.mediawiki.org/wiki/Manual:Central_ID' AFTER user_id;
Oct 28 2025
Oct 24 2025
Oct 20 2025
Related question regarding flow of data, based on the comment from the thread you linked.
The wikitext -> html happens inside the mediawiki application using the default mediawiki parser. I'm not sure what exactly happens under the hood, i expect it's a full php parser that runs in-process but i haven't paid enough attention to exactly what they do. This is indeed quite expensive, we are running hundreds of pages a second through the parser. Part of the reason i suggest we could do this is because we already parse this flow of data. Even at this high rate, it still takes a long time to get through everything. We have a loop that re-renders everything even if not edited, but it works on 16 week cycles.
Weekly updates
Oct 16 2025
This work has concluded.
Oct 3 2025
Nice, this works now. Thank you.
Thank you for the update. I think /etc/sudoers.d/ also has to be updated, i.e. sudo -u analytics-research ls still asks for a password.
$ ls /etc/sudoers.d/analytics-* /etc/sudoers.d/analytics-admins /etc/sudoers.d/analytics-ml-users /etc/sudoers.d/analytics-privatedata-users /etc/sudoers.d/analytics-product-users /etc/sudoers.d/analytics-search-users /etc/sudoers.d/analytics-wmde-users
Oct 2 2025
Semantic search prototypes updates:
- An updated semantic-search prototype is available on https://semantic-search.wmcloud.org, hosted on CloudVPS (it is very slow)
- There is a dropdown to choose which index to use, with the following options
- section level search for the following wikis: simple, en_space (en pages that exist in simple), Turkish (tr) and Greek (el)
- paragraph level search for simple and en_space
- paragraph level results return a paragraph as "section_text", while the hyperlink is to the section the paragraph is in. The index number is the paragraph, so there can be multiple results pointing to the same section but different paragraphs.
- Performance:
- The prototype on CloudVPS is very slow, and even slower if multiple people use it. It can take >10s per query. The instance is I/O bound, i.e. the data is read at the speed of a spinning disk (a phab requesting an increase in the quota), but ideally there would be more RAM. Another option is to host the service on the DSE cluster.
- The prototype runs much faster on a stat machine http://stat1010.eqiad.wmnet:8000/ (requires an ssh tunnel). That instance also hosts additional indices, e.g. en lead sections for all pages, de all sections, etc. There are 185GB of index files, vs 65GB on Cloudvps
- Observations:
- The embeddings were recomputed so that the page title and section name is always prepended to the embedding text, this makes a difference in the paragraph level embeddings.
- I did not experiment at all with using different models than the e5 large instruct, and neither with trying different prompts than f""" Instruct: Given a natural language query, retrieve relevant sections of wikipedia articles that answer the query Query: {query}"""".
- These prototypes are hopefully useful for the next phase product discussions, I think we are still a way from a concrete product / user story that makes use of semantic search / embeddings.
Sep 15 2025
- Summary of the outputs:
- Discussion with product management
- Moderator tools WE1.3 (Sam Walton). Editor history/discovery is a a need, but there are limited tools. Related to content investigations (activity around certain terms/topics), spam links (also see T221397), sockpuppet detection. Moderator hub: providing patrollers with a central location that suggests things that they could do. Need for data sources that can help inform those recommendations. Editor history could be such a data source, we need to bridge the gap between the existing data and product needs.
- Anti abuse WE4.3 (Kosta/Madalina). The work on "suggested investigations" focuses on signals that are not visible through other means. Patterns/sequences of events that are interesting to check-users, can we consolidate into a risk signal to be displayed, examples: shared email for signup, suspicious hCaptcha activity. Editor history a good candidate for such signals, example queries discussed: 1. the top X editors that have added the most external links in the last Y days, per wiki. 2. for top X external links, list of IP that added these links.
- Next steps
- Define one or two signals for suggested investigations teams, current candidates
- Propose a hypothesis for the next phase of this work, in discussion with product teams
- PMs will share the prototype for feedback with community members
- Decide if edit types should use the HTML or wikitext (cc @Isaac ). T378617
- Start process with DPE to have a productionized edit types dataset T351225
Sep 10 2025
Sep 5 2025
I agree with @BWojtowicz-WMF and prefer storing all predictions in a single value. The thresholds is "external" to the model predictions, i.e. it more of a product question, and determining/changing thresholds should also be done outside the generation/storage of predictions. The increased flexibility also allows queries such "give me the top 10 topics" independent of threshold.
Sep 4 2025
The knowledge gap pipeline (which is snapshot based) aggregates the historical pageviews too, as the pageview-hourly dataset is sizable for larger time ranges. The knowledge_gaps.pageviews_daily contains a subset of pages (wikipedia, namespace 0 , agent_type=user) has a minimal schema and partitioned by date (i.e. day), one day of data is about is ~4GB of data compared to ~25GB for pageviews-hourly. The code is here. The dag runs weekly.
Aug 28 2025
- Added an example notebook for how to calculate intersections for geography X gender gap intersection for standard quality articles.
Aug 26 2025
This work is in this repo. It includes extendible ways to compute metrics (including evaluation and latency/performance), future needs for new measurements/environments can be use this tooling.
Closing this, the geography model pipeline is part of T387041
Related planning doc from ML team
Aug 15 2025
Weekly updates:
- a prototype for querying editor history is deployed on cloud vps for 1 month of data (August 2024)
- a design plan draft that describes the approach
- ongoing discussion for how and to whom to present this work in product management
Aug 4 2025
Weekly update for edit diffs:
- Notebook for exploring query patterns.
- Switched to using edit types instead of tokenizing the added/removed words using the mwtokenizer. The structured output of edit-types is much preferable to imposing structure on the aggregated raw tokens. Running edit-types at scale is computationally challenging due the memory hungriness of the parsing library (mwparserfromhell). The approach to have a daily dag that appends to an iceberg table seems to work, likely because the suspected memory leaks don't have "enough time" to explode the job, as each batch only computes one day of data. This will be helpful for T351225.
Jul 25 2025
If there are no changes required on the airflow dag itself, it is preferable to not have to do another review/deploy just to start using the new gitlab artifact. The pipeline python project is already versioned (and by consequence the artifacts in the giltab package registry), so the extra versioning in the airflow-dags side is not needed from our side (only for the caching of the artifact on hdfs).
Jul 24 2025
Research does not bump versions for minor things, e.g. the version remains the same but gitlab built a new package. Does blunderbuss update the artifact cache if there is a new conda environment but now change to the yaml file? With scap deploy, if there are no airflow changes that would trigger an artifact sync anyways, you could use -f to force a sync.
Jul 21 2025
I would like to keep this folder until the end of Q2 in case there is a need to dig deeper for recent projects Muniza was working on.
Jul 10 2025
That creates a substantial dataset as part of the base features dataset (the wikitext and parent wikitext for all revision in these 10+ years, which could be around ~10TB of data. Let's first try with a beefier spark config, e.g. spark.sql.shuffle.partitions=4000, maxExecutors=129, executor-cores 4 --executor-memory 24G. There are also timeout configs to play with, but that is not a fun place to be.
Jul 9 2025
Weekly updates
- pipeline to aggregate an editor centric view of the content diff dataset. For every wiki_db, editor (user name, ip, or temp name), we aggregate information about each token the editor acted on. The information includes the action (added/removed), the count of how many times that action happened, and the affected revision_ids and page titles.
- the mwtokenizer library is used to tokenize the diff
- dataset for 1 year of data, for wikis "simplewiki", "arwiki", "dewiki", "enwiki", is available on hdfs `/user/fab/content_diff_index/
- evaluation of database options for prototype
Looking at the logs, the job seems to fail with timeouts and workers being removed from the pool - which often indicates that there are not enough resources available.
25/07/06 10:16:57 WARN TaskSetManager: Lost task 20.0 in stage 2.3 (TID 99049) (an-worker1117.eqiad.wmnet executor 281): FetchFailed(BlockManagerId(47, an-worker1174.eqiad.wmnet, 7337, None), shuffleId=1, mapIndex=399, mapId=8301, reduceId=391, message= org.apache.spark.shuffle.FetchFailedException: Connecting to an-worker1174.eqiad.wmnet/10.64.165.4:7337 failed in the last 9500 ms, fail this connection directly
Jul 8 2025
Closing this in favor of T396495, which will provide the scaffolding needed. Research can use this work as starting point when there is a specific service to deploy on the DSE cluster.
Jul 4 2025
Jul 3 2025
Jul 2 2025
Jun 23 2025
This work was merged with MR
Jun 17 2025
Resolving this, the remaining T344625 is not relevant anymore.
Jun 12 2025
Jun 9 2025
The data directories can be removed.
Jun 2 2025
I see these wikis were above the threshold in these results. There might be something wrong with the embeddings or the snapshot I use. @fkaelin do you get similar results and which snapshot do you use? The results are not deterministic but similar.
These results should be deterministic, at first glance it could be that there are multiple test/evaluation files, and when loading the data the order they are read is not guaranteed. Note as mentioned iin our discussion, this evaluation code has not been migrated to a spark/pipeline oriented design. I will have a closer look.
May 29 2025
- kickoff meeting with ML team
- choice of model: language agnostic revert risk model
- output of work will be report / notebook
- code will be added to research-datasets
May 27 2025
Final asana report:
May 19 2025
Weekly updates
- Updates have been completed, all sub tasks are resolved
- This task will be resolved when the wmf_content.mediawiki_content_current_v1 is in prod T391279
May 16 2025
- prod
- model on prod. model per language.
- research-datasets
- supports single model for multiple languages. Not used on prod yet.
correct
- uses research.article_topics table for embeddings rather than wikipedia2vec. However, this table seems to be empty. (I think this table is populated recently, we have data now. Do you know how to populate data to this table? Do we have any pipeline that we can trigger later?)
The table was not configured for the prod instance - the table is available now.
- Can we summarize the improvements in this repo compared to the prod one? It seems it's this list and maybe some more.
- airflow-dags
- uses research datasets with a language group config. Not used on prod yet.
- I've just triggered this pipeline. Curious about the evaluation scores. I hope it's not a problem and does not effect anyting.
- research-mwaddlink-gitlab: Should we ignore this repo as it's in a user's personal space?
The linked meta page is a good summary by the Martin. The described effort contains both changes to the modelling approach (joint models, link embeddings instead of w2v,...), and also changes to how the pipeline is executed (airflow migration with research engineering support). The latter was essentially required by the first one (i.e. the paint points for retraining apply to the researchers too), but it also means that it the two are connected. If (theoretically) you/ML team had started this investigation without that model improvement work done by Aisha/Martin, migrating the current training pipeline to airflow dag with the same modelling approach (ie. the output of the model should be equivalent, no joint model etc) would be reasonable first step - and only then iterate on the In that sensemodelling and serving components. For the linked repos this means:
- Aisha's research-mwaddlink-gitlab implements the model improvement changes
- The research-datasets pipeline used by the airflow dag is based on Aisha's repo, but as it substantially changes the logic to a spark focused design.
Done with MR
Done with MR
Done with MR
May 12 2025
May 10 2025
Weekly updates
- Airflow dag updates were tested and deployed to production
- The affected pipelines were backfilled and are now up-to-date
- Open remaining tasks all depend on the content current dataset (wmf_content.mediawiki_content_current_v1). The pipelines have been tested with the wip dataset. When DE promotes this dataset to production, there will be an airflow sensor that the dags can await.
@leila thank you for helping to coordinate.
May 6 2025
Apr 28 2025
Weekly updates
