fkaelin
User

Projects (3)
View All

acl*research_collaborations_policy_admins
Policy
Policy-Admins
Group
WMF-NDA
Policy

Calendar

User Details

User Since: Nov 12 2020, 6:16 PM (266 w, 16 h)
Availability: Available
LDAP User: Fabian Kaelin
MediaWiki User: FKaelin (WMF) [ Global Accounts ]

Recent Activity
View All

Thu, Dec 4

fkaelin added a comment to T360794: Implement stream of HTML content on mw.page_change event.

This html stream (or rather the events table that will be gobblined to the datalake) will be the first step towards a production html dataset in the datalake, followed by other challenges such as reconciliation and backfilling. Since the complexity of this stream is limited (almost identical to the page content change), having this initial building block in place is in my opinion a high priority from an essential work perspective. We have to get started somewhere, and once this stream is in place it will also be easier to scope and plan the other pieces that need to follow.

Thu, Dec 4, 5:45 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Research, Event-Platform

Tue, Nov 25

fkaelin added a comment to T410940: Productize Data for Monthly Active Moderator Actions.

Great to hear. @GGoncalves-WMF your educated guess is indeed such. Here some additional context

Tue, Nov 25, 3:14 PM · Data-Engineering

Mon, Nov 24

fkaelin added a comment to T360794: Implement stream of HTML content on mw.page_change event.

It would be great if you pick this up @JMonton-WMF .

Mon, Nov 24, 3:07 AM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Research, Event-Platform

Wed, Nov 19

fkaelin closed T408739: Create revert recommendations dataset as Resolved.

Resolving this - feedback and next steps will be discussed in T398071#11362486

Wed, Nov 19, 4:47 PM · Research-engineering, Research

fkaelin closed T408728: Additional monthly snapshots for the editor history dataset as Resolved.

The dataset has been updated with data from September and October 25, available in table fab.content_diff_edit_types_index. Resolving this.

Wed, Nov 19, 3:22 PM · WMF-NDA, OKR-Work, Research, Knowledge-Integrity

Nov 10 2025

fkaelin added a comment to T398974: Share code between Research & ML teams.

Weekly updates

started implementation of commons-utils as a project in ml-pipeline. Initial focus is on ci integration to publish a wheel, and adding it is a dependency to another project.

Nov 10 2025, 3:59 PM · Research (FY2025-26-Research-October-December), Machine-Learning-Team, Research-engineering

Nov 3 2025

fkaelin moved T408739: Create revert recommendations dataset from Backlog to In Progress on the Research board.

Nov 3 2025, 4:04 PM · Research-engineering, Research

Oct 31 2025

fkaelin added a comment to T408939: Fix iceberg table location in hive metastore.

List of tables with a non-fully qualified uri location (leaving out the tables in users personal databases)

wmde.campaign_banner_impressions_quarter_hourly
wmde.tmp_mwh_wiki_editor_activity_flags_monthly
wmde.tmp_wdqs_normalized_queries_and_metadata
wmde.tmp_wdqs_query_segments
wmde.wd_action_api_metrics_monthly
wmde.wd_action_api_request_metadata_monthly
wmde.wd_article_placeholder_metrics_daily
wmde.wd_changes_preference_usage_by_wiki_monthly
wmde.wd_changes_preference_usage_distinct_monthly
wmde.wd_coeditors_by_wiki_monthly
wmde.wd_coeditors_distinct_monthly
wmde.wd_device_type_edits_monthly
wmde.wd_dump_metrics_monthly
wmde.wd_dump_request_metadata_monthly
wmde.wd_entity_schema_namespace_metrics_daily
wmde.wd_entity_usage_by_wiki_monthly
wmde.wd_entity_usage_distinct_monthly
wmde.wd_item_sitelink_segments_weekly
wmde.wd_query_segments_daily
wmde.wd_reliability_metrics_daily
wmde.wd_rest_api_metrics_monthly
wmde.wd_rest_api_request_metadata_monthly
wmde.wd_rollback_editors_monthly
wmde.wd_special_entity_data_metrics_daily
wmde.wd_special_entity_schema_text_metrics_daily
wmde.wdqs_metrics_daily
wmde.wdqs_metrics_monthly
wmde.wiki_editor_activity_levels_monthly
wmde.wiki_page_wd_entity_usage_monthly
wmde.wit_docs_pageview_metrics_monthly
wmde.wlb_commons_video_metrics_daily
wmde.wlb_commons_video_metrics_monthly
wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
wmf_content.mediawiki_content_history_v1_old
wmf_contributors.commons_category_metrics_snapshot
wmf_contributors.commons_edits
wmf_contributors.commons_media_file_metrics_snapshot
wmf_contributors.commons_pageviews_per_category_monthly
wmf_contributors.commons_pageviews_per_media_file_monthly
wmf_contributors.editor_month
wmf_contributors.new_editor
wmf_data_ops.data_quality_alerts
wmf_data_ops.data_quality_metrics
wmf_experiments.experiment_results_v1
wmf_experiments.experiments_registry_v1
wmf_experiments.metrics_catalog_v1
wmf_product.automoderator_activity_snapshot_monthly
wmf_product.automoderator_config
wmf_product.automoderator_monitoring_snapshot_daily
wmf_product.automoderator_potential_vandalism_reverted
wmf_product.citation_needed_clickthroughs_daily
wmf_product.citation_needed_searches_daily
wmf_product.commons_deletions_monthly
wmf_product.commons_uploads_monthly
wmf_product.commons_uploadwizard_deletions_monthly
wmf_product.cx_abuse_filter_daily
wmf_product.cx_corpora
wmf_product.cx_deletion_stats_monthly
wmf_product.cx_draft_translations_daily
wmf_product.cx_exclude_users
wmf_product.cx_key_metrics_monthly
wmf_product.cx_mt_default_service_comparison_monthly
wmf_product.cx_mt_service_availability
wmf_product.cx_mt_service_usage_monthly
wmf_product.cx_published_translations_daily
wmf_product.cx_suggestions_menu_interactions_daily
wmf_product.cx_translations
wmf_product.cx_translators
wmf_product.moderation_flagged_revisions_pending_hourly
wmf_product.moderation_patrolled_recentchanges_daily
wmf_product.moderation_unpatrolled_recentchanges_daily
wmf_product.moderation_vandal_pageviews_monthly
wmf_product.trust_safety_admin_action_daily
wmf_product.trust_safety_admin_action_monthly
wmf_product.trust_safety_admin_monthly
wmf_product.trust_safety_admin_request_monthly
wmf_product.trust_safety_block_daily
wmf_product.trust_safety_block_monthly
wmf_product.trust_safety_new_admin_monthly
wmf_readership.unique_devices_per_domain_daily
wmf_readership.unique_devices_per_domain_monthly
wmf_readership.unique_devices_per_project_family_daily
wmf_readership.unique_devices_per_project_family_monthly
wmf_traffic.aqs_hourly
wmf_traffic.browser_general
wmf_traffic.interlanguage_navigation
wmf_traffic.referrer_daily
wmf_traffic.session_length

Oct 31 2025, 6:58 PM · Data-Engineering

fkaelin created T408939: Fix iceberg table location in hive metastore.

Oct 31 2025, 6:56 PM · Data-Engineering

Oct 30 2025

fkaelin added a comment to T407521: Represent text in cirrus as an array of sections, rather than a flat string.

@EBernhardson Thank you for the background!

Oct 30 2025, 1:45 PM · Epic, Discovery-Search

Oct 29 2025

fkaelin updated the task description for T408739: Create revert recommendations dataset.

Oct 29 2025, 7:58 PM · Research-engineering, Research

fkaelin renamed T408739: Create revert recommendations dataset from Create recommendations for to Create revert recommendations dataset.

Oct 29 2025, 7:49 PM · Research-engineering, Research

fkaelin created T408739: Create revert recommendations dataset.

Oct 29 2025, 7:48 PM · Research-engineering, Research

fkaelin closed T408203: Fix risk observatory dashboard as Resolved.

This is fixed.

Oct 29 2025, 6:55 PM · Research-engineering, Research

fkaelin closed T408206: Fix content diff pipeline as Resolved.

The fix has been deployed, the job have been backfilled, and the schema updated with the following commands.

ALTER TABLE research.mediawiki_content_diff DROP COLUMN revision_text_sha1;
ALTER TABLE research.mediawiki_content_diff ADD COLUMN user_central_id BIGINT COMMENT 'Global cross-wiki user ID. See: https://
www.mediawiki.org/wiki/Manual:Central_ID' AFTER user_id;

Oct 29 2025, 5:57 PM · Research-engineering, Research

Oct 28 2025

fkaelin moved T408206: Fix content diff pipeline from Needs Sign-off to In Progress on the Research board.

Oct 28 2025, 6:37 PM · Research-engineering, Research

fkaelin moved T408206: Fix content diff pipeline from Backlog to Needs Sign-off on the Research board.

Oct 28 2025, 6:37 PM · Research-engineering, Research

fkaelin moved T408203: Fix risk observatory dashboard from Backlog to In Progress on the Research board.

Oct 28 2025, 6:37 PM · Research-engineering, Research

fkaelin changed the status of T408206: Fix content diff pipeline from Open to In Progress.

Oct 28 2025, 1:25 PM · Research-engineering, Research

Oct 24 2025

fkaelin created T408206: Fix content diff pipeline.

Oct 24 2025, 2:00 PM · Research-engineering, Research

fkaelin created T408203: Fix risk observatory dashboard.

Oct 24 2025, 1:51 PM · Research-engineering, Research

Oct 20 2025

fkaelin added a comment to T407521: Represent text in cirrus as an array of sections, rather than a flat string.

Related question regarding flow of data, based on the comment from the thread you linked.

The wikitext -> html happens inside the mediawiki application using the default mediawiki parser. I'm not sure what exactly happens under the hood, i expect it's a full php parser that runs in-process but i haven't paid enough attention to exactly what they do. This is indeed quite expensive, we are running hundreds of pages a second through the parser. Part of the reason i suggest we could do this is because we already parse this flow of data. Even at this high rate, it still takes a long time to get through everything. We have a loop that re-renders everything even if not edited, but it works on 16 week cycles.

Oct 20 2025, 3:58 PM · Epic, Discovery-Search

fkaelin added a comment to T398974: Share code between Research & ML teams.

Weekly updates

The suggested contributions are described in this doc
Ongoing discussion regarding ML dev tooling (with notebook support). Prototype repo is wmfing, replacing the archived research-commons repo.

Oct 20 2025, 2:51 AM · Research (FY2025-26-Research-October-December), Machine-Learning-Team, Research-engineering

Oct 16 2025

fkaelin updated the task description for T398974: Share code between Research & ML teams.

Oct 16 2025, 2:54 PM · Research (FY2025-26-Research-October-December), Machine-Learning-Team, Research-engineering

fkaelin closed T396793: Research Engineering support for add a link worksteam as Resolved.

This work has concluded.

Oct 16 2025, 2:45 PM · Research, Research-engineering

Oct 3 2025

fkaelin added a comment to T403207: Add analytics-research user to stat boxes.

Nice, this works now. Thank you.

Oct 3 2025, 2:46 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Data-Engineering-Radar, Research-engineering

fkaelin added a comment to T403207: Add analytics-research user to stat boxes.

Thank you for the update. I think /etc/sudoers.d/ also has to be updated, i.e. sudo -u analytics-research ls still asks for a password.

$  ls /etc/sudoers.d/analytics-*
/etc/sudoers.d/analytics-admins
/etc/sudoers.d/analytics-ml-users
/etc/sudoers.d/analytics-privatedata-users
/etc/sudoers.d/analytics-product-users
/etc/sudoers.d/analytics-search-users
/etc/sudoers.d/analytics-wmde-users

Oct 3 2025, 1:31 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Data-Engineering-Radar, Research-engineering

Oct 2 2025

fkaelin closed T404201: Semantic search prototype as Resolved.

Semantic search prototypes updates:

An updated semantic-search prototype is available on https://semantic-search.wmcloud.org, hosted on CloudVPS (it is very slow)
There is a dropdown to choose which index to use, with the following options
- section level search for the following wikis: simple, en_space (en pages that exist in simple), Turkish (tr) and Greek (el)
- paragraph level search for simple and en_space
- paragraph level results return a paragraph as "section_text", while the hyperlink is to the section the paragraph is in. The index number is the paragraph, so there can be multiple results pointing to the same section but different paragraphs.
Performance:
- The prototype on CloudVPS is very slow, and even slower if multiple people use it. It can take >10s per query. The instance is I/O bound, i.e. the data is read at the speed of a spinning disk (a phab requesting an increase in the quota), but ideally there would be more RAM. Another option is to host the service on the DSE cluster.
- The prototype runs much faster on a stat machine http://stat1010.eqiad.wmnet:8000/ (requires an ssh tunnel). That instance also hosts additional indices, e.g. en lead sections for all pages, de all sections, etc. There are 185GB of index files, vs 65GB on Cloudvps
Observations:
- The embeddings were recomputed so that the page title and section name is always prepended to the embedding text, this makes a difference in the paragraph level embeddings.
- I did not experiment at all with using different models than the e5 large instruct, and neither with trying different prompts than f""" Instruct: Given a natural language query, retrieve relevant sections of wikipedia articles that answer the query Query: {query}"""".
- These prototypes are hopefully useful for the next phase product discussions, I think we are still a way from a concrete product / user story that makes use of semantic search / embeddings.

Oct 2 2025, 1:08 PM · Research-engineering, Research

Sep 15 2025

fkaelin closed T398482: develop a content diff index design plan as Resolved.

Summary of the outputs:
- A data pipeline that generates a editor history dataset, based on the content diff dataset and using the edit types library.
- A prototype UI to explore possible query patterns (for 1 month of data)
- Doc with background, approach, example queries
Discussion with product management
- Moderator tools WE1.3 (Sam Walton). Editor history/discovery is a a need, but there are limited tools. Related to content investigations (activity around certain terms/topics), spam links (also see T221397), sockpuppet detection. Moderator hub: providing patrollers with a central location that suggests things that they could do. Need for data sources that can help inform those recommendations. Editor history could be such a data source, we need to bridge the gap between the existing data and product needs.
- Anti abuse WE4.3 (Kosta/Madalina). The work on "suggested investigations" focuses on signals that are not visible through other means. Patterns/sequences of events that are interesting to check-users, can we consolidate into a risk signal to be displayed, examples: shared email for signup, suspicious hCaptcha activity. Editor history a good candidate for such signals, example queries discussed: 1. the top X editors that have added the most external links in the last Y days, per wiki. 2. for top X external links, list of IP that added these links.
Next steps
- Define one or two signals for suggested investigations teams, current candidates
- Propose a hypothesis for the next phase of this work, in discussion with product teams
- PMs will share the prototype for feedback with community members
- Decide if edit types should use the HTML or wikitext (cc @Isaac ). T378617
- Start process with DPE to have a productionized edit types dataset T351225

Sep 15 2025, 1:28 PM · Research, Research-engineering

fkaelin closed T398482: develop a content diff index design plan, a subtask of T398241: [FY25-26 Research Team] Develop, maintain, and see the execution of a Research roadmap for implemeting AI strategy for editors, as Resolved.

Sep 15 2025, 1:28 PM · Epic, Research

Sep 10 2025

fkaelin created T404254: Increase iops for recommendation-api project.

Sep 10 2025, 7:10 PM · Cloud-VPS (Quota-requests)

fkaelin moved T404201: Semantic search prototype from Backlog to In Progress on the Research board.

Sep 10 2025, 2:29 PM · Research-engineering, Research

fkaelin set Due Date to Sep 26 2025, 4:00 AM on T404201: Semantic search prototype .

Sep 10 2025, 2:29 PM · Research-engineering, Research

fkaelin created T404201: Semantic search prototype .

Sep 10 2025, 2:28 PM · Research-engineering, Research

Sep 5 2025

fkaelin added a comment to T401778: Evaluate adding caching mechanism for article topic model to make data available at scale.

I agree with @BWojtowicz-WMF and prefer storing all predictions in a single value. The thresholds is "external" to the model predictions, i.e. it more of a product question, and determining/changing thresholds should also be done outside the generation/storage of predictions. The increased flexibility also allows queries such "give me the top 10 topics" independent of threshold.

Sep 5 2025, 3:35 PM · Machine-Learning-Team

Sep 4 2025

fkaelin added a comment to T401968: Analyze samples of articles to see how many structured tasks we might be able to generate.

The knowledge gap pipeline (which is snapshot based) aggregates the historical pageviews too, as the pageview-hourly dataset is sizable for larger time ranges. The knowledge_gaps.pageviews_daily contains a subset of pages (wikipedia, namespace 0 , agent_type=user) has a minimal schema and partitioned by date (i.e. day), one day of data is about is ~4GB of data compared to ~25GB for pageviews-hourly. The code is here. The dag runs weekly.

Sep 4 2025, 1:05 PM · Research, Revise-Tone-Structured-Task, Growth-Team, OKR-Work, Goal, Machine-Learning-Team

Aug 28 2025

fkaelin created T403207: Add analytics-research user to stat boxes.

Aug 28 2025, 6:35 PM · Data-Platform-SRE (2025.09.26 - 2025.10.17), Essential-Work, Data-Engineering-Radar, Research-engineering

fkaelin closed T336766: Release data for interaction between gaps, a subtask of T331155: Knowledge Gaps Metrics, as Resolved.

Aug 28 2025, 1:03 PM · Research-Freezer, Epic

fkaelin closed T336766: Release data for interaction between gaps as Resolved.

Added an example notebook for how to calculate intersections for geography X gender gap intersection for standard quality articles.

Aug 28 2025, 1:03 PM · Research-Freezer, Research-engineering

fkaelin closed T344851: Add cumulative metrics to the knowledge gaps pipeline as Declined.

Aug 28 2025, 12:57 PM · Research-Freezer, Research-engineering

fkaelin closed T346473: Training dataset validation, a subtask of T341817: Standardize research pipelines - Dataset generation, as Invalid.

Aug 28 2025, 12:57 PM · Research-engineering, Epic, Research

fkaelin closed T346473: Training dataset validation as Invalid.

Aug 28 2025, 12:56 PM · Research-Freezer, Research-engineering

Aug 26 2025

fkaelin closed T377498: Phase 2: Article categorization metrics, fine-tuning metrics, optimization tooling, a subtask of T377159: [SDS 1.2.1 B] Test existing AI models for internal use-cases, as Resolved.

Aug 26 2025, 1:17 PM · Research

fkaelin closed T377498: Phase 2: Article categorization metrics, fine-tuning metrics, optimization tooling as Resolved.

This work is in this repo. It includes extendible ways to compute metrics (including evaluation and latency/performance), future needs for new measurements/environments can be use this tooling.

Aug 26 2025, 1:17 PM · Research-engineering, Research

fkaelin closed T377267: Consolidate article based data pipelines as Resolved.

Closing this, the geography model pipeline is part of T387041

Aug 26 2025, 1:05 PM · Research-Freezer, Research-engineering

fkaelin removed a project from T383361: Enable collection of JS features: fonts and canvas: Research-engineering.

Aug 26 2025, 1:02 PM · CheckUser, Trust and Safety Product Team, WE4.2 Anti-abuse

fkaelin closed T386009: Publish revert risk models and predictions in production as Declined.

Related planning doc from ML team

Aug 26 2025, 1:00 PM · Research-engineering, Research

fkaelin closed T389645: Discrepancy between `revert_risk` scores retrieved from Liftwing and `risk_observatory.revert_risk_predictions` as Resolved.

Aug 26 2025, 12:58 PM · Research-engineering

Aug 15 2025

fkaelin added a comment to T398482: develop a content diff index design plan.

Weekly updates:

a prototype for querying editor history is deployed on cloud vps for 1 month of data (August 2024)
a design plan draft that describes the approach
ongoing discussion for how and to whom to present this work in product management

Aug 15 2025, 5:43 PM · Research, Research-engineering

Aug 4 2025

fkaelin added a comment to T398247: [Q1 FY 25-26 Applied Sciences Team] Knowledge Integrity Research.

Weekly update for edit diffs:

Notebook for exploring query patterns.
Switched to using edit types instead of tokenizing the added/removed words using the mwtokenizer. The structured output of edit-types is much preferable to imposing structure on the aggregated raw tokens. Running edit-types at scale is computationally challenging due the memory hungriness of the parsing library (mwparserfromhell). The approach to have a daily dag that appends to an iceberg table seems to work, likely because the suspected memory leaks don't have "enough time" to explode the job, as each batch only computes one day of data. This will be helpful for T351225.

Aug 4 2025, 4:12 PM · Research

Jul 25 2025

fkaelin added a comment to T348958: Bump memory to enable large artifacts sync on HDFS.

If there are no changes required on the airflow dag itself, it is preferable to not have to do another review/deploy just to start using the new gitlab artifact. The pipeline python project is already versioned (and by consequence the artifacts in the giltab package registry), so the extra versioning in the airflow-dags side is not needed from our side (only for the caching of the artifact on hdfs).

Jul 25 2025, 6:01 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Structured-Data-Backlog

Jul 24 2025

fkaelin added a comment to T348958: Bump memory to enable large artifacts sync on HDFS.

Research does not bump versions for minor things, e.g. the version remains the same but gitlab built a new package. Does blunderbuss update the artifact cache if there is a new conda environment but now change to the yaml file? With scap deploy, if there are no airflow changes that would trigger an artifact sync anyways, you could use -f to force a sync.

Jul 24 2025, 1:25 PM · Data-Engineering (Q1 FY25/26 July 1st - September 30th), Structured-Data-Backlog

Jul 21 2025

fkaelin added a comment to T398160: Check home/HDFS leftovers of mnz.

I would like to keep this folder until the end of Q2 in case there is a need to dig deeper for recent projects Muniza was working on.

Jul 21 2025, 7:12 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work

Jul 10 2025

fkaelin added a comment to T392305: [Request] Create a replicable system to determine the optimal retraining frequency for ML models .

That creates a substantial dataset as part of the base features dataset (the wikitext and parent wikitext for all revision in these 10+ years, which could be around ~10TB of data. Let's first try with a beefier spark config, e.g. spark.sql.shuffle.partitions=4000, maxExecutors=129, executor-cores 4 --executor-memory 24G. There are also timeout configs to play with, but that is not a fun place to be.

Jul 10 2025, 5:54 AM · Research-engineering, Research

Jul 9 2025

fkaelin added a comment to T398482: develop a content diff index design plan.

Weekly updates

pipeline to aggregate an editor centric view of the content diff dataset. For every wiki_db, editor (user name, ip, or temp name), we aggregate information about each token the editor acted on. The information includes the action (added/removed), the count of how many times that action happened, and the affected revision_ids and page titles.
the mwtokenizer library is used to tokenize the diff
dataset for 1 year of data, for wikis "simplewiki", "arwiki", "dewiki", "enwiki", is available on hdfs `/user/fab/content_diff_index/
evaluation of database options for prototype

Jul 9 2025, 8:02 PM · Research, Research-engineering

fkaelin added a comment to T392305: [Request] Create a replicable system to determine the optimal retraining frequency for ML models .

Looking at the logs, the job seems to fail with timeouts and workers being removed from the pool - which often indicates that there are not enough resources available.

25/07/06 10:16:57 WARN TaskSetManager: Lost task 20.0 in stage 2.3 (TID 99049) (an-worker1117.eqiad.wmnet executor 281): FetchFailed(BlockManagerId(47, an-worker1174.eqiad.wmnet, 7337, None), shuffleId=1, mapIndex=399, mapId=8301, reduceId=391, message=
org.apache.spark.shuffle.FetchFailedException: Connecting to an-worker1174.eqiad.wmnet/10.64.165.4:7337 failed in the last 9500 ms, fail this connection directly

Jul 9 2025, 7:45 PM · Research-engineering, Research

fkaelin added a project to T398974: Share code between Research & ML teams: Research-engineering.

Jul 9 2025, 2:44 PM · Research (FY2025-26-Research-October-December), Machine-Learning-Team, Research-engineering

Jul 8 2025

fkaelin updated the task description for T398974: Share code between Research & ML teams.

Jul 8 2025, 2:29 PM · Research (FY2025-26-Research-October-December), Machine-Learning-Team, Research-engineering

fkaelin created T398974: Share code between Research & ML teams.

Jul 8 2025, 2:21 PM · Research (FY2025-26-Research-October-December), Machine-Learning-Team, Research-engineering

fkaelin closed T382070: Deploy pipeline under DSE namespace as Invalid.

Closing this in favor of T396495, which will provide the scaffolding needed. Research can use this work as starting point when there is a specific service to deploy on the DSE cluster.

Jul 8 2025, 2:05 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Research-engineering, Research

Jul 4 2025

fkaelin created T398718: update the risk observatory usage of the ipblocks.

Jul 4 2025, 1:56 PM · Research-engineering, Research

Jul 3 2025

fkaelin renamed T398482: develop a content diff index design plan from AI Assisted Tools Research to [Q1 FY 25-26 Research Engineering] AI Assisted Tools Research.

Jul 3 2025, 6:14 PM · Research, Research-engineering

Jul 2 2025

fkaelin moved T398482: develop a content diff index design plan from Backlog to FY2025-26-Research-July-September on the Research board.

Jul 2 2025, 6:34 PM · Research, Research-engineering

fkaelin created T398482: develop a content diff index design plan.

Jul 2 2025, 6:31 PM · Research, Research-engineering

Jun 23 2025

fkaelin closed T386839: Create data pipeline for the hashing algorithm, a subtask of T384855: FY2024-25 WE4.3.11, as Resolved.

Jun 23 2025, 3:05 PM · Research

fkaelin closed T386839: Create data pipeline for the hashing algorithm as Resolved.

This work was merged with MR

Jun 23 2025, 3:05 PM · Research, Research-engineering

Jun 17 2025

fkaelin moved T396793: Research Engineering support for add a link worksteam from Backlog to In Progress on the Research board.

Jun 17 2025, 6:15 PM · Research, Research-engineering

fkaelin added a project to T396793: Research Engineering support for add a link worksteam: Research.

Jun 17 2025, 6:14 PM · Research, Research-engineering

fkaelin claimed T396793: Research Engineering support for add a link worksteam.

Jun 17 2025, 6:14 PM · Research, Research-engineering

fkaelin closed T342913: Migrate the work of new Research team members to gitlab as Resolved.

Resolving this, the remaining T344625 is not relevant anymore.

Jun 17 2025, 5:37 PM · Research

fkaelin closed T342913: Migrate the work of new Research team members to gitlab, a subtask of T341818: Migrate and consolidate Research teams' code to Gitlab, as Resolved.

Jun 17 2025, 5:37 PM · Research (FY2023-24-Research-October-December)

Jun 12 2025

fkaelin created T396793: Research Engineering support for add a link worksteam.

Jun 12 2025, 6:57 PM · Research, Research-engineering

Jun 9 2025

fkaelin added a comment to T391832: Check home/HDFS leftovers of xiaoxiao.

The data directories can be removed.

Jun 9 2025, 3:59 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

Jun 2 2025

fkaelin added a comment to T393474: FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment.

I see these wikis were above the threshold in these results. There might be something wrong with the embeddings or the snapshot I use. @fkaelin do you get similar results and which snapshot do you use? The results are not deterministic but similar.

These results should be deterministic, at first glance it could be that there are multiple test/evaluation files, and when loading the data the order they are read is not guaranteed. Note as mentioned iin our discussion, this evaluation code has not been migrated to a spark/pipeline oriented design. I will have a closer look.

Jun 2 2025, 5:12 PM · Add-Link-Structured-Task, Growth-Team, Goal, Machine-Learning-Team

May 29 2025

fkaelin changed the status of T392305: [Request] Create a replicable system to determine the optimal retraining frequency for ML models from Open to In Progress.

kickoff meeting with ML team
choice of model: language agnostic revert risk model
output of work will be report / notebook
code will be added to research-datasets

May 29 2025, 6:05 PM · Research-engineering, Research

May 27 2025

fkaelin closed T385999: [FY25-SDS1.4.2] Update research pipelines to use Dumps2, a subtask of T391279: Daily updated wmf_content.mediawiki_content_current_v1, as Resolved.

May 27 2025, 12:53 PM · Data-Engineering-Roadmap, DPE-Mediawiki-Content, Epic

fkaelin closed T385999: [FY25-SDS1.4.2] Update research pipelines to use Dumps2 as Resolved.

Final asana report:

May 27 2025, 12:53 PM · Research (FY2024-25-Research-April-June), Research-engineering

May 19 2025

fkaelin added a comment to T385999: [FY25-SDS1.4.2] Update research pipelines to use Dumps2.

Weekly updates

Updates have been completed, all sub tasks are resolved
This task will be resolved when the wmf_content.mediawiki_content_current_v1 is in prod T391279

May 19 2025, 12:24 PM · Research (FY2024-25-Research-April-June), Research-engineering

May 16 2025

fkaelin added a comment to T393474: FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment.

In T393474#10830469, @OKarakaya-WMF wrote:
tried mwaddlink pipeline

https://airflow-research.wikimedia.org/dags/mwaddlink/grid?tab=logs&task_id=model_model_0&dag_run_id=manual__2025-05-16T12%3A59%3A23.745190%2B00%3A00

got this error:
KeyError: 'wiki_db'
Number of Wikis: 1
ERROR:  FileNotFoundError /tmp/research/mwaddlink/training/link_train.parquet/wiki_db=enwiki
Total training data size: 0
I've managed to set up the environment on a notebook. I'll try to reproduce this on a notebook with a smaller data next week.
@fkaelin

May 16 2025, 10:49 PM · Add-Link-Structured-Task, Growth-Team, Goal, Machine-Learning-Team

fkaelin added a comment to T393474: FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment.

prod

model on prod. model per language.

research-datasets

supports single model for multiple languages. Not used on prod yet.

correct

uses research.article_topics table for embeddings rather than wikipedia2vec. However, this table seems to be empty. (I think this table is populated recently, we have data now. Do you know how to populate data to this table? Do we have any pipeline that we can trigger later?)

The table was not configured for the prod instance - the table is available now.

Can we summarize the improvements in this repo compared to the prod one? It seems it's this list and maybe some more.

airflow-dags

uses research datasets with a language group config. Not used on prod yet.

I've just triggered this pipeline. Curious about the evaluation scores. I hope it's not a problem and does not effect anyting.

research-mwaddlink-gitlab: Should we ignore this repo as it's in a user's personal space?

The linked meta page is a good summary by the Martin. The described effort contains both changes to the modelling approach (joint models, link embeddings instead of w2v,...), and also changes to how the pipeline is executed (airflow migration with research engineering support). The latter was essentially required by the first one (i.e. the paint points for retraining apply to the researchers too), but it also means that it the two are connected. If (theoretically) you/ML team had started this investigation without that model improvement work done by Aisha/Martin, migrating the current training pipeline to airflow dag with the same modelling approach (ie. the output of the model should be equivalent, no joint model etc) would be reasonable first step - and only then iterate on the In that sensemodelling and serving components. For the linked repos this means:

Aisha's research-mwaddlink-gitlab implements the model improvement changes
The research-datasets pipeline used by the airflow dag is based on Aisha's repo, but as it substantially changes the logic to a spark focused design.

May 16 2025, 10:33 PM · Add-Link-Structured-Task, Growth-Team, Goal, Machine-Learning-Team

fkaelin closed T388146: Update addalink pipeline for dumps2 as Resolved.

May 16 2025, 6:15 PM · Research-engineering, Research

fkaelin closed T388146: Update addalink pipeline for dumps2, a subtask of T385999: [FY25-SDS1.4.2] Update research pipelines to use Dumps2, as Resolved.

May 16 2025, 6:15 PM · Research (FY2024-25-Research-April-June), Research-engineering

fkaelin closed T388144: Update reference risk pipelines for dumps2, a subtask of T385999: [FY25-SDS1.4.2] Update research pipelines to use Dumps2, as Resolved.

May 16 2025, 6:14 PM · Research (FY2024-25-Research-April-June), Research-engineering

fkaelin closed T388144: Update reference risk pipelines for dumps2 as Resolved.

Done with MR

May 16 2025, 6:14 PM · Research-engineering, Research

fkaelin added a comment to T388146: Update addalink pipeline for dumps2.

Done with MR

May 16 2025, 6:14 PM · Research-engineering, Research

fkaelin closed T390704: Update article embeddings pipeline for dumps2, a subtask of T385999: [FY25-SDS1.4.2] Update research pipelines to use Dumps2, as Resolved.

May 16 2025, 6:09 PM · Research (FY2024-25-Research-April-June), Research-engineering

fkaelin closed T390704: Update article embeddings pipeline for dumps2 as Resolved.

Done with MR

May 16 2025, 6:09 PM · Research

May 12 2025

fkaelin added a subtask for T391279: Daily updated wmf_content.mediawiki_content_current_v1: T385999: [FY25-SDS1.4.2] Update research pipelines to use Dumps2.

May 12 2025, 8:48 PM · Data-Engineering-Roadmap, DPE-Mediawiki-Content, Epic

fkaelin added a parent task for T385999: [FY25-SDS1.4.2] Update research pipelines to use Dumps2: T391279: Daily updated wmf_content.mediawiki_content_current_v1.

May 12 2025, 8:48 PM · Research (FY2024-25-Research-April-June), Research-engineering

fkaelin moved T390704: Update article embeddings pipeline for dumps2 from Backlog to In Progress on the Research board.

May 12 2025, 4:53 PM · Research

May 10 2025

fkaelin added a comment to T385999: [FY25-SDS1.4.2] Update research pipelines to use Dumps2.

Weekly updates

Airflow dag updates were tested and deployed to production
The affected pipelines were backfilled and are now up-to-date
Open remaining tasks all depend on the content current dataset (wmf_content.mediawiki_content_current_v1). The pipelines have been tested with the wip dataset. When DE promotes this dataset to production, there will be an airflow sensor that the dags can await.

May 10 2025, 2:58 AM · Research (FY2024-25-Research-April-June), Research-engineering

fkaelin updated subscribers of T388453: Make the revert risk predictions datasets available for analysis.

@leila thank you for helping to coordinate.

May 10 2025, 2:31 AM · Data-Engineering-Radar, Machine-Learning-Team, Data-Engineering, Essential-Work

May 6 2025

fkaelin set Due Date to May 16 2025, 4:00 AM on T388144: Update reference risk pipelines for dumps2.

May 6 2025, 10:50 AM · Research-engineering, Research

fkaelin set Due Date to May 16 2025, 4:00 AM on T390704: Update article embeddings pipeline for dumps2.

May 6 2025, 10:49 AM · Research

fkaelin set Due Date to May 16 2025, 4:00 AM on T388146: Update addalink pipeline for dumps2.

May 6 2025, 10:48 AM · Research-engineering, Research

Apr 28 2025

fkaelin added a comment to T385999: [FY25-SDS1.4.2] Update research pipelines to use Dumps2.

Weekly updates

Code changes (MR) completed for add-a-link (T388146), reference quality (T388144) and article embedding (T390704) pipelines
Tested manually in notebooks, airflow dag updates in progress.

Apr 28 2025, 2:21 AM · Research (FY2024-25-Research-April-June), Research-engineering

fkaelin changed the status of T388146: Update addalink pipeline for dumps2, a subtask of T385999: [FY25-SDS1.4.2] Update research pipelines to use Dumps2, from Open to In Progress.

Apr 28 2025, 2:18 AM · Research (FY2024-25-Research-April-June), Research-engineering

fkaelin changed the status of T388146: Update addalink pipeline for dumps2 from Open to In Progress.

Apr 28 2025, 2:18 AM · Research-engineering, Research

fkaelin changed the status of T388144: Update reference risk pipelines for dumps2, a subtask of T385999: [FY25-SDS1.4.2] Update research pipelines to use Dumps2, from Stalled to In Progress.

Apr 28 2025, 2:18 AM · Research (FY2024-25-Research-April-June), Research-engineering

fkaelin changed the status of T388144: Update reference risk pipelines for dumps2 from Stalled to In Progress.

Apr 28 2025, 2:18 AM · Research-engineering, Research

fkaelinUser

Projects (3)View All

Calendar

Today

Tomorrow

Sunday

User Details

Recent ActivityView All

Thu, Dec 4

Tue, Nov 25

Mon, Nov 24

Wed, Nov 19

Nov 10 2025

Nov 3 2025

Oct 31 2025

Oct 30 2025

Oct 29 2025

Oct 28 2025

Oct 24 2025

Oct 20 2025

Oct 16 2025

Oct 3 2025

Oct 2 2025

Sep 15 2025

Sep 10 2025

Sep 5 2025

Sep 4 2025

Aug 28 2025

Aug 26 2025

Aug 15 2025

Aug 4 2025

Jul 25 2025

Jul 24 2025

Jul 21 2025

Jul 10 2025

Jul 9 2025

Jul 8 2025

Jul 4 2025

Jul 3 2025

Jul 2 2025

Jun 23 2025

Jun 17 2025

Jun 12 2025

Jun 9 2025

Jun 2 2025

May 29 2025

May 27 2025

May 19 2025

May 16 2025

May 12 2025

May 10 2025

May 6 2025

Apr 28 2025

fkaelin
User

Projects (3)
View All

Recent Activity
View All