Migrate image recommendation to use page_weighted_tags_changed stream
Closed, ResolvedPublic
Actions

Description

Growth's add image suggestions rely on a DAG whose output is stored in a database. From there it is ingested by a weekly batch job that publishes kafka events which are in turn consumed by the search update pipeline and transformed into weighted tags.

This ticket is about replacing the temporary storage + batch job with publishing kafka events directly from the DAG.

Options are:

Using an HTTP client to publish events via event gate (no need to worry about schemas)
Writing a schema-safe kafka client to publish events directly to kafka (potentially better batching/compression performance)

Open questions:

Are we OK to produce from spark to kafka-main or should we prefer kafka-jumbo for this kind of inputs?
- There might be a preference for jumbo already. (check with @Ottomata/@gmodena/serviceops)
What's the best way to make a client usable in a job not owned by the search team (usable in pyspark & spark), see T374341?
- We went with a event-stream client wrapper around kafka that can be used from pyspark and spark.
How do we enforce necessary rate-limiting to protect the SUP from being flooded?
- @Cparle, would you know the rate of updates we'd have to expect?
- 90k events should be fine; If we really have to slow it down on the client side, we could introduce a mapping step that introduces a sleep. In combination with configuring kafka to linger long enough before sending a batch that would give us an 'effective' way of reducing the rate.

NOTE: the update size can grow up to 8.6 M for commonswiki, see also T372912#10163750.

How do we handle retries (of whole stages in spark), to avoid duplicates?
- In order to avoid duplicates, consider splitting/partitioning datasets so only the failing chunks can be retried
- Separate creation of event-stream-schema-compliant dataframes from writing/saving them (to kafka) by caching, persisting, checkpointing, or creating two separate jobs

Details

Related Changes in Gerrit:

Subject	Repo	Branch	Lines +/-
cirrus-streaming-updater: use new image	operations/deployment-charts	master	+1 -1
Support more time formats	wikimedia-event-utilities	master	+24 -4
Add AirFlow connection configuration for kafka_test_eqiad_external	operations/deployment-charts	master	+9 -0
airflow: add kafka-main-{eqiad,codfw}-external to the common connections	operations/deployment-charts	master	+18 -0
Provide shaded uber-jar for eventutilities-spark	wikimedia-event-utilities	master	+41 -4
Provide jar-with-dependencies for eventutilities-spark.	wikimedia-event-utilities	master	+8 -0
Spark: kafka rate-limiting	wikimedia-event-utilities	master	+185 -32
Spark: Row -> ObjectNode mapping	wikimedia-event-utilities	master	+625 -270
Strip null values as part of normalization	wikimedia-event-utilities	master	+71 -12
Use functions.to_json over Row.json	wikimedia-event-utilities	master	+134 -214
Added event-stream-sink wrapper around kafka sink	wikimedia-event-utilities	master	+1 K -20
Replace eventutilities-shaded with attached, classified artifact of eventutilities.	wikimedia-event-utilities	master	+66 -109

Related Changes in GitLab:

Title	Reference	Author	Source Branch	Dest Branch
Search: allow longer execution delta when waiting for ALIS/SLIS	repos/data-engineering/airflow-dags!1644	pfischer	T372912-weighted-tags-via-kafka-fix-wait	main
Allow per parameter default values that are treated none-ish	repos/search-platform/discolytics!54	pfischer	refined-arg-parsing	main
Pass through arbitrary kafka properties.	repos/search-platform/discolytics!53	pfischer	kafka-config	main
Added `publish_page_change_weighted_tags.py`	repos/search-platform/discolytics!52	pfischer	T372912-weighted-tags-via-kafka	main

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
Resolved	pfischer	T366253 Create a generic stream to populate CirrusSearch weighted_tags
Resolved	pfischer	T372912 Migrate image recommendation to use page_weighted_tags_changed stream
Open	None	T389643 [L] Adapt or transform image_suggestions_search_index_delta to allow creating one update per article

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Gehel edited projects, added Discovery-Search (2025.02.10 - 2025.02.28); removed Discovery-Search (Current work).Feb 11 2025, 2:18 PM

Gehel moved this task from Incoming to Blocked / Waiting on the Discovery-Search (2025.02.10 - 2025.02.28) board.

Gehel edited projects, added Data-Platform-SRE (2025.03.01 - 2025.03.21); removed Data-Platform-SRE (2025.02.10 - 2025.02.28).Feb 28 2025, 1:27 PM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.03.01 - 2025.03.21) board.

Gehel edited projects, added Discovery-Search (2025.03.01 - 2025.03.21); removed Discovery-Search (2025.02.10 - 2025.02.28).Feb 28 2025, 1:36 PM

Gehel moved this task from Incoming to Blocked / Waiting on the Discovery-Search (2025.03.01 - 2025.03.21) board.

Gehel edited projects, added Data-Platform-SRE (2025.03.22 - 2025.04.11); removed Data-Platform-SRE (2025.03.01 - 2025.03.21).Mar 21 2025, 9:57 AM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.03.22 - 2025.04.11) board.

dcausse mentioned this in T389643: [L] Adapt or transform image_suggestions_search_index_delta to allow creating one update per article.Mar 21 2025, 4:53 PM

Gehel edited projects, added Discovery-Search (2025.03.22 - 2025.04.11); removed Discovery-Search (2025.03.01 - 2025.03.21).Mar 24 2025, 4:49 PM

Gehel moved this task from Incoming to Blocked / Waiting on the Discovery-Search (2025.03.22 - 2025.04.11) board.

AUgolnikova-WMF moved this task from Triage to Tracking on the Structured-Data-Backlog board.Apr 7 2025, 4:47 PM

Gehel edited projects, added Data-Platform-SRE (2025.04.12 - 2025.05.02); removed Data-Platform-SRE (2025.03.22 - 2025.04.11).Apr 11 2025, 1:14 PM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.04.12 - 2025.05.02) board.

Gehel edited projects, added Discovery-Search (2025.04.11 - 2025.05.02); removed Discovery-Search (2025.03.22 - 2025.04.11).Apr 11 2025, 1:25 PM

Gehel moved this task from Incoming to Blocked / Waiting on the Discovery-Search (2025.04.11 - 2025.05.02) board.

dcausse added a subtask: T389643: [L] Adapt or transform image_suggestions_search_index_delta to allow creating one update per article.Apr 17 2025, 2:56 PM

Gehel edited projects, added Data-Platform-SRE (2025.05.02 - 2025.05.23); removed Data-Platform-SRE (2025.04.12 - 2025.05.02).May 5 2025, 12:49 PM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.05.02 - 2025.05.23) board.

Gehel edited projects, added Discovery-Search (2025.05.02 - 2025.05.23); removed Discovery-Search (2025.04.11 - 2025.05.02).May 5 2025, 2:20 PM

Gehel moved this task from Incoming to Blocked / Waiting on the Discovery-Search (2025.05.02 - 2025.05.23) board.

Gehel edited projects, added Discovery-Search (2025.05.24 - 2025.06.13); removed Discovery-Search (2025.05.02 - 2025.05.23).May 23 2025, 1:01 PM

Gehel moved this task from Incoming to Blocked / Waiting on the Discovery-Search (2025.05.24 - 2025.06.13) board.

Gehel edited projects, added Data-Platform-SRE (2025.05.24 - 2025.06.13); removed Data-Platform-SRE (2025.05.02 - 2025.05.23).May 23 2025, 1:08 PM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.05.24 - 2025.06.13) board.

Potential Search TODOs:

check the volumes produced over the last (three) weeks
derive a proper pacing (client side rate limiting) for kafka writes via spark (via event utilities)
implement sparkf-to-kafka writes

Gehel moved this task from Blocked / Waiting to In Progress on the Discovery-Search (2025.05.24 - 2025.06.13) board.May 26 2025, 3:21 PM

Gehel edited projects, added Data-Platform-SRE (2025.06.13 - 2025.07.04); removed Data-Platform-SRE (2025.05.24 - 2025.06.13).Jun 13 2025, 8:49 AM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.06.13 - 2025.07.04) board.

Gehel edited projects, added Discovery-Search (2025.06.13 - 2025.07.04); removed Discovery-Search (2025.05.24 - 2025.06.13).Jun 16 2025, 1:35 PM

Gehel moved this task from Incoming to In Progress on the Discovery-Search (2025.06.13 - 2025.07.04) board.

Change #1165577 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[wikimedia-event-utilities@master] Provide jar-with-dependencies for eventutilities-spark.

https://gerrit.wikimedia.org/r/1165577

gerritbot added a project: Patch-For-Review.Jul 1 2025, 5:02 PM

Change #1165577 merged by jenkins-bot:

[wikimedia-event-utilities@master] Provide jar-with-dependencies for eventutilities-spark.

https://gerrit.wikimedia.org/r/1165577

Maintenance_bot removed a project: Patch-For-Review.Jul 2 2025, 10:32 AM

pfischer opened https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/merge_requests/52

Added publish_page_change_weighted_tags.py

pfischer opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1514

Publish image recommendation weighted tags via kafka

BTullis edited projects, added Data-Platform-SRE (2025.07.05 - 2025.07.25); removed Data-Platform-SRE (2025.06.13 - 2025.07.04).Jul 4 2025, 4:04 PM

BTullis moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.

Change #1167297 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[wikimedia-event-utilities@master] Provide shaded uber-jar for eventutilities-spark.

https://gerrit.wikimedia.org/r/1167297

Change #1167297 merged by jenkins-bot:

[wikimedia-event-utilities@master] Provide shaded uber-jar for eventutilities-spark

https://gerrit.wikimedia.org/r/1167297

pfischer merged https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/merge_requests/52

Added publish_page_change_weighted_tags.py

pfischer edited projects, added Discovery-Search (2025.07.04 - 2025.07.25); removed Discovery-Search (2025.06.13 - 2025.07.04).Jul 17 2025, 2:20 PM

pfischer moved this task from Incoming to In Progress on the Discovery-Search (2025.07.04 - 2025.07.25) board.

BTullis edited projects, added Data-Platform-SRE (2025.07.26 - 2025.08.15); removed Data-Platform-SRE (2025.07.05 - 2025.07.25).Jul 28 2025, 4:48 PM

BTullis moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.07.26 - 2025.08.15) board.

dcausse changed the task status from Stalled to In Progress.Aug 4 2025, 2:27 PM

pfischer edited projects, added Discovery-Search (2025.07.25 - 2025.08.15); removed Discovery-Search (2025.07.04 - 2025.07.25).Aug 6 2025, 7:30 AM

pfischer moved this task from Incoming to In Progress on the Discovery-Search (2025.07.25 - 2025.08.15) board.

Change #1176190 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: add kafka-main-{eqiad,codfw}-external to the common connections

https://gerrit.wikimedia.org/r/1176190

Change #1176190 merged by Brouberol:

[operations/deployment-charts@master] airflow: add kafka-main-{eqiad,codfw}-external to the common connections

https://gerrit.wikimedia.org/r/1176190

pfischer opened https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/merge_requests/53

Pass through arbitrary kafka properties.

I somehow missed the big about producing to kafka main from Data Lake. Hm.

This needs to be done very carefully. I think it is okay, but it might be something we vet/run by the SRE ServiceOps team?

@pfischer I can't recall, but does the spark writer have throttling support?

Also, I assume this is only going to write to the eqiad prefixed topic in kafka main-eqiad?

Ah, I see your latest comment: T372912#10857120 talks about throttling too.

pfischer merged https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/merge_requests/53

Pass through arbitrary kafka properties.

Change #1178889 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Add AirFlow connection configuration for kafka_test_eqiad_external

https://gerrit.wikimedia.org/r/1178889

Change #1178889 merged by Brouberol:

[operations/deployment-charts@master] Add AirFlow connection configuration for kafka_test_eqiad_external

https://gerrit.wikimedia.org/r/1178889

BTullis edited projects, added Data-Platform-SRE (2025.08.16 - 2025.09.05); removed Data-Platform-SRE (2025.07.26 - 2025.08.15).Aug 20 2025, 10:29 AM

BTullis moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.08.16 - 2025.09.05) board.

pfischer merged https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/merge_requests/54

Allow per parameter default values that are treated none-ish

pfischer moved this task from In Progress to Needs Review on the Discovery-Search (2025.07.25 - 2025.08.15) board.Aug 20 2025, 12:13 PM

pfischer edited projects, added Discovery-Search (2025.08.15 - 2025.09.05); removed Discovery-Search (2025.07.25 - 2025.08.15).Aug 20 2025, 2:38 PM

pfischer moved this task from Incoming to Needs Review on the Discovery-Search (2025.08.15 - 2025.09.05) board.

pfischer merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1514

Publish image recommendation weighted tags via kafka

pfischer moved this task from Needs Review to To be Deployed on the Discovery-Search (2025.08.15 - 2025.09.05) board.Aug 29 2025, 3:13 PM

pfischer moved this task from To be Deployed to Done on the Discovery-Search (2025.08.15 - 2025.09.05) board.

pfischer moved this task from Done to Reported on the Discovery-Search (2025.08.15 - 2025.09.05) board.

pfischer opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1644

Search: allow longer execution delta when waiting for ALIS/SLIS

dcausse merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1644

Search: allow longer execution delta when waiting for ALIS/SLIS

Gehel added a project: Essential-Work.Sep 5 2025, 7:43 AM

Gehel closed this task as Resolved.Sep 5 2025, 7:51 AM

there are still some issues

dcausse moved this task from Incoming to Needs Review on the Discovery-Search (2025.09.05 - 2025.09.26) board.Sep 5 2025, 8:49 AM

Change #1185053 had a related patch set uploaded (by DCausse; author: DCausse):

[wikimedia-event-utilities@master] Support more time formats

https://gerrit.wikimedia.org/r/1185053

Change #1185053 merged by jenkins-bot:

[wikimedia-event-utilities@master] Support more time formats

https://gerrit.wikimedia.org/r/1185053

dcausse opened https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/189

Upgrade to eventutilities 1.4.6

The SUP producer is failing with:

Caused by: java.time.format.DateTimeParseException: Text '2025-09-05T07:51:31.430+0000' could not be parsed, unparsed text found at index 23
	at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2049)
	at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1874)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.convertToLocalDateTime(JsonRowDeserializationSchema.java:547)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.convertToInstant(JsonRowDeserializationSchema.java:574)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.lambda$wrapIntoNullableConverter$520f64f0$1(JsonRowDeserializationSchema.java:326)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.convertField(JsonRowDeserializationSchema.java:692)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.lambda$null$4(JsonRowDeserializationSchema.java:619)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.lambda$assembleRowConverter$b8bcd6df$1(JsonRowDeserializationSchema.java:658)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.lambda$wrapIntoNullableConverter$520f64f0$1(JsonRowDeserializationSchema.java:326)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.convert(JsonRowDeserializationSchema.java:207)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.deserialize(JsonRowDeserializationSchema.java:187)

The event utilities patch should support this format.

dcausse merged https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/189

Upgrade to eventutilities 1.4.6

Change #1185075 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] cirrus-streaming-updater: use new image

https://gerrit.wikimedia.org/r/1185075

Change #1185075 merged by jenkins-bot:

[operations/deployment-charts@master] cirrus-streaming-updater: use new image

https://gerrit.wikimedia.org/r/1185075

Gehel edited projects, added Data-Platform-SRE (2025.09.05 - 2025.09.26); removed Data-Platform-SRE (2025.08.16 - 2025.09.05).Sep 5 2025, 2:24 PM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.09.05 - 2025.09.26) board.

The flink job finally got restarted after numerous issues:

fix the parsing of timestamps coming from the image recommendation tags
hit by a nasty problem where after upgrading to flink-1.20 we forgot to update eventutilities to 1.4, this caused the RowSerializer to be used in the flink state and the job to fail upgrading. Reason is that the createSerializer function has changed its signature and running flink 1.20 without the new version of EventRowTypeInfo in 1.4 flink called the parent class RowTypeInfo that returns a RowSerializer. This is particularly fragile and we should probably stop extending RowTypeInfo.
hit by flink-k8s-operator bug that prevents the job from restarting using a savepoint T403838

All these are not directly related to this task but were initially caused by the new timestamp format we saw.

I'm leaving this task open because I think that there are still some cleanups that we need to make to the dags:

possibly move the dag from search to platform-eng
fix the hardcoded snapshot format in https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/search/shared/publish_page_change_weighted_tags.py#L34 (since we changed the scheduled there's no good reason for SLIS/ALIS to write to an "older" snapshot)

pfischer removed a project: Patch-For-Review.Sep 15 2025, 2:49 PM

pfischer updated the task description. (Show Details)

pfischer set Final Story Points to 21.

@dcausse, I created T404597 to continue work. Would moving the DAG be a priority or shall we start writing to kafka directly from ALIS/SLIS?

pfischer moved this task from Needs Review to Done on the Discovery-Search (2025.09.05 - 2025.09.26) board.Sep 15 2025, 3:11 PM

In T372912#11181194, @pfischer wrote:

Would moving the DAG be a priority or shall we start writing to kafka directly from ALIS/SLIS?

I don't know but I think that at least we should remove the hardcoded {{ data_interval_start | start_of_current_week | ts }} from https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/search/shared/publish_page_change_weighted_tags.py#L34.

pfischer moved this task from Done to Reported on the Discovery-Search (2025.09.05 - 2025.09.26) board.Sep 22 2025, 9:03 AM

Gehel edited projects, added Data-Platform-SRE (2025.09.26 - 2025.10.17); removed Data-Platform-SRE (2025.09.05 - 2025.09.26).Sep 26 2025, 1:51 PM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.

Gehel closed this task as Resolved.Oct 3 2025, 8:08 AM

Gehel moved this task from Backlog - operations to Reported on the Data-Platform-SRE (2025.09.26 - 2025.10.17) board.Oct 3 2025, 8:39 AM

	pfischer
	Aug 20 2024, 4:12 PM

Migrate image recommendation to use page_weighted_tags_changed streamClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Migrate image recommendation to use page_weighted_tags_changed stream
Closed, ResolvedPublic
Actions

Related Objects
Search...