Page MenuHomePhabricator

Migrate image recommendation to use page_weighted_tags_changed stream
Closed, ResolvedPublic

Description

Growth's add image suggestions rely on a DAG whose output is stored in a database. From there it is ingested by a weekly batch job that publishes kafka events which are in turn consumed by the search update pipeline and transformed into weighted tags.

This ticket is about replacing the temporary storage + batch job with publishing kafka events directly from the DAG.

Options are:

  • Using an HTTP client to publish events via event gate (no need to worry about schemas)
  • Writing a schema-safe kafka client to publish events directly to kafka (potentially better batching/compression performance)

Open questions:

  • Are we OK to produce from spark to kafka-main or should we prefer kafka-jumbo for this kind of inputs?
    • There might be a preference for jumbo already. (check with @Ottomata/@gmodena/serviceops)
  • What's the best way to make a client usable in a job not owned by the search team (usable in pyspark & spark), see T374341?
    • We went with a event-stream client wrapper around kafka that can be used from pyspark and spark.
  • How do we enforce necessary rate-limiting to protect the SUP from being flooded?
    • @Cparle, would you know the rate of updates we'd have to expect?
    • 90k events should be fine; If we really have to slow it down on the client side, we could introduce a mapping step that introduces a sleep. In combination with configuring kafka to linger long enough before sending a batch that would give us an 'effective' way of reducing the rate.
NOTE: the update size can grow up to 8.6 M for commonswiki, see also T372912#10163750.
  • How do we handle retries (of whole stages in spark), to avoid duplicates?
    • In order to avoid duplicates, consider splitting/partitioning datasets so only the failing chunks can be retried
    • Separate creation of event-stream-schema-compliant dataframes from writing/saving them (to kafka) by caching, persisting, checkpointing, or creating two separate jobs

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Search: allow longer execution delta when waiting for ALIS/SLISrepos/data-engineering/airflow-dags!1644pfischerT372912-weighted-tags-via-kafka-fix-waitmain
Allow per parameter default values that are treated none-ishrepos/search-platform/discolytics!54pfischerrefined-arg-parsingmain
Pass through arbitrary kafka properties.repos/search-platform/discolytics!53pfischerkafka-configmain
Added `publish_page_change_weighted_tags.py`repos/search-platform/discolytics!52pfischerT372912-weighted-tags-via-kafkamain
Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Potential Search TODOs:

  • check the volumes produced over the last (three) weeks
  • derive a proper pacing (client side rate limiting) for kafka writes via spark (via event utilities)
  • implement sparkf-to-kafka writes

Change #1165577 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[wikimedia-event-utilities@master] Provide jar-with-dependencies for eventutilities-spark.

https://gerrit.wikimedia.org/r/1165577

Change #1165577 merged by jenkins-bot:

[wikimedia-event-utilities@master] Provide jar-with-dependencies for eventutilities-spark.

https://gerrit.wikimedia.org/r/1165577

Change #1167297 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[wikimedia-event-utilities@master] Provide shaded uber-jar for eventutilities-spark.

https://gerrit.wikimedia.org/r/1167297

Change #1167297 merged by jenkins-bot:

[wikimedia-event-utilities@master] Provide shaded uber-jar for eventutilities-spark

https://gerrit.wikimedia.org/r/1167297

dcausse changed the task status from Stalled to In Progress.Aug 4 2025, 2:27 PM

Change #1176190 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: add kafka-main-{eqiad,codfw}-external to the common connections

https://gerrit.wikimedia.org/r/1176190

Change #1176190 merged by Brouberol:

[operations/deployment-charts@master] airflow: add kafka-main-{eqiad,codfw}-external to the common connections

https://gerrit.wikimedia.org/r/1176190

I somehow missed the big about producing to kafka main from Data Lake. Hm.

This needs to be done very carefully. I think it is okay, but it might be something we vet/run by the SRE ServiceOps team?

@pfischer I can't recall, but does the spark writer have throttling support?

Also, I assume this is only going to write to the eqiad prefixed topic in kafka main-eqiad?

Ah, I see your latest comment: T372912#10857120 talks about throttling too.

Change #1178889 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Add AirFlow connection configuration for kafka_test_eqiad_external

https://gerrit.wikimedia.org/r/1178889

Change #1178889 merged by Brouberol:

[operations/deployment-charts@master] Add AirFlow connection configuration for kafka_test_eqiad_external

https://gerrit.wikimedia.org/r/1178889

Change #1185053 had a related patch set uploaded (by DCausse; author: DCausse):

[wikimedia-event-utilities@master] Support more time formats

https://gerrit.wikimedia.org/r/1185053

Change #1185053 merged by jenkins-bot:

[wikimedia-event-utilities@master] Support more time formats

https://gerrit.wikimedia.org/r/1185053

The SUP producer is failing with:

Caused by: java.time.format.DateTimeParseException: Text '2025-09-05T07:51:31.430+0000' could not be parsed, unparsed text found at index 23
	at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2049)
	at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1874)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.convertToLocalDateTime(JsonRowDeserializationSchema.java:547)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.convertToInstant(JsonRowDeserializationSchema.java:574)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.lambda$wrapIntoNullableConverter$520f64f0$1(JsonRowDeserializationSchema.java:326)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.convertField(JsonRowDeserializationSchema.java:692)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.lambda$null$4(JsonRowDeserializationSchema.java:619)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.lambda$assembleRowConverter$b8bcd6df$1(JsonRowDeserializationSchema.java:658)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.lambda$wrapIntoNullableConverter$520f64f0$1(JsonRowDeserializationSchema.java:326)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.convert(JsonRowDeserializationSchema.java:207)
	at org.wikimedia.eventutilities.flink.formats.json.JsonRowDeserializationSchema.deserialize(JsonRowDeserializationSchema.java:187)

The event utilities patch should support this format.

Change #1185075 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] cirrus-streaming-updater: use new image

https://gerrit.wikimedia.org/r/1185075

Change #1185075 merged by jenkins-bot:

[operations/deployment-charts@master] cirrus-streaming-updater: use new image

https://gerrit.wikimedia.org/r/1185075

The flink job finally got restarted after numerous issues:

  • fix the parsing of timestamps coming from the image recommendation tags
  • hit by a nasty problem where after upgrading to flink-1.20 we forgot to update eventutilities to 1.4, this caused the RowSerializer to be used in the flink state and the job to fail upgrading. Reason is that the createSerializer function has changed its signature and running flink 1.20 without the new version of EventRowTypeInfo in 1.4 flink called the parent class RowTypeInfo that returns a RowSerializer. This is particularly fragile and we should probably stop extending RowTypeInfo.
  • hit by flink-k8s-operator bug that prevents the job from restarting using a savepoint T403838

All these are not directly related to this task but were initially caused by the new timestamp format we saw.

I'm leaving this task open because I think that there are still some cleanups that we need to make to the dags:

pfischer updated the task description. (Show Details)
pfischer set Final Story Points to 21.

@dcausse, I created T404597 to continue work. Would moving the DAG be a priority or shall we start writing to kafka directly from ALIS/SLIS?

Would moving the DAG be a priority or shall we start writing to kafka directly from ALIS/SLIS?

I don't know but I think that at least we should remove the hardcoded {{ data_interval_start | start_of_current_week | ts }} from https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/search/shared/publish_page_change_weighted_tags.py#L34.