Growth's add image suggestions rely on a DAG whose output is stored in a database. From there it is ingested by a weekly batch job that publishes kafka events which are in turn consumed by the search update pipeline and transformed into weighted tags.
This ticket is about replacing the temporary storage + batch job with publishing kafka events directly from the DAG.
Options are:
- Using an HTTP client to publish events via event gate (no need to worry about schemas)
- Writing a schema-safe kafka client to publish events directly to kafka (potentially better batching/compression performance)
Open questions:
- Are we OK to produce from spark to kafka-main or should we prefer kafka-jumbo for this kind of inputs?
- What's the best way to make a client usable in a job not owned by the search team (usable in pyspark & spark), see T374341?
- We went with a event-stream client wrapper around kafka that can be used from pyspark and spark.
- How do we enforce necessary rate-limiting to protect the SUP from being flooded?
- @Cparle, would you know the rate of updates we'd have to expect?
- 90k events should be fine; If we really have to slow it down on the client side, we could introduce a mapping step that introduces a sleep. In combination with configuring kafka to linger long enough before sending a batch that would give us an 'effective' way of reducing the rate.
- How do we handle retries (of whole stages in spark), to avoid duplicates?
- In order to avoid duplicates, consider splitting/partitioning datasets so only the failing chunks can be retried
- Separate creation of event-stream-schema-compliant dataframes from writing/saving them (to kafka) by caching, persisting, checkpointing, or creating two separate jobs