Page MenuHomePhabricator

Repeated replication lag pages for db1206
Closed, ResolvedPublic

Description

The ongoing enwiki dumps run appears to causing high replication lag on db1206 and repeated pages.

I am going to set a downtime for the MariaDB Replica Lag: s1 service on this host, at least through this upcoming weekend. We may want to extend this further if this continues into the global holiday.

The risk of doing this, of course, is that replication lag becomes significantly worse and we are unaware. Ultimately, given that the affected clients are largely limited to dumps and those using the vslow group (so, effectively, non-interactive), this seems like a reasonable risk to take.

Event Timeline

I've downtimed the service through 8:00 UTC on Monday 12/23.

FYI @akosiaris and @MoritzMuehlenhoff as next business-hours rotation oncallers.

This is impacting bots:

{
  "error": {
    "code": "maxlag",
    "info": "Waiting for 10.64.16.89: 1922.889304 seconds lagged.",
    "host": "10.64.16.89",
    "lag": 1922.889304,
    "type": "db",
    "*": "See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."
  },
  "servedby": "mw-api-ext.eqiad.main-69ff9b76fd-44pvs"
}

@AntiCompositeNumber - We can't do much at the moment, as this is a recurring issue with dumps. Even if we depool the host, the dumps processes will pick another random host of s1

As I've said before, dumps should be disabled until they no longer cause db lag. Causing db lag for 6 months is unacceptable.

@JJMC89 I agree. Read the parent task - I've made a comment there a bit earlier today.

As per the parent task, I have interrupted the currently running enwiki dump and deferred the start of the dump that was scheduled for Jan 1st.
Replication lag for db1206 is now down to zero again, so I will resolve this ticket.

image.png (1×1 px, 183 KB)

We will be working in the next quarter on moving the database traffic for dumps from the core db servers (db*) to the analytics replicas (dbstore*) servers, which will improve the situation regarding dumps affecting production traffic.
However, this is no small task.

BTullis claimed this task.
BTullis triaged this task as High priority.
BTullis moved this task from Triage to Done on the DBA board.