Page MenuHomePhabricator

Monitoring/Alerting for Wikipedia mobile app errors due to timeouts in wikifeeds
Open, MediumPublic

Description

We had an incident that resulted in the Wikipedia mobile app homepage not loading due to timeouts in wikifeeds [1]. This was not properly alerted so SREs did not know that the mobile app was effected until it was manually verified.
I'll "start" this task on the mobile app side (the "symptom" side), although this was a problem of the services that are used by mobile app.

[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20200714-termbox_and_wikifeeds_timeouts

Event Timeline

Hey @JMeybohm, looking to get a little more information on what happened and what sort of mitigation you have in mind. Is the ultimate goal to monitor conditions on the mobile services backend, the app, or both? I assume by "starting this task on the mobile apps side", you mean detecting if the mobile app homepage fails to load (whether or not the server returns a timeout status or whatever)? Is there an existing strategy for logging runtime errors from the mobile apps?

I don't have anything particular in mind. I think the problem we should try to solve is that nobody was aware that the mobile app had issues during the outage. AIUI that lead to the assumption that the very low (in terms of rps) number of timing out requests is not critical.
As I don't really know the architecture and involved services for the mobile apps content, I thought it would be good to start the discussion here. The outcome could very well be, that wikifeeds/termbox need more/better alerting (or just alerts that get paged) because the mobile app critically relies on them (that's what I meant with "starting on the mobile apps side"). Sorry for not being very clear with this.

Adding #Product-Infrastructure-Team-Backlog as Better Use Of Data/Product-Data-Infrastructure project tags got archived, so this open task has an active project tag and can be found.

MSantos renamed this task from Monitoring/Alerting for Wikipedia mobile app errors to Monitoring/Alerting for Wikipedia mobile app errors due to timeouts in wikifeeds.Sep 10 2021, 2:09 PM
MSantos removed a project: Mobile-Content-Service.

A couple of notes around this ticket:

  • The endpoint monitoring tests (service-checker) couldn't/can't spot this kind of issue that happened specifically for internal connections. AFAIK service-checker will use the swagger spec to run the tests without passing through restbase. cc/ @Pchelolo and @Clarakosi to correct me in case I'm wrong.
  • mobileapps (Page Content Service) does not rely on wikifeeds. The apps (Android / iOS) on the other hand query wikifeeds directly, but they go through restbase.

Maybe it's an opportunity to restore these tests in restbase? https://github.com/wikimedia/restbase/blame/ecef17bda6f4efc0d6e187fb05b1eeb389bf7120/v1/feed.yaml#L52

Removing inactive assignee from this open task. (Please update assignees on open tasks after offboarding. Thanks.)