Jump to content

Deployments/Emergencies

From Wikitech
If you're looking for help with an emergency situation, please first try to contact Release Engineering & SRE on libera.chat in #wikimedia-operationsconnect. If that fails, it may be appropriate to use Klaxon.

Emergency deployments happen when things need fixing right now, even though deployments aren't happening right now.

How to

🚨 Step-by-step – to do an emergency release you must:

  • Understand the purpose of an emergency deploy:
    • An emergency deploy is strictly meant to fix something that will be broken over the weekend, or something that needs to be fixed to prevent our sites from being broken over the weekend.
    • What constitutes an emergency deploy? Please see the Reasons for an emergency deploy section below, or ask us on IRC (see next).
  • Join #wikimedia-operationsconnect on libera.chat
  • Get positive confirmation from SRE before deployment by messaging the SRE's listed as SREs on call in #wikimedia-operationsconnect topic, and inform Release Engineering that you need to deploy (see the template below)
  • Have someone able to deploy your change

Ways to find a deployer:

IRC message Template

I need an emergency deploy for https://gerrit.wikimedia.org/r/1234 -- context is T1234, are SRE ok with a deployment? (cc: thcipriani [INSERT WEEKLY TRAIN CONDUCTOR NAME] [INSERT SRE ON CALL #wikimedia-operations NAME(s)]). I (already have|need) someone to deploy.

Reasons for an emergency deploy

  • Address security issues
    For example, a mis-configuration once meant that a private wiki and all of its content was accidentally made public.
  • Avoid data loss / corruption
    For example, a coding error meant that newly-painted pages were being cached in a corrupted form; the longer it went, the more of the site was wrong.
  • Maintain availability
    For example, a new feature proved much more popular than planned and the extra load it was causing was threatening to take down the site, so it was temporarily disabled over a holiday, until people were back at work.
  • Prevent abuse
    For example, a massive content scraping run from a search engine wasn't responding to automated HTTP 429 speed bumps and so had to be manually blocked until they could adjust their code.
  • Major loss of functionality / appearance
    For example, a code efficiency change broke the visual appearance and usability of parts the sites for a large number of logged-out users, and so the change was reverted out of production until it could be fixed.

For deployers

  • Rollback first, fix later; maintaining an overall service to our users is the most important focus.
  • Prioritise general availability over that of new features; we have a billion readers and only a few users of your new tool, no matter how cool.
  • Make on-wiki edits rarely, and only when you really have to; each wiki's editing community expects autonomy.