Page MenuHomePhabricator

Scott_French (Scott French)
User

Projects (11)

Today

  • No visible events.

Tomorrow

  • No visible events.

Friday

  • No visible events.

User Details

User Since
Jan 18 2024, 5:33 PM (110 w, 5 d)
Availability
Available
LDAP User
Scott French
MediaWiki User
SFrench-WMF [ Global Accounts ]

Recent Activity

Fri, Feb 27

Scott_French added a comment to T416756: Release OpenTelemetry integration for service-utils.

service-utils v2.0.0 was released earlier today, so I've updated the example instrumentation script in on Wikitech to reflect the new @wikimedia/service-utils/otel subpath export.

Fri, Feb 27, 1:47 AM · Patch-For-Review, service-utils, ServiceOps-SharedInfra, ServiceOps new
Scott_French moved T395682: Enable WikimediaDebug support for noc.wikimedia.org from Needs Info / Blocked to Backlog on the ServiceOps new board.

Thanks, @Krinkle - That helps clarify. For noc.wikimedia.org, presence of the x-wikimedia-debug header affects caching behavior and response header cleanup, but has no role in backend selection at ATS. Instead, requests always route to mw-misc, which also happens to be updated in the testservers stage, so the "feedback loop" for testing merged code is sort of equivalent.

Fri, Feb 27, 12:05 AM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new, WikimediaDebug, MediaWiki-Platform-Team (Radar), noc.wikimedia.org

Thu, Feb 26

Scott_French placed T397075: Package Wikimedia's PHP 8.3 component for bookworm up for grabs.
Thu, Feb 26, 10:23 PM · ServiceOps new, ServiceOps-Mediawiki
Scott_French moved T397075: Package Wikimedia's PHP 8.3 component for bookworm from Needs Info / Blocked to Backlog on the ServiceOps new board.
Thu, Feb 26, 10:23 PM · ServiceOps new, ServiceOps-Mediawiki
Scott_French renamed T397075: Package Wikimedia's PHP 8.3 component for bookworm from Package Wikimedia's PHP 8.1 component for bookworm to Package Wikimedia's PHP 8.3 component for bookworm.
Thu, Feb 26, 10:22 PM · ServiceOps new, ServiceOps-Mediawiki
Scott_French added a comment to T397075: Package Wikimedia's PHP 8.3 component for bookworm.

Thank you very much for doing so, @Jdforrester-WMF. Great, I'll refocus this on the upcoming work to migrate to 8.3-on-bookworm.

Thu, Feb 26, 10:20 PM · ServiceOps new, ServiceOps-Mediawiki
Scott_French added a comment to T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis.

Now that I think about it, interacting with envoy via UDS is going to be a bit of a pain without pulling in additional dependencies that we probably don't want in practice.

Thu, Feb 26, 8:27 PM · Patch-For-Review, ServiceOps new, MW-on-K8s, MediaWiki-Recent-changes
Scott_French added a comment to T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis.

The drain-envoy.sh hook script assumes the envoy admin interface is available via a TCP listener, but in our production envoy config, it's only available via UDS (at /var/run/envoy/admin.sock). The script needs updated to reflect / support that.

Thu, Feb 26, 8:05 PM · Patch-For-Review, ServiceOps new, MW-on-K8s, MediaWiki-Recent-changes
Scott_French added a comment to T397075: Package Wikimedia's PHP 8.3 component for bookworm.

@Jdforrester-WMF - Thanks for clarifying. I've reverted the task title and description to reflect that.

Thu, Feb 26, 6:22 PM · ServiceOps new, ServiceOps-Mediawiki
Scott_French renamed T397075: Package Wikimedia's PHP 8.3 component for bookworm from Package Wikimedia's PHP 8.3 component for bookworm to Package Wikimedia's PHP 8.1 component for bookworm.
Thu, Feb 26, 6:06 PM · ServiceOps new, ServiceOps-Mediawiki
Scott_French renamed T397075: Package Wikimedia's PHP 8.3 component for bookworm from Package Wikimedia's PHP 8.1 component for bookworm to Package Wikimedia's PHP 8.3 component for bookworm.
Thu, Feb 26, 5:57 PM · ServiceOps new, ServiceOps-Mediawiki
Scott_French moved T354853: Service mesh envoy does not treat incoming connections as local from Inbox to Backlog on the ServiceOps new board.

From a quick spot check in puppet and deployment-charts, it doesn't look like we've subsequently done this widely, though there are limited use cases in which use_remote_address: true is set.

Thu, Feb 26, 5:51 PM · ServiceOps-SharedInfra, ServiceOps new
Scott_French triaged T354853: Service mesh envoy does not treat incoming connections as local as Medium priority.
Thu, Feb 26, 5:43 PM · ServiceOps-SharedInfra, ServiceOps new
Scott_French moved T395682: Enable WikimediaDebug support for noc.wikimedia.org from Inbox to Needs Info / Blocked on the ServiceOps new board.

So, the way this currently works is that mw-misc (home of noc.wikimedia.org) is updated during the testservers deployment stage, just like mw-debug. Meaning, hitting noc during that stage should observe the new changes immediately.

Thu, Feb 26, 5:36 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new, WikimediaDebug, MediaWiki-Platform-Team (Radar), noc.wikimedia.org
Scott_French triaged T395682: Enable WikimediaDebug support for noc.wikimedia.org as Low priority.
Thu, Feb 26, 5:27 PM · MW-on-K8s, ServiceOps-Mediawiki, ServiceOps new, WikimediaDebug, MediaWiki-Platform-Team (Radar), noc.wikimedia.org
Scott_French moved T395870: Remove docker related parts from kubernetes puppet code from Inbox to Backlog on the ServiceOps new board.

@JMeybohm - Do you happen to have an estimate of how much effort this represents, and whether it might be something we can (or must) do soon? Thanks!

Thu, Feb 26, 5:16 PM · ServiceOps new, Prod-Kubernetes, Kubernetes
Scott_French triaged T395870: Remove docker related parts from kubernetes puppet code as Medium priority.
Thu, Feb 26, 5:15 PM · ServiceOps new, Prod-Kubernetes, Kubernetes
Scott_French claimed T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis.

The work planned for yesterday was preempted, but should be able to move forward again today.

Thu, Feb 26, 4:59 PM · Patch-For-Review, ServiceOps new, MW-on-K8s, MediaWiki-Recent-changes

Wed, Feb 25

Scott_French added a comment to T418262: deploy2003 implementation tracking.

[...]
IIRC, another blocker for getting rid of PHP on the deployment host is T377497: Functional replacement for importImages.php on Kubernetes

Wed, Feb 25, 3:21 PM · ServiceOps new (Next quarter), ServiceOps-Upgrades-Hardware
Scott_French added a comment to T416756: Release OpenTelemetry integration for service-utils.

Many thanks to @CDanis for pointing me to explanation for:

Wed, Feb 25, 3:08 AM · Patch-For-Review, service-utils, ServiceOps-SharedInfra, ServiceOps new
Scott_French added a comment to T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis.

With the changes to the drain-hook script now live, I'll move forward with the deployment-charts changes during tomorrow's UTC-late infra window.

Wed, Feb 25, 12:23 AM · Patch-For-Review, ServiceOps new, MW-on-K8s, MediaWiki-Recent-changes

Tue, Feb 24

Scott_French updated subscribers of T418262: deploy2003 implementation tracking.

Current state

Tue, Feb 24, 5:58 PM · ServiceOps new (Next quarter), ServiceOps-Upgrades-Hardware

Mon, Feb 23

Scott_French added a comment to T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis.

We've decided to move ahead with experimental changes to (re)introduce envoy drain in the MediaWiki deployments, before moving ahead with the sidecar container changes. While that adds confusion around what exactly the drain hook is responsible for (i.e., it now needs to both trigger the drain and sleep; details in https://gerrit.wikimedia.org/r/1242462), IMO that's okay for now.

Mon, Feb 23, 11:55 PM · Patch-For-Review, ServiceOps new, MW-on-K8s, MediaWiki-Recent-changes

Sat, Feb 21

Scott_French added a comment to T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis.

I don't understand what the specific proposal is regarding sidecar pods. Is the idea to kill envoy before you kill the main app? That would still not be all that graceful since it would prevent the app from returning responses.

Sat, Feb 21, 1:22 AM · Patch-For-Review, ServiceOps new, MW-on-K8s, MediaWiki-Recent-changes

Fri, Feb 20

Scott_French updated the task description for T376519: Steady-state sizing of mw-web and mw-api-ext.
Fri, Feb 20, 3:23 PM · ServiceOps new, Datacenter-Switchover
Scott_French closed T376519: Steady-state sizing of mw-web and mw-api-ext as Resolved.

Thanks, @MLechvien-WMF - Indeed, the docs are updated, and it has now been a week with no voiced objections to closing this out, so let's do so :)

Fri, Feb 20, 3:18 PM · ServiceOps new, Datacenter-Switchover
Scott_French edited projects for T364245: Recentchanges and cu_changes tables are occasionally missing revisions on multiple wikis, added: ServiceOps new; removed serviceops-deprecated.

The example in T364245#11634977 indeed looks like a request arriving at a pod after it has started terminating at 00:59:11 (as part of this scap deployment) - i.e., during the ~ 8s window where we wait for in-flight requests to finish (glossing over the details of how that's currently implemented). Now that we're on a version of k8s that directly supports the concept of sidecar containers, we're looking at how that can be used to improve container shutdown order: T417800: Implement proper sidecar container support in mediawiki pods.

Fri, Feb 20, 2:46 AM · Patch-For-Review, ServiceOps new, MW-on-K8s, MediaWiki-Recent-changes

Thu, Feb 19

Scott_French added a comment to T416756: Release OpenTelemetry integration for service-utils.

I've made a number of updates to:

over the last two days, which cover many of the points discussed in this task in at least some detail..

Thu, Feb 19, 2:41 AM · Patch-For-Review, service-utils, ServiceOps-SharedInfra, ServiceOps new

Tue, Feb 17

Scott_French added a comment to T417704: [FY25-26 WE6.1.4] Establish Pretrain production design for MVP.

Thanks, @bd808 - Nah, this is good, and basically identical to what I would have opened soon anyway :)

Tue, Feb 17, 10:27 PM · ServiceOps new, Goal, Release-Engineering-Team (Doing 😎)
Scott_French moved T417704: [FY25-26 WE6.1.4] Establish Pretrain production design for MVP from Inbox to Scheduled (this Q) on the ServiceOps new board.
Tue, Feb 17, 10:25 PM · ServiceOps new, Goal, Release-Engineering-Team (Doing 😎)
Scott_French added a comment to T386246: Migrate parsoidtest functionality to kubernetes.

Thank you both!

Tue, Feb 17, 4:45 PM · ServiceOps-Services-Oids, ServiceOps new, Content-Transform-Team, OKR-Work
Scott_French added a comment to T359375: make better use of spicerack's service_catalog().

So I'm now thinking that instead of an exclude_from_switchover boolean, we maybe should have gone for a switchover_type: {none,day1,day2} type enum ?

Tue, Feb 17, 4:03 PM · User-jijiki, Serviceops-easywins, ServiceOps new, Datacenter-Switchover

Fri, Feb 13

Scott_French added a comment to T416756: Release OpenTelemetry integration for service-utils.

Sampling decisions

Fri, Feb 13, 9:43 PM · Patch-For-Review, service-utils, ServiceOps-SharedInfra, ServiceOps new
Scott_French added a comment to T386246: Migrate parsoidtest functionality to kubernetes.

A couple of thoughts while reviewing yesterday's patch series, that don't really fit on one specific patch:

Fri, Feb 13, 3:40 PM · ServiceOps-Services-Oids, ServiceOps new, Content-Transform-Team, OKR-Work

Thu, Feb 12

Scott_French added a comment to T416756: Release OpenTelemetry integration for service-utils.

From some additional digging into auto-instrumentation today, there's at least one additional point we should highlight in our documentation:

Thu, Feb 12, 9:10 PM · Patch-For-Review, service-utils, ServiceOps-SharedInfra, ServiceOps new
Scott_French added a comment to T376519: Steady-state sizing of mw-web and mw-api-ext.

I've updated the relevant portion of the preparation section (now Verify MediaWiki serving capacity). Edits for clarity welcome :)

Thu, Feb 12, 6:45 PM · ServiceOps new, Datacenter-Switchover
Scott_French updated subscribers of T376519: Steady-state sizing of mw-web and mw-api-ext.

@jijiki - So, the script in serviceops-kitchensink made the analysis repeatable / less-manual, and agreed that we should augment the Check capacity in the destination datacentre step of the switchover documentation to encourage its use - I can do that later today. That would close this out for the purposes of the Datacenter-Switchover.

Thu, Feb 12, 4:49 PM · ServiceOps new, Datacenter-Switchover
Scott_French added a comment to T359375: make better use of spicerack's service_catalog().

One critical difference between exclude_from_switchover (which supersedes EXCLUDED_SERVICES) and MEDIAWIKI_SERVICES / MEDIAWIKI_RO_SERVICES is that the former applies at the service-catalog service level, while the latter applies at the DNS discovery service level. The relationship between these concepts is 1:N (where N >= 0 in general, but for anything we care about in the switchover, it'll be N >= 1).

Thu, Feb 12, 3:55 PM · User-jijiki, Serviceops-easywins, ServiceOps new, Datacenter-Switchover
Scott_French updated the task description for T416756: Release OpenTelemetry integration for service-utils.
Thu, Feb 12, 1:43 AM · Patch-For-Review, service-utils, ServiceOps-SharedInfra, ServiceOps new

Wed, Feb 11

Scott_French added a project to T416757: Develop a service-mesh HTTP client for service-utils: service-utils.
Wed, Feb 11, 10:41 PM · service-utils, ServiceOps-SharedInfra, ServiceOps new
Scott_French added a project to T416756: Release OpenTelemetry integration for service-utils: service-utils.

Expanding a bit on my comment in T416756#11599776 about error-prone SDK init, I noticed something curious today:

Wed, Feb 11, 10:40 PM · Patch-For-Review, service-utils, ServiceOps-SharedInfra, ServiceOps new
Scott_French added a comment to T350565: Switch conftool to use the version 3 etcd datastore.

No objections moving this to backlog, as it's not currently prioritized for either Q3 or Q4.

Wed, Feb 11, 5:28 PM · ServiceOps-SharedInfra, ServiceOps new, Patch-For-Review, conftool, Data-Persistence, Traffic
Scott_French added a comment to T356296: confd setup left without configuration doesn't stop confd.

Thank you both for beating me to assembling a similar command myself, heh.

Wed, Feb 11, 12:16 AM · ServiceOps new, Infrastructure-Foundations, SRE

Tue, Feb 10

Scott_French moved T356296: confd setup left without configuration doesn't stop confd from Backlog to Radar on the ServiceOps new board.
Tue, Feb 10, 10:00 PM · ServiceOps new, Infrastructure-Foundations, SRE
Scott_French moved T356296: confd setup left without configuration doesn't stop confd from Radar to Backlog on the ServiceOps new board.
Tue, Feb 10, 10:00 PM · ServiceOps new, Infrastructure-Foundations, SRE
Scott_French moved T356296: confd setup left without configuration doesn't stop confd from Inbox to Radar on the ServiceOps new board.
Tue, Feb 10, 9:58 PM · ServiceOps new, Infrastructure-Foundations, SRE
Scott_French edited projects for T356296: confd setup left without configuration doesn't stop confd, added: ServiceOps new; removed serviceops-deprecated.

Indeed, there's currently no mechanism I'm aware of to automatically absent a confd instance from a host when there are no longer templates configured.

Tue, Feb 10, 9:57 PM · ServiceOps new, Infrastructure-Foundations, SRE
Scott_French moved T355237: Update cache.mrouter modules in deployment-charts from Inbox to Needs Info / Blocked on the ServiceOps new board.
Tue, Feb 10, 9:33 PM · Prod-Kubernetes, ServiceOps-Datastores, ServiceOps new
Scott_French assigned T355237: Update cache.mrouter modules in deployment-charts to jijiki.

@jijiki - Is there anything here that still needs done (e.g., around cleanup of duplicate definitions of cache.mcrouter.deployment) or is this finished? If work is still required, if you could update the priority and timeline, that would be greatly appreciated.

Tue, Feb 10, 9:33 PM · Prod-Kubernetes, ServiceOps-Datastores, ServiceOps new
Scott_French edited projects for T358936: Kubernetes apiserver probe failures on restart, added: ServiceOps new; removed serviceops-deprecated.
Tue, Feb 10, 9:25 PM · ServiceOps new, Prod-Kubernetes, SRE
Scott_French moved T359423: Migrate charts to Calico Network Policies from Inbox to Backlog on the ServiceOps new board.
Tue, Feb 10, 9:19 PM · ServiceOps new, Data-Platform-SRE, Prod-Kubernetes, Kubernetes
Scott_French edited projects for T359423: Migrate charts to Calico Network Policies, added: ServiceOps new; removed serviceops-deprecated.

Checking in, it looks like we still have a couple of services on Wikikube that require migration, most notably MediaWiki (or, IIUC, cleanup of the older duplicate network policies, as in the tegola-vector-tiles case).

Tue, Feb 10, 9:19 PM · ServiceOps new, Data-Platform-SRE, Prod-Kubernetes, Kubernetes
Scott_French updated the task description for T359423: Migrate charts to Calico Network Policies.
Tue, Feb 10, 9:13 PM · ServiceOps new, Data-Platform-SRE, Prod-Kubernetes, Kubernetes
Scott_French moved T361724: scap should check if it is running within a tmux/screen from Inbox to Radar on the ServiceOps new board.
Tue, Feb 10, 8:28 PM · ServiceOps new, Release-Engineering-Team (Priority Backlog 📥), Sustainability (Incident Followup), Scap
Scott_French edited projects for T361724: scap should check if it is running within a tmux/screen, added: ServiceOps new; removed serviceops-deprecated.

Just to confirm, the terminal-multiplexer check feature works as expected, and all that remains is docs updates and comms to ops@lists before it's reenabled, correct? (i.e., the description remains accurate)

Tue, Feb 10, 8:27 PM · ServiceOps new, Release-Engineering-Team (Priority Backlog 📥), Sustainability (Incident Followup), Scap
Scott_French closed T416932: Occasional pymysql.err.OperationalError "MySQL server has gone away" on first load as Resolved.

This should have been fixed (going the pre-ping route) earlier today in https://gitlab.wikimedia.org/repos/sre/hiddenparma/-/merge_requests/142. Thanks, @Joe!

Tue, Feb 10, 3:32 PM · Hiddenparma
Scott_French moved T376516: Develop and validate a model of thumbor capacity to enable single-DC serving from Needs Info / Blocked to Backlog on the ServiceOps new board.
Tue, Feb 10, 3:27 PM · ServiceOps-Services-Oids, ServiceOps new, Datacenter-Switchover
Scott_French placed T376516: Develop and validate a model of thumbor capacity to enable single-DC serving up for grabs.
Tue, Feb 10, 3:27 PM · ServiceOps-Services-Oids, ServiceOps new, Datacenter-Switchover
Scott_French triaged T376516: Develop and validate a model of thumbor capacity to enable single-DC serving as Medium priority.
Tue, Feb 10, 3:27 PM · ServiceOps-Services-Oids, ServiceOps new, Datacenter-Switchover
Scott_French updated the task description for T376516: Develop and validate a model of thumbor capacity to enable single-DC serving.
Tue, Feb 10, 3:26 PM · ServiceOps-Services-Oids, ServiceOps new, Datacenter-Switchover
Scott_French added a comment to T376516: Develop and validate a model of thumbor capacity to enable single-DC serving.

@MLechvien-WMF - No, because the swift.discovery.wmnet services continues to be excluded from the switchover. This is work we should do, but it is not urgent to take on specifically before this switchover.

Tue, Feb 10, 3:24 PM · ServiceOps-Services-Oids, ServiceOps new, Datacenter-Switchover
Scott_French added a comment to T416756: Release OpenTelemetry integration for service-utils.
  1. Production-appropriate NodeSDK initialization - This was a pain-point during integration by Abstract Wikipedia, [...]
Tue, Feb 10, 12:11 AM · Patch-For-Review, service-utils, ServiceOps-SharedInfra, ServiceOps new

Mon, Feb 9

Scott_French moved T416932: Occasional pymysql.err.OperationalError "MySQL server has gone away" on first load from Backlog to Bugs on the Hiddenparma board.
Mon, Feb 9, 10:10 PM · Hiddenparma
Scott_French created T416932: Occasional pymysql.err.OperationalError "MySQL server has gone away" on first load.
Mon, Feb 9, 10:09 PM · Hiddenparma
Scott_French added a member for Hiddenparma: Scott_French.
Mon, Feb 9, 9:56 PM
Scott_French added a comment to T368096: mediawiki: migrate from image-suggestion to data-gateway.

@MLechvien-WMF - It looks like I have all of the patches ready, so it's just a question of rebasing and finding a couple of hours. Which is to say, I think it should be easy to (finally) get this done.

Mon, Feb 9, 3:53 PM · ServiceOps-Services-Oids, ServiceOps new, MW-1.45-notes (1.45.0-wmf.15; 2025-08-19), Patch-For-Review, Growth-Team, Cassandra

Sat, Feb 7

Scott_French added a comment to T416757: Develop a service-mesh HTTP client for service-utils.

A reasonably complete proof of concept can be found in the feature/axois-mesh-client branch.

Sat, Feb 7, 1:44 AM · service-utils, ServiceOps-SharedInfra, ServiceOps new
Scott_French moved T416757: Develop a service-mesh HTTP client for service-utils from Inbox to Scheduled (this Q) on the ServiceOps new board.
Sat, Feb 7, 1:24 AM · service-utils, ServiceOps-SharedInfra, ServiceOps new
Scott_French triaged T416757: Develop a service-mesh HTTP client for service-utils as Medium priority.
Sat, Feb 7, 1:24 AM · service-utils, ServiceOps-SharedInfra, ServiceOps new
Scott_French created T416757: Develop a service-mesh HTTP client for service-utils.
Sat, Feb 7, 1:23 AM · service-utils, ServiceOps-SharedInfra, ServiceOps new
Scott_French moved T416752: WE6.2.9: Adopt node.js service-utils from Inbox to In Progress on the ServiceOps new board.
Sat, Feb 7, 12:48 AM · ServiceOps-SharedInfra, Epic, ServiceOps new
Scott_French moved T416756: Release OpenTelemetry integration for service-utils from Inbox to In Progress on the ServiceOps new board.
Sat, Feb 7, 12:48 AM · Patch-For-Review, service-utils, ServiceOps-SharedInfra, ServiceOps new
Scott_French triaged T416756: Release OpenTelemetry integration for service-utils as Medium priority.
Sat, Feb 7, 12:47 AM · Patch-For-Review, service-utils, ServiceOps-SharedInfra, ServiceOps new
Scott_French triaged T416752: WE6.2.9: Adopt node.js service-utils as Medium priority.
Sat, Feb 7, 12:47 AM · ServiceOps-SharedInfra, Epic, ServiceOps new
Scott_French changed the status of T416756: Release OpenTelemetry integration for service-utils, a subtask of T416752: WE6.2.9: Adopt node.js service-utils, from Open to In Progress.
Sat, Feb 7, 12:47 AM · ServiceOps-SharedInfra, Epic, ServiceOps new
Scott_French changed the status of T416756: Release OpenTelemetry integration for service-utils from Open to In Progress.
Sat, Feb 7, 12:47 AM · Patch-For-Review, service-utils, ServiceOps-SharedInfra, ServiceOps new
Scott_French changed the status of T416752: WE6.2.9: Adopt node.js service-utils from Open to In Progress.
Sat, Feb 7, 12:47 AM · ServiceOps-SharedInfra, Epic, ServiceOps new
Scott_French created T416756: Release OpenTelemetry integration for service-utils.
Sat, Feb 7, 12:46 AM · Patch-For-Review, service-utils, ServiceOps-SharedInfra, ServiceOps new

Fri, Feb 6

Scott_French created T416752: WE6.2.9: Adopt node.js service-utils.
Fri, Feb 6, 11:14 PM · ServiceOps-SharedInfra, Epic, ServiceOps new
Scott_French added a comment to T397685: helmfile/scap does not reliably bootstrap mediawiki.

So, I'd say the main problem is really that we've introduced tight coupling between releases, which makes bootstrapping challenging since it forces sequencing. Investing in loosening that coupling (best possible solution), or ensuring that the appropriate tooling understands those constraints, seems like the right path here.

Fri, Feb 6, 6:39 PM · ServiceOps-Mediawiki, ServiceOps new, MW-on-K8s, Release-Engineering-Team, Scap

Wed, Feb 4

Scott_French added a comment to T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.

Roughly 9h after switching back to node 22 with --max-old-space-size=4096 and --max-semi-space-size=16, we're seeing some interesting results:

Wed, Feb 4, 2:46 AM · Wikimedia-production-error, ServiceOps new, Wikipedia-Android-App-Backlog, Content-Transform-Team, Wikifeeds

Tue, Feb 3

Scott_French closed T403220: Introduce known-client identity objects and integrate with requestctl as Resolved.

Yes, I believe we can mark this resolved, now that the feature is in wider use and seems to be working as expected.

Tue, Feb 3, 3:06 PM · Patch-For-Review, Traffic, Hiddenparma
Scott_French closed T403220: Introduce known-client identity objects and integrate with requestctl, a subtask of T400100: FY 25/26 WE 5.4.2: Known bots / clients, as Resolved.
Tue, Feb 3, 3:06 PM · Epic, ServiceOps new, SRE
Scott_French added a comment to T397685: helmfile/scap does not reliably bootstrap mediawiki.

Alright, I've updated the task description on T405703: Update wikikube eqiad to kubernetes 1.31 to reflect two points:

  • During the "Deploy mediawiki" phase, the sequencing constraints we've discussed here, together with example commands for bringing up the support releases.
  • During the "Deploy all the services" phase, charlie in its current form will operate on all mediawiki services as well, which is probably not what we want in practice if we want to do that via scap (or, if we do want to operate on them in this phase, that should be possible if we move the support-release bring-up earlier to ensure it happens first).
Tue, Feb 3, 1:11 AM · ServiceOps-Mediawiki, ServiceOps new, MW-on-K8s, Release-Engineering-Team, Scap
Scott_French updated the task description for T405703: Update wikikube eqiad to kubernetes 1.31.
Tue, Feb 3, 12:58 AM · Discovery-Search (2025.09.26 - 2025.10.17), Data-Platform-SRE (2025.09.26 - 2025.10.17), Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes, serviceops-deprecated
Scott_French updated the task description for T405703: Update wikikube eqiad to kubernetes 1.31.
Tue, Feb 3, 12:46 AM · Discovery-Search (2025.09.26 - 2025.10.17), Data-Platform-SRE (2025.09.26 - 2025.10.17), Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes, serviceops-deprecated

Mon, Feb 2

Scott_French added a comment to T412951: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph.

Thank you, @elukey!

Mon, Feb 2, 10:56 PM · Epic, Kubernetes, ServiceOps new, Release-Engineering-Team (Radar), Ceph, SRE-swift-storage
Scott_French added a comment to T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.

Thanks for the additional investigation, @Jgiannelos.

Mon, Feb 2, 5:58 PM · Wikimedia-production-error, ServiceOps new, Wikipedia-Android-App-Backlog, Content-Transform-Team, Wikifeeds

Feb 2 2026

Scott_French moved T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13 from In Progress to Radar on the ServiceOps new board.
Feb 2 2026, 3:27 PM · Wikimedia-production-error, ServiceOps new, Wikipedia-Android-App-Backlog, Content-Transform-Team, Wikifeeds

Jan 23 2026

Scott_French added a comment to T390251: docker-registry.wikimedia.org keeps serving bad blobs.

I've now also merged T406392, for the same reason.

Jan 23 2026, 2:18 AM · ServiceOps new, Patch-For-Review
Scott_French merged T406392: failed to push docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-10-03-192132-3003d4328df66a0086a350fdd2ba1dbd80a235c5: unknown: blob upload invalid into T390251: docker-registry.wikimedia.org keeps serving bad blobs.
Jan 23 2026, 12:56 AM · ServiceOps new, Patch-For-Review
Scott_French merged task T406392: failed to push docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-10-03-192132-3003d4328df66a0086a350fdd2ba1dbd80a235c5: unknown: blob upload invalid into T390251: docker-registry.wikimedia.org keeps serving bad blobs.
Jan 23 2026, 12:56 AM · Kubernetes, ServiceOps new, GitLab (CI & Job Runners)
Scott_French added a comment to T406392: failed to push docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-10-03-192132-3003d4328df66a0086a350fdd2ba1dbd80a235c5: unknown: blob upload invalid.

Since this is fundamentally the same class of failure mode as already tracked in T390251, I am going to duplicate this into the latter as canonical.

Jan 23 2026, 12:56 AM · Kubernetes, ServiceOps new, GitLab (CI & Job Runners)

Jan 22 2026

Scott_French added a comment to T410296: Significant increase in wikifeeds latency and mobileapps error rate since 2025/11/13.

A couple of hours in after @Jgiannelos set --max-old-space-size (and deployed the new node 22-based image), we're again seeing cyclic latency excursions (as measured from Envoy's view on the Wikifeeds side) that seem to correlate with CPU and memory (note: these are totals, not per-pod behavior) bumps.

Jan 22 2026, 10:59 PM · Wikimedia-production-error, ServiceOps new, Wikipedia-Android-App-Backlog, Content-Transform-Team, Wikifeeds
Scott_French added a comment to T390251: docker-registry.wikimedia.org keeps serving bad blobs.

I've merged T412265: Pushing to the docker registry fails with 500 Internal Server Error into this task, as we believe it's another manifestation of the same class of failure modes discussed here.

Jan 22 2026, 4:18 PM · ServiceOps new, Patch-For-Review
Scott_French merged T412265: Pushing to the docker registry fails with 500 Internal Server Error into T390251: docker-registry.wikimedia.org keeps serving bad blobs.
Jan 22 2026, 4:10 PM · ServiceOps new, Patch-For-Review
Scott_French merged task T412265: Pushing to the docker registry fails with 500 Internal Server Error into T390251: docker-registry.wikimedia.org keeps serving bad blobs.
Jan 22 2026, 4:09 PM · ServiceOps-SharedInfra, ServiceOps new, SRE, MW-on-K8s
Scott_French added a comment to T412265: Pushing to the docker registry fails with 500 Internal Server Error.

Since this is fundamentally the same class of failure mode as already tracked and reported in T390251, I am going to duplicate this into the latter as canonical.

Jan 22 2026, 4:09 PM · ServiceOps-SharedInfra, ServiceOps new, SRE, MW-on-K8s
Scott_French assigned T398592: Review the behaviour of foreachwikiindblist in mw-cron to Urbanecm_WMF.

@Urbanecm_WMF - Could you please confirm whether #1 from T398592#11539714 is correct or not? We'd like to try to confirm that immediate-term need has been met. Longer term, we would try to prioritize #2 once that functionality exists in the relevant maintenance scripts. Please unassign once responded.

Jan 22 2026, 4:00 PM · ServiceOps-Mediawiki, ServiceOps new, MW-on-K8s
Scott_French removed projects from T415169: Transcode jobs failing with Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit'): ServiceOps new, serviceops-deprecated.
Jan 22 2026, 3:26 PM · Reader Growth Team, TimedMediaHandler, MW-Interfaces-Team, Wikimedia-production-error