User Details
- User Since
- Jan 18 2024, 5:33 PM (110 w, 5 d)
- Availability
- Available
- LDAP User
- Scott French
- MediaWiki User
- SFrench-WMF [ Global Accounts ]
Fri, Feb 27
service-utils v2.0.0 was released earlier today, so I've updated the example instrumentation script in on Wikitech to reflect the new @wikimedia/service-utils/otel subpath export.
Thanks, @Krinkle - That helps clarify. For noc.wikimedia.org, presence of the x-wikimedia-debug header affects caching behavior and response header cleanup, but has no role in backend selection at ATS. Instead, requests always route to mw-misc, which also happens to be updated in the testservers stage, so the "feedback loop" for testing merged code is sort of equivalent.
Thu, Feb 26
Thank you very much for doing so, @Jdforrester-WMF. Great, I'll refocus this on the upcoming work to migrate to 8.3-on-bookworm.
Now that I think about it, interacting with envoy via UDS is going to be a bit of a pain without pulling in additional dependencies that we probably don't want in practice.
The drain-envoy.sh hook script assumes the envoy admin interface is available via a TCP listener, but in our production envoy config, it's only available via UDS (at /var/run/envoy/admin.sock). The script needs updated to reflect / support that.
@Jdforrester-WMF - Thanks for clarifying. I've reverted the task title and description to reflect that.
From a quick spot check in puppet and deployment-charts, it doesn't look like we've subsequently done this widely, though there are limited use cases in which use_remote_address: true is set.
So, the way this currently works is that mw-misc (home of noc.wikimedia.org) is updated during the testservers deployment stage, just like mw-debug. Meaning, hitting noc during that stage should observe the new changes immediately.
@JMeybohm - Do you happen to have an estimate of how much effort this represents, and whether it might be something we can (or must) do soon? Thanks!
The work planned for yesterday was preempted, but should be able to move forward again today.
Wed, Feb 25
Many thanks to @CDanis for pointing me to explanation for:
With the changes to the drain-hook script now live, I'll move forward with the deployment-charts changes during tomorrow's UTC-late infra window.
Tue, Feb 24
Current state
Mon, Feb 23
We've decided to move ahead with experimental changes to (re)introduce envoy drain in the MediaWiki deployments, before moving ahead with the sidecar container changes. While that adds confusion around what exactly the drain hook is responsible for (i.e., it now needs to both trigger the drain and sleep; details in https://gerrit.wikimedia.org/r/1242462), IMO that's okay for now.
Sat, Feb 21
I don't understand what the specific proposal is regarding sidecar pods. Is the idea to kill envoy before you kill the main app? That would still not be all that graceful since it would prevent the app from returning responses.
Fri, Feb 20
Thanks, @MLechvien-WMF - Indeed, the docs are updated, and it has now been a week with no voiced objections to closing this out, so let's do so :)
The example in T364245#11634977 indeed looks like a request arriving at a pod after it has started terminating at 00:59:11 (as part of this scap deployment) - i.e., during the ~ 8s window where we wait for in-flight requests to finish (glossing over the details of how that's currently implemented). Now that we're on a version of k8s that directly supports the concept of sidecar containers, we're looking at how that can be used to improve container shutdown order: T417800: Implement proper sidecar container support in mediawiki pods.
Thu, Feb 19
I've made a number of updates to:
- https://wikitech.wikimedia.org/wiki/Distributed_tracing
- https://wikitech.wikimedia.org/wiki/Distributed_tracing/Propagating_tracing_context
- https://wikitech.wikimedia.org/wiki/Distributed_tracing/Tutorial/Instrumenting_your_own_application
over the last two days, which cover many of the points discussed in this task in at least some detail..
Tue, Feb 17
Thanks, @bd808 - Nah, this is good, and basically identical to what I would have opened soon anyway :)
Thank you both!
Fri, Feb 13
Sampling decisions
A couple of thoughts while reviewing yesterday's patch series, that don't really fit on one specific patch:
Thu, Feb 12
From some additional digging into auto-instrumentation today, there's at least one additional point we should highlight in our documentation:
I've updated the relevant portion of the preparation section (now Verify MediaWiki serving capacity). Edits for clarity welcome :)
@jijiki - So, the script in serviceops-kitchensink made the analysis repeatable / less-manual, and agreed that we should augment the Check capacity in the destination datacentre step of the switchover documentation to encourage its use - I can do that later today. That would close this out for the purposes of the Datacenter-Switchover.
One critical difference between exclude_from_switchover (which supersedes EXCLUDED_SERVICES) and MEDIAWIKI_SERVICES / MEDIAWIKI_RO_SERVICES is that the former applies at the service-catalog service level, while the latter applies at the DNS discovery service level. The relationship between these concepts is 1:N (where N >= 0 in general, but for anything we care about in the switchover, it'll be N >= 1).
Wed, Feb 11
Expanding a bit on my comment in T416756#11599776 about error-prone SDK init, I noticed something curious today:
No objections moving this to backlog, as it's not currently prioritized for either Q3 or Q4.
Thank you both for beating me to assembling a similar command myself, heh.
Tue, Feb 10
Indeed, there's currently no mechanism I'm aware of to automatically absent a confd instance from a host when there are no longer templates configured.
@jijiki - Is there anything here that still needs done (e.g., around cleanup of duplicate definitions of cache.mcrouter.deployment) or is this finished? If work is still required, if you could update the priority and timeline, that would be greatly appreciated.
Checking in, it looks like we still have a couple of services on Wikikube that require migration, most notably MediaWiki (or, IIUC, cleanup of the older duplicate network policies, as in the tegola-vector-tiles case).
Just to confirm, the terminal-multiplexer check feature works as expected, and all that remains is docs updates and comms to ops@lists before it's reenabled, correct? (i.e., the description remains accurate)
This should have been fixed (going the pre-ping route) earlier today in https://gitlab.wikimedia.org/repos/sre/hiddenparma/-/merge_requests/142. Thanks, @Joe!
@MLechvien-WMF - No, because the swift.discovery.wmnet services continues to be excluded from the switchover. This is work we should do, but it is not urgent to take on specifically before this switchover.
- Production-appropriate NodeSDK initialization - This was a pain-point during integration by Abstract Wikipedia, [...]
Mon, Feb 9
@MLechvien-WMF - It looks like I have all of the patches ready, so it's just a question of rebasing and finding a couple of hours. Which is to say, I think it should be easy to (finally) get this done.
Sat, Feb 7
A reasonably complete proof of concept can be found in the feature/axois-mesh-client branch.
Fri, Feb 6
So, I'd say the main problem is really that we've introduced tight coupling between releases, which makes bootstrapping challenging since it forces sequencing. Investing in loosening that coupling (best possible solution), or ensuring that the appropriate tooling understands those constraints, seems like the right path here.
Wed, Feb 4
Roughly 9h after switching back to node 22 with --max-old-space-size=4096 and --max-semi-space-size=16, we're seeing some interesting results:
Tue, Feb 3
Yes, I believe we can mark this resolved, now that the feature is in wider use and seems to be working as expected.
Alright, I've updated the task description on T405703: Update wikikube eqiad to kubernetes 1.31 to reflect two points:
- During the "Deploy mediawiki" phase, the sequencing constraints we've discussed here, together with example commands for bringing up the support releases.
- During the "Deploy all the services" phase, charlie in its current form will operate on all mediawiki services as well, which is probably not what we want in practice if we want to do that via scap (or, if we do want to operate on them in this phase, that should be possible if we move the support-release bring-up earlier to ensure it happens first).
Mon, Feb 2
Thank you, @elukey!
Thanks for the additional investigation, @Jgiannelos.
Feb 2 2026
Jan 23 2026
I've now also merged T406392, for the same reason.
Since this is fundamentally the same class of failure mode as already tracked in T390251, I am going to duplicate this into the latter as canonical.
Jan 22 2026
A couple of hours in after @Jgiannelos set --max-old-space-size (and deployed the new node 22-based image), we're again seeing cyclic latency excursions (as measured from Envoy's view on the Wikifeeds side) that seem to correlate with CPU and memory (note: these are totals, not per-pod behavior) bumps.
I've merged T412265: Pushing to the docker registry fails with 500 Internal Server Error into this task, as we believe it's another manifestation of the same class of failure modes discussed here.
Since this is fundamentally the same class of failure mode as already tracked and reported in T390251, I am going to duplicate this into the latter as canonical.
@Urbanecm_WMF - Could you please confirm whether #1 from T398592#11539714 is correct or not? We'd like to try to confirm that immediate-term need has been met. Longer term, we would try to prioritize #2 once that functionality exists in the relevant maintenance scripts. Please unassign once responded.