core: extend userns gid_map with SupplementaryGroups= IDs#42106
core: extend userns gid_map with SupplementaryGroups= IDs#42106rocker-zhang wants to merge 1 commit into
Conversation
Claude review of PR #42106 (12b4c39)Suggestions
Nits
|
| &outside_uid, | ||
| &outside_gid, | ||
| gids_to_enforce, | ||
| ngids_to_enforce, |
There was a problem hiding this comment.
Claude: suggestion: ngids_to_enforce is declared as int (line 6039) but setup_private_users expects size_t n_supplementary_gids. While this is safe in practice (since the negative/error case is handled earlier), it may produce a -Wsign-conversion warning on strict builds. Consider either casting explicitly — (size_t) ngids_to_enforce — or declaring ngids_to_enforce as size_t (storing the merge_gid_lists result in a temporary int for the error check).
| &gid, | ||
| &outside_uid, | ||
| &outside_gid, | ||
| gids_to_enforce, |
There was a problem hiding this comment.
Claude: suggestion: Same signed-to-unsigned implicit conversion: int ngids_to_enforce is passed where size_t n_supplementary_gids is expected. Same fix as the other call site would apply here.
When PrivateUsers=yes (PRIVATE_USERS_SELF), the user-namespace gid_map
only maps the service manager's GID and the unit's GID. Any GIDs from
SupplementaryGroups= (or added by PAM) fall outside the map, so the
kernel substitutes them with the namespace's overflow GID and the unit
sees its supplementary groups collapsed to "nobody"/65534 — even though
the groups were correctly applied via setgroups() in the host namespace
before the userns was created.
Extend the gid_map under PRIVATE_USERS_SELF to also identity-map each
SupplementaryGroups=/PAM-derived GID so they remain visible inside the
namespace. Multi-line gid_map writes require CAP_SETGID in the parent
userns, which the existing two-line case already gates on; we reuse the
same gate.
The kernel limits a single gid_map write to 5 extents on pre-v5.10
kernels (raised to 340 in commit 6397fac4915a "userns: bump idmap
limits to 340"). systemd's baseline is v5.4, so cap the supplementary
GIDs appended here at 3 (leaving 2 for the baseline) to stay within
the older limit. Excess GIDs continue to appear as the overflow GID,
which is exactly the pre-fix behavior — so the cap never regresses
startup.
Add a TEST-07-PID1.private-users.sh regression check using GID 2
("bin"), which is stable across the mkosi test distros.
Fixes: systemd#41994
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
2747942 to
12b4c39
Compare
|
Pushed a fixup: changed the new Rationale per the bot's two suggestions on
|
| * append here so that, combined with the up-to-two baseline lines, the total stays | ||
| * within the pre-v5.10 limit. Excess groups remain visible as the overflow GID | ||
| * inside the namespace — the same as pre-fix behavior. */ | ||
| if (can_map_more) { |
There was a problem hiding this comment.
Claude: nit: The fixed cap of 3 supplementary GIDs is safe and well-reasoned for the common case (baseline + saved_gid line = 2 lines already used). However, when gid_is_valid(*gid) is false or *gid == saved_gid, the baseline gid_map has only 1 line, meaning up to 4 supplementary GIDs could fit within the pre-v5.10 limit of 5 extents. You could compute the remaining budget dynamically (5 - baseline_lines) to map one more GID in that corner case. Minor optimization — the fixed cap is correct and never regresses.
| # user namespace rather than being collapsed to the kernel's overflow GID | ||
| # (nogroup). Picks a stable, low-numbered system group ("bin", GID 2) so | ||
| # the assertion does not depend on /etc/group contents. | ||
| systemd-run -p PrivateUsers=yes -p DynamicUser=yes -p SupplementaryGroups=2 --wait \ |
There was a problem hiding this comment.
Claude: nit: The test assumes GID 2 exists across all mkosi test distros. Since SupplementaryGroups= with numeric GIDs doesn't require a /etc/group entry for setgroups() to succeed, this likely works everywhere. But if a future test environment lacks GID 2 in its group database, id -G output validation could behave unexpectedly. Consider adding a brief guard (getent group 2 >/dev/null 2>&1 || exit 0) or using GID 0 which is universally present.
Summary
Fixes #41994. When a unit is run with
PrivateUsers=yesandSupplementaryGroups=, the supplementary groups currently appear as the overflow GID (nobody/65534) inside the user namespace instead of as themselves.Reporter's repro:
Root cause
The
PRIVATE_USERS_SELFbranch insetup_private_users()writes a gid_map that contains onlysaved_gid → saved_gidand*gid → *gid. The supplementary GIDs were already applied viasetgroups()in the host namespace before the userns was created — but since the kernel can't map them through the namespace's gid_map, it substitutes the overflow GID for display in/proc/self/status,getgroups(), etc.Fix
Extend the gid_map under
PRIVATE_USERS_SELFto identity-map eachSupplementaryGroups=/PAM-derived GID. Multi-line gid_map writes already require CAP_SETGID in the parent userns (the existing two-line case gates on this); we reuse that gate.To avoid a startup regression on the older end of systemd's supported kernel range, the number of appended supplementary GIDs is capped (
USERNS_SUPPLEMENTARY_GIDS_MAX = 3):6397fac4915a).log_debugrecords the truncation.Test
Added to
test/units/TEST-07-PID1.private-users.sh:id -Ginside aPrivateUsers=yes(andPrivateUsersEx=self)DynamicUser=yesunit listsSupplementaryGroups=2./proc/self/gid_mapcontains the expected identity line for GID 2.Verified locally:
meson test -C build --print-errorlogs test-user-utilpasses; fullmeson compilesucceeds. The new TEST-07 case runs in the integration VM.Notes / caveats
setup_private_userscall (line ~6116) also passes the supplementary list through but itscan_map_moregate is false without CAP_SETGID, so behavior there is unchanged (andenforce_groupslater in the deny-setgroups userns is the pre-existing unprivileged-path limitation, not addressed here).PRIVATE_USERS_IDENTITYandPRIVATE_USERS_FULLalready cover all 32-bit GIDs, so they need no change.saved_gid/*gidprevents writing duplicate gid_map lines, which the kernel would reject.Closes: #41994