Releases: Varashi/gpu-node-vsphere-maintenance-controller
Releases · Varashi/gpu-node-vsphere-maintenance-controller
v0.4.4
Fixed
get_vm_names_on_hostandget_inventory_snapshotno longer abort the
reconcile loop when a VM becomes inaccessible mid-iteration. vCenter
evacuates/recreates vCLS agent VMs the moment a host enters maintenance,
and the controller's service account typically has noSystem.Viewon
the vCLS folder, so the doomed MoRef returnedvim.fault.NoPermission
from.nameand propagated up as an unhandled error. Per-VM access now
catchesManagedObjectNotFound+NoPermissionand skips the entry.
v0.4.3
Docs / CI follow-up from post-v0.4.2 review. No controller code change.
Changed
- README no longer suggests
helm pull --verify. The flag looks for a
PGP.provfile produced byhelm package --sign, but the release
workflow signs with cosign keyless — a different mechanism. Verify
charts withcosign verifyinstead. - CI
chartjob: dropfetch-depth: 0on checkout.ct lint --all
does not use git history to pick charts, so the shallow default
suffices and saves clone time.
v0.4.2
No controller code change. Supply-chain and CI polish only.
Added
- README "Verifying a release" section with cosign verify, cosign
verify-attestation,gh attestation verify, andhelm pull --verify
snippets for the published image and chart. - Release workflow now cosign-keyless-signs the published Helm chart OCI
artifact as well as the image, against the digest returned by
helm push. - CI
chartjob runshelm/chart-testingct lintin addition to
helm lintandhelm template, catching SemVer and metadata drift
that plainhelm lintmisses.
Changed
- Release workflow signs the image digest once rather than once per tag —
all tags resolve to the same digest, so per-tag signing only recorded
duplicate signatures against the same subject. - Release workflow disables the buildx-embedded SBOM (
sbom: falseon
docker/build-push-action).anchore/sbom-actionremains the single
source of the SPDX SBOM and the only input to the cosign SBOM
attestation, so image consumers no longer see two SBOMs referencing
the same digest. - Dockerfile
pip installnow passes--disable-pip-version-checkto
pre-empt hadolintDL3042and trim startup noise.
v0.4.1
Added
VCENTER_TLS_VERIFY(bool, defaultfalse) for vCenter certificates
issued by a public CA. WhentrueandVCENTER_CA_BUNDLEis unset,
the controller usesssl.create_default_context()with nocafile,
which falls back to OpenSSL's system trust store shipped in the
container image. Chart exposes it asvcenter.tlsVerify.- README "TLS verification modes" section covering the three supported
cases: self-signed, private/self-hosted CA, public CA.
v0.4.0
Added
- Optional TLS verification against vCenter via
VCENTER_CA_BUNDLE.
When set, usesssl.create_default_context(cafile=...); otherwise falls
back to the previous unverified behaviour and logs a warning. reconcile_pending_drains(host_states)runs every poll. Picks up GPU
nodes on in/entering-maintenance hosts that still have no state
annotation — covers the case whereMAX_CONCURRENT_DRAINSthrottled the
edge-trigger and the skipped host would otherwise never be retried.get_inventory_snapshot()emits bothhost_statesand a
vm_host_mapfrom a singleHostSystemview walk.reconcile_powered_off
now consults the map instead of making a per-nodeget_vm_hostround-trip
to vCenter on every poll.- Minimal Helm chart under
chart/, published as OCI to
ghcr.io/varashi/charts/gpu-node-vsphere-maintenance-controller. - GitHub Actions:
ci.yaml(ruff, hadolint, helm lint, buildx smoke build)
on pull requests;release.yamlonv*.*.*tag push builds multi-arch
images (amd64, arm64), cosign-signs keyless via OIDC, attaches SBOM and
build-provenance attestations, packages and pushes the Helm chart, and
creates a GitHub Release with the body extracted from this file. - This
CHANGELOG.md, seeded from the previous README "Version history".
Changed
- Dockerfile pinned to
python:3.13-slim.pyVmomi==8.0.3.0.1predates
Python 3.14 and has not been tested against it upstream. startup_reconciledelegates its "host already in maintenance at boot"
branch toreconcile_pending_drainsso the two paths share one
implementation.- Example Deployment in the README sets
strategy.type: Recreate. With
replicas: 1this closes the brief double-run window that was previously
only partially mitigated by idempotency at the state-machine level.
Fixed
- Concurrent power-off race: a
PowerOff()landing on an already-off VM
previously bubbled anInvalidPowerStateerror through the generic
exception catch and aborted the cycle mid-transition. Now treated as
success, symmetric to the existing power-on handling.
Removed
policy/poddisruptionbudgetsverb from the example ClusterRole. The
controller never calls the PDB API — PDB-blocked evictions are handled
via the 429 response onpods/eviction.