Skip to content

Releases: Varashi/gpu-node-vsphere-maintenance-controller

v0.4.4

01 May 11:40

Choose a tag to compare

Fixed

  • get_vm_names_on_host and get_inventory_snapshot no longer abort the
    reconcile loop when a VM becomes inaccessible mid-iteration. vCenter
    evacuates/recreates vCLS agent VMs the moment a host enters maintenance,
    and the controller's service account typically has no System.View on
    the vCLS folder, so the doomed MoRef returned vim.fault.NoPermission
    from .name and propagated up as an unhandled error. Per-VM access now
    catches ManagedObjectNotFound + NoPermission and skips the entry.

v0.4.3

21 Apr 09:23

Choose a tag to compare

Docs / CI follow-up from post-v0.4.2 review. No controller code change.

Changed

  • README no longer suggests helm pull --verify. The flag looks for a
    PGP .prov file produced by helm package --sign, but the release
    workflow signs with cosign keyless — a different mechanism. Verify
    charts with cosign verify instead.
  • CI chart job: drop fetch-depth: 0 on checkout. ct lint --all
    does not use git history to pick charts, so the shallow default
    suffices and saves clone time.

v0.4.2

21 Apr 09:07

Choose a tag to compare

No controller code change. Supply-chain and CI polish only.

Added

  • README "Verifying a release" section with cosign verify, cosign
    verify-attestation, gh attestation verify, and helm pull --verify
    snippets for the published image and chart.
  • Release workflow now cosign-keyless-signs the published Helm chart OCI
    artifact as well as the image, against the digest returned by
    helm push.
  • CI chart job runs helm/chart-testing ct lint in addition to
    helm lint and helm template, catching SemVer and metadata drift
    that plain helm lint misses.

Changed

  • Release workflow signs the image digest once rather than once per tag —
    all tags resolve to the same digest, so per-tag signing only recorded
    duplicate signatures against the same subject.
  • Release workflow disables the buildx-embedded SBOM (sbom: false on
    docker/build-push-action). anchore/sbom-action remains the single
    source of the SPDX SBOM and the only input to the cosign SBOM
    attestation, so image consumers no longer see two SBOMs referencing
    the same digest.
  • Dockerfile pip install now passes --disable-pip-version-check to
    pre-empt hadolint DL3042 and trim startup noise.

v0.4.1

21 Apr 07:49

Choose a tag to compare

Added

  • VCENTER_TLS_VERIFY (bool, default false) for vCenter certificates
    issued by a public CA. When true and VCENTER_CA_BUNDLE is unset,
    the controller uses ssl.create_default_context() with no cafile,
    which falls back to OpenSSL's system trust store shipped in the
    container image. Chart exposes it as vcenter.tlsVerify.
  • README "TLS verification modes" section covering the three supported
    cases: self-signed, private/self-hosted CA, public CA.

v0.4.0

21 Apr 07:39

Choose a tag to compare

Added

  • Optional TLS verification against vCenter via VCENTER_CA_BUNDLE.
    When set, uses ssl.create_default_context(cafile=...); otherwise falls
    back to the previous unverified behaviour and logs a warning.
  • reconcile_pending_drains(host_states) runs every poll. Picks up GPU
    nodes on in/entering-maintenance hosts that still have no state
    annotation — covers the case where MAX_CONCURRENT_DRAINS throttled the
    edge-trigger and the skipped host would otherwise never be retried.
  • get_inventory_snapshot() emits both host_states and a
    vm_host_map from a single HostSystem view walk. reconcile_powered_off
    now consults the map instead of making a per-node get_vm_host round-trip
    to vCenter on every poll.
  • Minimal Helm chart under chart/, published as OCI to
    ghcr.io/varashi/charts/gpu-node-vsphere-maintenance-controller.
  • GitHub Actions: ci.yaml (ruff, hadolint, helm lint, buildx smoke build)
    on pull requests; release.yaml on v*.*.* tag push builds multi-arch
    images (amd64, arm64), cosign-signs keyless via OIDC, attaches SBOM and
    build-provenance attestations, packages and pushes the Helm chart, and
    creates a GitHub Release with the body extracted from this file.
  • This CHANGELOG.md, seeded from the previous README "Version history".

Changed

  • Dockerfile pinned to python:3.13-slim. pyVmomi==8.0.3.0.1 predates
    Python 3.14 and has not been tested against it upstream.
  • startup_reconcile delegates its "host already in maintenance at boot"
    branch to reconcile_pending_drains so the two paths share one
    implementation.
  • Example Deployment in the README sets strategy.type: Recreate. With
    replicas: 1 this closes the brief double-run window that was previously
    only partially mitigated by idempotency at the state-machine level.

Fixed

  • Concurrent power-off race: a PowerOff() landing on an already-off VM
    previously bubbled an InvalidPowerState error through the generic
    exception catch and aborted the cycle mid-transition. Now treated as
    success, symmetric to the existing power-on handling.

Removed

  • policy/poddisruptionbudgets verb from the example ClusterRole. The
    controller never calls the PDB API — PDB-blocked evictions are handled
    via the 429 response on pods/eviction.