Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: allora-network/.github
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: main
Choose a base ref
...
head repository: allora-network/.github
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: devop-579-networkpolicy-rollout
Choose a head ref
Checking mergeability… Don’t worry, you can still create the pull request.
  • 8 commits
  • 1 file changed
  • 4 contributors

Commits on May 22, 2026

  1. docs: add DEVOP-579 NetworkPolicy egress rollout plan

    NetworkPolicy egress hardening is a 3-engineer-week project that
    must NOT be rushed — `default-deny-egress` silently breaks every
    workload that has an un-enumerated outbound dependency. The bulk of
    the work is discovery (7 days of baseline flow logs per namespace),
    not deployment.
    
    This doc captures the staged rollout plan so subsequent loop runs
    (or whoever picks up execution) don't redo the planning work. Covers:
    
    - Phase 0: pre-flight (CNI compat, flow log enablement).
    - Phase 1: discovery (per-namespace egress enumeration).
    - Phase 2: allowlist authoring.
    - Phase 3: staged rollout (1 staging → 1 prod → fan out).
    - Phase 4: steady-state (Kyverno schema enforcement, monthly review).
    
    Dependencies:
    - DEVOP-589 (Harbor proxy-cache) must land before Phase 2 or the
      allowlists will churn.
    - DEVOP-588 (Kyverno on all clusters) is a soft dep for Phase 4.
    
    This PR adds the doc only. No NetworkPolicy is deployed.
    
    Linear: https://linear.app/alloralabs/issue/DEVOP-579
    
    Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
    srt0422 and claude committed May 22, 2026
    Configuration menu
    Copy the full SHA
    3607f95 View commit details
    Browse the repository at this point in the history
  2. DEVOP-579: address cubic review — flag suspect egress, 48h soak, runb…

    …ook hook, ingress in scope
    
    Four findings from cubic addressed:
    
    1. tickets/devop-579-network-policy-rollout.md:33 (P2) — Phase 1
       discovery checklist now explicitly enumerates suspect egress
       destinations to flag for incident review (webhook receivers,
       pastebins, ngrok/tunnel services, 169.254.169.254 / cloud
       metadata, residential dynamic-DNS). Each flagged destination
       gets an owner-review gate before allowlist inclusion.
    
    2. tickets/devop-579-network-policy-rollout.md:52 (P1) — Phase 3
       staged rollout soak windows changed from 24h to the 48h spec'd
       by DEVOP-579, and now require a clean soak before advancing.
    
    3. tickets/devop-579-network-policy-rollout.md:64 (P2) — Phase 4
       steady-state now mandates documenting the rollout, allowlist
       layout, rollback command, and on-call escalation path in
       SECURITY-RUNBOOK.md (DEVOP-571).
    
    4. tickets/devop-579-network-policy-rollout.md:74 (P2) — Ingress
       default-deny is no longer out-of-scope. Added a dedicated
       section laying out the parallel ingress cohort (same Phases 0–4
       shape with ingress-specific discovery, allowlist patterns,
       slower production rollout because ingress blast-radius is
       higher, and Kyverno asserting both directions in Phase 4).
    
    Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
    srt0422 and claude committed May 22, 2026
    Configuration menu
    Copy the full SHA
    7b93a18 View commit details
    Browse the repository at this point in the history

Commits on May 30, 2026

  1. DEVOP-579: address @gh-allora — L3/L4 flow logs don't carry FQDNs; ad…

    …d DNS-log enablement + join step
    
    @gh-allora flagged that Hubble/Calico egress flow logs are L3/L4 only,
    so the Phase 1 line "enumerate destination CIDRs, DNS names, and ports"
    can't be satisfied from flow logs alone. Confirmed: Hubble flow records
    and Calico flow logs surface src/dst IP, port, and protocol — DNS names
    require either a CoreDNS query log feed or Cilium's L7 DNS visibility
    (which routes pod DNS through the proxy and records resolved FQDNs).
    
    Fix is structural, not cosmetic:
    
    - Phase 0 now has an explicit "enable verbose DNS query logging" step
      alongside flow log enablement, with concrete options for CoreDNS
      (`log` plugin) and Cilium (L7 DNS via `hubble observe --type=dns`),
      plus a retention check so the 7-day baseline is actually queryable
      before Phase 1 starts.
    - Phase 1 line 33 is split into two checklist items: enumerate CIDRs +
      ports from flow logs (the only fields they carry), then resolve to
      FQDNs by joining flow records against the Phase 0 DNS logs on
      (srcPodIP, dstIP) within a short window. Destinations with no DNS
      match (hard-coded IPs, 169.254.169.254, raw cloud-metadata) are
      carried through as IP-only and fall into the existing suspect-
      destination review.
    
    review-fix-loop iteration 1
    reviewer(s): gh-allora (human PR thread)
    file: tickets/devop-579-network-policy-rollout.md:17,33
    
    Co-Authored-By: Claude Opus 4.7 (review-fix-loop) <noreply@anthropic.com>
    srt0422 and claude committed May 30, 2026
    Configuration menu
    Copy the full SHA
    5fb3be4 View commit details
    Browse the repository at this point in the history
  2. fix(correctness): Phase 0 CNI/flow-log commands were wrong; replace w…

    …ith real per-CNI enablement
    
    Two problems in the Phase 0 checklist that would have wasted an
    engineer's day before they figured out the doc was wrong:
    
    1. `network-policy-engine (Calico)` and `Cilium's native NPL` are not
       real component names. Felix is Calico's per-node policy enforcer;
       Cilium ships NetworkPolicy enforcement built in (no separate "NPL"
       — NPL means NodePort Local in Antrea/Calico, unrelated to
       NetworkPolicy). The flannel-fallback bullet now correctly says the
       only path forward on flannel-without-policy is a CNI migration to
       Calico or Cilium, since flannel itself cannot enforce
       NetworkPolicies.
    
    2. `calicoctl flow logs enable` is not a calicoctl subcommand. Calico
       OSS flow logs are turned on via the FelixConfiguration CR
       (`spec.flowLogsFileEnabled: true`), and the resulting files land
       under `/var/log/calico/flowlogs/` on each node. Also called out
       that OSS file-based flow logs cover allow/deny only — for richer
       flow context the team needs Calico Enterprise / Calico Cloud, and
       the recommendation is to prefer the Cilium staging cluster for
       baseline capture if the option exists. Antrea enablement (Flow
       Exporter feature gate + flow-aggregator) added for completeness
       since one of our clusters is on Antrea.
    
    review-fix-loop iteration 1
    reviewer(s): review-fix-loop (correctness lens)
    file: tickets/devop-579-network-policy-rollout.md:15-17
    
    Co-Authored-By: Claude Opus 4.7 (review-fix-loop) <noreply@anthropic.com>
    srt0422 and claude committed May 30, 2026
    Configuration menu
    Copy the full SHA
    ef121fa View commit details
    Browse the repository at this point in the history
  3. fix(reliability): pin NetworkPolicy naming convention; rollback runbo…

    …ok now matches actual resource names
    
    The rollback runbook command `kubectl delete networkpolicy default-deny
    -n <ns>` would no-op (NotFound) once ingress lands, because the ingress
    section calls the ingress policy `default-deny-ingress` while the egress
    section never pinned the egress resource name. So:
    
    - An engineer authoring `default-deny.yaml` could legitimately name the
      resource `default-deny-egress`, `egress-default-deny`, or anything
      else. The runbook would silently fail to delete it in an incident.
    - Once both directions are deployed, the runbook needs both rollback
      commands, not one.
    - The Phase 4 Kyverno asserter needs to grep on a deterministic
      resource name to enforce "every namespace has both default-deny
      policies".
    
    Fix is structural: Phase 2 now contains a pinned naming convention
    table that the rollback runbook (Phase 3) and the Kyverno asserter
    (Phase 4) both reference by exact `metadata.name`. As a side effect of
    pinning, also split the egress baseline allows (DNS/NTP) into a
    separate generated policy (`egress-baseline-allow`) so the per-namespace
    `egress-allowlist` only contains workload-specific rules — resolves
    the Phase 2 ambiguity over which baseline rules live in default-deny
    vs allowlist.
    
    Changes:
    - New Phase 2 naming-convention table mapping filename ↔ metadata.name
      ↔ purpose for all five policy kinds (3 egress + 2 ingress).
    - Rollback runbook now lists both `default-deny-egress` and
      `default-deny-ingress` commands and calls out drift as an incident.
    - Phase 4 SECURITY-RUNBOOK hook now references both rollback commands.
    - Phase 4 Kyverno bullet now matches by exact metadata.name from the
      pinned table.
    - Ingress section's Phase 2 substitution now references the same table
      for both file name and resource name.
    
    review-fix-loop iteration 1
    reviewer(s): review-fix-loop (reliability lens)
    file: tickets/devop-579-network-policy-rollout.md:52,80,87,112,122
    
    Co-Authored-By: Claude Opus 4.7 (review-fix-loop) <noreply@anthropic.com>
    srt0422 and claude committed May 30, 2026
    Configuration menu
    Copy the full SHA
    5281060 View commit details
    Browse the repository at this point in the history
  4. fix(correctness): CoreDNS log plugin doesn't carry response IPs — s…

    …witch to dnstap (full)
    
    cubic flagged that my iter-1 Phase 0 DNS-log instruction was broken:
    the CoreDNS `log` plugin emits client IP + query name + response code
    but NOT the answer-section A/AAAA IPs, so the
    `(srcPodIP, dstIP)` join described in Phase 1 has nothing on the DNS
    side to match `dstIP` against. Confirmed — `log`'s format is per the
    CoreDNS docs, and resolved IPs only appear in the actual DNS message
    response (the answer section).
    
    Fix is to use the `dnstap` plugin with the `full` flag, which streams
    wire-format DNS messages (request + response, including the answer
    section) to a Unix socket or TCP collector. A dnstap collector
    (`golang-dnstap`, `dnstap-receiver`) decodes those into
    `(timestamp, client_pod_ip, query_name, response_ips[])` records that
    can actually be joined against flow-log destinations. The Cilium
    `hubble observe --type=dns` path was already correct because Hubble
    records FQDN and answer IPs together.
    
    Changes:
    - Phase 0 DNS-capture bullet now specifies `dnstap ... full` for
      CoreDNS, names the collector requirement, and calls out explicitly
      that the query-only `log` plugin is insufficient (so a future
      reader who has read the old docs doesn't reach for it).
    - Phase 1 resolve-to-FQDN bullet now describes the join key
      accurately: `srcPodIP == DNS client IP, dstIP ∈ DNS response answer
      IPs`, instead of pretending `log` output has the answer IPs.
    
    review-fix-loop iteration 2
    reviewer(s): cubic-dev-ai (PR thread PRRT_kwDOLZ5Xss6F4Gnj)
    file: tickets/devop-579-network-policy-rollout.md:18-21,38
    
    Co-Authored-By: Claude Opus 4.7 (review-fix-loop) <noreply@anthropic.com>
    srt0422 and claude committed May 30, 2026
    Configuration menu
    Copy the full SHA
    ee15fee View commit details
    Browse the repository at this point in the history

Commits on Jun 5, 2026

  1. Configuration menu
    Copy the full SHA
    1a5cafe View commit details
    Browse the repository at this point in the history
  2. fix(correctness): rollback runbook must reflect K8s NetworkPolicy iso…

    …lation rule
    
    `kubectl delete networkpolicy default-deny-egress -n <ns>` alone does NOT
    restore all egress: per the Kubernetes isolation rule, any policy whose
    `policyTypes` includes Egress and whose `podSelector` matches the pod
    keeps that pod isolated, with the union of allow rules across all
    selecting policies forming the entire allow set. Since this rollout
    puts a per-namespace `egress-allowlist` (and `egress-baseline-allow`)
    in every namespace by design, deleting just the deny leaves the
    allowlist as a strict constraint — the pod stays isolated and only
    the allowlist's allows are permitted.
    
    Rewrites the Phase 3 rollback section to teach the isolation rule
    first, then gives two named rollback shapes:
    
      1) primary: apply an additive `emergency-allow-all-egress` override
         (`policyTypes: [Egress]`, `egress: [{}]`) — works regardless of
         how many other policies select the pod, leaves the deny /
         allowlist / baseline in place for audit.
      2) fallback: delete every egress-direction policy in the namespace
         (for cases where the policy framework itself is broken).
    
    Same correction applied to ingress. Phase 4 SECURITY-RUNBOOK.md
    checklist updated so the steady-state runbook documents the same
    isolation rule plus both rollback shapes — closes the gap that would
    have on-call following the old "delete the deny, you're done" advice.
    
    Findings ref: int-correctness-1 (self-review iter 1).
    Scott Terry authored and srt0422 committed Jun 5, 2026
    Configuration menu
    Copy the full SHA
    1e77f19 View commit details
    Browse the repository at this point in the history
Loading