Skip to content

Fix cmfg reg#6213

Closed
jyejare wants to merge 292 commits intofeast-dev:masterfrom
opendatahub-io:fix_cmfg_reg
Closed

Fix cmfg reg#6213
jyejare wants to merge 292 commits intofeast-dev:masterfrom
opendatahub-io:fix_cmfg_reg

Conversation

@jyejare
Copy link
Copy Markdown
Collaborator

@jyejare jyejare commented Apr 1, 2026

What this PR does / why we need it:

Fixes issue in LTS nightly run :

e2e tests failing with latest changes as 'FeatureStore' is not in 'Ready' state.

After Investigating looks like hitting this error in feast instance condition Error: [roles.rbac.authorization.k8s.io](http://roles.rbac.authorization.k8s.io/) "feast-configs-registry-reader" already exists

Which issue(s) this PR fixes:

Checks

  • I've made sure the tests are passing.
  • My commits are signed off (git commit -s)
  • My PR title follows conventional commits format

Testing Strategy

  • Unit tests
  • Integration tests
  • Manual tests
  • Testing is not required for this change

Misc


Open with Devin

dchourasia and others added 30 commits June 24, 2025 00:16
moulalis and others added 27 commits March 16, 2026 16:25
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Fix: Feast configmap failure due to missing notebook CRD timing
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
- Add label-filtered informer cache (app.kubernetes.io/managed-by=feast-operator)
  for ConfigMaps to prevent cluster-wide cache flooding OOM
- Add GOMEMLIMIT=230MiB (90% of 256Mi limit) as GC safety net
- Add JSON Patch test ops to validate env indices before replace
- Strip managedFields from cached objects via DefaultTransform
- Use shared constants for label keys across operator and tests
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
feat: Auto-Discovery of feast client configmaps to the users/groups defined in feast permissions policies
… apply

Signed-off-by: Ani Lynch <anixlynch@gmail.com>
Signed-off-by: Vanshika Vanshika <vvanshik@redhat.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
…ast-dev#6126)

* feat: Add ServiceMonitor auto-generation for Prometheus discovery

Signed-off-by: ntkathole <nikhilkathole2683@gmail.com>

* fix: Server-side apply

Signed-off-by: ntkathole <nikhilkathole2683@gmail.com>

---------

Signed-off-by: ntkathole <nikhilkathole2683@gmail.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
feat: Manually Add commits from upstream as auto-merge failed
fix: Harden informer cache with label selectors and memory optimizations
fix: Failure of AutoMerge Remote tracking branch upstream master
Fixed the "already exists" error when multiple FeatureStore instances
try to create the same "feast-configs-registry-reader" Role. Changed
from CreateOrUpdate to a check-then-create pattern to avoid conflicts
when the shared Role already exists.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
@jyejare jyejare requested a review from a team as a code owner April 1, 2026 15:15
@jyejare jyejare closed this Apr 1, 2026
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 8 additional findings in Devin Review.

Open in Devin Review

Comment on lines +239 to +266
func (r *FeatureStoreReconciler) countOtherFeatureStoresInNamespace(ctx context.Context, namespace, excludeName string) int {
var list feastdevv1.FeatureStoreList
if err := r.List(ctx, &list, client.InNamespace(namespace)); err != nil {
return -1
}
count := 0
for i := range list.Items {
if list.Items[i].Name != excludeName && list.Items[i].DeletionTimestamp == nil {
count++
}
}
return count
}

func (r *FeatureStoreReconciler) countOtherFeatureStoresInCluster(ctx context.Context, namespace, excludeName string) int {
var list feastdevv1.FeatureStoreList
if err := r.List(ctx, &list); err != nil {
return -1
}
count := 0
for i := range list.Items {
item := &list.Items[i]
if (item.Name != excludeName || item.Namespace != namespace) && item.DeletionTimestamp == nil {
count++
}
}
return count
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 countOtherFeatureStores returning -1 on error causes premature cleanup of shared resources

When the Kubernetes API List call fails (e.g., transient API server error), countOtherFeatureStoresInNamespace and countOtherFeatureStoresInCluster return -1. This value is passed to RemoveNamespaceLabelIfLast and CleanupDiscoverClusterRoleIfLast at infra/feast-operator/internal/controller/access/access.go:61 and infra/feast-operator/internal/controller/access/rbac.go:222, both of which guard cleanup with otherFeatureStoreCount > 0. Since -1 > 0 is false, the guard is bypassed and the namespace label / shared ClusterRole are deleted even though other FeatureStores may still depend on them. This breaks namespace discovery for dashboard applications and removes RBAC for all remaining FeatureStores in the cluster.

Downstream impact in access.go and rbac.go

RemoveNamespaceLabelIfLast (access.go:61) and CleanupDiscoverClusterRoleIfLast (rbac.go:222) both use if otherFeatureStoreCount > 0 { return nil } — a sentinel value of -1 passes this guard and triggers deletion.

Prompt for agents
In infra/feast-operator/internal/controller/featurestore_controller.go, the functions countOtherFeatureStoresInNamespace (lines 239-251) and countOtherFeatureStoresInCluster (lines 253-266) return -1 on List errors. The callers in cleanupOnDeletion (lines 203-222) pass this value directly to RemoveNamespaceLabelIfLast and CleanupDiscoverClusterRoleIfLast, which interpret any value <= 0 as 'no other stores exist' and proceed with deletion.

Fix: Either (a) change the return type to (int, error) so callers can skip cleanup on error, or (b) return a large positive sentinel value (e.g., math.MaxInt32) on error so the > 0 guard in the downstream functions prevents accidental cleanup. For approach (a), update cleanupOnDeletion at lines 205 and 212 to check the error and skip the corresponding cleanup call when the count cannot be determined.

Also update the callers:
- Line 206: access.RemoveNamespaceLabelIfLast should only be called when otherCount was successfully retrieved
- Line 213: access.CleanupDiscoverClusterRoleIfLast should only be called when clusterCount was successfully retrieved
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.