Skip to content

Add composite indexes for alert decision filters#4468

Closed
jimstrang wants to merge 1 commit into
crowdsecurity:masterfrom
jimstrang:perf/alert-decision-composite-indexes
Closed

Add composite indexes for alert decision filters#4468
jimstrang wants to merge 1 commit into
crowdsecurity:masterfrom
jimstrang:perf/alert-decision-composite-indexes

Conversation

@jimstrang
Copy link
Copy Markdown

Overview

This PR addresses the /v1/alerts slowdowns and database is locked reports in #4464 by adding a few composite indexes on the decisions table. The reports started showing up after the 1.7.8 decision-stream chunking changes, but I do not think the chunked stream is the slow query itself. My read is that it made the SQLite/LAPI workload more likely to overlap with already-expensive alert-list queries, which is where the lock symptoms surface.

The slow path I was able to reproduce directly is an alert-list query that combines decision filters such as decision_type, origin, and has_active_decision. Today those filters are translated into separate EXISTS checks against decisions. With only the existing alert_decisions index, SQLite can find the decisions attached to an alert, but it still has to walk all of them to check type, origin, or until.

That gets painful when an alert has a large number of linked decisions. In the database I used for validation, one alert was linked to roughly 15,000 decisions (CAPI community blocklist), and the original query shape could take tens of seconds.

What Changed

This adds composite indexes for the columns used by the alert decision filters:

  • (alert_decisions, type)
  • (alert_decisions, origin)
  • (alert_decisions, until)

The schema change is in pkg/database/ent/schema/decision.go, with the generated migration schema updated in pkg/database/ent/migrate/schema.go. Existing installations should get the indexes through the normal Ent auto-migration path on startup.

Why This Approach

There is another possible fix that keeps everything in code: combine the decision-related alert filters into one HasDecisionsWith(...) call so Ent emits a single EXISTS subquery instead of several. That also performs well.

I did not take that route here because it can change the meaning of /v1/alerts filters.

The current query matches alerts that have at least one linked decision for each filter, even if those filters match different decision rows. The combined-predicate version would require the same decision row to satisfy all filters at once. That may be the intended behavior, but it is a semantic decision rather than just a performance fix, and I did not see existing tests or documentation that clearly pin it either way.

If maintainers can clarify the intended filter semantics here, I think that code-only predicate change would still be worth exploring. In other words, should multiple decision filters on /v1/alerts mean "find alerts where one linked decision matches all of these conditions" or "find alerts where the linked decisions collectively satisfy these conditions"?

The index-only change keeps the current behavior and just gives the database a better access path.

Validation

I tested this against a SQLite database from an affected setup with:

  • 107 alerts
  • 30,101 decisions
  • one alert linked to about 15,000 decisions

For the original /v1/alerts decision-filter query shape:

Case Runtime
Before these indexes ~39s
After these indexes <1ms

The row count stayed the same before and after the indexes, so this is a performance-only change.

After the change, SQLite uses covering indexes for the correlated decision lookups, for example:

SEARCH decisions EXISTS USING COVERING INDEX decision_alert_decisions_until  (alert_decisions=? AND until>?)
SEARCH decisions EXISTS USING COVERING INDEX decision_alert_decisions_type   (alert_decisions=? AND type=?)
SEARCH decisions EXISTS USING COVERING INDEX decision_alert_decisions_origin (alert_decisions=? AND origin=?)

On the same snapshot, creating the three indexes took about 125ms total for ~30k decisions. Larger databases will take longer, but this should be a one-time migration cost.

Refs #4464

@github-actions
Copy link
Copy Markdown

@jimstrang: There are no 'kind' label on this PR. You need a 'kind' label to generate the release automatically.

  • /kind feature
  • /kind enhancement
  • /kind refactoring
  • /kind fix
  • /kind chore
  • /kind dependencies
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

@github-actions
Copy link
Copy Markdown

@jimstrang: There are no area labels on this PR. You can add as many areas as you see fit.

  • /area agent
  • /area local-api
  • /area cscli
  • /area appsec
  • /area security
  • /area configuration
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

@jimstrang
Copy link
Copy Markdown
Author

/kind fix
/area local-api

@jimstrang
Copy link
Copy Markdown
Author

Small update: I’m moving this to draft for now.

The indexes still look useful, and they do improve the access path for /v1/alerts decision filters in isolation. But after more testing against a copy of an affected SQLite DB, I can still reproduce the Homepage-style bans query timing out under concurrent LAPI activity on current master, including with WAL enabled.

So I don’t want to present this as the confirmed fix for #4464 yet. It may still be part of the fix, or a worthwhile performance improvement on its own, but I’d like to verify the concurrent case before asking for a full review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant