ROX-28326: expose vulnerability metrics#15058
Conversation
|
Skipping CI for Draft Pull Request. |
|
Images are ready for the commit at ee180cf. To use with deploy scripts, first |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #15058 +/- ##
==========================================
+ Coverage 48.74% 48.84% +0.09%
==========================================
Files 2585 2599 +14
Lines 190093 190898 +805
==========================================
+ Hits 92670 93252 +582
- Misses 90147 90337 +190
- Partials 7276 7309 +33
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8a1b007 to
0d4c015
Compare
0d4c015 to
3cf1411
Compare
There was a problem hiding this comment.
Hey @parametalol - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider exploring the mentioned optimization opportunities (e.g., narrower queries) to mitigate potential performance impacts from frequent database reads.
- Ensure
sac.WithAllAccessis the appropriate context for the background telemetry task accessing deployment data.
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟡 Testing: 2 issues found
- 🟡 Complexity: 2 issues found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
0668090 to
e13db2f
Compare
central/cve/telemetry/singleton.go
Outdated
| metricExpressions := parseAggregationExpressions(env.AggregateVulnMetrics.Setting()) | ||
| if metricExpressions == nil { | ||
| return | ||
| } | ||
| instance = &vulnerabilityMetricsImpl{ | ||
| ds: deploymentDS.Singleton(), | ||
| metricExpressions: metricExpressions, | ||
| trackFunc: metrics.SetAggregatedVulnCount, | ||
| } | ||
| for metricName, expressions := range instance.metricExpressions { | ||
| labels := getMetricLabels(expressions) | ||
| metrics.RegisterVulnAggregatedMetric(metricName, labels) | ||
| } |
There was a problem hiding this comment.
How I understand that is that we give the power to the users to define themselves a set of metrics and labels. The users would do that in text of the env var.
First, I would rather predefine a set of metrics for the users based on business scenarios and allow them filter and aggregate freely in external tools.
Provided we must do it this way it is done in the code, we must have a validation for the user input. Wrongly defined metric can cause panics and collisions with existing metrics. Moreover, we won't have any control over the naming scheme and for each customer we would need to do additional checks in the env vars to support them with their issues regarding this functionality.
As the PR description does not explain why we want to give such freedom to the users, I'd like to ask you to add that text with some examples.
There was a problem hiding this comment.
Your understanding is correct: we want to provide the user with a possibility to configure the exposed metrics by selecting the aggregation labels, and filter the resources.
I updated the description.
central/cve/telemetry/singleton.go
Outdated
| func (h *vulnerabilityMetricsImpl) Stop() { | ||
| if h != nil { | ||
| close(h.stopSignal) | ||
| } | ||
| } |
There was a problem hiding this comment.
Calling Stop twice would cause panic. Maybe you could use concurrency.Singnal or context.Context instead.
There was a problem hiding this comment.
This is unfortunately the common central pattern. We do not call Stop or Start twice.
I am working on a better approach for handling periodic tasks, but that's a separate topic. I didn't want to introduce anything new in this PR.
central/cve/telemetry/metrics_map.go
Outdated
| } | ||
| } | ||
| } | ||
| result.WriteString("_total") |
There was a problem hiding this comment.
You propose the name to have a _total suffix while the type of prometheus metric is a gauge. The naming convention uses _total for accumulating types (counters). Here, we work with gauge, so I would rather expect sth similar to: "num_something" or "something_current".
There was a problem hiding this comment.
The metrics report the total number of vulnerabilities, aggregated by some set of properties. So _total makes perfect sense.
There was a problem hiding this comment.
Maybe, but then choosing the gauge as a type makes no sense, as it is not total by definition but current as gauges represent a snapshot of something, not a time-span.
When explaining this metric to a person you would say something like: "For this cluster (or any other filter), we have currently ### CVEs affecting all the images in the system"
I think that:
- keeping is as a gauge is fine when renamed,
- changing it so counter and keeping the
_totalsuffix is also fine, but it would require a different interpretation of the metric.
There was a problem hiding this comment.
A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.
Gauge is the right metric type to represent the current total number of vulnerabilities, IMHO.
There was a problem hiding this comment.
I think what Piotr is saying (and I would agree) is that by Prometheus conventions, all counter metrics shall be suffixed by _total. Since the metric here is a gauge, we should choose a different suffix to comply with Prometheus naming conventions.
There was a problem hiding this comment.
Ah, I see now! What about _vulnerabilities?
For example:
Cluster_vulnerabilitiesCluster_Namespace_vulnerabilities
There was a problem hiding this comment.
@vikin91 , @stehessel , please also suggest the better way of making the metric names, as they may collide:
Both Deployment=*3 and Deployment=?3 expressions produce the same metric name
Deployment_eq__3_vulnerabilities.
I see two ways:
- keep the expressions as is, but escape the non-alphanumeric characters (requires prometheus >=v3.0.0);
- replace the known characters with some mnemonics, e.g.:
*to_x_,?to_y_etc.
There was a problem hiding this comment.
My suggestion would be to scrap the labels and filters from the metric name all together. So something like rox_central_vulnerability_count{cluster="...", namespace="...", severity="..."}.
There was a problem hiding this comment.
My idea, that I tried to share in the meeting on Wednesday, is that we should not use env vars for the configuration, but instead have a configmap. This configmap would allow the users to specify the name of metric and configure it as they like (i.e., set the labels in the first phase). Example:
# not necessarily a valid yaml syntax
- vulnMetrics:
- name: "whatever1"
labels: ["cluster", "namespace"]
- name: "beautiful weather"
labels: ["cluster", "namespace", "severity"]
# later, we could have here filters: "namespace='abc*'", "severity>2"Next, we would read that config map and create the following metrics with our prefix and suffix:
rox_central_num_vlunerabilities_whatever1_currentrox_central_num_vlunerabilities_beautiful_weather_current
That would allow us to:
- Deterministic metric names and clear expectation regarding labels
- Easily validate user input (and limit it if needed) - also a solution to the duplicates problem
- Have as many metrics as we want without waiting for new release
- Use our own prefix (for filtering those metrics from diag-bundle)
- Use correct suffix to stick to the naming convention
Moreover, going forward into the future, the grammar for the filtering would be much simpler as it would be divided into parts - one for each label.
There was a problem hiding this comment.
I like your idea, and I will try to implement it. But not with a ConfigMap, but with a storage configuration object.
central/cve/telemetry/metrics_map.go
Outdated
| // "Namespace=abc,Severity": map[metricName][]expression{ | ||
| // "Namespace_eq_abc_Severity_total": {"Namespace=abc", "Severity"}, |
There was a problem hiding this comment.
I am not fully sure what is being aggregated here, but it feels that prometheus provides the constructs to achieve this.
group (metric_name{namespace="abc"}) by (severity) or sum (metric_name{namespace="abc"}) by (severity)
I would appreciate an explanation why we cannot do it this way.
There was a problem hiding this comment.
We do not want to expose everything for later aggregation in prometheus because of volumes. We want, though, to aggregate something according to the user configuration. So this is a mixed approach: prometheus will aggregate on top of what is aggregated here.
central/cve/telemetry/expression.go
Outdated
| "github.com/stackrox/rox/pkg/glob" | ||
| ) | ||
|
|
||
| var ops = []string{"!=", "=", "<=", ">=", "<", ">"} // The order matters! |
There was a problem hiding this comment.
Explain how and why it matters.
| ) | ||
|
|
||
| // Label is an alias because the central/metrics package cannot import it. | ||
| type Label = string |
There was a problem hiding this comment.
Why not using the prometheus type for label?
There was a problem hiding this comment.
Which one? I see only type Labels map[string]string.
central/metrics/central.go
Outdated
| Namespace: metrics.PrometheusNamespace, | ||
| Subsystem: metrics.CentralSubsystem.String(), | ||
| Name: name, | ||
| Help: "The discovered CVEs", |
There was a problem hiding this comment.
Sorry, but this needs to be more explanatory. From the examples in the PR description, I see that you offer a positive number as the value of the metric, so it must be the rather "the number of discovered CVEs". But that is still imprecise (does it contain duplicates? how large is the time window?), so I'd like to ask you to improve this help message.
There was a problem hiding this comment.
I added a more detailed description.
There was a problem hiding this comment.
The new description example:
# HELP rox_central_Cluster_Namespace_Severity_total The total number of discovered CVEs aggregated by Cluster,Namespace,Severity and gathered every 1h0m0s
# TYPE rox_central_Cluster_Namespace_Severity_total gauge
vikin91
left a comment
There was a problem hiding this comment.
At a given state I can only guess what this PR does and why it was implemented like this. I must ask you to improve the PR description so that it is clear:
- what is the business use case (use examples)
- how those metrics would help the users and how frequently they will be looked at
- how the users will consume those metrics (we don't have constant names, so I am super curious about that)
- why we cannot use aggregation built-in prometheus and grafana and we must offer our own aggregation language for this
I also left detailed comments but it is difficult to evaluate when the goal of the entire PR is unclear.
central/cve/telemetry/metrics_map.go
Outdated
| // Example: | ||
| // | ||
| // "Namespace=abc,Severity": map[metricName][]expression{ | ||
| // "Namespace_eq_abc_Severity_total": {"Namespace=abc", "Severity"}, |
There was a problem hiding this comment.
Why add the keys in the metric name as well as the labels? Wouldn't a hard coded metric name with dynamic labels be sufficient? That would also make it easier to create Grafana dashboards.
There was a problem hiding this comment.
Let's consider this configuration:
Cluster=*prod,IsFixable=true,Severity=CRITICAL_VULNERABILITY_SEVERITY|Cluster=*prod,IsFixable=true,Severity=HIGH_VULNERABILITY_SEVERITY|Severity.
A single hardcoded metric would not work, because the expectation is to get two different aggregations. But indeed, the expression arguments add complexity for further aggregation. The only disadvantage of not including the arguments is, probably, a potential confusion of implicit filtering, as the metric name won't tell about the applied filter. But the metric HELP will, so removing the arguments from the metric name should be fine. Thanks.
There was a problem hiding this comment.
But why do we need to define multiple metrics here? In this example, the user could just define a single base metric
Cluster=*prod,IsFixable=true,Severity
and then use Prometheus to get the other derived metrics.
There was a problem hiding this comment.
To filter out the LOW and MODERATE vulns for one metric and include them for the other metric.
Another example: a coarse metric for all clusters (like the default setting), and a fine metric for some specific cluster, which includes components. Cluster,Namespace,Severity|Cluster=reservation,Namespace,ComponentName=*apache*,Severity,CVSS.
There was a problem hiding this comment.
But are these real world use cases or are they contrived? Note that if push comes to shove, one can also use relabelling rules in Prometheus to drop metrics. So I don't see a big issue offering just a single metric that is configurable via labels - seems flexible enough to avoid crashing Prometheus, and yet simple enough to explain to customers.
Another option to avoid names like rox_central_Cluster_Namespace_Severity_total would be to add an identifier label that is derived from the query config like rox_central_vulnerability_count{identifier="<ID>", ...}.
There was a problem hiding this comment.
Metric name is not provided by the user.
central/cve/telemetry/expression.go
Outdated
| "github.com/stackrox/rox/pkg/glob" | ||
| ) | ||
|
|
||
| var ops = []string{"=", "!=", "<=", ">=", "<", ">"} // The order matters, as the |
There was a problem hiding this comment.
I think we should point out here that the filtering logic is a performance optimization. I wonder if it is a pre-mature one - it seems to complicate the code significantly, and the metric volume could otherwise be controlled by the labels.
Not saying we should not implementing filtering - also considering it's already done now - but I would not have implemented it myself, and I don't think it provides much additional value over the label selection alone.
db52b78 to
be8a8cf
Compare
|
I still have this PR on the radar, but I see that it grew to 5k lines added 😅 That maybe difficult to review. What is your plan for reviewing that (it is okay if you don't have one yet)? |
| }) | ||
| } | ||
|
|
||
| func (ar *aggregatorRunner) run(tracker common.Tracker) { |
There was a problem hiding this comment.
Move this to the tracker. Make the ticker part of the tracker as well.
vikin91
left a comment
There was a problem hiding this comment.
I reviewed only parts of the code - I skipped the expressions and the arithmetic parts and also all tests.
Unfortunately, the code is pretty demanding to read and follow and I am afraid that it will be difficult to maintain. I am not sure yet how can we bring that to the finish line in it's current form. I would think about it and give it another look after you apply consider some simplifications.
central/config/service/service.go
Outdated
| matcher.Singleton().SetRegexes(regexes) | ||
| go reprocessor.Singleton().RunReprocessor() | ||
|
|
||
| if err := aggregator.Singleton().Reconfigure(req.GetConfig().GetPrivateConfig().GetPrometheusMetricsConfig()); err != nil { |
There was a problem hiding this comment.
Is there any reason why the aggregator cannot be the member s (serviceImpl)?
There was a problem hiding this comment.
The only reason is the pattern of the code around.
| gcp.Singleton().Start() | ||
| administrationEventHandler.Singleton().Start() | ||
|
|
||
| customMetrics.Singleton().Start() |
There was a problem hiding this comment.
Something does not fit here right. The aggregator is a part of the config service, so I would expect that the config service would be started here, not the aggregator. Why cannot we keep the hierarchy here?
There was a problem hiding this comment.
The aggregator is not a part of the config service. The latter allows for reconfiguring the former in runtime.
| // aggregatedRecord is a single gauge metric record. | ||
| type aggregatedRecord struct { | ||
| labels prometheus.Labels | ||
| total int | ||
| } |
There was a problem hiding this comment.
Cannot we reuse the type that is built-in prometheus library?
There was a problem hiding this comment.
Please suggest a type.
| func makeAggregator(mcfg MetricsConfiguration, labelOrder map[Label]int) *aggregator { | ||
| aggregated := make(map[MetricName]map[aggregationKey]*aggregatedRecord) | ||
| for metric := range mcfg { | ||
| aggregated[metric] = make(map[aggregationKey]*aggregatedRecord) | ||
| } | ||
| return &aggregator{aggregated, mcfg, labelOrder} | ||
| } |
There was a problem hiding this comment.
Why this is not a constructor, like New(...) *aggregator?
There was a problem hiding this comment.
Because the package is called common.
| // aggregator computes the aggregation result. | ||
| type aggregator struct { | ||
| result map[MetricName]map[aggregationKey]*aggregatedRecord | ||
| mcfg MetricsConfiguration |
There was a problem hiding this comment.
This is type MetricsConfiguration map[MetricName]map[Label]Expression and it has its own type-alias, but the other does not have it. Why cannot we use the map[MetricName]map[Label]Expression as type here?
There was a problem hiding this comment.
Because MetricsConfiguration has methods and is used in other places as such.
| ar.registry, | ||
| iv.GetFilter(), | ||
| iv.GetMetrics(), | ||
| time.Minute*time.Duration(iv.GetGatheringPeriodMinutes())); err != nil { |
There was a problem hiding this comment.
Why GetGatheringPeriodMinutes() cannot return time.Duration?
There was a problem hiding this comment.
This is a storage field getter, the storage type is uint32.
| "failed to parse a condition for metric %q with label %q: %v", | ||
| metric, label, err) | ||
| } | ||
| expr = append(expr, condition) |
There was a problem hiding this comment.
There are many temp variables here just to collect the result. What you could do is
| expr = append(expr, condition) | |
| expr = append(expr, condition) | |
| result[MetricName(metric)][Label(label)] = expr |
and drop the labelExpression and related length checks
There was a problem hiding this comment.
This would assign to a nil map and panic. A metric with no labels should not be added to the result.
| // | ||
| // This key is used to uniquely identify and aggregate findings that match the same | ||
| // set of label expression. | ||
| package common |
There was a problem hiding this comment.
I think the name does not represent what this pkg is doing.
There was a problem hiding this comment.
The package is an assembly of tools for trackers to use. Please suggest a better name, and/or how to split the package.
|
@parametalol: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Closing for splitting. |
Description
This is a non-optimized approach of exposing vulnerability metrics, aggregated and filtered according to the configuration, provided via the configuration API.
Problem statement
The vulnerabilities identified by the scanner are exposed only as a snapshot of the current state. We need to allow users to monitor the state of the vulnerabilities in their workloads.
We do not have the exact list of use cases, how users will aggregate the available data, but we may identify already:
the SRE cluster case, where the folks have developed their custom GraphQL gatherer to expose the vulnerabilities at a very detailed granularity, like in
for larger customers, this configuration would be too heavy to handle;
the gathering should have an option to be disabled.
We could also imagine a need for aggregation only by namespace or by cluster and deployment name, or by the vulnerable image component name.
Suggested implementation
prometheusMetricsConfigsection in theconfig.privateConfigobject;/metricspath on the API endpoint: secured, authorized for admins only.The list of the available labels needs to be documented. It includes Cluster, Namespace, CVE, ComponentName and other image vulnerability properties.
Other considerations
Potential optimization opportunity:
User-facing documentation
Testing and quality
Automated testing
How I validated my change
Unit tests.
Manual test:
curl https://localhost:8000/v1/config -u admin:password -k | jq '.privateConfig.prometheusMetricsConfig'{ "gatheringPeriodHours": 1, "metrics": [ { "name": "my_cluster_severity_vulns", "labels": [ { "name": "Cluster", "expression": null }, { "name": "Severity", "expression": null } ] }, { "name": "my_severity_vulns", "labels": [ { "name": "Severity", "expression": { "operator": "=", "argument": "{HIGH,CRITICAL}*" } } ] } ] }Prometheus User Workload CVEs query example:


VS Central UI:
Current dependencies on/for this PR: