Skip to content

ROX-28326: expose vulnerability metrics#15058

Closed
parametalol wants to merge 66 commits intomasterfrom
michael/ROX-28326-expose-cve-metrics
Closed

ROX-28326: expose vulnerability metrics#15058
parametalol wants to merge 66 commits intomasterfrom
michael/ROX-28326-expose-cve-metrics

Conversation

@parametalol
Copy link
Copy Markdown
Contributor

@parametalol parametalol commented Apr 23, 2025

Description

This is a non-optimized approach of exposing vulnerability metrics, aggregated and filtered according to the configuration, provided via the configuration API.

Problem statement

The vulnerabilities identified by the scanner are exposed only as a snapshot of the current state. We need to allow users to monitor the state of the vulnerabilities in their workloads.

We do not have the exact list of use cases, how users will aggregate the available data, but we may identify already:

  • the SRE cluster case, where the folks have developed their custom GraphQL gatherer to expose the vulnerabilities at a very detailed granularity, like in

    acs_cvss{cluster="control-cluster",cve="CVE-2022-23648",deploymentCount="2",imageId="sha256:6220c898c05f6eeb0abc18c14d380c50e1e6a5ca0e12508c811c0e00bbef96a8",imageRegistry="quay.io",imageRemote="openshift-release-dev/ocp-v4.0-art-dev",imageTag="",isFixable="true",namespace="openshift-ovn-kubernetes",operatingSystem="rhel:9",severity="IMPORTANT_VULNERABILITY_SEVERITY"} 7.5
    
  • for larger customers, this configuration would be too heavy to handle;

  • the gathering should have an option to be disabled.

We could also imagine a need for aggregation only by namespace or by cluster and deployment name, or by the vulnerable image component name.

Suggested implementation

  • prometheusMetricsConfig section in the config.privateConfig object;
  • /metrics path on the API endpoint: secured, authorized for admins only.

The list of the available labels needs to be documented. It includes Cluster, Namespace, CVE, ComponentName and other image vulnerability properties.

Other considerations

Potential optimization opportunity:

  • Build a more narrow deployment and/or image query according to the known resource filters.

User-facing documentation

  • CHANGELOG is updated OR update is not needed
  • documentation PR is created and is linked above OR is not needed

Testing and quality

  • the change is production ready: the change is GA or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Unit tests.

Manual test:

curl https://localhost:8000/v1/config -u admin:password -k | jq '.privateConfig.prometheusMetricsConfig'

{
  "gatheringPeriodHours": 1,
  "metrics": [
    {
      "name": "my_cluster_severity_vulns",
      "labels": [
        {
          "name": "Cluster",
          "expression": null
        },
        {
          "name": "Severity",
          "expression": null
        }
      ]
    },
    {
      "name": "my_severity_vulns",
      "labels": [
        {
          "name": "Severity",
          "expression": {
            "operator": "=",
            "argument": "{HIGH,CRITICAL}*"
          }
        }
      ]
    }
  ]
}
sh$ kubectl logs deploy/central | grep Registered
cve/telemetry: 2025/05/06 19:00:48.973820 singleton.go:46: Info: Registered Prometheus metric "my_cluster_severity_vulns"
cve/telemetry: 2025/05/06 19:00:48.973992 singleton.go:46: Info: Registered Prometheus metric "my_severity_vulns"

sh$ curl https://localhost:8000/metrics -u admin:password -k
# HELP rox_central_my_cluster_severity_vulns The total number of discovered CVEs aggregated by Cluster,Severity and gathered every 1h0m0s
# TYPE rox_central_my_cluster_severity_vulns gauge
rox_central_my_cluster_severity_vulns{Cluster="production",Severity="CRITICAL_VULNERABILITY_SEVERITY"} 409
rox_central_my_cluster_severity_vulns{Cluster="production",Severity="IMPORTANT_VULNERABILITY_SEVERITY"} 885
rox_central_my_cluster_severity_vulns{Cluster="production",Severity="LOW_VULNERABILITY_SEVERITY"} 1490
rox_central_my_cluster_severity_vulns{Cluster="production",Severity="MODERATE_VULNERABILITY_SEVERITY"} 886
rox_central_my_cluster_severity_vulns{Cluster="security",Severity="IMPORTANT_VULNERABILITY_SEVERITY"} 1
rox_central_my_cluster_severity_vulns{Cluster="security",Severity="LOW_VULNERABILITY_SEVERITY"} 483
rox_central_my_cluster_severity_vulns{Cluster="security",Severity="MODERATE_VULNERABILITY_SEVERITY"} 179
# HELP rox_central_my_severity_vulns The total number of discovered CVEs aggregated by Severity and gathered every 1h0m0s
# TYPE rox_central_my_severity_vulns gauge
rox_central_my_severity_vulns{Severity="CRITICAL_VULNERABILITY_SEVERITY"} 409

Prometheus User Workload CVEs query example:
image
VS Central UI:
image

Current dependencies on/for this PR:

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 23, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@parametalol parametalol changed the title Michael/ROX-28326-expose-cve-metrics ROX-28326: expose CVSS metrics Apr 23, 2025
@rhacs-bot
Copy link
Copy Markdown
Contributor

rhacs-bot commented Apr 23, 2025

Images are ready for the commit at ee180cf.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.8.x-990-gee180cf6c5.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 23, 2025

Codecov Report

Attention: Patch coverage is 71.92118% with 114 lines in your changes missing coverage. Please review.

Project coverage is 48.84%. Comparing base (948299e) to head (67f6772).
Report is 64 commits behind head on master.

Files with missing lines Patch % Lines
central/metrics/aggregator/singleton.go 5.00% 38 Missing ⚠️
...etrics/aggregator/image_vulnerabilities/tracker.go 36.73% 28 Missing and 3 partials ⚠️
...entral/metrics/aggregator/common/tracker_config.go 75.00% 20 Missing and 4 partials ⚠️
central/metrics/aggregator/common/condition.go 87.50% 8 Missing and 2 partials ⚠️
central/config/service/service.go 0.00% 4 Missing ⚠️
central/metrics/aggregator/common/common.go 83.33% 4 Missing ⚠️
central/metrics/aggregator/common/metrics_map.go 92.50% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #15058      +/-   ##
==========================================
+ Coverage   48.74%   48.84%   +0.09%     
==========================================
  Files        2585     2599      +14     
  Lines      190093   190898     +805     
==========================================
+ Hits        92670    93252     +582     
- Misses      90147    90337     +190     
- Partials     7276     7309      +33     
Flag Coverage Δ
go-unit-tests 48.84% <71.92%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@parametalol parametalol changed the title ROX-28326: expose CVSS metrics ROX-28326: expose vulnerability metrics Apr 24, 2025
@parametalol parametalol force-pushed the michael/ROX-28326-expose-cve-metrics branch from 8a1b007 to 0d4c015 Compare April 24, 2025 22:27
@parametalol parametalol changed the base branch from master to michael/refactor-pkg-glob April 24, 2025 22:30
@parametalol parametalol force-pushed the michael/ROX-28326-expose-cve-metrics branch from 0d4c015 to 3cf1411 Compare April 24, 2025 22:56
@parametalol parametalol requested a review from stehessel April 25, 2025 22:02
@parametalol parametalol marked this pull request as ready for review April 25, 2025 22:02
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @parametalol - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider exploring the mentioned optimization opportunities (e.g., narrower queries) to mitigate potential performance impacts from frequent database reads.
  • Ensure sac.WithAllAccess is the appropriate context for the background telemetry task accessing deployment data.
Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟡 Testing: 2 issues found
  • 🟡 Complexity: 2 issues found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Base automatically changed from michael/refactor-pkg-glob to master April 28, 2025 10:56
@parametalol parametalol force-pushed the michael/ROX-28326-expose-cve-metrics branch from 0668090 to e13db2f Compare April 28, 2025 15:00
Comment on lines +24 to +36
metricExpressions := parseAggregationExpressions(env.AggregateVulnMetrics.Setting())
if metricExpressions == nil {
return
}
instance = &vulnerabilityMetricsImpl{
ds: deploymentDS.Singleton(),
metricExpressions: metricExpressions,
trackFunc: metrics.SetAggregatedVulnCount,
}
for metricName, expressions := range instance.metricExpressions {
labels := getMetricLabels(expressions)
metrics.RegisterVulnAggregatedMetric(metricName, labels)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How I understand that is that we give the power to the users to define themselves a set of metrics and labels. The users would do that in text of the env var.

First, I would rather predefine a set of metrics for the users based on business scenarios and allow them filter and aggregate freely in external tools.

Provided we must do it this way it is done in the code, we must have a validation for the user input. Wrongly defined metric can cause panics and collisions with existing metrics. Moreover, we won't have any control over the naming scheme and for each customer we would need to do additional checks in the env vars to support them with their issues regarding this functionality.

As the PR description does not explain why we want to give such freedom to the users, I'd like to ask you to add that text with some examples.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct: we want to provide the user with a possibility to configure the exposed metrics by selecting the aggregation labels, and filter the resources.

I updated the description.

Comment on lines +63 to +107
func (h *vulnerabilityMetricsImpl) Stop() {
if h != nil {
close(h.stopSignal)
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling Stop twice would cause panic. Maybe you could use concurrency.Singnal or context.Context instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunately the common central pattern. We do not call Stop or Start twice.

I am working on a better approach for handling periodic tasks, but that's a separate topic. I didn't want to introduce anything new in this PR.

}
}
}
result.WriteString("_total")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You propose the name to have a _total suffix while the type of prometheus metric is a gauge. The naming convention uses _total for accumulating types (counters). Here, we work with gauge, so I would rather expect sth similar to: "num_something" or "something_current".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics report the total number of vulnerabilities, aggregated by some set of properties. So _total makes perfect sense.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but then choosing the gauge as a type makes no sense, as it is not total by definition but current as gauges represent a snapshot of something, not a time-span.

When explaining this metric to a person you would say something like: "For this cluster (or any other filter), we have currently ### CVEs affecting all the images in the system"

I think that:

  • keeping is as a gauge is fine when renamed,
  • changing it so counter and keeping the _total suffix is also fine, but it would require a different interpretation of the metric.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.

Gauge is the right metric type to represent the current total number of vulnerabilities, IMHO.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what Piotr is saying (and I would agree) is that by Prometheus conventions, all counter metrics shall be suffixed by _total. Since the metric here is a gauge, we should choose a different suffix to comply with Prometheus naming conventions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see now! What about _vulnerabilities?
For example:

  • Cluster_vulnerabilities
  • Cluster_Namespace_vulnerabilities

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vikin91 , @stehessel , please also suggest the better way of making the metric names, as they may collide:
Both Deployment=*3 and Deployment=?3 expressions produce the same metric name
Deployment_eq__3_vulnerabilities.
I see two ways:

  • keep the expressions as is, but escape the non-alphanumeric characters (requires prometheus >=v3.0.0);
  • replace the known characters with some mnemonics, e.g.: * to _x_, ? to _y_ etc.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be to scrap the labels and filters from the metric name all together. So something like rox_central_vulnerability_count{cluster="...", namespace="...", severity="..."}.

Copy link
Copy Markdown
Contributor

@vikin91 vikin91 Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea, that I tried to share in the meeting on Wednesday, is that we should not use env vars for the configuration, but instead have a configmap. This configmap would allow the users to specify the name of metric and configure it as they like (i.e., set the labels in the first phase). Example:

# not necessarily a valid yaml syntax
- vulnMetrics:
   -  name: "whatever1"
       labels: ["cluster", "namespace"]
   - name: "beautiful weather"
        labels: ["cluster", "namespace", "severity"]
        # later, we could have here filters: "namespace='abc*'", "severity>2"

Next, we would read that config map and create the following metrics with our prefix and suffix:

  • rox_central_num_vlunerabilities_whatever1_current
  • rox_central_num_vlunerabilities_beautiful_weather_current

That would allow us to:

  • Deterministic metric names and clear expectation regarding labels
  • Easily validate user input (and limit it if needed) - also a solution to the duplicates problem
  • Have as many metrics as we want without waiting for new release
  • Use our own prefix (for filtering those metrics from diag-bundle)
  • Use correct suffix to stick to the naming convention

Moreover, going forward into the future, the grammar for the filtering would be much simpler as it would be divided into parts - one for each label.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your idea, and I will try to implement it. But not with a ConfigMap, but with a storage configuration object.

Comment on lines +45 to +46
// "Namespace=abc,Severity": map[metricName][]expression{
// "Namespace_eq_abc_Severity_total": {"Namespace=abc", "Severity"},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not fully sure what is being aggregated here, but it feels that prometheus provides the constructs to achieve this.

group (metric_name{namespace="abc"}) by (severity) or sum (metric_name{namespace="abc"}) by (severity)

I would appreciate an explanation why we cannot do it this way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want to expose everything for later aggregation in prometheus because of volumes. We want, though, to aggregate something according to the user configuration. So this is a mixed approach: prometheus will aggregate on top of what is aggregated here.

"github.com/stackrox/rox/pkg/glob"
)

var ops = []string{"!=", "=", "<=", ">=", "<", ">"} // The order matters!
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain how and why it matters.

)

// Label is an alias because the central/metrics package cannot import it.
type Label = string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using the prometheus type for label?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one? I see only type Labels map[string]string.

Namespace: metrics.PrometheusNamespace,
Subsystem: metrics.CentralSubsystem.String(),
Name: name,
Help: "The discovered CVEs",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but this needs to be more explanatory. From the examples in the PR description, I see that you offer a positive number as the value of the metric, so it must be the rather "the number of discovered CVEs". But that is still imprecise (does it contain duplicates? how large is the time window?), so I'd like to ask you to improve this help message.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a more detailed description.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new description example:

# HELP rox_central_Cluster_Namespace_Severity_total The total number of discovered CVEs aggregated by Cluster,Namespace,Severity and gathered every 1h0m0s
# TYPE rox_central_Cluster_Namespace_Severity_total gauge

Copy link
Copy Markdown
Contributor

@vikin91 vikin91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a given state I can only guess what this PR does and why it was implemented like this. I must ask you to improve the PR description so that it is clear:

  • what is the business use case (use examples)
  • how those metrics would help the users and how frequently they will be looked at
  • how the users will consume those metrics (we don't have constant names, so I am super curious about that)
  • why we cannot use aggregation built-in prometheus and grafana and we must offer our own aggregation language for this

I also left detailed comments but it is difficult to evaluate when the goal of the entire PR is unclear.

// Example:
//
// "Namespace=abc,Severity": map[metricName][]expression{
// "Namespace_eq_abc_Severity_total": {"Namespace=abc", "Severity"},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add the keys in the metric name as well as the labels? Wouldn't a hard coded metric name with dynamic labels be sufficient? That would also make it easier to create Grafana dashboards.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider this configuration:
Cluster=*prod,IsFixable=true,Severity=CRITICAL_VULNERABILITY_SEVERITY|Cluster=*prod,IsFixable=true,Severity=HIGH_VULNERABILITY_SEVERITY|Severity.

A single hardcoded metric would not work, because the expectation is to get two different aggregations. But indeed, the expression arguments add complexity for further aggregation. The only disadvantage of not including the arguments is, probably, a potential confusion of implicit filtering, as the metric name won't tell about the applied filter. But the metric HELP will, so removing the arguments from the metric name should be fine. Thanks.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why do we need to define multiple metrics here? In this example, the user could just define a single base metric

Cluster=*prod,IsFixable=true,Severity

and then use Prometheus to get the other derived metrics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To filter out the LOW and MODERATE vulns for one metric and include them for the other metric.

Another example: a coarse metric for all clusters (like the default setting), and a fine metric for some specific cluster, which includes components. Cluster,Namespace,Severity|Cluster=reservation,Namespace,ComponentName=*apache*,Severity,CVSS.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But are these real world use cases or are they contrived? Note that if push comes to shove, one can also use relabelling rules in Prometheus to drop metrics. So I don't see a big issue offering just a single metric that is configurable via labels - seems flexible enough to avoid crashing Prometheus, and yet simple enough to explain to customers.

Another option to avoid names like rox_central_Cluster_Namespace_Severity_total would be to add an identifier label that is derived from the query config like rox_central_vulnerability_count{identifier="<ID>", ...}.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric name is not provided by the user.

"github.com/stackrox/rox/pkg/glob"
)

var ops = []string{"=", "!=", "<=", ">=", "<", ">"} // The order matters, as the
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should point out here that the filtering logic is a performance optimization. I wonder if it is a pre-mature one - it seems to complicate the code significantly, and the metric volume could otherwise be controlled by the labels.

Not saying we should not implementing filtering - also considering it's already done now - but I would not have implemented it myself, and I don't think it provides much additional value over the label selection alone.

@parametalol parametalol force-pushed the michael/ROX-28326-expose-cve-metrics branch from db52b78 to be8a8cf Compare June 4, 2025 08:57
@vikin91
Copy link
Copy Markdown
Contributor

vikin91 commented Jun 5, 2025

I still have this PR on the radar, but I see that it grew to 5k lines added 😅 That maybe difficult to review. What is your plan for reviewing that (it is okay if you don't have one yet)?

})
}

func (ar *aggregatorRunner) run(tracker common.Tracker) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the tracker. Make the ticker part of the tracker as well.

Copy link
Copy Markdown
Contributor

@vikin91 vikin91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed only parts of the code - I skipped the expressions and the arithmetic parts and also all tests.
Unfortunately, the code is pretty demanding to read and follow and I am afraid that it will be difficult to maintain. I am not sure yet how can we bring that to the finish line in it's current form. I would think about it and give it another look after you apply consider some simplifications.

matcher.Singleton().SetRegexes(regexes)
go reprocessor.Singleton().RunReprocessor()

if err := aggregator.Singleton().Reconfigure(req.GetConfig().GetPrivateConfig().GetPrometheusMetricsConfig()); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why the aggregator cannot be the member s (serviceImpl)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason is the pattern of the code around.

gcp.Singleton().Start()
administrationEventHandler.Singleton().Start()

customMetrics.Singleton().Start()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something does not fit here right. The aggregator is a part of the config service, so I would expect that the config service would be started here, not the aggregator. Why cannot we keep the hierarchy here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregator is not a part of the config service. The latter allows for reconfiguring the former in runtime.

Comment on lines +7 to +11
// aggregatedRecord is a single gauge metric record.
type aggregatedRecord struct {
labels prometheus.Labels
total int
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we reuse the type that is built-in prometheus library?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please suggest a type.

Comment on lines +20 to +26
func makeAggregator(mcfg MetricsConfiguration, labelOrder map[Label]int) *aggregator {
aggregated := make(map[MetricName]map[aggregationKey]*aggregatedRecord)
for metric := range mcfg {
aggregated[metric] = make(map[aggregationKey]*aggregatedRecord)
}
return &aggregator{aggregated, mcfg, labelOrder}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is not a constructor, like New(...) *aggregator?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the package is called common.

// aggregator computes the aggregation result.
type aggregator struct {
result map[MetricName]map[aggregationKey]*aggregatedRecord
mcfg MetricsConfiguration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is type MetricsConfiguration map[MetricName]map[Label]Expression and it has its own type-alias, but the other does not have it. Why cannot we use the map[MetricName]map[Label]Expression as type here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because MetricsConfiguration has methods and is used in other places as such.

ar.registry,
iv.GetFilter(),
iv.GetMetrics(),
time.Minute*time.Duration(iv.GetGatheringPeriodMinutes())); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why GetGatheringPeriodMinutes() cannot return time.Duration?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a storage field getter, the storage type is uint32.

"failed to parse a condition for metric %q with label %q: %v",
metric, label, err)
}
expr = append(expr, condition)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many temp variables here just to collect the result. What you could do is

Suggested change
expr = append(expr, condition)
expr = append(expr, condition)
result[MetricName(metric)][Label(label)] = expr

and drop the labelExpression and related length checks

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would assign to a nil map and panic. A metric with no labels should not be added to the result.

//
// This key is used to uniquely identify and aggregate findings that match the same
// set of label expression.
package common
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name does not represent what this pkg is doing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package is an assembly of tools for trackers to use. Please suggest a better name, and/or how to split the package.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Jun 12, 2025

@parametalol: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ocp-4-12-ui-e2e-tests 317054e link false /test ocp-4-12-ui-e2e-tests
ci/prow/gke-operator-e2e-tests 67f6772 link false /test gke-operator-e2e-tests
ci/prow/ocp-4-12-nongroovy-e2e-tests 67f6772 link false /test ocp-4-12-nongroovy-e2e-tests
ci/prow/ocp-4-12-qa-e2e-tests 67f6772 link false /test ocp-4-12-qa-e2e-tests
ci/prow/ocp-4-18-qa-e2e-tests 67f6772 link false /test ocp-4-18-qa-e2e-tests
ci/prow/ocp-4-18-ui-e2e-tests 67f6772 link false /test ocp-4-18-ui-e2e-tests
ci/prow/ocp-4-18-nongroovy-e2e-tests 67f6772 link false /test ocp-4-18-nongroovy-e2e-tests
ci/prow/gke-scanner-v4-install-tests 67f6772 link false /test gke-scanner-v4-install-tests
ci/prow/ocp-4-12-operator-e2e-tests 67f6772 link false /test ocp-4-12-operator-e2e-tests
ci/prow/ocp-4-18-operator-e2e-tests 67f6772 link false /test ocp-4-18-operator-e2e-tests
ci/prow/ocp-4-12-scanner-v4-install-tests 67f6772 link false /test ocp-4-12-scanner-v4-install-tests
ci/prow/ocp-4-18-scanner-v4-install-tests 67f6772 link false /test ocp-4-18-scanner-v4-install-tests
ci/prow/gke-ui-e2e-tests ee180cf link true /test gke-ui-e2e-tests
ci/prow/gke-upgrade-tests ee180cf link false /test gke-upgrade-tests
ci/prow/gke-nongroovy-e2e-tests ee180cf link true /test gke-nongroovy-e2e-tests
ci/prow/gke-qa-e2e-tests ee180cf link false /test gke-qa-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@parametalol
Copy link
Copy Markdown
Contributor Author

Closing for splitting.

@parametalol parametalol deleted the michael/ROX-28326-expose-cve-metrics branch September 17, 2025 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants