-
Notifications
You must be signed in to change notification settings - Fork 886
add support for publishing percentile time series for the histogram m… #1689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for publishing percentile time series for the histogram m… #1689
Conversation
…etrics cql-requests, cql-messages and throttling delay. Motivation: Histogram metrics is generating too many metrics overloading the promethous servers. if application has 500 Vms and 1000 cassandra nodes, The histogram metrics generates 100*500*1000 = 50,000,000 time series every 30 seconds. This is just too much metrics. Let us say we can generate percentile 95 timeseries for for every cassandra nodes, then we only have 1*500 = 500 metrics and in applciation side, we can ignore the _bucket time series. This way there will be very less metrics. Modifications: add configurable pre-defined percentiles to Micrometer Timer.Builder.publishPercentiles. This change is being added to cql-requests, cql-messages and throttling delay. Result: Based on the configuration, we will see additonal quantile time series for cql-requests, cql-messages and throttling delay histogram metrics.
…etrics cql-requests, cql-messages and throttling delay. Motivation: Histogram metrics is generating too many metrics overloading the promethous servers. if application has 500 Vms and 1000 cassandra nodes, The histogram metrics generates 100*500*1000 = 50,000,000 time series every 30 seconds. This is just too much metrics. Let us say we can generate percentile 95 timeseries for for every cassandra nodes, then we only have 1*500 = 500 metrics and in applciation side, we can ignore the _bucket time series. This way there will be very less metrics. Modifications: add configurable pre-defined percentiles to Micrometer Timer.Builder.publishPercentiles. This change is being added to cql-requests, cql-messages and throttling delay. Result: Based on the configuration, we will see additonal quantile time series for cql-requests, cql-messages and throttling delay histogram metrics.
|
Thanks for the PR @nparaddi-walmart ! Have you signed the Contributor License Agreement for contributions to DataStax open source projects? If not you can find it at https://cla.datastax.com/. Thanks! |
|
i already signed the CLA Username nparaddi-walmart Nagappa Paraddi nagappa.paraddi@walmart.com Walmart Associates Inc Completed |
|
@absurdfarce you may be already aware. Just FYI. Snyk is failed with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done @nparaddi-walmart ... this is a good, solid PR!
A few smaller things, but the main issue to be addressed here is whether it makes sense to compute the percentile histogram if we know for sure it's going to be ignored.
| if (metric == DefaultNodeMetric.CQL_MESSAGES) { | ||
| return builder | ||
| builder | ||
| .publishPercentileHistogram() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we're only getting at half the solution if we leave this method call in unmodified.
I agree with your point that you can avoid the _bucket vals if you specify something in publish-percentiles but in that case you're (a) asking Micrometer to collect data for a histogram that you know up front you'll never use and (b) having to build logic in to select only certain tags for scraping. Neither of those seem great.
You can address the second point by changing the export process to remove the histogram data... say by pruning out any line that contains the literal string "_bucket". But (c) you're now introducing a lot of string processing every time you scrape metrics and (d) you haven't really addressed point (a) above; there's really no reason for Micrometer to collect this data if you aren't going to use it.
So really I think you're asking for two things here:
- The ability to specify percentiles for timers (and to have those percentiles published)
- The ability to shut off the percentile histogram if desired
Your PR implements the first, but it sure seems like you should have the second as well to really solve this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding second problem. we can always whitelist what kind of metrics we allow. That is what we are doing in Walmart. so we already have solution for second. So also this is configurable and adding new 3 metrics series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also am ok to remove publishPercentileHistogram if percentiles are defined. let me know if that is the path we want to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @absurdfarce - if you're trying to turn off the metrics produced by publishPercentileHistogram because it's producing too many values there will be others in the same situation too. As pointed out earlier there is an efficiency benefit of not having those produced in the first place if they are not needed, and additionally I think there is a usability benefit of having that configured directly in the driver, rather than directing users to another configuration system to strip out these the metrics they don't want.
I also wonder that since micrometer supports publishing both agregable histogram and local histogram metrics simultaneously that having the driver configure only one or the other might be too restrictive. Instead, perhaps you could add a new config entry for toggling publishPercentileHistogram in each scenario, maybe named something along the lines of "publish aggregable histogram" to contrast with the publishPercentiles settings which could be re-named "publish local percentiles"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. because too many metrics, none of the walmart teams are enabled the metrics.
As i understand correctly, you would like two config one for "publish local percentiles" and another one for ""publish aggregable histogram"
i am ok to that.. correct me if I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another general clarification i would like to see that is this going to be merged before handing to over apache foundation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some more thought and discussion on our side I think we've landed on one new configuration option which switches aggregable histogram generation on/off for all metric flavors [default=on], in addition to the per-metric publishPercentile settings you have already pulled together in this PR. If you could add that to this PR and tackle some of the other newly added feedback around doc strings then I think we'll be good to go here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. added addtional configuration option for aggregable histogram generation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note could not add junit cases for this new configuration as micrometer SimpleMeterRegistry does not generate _bucket metrics . I have changes ready to add junit test cases but it requires PrometheusMeterRegistry( prometheus dependency). I can add those changes this PR if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have this junit test case id different branch. It needs micrometer-registry-prometheus dependecy. I can pull junit cases in this PR if thats ok
https://github.com/nparaddi-walmart/java-driver/blob/f5a9b39ff37831a61b8dcd0f4ba5e871fa2ccf8c/metrics/micrometer/src/test/java/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerNodeMetricUpdaterTest.java#L294
| # An optional list of percentiles to be published by Micrometer. | ||
| # | ||
| # If defined, the histogram is guaranteed to contain these percentiles alongside other | ||
| # buckets used to generate aggregable percentile approximations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably be very explicit here in saying that these percentiles can NOT be aggregated across multiple nodes in the way that percentile histograms can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. added comment
| .stream() | ||
| .mapToDouble(Double::doubleValue) | ||
| .toArray()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already a lot of repetition in this file but let's start to minimize it by abstracting this out to a helper method:
private double[] toDoubleArray(List<Double> doubleList) {
return doubleList.stream().mapToDouble(Double::doubleValue).toArray();
} I would've thought we could have avoided some of the intermediate object creation here by just calling publishPercentiles() multiple times on the builder:
if (profile.isDefined(DseDriverOption.METRICS_SESSION_GRAPH_REQUESTS_PUBLISH_PERCENTILES)) {
profile
.getDoubleList(DseDriverOption.METRICS_SESSION_GRAPH_REQUESTS_PUBLISH_PERCENTILES)
.forEach(builder::publishPercentiles);
} But weirdly that didn't work. With this change the unit tests fail to pass, claiming that this assert:
assertThat(snapshot.percentileValues()).hasSize(3);is only finding one item rather than three... which suggests that percentiles aren't just added together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. added helper method. let me know anything else needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: it would be nice not to have to repeat the driver option name in the conditional and in the getter, particularly with these long similar names, consider breaking the repeated pattern of if(isDefined(option)) { ... profile.getDoubleList(option) ... into a method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created separate static method
…art/java-driver into add_percentile_cql_request
hhughes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nparaddi-walmart, thanks for your patience on this one! Biggest concern here is we'd like to provide users a way to turn off the large-volumed metrics in addition to providing config to add new ones. I've also added a few other pieces of optional feedback. Please hit me back if you have any concerns! Thanks
| new TypedDriverOption<>( | ||
| DefaultDriverOption.METRICS_SESSION_CQL_REQUESTS_SLO, | ||
| GenericType.listOf(GenericType.DURATION)); | ||
| /** Optional pre-defined percentile of cql resquests to publish, as a list of percentiles . */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: resquests is mis-spelled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it
| "advanced.metrics.node.cql-messages.publish-percentiles"), | ||
| /** | ||
| * List of percentiles in double to be published by the throttling delay metric. 95th percentile | ||
| * to represented as 0.95. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rework the javadoc descriptions for enums in this class and in DseDriverOption.java? They are a little hard to understand. For example I don't think we need to repeat that the value type as this is stated in the Value type: section of the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the java doc
| if (metric == DefaultNodeMetric.CQL_MESSAGES) { | ||
| return builder | ||
| builder | ||
| .publishPercentileHistogram() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @absurdfarce - if you're trying to turn off the metrics produced by publishPercentileHistogram because it's producing too many values there will be others in the same situation too. As pointed out earlier there is an efficiency benefit of not having those produced in the first place if they are not needed, and additionally I think there is a usability benefit of having that configured directly in the driver, rather than directing users to another configuration system to strip out these the metrics they don't want.
I also wonder that since micrometer supports publishing both agregable histogram and local histogram metrics simultaneously that having the driver configure only one or the other might be too restrictive. Instead, perhaps you could add a new config entry for toggling publishPercentileHistogram in each scenario, maybe named something along the lines of "publish aggregable histogram" to contrast with the publishPercentiles settings which could be re-named "publish local percentiles"?
| return builder.publishPercentileHistogram(); | ||
| } | ||
|
|
||
| protected double[] toDoubleArray(List<Double> doubleList) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this could be static
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to static
| .stream() | ||
| .mapToDouble(Double::doubleValue) | ||
| .toArray()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: it would be nice not to have to repeat the driver option name in the conditional and in the getter, particularly with these long similar names, consider breaking the repeated pattern of if(isDefined(option)) { ... profile.getDoubleList(option) ... into a method
| assertThat(timer.count()).isEqualTo(10); | ||
| HistogramSnapshot snapshot = timer.takeSnapshot(); | ||
| assertThat(snapshot.histogramCounts()).hasSize(2); | ||
| assertThat(snapshot.percentileValues()).hasSize(3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it would provide better coverage if you could assert the exact set of percentiles you expect here instead of just the number of items (i.e. [0.75, 0.95, 0.99]) - same feedback to the other test in MicrometerSessionMetricUpdaterTest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the validation
| when(profile.isDefined(percentiles)).thenReturn(true); | ||
| when(profile.getDoubleList(percentiles)) | ||
| .thenReturn( | ||
| Arrays.asList(Double.valueOf(0.75), Double.valueOf(0.95), Double.valueOf(0.99))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unneccessary boxing of doubles (and again in the other tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed boxing
…ion on/off for all metric flavors [default=on]
hhughes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes! LGTM
|
One unrelated test-failure (ChannelPoolResizeTest#should_grow_during_reconnection / JAVA-3113) |
…etrics cql-requests, cql-messages and throttling delay.
Motivation:
Histogram metrics is generating too many metrics overloading the promethous servers. if application has 500 Vms and 1000 cassandra nodes, The histogram metrics generates 1005001000 = 50,000,000 time series every 30 seconds. This is just too much metrics. Let us say we can generate percentile 95 timeseries for for every cassandra nodes, then we only have 1*500 = 500 metrics and in applciation side, we can ignore the _bucket time series. This way there will be very less metrics.
Modifications:
add configurable pre-defined percentiles to Micrometer Timer.Builder.publishPercentiles. This change is being added to cql-requests, cql-messages and throttling delay.
Also this PR adds configuration option which switches aggregable histogram generation on/off for all metric flavors [default=on]
Result:
Based on the configuration, we will see additonal quantile time series for cql-requests, cql-messages and throttling delay histogram metrics.
Also there will be configuration option available to disable aggregable histogram generation if not needed