Skip to content

Conversation

@nparaddi-walmart
Copy link
Contributor

@nparaddi-walmart nparaddi-walmart commented Jul 26, 2023

…etrics cql-requests, cql-messages and throttling delay.

Motivation:

Histogram metrics is generating too many metrics overloading the promethous servers. if application has 500 Vms and 1000 cassandra nodes, The histogram metrics generates 1005001000 = 50,000,000 time series every 30 seconds. This is just too much metrics. Let us say we can generate percentile 95 timeseries for for every cassandra nodes, then we only have 1*500 = 500 metrics and in applciation side, we can ignore the _bucket time series. This way there will be very less metrics.

Modifications:
add configurable pre-defined percentiles to Micrometer Timer.Builder.publishPercentiles. This change is being added to cql-requests, cql-messages and throttling delay.
Also this PR adds configuration option which switches aggregable histogram generation on/off for all metric flavors [default=on]

Result:
Based on the configuration, we will see additonal quantile time series for cql-requests, cql-messages and throttling delay histogram metrics.
Also there will be configuration option available to disable aggregable histogram generation if not needed

Nagappa Paraddi and others added 4 commits July 18, 2023 15:31
…etrics cql-requests, cql-messages and throttling delay.

Motivation:

Histogram metrics is generating too many metrics overloading the promethous servers. if application has 500 Vms
and 1000 cassandra nodes, The histogram metrics generates 100*500*1000 = 50,000,000 time series every 30 seconds.
This is just too much metrics.  Let us say we can generate percentile 95 timeseries for for every cassandra nodes,
then we only have 1*500 = 500 metrics and in applciation side, we can ignore the _bucket time series. This way there
will be very less metrics.

Modifications:
add configurable pre-defined percentiles to Micrometer Timer.Builder.publishPercentiles. This change is being added to
cql-requests, cql-messages and throttling delay.

Result:
Based on the configuration, we will see additonal quantile time series for cql-requests, cql-messages and throttling delay
histogram metrics.
…etrics cql-requests, cql-messages and throttling delay.

Motivation:

Histogram metrics is generating too many metrics overloading the promethous servers. if application has 500 Vms
and 1000 cassandra nodes, The histogram metrics generates 100*500*1000 = 50,000,000 time series every 30 seconds.
This is just too much metrics.  Let us say we can generate percentile 95 timeseries for for every cassandra nodes,
then we only have 1*500 = 500 metrics and in applciation side, we can ignore the _bucket time series. This way there
will be very less metrics.

Modifications:
add configurable pre-defined percentiles to Micrometer Timer.Builder.publishPercentiles. This change is being added to
cql-requests, cql-messages and throttling delay.

Result:
Based on the configuration, we will see additonal quantile time series for cql-requests, cql-messages and throttling delay
histogram metrics.
@absurdfarce
Copy link
Contributor

Thanks for the PR @nparaddi-walmart !

Have you signed the Contributor License Agreement for contributions to DataStax open source projects? If not you can find it at https://cla.datastax.com/. Thanks!

@nparaddi-walmart
Copy link
Contributor Author

i already signed the CLA

Username

nparaddi-walmart
Name

Nagappa Paraddi
Email

nagappa.paraddi@walmart.com
Company

Walmart Associates Inc
Status

Completed

@absurdfarce absurdfarce self-requested a review August 15, 2023 20:56
@nparaddi-walmart
Copy link
Contributor Author

@absurdfarce you may be already aware. Just FYI. Snyk is failed with
"snyk requires an authenticated account. Please run snyk auth and try again."

Copy link
Contributor

@absurdfarce absurdfarce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done @nparaddi-walmart ... this is a good, solid PR!

A few smaller things, but the main issue to be addressed here is whether it makes sense to compute the percentile histogram if we know for sure it's going to be ignored.

if (metric == DefaultNodeMetric.CQL_MESSAGES) {
return builder
builder
.publishPercentileHistogram()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we're only getting at half the solution if we leave this method call in unmodified.

I agree with your point that you can avoid the _bucket vals if you specify something in publish-percentiles but in that case you're (a) asking Micrometer to collect data for a histogram that you know up front you'll never use and (b) having to build logic in to select only certain tags for scraping. Neither of those seem great.

You can address the second point by changing the export process to remove the histogram data... say by pruning out any line that contains the literal string "_bucket". But (c) you're now introducing a lot of string processing every time you scrape metrics and (d) you haven't really addressed point (a) above; there's really no reason for Micrometer to collect this data if you aren't going to use it.

So really I think you're asking for two things here:

  1. The ability to specify percentiles for timers (and to have those percentiles published)
  2. The ability to shut off the percentile histogram if desired

Your PR implements the first, but it sure seems like you should have the second as well to really solve this problem.

Copy link
Contributor Author

@nparaddi-walmart nparaddi-walmart Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding second problem. we can always whitelist what kind of metrics we allow. That is what we are doing in Walmart. so we already have solution for second. So also this is configurable and adding new 3 metrics series.

Copy link
Contributor Author

@nparaddi-walmart nparaddi-walmart Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also am ok to remove publishPercentileHistogram if percentiles are defined. let me know if that is the path we want to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @absurdfarce - if you're trying to turn off the metrics produced by publishPercentileHistogram because it's producing too many values there will be others in the same situation too. As pointed out earlier there is an efficiency benefit of not having those produced in the first place if they are not needed, and additionally I think there is a usability benefit of having that configured directly in the driver, rather than directing users to another configuration system to strip out these the metrics they don't want.

I also wonder that since micrometer supports publishing both agregable histogram and local histogram metrics simultaneously that having the driver configure only one or the other might be too restrictive. Instead, perhaps you could add a new config entry for toggling publishPercentileHistogram in each scenario, maybe named something along the lines of "publish aggregable histogram" to contrast with the publishPercentiles settings which could be re-named "publish local percentiles"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. because too many metrics, none of the walmart teams are enabled the metrics.
As i understand correctly, you would like two config one for "publish local percentiles" and another one for ""publish aggregable histogram"
i am ok to that.. correct me if I am wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another general clarification i would like to see that is this going to be merged before handing to over apache foundation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more thought and discussion on our side I think we've landed on one new configuration option which switches aggregable histogram generation on/off for all metric flavors [default=on], in addition to the per-metric publishPercentile settings you have already pulled together in this PR. If you could add that to this PR and tackle some of the other newly added feedback around doc strings then I think we'll be good to go here :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. added addtional configuration option for aggregable histogram generation

Copy link
Contributor Author

@nparaddi-walmart nparaddi-walmart Sep 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note could not add junit cases for this new configuration as micrometer SimpleMeterRegistry does not generate _bucket metrics . I have changes ready to add junit test cases but it requires PrometheusMeterRegistry( prometheus dependency). I can add those changes this PR if needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have this junit test case id different branch. It needs micrometer-registry-prometheus dependecy. I can pull junit cases in this PR if thats ok
https://github.com/nparaddi-walmart/java-driver/blob/f5a9b39ff37831a61b8dcd0f4ba5e871fa2ccf8c/metrics/micrometer/src/test/java/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerNodeMetricUpdaterTest.java#L294

# An optional list of percentiles to be published by Micrometer.
#
# If defined, the histogram is guaranteed to contain these percentiles alongside other
# buckets used to generate aggregable percentile approximations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably be very explicit here in saying that these percentiles can NOT be aggregated across multiple nodes in the way that percentile histograms can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. added comment

.stream()
.mapToDouble(Double::doubleValue)
.toArray());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a lot of repetition in this file but let's start to minimize it by abstracting this out to a helper method:

private double[] toDoubleArray(List<Double> doubleList) {                                          
  return doubleList.stream().mapToDouble(Double::doubleValue).toArray();               
} 

I would've thought we could have avoided some of the intermediate object creation here by just calling publishPercentiles() multiple times on the builder:

       if (profile.isDefined(DseDriverOption.METRICS_SESSION_GRAPH_REQUESTS_PUBLISH_PERCENTILES)) {
         profile          
           .getDoubleList(DseDriverOption.METRICS_SESSION_GRAPH_REQUESTS_PUBLISH_PERCENTILES)
           .forEach(builder::publishPercentiles);
       } 

But weirdly that didn't work. With this change the unit tests fail to pass, claiming that this assert:

    assertThat(snapshot.percentileValues()).hasSize(3);

is only finding one item rather than three... which suggests that percentiles aren't just added together.

Copy link
Contributor Author

@nparaddi-walmart nparaddi-walmart Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. added helper method. let me know anything else needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: it would be nice not to have to repeat the driver option name in the conditional and in the getter, particularly with these long similar names, consider breaking the repeated pattern of if(isDefined(option)) { ... profile.getDoubleList(option) ... into a method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created separate static method

Copy link
Contributor

@hhughes hhughes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nparaddi-walmart, thanks for your patience on this one! Biggest concern here is we'd like to provide users a way to turn off the large-volumed metrics in addition to providing config to add new ones. I've also added a few other pieces of optional feedback. Please hit me back if you have any concerns! Thanks

new TypedDriverOption<>(
DefaultDriverOption.METRICS_SESSION_CQL_REQUESTS_SLO,
GenericType.listOf(GenericType.DURATION));
/** Optional pre-defined percentile of cql resquests to publish, as a list of percentiles . */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: resquests is mis-spelled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it

"advanced.metrics.node.cql-messages.publish-percentiles"),
/**
* List of percentiles in double to be published by the throttling delay metric. 95th percentile
* to represented as 0.95.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rework the javadoc descriptions for enums in this class and in DseDriverOption.java? They are a little hard to understand. For example I don't think we need to repeat that the value type as this is stated in the Value type: section of the description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the java doc

if (metric == DefaultNodeMetric.CQL_MESSAGES) {
return builder
builder
.publishPercentileHistogram()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @absurdfarce - if you're trying to turn off the metrics produced by publishPercentileHistogram because it's producing too many values there will be others in the same situation too. As pointed out earlier there is an efficiency benefit of not having those produced in the first place if they are not needed, and additionally I think there is a usability benefit of having that configured directly in the driver, rather than directing users to another configuration system to strip out these the metrics they don't want.

I also wonder that since micrometer supports publishing both agregable histogram and local histogram metrics simultaneously that having the driver configure only one or the other might be too restrictive. Instead, perhaps you could add a new config entry for toggling publishPercentileHistogram in each scenario, maybe named something along the lines of "publish aggregable histogram" to contrast with the publishPercentiles settings which could be re-named "publish local percentiles"?

return builder.publishPercentileHistogram();
}

protected double[] toDoubleArray(List<Double> doubleList) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could be static

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to static

.stream()
.mapToDouble(Double::doubleValue)
.toArray());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: it would be nice not to have to repeat the driver option name in the conditional and in the getter, particularly with these long similar names, consider breaking the repeated pattern of if(isDefined(option)) { ... profile.getDoubleList(option) ... into a method

assertThat(timer.count()).isEqualTo(10);
HistogramSnapshot snapshot = timer.takeSnapshot();
assertThat(snapshot.histogramCounts()).hasSize(2);
assertThat(snapshot.percentileValues()).hasSize(3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it would provide better coverage if you could assert the exact set of percentiles you expect here instead of just the number of items (i.e. [0.75, 0.95, 0.99]) - same feedback to the other test in MicrometerSessionMetricUpdaterTest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the validation

when(profile.isDefined(percentiles)).thenReturn(true);
when(profile.getDoubleList(percentiles))
.thenReturn(
Arrays.asList(Double.valueOf(0.75), Double.valueOf(0.95), Double.valueOf(0.99)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unneccessary boxing of doubles (and again in the other tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed boxing

Copy link
Contributor

@hhughes hhughes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes! LGTM

@hhughes hhughes removed the request for review from absurdfarce September 11, 2023 15:25
@hhughes
Copy link
Contributor

hhughes commented Sep 11, 2023

One unrelated test-failure (ChannelPoolResizeTest#should_grow_during_reconnection / JAVA-3113)

@hhughes hhughes merged commit 06947df into apache:4.x Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants