add support for publishing percentile time series for the histogram m… #1689

nparaddi-walmart · 2023-07-26T17:09:23Z

…etrics cql-requests, cql-messages and throttling delay.

Motivation:

Histogram metrics is generating too many metrics overloading the promethous servers. if application has 500 Vms and 1000 cassandra nodes, The histogram metrics generates 1005001000 = 50,000,000 time series every 30 seconds. This is just too much metrics. Let us say we can generate percentile 95 timeseries for for every cassandra nodes, then we only have 1*500 = 500 metrics and in applciation side, we can ignore the _bucket time series. This way there will be very less metrics.

Modifications:
add configurable pre-defined percentiles to Micrometer Timer.Builder.publishPercentiles. This change is being added to cql-requests, cql-messages and throttling delay.
Also this PR adds configuration option which switches aggregable histogram generation on/off for all metric flavors [default=on]

Result:
Based on the configuration, we will see additonal quantile time series for cql-requests, cql-messages and throttling delay histogram metrics.
Also there will be configuration option available to disable aggregable histogram generation if not needed

…etrics cql-requests, cql-messages and throttling delay. Motivation: Histogram metrics is generating too many metrics overloading the promethous servers. if application has 500 Vms and 1000 cassandra nodes, The histogram metrics generates 100*500*1000 = 50,000,000 time series every 30 seconds. This is just too much metrics. Let us say we can generate percentile 95 timeseries for for every cassandra nodes, then we only have 1*500 = 500 metrics and in applciation side, we can ignore the _bucket time series. This way there will be very less metrics. Modifications: add configurable pre-defined percentiles to Micrometer Timer.Builder.publishPercentiles. This change is being added to cql-requests, cql-messages and throttling delay. Result: Based on the configuration, we will see additonal quantile time series for cql-requests, cql-messages and throttling delay histogram metrics.

absurdfarce · 2023-08-15T20:00:27Z

Thanks for the PR @nparaddi-walmart !

Have you signed the Contributor License Agreement for contributions to DataStax open source projects? If not you can find it at https://cla.datastax.com/. Thanks!

nparaddi-walmart · 2023-08-15T20:37:19Z

i already signed the CLA

Username

nparaddi-walmart
Name

Nagappa Paraddi
Email

nagappa.paraddi@walmart.com
Company

Walmart Associates Inc
Status

Completed

nparaddi-walmart · 2023-08-16T03:33:24Z

@absurdfarce you may be already aware. Just FYI. Snyk is failed with
"snyk requires an authenticated account. Please run snyk auth and try again."

absurdfarce

Well done @nparaddi-walmart ... this is a good, solid PR!

A few smaller things, but the main issue to be addressed here is whether it makes sense to compute the percentile histogram if we know for sure it's going to be ignored.

absurdfarce · 2023-08-18T16:57:52Z

...in/java/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerNodeMetricUpdater.java

    if (metric == DefaultNodeMetric.CQL_MESSAGES) {
-      return builder
+      builder
          .publishPercentileHistogram()


I feel like we're only getting at half the solution if we leave this method call in unmodified.

I agree with your point that you can avoid the _bucket vals if you specify something in publish-percentiles but in that case you're (a) asking Micrometer to collect data for a histogram that you know up front you'll never use and (b) having to build logic in to select only certain tags for scraping. Neither of those seem great.

You can address the second point by changing the export process to remove the histogram data... say by pruning out any line that contains the literal string "_bucket". But (c) you're now introducing a lot of string processing every time you scrape metrics and (d) you haven't really addressed point (a) above; there's really no reason for Micrometer to collect this data if you aren't going to use it.

So really I think you're asking for two things here:

The ability to specify percentiles for timers (and to have those percentiles published)

The ability to shut off the percentile histogram if desired

Your PR implements the first, but it sure seems like you should have the second as well to really solve this problem.

Regarding second problem. we can always whitelist what kind of metrics we allow. That is what we are doing in Walmart. so we already have solution for second. So also this is configurable and adding new 3 metrics series.

also am ok to remove publishPercentileHistogram if percentiles are defined. let me know if that is the path we want to go.

I agree with @absurdfarce - if you're trying to turn off the metrics produced by publishPercentileHistogram because it's producing too many values there will be others in the same situation too. As pointed out earlier there is an efficiency benefit of not having those produced in the first place if they are not needed, and additionally I think there is a usability benefit of having that configured directly in the driver, rather than directing users to another configuration system to strip out these the metrics they don't want.

I also wonder that since micrometer supports publishing both agregable histogram and local histogram metrics simultaneously that having the driver configure only one or the other might be too restrictive. Instead, perhaps you could add a new config entry for toggling publishPercentileHistogram in each scenario, maybe named something along the lines of "publish aggregable histogram" to contrast with the publishPercentiles settings which could be re-named "publish local percentiles"?

Yes. because too many metrics, none of the walmart teams are enabled the metrics.
As i understand correctly, you would like two config one for "publish local percentiles" and another one for ""publish aggregable histogram"
i am ok to that.. correct me if I am wrong.

Another general clarification i would like to see that is this going to be merged before handing to over apache foundation?

After some more thought and discussion on our side I think we've landed on one new configuration option which switches aggregable histogram generation on/off for all metric flavors [default=on], in addition to the per-metric publishPercentile settings you have already pulled together in this PR. If you could add that to this PR and tackle some of the other newly added feedback around doc strings then I think we'll be good to go here :)

Thank you. added addtional configuration option for aggregable histogram generation

Note could not add junit cases for this new configuration as micrometer SimpleMeterRegistry does not generate _bucket metrics . I have changes ready to add junit test cases but it requires PrometheusMeterRegistry( prometheus dependency). I can add those changes this PR if needed

i have this junit test case id different branch. It needs micrometer-registry-prometheus dependecy. I can pull junit cases in this PR if thats ok
https://github.com/nparaddi-walmart/java-driver/blob/f5a9b39ff37831a61b8dcd0f4ba5e871fa2ccf8c/metrics/micrometer/src/test/java/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerNodeMetricUpdaterTest.java#L294

absurdfarce · 2023-08-18T17:08:27Z

core/src/main/resources/reference.conf

+        # An optional list of percentiles to be published by Micrometer.
+        #
+        # If defined, the histogram is guaranteed to contain these percentiles alongside other
+        # buckets used to generate aggregable percentile approximations.


We should probably be very explicit here in saying that these percentiles can NOT be aggregated across multiple nodes in the way that percentile histograms can.

thanks. added comment

absurdfarce · 2023-08-18T17:14:31Z

...in/java/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerNodeMetricUpdater.java

+                .stream()
+                .mapToDouble(Double::doubleValue)
+                .toArray());
+      }


There's already a lot of repetition in this file but let's start to minimize it by abstracting this out to a helper method:

private double[] toDoubleArray(List<Double> doubleList) { return doubleList.stream().mapToDouble(Double::doubleValue).toArray(); }

I would've thought we could have avoided some of the intermediate object creation here by just calling publishPercentiles() multiple times on the builder:

if (profile.isDefined(DseDriverOption.METRICS_SESSION_GRAPH_REQUESTS_PUBLISH_PERCENTILES)) { profile .getDoubleList(DseDriverOption.METRICS_SESSION_GRAPH_REQUESTS_PUBLISH_PERCENTILES) .forEach(builder::publishPercentiles); }

But weirdly that didn't work. With this change the unit tests fail to pass, claiming that this assert:

assertThat(snapshot.percentileValues()).hasSize(3);

is only finding one item rather than three... which suggests that percentiles aren't just added together.

Thank you. added helper method. let me know anything else needed

optional: it would be nice not to have to repeat the driver option name in the conditional and in the getter, particularly with these long similar names, consider breaking the repeated pattern of if(isDefined(option)) { ... profile.getDoubleList(option) ... into a method

created separate static method

…art/java-driver into add_percentile_cql_request

hhughes

Hi @nparaddi-walmart, thanks for your patience on this one! Biggest concern here is we'd like to provide users a way to turn off the large-volumed metrics in addition to providing config to add new ones. I've also added a few other pieces of optional feedback. Please hit me back if you have any concerns! Thanks

hhughes · 2023-09-06T03:45:34Z

core/src/main/java/com/datastax/oss/driver/api/core/config/TypedDriverOption.java

      new TypedDriverOption<>(
          DefaultDriverOption.METRICS_SESSION_CQL_REQUESTS_SLO,
          GenericType.listOf(GenericType.DURATION));
+  /** Optional pre-defined percentile of cql resquests to publish, as a list of percentiles . */


nit: resquests is mis-spelled

hhughes · 2023-09-06T17:54:14Z

core/src/main/java/com/datastax/oss/driver/api/core/config/DefaultDriverOption.java

+      "advanced.metrics.node.cql-messages.publish-percentiles"),
+  /**
+   * List of percentiles in double to be published by the throttling delay metric. 95th percentile
+   * to represented as 0.95.


Can you rework the javadoc descriptions for enums in this class and in DseDriverOption.java? They are a little hard to understand. For example I don't think we need to repeat that the value type as this is stated in the Value type: section of the description.

updated the java doc

hhughes · 2023-09-06T23:09:31Z

...in/java/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerNodeMetricUpdater.java

    if (metric == DefaultNodeMetric.CQL_MESSAGES) {
-      return builder
+      builder
          .publishPercentileHistogram()


I agree with @absurdfarce - if you're trying to turn off the metrics produced by publishPercentileHistogram because it's producing too many values there will be others in the same situation too. As pointed out earlier there is an efficiency benefit of not having those produced in the first place if they are not needed, and additionally I think there is a usability benefit of having that configured directly in the driver, rather than directing users to another configuration system to strip out these the metrics they don't want.

I also wonder that since micrometer supports publishing both agregable histogram and local histogram metrics simultaneously that having the driver configure only one or the other might be too restrictive. Instead, perhaps you could add a new config entry for toggling publishPercentileHistogram in each scenario, maybe named something along the lines of "publish aggregable histogram" to contrast with the publishPercentiles settings which could be re-named "publish local percentiles"?

hhughes · 2023-09-06T23:10:38Z

...c/main/java/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerMetricUpdater.java

    return builder.publishPercentileHistogram();
  }
+
+  protected double[] toDoubleArray(List<Double> doubleList) {


nit: this could be static

moved to static

hhughes · 2023-09-06T23:30:08Z

...in/java/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerNodeMetricUpdater.java

+                .stream()
+                .mapToDouble(Double::doubleValue)
+                .toArray());
+      }


optional: it would be nice not to have to repeat the driver option name in the conditional and in the getter, particularly with these long similar names, consider breaking the repeated pattern of if(isDefined(option)) { ... profile.getDoubleList(option) ... into a method

hhughes · 2023-09-07T00:11:16Z

...ava/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerNodeMetricUpdaterTest.java

    assertThat(timer.count()).isEqualTo(10);
    HistogramSnapshot snapshot = timer.takeSnapshot();
    assertThat(snapshot.histogramCounts()).hasSize(2);
+    assertThat(snapshot.percentileValues()).hasSize(3);


nit: it would provide better coverage if you could assert the exact set of percentiles you expect here instead of just the number of items (i.e. [0.75, 0.95, 0.99]) - same feedback to the other test in MicrometerSessionMetricUpdaterTest

added the validation

hhughes · 2023-09-07T00:13:18Z

...ava/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerNodeMetricUpdaterTest.java

+    when(profile.isDefined(percentiles)).thenReturn(true);
+    when(profile.getDoubleList(percentiles))
+        .thenReturn(
+            Arrays.asList(Double.valueOf(0.75), Double.valueOf(0.95), Double.valueOf(0.99)));


nit: unneccessary boxing of doubles (and again in the other tests)

removed boxing

…ion on/off for all metric flavors [default=on]

hhughes

Thank you for the changes! LGTM

hhughes · 2023-09-11T17:07:56Z

One unrelated test-failure (ChannelPoolResizeTest#should_grow_during_reconnection / JAVA-3113)

Nagappa Paraddi and others added 4 commits July 18, 2023 15:31

Merge branch '4.x' into add_percentile_cql_request

97b12f3

Merge branch '4.x' into add_percentile_cql_request

2eb4cf7

absurdfarce added the cla-missing label Aug 15, 2023

absurdfarce self-requested a review August 15, 2023 20:56

absurdfarce removed the cla-missing label Aug 15, 2023

absurdfarce requested changes Aug 18, 2023

View reviewed changes

Nagappa Paraddi added 2 commits August 19, 2023 11:31

Merge branch 'add_percentile_cql_request' of github.com:nparaddi-walm…

8f022b9

…art/java-driver into add_percentile_cql_request

using helper method as suggested in review

60a2908

nparaddi-walmart requested a review from absurdfarce August 22, 2023 16:56

hhughes suggested changes Sep 7, 2023

View reviewed changes

Nagappa Paraddi added 4 commits September 8, 2023 23:12

fixes as per review comments

69d201f

add configuration option which switches aggregable histogram generat…

31a1166

…ion on/off for all metric flavors [default=on]

updating java doc

aa44347

rename method to publishPercentilesIfDefined

fd78374

nparaddi-walmart requested a review from hhughes September 9, 2023 23:41

renmae method

eec409a

hhughes approved these changes Sep 11, 2023

View reviewed changes

hhughes removed the request for review from absurdfarce September 11, 2023 15:25

hhughes merged commit 06947df into apache:4.x Sep 11, 2023

add support for publishing percentile time series for the histogram m… #1689

add support for publishing percentile time series for the histogram m… #1689

Uh oh!

Conversation

nparaddi-walmart commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

absurdfarce commented Aug 15, 2023

Uh oh!

nparaddi-walmart commented Aug 15, 2023

Uh oh!

nparaddi-walmart commented Aug 16, 2023

Uh oh!

absurdfarce left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nparaddi-walmart Aug 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nparaddi-walmart Aug 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nparaddi-walmart Sep 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nparaddi-walmart Aug 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhughes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nparaddi-walmart commented Jul 26, 2023 •

edited

Loading

absurdfarce left a comment •

edited

Loading

nparaddi-walmart Aug 18, 2023 •

edited

Loading

nparaddi-walmart Aug 19, 2023 •

edited

Loading

nparaddi-walmart Sep 9, 2023 •

edited

Loading

nparaddi-walmart Aug 19, 2023 •

edited

Loading