0

I have a metric with 900K users, and I can't generate a distribution that would roughly repeat the existing one. I need this to quickly generate samples for AA AB tests (with guaranteed uplift).

Here are the sample data

**

count    953086.000000
mean        483.013657
std        1410.598133
min           0.000000
25%          33.000000
50%         125.000000
75%         421.000000
max      151074.000000

**

Here's what a sample of 10K users looks like for this metric:

enter image description here

How to determine the distribution? qualitatively? because the Fit methods of statistical modules are not very helpful

I tried the methods of statistical packages fit, but they didn't help properly.

I want someone to help me to repeat the distribution accurately enough, so that when I bring the metrics from 0 to 100 units together, I will match quite accurately, and the tail will be more random, because it is expected.

4
  • 1
    It is indeed impossible to analyze without the actual data. Could you post a link to the dataset in order to have a minimal reproducible example. For sure your distribution is highly asymmetric. Commented Jul 1, 2024 at 17:14
  • How did you try fit methods? Which statistical packages? Show some code, please. Those fit method are very efficient, even exact (in the sense that they do provide the parameters that fit the best. Not a trial. The exact set of best possible parameters). But, of course, you have to understand those parameters, and to choose which statistical distribution you want to fit. Commented Jul 2, 2024 at 14:37
  • I add here some other part of the discussion: you need to define what you mean by "same distribution". Do you mean (as some answers, and comments, tried): find a standard distribution that fits as much as possible? Do you mean something that has roughly the same histogram? And if so, with which kind of difference with your histogram? Exact? Or even same exact distribution (not just the discretization that is the histogram). In which case, shuffling the data is the only option. Commented Jul 3, 2024 at 9:06
  • But you need to define what are the features of the distribution you want to keep. To use an unrealistic example (but that illustrates that only you can tell what "same distribution" means), if what you want to keep is the exact same amount of 5 at the 7th decimal place of numbers, then, none of the attempts so far even try to do that. Because everybody assumed (probably rightly so, but theoretically, we can't know if you don't tell) that it wasn't part of what "same distribution" means. Commented Jul 3, 2024 at 9:09

2 Answers 2

2

Although it's very difficult to know without the exact data, this looks like a Weibull distribution with a parameter below 1:

import numpy as np
import pandas as pd

n = 10_000
s = pd.Series(np.random.weibull(0.4, size=n))
s.plot.hist(bins=100)

Weibull distribution

You might need to rescale the value.

Sign up to request clarification or add additional context in comments.

7 Comments

docs.google.com/spreadsheets/d/… found a way to share samples from 100K. If you can, please take a look, and I'll take a closer look at your code and distribution
@Roman it's impossible to know for sure, but this looks quite similar. Here are 10 random samples with parameter 0.35: counts and cumulative counts.
Since you have provided a data at @mozway was absolutely right in suggesting that this looks like a weibull distribution. You should be able to fi it.
Everything looks like a Weibull anyway. That is the main point of Weibull: you can make it look like anything. But it could also be Zipf, Pareto, exponential. Hard to say with this resolution. EDIT: well, no, not exponential any way. Mean and std of exponential are supposed to be identical, and we know it is not from this data.
@jlandercy agreed. That is why those are just comments, and I did not answer. OP haven't said what they wants. In absence of any information other than "same distribution as", the only thing we can say is "then, use the same numbers" (shuffle the numbers, for example). We could also smooth a cdf or somethimg like that. Or try a random+adjustment regression of some sort to fit all printed statistics (4 quantiles, mean, std, ...). But I wouldn't without knowing what "same distribution" means in OP's minde first.
|
1

Q-Q Plot

When trying to match an empirical distribution to a known model, comparing histograms is often misleading.

In such situation, Q-Q Plot is the tool you need to assess it. It compares your data as empirical quantiles with some defined model as theoretical quantiles.

Here is a procedure to draw such diagram:

  1. Compute the ECDF of dataset to get empirical quantiles;
  2. Chose a theoretical model, you may fit against your data (eg. using MLE);
  3. Compute theoretical quantiles using quantile function (or PPF) from you theoretical model;
  4. Create pair of quantiles (theoretical, empirical).

If model agrees with data, Q-Q points will lie on the identity line x=y.

The following snippet implement this procedure:

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

def qqplot(data, law_factory, axe=None):
    
    if axe is None:
        fig, axe = plt.subplots()
    
    # Compute ECDF from data:
    ecdf = stats.ecdf(data)
    
    # Check if law is already parametered:
    if isinstance(law_factory, stats._distn_infrastructure.rv_continuous_frozen):
        law = law_factory
    # Fit using MLE if not the case:
    else:
        parameters = law_factory.fit(data)
        law = law_factory(*parameters)

    # Compute theoretical quantiles:
    quantiles = law.ppf(ecdf.cdf.probabilities)
    
    axe.scatter(quantiles, ecdf.cdf.quantiles, marker=".")
    axe.loglog(quantiles, quantiles, "--", color="black")
    axe.set_title("Q-Q Plot: %s\n args=%s, kwargs=%s" % (law.dist.name, law.args, law.kwds))
    axe.set_xlabel("Theoretical Quantile")
    axe.set_ylabel("Empirical Quantile")
    axe.grid()
    
    return axe

Inspecting some distributions

Lets trial Weibull and Log Normal to see how it matches:

data = pd.read_excel("metric.xlsx")
x = data["metric"].values

laws = [
    stats.weibull_min,                   # Actual weibull name in scipy.stats
    stats.weibull_min(0.515, 0., 255),   # Weibull adjusted by chrslg
    stats.lognorm,
]

for law in laws:
    qqplot(x, law)

The result for a Weibull adjusted by MLE is:

enter image description here

The result for the Weibull proposed by chrslg is:

enter image description here

The result for a Log-Normal adjusted by MLE is:

enter image description here

You can try any other distributions in a systematic way using this script.

Comparing the histogram and CDF's (which comparison is not as good as Q-Q Plot) we have:

enter image description here

What we can say so far is:

  • Log Normal adjusted by MLE agrees better in low quantiles with your data (which was a requirements in your OP) but have a fatter tail than expected. It means low values will be more reproducible using the model, but statistics impacted by outliers such as mean and std will be greater than expected. In this case median is preserved;
  • As chrslg showed in comments, it is possible to adjust by hand a Weibull to have standards statistics (mean, median, std) close to your dataset, but low and high quantiles will take off from your dataset when using the model.

Conclusions

None of Weibull or Log Normal are perfect model to represent your dataset while Log Normal will behave better in low quantiles.

Finding the right model is not a trivial task and may require a bit more research or tailoring a specific law for it.

When comparing distribution to a model, Q-Q Plot is a powerful tool to ensure the whole behaviour of the distribution matches the model your are inspecting in the blink of an eye. Always prefer it to a simple histogram comparison.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.