Q-Q Plot
When trying to match an empirical distribution to a known model, comparing histograms is often misleading.
In such situation, Q-Q Plot is the tool you need to assess it. It compares your data as empirical quantiles with some defined model as theoretical quantiles.
Here is a procedure to draw such diagram:
- Compute the ECDF of dataset to get empirical quantiles;
- Chose a theoretical model, you may fit against your data (eg. using MLE);
- Compute theoretical quantiles using quantile function (or PPF) from you theoretical model;
- Create pair of quantiles (theoretical, empirical).
If model agrees with data, Q-Q points will lie on the identity line x=y.
The following snippet implement this procedure:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
def qqplot(data, law_factory, axe=None):
if axe is None:
fig, axe = plt.subplots()
# Compute ECDF from data:
ecdf = stats.ecdf(data)
# Check if law is already parametered:
if isinstance(law_factory, stats._distn_infrastructure.rv_continuous_frozen):
law = law_factory
# Fit using MLE if not the case:
else:
parameters = law_factory.fit(data)
law = law_factory(*parameters)
# Compute theoretical quantiles:
quantiles = law.ppf(ecdf.cdf.probabilities)
axe.scatter(quantiles, ecdf.cdf.quantiles, marker=".")
axe.loglog(quantiles, quantiles, "--", color="black")
axe.set_title("Q-Q Plot: %s\n args=%s, kwargs=%s" % (law.dist.name, law.args, law.kwds))
axe.set_xlabel("Theoretical Quantile")
axe.set_ylabel("Empirical Quantile")
axe.grid()
return axe
Inspecting some distributions
Lets trial Weibull and Log Normal to see how it matches:
data = pd.read_excel("metric.xlsx")
x = data["metric"].values
laws = [
stats.weibull_min, # Actual weibull name in scipy.stats
stats.weibull_min(0.515, 0., 255), # Weibull adjusted by chrslg
stats.lognorm,
]
for law in laws:
qqplot(x, law)
The result for a Weibull adjusted by MLE is:

The result for the Weibull proposed by chrslg is:

The result for a Log-Normal adjusted by MLE is:

You can try any other distributions in a systematic way using this script.
Comparing the histogram and CDF's (which comparison is not as good as Q-Q Plot) we have:

What we can say so far is:
- Log Normal adjusted by MLE agrees better in low quantiles with your data (which was a requirements in your OP) but have a fatter tail than expected. It means low values will be more reproducible using the model, but statistics impacted by outliers such as mean and std will be greater than expected. In this case median is preserved;
- As
chrslg showed in comments, it is possible to adjust by hand a Weibull to have standards statistics (mean, median, std) close to your dataset, but low and high quantiles will take off from your dataset when using the model.
Conclusions
None of Weibull or Log Normal are perfect model to represent your dataset while Log Normal will behave better in low quantiles.
Finding the right model is not a trivial task and may require a bit more research or tailoring a specific law for it.
When comparing distribution to a model, Q-Q Plot is a powerful tool to ensure the whole behaviour of the distribution matches the model your are inspecting in the blink of an eye. Always prefer it to a simple histogram comparison.
fitmethods? Which statistical packages? Show some code, please. Those fit method are very efficient, even exact (in the sense that they do provide the parameters that fit the best. Not a trial. The exact set of best possible parameters). But, of course, you have to understand those parameters, and to choose which statistical distribution you want to fit.