How To Determine The Probability Distribution Type for Data

by Contributor; Updated September 26, 2017

When you have collected data on your system or process, the next step is to determine what type of probability distribution one has. The types of probability distributions are: discrete uniform, Bernoulli, binomial, negative binomial, Poisson, geometric, continuous uniform, normal (bell curve), exponential, gamma and beta distributions. Narrowing even a few from the list of possibilities makes determining which is the closest R squared value much faster.

Items you will need

  • Graphing software
  • Means of calculating the R squared value (best fit analysis)
Step 1

Plot the data for a visual representation of the data type.

Step 2

One of the first steps to determining what data distribution one has - and thus the equation type to use to model the data - is to rule out what it cannot be. • If there are any peaks in the data set, it cannot be a discrete uniform distribution. • If the data has more than one peak, it is not Poisson or binomial. • If it has a single curve, no secondary peaks, and has a slow slope on each side, it may be Poisson or a gamma distribution. But it cannot be a discrete uniform distribution. • If the data is evenly distributed, and it is without a skew toward one side, it is safe to rule out a gamma or Weibull distribution. • If the function has an even distribution or a peak in the middle of the graphed results, it is not a geometric distribution or an exponential distribution. • If the occurrence of a factor varies with an environmental variable, it probably is not a Poisson distribution.

Step 3

After the probability distribution type has been narrowed down, do an R squared analysis of each possible type of probability distribution. The one with the highest R squared value is most likely correct.

Step 4

Eliminate one outlier data point. Then recalculate R squared. If the same probability distribution type comes up as the closest match, then there is a high confidence that this is the correct probability distribution to use for the data set.

Tips

  • If the data shows multiple peaks a broad scatter, it is possible that two seperate processes are going on or the product being sampled is mixed. Recollect the data and then re-analyze.

Warnings

  • Validate the equations generated against later data sets to confirm that it is still accurate for the data set. It is possible that environmental factors and process drift have made current equations and models incorrect.