Probability Distributions in Statistical Analysis of Big Data

By Alan Anderson, David Semmelroth

Probability distributions is one of many statistical techniques that can be used to analyze data to find useful patterns. You use a probability distribution to compute the probabilities associated with the elements of a dataset:

  • Binomial distribution: You would use the binomial distribution to analyze variables that can assume only one of two values. For example, you could determine the probability that a given percentage of members at a sports club are left-handed.

  • Poisson distribution: You would use the Poisson distribution to describe the likelihood of a given number of events occurring over an interval of time. For example, it could be used to describe the probability of a specified number of hits on a website over the coming hour.

  • Normal distribution: The normal distribution is the most widely used probability distribution in most disciplines, including economics, finance, marketing, biology, psychology, and many others. One of the characteristic features of the normal distribution is symmetry — the probability of a variable being a given distance below the mean of the distribution equals the probability of it being the same distance above the mean.

    For example, if the mean height of all men in the United States is 70 inches, and heights are normally distributed, a randomly chosen man is equally likely to be between 68 and 70 inches tall as he is to be between 70 and 72 inches tall.

    The normal distribution works well with many applications. For example, it’s often used in the field of finance to describe the returns to financial assets. Due to its ease of interpretation and implementation, the normal distribution is sometimes used even when the assumption of normality is only approximately correct.

  • The Student’s t-distribution: The Student’s t-distribution is similar to the normal distribution, but with the Student’s t-distribution, extremely small or extremely large values are much more likely to occur. This distribution is often used in situations where a variable exhibits too much variation to be consistent with the normal distribution. This is true when the properties of small samples are being analyzed. With small samples, the variation among samples is likely to be quite considerable, so the normal distribution shouldn’t be used to describe their properties.

    The Student’s t-distribution was developed by W.S. Gosset while employed at the Guinness brewing company. He was attempting to describe the properties of small sample means.

  • The chi-square distribution: The chi-square distribution is appropriate for several types of applications. For example, you can use it to determine whether a population follows a particular probability distribution. You can also use it to test whether the variance of a population equals a specified value, and to test for the independence of two datasets.

  • The F-distribution: The F-distribution is derived from the chi-square distribution. You use it to test whether the variances of two populations equal each other. The F-distribution is also useful in applications such as regression analysis.