David Semmelroth

David Semmelroth has two decades of experience translating customer data into actionable insights across the financial services, travel, and entertainment industries. David has consulted for Cedar Fair, Wachovia, National City, and TD Bank.

Articles From David Semmelroth

page 1
page 2
page 3
page 4
page 5
48 results
48 results
Statistics for Big Data For Dummies Cheat Sheet

Cheat Sheet / Updated 03-10-2022

Summary statistical measures represent the key properties of a sample or population as a single numerical value. This has the advantage of providing important information in a very compact form. It also simplifies comparing multiple samples or populations. Summary statistical measures can be divided into three types: measures of central tendency, measures of central dispersion, and measures of association.

View Cheat Sheet
Discrete and Continuous Probability Distributions

Article / Updated 03-26-2016

The two basic types of probability distributions are known as discrete and continuous. Discrete distributions describe the properties of a random variable for which every individual outcome is assigned a positive probability. A random variable is actually a function; it assigns numerical values to the outcomes of a random process. Continuous distributions describe the properties of a random variable for which individual probabilities equal zero. Positive probabilities can only be assigned to ranges of values, or intervals. Two of the most widely used discrete distributions are the binomial and the Poisson. You use the binomial distribution when a random process consists of a sequence of independent trials, each of which has only two possible outcomes. The probabilities of these outcomes are constant on each trial. For example, you could use the binomial distribution to determine the probability that a specified number of defaults will take place in a portfolio of bonds (if you can assume that the bonds are independent of each other). You use the Poisson distribution when a random process consists of events occurring over a given interval of time. For example, you could use the Poisson distribution to determine the likelihood that three stocks in an investor’s portfolio pay dividends over the coming year. Some of the most widely used continuous probability distributions are the: Normal distribution Student’s t-distribution Lognormal distribution Chi-square distribution F-distribution The normal distribution is one of the most widely used distributions in many disciplines, including economics, finance, biology, physics, psychology, and sociology. The normal distribution is often illustrated as a bell-shaped curve, or bell curve, which indicates that the distribution is symmetrical about its mean. Further, it is defined for all values from negative infinity to positive infinity. Many real-world variables seem to follow the normal distribution (at least approximately), which accounts for its popularity. For example, it’s often assumed that returns to financial assets are normally distributed (although this isn’t entirely correct). For situations in which the normal distribution is not appropriate, the Student’s t-distribution is often used in its place. The Student’s t-distribution shares several similar properties with the normal distribution; however, the most important difference is that it is more “spread out” about the mean. The Student’s t-distribution is often used for analyzing the properties of small samples. The lognormal distribution is closely related to the normal distribution, as follows: If Y = lnX and X is lognormally distributed, then Y is normally distributed. If X = eY and Y is normally distributed, then X is lognormally distributed. For example, if returns to financial assets are normally distributed, then their prices are lognormally distributed. Unlike the normal distribution, the lognormal distribution is only defined for non-negative values. Instead of being symmetrical, the lognormal distribution is positively skewed. The chi-square distribution is characterized by degrees of freedom and is defined only for non-negative values. It is also positively skewed. You can use the chi-square distribution for several applications, including these: Testing hypotheses about the variance of a population Testing whether a population follows a specified probability distribution Determining if two populations are independent of each other The F-distribution is characterized by two different degrees of freedom: numerator and denominator. It’s defined only for non-negative values and is positively skewed. You can use the F-distribution to determine whether the variances of two populations are equal. You can also use it in regression analysis to determine if a group of slope coefficients are statistically significant.

View Article
10 Key Concepts in Hypothesis Testing

Article / Updated 03-26-2016

Hypothesis testing is a statistical technique that is used in a variety of situations. Though the technical details differ from situation to situation, all hypothesis tests use the same core set of terms and concepts. The following descriptions of common terms and concepts refer to a hypothesis test in which the means of two populations are being compared. Null hypothesis The null hypothesis is a clear statement about the relationship between two (or more) statistical objects. These objects may be measurements, distributions, or categories. Typically, the null hypothesis, as the name implies, states that there is no relationship. In the case of two population means, the null hypothesis might state that the means of the two populations are equal. Alternative hypothesis Once the null hypothesis has been stated, it is easy to construct the alternative hypothesis. It is essentially the statement that the null hypothesis is false. In our example, the alternative hypothesis would be that the means of the two populations are not equal. Significance The significance level is a measure of the statistical strength of the hypothesis test. It is often characterized as the probability of incorrectly concluding that the null hypothesis is false. The significance level is something that you should specify up front. In applications, the significance level is typically one of three values: 10%, 5%, or 1%. A 1% significance level represents the strongest test of the three. For this reason, 1% is a higher significance level than 10%. Power Related to significance, the power of a test measures the probability of correctly concluding that the null hypothesis is true. Power is not something that you can choose. It is determined by several factors, including the significance level you select and the size of the difference between the things you are trying to compare. Unfortunately, significance and power are inversely related. Increasing significance decreases power. This makes it difficult to design experiments that have both very high significance and power. Test statistic The test statistic is a single measure that captures the statistical nature of the relationship between observations you are dealing with. The test statistic depends fundamentally on the number of observations that are being evaluated. It differs from situation to situation. Distribution of the test statistic The whole notion of hypothesis rests on the ability to specify (exactly or approximately) the distribution that the test statistic follows. In the case of this example, the difference between the means will be approximately normally distributed (assuming there are a relatively large number of observations). One-tailed vs. two-tailed tests Depending on the situation, you may want (or need) to employ a one- or two-tailed test. These tails refer to the right and left tails of the distribution of the test statistic. A two-tailed test allows for the possibility that the test statistic is either very large or very small (negative is small). A one-tailed test allows for only one of these possibilities. In an example where the null hypothesis states that the two population means are equal, you need to allow for the possibility that either one could be larger than the other. The test statistic could be either positive or negative. So, you employ a two-tailed test. The null hypothesis might have been slightly different, namely that the mean of population 1 is larger than the mean of population 2. In that case, you don't need to account statistically for the situation where the first mean is smaller than the second. So, you would employ a one-tailed test. Critical value The critical value in a hypothesis test is based on two things: the distribution of the test statistic and the significance level. The critical value(s) refer to the point in the test statistic distribution that give the tails of the distribution an area (meaning probability) exactly equal to the significance level that was chosen. Decision Your decision to reject or accept the null hypothesis is based on comparing the test statistic to the critical value. If the test statistic exceeds the critical value, you should reject the null hypothesis. In this case, you would say that the difference between the two population means is significant. Otherwise, you accept the null hypothesis. P-value The p-value of a hypothesis test gives you another way to evaluate the null hypothesis. The p-value represents the highest significance level at which your particular test statistic would justify rejecting the null hypothesis. For example, if you have chosen a significance level of 5%, and the p-value turns out to be .03 (or 3%), you would be justified in rejecting the null hypothesis.

View Article
Overview of Graphical Techniques

Article / Updated 03-26-2016

Several different types of graphs may be useful for analyzing data. These include stem-and-leaf plots, scatter plots, box plots, histograms, quantile-quantile (QQ) plots, and autocorrelation plots. A stem-and-leaf plot consists of a “stem” that reflects the categories in a data set and a “leaf” that shows each individual value in the data set. A scatter plot consists of a series of points that reflect observations from two data sets. The plot shows the relationship between the two data sets. A box plot shows summary measures for a data set. The plot takes the form of a rectangle whose shape represents measures such as the minimum value, the maximum value, the quartiles, and so on. A histogram shows the distribution of a data set as a series of vertical bars. Each bar represents a category (usually a numerical value or a range of numerical values) found in a data set. The height of each bar represents the frequency of values in the category. Histograms are often used to identify the distribution a data set follows. A QQ (quantile-quantile plot) compares the distribution of a data set with an assumed distribution. An autocorrelation plot is used to show how closely related the elements of a time series are to their own past values.

View Article
Overview of Hypothesis Testing

Article / Updated 03-26-2016

One important way to draw conclusions about the properties of a population is with hypothesis testing. You can use hypothesis tests to compare a population measure to a specified value, compare measures for two populations, determine whether a population follows a specified probability distribution, and so forth. Hypothesis testing is conducted as a six-step procedure: Null hypothesis Alternative hypothesis Level of significance Test statistic Critical value Decision The null hypothesis is a statement that’s assumed to be true unless there’s strong evidence against it. The alternative hypothesis is a statement that is accepted if the null hypothesis is rejected. The level of significance specifies the likelihood of rejecting the null hypothesis when it’s true; this is known as a Type I Error. The test statistic is a numerical measure you compute from sample data to determine whether or not the null hypothesis should be rejected. The critical value is used as a benchmark to determine whether the test statistic is too extreme to be consistent with the null hypothesis. The decision as to whether or not the null hypothesis should be rejected is determined as follows: If the absolute value of the test statistic exceeds the absolute value of the critical value, the null hypothesis is rejected. Otherwise, the null hypothesis fails to be rejected.

View Article
Measures of Association

Article / Updated 03-26-2016

Measures of association quantify the strength and the direction of the relationship between two data sets. Here are the two most commonly used measures of association: Covariance Correlation Both measures are used to show how closely two data sets are related to each other. The main difference between them is the units in which they are measured. The correlation measure is defined to assume values between –1 and 1, which makes interpretation very easy. Covariance The covariance between two samples is computed as follows: The covariance between two populations is computed as follows: Correlation The correlation between two samples is computed like this: The correlation between two populations is computed like this:

View Article
Measures of Central Tendency

Article / Updated 03-26-2016

Measures of central tendency show the center of a data set. Three of the most commonly used measures of central tendency are the mean, median, and mode. Mean Mean is another word for average. Here is the formula for computing the mean of a sample: With this formula, you compute the sample mean by simply adding up all the elements in the sample and then dividing by the number of elements in the sample. Here is the corresponding formula for computing the mean of a population: Although the notation is slightly different, the procedure for computing a population mean is the same as the procedure for computing a sample mean. Greek letters are used to describe populations, whereas Roman letters are used to describe samples. Median The median of a data set is a value that divides the data into two equal halves. In other words, half of the elements of a data set are less than the median, and the remaining half are greater than the median. The procedure for computing the median is the same for both samples and populations. Mode The mode of a data set is the most commonly observed value in the data set. You determine the mode in the same way for a sample and a population.

View Article
Measures of Central Dispersion

Article / Updated 03-26-2016

Measures of central dispersion show how "spread out" the elements of a data set are from the mean. Three of the most commonly used measures of central dispersion include the following: Range Variance Standard deviation Range The range of a data set is the difference between the largest value and the smallest value. You compute it the same way for both samples and populations. Variance You can think of the variance as the average squared difference between the elements of a data set and the mean. The formulas for computing a sample variance and a population variance are slightly different. Here is the formula for computing sample variance: And here is the formula for computing population variance: Standard deviation The standard deviation is simply the square root of the variance. It's more commonly used as a measure of dispersion than the variance because it's measured in the same units as the elements of the data set, whereas the variance is measured in squared units.

View Article
Are the Elements in the Dataset Uncorrelated?

Article / Updated 03-26-2016

For a dataset that consists of observations taken at different points in time (that is, time series data), it's important to determine whether or not the observations are correlated with each other. This is because many techniques for modeling time series data are based on the assumption that the data is uncorrelated with each other (independent). One graphical technique you can use to see whether the data is uncorrelated with each other is the autocorrelation function. The autocorrelation function shows the correlation between observations in a time series with different lags. For example, the correlation between observations with lag 1 refers to the correlation between each individual observation and its previous value. This figure shows the autocorrelation function for ExxonMobil's daily returns in 2013. Autocorrelation function of daily returns to ExxonMobil stock in 2013. Each "spike" in the autocorrelation function represents the correlation between observations with a given lag. The autocorrelation with lag 0 always equals 1, because this represents the correlations of the observations with themselves. On the graph, the dashed lines represent the lower and upper limits of a confidence interval. If a spike rises above the upper limit of the confidence interval or falls below the lower limit of the confidence interval, that shows that the correlation for that lag isn't 0. This is evidence against the independence of the elements in a dataset. In this case, there is only one statistically significant spike (at lag 8). This spike shows that the ExxonMobil returns may be independent. A more formal statistical test would show whether that is true or not.

View Article
Datasets That Include Dates

Article / Updated 03-26-2016

You very rarely run across a dataset that does not include dates. Purchase dates, birthdates, update dates, quote dates, and the list goes on. In almost every context, some sort of date is required to get a full picture of the situation you are trying to analyze. Dealing with dates can be a bit tricky, partly because of the variety of ways to store them. But also, depending on what you're trying to do, you may only need part of the date. Here are a few common situations to look out for. Dealing with datetime formats For starters, most database management systems have an extremely precise way of storing dates internally: They use a datetime. This is exactly what it sounds like: a mashup of the date and the time. For example, a common format looks like this: 2014 – 11 – 2414:25:44 That means 25 minutes and 44 seconds past 2 p.m. on November 24, 2014. The seemingly excessive detail here is rarely fully utilized. By far the most common user of the full detail is the database management system itself. It is a common practice for databases to put a datetime stamp on every record to indicate when the record was created and when it was last updated. The New York Stock Exchange systems actually keep track of trade time stamps to even greater precision. For most analytic applications, however, this is more detail than you want. If you are analyzing a stock's closing price over time, you won't be interested in more than just the day or maybe the month associated with each closing price. If you are doing a demographic analysis of age distributions, the year of birth may be all that's relevant. Birthdates provide a good example of something that you may encounter with datetime data. Even though data may be stored in a datetime field, it may be the case that only part of the field is really being used. Birthdates typically have the time portion defaulted to 00:00:00 for every record. Luckily, both database systems and analytic software have built-in functions that allow you to extract only the portion of the datetime that is relevant to you. You can choose to extract only the date part, only the month and year, only the year, and so forth. And in fact, this is often done for you before you ever see the data. Taking geography into account In the brave new world of the global economy, you will likely encounter data that has been collected from many different locations. Anyone who has ever tried to schedule an international conference call is well aware of the logistics involved in dealing with multiple time zones. More and more common nowadays are post-midnight conference calls with India. One typical big data example involves supply chain management. Supply chain management is the ongoing process of trying to manage raw materials, inventories, distribution, and any other relevant aspect of a company's business. It's how Walmart keeps shelves stocked, how UPS keeps track of packages, and how Amazon manages to deliver almost anything imaginable almost anywhere. In these examples, the analysis that underlies supply chain management needs to take into account that data is coming from different time zones. When faced with situations like this, datetime data must be dealt with carefully. Suppose a package is shipped from California at 10 a.m. on Wednesday and is delivered to its final destination in New York on Thursday at 10 a.m. If you are interested in analyzing delivery times, you need to take into account the time zone change. In this example, the delivery time is actually 21 hours, not 24. When dealing with datetime data collected from different time zones, you can't simply compare different data points based on the raw data. You need to first make sure that all datetimes are represented in a common time zone. Which time zone you use is somewhat arbitrary, so long as all the data points are using the same one. There is one other geographically — or, to be more accurate, culturally — related fact that you need to be aware of. Not all countries represent dates in the same way. The U.S. is actually somewhat unique in representing dates as month/day/year. Canada and most of Europe prefer to use the convention day/month/year. You may also run across variations beginning with the year. How your software thinks about dates Dates are used in a variety of ways in data analysis. Sometimes, as with stock price analysis, their primary function is to put the observations in order from earliest to latest. But in other cases, they are used to measure time intervals. In engineering, particularly in quality control applications, a key statistic is mean time to failure. This is simply the average life span of a part or product. For long-lived products, like car parts and light bulbs, this calculation requires the comparison of dates. On the face of it, August 15, 2013 minus January 1, 2010 doesn't make much sense mathematically. We all know what is meant by this, but it takes some thinking to get the answer. For this reason, many statistical packages, when confronted with dates, immediately convert them into a number in order to facilitate comparisons. They do this by picking some starting point and calculating the number of days between that starting point and the date that is being converted. For example, one large statistical software maker, SAS, uses the date January 1, 1960 as its starting point. This date has the value 0. It stores every date as the number of days it is away from this starting point. Thus, SAS thinks of January 1, 1961 as 366 (remember, 1960 was a leap year, and January 1 is day 0, not day 1). The starting point is arbitrary and different software makers use different starting points, but the idea is the same. One odd consequence of this convention is that if you look at the raw data, not only are all the dates integers, but they don't even have to be positive integers. In the SAS example, January 1, 1959 would be represented as –365. In any case, this way of handling dates facilitates calculations. By converting the date to a number on input, the system avoids having to jump through hoops every time a calculation involving that date is performed.

View Article
page 1
page 2
page 3
page 4
page 5