David Semmelroth has two decades of experience translating customer data into actionable insights across the financial services, travel, and entertainment industries. David has consulted for Cedar Fair, Wachovia, National City, and TD Bank.
Summary statistical measures represent the key properties of a sample or population as a single numerical value. This has the advantage of providing important information in a very compact form. It also simplifies comparing multiple samples or populations. Summary statistical measures can be divided into three types: measures of central tendency, measures of central dispersion, and measures of association.
Many different techniques have been designed to forecast the future value of a variable. Two of these are time series regression models and simulation models.
Time series regression models
A time series regression model is used to estimate the trend followed by a variable over time, using regression techniques.
For a dataset that consists of observations taken at different points in time (that is, time series data), it's important to determine whether or not the observations are correlated with each other. This is because many techniques for modeling time series data are based on the assumption that the data is uncorrelated with each other (independent).
Healthcare is one area where big data has the potential to make dramatic improvements in the quality of life. The increasing availability of massive amounts of data and rapidly increasing computer power could enable researchers to make breakthroughs, such as the following:
Predicting outbreaks of diseases
Gaining a better understanding of the effectiveness and side effects of drugs
Developing customized treatments based on patient histories
Reducing the cost of developing new treatments
One of the biggest challenges facing the use of big data in healthcare is that much of the data is stored in independent "silos.
Before you apply statistical techniques to a dataset, it's important to examine the data to understand its basic properties. You can use a series of techniques that are collectively known as Exploratory Data Analysis (EDA) to analyze a dataset. EDA helps ensure that you choose the correct statistical techniques to analyze and forecast the data.
Big data has made possible the development of highly capable online search engines. A search engine finding web pages based on search terms requires sophisticated algorithms and the ability to process a staggering number of requests. Here are four of the most widely used search engines:
Google
Microsoft Bing
Yahoo!
The two basic types of probability distributions are known as discrete and continuous. Discrete distributions describe the properties of a random variable for which every individual outcome is assigned a positive probability.
A random variable is actually a function; it assigns numerical values to the outcomes of a random process.
Hypothesis testing is a statistical technique that is used in a variety of situations. Though the technical details differ from situation to situation, all hypothesis tests use the same core set of terms and concepts. The following descriptions of common terms and concepts refer to a hypothesis test in which the means of two populations are being compared.
Several different types of graphs may be useful for analyzing data. These include stem-and-leaf plots, scatter plots, box plots, histograms, quantile-quantile (QQ) plots, and autocorrelation plots.
A stem-and-leaf plot consists of a “stem” that reflects the categories in a data set and a “leaf” that shows each individual value in the data set.
One important way to draw conclusions about the properties of a population is with hypothesis testing. You can use hypothesis tests to compare a population measure to a specified value, compare measures for two populations, determine whether a population follows a specified probability distribution, and so forth.