|
Published:
August 31, 2015

Statistics for Big Data For Dummies

Overview

The fast and easy way to make sense of statistics for big data

Does the subject of data analysis make you dizzy? You've come to the right place! Statistics For Big Data For Dummies breaks this often-overwhelming subject down into easily digestible parts, offering new and aspiring data analysts the foundation they need to be successful in the field. Inside, you'll find an easy-to-follow introduction to exploratory data analysis, the lowdown on collecting, cleaning, and organizing data, everything you need to know about interpreting data using common software and programming languages, plain-English explanations of how to make sense of data in the real world, and much more.

Data has never been easier to come by, and the tools students and professionals need to enter the world of big data are based on applied statistics. While the word "statistics" alone can evoke feelings of anxiety in even the most confident

student or professional, it doesn't have to. Written in the familiar and friendly tone that has defined the For Dummies brand for more than twenty years, Statistics For Big Data For Dummies takes the intimidation out of the subject, offering clear explanations and tons of step-by-step instruction to help you make sense of data mining—without losing your cool.

  • Helps you to identify valid, useful, and understandable patterns in data
  • Provides guidance on extracting previously unknown information from large databases
  • Shows you how to discover patterns available in big data
  • Gives you access to the latest tools and techniques for working in big data

If you're a student enrolled in a related Applied Statistics course or a professional looking to expand your skillset, Statistics For Big Data For Dummies gives you access to everything you need to succeed.

Read More

About The Author

Alan Anderson, PhD is a teacher of finance, economics, statistics, and math at Fordham and Fairfield universities as well as at Manhattanville and Purchase colleges. Outside of the academic environment he has many years of experience working as an economist, risk manager, and fixed income analyst. Alan received his PhD in economics from Fordham University, and an M.S. in financial engineering from Polytechnic University.

Sample Chapters

statistics for big data for dummies

CHEAT SHEET

Summary statistical measures represent the key properties of a sample or population as a single numerical value. This has the advantage of providing important information in a very compact form. It also simplifies comparing multiple samples or populations. Summary statistical measures can be divided into three types: measures of central tendency, measures of central dispersion, and measures of association.

HAVE THIS BOOK?

Articles from
the book

Hypothesis testing is a statistical technique that is used in a variety of situations. Though the technical details differ from situation to situation, all hypothesis tests use the same core set of terms and concepts. The following descriptions of common terms and concepts refer to a hypothesis test in which the means of two populations are being compared.
Statistical software packages are extremely powerful these days, but they cannot overcome poor quality data. Following is a checklist of things you need to do before you go off building statistical models. Check data formats Your analysis always starts with a raw data file. Raw data files come in many different shapes and sizes.
For a dataset that consists of observations taken at different points in time (that is, time series data), it's important to determine whether or not the observations are correlated with each other. This is because many techniques for modeling time series data are based on the assumption that the data is uncorrelated with each other (independent).
An autocorrelation plot shows the properties of a type of data known as a time series. A time series refers to observations of a single variable over a specified time horizon. For example, the daily price of Microsoft stock during the year 2013 is a time series. Cross-sectional data refers to observations on many variables at a single point in time.
One area where big data has made an impact on electric utilities is the development of smart meters. Smart meters provide a more accurate measure of energy usage by giving far more frequent readings than traditional meters. A smart meter may give several readings a day, not just once a month or once a quarter.
One area of the finance industry that has been dramatically affected by big data is the trading activities of banks and other financial institutions. An example is high-frequency trading (HFT), a relatively new mode of trading that depends on the ability to execute massive volumes of trades in extremely short time intervals.
Healthcare is one area where big data has the potential to make dramatic improvements in the quality of life. The increasing availability of massive amounts of data and rapidly increasing computer power could enable researchers to make breakthroughs, such as the following: Predicting outbreaks of diseases Gaining a better understanding of the effectiveness and side effects of drugs Developing customized treatments based on patient histories Reducing the cost of developing new treatments One of the biggest challenges facing the use of big data in healthcare is that much of the data is stored in independent "silos.
Big data is making dramatic changes in the field of education. One area that has shown particular promise is computerized learning programs, which provide instant feedback to educators. The data gathered from these programs can provide key information to identify key challenges: Students who need extra help Students who are ready for more advanced material Topics that students are finding especially difficult Different learning styles This information enables educators to identify problem areas and come up with alternative methods for presenting material.
The insurance industry couldn't survive without the ability to gather and process substantial quantities of data. In order to determine the appropriate premiums for their policies, insurance companies must be able to analyze the risks that policyholders face and be able to determine the likelihood of these risks actually materializing.
Retailers collect and maintain sales records for large number of customers. The challenge has always been to put this data to good use. Ideally, a retailer would like to understand the demographic characteristics of its customers and what types of goods and services they are interested in buying. The continued improvement in computing capacity has made it possible to sift through huge volumes of data in order to find patterns that can be used to forecast demand for different products, based on customer characteristics.
Big data has made possible the development of highly capable online search engines. A search engine finding web pages based on search terms requires sophisticated algorithms and the ability to process a staggering number of requests. Here are four of the most widely used search engines: Google Microsoft Bing Yahoo!
Social media wouldn't be possible without big data. Social media websites let people share photos, videos, personal data, commentary, and so forth. Some of the best examples of social media websites include these: Facebook Twitter LinkedIn Instagram Facebook was created in 2004 by Harvard students. It has since grown into the largest social media site in the world.
Weather forecasting has always been extremely challenging, given the number of variables involved and the complex interactions between those variables. Dramatic increases in the ability to gather and process data have greatly enhanced the ability of weather forecasters to pinpoint the timing and severity of hurricanes, floods, snowstorms, and other weather events.
A box plot is designed to show several key statistics for a dataset in the form of a vertical rectangle or box. The statistics it can show include the following: Minimum value Maximum value First quartile (Q1) Second quartile (Q2) Third quartile (Q3) Interquartile range (IQR) The first quartile of a dataset is a numerical measure that divides the data into two parts: the smallest 25 percent of the observations and the largest 75 percent of the observations.
You very rarely run across a dataset that does not include dates. Purchase dates, birthdates, update dates, quote dates, and the list goes on. In almost every context, some sort of date is required to get a full picture of the situation you are trying to analyze. Dealing with dates can be a bit tricky, partly because of the variety of ways to store them.
The two basic types of probability distributions are known as discrete and continuous. Discrete distributions describe the properties of a random variable for which every individual outcome is assigned a positive probability. A random variable is actually a function; it assigns numerical values to the outcomes of a random process.
Most datasets come with some sort of metadata, which is essentially a description of the data in the file. Metadata typically includes descriptions of the formats, some indication of what values are in each data field, and what these values mean. When you are faced with a new dataset, never take the metadata at face value.
For time series data, it's important to know whether the observations continue to have the same mean over time and whether the variance of the data is changing over time. Many statistical tests and forecasting techniques depend on this assumption. The figure shows a time series plot of ExxonMobil's daily returns throughout 2013.
There are several Exploratory Data Analysis (EDA) techniques you can use to test assumptions about a dataset. These include run sequence plot, lag plot, histogram, and normal probability plot. Run sequence plot Many statistical techniques are based on the assumption that the data being analyzed has the following properties: Independent variables Variables drawn from a common probability distribution Variables with common parameters (for example, mean and standard deviation) A run sequence plot tests whether the data conforms to these assumptions.
Before you apply statistical techniques to a dataset, it's important to examine the data to understand its basic properties. You can use a series of techniques that are collectively known as Exploratory Data Analysis (EDA) to analyze a dataset. EDA helps ensure that you choose the correct statistical techniques to analyze and forecast the data.
Many different techniques have been designed to forecast the future value of a variable. Two of these are time series regression models and simulation models. Time series regression models A time series regression model is used to estimate the trend followed by a variable over time, using regression techniques.
EDA is based heavily on graphical techniques. You can use graphical techniques to identify the most important properties of a dataset. Here are some of the more widely used graphical techniques: Box plots Histograms Normal probability plots Scatter plots Box plots You use box plots to show some of the most important features of a dataset, such as the following: Minimum value Maximum value Quartiles Quartiles separate a dataset into four equal sections.
Identifying data outliers isn't a cut-and-dried matter. There can be disagreement about what does and does not qualify as an outlier. The definition of an outlier depends on the assumed probability distribution of a population. For example, if population really is normally distributed, the graph of a dataset should have the same signature bell shape — if it doesn't, that could be a sign that there are outliers in the data.
A histogram is a graph that represents the probability distribution of a dataset. A histogram has a series of vertical bars where each bar represents a single value or a range of values for a variable. The heights of the bars indicate the frequencies or probabilities for the different values or ranges of values.
When working with big data statistics, you identify the spread of a dataset from the center with several different summary measures: variance, standard deviation, quartiles, interquartile range (IQR). Variance is the average squared deviation between the elements of the dataset and the mean. For a sample of data, the variance is computed like this: where xi is the value of a single element in the sample.
Data is stored in different ways in different systems. So it's no surprise that when collecting and consolidating data from various sources, it's possible that duplicates pop up. In particular, what makes an individual record unique is different for different systems. An investment account summary is attached to an account number.
Several formal statistical tests that are designed to detect data outliers. Three of these take the form of hypothesis tests. A hypothesis test is a procedure for determining whether a proposition can be rejected based on sample data. Hypothesis tests always involve comparing a test statistic from the data to an appropriate distribution to determine whether a given hypothesis is supported by the data.
In data analysis, the relationship between the mean and the median can be used to determine if a distribution is skewed. The histogram shows that most of the returns are close to the mean, which is 0.000632 (0.0632 percent). The median is −0.0001179. Histogram shows most returns close to the mean. Here's how to determine whether the distribution is skewed: In this case, the distribution of returns to ExxonMobil stock is positively skewed.
The properties of a time series may be modeled in terms of the following components or factors. Most time series contain one or more of the following: Trend component Seasonal component Cyclical component Irregular component Trend component A trend is a long-run increase or decrease in a time series.
Measures of association quantify the strength and the direction of the relationship between two data sets. Here are the two most commonly used measures of association: Covariance Correlation Both measures are used to show how closely two data sets are related to each other. The main difference between them is the units in which they are measured.
Measures of central dispersion show how "spread out" the elements of a data set are from the mean. Three of the most commonly used measures of central dispersion include the following: Range Variance Standard deviation Range The range of a data set is the difference between the largest value and the smallest value.
Measures of central tendency show the center of a data set. Three of the most commonly used measures of central tendency are the mean, median, and mode. Mean Mean is another word for average. Here is the formula for computing the mean of a sample: With this formula, you compute the sample mean by simply adding up all the elements in the sample and then dividing by the number of elements in the sample.
One of the most frequent and messiest data problems to deal with is missing data. Files can be incomplete because records were dropped or a storage device filled up. Or certain data fields might contain no data for some records. The first of these problems can be diagnosed by simply verifying record counts for files.
Several different types of graphs may be useful for analyzing data. These include stem-and-leaf plots, scatter plots, box plots, histograms, quantile-quantile (QQ) plots, and autocorrelation plots. A stem-and-leaf plot consists of a “stem” that reflects the categories in a data set and a “leaf” that shows each individual value in the data set.
One important way to draw conclusions about the properties of a population is with hypothesis testing. You can use hypothesis tests to compare a population measure to a specified value, compare measures for two populations, determine whether a population follows a specified probability distribution, and so forth.
Probability distributions is one of many statistical techniques that can be used to analyze data to find useful patterns. You use a probability distribution to compute the probabilities associated with the elements of a dataset: Binomial distribution: You would use the binomial distribution to analyze variables that can assume only one of two values.
A quantile-quantile plot (also known as a QQ-plot) is another way you can determine whether a dataset matches a specified probability distribution. QQ-plots are often used to determine whether a dataset is normally distributed. Graphically, the QQ-plot is very different from a histogram. As the name suggests, the horizontal and vertical axes of a QQ-plot are used to show quantiles.
Although EDA is mainly based on graphical techniques, it also consists of a few quantitative techniques. This article discusses two of these: interval estimation and hypothesis testing. Interval estimation Interval estimation is a technique that's used to construct a range of values within which a variable is likely to fall.
Regression analysis is used to estimate the strength and direction of the relationship between variables that are linearly related to each other. Two variables X and Y are said to be linearly related if the relationship between them can be written in the form Y = mX + b where m is the slope, or the change in Y due to a given change in X b is the intercept, or the value of Y when X = 0 As an example of regression analysis, suppose a corporation wants to determine whether its advertising expenditures are actually increasing profits, and if so, by how much.
A statistic is said to be robust if it isn’t strongly influenced by the presence of outliers. For example, the mean is not robust because it can be strongly affected by the presence of outliers. On the other hand, the median is robust — it isn’t affected by outliers. For example, suppose the following data represents a sample of household incomes in a small town (measured in thousands of dollars per year): 32, 47, 20, 25, 56 You compute the sample mean as the sum of the five observations divided by five: The sample mean is $36,000 per year.
Unlike a stem-and-leaf plot, a scatter plot is intended to show the relationship between two variables. It may be difficult to see whether there's a relationship between two variables just by looking at the raw data, but with a scatter plot, any patterns that exist in the data become much easier to see. A scatter plot consists of a series of points; each point shows a single value for two different variables.
Summary statistical measures represent the key properties of a sample or population as a single numerical value. This has the advantage of providing important information in a very compact form. It also simplifies comparing multiple samples or populations. Summary statistical measures can be divided into three types: measures of central tendency, measures of central dispersion, and measures of association.
A stem-and-leaf plot is a graphical device in which the distribution of a dataset is organized by the numerical value of the observations in the dataset. The diagram consists of a "stem," showing the different categories in the data, and a "leaf," which shows the values of the individual observations in the dataset.
A time series is a set of observations of a single variable collected over time. With time series analysis, you can use the statistical properties of a time series to predict the future values of a variable. There are many types of models that may be developed to explain and predict the behavior of a time series.
Decomposition methods are based on an analysis of the individual components of a time series. The strength of each component is estimated separately and then substituted into a model that explains the behavior of the time series. Two of the more important decomposition methods are Multiplicative decomposition Additive decomposition Multiplicative decomposition The multiplicative decomposition model is expressed as the product of the four components of a time series: yt = TRtStCtIt These variables are defined as follows: yt = Value of the time series at time t TRt = Trend at time t St = Seasonal component at time t Ct = Cyclical component at time t It = Irregular component at time t Each component has a subscript t to indicate a specific time period.
Prior to performing any type of statistical analysis, understanding the nature of the data being analyzed is essential. You can use EDA to identify the properties of a dataset to determine the most appropriate statistical methods to apply to the data. You can investigate several types of properties with EDA techniques, including the following: The center of the data The spread among the members of the data The skewness of the data The probability distribution the data follows The correlation among the elements in the dataset Whether or not the parameters of the data are constant over time The presence of outliers in the data Another key question EDA answers is "Does the data conform to our assumptions?
One technique you can use to identify the distribution a dataset follows is the QQ-plot (QQ stands for quantile-quantile). You can use the QQ-plot to compare a dataset to a large number of different probability distributions. Often, data is compared to the normal distribution because many statistical tests assume normally distributed data.
You identify the center of a dataset with several different summary measures. These include the big three: mean, median, and mode. You calculate the mean of a dataset by adding up the values of all the elements and dividing by the total number of elements. For example, suppose a small dataset consists of the number of days required to receive a package by the residents of an apartment complex: 1, 2, 2, 4, 7, 9, 10 The mean of this dataset would be the following: The average length of time for the residents to receive a package is 5 days.
https://cdn.prod.website-files.com/6630d85d73068bc09c7c436c/69195ee32d5c606051d9f433_4.%20All%20For%20You.mp3

Frequently Asked Questions

No items found.