Cheat Sheet

Data Science For Dummies Cheat Sheet

From Data Science For Dummies, 2nd Edition

By Lillian Pierson

“Big data” is definitely the big buzzword these days, and most folks who have come across the term realize that big data is a powerful force that is in the process of revolutionizing scores of major industries. Not many folks, however, are aware of the range of tools currently available that are designed to help big businesses and small take advantage of the Big Data revolution. This Cheat Sheet gives you a peek at these tools and shows you how they fit in to the broader context of data science.

Seeing What You Need to Know When Getting Started in Data Science

Traditionally, big data is the term for data that has incredible volume, velocity, and variety. Traditional database technologies aren’t capable of handling big data — more innovative data-engineered solutions are required. To evaluate your project for whether it qualifies as a big data project, consider the following criteria:

  • Volume: Between 1 terabytes/year and10 petabytes/year

  • Velocity: Between 30 kilobytes/second and 30 gigabytes/second

  • Variety: Combined sources of unstructured, semi-structured, and structured data

Data science and data engineering are not the same

Hiring managers tend to confuse the roles of data scientist and data engineer. While it is possible to find someone who does a little of both, each field is incredibly complex. It’s unlikely that you’ll find someone with robust skills and experience in both areas. For this reason, it’s important to be able to identify what type of specialist is most appropriate for helping you achieve your specific goals. The descriptions below should help you do that.

  • Data scientists: Data scientists use coding, quantitative methods (mathematical, statistical, and machine learning), and highly specialized expertise in their study area to derive solutions to complex business and scientific problems.

  • Data engineers: Data engineers use skills in computer science and software engineering to design systems for, and solve problems with, handling and manipulating big data sets.

Data science and business intelligence are also not the same

Business-centric data scientists and business analysts who do business intelligence are like cousins. Both types of specialist use data to achieve the same business goals, but their approaches, technologies, and functions are different. The descriptions below spell out the differences between the two roles.

  • Business intelligence (BI): BI solutions are generally built using datasets generated internally — from within an organization rather than from without, in other words. Common tools and technologies include online analytical processing, extract transform and load, and data warehousing. Although BI sometimes involves forward-looking methods like forecasting, these methods are based on simple mathematical inferences from historical or current data.

  • Business-centric data science: Business-centric data science solutions are built using datasets that are both internal and external to an organization. Common tools, technologies, and skillsets include cloud-based analytics platforms, statistical and mathematical programming, machine learning, data analysis using Python and R, and advanced data visualization. Business-centric data scientists use advanced mathematical or statistical methods to analyze and generate predictions from vast amounts of business data.

Looking at the Basics of Statistics, Machine Learning, and Mathematical Methods in Data Science

If statistics has been described as the science of deriving insights from data, then what’s the difference between a statistician and a data scientist? Good question! While many tasks in data science require a fair bit of statistical know how, the scope and breadth of a data scientist’s knowledge and skill base is distinct from those of a statistician. The core distinctions are outlined below.

  • Subject matter expertise: One of the core features of data scientists is that they offer a sophisticated degree of expertise in the area to which they apply their analytical methods. Data scientists need this so that they’re able to truly understand the implications and applications of the data insights they generate. A data scientist should have enough subject matter expertise to be able to identify the significance of their findings and independently decide how to proceed in the analysis.

    In contrast, statisticians usually have an incredibly deep knowledge of statistics, but very little expertise in the subject matters to which they apply statistical methods. Most of the time, statisticians are required to consult with external subject matter experts to truly get a firm grasp on the significance of their findings, and to be able to decide the best way to move forward in an analysis.

  • Mathematical and machine learning approaches: Statisticians rely mostly on statistical methods and processes when deriving insights from data. In contrast, data scientists are required to pull from a wide variety of techniques to derive data insights. These include statistical methods, but also include approaches that are not based in statistics — like those found in mathematics, clustering, classification, and non-statistical machine learning approaches.

Seeing the importance of statistical know-how

You don’t need to go out and get a degree in statistics to practice data science, but you should at least get familiar with some of the more fundamental methods that are used in statistical data analysis. These include:

  • Linear regression: Linear regression is useful for modeling the relationships between a dependent variable and one or several independent variables. The purpose of linear regression is to discover (and quantify the strength of) important correlations between dependent and independent variables.

  • Time-series analysis: Time series analysis involves analyzing a collection of data on attribute values over time, in order to predict future instances of the measure based on the past observational data.

  • Monte Carlo simulations: The Monte Carlo method is a simulation technique you can use to test hypotheses, to generate parameter estimates, to predict scenario outcomes, and to validate models. The method is powerful because it can be used to very quickly simulate anywhere from 1 to 10,000 (or more) simulation samples for any processes you are trying to evaluate.

  • Statistics for spatial data: One fundamental and important property of spatial data is that it’s not random. It’s spatially dependent and autocorrelated. When modeling spatial data, avoid statistical methods that assume your data is random. Kriging and krige are two statistical methods that you can use to model spatial data. These methods enable you to produce predictive surfaces for entire study areas based on sets of known points in geographic space.

Working with clustering, classification, and machine learning methods

Machine learning is the application of computational algorithms to learn from (or deduce patterns in) raw datasets. Clustering is a particular type of machine learning —unsupervised machine learning, to be precise, meaning that the algorithms must learn from unlabeled data, and as such, they must use inferential methods to discover correlations.

Classification, on the other hand, is called supervised machine learning, meaning that the algorithms learn from labeled data. The following descriptions introduce some of the more basic clustering and classification approaches:

  • k-means clustering: You generally deploy k-means algorithms to subdivide data points of a dataset into clusters based on nearest mean values. To determine the optimal division of your data points into clusters, such that the distance between points in each cluster is minimized, you can use k-means clustering.

  • Nearest neighbor algorithms: The purpose of a nearest neighbor analysis is to search for and locate either a nearest point in space or a nearest numerical value, depending on the attribute you use for the basis of comparison.

  • Kernel density estimation: An alternative way to identify clusters in your data is to use a density smoothing function. Kernel density estimation (KDE) works by placing a kernel a weighting function that is useful for quantifying density — on each data point in the data set, and then summing the kernels to generate a kernel density estimate for the overall region.

Keeping mathematical methods in the mix

Lots gets said about the value of statistics in the practice of data science, but applied mathematical methods are seldom mentioned. To be frank, mathematics is the basis of all quantitative analyses. Its importance should not be understated. The two following mathematical methods are particularly useful in data science.

  • Multi-criteria decision making (MCDM): MCDM is a mathematical decision modeling approach that you can use when you have several criteria or alternatives that you must simultaneously evaluate when making a decision.

  • Markov chains: A Markov chain is a mathematical method that chains together a series of randomly generated variables that represent the present state in order to model how changes in present state variables affect future states.

Using Visualization Techniques to Communicate Data Science Insights

All of the information and insight in the world is useless if it can’t be communicated. If data scientists cannot clearly communicate their findings to others, potentially valuable data insights may remain unexploited.

Following clear and specific best practices in data visualization design can help you develop visualizations that communicate in a way that’s highly relevant and valuable to the stakeholders for whom you’re working. The following is a brief summary of some of the more important best practices in data visualization design.

  • Know thy audience: Since data visualizations are designed for a whole spectrum of different audiences, different purposes, and different skill levels, the first step to designing a great data visualization is to know your audience. Since each audience will be comprised of a unique class of consumers, each with their unique data visualization needs, it’s essential to clarify exactly for whom you’re designing.

  • Choose appropriate design styles: After considering your audience, choosing the most appropriate design style is also critical. If your goal is to entice your audience into taking a deeper, more analytical dive into the visualization, then use a design style that induces a calculating and exacting response in its viewers. If you want your data visualization to fuel your audience’s passion, use an emotionally compelling design style instead.

  • Choose smart data graphic types: Lastly, make sure to pick graphic types that dramatically display the data trends you’re seeking to reveal. You can display the same data trend in many ways, but some methods deliver a visual message more effectively than others. Pick the graphic type that most directly delivers a clear, comprehensive visual message.

Looking at your coding toolset

D3.js is the perfect programming language for building dynamic interactive web-based visualizations. If you’re already a web programmer, or if you don’t mind taking the time required to get up to speed in the basics of HTML, CSS, and JavaScript, then it’s a no-brainer: Using D3.js to design interactive web-based data visualizations is sure to be the perfect solution to many of your visualization problems.

Working with web-based applications

If you don’t have the time or energy to get into coding up your own custom-made data visualization, fear not — there are some amazing online applications available to help you get the job done in no time. The following list details some excellent alternatives.

  • Watson Analytics: Watson Analytics is the first full-scale data science and analytics solution that’s been made available as a 100% cloud-based offering. Watson Analytics was built for the purpose of democratizing the power of data science. It’s a platform where users of all skill levels can go to access, refine, discover, visualize, report, and collaborate on data-driven insights.

  • CartoDB: For non-programmers or non-cartographers, CartoDB is about the most powerful map-making solution that’s available online. It’s used for digital visual communications by people from all sorts of industries — including information services, software engineering, media and entertainment, and urban development.

  • Piktochart: The Piktochart web application provides an easy-to-use interface for creating beautiful infographics. The application offers a very large selection of attractive, professionally-designed templates. With Piktochart, you can make either static or dynamic infographics.

Going with analytics dashboards

When the word “dashboard” comes up, many people associate it with old-fashioned business intelligence solutions. This association is faulty. A dashboard is just another way of using visualization methods to communicate data insights.

While it’s true that you can use a dashboard to communicate findings that are generated from business intelligence, you can also use them to communicate and deliver valuable insights that are derived from business-centric data science. Just because dashboards have been around awhile, they shouldn’t be disregarded as effective tools for communicating valuable data insights.

Leveraging Geographic Information Systems (GIS) software

Geographic information systems (GIS) is another understated resource in data science. When you need to discover and quantify location-based trends in your dataset, GIS is the perfect solution for the job. Maps are one form of spatial data visualization that you can generate using GIS, but GIS software is also good for more advanced forms of analysis and visualization. The two most popular GIS solutions are detailed below.

  • ArcGIS for Desktop: Proprietary ArcGIS for Desktop is the most widely used map-making application.

  • QGIS: If you don’t have the money to invest in ArcGIS for Desktop, you can use open-source QGIS to accomplish most of the same goals for free.

Choosing the Best Programming Languages for Data Science

Coding is one of the primary skills in a data scientist’s toolbox. Some incredibly powerful applications have successfully done away with the need to code in some data-science contexts, but you’re never going to be able to use those applications for custom analysis and visualization. For advanced tasks, you’re going to have to code things up for yourself, using either the Python programming language or the R programming language.

Using Python for data science

Python is an easy-to-learn, human-readable programming language that you can use for advanced data munging, analysis, and visualization. You can install it and set it up incredibly easily, and you can more easily learn Python than the R programming language. Python runs on Mac, Windows, and UNIX.

IPython offers a very user-friendly coding interface for people who don’t like coding from the command line. If you download and install the Anaconda Python distribution, you get your IPython/Jupyter environment, as well as NumPy, SciPy, MatPlotLib, Pandas, and scikit-learn libraries (among others) that you’ll likely need in your data sense-making procedures.

The base NumPy package is the basic facilitator for scientific computing in Python. It provides containers/array structures that you can use to do computations with both vectors and matrices (like in R). SciPy and Pandas are the Python libraries that are most commonly used for scientific and technical computing.

They offer tons of mathematical algorithms that are simply not available in other Python libraries. Popular functionalities include linear algebra, matrix math, sparse matrix functionalities, statistics, and data munging. MatPlotLib is Python’s premiere data visualization library.

Lastly, the scikit-learn library is useful for machine learning, data pre-processing, and model evaluation.

Using R for data science

R is another popular programming language that’s used for statistical and scientific computing. Writing analysis and visualization routines in R is known as R scripting. R has been specifically developed for statistical computing, and consequently, it has a more plentiful offering of open-source statistical computing packages than Python’s offerings.

Also, R’s data visualizations capabilities are somewhat more sophisticated than Python’s, and generally easier to generate. That being said, as a language, Python is a fair bit easier for beginners to learn.

R has a very large and extremely active user community. Developers are coming up with (and sharing) new packages all the time — to mention just a few, the forecast package, the ggplot2 package, and the statnet/igraph packages.

If you want to do predictive analysis and forecasting in R, the forecast package is a good place to start. This package offers the ARMA, AR, and exponential smoothing methods.

For data visualization, you can use the ggplot2 package, which has all the standard data graphic types, plus a lot more.

Lastly, R’s network analysis packages are pretty special as well. For example, you can use igraph and StatNet for social network analysis, genetic mapping, traffic planning, and even hydraulic modeling.