Python for Data Science For Dummies book cover

Python for Data Science For Dummies

Authors:
John Paul Mueller ,
Luca Massaron
Published: February 27, 2019

Overview

The fast and easy way to learn Python programming and statistics

Python is a general-purpose programming language created in the late 1980s—and named after Monty Python—that's used by thousands of people to do things from testing microchips at Intel, to powering Instagram, to building video games with the PyGame library. 

Python For Data Science For Dummies is written for people who are new to data analysis, and discusses the basics of Python data analysis programming and statistics. The book also discusses Google Colab, which makes it possible to write Python code in the cloud.

  • Get started with data science and Python
  • Visualize information
  • Wrangle data
  • Learn from data

The book provides the statistical background needed to get started in data science programming, including probability, random distributions, hypothesis testing, confidence intervals, and building regression models for prediction.

The fast and easy way to learn Python programming and statistics

Python is a general-purpose programming language created in the late 1980s—and named after Monty Python—that's used by thousands of people to do things from testing microchips at Intel, to powering Instagram, to building video games with the PyGame library. 

Python For Data Science For Dummies is written for people who are new to data analysis, and discusses the basics of Python data analysis

programming and statistics. The book also discusses Google Colab, which makes it possible to write Python code in the cloud.

  • Get started with data science and Python
  • Visualize information
  • Wrangle data
  • Learn from data

The book provides the statistical background needed to get started in data science programming, including probability, random distributions, hypothesis testing, confidence intervals, and building regression models for prediction.

Python for Data Science For Dummies Cheat Sheet

Python is an incredible programming language that you can use to perform data science tasks with a minimum of effort. The huge number of available libraries means that the low-level code you normally need to write is likely already available from some other source. All you need to focus on is getting the job done. With that in mind, this cheat sheet helps you access the most commonly needed reminders for making your programming experience fast and easy.

Articles From The Book

9 results

Python Articles

What is Google Colaboratory?

Google Colaboratory, sometimes called Colaboratory for short, is a Google cloud-based service that replicates Jupyter Notebook in the cloud. You don’t have to install anything on your system to use it. In most respects, you use Colaboratory as you would a desktop installation of Jupyter Notebook. Google Colaboratory is primarily for those readers who use something other than a standard desktop setup to work through the examples.

The most important thing to remember is that Colaboratory isn’t a replacement for Jupyter Notebook and the examples aren’t tested to specifically work with it, but you can try it with your alternative device if desired to follow the examples.

To use Colaboratory, you must have a Google account and then access Colaboratory using your account. Otherwise, most of the Colaboratory features won’t work. As with Jupyter Notebook, you can use Colaboratory to perform specific tasks in a cell-oriented paradigm. If you’ve used Jupyter Notebook before, you notice a strong resemblance between Notebook and Colaboratory. Of course, you also want to perform other sorts of tasks, such as creating various cell types and using them to create notebooks that look like those you create with Notebook.

Defining Google Colaboratory

Google Colaboratory is the cloud version of Jupyter Notebook. In fact, the Welcome page makes this fact apparent. It even uses IPython (the previous name for Jupyter) Notebook (.ipynb) files for the site. That's right, you’re viewing a Jupyter Notebook right there in your browser. Even though the two applications are similar and they both use .ipynb files, they do have some differences that you need to know about.

Understanding what Google Colaboratory does

You can use Colaboratory to perform many tasks, such as to write and run code, create its associated documentation, and display graphics, just as you do with Jupyter Notebook. The techniques you use are similar, in fact, to using Jupyter Notebook, but there are small differences between the two. Jupyter Notebook is a localized application in that you use local resources with it. You could potentially use other sources, but doing so could prove inconvenient or impossible in some cases. For example, according to Github, your Jupyter Notebook files will appear as static HTML pages when you use a GitHub repository. In fact, some features won’t work at all. Colaboratory enables you to fully interact with your Jupyter Notebook files using GitHub as a repository. In fact, Colaboratory supports a number of online storage options, so you can regard Colaboratory as your online partner in creating Python code. The other reason that you really need to know about Colaboratory is that you can use it with your alternative device. During the writing process, some of the example code was tested on an Android-based tablet (an ASUS ZenPad 3S 10). The target tablet has Chrome installed and executes the code well enough to follow the examples. All this said, you likely won’t want to try to write code using a tablet of that size — the text was incredibly small, for one thing, and the lack of a keyboard could be a problem, too. The point is that you don’t absolutely have to have a Windows, Linux, or OS X system to try the code, but the alternatives might not provide quite the performance you expect.

Google Colaboratory generally doesn’t work with browsers other than Chrome or Firefox. In most cases, you see an error message and no other display if you try to start Colaboratory in a browser that it doesn’t support. Your copy of Firefox may also need some configuration to work properly. The amount of configuration that you perform depends on which Colaboratory features you choose to use. Many examples work fine in Firefox without any modification.

The online coding difference between Google Colaboratory and Jupyter Notebook

For the most part, you use Colaboratory just as you would Jupyter Notebook. However, some features work differently. For example, to execute the code within a cell, you select that cell and click the run button (right-facing arrow) for that cell. The current cell remains selected, which means that you must actually initiate the selection of the next cell as a separate action. A block next to the output lets you clear just that output without affecting any other cell. Hovering the mouse over the block tells you when someone executed the content. On the right side of the cell, you see a vertical ellipsis that you can click to see a menu of options for that cell. The result is the same as when using Notebook, but the process for achieving the result is different.

The actual process for working with the code also differs from Notebook. Yes, you still type the code as you always have and the resulting code executes without problem in Jupyter Notebook. The difference is in the way you can manage the code. You can upload code from your local drive as desired and then save it to a Google Drive or GitHub. The code becomes accessible from any device at this point by accessing those same sources. All you need to do is load Colaboratory to access it.

If you use Chrome when working with Colaboratory and choose to sync your copy of Chrome among various devices, all your code becomes available on any device you choose to work with. Syncing transfers your choices to all your devices as long as those devices are also set to synchronize their settings. Consequently, you can write code on your desktop, test it on your tablet, and then review it on your smart phone. It’s all the same code, all the same repository, and the same Chrome setup, just a different device. What you may find, however, is that all this flexibility comes at the price of speed and ergonomics. In reviewing the various options, a local copy of Jupyter Notebook executes code faster than a copy of Colaboratory using any of the available configurations (even when working with a local copy of the .ipynb file). So, you trade speed for flexibility when working with Colaboratory. In addition, viewing the source code on a tablet is hard; viewing it on a smart phone is nearly impossible. If you make the text large enough to see, you can’t see enough of the code to make any sort of reasonable editing possible. At best, you could review the code one line at a time to determine how it works.

Using Jupyter Notebook has other benefits, too. For example, when working with Colaboratory, you have options to download your source files only as .ipynb or .py files. Colaboratory doesn't include all the other download options, including (but not limited to) HTML, LaTeX, and PDF. Consequently, your options for creating presentations from the online content are also limited to some extent. In short, using Colaboratory and Jupyter Notebook provides different coding experiences to some degree. They’re not mutually exclusive, however, because they share file formats. Theoretically, switching between the two as needed is possible.

One thing to consider when using Jupyter Notebook and Colaboratory is that the two products use most of the same terminology and many of the same features, but they're not completely the same. The methods used to perform tasks differ, and some of the terminology does as well. For example, a Markdown cell in Jupyter Notebook is a Text cell in Colaboratory.

Using local runtime support

The only time you really need local runtime support is when you want to work within a team environment and you need the speed or resource access advantage offered by a local runtime. Using a local runtime normally produces better speed than you obtain when relying on the cloud. In addition, a local runtime enables you to access files on your machine. A local runtime also gives you control over the version of Jupyter Notebook used to execute code. You can read more about local runtime support.

You need to consider several issues when determining the need for local runtime support. The most obvious is that you need a local runtime, which means that this option won’t work with your laptop or tablet unless your laptop has Windows, Linux, or OS X and the appropriate version of Jupyter Notebook installed. Your laptop or tablet will also need an appropriate browser; Internet Explorer is almost guaranteed to cause problems, assuming that it works at all.

The most important consideration when using a local runtime, however, is that your machine is now open to possible infection from Jupyter Notebook code. You need to trust the party supplying the code. The local runtime option doesn’t open your machine to others that you share code with, however; they must either use their own local runtimes or rely on the cloud to execute code.

When working with Colaboratory on using local runtime support and Firefox, you must perform some special setups. Make sure to read the Browser Specific Setups section on the Local Runtimes page to ensure that you have Firefox configured correctly. Always verify your setup. Firefox may appear to work correctly with Colaboratory. However, a configuration issue arises when you perform tasks with it, and Colaboratory shows error messages that say the code didn’t execute (or something else that isn’t particularly helpful).

Python Articles

Working with Google Colaboratory Notebooks

As with Jupyter Notebook, the notebook forms the basis of interactions with Google Colaboratory. In fact, Colab is built on notebooks. When you place the mouse on certain parts of the Welcome page, you see opportunities for interacting with the page by adding either code or text entries (which you can use for notes as needed). These entries are active, so you can interact with them. You can also move cells around and copy the resulting material to your Google Drive. Of course, while interacting with the Welcome page is both unexpected and fun, the real purpose of the information below is to describe how to interact with Colab notebooks. The following information describes how to perform basic notebook-related tasks with Colab.

Creating a new Google Colaboratory notebook

To create a new notebook, choose File → New Python 3 Notebook. You see a new Python 3 notebook like the one shown below. The new notebook looks similar to, but not precisely the same as, those found in Notebook. However, all the same functionality exists. You can also create a new Python 2 notebook if desired. The notebook shown here lets you change the filename by clicking on it, just as you do when working in Notebook. Some features work differently but provide the same results. For example, to run the code in a particular cell, you click the right-pointing arrow on the left side of that cell. In contrast to Notebook, the cell focus doesn’t change to the next cell, so you must choose the next cell directly or by clicking the Next Cell or Previous Cell buttons on the toolbar.

Opening existing Google Colaboratory notebooks

You can open existing notebooks found in local storage, on Google Drive, or on GitHub. You can also open any of the Colab examples or upload files from sources that you can access, such as a network drive on your system. In all cases, you begin by choosing File → Open Notebook. You see the dialog box here. The default view shows all the files you opened recently, regardless of location. The files appear in alphabetical order. You can filter the number of items displayed by typing a string into the Filter Notebooks field. Across the top are other options for opening notebooks.

Even if you’re not logged in, you can still access the Colab example projects. These projects help you understand Colab but won’t allow you to do anything with your own projects. Even so, you can still experiment with Colab without logging into Google first. The following information discusses these options in more detail.

Using Google Drive for existing Colab notebooks

Google Drive is the default location for many operations in Colab, and you can always choose it as a destination. When working with Drive, you see a list of files similar to those shown above. To open a particular file, you click its link in the dialog box. The file opens in the current tab of your browser.

Using GitHub for existing Colab notebooks

When working with GitHub, you initially need to provide the location of the source code online. The location must point to a public project; you can’t use Colab to access private projects. After you make the connection to GitHub, you see two lists: repositories, which are containers for code related to a particular project; and branches, a particular implementation of the code. Selecting a repository and branch displays a list of notebook files that you can load into Colab. Simply click the required link and it loads as if you were using a Google Drive.

Using local storage for existing Colab notebooks

If you want to use the downloadable source for any local source, you select the Upload tab of the dialog box. In the center is a single button, Choose File. Clicking this button opens the File Open dialog box for your browser. You locate the file you want to upload, just as you normally would for opening any file.

Selecting a file and clicking Open uploads the file to Google Drive. If you make changes to the file, those changes appear on Google Drive, not on your local drive. Depending on your browser, you usually see a new window open with the code loaded. However, you could also simply see a success message, in which case you must now open the file using the same technique as you would when using Google Drive. In some case, your browser asks whether you want to leave the current page. You should tell the browser to do so.

The File → Upload Notebook command also uploads a file to Google Drive. In fact, uploading a notebook works like uploading any other kind of file, and you see the same dialog box. If you want to upload other kinds of files, using the File → Upload Notebook command is likely faster.

Saving Google Colaboratory notebooks

Colab provides a significant number of options for saving your notebook. However, none of these options works with your local drive. After you upload content from your local drive to Google Drive or GitHub, Colab manages the content in the cloud and not on your local drive.

Using Drive to save Colab notebooks

The default location for storing your data is Google Drive. When you choose File → Save, the content you create goes to the root directory of your Google Drive. If you want to save the content to a different folder, you need to select that folder in Google Drive.

Colab tracks the versions of your project as you perform saves. However, as these revisions age, Colab removes them. To save a version that won’t age, you use the File → Save and Pin Revision command. To see the revisions for your project, choose File → Revision History. You see the output shown below. Notice that the first entry is pinned. You can also pin entries by checking the entry in the History list. The revision history also shows you the modification date, who made the revision, and the size of the resulting file. You can use this list to restore a previous revision or download the revision to your local drive.

You can also save a copy of your project by choosing File → Save a Copy In Drive. The copy receives the word Copy as part of its name. Of course, you can rename it later. Colab stores the copy in the current Google Drive folder.

Using GitHub to save Colab notebooks

GitHub provides an alternative to Google Drive for saving content. It offers an organized method of sharing code for the purpose of discussion, review, and distribution.

You may use only public repositories when working with GitHub from Colab, even though GitHub also supports private repositories. To save a file to GitHub, choose File → Save a Copy in GitHub. If you aren’t already signed into GitHub, Colab displays a window that requests your sign-in information. After you sign in, you see a dialog box similar to the one you see below.

Notice that this account doesn’t currently have a repository. You must either create a new repository or choose an existing repository in which to store your data. After you save the file, it will appear in your GitHub repository of your choice. The repository will include a link to open the data in Colab by default, unless you choose not to include this feature.

Using GitHubGist to save Colab notebooks

You use GitHub Gists as a means of sharing single files or other resources with other people. Some people use them for full projects as well, but the idea is that you have a concept that you want to share — something that isn’t quite fully formed and doesn’t represent a usable application. You can read more about Gists here. As with GitHub, Gists come in both public and secret form. You can access both public and secret Gists from Colab, but Colab automatically keeps your files secret. To save your current project as a Gist, you choose File → Save a Copy as a GitHub Gist. Unlike GitHub, you don’t need to create a repository or do anything fancy in this case. The file saves as a Gist without any extra effort. The resulting entry always contains a View in Colaboratory link, as shown below.

Downloading Google Colaboratory notebooks

Colab supports two methods for downloading notebooks to your local drive: .ipynb files (using File→Download .ipynb) and .py files (using File → Download .py). In both cases, the file appears in the default download directory for your browser; Colab doesn't offer a method for downloading the file to a specific directory.

Python Articles

Python Programming: Making Machine Learning Accessible with the Random Forest Algorithm

Random Forest is a classification and regression algorithm developed by Leo Breiman and Adele Cutler that uses a large number of decision tree models to provide precise predictions by reducing both the bias and variance of the estimates. When you aggregate many models together to produce a single prediction, the result is an ensemble of models. Random Forest isn't just an ensemble model, it’s also a simple and effective algorithm to use as an out-of-the-box algorithm. It makes machine learning accessible to nonexperts. The Random Forest algorithm uses these steps to perform its predictions:

  1. Create a large number of decision trees, each one different from the other, based on different subsets of observations and variables.
  2. Bootstrap the dataset of observations for each tree (sampled from the original data with replacement). The same observation can appear multiple times in the same dataset.
  3. Randomly select and use only a part of the variables for each tree.
  4. Estimate the performance for each tree using the observations excluded by sampling (the Out Of Bag, or OOB, estimate).
  5. Obtain the final prediction, which is the average for regression estimates or the most frequent class for prediction, after all the trees have been fitted and used for prediction.
You can
reduce bias by using these steps, because the decision trees have a good fit on data and, by relying on complex splits, can approximate even the most complex relationships between predictors and predicted outcome. Decision trees can produce a great variance of estimates, but you reduce this variance by averaging many trees. Noisy predictions, due to variance, tend to distribute evenly above and below the correct value that you want to predict — and when averaged together, they tend to cancel each other, leaving, as a result, a more correct average prediction. Leo Breiman derived the idea for Random Forest from the bagging technique. Scikit-learn has a bagging class for both regression (BaggingRegressor) and classifying (BaggingClassifier) that you can use with any other predictor you prefer to pick from Scikit-learn modules. The max_samples and max_features parameters let you decide the proportion of cases and variables to sample (not bootstrapped, but sampled, so you can use a case only once) for building each model of the ensemble. The n_estimators parameter decides the total number of models in the ensemble. Here's an example that loads the handwritten digit dataset (used for demonstrations later with other ensemble algorithms) and then fits the model by bagging:
from sklearn.datasets import load_digits
digit = load_digits()
X, y = digit.data, digit.target
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
tree_classifier = DecisionTreeClassifier(random_state=0)
crossvalidation = KFold(n_splits=5, shuffle=True,
random_state=1)
bagging = BaggingClassifier(tree_classifier,
max_samples=0.7,
max_features=0.7,
n_estimators=300)
scores = np.mean(cross_val_score(bagging, X, y,
scoring='accuracy',
cv=crossvalidation))
print ('Accuracy: %.3f' % scores)
Here's the cross-validated accuracy for the bagging applied to the handwritten dataset:
Accuracy: 0.967
In bagging, as in Random Forest, the more models in the ensemble, the better. You run little risk of overfitting because every model is different from the others, and errors tend to spread around the real value. Adding more models just adds stability to the result. Another characteristic of the algorithm is that it permits estimation of variable importance while taking the presence of all the other predictors into account. In this way, you can determine which feature is important for predicting a target given the set of features that you have; also, you can use the importance estimate as a guideline for variable selection.

In contrast to single decision trees, you can’t easily visualize or understand Random Forest, making it act as a black box (a black box is a transformation that doesn’t reveal its inner workings; all you see are its inputs and outputs). Given its opacity, importance estimation is the only way to understand how the algorithm works with respect to the features.

Importance estimation in a Random Forest is obtained in a straightforward way. After building each tree, the code fills each variable in turn with junk data and the example records how much the predictive power decreases. If the variable is important, crowding it with casual data harms the prediction; otherwise, the predictions are left almost unchanged and the variable is deemed unimportant.

Working with a Random Forest classifier

The example Random Forest classifier keeps using the previously loaded digit dataset:
X, y = digit.data, digit.target
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
crossvalidation = KFold(n_splits=5, shuffle=True,
random_state=1)
RF_cls = RandomForestClassifier(n_estimators=300,
random_state=1)
score = np.mean(cross_val_score(RF_cls, X, y,
scoring='accuracy',
cv=crossvalidation))
print('Accuracy: %.3f' % score)
The cross-validated accuracy reported by this code for the Random Forest is an improvement over the bagging method:
Accuracy: 0.977
Just setting the number of estimators is sufficient for most problems you encounter, and setting it correctly is a matter of using the highest number possible given the time and resource constraints of the host computer. You can demonstrate this by calculating and drawing a validation curve for the algorithm.
from sklearn.model_selection import validation_curve
param_range = [10, 50, 100, 200, 300, 500, 800, 1000, 1500]
crossvalidation = KFold(n_splits=3,
shuffle=True,
random_state=1)
RF_cls = RandomForestClassifier(n_estimators=300,
random_state=0)
train_scores, test_scores = validation_curve(RF_cls, X, y,
'n_estimators',
param_range=param_range,
cv=crossvalidation,
scoring='accuracy')
mean_test_scores = np.mean(test_scores, axis=1)
import matplotlib.pyplot as plt
plt.plot(param_range, mean_test_scores,
'bD-.', label='CV score')
plt.grid()
plt.xlabel('Number of estimators')
plt.ylabel('accuracy')
plt.legend(loc='lower right', numpoints= 1)
plt.show()
The image below shows the results provided by the preceding code. The more estimators, the better the results. However, at a certain point the gain becomes minimal indeed.

Working with a Random Forest regressor

RandomForestRegressor works in a similar way as the Random Forest for classification, using exactly the same parameters:
X, y = boston.data, boston.target
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
RF_rg = RandomForestRegressor (n_estimators=300,
random_state=1)
crossvalidation = KFold(n_splits=5, shuffle=True,
random_state=1)
score = np.mean(cross_val_score(RF_rg, X, y,
scoring='neg_mean_squared_error',
cv=crossvalidation))
print('Mean squared error: %.3f' % abs(score))
Here is the resulting cross-validated mean squared error:
Mean squared error: 12.028

The Random Forest uses decision trees. Decision trees https://www.dummies.com/programming/big-data/data-science/how-to-create-classification-and-regression-trees-in-python-for-data-science/ segment the dataset into small partitions, called leaves, when estimating regression values. The Random Forest takes the average of the values in each leaf to create a prediction. Using this procedure causes extreme and high values to disappear from predictions because of the averaging used for each leaf of the forest, producing dumped values instead of much higher or much lower values.

Optimizing a Random Forest

Random Forest models are out-of-box algorithms that can work quite well without optimization or worrying about overfitting. (The more estimators you use, the better the output, depending on your resources.) You can always improve performance by removing redundant and less informative variables, fixing a minimum leaf size, and defining a sampling number that avoids having too many correlated predictors in the sample. The following example shows how to perform these tasks:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
X, y = digit.data, digit.target
crossvalidation = KFold(n_splits=5, shuffle=True,
random_state=1)
RF_cls = RandomForestClassifier(random_state=1)
scorer = 'accuracy'
Using the handwritten digit dataset and a first default classifier, you can optimize both max_features and min_samples_leaf. When optimizing max_features, you use preconfigured options (auto for all features, sqrt or log2 functions applied to the number of features) and integrate them using small feature numbers and a value of 1/3 of the features. Selecting the right number of features to sample tends to reduce the number of times when correlated and similar variables are picked together, thus increasing the predictive performances. There is a statistical reason to optimize min_samples_leaf. Using leaves with few cases often corresponds to overfitting to very specific data combinations. You need to have at least 30 observations to achieve a minimal statistical confidence that data patterns correspond to real and general rules:
from sklearn.model_selection import GridSearchCV
max_features = [X.shape[1]//3, 'sqrt', 'log2', 'auto']
min_samples_leaf = [1, 10, 30]
n_estimators = [50, 100, 300]
search_grid = {'n_estimators':n_estimators,
'max_features': max_features,
'min_samples_leaf': min_samples_leaf}
search_func = GridSearchCV(estimator=RF_cls,
param_grid=search_grid,
scoring=scorer,
cv=crossvalidation)
search_func.fit(X, y)
best_params = search_func.best_params_
best_score = search_func.best_score_
print('Best parameters: %s' % best_params)
print('Best accuracy: %.3f' % best_score)
The best parameters and best accuracy obtained are then reported, highlighting that the parameters to act on is the number of trees:
Best parameters: {'max_features': 'sqrt',
'min_samples_leaf': 1,
'n_estimators': 100}
Best accuracy: 0.978