Python for Data Science For Dummies book cover

Python for Data Science For Dummies

By: John Paul Mueller and Luca Massaron Published: 02-27-2019

The fast and easy way to learn Python programming and statistics

Python is a general-purpose programming language created in the late 1980s—and named after Monty Python—that's used by thousands of people to do things from testing microchips at Intel, to powering Instagram, to building video games with the PyGame library. 

Python For Data Science For Dummies is written for people who are new to data analysis, and discusses the basics of Python data analysis programming and statistics. The book also discusses Google Colab, which makes it possible to write Python code in the cloud.

  • Get started with data science and Python
  • Visualize information
  • Wrangle data
  • Learn from data

The book provides the statistical background needed to get started in data science programming, including probability, random distributions, hypothesis testing, confidence intervals, and building regression models for prediction.

Articles From Python for Data Science For Dummies

10 results
10 results
Python for Data Science For Dummies Cheat Sheet

Cheat Sheet / Updated 02-24-2022

Python is an incredible programming language that you can use to perform data science tasks with a minimum of effort. The huge number of available libraries means that the low-level code you normally need to write is likely already available from some other source. All you need to focus on is getting the job done. With that in mind, this cheat sheet helps you access the most commonly needed reminders for making your programming experience fast and easy.

View Cheat Sheet
What is Google Colaboratory?

Article / Updated 07-10-2019

Google Colaboratory, sometimes called Colaboratory for short, is a Google cloud-based service that replicates Jupyter Notebook in the cloud. You don’t have to install anything on your system to use it. In most respects, you use Colaboratory as you would a desktop installation of Jupyter Notebook. Google Colaboratory is primarily for those readers who use something other than a standard desktop setup to work through the examples. The most important thing to remember is that Colaboratory isn’t a replacement for Jupyter Notebook and the examples aren’t tested to specifically work with it, but you can try it with your alternative device if desired to follow the examples. To use Colaboratory, you must have a Google account and then access Colaboratory using your account. Otherwise, most of the Colaboratory features won’t work. As with Jupyter Notebook, you can use Colaboratory to perform specific tasks in a cell-oriented paradigm. If you’ve used Jupyter Notebook before, you notice a strong resemblance between Notebook and Colaboratory. Of course, you also want to perform other sorts of tasks, such as creating various cell types and using them to create notebooks that look like those you create with Notebook. Defining Google Colaboratory Google Colaboratory is the cloud version of Jupyter Notebook. In fact, the Welcome page makes this fact apparent. It even uses IPython (the previous name for Jupyter) Notebook (.ipynb) files for the site. That's right, you’re viewing a Jupyter Notebook right there in your browser. Even though the two applications are similar and they both use .ipynb files, they do have some differences that you need to know about. Understanding what Google Colaboratory does You can use Colaboratory to perform many tasks, such as to write and run code, create its associated documentation, and display graphics, just as you do with Jupyter Notebook. The techniques you use are similar, in fact, to using Jupyter Notebook, but there are small differences between the two. Jupyter Notebook is a localized application in that you use local resources with it. You could potentially use other sources, but doing so could prove inconvenient or impossible in some cases. For example, according to Github, your Jupyter Notebook files will appear as static HTML pages when you use a GitHub repository. In fact, some features won’t work at all. Colaboratory enables you to fully interact with your Jupyter Notebook files using GitHub as a repository. In fact, Colaboratory supports a number of online storage options, so you can regard Colaboratory as your online partner in creating Python code. The other reason that you really need to know about Colaboratory is that you can use it with your alternative device. During the writing process, some of the example code was tested on an Android-based tablet (an ASUS ZenPad 3S 10). The target tablet has Chrome installed and executes the code well enough to follow the examples. All this said, you likely won’t want to try to write code using a tablet of that size — the text was incredibly small, for one thing, and the lack of a keyboard could be a problem, too. The point is that you don’t absolutely have to have a Windows, Linux, or OS X system to try the code, but the alternatives might not provide quite the performance you expect. Google Colaboratory generally doesn’t work with browsers other than Chrome or Firefox. In most cases, you see an error message and no other display if you try to start Colaboratory in a browser that it doesn’t support. Your copy of Firefox may also need some configuration to work properly. The amount of configuration that you perform depends on which Colaboratory features you choose to use. Many examples work fine in Firefox without any modification. The online coding difference between Google Colaboratory and Jupyter Notebook For the most part, you use Colaboratory just as you would Jupyter Notebook. However, some features work differently. For example, to execute the code within a cell, you select that cell and click the run button (right-facing arrow) for that cell. The current cell remains selected, which means that you must actually initiate the selection of the next cell as a separate action. A block next to the output lets you clear just that output without affecting any other cell. Hovering the mouse over the block tells you when someone executed the content. On the right side of the cell, you see a vertical ellipsis that you can click to see a menu of options for that cell. The result is the same as when using Notebook, but the process for achieving the result is different. The actual process for working with the code also differs from Notebook. Yes, you still type the code as you always have and the resulting code executes without problem in Jupyter Notebook. The difference is in the way you can manage the code. You can upload code from your local drive as desired and then save it to a Google Drive or GitHub. The code becomes accessible from any device at this point by accessing those same sources. All you need to do is load Colaboratory to access it. If you use Chrome when working with Colaboratory and choose to sync your copy of Chrome among various devices, all your code becomes available on any device you choose to work with. Syncing transfers your choices to all your devices as long as those devices are also set to synchronize their settings. Consequently, you can write code on your desktop, test it on your tablet, and then review it on your smart phone. It’s all the same code, all the same repository, and the same Chrome setup, just a different device. What you may find, however, is that all this flexibility comes at the price of speed and ergonomics. In reviewing the various options, a local copy of Jupyter Notebook executes code faster than a copy of Colaboratory using any of the available configurations (even when working with a local copy of the .ipynb file). So, you trade speed for flexibility when working with Colaboratory. In addition, viewing the source code on a tablet is hard; viewing it on a smart phone is nearly impossible. If you make the text large enough to see, you can’t see enough of the code to make any sort of reasonable editing possible. At best, you could review the code one line at a time to determine how it works. Using Jupyter Notebook has other benefits, too. For example, when working with Colaboratory, you have options to download your source files only as .ipynb or .py files. Colaboratory doesn't include all the other download options, including (but not limited to) HTML, LaTeX, and PDF. Consequently, your options for creating presentations from the online content are also limited to some extent. In short, using Colaboratory and Jupyter Notebook provides different coding experiences to some degree. They’re not mutually exclusive, however, because they share file formats. Theoretically, switching between the two as needed is possible. One thing to consider when using Jupyter Notebook and Colaboratory is that the two products use most of the same terminology and many of the same features, but they're not completely the same. The methods used to perform tasks differ, and some of the terminology does as well. For example, a Markdown cell in Jupyter Notebook is a Text cell in Colaboratory. Using local runtime support The only time you really need local runtime support is when you want to work within a team environment and you need the speed or resource access advantage offered by a local runtime. Using a local runtime normally produces better speed than you obtain when relying on the cloud. In addition, a local runtime enables you to access files on your machine. A local runtime also gives you control over the version of Jupyter Notebook used to execute code. You can read more about local runtime support. You need to consider several issues when determining the need for local runtime support. The most obvious is that you need a local runtime, which means that this option won’t work with your laptop or tablet unless your laptop has Windows, Linux, or OS X and the appropriate version of Jupyter Notebook installed. Your laptop or tablet will also need an appropriate browser; Internet Explorer is almost guaranteed to cause problems, assuming that it works at all. The most important consideration when using a local runtime, however, is that your machine is now open to possible infection from Jupyter Notebook code. You need to trust the party supplying the code. The local runtime option doesn’t open your machine to others that you share code with, however; they must either use their own local runtimes or rely on the cloud to execute code. When working with Colaboratory on using local runtime support and Firefox, you must perform some special setups. Make sure to read the Browser Specific Setups section on the Local Runtimes page to ensure that you have Firefox configured correctly. Always verify your setup. Firefox may appear to work correctly with Colaboratory. However, a configuration issue arises when you perform tasks with it, and Colaboratory shows error messages that say the code didn’t execute (or something else that isn’t particularly helpful).

View Article
Working with Google Colaboratory Notebooks

Article / Updated 07-05-2019

As with Jupyter Notebook, the notebook forms the basis of interactions with Google Colaboratory. In fact, Colab is built on notebooks. When you place the mouse on certain parts of the Welcome page, you see opportunities for interacting with the page by adding either code or text entries (which you can use for notes as needed). These entries are active, so you can interact with them. You can also move cells around and copy the resulting material to your Google Drive. Of course, while interacting with the Welcome page is both unexpected and fun, the real purpose of the information below is to describe how to interact with Colab notebooks. The following information describes how to perform basic notebook-related tasks with Colab. Creating a new Google Colaboratory notebook To create a new notebook, choose File → New Python 3 Notebook. You see a new Python 3 notebook like the one shown below. The new notebook looks similar to, but not precisely the same as, those found in Notebook. However, all the same functionality exists. You can also create a new Python 2 notebook if desired. The notebook shown here lets you change the filename by clicking on it, just as you do when working in Notebook. Some features work differently but provide the same results. For example, to run the code in a particular cell, you click the right-pointing arrow on the left side of that cell. In contrast to Notebook, the cell focus doesn’t change to the next cell, so you must choose the next cell directly or by clicking the Next Cell or Previous Cell buttons on the toolbar. Opening existing Google Colaboratory notebooks You can open existing notebooks found in local storage, on Google Drive, or on GitHub. You can also open any of the Colab examples or upload files from sources that you can access, such as a network drive on your system. In all cases, you begin by choosing File → Open Notebook. You see the dialog box here. The default view shows all the files you opened recently, regardless of location. The files appear in alphabetical order. You can filter the number of items displayed by typing a string into the Filter Notebooks field. Across the top are other options for opening notebooks. Even if you’re not logged in, you can still access the Colab example projects. These projects help you understand Colab but won’t allow you to do anything with your own projects. Even so, you can still experiment with Colab without logging into Google first. The following information discusses these options in more detail. Using Google Drive for existing Colab notebooks Google Drive is the default location for many operations in Colab, and you can always choose it as a destination. When working with Drive, you see a list of files similar to those shown above. To open a particular file, you click its link in the dialog box. The file opens in the current tab of your browser. Using GitHub for existing Colab notebooks When working with GitHub, you initially need to provide the location of the source code online. The location must point to a public project; you can’t use Colab to access private projects. After you make the connection to GitHub, you see two lists: repositories, which are containers for code related to a particular project; and branches, a particular implementation of the code. Selecting a repository and branch displays a list of notebook files that you can load into Colab. Simply click the required link and it loads as if you were using a Google Drive. Using local storage for existing Colab notebooks If you want to use the downloadable source for any local source, you select the Upload tab of the dialog box. In the center is a single button, Choose File. Clicking this button opens the File Open dialog box for your browser. You locate the file you want to upload, just as you normally would for opening any file. Selecting a file and clicking Open uploads the file to Google Drive. If you make changes to the file, those changes appear on Google Drive, not on your local drive. Depending on your browser, you usually see a new window open with the code loaded. However, you could also simply see a success message, in which case you must now open the file using the same technique as you would when using Google Drive. In some case, your browser asks whether you want to leave the current page. You should tell the browser to do so. The File → Upload Notebook command also uploads a file to Google Drive. In fact, uploading a notebook works like uploading any other kind of file, and you see the same dialog box. If you want to upload other kinds of files, using the File → Upload Notebook command is likely faster. Saving Google Colaboratory notebooks Colab provides a significant number of options for saving your notebook. However, none of these options works with your local drive. After you upload content from your local drive to Google Drive or GitHub, Colab manages the content in the cloud and not on your local drive. Using Drive to save Colab notebooks The default location for storing your data is Google Drive. When you choose File → Save, the content you create goes to the root directory of your Google Drive. If you want to save the content to a different folder, you need to select that folder in Google Drive. Colab tracks the versions of your project as you perform saves. However, as these revisions age, Colab removes them. To save a version that won’t age, you use the File → Save and Pin Revision command. To see the revisions for your project, choose File → Revision History. You see the output shown below. Notice that the first entry is pinned. You can also pin entries by checking the entry in the History list. The revision history also shows you the modification date, who made the revision, and the size of the resulting file. You can use this list to restore a previous revision or download the revision to your local drive. You can also save a copy of your project by choosing File → Save a Copy In Drive. The copy receives the word Copy as part of its name. Of course, you can rename it later. Colab stores the copy in the current Google Drive folder. Using GitHub to save Colab notebooks GitHub provides an alternative to Google Drive for saving content. It offers an organized method of sharing code for the purpose of discussion, review, and distribution. You may use only public repositories when working with GitHub from Colab, even though GitHub also supports private repositories. To save a file to GitHub, choose File → Save a Copy in GitHub. If you aren’t already signed into GitHub, Colab displays a window that requests your sign-in information. After you sign in, you see a dialog box similar to the one you see below. Notice that this account doesn’t currently have a repository. You must either create a new repository or choose an existing repository in which to store your data. After you save the file, it will appear in your GitHub repository of your choice. The repository will include a link to open the data in Colab by default, unless you choose not to include this feature. Using GitHubGist to save Colab notebooks You use GitHub Gists as a means of sharing single files or other resources with other people. Some people use them for full projects as well, but the idea is that you have a concept that you want to share — something that isn’t quite fully formed and doesn’t represent a usable application. You can read more about Gists here. As with GitHub, Gists come in both public and secret form. You can access both public and secret Gists from Colab, but Colab automatically keeps your files secret. To save your current project as a Gist, you choose File → Save a Copy as a GitHub Gist. Unlike GitHub, you don’t need to create a repository or do anything fancy in this case. The file saves as a Gist without any extra effort. The resulting entry always contains a View in Colaboratory link, as shown below. Downloading Google Colaboratory notebooks Colab supports two methods for downloading notebooks to your local drive: .ipynb files (using File→Download .ipynb) and .py files (using File → Download .py). In both cases, the file appears in the default download directory for your browser; Colab doesn't offer a method for downloading the file to a specific directory.

View Article
Python Programming: Making Machine Learning Accessible with the Random Forest Algorithm

Article / Updated 07-05-2019

Random Forest is a classification and regression algorithm developed by Leo Breiman and Adele Cutler that uses a large number of decision tree models to provide precise predictions by reducing both the bias and variance of the estimates. When you aggregate many models together to produce a single prediction, the result is an ensemble of models. Random Forest isn't just an ensemble model, it’s also a simple and effective algorithm to use as an out-of-the-box algorithm. It makes machine learning accessible to nonexperts. The Random Forest algorithm uses these steps to perform its predictions: Create a large number of decision trees, each one different from the other, based on different subsets of observations and variables. Bootstrap the dataset of observations for each tree (sampled from the original data with replacement). The same observation can appear multiple times in the same dataset. Randomly select and use only a part of the variables for each tree. Estimate the performance for each tree using the observations excluded by sampling (the Out Of Bag, or OOB, estimate). Obtain the final prediction, which is the average for regression estimates or the most frequent class for prediction, after all the trees have been fitted and used for prediction. You can reduce bias by using these steps, because the decision trees have a good fit on data and, by relying on complex splits, can approximate even the most complex relationships between predictors and predicted outcome. Decision trees can produce a great variance of estimates, but you reduce this variance by averaging many trees. Noisy predictions, due to variance, tend to distribute evenly above and below the correct value that you want to predict — and when averaged together, they tend to cancel each other, leaving, as a result, a more correct average prediction. Leo Breiman derived the idea for Random Forest from the bagging technique. Scikit-learn has a bagging class for both regression (BaggingRegressor) and classifying (BaggingClassifier) that you can use with any other predictor you prefer to pick from Scikit-learn modules. The max_samples and max_features parameters let you decide the proportion of cases and variables to sample (not bootstrapped, but sampled, so you can use a case only once) for building each model of the ensemble. The n_estimators parameter decides the total number of models in the ensemble. Here's an example that loads the handwritten digit dataset (used for demonstrations later with other ensemble algorithms) and then fits the model by bagging: from sklearn.datasets import load_digits digit = load_digits() X, y = digit.data, digit.target from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold tree_classifier = DecisionTreeClassifier(random_state=0) crossvalidation = KFold(n_splits=5, shuffle=True, random_state=1) bagging = BaggingClassifier(tree_classifier, max_samples=0.7, max_features=0.7, n_estimators=300) scores = np.mean(cross_val_score(bagging, X, y, scoring='accuracy', cv=crossvalidation)) print ('Accuracy: %.3f' % scores) Here's the cross-validated accuracy for the bagging applied to the handwritten dataset: Accuracy: 0.967 In bagging, as in Random Forest, the more models in the ensemble, the better. You run little risk of overfitting because every model is different from the others, and errors tend to spread around the real value. Adding more models just adds stability to the result. Another characteristic of the algorithm is that it permits estimation of variable importance while taking the presence of all the other predictors into account. In this way, you can determine which feature is important for predicting a target given the set of features that you have; also, you can use the importance estimate as a guideline for variable selection. In contrast to single decision trees, you can’t easily visualize or understand Random Forest, making it act as a black box (a black box is a transformation that doesn’t reveal its inner workings; all you see are its inputs and outputs). Given its opacity, importance estimation is the only way to understand how the algorithm works with respect to the features. Importance estimation in a Random Forest is obtained in a straightforward way. After building each tree, the code fills each variable in turn with junk data and the example records how much the predictive power decreases. If the variable is important, crowding it with casual data harms the prediction; otherwise, the predictions are left almost unchanged and the variable is deemed unimportant. Working with a Random Forest classifier The example Random Forest classifier keeps using the previously loaded digit dataset: X, y = digit.data, digit.target from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold crossvalidation = KFold(n_splits=5, shuffle=True, random_state=1) RF_cls = RandomForestClassifier(n_estimators=300, random_state=1) score = np.mean(cross_val_score(RF_cls, X, y, scoring='accuracy', cv=crossvalidation)) print('Accuracy: %.3f' % score) The cross-validated accuracy reported by this code for the Random Forest is an improvement over the bagging method: Accuracy: 0.977 Just setting the number of estimators is sufficient for most problems you encounter, and setting it correctly is a matter of using the highest number possible given the time and resource constraints of the host computer. You can demonstrate this by calculating and drawing a validation curve for the algorithm. from sklearn.model_selection import validation_curve param_range = [10, 50, 100, 200, 300, 500, 800, 1000, 1500] crossvalidation = KFold(n_splits=3, shuffle=True, random_state=1) RF_cls = RandomForestClassifier(n_estimators=300, random_state=0) train_scores, test_scores = validation_curve(RF_cls, X, y, 'n_estimators', param_range=param_range, cv=crossvalidation, scoring='accuracy') mean_test_scores = np.mean(test_scores, axis=1) import matplotlib.pyplot as plt plt.plot(param_range, mean_test_scores, 'bD-.', label='CV score') plt.grid() plt.xlabel('Number of estimators') plt.ylabel('accuracy') plt.legend(loc='lower right', numpoints= 1) plt.show() The image below shows the results provided by the preceding code. The more estimators, the better the results. However, at a certain point the gain becomes minimal indeed. Working with a Random Forest regressor RandomForestRegressor works in a similar way as the Random Forest for classification, using exactly the same parameters: X, y = boston.data, boston.target from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold RF_rg = RandomForestRegressor (n_estimators=300, random_state=1) crossvalidation = KFold(n_splits=5, shuffle=True, random_state=1) score = np.mean(cross_val_score(RF_rg, X, y, scoring='neg_mean_squared_error', cv=crossvalidation)) print('Mean squared error: %.3f' % abs(score)) Here is the resulting cross-validated mean squared error: Mean squared error: 12.028 The Random Forest uses decision trees. Decision trees https://www.dummies.com/programming/big-data/data-science/how-to-create-classification-and-regression-trees-in-python-for-data-science/ segment the dataset into small partitions, called leaves, when estimating regression values. The Random Forest takes the average of the values in each leaf to create a prediction. Using this procedure causes extreme and high values to disappear from predictions because of the averaging used for each leaf of the forest, producing dumped values instead of much higher or much lower values. Optimizing a Random Forest Random Forest models are out-of-box algorithms that can work quite well without optimization or worrying about overfitting. (The more estimators you use, the better the output, depending on your resources.) You can always improve performance by removing redundant and less informative variables, fixing a minimum leaf size, and defining a sampling number that avoids having too many correlated predictors in the sample. The following example shows how to perform these tasks: from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import KFold X, y = digit.data, digit.target crossvalidation = KFold(n_splits=5, shuffle=True, random_state=1) RF_cls = RandomForestClassifier(random_state=1) scorer = 'accuracy' Using the handwritten digit dataset and a first default classifier, you can optimize both max_features and min_samples_leaf. When optimizing max_features, you use preconfigured options (auto for all features, sqrt or log2 functions applied to the number of features) and integrate them using small feature numbers and a value of 1/3 of the features. Selecting the right number of features to sample tends to reduce the number of times when correlated and similar variables are picked together, thus increasing the predictive performances. There is a statistical reason to optimize min_samples_leaf. Using leaves with few cases often corresponds to overfitting to very specific data combinations. You need to have at least 30 observations to achieve a minimal statistical confidence that data patterns correspond to real and general rules: from sklearn.model_selection import GridSearchCV max_features = [X.shape[1]//3, 'sqrt', 'log2', 'auto'] min_samples_leaf = [1, 10, 30] n_estimators = [50, 100, 300] search_grid = {'n_estimators':n_estimators, 'max_features': max_features, 'min_samples_leaf': min_samples_leaf} search_func = GridSearchCV(estimator=RF_cls, param_grid=search_grid, scoring=scorer, cv=crossvalidation) search_func.fit(X, y) best_params = search_func.best_params_ best_score = search_func.best_score_ print('Best parameters: %s' % best_params) print('Best accuracy: %.3f' % best_score) The best parameters and best accuracy obtained are then reported, highlighting that the parameters to act on is the number of trees: Best parameters: {'max_features': 'sqrt', 'min_samples_leaf': 1, 'n_estimators': 100} Best accuracy: 0.978

View Article
Playing with Scikit-Learn and Neural Networks

Article / Updated 07-05-2019

Starting with the idea of reverse-engineering how a brain processes signals, researchers based neural networks on biological analogies and their components, using brain terms such as neurons and axons as names. However, you'll discover that neural networks resemble nothing more than a sophisticated kind of linear regression because they are a summation of coefficients multiplied by numeric inputs. You also find that neurons are just where such summations happen. Even if neural networks don’t mimic a brain (they’re arithmetic), these algorithms are extraordinarily effective against complex problems such as image and sound recognition, or machine language translation. They also execute quickly when predicting, if you use the right hardware. Well-devised neural networks use the name deep learning and are behind powerful tools like Siri and other digital assistants, along with more astonishing machine learning applications as well. Running deep learning requires special hardware (a computer with a GPU) and installing special frameworks such as Tensorflow MXNet, Pytorch or Chainer. This book doesn't delve into complex neural networks but does explore a simpler implementation offered by Scikit-learn instead, which allows you to create neural network quickly and compare them to other machine learning algorithms. Understanding neural networks The core neural network algorithm is the neuron (also called a unit). Many neurons arranged in an interconnected structure make up the layers of a neural network, with each neuron linking to the inputs and outputs of other neurons. Thus, a neuron can input features from examples or from the results of other neurons, depending on its location in the neural network. Contrary to other algorithms, which have a fixed pipeline that determines how algorithms receive and process data, neural networks require you to decide how information flows by fixing the number of units (the neurons) and their distribution in layers. For this reason, setting up neural networks is more an art than a science; you learn from experience how to arrange neurons into layers and obtain the best predictions. In a more detailed view, neurons in a neural network take many weighted values as inputs, sum them, and provide the summation as the result. A neural network can process only numeric, continuous information; it can’t process qualitative variables (for example, labels indicating a quality such as red, blue, or green in an image). You can process qualitative variables by transforming them into a continuous numeric value, such as a series of binary values Neurons also provide a more sophisticated transformation of the summation. In observing nature, scientists noticed that neurons receive signals but don’t always release a signal of their own. It depends on the amount of signal received. When a neuron in a brain acquires enough stimuli, it fires an answer; otherwise, it remains silent. In a similar fashion, neurons in a neural network, after receiving weighted values, sum them and use an activation function to evaluate the result, which transforms it in a nonlinear way. For instance, the activation function can release a zero value unless the input achieves a certain threshold, or it can dampen or enhance a value by nonlinearly rescaling it, thus transmitting a rescaled signal. Each neuron in the network receives inputs from the previous layers (when starting, it connects directly with data), weights them, sums them all, and transforms the result using the activation function. After activating, the computed output becomes the input for other neurons or the prediction of the network. Consequently, given a neural network made of a certain number of neurons and layers, what makes this structure efficient in its predictions is the weights used by each neuron for its inputs. Such weights aren’t different from the coefficients of a linear regression, and the network learns their value by repeated passes (iterations or epochs) over the examples of the dataset. Classifying and regressing with neurons using Scikit-learn If you plan to work with neural networks and Python, you’ll need Scikit-learn.Scikit-learn offers two functions for neural networks: MLPClassifier: Implements a multilayer perceptron (MLP) for classification. Its outputs (one or many, depending on how many classes you have to predict) are intended as probabilities of the example being of a certain class. MLPRegressor: Implements MLP for regression problems. All its outputs (because it can predict multiple target values at one time) are intended as estimates of the measures to predict. Because both functions have the exact same parameters, the Scikit-learn example delves into a single example for classification, using the handwritten digits as an example of multiclass classification using a MLP. The example starts by importing the necessary packages, loading the dataset into memory, and splitting it into a training and a test set: from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.preprocessing import MinMaxScaler from sklearn import datasets from sklearn.neural_network import MLPClassifier digits = datasets.load_digits() X, y = digits.data, digits.target X_tr, X_t, y_tr, y_t = train_test_split(X, y, test_size=0.3, random_state=0) Preprocessing the Scikit-learn data to feed to the neural network is an important aspect because the operations that neural networks perform under the hood are sensitive to the scale and distribution of data. Consequently, it's good practice to normalize the data by putting its mean to zero and its variance to one, or to rescale it by fixing the minimum and maximum between –1 and +1 or 0 and +1. Experimentation shows which transformation works better for your data, though most people find that rescaling between –1 and +1 works better. This Scikit-learn example rescales all the values between –1 and +1. scaling = MinMaxScaler(feature_range=(-1, 1)).fit(X_tr) X_tr = scaling.transform(X_tr) X_t = scaling.transform(X_t) It’s good practice to define the preprocessing transformations on the training data alone and then apply the learned procedure to the test data. Only in this way can you correctly test how your model works with different data. To define MLP, you must consider that there are quite a few parameters and if you don’t tune them correctly, the results may be disappointing. (MLP is not an algorithm that works out of the box.) For an MLP to work properly, you should first define the architecture of the neurons, setting how many to use for each layer and how many layers to create. (You state the number of neurons for each layer in the hidden_layer_sizes parameter.) Then you have to determine the right solver among: L-BFGS: Use for small datasets. Adam: Use for large datasets. SGD: Excels at most problems if you correctly set some special parameters. L-BFGS works for small datasets, Adam for large ones, and SDG can excel at most problems if you set its parameters correctly. SGD’s parameters are the learning rate, which can reflect learning speed, and momentum (or Nesterov’s momentum), a value that helps the neural network to avoid less useful solutions. When specifying the learning rate, you have to define its starting value (learning_rate_init, which is usually around 0.001, but it can be even less) and how the speed changes during training (the learning_rate parameter, which can be 'constant', 'invscaling', or 'adaptive'). Given the complexity of setting the parameters for an SGD solver, you can determine how they work on your data only by testing them in a hyperparameter optimization. Most people prefer to start with an L-BFGS or Adam solver. Another critical hyperparameter is max_iter, the number of iterations, which can lead to completely different results if you set it too low or too high. The default is 200 iterations, but it's always better, after having fixed the other parameters, to try to increase or decrease its number. Finally, shuffling the data (shuffle=True) and setting a random_state for reproducibility of results are also important. The example code sets 512 nodes on a single layer, relies on the Adam solver, and uses the standard number of iterations (200): nn = MLPClassifier(hidden_layer_sizes=(512, ), activation='relu', solver='adam', shuffle=True, tol=1e-4, random_state=1) cv = cross_val_score(nn, X_tr, y_tr, cv=10) test_score = nn.fit(X_tr, y_tr).score(X_t, y_t) print('CV accuracy score: %0.3f' % np.mean(cv)) print('Test accuracy score: %0.3f' % (test_score)) Using this code, the example successfully classifies handwritten digits by running an MLP whose CV and test score are CV accuracy score: 0.978 Test accuracy score: 0.981 The results obtained are a little better than SVC’s, yet the increase involves tuning quite a few parameters correctly as well. When using nonlinear algorithms, you can’t expect any no-brainer approach, apart from a few decision-tree based solutions.

View Article
Tips for Using Jupyter Notebook for Python Programming

Article / Updated 07-03-2019

The Jupyter Notebook Integrated Development Environment (IDE) is a part of the Anaconda suite of tools for Python programming and can do lots of things for you. The following information helps you understand some of the interesting things that Jupyter Notebook (often simply called Notebook) can help you do. Working with styles in Jupyter Notebook Here's one of the ways in which Jupyter Notebook excels over just about any other IDE that you’ll ever use: It helps you to create nice-looking output. Rather than have a screen full of a whole bunch of plain-old code, you can use Jupyter Notebook to create sections and add styles so that the output is nicely formatted. What you can end up with is a good-looking report that just happens to contain executable code. The reason for this improved output is the use of styles. When you type code into Jupyter Notebook, you place the code in a cell. Each section of code that you create goes into a separate cell. When you need to create a new cell, you click Insert Cell Below (the button with a plus sign) on the toolbar. Likewise, when you decide that you no longer need a cell, you select it and then click Cut Cell (the button with a scissors) to place the deleted cell on the Clipboard, or choose Edit → Delete Cell to remove it completely. The default style for a cell is Code. However, when you click the down arrow next to the Code entry, you see a listing of styles. The various styles shown help you format content in various ways. The Markdown style is most definitely used to separate varies entries. To try it for yourself, choose Markdown from the drop-down list, type the heading # Using Jupyter Notebook, in the first cell; next, click Run. The content changes to a heading. The single hash (#) tells Notebook that this is a first-level heading. Notice that clicking Run automatically adds a new cell and places the cursor in it. To add a second-level heading, choose Markdown from the drop-down list, type ## Working with styles, and click Run. The image below shows that the two entries are indeed headings and that the second entry is smaller than the first. The Markdown style also lets you add HTML content, which can contain anything a web page contains with regard to standard HTML tags. Another way to create a first-level heading is to define the cell type as Markdown, type Using Jupyter Notebook, and then click Run. In general, you use HTML to provide documentation and links to outside material. Relying on HTML tags makes it possible to include things like lists or even pictures. In short, you can actually include an HTML document fragment as part of your notebook, which makes Jupyter Notebook much more than a simple means of writing down code. The use of the Raw NBConvert formatting option is outside the scope of what is discussed here. However, it provides you with the means for included information that shouldn’t be modified by the notebook converter (NBConvert). You can output notebooks in a variety of formats, and NBConvert performs this task for you. The goal of the Raw NBConvert style is to allow you to include special content, such as Lamport TeX (LaTeX) content. The LaTeX document system isn’t tied to a particular editor — it’s simply a means of encoding scientific documents. Restarting the kernel in your Jupyter Notebook Every time you perform a task in your Jupyter notebook, you create variables, import modules, and perform a wealth of other tasks that corrupt the environment. At some point, you can’t really be sure that something is working as it should. To overcome this problem, you click Restart Kernel (the button with an open circle with an arrow at one end) after saving your document by clicking Save and Checkpoint (the button containing a floppy disk symbol). You can then run your code again to ensure that it does work as you thought it would. Sometimes an error also causes the kernel to crash. Your document starts acting oddly, updates slowly, or shows other signs of corruption. Again, the answer is to restart the kernel to ensure that you have a clean environment and that the kernel is running as it should for optimal Python programming. Whenever you click Restart Kernel, you see the warning message shown below. Make certain that you pay attention to the warning because you could lose temporary changes during a kernel restart. Always save your document before you restart the kernel. Restoring a checkpoint in Jupyter Notebook At some point, you may find that you made a mistake. Jupyter Notebook is notably missing an Undo button: You won’t find one anywhere. Instead, you create checkpoints in Jupyter Notebook each time you finish a task. Creating checkpoints when your document is stable and working properly helps you recover faster from mistakes. To restore your setup to the condition contained in a checkpoint, choose File → Revert to Checkpoint. You see a listing of available checkpoints. Simply select the one you want to use. When you select the checkpoint, you see a warning message like the one shown here. When you click Revert, any old information is gone and the information found in the checkpoint becomes the current information.

View Article
Scikit-Learn Method Summary

Article / Updated 01-27-2019

Scikit-learn is a focal point for data science work with Python, so it pays to know which methods you need most. The following table provides a brief overview of the most important methods used for data analysis. Syntax Usage Description model_selection.cross_val_score Cross-validation phase Estimate the cross-validation score model_selection.KFold Cross-validation phase Divide the dataset into k folds for cross validation model_selection.StratifiedKFold Cross-validation phase Stratified validation that takes into account the distribution of the classes you predict model_selection.train_test_split Cross-validation phase Split your data into training and test sets decomposition.PCA Dimensionality reduction Principal component analysis (PCA) decomposition.RandomizedPCA Dimensionality reduction Principal component analysis (PCA) using randomized SVD feature_extraction.FeatureHasher Preparing your data The hashing trick, allowing you to accommodate a large number of features in your dataset feature_extraction.text.CountVectorizer Preparing your data Convert text documents into a matrix of count data feature_extraction.text.HashingVectorizer Preparing your data Directly convert your text using the hashing trick feature_extraction.text.TfidfVectorizer Preparing your data Creates a dataset of TF-IDF features feature_selection.RFECV Feature selection Automatic feature selection model_selection.GridSearchCV Optimization Exhaustive search in order to maximize a machine learning algorithm linear_model.LinearRegression Prediction Linear regression linear_model.LogisticRegression Prediction Linear logistic regression metrics.accuracy_score Solution evaluation Accuracy classification score metrics.f1_score Solution evaluation Compute the F1 score, balancing accuracy and recall metrics.mean_absolute_error Solution evaluation Mean absolute error regression error metrics.mean_squared_error Solution evaluation Mean squared error regression error metrics.roc_auc_score Solution evaluation Compute Area Under the Curve (AUC) from prediction scores naive_bayes.MultinomialNB Prediction Multinomial Naïve Bayes neighbors.KNeighborsClassifier Prediction K-Neighbors classification preprocessing.Binarizer Preparing your data Create binary variables (feature values to 0 or 1) preprocessing.Imputer Preparing your data Missing values imputation preprocessing.MinMaxScaler Preparing your data Create variables bound by a minimum and maximum value preprocessing.OneHotEncoder Preparing your data Transform categorical integer features into binary ones preprocessing.StandardScaler Preparing your data Variable standardization by removing the mean and scaling to unit variance

View Article
Common IPython Magic Functions

Article / Updated 01-27-2019

It’s kind of amazing to think that IPython provides you with magic, but that’s precisely what you get with the magic functions. A magic function begins with either a % or %% sign. Those with a % sign work within the environment, and those with a %% sign work at the cell level. Note that the magic functions work best with Jupyter Notebook. People using alternatives, such as Google Colab, may find that some magic functions fail to provide the desired result. The following list gives you a few of the most common magic functions and their purpose. To obtain a full list, type %quickref and press Enter in the IPython console or check out the full list. Magic Function Type Alone Provides Status? Description %%timeit No Calculates the best time performance for all the instructions in a cell, apart from the one placed on the same cell line as the cell magic (which could therefore be an initialization instruction). %%writefile No Writes the contents of a cell to the specified file. %alias Yes Assigns or displays an alias for a system command. %autocall Yes Enables you to call functions without including the parentheses. The settings are Off, Smart (default), and Full. The Smart setting applies the parentheses only if you include an argument with the call. %automagic Yes Enables you to call the line magic functions without including the % sign. The settings are False (default) and True. %cd Yes Changes directory to a new storage location. You can also use this command to move through the directory history or to change directories to a bookmark. %cls No Clears the screen. %colors No Specifies the colors used to display text associated with prompts, the information system, and exception handlers. You can choose between NoColor (black and white), Linux (default), and LightBG. %config Yes Enables you to configure IPython. %dhist Yes Displays a list of directories visited during the current session. %file No Outputs the name of the file that contains the source code for the object. %hist Yes Displays a list of magic function commands issued during the current session. %install_ext No Installs the specified extension. %load No Loads application code from another source, such as an online example. %load_ext No Loads a Python extension using its module name. %lsmagic Yes Displays a list of the currently available magic functions. %matplotlib Yes Sets the backend processor used for plots. Using the inline value displays the plot within the cell for an IPython Notebook file. The possible values are ‘gtk’, ‘gtk3’, ‘inline’, ‘nbagg’, ‘osx’, ‘qt’, ‘qt4’, ‘qt5’, ‘tk’, and ‘wx’. %paste No Pastes the content of the clipboard into the IPython environment. %pdef No Shows how to call the object (assuming that the object is callable). %pdoc No Displays the docstring for an object. %pinfo No Displays detailed information about the object (often more than provided by help alone). %pinfo2 No Displays extra detailed information about the object (when available). %reload_ext No Reloads a previously installed extension. %source No Displays the source code for the object (assuming that the source is available). %timeit No Calculates the best performance time for an instruction. %unalias No Removes a previously created alias from the list. %unload_ext No Unloads the specified extension.

View Article
Line Plot Styles

Article / Updated 01-27-2019

Whenever you create a plot, you need to identify the sources of information using more than just the lines. Creating a plot that uses differing line types and data point symbols makes the plot much easier for other people to use. The following table lists the line plot styles. Color Marker Style Code Line Color Code Marker Style Code Line Style b blue . point - Solid g green o circle : Dotted r red x x-mark -. dash dot c cyan + plus -- Dashed m magenta * star (none) no line y yellow s square k black d diamond w white v down triangle ^ up triangle < left triangle > right triangle p 5-point star h 6-point star Remember that you can also use these styles with other kinds of plots. For example, a scatter plot can use these styles to define each of the data points. When in doubt, try the styles to see whether they’ll work with your particular plot.

View Article
The 8 Most Common Python Programming Errors

Article / Updated 01-27-2019

Every developer on the planet makes mistakes. However, knowing about common mistakes will save you time and effort later. The following list tells you about the most common errors that developers experience when working with Python: Using the incorrect indentation: Many Python features rely on indentation. For example, when you create a new class, everything in that class is indented under the class declaration. The same is true for decision, loop, and other structural statements. If you find that your code is executing a task when it really shouldn’t be, start reviewing the indentation you’re using. Relying on the assignment operator instead of the equality operator: When performing a comparison between two objects or value, you just use the equality operator (==), not the assignment operator (=). The assignment operator places an object or value within a variable and doesn’t compare anything. Placing function calls in the wrong order when creating complex statements: Python always executes functions from left to right. So the statement MyString.strip().center(21, "*") produces a different result than MyString.center(21, "*").strip(). When you encounter a situation in which the output of a series of concatenated functions is different from what you expected, you need to check function order to ensure that each function is in the correct place. Misplacing punctuation: You can put punctuation in the wrong place and create an entirely different result. Remember that you must include a colon at the end of each structural statement. In addition, the placement of parentheses is critical. For example, (1 + 2) * (3 + 4), 1 + ((2 * 3) + 4), and 1 + (2 * (3 + 4)) all produce different results. Using the incorrect logical operator: Most of the operators don’t present developers with problems, but the logical operators do. Remember to use and to determine when both operands must be True and or when either of the operands can be True. Creating count-by-one errors on loops: Remember that a loop doesn’t count the last number you specify in a range. So, if you specify the range [1:11], you actually get output for values between 1 and 10. Using the wrong capitalization: Python is case sensitive, so MyVar is different from myvar and MYVAR. Always check capitalization when you find that you can’t access a value you expected to access. Making a spelling mistake: Even seasoned developers suffer from spelling errors at times. Ensuring that you use a common approach to naming variables, classes, and functions does help. However, even a consistent naming scheme won’t always prevent you from typing MyVer when you meant to type MyVar.

View Article