How to Prepare Data for Predictive Analysis - dummies

How to Prepare Data for Predictive Analysis

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

When you’re learning a new programming language, it’s customary to write the “hello world” program. For machine learning and predictive analytics, creating a model to classify the Iris dataset is its “hello world” equivalent program. This is a rather simple example, but it’s very effective in teaching the basics of machine learning and predictive analytics.

How to get the sample dataset

To create our predictive model, you’ll need to download the sample Iris dataset. This dataset is freely available from many sources, especially at academic institutions that have machine-learning departments. Fortunately, the folks at were nice enough to include some sample datasets and data-loading functions along with their package. For the purposes of these examples, you’ll only need to run a couple of simple lines of code to load the data.

How to label your data

Here is one observation and its features from each class of the Iris Flower dataset.

Sepal Length Sepal Width Petal Length Petal Width Target Class/Label
5.1 3.5 1.4 0.2 Setosa (0)
7.0 3.2 4.7 1.4 Versicolor (1)
6.3 3.3 6.0 2.5 Virginica (2)

The Iris Flower dataset is a real multivariate dataset of three classes of the Iris flower (Iris setosa, Iris virginica, and Iris versicolor) introduced by Ronald Fisher in his 1936 article, “The Use of Multiple Measurements in Taxonomic Problems.” This dataset is best known for its extensive use in academia for machine learning and statistics.

The dataset consists of 150 total instances, with 50 instances from each of the 3 classes of the Iris flower. The sample has 4 features (also commonly called attributes), which are the length and width measurements of the sepals and petals.

The interesting part of this dataset is that the three classes are somewhat linearly separable. The Setosa class can be separated from the other two classes by drawing a straight line on the graph between them. The Virginica and Versicolor classes can’t be perfectly separated using a straight line — although it’s close. This makes it a perfect candidate dataset to do classification analysis but not so good for clustering analysis.

The sample data was already labeled. The right column (Label) above shows the names of each class of the Iris flower. The class name is called a label or a target; it’s usually assigned to a variable named y. It is basically the outcome or the result of what is being predicted.

In statistics and modeling, it is often referred to as the dependent variable. It depends on the inputs that correspond to sepal length and width and to petal length and width.

You may also want to know what’s different about the scikit preprocessed Iris dataset, as compared to the original dataset. To find out, you need to obtain the original data file. You can do a Google search for iris dataset and download it or view it from any one of the academic institutions.

The result that usually comes up first is the University of California Irvine’s (UCI) machine-learning repository of datasets. The Iris dataset in its original state from the UCI machine-learning repository can be found on the UCI website.

If you download it, you should be able to view it with any text editor. Upon viewing the data in the file, you’ll notice that there are five columns in each row. The first four columns are the measurements (referred to as the features) and the last column is the label. The label differs between the original and scikit versions of the Iris dataset.

Another difference is the first row of the data file. It includes a header row used by the scikit data-loading function. It has no effect on the algorithms themselves.

Normalizing features to numbers rather than keeping them as text makes it easier for the algorithms to process — and it’s much more memory-efficient. This is especially evident if you run very large datasets with many features — which is often the case in real scenarios.

Here is sample data from both files. All the data columns are the same except for Col5. Note that scikit has class names with numerical labels; the original file has text labels.

Source Col1 Col2 Col3 Col4 Col5
scikit 5.1 3.5 1.4 0.2 0
original 5.1 3.5 1.4 0.2 Iris-setosa
scikit 7.0 3.2 4.7 1.4 1
original 7.0 3.2 4.7 1.4 Iris-versicolor
scikit 6.3 3.3 6.0 2.5 2
original 6.3 3.3 6.0 2.5 Iris-virginica