## Articles From Anasse Bari

### Filter Results

Cheat Sheet / Updated 04-27-2022

A predictive analytics project combines execution of details with big-picture thinking. These handy tips and checklists will help keep your project on the rails and out of the woods.

View Cheat SheetArticle / Updated 04-26-2017

After you build your first classification predictive model for analysis of the data, creating more models like it is a really straightforward task in scikit. The only real difference from one model to the next is that you may have to tune the parameters from algorithm to algorithm. How to load your data This code listing will load the iris dataset into your session: >>> from sklearn.datasets import load_iris >>> iris = load_iris() How to create an instance of the classifier The following two lines of code create an instance of the classifier. The first line imports the logistic regression library. The second line creates an instance of the logistic regression algorithm. >>> from sklearn import linear_model >>> logClassifier = linear_model.LogisticRegression(C=1, random_state=111) Notice the parameter (regularization parameter) in the constructor. The regularization parameter is used to prevent overfitting. The parameter isn’t strictly necessary (the constructor will work fine without it because it will default to C=1). Creating a logistic regression classifier using C=150 creates a better plot of the decision surface. You can see both plots below. How to run the training data You’ll need to split the dataset into training and test sets before you can create an instance of the logistic regression classifier. The following code will accomplish that task: >>> from sklearn import cross_validation >>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.10, random_state=111) >>> logClassifier.fit(X_train, y_train) Line 1 imports the library that allows you to split the dataset into two parts. Line 2 calls the function from the library that splits the dataset into two parts and assigns the now-divided datasets to two pairs of variables. Line 3 takes the instance of the logistic regression classifier you just created and calls the fit method to train the model with the training dataset. How to visualize the classifier Looking at the decision surface area on the plot, it looks like some tuning has to be done. If you look near the middle of the plot, you can see that many of the data points belonging to the middle area (Versicolor) are lying in the area to the right side (Virginica). This image shows the decision surface with a C value of 150. It visually looks better, so choosing to use this setting for your logistic regression model seems appropriate. How to run the test data In the following code, the first line feeds the test dataset to the model and the third line displays the output: >>> predicted = logClassifier.predict(X_test) >>> predictedarray([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2]) How to evaluate the model You can cross-reference the output from the prediction against the y_test array. As a result, you can see that it predicted all the test data points correctly. Here’s the code: >>> from sklearn import metrics >>> predictedarray([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2]) >>> y_testarray([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2]) >>> metrics.accuracy_score(y_test, predicted)1.0 # 1.0 is 100 percent accuracy >>> predicted == y_testarray([ True, True, True, True, True, True, True, True, True, True, True, True, True, True, True], dtype=bool) So how does the logistic regression model with parameter C=150 compare to that? Well, you can’t beat 100 percent. Here is the code to create and evaluate the logistic classifier with C=150: >>> logClassifier_2 = linear_model.LogisticRegression( C=150, random_state=111) >>> logClassifier_2.fit(X_train, y_train) >>> predicted = logClassifier_2.predict(X_test) >>> metrics.accuracy_score(y_test, predicted)0.93333333333333335 >>> metrics.confusion_matrix(y_test, predicted)array([[5, 0, 0], [0, 2, 0], [0, 1, 7]]) We expected better, but it was actually worse. There was one error in the predictions. The result is the same as that of the Support Vector Machine (SVM) model. Here is the full listing of the code to create and evaluate a logistic regression classification model with the default parameters: >>> from sklearn.datasets import load_iris >>> from sklearn import linear_model >>> from sklearn import cross_validation >>> from sklearn import metrics >>> iris = load_iris() >>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.10, random_state=111) >>> logClassifier = linear_model.LogisticRegression(, random_state=111) >>> logClassifier.fit(X_train, y_train) >>> predicted = logClassifier.predict(X_test) >>> predictedarray([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2]) >>> y_testarray([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2]) >>> metrics.accuracy_score(y_test, predicted)1.0 # 1.0 is 100 percent accuracy >>> predicted == y_testarray([ True, True, True, True, True, True, True, True, True, True, True, True, True, True, True], dtype=bool)

View ArticleArticle / Updated 03-24-2017

Another task in predictive analytics is to classify new data by predicting what class a target item of data belongs to, given a set of independent variables. You can, for example, classify a customer by type – say, as a high-value customer, a regular customer, or a customer who is ready to switch to a competitor – by using a decision tree. To see some useful information about the R Classification model, type in the following code: > summary(model) Length Class Mode 1 BinaryTree S4 The Class column tells you that you’ve created a decision tree. To see how the splits are being determined, you can simply type in the name of the variable in which you assigned the model, in this case model, like this: > model Conditional inference tree with 6 terminal nodes Response: seedType Inputs: area, perimeter, compactness, length, width, asymmetry, length2 Number of observations: 147 1) area <= 16.2; criterion = 1, statistic = 123.423 2) area <= 13.37; criterion = 1, statistic = 63.549 3) length2 <= 4.914; criterion = 1, statistic = 22.251 4)* weights = 11 3) length2 > 4.914 5)* weights = 45 2) area > 13.37 6) length2 <= 5.396; criterion = 1, statistic = 16.31 7)* weights = 33 6) length2 > 5.396 8)* weights = 8 1) area > 16.2 9) length2 <= 5.877; criterion = 0.979, statistic = 8.764 10)* weights = 10 9) length2 > 5.877 11)* weights = 40 Even better, you can visualize the model by creating a plot of the decision tree with this code:> plot(model) This is a graphical representation of a decision tree. You can see that the overall shape mimics that of a real tree. It’s made of nodes (the circles and rectangles) and links or edges (the connecting lines). The very first node (starting at the top) is called the root node and the nodes at the bottom of the tree (rectangles) are called terminal nodes. There are five decision nodes and six terminal nodes. At each node, the model makes a decision based on the criteria in the circle and the links, and chooses a way to go. When the model reaches a terminal node, a verdict or a final decision is reached. In this particular case, two attributes, the and the , are used to decide whether a given seed type is in class 1, 2 or 3. For example, take observation #2 from the dataset. It has a of 4.956 and an of 14.88. You can use the tree you just built to decide which particular seed type this observation belongs to. Here’s the sequence of steps: Start at the root node, which is node 1 (the number is shown in the small square at the top of the circle). Decide based on the attribute: Is the of observation #2 less than or equal to (denoted by <=) 16.2? The answer is yes, so move along the path to node 2. At node 2, the model asks: Is the area <= 13.37? The answer is no, so try the next link which asks: Is the area > 13.37? The answer is yes, so move along the path to node 6. At this node the model asks: Is the length2 <= 5.396? It is, and you move to terminal node 7 and the verdict is that observation #2 is of seed type 1. And it is, in fact, seed type 1. The model does that process for all other observations to predict their classes. To find out whether you trained a good model, check it against the training data. You can view the results in a table with the following code: > table(predict(model),trainSet$seedType) 1 2 3 1 45 4 3 2 3 47 0 3 1 0 44 The results show that the error (or misclassification rate) is 11 out of 147, or 7.48 percent. With the results calculated, the next step is to read the table. The correct predictions are the ones that show the column and row numbers as the same. Those results show up as a diagonal line from top-left to bottom-right; for example, [1,1], [2,2], [3,3] are the number of correct predictions for that class. So for seed type 1, the model correctly predicted it 45 times, while misclassifying the seed 7 times (4 times as seed type 2, and 3 times as type 3). For seed type 2, the model correctly predicted it 47 times, while misclassifying it 3 times. For seed type 3, the model correctly predicted it 44 times, while misclassifying it only once. This shows that this is a good model. So now you evaluate it with the test data. Here is the code that uses the test data to predict and store it in a variable (testPrediction) for later use: > testPrediction <- predict(model, newdata=testSet) To evaluate how the model performed with the test data, view it in a table and calculate the error, for which the code looks like this: > table(testPrediction, testSet$seedType) testPrediction 1 2 3 1 23 2 1 2 1 19 0 3 1 0 17 The results show that the error is 5 out of 64, or 7.81 percent. This is consistent with the training data.

View ArticleArticle / Updated 11-14-2016

Think of predictive analytics as a bright bulb powered by your data. The light (insight) from predictive analytics can empower your strategy, streamline your operations, and improve your bottom line. The followings four recommendations can help you ensure success for your predictive analytics initiatives. Foster a culture of change Predictive analytics should be adopted across the organization as a whole. The organization should embrace change. Business stakeholders should be ready to incorporate recommendations and adopt findings derived from the predictive analytics projects. The outcomes of a predictive analytics projects are only valuable if the business leaders are willing to act on them. Create a data-science team Hire a data-science team whose sole job is to establish and support your predictive analytics solutions. This team of talented professionals— comprising business analysts, data scientists, and information technologists — is better equipped to work on the project full-time. Including a range of professional backgrounds can bring valuable insights to the team from other domains. Selecting team members from different departments in your organization can help ensure a widespread buy-in. Use visualization tools effectively Visualization is a powerful way to conveying complex ideas efficiently. Using visualization effectively can help you initially explore and understand the data you’re working with. Visual aids such as charts can also help you evaluate the model’s output or compare the performance of predictive models. Use predictive analytics tools Powerful predictive analytics tools are available as software packages in the marketplace. They’re designed to make the whole process a lot easier. Without the use of such tools, building a model from scratch quickly becomes time-intensive. Using a good predictive analytics tool enables you to run multiple scenarios and instantaneously compare the results — all with a few clicks. A tool can quickly automate many of time-consuming steps required to build and evaluate one or more models.

View ArticleArticle / Updated 11-14-2016

Data for a predictive analytics project can come from many different sources. Some of the most common sources are within your own organization; other common sources include data purchased from outside vendors. Internal data sources include Transactional data, such as customer purchases Customer profiles, such as user-entered information from registration forms Campaign histories, including whether customers responded to advertisements Clickstream data, including the patterns of customers’ web clicks Customer interactions, such as those from e-mails, chats, surveys, and customer-service calls Machine-generated data, such as that from telematics, sensors, and smart meters External data sources include Social media such as Facebook, Twitter, and LinkedIn Subscription services such as Bloomberg, Thompson Reuters, Esri, and Westlaw By combining data from several disparate data sources in your predictive models, you may get a better overall view of your customer, thus a more accurate model.

View ArticleArticle / Updated 11-14-2016

A successful predictive analytics project is executed step by step. As you immerse yourself in the details of the project, watch for these major milestones: Defining Business Objectives The project starts with using a well-defined business objective. The model is supposed to address a business question. Clearly stating that objective will allow you to define the scope of your project, and will provide you with the exact test to measure its success. Preparing Data You’ll use historical data to train your model. The data is usually scattered across multiple sources and may require cleansing and preparation. Data may contain duplicate records and outliers; depending on the analysis and the business objective, you decide whether to keep or remove them. Also, the data could have missing values, may need to undergo some transformation, and may be used to generate derived attributes that have more predictive power for your objective. Overall, the quality of the data indicates the quality of the model. Sampling Your Data You’ll need to split your data into two sets: training and test datasets. You build the model using the training dataset. You use the test data set to verify the accuracy of the model’s output. Doing so is absolutely crucial. Otherwise you run the risk of overfitting your model — training the model with a limited dataset, to the point that it picks all the characteristics (both the signal and the noise) that are only true for that particular dataset. An model that’s overfitted for a specific data set will perform miserably when you run it on other datasets. A test dataset ensures a valid way to accurately measure your model’s performance. Building the Model Sometimes the data or the business objectives lend themselves to a specific algorithm or model. Other times the best approach is not so clear-cut. As you explore the data, run as many algorithms as you can; compare their outputs. Base your choice of the final model on the overall results. Sometimes you’re better off running an ensemble of models simultaneously on the data and choosing a final model by comparing their outputs. Deploying the Model After building the model, you have to deploy it in order to reap its benefits. That process may require co-ordination with other departments. Aim at building a deployable model. Also be sure you know how to present your results to the business stakeholders in an understandable and convincing way so they adopt your model. After the model is deployed, you’ll need to monitor its performance and continue improving it. Most models decay after a certain period of time. Keep your model up to date by refreshing it with newly available data.

View ArticleArticle / Updated 03-26-2016

Various statistical, data-mining, and machine-learning algorithms are available for use in your predictive analysis model. You’re in a better position to select an algorithm after you’ve defined the objectives of your model and selected the data you’ll work on. Some of these algorithms were developed to solve specific business problems, enhance existing algorithms, or provide new capabilities — which may make some of them more appropriate for your purposes than others. You can choose from a range of algorithms to address business concerns such as the following: For customer segmentation and/or community detection in the social sphere, for example, you’d need clustering algorithms. For customer retention or to develop a recommender system, you’d use classification algorithms. For credit scoring or predicting the next outcome of time-driven events, you’d use a regression algorithm. As time and resources permit, you should run as many algorithms of the appropriate type as you can. Comparing different runs of different algorithms can bring surprising findings about the data or the business intelligence embedded in the data. Doing so gives you more detailed insight into the business problem, and helps you identify which variables within your data have predictive power. Some predictive analytics projects succeed best by building an ensemble model, a group of models that operate on the same data. An ensemble model uses a predefined mechanism to gather outcomes from all its component models and provide a final outcome for the user. Models can take various forms — a query, a collection of scenarios, a decision tree, or an advanced mathematical analysis. In addition, certain models work best for certain data and analyses. You can (for example) use classification algorithms that employ decision rules to decide the outcome of a given scenario or transaction, addressing questions like these: Is this customer likely to respond to our marketing campaign? Is this money-transfer likely to be part of a money-laundering scheme? Is this loan applicant likely to default on the loan? You can use unsupervised clustering algorithms to find what relationships exist within your dataset. You can use these algorithms to find different groupings among your customers, determine what services can be grouped together, or decide for example which products can be upsold. Regression algorithms can be used to forecast continuous data, such as predicting the trend for a stock movement given its past prices. Data and business objectives aren’t the only factors to consider when you’re selecting an algorithm. The expertise of your data scientists is of tremendous value at this point; picking an algorithm that will get the job done is often a tricky combination of science and art. The art part comes from experience and proficiency in the business domain, which also plays a critical role in identifying a model that can serve business objectives accurately.

View ArticleArticle / Updated 03-26-2016

To be able to test the predictive analysis model you built, you need to split your dataset into two sets: training and test datasets. These datasets should be selected at random and should be a good representation of the actual population. Similar data should be used for both the training and test datasets. Normally the training dataset is significantly larger than the test dataset. Using the test dataset helps you avoid errors such as overfitting. The trained model is run against test data to see how well the model will perform. Some data scientists prefer to have a third dataset that has characteristics similar to those of the first two: a validation dataset. The idea is that if you’re actively using your test data to refine your model, you should use a separate (third) set to check the accuracy of the model. Having a validation dataset, that wasn’t used as part of the development process of your model, helps ensure a neutral estimation of the model’s accuracy and efficacy. If you’ve built multiple models using various algorithms, the validation sample can also help you evaluate which model performs best. Make sure you double-check your work developing and testing the model. In particular, be skeptical if the performance or the accuracy of the model seems too good to be true. Errors can happen where you least expect them. Incorrectly calculating dates for time series data, for example, can lead to erroneous results. How to employ cross-validation Cross-validation is a popular technique you can use to evaluate and validate your model. The same principle of using separate datasets for testing and training applies here: The training data is used to build the model; the model is run against the testing set to predict data it hasn’t seen before, which is one way to evaluate its accuracy. In cross-validation, the historical data is split into X numbers of subsets. Each time a subset is chosen to be used as test data, the rest of the subsets are used as training data. Then, on the next run, the former test set becomes one of the training sets and one of the former training sets becomes the test set. The process continues until every subset of that X number of sets has been used as a test set. For example, imagine you have a dataset that you have divided into 5 sets numbered 1 to 5. In the first run, you use set 1 as the test set and use sets 2, 3, 4 and 5 as the training set. Then, on the second run, you use set 2 as the test set and sets 1, 3, 4, and 5 as training set. You continue this process until every subset of the 5 sets has been used as a test set. Cross-validation allows you to use every data point in your historical data for both training and testing. This technique is more effective than just splitting your historical data into two sets, using the set with the most data for training, using the other set for testing, and leaving it at that. When you cross-validate your data, you’re protecting yourself against randomly picking test data that’s too easy to predict — which would give you the false impression that your model is accurate. Or, if you happen to pick test data that’s too hard to predict, you might falsely conclude that your model isn’t performing as you had hoped. Cross-validation is widely used not only to validate the accuracy of models but also to compare the performance of multiple models. How to balance bias and variance Bias and variance are two sources of errors that can take place as you’re building your analytical model. Bias is the result of building a model that significantly simplifies the presentation of the relationships among data points in the historical data used to build the model. Variance is the result of building a model that is explicitly specific to the data used to build the model. Achieving a balance between bias and variance — by reducing the variance and tolerating some bias — can lead to a better predictive model. This trade-off usually leads to building less complex predictive models. Many data-mining algorithms have been created to take into account this trade-off between bias and variance. How to troubleshoot ideas When you’re testing your model and you find yourself going nowhere, here are a few ideas to consider that may help you get back on track: Always double-check your work. You may have overlooked something you assumed was correct but isn’t. Such flaws could show up (for example) among the values of a predictive variable in your dataset, or in the preprocessing you applied to the data. If the algorithm you chose isn’t yielding any results, try another algorithm. For example, you try several classification algorithms available and depending on your data and the business objectives of your model, one of those may perform better than the others. Try selecting different variables or creating new derived variables. Be always on the lookout for variables that have predictive powers. Frequently consult with the business domain experts who can help you make sense of the data, select variables, and interpret the model’s results.

View ArticleArticle / Updated 03-26-2016

K is an input to the algorithm for predictive analysis; it stands for the number of groupings that the algorithm must extract from a dataset, expressed algebraically as k. A K-means algorithm divides a given dataset into k clusters. The algorithm performs the following operations: Pick k random items from the dataset and label them as cluster representatives. Associate each remaining item in the dataset with the nearest cluster representative, using a Euclidean distance calculated by a similarity function. Recalculate the new clusters’ representatives. Repeat Steps 2 and 3 until the clusters do not change. A representative of a cluster is the mathematical mean (average) of all items that belong to the same cluster. This representative is also called a cluster centroid. For instance, consider three items from the fruits dataset where Type 1 corresponds to bananas. Type 2 corresponds to apples. Color 2 corresponds to yellow. Color 3 corresponds to green. Assuming that these items are assigned to the same cluster, the centroid of these three items is calculated. Item Feature#1 Type Feature#2 Color Feature#3 Weight (Ounces) 1 1 2 5.33 2 2 3 9.33 3 1 2 2.1 Here are the calculations of a cluster representative of three items that belong to the same cluster. The cluster representative is a vector of three attributes. Its attributes are the average of the attributes of the items in the cluster in question. Item Feature#1 Type Feature#2 Color Feature#3 Weight (Ounces) 1 1 2 5.33 2 2 3 9.33 3 1 2 2.1 Cluster Representative (Centroid Vector) (1+2+1)/3=1.33 (2+3+2)/3=2.33 (5.33 + 9.33 +32.1)/3=3 The dataset shown next consists of seven customers’ ratings of two products, A and B. The ranking represents the number of points (between 0 and 10) that each customer has given to a product — the more points given, the higher the product is ranked. Using a K-means algorithm and assuming that k is equal to 2, the dataset will be partitioned into two groups. The rest of the procedure looks like this: Pick two random items from the dataset and label them as cluster representatives. The following shows the initial step of selecting random centroids from which the K-means clustering process begins. The initial centroids are selected randomly from the data that you are about to analyze. In this case, you’re looking for two clusters, so two data items are randomly selected: Customers 1 and 5. At first, the clustering process builds two clusters around those two initial (randomly selected) cluster representatives. Then the cluster representatives are recalculated; the calculation is based on the items in each cluster. Customer ID Customer Ratings of Product A Customer Ratings of Product B 1 2 2 2 3 4 3 6 8 4 7 10 5 10 14 6 9 10 7 7 9 Inspect every other item (customer) and assign it to the cluster representative to which it is most similar. Use the Euclidean distance to calculate how similar an item is to a group of items: Similarity of Item I to Cluster X = sqrt {{{left( {{f_1} - {x_1}} right)}^2} + {{left( {{f_2} - {x_2}} right)}^2} + cdots + {{left( {{f_n} - {x_n}} right)}^2}} The values {f_1},;{f_2},; ldots ,;{f_n} are the numerical values of the features that describe the item in question. The values {x_1},;{x_2},; ldots ,;{x_n} are the features (mean values) of the cluster representative (centroid), assuming that each item has n features. For instance, consider the item called Customer 2 (3, 4): The customer’s rating for Product A was 3 and the rating for Product B was 4. The cluster representative feature is (2, 2). The similarity of Customer 2 to Cluster 1 is calculated as follows: Similarity of Item 2 to Cluster 1 = sqrt {{{left( {3 - 2} right)}^2} + {{left( {4 - 2} right)}^2}} = 2.23 Here’s what the same process looks like with Cluster 2: Similarity of Item 2 to Cluster 2 = sqrt {{{left( {3 - 10} right)}^2} + {{left( {4 - 14} right)}^2}} = 12.20 Comparing these results, you assign Item 2 (that is, Customer 2) to Cluster 1 because the numbers say Item 2 is more similar to Cluster 1. Apply the same similarity analysis to every other item in the dataset. Every time a new member joins a cluster, you must recalculate the cluster representative. This depicts the results of the first iteration of K-mean algorithm. Notice that k equals 2, so you’re looking for two clusters, which divides a set of customers into two meaningful groups. Every customer is analyzed separately and is assigned to one of the clusters on the basis of the customer's similarity to each of the current cluster representatives. Iterate the dataset again, going through every element; compute the similarity between each element and its current cluster representative. Notice that Customer 3 has moved from Cluster 1 to Cluster 2. This is because Customer 3’s distance to the cluster representative of Cluster 2 is closer than to the cluster representative of Cluster 1. Cluster Representative (Centroid Vector) Cluster 1 Customer ID#1 (2, 2) Cluster 2 Customer ID#5 (10,14) Iteration#1 Customer Cluster 1 Customer Cluster 2 Customer to be examined Customer IDs belonging to Cluster 1 Cluster Representative Customer IDs belonging to Cluster 1 Cluster Representative 1 (2, 2) 5 (10, 14) 2 1, 2 (2.4, 3) 5 (10, 14) 3 1, 2, 3 (3.6, 4.6) 5 (10, 14) 4 1, 2, 3 (3.6, 4.6) 4, 5 (8.4, 12) 6 1, 2, 3 (3.6, 4.6) 4, 5, 6 (8.6, 11.4) 7 1, 2, 3 (3.6, 4.6) 4, 5, 6, 7 (8.2, 10.8) Here is a second iteration of K-means algorithm on customer data. Each customer is being re-analyzed. Customer 2 is being assigned to Cluster 1 because Customer 2 is closer to the representative of Cluster 1 than Cluster 2. The same scenario applies to Customer 4. Notice that a cluster representative is being recalculated each time a new member is assigned to a cluster. Iteration#2 Customer Cluster 1 Customer Cluster 2 Customer to be examined Customer IDs belonging to Cluster 1 Cluster Representative Customer IDs belonging to Cluster 2 Cluster Representative 1 1 (3.6, 4.6) 5 (8.2, 10.8) 2 1, 2 (5.2, 3) 5 (8.2, 10.8) 3 1, 2 (5.2, 3) 5,3 (7.8, 10.2) 4 1, 2 (5.2, 3) 4, 5.3 (7.8, 10.2) 6 1, 2 (5.2, 3) 4, 5, 6.3 (7.8, 10.2) 7 1, 2 (5.2, 3) 3, 4, 5, 6, 7 (7.8, 10.2)

View ArticleArticle / Updated 03-26-2016

The support vector machine (SVM) is a predictive analysis data-classification algorithm that assigns new data elements to one of labeled categories. SVM is, in most cases, a binary classifier; it assumes that the data in question contains two possible target values. Another version of the SVM algorithm, multiclass SVM, augments SVM to be used as classifier on a dataset that contains more than one class (grouping or category). SVM has been successfully used in many applications such as image recognition, medical diagnosis, and text analytics. Suppose you’re designing a predictive analytics model that will automatically recognize and predict the name of an object in a picture. This is essentially the problem of image recognition — or, more specifically, face recognition: You want the classifier to recognize the name of a person in a photo. Well, before tackling that level of complexity, consider a simpler version of the same problem: Suppose you have pictures of individual pieces of fruit and you’d like your classifier to predict what kind of fruit appears in the picture. Assume you have only two types of fruit: apples and pears, one per picture. Given a new picture, you’d like to predict whether the fruit is an apple or a pear — without looking at the picture. You want the SVM to classify each picture as apple or pear. As with all other algorithms, the first step is to train the classifier. Suppose you have 200 pictures of different apples, and 200 pictures of pears. The learning step consists of feeding those pictures to the classifier so it learns what an apple looks like and what a pear looks like. Before getting into this first step, you need to transform each image into a data matrix, using (say) the R statistical package. A simple way to represent an image as numbers in a matrix is to look for geometric forms within the image (such as circles, lines, squares, or rectangles) and also the positions of each instance of each geometric form. Those numbers can also represent coordinates of those objects within the image, as plotted in a coordinate system. As you might imagine, representing an image as a matrix of numbers is not exactly a straightforward task. A whole distinct area of research is devoted to image representation. The following shows how a support vector machine can predict the class of a fruit (labeling it mathematically as apple or pear), based on what the algorithm has learned in the past. Suppose you’ve converted all the images into data matrices. Then the support vector machine takes two main inputs: Previous (training) data: This set of matrices corresponds to previously seen images of apples and pears. The new (unseen) data consists of an image converted to a matrix. The purpose is to predict automatically what is in the picture — an apple or a pear. The support vector uses a mathematical function, often called a kernel function which is a math function that matches the new data to the best image from the training data in order to predict the unknown picture’s label (apple or pear). In comparison to other classifiers, support vector machines produce robust, accurate predictions, are least affected by noisy data, and are less prone to overfitting. Keep in mind, however, that support vector machines are most suitable for binary classification — when you have only two categories (such as apple or pear).

View Article