Python for Data Science: Developing a Multivariate Approach to Find Outliers

By John Paul Mueller, Luca Massaron

Python is a data scientist’s friend. Working on single variables allows you to spot a large number of outlying observations. However, outliers do not necessarily display values too far from the norm. Sometimes outliers are made of unusual combinations of values in more variables. They are rare, but influential, combinations that can especially trick machine learning algorithms.

In such cases, the precise inspection of every single variable won’t suffice to rule out anomalous cases from your dataset. Only a few selected techniques, taking in consideration more variables at a time, will manage to reveal problems in your data.

The presented techniques approach the problem from different points
of view:

  • Dimensionality reduction

  • Density clustering

  • Nonlinear distribution modeling

Using these techniques allows you to compare their results, taking notice of the recurring signals on particular cases — sometimes already located by the univariate exploration, sometimes as yet unknown.

Using principal component analysis

Principal component analysis can completely restructure the data, removing redundancies and ordering newly obtained components according to the amount of the original variance that they express. This type of analysis offers a synthetic and complete view over data distribution, making multivariate outliers particularly evident.

The first two components, being the most informative in term of variance, can depict the general distribution of the data if visualized. The output provides a good hint at possible evident outliers.

The last two components, being the most residual, depict all the information that could not be otherwise fitted by the PCA method. They can also provide a suggestion about possible but less evident outliers.

from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from pandas.tools.plotting import scatter_matrix
dim_reduction = PCA()
Xc = dim_reduction.fit_transform(scale(X))
print ‘variance explained by the first 2 components: %0.1f%%’ % (
 sum(dim_reduction.explained_variance_ratio_[:2]*100))
print ‘variance explained by the last 2 components: %0.1f%%’ % (
 sum(dim_reduction.explained_variance_ratio_[-2:]*100))
df = pd.DataFrame(Xc, columns=[‘comp_’+str(j+1) for j in range(10)])
first_two = df.plot(kind=‘scatter’, x=‘comp_1’, y=‘comp_2’, c=‘DarkGray’, s=50)
last_two = df.plot(kind=‘scatter’, x=‘comp_9’, y=‘comp_10’, c=‘DarkGray’, s=50)

Look at these two scatterplots of the first and last components. Pay particular attention to the data points along the axis (where the x axis defines the independent variable and the y axis defines the dependent variable). You can see a possible threshold to use for separating regular data from suspect data.

The first two and last two components of the principal component analysis.

The first two and last two components of the principal component analysis.

Using the two last components, you can locate a few points to investigate using the threshold of –0.3 for the tenth component and of –1.0 for the ninth. All cases below these values are possible outliers.

outlying = (Xc[:,-1] < -0.3) | (Xc[:,-2] < -1.0)
print df[outlying]

Using cluster analysis

Outliers are isolated points in the space of variables, and DBScan is a clustering algorithm that links dense data parts together and marks the too-sparse parts. DBScan is therefore an ideal tool for an automated exploration of your data for possible outliers to verify.

from sklearn.cluster import DBSCAN
DB = DBSCAN(eps=2.5, min_samples=25, random_state=101)
DB.fit(Xc)
from collections import Counter
print Counter(DB.labels_),’n’
print df[DB.labels_==-1]
Counter({0: 414, -1: 28})
  0  1  2  3  4  5  6  7  8  9
15 -0.05 0.05 -0.02 0.08 0.09 0.11 -0.04 0.11 0.04 -0.04
23 0.05 0.05 0.06 0.03 0.03 -0.05 -0.05 0.07 0.13 0.14
29 0.07 0.05 -0.01 0.06 -0.04 -0.10 0.05 -0.08 0.06 0.05
... (results partially omitted)
[28 rows x 10 columns]

However, DBSCAN requires two parameters, eps and min_samples. These two parameters require multiple tries to locate the right values, making using the parameters a little tricky.

Start with a low value of min_samples and try growing the values of eps from 0.1 upward. After every trial with modified parameters, check the situation by counting the number of observations in the class –1 inside the attribute labels, and stop when the number of outliers seems reasonable for a visual inspection.

There will always be points on the fringe of the dense parts’ distribution, so it is hard to provide you with a threshold for the number of cases that might be classified in the –1 class. Normally, outliers should not be more than 5 percent of cases, so use this indication as a generic rule of thumb.

Automating outliers detection with SVM

Support Vector Machines (SVM) is a powerful machine learning technique. OneClassSVM is an algorithm that specializes in learning the expected distributions in a dataset. OneClassSVM is especially useful as a novelty detector method if you can first provide data cleaned from outliers; otherwise, it’s effective as a detector of multivariate outliers. In order to have OneClassSVM work properly, you have two key parameters to fix:

  • gamma, telling the algorithm whether to follow or approximate the dataset distributions. For novelty detection, it is better to have a value of 0 or superior (follow the distribution); for outlier detection values, smaller than 0 values are preferred (approximate the distribution).

  • nu, which can be calculated by the following formula: nu_estimate = 0.95 * f + 0.05, where f is the percentage of expected outliers (a number from 1 to 0). If your purpose is novelty detection, f will be 0.

Executing the following script, you will get a OneClassSVM working as an outlier detection system:

from sklearn import svm
outliers_fraction = 0.01 #
nu_estimate = 0.95 * outliers_fraction + 0.05
auto_detection = svm.OneClassSVM(kernel=“rbf”, gamma=0.01, degree=3, 
nu=nu_estimate) auto_detection.fit(Xc) evaluation = auto_detection.predict(Xc) print df[evaluation==-1] 0 1 2 3 4 5 6 7 8 9 10 -0.10 -0.04 -0.08 0.01 -0.10 -0.09 -0.01 -0.08 -0.06 -0.03 23 0.05 0.05 0.06 0.03 0.03 -0.05 -0.05 0.07 0.13 0.14 32 0.03 0.05 0.13 0.03 -0.05 -0.01 -0.10 0.11 0.00 0.03 ... (results partially omitted) [25 rows x 10 columns]

OneClassSVM, like all the family of SVM, works better if you rescale your variables using the sklearn.preprocessing function scale or the class StandardScaler.