Machine Learning For Dummies
Book image
Explore Book Buy On Amazon
There are some great machine learning packages such as caret (R) and NumPy (Python). Of course, these are good, versatile packages you can use to begin your machine learning journey. It’s important to have more than a few tools in your toolbox, which is where the suggestions found here come into play.

Cloudera Oryx

Cloudera Oryx is a machine learning project for Apache Hadoop that provides you with a basis for performing machine learning tasks. It emphasizes the use of live data streaming. This product helps you add security, governance, and management functionality that’s missing from Hadoop so that you can create enterprise-level applications with greater ease.

The functionality provided by Oryx builds on Apache Kafka and Apache Spark. Common tasks for this product are real-time spam filters and recommendation engines.


GPUs enable you to perform machine learning tasks significantly faster. You can add Accelerate to Anaconda to provide basic GPU support for that environment. Caffe is a separate product that you can use to process images using Python or MATLAB.

If you need to perform serious image processing, you obviously need a GPU to do it. The CUDA-Convnet library provides specific support for NVidia’s CUDA GPU processor, which means that it can provide faster processing at the cost of platform flexibility (you must have a CUDA processor in your system). For the most part, this library sees use in neural-network applications.


As described for CUDA-Convnet, being able to recognize objects in images is an important machine learning task, but getting the job done without a good library can prove difficult or impossible. While CUDA-Convnet provides support for heavy-duty desktop applications, ConvNetJS provides image-processing support for JavaScript applications. The important feature of this library is that it works asynchronously.

When you make a call, the application continues to work. An asynchronous response lets the application know when tasks, such as training, complete so that the user doesn’t feel as if the browser has frozen (become unresponsive in some way). Given that these tasks can take a long time to complete, the asynchronous call support is essential.


This R library, e1071, developed by the TU Wien E1071 group on probability theory, provides support for support vector machines (SVMs). Behind its R command interface runs an external C++ library (with a C API to interface with other languages) developed at the National Taiwan University. You can find more on LIBSVM for SVM classification and regression, together with plenty of datasets, tutorials, and even a practical guide for getting more from SVMs.

In addition, you get support functions for latent class analysis, short-time Fourier transform, fuzzy clustering, shortest-path computation, bagged clustering, and Naïve Bayes classifiers.


The gradient boosting machines (GBM) algorithm uses gradient descent optimization to determine the right weights for learning in the ensemble. The resulting performance increase is impressive, making GBM one of the most powerful predictive tools that you can learn to use in machine learning. The gbm package adds GBM support to R.

This package also includes regression methods for least squares, absolute loss, t-distribution loss, quantile regression, logistic, multinomial logistic, Poisson, Cox proportional hazards partial likelihood, AdaBoost exponential loss, Huberized hinge loss, and Learning to Rank measures (LambdaMart).

The package also provides convenient functions to cross-validate and to find out how to tune without overfitting the number of trees, a crucial hyper-parameter of the algorithm.


Gensim is a Python library that can perform natural language processing (NLP) and unsupervised learning on textual data. It offers a wide range of algorithms to choose from: TF-IDF, random projections, latent Dirichlet allocation, latent semantic analysis, and two semantic algorithms: word2vec and document2vec.

Word2vec is based on neural networks (shallow, not deep learning, networks) and it allows meaningful transformations of words into vectors of coordinates that you can operate in a semantic way. For instance, operating on the vector representing Paris, subtracting the vector France, and then adding the vector Italy results in the vector Rome, demonstrating how you can use mathematics and the right Word2vec model to operate semantic operations on text.


Regularization is as an effective, fast, and easy solution to use when you have many features and want to reduce the variance of the estimates due to multicollinearity between your predictors. One form of regularization is Lasso, which is one of the forms of support you get from glmnet (with the other being elastic-net). This package fits the linear, logistic and multinomial, Poisson, and Cox regression models.

You can also use glmnet to perform prediction, plotting, and K-fold cross-validation. Professor Rob Tibshirani, the creator of the L1 (also known as Lasso) regularization also helped develop this package. In addition, Gensim provides multiprocessing and out-of-core capabilities, allowing you to speed up the processing of algorithms and handle textual data larger than available RAM memory.


You can improve a decision tree by replicating it many times and averaging results to get a more general solution. The R open source package for performing this task is randomForest. You can use it to perform classification and regression tasks based on a forest of trees using random inputs. The Python version of this package appears as RandomForestClassifier and RandomForestRegressor, both of which are found in Scikit-learn.


The SciPy stack contains a host of other libraries that you can also download separately. These libraries provide support for mathematics, science, and engineering. When you obtain SciPy, you get a set of libraries designed to work together to create applications of various sorts. These libraries are
  • NumPy
  • SciPy
  • matplotlib
  • IPython
  • Sympy
  • pandas
The SciPy library itself focuses on numerical routines, such as routines for numerical integration and optimization. SciPy is a general-purpose library that provides functionality for multiple problem domains. It also provides support for domain-specific libraries, such as Scikit-learn, Scikit-image, and statsmodels. The site contains many lectures and tutorials on SciPy’s functions.


Other types of gradient boosting machines exist that are based on a slightly different set of optimization approaches and cost functions. The XGBoost package enables you to apply GBM to any problem, thanks to its wide choice of objective functions and evaluation metrics. It operates with a variety of languages, including Python, R, Java, and C++.

In spite of the fact that GBM is a sequential algorithm (and thus slower than others that can take advantage of modern multicore computers), XGBoost leverages multithread processing in order to search in parallel for the best splits among the features. The use of multithreading helps XGBoost turn in an unbeatable performance when compared to other GBM implementations, both in R and Python. Because of all that it contains, the full package name is eXtreme Gradient Boosting (or XGBoost for short).

About This Article

This article is from the book:

About the book authors:

John Mueller has produced 114 books and more than 600 articles on topics ranging from functional programming techniques to working with Amazon Web Services (AWS). Luca Massaron, a Google Developer Expert (GDE),??interprets big data and transforms it into smart data through simple and effective data mining and machine learning techniques.

John Mueller has produced 114 books and more than 600 articles on topics ranging from functional programming techniques to working with Amazon Web Services (AWS). Luca Massaron, a Google Developer Expert (GDE),??interprets big data and transforms it into smart data through simple and effective data mining and machine learning techniques.

This article can be found in the category: