Using the Python Ecosystem for Data Science

By John Paul Mueller, Luca Massaron

You need to load libraries in order to perform data science tasks in Python. Here’s an overview of the libraries you can use for data science. These libraries can perform multiple functions for the data scientist.

Accessing scientific tools using SciPy

The SciPy stack contains a host of other libraries that you can also download separately. These libraries provide support for mathematics, science, and engineering. When you obtain SciPy, you get a set of libraries designed to work together to create applications of various sorts. These libraries are

  • NumPy

  • SciPy

  • matplotlib

  • IPython

  • Sympy

  • pandas

The SciPy library itself focuses on numerical routines, such as routines for numerical integration and optimization. SciPy is a general-purpose library that provides functionality for multiple problem domains. It also provides support for domain-specific libraries, such as Scikit-learn, Scikit-image, and statsmodels.

Performing fundamental scientific computing using NumPy

The NumPy library provides the means for performing n-dimensional array manipulation, which is critical for data science work. You couldn’t easily access n-dimensional arrays without NumPy functions that include support for linear algebra, Fourier transform, and random-number generation.

Performing data analysis using pandas

The pandas library provides support for data structures and data analysis tools. The library is optimized to perform data science tasks especially fast and efficiently. The basic principle behind pandas is to provide data analysis and modeling support for Python that is similar to other languages, such as R.

Implementing machine learning using Scikit-learn

The Scikit-learn library is one of a number of Scikit libraries that build on the capabilities provided by NumPy and SciPy to allow Python developers to perform domain-specific tasks. In this case, the library focuses on data mining and data analysis. It provides access to the following sorts of functionality:

  • Classification

  • Regression

  • Clustering

  • Dimensionality reduction

  • Model selection

  • Preprocessing

Plotting the data using matplotlib

The matplotlib library provides you with a MATLAB-like interface for creating data presentations of the analysis you perform. The library is currently limited to 2D output, but it still provides you with the means to express graphically the data patterns you see in the data you analyze. Without this library, you couldn’t create output that people outside the data science community could easily understand.

Parsing HTML documents using Beautiful Soup

The Beautiful Soup library download is actually found at the Python website. This library provides the means for parsing HTML or XML data in a manner that Python understands. It allows you to work with tree-based data.

Besides providing a means for working with tree-based data, Beautiful Soup takes a lot of the work out of working with HTML documents. For example, it automatically converts the encoding (the manner in which characters are stored in a document) of HTML documents from UTF-8 to Unicode. A Python developer would normally need to worry about things like encoding, but with Beautiful Soup, you can focus on your code instead.