Data Science: How to Use Python to Manage Data from Relational Databases

By John Paul Mueller, Luca Massaron

Python is great for managing data. Data scientists know that databases come in all sorts of forms. For example, AskSam is a kind of free-form textual database. However, the vast majority of data used by organizations rely on relational databases because these databases provide the means for organizing massive amounts of complex data in an organized manner that makes the data easy to manipulate.

The goal of a database manager is to make data easy to manipulate. The focus of most data storage is to make data easy to retrieve.

Relational databases accomplish both the manipulation and data retrieval objectives with relative ease. However, because data storage needs come in all shapes and sizes for a wide range of computing platforms, there are many different relational database products. In fact, for the data scientist, the proliferation of different Database Management Systems (DBMSs) using various data layouts is one of the main problems you encounter with creating a comprehensive dataset for analysis.

The one common denominator between many relational databases is that they all rely on a form of the same language to perform data manipulation, which does make the data scientist’s job easier. The Structured Query Language (SQL) lets you perform all sorts of management tasks in a relational database, retrieve data as needed, and even shape it in a particular way so that the need to perform additional shaping is unnecessary.

Creating a connection to a database can be a complex undertaking. For one thing, you need to know how to connect to that particular database. However, you can divide the process into smaller pieces. The first step is to gain access to the database engine. You use two lines of code similar to the following code (but the code presented here is not meant to execute and perform a task):

from sqlalchemy import create_engine
engine = create_engine(‘sqlite:///:memory:’)

After you have access to an engine, you can use the engine to perform tasks specific to that DBMS. The output of a read method is always a DataFrame object that contains the requested data. To write data, you must create a DataFrame object or use an existing DataFrame object. You normally use these methods to perform most tasks:

  • read_sql_table(): Reads data from a SQL table to a DataFrame object

  • read_sql_query(): Reads data from a database using a SQL query to a DataFrame object

  • read_sql(): Reads data from either a SQL table or query to a DataFrame object

  • DataFrame.to_sql(): Writes the content of a DataFrame object to the specified tables in the database

The sqlalchemy library provides support for a broad range of SQL databases. The following list contains just a few of them:

  • SQLite

  • MySQL

  • PostgreSQL

  • SQL Server

  • Other relational databases, such as those you can connect to using Open Database Connectivity (ODBC)

Check here to discover more about working with databases.