Big Data For Dummies
Book image
Explore Book Buy On Amazon

Big data is becoming an important element in the way organizations are leveraging high-volume data at the right speed to solve specific data problems. Relational Database Management Systems are important for this high volume. Big data does not live in isolation. To be effective, companies often need to be able to combine the results of big data analysis with the data that exists within the business.

Big data basics: RDBMS and persistent data

One of the most important services provided by operational databases (also called data stores) is persistence. Persistence guarantees that the data stored in a database won't be changed without permissions and that it will available as long as it is important to the business. What good is a database if it cannot be trusted to protect the data you put in it?

Given this most important requirement, you must then think about what kind of data you want to persist, how can you access and update it, and how can you use it to make business decisions. At this most fundamental level, the choice of your database engines is critical to your overall success with your big data implementation.

Even though the underlying technology has been around for quite some time, many of these systems are in operation today because the businesses they support are highly dependent on the data. To replace them would be akin to changing the engines of an airplane on a transoceanic flight.

Big data basics: RDBMS and tables

Relational databases are built on one or more relations and are represented by tables. These tables are defined by their columns, and the data is stored in the rows. The primary key is often the first column in the table. The consistency of the database and much of its value are achieved by "normalizing" the data. Normalized data has been converted from native format into a shared, agreed upon format.

For example in one database you might have "telephone" as XXX-XXX-XXXX while in another it might be XXXXXXXXX. To achieve a consistent view of the information, the field will need to be normalized to another form. Five levels of standards exist for normalization. The choice of normal form is often relegated to the database designer. The collection of tables, keys, elements, and so on is known as the database schema.

Over the years, the structured query language (SQL) has evolved in lock step with RDBMS technology and is the most widely used mechanism for creating, querying, maintaining, and operating relational databases.

In companies both small and large, most of their important operational information is probably stored in RDBMSs. Many companies have different RDBMSs for different areas of their business. Transactional data might be stored in one vendor's database, while customer information could be stored in another.

It is not likely you will use RDBMSs for the core of the implementation, but you will need to rely on the data stored in RDBMSs to create the highest level of value to the business with big data.

PostgreSQL, an open source relational database

During your big data implementation, you'll likely come across PostgreSQL, a widely used, open source relational database. Several factors contribute to the popularity of PostgreSQL. As an RDBMS with support for the SQL standard, it does all the things expected in a database product, plus its longevity and wide usage have made it "battle tested." It is also available on just about every variety of operating system, from PCs to mainframes.

Providing the basics and doing so reliably are only part of the story. PostgreSQL also supports many features only found in expensive proprietary RDBMSs, including the following:

  • Capability to directly handle "objects" within the relational schema

  • Foreign keys (referencing keys from one table in another)

  • Triggers (events used to automatically start a stored procedure)

  • Complex queries (subqueries and joins across discrete tables)

  • Transactional integrity

  • Multiversion concurrency control

The real power of PostgreSQL is its extensibility. Users and database programmers can add new capabilities without affecting the fundamental operation or reliability of the database. Possible extensions include

  • Data types

  • Operators

  • Functions

  • Indexing methods

  • Procedural languages

This high level of customization makes PostgreSQL desirable when rigid, proprietary products won't get the job done. It is infinitely extensible.

Finally, the PostgreSQL license permits modification and distribution in any form, open or closed source. Any modifications can be kept private or shared with the community as you wish.

About This Article

This article is from the book:

About the book authors:

Judith Hurwitz is an expert in cloud computing, information management, and business strategy. Alan Nugent has extensive experience in cloud-based big data solutions. Dr. Fern Halper specializes in big data and analytics. Marcia Kaufman specializes in cloud infrastructure, information management, and analytics.

This article can be found in the category: