Basics of Data Types and Structures in R Programming for Predictive Analytics - dummies

Basics of Data Types and Structures in R Programming for Predictive Analytics

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

In R programming for predictive analytics, data types are sometimes confused with data structures. Each variable in the program memory has a data type. Sure, you can get away with having several variables in your program and still be manageable. But that probably won’t work so well if you have hundreds (or thousands) of variables; you have to give every variable a name so you can access it.

It’s more efficient to store all those variables in a logical collection.

Data types

Like other full-fledged programming languages, R offers many data types and data structures. There is no need to specify the type that you’re assigning to a variable; the interpreter will do that for you. However, you can specify or convert the type if the need arises; this is called casting. Three data types are as follows:

  • Numerical: These are your typical decimal numbers. These are called floats (short for floating-point numbers) or doubles in other languages.

  • Characters: These are your strings formed with combinations of letters, characters, and numbers. They are not meant to have any numerical meaning. These are called strings in other languages.

  • Logical: TRUE or FALSE. Always capitalize these values in R. These values are called Booleans in other languages.

Comparing a string of numbers to a numerical number results in the interpreter converting the string of numbers into a numerical and then doing a numerical comparison.

Examples of data types are as follows:

> i <- 10       # numeric 
> j <- 10.0     # numeric
> k <- "10"     # character
> m <- i == j   # logical
> n <- i == k   # logical

After you execute those lines of code, you can find out their values and types by using the str() function. That operation looks like this:

> str(i)
 num 10
> str(j)
 num 10
> str(k)
 chr "10"
> str(m)
 logi TRUE
> str(n)
 logi TRUE

The expression in the n assignment is an example of the interpreter temporarily converting the data type of k into a numeric to do the evaluation between numeric i and character k.

Data structures

R will need a place to store groups of data types in order to work with it efficiently. These are called data structures.

A real-life example of this concept is a parking garage: It’s a structure that stores automobiles efficiently. It’s designed to park as many automobiles as possible, and allows for automobiles to efficiently enter and exit the structure. Also, no other objects besides automobiles should be parked in a parking structure.

Data structures include:

  • Vectors: Vectors store a set of values of a single data type. Think of it as a weekly pillbox. Each compartment in the pillbox can only store a certain type of object. After you put some pills in one of the compartments, all the other compartments must also be filled with either zero pills or more pills.

    You can’t put coins in that same box; you have to use a different “pill box” (vector) for that. Likewise, once you store a number in a vector, all future values should also be numbers. Otherwise the interpreter converts all your numbers to characters.

  • Matrices: A matrix looks like an Excel spreadsheet: Essentially it’s a table consisting of rows and columns. The data populates the empty cells by row or column order, in which you specify when you create the matrix.

    All columns must have the same data type.

  • Data frames: A data frame is similar to a matrix, except a data frame’s columns can contain different data types. The datasets used in predictive modeling are loaded into data frames and stored there for use in the model.

  • Factors: A factor is like a vector with a limited number of distinct values. The number of distinct values is referred to as its level. You can use factors to treat a column that has a limited and known number of values as categorical values. By default, character data is loaded into data frames as factors.

You access vectors, matrices, and data frames by using array notation. For example, you would type v[5] to access the fifth element of vector v. For a two-dimensional matrix and data frame, you put in the row number and column number, separated by a comma, inside the square brackets. For example, you type m[2,3] to access the second row, third column value for matrix m.

Data structures are an advanced subject in computer science. For now, we’re sticking to the practical. Just remember that data structures were built to store specific types of data and they have functions for data insertion, deletion, and retrieval.