Formatting Data Properly - dummies

By Meta S. Brown

Humans use experience when they interpret the data they see, but computers can’t. Your data-mining software will do its best to identify the kind of data in each column, but data types are often ambiguous.

When you see a list of ZIP Codes, you don’t try to add and subtract them. You know that they represent places. You understand this because you have lots of experience seeing and recognizing ZIP Codes. A computer might interpret a ZIP Code as an integer or continuous measure. In the end, it’s up to you to define the proper format.

Functions for setting data formats and roles (such as denoting the dependent variable for modeling) can be buried within a variety of places in your data-mining application. You might define the formats and role of variables within a data file before you even open a data-mining application (the native data formats for Orange and Weka allow this), as part of the import or sometime later in the process.

image0.jpg

You may have tools built for this purpose, like the tools shown in the following figures, or you may define these properties within other procedures.

image1.jpg
image2.jpg

Each data-mining application has its own set of variable types and its own limits on how each type can be used. Some of these limits are based in theory. For example, you can only add and subtract numbers, not letters. But others may be just a matter of how the application was designed.

So, for example, you may find that a particular modeling tool in one application allows you to predict both categorical and continuous variables, but a similar tool in another application may allow modeling of only one or the other.