How to Collect Categorical Data in Biostatistics

Biology Essentials For Dummies

Setting up your data collection forms and database tables for categorical data requires more thought than you may expect. Everyone assumes he knows how to record and enter categorical data — you just type what that data is (for example, Male, White, Diabetes, or Headache), right? Bad assumption!

Carefully coding categories

The first issue is how to "code" the categories (how to represent them in the database). Do you want to enter Gender as Male or Female, M or F, 1 (if male) or 2 (if female), or in some other manner? Most modern statistical software can analyze categorical data with any of these representations, but some older software needs the categories coded as consecutive numbers: 1, 2, 3, and so on.

Some software lets you specify a correspondence between number and text (1=Male, 2=Female, for instance); then you can type it in either way, and you can choose to display it in either the numeric or textual form.

Nothing is worse than having to deal with a data set in which Gender has been coded as 1 or 2, with no indication of which is which, when the person who created the file is long gone. So it's probably best to enter the category values as short, meaningful text abbreviations like M or F, or Male or Female, which are self-evident and, therefore, self-documenting.

Excel doesn't care what you type in, and this characteristic is one of its biggest drawbacks when it's used as a data repository. You can enter Gender as M for the first subject, Male for the second, male for the third, 2 for the fourth, and m for the fifth, and Excel couldn't care less. But most statistics programs consider each of these to be a completely different category!

Even worse, you may inadvertently type one or more blank spaces before and/or after the text. You may never notice it, but some statistics programs consider M~ to be different from ~M, ~M~, and M~~ (where ~ indicates a blank space). In Excel, it's a good idea to enable AutoComplete for cell values (in the Advanced section of the Options dialog box, located in the File menu). Then when you start typing something in a cell, it suggests something that's already present in that column and begins with the same letter or letters that you typed. This will help you avoid having several slightly different representations for the same category.

When you're checking your data for mistakes, it's helpful if you can have the software produce a tally for each categorical variable, showing how many occurrences of each distinct category it found in the data. This will show you if any mistyped categories were entered (including ones with those pesky leading or trailing blank characters).

Dealing with more than two levels in a category

When a categorical variable has more than two levels, things get even more interesting. First, you have to ask yourself, "Is this variable a Choose only one (single-choice) or Choose all that apply (multiple-choice) variable?" The required coding in your computerized data file is completely different for these two kinds of variables. For example, suppose you are going to record the results of a microbiology lab assay of a blood specimen. And suppose, for simplicity's sake, that this lab analyzes for the presence of only five classes of bacteria: coccus, bacillus, rickettsia, mycoplasma, and spirillum. To reduce the chance of misspellings, you might decide to use short abbreviations such as: coc, bac, ric, myc, and spi.

Suppose you decide (or assume) that no blood specimen will ever have more than one kind of infection present. This would be a single-choice situation. Then the lab result could be held in a single variable (a single column in your data file), which you might call BacteriaType (or BT if you wanted to keep your variable names short). The results for each specimen (in each cell of your table) would be coded as a single value: coc, or bac, or ric, or myc, or spi. You'd also have to allow for another category (perhaps called none) if no infection at all was found.

But you might find that your assumption of at most one infection per specimen was unrealistic. A person could have two (or more) infections present simultaneously. How would you handle that in your database? If the lab reported that both coccus and bacillus infections were present, would you try to stuff both abbreviations (coc and bac) into the cell at the same time? Most statistical software would allow only one category to be specified in one cell of a categorical variable. Unfortunately, Excel would happily let you cram as much as you liked into the cell, even though this would be the wrong way to do it.

If multiple-choices are possible (even if you think they're unlikely to happen in your study), you need to set up your database differently. You should define separate variables in the database (separate columns in Excel) — one for each possible category value. In this case, because there are five possible types of bacteria, you would have to define five separate variables, perhaps called BTcoc, BTbac, BTric, BTmyc, and BTspi (where the BT stands for bacteria type). Each variable would be represented as a two-value category (perhaps with values Pres/Abs standing for present and absent, or Yes/No, or 1 or 0).

So, if Subject 101's specimen was found to contain coccus, Subject 102's specimen had bacillus and mycoplasma present, and Subject 103's specimen has no bacteria at all, the information could be coded as shown in the following table.

Subject	BTcoc	BTbac	BTric	BTmyc	BTspi
101	Yes	No	No	No	No
102	No	Yes	No	Yes	No
103	No	No	No	No	No

Regardless of whether you code a categorical variable as single-choice or multiple choice, you'll need to allow for the possibility of missing values. You can handle missing values by leaving the cell blank, but an even better way is to add a category called Missing to the regular categories of that variable. If you need several different flavors of Missing (like not collected yet, don't know, other, refused to answer, or not applicable), just add them to the set of permissible categories for that variable. The important thing is to make sure that you can always enter something for that variable.

Never try to cram multiple choices into one column — don't enter "coc,bac" into a cell in the BacType column. If you do, the resulting column will be almost impossible to analyze statistically, and you'll have to take the time later to painstakingly split your single multi-valued column into separate yes/no columns. So why not do it right the first time?

About This Article

About the book author:

John C. Pezzullo, PhD, has held faculty appointments in the departments of biomathematics and biostatistics, pharmacology, nursing, and internal medicine at Georgetown University. He is semi-retired and continues to teach biostatistics and clinical trial design online to Georgetown University students.