How to Record Numerical Data for Biostatistics - dummies

How to Record Numerical Data for Biostatistics

By John Pezzullo

For numerical data, the main question is how much precision to record. Recording a numerical variable to as many decimal places as you have available is usually best.

For example, if a scale can measure body weight to the nearest 1/10 of a kilogram, record it in the database to that degree of precision. You can always round it off to the nearest kilogram later if you want, but you can never “unround” a number to recover digits you didn’t record in the first place.

But don’t go overboard in this direction — don’t record a person’s body mass index (BMI) as 28.648832 kilograms/square meter, even if your calculator produced the result to such ridiculous precision.

Along the same lines, don’t group numerical data into intervals when recording it. If you know a person’s age in years, then record it as the actual number of years; don’t record it in 10-year intervals (0 to 9, 10 to 19, and so on). You can always have the computer do that kind of interval grouping later, but you can never recover the age in years if all you recorded was the decade.

Some programs let you choose between several ways of internally representing the number in the computer. The program may refer to these different storage modes using arcane terms like short, long, or very long integers (whole numbers) or single-precision (short) or double-precision (long) floating point (fractional) numbers. Each type has its own limits, which may vary from one program to another or from one kind of computer to another.

For example, a short integer might be able to represent only whole numbers within the range from –32,768 to +32,767, whereas a double-precision floating-point number could easily handle a number like 1.23456789012345 x 10250.

In the old days, the judicious choice of storage modes for your variables could produce smaller files and let the program work with more subjects or more variables. Nowadays, storage is much less of an issue than it used to be, so pinching pennies this way offers little benefit.

Go for the most general numeric representation available — usually double-precision floating point, which can represent just about any number you may ever encounter in your research.

Here are a couple things to watch out for when entering numerical data into Excel:

  • Don’t put two numbers (such as a blood pressure reading of 135/85 mmHg) into one column of data. Excel won’t complain about it, but it will treat it as text because of the embedded “/”, rather than as numerical data. Instead, create two separate variables — such as the systolic and diastolic pressures (perhaps called BPS for blood pressure systolic and BPD for blood pressure diastolic) — and enter each number into the appropriate variable.

  • In an obstetrical database, don’t enter 6w2d for a gestational age of 6 weeks and 2 days; even worse, don’t enter it as 6.2, which the computer would interpret as 6.2 weeks. Either enter it as 44 days, or create two variables (perhaps GAwks for gestational age weeks and GAdays for gestational age days), to hold the values 6 and 2, respectively.

    The computer can easily combine them later into the number of days or the number of weeks (and fractions of a week).

There’s one important exception to this “don’t cram two things into one column” rule – If you are recording both the date and time of a single event (like “born on February 15, 2006, at 8:56 in the evening”), then you should record both the date and the time as a single variable! See the article on Entering Date and Time Data for more details.

Missing numerical data requires a little more thought than missing categorical data. Some researchers use 99 (or 999, or 9999) to indicate a missing value. If you use that technique, you have to make sure that all your analyses ignore those values. Fortunately, many statistics programs let you specify what the missing value indicator is for each variable, and the programs exclude those values from all analyses.

But can you really be sure you’ll never have that value pop up as a real value for some very atypical subject? (Some people are 99 years old, and some people can have a blood glucose value of 999 mg/dL). Simply leaving the cell blank may be best; almost all programs treat blank cells as missing data and ignore them in the calculations.