Biostatistics For Dummies
Book image
Explore Book Buy On Amazon

When you enter numerical data into your computer, don't combine two numbers into a single variable (such as 145/85 for systolic and diastolic blood pressure). When it comes to dates and times, however, exactly the opposite is true!

Most statistical software can represent dates and times as a single variable (an "instant" on a continuous timeline), so take advantage of that if you can — enter the date and time as one variable (for example, 07/15/2010 08:23), not as a date variable and a time variable.

This method is especially useful when dealing with events that take place over a short time interval (like events occurring during labor and delivery).

Most statistical programs store date and time internally as a number, specifying the number of days (and fractions of days) from some arbitrary "zero date." Here are the zero dates for a few common programs:

  • Excel: Midnight at the start of December 31, 1899 (this is also the earliest date that Excel can store). So November 21, 2012, at 6:00 p.m., is stored internally as 41,234.75 (the .75 is because 6 p.m. is 3/4 of the way through that day).

  • SPSS: October 14, 1582 (the date the Gregorian calendar was adopted to replace the Julian calendar).

  • SAS: 01/01/1960 (a totally arbitrary date).

Some programs may store a date and time as a Julian Date, whose zero occurred at noon, Greenwich mean time, on Jan. 1, 4713 BC. (Nothing special happened on that date; 'it was originally chosen purely a numerical convenience.)

What if you don't know the day of the month? This happens a lot with medical history items; you hear something like "I got the flu in September 2004." Most software insists that a date variable be a complete date and won't accept just a month and a year.

In this case, an argument can be made for setting the day to 15 (around mid-month), on the grounds that the error is equally likely to be on either side and therefore tends to cancel out, on average. Similarly, if both the month and day are missing, you can set them to June 30 or July 1 (around mid-year) to achieve the same kind of average error cancellation.

If only some records have partial dates, you may want to create another variable to indicate whether the date is complete or partial, so you can tell, if you need to, whether 09/15/2004 really means September 15, 2004, or just September 2004.

Completely missing dates should usually just be left blank; most statistical software treats blank cells as missing data.

Because of the way most statistics programs store dates and times, they can easily calculate intervals between any two points in time by what's called "calendar arithmetic", which you might indicate as a simple subtraction. So it's usually easier and safer to enter dates and times and let the computer calculate the intervals between them than to calculate the intervals yourself.

For example, if you create variables for date of birth (DOB) and a visit date (VisDt) in Excel, you can often have Excel calculate a very accurate age at the time of the visit with this simple formula:

Age = (VisDt DOB)/365.25

Similarly, in cancer studies, you can easily and accurately calculate intervals from diagnosis or treatment to remission and recurrence, as well as total survival time, from the dates of the corresponding events.

About This Article

This article is from the book:

About the book author:

John C. Pezzullo, PhD, has held faculty appointments in the departments of biomathematics and biostatistics, pharmacology, nursing, and internal medicine at Georgetown University. He is semi-retired and continues to teach biostatistics and clinical trial design online to Georgetown University students.

This article can be found in the category: