Collecting and Validating Clinical Study Data

By John Pezzullo

If the case report form (CRF) has been carefully and logically designed, entering each subject’s data in the right place on the CRF should be straightforward. Then you need to get this data into a computer for analysis.

You can enter your data directly into the statistics software you plan to use for the majority of the analysis, or you can enter it into a general database program such as MS Access or a spreadsheet program like Excel.

The structure of a computerized database usually reflects the structure of the CRF. If a study is simple enough that a single data sheet can hold all the data, then a single data file (called a table) or a single Excel worksheet will suffice.

But for most studies, a more complicated database is required, consisting of a set of tables or Excel worksheets (one for each kind of data collection sheet in the CRF). If the design of the database is consistent with the structure of the CRF, entering the data from each CRF sheet into the corresponding data table shouldn’t be difficult.

You must retain all the original source documents (lab reports, the examining physician’s notes, original questionnaire sheets, and so forth) in case questions about data accuracy arise later.

Before you can analyze your data (see the next section), you must do one more crucially important task — check your data thoroughly for errors! And there will be errors — they can arise from transcribing data from the source documents onto the CRF or from entering the data from the CRFs into the computer. Consider some of the following error-checking techniques:

  • After the data has been entered into the computer, have one person read data from the original source documents or CRFs while another looks at the data that’s in the computer. Ideally, this would be done with all data for all subjects.

  • Have the computer display the smallest and largest values of each variable. Better yet, have the computer display a sorted list of the values for each variable. Typing errors often produce very large or very small values (like an impossibly large hemoglobin value of 124 instead of 12.4).

  • Prepare scatter charts for pairs of variables that ought to be closely correlated (such as hemoglobin vs. hematocrit). The points should lie fairly closely along a line (or a curve), while points containing a variable that was mistyped will often stick out like a sore thumb. This technique can often catch data entry errors that might not be caught by just looking at the lowest and highest values of each variable.

  • A more extreme approach, but one that’s sometimes done for crucially important studies, is to have two people enter all the data into separate copies of the database; then have the computer automatically compare every single data item between the two databases.