How to Create a Data Dictionary to Describe Your Biostatistics Data

By John Pezzullo

Every research database, large or small, simple or complicated, should be accompanied by a data dictionary that describes the variables contained in the database. It will be invaluable if the person who created the database is no longer around. A data dictionary is, itself, a data file, containing one record for every variable in the database.

For each variable, the dictionary should contain most of the following information (sometimes referred to as metadata, which means “data about data”):

  • A short variable name (usually no more than eight or ten characters) that’s used when telling the software what variables you want it to use in an analysis

  • A longer verbal description of the variable (up to 50 or 100 characters)

  • The type of data (text, categorical, numerical, date/time, and so on)

    • If numeric: Information about how that number is displayed (how many digits are before and after the decimal point)

    • If date/time: How it’s formatted (for example, 12/25/13 10:50pm or 25Dec2013 22:50)

    • If categorical: What the permissible categories are

  • How missing values are represented in the database (99, 999, “NA,” and so on)

Many statistical packages allow (or require) you to specify this information when you’re creating the file anyway, so they can generate the data dictionary for you automatically.

But Excel lets you enter anything anywhere, without formally defining variables, so you need to create the dictionary yourself (perhaps as another worksheet — which you can call “Data Dictionary” — in the same Excel file that has the data, so that the data dictionary always stays with the data).