What is Categorical Data and How is It Summarized?
What is categorical data? Basically, it is data in which individuals are placed into groups or categories — for example gender, region, or type of movie. Summarizing categorical data involves boiling down all the information into just a few numbers that tell its basic story.
Because categorical data involves pieces of data that belong in categories, you have to look at how many individuals fall into each group and summarize the numbers appropriately. Here, you learn about making, interpreting, and evaluating frequency and relative frequency tables for categorical data.
Counting on the frequency
One way to summarize categorical data is to simply count, or tally up, the number of individuals that fall into each category. The number of individuals in any given category is called the frequency (or count) for that category. If you list all the possible categories along with the frequency for each, you create a frequency table. The total of all the frequencies should equal the size of the sample (because you place each individual in one category).
See the following for an example of summarizing data by using a frequency table.
Suppose that you take a sample of 10 people and ask them all whether they own a cellphone. Each person falls into one of two categories: yes or no. The data is shown in the following table.
|Person #||Cellphone||Person #||Cellphone|
Data summaries boil down the data quickly and clearly.
A data summary allows you to see patterns in the data, which aren’t clear if you look only at the original data.
|Own a Cellphone?||Frequency|
Relating categorical data with percentages
Another way to summarize categorical data is to show the percentage of individuals who fall into each category, thereby creating a relative frequency. The relative frequency of a given category is the frequency (number of individuals in that category) divided by the total sample size, multiplied by 100 to get the percentage. For example, if you survey 50 people and 10 are in favor of a certain issue, the relative frequency of the “in-favor” category is 10 / 50 = 0.20 × 100, which gives you 20 percent. If you list all the possible categories along with their relative frequencies, you create a relative frequency table. The total of all the relative frequencies should equal 100 percent (subject to possible round-off error).
See the following for an example of summarizing data by using a relative frequency table.
Using the cellphone data from the following table, make a relative frequency table and interpret the results.
|Person #||Cellphone||Person #||Cellphone|
The following table shows a relative frequency table for the cellphone data. Seventy percent of the people sampled reported owning cellphones, and 30 percent admitted to being technologically behind the times.
|Own a Cellphone?||Relative Frequency|
You get the 70 percent by taking 7 / 10 × 100, and you calculate the 30 percent by taking 3 / 10 × 100.
Watch for total sample sizes when given a relative frequency table. Don’t be misled by percentages alone, thinking they’re always based on large sample sizes, because many are not.
Interpreting counts and percents with caution
Not all summaries of categorical data are fair and accurate. Knowing what to look for can help you keep your eyes open for misleading and incomplete information.
Instructors often ask you to “interpret the results.” In this case, your instructor wants you to use the statistics available to talk about how they relate to the given situation. In other words, what do the results mean to the person who collects the data?
See the following for an example of critiquing a data summary.
You watch a commercial where the manufacturer of a new cold medicine (“Nocold”) compares it to the leading brand. The results are shown in the following table.
|How Nocold Compares||Percentage|
|At least as good||18%|
The table about “Nocold” does “Nogood.”
This table is an incomplete relative frequency table. The remaining category is “not as good” for the Nocold brand, and the advertiser doesn’t show it. But you can do the math and see that 100% – (47% + 18%) = 35% of the people say that the leading brand is better.
If you put the two groups together, 65% of the patients say that Nocold is at least as good as the leading brand, and almost half of the patients say Nocold is much better.
What’s missing? The remaining percentage (to keep all possible results in perspective). But more importantly, the total sample size is missing. You don’t know whether the surveyors sampled 10 people, 100 people, or 1,000 people. This means that the precision of the results is unknown. (Precision means how consistent the results will be from sample to sample; it’s related to sample size.)
With relative frequency tables, don’t forget to check whether all categories sum to 1 or 100 percent (subject to round-off error), and remember to look for some indicator as to total sample size.
If you’re interested in knowing how to represent categorical data in a graph, see “How to Summarize and Graph Categorical Data.”