Summarizing Categorical Data in Statistics
Categorical data capture qualities or characteristics about the individual, such as a person's eye color, gender, political party, or opinion on some issue (using categories such as agree, disagree, or no opinion). Categorical data tend to fall into groups or categories pretty naturally. "Political party," for example, typically has four groups: Democrat, Republican, Independent, and other. Categorical data often come from survey data, but they can also be collected in experiments. For example, in an experimental test of a new medical treatment, researchers may use three categories to assess the outcome of the experiment: Did the patient get better, worse, or stay the same while undergoing the treatment?
Categorical data are often summarized by reporting the percentage of individuals falling into each category. For example, pollsters may report the percentage of Republicans, Democrats, Independents, and others who took part in a survey. To calculate the percentage of individuals in a certain category, find the number of individuals in that category, divide by the total number of people in the study, and then multiply by 100%. For example, if a survey of 2,000 teenagers included 1,200 females and 800 males, the resulting percentages would be (1,200 ÷ 2,000) x 100% = 60% female and (800 ÷ 2,000) x 100% = 40% male.
You can further break down categorical data by creating something called crosstabs. Crosstabs (also called two-way tables) are tables with rows and columns. They summarize the information from two categorical variables at once, such as gender and political party, so you can see (or easily calculate) the percentage of individuals in each combination of categories. For example, if you had data about the gender and political party of your respondents, you would be able to look at the percentage of Republican females, Republican males, Democratic females, Democratic males, and so on. In this example, the total number of possible combinations in your table would be 2 x 4 = 8, or the total number of gender categories times the total number of party affiliation categories.
The U.S. government calculates and summarizes loads of categorical data using crosstabs. The U.S. Census Bureau doesn't just count the population; it also collects and summarizes data from a subset of all Americans (those who fill out the long form) on various demographic characteristics, such as gender and age. Typical age and gender data, reported by the U.S. Census Bureau for a survey conducted in 2001, are shown in Table 1. (Normally, age would be considered a numerical variable, but the way the U.S. government reports it, age is broken down into categories, making it a categorical variable. See the following section for more on numerical data.)
Table 1: U.S. Population, Broken Down by Age and Gender (2001)
Age | Total | % | # Males | % Males | # Females | % Females |
Under 5 years | 19,369,341 | 6.80 | 9,905,282 | 7.08 | 9,464,059 | 6.53 |
5 to 9 years | 20,184,052 | 7.09 | 10,336,616 | 7.39 | 9,847,436 | 6.79 |
10 to 14 years | 20,881,442 | 7.33 | 10,696,244 | 7.65 | 10,185,198 | 7.03 |
15 to 19 years | 20,267,154 | 7.12 | 10,423,173 | 7.46 | 9,843,981 | 6.79 |
20 to 24 years | 19,681,213 | 6.91 | 10,061,983 | 7.20 | 9,619,230 | 6.63 |
25 to 29 years | 18,926,104 | 6.65 | 9,592,895 | 6.86 | 9,333,209 | 6.44 |
30 to 34 years | 20,681,202 | 7.26 | 10,420,677 | 7.45 | 10,260,525 | 7.08 |
35 to 39 years | 22,243,146 | 7.81 | 11,104,822 | 7.94 | 11,138,324 | 7.68 |
40 to 44 years | 22,775,521 | 8.00 | 11,298,089 | 8.08 | 11,477,432 | 7.92 |
45 to 49 years | 20,768,983 | 7.29 | 10,224,864 | 7.31 | 10,544,119 | 7.27 |
50 to 54 years | 18,419,209 | 6.47 | 9,011,221 | 6.45 | 9,407,988 | 6.49 |
55 to 59 years | 14,190,116 | 4.98 | 6,865,439 | 4.91 | 7,324,677 | 5.05 |
60 to 64 years | 11,118,462 | 3.90 | 5,288,527 | 3.78 | 5,829,935 | 4.02 |
65 to 69 years | 9,532,702 | 3.35 | 4,409,658 | 3.15 | 5,123,044 | 3.53 |
70 to 74 years | 8,780,521 | 3.08 | 3,887,793 | 2.78 | 4,892,728 | 3.37 |
75 to 79 years | 7,424,947 | 2.61 | 3,057,402 | 2.19 | 4,367,545 | 3.01 |
80 to 84 years | 5,149,013 | 1.81 | 1,929,315 | 1.38 | 3,219,698 | 2.22 |
85 to 89 years | 2,887,943 | 1.01 | 926,654 | 0.66 | 1,961,289 | 1.35 |
90 to 94 years | 1,175,545 | 0.41 | 303,927 | 0.22 | 871,618 | 0.60 |
95 to 99 years | 291,844 | 0.10 | 58,667 | 0.04 | 233,177 | 0.16 |
100 years and over | 48,427 | 0.02 | 9,860 | 0.01 | 38,567 | 0.03 |
Total all ages | 284,796,887 | 100 | 139,813,108 | 100 | 144,983,779 | 100 |
You can examine many different facets of the population by looking at and working with different numbers from Table 1. Looking at gender, notice that women slightly outnumber men, because the population in 2001 was 51% female (divide total number of females by total population size and multiply by 100%) and 49% male (divide total number of males by total population size and multiply by 100%). You can also look at age: The percentage of the entire population that is age 5 and under was 6.8%; the largest group belongs to the 40 to 44 year olds, who made up 8% of the population. Next, you can explore a possible relationship between gender and age by comparing various parts of the table. You can compare, for example, the percentage of females to males in the 80-and-over age group. Because these data are reported in five-year increments, you have to do a little math in order to get your answer, though. The percentage of the population that's female and aged 80 and above is 2.22% + 1.35% + 0.6% + 0.16% + 0.03% = 4.36%. The percentage of males aged 80 and over is 1.38% + 0.66% + 0.22% + 0.04% + 0.01% = 2.31%. This shows that the 80-and-over age group contains almost twice as many women as men. These data seem to confirm the notion that women tend to live longer than men.
If you're given the number of individuals in each group, you can always calculate your own percents. But if you're only given percentages without the total number in the group, you can never retrieve the original number of individuals in each group. For example, you could hear that 80% of the people surveyed prefer Cheesy cheese crackers over Crummy cheese crackers. But how many were surveyed? It could be only 10 people, for all you know, because 8 out of 10 is 80%, just as 800 out of 1,000 is 80%. These two fractions (8 out of 10 and 800 out of 1,000) have different meanings for statisticians, because in the first case, the statistic is based on very little data, and in the second case, it's based on a lot of data.
After you have the crosstabs that show the breakdown of two categorical variables, you can conduct statistical tests to determine whether a significant relationship or link between the two variables exists.
















Comments (0)
Leave a Reply