R Formulas and an Example

By Joseph Schmuller

R formulas are useful for multiple reasons. Suppose you’re interested in how the temperature varies with the month. Having lived through many Mays through Septembers in one place, you might guess is that the temperature generally increases in this data frame from month to month. Is that the case?

This gets into the area of statistical analysis, and at a fairly esoteric level. So let’s take a look at an R capability — the formula.

In this example, let’s say that Temperature depends on Month. Another way to say this is that Temperature is the dependent variable and Month is the independent variable.

An R formula incorporates these concepts and serves as the basis for many of R’s statistical functions and graphing functions. This is the basic structure of an R formula:

function(dependent_var ~ independent_var, data = data.frame)

Read the tilde operator (~) as “depends on.”

Here’s how you can address the relationship between Temp and Month:

> analysis <- lm(Temp ~ Month, data=airquality)

The name of the function lm() is an abbreviation for linear model. This means that you expect the temperature to increase linearly (at a constant rate) from month to month. To see the results of the analysis, you can use summary():

analysis, you can use summary():
> summary(analysis)

Call:
lm(formula = Temp ~ Month, data = airquality)

Residuals:
     Min       1Q   Median       3Q      Max
-20.5263  -6.2752   0.9121   6.2865  17.9121

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  58.2112     3.5191  16.541  < 2e-16 ***
Month         2.8128     0.4933   5.703 6.03e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.614 on 151 degrees of freedom
Multiple R-squared:  0.1772, Adjusted R-squared:  0.1717
F-statistic: 32.52 on 1 and 151 DF,  p-value: 6.026e-08

The Estimate for Month indicates that temperature increases at a rate of 2.8128 degrees per month between May and September. Along with the Estimate for (Intercept), you can summarize the relationship between Temp and Month as

Temp=58.2112+(2.8128×Month)

where Month is a number from 5 to 9.

You might remember from algebra class that when you graph this kind of equation, you get a straight line — hence the term linear model. Is the linear model a good way to summarize these data? The numbers in the bottom line of the output say that it is, but I won’t go into the details.

The output of summary() (and other statistical functions in R) is a list. So if you want to refer to the Estimate for Month, that’s

> s <- summary(analysis)

> s$coefficients[2,1]

[1] 2.812789