# R Formulas and an Example

R formulas are useful for multiple reasons. Suppose you’re interested in how the temperature varies with the month. Having lived through many Mays through Septembers in one place, you might guess is that the temperature generally increases in this data frame from month to month. Is that the case?

This gets into the area of statistical analysis, and at a fairly esoteric level. So let’s take a look at an R capability — the *formula*.

In this example, let’s say that Temperature depends on Month. Another way to say this is that Temperature is the *dependent variable* and Month is the *independent variable*.

An R formula incorporates these concepts and serves as the basis for many of R’s statistical functions and graphing functions. This is the basic structure of an R formula:

`function(dependent_var ~ independent_var, data = data.frame)`

Read the tilde operator (`~`

) as “depends on.”

Here’s how you can address the relationship between `Temp`

and `Month`

:

`> analysis <- lm(Temp ~ Month, data=airquality)`

The name of the function `lm()`

is an abbreviation for *l*inear* m*odel. This means that you expect the temperature to increase linearly (at a constant rate) from month to month. To see the results of the analysis, you can use `summary()`

:

analysis, you can use summary(): > summary(analysis) Call: lm(formula = Temp ~ Month, data = airquality) Residuals: Min 1Q Median 3Q Max -20.5263 -6.2752 0.9121 6.2865 17.9121 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 58.2112 3.5191 16.541 < 2e-16 *** Month 2.8128 0.4933 5.703 6.03e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 8.614 on 151 degrees of freedom Multiple R-squared: 0.1772, Adjusted R-squared: 0.1717 F-statistic: 32.52 on 1 and 151 DF, p-value: 6.026e-08

The `Estimate`

for `Month`

indicates that temperature increases at a rate of 2.8128 degrees per month between May and September. Along with the `Estimate`

for `(Intercept)`

, you can summarize the relationship between `Temp`

and `Month`

as

*Temp*=58.2112+(2.8128×*Month*)

where *Month *is a number from 5 to 9.

You might remember from algebra class that when you graph this kind of equation, you get a straight line — hence the term *linear model*. Is the linear model a good way to summarize these data? The numbers in the bottom line of the output say that it is, but I won’t go into the details.

The output of `summary()`

(and other statistical functions in R) is a list. So if you want to refer to the `Estimate`

for `Month`

, that’s

`> s <- summary(analysis)`

`> s$coefficients[2,1]`

`[1] 2.812789`