Before you begin with regression analysis, you need to identify the population regression function (PRF). The PRF defines reality (or your perception of it) as it relates to your topic of interest. To identify it, you need to determine your dependent and independent variables (and how they’ll be measured) as well as the mathematical function describing how the variables are related.

After you narrow down your topic or question of interest, you’re ready to develop your model using the following steps:

1. Provide the general mathematical specification of your model.

The general specification denotes your dependent variable and all the independent (or explanatory) variables that you believe affect the dependent variable in your population of interest.

Suppose that three variables affect the dependent variable. The general specification will look something like Y = f(X1,X2,X3), where Y is the dependent variable and the Xs represent the independent variables, which you believe directly affect (or cause) fluctuations in the Y variable.

Unless the reasoning is obvious, provide some justification for the variables chosen as independent variables and for the functional form of the specification (see Step 2). Doing so helps you avoid misspecification, which occurs if you omit important variables or include irrelevant variables.

2. Derive the econometric specification of your model.

In this step, you take the variables identified in Step 1 and develop a function that can be used to calculate econometric results. This functional form is known as the population regression function (PRF). In this step, you’re also acknowledging that the relationship you hypothesized in Step 1 is expected to exist when you look at the average of the data; not for every single observation.

Assume you have reason to believe that the model is linear. It will look like this:

In this function, the conditional mean operator E(Y|X1,X2,X3) indicates that the relationship is expected to hold, on average, for given values of the independent variables. The intercept term

also called the constant, is the expected mean value of Y when all Xs are equal to zero. The other betas represent the partial slopes (effects). These partial slopes tell you how much your dependent variable changes when you change the independent variable by one unit but hold the value of the other independent variables constant.

(This idea of changing one thing and keeping the rest the same is the ceteris paribus, or all else equal, condition that you’re familiar with from your introductory economics courses.)

Depending on the particular phenomenon you’re analyzing, a nonlinear relationship using squared terms, logs, or another method instead of the linear function

may be more appropriate.

The specification you choose is assumed to describe the “true” relationship, so be sure to justify it using sound economic theory and common sense.

3. Specify the random nature of your model.

This step clarifies that the relationship you’ve assumed in Steps 1 and 2 holds on average but may contain errors when a specific observation is chosen at random from the population. This is known as the stochastic population regression function and is written as

where the i subscripts denote any randomly chosen observation and

represents the stochastic (or random) error term associated with that observation. Note that stochastic is simply statistics jargon for random.

Regardless of how you choose to represent the PRF, the random error term represents the difference between the observed value of your dependent variable and the conditional mean of the dependent variable derived from your model. This value is positive if the observed value is above the conditional mean and negative if it is below.

The random error can result from one or more of the following factors:

• Insufficient or incorrectly measured data

• A lack of theoretical insights to fully account for all the factors that affect the dependent variable

• Applying an incorrect functional form; for example, assuming the relationship is linear when it’s quadratic

• Unobservable characteristics

• Unpredictable elements of behavior

If you have several explanatory variables, you can save time by writing the econometric model using some mathematical shorthand. With algebraic notation, it would look like one of the following two functions: