Econometrics students always appreciate a review of the statistical concepts that are most important to succeeding with econometrics. Specifically, you need to be comfortable with probability distributions, the calculation of descriptive statistics, and hypothesis tests.

Your ability to accurately quantify economic relationships depends not only on your econometric model-building skills but also on the quality of the data you’re using for analysis and your capacity to adopt the appropriate strategies for estimating models that are likely to violate a statistical assumption. The data must be derived from a reliable collection process, but you should also be aware of any additional limitations or challenges.

They may include, but aren’t limited to

• Aggregation of data: Information that may have originated at a household, individual, or firm level is being measured at a city, county, state, or country level in your data.

• Statistically correlated but economically irrelevant variables: Some datasets contain an abundance of information, but many of the variables may have nothing to do with the economic question you’re hoping to address.

• Qualitative data: Rich datasets typically include qualitative variables (geographic information, race, and so on), but this information requires special treatment before you can use it in an econometric model.

• Classical linear regression model (CLRM) assumption failure: The legitimacy of your econometric approach always rests on a set of statistical assumptions, but you’re likely to find that at least one of these assumptions doesn’t hold (meaning it isn’t true for your data).

Econometricians differentiate themselves from statisticians by emphasizing violations of statistical assumptions that are often taken for granted. The most common technique for estimating an econometric model is ordinary least squares (OLS). However, a number of CLRM assumptions must hold in order for the OLS technique to provide reliable estimates. In practice, the assumptions that are most likely to fail depend on your data and specific application.

Recognizing the importance of data type, frequency, and aggregation

The data that you use to estimate and test your econometric model is typically classified into one of three possible types:

• Cross sectional: This type of data consists of measurements for individual observations (persons, households, firms, counties, states, countries, or whatever) at a given point in time.

• Time series: This type of data consists of measurements on one or more variables (such as gross domestic product, interest rates, or unemployment rates) over time in a given space (like a specific country or state).

• Panel or longitudinal: This type of data consists of a time series for each cross-sectional unit in the sample. The data contains measurements for individual observations (persons, households, firms, counties, states, countries, and so on) over a period of time (days, months, quarters, or years).

The type of data you’re using may influence how you estimate your econometric model. In particular, specialized techniques are usually required to deal with time-series and panel data.

You can anticipate common econometric problems because certain CLRM assumption failures are more likely with particular types of data. Two typical cases of CLRM assumption failures involve heteroskedasticity (which occurs frequently in models using cross-sectional data) and autocorrelation (which tends to be present in models using time-series data).

In addition to knowing the type of data you’re working with, make sure you’re always aware of the following information:

• The level of aggregation used in measuring the variables: The level of aggregation refers to the unit of analysis when information is acquired for the data. In other words, the variable measurements may originate at a lower level of aggregation (like an individual, household, or firm) or at a higher level of aggregation (like a city, county, or state).

• The frequency with which the data is captured: The frequency refers to the rate at which measurements are obtained. Time-series data may be captured at a higher frequency (like hourly, daily, or weekly) or at lower frequency (like monthly, quarterly, or yearly).

All the data in the world won’t allow you to produce convincing results if the level of aggregation or frequency isn’t appropriate for your problem. For example, if you’re interested in determining how spending per pupil affects academic achievement, state-level data probably won’t be appropriate because spending and pupil characteristics have so much variation across cities within states that your results are likely to be misleading.

Avoiding the data-mining trap

As you acquire more data-analysis tools, you may be inclined to search the data for relationships between variables. You can use your knowledge of statistics to find models that fit your data quite well. However, this practice is known as data mining, and you don’t want to be seduced by it!

Although data mining can be useful in fields where the underlying mechanism generating the outcomes isn’t important, most economists don’t view this approach favorably. In econometrics, building a model that makes sense and is reproducible by others is far more important than searching for a model that has a perfect fit.

Incorporating quantitative and qualitative information

Economic outcomes can be affected by both quantitative (numeric) and qualitative (non-numeric) factors. Generally, quantitative information has a straightforward application and interpretation in econometric models.

Qualitative variables are associated with characteristics that have no natural numeric representation, although your raw data may code qualitative characteristics with a numeric value. For example, a U.S. region may be coded with a 1 for West, 2 for South, 3 for Midwest, and 4 for Northeast. However, the assignment of the specific values is arbitrary and carries no special significance.

In order to utilize the information contained in qualitative variables, you’ll usually convert them into dummy variables — dichotomous variables that take on a value of 1 if a particular characteristic is present and 0 otherwise.

Sometimes the economic outcome itself is qualitative or contains restricted values. For example, your dependent variable could measure whether or not a firm fails (goes bankrupt) in a given year using various firm characteristics as independent variables. Although standard techniques are sometimes acceptable with qualitative and noncontinuous dependent variables, usually they result in assumption violations and require an alternative econometric approach.