Statistics for Big Data For Dummies
Book image
Explore Book Buy On Amazon

You very rarely run across a dataset that does not include dates. Purchase dates, birthdates, update dates, quote dates, and the list goes on. In almost every context, some sort of date is required to get a full picture of the situation you are trying to analyze.

Dealing with dates can be a bit tricky, partly because of the variety of ways to store them. But also, depending on what you're trying to do, you may only need part of the date. Here are a few common situations to look out for.

Dealing with datetime formats

For starters, most database management systems have an extremely precise way of storing dates internally: They use a datetime. This is exactly what it sounds like: a mashup of the date and the time. For example, a common format looks like this:

2014 – 11 – 2414:25:44

That means 25 minutes and 44 seconds past 2 p.m. on November 24, 2014.

The seemingly excessive detail here is rarely fully utilized. By far the most common user of the full detail is the database management system itself. It is a common practice for databases to put a datetime stamp on every record to indicate when the record was created and when it was last updated. The New York Stock Exchange systems actually keep track of trade time stamps to even greater precision.

For most analytic applications, however, this is more detail than you want.

If you are analyzing a stock's closing price over time, you won't be interested in more than just the day or maybe the month associated with each closing price. If you are doing a demographic analysis of age distributions, the year of birth may be all that's relevant.

Birthdates provide a good example of something that you may encounter with datetime data. Even though data may be stored in a datetime field, it may be the case that only part of the field is really being used. Birthdates typically have the time portion defaulted to 00:00:00 for every record.

Luckily, both database systems and analytic software have built-in functions that allow you to extract only the portion of the datetime that is relevant to you. You can choose to extract only the date part, only the month and year, only the year, and so forth. And in fact, this is often done for you before you ever see the data.

Taking geography into account

In the brave new world of the global economy, you will likely encounter data that has been collected from many different locations. Anyone who has ever tried to schedule an international conference call is well aware of the logistics involved in dealing with multiple time zones. More and more common nowadays are post-midnight conference calls with India.

One typical big data example involves supply chain management. Supply chain management is the ongoing process of trying to manage raw materials, inventories, distribution, and any other relevant aspect of a company's business. It's how Walmart keeps shelves stocked, how UPS keeps track of packages, and how Amazon manages to deliver almost anything imaginable almost anywhere.

In these examples, the analysis that underlies supply chain management needs to take into account that data is coming from different time zones. When faced with situations like this, datetime data must be dealt with carefully.

Suppose a package is shipped from California at 10 a.m. on Wednesday and is delivered to its final destination in New York on Thursday at 10 a.m. If you are interested in analyzing delivery times, you need to take into account the time zone change. In this example, the delivery time is actually 21 hours, not 24.

When dealing with datetime data collected from different time zones, you can't simply compare different data points based on the raw data. You need to first make sure that all datetimes are represented in a common time zone. Which time zone you use is somewhat arbitrary, so long as all the data points are using the same one.

There is one other geographically — or, to be more accurate, culturally — related fact that you need to be aware of. Not all countries represent dates in the same way. The U.S. is actually somewhat unique in representing dates as month/day/year. Canada and most of Europe prefer to use the convention day/month/year. You may also run across variations beginning with the year.

How your software thinks about dates

Dates are used in a variety of ways in data analysis. Sometimes, as with stock price analysis, their primary function is to put the observations in order from earliest to latest. But in other cases, they are used to measure time intervals.

In engineering, particularly in quality control applications, a key statistic is mean time to failure. This is simply the average life span of a part or product. For long-lived products, like car parts and light bulbs, this calculation requires the comparison of dates.

On the face of it, August 15, 2013 minus January 1, 2010 doesn't make much sense mathematically. We all know what is meant by this, but it takes some thinking to get the answer. For this reason, many statistical packages, when confronted with dates, immediately convert them into a number in order to facilitate comparisons. They do this by picking some starting point and calculating the number of days between that starting point and the date that is being converted.

For example, one large statistical software maker, SAS, uses the date January 1, 1960 as its starting point. This date has the value 0. It stores every date as the number of days it is away from this starting point. Thus, SAS thinks of January 1, 1961 as 366 (remember, 1960 was a leap year, and January 1 is day 0, not day 1). The starting point is arbitrary and different software makers use different starting points, but the idea is the same.

One odd consequence of this convention is that if you look at the raw data, not only are all the dates integers, but they don't even have to be positive integers. In the SAS example, January 1, 1959 would be represented as –365.

In any case, this way of handling dates facilitates calculations. By converting the date to a number on input, the system avoids having to jump through hoops every time a calculation involving that date is performed.

About This Article

This article is from the book:

About the book authors:

Alan Anderson, PhD, is a professor of economics and finance at Fordham University and New York University. He's a veteran economist, risk manager, and fixed income analyst.

David Semmelroth is an experienced data analyst, trainer, and statistics instructor who consults on customer databases and database marketing.

This article can be found in the category: