Y2K Problems Still Haunt Us
By Jeffrey McArthur
One of the challenges of building a data warehouse is the problem of dirty data. While trying to import a flat file containing over 1.3 million records I ran into several Y2K problems. The file was over 425 Meg in size. The size of the file makes it difficult to edit.
The file was the set of transactions for a website that covering several years. The first problem with the import was seven records that had a year of 0004. Obviously, the records should have had the year of 2004, since there were no websites running in the year 0004. One limitation of Microsoft's Data Transformation Services (DTS) is that sometimes the error reporting is not as complete as you would like. The file was quote delimited and used the concat bar "|" as a field separator. The quick solution is to edit the file. For this, Epsilon from Lugaru (www.lugaru.com) is an invaluable tool. Epsilon was able to open the 425 Meg file, though even on a fast machine it takes a few seconds to open a file that large. The dates in the file followed the standard "MM/DD/YYYY" format. The regular expression capability of Epsilon solved the problem by replacing:
([0-9][0-9]/[0-9][0-9]/)0([0-9][0-9][0-9])
with
#1/2#2
This changed dates like 03/01/0004 to 03/01/2004. The file now imported using DTS. However, there were still other problems. One quality check is to select all the distinct date values from the database and sort them. The SQL statement to select the distinct dates looked something like this:
select distinct effectivedate from transactionfile order by effectivedate
That showed some problems. The more interesting query was:
select distinct effectivedate from transactionfile where year(effectivedate) < 1970 or year(effectivedate) > 2010 order by effectivedate
That SQL statement generated the following results:
Effectivedate |
03/01/1900 |
01/01/1920 |
01/01/2030 |
01/27/2035 |
10/10/2040 |
12/31/2040 |
12/31/2041 |
01/01/2049 |
02/17/2049 |
02/18/2049 |
02/28/2049 |
07/31/2049 |
12/30/2049 |
12/31/2049 |
12/31/2050 |
01/10/2051 |
12/31/2054 |
04/11/2055 |
05/19/2058 |
12/30/2060 |
07/19/2098 |
08/18/2106 |
10/31/2203 |
08/22/2301 |
12/31/2999 |
This was a transaction log for a website. Websites were not available in 1920 and earlier. The year 1900 might be a mistake and should be 2000. The year 2999 probably should be 1999. Both are Y2K problems.
It is impossible to determine from the data how dates 2030 to 2301 ended up in the transaction file. Errors occur. The key is to take a total quality management approach to resolving the errors. It is impossible to determine the proper values. However, it should be possible to prevent those errors from happening in the future. The TQM approach is continual improvement. Deming stated in point five, "Improve constantly and forever every process for planning, production, and service." By using TQM, someday we will eventually solve the hidden Y2K problems.
The author, Jeffrey McArthur, is President of JSM Software, a company focused on affordable business intelligence solution using Microsoft SQL Server. You can reach Jeffrey at jeffmcarthur@jsmsoftware.com. For more information on JSM Software, please see http://www.jsmsoftware.com or contact the sales department.