Y2K Problems Still Haunt Us

By Jeffrey McArthur

One of the challenges of building a data warehouse is the problem of dirty data. While trying to import a flat file containing over 1.3 million records I ran into several Y2K problems. The file was over 425 Meg in size. The size of the file makes it difficult to edit.

The file was the set of transactions for a website that covering several years. The first problem with the import was seven records that had a year of 0004. Obviously, the records should have had the year of 2004, since there were no websites running in the year 0004. One limitation of Microsoft's Data Transformation Services (DTS) is that sometimes the error reporting is not as complete as you would like. The file was quote delimited and used the concat bar "|" as a field separator. The quick solution is to edit the file. For this, Epsilon from Lugaru (www.lugaru.com) is an invaluable tool. Epsilon was able to open the 425 Meg file, though even on a fast machine it takes a few seconds to open a file that large. The dates in the file followed the standard "MM/DD/YYYY" format. The regular expression capability of Epsilon solved the problem by replacing:

([0-9][0-9]/[0-9][0-9]/)0([0-9][0-9][0-9])

with

#1/2#2

This changed dates like 03/01/0004 to 03/01/2004. The file now imported using DTS. However, there were still other problems. One quality check is to select all the distinct date values from the database and sort them. The SQL statement to select the distinct dates looked something like this:

select distinct
   effectivedate
from
   transactionfile
order by
   effectivedate

That showed some problems. The more interesting query was:

select distinct
   effectivedate
from
   transactionfile
where
   year(effectivedate) < 1970
   or
   year(effectivedate) > 2010
order by
   effectivedate

That SQL statement generated the following results:

Effectivedate

03/01/1900

01/01/1920

01/01/2030

01/27/2035

10/10/2040

12/31/2040

12/31/2041

01/01/2049

02/17/2049

02/18/2049

02/28/2049

07/31/2049

12/30/2049

12/31/2049

12/31/2050

01/10/2051

12/31/2054

04/11/2055

05/19/2058

12/30/2060

07/19/2098

08/18/2106

10/31/2203

08/22/2301

12/31/2999

This was a transaction log for a website. Websites were not available in 1920 and earlier. The year 1900 might be a mistake and should be 2000. The year 2999 probably should be 1999. Both are Y2K problems.

It is impossible to determine from the data how dates 2030 to 2301 ended up in the transaction file. Errors occur. The key is to take a total quality management approach to resolving the errors. It is impossible to determine the proper values. However, it should be possible to prevent those errors from happening in the future. The TQM approach is continual improvement. Deming stated in point five, "Improve constantly and forever every process for planning, production, and service." By using TQM, someday we will eventually solve the hidden Y2K problems.

The author, Jeffrey McArthur, is President of JSM Software, a company focused on affordable business intelligence solution using Microsoft SQL Server. You can reach Jeffrey at jeffmcarthur@jsmsoftware.com. For more information on JSM Software, please see http://www.jsmsoftware.com or contact the sales department.

White Papers Calculating Deciles Valid PHP Email Form Database Frequency Time Management Using Excel Y2K Problems Still Haunt Us
Windows Mobile 6.5