Friday 1 April 2011

No Such Things As Data Rules?

I found an interesting blog entry by John Owens titled There Are No Such Things As Data Rules.  I find John's articles compelling and much of his material I agree with, but not quite all of it.  This article of his is in the second category for me.

I think I know where John is coming from in his approach to this subject.   Data quality analysts who spend all of their time with their heads buried in the data making decisions on what records are "right" or "wrong" are not what data quality improvement is about.  This approach can create as many new data problems as it resolves.  However, I don't believe we solve this problem by going to the other extreme and assert that simply looking at the data for problems accomplishes nothing.

I want to focus on a few of John's specific points, some of which are found in his follow-up comments after his initial article:

a.   "There are no data rules …. [There are] business function rules."
      "The only reason any attribute would have rules ... would be to support a known, documented business function."

In order for these statements to be true, every last data rule would have to be documented as part of each business function that uses it.  For instance, every business function that required an email address would have to restate that all email addresses must have one and only one "@" symbol somewhere in the middle.  Why would any business want to duplicate that definition all over its business functional documents?  What value does it create?

Further, there is no business need that requires email addresses to have one and only one "@" symbol.  The business requirement might be to communicate with a client quickly and electronically, and nothing more.  The fact that the chosen solution to meet that need is email does not introduce a new business rule.   It introduces a technical rule specific to that solution, namely that email standards must be followed in order to complete a successful email communication.  If a different technical solution is chosen (such as SMS text messaging via wireless phones), the technical standards would change accordingly, but the business requirements and rules would not change.

Shared technical standards are best documented and evaluated centrally, and calling them "data rules" may not be the best name, but it's certainly more accurate than calling them business functional rules.

b.   "If you want to understand the data, you must understand the business functions."  
"Data only has meaning ... when viewed in the context of supporting the business functions of the enterprise."
      "Data can only be said to be erroneous if it fails to meet the needs of the business function(s) that utilise it."

These statements are partially true.  There are certain data questions that require business context in order to arrive at the right answers.  For example, one can never answer the question, "Is this data precise enough?" without putting it into a business context.  However, not all data questions require business context.  Meaning can sometimes actually exist in the data alone.  More importantly, meaninglessness can sometimes be determined from the data alone.

For example, postal codes are a mandatory part of any Canadian mailing address, such as K1A 0A6 (the Prime Minister's office in Ottawa).  The business purpose for capturing a customer's postal code may be to send the customer's bill to them by mail, or it may be to assign the customer to a geographical location for a mapping analysis to set regional boundaries for salespeople.  For the first purpose, all 6 digits of the postal code are required or the bill doesn't arrive at the address.  For the second purpose, the sales department only uses the first 3 digits of the postal code (called the Forward Sortation Area) which provides sufficient location accuracy for assigning sales region boundaries.  Therefore, a partial entry of K1A in the Customer Postal Code field would be insufficient precision for mailing purposes, but sufficient precision for assigning regional sales boundaries.  Answering the precision question requires business context.

However, because Canada Post has defined that no postal codes shall begin with the letters I, O, or D (to avoid confusion with the numbers 1 and 0), we can be sure any entry in the Customer Postal Code field beginning with I or O or D is meaningless, regardless of the business purpose.  Answering the validity question sometimes requires no knowledge of the business purpose whatsoever.

c.   "If a business function requires [data] to be collected ..., then it must enforce all applicable data rules at the point of creation."

I wish this could be true in the real world, but it in many situations it simply cannot be implemented successfully. 

Implementing it technically is easy -- design your system to force the sales person to enter a valid customer postal code before moving to the next screen.  However, if the customer who is standing with the sales person recently moved and hasn't memorized his new postal code, does that mean you should prevent that sale from occurring?  Is a valid postal code more important to the company than a sale?  Probably not, so the employee will enter a fake (but valid!) postal code just to get to the next screen and complete the sale.  It doesn't take long for frustrated employees to realize that Santa Claus has a completely valid postal code -- H0H 0H0.  Suddenly a lot of the company's customers seem to be living at the North Pole!  Therefore, an enforced business rule at the point of capture may improve the validity of postal code data, but it does not guarantee improved quality of the postal code information.  It may instead just frustrate both employees and customers.

The situation is a bit riskier if your organization is a hospital, and the postal code being captured is the patient's at the point of admission in an Emergency Room.  Do you really want to prevent a patient from being taken off the ambulance because he is unconscious and can't tell you his postal code?  If the registration clerk cannot leave the patient postal code blank in the registration system, then you can be sure a fake one will be entered so the intake process is not delayed.  The hospital's postal code is both valid and familiar to the registration clerks, so a lot of patients will appear to live at the hospital.  Oh, and Santa has lots of ER admissions too!

Given the complexity of technical data standards that have no direct relation to the business processes that use the data, and given the challenge of preventing erroneous data at the point of capture, data rules can play a helpful role in managing an organization's information assets.

No comments:

Post a Comment