Monday, 21 July 2014

Data Quality in Datawarehouse



Data quality in datawarehouse is usually ignored .


Medical Records. All of us are affected by the quality of data in medical records. We want the diagnosis and the procedure to be correct and readable, both for our medical safety and for billing and insurance purposes. A major headache in the medical insurance industry is the correct identification of the provider physician, clinic, or hospital. This is another variant of the customer matching application I discussed previously

You want to find the same customer when the customer buys a second or a third product from you. Customer matching is a major issue in banking and healthcare, where separate customer (or patient) encounters are often listed separately. The average bank has great difficulty listing all of the separate accounts of a given individual, although the reverse process of listing all the individuals in a specific account causes no trouble

One way to implement this is by use of Quality screens

  1. Column Screens
  2. Structure screens
  1. Business rule screens

Column Screens

  1. Value must not be null , Should be of fixed length, Should not be within list of excluded values, Should confirm spell check

Structure screen
They gurantee that entire data has come to DW, Some data has not been missed. Like each patient information should have source of data .  Each data source present in patient is there in the data source master table. Making sure entire structure has some and some data is not misssed

Business Rules screens
Business rules can be applied to data to confirm that data is consistent with business rules. Like ordered quantity cannot be less than minimum quantity. Or put a busineess rule for checking that number of orders per month is not increased by more than 20% . The idea is to check whether we have inserted duplicates in system may be due to source system changes and that is causing all wrong data to show up

Example - Checking the number of patient plans . They should not exceed by huge number than the number of patients. If this occurs it means we might be duplicating data somewhere. They are like sanity checks after each load

How are Errors from Column Screens Treated

Usually best idea is to write them to log files and at the end tracking them into a fact table . But this is questionable because we are missing some data best idea is to tag the data saying its bad

Creation of Audit tables to track each record of Fact table ( For this the error logic has to be done real time rather than saving in log files )

We can keep column like completleness , validation, Version number saying which version of Screen this record belongs to .
Points to think are - What happens when updation are handled for the same record

1 comment: