Monday 21 July 2014

Data Quality in Datawarehouse



Data quality in datawarehouse is usually ignored .


Medical Records. All of us are affected by the quality of data in medical records. We want the diagnosis and the procedure to be correct and readable, both for our medical safety and for billing and insurance purposes. A major headache in the medical insurance industry is the correct identification of the provider physician, clinic, or hospital. This is another variant of the customer matching application I discussed previously

You want to find the same customer when the customer buys a second or a third product from you. Customer matching is a major issue in banking and healthcare, where separate customer (or patient) encounters are often listed separately. The average bank has great difficulty listing all of the separate accounts of a given individual, although the reverse process of listing all the individuals in a specific account causes no trouble

One way to implement this is by use of Quality screens

  1. Column Screens
  2. Structure screens
  1. Business rule screens

Column Screens

  1. Value must not be null , Should be of fixed length, Should not be within list of excluded values, Should confirm spell check

Structure screen
They gurantee that entire data has come to DW, Some data has not been missed. Like each patient information should have source of data .  Each data source present in patient is there in the data source master table. Making sure entire structure has some and some data is not misssed

Business Rules screens
Business rules can be applied to data to confirm that data is consistent with business rules. Like ordered quantity cannot be less than minimum quantity. Or put a busineess rule for checking that number of orders per month is not increased by more than 20% . The idea is to check whether we have inserted duplicates in system may be due to source system changes and that is causing all wrong data to show up

Example - Checking the number of patient plans . They should not exceed by huge number than the number of patients. If this occurs it means we might be duplicating data somewhere. They are like sanity checks after each load

How are Errors from Column Screens Treated

Usually best idea is to write them to log files and at the end tracking them into a fact table . But this is questionable because we are missing some data best idea is to tag the data saying its bad

Creation of Audit tables to track each record of Fact table ( For this the error logic has to be done real time rather than saving in log files )

We can keep column like completleness , validation, Version number saying which version of Screen this record belongs to .
Points to think are - What happens when updation are handled for the same record

Tuesday 15 July 2014

PGA size for Datawarehouse ETL loads






As we see in the figure above, having a large db_cache_size does not benefit parallel large-table-full-table scans, as this requires memory in the pga_aggregate_target region instead. With Oracle, you can use the multiple data buffer features to segregate and cache your dimension tables and indexes, all while providing sufficient RAM for the full scans. When your processing mode changes (evening ETL and rollups), Oracle AMM will automatically detect the change in data access and re-allocate the RAM regions to accommodate the current processing.

All 64-bit servers have a larger word size (two to the 64th power) that allows for up to 18 billion GB of addressable RAM. Hence, you may be tempted to create a super-large RAM data buffer. Data warehouse systems tend to bypass the data buffers because of parallel full-table scans, and maximizing disk I/O throughput is the single most critical bottleneck.

As we know, the choice between a hash join, star transformation join, and a nested loop join depends on several factors:

The relative number of rows in each table
The presence of indexes on the key values
The settings for static parameters such as index_caching and cpu_costing
The current setting and available memory in pga_aggregate_target
As we know, hash joins do not use indexes and perform full-table scans (often using parallel query). Hence, hash joins with parallel full-table scans tend to drive up CPU consumption.


Be aware that in Oracle9i, when you set pga_aggregate_target, no single hash join may consume more than five percent of the area. If you have specialized hash joins that require more hash area

The _PGA_MAX_SIZE does not controls the total used PGA memory for the whole system.

The _PGA_MAX_SIXE controls the max used PGA memory per process, which by default is 200 MB.
A work area is a large extent on the memory available in the Program Global Area (PGA).
Each time a cursor is executed, a new runtime area is created for that cursor in the PGA memory region of the server process executing that cursor.

One process can have many work areas.

The size that can be used from a single work area is limited to a max of 100 MB (_PGA_MAX_SIZE/2) by default for serial operations. This is in case that your PGA_AGGREGATE_TARGET > = 2 GB. Otherwise it will use no more than of (5% of PGA_AGGREGATE_TARGET), for values for PGA_AGGREGATE_TARGET <= 2GB. It is coming from that the min(5% of PGA_AGGREGATE_TARGET, 100MB) is taken.
The size for a single work area is limited to 30% of the PGA_AGGREGATE_TARGET/DOP for parallel operations, where DOP is Degree of Parallelism.

If you have set, for example the PGA_AGGREGATE_TARGET to 1 GB then Oracle cannot give more than 50 MB to a single work area (for serial operations).

If you want to use more than 100 MB per single work area then may be the _PGA_MAX_SIZE can help you to achieve better results for your sort tasks and to let you to perform the biggest ones only inside the memory.