Wednesday 13 November 2013

ETL target db Optimization techniques


Hi Guys,

I happen to read informatica target db optimization few days back. It had 5 points mentioned to speed up load so i went through each one of it and below are my thoughts

1) Optimizing db ..Storing index and table in different tablespace ...

I read Ask tom and i disagree . we have Raid now which stripes data across the disks and will give a consistent I/O this is a 1980 stratergy for making disk I/O in sync.

A datafile is a physical file and tablespace is logical structure . A datafile will contain data relating to only one tablespace. we stripe data file across a disk with RAID

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1463404632043


2) Consider you loads are frequently failing and you need to delete data from fact which was partially loaded. In such cases if its is possible( that is you are going for truncate load for fact based on your source system)  you can go for table swap. Create a second fact table and populate it . Then when load is complete swap it .

This type of arrangement is typically required if data is deleted from source system. Typical change capture is not able to help in such cases. Change capture generally works for inserts and updates in source and not for deletes

Below are few question on Loading strategies  

        1) What is the load strategy followed for staging
Sub Q1 - What if the source is sql server and target is oracle. What is the best technique.. Usually load and unload of the file can be considered if data is on remote server .
Sub Q2 - What if we are loading from files …(oracle sql load, teradata fast load)
Sub Q3 -- If we are loading from source to target through transformation, does the data always reside on the intermediate server. 

Why do we need to use utility like sql loader
Sub Q4 -- Configuration file for sql loader
Sub Q5 - Why is sql loader faster than using ETL tool
http://cognossimplified.blogspot.com/2013/11/bulk-loading-of-data-and-performance.html

Note - You cant rollback bulk load and you should not have primary key on the table while loading.


Sub Q1 Solution -- ( Consider you are using oracle load of file can be done by oracle loader and in case of source is Sql server .. Unload can be done by Sql server utility .so we are unloading using utility to file and then loading this by loader in oracle. Usually a ETL tool can also be used to unload from Sql server and load to oracle)

Note - We usually truncate and load the staging area. We pick the latest records based on the update time stamp. DW always maintains the history so no need to maintain history in staging.
The idea of going for stage in scenarios like this when both source and target is oracle. To minimise load on source system if the load fails and we have to run the load again.
We load the file from source system to pentaho server and then from pentaho server to oracle using sql loader. In our case this idea might not be required as both source and target is oracle. But in case where source is Sql Server and target is oracle this idea works better.
Point to ponder --While loading staging patient table we are not directly copying patient from source system.Source system is always in third normal form we get the values for it.For eg consider we have state key in source patient table we will get the state value instead of just getting key and then joining it while loading datawarehouse.

    1) Load strategy for DW from staging
    2) Sub Q1 --How are staging table merged for forming DW tables
  
    3) On what criteria are tables designed in DW
    4) Do you want to keep all the keys from source system in your datawarehouse?.Source system is in third normal form
  
    5) Why are we loading msr val
    6) Stg is joined with dw why ?
    7) Dw is joined with dm why ?
  
    8) How to handle if fact table load fails in middle.
    Ans - If dimension load fails in the middle there is no worry. We can just start the load directly because it always compares data with source ( Change data capture) but in case of fact we are not checking with source. We just insert new records, we never update fact due to the huge number of records which make update very time consuming.
  
    So if fact load fails we need to delete todays data based on timestamp and insert new data.
  
  
    9) How to handle deletes from source in fact
    Ans -- In fact load we are never comparing with fact to see if a relationship is terminated or not. Consider you have employee and work location dimension. The only connection between employee and work location is through the fact and if in the source table the relationship ends there is no way of ending the relationship in fact. It has to be done by deleting the entry from fact using delete scripts or creating a factless fact.
  
    10) How will you find out how much data in mb is loaded everyday
  
    11) Sequence val in procedure takes time to load
    Ans - Make sure you cache enough sequece keys otherwise the insertion will take time. You will be able to find such root causes by making use of session table in oracle. Check out below link for details
  
  
    12) Using a table swap idea for fact load is better than direct fact load
  
    Ans - This idea can be considered when you are doing a truncate and load of fact due to source system inability to point to new records. Caution- There will be some seconds lag when you are renaming tables and your have to be sure that this is agreable.
    (Example case -Source system some of the tables are being populated by ETL from some other system so the keys are not primary keys but surrogate key and they truncate and load only active records in source table )
  
    Advantage - In usual case whenever we truncate and load fact. The system is not available to user till we finish fact load. But in this case you will have yesterdays fact which can be used and the severity of the issue is reduced
  
    13) If you use wrong data type for index will the index be used a = '1'
  
    14) Can change capture detect delete records
    Ans - No you cant because change capture works for dimension and not facts.
  
    15) How can you do masking of data

16 ) How Change data capture happens. How can we speed up the data look up in such cases ??






No comments:

Post a Comment