Saturday 9 November 2013

Bulk loading of data and Performance improvements

Hi

ETL tools like inforamtica when they say bulk loading they use sql loader/Fast load teradata and so on. sql loader will read data from a file and insert it into database. Now everytime we use a informatica transformation it will read the data from source and prepare a file on server and when you select bulk load it will read from this cache file to target. now i know bulk load can give a good speed but consider this . For going from oracle OLTP into datawarehouse when both are on same server you are first goint to ETL tool server with data and then to datawarehouse this will slow you down. Even with bulk inserts

When loading data its best to use a procedure oracle has facilities like bulk collection which will help you to insert data much faster than other techniques . Drawback you might need a db link to move data from one schema to other which your DBA might not allow.

For SQL loader you need a control file in case of informatica it creates this control file to read data from its cache.

Note - You cant rollback bulk load and you should not have primary key on the table while loading.

Below is a link for 

Note - Sometime rather than calling Sql Loader/Fast load teradata from ETL tool ( diff tools give a different name to it like informatinca calls bulk load) its faster to do load using procedure as data does not have to travel to ETL tool server. This is true in cases when the time taken for data to move to ETL server is large


Bulk loading using procedure( without using sql loader utility)  -- The FORALL(used for bulk loading) statement is not a loop; it is a declarative statement to the PL/SQL engine: “Generate all the DML statements that would have been executed one row at a time, and send them all across to the SQL engine with one context switch.”

http://www.oracle.com/technetwork/issue-archive/2012/12-sep/o52plsql-1709862.html

Best practices for bulk collect

http://www.oracle.com/technetwork/issue-archive/2008/08-mar/o28plsql-095155.html

Below is a link for writing control file for SQL loader

Usually in tools like pentaho there is no direct bulk load option. You need to write the control file





 A Thought that got triggered.... which is faster ETL tool or Stored procedures

That is not a logical question in first place. ETL tools are not used because they are faster than procedure in some cases they can be actually slower. But there are some very good reason for them to be there apart from speed. Speed is the least importance when considering advantage.

Ok. You have to have 2 similar system to measure that. We do not use ETL tools because they give the best of performance which you cant get using store procedure. You can write a good procedure which can be better than ETL tool . ( Because if you have both source and target on same db server its much faster to move the data rather than moving it to ETL tool server and then going to target db) so depends on situation.

Below is good article from Ralph kimbel site. do read


http://www.kimballgroup.com/2008/04/06/should-you-use-an-etl-tool/

Point he mentiones are 

1)Visual flow and self-documentation
2)Structured system design
3)Operational resilience.
4)Data-lineage and data-dependency functionality
5)Advanced data cleansing functionality.
6)Performance ( last point -- as its not the most important point to be considered)

Assumption -- both source and datawarehouse are on same RDMS (Oracle, Sql server etc)


By going through this i got a thought which architecture is good

1) Load all the data from Source to Staging. I am actually thinking of some way we can directly dump entire source system with minimum time as possible

2) Have Staging area 2 to keep history and all

Good Article on Using cursors in PL/SQL

http://www.oracle.com/technetwork/issue-archive/2013/13-mar/o23plsql-1906474.html



No comments:

Post a Comment