Saturday, 12 April 2014

ETL load and Linux server performance

Hi Guys,

One of the things that comes into picture is the speed at which a ETL server is creating data file which has to be loaded. If the data file is about 20 GB in size because you are pulling from one database like oracle and loading the data into Teradata. Then the speed at which ETL server is able to pull the data from source becomes very important.

We were noticing that our Server was taking 1 hour to build 1 GB data file which was very slow. So we decided to look into the bandwidth of server. Now considering the IT setup you need to contact Linux server teams to get the information. But knowing few basics System Activity commands helps you to know first hand what is happening on linux server


SAR ( System activity monitor )



Note - we tried using SAR but it was not giving proper information on what amount of data is coming in from network Try using NMON instead 

Long running Loads and one Inactive session

Hi Guys,

Usually by looking at V$session, active session history we are able to find the issue with the load, what is happening. What are the event on which load is waiting like reading from temp space, updating index. However this was one long running load where i was not able to find any information on what is happening for almost half a day as if load is not running. 

Below are 3 of my favourite links from Oracle which will help you understand your ETL loads better than ever. Read it as generic examples. It can be applied to Sql server, Teradata anything



 
Below is my analysis on session which Active session history did not capture 

Note - ACTIVE SESSION HISTORY will capture only if the session is active in case of long loads the session becomes inactive and active session history will not capture it DBA_HIST_ACTIVE_SESS_HISTORY


select * from DBA_HIST_ACTIVE_SESS_HISTORY--- look into this to see which session are running

select event,p1text,p2text,wait_class,blocking_session_status,current_obj#
,pga_allocated,temp_space_allocated from DBA_HIST_ACTIVE_SESS_HISTORY

MASTER QUERY

select count(event),event,current_obj#,object_name
 from DBA_HIST_ACTIVE_SESS_HISTORY,dba_objects  where sql_id = 'ABCCFDFDDD'
 and current_obj# = object_id
and sample_time >= '27-MAR-14 05.00.02.829000000 AM'
 group by event,current_obj#,object_name
 order by 1 desc






 

Data pump in datawarehousing environment

Hi Guys,

Basic requirement for using Data pump -- To use data pump across 2 schema or 2 instance of databases both the databases should be on same physical server . ( Same machine ) because you will be using local directory ( local folder on that machine) and this folder should be accessible from both schema.In case of using datapump using db links over the network. It is considered very slow and you are dependent on the speed of db link and so no where close to speed Data pump is capable of .



According to me to dump the file both schema must be on the same server where they can see the folder on which the dump file is saved

  1. Create a directory on the oracle database server which will save dump file
  2. Grant permission to users for this dump file
  3. Command to export data using data pump to directory on the linux server
impdp username/password DIRECTORY=dpump_dir1 DUMPFILE=scott.dmp
TABLES=scott.emp REMAP_SCHEMA=scott:jim

Easy Way To Export import using External table ( Added 21 Aug 2014)

impdp,expdp is new command in 11g instead of imp , exp. Instead of doing that we can make use of external table to makes things much simpler

/* Creates a directory */

CREATE OR REPLACE DIRECTORY test_dir AS '/app1/ADMIN/product/11.2.0/dbhome_1/demo/'

/* exports table using datapump into external table. The external table can be directly queried we can export data to external table using datapum and oracle loader both */

create table emp_unload
organization external
( type oracle_datapump
 default directory test_dir
 location( 'emp.dat' )
 )
 as
 select * from emp

/* We can directly select from external table using select *. Though its not a actual table */

select * from emp_unload

/* Loading data into emp2 from External table */

create table emp3 as select * from emp_unload





How to find Hostnames

select host_name from v$instance

TNSNAMES.ora file

ABC =
 (DESCRIPTION =
   (ADDRESS_LIST =
     (ADDRESS = (PROTOCOL = TCP)(HOST = bbbbbb )(PORT = 1521))
   )
 (CONNECT_DATA =
   (SERVICE_NAME = kkkkk.iiiii.com )
 )
)
Below are steps for using data pump.
  1.  Check if you have read/write access to data pump directory. I think you have access to default directory DATA_PUMP_DIR  . So commands are based on default directory. Please check name while running command.
  2. Open SQLPLUS by logging with ABC which has access to data pump directory.
  3. Place the attached tnsnames file at  C:\oracle32\product\11.2.0\client_1\network\admin
  4. Login to sqlplus with username : ABC@ABC ( note – ABC is alias in tnsnames.ora)
  5. Command to export table 
  6. expdp ABC ESTIMATE_ONLY=YES DIRECTORY=data_pump_dir  DUMPFILE= data_pump_dir: ref_type.dmp LOGFILE= data_pump_dir: ref_type.log tables=ABC.ref_bill_type
  7. Command to Import table
  8. impdp KLJ_STAGE SCHEMAS=KLJUSER DIRECTORY=data_pump_dir  DUMPFILE= data_pump_dir: ref_type.dmp tables=ABC.ref_bill_type LOGFILE= data_pump_dir:exp_ref_bill_type.log




Tuesday, 1 April 2014

Update in ETL without primary key in Pentaho


Redo Log and Undo tablespace


Hi All,

Redo logs allows you to replay your transactions as the name suggest Re - Do .

To understand redo logs you need to understand how block buffer cache works. When we insert data into oracle its does not directly  inserted into disk. It goes into buffer first and then when the buffer is full its writen to disk this makes oracle look faster as physical insert takes more time then writing to disk. (Read the lines from  Oracle book at the End)

There are 3 things of interest with redo logs. Redo log buffer (SGA), Online (Redo file group on Disk)  and Archived (After a group is full its archived) . The online Redo has 2 sets in RAM, as one gets full the the data is writen to other redo file. And the first one is written to disk .  The Redo logs which are written to disk are the only ones that can be recovered. The ones that are in redo buffer are not recovered is system fails
Online Redo logs generally tell us what is in the Buffer block ( I used the word generally because  not everything  is covered depends on how recently redo buffer was flushed) .  The buffer

Consider a Scenario of a Insert

Insert will generate undo in the undo segment and redo in the redo buffer. The redo buffer also captures the undo segment changes.  Important point to be remembered is when we insert data into the database they are first writen to block buffer which is part of SGA (RAM) and then database writer flushes them to disk once block buffer is full . There are two scenarios for flushing redo and block buffers. First is when it gets full or when the online redo log file that contains the redo is full and a switch to next group is made at this point the Block buffer is flushed and other is when we commit.

How redo handles system crash

  1. If the system crashes now . The entire insert is not recoverable
  1. If the block buffer fills up first , then the database has to empty content of block buffer to disk to accept new rows. Before doing this the Redo buffer is writen to disk by log writer. Remember the redo log contains logs for undo segment as well. So if the system crashed immediately after writing redo log before writing the block buffer to disk we can get the system back to stable state.

So The Basic about Redo is " They are always writen first to disk before any changes are writen or more  specifically before the Block buffer is flushed to disk" This allows consistency.

Note about Append -- Since append by passes the SGA or Block buffers they do not generate redo logs . When we use  append the changes are directly writen to disk by passing the RAM.(SGA)

What happens when we commit

When we commit the content of SGA are writen to disk. This means redo logs are writen to disk first and then the Block buffer is flushed to disk. So the data you entered is in the disk. For transaction involving lot of data the redo logs are regularly flushed to disk after they get full and same is case with Block buffer.

Point to be remembered is they allow the system to be recovered only till the point of Commit

Lines From an Oracle  book

To understand how online redo logs are used, you’ll need to know something about checkpointing, how the database buffer cache works, and what a process called Database Block Writer (DBWn) does. The database buffer cache and DBWn are covered in more detail a later on,, but we’ll skip ahead a little anyway and touch on them now.The database buffer cache is where database blocks are stored temporarily. This is a structure in
Oracle’s SGA. As blocks are read, they are stored in this cache, hopefully so we won’t have to physically reread them later. The buffer cache is first and foremost a performance-tuning device. It exists solely to
make the very slow process of physical I/O appear to be much faster than it is. When we modify a block by updating a row on it, these modifications are done in memory to the blocks in the buffer cache.

Enough information to redo this modification is stored in the redo log buffer, another SGA data structure. When we COMMIT our modifications, making them permanent, Oracle does not go to all of the
blocks we modified in the SGA and write them to disk. Rather, it just writes the contents of the redo log buffer out to the online redo logs. As long as that modified block is in the buffer cache and not on disk,
we need the contents of that online redo log in case the database fails.

If, at the instant after we committed, the power was turned off, the database buffer cache would be wiped out.
If this happens, the only record of our change is in that redo log file. Upon restart of the database,Oracle will actually replay our transaction, modifying the block again in the same way we did and
committing it for us. So, as long as that modified block is cached and not written to disk, we can’t reuse that redo log file.
This is where DBWn comes into play. This Oracle background process is responsible for making space in the buffer cache when it fills up and, more important, for performing checkpoints. A checkpoint is the
writing of dirty (modified) blocks from the buffer cache to disk. Oracle does this in the background for us. Many things can cause a checkpoint to occur, the most common being a redo log switch.
As we filled up log file 1 and switched to log file 2, Oracle initiated a checkpoint. At this point, DBWn started writing to disk all of the dirty blocks that are protected by log file group 1. Until DBWn flushes all of
these blocks protected by that log file,
 



Undo segment is the rollback segment. The changes you make to the table, the index values for insert , update or delete are stored in undo segment. Earlier undo segment was called rollback segment.

Redo log primary purpose is for Instance recovery. There are usually 2 redo log files. They gurantee state of data base in case the instance fails.

Redo log files are filled with redo records. A redo record, also called a redo entry, is made up of a group of change vectors, each of which is a description of a change made to a single block in the database. For example, if you change a salary value in an employee table, you generate a redo record containing change vectors that describe changes to the data segment block for the table, the undo segment data block, and the transaction table of the undo segments.

Redo entries record data that you can use to reconstruct all changes made to the database, including the undo segments. Therefore, the redo log also protects rollback data. When you recover the database using redo data, the database reads the change vectors in the redo records and applies the changes to the relevant blocks.

Suppose when we are copying data from one table to another it will have a undo so it will have a redo log. By saying it as unrecoverable we are getting rid of redo and undo

create table new_table unrecoverable as select * from old_table where ....;
drop table old_table;

SQL loader does not perform Bulk load when using Pentaho

Hi Guys,

We are trying to bulk load 200 million rows into staging table from source by using Sql Loader direct path load (Bulk load). When we tried using explain plan on the sql fired by SQL loader it uses a hint called SYS_DL_CURSOR which does a conventional load.

Speeds worth to be mentioned  ---SQL loader loads 180k pers seconds which is 11 million per minute.
Db links directly can load -- 3.5 millions per minute . Loaded 180 million in 60 minutes. Some time spent on sql read

Reason -- We were using a option in Pentaho which says load on the fly. So it reads the data from sql and starts loading into oracle as rows come in. The problem is you never have enough rows available for sql loader to go for Bulk load. so it goes for conventional load.

Solution -- Use Automatic load at the end. Any option which creates a Data file on the ETL server and then SQL loader will load from this data file. This will allow sql loader to go for direct load

Note -- The hint  SYS_DL_CURSOR does not matter . Even if the explain plan shows that it is doing a conventional load. You should look into log file of Sql loader you will come to know whether it is doing direct load or conventional load. I have lot of confusion because the explain plan was showing conventional load.

Analysis which was flawed and you might even do this so just noting down. Even this did not work because it never had enough data to do direct load. 


Note - I have tried to manually create SQL loader control file and triggering it from the Linux server. Still it adds the hint

SQL loader control file (.ctl)
load data
 infile '/archive/Load_trial.txt'
 into table ABC
 fields terminated by "," optionally enclosed by '"'
( SRC,
CD ,
SRC_VAL_ID,
GG_CD ,
NMT_IND,
DENOM_IND,
NMT_XCLS_IND,
DENOM_XCLS_IND,
DENOM_XCPT_IND,
RSLT_ANS,
SR_EPSD_ID,
XCLS_RSN_CD,
VAL_TYP_CD,
CREAT_USER_ID,
CREAT_DTM DATE 'mm/dd/yy')

Command to load using SQL loader

sqlldr userid=username/password@servicename control=archive/Trial_load1.ctl log=archive/Trial_logg.log DIRECT=TRUE

 Alternate solution 

Create a procedure using db links which loads data from source schema to target schema. Procedure will perform good because you are eliminating ETL server in the middle. But you will be limited by the speed of the db link connection. DBA will not agree because DB links are a security issue as they expose the data over the network. But keep the procedure ready as contigency measure if load fails and you want urgent load 

Alternate Solution 2 

Data pump.  Pull the tables required from source schema to target schema using data pump and write procedure in target schema to load the data from these pulled source tables. Increadible speed can be achieved.  200  million records can be inserted in less than 1 hour in oracle. 
Data pump is fast and can easily give you speed of 20 million per minute when creating a dump file for import/Export