Saturday, 30 November 2013

Appliances and In memory databases

Hi All,

Appliances are software running on a certified hardware like SAP Hanna , Teradata 5600 . Teradata has its own set of hardware on which connecting cables LAN and everything is designed and tested by teradata so they sell the entire big rack to you instead of software. Teradata best runs on its own rack because they have taken care of optimization to last detail . SAP Hanna is also run on its own certified hardware

SAP Hanna is a database ( like oracle , mysql ) but it has very High RAM 1 TB and multiple CPU also insted of disk drives with moving parts it used solid state drives .  Disk is approximately million times slower than RAM , So with Hanna user can expect high speed querying

1 TB of ram in Hanna can accompodate  40 TB of uncompressed data

100 TB of RAM can accomoadate 500 TB of uncompressed data . Note one machine does not have this 100TB ram it is distributed across CPU in server . Each CPU will be in its own U case and with certain CPU when all are added up it becomes 100TB

Below all the databases are competitors of each other in offering similar speeds . Some are using in memory database and some are using other ways but they offer similar query speeds and similar costs

Netezza , Teradata , Oracle Exadata ( to some extent ). SQL Server parallel DW  Are MPP system (Share nothing ) that is each processor has it dedicated main memory (RAM) . In SMP like your i5 processor on you laptop .There is a single RAM for all the 4 processors inside.





 

Difference between Reporting and Analytics

Hi Guys,

If you are working in BI reporting you must have seen the formalas that are there for standard deviation, variance, Mean but we hardly used them as BI report developers so why are they there and who uses them was a question in my mind.

Reporting and Analytics are different things. A BI report developer is not a analyst and an Analyst does not have to know in detail about BI technology people can manage even with simple Excel .

So i know people use the term interchangibly and so its confusing what is the difference between the two.

Reporting 

The navigation is structured and people know what they are looking for. Even adhoc queries by user come into reporting.
Generally we are reporting what is there. Like how much quantity was sold. We are not designing a model that will give a trend to predict what are the expected sales in years to come.

Analytics 

Here analyst makes a hypothesis and uses data to confirm that hypothesis. It involves adhoc querying (dont confuse with adhoc reporting in BI). Generally data analyst role are separate from BI developers . Suppose business has a pressing question that can be answered by gut feeling and experience of managers still they can decide to go for data analysis and an Analyst can come up with a model to prove that the hypothesis made by manager is correct.

Now this requires analyst to prepare the model and fine tune it so that the analysis matches the hypothesis. In many cases it wont match which says the hypothesis was wrong.

Generally reporting is done on a structured data. Analytics can also be done but the chances for finding pattern in a unstructered data like web logs is more.

Famous tools for Analytics are - SPSS , SAS and R analytics.

I know Cognos offers SPSS from Cognos 10 onwards but to use it to business advantage you need a Data analyst or a developer with good knowledge of statistical analysis models and with ability to make hypothesis and test against data.

There are a lot of videos on statistical analysis on you tube. I would recommend you going through those videos irrespective of the tool . Because tool is just for helping analyst do their job. The thought process has to be within the person.

Analytics is best done on unstructured data. Like network traffic data for telecom company. Logs from the switches which can be loaded into a HDFS system ( Hadoop ) and analysed for patterns.

Difference between SAS and Cognos

SAS has a ETL part with it . ( i did not know that :-) )

some comments i picked from sites 

SAS in my view is still fundamentally a data mining software vendor comparable to SPSS. Data mining sorts through data to identify patterns and establish relationships (Association, Sequence & path analysis, Classifications, Clustering, Forecasting).

The other vendors you mention are OLAP vendors (Cognos and BO are cube -based MOLAP solutions, MicroStrategy ROLAP). Online analytical processing (OLAP) for the most part allows users to derive information and business intelligence from data warehouse systems by providing tools for querying and analyzing the information in the Warehouse from different points-of-view. OLAP can be used for data mining or the discovery of previously undiscerned relationships between data items.

Generally, Cognos can be used as a data presentation layer that includes descriptive statistics and OLAP. SAS has a substantially more powerful ETL capability, so I would use SAS for data prep processes. Finally, SAS handles the inferential statistics and data mining, including predictive and optimization modeling.
To summarize, I've used SAS to integrate and perform ETL on data, Cognos to present he data, then SAS to analytically crunch the data. The analytical results may then be reported back through Cognos. This whole process can be made to feel


Big data and Analytics 

With the coming of Hadoop we have the ability to analyse big data. Data like weblogs. Bio Informatics (DNA sequencing) . Online gaming data. Which user is likely to play which games. Geo mappings. These data can be multiple terabytes. So now analyst have a huge data set on which they can query (HSQL , Hive)  and come up with predictive models or verify a hypothesis.

In dataware house the ability of analyst to run a hypothesis is less because data is structured and organised. So the changes that a Analyst can find a pattern in a unstructured data are more







Sunday, 24 November 2013

Pl/SQL Notes

Hi All,

Just noting down some plsql procedures



Example 1 

create or replace procedure load_data3 is

cursor Tab_order_detail is select * from order_details;
 
 BEGIN

         FOR l_detail in tab_order_detail
           insert into ORDER_DETAILS_STAGE
           values ( ORDER_DETAIL_SEQ.NEXTVAL,L_DETAIL.ORDER_DETAIL_CODE,L_DETAIL.ORDER_NUMBER,L_DETAIL.SHIP_DATE
           ,L_DETAIL.PRODUCT_NUMBER,L_DETAIL.PROMOTION_CODE,
           L_DETAIL.QUANTITY,L_DETAIL.UNIT_COST,L_DETAIL.UNIT_PRICE ,L_DETAIL.UNIT_SALE_PRICE);
         END LOOP;
 END;

Some Basic Procedure Samples 

CURSOR employee_id_cur
IS
  SELECT employee_id FROM plch_employees ORDER BY salary ASC;

l_employee_id employee_id_cur%ROWTYPE;

BEGIN
  OPEN employee_id_cur;

  LOOP
    FETCH employee_id_cur INTO l_employee_id;

    EXIT  WHEN employee_id_cur%NOTFOUND;

  END LOOP

----------------------------------------------------------------------------

same thing can be writen with for loop. Notice there is no need to define employee_rec. Oracle automatically recognises this
and assigne it data type as cursor

DECLARE
   CURSOR employees_in_10_cur
   IS
      SELECT *
        FROM employees
       WHERE department_id = 10;
BEGIN
   FOR employee_rec
   IN employees_in_10_cur
   LOOP
      DBMS_OUTPUT.put_line (
         employee_rec.last_name);
   END LOOP;
END;
-------------------------------------------------------------------------------------------------

same thing can be also writen as. here notice that we did not even name the cursor still oracle will give it a name and then create employee_rec of same data type. Also the good thing about Best of all, Oracle Database automatically optimizes cursor FOR loops to perform similarly to BULK COLLECT queries

BEGIN
   FOR employee_rec IN (
        SELECT *
          FROM employees
         WHERE department_id = 10)
   LOOP
      DBMS_OUTPUT.put_line (
         employee_rec.last_name);
   END LOOP;
END;

---------------------------------------------------------------------------------------------------
difference between function and procedure


create or replace function trial ( a IN number) return kk


Saturday, 23 November 2013

SQL Trace and TKProf

Hi All,

When you enable sqltrace on your user session it will create a trace file on server. Trace file will give you all the statistics of the sql like consistent gets , Cpu utilization , memory usage. Consider you have a procedure to delete data which is deleting 35 million records and it takes huge time. Now you can do some thinking and find what is going wrong .

1) Is the select statement that you are running inside procedure to identify records is taking time. Like you are using select * from t1 exists ( select * from t2 where t1.a =t2.a)  now if t1 is big you will be in lot of trouble with a lot of consistent gets and high cpu usage. Now considering you did not know this and you had to investigate what is going wrong then you need sql trace.

Auto trace and explain plan will not work for procedure to identify where the bottleneck. So sql trace is like autotrace for procedure gives all info like consistent gets and CPU utilization. If you are not familiar with autotrace you can read my article on reading explain plans Here

ALTER SESSION SET sql_trace = true;

run the procedure for which you want to find the trace

ALTER SESSION SET sql_trace = false;

Now formating the output of sqltrace as it is quite unreadable. 

Oracle has formating utility for sqltrace called tkprof. You cant run the statement from sql developer run in command prompt

tkprof C:\Dummy_folder_for_informatica_textfiles\orcl_ora_4432.trc C:\Dummy_folder_for_informatica_textfiles\output.prf EXPLAIN=DWBUILD/DWBUILD SYS=NO

Again a Good article on ASK tom on Understanding TKProf

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:969160000346108326


Below are some question that ASK tom asks us to consider


Now, we have a bunch of FACTS

fact: we did a lot of physical IO
fact: we did a lot of logical IO
fact: we used about 100 cpu seconds
fact: it took about 262 seconds elapsed time, therefore we waited lots of seconds for something
fact: we waited for db file scattered read for a long time, this is the read of a full scan
fact: we full scanned a huge table
fact: we can see our query
fact: we have the plan that was used

Now, what can we do with these facts? We can use our knowledge of

a) the data
b) how oracle works

to "tune". What are some obvious things to think about there?

Well, we see the query result was about 700,000 rows out of 128,000,000
that gives us another fact or two 


PARSE

Translates the SQL statement into an execution plan, including checks for proper security authorization and checks for the existence of tables, columns, and other referenced objects.

EXECUTE

Actual execution of the statement by Oracle. For INSERT, UPDATE, and DELETE statements, this modifies the data. For SELECT statements, this identifies the selected rows.

FETCH

Retrieves rows returned by a query. Fetches are only performed for SELECT statements.

Query --- is your consistent gets

Disk -- Physical reads

db sequential read --- means reading indexes

db scattered read -- full table scans

When saying sequential read takes time we should think how much time it is taking to load data into SGA in short your RAM .

Using a high consistent get will be accompanied by huge time in db sequential read -- i need to read more why it gives lot of time for sequential read

http://asktom.oracle.com/pls/asktom/f?p=100:11:0:::%3AP11_QUESTION_ID:6265095774206


Alternative for TKPROF ( Poor developers vesion) 

You can use V$SQL to get some of the information. Like Disk reads and Buffer gets . Autotrace can also be used for some of the information

SELECT sql_id,
  Elapsed_time,
  Cpu_time,
  Fetches,
  executions,
  Buffer_gets,
  disk_reads,
  Direct_writes,
  Physical_read_bytes,
  Physical_write_bytes,
  concurrency_wait_time,
  user_io_wait_time,
  rows_processed,
  optimizer_mode,
  ((IO_interconnect_bytes)/1024)/1024 Interconnect_MB
FROM v$sql


http://docs.oracle.com/cd/B10501_01/server.920/a96533/autotrac.htm

db file sequential read is due to INDEXED reads by the way -- it is single block IO, it 
is not the result of a full scan.  So.... you may well be looking at the wrong place.  
the p1, p2, p3 info in v$session_wait can be used to determine the actual object being 
waited on.  A sequential read is usually a single-block read, although it is possible to 
see sequential reads for more than one block (See P3). This wait may also be seen for 
reads from datafile headers (P2=1 indicates a file header read) . 


Block reads are fairly inevitable so the aim should be to minimise un-necessary IO. This 
is best achieved by good application design and efficient execution plans. Changes to 
execution plans can yield orders of magnitude changes in performance. Tweaking at system 
level usually only achieves percentage gains. The following points may help:

    * Check for SQL using unselective index scans

    * A larger buffer cache can (not will, "might") help 

    * A less obvious issue which can affect the IO rates is how well data is clustered 
physically. Eg: Assume that you frequently fetch rows from a table where a column is 
between two values via an index scan. If there are 100 rows in each index block then the 
two extremes are:
         1. Each of the table rows is in a different physical block (100 blocks need to 
be read for each index block)
         2. The table rows are all located in the few adjacent blocks (a handful of 
blocks need to be read for each index block) 
      Pre-sorting or re-organising data can help to tackle this in severe situations.

    * See if partitioning can be used to reduce the amount of data you need to look at.



a db file sequential read is a physical IO - not a read from the buffer cache. it will be *followed* by a read from the buffer cache - but the wait for a db file sequential read is a wait for a physical IO. 



http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:6071779300346659903

Wednesday, 13 November 2013

ETL target db Optimization techniques


Hi Guys,

I happen to read informatica target db optimization few days back. It had 5 points mentioned to speed up load so i went through each one of it and below are my thoughts

1) Optimizing db ..Storing index and table in different tablespace ...

I read Ask tom and i disagree . we have Raid now which stripes data across the disks and will give a consistent I/O this is a 1980 stratergy for making disk I/O in sync.

A datafile is a physical file and tablespace is logical structure . A datafile will contain data relating to only one tablespace. we stripe data file across a disk with RAID

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1463404632043


2) Consider you loads are frequently failing and you need to delete data from fact which was partially loaded. In such cases if its is possible( that is you are going for truncate load for fact based on your source system)  you can go for table swap. Create a second fact table and populate it . Then when load is complete swap it .

This type of arrangement is typically required if data is deleted from source system. Typical change capture is not able to help in such cases. Change capture generally works for inserts and updates in source and not for deletes

Below are few question on Loading strategies  

        1) What is the load strategy followed for staging
Sub Q1 - What if the source is sql server and target is oracle. What is the best technique.. Usually load and unload of the file can be considered if data is on remote server .
Sub Q2 - What if we are loading from files …(oracle sql load, teradata fast load)
Sub Q3 -- If we are loading from source to target through transformation, does the data always reside on the intermediate server. 

Why do we need to use utility like sql loader
Sub Q4 -- Configuration file for sql loader
Sub Q5 - Why is sql loader faster than using ETL tool
http://cognossimplified.blogspot.com/2013/11/bulk-loading-of-data-and-performance.html

Note - You cant rollback bulk load and you should not have primary key on the table while loading.


Sub Q1 Solution -- ( Consider you are using oracle load of file can be done by oracle loader and in case of source is Sql server .. Unload can be done by Sql server utility .so we are unloading using utility to file and then loading this by loader in oracle. Usually a ETL tool can also be used to unload from Sql server and load to oracle)

Note - We usually truncate and load the staging area. We pick the latest records based on the update time stamp. DW always maintains the history so no need to maintain history in staging.
The idea of going for stage in scenarios like this when both source and target is oracle. To minimise load on source system if the load fails and we have to run the load again.
We load the file from source system to pentaho server and then from pentaho server to oracle using sql loader. In our case this idea might not be required as both source and target is oracle. But in case where source is Sql Server and target is oracle this idea works better.
Point to ponder --While loading staging patient table we are not directly copying patient from source system.Source system is always in third normal form we get the values for it.For eg consider we have state key in source patient table we will get the state value instead of just getting key and then joining it while loading datawarehouse.

    1) Load strategy for DW from staging
    2) Sub Q1 --How are staging table merged for forming DW tables
  
    3) On what criteria are tables designed in DW
    4) Do you want to keep all the keys from source system in your datawarehouse?.Source system is in third normal form
  
    5) Why are we loading msr val
    6) Stg is joined with dw why ?
    7) Dw is joined with dm why ?
  
    8) How to handle if fact table load fails in middle.
    Ans - If dimension load fails in the middle there is no worry. We can just start the load directly because it always compares data with source ( Change data capture) but in case of fact we are not checking with source. We just insert new records, we never update fact due to the huge number of records which make update very time consuming.
  
    So if fact load fails we need to delete todays data based on timestamp and insert new data.
  
  
    9) How to handle deletes from source in fact
    Ans -- In fact load we are never comparing with fact to see if a relationship is terminated or not. Consider you have employee and work location dimension. The only connection between employee and work location is through the fact and if in the source table the relationship ends there is no way of ending the relationship in fact. It has to be done by deleting the entry from fact using delete scripts or creating a factless fact.
  
    10) How will you find out how much data in mb is loaded everyday
  
    11) Sequence val in procedure takes time to load
    Ans - Make sure you cache enough sequece keys otherwise the insertion will take time. You will be able to find such root causes by making use of session table in oracle. Check out below link for details
  
  
    12) Using a table swap idea for fact load is better than direct fact load
  
    Ans - This idea can be considered when you are doing a truncate and load of fact due to source system inability to point to new records. Caution- There will be some seconds lag when you are renaming tables and your have to be sure that this is agreable.
    (Example case -Source system some of the tables are being populated by ETL from some other system so the keys are not primary keys but surrogate key and they truncate and load only active records in source table )
  
    Advantage - In usual case whenever we truncate and load fact. The system is not available to user till we finish fact load. But in this case you will have yesterdays fact which can be used and the severity of the issue is reduced
  
    13) If you use wrong data type for index will the index be used a = '1'
  
    14) Can change capture detect delete records
    Ans - No you cant because change capture works for dimension and not facts.
  
    15) How can you do masking of data

16 ) How Change data capture happens. How can we speed up the data look up in such cases ??






Sunday, 10 November 2013

Some Oracle functions which we do not use daily

Hi Guys,

Below is a post on some oracle functions which most people would not have tried

1) Pivot
2) connect by
3) Levels
4) Grouping sets


addmonths

Consider you are at march 15 and you want to go to Feb 15 then you cant just write -30 directly. because feb might be having 28 days or 29 days in this case you need to use addmonth to go back a month.

Max and min function 

 select emp, max(hiring_date) from emp

This will not give you employee with max hiring date it will give the max date next to each employee. If you want to get employee having max hiring date then go for rank function and get the first rank



Below is a good article on Parallel hints 

http://www.oracle.com/technetwork/articles/database-performance/geist-parallel-execution-2-1872405.html

Note -- Parallel hints works best when using full table scans as

parallel hints only helps in two cases -- your disk can provide enough i/o for both of processors to use

2) The query is so complex that it is saturating cpu so getting in another cpu will work

In  most cases cpu is waiting for i/o in such cases parallel hints will not help.

 We need example here where using parallel hints has lead to increase in the speed of query.