Thursday, 18 December 2014

Change data Capture from Redo logs




The  change  data capture  has 2 approaches.  One is synchronous in which a trigger is set on the oracle table which synchronises the data into a change capture  table. This was oracle 9i change and it went away with Oracle 10 and there was asynchrounous data capture  which copies the  data from  redo log files.  In asynchronous CDC we have  two

  1. Hot log mode -- This relies on online redo logs
  2. Asynchronous mode -- this kicks in after a log switch.  So it reads from a archived redo file or a file which is not the active  redo log file


Issues using CDC…. How to do real time datawarehouse ??

What is the other way of handling CDC .. Using applicatoin level queing

Supplimental logging must be enabled for CDC  to work



Data types in Oracle



Most of time we have ASCII as character set . Where 7 bits are used to represent a character.
Unicode(Character set ) UTF 8(Encoding) -- Is the one that can be used for storing chinese, English.. And is a multi byte dataset. Specifying the length in character is better in case of multibyte character sets.

Unicode(Character set) -- Allows you to store data in any language in oracle.  There are two data sets 1) the default one of the data base and 2) The column can have unicode character set. This is made possible  by Nchar and Nvarchar.So suppose you are trying to save data in oracle using Hindi, French , chinese  use  Nchar,Nvarchar,Nclob data types

A Short note on Character set and Encoding

A character set is a list of characters with unique numbers (these numbers are sometimes referred to as "code points"). For example, in the Unicode character set, the number for A is 41. The other character set is ASCII
An Encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this:


Char --- Data type can have the length specified in either in Bytes or in Characters . 

Char ( 2 char) or Char ( 2 bytes) . But for Ascii it does not make much difference as 7 bits are used to represent a character . The default whether a byte or char is taken depends on the database NLS_Length_Semantics
Char is a fixed length data types and rest is padded by blank

Varchar does not pad blanks and only uses space depending on the number of characters.  For all purposes use Varchar2 .. Varchar is like the old one

Long --- Its there for backward compatiblity and for all development use LOB ( that is CLOB in case of characters)

BLOB and CLOB and NCLOB -- Clob is used for Character Large object , as the name suggest its allows you to store  ASCI character data (or character set of your DB) data . The NCLOB allows you to save unicode data irrespective of the character set of the database. The BLOB is used to store binary data. When we say that the data is Binary it means the data will not be interpreted by oracle to resemble any character from a character set.


Numeric Data types

when you specify a data type as number without scale and precision then the precision is taken as zero that is , It cant store decimal points .
Important point is you can store numbers of any length the only restriction is on the precision and maximum precision is 38

RAW and Long RAW -- They are used for Binary data that is not to be interpreted . Nowadays Long RAW is only there for backward compatibility and its better to use BLOB


Date and timestamp

Point to remember are Both date and Timestamp both contain time part.  The precision of time noted by Date is upto seconds and that noted down by timestamp is upto milli seconds. If you are  doing any comparision using dates in the where  clause be sure to do them in the exact format specified by your NLS setting otherwise you need to use to_date

Things not there in Oracle …..FLOAT, DECIMAL,BYTE, SMALL INT, INT  All these things are not there in Oracle. The NUMBER is the all that is there . It covers everything in oracle. So even if you write int while creating a table it is taken as number in the create statement. Oracle allows you to specify int while creating a table

Very good article on the topic


Sunday, 12 October 2014

What most BI reports contain

Hi All,

After reading articles by Ralph Kimball. The two major themes in BI are
1) Comparisons
2)  Sequential computations

In this article i am planning to capture some of them from Cognos default package reports. I have never categories reports at these levels and had a look at them. I am hoping that these analysis will give me a new perspective when designing report or Dimensional modelling on what is expected




Employee Dimension Modelling Cognos

Hi All,

This article Covers the Modelling of Employee dimension. Though the task looks simple it took me quite some time to think over it. Ralph kimball has mentioned 5 good design tips you can go through it. I have found one implementation done by IBM on its sample that comes with Cognos. So discussing the implementation here

Though Fixed depth hierarchies have issues when we are implementing type 2 as entire dimension needs to be end dated when CEO gets changed. But considering we are implementing type 1 the idea of keeping this information in separate columns in best.

It has further disadvantage of not being able to filter directly on employee name. You need to find first find the designation that is whether the employee is manger 1 , manager 2 and then you can filter on it.

We will now look at how to do the Basic ETL logic to populate this Employee table from source table to get the required structure





























The other point to be considered is how do we implement security in such a way that when a user logs in he should be able to see only his subordinate expense but not the expense of others this has to be implemeted in Cognos












Saturday, 11 October 2014

Resolving Many to Many relationships

Article should contain

How to identify many to many relationshp between your dimension
why many to many relationship cause a problem to your data warehouse example case of customer and account

How to resolve many to many relationships

How to build the combination sequence table and how to build the bridge table by accessign this table 

Thursday, 9 October 2014

How to get execution plans of sql that were run as part of ETL

Hi All,

This might look simple but queries come in handy. Always remember the plan which we get from explain plan is not the actual plan its only estimated plan. The plan that we are getting from below queries is the Actual plan


select a.*, to_char(substr(sql_text,1,400)) from
(
select sql_id, sum(executions_delta), sum(elapsed_time_delta)/1000/1000
from dba_hist_sqlstat where snap_id >= 13222-240
group by sql_id
)a, dba_hist_sqltext b
where a.sql_id=b.sql_id
order by 3 desc
 
Select * From Dba_Hist_Osstat
Where Stat_Name Like '%LOAD%'
And Snap_Id >= 14672-240
order by 6 desc
 
 Select * From Dba_Hist_Snapshot
 where snap_id in (1273,1173,1151,11218,12351)

Understanding Hive and Pig


Hi All,

Hive for Data analysis due to it structural similarity with SQL 
Pig for Data loading due to its similarity with procedure language
Swoop - Bulk movement
Flume - Aggregate streaming of data 
  
Steps for loading data into Hadoop

Using Hadoop and Pig from the data warehousing perspective. Below are the tasks which can be easily done.

Important note – Pig has script like structure which is not very friendly for quering data. Hive offers familiar sql interace as usual relational sql. Pig scripts are like procedure language and easy for users to code transformations

  1. Take data from OLTP system as dump files either Delta dump or full dump
  1. Transform the data using Pig scripts and save it into tables


Using Hive to Query data from the Dump file

  1. Take  a dump file from OLTP system or load the data to external tables
  1. Dump the file into the HDFS system
  2. Now register the file with the HC Catalog , register the file as new table when doing this  its same as doing flat file with datastage, field delimiters
  1. Hcatalog separates the schema and metadata information from the query. Without this you would need to write full names

Using Pig to transform the data using Pig

  1. The table information is stored in the HC catalog
  2. We have to write a Pig script to transform this data below is  a pig script. Once the script is executed the data is stored into another table in Hadoop


 A Site for Very good Tutorial on Hive and Pig 


Business Intelligence as Service



Hi All,

Recently i have heard a lot about this topic of BI as service so here is some common example and their use

Consider a system detecting retail fraud. In BI we have the tools to build an analytical engine capable of mining the potentially huge amounts of transactional data for patterns resulting in a list of suspicious credit cards. The adoption of SO principles allows us to offer a service providing the details of credit cards in use in our store as they are swiped. The SoBI architecture allows this service to be consumed by the BI component of the platform and analyze it against the known list of suspicious cards, and therefore to respond immediately should a potentially suspicious transaction be detected….. Very good Example. It allows applications to use BI in real time

Why this needs to be service. Why not directly write a sql on datawarehouse from the Java/.net Application.
 
A simple case would be the application developers  does  not need to know the logic  for implemneting the DW checks, The maintance is easy no code change needs to be done in the application as service code can be changed any time without application downtime


Uses of Services in Business intelligence

Aggregation of transactional and historical data as a service. Given a service exposed through the SoBI framework that can now seamlessly aggregate current transactional data and historical warehouse data, a new breed of business services can be supported. An example would be slowly changing dimensions, which is where a value such as a customer name changes over time. Obviously this information is available within the data warehouse, so if there is a requirement to expose an entity service that gives a single view of the customer, this can be achieved more accurately and easily.
Brings interface abstraction patterns to BI. The ability to use the interface abstraction pattern over BI functionality makes the functionality and data more accessible to line-of-business applications and provides the capability to expose complex business rules usually buried in the ETL layer. 

My note - You can keep master data management. Where in datawarehouse you have a single truth of data that is cleaned like addreess of customer that can be requested as service from application

Cleansing and consolidation. Data will be changed for purposes of consistency and integrity. Where this change involves a mapping operation, that mapping will be made available to the architecture as a service. Where this change involves a correction to data, details of that correction will be fed back to the system of record through a request for change to the owning service, that is, the ETL process cannot change the data as it is not the owner. In turn, this process obviously drives improvement in the quality of data.


Obtain a cross-system, consistent view of the product. During the ETL stage data of the same product (or entity) may require transforming in order that it may be stored in a consistent manner. By service enabling access to the data the organization can expose a single common view of a product.

Mappings available as a service. As previously stated, mapping is a fundamental requirement within the ETL stage. The SoBI framework enables the mapping functionality to be exposed as a service for other uses within the organization. Such uses include EAI and enterprise reference scenarios. This availability of this service can also be used to promote best of breed transformations.

Calculation. The data warehouse is often used to store precalculated values to support the requirements of BI. For instance, sales and forecasting data may be held in different physical systems. The consolidation of the data from these systems into the data warehouse allows us to calculate and store actual versus forecast figures to support more performant analysis and reporting. The business logic used to define such calculations is often interesting to other parts of the business so the calculation to support this invention of data in the data warehouse will be made available to the SoBI architecture as a service.

Aggregation. To support fast response times, data in the data warehouse is often preaggregated. For instance, the data warehouse may contain data relating to sales at an individual transaction level, but the majority of management reporting may require seeing the totals at the month level. In this case, it is cost effective to roll the (potentially millions of) individual transactions up to a level more appropriate for the known queries and to store the results in an aggregation, avoiding the need for known queries to perform the aggregation action at query time. Where such aggregations are created, they will be made available to the architecture as a service.

Provides a road map for integration. It is believed that one of the outcomes for the SoBI framework is an ability, at an architectural level, to provide a framework for future integration scenarios.

Compliance/audit. Application of the SoBI framework requires adherence to a formal governance process. Examples include the identification of the system of record or operational data owner, and the definition of the messages that describe the data and functional requirements. Given only the owner of the data can make a change to that data, other systems simply make a request for change, and auditing can be carried out at a single point.

A Case of Factless fact


 


Challenge


When we  group by Doctor to get the total amount and Total patient count which are under  the doctor the total amount is correct because there is a entry in fact table for this. Consider the case of total patient under doctor that is incorrect because not all patients have a bill amount. For this we need to design a factless fact table  

Factless  facts  are  of  two  types --  EVENT TRACKING  and COVERAGE  a event tacking only says who took part in the event while a coverage allows you to answer who did not take part  

Answer is  Simple -- Create a Factless fact table which has entry for each doctor and patient . Sum up the values for Provider you will get number of patients . Then sum up the billing fact details based on provider . Then join both facts based on provider keys and then join them with dimension.

The reason for joining with dimensions later is size of the tables for joining would increase and it would require additional hash joins. Hash joins are CPU intensive and will slow down your system also consider your dimension is 1GB with 50000 doctors and 200 million records for fact you are repeating the doctor name for each of 200 million records and doctor name is 100 KB field then you are adding that much size so your temp size the amount of memory for PGA everything will increase this will cause a slowdown.