Sunday 8 December 2013

Dimensional Modelling Notes 1

Hi All,

Below are notes i have prepared while going through Ralph Kimball case studies they are mostly question that are discussed in the case study or question i have asked myself..

Question on Retail case study

 Note -- Scripts are at the end to create the tables if you want to replicate and understand case by case 

Queries

you want to check discount for each sku by location and department and that sku is not sold


SELECT SKU_ID,DISCOUNT,STATE FROM DIM_PRODUCT DP
INNER JOIN FACT_NO_BILLING FB
ON FB.DIM_PROD_SEQ_KEY = DP.PROD_SEQ_KEY
INNER JOIN DIM_PROMOTION DPM
ON DPM.PROM_SEQ_KEY = DIM_PROM_KEY
INNER JOIN DIM_LOCATION DL
ON DL.LOC_SEQ_KEY = FB.DIM_LOC_SEQ_KEY
ORDER BY 1

suppose a product is terminated only for one store how will you handle it

Three approaches

Note -- the first thought that would come in anyones mind is we should add termination date in product or location dimension but the issue is then you are terminating the product for all the location which is not true. the product has been terminated only for a particular location.


1) Join with the source table in datawarehouse and then join it with fact. That is suppose source says sku 1 is terminated for location 1 then you join this with fact. This is simplest of idea but it might slow your query down because you are joining with dw tables assumption ( you are having staging , datawarehouse and datamart and data mart is where you fact and dimension are)

2) Add a key in fact saying relationship valid or not. Issue with this .... suppose a product for store gets terminated you need to update keys for 10 million billing entries , updates takes huge time so idea is bad design

3) Create a factless fact table with entries for product that are terminated


Next thought is what will be what will be the promotion and time key values. In this case promotion will be null but there will be time key.


CREATE TABLE FACT_TERM_PROD
( dim_prod_seq_key int,
dim_loc_seq_key int,
dim_prom_key int,
dim_time_key int,
  constraint fk_prod_T foreign key (dim_prod_seq_key) references dim_product(prod_seq_key) ,
  constraint fk_loc_T foreign key (dim_loc_seq_key) references dim_location(loc_seq_key) ,
   constraint fk_promo_T foreign key (dim_prom_key) references dim_promotion(prom_seq_key) ,
    constraint fk_time_T foreign key (dim_time_key) references dim_time(time_seq_key) )
   
insert into FACT_TERM_PROD values ( 1,1,1,1)

insert into FACT_TERM_PROD values ( 1,2,1,1)

Now we have below query Which gives cross join because we are joining only on one key dim_prod_seq_key


SELECT SKU_ID,DISCOUNT,STATE,FTP.DIM_PROD_SEQ_KEY FROM DIM_PRODUCT DP
INNER JOIN FACT_BILLING FB
ON FB.DIM_PROD_SEQ_KEY = DP.PROD_SEQ_KEY
INNER JOIN DIM_PROMOTION DPM
ON DPM.PROM_SEQ_KEY = DIM_PROM_KEY
INNER JOIN DIM_LOCATION DL
ON DL.LOC_SEQ_KEY = FB.DIM_LOC_SEQ_KEY
LEFT OUTER JOIN FACT_TERM_PROD FTP
ON FTP.DIM_PROD_SEQ_KEY = FB.DIM_PROD_SEQ_KEY
and FTP.DIM_LOC_SEQ_KEY = FB.DIM_LOC_SEQ_KEY
WHERE FTP.DIM_PROD_SEQ_KEY IS NULL
order by 1


Factless fact table does this mean that for every fact where we are capturing process we need another fact to capture where the process was not satisfied ???


Should we snowflake a Dimesion

Ans -- As much as possible we should not do snowflaking we have option like bitmap index to handle low cardinality values (frequently occuring values like male female those are having low uniquenss , primary key has high cardinality)

Cases where bitmap index will not work -- Normally we will be having partition and in that partition we are having bitmap index. In practical life we never have easy scenario like male and female and even if its there the distribution of data is such that in one partition 2013 we are having 80% males and 20% female so oracle will not think of going for index. It will prefer full scan of the table. (if you are picking more than 50% of data using bitmap oracle may consider full scan economical compared to bitmap . 20% for B tree)

Even if you are picking less than 50% of data using bitmap oracle might not still opt for bitmap index due to the distribution of data in your table. High clustering factor of data.Below is a indepth article on clustering factor of index  

Link for Article on Clustering factor of index

Can you do market basket analysis using datawarehouse. Question --which product sell well together like chips and cola . so that they can be stacked together in super market. .. this kind of analysis can not be done using normal datawarehouse but will come into scope of analytical tools .. I will go through it and write down the logic used




Below are scripts to create a dummy case study to understand how to design tables


create table dim_product
( prod_seq_key int,
  sku_id int,
  dept_name varchar2(20),
  category_name varchar2(20),
  brand_name varchar2(20),
  box_type varchar2(20),
  storage_type varchar2(20),
  shelf_life varchar2(20) )


create table dim_promotion
( prom_seq_key int,
  promotion_type varchar2(20),
  discount int )

create table dim_location
( loc_seq_key int,
  state varchar2(20),
  store_number int)


create table dim_time
( time_seq_key int,
  day date,
  month date)

  alter table dim_location add constraint loc_pk primary key (loc_seq_key)
    
  alter table dim_product add constraint prod_pk primary key (prod_seq_key)
     
  alter table dim_promotion add constraint prom_pk primary key (prom_seq_key)
      
  alter table dim_time add constraint time_pk primary key (time_seq_key)
 


CREATE TABLE FACT_BILLING
( dim_prod_seq_key int,
dim_loc_seq_key int,
dim_prom_key int,
dim_time_key int,
amount int ,
  constraint fk_prod foreign key (dim_prod_seq_key) references dim_product(prod_seq_key) ,
  constraint fk_loc foreign key (dim_loc_seq_key) references dim_location(loc_seq_key) ,
   constraint fk_promo foreign key (dim_prom_key) references dim_promotion(prom_seq_key) ,
    constraint fk_time foreign key (dim_time_key) references dim_time(time_seq_key) )


CREATE TABLE FACT_NO_BILLING
( dim_prod_seq_key int,
dim_loc_seq_key int,
dim_prom_key int,
dim_time_key int,
  constraint fk_prod_N foreign key (dim_prod_seq_key) references dim_product(prod_seq_key) ,
  constraint fk_loc_N foreign key (dim_loc_seq_key) references dim_location(loc_seq_key) ,
   constraint fk_promo_N foreign key (dim_prom_key) references dim_promotion(prom_seq_key) ,
    constraint fk_time_N foreign key (dim_time_key) references dim_time(time_seq_key) )

Dummy scripts for inserting data into retail case study tables

---------------------------------------------------------------------------------------------------------  



insert into dim_product values ( 1,1,'GROCERY','GRAINS','HERITAGE','REFILL','ROOM TEMP','25 MONTHS')

insert into dim_product values ( 2,2,'SPICES','TURMERIC','HERITAGE','REFILL','ROOM TEMP','25 MONTHS')

insert into dim_product values ( 3,3,'COSMETIC','GRAINS','HERITAGE','REFILL','ROOM TEMP','25 MONTHS')

insert into dim_product values ( 4,4,'SNACKS','GRAINS','HERITAGE','REFILL','ROOM TEMP','25 MONTHS')

insert into dim_product values ( 5,5,'FRUITS','GRAINS','HERITAGE','REFILL','ROOM TEMP','25 MONTHS')
------------------------------------------------------------------------------------------------------


INSERT INTO DIM_PROMOTION VALUES ( 1,'ADS',10)

INSERT INTO DIM_PROMOTION VALUES ( 2,'AISLE',20)

INSERT INTO DIM_PROMOTION VALUES ( 3,'SHELF',50)

INSERT INTO DIM_PROMOTION VALUES ( 4,'POSTER',10)

----------------------------------------------------------------

INSERT INTO DIM_LOCATION VALUES ( 1,'AP',10)

INSERT INTO DIM_LOCATION VALUES ( 2,'MP',10)

INSERT INTO DIM_LOCATION VALUES ( 3,'UP',10)

INSERT INTO DIM_LOCATION VALUES ( 4,'RAJASTHAN',10)

INSERT INTO DIM_LOCATION VALUES ( 5,'BIHAR',10)

----------------------------------------------------------

INSERT INTO DIM_TIME VALUES ( 1,SYSDATE,TRUNC(SYSDATE))

INSERT INTO DIM_TIME VALUES ( 2,SYSDATE-30,TRUNC(SYSDATE))

INSERT INTO DIM_TIME VALUES ( 3,SYSDATE-60,TRUNC(SYSDATE))

INSERT INTO DIM_TIME VALUES ( 4,SYSDATE-90,TRUNC(SYSDATE))

----------------------------------------------------------------------------


insert into fact_billing values ( 1,1,1,1,10)

insert into fact_billing values ( 1,2,1,1,10)

insert into fact_billing values ( 1,2,2,1,10)

insert into fact_billing values ( 1,3,3,1,10)

insert into fact_billing values ( 2,1,1,1,10)

insert into fact_billing values ( 2,2,1,1,10)

insert into fact_billing values ( 2,2,2,1,10)

insert into fact_billing values ( 2,2,2,2,10)

insert into fact_billing values ( 3,1,1,1,10)

----------------------------------------------------------------------------

insert into fact_NO_billing values ( 4,1,1,1)

insert into fact_NO_billing values ( 4,2,1,1)

insert into fact_NO_billing values ( 4,3,1,1)

insert into fact_NO_billing values ( 4,4,1,1)

insert into fact_NO_billing values ( 5,1,1,1)

insert into fact_NO_billing values ( 5,2,1,1)

insert into fact_NO_billing values ( 5,3,1,1)

insert into fact_NO_billing values ( 5,4,1,1)

No comments:

Post a Comment