Hi All,
Consider tables as files . Fact table is Big file worth 50GB and Dimension is small file worth 1GB. Now what are the mechanism that can be used to find the data that is matching Big and small file. Whenever CPU has to read data the file has to be in RAM. For now do not consider oracle specifics like PGA, SGA, Buffer cache. Just simple RAM concept
1) Read small file worth 1 GB first in RAM and then search 50 GB file for matching row. Now 1GB file has 10 rows and only full scan is allowed so you end up reading full 50 GB table to find one row (consider 10 rows you want are at end) . So you cannot read full 50 GB in ram in one go so you read 1GB at time if row not found you clear ram and take next 1gb. So for 10 rows you did 500gb read that is BAD idea.
2) You read 50 GB table 1gb at a time compare with 1gb smaller table . 50gb fact has 1000 rows so you end up doing 1TB i/o that is BAD. This is your Nested loop. This works if you have a index on the 1gb table which allows you to pin point to correct row without doing a full table scan.and even there is a index on 50GB table and you are not selecting the entire 50GB
3) So now look at some of the efficient algorithm for finding rows
4) Sort Merge joins --- you sort both the tables. Its like a nested loop. You take on sorted output probe the second sorted ouput till you find a row that does not match. So record 1 from source1 can only find one match in source 2 no need to look further as the table is sorted and you wont find any more matches
HASH JOINS
To illustrate a hash table, assume that the database hashes hr.departments in a join of departments and employees. The join key column is department_id. The first 5 rows of departments are as follows:
SQL> select * from departments where rownum < 6;
DEPARTMENT_ID DEPARTMENT_NAME MANAGER_ID LOCATION_ID
------------- ------------------------------ ---------- -----------
10 Administration 200 1700
20 Marketing 201 1800
30 Purchasing 114 1700
40 Human Resources 203 2400
50 Shipping 121 1500
The database applies the hash function to each department_id in the table, generating a hash value for each. For this illustration, the hash table has 5 slots (it could have more or less). Because n is 5, the possible hash values range from 1 to 5. The hash functions might generate the following values for the department IDs:
f(10) = 4
f(20) = 1
f(30) = 4
f(40) = 2
f(50) = 5
Note that the hash function happens to generate the same hash value of 4 for departments 10 and 30. This is known as a hash collision. In this case, the database puts the records for departments 10 and 30 in the same slot, using a linked list. Conceptually, the hash table looks as follows:
1 20,Marketing,201,1800
2 40,Human Resources,203,2400
3
4 10,Administration,200,1700 -> 30,Purchasing,114,1700
5 50,Shipping,121,1500
Hash Join: Basic Steps
A hash join of two row sources uses the following basic steps:
The database performs a full scan of the smaller data set, and then applies a hash function to the join key in each row to build a hash table in the PGA.
The database probes the second data set, using whichever access mechanism has the lowest cost.
Typically, the database performs a full scan of both the smaller and larger data set. The algorithm in pseudocode might look as follows:
For each row retrieved from the larger data set, the database does the following:
Applies the same hash function to the join column or columns to calculate the number of the relevant slot in the hash table.
For example, to probe the hash table for department ID 30, the database applies the hash function to 30, which generates the hash value 4.
Probes the hash table to determine whether rows exists in the slot.
If no rows exist, then the database processes the next row in the larger data set. If rows exist, then the database proceeds to the next step.
Checks the join column or columns for a match. If a match occurs, then the database either reports the rows or passes them to the next step in the plan, and then processes the next row in the larger data set.
If multiple rows exist in the hash table slot, the database walks through the linked list of rows, checking each one. For example, if department 30 hashes to slot 4, then the database checks each row until it finds 30.
Consider tables as files . Fact table is Big file worth 50GB and Dimension is small file worth 1GB. Now what are the mechanism that can be used to find the data that is matching Big and small file. Whenever CPU has to read data the file has to be in RAM. For now do not consider oracle specifics like PGA, SGA, Buffer cache. Just simple RAM concept
1) Read small file worth 1 GB first in RAM and then search 50 GB file for matching row. Now 1GB file has 10 rows and only full scan is allowed so you end up reading full 50 GB table to find one row (consider 10 rows you want are at end) . So you cannot read full 50 GB in ram in one go so you read 1GB at time if row not found you clear ram and take next 1gb. So for 10 rows you did 500gb read that is BAD idea.
2) You read 50 GB table 1gb at a time compare with 1gb smaller table . 50gb fact has 1000 rows so you end up doing 1TB i/o that is BAD. This is your Nested loop. This works if you have a index on the 1gb table which allows you to pin point to correct row without doing a full table scan.and even there is a index on 50GB table and you are not selecting the entire 50GB
3) So now look at some of the efficient algorithm for finding rows
4) Sort Merge joins --- you sort both the tables. Its like a nested loop. You take on sorted output probe the second sorted ouput till you find a row that does not match. So record 1 from source1 can only find one match in source 2 no need to look further as the table is sorted and you wont find any more matches
HASH JOINS
To illustrate a hash table, assume that the database hashes hr.departments in a join of departments and employees. The join key column is department_id. The first 5 rows of departments are as follows:
SQL> select * from departments where rownum < 6;
DEPARTMENT_ID DEPARTMENT_NAME MANAGER_ID LOCATION_ID
------------- ------------------------------ ---------- -----------
10 Administration 200 1700
20 Marketing 201 1800
30 Purchasing 114 1700
40 Human Resources 203 2400
50 Shipping 121 1500
The database applies the hash function to each department_id in the table, generating a hash value for each. For this illustration, the hash table has 5 slots (it could have more or less). Because n is 5, the possible hash values range from 1 to 5. The hash functions might generate the following values for the department IDs:
f(10) = 4
f(20) = 1
f(30) = 4
f(40) = 2
f(50) = 5
Note that the hash function happens to generate the same hash value of 4 for departments 10 and 30. This is known as a hash collision. In this case, the database puts the records for departments 10 and 30 in the same slot, using a linked list. Conceptually, the hash table looks as follows:
1 20,Marketing,201,1800
2 40,Human Resources,203,2400
3
4 10,Administration,200,1700 -> 30,Purchasing,114,1700
5 50,Shipping,121,1500
Hash Join: Basic Steps
A hash join of two row sources uses the following basic steps:
The database performs a full scan of the smaller data set, and then applies a hash function to the join key in each row to build a hash table in the PGA.
The database probes the second data set, using whichever access mechanism has the lowest cost.
Typically, the database performs a full scan of both the smaller and larger data set. The algorithm in pseudocode might look as follows:
For each row retrieved from the larger data set, the database does the following:
Applies the same hash function to the join column or columns to calculate the number of the relevant slot in the hash table.
For example, to probe the hash table for department ID 30, the database applies the hash function to 30, which generates the hash value 4.
Probes the hash table to determine whether rows exists in the slot.
If no rows exist, then the database processes the next row in the larger data set. If rows exist, then the database proceeds to the next step.
Checks the join column or columns for a match. If a match occurs, then the database either reports the rows or passes them to the next step in the plan, and then processes the next row in the larger data set.
If multiple rows exist in the hash table slot, the database walks through the linked list of rows, checking each one. For example, if department 30 hashes to slot 4, then the database checks each row until it finds 30.
No comments:
Post a Comment