Saturday 5 July 2014

Database Architectures ( Teradata VS Oracle VS Vertiga)

Hi All ,

Link to my Earlier article which was writen year back. The earlier articles is basic on the different architures, my knowledge has matured over the year and this is in depth on the topic.

Very good article on Share nothing by MIT professor Link  . I urge everyone to read the original article on the subject. Below are notes that i have taken from article.

Using Parallelism to achieve better performance 

Basically we are saying either we can use multiple processors in parallel to do a task / Multiple disk which are accessed in parallel to do a task / Or a Combination of multiple processor and Multiple disk in parallel to do a task

Shared memory (RAM Shared Disk Shared). a shared RAM. Shared disk. Oracle non RAC , MS sql server. Advantages of these system are locking is easier , commits are easier as we have Buffer pool , locking maintain in same main memory (RAM).
Disdvantages - They do not scale well due to issue of maintaining L2 cache consistent between processor( need some more research on this)  So SMP system of more than 16 processor are rare. Bandwidth of bus connecting CPU and RAM is easily reached so it becomes bottleneck
Clarification:  L1 is inside processor (8KB - 64KB) L2 is in between RAM and Processor (64KB to 4MB)The issue with SMP computers for scalability is L2 cache coherence. that is  maintaining current data in L2 cache of all processors. Remember L2 sits between RAM and processor. So usually SMP do not exceed more than 16 processors.  System bus bottleneck - Since all the 16 processors use the same system bus to connect to RAM.



Shared Disk - Each processor has its own memory (RAM) but disk is usually shared. A shared disk is usually a SAN(Storage area network) or NAS( network attached storage).
Disdvantage - The interconnect between CPU and shared disk system becomes problem.
2) Maintaining lock on objects is a issue since a common RAM is not there. Software implements cache consistency the the same that is implemented by hardware in SMP.  This two issue compound up as we try to scale the system.
Oracle example -It requires that each system have its own lock table and buffer pool, which must be synchronized with their peers. This synchronization is painful and has serious performance problems, which limit the scalability of shared disk implementations.Observation you can make is interconnect bytes in your system. Are they really going high.

How Shared cache works - Processor first check if page exists in its own local buffer cache
2) If  not checks shared cache
3) If not reads from disk
Clarification - 

Why Shared disk does not work well for Datawarehouse - (Oracle RAC is Shared disk) 

Most Fact table reads require reading from disk as fact table is very large to be present in buffer cache ( The way oracle reads is it checks SGA to see if the block exists in buffer cache is not then it read from disk. Also apart from direct reads everthing is first read into buffer cache and then used by processors). Since the chances of finding data in buffer cache is less , this checking is overhead
2) The bus between CPU and disk becomes full on certain situations
3) Resource contention for same CPU blocks can be there in certain situations

Note-



Shared Nothing - Each processor has its own RAM , Disk . The data is horizontally partitioned across nodes that is example A table contains 4 nodes so each row in disk of each node (its just a simple example ) . Data is prepartitioned before loading in Database.


SQL Server ,My Sql ( Shared Memory )
Oracle RAC (Shared Disk)
Teradata, Netezza, Vertiga ( Shared Nothing)

Shared Nothing the machine are only connected by the network. The “shared nothing” architecture shares network bandwidth because it must transfer data from machines doing the Map tasks to machine doing the Reduce tasks over the network.

In a shared nothing system CPU cores are solely responsible for individual data sets and the only way to access a specific piece of data you have to use the CPU core that owns this subset of data, such systems are are also commonly known as MPP (Massively Parallel Processing) systems. In order to achieve a good workload distribution MPP systems have to use a hash algorithm to distribute (partition) data evenly across available CPU cores.

As a result MPP systems introduce mandatory, fixed parallelism in their systems in order to perform operations that involve table scans, the fixed parallelism completely relies on a fixed static data partitioning at database or object creation time. Most non-Oracle data warehouse systems are MPP systems.







No comments:

Post a Comment