Sunday 21 February 2016

Part 1. NUMA Architecture



What is NUMA?

Short for Non-Uniform Memory Access, a type of parallel processing architecture in which each processor has its own local memory but can also access memory owned by other processors. It's called non-uniform because the memory access times are faster when a processor accesses its own memory than when it borrows memory from another processor. NUMA computers offer the scalability of MPP and the programming ease of SMP.

To understand this concept, let’s understand parallel memory architecture.

Parallel memory architecture is of two type:  
  • Shared memory
  • Distributed Memory

Shared memory: All CPU shared a same memory and treated it as global address space (Refer below diagram). Cache coherence is the main issue in this kind of architecture. Normally this architecture is used for general purpose CPUs, example Laptop and Desktop.

Note: Cache coherence is the consistency of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data, which is particularly the case with CPUs in a multiprocessing system.




 Distributed Memory: In this architecture each CPU has its own local memory and all CPUs are connected over network. As each CPU has its own local memory so in this scenario there is no concept of cache coherent or Global Address space. When CPU wants to access memory associated with other CPU, it communicate explicitly and this way of communication takes lots of memory cycle. This kind of architecture used in cluster, where cluster nodes are connected to each other on network


Two types of Shared memory:
  • UMA(Uniform Memory Access)
  • NUMA (Non Uniform Memory Access
In UMA, all CPU access memory uniformly, which means equal latency for all. As we discussed CC is the main issue in shared memory access, so UMA based system with CC hardware are called as CC UMA and BUS based system called as SMP (Symmetric Multiprocessor system).

Below is the diagram of SMP system, in this diagram we can see memory is shared by CPUs and these CPUs are connected over a single bus. This kind of architecture faces bus contention problem after 8 to 12 number of CPUs as those are sharing the same bus









Why NUMA is needed?

Increasing the number of clock speed and it’s very difficult to reduce memory latency.
These two issues are addressed by NUMA architecture. Let’s understand the architecture of NUMA.

In this architecture set of CPUs and memory are sharing the same memory and I/O. In this, set of CPUs, Memory and I/O are called NUMA node. NUMA node connected to each other over scalable network.  In this scenario CPU can access memory associated with other CPU in coherent way. However, it takes long way if CPU accessing memory of other node, off course it will be faster if node access its own local memory. This is the reason why it is called NUMA (Non Uniform Memory Access)







Terminology Used in NUMA:
  • Local memory
  • Foreign memory
  • NUMA ratio (If NUMA ratio is 1, then the system is  SMP)

If the system is running its thread on node A CPU, then memory associated with node A CPU is local memory. If CPU of node A accessing the memory of Node B. then it’s called remote or foreign memory. NUMA ratio is the ratio of cost accessing the foreign memory to the cost accessing local memory. Greater the ratio greater is the cost accessing foreign memory.
Hope you understand the concept, If NUMA ratio is 1, then that system is called SMP.

Note:-NUMA architecture uses MESIF cache coherent protocol for the cache coherence.

Software optimization to improve performance on NUMA aware systems

Two measures to be considered to improve the performance of the Systems supporting NUMA architecture

  • Processor affinity
  • Data placements
   
In processor affinity, in multi threading systems, system assign a resource to a thread and thread switches between cores to ensure timely execution. However in case of NUMA, switching of thread from Node A to B takes longer to access the memory. Example, a thread has started on Node A and later switches to Node B, in this case memory on Node A will became a foreign to a thread which has switched to Node B, when memory becomes foreign then it takes longer time to access memory. So system is responsible to ensure the thread should run within a NUMA node

Data placements, same is the case with data placement. If system is capable of keeping the data local as long as possible then it increases the throughput of the system.


Supported NUMA architecture OS/Database:

  •          Microsoft Windows 7, Windows VISTA etc
  •      Oracle 8i, Oracle 10g, SQL Server 2008 etc



                                                                                  

No comments:

Post a Comment