Tuesday 9 January 2018

Oracle Database 11g R2 clusterware startup sequence -Interview


SHORT INTRO TO ORACLE REAL APPLICATION CLUSTER 

Oracle Database with the Oracle Real Application Clusters (RAC) option allows multiple instances running on different servers to access the same physical database stored on shared Storage. The database spans multiple hardware systems and yet appears as a
single unified database to the application.
This enables the utilization of commodity hardware to reduce total cost of ownership and to provide a scalable computing environment that supports various application workloads.
If additional computing capacity is needed,customers can add additional nodes instead of replacing their existing servers. The only requirement is that servers in the cluster must run the same operating system and the same version of Oracle.
They do not have to be of the same capacity. This saves on capital
expenditure as this allows customers to buy servers with latest CPU and memory and use it alongside existing servers.
This architecture also provides high availability as RAC instances running on different nodes provides protection from a node failure.
Customer’s requirement for database availability and scalability continue to increase. They cannot afford any downtime in their environments. These requirements are not isolated to just databases but include other critical components like servers, network, client connections etc. Furthermore, there is a need for an intelligent resource manager that is able to redirect incoming workloads
dynamically to nodes which are idle or in some cases more capable in terms of computing power and memory.
These features can be broadly classified as, features that provide
» Better Scalability
» Better Availability
» More Efficiency and Management of a pool of Clusters

CLUSTER STARTUP SEQUENCE : 

Beginning with the version 11g Release 2, the ASM spfile is stored automatically in the first disk group created during Grid Infrastructure installation.Since voting disk/OCR are stored on ASM, ASM needs to be started on the node. To startup ASM, its SPfile is needed. But SPFILE is again located on ASM diskgroup only.  How does clusterware resolve this issue?

– When a node of an Oracle Clusterware cluster restarts, OHASD is started by platform-specific means.
Below Image will show you all the processes spawned by OHASD.
Capture

OHASD accesses OLR (Oracle Local Registry) stored on the local file system to get the data needed to complete OHASD initialization.
  • OHASD brings up GPNPD and CSSD.
  • – CSSD accesses the GPNP Profile stored on the local file system which contains the following vital bootstrap data.
a. ASM_DISKSTRING parameter (if specified) to locate the disks on which ASM disks are configured.
b. ASM SPFILE location : Name of the diskgroup containing ASM spfile
c. Location of  Voting Files : ASM
  • – CSSD scans the headers of all ASM disks ( as indicated in ASM_DISKSTRING in GPnP profile) to identify the disk containing the voting file.  Using the pointers in ASM disk headers, the Voting Files locations on ASM Disks are accessed by CSSD and CSSD is able to complete initialization and start or join an existing cluster.

  • – To read the ASM spfile during the ASM instance startup, it is not necessary to open the disk group. All information necessary to access the data is stored in the device’s header. OHASD reads the header of the ASM disk containing ASM SPfile (as read from GPnP profile) and using the pointers in disk header, contents of ASM spfile are read. Thereafter, ASM instance is started.
  • –  With an ASM instance operating and its Diskgroups mounted, access to Clusterware’s OCR is available to CRSD.
  • – OHASD starts CRSD with access to the OCR in an ASM Diskgroup.
Webp.net-gifmaker (1)
If you have read the above startup sequence we can see there are 5 important file , they are :
Note : every file is associated with a process .
FILE 1 : OLR ( ORACLE LOCAL REGISTRY )   ——————————-> OHASD Process
FILE 2 :GPNP PROFILE ( GRID PLUG AND PLAY ) ————————> GPNPD process
FILE 3 : VOTING DISK —————————————————————-> CSSD Process
FILE 4 : ASM SPFILE ——————————————————————> OHASD Process
FILE 5 : OCR ( ORACLE CLUSTER REGISTRY ) ——————————> CRSD Process

Lets study this files in details

FILE 1 : OLR : ORACLE LOCAL REGISTRY :
It is the very first file that is accessed to startup  clusterware when OCR is stored on ASM. OCR should be accessible to find out the resources which need to be started on a node. If OCR is on ASM, it can’t be read until ASM (which itself is a resource for the node and this information is stored in OCR) is up. To resolve this problem, information about the resources which need to be started on a node is stored in an operating system  file which is called Oracle Local Registry or OLR.
Since OLR is a file an operating system file, it can be accessed by various processes on the node for read/write irrespective of the status of the clusterware (up/down). Hence, when  a node joins the cluster,  OLR on that node is read, various resources ,including ASM  are started on the node  . Once ASM is up , OCR is accessible and is used henceforth to manage all the clusterware resources. If OLR is missing or corrupted, clusterware can’t be started on that node.
The OLR stores data about
ORA_CRS_HOME
localhost version
active version
GPnP details
OCR latest backup time and location information about OCR daily, weekly backup location node name etc.
This information stored in the OLR is needed by OHASD to start or join a cluster.
Locaton of OLR :
$GRID_HOME/cdata/<hostname.olr>     
To check OLR location :
cat /etc/oracle/olr.loc 
How to check OLR location from command :
$ ocrcheck -local     
How to check the OLR backup :
$ocrconfig -local -showbackup 
How to take OLR backup  :
ocrconfig -local -manualbackup    <<<<<<<<< root user
FILE 2 :GPNP PROFILE : GIRD PLUG AND PLAY  :

WHAT IS GPNP PROFILE?

The GPnP profile is a small XML file located in GRID_HOME/gpnp/<hostname>/profiles/peer under the name profile.xml. It is used to establish the correct global personality of a node. Each node maintains a local copy of the GPnP Profile and is maintained by the GPnP Deamon (GPnPD) .

WHAT DOES GPNP PROFILE CONTAIN?

GPnP Profile  is used to store necessary information required for the startup of Oracle Clusterware like  SPFILE location,ASM DiskString  etc.It contains various attributes defining node personality.- Cluster name- Network classifications (Public/Private)- Storage to be used for CSS- Storage to be used for ASM : SPFILE location,ASM DiskString  etc- Digital signature information : The profile is security sensitive. It might identify the storage to be used as the root partition of a machine.  Hence, it contains digital signature information of the provisioning authority.
location of GPNP profile : $GRID_HOME/gpnp/<hostname>/profile/peer/profile.xml
Command to check gpnp : gpnptool get 
It contains various attributes defining node personality.
– Cluster name
– Network classifications (Public/Private)
– Storage to be used for CSS
– Storage to be used for ASM : SPFILE location,ASM DiskString  etc
 Digital signature information : The profile is security sensitive. It might identify the storage to be used as the root partition of a machine.  Hence, it contains digital signature information of the provisioning authority.
Here is the GPnP profile of my RAC setup.
gpnptool can be  used  for reading/editing the gpnp profile.
[root@host01 peer]# gpnptool get
<?xml version=”1.0″ encoding=”UTF-8″?><gpnp:GPnP-Profile Version=”1.0″ xmlns=”http://www.grid-pnp.org/2005/11/gpnp-profile” xmlns:gpnp=”http://www.grid-pnp.org/2005/11/gpnp-profile” xmlns:orcl=”http://www.oracle.com/gpnp/2005/11/gpnp-profile” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://www.grid-pnp.org/2005/11/gpnp-profilegpnp-profile.xsd” ProfileSequence=”7″ ClusterUId=”14cddaccc0464f92bfc703ec1004a386″ ClusterName=”cluster01″ PALocation=””><gpnp:Network-Profile><gpnp:HostNetwork id=”gen” HostName=”*”><gpnp:Network id=”net1″ IP=”192.9.201.0″ Adapter=”eth0″ Use=”public”/><gpnp:Network id=”net2″ IP=”10.0.0.0″ Adapter=”eth1″ Use=”cluster_interconnect”/></gpnp:HostNetwork></gpnp:Network-Profile><orcl:CSS-Profile id=”css” DiscoveryString=”+asm” LeaseDuration=”400″/><orcl:ASM-Profile id=”asm” DiscoveryString=”” SPFile=”+DATA/cluster01/asmparameterfile/registry.253.783619911″/><ds:Signature xmlns:ds=”http://www.w3.org/2000/09/xmldsig#”><ds:SignedInfo><ds:CanonicalizationMethod Algorithm=”http://www.w3.org/2001/10/xml-exc-c14n#”/><ds:SignatureMethod Algorithm=”http://www.w3.org/2000/09/xmldsig#rsa-sha1″/><ds:Reference URI=””><ds:Transforms><ds:Transform Algorithm=”http://www.w3.org/2000/09/xmldsig#enveloped-signature”/><ds:Transform Algorithm=”http://www.w3.org/2001/10/xml-exc-c14n#”> <InclusiveNamespaces xmlns=”http://www.w3.org/2001/10/xml-exc-c14n#” PrefixList=”gpnp orcl xsi”/></ds:Transform></ds:Transforms><ds:DigestMethod Algorithm=”http://www.w3.org/2000/09/xmldsig#sha1″/><ds:DigestValue>4VMorzxVNa+FeOx2SCk1unVBpfU=</ds:DigestValue></ds:Reference></ds:SignedInfo><ds:SignatureValue>bbzV04n2zSGTtUEvqqB+pjw1vH7i8MOEUqkhXAyloX0a41T2FkDEA++ksc0BafndAk7tR+6LGdppE1aOsaJUtYxQqaHJdpVsJF+sj2jN7LPJlT5NBt+K7b08TLjDID92Se6vEiDAeeKlEbpVWKMUIvQvp6LrYK8cDB/YjUnXuGU=</ds:SignatureValue></ds:Signature></gpnp:GPnP-Profile>

FILE 3 : VOTING DISK :  
A voting disk is a file that manages information about node membership.
What is stored in voting disk?
Voting disks contain static and dynamic data.
Static data : Info about nodes in the cluster
Dynamic data : Disk heartbeat logging .It maintains and consists of important details about the cluster nodes membership, such as- which node is part of the cluster,- who (node) is joining the cluster, and- who (node) is leaving the cluster.
Backing up voting disk
In previous versions of Oracle Clusterware you needed to backup the voting disks with the dd command. Starting with Oracle Clusterware 11g Release 2 you no longer need to backup the voting disks. The voting disks are automatically backed up as a part of the OCR. In fact, Oracle explicitly indicates that you should not use a backup tool like dd to backup or restore voting disks. Doing so can lead to the loss of the voting disk.
Although the Voting disk contents are not changed frequently, you will need to back up the Voting disk file every time- you add or remove a node from the cluster or- immediately after you configure or upgrade a cluster.
A node in the cluster must be able to access more than half of the voting disks at any time in order to be able to tolerate a failure of n voting disks. Therefore, it is strongly recommended that you configure an odd number of voting disks such as 3, 5, and so on.
– Check the location of voting disk  :
$crsctl query css votedisk 
How can we add and remove multiple voting disks?
If we have multiple voting disks, then we can remove the voting disks and add them back into our environment using the following commands, where path is the complete path of the location where the voting disk resides:
crsctl delete css votedisk path
crsctl add css votedisk path
FILE 4 : ASM INSTANCE : ASMSPFILE :
An Oracle ASM instance is built on the same technology as an Oracle Database instance. An Oracle ASM instance has a System Global Area (SGA) and background processes that are similar to those of Oracle Database.
However, because Oracle ASM performs fewer tasks than a database, an Oracle ASM SGA is much smaller than a database SGA.
In addition, Oracle ASM has a minimal performance effect on a server. Oracle ASM instances mount disk groups to make Oracle ASM files available to database instances; Oracle ASM instances do not mount databases.
Oracle ASM is installed in the Oracle Grid Infrastructure home before Oracle Database is installed in a separate Oracle home. Oracle ASM and database instances require shared access to the disks in a disk group. Oracle ASM instances manage the metadata of the disk group and provide file layout information to the database instances.
Oracle ASM metadata is the information that Oracle ASM uses to control a disk group and the metadata resides within the disk group. Oracle ASM metadata includes the following information:
The disks that belong to a disk groupThe amount of space that is available in a disk groupThe file names of the files in a disk groupThe location of disk group data file extentsA redo log that records information about atomically changing metadata blocksOracle ADVM volume information
What init.ora parameters does a user need to configure for ASM instances?
The default parameter settings work perfectly for ASM. The only parameters needed for 11g ASM:
• PROCESSES
• ASM_DISKSTRING*
• ASM_DISKGROUPS*
• INSTANCE_TYPE*
FILE 5 :OCR : ORACLE CLUSTER REGISTRY  :

Oracle Cluster Registry (OCR) — Maintains cluster configuration information as well as configuration information about any cluster database within the cluster. The OCR must reside on shared disk that is accessible by all of the nodes in your cluster.
OCR manages Oracle Clusterware and Oracle RAC database configuration information.
Location : <CRS Home>/cdata/<cluster name>
location info : cat /etc/oracle/ocr.loc
commands :
ocrcheck    <<<<<<<<<< to check the OCR location
ocrconfig -showbackup    <<<<<<<<<<< to check the OCR backup location.
ocrconfig -manualbackup   <<< to take the manual backup of the OCR   << root user only
SPLIT BRAIN SYNDROME IN ORACLE RAC :
-In a Oracle RAC environment all the instances/servers communicate with each other using high-speed interconnects on the private network. This private network interface or interconnect are redundant and are only used for inter-instance oracle data block transfers.
-Now talking about split-brain concept with respect to oracle rac systems, it occurs when the instance members in a RAC fail to ping/connect to each other via this private interconnect, but the servers are all pysically up and running and the database instance on each of these servers is also running. These individual nodes are running fine and can conceptually accept user connections and work independently. So basically due to lack of communication the instance thinks that the other instance that it is not able to connect is down and it needs to do something about the situation. -The problem is if we leave these instance running, the sane block might get read, updated in these individual instances and there would be data integrity issue, as the blocks changed in one instance, will not be locked and could be over-written by another instance. Oracle has efficiently implemented check for the split brain syndrome.


What does RAC do in case node becomes inactive:
In RAC if any node becomes inactive, or if other nodes are unable to ping/connect to a node in the RAC, then the node which first detects that one of the node is not accessible, it will evict that node from the RAC group. e.g. there are 4 nodes in a rac instance, and node 3 becomes unavailable, and node 1 tries to connect to node 3 and finds it not responding, then node 1 will evict node 3 out of the RAC groups and will leave only Node1, Node2 & Node4 in the RAC group to continue functioning.
The split brain concepts can become more complicated in large RAC setups. For example there are 10 RAC nodes in a cluster. And say 4 nodes are not able to communicate with the other 6. So there are 2 groups formed in this 10 node RAC cluster ( one group of 4 nodes and other of 6 nodes). Now the nodes will quickly try to affirm their membership by locking controlfile, then the node that lock the controlfile will try to check the votes of the other nodes. The group with the most number of active nodes gets the preference and the others are evicted. Moreover, I have seen this node eviction issue with only 1 node getting evicted and the rest function fine, so I cannot really testify that if thats how it work by experience, but this is the theory behind it.When we see that the node is evicted, usually oracle rac will reboot that node and try to do a cluster reconfiguration to include back the evicted node.You will see oracle error: ORA-29740, when there is a node eviction in RAC. There are many reasons for a node eviction like heart beat not received by the controlfile, unable to communicate with the clusterware etc.
A good metalink note on understanding node eviction and how to address is Note ID: 219361.1

A node must be able to access more than half of the voting disks :  
A node must be able to access more than half of the voting disks at any time.
For example, let’s have a two node cluster with an even number of let’s say 2 voting disks. Let Node1 is able to access voting disk1 and Node2 is able to access voting disk2 . This means that there is no common file where clusterware can check the heartbeat of both the nodes.  Hence, if we have 2 voting disks, all the nodes in the cluster should be able to access both the voting disks.
If we have 3 voting disks and both the nodes are able to access more than half i.e. 2 voting disks, there will be at least on disk which will be accessible by both the nodes. The clusterware can use that disk to check the heartbeat of both the nodes. Hence, each  node should be  able to access more than half the number of voting disks. A node not able  to do so will have to be evicted from the cluster to maintain the integrity of the cluster  . After the cause of the failure has been corrected and access to the voting disks has been restored, you can instruct Oracle Clusterware to recover the failed node and restore it to the cluster.
Loss of more than half your voting disks will cause the entire cluster to fail !!
why odd number of voting disk  :   
When you have 1 voting disk and it goes bad, the cluster stops functioning.
When you have 2 and 1 goes bad, the same happens because the nodes realize they can only write to half of the original disks (1 out of 2), violating the rule that they must be able to write > half (yes, the rule says >, not >=).
When you have 3 and 1 goes bad, the cluster runs fine because the nodes know they can access more than half of the original voting disks (2/3 > half).
When you have 4 and 1 goes bad, the same, because (3/4 > half).
When you have 3 and 2 go bad, the cluster stops because the nodes can only access 1/3 of the voting disks, not > half.
When you have 4 and 2 go bad, the same, because the nodes can only access half, not > half.
So you see 4 voting disks have the same fault tolerance as 3, but you waste 1 disk, without gaining anything. The recommendation for odd number of voting disks helps save a little on hardware requirement.

What is CACHE FUSION :
  • CACHE FUSION is one of the most important and interesting concepts in a RAC setup. As the name suggests, CACHE FUSION is the amalgamation of cache from each node/instance participating in the RAC, but it is not any physically secured memory component which can be configured unlike the usual buffer cache (or other SGA components) which is local to each node/instance.
  • We know that every instance of the RAC database has its own local buffer cache which performs the usual cache functionality for that instance. Now there could be occasions when a transaction/user on instance A needs to access a data block which is being owned/locked by the other instance B. In such cases, the instance A will request instance B for that data block and hence accesses the block through the interconnect mechanism. This concept is known as CACHE FUSION where one instance can work on or access a data block in other instance’s cache via the high speed interconnect.

  • Cache Fusion uses a high-speed IPC interconnect to provide cache-to-cache transfers of data blocks between instances in a cluster. This data block shipping eliminates the disk I/O and optimizes read/write concurrency.
  • Now the question is how the integrity of the data is maintained in a RAC environment, if there are concurrent requests for the same data block – Here too Oracle uses locking and queuing mechanisms to coordinate lock resources, data and inter-instance data requests.
  • Cache Fusion was implemented by a controlling mechanism called Global Cache Service (GCS), which is responsible for block transfers between instances.
Capture
Global Resource Directory (GRD) to be able to keep track of the resources in the cluster.  There is no true concept of a master node in Oracle RAC. Instead, each instance in the cluster becomes the resource master for a subset of resources.
Global Enqueue Services (GES) :
A single-instance database relies on enqueues (locks) to protect two processes from simultaneously modifying the same row. Similarly, we need enqueues in Oracle RAC but since the Buffer Cache is now global, the enqueues on the resources must be global as well.
It should be no surprise that the Global Enqueue Services (GES) is responsible for managing locks across the cluster. As a side note, GES was previously called the Distributed Lock Manager (DLM).
The processes running to support an Oracle RAC instance include:
?         LMS:  This process is GCS. This process used to be called the Lock Manager Server.
?         LMON:  The Lock Monitor. This process is the GES master process.
?         LMD:  The Lock Manager Daemon. This process manages incoming lock requests.
?         LCK0: The instance enqueue process. This process manages lock requests for library   cache objects .

1 comment:

  1. Interesting ...Very clear and easy to understand RAC clusterware startup sequence. Thanks for sharing.

    ReplyDelete