Tuesday, 17 July 2018

Quick RAC Interview Question and Answer

Oracle DBA question and answer
Important link for interview question

http://oracle-dba-help.blogspot.com/search/label/Interview%20QuestionsOracle Database interview questions and answers 


Understanding Offline Processes in Oracle Grid Infrastructure

After the installation of Oracle Grid Infrastructure, some components may be listed as OFFLINE. 
Oracle Grid Infrastructure activates these resources when you choose to add them.

Oracle Grid Infrastructure provides required resources for various Oracle products and components. 
Some of those products and components are optional, so you can install and enable them after installing Oracle Grid Infrastructure. 
To simplify postinstall additions, Oracle Grid Infrastructure preconfigures and registers all required resources for all products available for these products and components, 
but only activates them when you choose to add them. As a result, some components may be listed as OFFLINE after the installation of Oracle Grid Infrastructure.
 Run the following command to view status of any resource:

$crsctl status resource resource_name -t 

crsctl status resource ora.easydb.easy_srv.svc -l
rsctl stat res -t -w "TYPE = ora.database.type"
srvctl status service -d easydb1
crsctl stat res -f -w "(TYPE = ora.service.type)"
crsctl stat res -t -w '((TARGET != ONLINE) or (STATE != ONLINE)'


Resources listed as TARGET:OFFLINE and STATE:OFFLINE do not need to be monitored. 
They represent components that are registered, but not enabled, so they do not use any system resources. 
If an Oracle product or component is installed on the system, and it requires a particular resource to be online, then the software prompts you to activate the required offline resource.

RAC Investigation

Environment description

2 Node RAC with Oracle 11.2.0.2.2
Oracle Linux 5.6 

Conducted tests

easy_srv is a service which has both the instance running on node1 and node2 as preferred instances.
On node1 the service was manually stopped.

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l
 NAME=ora.mydb.easy_srv.svc
 TYPE=ora.service.type
 CARDINALITY_ID=1
 DEGREE_ID=1
 TARGET=OFFLINE
 STATE=OFFLINE
 CARDINALITY_ID=2
 DEGREE_ID=1
 TARGET=ONLINE
 STATE=ONLINE on node2

Issue a “shutdown abort” on the instance running on node2

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE

CARDINALITY_ID=2
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node1

start the instance again

[grid@node1 ~]$ srvctl start instance -d mydb -i mydb2
 
[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2
 
CARDINALITY_ID=2
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node1

The service is now running on both instances, although before the crash the service was set offline on node1.

Same test, but this time the service is stopped on all instances

[grid@node1 ~]$ srvctl stop service -d mydb -s easy_srv
 
[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE
 
CARDINALITY_ID=2
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE
 
[grid@node1 ~]$ srvctl stop instance -d mydb -i mydb2 -o abort
 
[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE
 
CARDINALITY_ID=2
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE

This time both services stay offline.
But what happens if we start the instance again:

[grid@node1 ~]$ srvctl start instance -d mydb -i mydb2
 
[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2
 
CARDINALITY_ID=2
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE

Now the service has started again on the restarted instance.

Explanation for this is that the service was configured to come up automatically with the instance, which explains why the service is started on the restarted node.

For the failover this seems to me as expected behaviour as it is the same as what would happen with a preferred / available configuration.

For the third test, we will reconfigure the service to have a preferred and an available node

[grid@node1 ~]$ srvctl stop service -d mydb -s easy_srv
[grid@node1 ~]$ srvctl modify service -d mydb -s easy_srv -n -i mydb2 -a mydb1
 
[grid@node1 ~]$ srvctl config service -d mydb -s easy_srv
Service name: easy_srv
Service is enabled
Server pool: mydb_easy_srv
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: NONE
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: NONE
Edition:
Preferred instances: mydb2
Available instances: mydb1
 
[grid@node1 ~]$ srvctl start service -d mydb -s easy_srv -i mydb2
[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2

The service is running on its preferred instance, which we will now crash

[grid@node1 ~]$ srvctl stop instance -d mydb -i mydb2 -o abort
 
[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=OFFLINE

 I actually expected a relocation here…

As I have other services which have a preferred / available configuration, I know this service should failover.

[grid@node1 ~]$ srvctl status service -d mydb -s easy_srv
Service test_srv is not running.
 
[grid@node1 ~]$ srvctl config service -d mydb -s easy_srv
Service name: easy_srv
Service is enabled
Server pool: mydb_easy_srv
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: NONE
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: NONE
Edition:
Preferred instances: mydb2
Available instances: mydb1
 
[grid@node1 ~]$ srvctl status database -d mydb
Instance mydb1 is running on node node1
Instance mydb2 is not running on node node2


I could find no clues in the different cluster log files as of why the relocation did not occur.

More testing will be necessary.

Also note that the output of the crsctl status resource does not contain information about on which node or instance the service is expected to be online.
But by using the -v flag we can see the last_server attribute:

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -v
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
LAST_SERVER=node2
STATE=OFFLINE
TARGET=ONLINE
CARDINALITY_ID=1
CREATION_SEED=137
RESTART_COUNT=0
FAILURE_COUNT=0
FAILURE_HISTORY=
ID=ora.mydb.test_srv.svc 1 1
INCARNATION=5
LAST_RESTART=08/10/2011 16:32:53
LAST_STATE_CHANGE=08/10/2011 16:34:03
STATE_DETAILS=
INTERNAL_STATE=STABLE

After starting the instance again, the service was back available

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2

A second run of this test gave the same result.
Manually relocating the service did work though:

[grid@node1 ~]$ srvctl relocate service -d mydb -s easy_srv -i mydb1 -t mydb2
[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.test_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2

What if I removed the service and recreated it directly as preferred / available:

[grid@node1 ~]$ srvctl stop service -d mydb -s easy_srv
 
[grid@node1 ~]$ srvctl remove service -d mydb -s easy_srv
 
[grid@node1 ~]$ srvctl add service -d mydb -s easy_srv -r mydb2 -a mydb1 -y AUTOMATIC -P BASIC -e SELECT
PRCD-1026 : Failed to create service test_srv for database mydb
PRKH-1014 : Current user grid is not the same as oracle owner orauser of oracle home /opt/oracle/orauser/product/11.2.0.2/dbhome_1.
would it?
Let us test it:

[grid@node1 ~]$ su - orauser
Password:
 
[orauser@node1 ~]$ srvctl add service -d mydb -s easy_srv -r mydb1,mydb2 -y AUTOMATIC -P BASIC -e SELECT
 
[orauser@node1 ~]$ srvctl config service -d mydb -s easy_srv
Service name: easy_srv
Service is enabled
Server pool: mydb_easy_srv
Cardinality: 2
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: SELECT
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: BASIC
Edition:
Preferred instances: mydb1,mydb2
Available instances:
 
[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE
 
CARDINALITY_ID=2
DEGREE_ID=1
TARGET=OFFLINE
STATE=OFFLINE

now modify it

[orauser@node1 ~]$ srvctl modify service -d mydb -s easy_srv -n -i mydb2 -a mydb1
 
[orauser@node1 ~]$ srvctl config service -d mydb -s easy_srv
Service name: easy_srv
Service is enabled
Server pool: mydb_easy_srv
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: SELECT
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: BASIC
Edition:
Preferred instances: mydb2
Available instances: mydb1
 
[orauser@node1 ~]$ srvctl start service -d mydb -s easy_srv -i mydb2
 
[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2
 
[orauser@node1 ~]$ srvctl stop instance -d mydb -i mydb2 -o abort
 
[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=OFFLINE

Nope, the user modifying the service has nothing to do with it.

I also tested the scenario where I directly created a preferred / available service, but in this case the failover also did not work.

But after some more testing I found the reason.

During the first test I had shutdown the instance via sqlplus, not via srvctl. And the other services I talked about had failed over during this test (I never did a failback).
After doing the shutdown abort again via sqlplus, the failover worked again.

[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node2
 
[orauser@node2 ~]$ export ORACLE_SID=mydb2
[orauser@node2 ~]$ sqlplus / as sysdba
 
SQL*Plus: Release 11.2.0.2.0 Production on Wed Aug 10 18:28:29 2011
 
Copyright (c) 1982, 2010, Oracle.  All rights reserved.
 
Connected to:
Oracle Database 11g Release 11.2.0.2.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
 
SQL> shutdown abort
ORACLE instance shut down.
 
[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node1
 
SQL> startup
ORACLE instance started.
 
Total System Global Area 3140026368 bytes
Fixed Size                  2230600 bytes
Variable Size            1526728376 bytes
Database Buffers         1593835520 bytes
Redo Buffers               17231872 bytes
Database mounted.
Database opened.
 
[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.easy_srv.svc -l
NAME=ora.mydb.easy_srv.svc
TYPE=ora.service.type
CARDINALITY_ID=1
DEGREE_ID=1
TARGET=ONLINE
STATE=ONLINE on node1

as expected, starting the instance again did not trigger a failback of the service.

Question now is, if the failover not happening when issuing the shutdown via srvctl is expected behaviour or not.

For this, one probably would have to open a service case, answer a couple of question not important for this issue, escalate and still have to wait for several months.
Do I sound bitter now?

Conclusion:

  • When restarting an instance, an offline service that has this instance listed as a preferred node will be started (management policy = automatic).
  • When an instance on which a service was running fails, the service is started on at least one other preferred instance.
  • The service will remain running on this instance, even when the original instance is started again (in which case the service will run on both instances).
  • When a service has a preferred / available configuration, the service will failover to the available instance, but not failback afterwards.
  • Failover in a preferred / available configuration does not happen when the instance was stopped via “srvctl shutdown <db_unique_name> – o abort”
Questions remaining:

  • What if there were more than 2 nodes, with a service that has all three or more nodes listed as preferred, but currently only running on one node.
  • If the instance on which that service is running fails, would the service then be started on all preferred nodes or on only 1 of them?
  • What if, in the above case, the service was running on 2 nodes.
  • Would it still be started on other nodes?
  • And what if one of the nodes was configured as available and not as preferred? Would the service on the preferred node still be started or the one on the available instance or both?
  • And last but not least, is the srcvtl shutdown behavour a bug or not?


How many IPs is required before installation of clusterware/ Grid

If you do a not enable GNS, the public and virtual IP addresses for each node must be static IP addresses, configured before installation for each node but not currently in use. Public and virtual IP addresses must be on the same subnet. Oracle Clusterware manages private IP addresses in the private subnet on interfaces you identify as private during the installation process.
The cluster must have the following addresses configured:
A public IP address for each node, with the following characteristics:
Static IP address
Configured before installation for each node, and resolvable to that node before installation
On the same subnet as all other public IP, VIP, and SCAN addresses
A virtual IP address for each node, with the following characteristics:
Static IP address
Configured before installation for each node, but not currently in us
On the same subnet as all other public IP addresses, VIP addresses, and SCAN addresses

A Single-Client Access Name (SCAN) for the cluster, with the following characteristics:
Three Static IP addresses configured on the domain name server (DNS)
before installation so that the three IP addresses are associated with the name provided as the SCAN, and all three addresses are returned in random order by the DNS to the requestor
Configured before installation in the DNS to resolve to addresses that are not currently in use
Given a name that does not begin with a numeral
On the same subnet as all other public IP addresses, VIP addresses, and SCAN addresses
A private IP address for each node, with the following characteristics:
Static IP address
Configured before installation, but on a separate private network, with its own subnet, that is not resolvable except by other cluster member nodes to improve the interconnect performance

12c RAC Installation


  • Hardware Requirements.
Each node must have at least two network interface cards (NIC), or network adapters. One adapter is for the public network interface and the other adapter is for the private network interface (interconnect).

Public interface names must be the same for all nodes. If the public interface on one node uses the network adapter eth0, then you must configure eth0 as the public interface on all nodes.

You should configure the same private interface names for all nodes as well. If eth1 is the private interface name for the first node, then eth1 should be the private interface name for your second node.

The private network adapters must support the user datagram protocol (UDP) using high-speed network adapters and a network switch that supports TCP/IP (Gigabit Ethernet or better). Oracle recommends that you use a dedicated network switch
  • IP Address Requirements.
You must have a DNS server in order to make SCAN listener work. So, before you proceed installation prepare you DNS server. You must give the following entry manually in your DNS server.


i)  A public IP address for each node

ii) A virtual IP address for each node

ii) Three single client access name (SCAN) addresses for the cluster


During installation a SCAN for the cluster is configured, which is a domain name that resolves to all the SCAN addresses allocated for the cluster. The IP addresses used for the SCAN addresses must be on the same subnet as the VIP addresses. 
The SCAN name must be unique within your network. The SCAN addresses should not respond to ping commands before installation.
 Interface
 Type 
 Resolution
 Public
 Static 
 DNS 
 Private
 Static 
 Not required 
 ASM
 Static 
 Not required 
  Node Virtual IP
 Static 
Not required 
  SCAN virtual IP
 Static 
Not required 
  • OS and software Requirements
 Preparing Shared Storage

Database Volume Type/PurposeNumber of VolumesVolume Size
OCR/VOTE350GB each
DATA4250GB1 each
REDO22At least 50GB each
FRA1100GB3
TEMP1100GB
.
  • Preparing the server to install Grid Infrastructure.
Create OS groups using the command below. 
    Enter these commands as the ‘root’ user:
 
    #/usr/sbin/groupadd -g 501 oinstall

    #/usr/sbin/groupadd -g 502 dba

    #/usr/sbin/groupadd -g 504 asmadmin

    #/usr/sbin/groupadd -g 506 asmdba

    #/usr/sbin/groupadd -g 507 asmoper

Create the users that will own the Oracle software using the commands:

#/usr/sbin/useradd -u 501 -g oinstall -G asmadmin,asmdba,asmoper grid

#/usr/sbin/useradd -u 502 -g oinstall -G dba,asmdba oracle


vi /etc/sysctl.conf

kernel.shmmni = 4096

kernel.sem = 250 32000 100 128

fs.file-max = 6553600

net.ipv4.ip_local_port_range = 9000 65500

net.core.rmem_default = 262144

net.core.rmem_max = 4194304

net.core.wmem_default = 262144

net.core.wmem_max = 1048576

 Add or edit the following line in the /etc/pam.d/login file, if it does not already exist:

session required pam_limits.so


     Create the Oracle Inventory Directory

To create the Oracle Inventory directory, enter the following commands as the root user:

    # mkdir -p /u01/app/oraInventory

    # chown -R grid:oinstall /u01/app/oraInventory

    # chmod -R 775 /u01/app/oraInventory

    Creating the Oracle Grid Infrastructure Home Directory

    To create the Grid Infrastructure home directory, enter the following commands as the root user:

    # mkdir -p /u01/11.2.0/grid

    # chown -R grid:oinstall /u01/11.2.0/grid

             # chmod -R 775 /u01/11.2.0/grid

               Creating the Oracle Base Directory

To create the Oracle Base directory, enter the following commands as the root user:

    # mkdir -p /u01/app/oracle

    # mkdir /u01/app/oracle/cfgtoollogs –needed to ensure that dbca is able to run after the rdbms installation.

    # chown -R oracle:oinstall /u01/app/oracle

    # chmod -R 775 /u01/app/oracle


How to increase the performance of interconnect


1) Link aggregation. It is also known as NIC teaming,” “NIC bonding. it can be used to increase redundancy for higher availability with an Active/Standby configuration. Link aggregation can be used to increase bandwidth for performance with an Active/Active configuration. This arrangement involves simultaneous use of both the bonded physical network interface cards in parallel to achieve a higher bandwidth beyond the limit of any one single network card. It is very important that if 802.3ad is used at the NIC layer, the switch must also support and be configured for 802.3ad. Misconfiguration results in poor performance and interface resets or “port flapping.
2)An alternative is to consider a single network interface card with a higher bandwidth, such as 10 Gb Ethernet instead of 1Gb Ethernet. InfiniBand can also be used for the interconnect.
3)UDP socket buffer (rx):
Default settings are adequate for the majority of customers. It may be necessary to increase the allocated buffer size
when the:
– MTU size has been increased
– netstat command reports errors
– ifconfig command reports dropped packets or overflow
The maximum UDP socket receive buffer size varies according to the operating system. the upper  limit may be as small as 128 KB or as large as 1 GB. In most cases, the default settings are adequate for the majority of customers. This is one of the first settings to consider if you are receiving lost blocks.
Three significant conditions that indicate when it may be necessary to change the UDP socket receive buffer size are when the MTU size has been increased, when excessive fragmentation and/or reassembly of packets is observed, and if dropped packets or overflows are observed
4)Jumbo frames: It is not an Institute of Electrical and Electronics Engineers (IEEE) standard Jumbo frames are not a requirement for Oracle Clusterware and not configured by default. The use of jumbo frames is supported; however, special care must be taken because this is not an IEEE standard and there are significant variances among network devices and switches especially from different manufacturers. The typical frame size for jumbo frames is 9 KB, but again, this can vary. It is necessary that all devices in the communication path be set to the same value.
A jumbo frame is an Ethernet frame with a payload greater than the standard maximum transmission unit (MTU) of 1,500 bytes. By default each network packet can carry 1500 bytes of data (also referred to as the packet’s payload). Any payload larger than 1500 bytes sent over the network will be split into more than one packet. If we enable jumbo frames we reduce the number of packets sent over the network when sending large amounts of data.When we enable “jumbo frames” we are telling our network devices that we want to send more than 1500 bytes in each packet. 
Most commonly, jumbo frames means setting the MTU (maximum transmission unit) to enable a payload of 9000 bytes.
The obvious advantage of using jumbo frames is more data is transferred in less packets.
They can speed up your overall network speed, provide better interaction between some applications, 
and reduce strain on your network.  
If you’re considering implementing Jumbo Frames, it’s important to do your homework first. 

MTU=Default Ethernet 1500 or Jumbo Frames 9000

Let’s assume we need to transfer 20 gigabytes (21,474,836,480 bytes) of data as quickly as possible. 
With a standard 1500 byte MTU that will take 14,316,558 packets,
but with an MTU of 9000 we are sending 2,386,093 packets. That’s a difference of 11,930,465 packets.
Note: For Oracle Clusterware, the Maximum Transmission Unit (MTU) needs to be the same on all nodes. If it is not set to the same value, an error message will be sent to the Clusterware alert logs.

What is HAIP

HAIP, High Availability IP, is the Oracle based solution for load balancing and failover for private interconnect traffic. Typically, Host based solutions such as Bonding (Linux), Trunking (Solaris) etc is used to implement high availability solutions for private interconnect traffic. But, HAIP is an Oracle solution for high availability.
In earlier releases, to minimize node evictions due to frequent private NIC down events, bonding, trunking, teaming, or similar technology was required to make use of redundant network connections between the nodes. Oracle Clusterware now provides an integrated solution which ensures “Redundant Interconnect Usage” as it supports IP failover .

Multiple private network adapters can be defined either during the installation phase or afterward using the oifcfg. The ora.cluster_interconnect.haip resource will pick up a  highly available virtual IP (the HAIP) from “link-local” (Linux/Unix)  IP range (169.254.0.0 ) and assign to each private network.   With HAIP, by default, interconnect traffic will be load balanced across all active interconnect interfaces. If a private interconnect interface fails or becomes non-communicative, then Clusterware transparently moves the corresponding HAIP address to one of the remaining functional interfaces.
Grid Infrastructure can activate a maximum of four private network adapters at a time even if more are defined. The number of HAIP addresses is decided by how many private network adapters are active when Grid comes up on the first node in the cluster .  If there’s only one active private network, Grid will create one;  if two, Grid will create two and so on. The number of HAIPs won’t increase beyond four even if more private network adapters are activated . A restart of clusterware on all nodes is required for new adapters to become effective.
. This functionality is available starting with Oracle Database 11g Release 2 (11.2.0.2). If you use the Oracle Clusterware Redundant Interconnect feature, you must use IPv4 addresses for the interfaces.
When you define multiple interfaces, Oracle Clusterware creates from one to four highly available IP (HAIP) addresses. Oracle RAC and Oracle Automatic Storage Management (Oracle ASM) instances use these interface addresses to ensure highly available, load-balanced interface communication between nodes. The installer enables Redundant Interconnect Usage to provide a high-availability private network.
By default, Oracle Grid Infrastructure software uses all of the HAIP addresses for private network communication, providing load-balancing across the set of interfaces you identify for the private network. If a private interconnect interface fails or becomes noncommunicative,
Oracle Clusterware transparently moves the corresponding HAIP address to one of the remaining functional interfaces.

What is advantage of Single-Client Access

The single-client access name is address used by clients connecting to the cluster. The scan is a fully qualified host name (host name + domain) registered to three IP addresses. If  you use GNS, and have DHCP support, then the GNS will assign addresses dynamically to the SCAN.
If  you do not use GNS, the SCAN should be defined in the DNS to resolve to the three addresses assigned to that name.

This should be done before you install Oracle Grid Infrastructure.

  • The SCAN and its associated IP addresses provide a stable name for clients to use for connections, independent of the nodes that make up the cluster.
  • SCANs function like a cluster alias. However, SCANs are resolved on any node in the cluster, so unlike a VIP address for a node, clients connecting to the SCAN no longer require updated VIP addresses as nodes are added to or removed from the cluster. Because the SCAN addresses
  • resolve to the cluster, rather than to a node address in the cluster, nodes can be added to or removed from the cluster without affecting the SCAN address configuration.
  • During installation, listeners are created on each node for the SCAN IP addresses. Oracle Clusterware routes application requests to the cluster SCAN to the least loaded instance providing the service.SCAN listeners can run on any node in the cluster. SCANs provide location independence for databases so that the client configuration does not have to depend on which nodes run a particular database.Instances register with SCAN listeners only as remote listeners. Upgraded databases register with SCAN listeners as remote listeners, and also continue to register with all other listeners.
  • If you specify a GNS domain during installation, the SCAN defaults to clustername-scan.GNS_domain. If a GNS domain is not specified at installation, the SCAN defaults to clustername-scan.current_domain.

How SCAN and Local Listeners work

When a client submits a connection request, the SCAN listener listening on a SCAN IP address and the SCAN port are contacted on the client’s behalf. Because all services on the cluster are registered with the SCAN listener, the SCAN listener replies with the address of the local listener on the least-loaded node where the service is currently being offered. Finally, the client establishes a connection to the service through the listener on the node where service is offered. All these actions take place transparently to the client without any explicit configuration required in the client.
During installation, listeners are created on nodes for SCAN IP addresses.
Oracle net Services routes application requests to the least loaded
Instance providing service Because SCAN addresses resolve to cluster, rather than to a node address in the cluster cluster, nodes can be added or removed from the cluster without affecting the SCAN address configuration.

What is Node eviction and its advantage

An important service provided by Oracle Clusterware is node fencing.
Node fencing is used to evict nonresponsive hosts from the cluster, preventing data corruptions.
An important service provided by Oracle Clusterware is node fencing. Node fencing is a technique used by clustered environments to evict nonresponsive or malfunctioning hosts from the cluster.
Allowing affected nodes to remain in the cluster increases the probability of data corruption due to Traditionally, Oracle Clusterware uses a STONITH (Shoot The Other Node In The Head)comparable fencing algorithm to ensure data integrity in cases, in which cluster integrity is endangered and split-brain scenarios need to be prevented. 
For Oracle Clusterware this means that a local process enforces the removal of one or more nodes from the cluster (fencing). This approach traditionally involved a forced “fast” reboot of the offending node. A fast reboot is a shutdown and restart procedure that does not wait for any I/O to finish or for file systems to synchronize on shutdown. Starting with Oracle Clusterware 11g Release 2 (11.2.0.2), this
mechanism has been changed to prevent such a reboot as much as possible by introducing rebootless node fencing.
Now, when a decision is made to evict a node from the cluster, Oracle Clusterware will first attempt to shut down all resources on the machine that was chosen to be the subject of an eviction. Specifically, I/O generating processes are killed and Oracle Clusterware ensures that those processes are completely stopped before continuing.
If all resources can be stopped and all I/O generating processes can be killed, Oracle Clusterware will shut itself down on the respective node, but will attempt to restart after the stack has been stopped.
If, for some reason, not all resources can be stopped or I/O generating processes cannot be stopped completely, Oracle Clusterware will still perform a reboot.
In addition to this traditional fencing approach, Oracle Clusterware now supports a new fencing mechanism based on remote node termination. The concept uses an external mechanism capable of restarting a problem node without cooperation either from Oracle Clusterware or from the
operating system running on that node. To provide this capability, Oracle Clusterware supports the Intelligent Management Platform Interface specification (IPMI), a standard management protocol.
To use IPMI and to be able to remotely fence a server in the cluster, the server must be equipped with a Baseboard Management Controller (BMC), which supports IPMI over a local area network(LAN). After this hardware is in place in every server of the cluster, IPMI can be activated either during the installation of the Oracle Grid Infrastructure or after the installation in course of a post installation management task by using CRSCTL.

What is the difference between a oracle global index and a local index?

When using Oracle partitioning, you can specify the “global” or “local” parameter in the create index syntax:

Local Index(Local partitioned indexes ): A local index is a one-to-one mapping between a index partition and a table partition. 
The partitioning of the indexes is transparent to all SQL queries. The great benefit is that the Oracle query engine will scan only the index partition that is required to service the query, thus speeding up the query significantly. In addition, the Oracle parallel query engine will sense that the index is partitioned and will fire simultaneous queries to scan the indexes.
Local partitioned indexes allow the DBA to take individual partitions of a table and indexes offline for maintenance (or reorganization) without affecting the other partitions and indexes in the table.

  • A local index on a partitioned table is created where the index is partitioned in exactly the same manner as the underlying partitioned table. That is, the local index inherits the partitioning method of the table. This is known as equi-partitioning.
  • For local indexes, the index keys within the index will refer only to the rows stored in the single underlying table partition. A local index is created by specifying the LOCAL attribute, and can be created as UNIQUE or NON-UNIQUE.
  • The table and local index are either partitioned in exactly the same manner, or have the same partition key because the local indexes are automatically maintained, can offer higher availability.
  • As the Oracle database ensures that the index partitions are synchronized with their corresponding table partitions, it follows that the database automatically maintains the index partition whenever any maintenance operation is performed on the underlying tables
  • for example, when partitions are added, dropped, or merged.
  • A local index is prefixed if the partition key of the table and the index key are the same; otherwise it is a local non-prefixed index


In a local partitioned index, the key values and number of index partitions will match the number of partitions in the base table.

To create Rang partition on table

  CREATE TABLE EASY_INVOICE (id number, item_id number, name varchar2(20))
  PARTITION BY RANGE (id, item_id)
  (partition EASYINVOICE_1 values less than (10, 100),
  partition EASYINVOICE_2 values less than (20, 200),
  partition EASYINVOICE_3 values less than (30, 300),
  partition EASYINVOICE_4 values less than (40, 400));
 Table created

 CREATE INDEX test_idx ON EASY_INVOICE(id, item_id)
  LOCAL
  (partition test_idx_1,
  partition test_idx_2,
  partition test_idx_3,
  partition test_idx_4);

 Index created.


SELECT index_name, partition_name, status
  FROM user_ind_partitions where index_name='TEST_IDX'
  ORDER BY index_name, partition_name;
eg

1) check partition table


SELECT distinct table_name
FROM dba_tab_partitions where table_name like '%EASY_INVOICE%' ORDER BY table_name;

set long 9999999
set pagesize 0
set linesize 120 

SELECT DBMS_METADATA.GET_DDL('TABLE','EASY_INVOICE','EASYOWNER
') FROM DUAL


COLUMN table_name FORMAT A30
COLUMN partition_name FORMAT A30
COLUMN high_value FORMAT A20

SELECT table_name,partition_name,high_value,num_rows
FROM  dba_tab_partitions where table_name='EASY_INVOICE' ORDER BY table_name, partition_name;

select max(partition_position) from dba_tab_partitions where table_owner='EASYOWNER' and table_name='EASY_INVOICE';


select PARTITION_NAME from dba_tab_partitions where table_owner='EASYOWNER' and table_name='EASY_INVOICE' and partition_position=15;


Select TABLE_OWNER, table_name,PARTITION_NAME ,last_analyzed from dba_tab_partitions where table_owner='SBM_DWH_EDW' and table_name='EASY_INVOICE' 
and partition_name= 'EASYPAR_100';

SQL>  select a.index_name, partition_name,table_name from all_ind_partitions a inner join all_part_indexes b on a.index_name = b.INDEX_NAME
where  status = 'UNUSABLE' and (a.INDEX_NAME like '%ESM%' or a.INDEX_NAME like '%SSM%')
  2    3  /

SQL> CREATE UNIQUE INDEX "EASYOWNER"."PKC_EBA_T_BMS_OUT_AGG_1SK" ON "MLY_UC_DAT"."EBA_T_BMS_OUT_AGG" ("START_DATE", "CONTROL_POINT_ID", "EVENT_TYPE_ID", "CALLING_NO_GRP_ID", "CALLED_NO_GRP_ID", "SUBSCRIBER_TYPE_ID", "ROAMING_TYPE_ID", "CALL_TYPE_ID","NE")
  2  PCTFREE 10 INITRANS 2 MAXTRANS 255

STORAGE(
BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
LOCAL
(PARTITION "P_FIRST",PARTITION "SYS_P3525304",PARTITION "SYS_P3537238",
PARTITION "SYS_P3548658",
PARTITION "SYS_P3560119",PARTITION "SYS_P3573892",PARTITION "SYS_P3579810",
PARTITION "SYS_P3591451",PARTITION "SYS_P3603371",PARTITION "SYS_P3620692",
PARTITION "SYS_P3636416",PARTITION "SYS_P3640544",PARTITION "SYS_P3646479",
PARTITION "SYS_P3658676",PARTITION "SYS_P3672966",PARTITION "SYS_P368  3  6620",
PARTITION "SYS_P3708768",PARTITION "SYS_P3717108",PARTITION "SYS_P3726469",
PARTITION "SYS_P3737761",PARTITION "SYS_P3753479",PARTITION "SYS_P3766044",
PARTITION "SYS_P3778936",PARTITION "SYS_P3785475",PARTITION "SYS_P3796607",
PARTITION "SYS_P3802547",PARTITION "SYS_P3812703",PARTITION "SYS_P3824595",
PARTITION "SYS_P3828487",PARTITION "SYS_P3840822"
  4    5    6    7    8    9   10   11  ,PARTITION "SYS_P3844642",PARTITION "SYS_P3854731",PARTITION "SYS_P3864643"
,PARTITION "SYS_P3875729",PARTITION "SYS_P3886228",PARTITION "SYS_P3895740",
PARTITION "SYS_P3905488",PARTITION "SYS_P3916783",PARTITION "SYS_P3919643",
PARTITION "SYS_P3928320",PARTITION "SYS_P3939494",PARTITION "SYS_P3954794",
PARTITION "SYS_P3963313",PARTITION "SYS_P3974561",PARTITION "SYS_P3984998",
PARTITION "SYS_P3998992",PARTITION "SYS_P4008979",PARTITION "SYS_P4013778",
PARTITION "SYS_P4023967",PARTITION "SYS_P4035561", PAR 12  TITION "SYS_P4047192",
PARTITION "SYS_P4052024",PARTITION "SYS_P4073615",PARTITION "SYS_P4077074",
 13   14   15   16   17   18   19   20   21   22   23   24   25  PARTITION "SYS_P4080872"
 26  PCTFREE 10 INITRANS 2 MAXTRANS 125 LOGGING
STORAGE(
BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
TABLESPACE "RAS_UC_DAT_TAB") ; 27   28   29


Index created.

To rebuild partition index

set time on;
set timing on;
ALTER INDEX EASYOWNER.EASYINVOICE REBUILD PARTITION EAB_P1_2015;
ALTER INDEX EASYOWNER.EASYINVOICE REBUILD PARTITION SYS_P10766;
ALTER INDEX EASYOWNER.EASYINVOICE REBUILD PARTITION SYS_P12025;
ALTER INDEX EASYOWNER.EASYINVOICE REBUILD PARTITION SYS_P13245;
ALTER INDEX EASYOWNER.EASYINVOICE REBUILD PARTITION SYS_P14403;
ALTER INDEX EASYOWNER.EASYINVOICE REBUILD PARTITION SYS_P1520;

Oracle will automatically use equal partitioning of the index based upon the number of partitions in the indexed table. For example, in the above definition, if we created four indexes on all_fact, the CREATE INDEX would fail since the partitions do not match. This equal partition also makes index maintenance easier, since a single partition can be taken offline and the index rebuilt without affecting the other partitions in the table.

Global partitioned indexes

A global index is a one-to-many relationship, allowing one index partition to map to many table partitions.A global partitioned index is used for all other indexes except for the one that is used as the table partition key. Global indexes partition OLTP (online transaction processing) applications where fewer index probes are required than with local partitioned indexes. In the global index partition scheme, the index is harder to maintain since the index may span partitions in the base table.
For example, when a table partition is dropped as part of a reorganization, the entire global index will be affected. When defining a global partitioned index, the DBA has complete freedom to specify as many partitions for the index as desired.

  • A global partitioned index is an index on a partitioned or non-partitioned table that is partitioned independently, i.e. using a different partitioning key from the table. Global-partitioned indexes can be range or hash partitioned.
  • Global partitioned indexes are more difficult to maintain than local indexes. However, they do offer a more efficient access method to any individual record.
  • During table or index interaction during partition maintenance, all partitions in a global index will be affected.
  • When the underlying table partition has any SPLIT, MOVE, DROP, or TRUNCATE maintenance operations performed on it, both global indexes and global partitioned indexes will be marked as unusable. It therefore follows that partition independence it is not possible for global indexes.
  • Depending on the type of operation performed on a table partition, the indexes on the table will be affected. When altering a table partition, the UPDATE INDEXES clause can be specified. This automatically maintains the affected global indexes and partitions.
  • The advantages of using this option are that the index will remain online and available throughout the operation and does not have to be rebuilt once the operation has completed.


Now that we understand the concept, let’s examine the Oracle CREATE INDEX syntax for a globally partitioned index:

CREATE INDEX item_idx
on all_fact (item_nbr)
GLOBAL
(PARTITION city_idx1 VALUES LESS THAN (100)),
(PARTITION city_idx1 VALUES LESS THAN (200)),
(PARTITION city_idx1 VALUES LESS THAN (300)),
(PARTITION city_idx1 VALUES LESS THAN (400)),
(PARTITION city_idx1 VALUES LESS THAN (500));

Here, we see that the item index has been defined with five partitions, each containing a subset of the index range values. Note that it is irrelevant that the base table is in three partitions. In fact, it is acceptable to create a global partitioned index on a table that does not have any partitioning.


Making Failover Seamless

In addition to adding database instances to mitigate node failure, Oracle RAC offers a number of technologies to make a node failover seamless to the application (and subsequently, to the end user),
including the following:

Transparent Application Failover
Fast Connect Failover

Transparent Application Failover (TAF) is a client-side feature. The term refers to the failover/reestablishment of sessions in case of instance or node failures. TAF is not limited to RAC configurations; active/passive clusters can benefit equally from it. TAF can be defined through local naming in the client’s tnsnames.ora file or, alternatively, as attributes to a RAC database service. The
latter is the preferred way of configuring it. Note that this feature requires the use of the OCI libraries, so thin-client only applications won’t be able to benefit from it. With the introduction of the Oracle Instant client, this problem can be alleviated somewhat by switching to the correct driver.

TAF can operate in two ways:

it can either restore a session or re-execute a select statement in the event of a node failure.
While this feature has been around for a long time,
Oracle’s net manager configuration assistant doesn’t provide support for setting up client-side TAF. Also, TAF isn’t the most elegant way of handling node failures because any in-flight transactions will be rolled back—
TAF can resume running select statements only.

The fast connection failover feature provides a different way of dealing with node failures and other types of events published by the RAC high availability framework (also known as the Fast Application Notification, or FAN). It is more flexible than TAF.
Fast Connection Failover offers a driver-independent way for your JDBC application to take advantage of the connection failover facilities introduced in 10g Release 1 (10.1).
    When a RAC service failure is propagated to the JDBC application, the database has already rolled back the local transaction. The cache manager then cleans up all invalid connections. When an application holding an invalid connection tries to do work through that connection, it receives a SQLException ORA-17008, Closed Connection. The application has to handle the exception and reconnect.

What are the different types of failover mechanisms available

  1. JDBC-THIN driver supports Fast Connection Failover (FCF)
  2. JDBC-OCI driver supports Transparent Application Failover (TAF)
  3. JDBC-THIN 11gR2 supports Single Client Access Name (SCAN)

Can I use FCF and TAF together ?
No. Only one of them should be used at a time.
 

Is FCF provided by Oracle JDBC 9i drivers ?

No. FCF is built on the pooling feature known as 'Implicit Connection Caching' and this is available only with JDBC 10g or higher versions.
Please also note that in version 11gR2 the FCF is now deprecated along with the Implicit Connection Caching in favor of using the Universal Connection Pool (UCP)
 

What is SCAN ?  Which version of JDBC supports SCAN ?

SCAN or Single Client Access Name is a new Oracle Real Application Clusters (RAC) 11g Release 2 feature that provides a single name for clients to access an Oracle Database running in a cluster.
 The benefit is clients using SCAN do not need to change if you add or remove nodes in the cluster.  Having a single name to access the cluster allows clients to use the EZConnect client and the simple JDBC thin URL to access any database running in the clusters independently of which server(s) in the cluster the database is active. SCAN provides load balancing and failover of client connections to the database. The SCAN works as an IP alias for the cluster.



  What exactly is the use of FCF?
  
  FCF provides is very fast notification of the failure and the ability to reconnect immediately using the same URL. When a RAC node fails the application will receive an exception. The application has to handle the exception and reconnect.
    The JDBC driver does not re-target existing connections. If a node fails the application must close the existing connection and get a new one. The way the application knows that the node failed is by getting an exception. There is more than one ORA error that can be thrown when a node fails,the application must be able to deal with them all.
    An application may call isFatalConnectionError() API on the OracleConnectionCacheManager to determine if the SQLException caught is fatal.

    If the return value of this API is true, we need to retry the getConnection on the DataSource.xxxxxx


  How do we use FCF with JDBC driver?

    In order to use FCF with JDBC, the following things must be done:
  •         Configure and start ONS. If ONS is not correctly set up,implicit connection cache creation fails and an ONSException is thrown at the first getConnection() request.
  • See Oracle® Universal Connection Pool for JDBC Developer's Guide in the section Configuring ONS located in Using Fast Connection Failover
  • FCF is now configured through a pool-enabled data source and is tightly integrated with UCP.  The FCF enabled through the Implicit Connection Cache as was used in 10g and 11g R1 is now deprecated.
  • Set the FastConnectionFailoverEnabled property before making the first getConnection() request to an OracleDataSource. When FastConnection Failover is enabled, the failover applies to all connections in the pool.
  • Use a service name rather than a SID when setting the OracleDataSource url property.

what is Transparent Application Failover

Transparent Application Failover (TAF) or simply Application Failover is a feature of the OCI driver. It enables you to automatically reconnect to a database if the database instance to which the connection is made goes down. In this case, the active transactions roll back. A transaction rollback restores the last committed transaction. The new database connection, though created by a different node, is identical to the original. This is true regardless of how the connection was lost.
TAF is always active and does not have to be set.
TAF cannot be used with thin driver.

Failover Modes

Transparent Application Failover can be configured to work in two modes, or it can be deactivated. If we count deactivated as a mode, it means TAF can be assigned the following three options:
Session failover
  • Select failover
  • None (default)
  • Failover Type Events

The following are possible failover types in the OracleOCI Failover interface:

    FO_SESSION
    Is equivalent to FAILOVER_MODE=SESSION in the tnsnames.ora file CONNECT_DATA flags. This means that only the user session is re-authenticated on the server-side while open cursors in the OCI application need to be re-executed.

    FO_SELECT
    Is equivalent to FAILOVER_MODE=SELECT in tnsnames.ora file CONNECT_DATA flags. This means that not only the user session is re-authenticated on the server-side, but open cursors in the OCI can continue fetching. This implies that the client-side logic maintains fetch-state of each open cursor.

    FO_NONE
    Is equivalent to FAILOVER_MODE=NONE in the tnsnames.ora file CONNECT_DATA flags. This is the default, in which no failover functionality is used. This can also be explicitly specified to prevent failover from happening. Additionally, FO_TYPE_UNKNOWN implies that a bad failover type was returned from the OCI driverFailover Methods
With the failover mode specified, users can further define a method that dictates exactly how TAF will re-establish the session on the other instance. A failover method can be defined independently of the
failover type. 
The failover method determines how the failover works; the following options are available:
  • Basic
  • Preconnect

 Basic option instructs the client to establish a new connection only after the node failed. This can potentially lead to a large number of new connection requests to the surviving instance. In the case of a two-node RAC, this might cause performance degradation until all user
connections are re-established. If you consider using this approach, you should test for potential performance degradation during the design stage.
The preconnect option is slightly more difficult to configure. When you specify the preconnect parameter, the client is instructed to preconnect a session to a backup instance to speed up session failover. You need to bear in mind that these preconnections increase the number of sessions to the cluster. In addition, you also need to define what the backup connection should be.

What are huge pages and Large Pages

HugePages is a feature integrated into the Linux kernel with release 2.6. This feature basically provides the alternative to the 4K page size (16K for IA64) providing bigger pages. HugePages is a method to have larger pages where it is useful for working with very large memory.
If you run an Oracle Database on a Linux Server with more than 16 GB physical memory and your System Global Area (SGA) is greater than 8 GB, you should configure HugePages. Oracle promises more performance by doing this. A HugePages configuration means, that the linux kernel can handle „large pages“, like Oracle generally calls them. Instead of standardly 4 KB on x86 and x86_64 or 16 KB on IA64 systems – 4 MB on x86, 2 MB on x86_64 and 256 MB on IA64 system. Bigger pages means, that the system uses less page tables, manages less mappings and by that reduce the effort for their management and access.

However, there is a limitation by Oracle, because Automatic Memory Management (AMM) does not support HugePages. If you already use AMM and MEMORY_TARGET is set, you have to disable it and switch back to Automatic Shared Memory Management (ASMM). That means set SGA_TARGET and PGA_AGGREGATE_TARGET. But there is another innovation called Transparent Hugpages (THP) which should be disabled as well. The feature will be delivered since Red Hat Linux 6 or a according derivate. Oracle as well as Red Hat recommend disabling Transparent Hugepages. 

Why Do You Need HugePages?

HugePages is crucial for faster Oracle database performance on Linux if you have a large RAM and SGA. If your combined database SGAs is large (like more than 8GB, can even be important for smaller), you will need HugePages configured. Note that the size of the SGA matters. Advantages of HugePages are:

  • Larger Page Size and Less # of Pages: Default page size is 4K whereas the HugeTLB size is 2048K. That means the system would need to handle 512 times less pages.
  • Reduced Page Table Walking: Since a HugePage covers greater contiguous virtual address range than a regular sized page, a probability of getting a TLB hit per TLB entry with HugePages are higher than with regular pages. This reduces the number of times page tables are walked to obtain physical address from a virtual address.
  • Less Overhead for Memory Operations: On virtual memory systems (any modern OS) each memory operation is actually two abstract memory operations. With HugePages, since there are less number of pages to work on, the possible bottleneck on page table access is clearly avoided.
  • Less Memory Usage: From the Oracle Database perspective, with HugePages, the Linux kernel will use less memory to create pagetables to maintain virtual to physical mappings for SGA address range, in comparison to regular size pages. This makes more memory to be available for process-private computations or PGA usage.
  • No Swapping: We must avoid swapping to happen on Linux OS at all Document 1295478.1. HugePages are not swappable (whereas regular pages are). Therefore there is no page replacement mechanism overhead. HugePages are universally regarded as pinned.
  • No 'kswapd' Operations: kswapd will get very busy if there is a very large area to be paged (i.e. 13 million page table entries for 50GB memory) and will use an incredible amount of CPU resource. When HugePages are used, kswapd is not involved in managing them. See also Document 361670.1


Troubleshooting

Some of the common problems and how to troubleshoot them are listed in the following table:

SymptomPossible CauseTroubleshooting Action
System is running out of memory or swappingNot enough HugePages to cover the SGA(s) and therefore the area reserved for HugePages are wasted where SGAs are allocated through regular pages.Review your HugePages configuration to make sure that all SGA(s) are covered.
Databases fail to startmemlock limits are not set properlyMake sure the settings in limits.conf apply to database owner account.
One of the database fail to start while another is upThe SGA of the specific database could not find available HugePages and remaining RAM is not enough.Make sure that the RAM and HugePages are enough to cover all your database SGAs
Cluster Ready Services (CRS) fail to startHugePages configured too large (maybe larger than installed RAM)Make sure the total SGA is less than the installed RAM and re-calculate HugePages.
HugePages_Total = HugePages_FreeHugePages are not used at all. No database instances are up or using AMM.Disable AMM and make sure that the database instances are up. See Doc ID 1373255.1
Database started successfully and the performance is slowThe SGA of the specific database could not find available HugePages and therefore the SGA is handled by regular pages, which leads to slow performanceMake sure that the HugePages are many enough to cover all your database SGAs
So let's get started and come to the 7 steps:

1. Check Physical Memory

First we should check our „physical“ available Memory. In the example we have about 128 GB of RAM. SGA_TARGET and PGA_AGGREGATE_TARGET together, should not be more than the availabel memory. Besides should be enough space for OS processes itself:

grep MemTotal /proc/meminfo

MemTotal: 132151496 kB

2. Check Database Parameter

Second check your database parameter. Initially: AMM disabled? MEMORY_TARGET and MEMORY_MAX_TARGET should be set to 0:

SQL> select value from v$parameter where name = 'memory_target';

VALUE
---------------------------
How big is our SGA? In this example about 40 GB. Important: In the following query we directly convert into kB (value/1024). With that we can continue to calculate directly:

SQL> select value/1024 from v$parameter where name = 'sga_target';

VALUE
---------------------------
41943040
Finally as per default the parameter use_large_pages should be enabled:

SQL> select value from v$parameter where name = 'use_large_pages';

VALUE
---------------------------
TRUE

3. Check Hugepagesize

In our example we use a x86_64 Red Hat Enterprise Linux Server. So by default hugepagesize should be set to 2 MB:

grep Hugepagesize /proc/meminfo

Hugepagesize:       2048 kB

4. Calculate Hugepages

For the calculation of the number of hugepages there is a easy way:

SGA / Hugepagesize = Number Hugepages
Following our example:

41943040 / 2048 = 20480

If you run more than one database on your server, you should include the SGA of all of your instances into the calculation:

( SGA 1. Instance + SGA 2. Instance + … etc. ) / Hugepagesize = Number Hugepages
In My Oracle Support you can find a script (Doc ID 401749.1) called hugepages_settings.sh, which does the calculation. This also includes a check of your kernel version and the actually used shared memory area by the SGA. Please consider that this calculation observes only the actual use of SGA and their use. If your second instance is down it will be not in the account. That means to adjust your SGA and restart your database first. Than you can run the script. Result should be the following line. Maybe you can make your own calculation and than check it with the script:

Recommended setting: vm.nr_hugepages = 20480

5. Change Server Configuration

The next step is to enter the number of hugepages in the server config file. For that you need root permissions. On Red Hat Linux 6 /etc/sysctl.conf.

vi /etc/sysctl.conf

vm.nr_hugepages=20480
Correctly inserted, following result should show up:

grep vm.nr_hugepages /etc/sysctl.conf

vm.nr_hugepages=20480 
The next parameter is hard and soft memlock in /etc/security/limits.conf for our oracle user. This value should be smaller than our available memory but minor to our SGA. Our hugepages should fit into that by 100 percent. For that following calculation:

Number Hugepages * Hugepagesize = minimum Memlock
Following our example:

20480 * 2048 = 41943040
vi /etc/security/limits.conf

oracle               soft    memlock 41943040
oracle               hard    memlock 41943040
Correctly inserted, following result should show up:

grep oracle /etc/security/limits.conf

...
oracle               soft    memlock 41943040
oracle               hard    memlock 41943040
As mentioned, before we have to disable transparent hugepages from Red Hat Linux version 6 ongoing:

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
[always] madvise never

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
always madvise [never]

6. Server Reboot

If all parameter are set, make a complete reboot your server. As an alternative you can reload the parameters with sysctl -p.

7. Check Configuration

Memlock correct?

ulimit -l

41943040 

HugePages correctly configured and in use?

grep Huge /proc/meminfo

AnonHugePages:    538624 kB 
HugePages_Total:    20480
HugePages_Free:     12292
HugePages_Rsvd:      8188
HugePages_Surp:        0
Hugepagesize:       2048 kB
Transparent Hugepages disabled?

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled

always madvise [never]
Did the database uses HugePages? For that we take a look into the alert log. After „Starting ORACLE instance (normal)“ following entry „Large Pages Information“ gives us advise:

************************ Large Pages Information *******************

Per process system memlock (soft) limit = 100 GB
Total Shared Global Region in Large Pages = 40 GB (100%) 
Large Pages used by this instance: 20481 (40 GB)
Large Pages unused system wide = 0 (0 KB)
Large Pages configured system wide = 20481 (40 GB)
Large Page size = 2048 KB

********************************************************************
If your configuration is incorrect Oracle delivers recommendation here for the right setting. In the following example exactly one Page is missing, so 2048 kB memlock to come to 100% of SGA use of hugepages:

************************ Large Pages Information *******************
...
...

RECOMMENDATION:

Total System Global Area size is 40 GB. For optimal performance,
prior to the next instance restart:
1. Increase the number of unused large pages by
at least 1 (page size 2048 KB, total size 2048 KB) system wide to
get 100% of the System Global Area allocated with large pages
2. Large pages are automatically locked into physical memory.
Increase the per process memlock (soft) limit to at least 40 GB to lock
100% System Global Area's large pages into physical memory

********************************************************************
Done!


Why OLR is used and its significant at time clusterware startup

An additional cluster configuration file has been introduced with Oracle 11.2, the so-called Oracle Local Registry (OLR). Each node has its own copy of the file in the Grid Infrastructure software home.
The OLR stores important security contexts used by the Oracle High Availability Service early in the start sequence of Clusterware. The information in the OLR and the Grid Plug and Play configuration file are needed to locate the voting disks. If they are stored in ASM, the discovery string in the GPnP profile will be used by the cluster synchronization daemon to look them up. Later in the Clusterware boot sequence,the ASM instance will be started by the cssd process to access the OCR files; however, their location is stored in the /etc/ocr.loc file, just as it is in RAC 11.1. Of course, if the voting files and OCR are on a shared cluster file system, then an ASM instance is not needed and won’t be started unless a different resource depends on ASM.
Storing Information in the Oracle Local Registry
The Oracle Local Registry is the OCR’s local counterpart and a new feature introduced with Grid Infrastructure. The information stored in the OLR is needed by the Oracle High Availability Services daemon (OHASD) to start; this includes data about GPnP wallets, Clusterware configuration, and version information. Comparing the OCR with the OLR reveals that the OLR has far fewer keys;
for example,
ocrdump reported 704 different keys for the OCR vs. 526 keys for the OLR on our installation.
If you compare only the keys again, you will notice that the majority of keys in the OLR deal with the OHASD process, whereas the majority of keys in the OCR deal with the CRSD. This confirms what we said earlier: you need the OLR (along with the GPnP profile) to start the High Availability Services stack.
In contrast, the OCR is used extensively by CRSD. The OLR is maintained by the same command-line utilities as the OCR, with the appended -local option. Interestingly, the OLR is automatically backed up during an upgrade to Grid Infrastructure, whereas the OCR is not.

What is Grid Infrastructure Agents

In Oracle 11gR2 and later, there are two new types of agent processes: the Oracle Agent and the Oracle Root Agent. These processes interface between Oracle Clusterware and managed resources.
In previous versions of Oracle Clusterware, this functionality was provided by the RACG family of scripts and processes.
To slightly complicate matters, there are two sets of Oracle Agents and Oracle Root Agents, one for the High Availability Services stack and one for the Cluster Ready Services stack.
The Oracle Agent and Oracle Root Agent that belong to the High Availability Services stack are started by ohasd daemon. The Oracle Agent and Oracle Root Agent pertaining to the Cluster Ready
Services stack are started by the crsd daemon. In systems where the Grid Infrastructure installation is not owned by Oracle—and this is-probably the majority of installations—there is a third Oracle Agent
created as part of the Cluster Ready Services stack. Similarly, the Oracle Agent spawned by OHAS is owned by the Grid Infrastructure software owner.
In addition to these two processes, there are agents responsible for managing and monitoring the CSS daemon, called CSSDMONITOR and CSSDAGENT. CSSDAGENT, the agent process responsible for spawning CSSD is created by the OHAS daemon. CSSDMONITOR, which monitors CSSD and the overall node health (jointly with the CSSDAGENT), is also spawned by OHAS.
You might wonder how CSSD, which is required to start the clustered ASM instance, can be started if voting disks are stored in ASM? This sounds like a chicken-and-egg problem: without access to the voting disks there is no CSS, hence the node cannot join the cluster. But without being part of the
cluster, CSSD cannot start the ASM instance. To solve this problem the ASM disk headers have new metadata in 11.2: you can use kfed to read the header of an ASM disk containing a voting disk. The kfdhdb.vfstart and kfdhdb.vfend fields tell CSS where to find the voting file. This does not require the ASM instance to be up. Once the voting disks are located, CSS can access them and joins the cluster.
The high availability stack’s Oracle Agent runs as the owner of the Grid Infrastructure stack in a clustered environment, as either the oracle or grid users. It is spawned by OHAS directly as part of the cluster startup sequence, and it is responsible for starting resources that do not require root privileges.

The list of processes Oracle Agent starts includes the following:

  • EVMD and EVMLOGGER
  • the gipc daemon
  • the gpnp daemon
  • The mDNS daemon
The Oracle Root Agent that is spawned by OHAS in turn starts all daemons that require root privileges to perform their programmed tasks. Such tasks include the following:
  • CRS daemon
  • CTSS daemon
  • Disk Monitoring daemon
  • ACFS drivers
Once CRS is started, it will create another Oracle Agent and Oracle Root Agent. If Grid Infrastructure is owned by the grid account, a second Oracle Agent is created. The grid Oracle Agent(s) will be responsible for:
Starting and monitoring the local ASM instance
ONS and eONS daemons
The SCAN listener, where applicable
The Node listener
There can be a maximum of three SCAN listeners in the cluster at any given time. If you have more than three nodes, then you can end up without a SCAN listener on a node. Likewise, in the extreme example where there is only one node in the cluster, you could end up with three SCAN listeners on that
node.The oracle Oracle Agent will only spawn the database resource if account separation is used. If not—i.e., if you didn’t install Grid Infrastructure with a different user than the RDBMS binaries—then
the oracle Oracle Agent will also perform the tasks listed previously with the grid Oracle Agent.
The Oracle Root Agent finally will create the following background processes:
  • GNS, if configured
  • GNS VIP if GNS enabled
  • ACFS Registry
  • Network
  • SCAN VIP, if applicable
  • Node VIP

The functionality provided by the Oracle Agent process in Oracle 11gR2

Clusterware startup sequence

Here is the brief explanation that how the clusterware brings up step by step .

  1.  When a node of an Oracle Clusterware cluster restarts, OHASD is started by platform-specific means. OHASD is the root for bringing up Oracle Clusterware. OHASD has access to the OLR (Oracle Local Registry) stored on the local file system. OLR provides needed data to complete OHASD initialization.
  2. OHASD brings up GPNPD and CSSD. CSSD has access to the GPNP Profile stored on the local file system. This profile contains the following vital bootstrap data; 
        a. ASM Diskgroup Discovery String 
        b. ASM SPFILE location (Diskgroup name) 
        c. Name of the ASM Diskgroup containing the Voting Files 

    3. The Voting Files locations on ASM Disks are accessed by CSSD with well-known pointers in the             ASM Disk headers and CSSD is able to complete initialization and start or join an existing cluster.
    
    4. OHASD starts an ASM instance and ASM can now operate with CSSD initialized and operating.             The ASM instance uses special code to locate the contents of the ASM SPFILE, assuming it is stored     in a Diskgroup.

    5. With an ASM instance operating and its Diskgroups mounted, access to Clusterware’s OCR is                 available to CRSD.
    6. OHASD starts CRSD with access to the OCR in an ASM Diskgroup.
    7. Clusterware completes initialization and brings up other services under its control.

or

12c Oracle Clusterware Startup Sequence - Oracle clusterware startup automatically when the RAC node starts. Startup sequence process tuns through different levels, in below figure you can find how multiple level startup process to start the full grid infrastructure stack also how the resources that clusterware manage.

This tutorial will describe startup sequence of oracle 12c RAC clusterware which is installed on Unix / Linux platform.

Oracle RAC 12c Clusterware Startup Sequence
Oracle 12c RAC Clusterware Startup Sequence


Once your Operating system finish the boot scrap pocess it reads /etc/init.d file via the initialisation daemon names init or init.d. In the init tab file is the one it triggers oracle high availability service daemon.

$cat /etc/inittab | grep init.d | grep -v grep
h1:35:respawn:/etc/init.d/init.ohasd run  >/dev/null  2>&1 </dev/null
Oracle Linux 6.x and Red Hat Linux 6.x have deprecated inittab.  init.ohasd is configured in startup in /etc/init/oracle-ohasd.conf:

    $cat /etc/init/oracle-ohasd.conf

     start on runLevel [35]
    start on tunLevel [!35]
    respawn
    exec /etc/init.d/init.ohasd run > /dev/null  2>&1 <dev/null
    this start up " init.ohasd run " , which in turn starts up the ohasd.bin background process :

    $ps  -ef  | grep  ohasd  | grep  -v grep
    root  4056  1  1  Feb19   ?     01:54:34 /u01/app/12.1.0/grid/bin/ohsd.bin  reboot
    root  2715  1   0 Feb19  ?     00:00:00  /bin/sh   /etc/init.d/init.ohsd  run

    OHASD ( Oracle High Availability Service Daemon )  - we also call it as oracle restart

    First /etc/init triggers OHASD, once ohasd is started on Level 0, it is responsible for starting the rest     of clusterware and the resources that clusterware manages directly or indirectly through Levels 1- 4.

    Level 1 - Ohasd on it own triggers four agent process

    cssdmonitor : CSS Monitor
    OHASD orarootagent : High Availability Service stack Oracle root agent
    OHASD oraagent : High Availability Service stack Oracle Agent
    cssdagent : CSS Agent


    Level 2 - On this level, OHASD ora agent trigger five processes

    mDNSD : mDNS daemon process
    GIPCD : Grid Interprocess Comunication
    GPnPD : GPnP Profile daemon
    EVMD : Even Monitor Daemon
    ASM : Resources for monitoring ASM Instances
    Then, OHASD oraclerootagent trigger following processes 
    CRSD : CRS daemon
    CTSSD : Cluster Time Synchronisation Service Daemon 
    Diskmon : Disk Monitor Daemon ( Exadata Server Storage )
    ACFS : ( ASM Cluster File System ) Drivers
    Next, the cssdagent starts the CSSD ( CSS daemon ) process.

Level 3 - The CRSD spawns two CRSD agents : CRSD orarootagent and CRSD oracleagent.
Level 4 - On this levael, the CRSD orarootagent is responsible for starting he following resources :
Network resource : for the public network
SCAN VIPs
Node VIPs : VIPs for each node
ACFS Registry
GNS VIP : VIP for GNS if you use the GNS option
Then, the CRSD orarootagent is responsible for starting the rest of the resources as follow 
ASM Resources : ASM instances(s) resource
Diskgroup : Used for managing / monitoring ASM diskgroups
Disk Resource : Used for managing and monitoring the DB and instances
SCAN Listener : Listener for SCAN listening on SCAN VIP
Listener : Node Listener listening on the Node VIP
Services : Database Services
ONS
eONS : Enhanced ONS
GSD : For 9i backword compatibility
GNS : performs name resolution ( Optional )


How Database interact with ASM


The file creation process provides a fine illustration of the interactions that take place between
database instances and ASM. The file creation process occurs as follows:

1. The database requests file creation.
2. An ASM foreground process creates a Continuing Operation Directory (COD) entry and
allocates space for the new file across the disk group.
3. The ASMB database process receives an extent map for the new file.
4. The file is now open and the database process initializes the file directly.
5. After initialization, the database process requests that the file creation is committed. This
causes the ASM foreground process to clear the COD entry and mark the file as created.
6. Acknowledgement of the file commit implicitly closes the file. The database instance will
need to reopen the file for future I/O.  


What is GPnP profile and its importance


The GPnP profile is a XML file located at location <GRID_HOME/gpnp/<hostname>/profiles/peer as profile.xml. Each node of the cluster maintains a copy of this profile locally and is maintained by GPnP daemon along with mdns daemon.
Now before understanding why Oracle came up with GPnP profile, we need to focus on what it contains.GPnP defines a node’s meta data about network interfaces for public and private interconnect, the ASM server parameter file, and CSS voting disks. This profile is protected by a wallet against modification. If you have to manually modify the profile, it must first be unsigned with $GRID_HOME/bin/gpnptool, modified, and then signed again with the same utility, however there is a very slight chance you would ever be required to do so.

Now we’ll use the gpnptool with get option to dump this xml file into standard output. Below is the formatted output for the ease of readability.

<?xml version=”1.0″ encoding=”UTF-8″?>
<gpnp:GPnP-Profile Version=”1.0″ xmlns=”http://www.xyz/gpnp-profile&#8221;
xmlns:gpnp=”http://xyz/gpnp-profile&#8221;
xmlns:orcl=”http://xyz/gpnp-profile&#8221;
xmlns:xsi=”http://xyz/XMLSchema-instance&#8221;
xsi:schemaLocation=”http://xyz/gpnp-profile gpnp-profile.xsd”
ProfileSequence=”3″ ClusterUId=”002c207a71cvaljgkcea7bea5b3a49″
ClusterName=”Cluster01″ PALocation=””>
<gpnp:Network-Profile>
<gpnp:HostNetwork id=”gen” HostName=”*”>
<gpnp:Network id=”net1″ IP=”xxx.xx.x.x” Adapter=”bond0″ Use=”public”/>
<gpnp:Network id=”net2″ IP=”xxx.xxx.x.x” Adapter=”bond1″
Use=”cluster_interconnect”/>
</gpnp:HostNetwork>
</gpnp:Network-Profile>
<orcl:CSS-Profile id=”css” DiscoveryString=”+asm” LeaseDuration=”400″ />
<orcl:ASM-Profile id=”asm” DiscoveryString=””
SPFile=”+DATA/prod/asmparameterfile/registry.253.699915959″ />
<ds:Signature…>…</ds:Signature>
</gpnp:GPnP-Profile>

So from the above dump we can see that GPnP profile contains following information:-

1) Cluster Name
2) Network Profile
3) CSS-Profile tag
4) ASM-Profile tag

Now that we have understood the content of a GPnP profile, we need to understand how the Clusterware uses this information to start. From 11gr2 you have the option of storing the OCR and Voting disk on ASM, but clusterware needs OCR and Voting disk to start crsd & cssd and both these files are on ASM which itself is a resource for the node. so how does the clusterware starts, which files it accesses to get the information needed to start clusterware, to resolve this Oracle came up with two local operating system files OLR & GPnP.
When a node of an Oracle Clusterware cluster restarts, OHASD is started by platform-specific means.OHASD has access to the OLR (Oracle Local Registry) stored on the local file system. OLR provides needed data (Would explain in another post) to complete OHASD initialization
OHASD brings up GPnP Daemon and CSS Daemon. CSS Daemon has access to the GPNP Profile stored on the local file system.
The Voting Files locations on ASM Disks are accessed by CSSD with well-known pointers in the ASM Disk headers and CSSD is able to complete initialization and start or join an existing cluster.
OHASD starts an ASM instance and ASM can now operate with CSSD initialized and operating. The ASM instance uses special code to locate the contents of the ASM SPFILE, assuming it is stored in a Diskgroup.
With an ASM instance operating and its Diskgroups mounted, access to Clusterware’s OCR is available to CRSD.OHASD starts CRSD with access to the OCR in an ASM Diskgroup.And thus Clusterware completes initialization and brings up other services under its control.
Thus with the use of GPnP profile several information stored in it along with the information in the OLR several tasks have been automated or eased for the administrators.
I hope the above information helps you in understanding the Grid plug and play profile, its content, its usage and why was it required. Please comment below if you need more information on GPnP as in the complete dump of the profile, how GPnP daemon and mdns daemon communicates to maintain the updated profile on all the nodes, how does oifcfg, crsctl, asmcmd and other utilities does uses IPC to alter the content of these files accordingly, etc.


What is voting disk

OCR is used to store the cluster configuration details. It stores the information about the resources that Oracle Clusterware controls. The resources include the Oracle RAC database and instances, listeners, and virtual IPs (VIPs) such as SCAN VIPs and local VIPs.

The voting disk (VD) stores the cluster membership information. Oracle Clusterware uses the VD to determine which nodes are members of a cluster. Oracle Cluster Synchronization Service daemon (OCSSD) on each cluster node updates the VD with the current status of the node every second. The VD is used to determine which RAC nodes are still in the cluster should the interconnect heartbeat between the RAC nodes fail.
CSS is the service which determine which node in cluster is available and provides cluster group membership  and simple locking services to other  processes. CSS typically determines node availability via communication through a dedicated private network with a voting disk used as a secondary communication mechanism. This is done by sending heartbeat messages through the network and the voting disk, as illustrated by the top graphic
 The voting disk is a file on a clustered file system that is accessible to all nodes in the cluster. Its primary purpose is to help in situations where the private network communication fails. The voting disk is then used to communicate the node state information used to determine which nodes go offline. Without the voting disk, it can be difficult for isolated nodes to determine whether it is experiencing a network failure or whether the other nodes are no longer available. It would then be possible for the cluster to enter a state where multiple subclusters of nodes would have unsynchronized access to the same database files.
It contains information regarding nodes in the cluster and the disk heartbeat information, CSSD of the individual nodes registers the information regarding their nodes in the voting disk and with that pwrite() system call at a specific offset and then a pread() system call to read the status of other CSSD processes. But as information regarding the nodes is in OCR/OLR too and system calls have nothing to do with previous calls, there isn’t any useful data kept in the voting disk. So, if you lose voting disks, you can simply add them back without losing any data. But, of course, losing voting disks can lead to node reboots. If you lose all voting disks, then you will have to keep the CRS daemons down, then only you can add the voting disks.Now that we have understood both the heartbeats which was the most important part, we will dig deeper into Voting Disk, as in what is stored inside voting disk, why Clusterware needs of Voting Disk, how many voting disks are required etc.
Now finally to understand the whole concept of voting disk we need to know what is split brain syndrome, I/O Fencing and simple majority rule.
Split Brain Syndrome, In a Oracle RAC environment all the instances/servers communicate with each other using high-speed interconnects on the private network. This private network interface or interconnect are redundant and are only used for inter-instance oracle data block transfers. Now talking about split-brain concept with respect to oracle RAC systems, it occurs when the instance members in a RAC fail to ping/connect to each other via this private interconnect, but the servers are all physically up and running and the database instance on each of these servers is also running. These individual nodes are running fine and can conceptually accept user connections and work independently. So basically due to lack of communication the instance thinks that the other instance that it is not able to connect is down and it needs to do something about the situation. The problem is if we leave these instances running, the same block might get read, updated in these individual instances and there would be data integrity issue, as the blocks changed in one instance, will not be locked and could be over-written by another instance. This situation is termed as Split Brain Syndrome.
I/O Fencing, there will be some situation where the leftover write operations from failed database instances (The cluster function failed on the nodes, but the nodes are still running at OS level) reach the storage system after the recovery process starts. Since these write operations are no longer in the proper serial order, they can damage the consistency of the data stored data. Therefore when a cluster node fails, the failed node needs to be fenced off from all the shared disk devices or disk groups. This methodology is called I/O fencing or failure fencing
Simple Majority Rule, According to Oracle – “An absolute majority of voting disks configured (more than half) must be available and responsive at all times for Oracle Clusterware to operate.” Which means to survive from loss of ‘N’ voting disks, you must configure atleast ‘2N+1′ voting disks.
Now we are in a state to understand the use of voting disk in case of heartbeat failure.

Example 1:- Suppose in a 3 node cluster with 3 voting disks, a network heartbeat fails between Node 1 and Node 3 & Node 2 and Node 3 whereas Node 1 and Node 2 are able to communicate via interconnect, and from the Voting Disk CSSD notices that all the nodes are able to write to Voting Disks thus spli-brain, so the healthy nodes Node 1 & Node 2 would would update the kill block in the voting disk for Node 3. Then when during pread() system call of CSSD of Node 3, it sees a self kill flag set and thus the CSSD of Node 3 evicts itself. And then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.

Example 2:- Suppose in a 2 node cluster with 3 voting disk, a disk heartbeat fails such that Node 1 can see 2 Voting Disks and Node 2 can see 1 Voting Disk, ( If here the Voting Disk wouldn’t have been odd then both the Nodes would have thought the other node should be killed hence would have been difficult to avoid split-brain), thus based on Simple Majority Rule, CSSD process of Node 1 (2 Voting Disks) sends a kill request to the CSSD process of Node 2 (1 Voting Disk) and thus the Node 2 evicts itself and then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.
Thus Voting disk is plays a role in both the heartbeat failures, and hence a very important file for node eviction & I/O fencing in case of a split brain situation.

I hope the above information helps you in understanding the concept of Voting Disk, its purpose, what it contains, when its used etc,  Please comment below if you need more information as in the split brain examples of a bigger cluster, how does Oracle executes STONITH internally, what processes are involved in the complete node eviction process how to identify the cause of a node eviction, how is a node evicted, rebooted and then joined again in the cluster etc.


What is node eviction and its troubleshooting steps

Clusterware will evict one or more node from cluster if 
a critical problem idsdetected . these problem include :

- A node not responding via network or disk heartbeat
- A hung node
- A hung ocssd.bin process

The purpose of this to maintain the overall health of the cluster
by removing suspected node

In Grid infrastructure ,More than two nodes made cluster , There are two heartbeat ,One voting disk heartbeat ,network heartbeat

Network heartbeat is across the interconnect, every one second, a thread (sending) of CSSD sends a network tcp heartbeat to itself and all other nodes, another thread (receiving) of CSSD receives the heartbeat. If the network packets are dropped or has error, the error correction mechanism on tcp would re-transmit the package, Oracle does not re-transmit in this case. In the CSSD log, you will see a WARNING message about missing of heartbeat if a node does not receive a heartbeat from another node for 15 seconds (50% of misscount). Another warning is reported in CSSD log if the same node is missing for 22 seconds (75% of misscount) and similarly at 90% of misscount and when the heartbeat is missing for a period of 100% of the misscount (i.e. 30 seconds by default), the node is evicted.
Disk heartbeat is between the cluster nodes and the voting disk. CSSD process in each RAC node maintains a heart beat in a block of size 1 OS block in a specific offset by read/write system calls (pread/pwrite), in the voting disk. In addition to maintaining its own disk block, CSSD processes also monitors the disk blocks maintained by the CSSD processes running in other cluster nodes. The written block has a header area with the node name and a counter which is incremented with every next beat (pwrite) from the other nodes. Disk heart beat is maintained in the voting disk by the CSSD processes and If a node has not written a disk heartbeat within the I/O timeout, the node is declared dead. Nodes that are of an unknown state, i.e. cannot be definitively said to be dead, and are not in the group of nodes designated to survive, are evicted, i.e. the node’s kill block is updated to indicate that it has been evicted.

Thus, summarizing the heartbeats, N/W Heartbeat is pinged every second, nodes must respond in css_miscount time, failure would lead to node eviction. Similarly Disk Heartbeat, node pings (r/w) voting disk every second, nodes must recieve a response in (long/short) disk timeout time.
  both heartbeat have threshold level set. For network heartbeat ,cssmiscount which is default 30 second and similarly for disk heartbeat distimout which is default 200second in case both nodes are not able to communicate with each other within  threshold time . one of node will be evicted 


Voting disk should be in odd number why

All nodes should vote to voting disk ,for example if we have three voting disk. If one voting disk gets failed ,we have two voting disks , so clusterware will not stop functioning. Since As per rule,  at any given time, every node should access  more than 50 percent of voting disk.

other examples

When you have 1 voting disk and it goes bad, the cluster stops functioning.

When you have 2 and 1 goes bad, the same happens because the nodes realize they can only write to half of the original disks (1 out of 2), violating the rule that they must be able to write > half (yes, the rule says >, not >=).

When you have 3 and 1 goes bad, the cluster runs fine because the nodes know they can access more than half of the original voting disks (2/3 > half).

When you have 4 and 1 goes bad, the same, because (3/4 > half).

When you have 3 and 2 go bad, the cluster stops because the nodes can only access 1/3 of the voting disks, not > half.

When you have 4 and 2 go bad, the same, because the nodes can only access half, not > half.


So you see 4 voting disks have the same fault tolerance as 3, but you waste 1 disk, without gaining anything. The recommendation for odd number of voting disks helps save a little on hardware requirement.s.


Node Eviction Troubleshoot steps

 The node eviction process is reported as Oracle error ORA-29740 in the alert log and LMON trace files

1. Look at the cssd.log files on both nodes; usually we will get more information on the second node if the first node is evicted.Also take a look at crsd.log file too

2. The evicted node will have core dump file generated and system reboot info.

3. Find out if there was node reboot , is it because of CRS or others, check system reboot time

4. If you see “Polling” key words with reduce in percentage values in cssd.log file that says the eviction is probably due to Network.
If you see “Diskpingout” are something related to -DISK- then, the eviction is because of Disk time out.

5. After finding Network or Disk issue. Then starting going in depth.

6. Now it’s time to collect NMON/OSW/RDA reports to make sure /justify if it was DISK issue or Network.

7. If in case we see more memory contention/paging in the reports then it’s time to collect AWR report to see what loads/SQL was running during that period?

8. If network was the issue, then check if any NIC cards were down, or if link switching as happen. And check private interconnect is working between both the nodes.

9. Sometimes eviction could also be due to OS error where the system is in halt state for while or Memory over commitment or CPU 100% used.

10. Check OS /system logfiles to get more information.

11. What got changed recently? Ask your coworker to open up a ticket with Oracle and upload logs

12. Check the health of clusterware, db instances, asm instances, uptime of all hosts and all the logs – ASM logs, Grid logs, CRS and ocssd.log,
HAS logs, EVM logs, DB instances logs, OS logs, SAN logs for that particular timestamp.

13. Check health of interconnect if error logs guide you in that direction.

14. Check the OS memory, CPU usage if error logs guide you in that direction.

15. Check storage error logs guide you in that direction.

16. Run TFA and OSWATCHER, NETSTAT, IFCONFIG settings etc based on error messages during your RCA.

17. Node eviction because iptables had been enabled. After iptables was turned off, everything went back to normal.
Avoid to enable firewalls between the nodes, and that appears to be true.
The ACL can open the ports on the interconnect, as we did, but we still experienced all kinds of issues.
(unable to start crs, unable to stop crs and node eviction).
We also had a problem with the voting disk caused by presenting LDEV's using business copies / Shadowimage that made RAC less than happy.

18. Verify user equiv between cluster nodes
19. Verify switch use for only interconnect. DO NOT USE same switch for other network operations.
20. Verify all nodes are 100pct the same configuration, sometimes there are net or config diffs that are not obvious.
look for hangs in the logs and the monitoring tools like NAGIOS to see any memory usage ran out of RAM, or became unresponsive.
21. A major reason however for node evictions at our cluster was at the "patch-levels" not being equal across the two nodes.

Nodes sometimes completely died, without any error what so ever.It turned to be a bug in the installer of 11.1.0.7.1 PSU,


What is Undo retention and retention guarantee and its importance


Undo_retention is new parameter introduced in Oracle 9i.

This parameter is used to support the "flashback query" feature. However this parameter can potentially cause the Ora-1555 "snapshot too old" error.  The value for this parameter is specified in seconds. This parameter determines the lower threshold value of undo retention. The system retains undo for at least the time specified in this parameter.

Oracle 10g/11g automatically tunes undo retention to reduce the chances of "snapshot too old" errors during long-running queries.  In the event of any undo space constraints the system will prioritize DML operations over undo retention meaning the low threshold may not be achieved.  If the undo retention threshold must be guaranteed, even at the expense of DML operations, the RETENTION GUARANTEE clause can be set against the undo tablespace during or after creation:

-- Reset the undo low threshold.
ALTER SYSTEM SET UNDO_RETENTION = 2400;

-- Guarantee the minimum threshold is maintained.
ALTER TABLESPACE undotbs RETENTION GUARANTEE;

-- Check the value in data dictionary view.
SELECT tablespace_name, retention FROM dba_tablespaces;

TABLESPACE_NAME RETENTION
------------------------------ -----------
SYSTEM NOT APPLY
UNDOTBS GUARANTEE
SYSAUX NOT APPLY
TEMP NOT APPLY
USERS NOT APPLY

5 rows selected.

-- Switch back to the default mode.
ALTER TABLESPACE undotbs1 RETENTION NOGUARANTEE;

-- Check the value in data dictionary view.
SELECT tablespace_name, retention FROM dba_tablespaces;

TABLESPACE_NAME RETENTION
------------------------------ -----------
SYSTEM NOT APPLY
UNDOTBS NOGUARANTEE
SYSAUX NOT APPLY
TEMP NOT APPLY
USERS NOT APPLY

5 rows selected.

As the name suggests, the NOT APPLY value is assigned to non-undo tablespaces for which this functionality does not apply.

You can enable the guarantee option for the undo tablespace when it is created by either the CREATE DATABASE or CREATE UNDO TABLESPACE statement or at a later period using the ALTER TABLESPACE statement.

Automatic Tuning of Undo Retention Common Issues


Automatic Tuning of Undo Retention Feature

Oracle 10g/11g and higher version automatically tunes undo retention to reduce the chances of "snapshot too old" errors during long-running queries.  In the event of any undo space constraints, the system will prioritize DML operations over undo retention. In such situations, the low threshold may not be achieved and tuned_undoretention can go below undo_retention. If the undo retention threshold must be guaranteed, even at the expense of DML operations, the RETENTION GUARANTEE clause can be set against the undo tablespace during or after creation.

-- Set/Reset the undo low threshold.
ALTER SYSTEM SET UNDO_RETENTION = 900;

-- Guarantee the minimum threshold is maintained.
ALTER TABLESPACE undotbs RETENTION GUARANTEE;

You can enable the guarantee option for the undo tablespace when it is created by either the CREATE DATABASE or CREATE UNDO TABLESPACE statement or at a later period using the ALTER TABLESPACE statement.

Thus, tuned_undoretention can be less than undo_retention specified in the parameter file.

Common Issues
 1- Space Related Issues/ Undo Tablespace is Full
Many UNEXPIRED undo segments can be seen when selecting  from dba_undo_extents view, although no ORA-1555 or ORA-30036 errors are reported.

a. Expected behavior/ Concepts Misunderstanding:

When the UNDO tablespace is created with NO AUTOEXTEND, the below allocation algorithm is being followed:

1. If the current extent has more free blocks then the next free block is allocated.
2. Otherwise, if the next extent expired then wrap in the next extent and return the first block.
3. If the next extent is not expired then get space from the UNDO tablespace. If a free extent is available then allocate it to the undo segment and return the first block in the new extent.
4. If there is no free extent available, then steal expired extents from offline undo segments. De-allocate the expired extent from the offline undo segment and add it to the undo segment. Return the first free block of the extent.
5. If no expired extents are available in offline undo segments, then steal from online undo segments and add the new extents to the current undo segment.  Return the first free block of the extent.
6. Extend the file in the UNDO tablespace. If the file can be extended then add an extent to the current undo segment and then return the block.
7. Tune down retention in decrements of 10% and steal extents that were unexpired, but now expired with respect to the lower retention value.
8. Steal unexpired extents from any offline undo segments.
9. Try to reuse unexpired extents from own undo segment. If all extents are currently busy (they contains uncommitted information) go to the step 10. Otherwise, wrap into the next extent.
10. Try to steal unexpired extents from any online undo segment.
11. If all the above fails then return ORA-30036 unable to extend segment by %s in undo tablespace '%s'

For a fixed size UNDO tablespace (NO AUTOEXTEND), starting with 10.2, we provide max retention given the fixed undo space, which is set to a value based on the UNDO tablespace size.
This means that even if the undo_retention is set to a number of seconds (900 default), the fixed UNDO tablespace supports a bigger undo_retention time interval (e.g: 36 hours), based on the tablespace size, thing that makes the undo extents to be UNEXPIRED. But this doesn't indicate that there are no available undo extents when a transaction will be run in the database, as the UNEXPIRED undo segments will be reused.

Here is a small test case for making it clearer:
Before starting any transaction in the database:

SQL> select count(status) from dba_undo_extents where status = 'UNEXPIRED';
COUNT(STATUS)
-------------
463

SQL> select count(status) from dba_undo_extents where status = 'EXPIRED';
COUNT(STATUS)
-------------
20

SQL> select count(status) from dba_undo_extents where status = 'ACTIVE';
COUNT(STATUS)
-------------
21

Space available reported by dba_free_space:
SUM(BYTES)/(1024*1024) TABLESPACE_NAME
---------------------- ---------------------
      3                  UNDOTBS1
      58.4375            SYSAUX
      3                  USERS3
      4.3125             SYSTEM
      103.9375           USERS04

When the transactions run:
SUM(BYTES)/(1024*1024) TABLESPACE_NAME
---------------------- ----------------
      58.25              SYSAUX
      98                 USERS3
      4.3125             SYSTEM
      87.9375            USERS04
 

b. wrong calculation of the tuned undo retention value
For DB version lower than 10.2.0.4

it is caused by Bug:5387030 - This bug only affects db version lower than 10.2.0.4

To investigate the issue, check the following:

1- Undo automatically managed by the database

SQL> show parameter undo_

2- Type of undo tablespace (fixed, auto extensible)

SQL> SELECT autoextensible
     FROM dba_data_files
     WHERE tablespace_name='<UNDO_TABLESPACE_NAME>'
 This returns "NO" for all the undo tablespace datafiles.

3- The undo tablespace is already sized such that it always has more than enough space to store all the undo generated within the undo_retention time, and the in-use undo space never exceeds the undo tablespace warning alert threshold (see below for the query to show the thresholds).


4- The tablespace threshold alerts recommend that the DBA add more space to the undo tablespace:

SQL> SELECT creation_time, metric_value, message_type, reason, suggested_action
     FROM dba_outstanding_alerts
     WHERE object_name='<UNDO_TABLESPACE_NAME>';
This returns a suggested action of: "Add space to the tablespace".
Or,
This recommendation has been reported in the past but the condition has now cleared:

SQL> SELECT creation_time, metric_value, message_type, reason, suggested_action, resolution
     FROM dba_alert_history
     WHERE object_name='<UNDO_TABLESPACE_NAME>';
 
5- The undo tablespace in-use space exceeded the warning alert threshold at some point in time. To see the warning alert percentage threshold, issue:

SQL> SELECT object_type, object_name, warning_value, critical_value
    FROM dba_thresholds
    WHERE object_type='TABLESPACE';
 To see the (current) undo tablespace percent of space in use:

SQL> SELECT
             ((SELECT (NVL(SUM(bytes),0))
               FROM dba_undo_extents
               WHERE tablespace_name='<UNDO_TABLESPACE_NAME>'
               AND status IN ('ACTIVE','UNEXPIRED')) * 100)
             /
             (SELECT SUM(bytes)
              FROM dba_data_files
              WHERE tablespace_name='<UNDO_TABLESPACE_NAME>')
             "PCT_INUSE"
         FROM dual;
To solve the issue, you can either apply patch:5387030 (This bug only affects db version lower than 10.2.0.4) OR use any of the following workarounds:

1- Set the AUTOEXTEND and MAXSIZE attributes of each datafile of the undo tablespace in such a way that they are autoextensible and the MAXSIZE is equal to the current size (so the undo tablespace now has the AUTOEXTEND attribute but does not autoextend):

SQL> ALTER DATABASE DATAFILE '<datafile_flename>' AUTOEXTEND ON MAXSIZE <current_size>
2- Set the following instance parameter (Contact Oracle Support before setting it):

_undo_autotune = false
With this setting, V$UNDOSTAT (and therefore V$UNDOSTAT.TUNED_UNDORETENTION) is not maintained and the undo retention used is based on the UNDO_RETENTION instance parameter. Which means that you will loose all advantages in having automatic undo management and is not an ideal long term fix.
 

Even with the patch fix installed, the autotuned retention can still grow under certain circumstances. The fix attempts to throttle back how aggressive that autotuning will be. Options 2 and 3 may be needed to get around this aggressive growth in some environments.
For DB version 12.2 or higher and has Local Undo Mode enabled

It is probably caused by bug 27543971, check note 27543971.8 for details

2- Undo Remains Unexpired and TUNED_UNDORETENTION is high
This matches Bug 9650380 where WRH$_UNDOSTAT table shows that TUNED_UNDORETENTION remains high based on previous workload after restarting the instance.
Bug 9650380 is closed as duplicate of Bug 9681444. It is caused by heavy undo generation right before an instance shutdown influencing the calculation of the tuned undo retention after instance startup.


Possible workarounds for this issue are:


1) disable automatic tuning of undo by setting _undo_autotune=false (Contact Support before setting this parameter)

This option requires manual tuning of the UNDO_RETENTION instance parameter to avoid ORA-1555 errors.

2) turn on autoextensibility of the undo tablespace datafiles and set the MAXSIZE to the actual size of the all the datafiles of the undo tablespace, or set _smu_debug_mode=33554432.

This option allows tuned undo retention, but bases the calculation on MAXQUERYLEN instead of the free space in the undo tablespace. This option also requires a sensible value for UNDO_RETENTION being set.

WARNING: do NOT set MAXSIZE > actual size! Otherwise the algorithm to calculate the tuned undo will again work based on the free space.

3) allocate more space to the undo tablespace and/or reduce the threshold level used for computation of the tuned undo retention value.

You can control the extra space available for undo growth using the tablespace warning threshold:

begin
DBMS_SERVER_ALERT.SET_THRESHOLD(
metrics_id => dbms_server_alert.tablespace_pct_full,
warning_operator => dbms_server_alert.operator_ge,
warning_value => '50', /* <<<<<<<<<<<<<<<<<<<<<<<< */
critical_operator => dbms_server_alert.operator_ge,
critical_value => '90',
observation_period => 1,
consecutive_occurrences => 1,
instance_name => NULL,
object_type => dbms_server_alert.object_type_tablespace,
object_name => 'UNDOTBS1');
end;
/
The above example provides more headroom for undo by setting the warning threshold to 50% of the tablespace size, thereby letting the tuned undo retention algorithm use 50% - 10% = 40% of the tablespace
size for its calculations. By default the algorithm uses 70% of the undo tablespace in the tuned undo retention calculation, leaving 30% headroom.

4) set the _first_spare_parameter (depending on version, check note 742035.1 for more details) or _highthreshold_undoretention (depending on version, check note 742035.1 for more details) instance parameter to a value limiting the tuned undo retention value. See Note 742035.1 for more details.

If this value is set too high, you still will encounter problems with tuned undo retention. If this value is set too low, ORA-1555 errors are bound to happen.

Next to this, monitor the ACTIVEBLKS + UNEXPIREDBLKS values in V$UNDOSTAT to be sure that not a too large portion of the undo tablespace is allocated by these block types. Otherwise stealing of undo blocks will occur which might again result in ORA-1555 errors. If these blocks take up a too high percentage of the blocks of the undo tablespace, consider adding more space to the tablespace, or tune down the undo retention.

When available, download Patch 9681444 to resolve this issue.

Further Diagnostics
If none of the above addressed your issue, please feel free to log an SR with Oracle Support providing the following information:

1- Alert.log file

2- output of script in Doc 1579035.1.

3- Trace Files generated at the time of the issue

wrong calculation of the tuned undo retention value

How to Determine the Value Of UNDO_RETENTION Parameter to Avoid ORA-1555

SYMPTOMS
The objective of this note  is to explain how to set UNDO_RETENTION parameter and to clarify how the error ORA-1555 could be generated due to wrong setting of UNDO_RETENTION parameter value.


CAUSE
undo_retention sizing

SOLUTION
You need to tune to increase to an optimum value the UNDO_RETENTION parameter.
The value for this parameter is specified in seconds.
This is important for systems running long queries.

That could be tuned by checking the maxquerylen from v$undostat;

The UNDO_RETENTION value should at least be equal to the length of longest running query on a
given database instance.

This can be determined by querying V$UNDOSTAT view once the database has been running for a while.

SQL> select max(maxquerylen) from v$undostat;

This needs to be captured when the system has been running for a while and is fully used.

The following two column are enough to check if you are detecting or not an out of space error and/or ora-1555 one :

SSOLDERRCNT - The number of ORA-1555 errors that occurred during the interval
NOSPACEERRCNT - The number of Out-of-Space errors

The folowing Note 262066.1: How To Size UNDO Tablespace For Automatic Undo Management
explains how to set undo tablespace correctly to guarantee undo retention.

When this option is enabled the database never overwrites unexpired undo data that is, undo data
whose age is less than the undo retention period.

The storage and used space for undo is then a direct consecuency on your undo_retention configuration.

The recommend value for undo_retention is the value is length of longest running query on a given
database instance.

If you see a message in the trace file like "Query Duration=5095 " means that the Query was running for '5095 sec' when the error occured.

Note that the UNDO_RETENTION parameter works best if the current undo tablespace has enough space for the active transactions.
If an active transaction needs undo space and the undo tablespace does not have any free space,
then the system will start reusing undo space that would have been retained.
This may cause long queries to fail.
Be sure to allocate enough space in the undo tablespace to satisfy the space requirement for the current setting of this parameter.





How to recover Table data Using the Flashback Table Feature

PURPOSE
-------

Purpose of this document to restore the table data which was deleted accidentally.

 
Recovering Tables Using the Flashback Table Feature:
-----------------------------------------------------
The FLASHBACK TABLE statement enables users to recover a table to a previous
point in time. It provides a fast, online solution for recovering a table that has been
accidentally modified or deleted by a user or application. 
Flashback Drop is substantially faster than other recovery mechanisms that 
can be used in this situation, such as point-in-time recovery, and does not lead to any loss of recent transactions or downtime.

Restores all data in a specified table to a previous point in time described by a
timestamp or SCN. An exclusive DML lock is held on a table while it is being
restored.

Performs the restore operation online.

Note: You must be using automatic undo management to use the
flashback table feature. It is based on undo information stored in an
undo tablespace.

Automatically restores all of the table attributes, such as indexes, triggers, and
the likes that are necessary for an application to function with the flashed back
table.

Maintains any remote state in a distributed environment. For example, all of the
table modifications required by replication if a replicated table is flashed back.

Maintains data integrity as specified by constraints. Tables are flashed back
provided none of the table constraints are violated. This includes any referential
integrity constraints specified between a table included in the FLASHBACK
TABLE statement and another table that is not included in the FLASHBACK
TABLE statement.

Even after a flashback, the data in the original table is not lost. You can later
revert to the original state.

To use the FLASHBACK TABLE statement you must have been granted the
FLASHBACK ANY TABLE system privilege or you must have the FLASHBACK object
privilege on the table. Additionally, you must have SELECT, INSERT, DELETE, and
UPDATE privileges on the table. The table that you are performing the flashback
operation on must have row movement enabled.

Example:

SQL>alter tablespace UNDOTBS1 retention guarantee;

SQL>select tablespace_name,retention from dba_tablespaces;

TABLESPACE_NAME                RETENTION
------------------------------ -----------
SYSTEM                         NOT APPLY
UNDOTBS1                       GUARANTEE
SYSAUX                         NOT APPLY
TEMP                           NOT APPLY
EXAMPLE                        NOT APPLY
USERS                          NOT APPLY
HISTORY                        NOT APPLY

7 rows selected.

SQL> ALTER TABLE flash_test_table enable row movement;

Table altered.


SQL> select * from flash_test_table;

     EMPNO EMPNAME
---------- ------------------------------
         1 Kiran
         2 Scott
         3 Tiger
         4 Jeff

SQL> select current_scn from v$database;

CURRENT_SCN
----------------
332348



SQL> connect scott/tiger
Connected.
SQL> insert into flash_test_table values(5,'Jane');

1 row created.

SQL> insert into flash_test_table values(6,'John');

1 row created.

SQL> commit;

Commit complete.

SQL> connect / as sysdba
Connected.
SQL> select current_scn from v$database;

CURRENT_SCN
----------------
332376

SQL> connect scott/tiger
Connected.

SQL> select * from flash_test_table;

     EMPNO EMPNAME
---------- ------------------------------
         1 Kiran
         2 Scott
         3 Tiger
         4 Jeff
         5 Jane
         6 John

6 rows selected.

SQL> flashback table flash_test_table to scn 332348;

Flashback complete.

SQL> select * from flash_test_table;

     EMPNO EMPNAME
---------- ------------------------------
         1 Kiran
         2 Scott
         3 Tiger
         4 Jeff

SQL> flashback table flash_test_table to scn 332376;

Flashback complete.

SQL> select * from flash_test_table;

     EMPNO EMPNAME
---------- ------------------------------
         1 Kiran
         2 Scott
         3 Tiger
         4 Jeff
         5 Jane
         6 John

6 rows selected.



Additional comment:
------------------------
Adding a example for using flashback table with timestamp (to_timestamp) 
.
SQL> flashback table xxx to timestamp to_timestamp('2012-09-01 11:00:00', 'YYYY-MM-DD HH24:MI:SS') ;

--

Note: When Using Dataguard

While performing flashback table, you can create a guaranteed restore point on primary only and you can perform flashback "table" 
to restore point on primary db which will automatically apply the data changes on physical standby db & logical standby db. 
You do not have to create guaranteed restore point on standby db and no action to be performed on standby db while performing flashback 
table.

Defragmenting Objects with Alter Shrink Method

Solution
------

Pre-requisites

With respect to indexes, rebuild online might be more efficient than defragmentation. This is not necessarily the case.

Rebuilding online the indexes is available in Oracle Server 9, but only in the Enterprise Edition.
To eliminate or reduce fragmentation, you can rebuild or coalesce/shrink the index. But before you perform either task weigh the costs and benefits of each option and choose the one that works best for your situation. 
Following table is a comparison of the costs and benefits associated with rebuilding and coalescing indexes.


Rebuild indexCoalesce index
Quickly moves index to another tablespaceCannot move index to another tablespace
Higher costs: requires more disk spaceLower costs: does not require more disk space
Creates new tree, shrinks height if applicableCoalesces leaf blocks within same branch of tree
Enables you to quickly change storage and tablespace parameters without having to drop the original index.Quickly frees up index leaf blocks for use.


Procedure
Prerequisites:  What privileges are needed?

In the scope of this document, ALTER TABLE SHRINK and CREATE JOB will be executed.

To alter the table, the table must be in your own schema, or you must have ALTER object privilege on the table, or you must have ALTER ANY TABLE system privilege.
To run a dbms job, you must have either
CREATE JOB privilege: This privilege enables you to create jobs, chains, schedules, and programs in your own schema. You will always be able to alter and drop jobs, schedules and programs in your own schema, even if you do not have the CREATE JOB privilege. In this case, the job must have been created in your schema by another user with the CREATE ANY JOB privilege.
CREATE ANY JOB privilege: This privilege enables you to create, alter, and drop jobs, chains, schedules, and programs in any schema except SYS. This privilege is very powerful and should be used with care because it allows the grantee to execute code as any other user.
For information, commands in this document are executed by SYSDBA.

Check tablespace usage
Use the following SQL command to check tablespaces usage:

 SELECT /*+ RULE */ T.TABLESPACE_NAME||' '||
        (100 - ROUND((((T.TOT_AVAIL - NVL(F.TOT_USED,0))*100)/TOT_AVAIL),0))
 FROM (SELECT /*+ RULE */ TABLESPACE_NAME, SUM(BYTES) TOT_USED
       FROM SYS.DBA_EXTENTS GROUP BY TABLESPACE_NAME) F,
      (SELECT /*+ RULE */ TABLESPACE_NAME, COUNT(1) FILE_COUNT,
              SUM(DECODE(maxbytes,0,BYTES,maxbytes)) TOT_AVAIL
       FROM SYS.DBA_DATA_FILES
       GROUP BY TABLESPACE_NAME
       UNION
       SELECT /*+ RULE */ TABLESPACE_NAME,COUNT(1) FILE_COUNT,
              SUM(DECODE(maxbytes,0,BYTES,maxbytes)) TOT_AVAIL
       FROM SYS.DBA_TEMP_FILES
       GROUP BY TABLESPACE_NAME) T,
       SYS.DBA_TABLESPACES D
 WHERE T.TABLESPACE_NAME = F.TABLESPACE_NAME(+)
 AND D.TABLESPACE_NAME = T.TABLESPACE_NAME
 ORDER BY ROUND((((T.TOT_AVAIL - NVL(F.TOT_USED,0))*100)/TOT_AVAIL),0);

On SMS and VWSs, we are interested in the following tablespaces:

On SMS:

TABLESPACE_NAME                PERCENT USED
------------------------------ ------------
CCS_VOUCHERS                             93
CCS_VOUCHERS_I                           66

On VWS:

TABLESPACE_NAME                PERCENT USED
------------------------------ ------------
CCS_VOUCHERS                             89
CCS_VOUCHERS_I                           82
BE_VOUCHERS                              80
BE_VOUCHERS_I                            56
 

Find segments to defragment

SQL> set lines 300
SQL> select segment_name,sum(bytes) from dba_segments where tablespace_name='<tablespace_name_from_above>'

On SMS and VWSs, we are interested in the following segments:

On SMS:

SEGMENT_NAME                      extents
------------------------------ ----------
CCS_VOUCHER_REFERENCE_PK               20
CCS_VOUCHER_REFERENCE_IX_01            23
CCS_VOUCHER_REFERENCE_IXR              25
CCS_VOUCHER_REFERENCE_UQ               37

On VWS:

SEGMENT_NAME                      extents
------------------------------ ----------
CCS_VOUCHER_REFERENCE_PK               20
CCS_VOUCHER_REFERENCE_UQ               46
BE_VOUCHER_PK                        1128
 

Finding Candidates for Shrinking


Why using shrink instead of coalesce?

Coalesce is designed specifically to reduce fragmentation within an index but not to deallocate any freed up blocks which are placed on the freelist and recycled by subsequent block splits.
Shrink is designed specifically to reduce the overall size of an index segment, resetting the High Water Mark (HWM) and releasing any excess storage as necessary.

The key difference being that Shrink must reorganise the index leaf blocks in such a way that all the freed up, now empty blocks are all grouped together at one end of the index segment. All these blocks can then be deallocated and removed from the index segment. This means that specific leaf block entries must be removed from these specific blocks, in order to free up the leaf blocks in this manner.

Basically, we are interested in decreasing the tablespace usage (the customer will want that in particular, to get rid of a monitoring system alarms saying the tablespace is full), therefore we will use shrink.
Before performing an online shrink, you may want to find out the biggest bang-for-the-buck by identifying the segments that can be most fully compressed. Simply use the built-in function verify_shrink_candidate in the package dbms_space.
Execute this PL/SQL code for each segment mentioned above to test if the segment can be shrunk to a desired value.

In the following example, the segment "BE_VOUCHER_PK" is used with the desired value 1GB:


declare
   x char(1);
begin
   if (dbms_space.verify_shrink_candidate
         ('E2BE_ADMIN','BE_VOUCHER_PK','INDEX',1073741824)  -- Shrink to 1GB here
   ) then
       x := 'T';
   else
       x := 'F';
   end if;
   dbms_output.put_line(' Result: '||x);
end;
/
 

Shrink!
After having found candidates, the desired objects can be shrunk (table or index).

Some best-practices:

As from experience, a 60 Millions rows table (from which 30 Millions vouchers were deleted (ie, the table is really fragmented)) can be shrunk in about 7 hours.
Therefore it is recommended to run the shrink method in a background Oracle dbms job.

Shrinking is preferred to be executed in two steps:
alter table <table> shrink space compact;
alter table <table> shrink space;
The first will take a long time and will not change the HWM. The second will be faster (since the object is already shrunk) and will move the HWM.

The following example is the procedure for the table CCS_VOUCHER_REFERENCE with the "shrink space compact" method:

SQL> alter table CCS_ADMIN.CCS_VOUCHER_REFERENCE  enable row movement;

SQL>
begin
DBMS_SCHEDULER.create_job(
job_name => 'NEW_SHRINK_CCS_VOUCHER',
job_type     => 'PLSQL_BLOCK',
job_action   => 'BEGIN EXECUTE IMMEDIATE ''alter table CCS_ADMIN.CCS_VOUCHER_REFERENCE shrink space compact''; END;',
start_date   => to_date('20110511000000', 'YYYYMMDDHH24MISS'),
auto_drop    => FALSE,
enabled      => TRUE,
Comments => 'ccs_voucher_reference defragmentation');
commit;
end;
/
You can kick off the job manually:

SQL> exec DBMS_SCHEDULER.run_job('NEW_SHRINK_CCS_VOUCHER');
If you no longer need to job:

SQL> exec DBMS_SCHEDULER.drop_job('NEW_SHRINK_CCS_VOUCHER');
To show the job's running history:


col LOG_DATE format a20
col JOB_NAME format a20
col STATUS format a15
col REQ_START_DATE format a20
col ACTUAL_START_DATE format a20
col RUN_DURATION format a20

select log_date
,      job_name
,      status
,      req_start_date
,      actual_start_date
,      run_duration
from   dba_scheduler_job_run_details
where
job_name = 'NEW_SHRINK_CCS_VOUCHER';
To check out the job running schedule:

col START_DATE format a20
col END_DATE format a20
col LAST_START_DATE format a20
select JOB_NAME,START_DATE,END_DATE, LAST_START_DATE from dba_scheduler_jobs;
At this stage, shrink compact is performed but the HWM is not reset, recreate another job with the following change in the procedure:

job_action   => 'BEGIN EXECUTE IMMEDIATE ''alter table CCS_ADMIN.CCS_VOUCHER_REFERENCE shrink space compact''; END;',

to

job_action   => 'BEGIN EXECUTE IMMEDIATE ''alter table CCS_ADMIN.CCS_VOUCHER_REFERENCE shrink space''; END;',
Finally, execute the same procedure for all objects that need to be defragmented.


Troubleshooting

Shrinking is an expensive operation, specially when the table is big as the whole activity involves the compaction phase of segment shrink that will be done as insert/delete pairs. 
That means every row has an insert and delete DML involved.  So no wonder we see such a huge
redo and undo values.

Therefore you could face the following issue while shrinking:

IMP-00058: ORACLE error 30036 encountered
ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDO'
There are 2 options to avoid it:

Compare the tablesize with UNDO size. If the table requires more than UNDO size, then we will need to increase the UNDO tablespace. Another tip is to check is the RETENTION GUARANTEE is off.
Example:
SQL> select RETENTION from dba_tablespaces where tablespace_name = 'UNDOTBS1';

RETENTION
-----------
NOGUARANTEE
Archiving logfiles can be temporarily disabled for this table by issuing the following commands:
SQL> ALTER TABLE TABLE_NAME NOLOGGING;
SQL> ALTER TABLESPACE TABLESPACE_NAME NOLOGGING;
Please remember to set it back to LOGGING after shrinking is performed.

Also consider making a backup otherwise you won't be able to recover the tablespace because of no archived logs will be generated during the operation. Taking a Database backup it out of scope of this document.




What is incremental check pointing Checkpoint & SCN

Checkpoint is a data structure that indicates the “checkpoint position“, determined by the oldest dirty buffer in the database buffer cache. In terms of Oracle’s clock this position is actually the SCN in the redo stream where instance recovery must begin. The checkpoint position acts as a pointer to the redo stream and is stored in the control file and in each data file header. Whenever we say checkpoint happened we mean that The writing of modified database buffers in the database buffer cache to disk. A successful checkpoint guarantees that all database changes up to the checkpoint SCN have been recorded in the datafiles and SCNs recorded in the file headers guarantee that all changes made to database blocks prior to that SCN are already written to disk. As a result, only those changes made after the checkpoint need to be applied during recovery.
Checkpoints triggered on following conditions:
§  Every 3 seconds (Incremental Checkpoint)
§  When Logswitch happened
§  When instance shutdown normal/transactional/immediate
§  Whenever Alter Tablespace [Offline Normal| Read Only| Begin Backup]
§  Controlled by internal checkpoint forced by recovery related parameters i.e. Fast_Start_MTTR_Target etc.

Purpose of Checkpoints

Oracle Database uses checkpoints to achieve the following goals:
§  Reduce the time required for recovery in case of an instance or media failure
§  Ensure that dirty buffers in the buffer cache are written to disk regularly
§  Ensure that all committed data is written to disk during a consistent shutdown

When Oracle Database Initiates Checkpoints

The checkpoint process (CKPT) is responsible for writing checkpoints to the data file headers and control file. Implementing full checkpoint every time would be a costly operation and a major bottleneck for concurrency, so Oracle using different types of checkpoints based on different purposes:

§  Full checkpoint: Writes block images to the database for all dirty buffers from all instances. Controlfile and datafile headers are updated during this checkpoint. Until Oracle 8 log switch was also causing full check point which is changed since 8i onwards for performance reasons. Occurred in following situations
§  Alter system checkpoint global
§  Alter database begin backup
§  Alter database close
§  Shutdown Immediate/Transactional
§  Thread checkpoints: The database writes to disk all buffers modified by redo in a specific thread before a certain target. The set of thread checkpoints on all instances in a database is a database checkpoint. Controlfile and datafile headers are updated during this checkpoint. Occures in the following situations
§  Consistent database shutdown
§  Alter system checkpoint local
§  Online redo log switch
§  Tablespace and Datafile Checkpoint: Writes block images to the database for all dirty buffers for all files of a tablespace from all instances. Controlfile and datafile headers are updated during this checkpoint. Occurs in following situations
§  Alter tablespace … offline
§  Alter tablespace … begin backup
§  Alter tablespace … read only
§  Alter database datafile resize ( while shrinking a data file)
§  Parallel Query Checkpoint: Writes block images to the database for all dirty buffers belonging to objects accessed by the query from all instances. It’s mandatory to maintain consistency. Occurs in following situations
§  Parallel Query
§  Parallel Query component of PDML or PDDL.
§  Incremental checkpoints: An incremental checkpoint is a type of thread checkpoint partly intended to avoid writing large numbers of blocks at online redo log switches. DBWn checks at least every 3 seconds to determine whether it has work to do. When DBWn writes dirty buffers, it advances the checkpoint position, causing CKPT to write the checkpoint position to the control file, but not to the data file headers.
§  Object Checkpoint: Writes block images to the database for all dirty buffers belonging to an object from all instances. Occurs in following situations
§  Drop table
§  Drop table … purge
§  Truncate table
§  Drop Index
§  Log Switch Checkpoint: Writes the contents of “some” dirty buffers to the database. Controlfile and datafile headers are updated with checkpoint_change#.
§  Instance Recovery Checkpoint: Writes recovered block back to datafiles. Trigger as soon as SMON is done with instance recovery.
§  RBR Checkpoint: It’s actually Reuse Block Range checkpoint, usually appears post index rebuild operations.
§  Multiple Object Checkpoint: Triggered whenever a single operation causes checkpoints on multiple objects i.e. dropping partitioned table or index.
Whenever anything happened in database, Oracle has a SCN number which has to update into various places. We can classify SCN into following major categories:
§  System (checkpoint) SCN: After a checkpoint completes, Oracle stores the system checkpoint SCN in the control file. We can check that in checkpoint_change# of v$database view.
SQL> select checkpoint_change# from v$database;

CHECKPOINT_CHANGE#
------------------
1677903

SQL> alter system checkpoint;

System altered.

SQL> select checkpoint_change# from v$database;

CHECKPOINT_CHANGE#
------------------
1679716
§  DataFile (checkpoint) SCN: After a checkpoint completes, Oracle stores the SCN individually in the control file for each datafile. The following SQL shows the datafile checkpoint SCN for a datafile in the control file:
SQL> select name,checkpoint_change# from v$datafile where name like '%system01%';

NAME                                                 CHECKPOINT_CHANGE#
---------------------------------------------------- ------------------
/u02/app/oracle/oradata/mask11g/system01.dbf                1679716
§  Partial (Checkpoint) SCN: Operational non-full checkpoints for sub set of system i.e. tablespace or a datafile etc, would set checkpoint for affected entities only
SQL> select name,checkpoint_change# from v$datafile_header where name like '%01.dbf';

NAME                                                 CHECKPOINT_CHANGE#
---------------------------------------------------- ------------------
/u02/app/oracle/oradata/mask11g/system01.dbf                 1685610
/u02/app/oracle/oradata/mask11g/sysaux01.dbf                 1685610
/u02/app/oracle/oradata/mask11g/undotbs01.dbf                1685610
/u02/app/oracle/oradata/mask11g/users01.dbf                  1685610

SQL> alter tablespace users read only;

Tablespace altered.

SQL> select name,checkpoint_change# from v$datafile_header where name like '%01.dbf';

NAME                                                 CHECKPOINT_CHANGE#
---------------------------------------------------- ------------------
/u02/app/oracle/oradata/mask11g/system01.dbf                  1685610
/u02/app/oracle/oradata/mask11g/sysaux01.dbf                  1685610
/u02/app/oracle/oradata/mask11g/undotbs01.dbf                 1685610
/u02/app/oracle/oradata/mask11g/users01.dbf                   1685618

SQL> alter tablespace users read write;

Tablespace altered.

SQL> select name,checkpoint_change# from v$datafile_header where name like '%01.dbf';

NAME                                                 CHECKPOINT_CHANGE#
---------------------------------------------------- ------------------
/u02/app/oracle/oradata/mask11g/system01.dbf                  1685610
/u02/app/oracle/oradata/mask11g/sysaux01.dbf                  1685610
/u02/app/oracle/oradata/mask11g/undotbs01.dbf                 1685610
/u02/app/oracle/oradata/mask11g/users01.dbf                   1685642
§  Start (Checkpoint) SCN: Oracle stores the checkpoint SCN value in the header of each datafile. This is referred to as the start SCN because it is used at instance startup time to check if recovery is required. The following SQL shows the checkpoint SCN in the datafile header for a single datafile.
SQL> select name,checkpoint_change# from v$datafile_header where name like '%system01%';

NAME                                                 CHECKPOINT_CHANGE#
---------------------------------------------------- ------------------
/u02/app/oracle/oradata/mask11g/system01.dbf                  1657172
§  End (checkpoint) SCN: The stop SCN or Termination is held in the control file for each datafile. The following SQL shows the stop SCN for a single datafile when the database is open for normal use.
SQL> select distinct LAST_CHANGE# from v$datafile;

LAST_CHANGE#
------------

SQL> alter database close;

Database altered.

SQL> select distinct LAST_CHANGE# from v$datafile;

LAST_CHANGE#
------------
2125206

SQL> select distinct CHECKPOINT_CHANGE# from v$datafile_header ;

CHECKPOINT_CHANGE#
------------------
2125206
During normal database operation, the stop SCN is NULL for all datafiles that are online in read-write mode. SCN Values while the Database Is Up Following a checkpoint while the database is up and open for use, the system checkpoint in the control file, the datafile checkpoint SCN in the control file, and the start SCN in each datafile header all match. The stop SCN for each datafile in the control file is NULL. SCN after a Clean Shutdown After a clean database shutdown resulting from a SHUTDOWN IMMEDIATE or SHUTDOWN NORMAL of the database, followed by STARTUP MOUNT, the previous queries on v$database and v$datafile return the following:
During a clean shutdown, a checkpoint is performed and the stop SCN for each datafile is set to the start SCN from the datafile header. Upon startup, Oracle checks the start SCN in the file header with the datafile checkpoint SCN. If they match, Oracle checks the start SCN in the datafile header with the datafile stop SCN in the control file. If they match, the database can be opened because all block changes have been applied, no changes were lost on shutdown, and therefore no recovery is required on startup. After the database is opened, the datafile stop SCN in the control file once again changes to NULL to indicate that the datafile is open for normal use.


 An incremental checkpoint

An incremental checkpoint is a type of thread checkpoint partly intended to avoid writing large numbers of blocks at online redo log switches. DBWR checks at least every three seconds to determine whether it has work to do. When DBWn writes dirty buffers, it advances the checkpoint position, causing CKPT to write the checkpoint position to the control file, but not to the data file headers. . . .
During instance recovery, the database must apply the changes that occur between the checkpoint position and the end of the redo thread. Some changes may already have been written to the data files. However, only changes with SCNs lower than the checkpoint position are guaranteed to be on disk."
Can you explain incremental checkpoints in plain English?
Answer: An incremental checkpoint is sort of like when you are sitting on the toilet taking a large dump and you flush multiple times to prevent clogging the toilet.

The "fast start" recovery (and the fast_start_mttr_target) is directly related to the incremental checkpoint.  By reducing the checkpoint time to be more frequent than a log switch, Oracle will recover and re-start faster in case of an instance crash. 
The docs note that a DBWR writes buffers to disk in advance the checkpoint position, writing the "oldest" blocks first to preserve integrity.
A "checkpoint" is the event that triggers writing of dirty blocks to the disks and a "normal" checkpoint only occurs with every redo log file switch. 
In a nutshell, an "incremental" directs the CKPT process to search for "dirty" blocks that need to be written by the DBWR process. thereby advancing the SCN to the control file.
The DBWR wakes up every 3 seconds, seeking dirty blocks and sleeps if he finds no blocks.  This prevents a "burst" of writing when a redo log switches.



How ASM communicate with database

A database that stores data on ASM volumes has two new background processes; RBAL and ASMB. RBAL performs global opens of the disks. ASMB connects to the +ASMn instance to communicate information such as file creation and deletion
ASMB process communicates with CSS daemon on node and receives file extend map information from ASM instance . ASMB is also responsible for providing I/O stats to ASM instance .

How connection established when DML run

The user process first communicates with a listener process that creates a server process in a dedicated enviroment
Oracle Database creates server processes to handle the requests of user processes connected to the instance.The user process represents the application or tool that connects to the Oracle database
Server processes created on behalf of each user’s application can perform one or more of the following:

  • Parse and run SQL statements issued through the application.
  • Read necessary data blocks from data files on disk into the shared database buffers of the SGA (if the blocks are not already present in the SGA).
  • Return results in such a way that the application can process the information.
When a user starts a transaction—for example, a DML operation—the old data is written from the buffer cache to the undo tablespace and the new change details are in the redo log files.

What is Single instance recovery and RAC instance recovery

If an instance of open database fails, either because of a SHUTDOWN ABORT statement o abnormal termination,the following situations can result:

Data blocks committed by a transaction are not written to the data files and appear only in the online redo log. These changes must be reapplied to the database.The data files contain changes that had not been committed when the instance failed. These changes must be rolled back to ensure transactional consistency. Instance recovery uses only online redo log files and current online data files to synchronize the data files and ensure that they are consistent.

Understanding Instance Recovery

Automatic instance or crash recovery:
  • Is caused by attempts to open a database whose files are not synchronized on shutdown
  • Uses information stored in redo log groups to synchronize files
  • Involves two distinct operations:
  • Rolling forward: Redo log changes (both committed and uncommitted) are applied to data files.
  • Rolling back: Changes that are made but not committed returned to their original state.
The Oracle database automatically recovers from instance failure. All that needs to happen is for the instance to be started normally. If Oracle Restart is enabled and configured to monitor this database,then this happens automatically. The instance mounts the control files and then attempts to open the data files. When it discovers that the data files have not been
synchronized during shutdown, the instance uses information contained in
the redo log groups to roll the data files forward to the time of shutdown. Then the database is opened and any uncommitted transactions are rolled back.

Phases of Instance Recovery

  1. Startup instance (data files are out of sync)
  2. Roll forward (redo)
  3. Committed and uncommitted data in files
  4. Database opened
  5. Roll back (undo)
  6. Committed data in files

For an instance to open a datafile, the system change number (SCN) contained in the data fil’s header must  match the current SCN that is stored in the database’s control files.
If the numbers do not match, the instance applies redo data from the online redo logs, sequentially “redoing” transactions until the data files are up to date. After all data files have been synchronized with the control files, the database is opened and users can log in.
When redo logs are applied, all transactions are applied to bring the database up to the state as of the time of failure. This usually includes transactions that are in progress but have not yet been committed. After the database has been opened, those uncommitted transactions are rolled back.
At the end of the rollback phase of instance recovery, the data files contain only committed data.

Tuning Instance Recovery

  • During instance recovery, the transactions between the checkpoint position and the end of the redo log must be applied to data files.
  • You tune instance recovery by controlling the difference between the checkpoint position and the end of the redo log.

 Why does Oracle recommend 3 voting disks when you have 2 nodes?

When you have 1 voting disk and it goes bad, the cluster stops functioning.
When you have 2 and 1 goes bad, the same happens because the nodes realize they can only write to half of the original disks (1 out of 2), violating the rule that they must be able to write > half (yes, the rule says >, not >=).
When you have 3 and 1 goes bad, the cluster runs fine because the nodes know they can access more than half of the original voting disks (2/3 > half).
When you have 4 and 1 goes bad, the same, because (3/4 > half).
When you have 3 and 2 go bad, the cluster stops because the nodes can only access 1/3 of the voting disks, not > half.
When you have 4 and 2 go bad, the same, because the nodes can only access half, not > half.
So you see 4 voting disks have the same fault tolerance as 3, but you waste 1 disk, without gaining anything. The recommendation for odd number of voting disks helps save a little on hardware requirement.
All the above assume the nodes themselves are fine.

How big table loading in buffer cache

Before John can explain the new feature, everyone in the office wants to know why a full table scan doesn’t use the buffer cache. When a session connected to the Oracle Database instance selects data from a table, John elaborates, the database server process reads the appropriate data blocks from the disk and puts them into the buffer cache by default. Each block goes into a buffer in the buffer cache. The reason is simple: if another session wants some data from those blocks, it can be served from those cached blocks much faster than being served from disk. The buffer cache is limited and usually smaller than the entire database, so when the cache is full and a new database block comes in, Oracle Database forces old buffers that have not been accessed in a long time out of the cache to make room for the new blocks coming in.
However, John continues, consider the case of a full table scan query that selects all the blocks of the table. If that table is large, its blocks will consume a large portion of the buffer cache, forcing out the blocks of other tables. It’s unlikely that all the blocks of a large table will be accessed regularly, so having those blocks in the cache does not actually help performance. But forcing out the blocks of other tables, especially popular blocks, degrades the overall performance of the applications running against the database. That is why, John explains, Oracle Database does not load the blocks into the buffer cache for full table scans.
how connection is getting establish and sql internally are running
how database  is communicating with ASM
how hot backup is happening

How instance recovery is happening in single and RAC Database

Instance recovery occurs when an instance goes down abruptly, either via a SHUTDOWN ABORT, a killing of a background process, or a crash of a node or the instance itself. After an ungraceful shutdown, it is necessary for the database to go through the process of rolling forward all information in the redo logs and rolling back any transactions that had not yet been committed. This process is known as instance recovery and is usually automatically performed by the SMON process.
The redo logs for all RAC instances are located either on an OCFS shared disk asset or on a RAW file system that is visible to all the other RAC instances. This allows any other node to recover for a failed RAC node in the event of instance failure.
There are basically two types of failure in a RAC environment: instance and media. Instance failure involves the loss of one or more RAC instances, whether due to node failure or connectivity failure. Media failure involves the loss of one or more of the disk assets used to store the database files themselves.

  1. All nodes available.
  2. One or more RAC instances fail.
  3. Node failure is detected by any one of the remaining instances.
  4. Global Resource Directory(GRD) is reconfigured and distributed among surviving nodes.
  5. The instance which first detected the failed instance, reads the failed instances redo logs to determine the logs which are needed to be recovered.
The above task is done by the SMON process of the instance that detected failure.
 6. Until this time database activity is frozen, The SMON issues recovery requests
    for all the blocks that are needed for recovery. Once all the blocks are available,
    the other blocks which are not needed for recovery are available for normal processing.

7. Oracle performs roll forward operation against the blocks that were modified by the
failed instance but were not written to disk using redo log recorded transactions.

8.Once redo logs are applied, uncomitted transactions are rolled back using
undo tablespace.

9. Database on the RAC in now fully available.

Or

INSTANCE RECOVERY IN RAC DATABASE

I will discuss how instance recovery takes place in 11g R2 RAC. Instance recovery aims at
– writing all committed changes to the datafiles
– undoing all the uncommitted changes from the datafiles
– Incrementing the checkpoint no. to the SCN till which changes have been written to datafiles.
In a single instance database, before the instance crashes,
– some committed changes are in the redo log files but have not been written to the datafiles
– some uncommitted changes have made their way to datafiles
– some uncommitted changes are in the redo log buffer
After  the instance crashes in a single instance database
– all uncommitted changes in the redo log buffer are wiped out
– Online redo log files are read to identify the blocks that need to be recovered
– Identified blocks are read from the datafiles
– During roll forward phase, all the changes (committed/uncommitted) in redo log files are applied to them
– During rollback phase, all uncommitted changes are rolled back after reading undo from undo tablespace.
– CKTP# is incremented in control file/data file headers
In a RAC database there can be two scenarios :
– Only one instance crashes
– Multiple instances crash
We will discuss these cases one by one.
Single instance crash in RAC database
In this case, scenario is quite similar to instance crash in a single instance database. But there is slight difference also.
Let us consider a 3 node setup. We will consider a data block B1 with one
column and 4 records in it . The column contains values 100, 200, 300 and 400 in 4 records. Initially the block is on disk . In the following chart, update operations on the block in various nodes and corresponding states of the block are represented. Colour code followed is : CR, PI, XCUR:
SCN# —-Update operation on —        ———– State of the block on ————
Node1          Node2             Node3      Node1           Node2          Node3        Disk
1   100->101        –                      –                 101                     –                     –                100
200                     –                     –                200
300                     –                     –                300
400                     –                     –                400

2      –           200->201                                 101                  101                    –                100
200                 201                    –                200
300                 300                    –                300
400                 400                    –                400

3      –                –           300->301                101                 101                101                 100
200                201                 201                200
300                 300                 301                300
400                400                 400                400

4                                                                                             CRASH
(Node2)
–                –           300->301                101                  101                101                 100
200                  201                  201                200
300                  300                 301                300
400                  400                 400                400
It is assumed that no incremental checkpointing has taken place on any of the nodes in the meanwhile.
Before crash status of block on various nodes is as follows:
– PI at SCN# 2 on Node1
– PI at SCN# 3 on Node2
– XCUR on Node3

Redo logs at various nodes are
Node1 : B1: 100 -> 101, SCN# 1
Node2 : B1:200 -> 201, SCN# 2
Node3 : B1:300 -> 301, SCN# 3
After the crash,
– Redo logs of crashed node (Node2) is analyzed and it is identified that block B1 needs to be recovered.
– It is also identified that role of the block is global as its different versions are available in Node1 and Node3
– It is identified that there is a PI on node1 whose SCN# (2) is earlier than the SCN# of crash (4)
– Changes from redo logs of Node2 are applied to the PI on Node1 and the block is written to disk
– Checkpoint # of node1 is incremented.
– a BWR is placed in redo log of Node1 to indicate that the block has been written to disk and need not be recovered in case Node1
Here it can be readily seen that there are certain differences from the instance recovery in single instance database.
The Role of the block is checked.
If the role is local, then the block will be read from the disk and changes from redo logs of Node2 will be applied i.e. just like single instance database
If the role is global,
It is checked if PI of the block at a SCN# earlier than the SCN# of crash is available
If PI is available, then changes in redo logs of node2 are applied to the PI ,instead of reading the block from the disk,
If PI is not available (has been flushed to disk due to incremental checkpointing
on the owner node of PI  or
on any of the nodes at a SCN# > PI holder)
the block will be read from the disk and changes from redo logs of Node2 will be applied just like it used to happen in OPS.
Hence, it can be inferred that PI, if available, speeds up the instance recovery as need to read the block from disk is eliminated. If PI is
not available, block is read from the disk just like in OPS.
Multiple instance crash in RAC database
Let us consider a 4 node setup. We will consider a data block B1 with one column and 4 records in it
. The column contains values 100, 200, 300 and 400 in 4 records. Initially the block is on disk . It can be represented as:
SCN#  —- Update operation on —–         ————– State of the block on ————–
Node1       Node2       Node3    Node4        Node1         Node2        Node3      Node4   Disk
1   100->101        –               –            –                        101                   –                      –                 –           100
200                 –                        –                –           200
300                 –                        –                –           300
400                 –                        –                –           400
2         –         200->201        –            –                        101                101                     –            –           100
200                201                     –             –           200
300                300                    –             –           300
400                400                   –             –           400
3         –              –          300->301     –                           101               101                 101          –           100
200               201               201           –           200
300               300              301           –           300
400               400               400           –           400
4                                                                                     CKPT
101               101                 101            –             101
200               201                201            –             201
300               300                301            –             300
400               400                400            –             400
5          –               –               –        400->401                       101                101            101      101          100
–               –            –                                       200               201             201         201          201
–               –            –                                        300               300             301         301          300
–               –            –                                        400               400             400         401        400
6     401->402       –               –           –                                     101                101             101        101         100
200               201             201         201         201
300               300             301         301        300
400               400             400         401        400
101
201
301
402
7                                                                                                                   CRASH        CRASH
(Node2)    (Node3)
101                 –                 –            101          101
200                 –                 –            201          201
300                 –                 –            301          301
400                 –                 –            401          400
101
201
301
402
Explanation:
SCN#1 – Node1 reads the block from disk and updates 100 to 101 in  record. It holds the block in XCUR mode
SCN#2 – Node2  requests the same block for update. Node1 keeps the PI and Node2 holds the block in XCUR mode
SCN#3 – Node3  requests the same block for update. Node2 keeps the PI and Node3 holds the block in XCUR mode . Now we have two PIs
– On Node1 with SCN# 2
– On Node2 with SCN# 3
SCN# 4 – Local checkpointing takes place on Node2. PI on this node has SCN# 3.
It is checked if any of the other nodes has a PI at an earlier SCN# than this. Node1 has PI at SCN# 2.
CHanges in redo log of Node2 are applied to its PI and it is flushed to disk.
BWR is placed in redo log of Node2 to indicate that the block has been written to disk and need not be recovered in case Node2 crashes.
PI at node2 is discarded i.e. its state changes to CR which can’t be used to serve remote nodes.
PI at node1 is discarded i.e. its state changes to CR which can’t be used to serve remote nodes.
BWR is placed in redo log of Node1 to indicate that block has been written to disk and need not be recovered in case Node2 crashes.
Now on disk version of block contains changes of both Node1 and Node2.
SCN# 5 – Node4  requests the same block for update. Node3 keeps the PI and Node4 holds the block in XCUR mode .Node1 and Node2 have the CR’s.
SCN# 6 – Node1 again requests the same block for update. Node4 keeps the PI and Node1 holds the block in XCUR mode. Now Node1 has both the same block in CR and XCUR mode. Node3 has PI at SCN# 5.
SCN# 7 – Node2 and Node3 crash.
It is assumed that no incremental checkpointing has taken place on any of the nodes in the meanwhile.
Before crash status of block on various nodes is as follows:
– CR at SCN# 2 on Node1, XCUR on Node1
– CR at SCN# 3 on Node2
– PI  at SCN# 5 on Node3
– PI at SCN# 6 on Node4
Redo logs at various nodes are
Node1 : B1: 100 -> 101, SCN# 1, BWR for B1 , B1:401->402 at SCN#6
Node2 : B1:200 -> 201, SCN# 2, BWR for B1
Node3 : B1:300 -> 301, SCN# 3
Node4 : B1:400->401 at SCN# 5
After the crash,
– Redo logs of crashed node (Node2) are analyzed and it is identified that block B1 has been flushed to disk as of SCN# 4 and need not be recovered as no changes have been made to it from Node2.
– No Redo log entry from Node2  needs to be applied
– Redo logs of crashed node (Node3) are analyzed and it is identified that block B1 needs to be recovered
– It is also identified that role of the block is global as its different versions was/is  available in Node1(XCUR), Node2(crashed) , Node4(PI)
– Changes from Node3 have to be applied . It is checked if any PI is available which is earlier than the SCN# of the change on node3 which needs to be applied i.e. SCN# 3.
– It is identified that no PI is available  whose SCN is earlier  than the  SCN# (3). Hence, block is read from the disk.
– Redo log entry which needs to be applied is : B1:300 -> 301, SCN# 3
–  Redo is applied to the block read from the disk and the block is written to disk so that on disk version contains changes made by Node3 also.
– Checkpoint # of node2 and Node3 are incremented.
After instance recovery :
Node1 : holds CR and XCUR
Node2 :
Node3 :
Node4 : holds PI
On disk version  of the block is:
101
201
301
400



4) what when we put database in begin backup mode
5) how sql execute and make connection  with  database


How SQL Statement Processing in Oracle Architecture


SQL Statements are processed differently depending on whether the statement is a query, data manipulation language (DML) to update, insert, or delete a row, or data definition language (DDL) to write information to the data dictionary.

Connect to an Instance using:
_ User  process
_Server process
_The Oracle server components that are used depend on the type os SQL statemebt:
-Quries return rows
-DML statements log changes
-commit ensures transactio  recovery
_Some Oracle server components do not participate in SQL statement processing

Processing a query:
Parse:
o       Search for identical statement in the Shared SQL Area.
o       Check syntax, object names, and privileges.
o       Lock objects used during parse.
o       Create and store execution plan.
Bind: Obtains values for variables.
Execute: Process statement.
Fetch: Return rows to user process.

Processing a DML statement:

Parse: Same as the parse phase used for processing a query.
Bind: Same as the bind phase used for processing a query.
Execute:
o       If the data and undo blocks are not already in the Database Buffer Cache, the server process reads them from the datafiles into the Database Buffer Cache.
o       The server process places locks on the rows that are to be modified. The undo block is used to store the before image of the data, so that the DML statements can be rolled back if necessary.
o       The data blocks record the new values of the data.
o       The server process records the before image to the undo block and updates the data block.  Both of these changes are made in the Database Buffer Cache.  Any changed blocks in the Database Buffer Cache are marked as dirty buffers.  That is, buffers that are not the same as the corresponding blocks on the disk.
o       The processing of a DELETE or INSERT command uses similar steps.  The before image for a DELETE contains the column values in the deleted row, and the before image of an INSERT contains the row location information.

Processing a DDL statement:

The execution of DDL (Data Definition Language) statements differs from the execution of DML (Data Manipulation Language) statements and queries, because the success of a DDL statement requires write access to the data dictionary.
For these statements, parsing actually includes parsing, data dictionary lookup, and execution. Transaction management, session management, and system management SQL statements are processed using the parse and execute stages. To re-execute them, simply perform another execute.


Stage 2: Parse the Statement
During parsing, the SQL statement is passed from the user process to Oracle and a parsed representation of the SQL statement is loaded into a shared SQL area. Many errors can be caught during this phase of statement processing. Parsing is the process of
  • translating a SQL statement, verifying it to be a valid statement
  • performing data dictionary lookups to check table and column definitions
  • acquiring parse locks on required objects so that their definitions do not change during the statement’s parsing
  • checking privileges to access referenced schema objects
  • determining the optimal execution plan for the statement
  • loading it into a shared SQL area
  • for distributed statements, routing all or part of the statement to remote nodes that contain referenced data
A SQL statement is parsed only if a shared SQL area for an identical SQL statement does not exist in the shared pool. In this case, a new shared SQL area is allocated and the statement is parsed. For more information about shared SQL, refer to Chapter 10, “Managing SQL and Shared PL/SQL Areas“.
The parse phase includes processing requirements that need to be done only once no matter how many times the statement is executed. Oracle translates each SQL statement only once, re-executing that parsed statement during subsequent references to the statement.
Although the parsing of a SQL statement validates that statement, parsing only identifies errors that can be found before statement execution. Thus, certain errors cannot be caught by parsing. For example, errors in data conversion or errors in data (such as an attempt to enter duplicate values in a primary key) and deadlocks are all errors or situations that can only be encountered and reported during the execution phase.
Module 1 – Oracle Architecture
Objectives
These notes introduce the Oracle server architecture.  The architecture includes physical components, memory components, processes, and logical structures.
Primary Architecture Components
The figure shown above details the Oracle architecture.
Oracle server:  An Oracle server includes an Oracle Instance and an Oracle database.
An Oracle database includes several different types of files:  datafiles, control files, redo log files and archive redo log files.  The Oracle server also accesses parameter files and password files.
This set of files has several purposes.
One is to enable system users to process SQL statements.
Another is to improve system performance.
Still another is to ensure the database can be recovered if there is a software/hardware failure.
The database server must manage large amounts of data in a multi-user environment.
The server must manage concurrent access to the same data.
The server must deliver high performance.  This generally means fast response times.
Oracle instance:  An Oracle Instance consists of two different sets of components:
The first component set is the set of background processes (PMON, SMON, RECO, DBW0, LGWR, CKPT, D000 and others).
These will be covered later in detail – each background process is a computer program.
These processes perform input/output and monitor other Oracle processes to provide good performance and database reliability.
The second component set includes the memory structures that comprise the Oracle instance.
When an instance starts up, a memory structure called the System Global Area (SGA) is allocated.
At this point the background processes also start.
An Oracle Instance provides access to one and only one Oracle database.
Oracle database: An Oracle database consists of files.
Sometimes these are referred to as operating system files, but they are actually database files that store the database information that a firm or organization needs in order to operate.
The redo log files are used to recover the database in the event of application program failures, instance failures and other minor failures.
The archived redo log files are used to recover the database if a disk fails.
Other files not shown in the figure include:
The required parameter file that is used to specify parameters for configuring an Oracle instance when it starts up.
The optional password file authenticates special users of the database – these are termed privileged users and include database administrators.
Alert and Trace Log Files – these files store information about errors and actions taken that affect the configuration of the database.
User and server processes:  The processes shown in the figure are called user and server processes.  These processes are used to manage the execution of SQL statements.
A Shared Server Process can share memory and variable processing for multiple user processes.
A Dedicates Server Process manages memory and variables for a single user process.
Connecting to an Oracle Instance – Creating a Session
System users can connect to an Oracle database through SQLPlus or through an application program like the Internet Developer Suite (the program becomes the system user).  This connection enables users to execute SQL statements.
The act of connecting creates a communication pathway between a user process and an Oracle Server.  As is shown in the figure above, the User Process communicates with the Oracle Server through a Server Process.  The User Process executes on the client computer.  The Server Process executes on the server computer, and actually executes SQL statements submitted by the system user.
The figure shows a one-to-one correspondence between the User and Server Processes.  This is called a Dedicated Server connection.  An alternative configuration is to use a Shared Server where more than one User Process shares a Server Process.
Sessions:  When a user connects to an Oracle server, this is termed a session.  The session starts when the Oracle server validates the user for connection.  The session ends when the user logs out (disconnects) or if the connection terminates abnormally (network failure or client computer failure).
A user can typically have more than one concurrent session, e.g., the user may connect using SQLPlus and also connect using Internet Developer Suite tools at the same time.  The limit of concurrent session connections is controlled by the DBA.
If a system users attempts to connect and the Oracle Server is not running, the system user receives the Oracle Not Available error message.
Physical Structure – Database Files
As was noted above, an Oracle database consists of physical files.  The database itself has:
Datafiles – these contain the organization’s actual data.
Redo log files – these contain a record of changes made to the database, and enable recovery when failures occur.
Control files – these are used to synchronize all database activities and are covered in more detail in a later module.
Other key files as noted above include:
Parameter file – there are two types of parameter files.
The init.ora file (also called the PFILE) is a static parameter file.  It contains parameters that specify how the database instance is to start up.  For example, some parameters will specify how to allocate memory to the various parts of the system global area.
The spfile.ora is a dynamic parameter file.  It also stores parameters to specify how to startup a database; however, its parameters can be modified while the database is running.
Password file – specifies which *special* users are authenticated to startup/shut down an Oracle Instance.
Archived redo log files – these are copies of the redo log files and are necessary for recovery in an online, transaction-processing environment in the event of a disk failure.
Memory Structure
The memory structures include two areas of memory:
System Global Area (SGA) – this is allocated when an Oracle Instance starts up.
Program Global Area (PGA) – this is allocated when a Server Process starts up.
System Global Area
The SGA is an area in memory that stores information shared by all database processes and by all users of the database (sometimes it is called the Shared Global Area).
This information includes both organizational data and control information used by the Oracle Server.
The SGA is allocated in memory and virtual memory.
The size of the SGA can be established by a DBA by assigning a value to the parameter SGA_MAX_SIZE in the parameter file—this is an optional parameter.
The SGA is allocated when an Oracle instance (database) is started up based on values specified in the initialization parameter file (either PFILE or SPFILE).
The SGA has the following mandatory memory structures:
Shared Pool – includes two components:
Library Cache
Data Dictionary Cache
Database Buffer Cache
Redo Log Buffer
Other structures (for example, lock and latch management, statistical data)
Additional optional memory structures in the SGA include:
Large Pool
Java Pool
Streams Pool
The SHOW SGA SQL command will show you the SGA memory allocations.  This is a recent clip of the SGA for the Oracle database at SIUE.  In order to execute SHOW SGA you must be connected with the special privilege SYSDBA (which is only available to user accounts that are members of the DBA Linux group).
SQL> connect / as sysdba
Connected.
SQL> show sga
Total System Global Area 1610612736 bytes
Fixed Size                  2084296 bytes
Variable Size             385876536 bytes
Database Buffers         1207959552 bytes
Redo Buffers               14692352 bytes
Oracle 8i and earlier versions of the Oracle Server used a Static SGA.  This meant that if modifications to memory management were required, the database had to be shutdown, modifications were made to the init.ora parameter file, and then the database had to be restarted.
Oracle 9i and 10g use a Dynamic SGA.   Memory configurations for the system global area can be made without shutting down the database instance.  The advantage is obvious.  This allows the DBA to resize the Database Buffer Cache and Shared Pool dynamically.
Several initialization parameters are set that affect the amount of random access memory dedicated to the SGA of an Oracle Instance.  These are:
SGA_MAX_SIZE:  This optional parameter is used to set a limit on the amount of virtual memory allocated to the SGA – a typical setting might be 1 GB; however, if the value for SGA_MAX_SIZE in the initialization parameter file or server parameter file is less than the sum the memory allocated for all components, either explicitly in the parameter file or by default, at the time the instance is initialized, then the database ignores the setting for SGA_MAX_SIZE.
DB_CACHE_SIZE:  This optional parameter is used to tune the amount memory allocated to the Database Buffer Cache in standard database blocks.  Block sizes vary among operating systems.  The DBORCL database uses 8 KB blocks.  The total blocks in the cache defaults to 48 MB on LINUX/UNIX and 52 MB on Windows operating systems.
LOG_BUFFER:   This optional parameter specifies the number of bytes allocated for the Redo Log Buffer.
SHARED_POOL_SIZE:  This optional parameter specifies the number of bytes of memory allocated to shared SQL and PL/SQL.  The default is 16 MB.  If the operating system is based on a 64 bit configuration, then the default size is 64 MB.
LARGE_POOL_SIZE:  This is an optional memory object – the size of the Large Pool defaults to zero.  If the init.ora parameter PARALLEL_AUTOMATIC_TUNING is set to TRUE, then the default size is automatically calculated.
JAVA_POOL_SIZE:   This is another optional memory object.  The default is 24 MB of memory.
The size of the SGA cannot exceed the parameter SGA_MAX_SIZE minus the combination of the size of the additional parameters, DB_CACHE_SIZE, LOG_BUFFER, SHARED_POOL_SIZE, LARGE_POOL_SIZE, and JAVA_POOL_SIZE.
Memory is allocated to the SGA as contiguous virtual memory in units termed granules.  Granule size depends on the estimated total size of the SGA, which as was noted above, depends on the SGA_MAX_SIZE parameter.  Granules are sized as follows:
If the SGA is less than 128 MB in total, each granule is 4 MB.
If the SGA is greater than 128 MB in total, each granule is 16 MB.
Granules are assigned to the Database Buffer Cache and Shared Pool, and these two memory components can dynamically grow and shrink.  Using contiguous memory improves system performance.  The actual number of granules assigned to one of these memory components can be determined by querying the database view named V$BUFFER_POOL.
Granules are allocated when the Oracle server starts a database instance in order to provide memory addressing space to meet the SGA_MAX_SIZE parameter.  The minimum is 3 granules:  one each for the fixed SGA, Database Buffer Cache, and Shared Pool.  In practice, you’ll find the SGA is allocated much more memory than this.  The SELECT statement shown below shows a current_size of 1,152 granules.
SELECT name, block_size, current_size, prev_size, prev_buffers
FROM v$buffer_pool;
NAME                 BLOCK_SIZE CURRENT_SIZE  PREV_SIZE PREV_BUFFERS
——————– ———- ———— ———- ————
DEFAULT                    8192         1152          0            0
For additional information on the dynamic SGA sizing, enroll in Oracle’s Oracle10g Database Performance Tuning course.
Automatic Shared Memory Management
Prior to Oracle 10G, a DBA had to manually specify SGA Component sizes through the initialization parameters, such as SHARED_POOL_SIZE, DB_CACHE_SIZE, JAVA_POOL_SIZE, and LARGE_POOL_SIZE parameters.
Automatic Shared Memory Management enables a DBA to specify the total SGA memory available through the SGA_TARGET initialization parameter.  The Oracle Database automatically distributes this memory among various subcomponents to ensure most effective memory utilization.
The DBORCL database SGA_TARGET is set in the initDBORCL.ora file:
sga_target=1610612736
With automatic SGA memory management, the different SGA components are flexibly sized to adapt  to the SGA available.
Setting a single parameter simplifies the administration task – the DBA only specifies the amount of SGA memory available to an instance – the DBA can forget about the sizes of individual components. No out of memory errors are generated unless the system has actually run out of memory.  No manual tuning effort is needed.
The SGA_TARGET initialization parameter reflects the total size of the SGA and includes memory for the following components:
Fixed SGA and other internal allocations needed by the Oracle Database instance
The log buffer
The shared pool
The Java pool
The buffer cache
The keep and recycle buffer caches (if specified)
Nonstandard block size buffer caches (if specified)
The Streams Pool
If SGA_TARGET is set to a value greater than SGA_MAX_SIZE at startup, then the SGA_MAX_SIZE value is bumped up to accomodate SGA_TARGET.  After startup, SGA_TARGET can be decreased or increased dynamically. However, it cannot exceed the value of SGA_MAX_SIZE that was computed at startup.
When you set a value for SGA_TARGET, Oracle Database 10g automatically sizes the most commonly configured components, including:
The shared pool (for SQL and PL/SQL execution)
The Java pool (for Java execution state)
The large pool (for large allocations such as RMAN backup buffers)
The buffer cache
There are a few SGA components whose sizes are not automatically adjusted. The DBA must specify the sizes of these components explicitly, if they are needed by an application. Such components are:
Keep/Recycle buffer caches (controlled by DB_KEEP_CACHE_SIZE and DB_RECYCLE_CACHE_SIZE)
Additional buffer caches for non-standard block sizes (controlled by DB_nK_CACHE_SIZE, n = {2, 4, 8, 16, 32})
Streams Pool (controlled by the new parameter STREAMS_POOL_SIZE)
Shared Pool
The Shared Pool is a memory structure that is shared by all system users.  It consists of both fixed and variable structures.  The variable component grows and shrinks depending on the demands placed on memory size by system users and application programs.
Memory can be allocated to the Shared Pool by the parameter SHARED_POOL_SIZE in the parameter file.  You can alter the size of the shared pool dynamically with the ALTER SYSTEM SET command.  An example command is shown in the figure below.  You must keep in mind that the total memory allocated to the SGA is set by the SGA_TARGET parameter (and may also be limited by the SGA_MAX_SIZE if it is set), and since the Shared Pool is part of the SGA, you cannot exceed the maximum size of the SGA.
The Shared Pool stores the most recently executed SQL statements and used data definitions.  This is because some system users and application programs will tend to execute the same SQL statements often.  Saving this information in memory can improve system performance.
The Shared Pool includes the Library Cache and Data Dictionary Cache.
Library Cache
Memory is allocated to the Library Cache whenever an SQL statement is parsed or a program unit is called.  This enables storage of the most recently used SQL and PL/SQL statements.
If the Library Cache is too small, the Library Cache must purge statement definitions in order to have space to load new SQL and PL/SQL statements.  Actual management of this memory structure is through a Least-Recently-Used (LRU) algorithm.  This means that the SQL and PL/SQL statements that are oldest and least recently used are purged when more storage space is needed.
The Library Cache is composed of two memory subcomponents:
Shared SQL:  This stores/shares the execution plan and parse tree for SQL statements.  If a system user executes an identical statement, then the statement does not have to be parsed again in order to execute the statement.
Shared PL/SQL Procedures and Packages:  This stores/shares the most recently used PL/SQL statements such as functions, packages, and triggers.
Data Dictionary Cache
The Data Dictionary Cache is a memory structure that caches data dictionary information that has been recently used.  This includes user account information, datafile names, table descriptions, user privileges, and other information.
The database server manages the size of the Data Dictionary Cache internally and the size depends on the size of the Shared Pool in which the Data Dictionary Cache resides.  If the size is too small, then the data dictionary tables that reside on disk must be queried often for information and this will slow down performance.
Buffer Caches
A number of buffer caches are maintained in memory in order to improve system response time.
Database Buffer Cache
The Database Buffer Cache is a fairly large memory object that stores the actual data blocks that are retrieved from datafiles by system queries and other data manipulation language commands.
A query causes a Server Process to first look in the Database Buffer Cache to determine if the requested information happens to already be located in memory – thus the information would not need to be retrieved from disk and this would speed up performance.  If the information is not in the Database Buffer Cache, the Server Process retrieves the information from disk and stores it to the cache.
Keep in mind that information read from disk is read a block at a time, not a row at a time, because a database block is the smallest addressable storage space on disk.
Database blocks are kept in the Database Buffer Cache according to a Least Recently Used (LRU) algorithm and are aged out of memory if a buffer cache block is not used in order to provide space for the insertion of newly needed database blocks.
The buffers in the cache are organized in two lists:
the write list and,
the least recently used (LRU) list.
The write list holds dirty buffers – these are buffers that hold that data that has been modified, but the blocks have not been written back to disk.
The LRU list holds free buffers, pinned buffers, and dirty buffers that have not yet been moved to the write list.  Free buffers do not contain any useful data and are available for use.  Pinned buffers are currently being accessed.
When an Oracle process accesses a buffer, the process moves the buffer to the most recently used (MRU) end of the LRU list – this causes dirty buffers to age toward the LRU end of the LRU list.
When an Oracle user process needs a data row, it searches for the data in the database buffer cache because memory can be searched more quickly than hard disk can be accessed.  If the data row is already in the cache (a cache hit), the process reads the data from memory; otherwise a cache miss occurs and data must be read from hard disk into the database buffer cache.
Before reading a data block into the cache, the process must first find a free buffer. The process searches the LRU list, starting at the LRU end of the list.  The search continues until a free buffer is found or until the search reaches the threshold limit of buffers.
Each time the user process finds a dirty buffer as it searches the LRU, that buffer is moved to the write list and the search for a free buffer continues.
When the process finds a free buffer, it reads the data block from disk into the buffer and moves the buffer to the MRU end of the LRU list.
If an Oracle user process searches the threshold limit of buffers without finding a free buffer, the process stops searching the LRU list and signals the DBW0 background process to write some of the dirty buffers to disk.  This frees up some buffers.
The block size for a database is set when a database is created and is determined by the init.ora parameter file parameter named DB_BLOCK_SIZE.  Typical block sizes are 2KB, 4KB, 8KB, 16KB, and 32KB.  The size of blocks in the Database Buffer Cache matches the block size for the database.  The DBORCL database uses a 8KB block size.
Because tablespaces that store oracle tables can use different (non-standard) block sizes, there can be more than one Database Buffer Cache allocated to match block sizes in the cache with the block sizes in the non-standard tablespaces.
The size of the Database Buffer Caches can be controlled by the parameters DB_CACHE_SIZE and DB_nK_CACHE_SIZE to dynamically change the memory allocated to the caches without restarting the Oracle instance.
You can dynamically change the size of the Database Buffer Cache with the ALTER SYSTEM command like the one shown here:
ALTER SYSTEM SET DB_CACHE_SIZE = 96M;
You can have the Oracle Server gather statistics about the Database Buffer Cache to help you size it to achieve an optimal workload for the memory allocation.  This information is displayed from the V$DB_CACHE_ADVICE view.   In order for statistics to be gathered, you can dynamically alter the system by using the ALTER SYSTEM SET DB_CACHE_ADVICE (OFF, ON, READY) command.  However, gathering statistics on system performance always incurs some overhead that will slow down system performance.
SQL> ALTER SYSTEM SET db_cache_advice = ON;
System altered.
SQL> DESC V$DB_cache_advice;
Name                                      Null?    Type
—————————————– ——– ————-
ID                                                 NUMBER
NAME                                               VARCHAR2(20)
BLOCK_SIZE                                         NUMBER
ADVICE_STATUS                                      VARCHAR2(3)
SIZE_FOR_ESTIMATE                                  NUMBER
SIZE_FACTOR                                        NUMBER
BUFFERS_FOR_ESTIMATE                               NUMBER
ESTD_PHYSICAL_READ_FACTOR                          NUMBER
ESTD_PHYSICAL_READS                                NUMBER
ESTD_PHYSICAL_READ_TIME                            NUMBER
ESTD_PCT_OF_DB_TIME_FOR_READS                      NUMBER
ESTD_CLUSTER_READS                                 NUMBER
ESTD_CLUSTER_READ_TIME                             NUMBER
SQL> SELECT name, block_size, advice_status FROM v$db_cache_advice;
NAME                 BLOCK_SIZE ADV
——————– ———- —
DEFAULT                    8192 ON
<more rows will display>
21 rows selected.
SQL> ALTER SYSTEM SET db_cache_advice = OFF;
System altered.
KEEP Buffer Pool
This pool retains blocks in memory (data from tables) that are likely to be reused throughout daily processing.  An example might be a table containing user names and passwords or a validation table of some type.
The DB_KEEP_CACHE_SIZE parameter sizes the KEEP Buffer Pool.
RECYCLE Buffer Pool
This pool is used to store table data that is unlikely to be reused throughout daily processing – thus the data is quickly recycled.
The DB_RECYCLE_CACHE_SIZE parameter sizes the RECYCLE Buffer Pool.
Redo Log Buffer
The Redo Log Buffer memory object stores images of all changes made to database blocks.  As you know, database blocks typically store several table rows of organizational data.  This means that if a single column value from one row in a block is changed, the image is stored.  Changes include INSERT, UPDATE, DELETE, CREATE, ALTER, or DROP.
Think of the Redo Log Buffer as a circular buffer that is reused over and over.  As the buffer fills up, copies of the images are stored to the Redo Log Files that are covered in more detail in a later module.
Large Pool
The Large Pool is an optional memory structure that primarily relieves the memory burden placed on the Shared Pool.  The Large Pool is used for the following tasks if it is allocated:
Allocating space for session memory requirements from the User Global Area (part of the Server Process) where a Shared Server is in use.
Transactions that interact with more than one database, e.g., a distributed database scenario.
Backup and restore operations by the Recovery Manager (RMAN) process.
RMAN uses this only if the BACKUP_DISK_IO = n and BACKUP_TAPE_IO_SLAVE = TRUE parameters are set.
If the Large Pool is too small, memory allocation for backup will fail and memory will be allocated from the Shared Pool.
Parallel execution message buffers for parallel server operations.  The PARALLEL_AUTOMATIC_TUNING = TRUE parameter must be set.
The Large Pool size is set with the LARGE_POOL_SIZE parameter – this is not a dynamic parameter.  It does not use an LRU list to manage memory.
Java Pool
The Java Pool is an optional memory object, but is required if the database has Oracle Java installed and in use for Oracle JVM (Java Virtual Machine).  The size is set with the JAVA_POOL_SIZE parameter that defaults to 24MB.
The Java Pool is used for memory allocation to parse Java commands.
Storing Java code and data in the Java Pool is analogous to SQL and PL/SQL code cached in the Shared Pool.
Streams Pool
This cache is new to Oracle 10g.  It is sized with the parameter STREAMS_POOL_SIZE.
This pool stores data and control structures to support the Oracle Streams feature of Oracle Enterprise Edition.  Oracle Steams manages sharing of data and events in a distributed environment.
If STEAMS_POOL_SIZE is not set or is zero, memory for Oracle Streams operations is allocated from up to 10% of the Shared Pool memory.
Program Global Area
The Program Global Area is also termed the Process Global Area (PGA) and is a part of memory allocated that is outside of the Oracle Instance.  The PGA stores data and control information for a single Server Process or a single Background Process.  It is allocated when a process is created and the memory is scavenged by the operating system when the process terminates.  This is NOT a shared part of memory – one PGA to each process only.
The content of the PGA varies, but generally includes the following:
Private SQL Area:  Data for binding variables and runtime memory allocations.  A user session issuing SQL statements has a Private SQL Area that may be associated with a Shared SQL Area if the same SQL statement is being executed by more than one system user.  This often happens in OLTP environments where many users are executing and using the same application program.
Dedicated Server environment – the Private SQL Area is located in the Program Global Area.
Shared Server environment – the Private SQL Area is located in the System Global Area.
Session Memory:  Memory that holds session variables and other session information.
SQL Work Area:  Memory allocated for sort, hash-join, bitmap merge, and bitmap create types of operations.
Oracle 9i and later versions enable automatic sizing of the SQL Work Areas by setting the WORKAREA_SIZE_POLICY = AUTO parameter (this is the default!) and PGA_AGGREGATE_TARGET = n (where n is some amount of memory established by the DBA).  However, the DBA can let Oracle 10g determine the appropriate amount of memory.
Oracle 8i and earlier required the DBA to set the following parameters to control SQL Work Area memory allocations:
SORT_AREA_SIZE.
HASH_AREA_SIZE.
BITMAP_MERGE_AREA_SIZE.
CREATE_BITMAP_AREA_SIZE.
Software Code Area
Software code areas store Oracle executable files running as part of the Oracle instance.
These code areas are static in nature and are located in privileged memory that is separate from other user programs.
The code can be installed sharable when multiple Oracle instances execute on the same server with the same software release level.
Processes
You need to understand three different types of Processes:
User Process:  Starts when a database user requests to connect to an Oracle Server.
Server Process:  Establishes the Connection to an Oracle Instance when a User Process requests connection – makes the connection for the User Process.
Background Processes:  These start when an Oracle Instance is started up.
User Process
In order to use Oracle, you must obviously connect to the database.  This must occur whether you’re using SQLPlus, an Oracle tool such as Designer or Forms, or an application program.
This generates a User Process (a memory object) that generates programmatic calls through your user interface (SQLPlus, Integrated Developer Suite, or application program) that creates a session and causes the generation of a Server Process that is either dedicated or shared.
Server Process
As you have seen, the Server Process is the go-between for a User Process and the Oracle Instance.   In a Dedicated Server environment, there is a single Server Process to serve each User Process.  In a Shared Server environment, a Server Process can serve several User Processes, although with some performance reduction.  Allocation of server process in a dedicated environment versus a shared environment is covered in further detail in the Oracle10g Database Performance Tuning course offered by Oracle Education.
Background Processes
As is shown here, there are both mandatory and optional background processes that are started whenever an Oracle Instance starts up.  These background processes serve all system users.  We will cover mandatory process in detail.
Optional Background Process Definition:
ARCn: Archiver – One or more archiver processes copy the online redo log files to archival storage when they are full or a log switch occurs.
CJQ0:  Coordinator Job Queue – This is the coordinator of job queue processes for an instance. It monitors the JOB$ table (table of jobs in the job queue) and starts job queue processes (Jnnn) as needed to execute jobs The Jnnn processes execute job requests created by the DBMS_JOBS package.
Dnnn:  Dispatcher number “nnn”, for example, D000 would be the first dispatcher process – Dispatchers are optional background processes, present only when the shared server configuration is used. Shared server is discussed in your readings on the topic “Configuring Oracle for the Shared Server”.
RECO:  Recoverer – The Recoverer process is used to resolve distributed transactions that are pending due to a network or system failure in a distributed database.  At timed intervals, the local RECO attempts to connect to remote databases and automatically complete the commit or rollback of the local portion of any pending distributed transactions.  For information about this process and how to start it, see your readings on the topic “Managing Distributed Transactions”.
Of these, the ones you’ll use most often are ARCn (archiver) when you automatically archive redo log file information (covered in a later module), and RECO for recovery where the database is distributed on two or more separate physical Oracle servers, perhaps a UNIX machine and an NT machine.
DBWn (also called DBWR in earlier Oracle Versions)
The Database Writer writes modified blocks from the database buffer cache to the datafiles. Although one database writer process (DBW0) is sufficient for most systems, you can configure up to 20 DBWn processes (DBW0 through DBW9 and DBWa through DBWj) in order to improve write performance for a system that modifies data heavily.
The initialization parameter DB_WRITER_PROCESSES specifies the number of DBWn processes.
The purpose of DBWn is to improve system performance by caching writes of database blocks from the Database Buffer Cache back to datafiles.  Blocks that have been modified and that need to be written back to disk are termed “dirty blocks.”  The DBWn also ensures that there are enough free buffers in the Database Buffer Cache to service Server Processes that may be reading data from datafiles into the Database Buffer Cache.  Performance improves because by delaying writing changed database blocks back to disk, a Server Process may find the data that is needed to meet a User Process request already residing in memory!
DBWn writes to datafiles when one of these events occurs that is illustrated in the figure below.
LGWR
The Log Writer (LGWR) writes contents from the Redo Log Buffer to the Redo Log File that is in use.  These are sequential writes since the Redo Log Files record database modifications based on the actual time that the modification takes place.  LGWR actually writes before the DBWn writes and only confirms that a COMMIT operation has succeeded when the Redo Log Buffer contents are successfully written to disk.  LGWR can also call the DBWn to write contents of the Database Buffer Cache to disk.  The LGWR writes according to the events illustrated in the figure shown below.
SMON
The System Monitor (SMON) is responsible for instance recovery by applying entries in the online redo log files to the datafiles.  It also performs other activities as outlined in the figure shown below.
If an Oracle Instance fails, all information in memory not written to disk is lost.  SMON is responsible for recovering the instance when the database is started up again.  It does the following:
Rolls forward to recover data that was recorded in a Redo Log File, but that had not yet been recorded to a datafile by DBWn.  SMON reads the Redo Log Files and applies the changes to the data blocks.  This recovers all transactions that were committed because these were written to the Redo Log Files prior to system failure.
Opens the database to allow system users to logon.
Rolls back uncommitted transactions.
SMON also does limited space management.  It combines (coalesces) adjacent areas of free space in the database’s datafiles for tablespaces that are dictionary managed.
It also deallocates temporary segments to create free space in the datafiles.
PMON
The Process Monitor (PMON) is a cleanup type of process that cleans up after failed processes such as the dropping of a user connection due to a network failure or the abend of a user application program.  It does the tasks shown in the figure below.
CKPT
The Checkpoint (CPT) process writes information to the database control files that identifies the point in time with regard to the Redo Log Files where instance recovery is to begin should it be necessary.  This is done at a minimum, once every three seconds.
Think of a checkpoint record as a starting point for recovery.  DBWn will have completed writing all buffers from the Database Buffer Cache to disk prior to the checkpoint, thus those record will not require recovery.  This does the following:
Ensures modified data blocks in memory are regularly written to disk – CKPT can call the DBWn process in order to ensure this and does so when writing a checkpoint record.
Reduces Instance Recovery time by minimizing the amount of work needed for recovery since only Redo Log File entries processed since the last checkpoint require recovery.
Causes all committed data to be written to datafiles during database shutdown.
If a Redo Log File fills up and a switch is made to a new Redo Log File (this is covered in more detail in a later module), the CKPT process also writes checkpoint information into the headers of the datafiles.
Checkpoint information written to control files includes the system change number (the SCN is a number stored in the control file and in the headers of the database files that are used to ensure that all files in the system are synchronized), location of which Redo Log File is to be used for recovery, and other information.
CKPT does not write data blocks or redo blocks to disk – it calls DBWn and LGWR as necessary.
ARCn
We cover the Archiver (ARCn) optional background process in more detail because it is almost always used for production systems storing mission critical information.   The ARCn process must be used to recover from loss of a physical disk drive for systems that are “busy” with lots of transactions being completed.
When a Redo Log File fills up, Oracle switches to the next Redo Log File.  The DBA creates several of these and the details of creating them are covered in a later module.  If all Redo Log Files fill up, then Oracle switches back to the first one and uses them in a round-robin fashion by overwriting ones that have already been used – it should be obvious that the information stored on the files, once overwritten, is lost forever.
If ARCn is in what is termed ARCHIVELOG mode, then as the Redo Log Files fill up, they are individually written to Archived Redo Log Files and LGWR does not overwrite a Redo Log File until archiving has completed.  Thus, committed data is not lost forever and can be recovered in the event of a disk failure.  Only the contents of the SGA will be lost if an Instance fails.
In NOARCHIVELOG mode, the Redo Log Files are overwritten and not archived.  Recovery can only be made to the last full backup of the database files.  All committed transactions after the last full backup are lost, and you can see that this could cost the firm a lot of $$$.
When running in ARCHIVELOG mode, the DBA is responsible to ensure that the Archived Redo Log Files do not consume all available disk space!  Usually after two complete backups are made, any Archived Redo Log Files for prior backups are deleted.
Logical Structure
It is helpful to understand how an Oracle database is organized in terms of a logical structure that is used to organize physical objects.
Tablespace:  An Oracle 10g database must always consist of at least two tablespaces (SYSTEM and SYSAUX), although a typical Oracle database will multiple tablespaces tablespaces.
A tablespace is a logical storage facility (a logical container) for storing objects such as tables, indexes, sequences, clusters, and other database objects.
Each tablespace has at least one physical datafile that actually stores the tablespace at the operating system level.  A large tablespace may have more than one datafile allocated for storing objects assigned to that tablespace.
A tablespace belongs to only one database.
Tablespaces can be brought online and taken offline for purposes of backup and management, except for the SYSTEM tablespace that must always be online.
Tablespaces can be in either read-only or read-write status.
Datafile:  Tablespaces are stored in datafiles which are physical disk objects.
A datafile can only store objects for a single tablespace, but a tablespace may have more than one datafile – this happens when a disk drive device fills up and a tablespace needs to be expanded, then it is expanded to a new disk drive.
The DBA can change the size of a datafile to make it smaller or later.  The file can also grow in size dynamically as the tablespace grows.
Segment:  When logical storage objects are created within a tablespace, for example, an employee table, a segment is allocated to the object.
Obviously a tablespace typically has many segments.
A segment cannot span tablespaces but can span datafiles that belong to a single tablespace.
Extent:  Each object has one segment which is a physical collection of extents.
Extents are simply collections of contiguous disk storage blocks.  A logical storage object such as a table or index always consists of at least one extent – ideally the initial extent allocated to an object will be large enough to store all data that is initially loaded.
As a table or index grows, additional extents are added to the segment.
A DBA can add extents to segments in order to tune performance of the system.
An extent cannot span a datafile.
Block:  The Oracle Server manages data at the smallest unit in what is termed a block or data block.  Data are actually stored in blocks.
A physical block is the smallest addressable location on a disk drive for read/write operations.
An Oracle data block consists of one or more physical blocks (operating system blocks) so the data block, if larger than an operating system block, should be an even multiple of the operating system block size, e.g., if the Linux operating system block size is 2K or 4K, then the Oracle data block should be 2K, 4K, 8K, 16K, etc in size.  This optimizes I/O.
The data block size is set at the time the database is created and cannot be changed.  It is set with the DB_BLOCK_SIZE parameter.  The maximum data block size depends on the operating system.
Thus, the Oracle database architecture includes both logical and physical structures as follows:
Physical:  Control files; Redo Log Files; Datafiles; Operating System Blocks.
Logical:  Tablespaces; Segments; Extents; Data Blocks.
SQL Statement Processing
SQL Statements are processed differently depending on whether the statement is a query, data manipulation language (DML) to update, insert, or delete a row, or data definition language (DDL) to write information to the data dictionary.
Processing a query:
Parse:
Search for identical statement in the Shared SQL Area.
Check syntax, object names, and privileges.
Lock objects used during parse.
Create and store execution plan.
Bind: Obtains values for variables.
Execute: Process statement.
Fetch: Return rows to user process.
Processing a DML statement:
Parse: Same as the parse phase used for processing a query.
Bind: Same as the bind phase used for processing a query.
Execute:
If the data and undo blocks are not already in the Database Buffer Cache, the server process reads them from the datafiles into the Database Buffer Cache.
The server process places locks on the rows that are to be modified. The undo block is used to store the before image of the data, so that the DML statements can be rolled back if necessary.
The data blocks record the new values of the data.
The server process records the before image to the undo block and updates the data block.  Both of these changes are made in the Database Buffer Cache.  Any changed blocks in the Database Buffer Cache are marked as dirty buffers.  That is, buffers that are not the same as the corresponding blocks on the disk.
The processing of a DELETE or INSERT command uses similar steps.  The before image for a DELETE contains the column values in the deleted row, and the before image of an INSERT contains the row location information.
Processing a DDL statement:The execution of DDL (Data Definition Language) statements differs from the execution of DML (Data Manipulation Language) statements and queries, because the success of a DDL statement requires write access to the data dictionary.
For these statements, parsing actually includes parsing, data dictionary lookup, and execution.  Transaction management, session management, and system management SQL statements are processed using the parse and execute stages.  To re-execute them, simply perform another execute.


Instance recovery is performed in two steps. ie, rollforward and rollback.
Cache Recovery or Rollforward:
Here, the changes recorded in the redolog files are applied to the affected blocks. This includes both committed and uncommited data. Since Undo data is protected by redo, rollforward generated the undo images also. The time required for this will be proportional to the changes made in the database after the last successful checkpoint. After cache recovery, the database will be ‘consistent’ to the point when the crash occurred. Now the database will be open and users can start connecting to it. The parameter RECOVERY_PARALLELISM specifies the number of processes to participate in instance or crash recovery and we can thus speed up rollforward.
Transaction Recovery or Rollback
The uncommitted data in the database will now be rolled back. This is coordinated by SMON and rolls back set of transactions parallely (by default) using multiple server processes. SMON automatically decides when to begin parallel rollback and disperses the work among several parallel processes: process one rolls back one transaction, process two rolls back a second transaction, and so on. And if a transaction is huge Oracle  begins intra-transaction recovery by dispersing the huge transaction among the slave processes: process one takes one part, process two takes another part, and so on. Parallel mode is the default one and is decided by the parameter FAST_START_PARALLEL_ROLLBACK . We can either turn it off (for serial recovery) or increase the degree of parallelism. If you change the value of the parameter FAST_START_PARALLEL_ROLLBACK, then transaction recovery will be stopped and restarted with the new implied degree of parallelism.
As mentioned earlier, user sessions are allowed to connect even before the transaction recovery is completed. If a user attempts to access a row that is locked by a terminated transaction, Oracle rolls back only those changes necessary to complete the transaction; in other words, it rolls them back on demand. Consequently, new transactions do not have to wait until all parts of a long transaction are rolled back.
This transaction recovery is required and has to be completed. We can disable transaction recovery temporarily but at some point this has to be completed. We can monitor the progress of fast-start parallel rollback by examining the V$FAST_START_SERVERS and V$FAST_START_TRANSACTIONS views.
The Fast-Start Fault Recovery feature reduces the time required for cache recovery, and makes the recovery bounded and predictable by limiting the number of dirty buffers and the number of redo records generated between the most recent redo record and the last checkpoint. With the Fast-Start Fault Recovery feature, the FAST_START_MTTR_TARGET initialization parameter simplifies the configuration of recovery time from instance or system failure. FAST_START_MTTR_TARGET specifies a target for the expected mean time to recover (MTTR), that is, the time (in seconds) that it should take to start up the instance and perform cache recovery. After FAST_START_MTTR_TARGET is set, the database manages incremental checkpoint writes in an attempt to meet that target.
If the SMON is busy doing the transaction recovery you should never attempt a shutdown abort and restarting the database. The entire work done till that point needs to be done again.
There are different modes in which you can open the database eg: migrate, read only, restricted modes.

What is Active Guard DataGuard


It enables physical standby database to be open for read access while media  recovery is being performed on them to keep them synchronized with the production
database. The physical standby database is open in read-only mode while
redo transport and standby apply are both active.
Active Data Guard automatically repairs corrupt blocks online by using the active standby database.
Normally, queries executed on active standby databases return up-to-date results. Due to potential delays in redo transport or standby apply, a standby database may “fall behind” its primary, which can cause results of queries on the standby to be out of date.
Active Data Guard sessions can be configured with a maximum query delay (in
seconds). If the standby exceeds the delay, Active Data Guard returns an error to the application, which can then retry the query or transparently redirect the query to the primary, as required

What is logical standby database

In a logical standby database configuration, Data Guard SQL Apply uses redo information shipped from  the primary system. However, instead of using media recovery to apply changes no physical
(as in thestandby database configuration), the redo data is transformed into
equivalent SQL statements by using LogMiner technology. These SQL statements are then applied to the logical standby database. The logical standby database is open in read/write mode and is available for reporting capabilities.
The logical standby database can offer protection at database level, schema level, or even object level.
A logical standby database can be used to perform rolling database upgrades, thereby minimizing down time when upgrading to new database patch sets or full database releases.




Oracle Back ground process

When a buffer in database buffer cache is modified .it is marked dirty  buffer and added in  head of the checkpoint queue that is kept in system change number (SCN) order. This order therefore matches the order of redo that is written to the redo logs for these changed buffers.

When the number of available buffers in the buffer cache falls below an internal threshold (to the extent that server processes find it difficult to obtain available buffers), DBWn writes non
frequently used, modified (dirty) buffers to the data files from the tail of the LRU list so that processes can replace buffers when they need them. DBWn also writes from the tail of the checkpoint queue to keep the checkpoint advancing.
The SGA contains a memory structure that has the redo byte address (RBA) of the position in the redo stream where recovery should begin in the case of an instance failure. This structure acts as a pointer into the redo and is written to the control file by the CKPT process once every three seconds. Because the DBWn writes dirty buffers in SCN order, and because the
redo is in SCN order, every time DBWn writes dirty buffers from the LRU list, it also advances the pointer held in the SGA memory structure so that instance recovery (if required) begins reading the redo from approximately the correct location and avoids unnecessary I/O. This is known as incremental checkpointing.
In all cases, DBWn performs batched (multiblock) writes to improve efficiency. The number of blocks written in a multiblock write varies by operating system.


LGWR is  responsible for redo log buffer management by writing redo buffer entries to a redo log file to  disk. LGWR writes all redo entries that have been copied into the buffer Since last time it wrote
The redo log buffer is a circular buffer. When LGWR writes redo entries from the redo log buffer to a redo log file, server processes can then copy new entries over the entries in the redo log buffer that have been written to disk. LGWR normally writes fast enough to ensure that space is always available in the buffer for new entries, even when access to the redo log
is heavy. LGWR writes one contiguous portion of the buffer to disk.
LGWR writes:

When a user process commits a transaction
When the redo log buffer is one-third full
Before a DBWn process writes modified buffers to disk (if necessary)
Every three seconds

Before DBWn can write a modified buffer, all redo records that are associated with the changes to the buffer must be written to disk (the write-ahead protocol). If DBWn finds that some redo records have not been written, it signals LGWR to write the redo records to disk and waits forLGWR to complete writing the redo log buffer before it can write out the data buffers. LGWR writes to the current log group. If one of the files in the group is damaged or unavailable, LGWR continues writing to other files in the group and logs an error in the LGWR trace file and in the system alert log. If all files in a group are damaged, or if the group is unavailable because it has
not been archived, LGWR cannot continue to function.
When a user issues a COMMIT statement, LGWR puts a commit record in the redo log buffer and writes it to disk immediately, along with the transaction’s redo entries. The corresponding changes to data blocks are deferred until it is more efficient to write them. This is called a fast commit mechanism. The atomic write of the redo entry containing the transaction’s commit
record is the single event that determines whether the transaction has committed. Oracle Database returns a success code to the committing transaction, although the data buffers have not yet been written to disk.

What is checkpoint and its types

A checkpoint is concept and mechanism. There are different type of checkpoint .The most import is  related to this course are the full checkpoint and the incremental checkpoint.
The checkpoint position defines at which system change number (SCN) in the redo thread instance recovery would need to begin.
The SCN at which a full checkpoint occurred is stored in both the data file headers and the control file.
The SCN at which the last incremental checkpoint occurred is only stored in the control file (in a structure known as the checkpoint progress record).
The CKPT process updates the control files and the headers of all data files to record the details of the checkpoint (as shown in the graphic). The CKPT process does not write blocks to disk; DBWn always performs that work. The SCNs recorded in the file headers guarantee that all changes made to database blocks prior to that SCN have been written to disk.
What is Process Monitor and System Monitor Process

The System Monitor Process performs recovery at instance startup if necessary SMON is also responsible for cleaning up temporary segments that are no longer in use. If any terminated transactions were skipped during instance recovery because of file-read or offline errors, SMON recovers them when the tablespace or file is brought back online. SMON checks regularly to see whether the process is needed. Other processes can call SMON if they detect a need for it.

The Process Monitor Process perform process recovery when a user process fails  .PMON is responsible for cleaning up the database buffer cache and freeing resources that the user process was using. For example, it resets the status of the active transaction table,releases locks, and removes the process ID from the list of active processes.
PMON periodically checks the status of dispatcher and server processes, and restarts any that have stopped running (but not any that Oracle Database has terminated intentionally).Like SMON, PMON checks regularly to see whether it is needed. It can be called if another process detects the need for it.
What is retention policy

A retention policy describes which backups will be kept and for how long.
You can set the value of the retention policy by using the RMAN CONFIGURE command or Recovery Window Retention Policy
The best practice is to establish a period of time during which it will be possible to discoverlogical errors and fix the affected objects by doing a point-in-time recovery to just before the error occurred. This period of time is called the recovery window. This policy is specified in number of days. For each data file, there must always exist at least one backup that satisfies
the following condition:
SYSDATE – backup_checkpoint_time >= recovery_window

You can use the following command syntax to configure a recovery window retention policy:

RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF <days> DAYS;
where <days> is the size of the recovery window.
If you are not using a recovery catalog, you should keep the recovery window time period less than or equal to the value of the CONTROL_FILE_RECORD_KEEP_TIME parameter to prevent
the record of older backups from being overwritten in the control file. If you are using a recovery catalog, then make sure the value of CONTROL_FILE_RECORD_KEEP_TIME is greater than the
time period between catalog resynchronizations. Resynchronizations happen when you:
Create a backup. In this case, the synchronization is done implicitly.
Execute the RESYNC CATALOG command.
Recovery catalogs are covered in more detail in the lesson titled “Using the RMAN Recovery
Catalog.”



Redundancy Retention Policy

If you require a certain number of backups to be retained, you can set the retention policy on the basis of the redundancy option. This option requires that a specified number of backups be cataloged before any backup is identified as obsolete. The default retention policy has a redundancy of 1, which means that only one backup of a file must exist at any given time. A
backup is deemed obsolete when a more recent version of the same file has been backed up.


What is obsolete and expired backup

Expired backup : When the CROSSCHECK command is used to determine whether backups  is recorded in the repository still exist on disk or tape, if RMAN cannot locate the backups, then it updates their records in the RMAN repository to EXPIRED status.

Obsolete Backup: One of the key questions every backup strategy must address is how long you want to keep a backup. Although you can specify that a backup be kept forever without becoming obsolete,
it’s not common to follow such a strategy, unless you’re doing it for a special reason. Instead,backups become obsolete according to the retention policy you adopt. You can select the retention duration of backups when using RMAN in two ways.
In the first method, you can specify backup retention based on a recovery window. That is, all backups necessary to perform a point-in-time recovery to a specified past point of time will be retained by RMAN. If a backup is older than the point of time you chose, that backup will become obsolete according
to the backup retention rules.
The second way to specify the retention duration is to use a redundancy-based retention policy, under which you specify the number of backups of a file
that must be kept on disk. Any backups of a datafile greater than that number will be considered obsolete. obsolete backups are automatically deleted when space is needed for fresh files, you won’t be running the risk of accidentally deleting necessary files. obsolete backups are automatically deleted when space is needed for fresh files, you won’t be running the risk of accidentally deleting necessary files.

Obsolete backups are any backups that you don’t need to satisfy a configured retention policy.You may also delete obsolete backups according to any retention policy you may specify as an option to the delete obsolete command. The delete obsolete command will remove the deleted files from the backup media and mark those backups as deleted in both the control file and the recovery catalog.

The report obsolete command reports on any obsolete backups. Always run the crosscheck command first in order to update the status of the backups in the RMAN repository to that on disk and tape.
In the following example, the report obsolete command shows no obsolete
backups:
RMAN> crosscheck backup;
RMAN> report obsolete;
RMAN retention policy will be applied to the command
RMAN retention policy is set to redundancy 1
no obsolete backups found
The following execution of the report obsolete command shows that there are both obsolete backup sets and obsolete archived redo log backups. Again, run the crosscheck command before issuing the report obsolete command.
RMAN> crosscheck backup;
RMAN> report obsolete;

What is Clusterware Startup Sequence

The Oracle Services daemon (ohasd) is responsible for starting in proper order monitoring, and restarting other local Oracle Clusterware daemons, up through the daemon, crsd which in turn manages clusterwide resources.

When a cluster node boots, or Clusterware is started on a running clusterware node, the init process starts ohasd. The ohasd process then initiates the startup of the processes in the lower, or Oracle High Availability (OHASD) stack.

The cssdagent process is started, which in turn starts ocssd. The ocssd process discovers the voting disk either in ASM or on shared storage, and then joins the cluster.The cssdagent process monitors the cluster and provides I/O fencing. This service formerly was provided by Oracle Process Monitor Daemon (oprocd). A cssdagent failure may result in Oracle Clusterware restarting the node.
The orarootagent is started. This process is a specialized oraagent process that helps crsd start and manage resources owned by root, such as the network and the grid virtual IP address.





Node vip: The node vip is a node application (nodeapp) responsible for eliminating response delays (TCP timeouts) to client programs requesting a connection to the database. Each node vip is assigned an unused IP address.
This is usually done via DHCP but can be manually assigned. There is initially one node vip per cluster node at Clusterware startup. When a cluster node becomes unreachable, the node vip is failed over to a surviving node and redirects connection requests made to the unreachable node to a surviving node.

SCAN vip: SCAN vips or Single Client Access Name vips are part of a connection framework that eliminates dependencies on static cluster node names. This framework allows nodes to be added to or removed from the cluster without affecting the ability of clients to connect to the database. If GNS is used in the cluster, three SCAN vips are started on the member nodes using the IP addresses assigned by the DHCP server. If GNS is not used, SCAN vip addresses for the cluster can be defined in the DNS server used by the cluster nodes.

SCAN Listener: Three SCAN Listeners are started on the cluster nodes where the SCAN VIPs are started. Oracle Database 11g Release 2 and later instances register with SCAN listeners only as remote listeners.
Node Listener: If GNS is used to resolve client requests for the cluster, a single GNS vip for the cluster is started. The IP address is assigned in the GNS server used by the cluster nodes.


SCAN and Local Listeners

When a client submits a connection request, the SCAN listener listening on a SCAN IP address and the SCAN port are contacted on the client’s behalf. Because all services on the cluster are registered with the SCAN listener, the SCAN listener replies with the address of the local listener on the least-loaded node where the service is currently being offered.
Finally, the client establishes a connection to the service through the listener on the node where service is offered. All these actions take place transparently to the client without any explicit configuration required in the client.

During installation ,listeners are created on nodes for SCAN IP Addresses
Oracle net services routes application requests to least loaded instance providing services  Becouse the SCAN addresses resolve to cluster, rather than to  a node address in a cluster,node can be added or removed from cluster without affecting the SCAN address configuration



Failing to Start OHAS

The first daemon to start in a Grid Infrastructure environment is OHAS. This process relies on the init process to invoke /etc/init.d/init.ohasd, which starts /etc/rc.d/init.d/ohasd, which in turn executes $GRID_HOME/ohasd.bin. Without a properly working ohasd.bin process, none of the other stack
components will start. The entry in /etc/inittab defines that /etc/init.d/init.ohasd is started at runlevels 3 and 5. Runlevel 3 in Linux usually brings the system up in networked, multi-user mode;
however, it doesn’t start X11. Runlevel 5 is normally used for the same purpose, but it also starts the graphical user interface. If the system is at a runlevel other than 3 or 5, then ohasd.bin cannot be started, and you need to use a call to init to change the runlevel to either 3 or 5. You can check
/var/log/messages for output from the scripts under /etc/rc.d/init.d/; ohasd.bin logs information into the default log file destination at $GRID_HOME/log/hostname in the ohasd/ohasd.log subdirectory.
The administrator has the option to disable the start of the High Availability Services stack by calling crsctl disable crs. This call updates a flag in /etc/oracle/scls_scr/hostname/root/ohasdstr. The file contains only one word, either enable or disable, and no carriage return. If set to disable, then
/etc/rc.d/init.d/ohasd will not proceed with the startup. Call crsctl start crs to start the cluster stack manually in that case. Many Grid Infrastructure background processes rely on sockets created in /var/tmp/.oracle. You
can check which socket is used by a process by listing the contents of the /proc/pid/fd directory, where pid is the process id of the program you are looking at. In some cases, permissions on the sockets can become garbled; in our experience, moving the .oracle directory to a safe location and rebooting solved the cluster communication problems.
Another reason ohasd.bin might fail to start: the file system for $GRID_HOME could be either corrupt or otherwise not mounted. Earlier, it was noted that ohasd.bin lives in $GRID_HOME/bin. If $GRID_HOME isn’t mounted, then it is not possible to start the daemon.
We introduced the OLR as an essential file for starting Grid Infrastructure. If the OLR has become corrupt or is otherwise not accessible, then ohasd.bin cannot start. Successful initialization of the OLR is recorded in the ohasd.log, as in the following example (the timestamps have been removed for the sake
of clarity):
[ default][3046704848] OHASD Daemon Starting. Command string :reboot
[ default][3046704848] Initializing OLR
[ OCRRAW][3046704848]proprioo: for disk 0
(/u01/app/crs/cdata/london1.olr),
id match (1), total id sets, (1) need recover (0), my votes (0),
total votes (0), commit_lsn (15), lsn (15)
[ OCRRAW][3046704848]proprioo: my id set: (2018565920, 1028247821, 0, 0, 0)
[ OCRRAW][3046704848]proprioo: 1st set: (2018565920, 1028247821, 0, 0, 0)
[ OCRRAW][3046704848]proprioo: 2nd set: (0, 0, 0, 0, 0)
[ CRSOCR][3046704848] OCR context init CACHE Level: 0xaa4cfe0
[ default][3046704848] OHASD running as the Privileged user

Interestingly, the errors pertaining to the local registry have the same numbers as those for the OCR; however, they have been prefixed by PROCL. The L can easily be missed, so check carefully! If the OLR cannot be read, then you will see the error messages immediately under the Initializing OLR line. This chapter has covered two causes so far: the OLR is missing or the OLR is corrupt. The first case is much easier to diagnose because, in that case, OHAS will not start:
[root@london1 ~]# crsctl check crs
CRS-4639: Could not contact Oracle High Availability Services
In the preceding example, ohasd.log will contain an error message similar to this one:
[ default][1381425744] OHASD Daemon Starting. Command string :restart
[ default][1381425744] Initializing OLR
[ OCROSD][1381425744]utopen:6m’:failed in stat OCR file/disk
/u01/app/crs/cdata/london1.olr,
errno=2, os err string=No such file or directory
[ OCROSD][1381425744]utopen:7:failed to open any OCR file/disk, errno=2,
os err string=No such file or directory
[ OCRRAW][1381425744]proprinit: Could not open raw device
[ OCRAPI][1381425744]a_init:16!: Backend init unsuccessful : [26]
[ CRSOCR][1381425744] OCR context init failure. Error: PROCL-26: Error
while accessing the physical storage Operating System
error [No such file or directory] [2]
417CHAPTER 8  CLUSTERWARE
[ default][1381425744] OLR initalization failured, rc=26
[ default][1381425744]Created alert : (:OHAS00106:) : Failed to initialize
Oracle Local Registry
[ default][1381425744][PANIC] OHASD exiting; Could not init OLR
In this case, you should restore the OLR, which you will learn how to do in the “Maintaining Voting
Disk and OCR/OLR” section.

If the OLR is corrupted, then you will slightly different errors. OHAS tries to read the OLR; while it succeeds for some keys, it fails for some others. Long hex dumps will appear in the ohasd.log, indicating a problem. You should perform an ocrcheck -local in this case, which can help you determine the root
cause. The following output has been taken from a system where the OLR was corrupt:
[root@london1 ohasd]# ocrcheck -local
Status of Oracle Local Registry is as follows :
Version
:
3
Total space (kbytes)
:
262120
Used space (kbytes)
:
2232
Available space (kbytes) :
259888
ID
: 1022831156
Device/File Name
: /u01/app/crs/cdata/london1.olr
Device/File integrity check failed
Local registry integrity check failed
Logical corruption check bypassed


If the utility confirms that the OLR is corrupted, then you have no option but to restore it. Again,


Testing ASM Disk Failure Scenario and disk_repair_time

External Redundancy  --Every piece of data is stored once
     If five drives are allocated , Data will spread across five drives . if loose one drive, Data get lost. We need to restore from backup
 
Normal Redundancy --It is possible by two or more asm disk .it can protect from single drive outage
Eg  A      A
                B   B
     C             C
If lose one drive , you have data since it is mirror

High Redundancy -- it is possible by three or more asm disk .it can protect  from two drive outage .
writing same data twice or thrice on disk group
Eg
If you drop one drive in high redundancy  and abruptly shutdown machine .Disk group will not mount Automatically
due to disk is incomplete mean one disk is missing and it can be mounted by force only as below

 
 
  
For more details  https://youtu.be/GNcU0NudlJ0
                ;


When a disk failure occurs for an ASM disk, behavior of ASM would be different, based on what kind of redundancy for the diskgroup is in use. If diskgroup has EXTERNAL REDUDANCY, diskgroup would keep working if you have redundancy at external RAID level. If there is no RAID at external level, the diskgroup would immediately get dismounted and disk would need a repair/replaced and then diskgroup might need to be dropped and re-created, and data on this diskgroup would require recovery.

For NORMAL and HIGH redundancy diskgroups, the behavior is a little different. When a disk gets corrupted/missing in a NORMAL/HIGH redundancy diskgroup, error is reported in the alert log file, and disk becomes OFFLINE, as we can see in the output of bellow query, after I started my testing for an ASM disk failure. I just needed to plug out the disk from the storage that belonged to an ASM diskgroup with NORMAL redundancy.
col name format a8
col header_status format a7
set lines 2000
col path format a10
select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;

NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S
-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------
DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA4                              NORMAL   UNKNOWN         1200                   OFFLINE       MISSING
  
Here we see a value “1200” under REPAIR_TIME column; this value is time in seconds after which this disk would be dropped automatically. This time is calculated using value of a diskgroup attribute called DISK_REPAIR_TIME that I will discuss bellow.

In 10g, if a disk goes missing, it would immediately get dropped and REBALANCE operation would kick in immediately whereby ASM would start redistributing the ASM extents across the available disks in ASM diskgroup to restore the redundancy.

DISK_REPAIR_TIME
Starting 11g, oracle has provided an attribute for diskgroups called “DISK_REPAIR_TIME”. This has a default value of 3.6 hours. This actually means that in case a disk goes missing, this disk should not be dropped immediately and ASM should wait for this disk to come online/replaced. This feature helps in scenarios where a disk is plugged out accidentally, or a storage server/SAN gets disconnected/rebooted which leaves some ASM diskgroup without one or more disks. During the time when disk(s) remain unavailable, ASM would keep track of the extents that are candidates of being written to the missing disks, and immediately starts writing to the disk(s) as soon as missing disk(s) come back online (this feature is called fast mirror resync). If disk(s) does not come back online within DISK_REPAIR_TIME threshold, disk(s) is/are dropped and rebalance starts.

FAILGROUP_REPAIR_TIME
Starting 12c, another new attribute can be set for the diskgroup. This attribute is FAILGROUP_REPAIR_TIME, and this has a default value of 24 hours. This attribute is similar to DISK_REPAIR_TIME, but is applied to the whole failgroup. In Exadata, all disks belonging to a storage server can belong to a failgroup (to avoid a mirror copy of extent to be written in a disk from the same storage server), and this attribute is quite handy in Exadata environment when complete storage server is taken down for maintenance, or some other reason.
In the following we can see how to set values for the diskgroup attributes explained above.
SQL> col name format a30
SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%';

NAME                           VALUE
------------------------------ --------------------
disk_repair_time               3.6h
failgroup_repair_time          24.0h

SQL> alter diskgroup data set attribute 'disk_repair_time'='1h';

Diskgroup altered.

SQL>  alter diskgroup data set attribute  'failgroup_repair_time'='10h';

Diskgroup altered.

SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%';

NAME                           VALUE
------------------------------ --------------------
disk_repair_time               1h
failgroup_repair_time          10h

ORA-15042
If a disk is offline/missing from an ASM diskgroup, ASM may not mount the diskgroup automatically during instance restart. In this case, we might need to mount the diskgroup manually, with FORCE option.
SQL> alter diskgroup data mount;
alter diskgroup data mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "3" is missing from group number "2"

SQL> alter diskgroup data mount force;

Diskgroup altered.

Monitoring the REPAIR_TIME
After a disk goes offline, the time starts ticking and value of REPAIR_TIMER can be monitored to see the time remains before the disk can be made available to avoid auto drop of the disk.
SQL> select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;

NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S
-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------
DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA4                              NORMAL   UNKNOWN         649                     OFFLINE       MISSING

--We can confirm that no rebalance has started yet by using following query
SQL> select * from v$asm_operation;

no rows selected

If we are able to make this disk available/replaced before DISK_REPAIR_TIME lapses, we can bring this disk back online. Please note that we would need to bring it ONLINE manually.
SQL> alter diskgroup data online disk data4;

Diskgroup altered.

select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;

NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S
-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------
DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA4                              NORMAL   UNKNOWN        465                      SYNCING     CACHED

--Syncing is in progress, and hence no rebalance would occur.

SQL> select * from v$asm_operation;

no rows selected
-- After some time, everything would become normal.

select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;

NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S
-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------
DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA4    ORCL:DATA4 NORMAL   MEMBER             0                         ONLINE        CACHED


 

If same disk cannot be made available, or replaced, either ASM would auto drop the disk after DISK_REPAIR_TIME has lapsed, or we manually drop this ASM disk. Rebalance would occur after the disk drop.
Since the disk status if OFFLINE, we would need to use FORCE option to drop the disk. After dropping the disk rebalance would start and can be monitored from v$ASM_OPERATION view.
SQL> alter diskgroup data drop disk data4;
alter diskgroup data drop disk data4
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15084: ASM disk "DATA4" is offline and cannot be dropped.


SQL> alter diskgroup data drop disk data4 force;

Diskgroup altered.

select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;

GROUP_NUMBER OPERA PASS                   STATE      POWER      SOFAR   EST_WORK 
---------------------------------- --------- ----            ---------- ---------- ---------- ------------------------
           2                     REBAL RESYNC             DONE          9                0             0   
           2                     REBAL REBALANCE    DONE           9                42          42  
           2                     REBAL COMPACT         RUN             9                1            0   

Later we can replace the faulty disk and then add back the new disk again into this diskgroup. Adding diskgroup back would initiate rebalance once again.
SQL> alter diskgroup data add disk 'ORCL:DATA4';

Diskgroup altered.

SQL> select * from v$asm_operation;

select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;

GROUP_NUMBER OPERA PASS                   STATE      POWER      SOFAR   EST_WORK 
---------------------------------- --------- ----            ---------- ---------- ---------- ------------------------
           2                     REBAL RESYNC             DONE          9                0             0   
           2                     REBAL REBALANCE    RUN              9               37           2787  
           2                     REBAL COMPACT         WAIT            9                1            0   

https://www.oraclenext.com/2018/01/testing-asm-disk-failure-scenario-and.html



How to add disk to ASM
Posted on October 12, 2012by Sher khan

How to add disk to ASM (DATABASE) runing in production server
We have database running on ASM, after two years we faced the problem of space deficiency.
Now we planed to add disk to ASM diskgroup DATAGROUP.
SQL> @asm
NAME                 TOTAL_GB                FREE_GB
------------------------------ ---------- ----------
DATAGROUP            249.995117             15.2236328
IDXGROUP             149.99707              10.4892578
Steps are below
1) Create partition of disk /dev/sdm which we got new LUN from Storage
[root@rac-node1 ~]# fdisk -l /dev/sdm
Disk /dev/sdm: 85.8 GB, 85899345920 bytes
255 heads, 63 sectors/track, 10443 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk /dev/sdm doesn't contain a valid partition table
[root@rac-node1 ~]# fdisk /dev/sdm
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

The number of cylinders for this disk is set to 10443.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
 (e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
Command (m for help): n
Command action
 e extended
 p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-10443, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-10443, default 10443):
Using default value 10443
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
[root@rac-node1 ~]# fdisk -l /dev/sdm
Disk /dev/sdm: 85.8 GB, 85899345920 bytes
255 heads, 63 sectors/track, 10443 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdm1 1 10443 83883366 83 Linux
[root@rac-node1 ~]#
2) Configure the disk /dev/sdm1 to ASM and giving LABEL DATA5
[root@rac-node1 ~]# /etc/init.d/oracleasm createdisk DATA5 /dev/sdm1
Marking disk "DATA5" as an ASM disk: [ OK ]
[root@rac-node1 ~]# /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks: [ OK ]
[root@rac-node1 ~]# /etc/init.d/oracleasm listdisks
DATA3
DATA4
DATA5
DISK1
DISK2
INDEX2
INDEX5
[root@rac-node1 ~]#
Scandisks on RAC -node2
[root@rac-node2 ~]# /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks: [ OK ]
[root@rac-node2 ~]# /etc/init.d/oracleasm listdisks
DATA3
DATA4
DATA5
DISK1
DISK2
INDEX2
INDEX5
Add the disk to /etc/rawdevices
[root@rac-node2 bin]vi /etc/sysconfig/rawdevices
/dev/raw/raw6 /dev/sdm1    ==> add this to rawdevices file
And added to /etc/rc.local for permission on reboot
[root@rac-node2 bin]#vi /etc/rc.local

chmod 660 /dev/raw/raw6

Check the disk status 
SQL> set linesize 9999
SQL> ;
 SELECT
 NVL(a.name, '[CANDIDATE]') disk_group_name
 , b.path disk_file_path
 , b.name disk_file_name
 , b.failgroup disk_file_fail_group
 FROM
 v$asm_diskgroup a RIGHT OUTER JOIN v$asm_disk b USING (group_number)
 ORDER BY
* a.name
SQL> /
DISK_GROUP_NAME DISK_FILE_PATH DISK_FILE_NAME DISK_FILE_FAIL_GROUP
------------------------------ --------------
DATAGROUP       ORCL:DISK1 DISK1 DISK1
DATAGROUP       ORCL:INDEX5 INDEX5 INDEX5
DATAGROUP       ORCL:DATA4 DATA4 DATA4
DATAGROUP       ORCL:DATA3 DATA3 DATA3
IDXGROUP        ORCL:DISK2 DISK2 DISK2
IDXGROUP        ORCL:INDEX2 INDEX2 INDEX2
[CANDIDATE]    ORCL:DATA5 ==> this is the new disk 
7 rows selected.
3) Add disk DATA5 to diskgroup DATAGROUP
SQL> alter diskgroup DATAGROUP ADD DISK 'ORCL:DATA5' ;
Diskgroup altered.
Check disk status again
SQL> SELECT
 2 NVL(a.name, '[CANDIDATE]') disk_group_name
 , b.path disk_file_path
 , b.name disk_file_name
 3 4 5 , b.failgroup disk_file_fail_group
 6 FROM
 7 v$asm_diskgroup a RIGHT OUTER JOIN v$asm_disk b USING (group_number)
 8 ORDER BY
 9 a.name;
DISK_GROUP_NAME DISK_FILE_PATH ISK_FILE_NAME DISK_FILE_FAIL_GROUP
----------------------------------------------------------
DATAGROUP        ORCL:INDEX5 NDEX5 INDEX5
DATAGROUP        ORCL:DATA4 ATA4 DATA4
DATAGROUP        ORCL:DISK1 ISK1 DISK1
DATAGROUP        ORCL:DATA5 ATA5 DATA5
DATAGROUP        ORCL:DATA3 ATA3 DATA3
IDXGROUP         ORCL:INDEX2 NDEX2 INDEX2
IDXGROUP         ORCL:DISK2 ISK2 DISK2
7 rows selected. There is no candidates any more for DATA5

SQL> host cat script/asm.sql
select name,TOTAL_MB/1024 total_gb,free_mb/1024 FREE_GB from v$asm_diskgroup;
NAME             TOTAL_GB        FREE_GB
------------------------------ ---------- ----------
DATAGROUP       329.992188      95.21875
IDXGROUP        149.99707       10.4892578
SQL>
 
 
 
Configuring three IPs for SCAN listener in Oracle 11gR2 
SCAN
 
The benefit of using the SCAN is that the connection information of the client does not need to change if you add or remove nodes in the cluster."
"Having a single name to access the cluster enables the client to use the EZConnect client and the simple JDBC thin URL to access any Database running in the cluster, independent of the active servers in the cluster. The SCAN provides load balancing and failover for client connections to the Database. The SCAN works as a cluster alias for Databases in the cluster."
"...provide information on what services are being provided by the instance, the current load, and a recommendation on how many incoming connections should be directed to the instance."
https://docs.oracle.com/database/121/JJDBC/scan.htm#JJDBC29160

"each pair of resources (SCAN VIP and Listener) will be started on a different server in the cluster, assuming the cluster consists of three or more nodes."
"In case, a 2-node-cluster is used (for which 3 IPs are still recommended for simplification reasons), one server in the cluster will host two sets of SCAN resources under normal operations.
 
 
How to configure SCAN listener with DNS?

About SCAN( Single Client Access Name) Listener in Oracle 11gR2 RAC:
Single Client Access Name (SCAN) is a new Oracle Real Application Clusters (RAC) 11g Release 2 feature that provides a single name for clients to access Oracle Databases running in a cluster. The benefit is that the client’s connect information does not need to change if you add or remove nodes in the cluster. Having a single name to access the cluster allows clients to use the EZConnect client and the simple JDBC thin URL to access any database running in the cluster, independently of which server(s) in the cluster the database is active. SCAN provides load balancing and failover for client connections to the database. The SCAN works as a cluster alias for databases in the cluster.
Why three IPs to be configured for SCAN listener through DNS (Domain Name Server):

If we configure 3 IPs for SCAN listener through DNS, then in case of any failure on any SCAN IP, then fail-over will happen to other running IP. Another benefit, any client access should resolved throgh DNS also.
Failover becomes a bit of a problem if there is only one SCAN Listener. Let's assume the node where the one SCAN Listener is running on dies. Grid Infrastructure on a surviving node will start the SCAN Listener on some surviving node. It does take some time for GI to detect the failure and start it on some node. During this time, applications will not be able to connect so you'd lose high availability. Also consider the scenario where the SCAN Listener fails for some reason but the node it was running on is still operational. In that case, the SCAN Listener will not be restarted anywhere. You want more than one for high availability.

 There are three ways we can configured our RAC environment .
1) Non – DNS ( that means IP based RAC configuration
( Only 1 scan ip will work ))
2) DNS ( minimum 3 SCAN ip are required )
3) GNS ( which we called DHCP)
 

Before Configuration:

$ srvctl config scan
SCAN name: prddbscan, Network: 1/101.10.1.1/255.255.255.192/en0
SCAN VIP name: scan1, IP: /prddbscan/101.10.1.4
$

Starting Configuration:
Step:1  - add three IPs in DNS , e.g.,

101.10.1.5            hrdbscan.hrapplication.com
101.10.1.6            hrdbscan.hrapplication.com
101.10.1.7            hrdbscan.hrapplication.com

Step:2 - Stop all node scan listeners

$ srvctl stop scan_listener

Step:3 - Create the below file in /etc location with adding domain name

# vi /etc/resolv.conf
search domain hrapplication.com
nameserver      101.10.9.9

Note : Name of the DNS/ AD server is "hrapplication.com" and IP is 10.10.9.9

Step: 4 - Verify with nslookup in all nodes - all node should show configured three IPs

# nslookup hrdbscan.hrapplication.com
Server:         101.10.9.9
Address:        101.10.9.9#53

Name:   hrdbscan.hrapplication.com
Address: 101.10.1.7
Name:   hrdbscan.hrapplication.com
Address: 101.10.1.6
Name:   hrdbscan.hrapplication.com
Address: 101.10.1.5

Note: If your DNS server does not return a set of 3 IPs as shown in figure 3 or does not round-robin, ask your network administrator to enable such a setup. DNS using a round-robin algorithm on its own does not ensure failover of connections. However, the Oracle Client typically handles this. It is therefore recommended that the minimum version of the client used is the Oracle Database 11g Release 2 client.

Step: 5 - modify scan
#./srvctl modify scan -n hrdbscan.hrapplication.com
#./srvctl modify scan_listener -u

-- again verify

# ./srvctl config scan 

Step: 6 - start the scan listener

#./srvctl start scan_listener
#./srvctl status scan_listener

Step: 7 - Now stop cluster services and start it again to effect

./crsctl stop crs -- one by one node

./crsctl start crs

./crsctl stat res -t
./crsctl check crs

Step: 8 - check the services

./crsctl stat res -t

HOW CONNECTION LOAD BALANCING WORKS USING SCANFor clients connecting using Oracle SQL*Net 11g Release 2, three IP addresses will be received by the client by resolving the SCAN name through DNS as discussed. The client will then go through the list it receives from the DNS and try connecting through one of the IPs received. If the client receives an error, it will try the other addresses before returning an error to the user or application. This is similar to how client connection failover works in previous releases when an address list is provided in the client connection string.

When a SCAN Listener receives a connection request, the SCAN Listener will check for the least loaded instance providing the requested service. It will then re-direct the connection request to the local listener on the node where the least loaded instance is running. Subsequently, the client will be given the address of the local listener. The local listener will finally create the connection to the database instance.

 
This document may help you. You can refer Oracle support documents for more clarification.

Case -1: Unable to start database instances after starting all cluster services:

When I tried to start database instances after starting all node cluster services, I found below error.

SQL> startup nomount;
ORA-00119: invalid specification for system parameter REMOTE_LISTENER
ORA-00132: syntax error or unresolved network name 'hrapplication.com:1521'

After some verification, I found some body disbaled below options during some RAM upgrations time.

# vi /etc/resolv.conf
#search domain hrapplication.com
#nameserver      101.10.9.9

I uncommented like below and then started my databases services and found ok.

# vi /etc/resolv.conf
search domain hrapplication.com
nameserver      101.10.9.9
 

Approach to Troubleshoot an Abended OGG Process


Oracle GoldenGate is an Heterogeneous replication tool. It is very easy to install and configure Oracle GoldenGate. The real challenge comes when the Processes gets ABEND. Sometimes it is easy to detect problems, but sometimes we will be really not knowing how to proceed or approach to solve or troubleshoot the issue.

This article explains,

1. Levels of Failure in Oracle GoldenGate.
2. The approach to Troubleshoot Oracle GoldenGate.
3. How to identify the issue.
4. What are the files to be looked for Troubleshooting Oracle GoldenGate.
5. Tools to Monitor and Troubleshoot Oracle GoldenGate.

1. LEVELS OF FAILURE

Oracle GoldenGate can Abend or Fail at different levels. There might be many reasons for the Oracle GoldenGate Process failures. The different levels at which the Oracle GoldenGate processes fails or Abends are .,

1. Database Level
2. Network Level
3. Storage Level
4. User Level

a. DATABASE LEVEL OF FAILURE

Oracle GoldenGate also fails if you have issues at the Database Level. Below are some of the issues listed.,

Tablespace filled
Redo log corruption
Archive log destination filled
No Primary Key or Unique Index on the tables
Archive log Mode not enabled
Reset Log performed
Memory Problem with Streams_Pool_Size 
Database Hung

b. NETWORK LEVEL OF FAILURE

Network plays a vital role in the Oracle GoldenGate Replication. For each and every commands you execute in the GGSCI prompt, the Manager Process opens a port. There should be a proper, speedy network between the Source and Target sides. Some of the Network level failures are listed below.,

Network Fails
Network slow
Ports Unavailability
Firewall Enabled

c. STORAGE LEVEL OF FAILURE

There should be sufficient storage space available for the Oracle GoldenGate to keep the Trail files. Even at Oracle Database level, there should be sufficient space to retain the Archive Log files and also space for tablespaces. Proper privileges should be given to the file system so that, Oracle GoldenGate Processes creates the trail files in the location.

File System fills
File System corruption
No Proper privileges given to the File System
Connection Problem between Database Server and Storage
No Free Disks Available in Storage

d. USER LEVEL OF FAILURE

Of course, we users make some mistakes. Some of the user level of failures are below.,

Mistakenly Delete the GoldenGate Admin User at Database level.
Manually Performing Operations like Insert, Delete and Update at Target Side.
Manually deleting / removing the Trail Files either from Source server or Target server.
Forcefully Stopping any Oracle GoldenGate Processes like Manager, Extract, Pump, Collector or Replicat.
Killing the Oracle GoldenGate Processes at OS level.
Performing an ETROLLOVER at Extract / Pump / Replicat Processes.

So we have seen the different levels of Failures in Oracle GoldenGate. How to proceed if you face these failures in your day to day life. What is the approach to identify the issue and solve it.

2. HOW TO APPROACH?

The below are the steps on how to approach to the problem. If the environment is a known one, then you can skip some of the steps.

Learn and Understand the Environment
Operating Provider and Operating System Version
Database Provider and Database Version
Is it a Cluster, Active / Passive?
Oracle GoldenGate UniDirectional or Bi-Directional
If Oracle, then is it a Single Instance or RAC – Real Application Clusters
Is it a Homogeneous or Heterogeneous Environment Replication
Network Flow, Ports Used and Firewalls configured
Components used in Oracle GoldenGate like Extract, Pump, Replicat processes and Trails files.

After seeing all the prerequisites like Environment study etc, check if the Processes are up and running. INFO ALL is the command to check the status of the processes. There are different status of process.

RUNNING
The Process has started and running normally.

STOPPED
The Process has stopped either normally (Controlled Manner) or due to an error.

STARTING
The Process is starting.

ABENDED
The Process has been stopped in an uncontrolled manner. Abnormal End is known was ABEND.

From the above status of the Processes status, RUNNING, STOPPED and ABENDED are common. But what is STARTING? What actually happens when the Oracle GoldenGate process is in this state?

Whenever you start an Abended Extract Process, it will take some time to get started. It is because, the process is getting recovered from its last Abend point. To recover it’s processing state, the Extract Process search back to its Online Redo Log file or Archive log file, to find the first log record for the opened transactions when it is crashed. The more back the Extract Process goes in search, the more it takes to recover itself and get started. So, It takes more time depending upon how long back the Open transaction is in the Redo Logs or Archive Logs.

To check the status of the Extract Process and also to check if it is recovering properly, issue the command.,

THREE BASIC FILES

There are many files which needs to be checked whenever you face issue in Oracle GoldenGate. Out of which Oracle GoldenGate logs the activity in three files.

1. Error Log File – ggserr.log
2. Report File
3. Discard File
4. Trace File
5. Alert Log File
6. DDL Trace File

The first three are the very basic, also can be called as major files which are to be looked in to whenever there are problems in the Oracle GoldenGate. Below, is the explanation for these three files.

What is Error Log file – ggserr.log?

This file is created during the Installation of the Oracle GoldenGate. The file is created in the Oracle GoldenGate home directory with the name ggserr.log. For each installation of Oracle GoldenGate, a ggserr.log file is created in the respective Oracle GoldenGate directory. This file is updated by all the processes of Oracle GoldenGate and the below information are logged in this file.,

 Start and Stop of the Oracle GoldenGate Processes.
 Processing Information like Bounded Recovery operations.
 Error Messages.
 Informational messages like normal operations happening in Oracle GoldenGate.
 WARNING Messages like Long Running Transactions.
 Commands executed in GGSCI Prompt.

The format in which the Oracle GoldenGate processes logs the information in to this ggserr.log file is below.,

You can view this file in the ggsci prompt itself by using the command VIEW GGSEVT. But it is always better to view it using the OS tool as this file can grow a lot. The below is the example.,

So with the ggserr.log file you basically identify the below.,

 What is the Error?
 When the Error occurred?
 How Frequently it occurred?
 What were the operations performed before the Error occurred?
 How Frequently the error occurred?

What is Report File?

A Report file is a process specific log file. Each process has its own report file created and this file is created during the instantiation of the process. This file is stored in the directory /dirrpt and the format of this file is .rpt. This file is automatically renamed on the next instantiation of the process. If a process starts all the log entries for that process are written to its respective report file.

Let’s consider a process called EXT and the report file during instantiation of this process is called as EXT.rpt. If this process is stopped and started again, existing file EXT.rpt will be automatically renamed to EXT0.rpt and a new file will be generated with the name EXT.rpt and this occurs recursively till the value of the sequence reaches 9. If the last report file name for the process EXT is created as EXT9, now during the new file generation, the last file EXT9.rpt will be removed and EXT8.rpt will be renamed as EXT9.rpt. So, the report file with the lower sequence value will be the latest and younger one when compared with older sequence valued report file.

REPORTROLLOVER parameter is used to manually or forcefully create a new report file for the processes. To view the current report of the process the below command is used.,

To get the runtime statistics report of a process, use the below command,

The below information can be seen in the report file of a particular process.,

 Oracle GoldenGate Product Version and Release
 Operating Version, Release, Machine Type, Hostname and Ulimit settings of the respective process
 Memory Utilized by the respective process
 Configured Parameters of the respective Oracle GoldenGate Process
 Database Provider, Version and Release
 Trail files Information
 Mapping of Tables
 Informational messages with respective to a particular process
 Warning messages with respective to a particular process
 Error messages with respective to a particular process
 All DDL Operations performed.
 All the Discarded Errors and Ignored Operations
 Crash dumps
 Any commands which are performed on that particular process.

The below is the example of the Report file which I had split it to many parts so that you will get an clear understanding.

1. Oracle GoldenGate Product Version and Release. Operating Version, Release, Machine Type, Hostname and Ulimit settings of the respective process

2. Configured Parameters of the respective Oracle GoldenGate Process

3. Database Provider, Version, Release and Trail File information.

4. Mapping of tables and Informational messages with respect to the Process.

5. Crash dump and Error messages of the respective process.

Above examples clearly shows the contents of a Report file. So with the help of a Report file, the following can be known,

 In which Trail File the Process gets Abend.
 Whether the Trail File is moving forward?
 Whether the process is getting failed with Same Trail File?
 What operations has been performed before the process abend?
 Whether any errors in the Parameter configuration?
 Whether the MAP statements has the correct table names?

What is Discard File?

A log file for logging failed operations of the Oracle GoldenGate processes. It is mainly used for Data errors. In Oracle GoldenGate 11g, this file is not created by default. We have to mention a keyword DISCARDFILE to enable discard file logging. But from Oracle GoldenGate 12c, this file is generated by default during the instantiation of the process.

The Naming format of the Log file is ., but this file can named manually when enabling. Extension of this file is .DSC and this file is located in the directory /dirrpt

PURGE and APPEND keywords are used in the process parameter files to manually maintain the Discard File. Similar to the Report file, the Discard file can also be rolled over using the keyword DISCARDFILEROLLOVER. The syntax is as below.,

file_name
The relative or fully qualified name of the discard file, including the actual file name.

APPEND
Adds new content to existing content if the file already exists.

PURGE
Purges the file before writing new content.

MAXBYTESn | MEGABYTESn
File size in Bytes. For file size in bytes the valid range is from 1 to 2147483646. The default is 50000000. For file size in megabytes the valid range is from 1 to 2147. The default size is 50MB. If the specified size is exceeded, the process Abends.

NODISCARDFILE
When using this parameter, there will be no discard file creation. It prevents generating the Discard file.

The below is the example for the Discard file parameter used in the Replicat process parameter file.,

The Discard File is mainly used in the Target Side. Each and Every Replicat Process should have its own Discard File. This is a mandatory one.

The below is the example which shows the contents of the Discard file. The Replicat process got Abended due to the error OCI Error ORA-01403 : no data found. The discard file is as below.,

So, we have seen about the three basic and important file where Oracle GoldenGate Processes logs the information. There is also a tool which is used to troubleshoot Oracle GoldenGate during Data corruption or trail file corruption. This is mainly used when Data error occurs in the Oracle GoldenGate.

The tool is called LOGDUMP. It is a very useful tool which allows a user to navigate through the trail file and compare the information of the trail file with the data extracted and replicated by the processes. The below can be seen in the trail file using the LOGDUMP utility.,

 Transactions Information
 Operation type and Time when the Record written.
 Source Object name
 Image type, whether it is a Before Image or After Image.
 Column information with data and sequence information.
 Record length, Record data in ASCII format.
 RBA Information.

The below is the example of the contents of the Trail File.,

Some of the Logdump commands with the description are below., To get in to the logdump prompt, just run the logdump program from the Oracle GoldenGate Home directory.

Logdump 1> GHDR ON – To view the Record Header.

Logdump 2> DETAIL ON – To view the column information.

Logdump 3> DETAIL DATA – To view the Hex and ASCII values of the Column.

Logdump 4> USERTOKEN ON – User defined information specified in the Table of Map statements. These information are stored in the Trail file.

Logdump 4> GGSTOKEN ON – Oracle GoldenGate generated tokens. These tokens contains the Transaction ID, Row ID etc.,

Logdump 5> RECLEN length – Manually control the length of the record.

Logdump 6> OPEN file_name – To open a Trail file.

Logdump 7> NEXT – To move to the next File record. In short, you can use the letter N.

Logdump 8> POS rba – To position to a particular RBA.

Logdump 9> POS FIRST – To go to the position of the first record in the file.

Logdump 10> POS 0 – This is the alternate command for the POS FIRST. Either of this can be used.

Logdump 11> SCANFORENDTRANS – To go to the end of the transaction.

Logdump 12> HELP – To get the online help.

Logdump 13> EXIT – To exit from the Logdump prompt. You can also use QUIT alternatively.

Hope you got a clear view on how to approach to a Oracle GoldenGate problem and also find who stopped the Oracle GoldenGate process and the reason behind it.

 

-----------------------Standby--------------



Comparative study of Standby database from 7.3 to latest version

12c Data Guard Agenda
- Physical Standby
- Logical Standby
- Snapshot Standby
- Active Data Guard
- Data Guard Broker
- Architecture
- Configurations
- Standby Creations using Commands and OEM
- 12c NFs (Far Sync, Fast Sync, etc.)
- Far Sync Lab
- Data Protection Modes
- Role Transitions
- Flashback Database
- Fast-Start Failover (Observer Software)
- Backup and Recovery in DG
- Patching
- Optimization DG



High Availability Solutions from Oracle
- RAC
- RAC ONE Node
- Data Guard
- Golden Gate
- Oracle Streams

History

Version 7.3
- keeping duplicate DB in a separate server
- can be synchronized with Primary Database
- was constantly in Recovery Mode
- NOT able to automate the transfer of Archive Redo Logs
and Apply
- DBAs has to find an option for transfer of Archive
Redo Logs and Apply
- aim was disaster recovery


Version 8i
- Archive log shipping and apply process automatic
- which is now called
- managed standby environment (log shipping)
- managed recovery (apply process)
- was not possible to set a DELAY in the managed recovery mode
- possible to open a Standby with read-only mode
for reporting purpose
- when we added a Data File or Created TS on Primary,
these changes were NOT being replicated to STandby
- when we opened the Primary with resetlogs
or restored a backup control file,
we had to re-create the Standby



Version 9i
- Oracle 8i Standby was renamed to Oracle 9i Data Guard
- introduced Data Guard Broker
- ZERO Data Loss on Failover was guaranteed
- Switchover as introduced (primary <> standby)
- Gap resolution (missing logs detected
and trasmits automatically)
- DELAY option was added
- parallel recovery increase recovery performance on STandby
- Logical Standby was introduced



Version 10g
- Real-Time Apply (provides faster swithover and failover)
- Flashback Database support was introduced
- if we open a Primary with resetlogs,
it was NOT required to re-create the Standby
- Standby was able to recover through resetlogs
- Rolling Upgrades of Primary Database
- Fast-Start-Failover (Observer Software)

Version 11g
- Active Data Guard
- Snapshot Standby (possible with 10g R2
guaranteed restore point)
- continuous archived log shipping with snapshot standby
- compress REDO when resolving Gaps => 11g R1
- compress of all REDO => 11g R2
- possible to include different O/S in DataGuard
- recovery of Block corruptions automatic
for Active Data Guard
- "Block Change Tracking" can be run on Active Data Guard

Version 12c
- Far Sync
- Fast Sync
- Session Sequence
- temp as UNDO
- Rolling Upgrade using PL/SQL Package (DBMS_Rolling)


LNS=Log Writer Network Server
RFS=Remote File Server
MRP=Managed Recovery
LSP=Logical STandby
DMON=Data Guard Broker Monitor Process
NSS = Network Server SYNC


MRP = coordinates read and apply process of REDO on Physical Standby

RFS = responsible for receiving the REDO Data,
which is sent by the Primary to STandby

LGWR and SYNC
- REDO is read and sent to the STandby directly
from the log buffer by the LNS process
- Ack. needed from the standby (RFS to LNS
and LNS to LGWR)
to send COMMIT Ack to the Database USer


LGWR and ASYNC
- NO ack needed from standby to send the COMMIT ack.
to Prmary Database
- redo is read and sent to standby from redo log buffer
or online redo logs by the LNS process
- redo log buffer before it is recycled,
it automatically reads and sends redo data
from online redo
- the committed transactions that weren't shipped
to standby yet, may be lost in a Failover




FAL_CLIENT
- no longer required from 11g R2
- primary DB will obtain the Client Service Name
from the related LOG_ARCHIVE_DEST_n



How to start REDO Apply as Foreground Process?

alter database recover managed standby database;


How to start REDO Apply as Background Process?

use DISCONNECT FROM SESSION option,


alter database recover managed standby database
disconnect from session;


How to cancel REDO Apply?

alter database recover managed standby database cancel;


I found this... check "4.7.10.1 Data Guard Status" in http://docs.oracle.com/cd/E11857_01/em.111/e16285/oracle_database.htm#CHDBEAFG


Posts Tagged ‘alter database recover managed standby database’
How to Resolve Primary/Standby Log GAP In Case of Deleting Archivelogs From Primary?
I will write about resolving the Primary/Standby log gap in case of we deleted some archive log files from primary. Suppose that we don’t have the backup of the deleted archive files. Normally we (DBAs) should not allow such a situation but such a situation can happen to us. In this case,  we need to learn the current SCN number of Primary and standby databases.1- let’s learn current SCN number with the following query on the Primary.
SQL> select current_scn from v$database;
CURRENT_SCN
———–
1289504966
2- let’s learn current SCN number with the following query on the Standby
SQL> select current_scn from v$database;
CURRENT_SCN
———–
1289359962
using the function scn_to_timestamp(SCN_NUMBER) you can check the time difference between primary and standby.
3- Stop apply process on the Standby database.
SQL> alter database recover managed standby database cancel;
4– Shutdown the Standby database.
SQL> shutdown immediate;
5- Take incremental backup from the latest SCN number of the Standby database on the Primary database. And copy backup to the standby server.
RMAN> backup incremental from scn 1289359962 database;
# scp /backup_ISTANBUL/dun52q66_1_1 oracle@192.168.2.3:/oracle/ora11g
6- Create new standby control file on the Primary database. And copy this file to standby server.
SQL> alter database create standby controlfile as ‘/oracle/ora11g/standby.ctl’;
# scp /oracle/ora11g/standby.ctl oracle@192.168.2.3:/oracle/ora11g
7- Open the Standby database on NOMOUNT state to learn control files location.
SQL> startup nomount
SQL> show parameter control_files
8- Replace new standby control file with old files.
# cp /oracle/ora11g/standby.ctl /oracle/ora11g/ISTANBUL/data1/control01.ctl
# cp /oracle/ora11g/standby.ctl /oracle/ora11g/ISTANBUL/data2/control02.ctl
9- Open the Standby database on MOUNT state.
SQL> alter database mount standby database;
10- Connect to the RMAN and register backup to catalog.
# rman target /
RMAN> catalog start with ‘/oracle/ora11g’;
It will ask for confirmation. Click “y” .
11- Now, you can recover the Standby database. Start recover database.
RMAN> recover database;
When recover of database is finished, it searches the latest archive file. And it gives an ORA-00334 error. In this case, don’t worry about it. Exit from RMAN and start apply process on the standby database.
SQL> alter database recover managed standby database disconnect from session;
We solved the Primary/Standby log gap with RMAN incremental backup . When we faced with such a situation we don’t need to think about re-installing standby database. Because time is very valuable for us.

SRL – standby redo log


How to Standby apply process

in the system above, SRLs are not configured on the standby database. The arrows show how redo transport flows through the system. Redo travels along this path:


A transaction writes redo records into the Log Buffer in the System Global Area (SGA).
The Log Writer process (LGWR) writes redo records from the Log Buffer to the Online Redo Logs (ORLs).
When the ORL switches to the next log sequence (normally when the ORL fills up), the Archiver process (ARC0) will copy the ORL to the Archived Redo Log.
Because a standby database exists, a second Archiver process (ARC1) will read from a completed Archived Redo Log and transmit the redo over the network to the Remote File Server (RFS) process running for the standby instance.
RFS sends the redo stream to the local Archiver process (ARCn).
ARCn then writes the redo to the archived redo log location on the standby server.
Once the archived redo log is completed, the Managed Recovery Process (MRP0) sends the redo to the standby instance for applying the transaction.



With SRLs, not only do we have more items in the picture, we also have different choices, i.e. different paths to get from the primary to the standby. The first choice is to decide if we are configured for Max Protect or Max Performance as I will discuss its impact below.

Just like without SRLs, a transaction generates redo in the Log Buffer in the SGA.
The LGWR process writes the redo to the ORL.
Are we in Max Protect or Max Performance mode?
If Max Protect, then we are performing SYNC redo transport. The Network Server SYNC process (NSSn) is a slave process to LGWR. It ships redo to the RFS process on the standby server.
If Max Performance mode, then we are performing ASYNC redo transport. The Network Server ASYNC process (NSAn) reads from the ORL and transports the redo to the RFS process on the standby server.
RFS on the standby server simply writes the redo stream directly to the SRLs.
How the redo gets applied depends if we are using Real Time Apply or not.
If we are using Real Time Apply, MRP0 will read directly from the SRLs and apply the redo to the standby database.
If we are not using Real Time Apply, MRP0 will wait for the SRL’s contents to be archived and then once archived and once the defined delay has elapsed, MRP0 will apply the redo to the standby database.
Best Practices

I’ve already covered a few best practices concerning SRLs. I’ll recap what I have already covered and include a few more in this section.

Make sure your ORL groups all have the same exact size.  You want every byte in the ORL to have a place in its corresponding SRL.
Create the SRLs with the same exact byte size as the ORL groups. If they can’t be the same exact size, make sure they are bigger than the ORLs.
Do not assign the SRLs to any specific thread.  That way, the SRLs can be used by any thread, even with Oracle RAC primary databases.
When you create SRLs in the standby, create SRLs in the primary. They will normally never be used. But one day you may perform a switchover operation. When you do switchover, you want the old primary, now a standby database, to have SRLs. Create them at the same time.
For an Oracle RAC primary database, create the number of SRLs equal to the number of ORLs in all primary instances. For example, if you have a 3-node RAC database with 4 ORLs in each thread, create 12 SRLs (3x4) in your standby. No matter how many instances are in your standby, the standby needs enough SRLs to support all ORLs in the primary, for all instances

DB_NAME – both the primary and the physical standby database will have the same database name. After all, the standby is an exact copy of the primary and its name needs to be the same.
DB_UNIQUE_NAME – While both the primary and the standby have the same database name, they have a unique name. In the primary, DB_UNIQUE_NAME equals DB_NAME. In the standby, DB_UNIQUE_NAME does not equal DB_NAME.  The DB_UNIQUE_NAME will match the ORACLE_SID for that environment.
LOG_ARCHIVE_FORMAT – Because we have to turn on archive logging, we need to specify the file format of the archived redo logs.
LOG_ARCHIVE_DEST_1 – The location on the primary where the ORLs are archived. This is called the archive log destination.
LOG_ARCHIVE_DEST_2 – The location of the standby database.
REMOTE_LOGIN_PASSWORDFILE – We need a password file because the primary will sign on to the standby remotely as SYSDBA.


Create the Standby Database

This is the second major task to complete. In this section, we will create a backup of the primary and use it to create the standby database. We need to create a special standby control file. A password file, a parameter file, and a similar TNS alias will complete the setup. The subtasks are outlined below.

Create a backup of the primary.
Create a standby controlfile.
Copy the password file.
Create a parameter file.
Create a TNSNAMES.ORA file.


How to Speed up and Troubleshooting MRP (Log Apply Rate of a Standby Database) Stuck Issues
To Speed up MRP on Standby database, Check for

1) parallel_execution_message_size - this is an OS dependent parameter
2) recovery_parallelism - this will be dictated by the numbers of CPU's and your ability to handle IO
3) Consider increasing sga_target parameter, if it's set to low.
4) Check for Disk I/O. Move you I/O intensive files to faster disks including Online Redo log and Standby redo log files.

I've came across below these links, which I've found very useful in this regards,


Dataguard Performance
Edit (07,2013):
The following information is important about Physical Data Guard Redo Apply performance:
11g Media Recovery performance improvements include:
•More parallelism by default
•More efficient asynchronous redo read, parse, and apply
•Fewer synchronization points in the parallel apply algorithm
•The media recovery checkpoint at a redo log boundary no longer blocks the apply of the next log

In 11g, when tuning redo apply consider following:

•By default recovery parallelism = CPU Count-1. Do not use any other values.
•Keep PARALLEL_EXECUTION_MESSAGE_SIZE >= 8192
•Keep DB_CACHE_SIZE >= Primary value
•Keep DB_BLOCK_CHECKING = FALSE (if you have to)
•System Resources Needs to be assessed
•Query what MRP process is waiting
select a.event, a.wait_time, a.seconds_in_wait from gv$session_wait a, gv$session b where a.sid=b.sid and b.sid=(select SID from v$session where PADDR=(select PADDR from v$bgprocess where NAME='MRP0'))

Check: Active Data Guard 11g Best Practices Oracle Maximum Availability Architecture White Paper


When tuning redo transport service, consider following:

1 - Tune LOG_ARCHIVE_MAX_PROCESSES parameter on the primary.
•Specifies the parallelism of redo transport
•Default value is 2 in 10g, 4 in 11g
•Increase if there is high redo generation rate and/or multiple standbys
•Must be increased up to 30 in some cases.
•Significantly increases redo transport rate.
2 - Consider using Redo Transport Compression:
•In 11.2.0.2 redo transport compression can be always on
•Use if network bandwidth is insufficient
•and CPU power is available


Also consider:
3 - Configuring TCP Send / Receive Buffer Sizes (RECV_BUF_SIZE / SEND_BUF_SIZE)
4 - Increasing SDU Size
5 - Setting TCP.NODELAY to YES



Check: Redo Transport Services Best Practices Chapter of Oracle® Database High Availability Best Practices 11g Release 1
-------------------------------------------------------------------
Original Post:
Problem: Recovery service has stopped for a while and there has been a gap between primary and standby side. After recovery process was started again, standby side is not able to catch primary side because of low log applying performance. Disk I/O and memory utilization on standby server are nearly 100%.

Solution:
1 – Rebooting the standby server reduced memory utilization a little.
2 – ALTER DATABASE RECOVER MANAGED STANDBY DATABASE PARALLEL 8 DISCONNECT FROM SESSION;
In general, using the parallel recovery option is most effective at reducing recovery time when several datafiles on several different disks are being recovered concurrently. The performance improvement from the parallel recovery option is also dependent upon whether the operating system supports asynchronous I/O. If asynchronous I/O is not supported, the parallel recovery option can dramatically reduce recovery time. If asynchronous I/O is supported, the recovery time may be only slightly reduced by using parallel recovery.
3 – SQL>alter system Set PARALLEL_EXECUTION_MESSAGE_SIZE = 4096 scope = spfile;
Set PARALLEL_EXECUTION_MESSAGE_SIZE = 4096
When using parallel media recovery or parallel standby recovery, increasing the PARALLEL_EXECUTION_MESSAGE_SIZE database parameter to 4K (4096) can improve parallel recovery by as much as 20 percent. Set this parameter on both the primary and standby databases in preparation for switchover operations. Increasing this parameter requires more memory from the shared pool by each parallel execution slave process.
4 – Kernel parameters that changed in order to reduce file system cache size.
dbc_max_pct 10 10 Immed
dbc_min_pct 3 3 Immed
5 – For secure path (HP) load balancing, SQL Shortest Queue Length is chosen.
autopath set -l 6005-08B4-0007-4D25-0000-D000-025F-0000 -b SQL







Oracle 12c multiple physical standby setup and consideration






I have a question regarding having multiple physical standby DBs at the same time (redo apply/log shipping) in the same server, is it possible ? if so what would be the  DB_NAME for my second stand-by based on the values of the parameters below?

Primary
*.DB_NAME=CHICAGO
*.db_unique_name='CHICAGO'
*.log_archive_config='DG_CONFIG=(CHICAGO, BOSTON, TORONTO)'
*.log_archive_dest_1='location=/adjarch/CHICAGO reopen=60'
*.log_archive_dest_2='SERVICE=BOSTON NOAFFIRM ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=BOSTON'
*.log_archive_dest_3='SERVICE=TORONTO NOAFFIRM ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=TORONTO'
*.log_archive_dest_state_2='DEFER'
*.log_archive_dest_state_3='DEFER'
*.log_archive_format='archCHICAGO_%t_%s_%r.log'

target1:
*.DB_NAME=CHICAGO
*.db_unique_name='BOSTON'
*.log_archive_config='DG_CONFIG=(CHICAGO, BOSTON)'
*.log_archive_dest_1='location=/adjarch/CHICAGO reopen=60'
*.log_archive_dest_2='SERVICE=CHICAGO NOAFFIRM ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=CHICAGO'

Target2:
*.DB_NAME=????
*.db_unique_name='TORONTO'
*.log_archive_config='DG_CONFIG=(CHICAGO,TORONTO)'
*.log_archive_dest_1='location=/adjarch/CHICAGO reopen=60'
*.log_archive_dest_3='SERVICE=CHICAGO NOAFFIRM ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=CHICAGO'



In this paper, I thought I would provide a few things that you may not know about SRLs. Some of this information was covered in the previous article, but it’s good to have all of this in one location. I write these items in no particular order.

Do not assign SRLs to a specific thread. – There is a temptation for DBAs who work on Oracle RAC databases to assign a SRL to a specific thread of redo.  The RAC DBA is already familiar with creating Online Redo Logs (ORLs) for a specific thread, one for each instance in the RAC database. So they must do similarly for SRLs, correct? The answer is no. Do not assign SRLs to a specific thread. If the SRL is assigned to a specific thread, then it can only be used by that thread and no other. If the SRL is not assigned to a thread, it can be used by any thread.

SRLs do not rotate like ORLs. – Most DBAs are used to seeing Online Redo Logs rotate. If there are three ORL groups, redo will be written to group 1, then group 2, and then group 3, and then back to group 1 again. Many DBAs working with SRLs for the first time assume SRLs rotate the same way, but they do not. If you have SRL groups 10, 11, 12, and 13 then the first redo transport stream will be written to SRL group 10. The next one will be written to SRL group 11. If group 10 becomes available again, the third redo stream will be written to SRL group 10. It is possible that SRL group 13 never gets used.

You should have one more SRL groups than ORL groups – If you go back to the article I linked at the start, there is a second diagram showing the flow of redo when SRLs are in place. Either MRP0 or ARCn is reading from the SRL and applying redo or creating an archived redo log. No matter which route is taken for the redo, the process can take some time. It is a good idea to have an extra SRL in case the redo writes from the SRLs take extra time.  Remember for Oracle RAC primary databases, to count the groups from all primary threads.

SRLs provide a near-zero data loss solution, even in Max Performance mode. – As I stated in the previous article, SRLs are great for achieving a near-zero data loss solution even in Max Performance mode. If you look at the second diagram in that article, you can see that the primary will transport redo to the SRL in near real time. You would use Max Protect mode when you absolutely cannot afford data loss, but SRLs get you close to near-zero data loss which is often good enough for most people. Here is a production database (name changed to protect the innocent) as seen in the DG Broker. We can see the configuration is Max Performance mode.

DGMGRL> show configuration

Configuration - orcl_orcls

  Protection Mode: MaxPerformance
DGMGRL> show database orcls

Database - orcls

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 0 seconds ago)
  Apply Lag:          4 hours 29 minutes 21 seconds (computed 0 seconds ago)
  Average Apply Rate: 5.02 MByte/s

We can also see a transport lag of 0 seconds. I typically see between 0 and 2 seconds of transport lag on this system. One might say that this system does not generate a log of redo, which is why I included the Average Apply Rate in the output. This standby is applying 5 megabytes per second on average, which means the primary is generating lots of redo. This is not an idle system and I’m still seeing a near-zero data loss implementation.

For RAC standby, any node can receive redo to the SRLs. – If you have an Oracle RAC primary and an Oracle RAC standby, then any node in the primary can transport redo to the SRLs on any node in the standby. This is why I use VIPs for my LOG_ARCHIVE_DEST_2 parameter settings in my configuration. I now have high availability for my redo transport. You may see SRLs active on any node in your RAC standby. That being said, in 12.1.0.2 and earlier, only one node will perform the redo apply.

If you use an Apply Delay, redo apply won’t take place until the log switch occurs. – If you look at my output from the DG Broker above, you’ll see a 0 second transport lag but an apply lag of almost 4.5 hours. I have set an Apply Delay of 4 hours in this standby database. Because the redo is written to the SRL, the redo from that SRL is not available to be applied until the log switch completes and that log switch passes the same apply delay. In my primary databases, I often set ARCHIVE_LAG_TARGET to 1 hour so that I have ORL switches at most once per hour. The apply lag in the standby will often be between Apply Delay and Apply Delay+ARCHIVE_LAG_TARGET. With my configuration, my apply lag is often between 4 and 5 hours. If you use an apply delay, then it’s a good idea to set the ARCHIVE_LAG_TARGET as well. Too many people miss this point and assume the apply lag should be very close to the Apply Delay setting.

Redo Transport does not use ARCn if SRLs are in place. – Too often, I see a posting in some forum where it is assumed that ARCn is the only process that transports redo to the standby. If SRLs are in place, NSAn is used instead, or LNS if prior to 12c. Refer back to the diagrams in my earlier paper for details on how log transport works.

SRLs should be the same size and at least as large as the biggest ORL. – I try to keep the ORL and SRL groups all set to the same exact size. But if there are mixed ORL sizes, then make sure the SRLs are sized to the largest of the ORL groups. Redo transport can be written to any of the SRLs, so all of the SRLs need to be sized to handle redo from any ORL.



why are job running slow on particulart node
How to improve performance of managed recover process 

what is oracle wallet
if extract process failed , How to troubleshoot.

How to set up maximum protection and maximum availablity database 

setup:

    There is a Primary Database in Maximum Protection Mode having at least two associated Standby Databases.
    Both Standby Databases are serving the Maximum Protection Mode, ie. Log Transport Services to these Standby Databases are using 'LGWR SYNC AFFIRM'
    One or both Standby Databases are Physical Standby Databases in Active Data Guard Mode or at least open 'READ ONLY'


Behaviour:

If we now try to shutdown such a Standby Database which is open READ ONLY, if fails with

ORA-01154: database busy. Open, close, mount, and dismount not allowed

although the remaining Standby Databases are serving the Maximum Protection Mode, too.

In the ALERT.LOG we can find Entries like this:

Attempt to shut down Standby Database
Standby Database operating in NO DATA LOSS mode
Detected primary database alive, shutdown primary first, shutdown aborted

Cause
If the Primary Database is in Maximum Protection Mode all associated Standby Databases serving this Protection Mode are considered as 'No Data Loss' Standby Databases and so cannot be shutdown as long as the Primary Database is in this Proctection Mode or still alive.
Solution
If you want to shutdown this Standby Database only, there are two Possibilities:

1. Use 'shutdown abort' which will force the Shutdown the Standby Database. Typically this should not harm the Standby Database; however ensure that Log Apply Services (Managed Recovery) are stopped before you issue this Command. So you can use:

SQL> alter database recover managed standby database cancel;
SQL> shutdown abort

2. Set the State of the corresponding log_archive_dest_n serving this Standby Database to 'defer' on the Primary Database (and perform a Log Switch to make this Change effective), then you can shutdown the Standby Database in any Way after the RFS-Processes terminated on this Standby Database (if they do not terminate in a timely Manner you can also kill those using OS-Kill Command)

On the Primary set the State to 'defer', eg. for log_archive_dest_2
SQL> alter system set log_archive_dest_state_2='defer' scope=memory;
SQL> alter system switch logfile;

Then on the Standby you can can shutdown (eg. shutdown immediate)
SQL> shutdown immediate;

To find out about still alive RFS-Processes and their PID you can use this Query:
SQL> select process, pid, status from v$managed_standby

If you have to kill RFS Processes you can do this using OS Kill-Command:
$ kill -9 <pid>

For both Cases ensure there is at least one surviving Standby Database available still serving the Maximum Protection Mode.
=======================

what will happend evm down and clusterware troubleshoot method

This note gives the output of the 'ps' command on pre 11gR2 releases of Oracle CRS and shows all clusterware processes running. It also helps to diagnose in which state the clusterware stands following the 'ps' outlook. note:1050908.1 explains the same for 11gR2 onwards.
Solution

Introduction

All the clusterware processes are normally retrieved via OS commands like:
ps -ef | grep -E 'init|d.bin|ocls|sleep|evmlogger|oprocd|diskmon|PID'

There are general processes, i.e. processes that need to be started on all platforms/releases
and specific processes, i.e. processes that need to be started on some CRS versions/platforms

a. the general processes are

ocssd.bin
evmd.bin
evmlogger.bin
crsd.bin

b. the specific processes are

oprocd: run on Unix when vendor Clusterware is not running. On Linux, only starting with 10.2.0.4.
oclsvmon.bin: normally run when a third party clusterware is running
oclsomon.bin: check program of the ocssd.bin (starting in 10.2.0.1)
diskmon.bin: new 11.1.0.7 process for exadata
oclskd.bin: new 11.1.0.6 process to reboot nodes in case rdbms instances are hanging

There are three fatal processes, i.e. processes whose abnormal halt or kill will provoque a node reboot (see note:265769.1):
1. the ocssd.bin
2. the oprocd.bin
3. the oclsomon.bin
The other processes are automatically restarted when they go away.

outlook of the 'ps' output

A. When all clusterware processes are started

1. CRS 10.2.0.3 on Solaris without third party clusterware

ps -ef | /usr/xpg4/bin/grep -E 'init|d.bin|ocls|sleep|evmlogger|UID'
     UID   PID  PPID   C    STIME TTY         TIME CMD
    root     1     0   0   Aug 25 ?          43:22 /sbin/init
    root   799     1   1   Aug 25 ?        1447:06 /bin/sh /etc/init.d/init.cssd fatal
    root   797     1   0   Aug 25 ?           0:00 /bin/sh /etc/init.d/init.evmd run
    root   801     1   0   Aug 25 ?           0:00 /bin/sh /etc/init.d/init.crsd run
    root  1144   799   0   Aug 25 ?           0:00 /bin/sh /etc/init.d/init.cssd daemon
    root  1091   799   0   Aug 25 ?           0:00 /bin/sh /etc/init.d/init.cssd oprocd
    root  1107   799   0   Aug 25 ?           0:00 /bin/sh /etc/init.d/init.cssd oclsomon
  oracle  1342  1144   0   Aug 25 ?         687:50 /u01/app/oracle/crs/10.2/bin/ocssd.bin
    root  1252  1091   0   Aug 25 ?          25:45 /u01/app/oracle/crs/10.2/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90
  oracle  1265  1107   0   Aug 25 ?           0:00 /bin/sh -c cd /u01/app/oracle/crs/10.2/log/artois1/cssd/oclsomon; ulimit -c unl
  oracle  1266  1265   0   Aug 25 ?         125:34 /u01/app/oracle/crs/10.2/bin/oclsomon.bin
    root 22137   799   0 07:10:38 ?           0:00 /bin/sleep 1
  oracle  1041   797   0   Aug 25 ?          68:01 /u01/app/oracle/crs/10.2/bin/evmd.bin
  oracle  1464  1041   0   Aug 25 ?           2:58 /u01/app/oracle/crs/10.2/bin/evmlogger.bin -o /u01/app/oracle/crs/10.2/evm/log/
    root  1080   801   0   Aug 25 ?        2299:04 /u01/app/oracle/crs/10.2/bin/crsd.bin reboot

2. CRS 10.2.0.3 on HP/UX with HP Serviceguard
ps -ef | /usr/bin/grep -E 'init|d.bin|ocls|sleep|evmlogger|UID'
     UID   PID  PPID  C    STIME TTY       TIME COMMAND
    root     1     0  0  Nov 13  ?        12:58 init
    root 17424     1  0  Dec 17  ?        136:39 /bin/sh /sbin/init.d/init.cssd fatal
    root 17425     1  0  Dec 17  ?         0:00 /bin/sh /sbin/init.d/init.crsd run
    root 17624 17424  0  Dec 17  ?         0:00 /bin/sh /sbin/init.d/init.cssd daemon
   haclu 17821 17624  0  Dec 17  ?        268:13 /haclu/64bit/app/oracle/product/crs102/bin/ocssd.bin
    root 17621 17424  0  Dec 17  ?         0:00 /bin/sh /sbin/init.d/init.cssd oclsvmon
   haclu 17688 17621  0  Dec 17  ?         0:00 /bin/sh -c cd /haclu/64bit/app/oracle/product/crs102/log/cehpclu7/cssd/oclsvmon; ulimit -c unlimited; /haclu/64bit/app/oracle/p
   haclu 17689 17688  0  Dec 17  ?         8:04 /haclu/64bit/app/oracle/product/crs102/bin/oclsvmon.bin
    root 17623 17424  0  Dec 17  ?         0:00 /bin/sh /sbin/init.d/init.cssd oclsomon
   haclu 17744 17623  0  Dec 17  ?         0:00 /bin/sh -c cd /haclu/64bit/app/oracle/product/crs102/log/cehpclu7/cssd/oclsomon; ulimit -c unlimited; /haclu/64bit/app/oracle/p
   haclu 17750 17744  0  Dec 17  ?        158:34 /haclu/64bit/app/oracle/product/crs102/bin/oclsomon.bin
    root 11530 17424  1 14:13:28 ?         0:00 /bin/sleep 1
   haclu  5727     1  0 13:49:56 ?         0:00 /haclu/64bit/app/oracle/product/crs102/bin/evmd.bin
   haclu  5896  5727  0 13:49:59 ?         0:00 /haclu/64bit/app/oracle/product/crs102/bin/evmlogger.bin -o /haclu/64bit/app/oracle/product/crs102/evm/log/evmlogger.info -l /h
    root 17611 17425  0  Dec 17  ?        163:50 /haclu/64bit/app/oracle/product/crs102/bin/crsd.bin reboot

3. CRS 10.2.0.4 on AIX with HACMP installed
# ps -ef | grep -E 'init|d.bin|ocls|sleep|evmlogger|UID'
     UID    PID   PPID   C    STIME    TTY  TIME CMD
    root      1      0   0   Dec 23      -  0:56 /etc/init
    root 106718      1   0   Jan 05      - 25:01 /bin/sh /etc/init.cssd fatal
    root 213226      1   0   Jan 05      -  0:00 /bin/sh /etc/init.crsd run
    root 278718      1   0   Jan 05      -  0:00 /bin/sh /etc/init.evmd run
    root 258308 106718   0   Jan 05      -  0:00 /bin/sh /etc/init.cssd daemon
   haclu 299010 348438   0   Jan 05      - 12:24 /haclu/64bit/app/oracle/product/crs102/bin/ocssd.bin
    root 315604 106718   0   Jan 05      -  0:00 /bin/sh /etc/init.cssd oclsomon
   haclu 303300 315604   0   Jan 05      -  0:00 /bin/sh -c cd /haclu/64bit/app/oracle/product/crs102/log/celaixclu3/cssd/oclsomon; ulimit -c unlimited; /haclu/64bit/app/oracle/product/crs102/bin/oclsomon  || exit $?
   haclu 278978 303300   0   Jan 05      -  2:36 /haclu/64bit/app/oracle/product/crs102/bin/oclsomon.bin
    root 250352 106718   0 13:56:56      -  0:00 /bin/sleep 1
   haclu 323672 278718   0   Jan 05      -  0:58 /haclu/64bit/app/oracle/product/crs102/bin/evmd.bin
   haclu 311416 323672   0   Jan 05      -  0:01 /haclu/64bit/app/oracle/product/crs102/bin/evmlogger.bin -o /haclu/64bit/app/oracle/product/crs102/evm/log/evmlogger.info -l /haclu/64bit/app/oracle/product/crs102/evm/log/evmlogger.log
    root 287166 213226   2   Jan 05      - 84:56 /haclu/64bit/app/oracle/product/crs102/bin/crsd.bin reboot

4. CRS 11.1.0.7 on Linux 32bit
[root@haclulnx1 init.d]# ps -ef | grep -E 'init|d.bin|ocls|oprocd|diskmon|evmlogger|PID'
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 16:55 ?        00:00:00 init [5]                       
root      5412     1  0 16:56 ?        00:00:00 /bin/sh /etc/init.d/init.evmd run
root      5413     1  0 16:56 ?        00:00:03 /bin/sh /etc/init.d/init.cssd fatal
root      5416     1  0 16:56 ?        00:00:00 /bin/sh /etc/init.d/init.crsd run
root      7690  5413  0 16:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd daemon
oracle    8465  7690  0 16:57 ?        00:00:01 /orasoft/red4u2/crs/bin/ocssd.bin
root      7648  5413  0 16:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd oprocd
root      8372  7648  0 16:56 ?        00:00:00 /orasoft/red4u2/crs/bin/oprocd run -t 1000 -m 500 -f
root      7672  5413  0 16:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd diskmon
oracle    8255  7672  0 16:56 ?        00:00:00 /orasoft/red4u2/crs/bin/diskmon.bin -d -f
root      7658  5413  0 16:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd oclsomon
root      8384  7658  0 16:56 ?        00:00:00 /sbin/runuser -l oracle -c /bin/sh -c 'cd /orasoft/red4u2/crs/log/haclulnx1/cssd/oclsomon; ulimit -c unlimited; /orasoft/red4u2/crs/bin/oclsomon  || exit $?'
oracle    8385  8384  0 16:56 ?        00:00:00 /bin/sh -c cd /orasoft/red4u2/crs/log/haclulnx1/cssd/oclsomon; ulimit -c unlimited; /orasoft/red4u2/crs/bin/oclsomon  || exit $?
oracle    8418  8385  0 16:56 ?        00:00:01 /orasoft/red4u2/crs/bin/oclsomon.bin
root      9746     1  0 17:00 ?        00:00:00 /orasoft/red4u2/crs/bin/oclskd.bin
oracle   10537     1  0 17:01 ?        00:00:00 /orasoft/red4u2/crs/bin/oclskd.bin
oracle    7606  7605  0 16:56 ?        00:00:00 /orasoft/red4u2/crs/bin/evmd.bin
oracle    9809  7606  0 17:00 ?        00:00:00 /orasoft/red4u2/crs/bin/evmlogger.bin -o /orasoft/red4u2/crs/evm/log/evmlogger.info -l /orasoft/red4u2/crs/evm/log/evmlogger.log
root      7585  5416  0 16:56 ?        00:00:08 /orasoft/red4u2/crs/bin/crsd.bin reboot

B. When the clusterware is not allowed to start on boot

This state is reached when:
1. 'crsctl stop crs' has been issued and the clusterware is stopped
or
2. the automatic startup of the clusterware has been disabled and the node has been rebooted, e.g.
./init.crs disable
Automatic startup disabled for system boot.

The 'ps' command only show the three inittab processes with spawned sleeping processes in a 30seconds loop
ps -ef | grep -E 'init|d.bin|ocls|oprocd|diskmon|evmlogger|sleep|PID'
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 16:55 ?        00:00:00 init [5]                       
root     19770     1  0 18:00 ?        00:00:00 /bin/sh /etc/init.d/init.evmd run
root     19854     1  0 18:00 ?        00:00:00 /bin/sh /etc/init.d/init.crsd run
root     19906     1  0 18:00 ?        00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root     22143 19770  0 18:02 ?        00:00:00 /bin/sleep 30
root     22255 19854  0 18:02 ?        00:00:00 /bin/sleep 30
root     22266 19906  0 18:02 ?        00:00:00 /bin/sleep 30

The clusterware can be reenabled via './init.crs enable' execution or/and via 'crsctl start crs'

C. When the clusterware is allowed to start on boot, but can't start because some prerequisites are not met

This state is reached when the node has reboot and some prerequisites are missing, e.g.
1. OCR is not accessible
2. cluster interconnect can't accept tcp connections
3. CRS_HOME is not mounted
...
and 'crsctl check boot' (run as oracle) show errors, e.g.

$ crsctl check boot
Oracle Cluster Registry initialization failed accessing Oracle Cluster Registry device:
PROC-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]

The three inittab processes are sleeping for 60seconds in a loop in 'init.cssd startcheck'
ps -ef | grep -E 'init|d.bin|ocls|oprocd|diskmon|evmlogger|sleep|PID'
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 18:28 ?        00:00:00 init [5]                       
root      4969     1  0 18:29 ?        00:00:00 /bin/sh /etc/init.d/init.evmd run
root      5060     1  0 18:29 ?        00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root      5064     1  0 18:29 ?        00:00:00 /bin/sh /etc/init.d/init.crsd run
root      5405  4969  0 18:29 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      5719  5060  0 18:29 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      5819  5064  0 18:29 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      6986  5405  0 18:30 ?        00:00:00 /bin/sleep 60
root      6987  5819  0 18:30 ?        00:00:00 /bin/sleep 60
root      7025  5719  0 18:30 ?        00:00:00 /bin/sleep 60

Once the 'crsctl check boot' will return nothing (no error messages anymore), then the clusterware processes will start.
Oracle Rac crsctl and srvctl commands

CRSCTL Commands :-
Cluster Related Commands
crs_stat t Shows HA resource status (hard to read)
crsstat Output of crs_stat t formatted nicely
crsctl check crs CSS,CRS,EVM appears healthy
crsctl stop crs Stop crs and all other services
crsctl disable crs Prevents CRS from starting on reboot
crsctl enable crs Enables CRS start on reboot
crs_stop all Stops all registered resources
crs_start all Starts all registered resources
crsctl stop cluster -all Stops the cluster in all nodes
crsctl start cluster -all Starts the cluster in all nodesSRVCTL Commands :-
Database Related Commands
srvctl start instance -d <db_name>  -i <inst_name> Starts an instance
srvctl start database -d <db_name> Starts all instances
srvctl stop database -d <db_name> Stops all instances, closes database
srvctl stop instance -d <db_name> -i <inst_name> Stops an instance
srvctl start service -d <db_name> -s <service_name> Starts a service
srvctl stop service -d <db_name> -s <service_name> Stops a service
srvctl status service -d <db_name> Checks status of a service
srvctl status instance -d <db_name> -i <inst_name> Checks an individual instance
srvctl status database -d  <db_name> Checks status of all instances
srvctl start nodeapps -n  <node_name> Starts gsd, vip, listener, and ons
srvctl stop nodeapps -n  <node_name> Stops gsd, vip and listener
srvctl status scan Status of scan listener
srvctl config scan Configuration of scan listener
srvctl status asm Status of ASM instance
  How to interpret explain plan and what is cost in explain planThis cost is a number that represents the estimated resource usage for each step.It is just a number or an internal unit that is used to be able to compare different plans.

Estimating
The estimator generates three types of measures:
• Selectivity y
• Cardinality
. Cost


Cardinality represents the number of rows in a row source.
Cost represents the units of work or resource that are used.

Cardinality represents the number of rows in a row source.Here, the row source can be base table , a view or 
the result of a join or group by operator. If a select from a table is performed . the table is the row source and the cardinality is the number of rows in that table. 
 A higher cardinality => you're going to fetch more rows => you're 
going to do more work => the query will take longer. Thus the cost is
 (usually) higher. 
A higher cardinality => you're going to fetch more rows => you're 
going to do more work => the query will take longer. Thus the cost is
 (usually) higher. 

All other things being equal, a query with a 
higher cost will use more resources and thus take longer to run. But all
 things rarely are equal. A lower cost query can run faster than a 
higher cost one!  
 Cost represents the number of units of work (or resource) that are used. The query optimizer
uses disk I/O, CPU usage, and memory usage as units of work. So the cost used by the query
optimizer represents an estimate of the number of disk I/Os and the amount of CPU and
memory used in performing an operation. The operation can be scanning a table, accessing
rows from a table by using an index, joining two tables together, or sorting a row source. 
query is executed and its result is produced.  The COST is the final output of the Cost-based optimiser (CBO),

The access path determines the number of units of work that are required to get data from a
base table. The access path can be a table scan, a fast full index scan, or an index scan.

Changing optimizer behavior

The optimizer is influenced by:
• SQL statement construction
• Data structure
• Statistics
• SQL Plan Management options
• Session parameters
• System parameters
• Hints


Adaptive Execution Plans

A query plan changes during execution because runtime conditions indicate that optimizer estimates are inaccurate.
All adaptive execution plans rely on statistics that are collected during query execution.
The two adaptive plan techniques are:
– Dynamic plans  -- A dynamic plan chooses among subplans during statement execution.
For dynamic plans, the optimizer must decide which subplans to include in a dynamic
plan, which statistics to collect to choose a subplan, and thresholds for this choice.
– Re-optimization -- In contrast, re-optimization changes a plan for executions after the
current execution. For re-optimization, the optimizer must decide which statistics to
collect at which points in a plan and when re-optimation is feasible
Operations that Retrieve Rows (Access Paths)
As I mentioned earlier, some operations retrieve rows from data sources, and in those cases, the object_name column shows the name of the data source, which can be a table, a view, etc. However, the optimizer might choose to use different techniques to retrieve the data depending on the information it has available from the database statistics. These different techniques that can be used to retrieve data are usually called access paths, and they are displayed in the operations column of the plan, usually enclosed in parenthesis.
Below is a list of the most common access paths with a small explanation of them (source). I will not cover them all because I don’t want to bore you  . I’m sure that after reading the ones I include here you will have a very good understanding of what access paths are and how they can affect the performance of you queries.
Full Table Scan
A full table scan reads all rows from a table, and then filters out those rows that do not meet the selection criteria (if there is one). Contrary to what one could think, full table scans are not necessarily a bad thing. There are situations where a full table scan would be more efficient than retrieving the data using an index.
Table Access by Rowid
A rowid is an internal representation of the storage location of data. The rowid of a row specifies the data file and data block containing the row and the location of the row in that block. Locating a row by specifying its rowid is the fastest way to retrieve a single row because it specifies the exact location of the row in the database.
In most cases, the database accesses a table by rowid after a scan of one or more indexes.
Index Unique Scan
An index unique scan returns at most 1 rowid, and thus, after an index unique scan you will typically see a table access by rowid (if the desired data is not available in the index). Index unique scans can be used when a query predicate references all of the columns of a unique index, by using the equality operator.
Index Range Scan
An index range scan is an ordered scan of values, and it is typically used when a query predicate references some of the leading columns of an index, or when for any reason more than one value can be retrieved by using an index key. These predicates can include equality and non-equality operators (=, <. >, etc).
Index Full Scan
An index full scan reads the entire index in order, and can be used in several situations, including cases in which there is no predicate, but certain conditions would allow the index to be used to avoid a separate sorting operation.
Index Fast Full Scan
An index fast full scan reads the index blocks in unsorted order, as they exist on disk. This method is used when all of the columns the query needs to retrieve are in the index, so the optimizer uses the index instead of the table.
Index Join Scan
An index join scan is a hash join of multiple indexes that together return all columns requested by a query. The database does not need to access the table because all data is retrieved from the indexes.
Operations that Manipulate Data
As I mentioned before, besides the operations that retrieve data from the database, there are some other types of operations you may see in an execution plan, which do not retrieve data, but operate on data that was retrieved by some other operation. The most common operations in this group are sorts and joins.
Sorts
A sort operation is performed when the rows coming out of the step need to be returned in some specific order. This can be necessary to comply with the order requested by the query, or to return the rows in the order in which the next operation needs them to work as expected, for example, when the next operation is a sort merge join.
Joins
When you run a query that includes more than one table in the FROM clause the database needs to perform a join operation, and the job of the optimizer is to determine the order in which the data sources should be joined, and the best join method to use in order to produce the desired results in the most efficient way possible.
Both of these decisions are made based on the available statistics.
Here is a small explanation for the different join methods the optimizer can decide to use:
Nested Loops Joins
When this method is used, for each row in the first data set that matches the single-table predicates, the database retrieves all rows in the second data set that satisfy the join predicate. As the name implies, this method works as if you had 2 nested for loops in a procedural programming language, in which for each iteration of the outer loop the inner loop is traversed to find the rows that satisfy the join condition.
As you can imagine, this join method is not very efficient on large data sets, unless the rows in the inner data set can be accessed efficiently (through an index).
In general, nested loops joins work best on small tables with indexes on the join conditions.
Hash Joins
The database uses a hash join to join larger data sets. In summary, the optimizer creates a hash table (what is a hash table?) from one of the data sets (usually the smallest one) using the columns used in the join condition as the key, and then scans the other data set applying the same hash function to the columns in the join condition to see if it can find a matching row in the hash table built from the first data set.
You don’t really need to understand how a hash table works. In general, what you need to know is that this join method can be used when you have an equi-join, and that it can be very efficient when the smaller of the data sets can be put completely in memory.
On larger data sets, this join method can be much more efficient than a nested loop.
Sort Merge Joins
A sort merge join is a variation of a nested loops join. The main difference is that this method requires the 2 data sources to be ordered first, but the algorithm to find the matching rows is more efficient.
This method is usually selected when joining large amounts of data when the join uses an inequality condition, or when a hash join would not be able to put the hash table for one of the data sets completely in memory.
 what is Incremental statistics
Incremental statistics maintenance was introduced in Oracle Database 11g to improve the performance of gathering statistics on large partitioned table. When incremental statistics maintenance is enabled for a partitioned table, Oracle accurately generated global level  statistics by aggregating partition level statistics.
By default, incremental maintenance does not use the staleness status to decide when to update statistics. This scenario is covered in an earlier blog post for Oracle Database 11g. If a partition or sub-partition is subject to even a single DML operation, statistics will be re-gathered, the appropriate synopsis will be updated and the global-level statistics will be re-calculated from the synopses. This behavior can be changed in Oracle Database 12c, allowing you to use the staleness threshold to define when incremental statistics will be re-calculated. This is covered in Staleness and DML thresholds, below.
Implementation
Enabling synopses

To enable the creation of synopses, a table must be configured to use incremental maintenance. This feature is switched on using a DBMS_STATS preference called ‘INCREMENTAL’. For example:

EXEC dbms_stats.set_table_prefs(null,'SALES','INCREMENTAL','TRUE')

Checking that incremental maintenance is enabled

The value of the DBMS_STATS preference can be checked as follows:

SELECT dbms_stats.get_prefs(pname=>'INCREMENTAL',
                            tabname=>'SALES') 
FROM dual;

Staleness and DML thresholds

As mentioned above, Optimizer statistics are considered stale when the number of changes made to data exceeds a certain threshold. This threshold is expressed as a percentage of row changes for a table, partition or subpartition and is set using a DBMS_STATS preference called STALE_PERCENT. The default value for stale percent is 10 so, for example, a partition containing 100 rows would be marked stale if more than 10 rows are updated, added or deleted. Here is an example of setting and inspecting the preference:

EXEC dbms_stats.set_table_prefs(null, 'SALES', 'STALE_PERCENT','5')

select dbms_stats.get_prefs('STALE_PERCENT',null,'SALES') from dual;

It is easy to check if a table or partition has been marked as stale:

select partition_name,
       subpartition_name,
       stale_stats               /* YES or NO */
from   dba_tab_statistics
where  table_name = 'SALES';

The database tracks DML operations to measure when data change has caused a table to exceed its staleness threshold. If you want to take a look at this information, bear in mind that the statistics are approximate and they are autmatically flushed to disk periodically. If you want to see the figures change immediately during your tests then you will need to flush them manually (you must have ‘ANALYZE ANY’ system privilege), like this:

EXEC dbms_stats.flush_database_monitoring_info
                
select  *
from    dba_tab_modifications
where   table_name = 'SALES';

Remember that if you are using incremental statistics in Oracle Database 11g, a single DML operation on a partition or sub-partition will make it a target for a statistics refresh  - even if it is not marked stale. In other words, we might update one row in a partition containing 1 million rows. The partition won't be marked state (if we assume a 10% staleness threshold) but fresh statistics will be gathered. Oracle Database 12c exhibits the same behavior by default, but this release gives you the option to allow multiple DML changes to occur against a partition or sub-partition before it is a target for incremental refresh. You can enable this behavior by changing the DBMS_STATS preference INCREMENTAL_STALENESS from its default value (NULL) to 'USE_STALE_PERCENT'. For example:

exec dbms_stats.set_global_prefs('INCREMENTAL_STALENESS', 'USE_STALE_PERCENT')

Once this preference is set, a table’s STALE_PERCENT value will be used to define the threshold of DML change in the context of incremental maintenance. In other words, statistics will not be re-gathered for a partition if the number of DML changes is below the STALE_PERCENT threshold.
Locking statistics

Incremental statistics does work with locked partitions statistics as long as no DML occurs on the locked partitions. However, if DML does occurs on the locked partitions then we can no longer guarantee that the global statistics built from the locked statistics will be accurate so the database will fall back to using the non-incremental approach when gathering global statistics. However, if for some reason you must lock the partition level statistics and still want to take advantage of incremental statistics gathering, you can set the 'INCREMENTAL_STALENESS' preference to include ‘USE_LOCKED_STATS’. Once set, the locked partitions/subpartitions stats are NOT considered as stale as long as they have synopses, regardless of DML changes.

Note that ‘INCREMENTAL_STALENESS’ accepts multiple values, such as:

BEGIN
   dbms_stats.set_table_prefs(
      ownname=>null, 
      tabname=>'SALES', 
      pname =>'INCREMENTAL_STALENESS', 
      pvalue=>'USE_STALE_PERCENT, USE_LOCKED_STATS');
END;
/

Checking for staleness

You can check for table/partition/subpartition staleness very easily using the statistics views. For example:

EXEC dbms_stats.flush_database_monitoring_info  

select partition_name,subpartition_name,stale_stats
from   dba_tab_statistics
where  table_name = 'SALES'
order by partition_position, subpartition_position;

Database monitoring information is used identify stale statistics, so you’ll need to call FLUSH_DATABASE_MONITORING_INFO if you’re testing this out and you want to see immediately how the staleness status is affected by data change.
Gathering statistics

How do you gather statistics on a table using incremental maintenance? Keep things simple! Let the Oracle Database work out how best to do it. Use these procedures:

                       EXEC dbms_stats.gather_table_stats(null,'SALES')       
or                     EXEC dbms_stats.gather_schema_stats(…)
or, even better        EXEC dbms_stats.gather_database_stats()

For the DBMS_STATS.GATHER... procedures you must use ESTIMATE_PERCENT set to AUTO_SAMPLE_SIZE. Since this is the default, then that is what will be used in the examples above unless you have overriden it. If you use a percentage value for ESTIMATE_PERCENT, incremental maintenance will not kick in.
Regathering statistics when data hasn’t changed

From time-to-time you might notice that statistics are gathered on partitions that have not been subject to any DML changes. Why is this? There are a number of reasons:

    Statistics have been unlocked.
    Table column usage has changed (this is explained below).
    New columns are added. This includes hidden columns created from statistics extensions such as column groups, column expressions.
    Synopses are not in sync with the column statistics. It is possible that you have gathered statistics in incremental mode at time T1. Then you disable incremental and regather statistics at time T2. Then the synopses’ timestamp T1 is out of sync with the basic column statistics’ timestamp T2.
    Unusual cases such as column statistics have been deleted using delete_column_statistics.

Bullet point "2" has some implications. The database tracks how columns are used in query predicates and stores this information in the data dictionary (sys.col_usage$). It uses this information to help it figure out which columns will benefit from a histogram to improve query cardinality estimates and, as a result, improve SQL execution plans. If column usage changes, then the database might choose to re-gather statistics and create a new histogram.
Locally partitioned index statistics

For locally partitioned index statistics, we first check their corresponding table partitions (or subpartitions). If the table (sub)partitions have fresh statistics and the index statistics have been gathered after the table (sub)partition-level statistics, then they are considered fresh and their statistics are not regathered.
Composite partitioned tables

Statistics at the subpartition level are gathered and stored by the database, but note that synopses are created at the partition level only. This means that if the statistics for a subpartition become stale due to data changes, then the statistics (and synopsis) for the parent partition will be refreshed by examining all of its subpartitions. The database only regathers subpartition-level statistics on subpartitions that are stale.
More information

 
 
What's difference between SQL profiles and SQL plan baselines.


SQL Profiles were designed to correct Optimizer behavior when underlying data do not fit anymore into its statistical. Their goal is to create the absolute best execution plan for the SQL by giving the very precise data to the optimizer.a SQL profile helps the optimizer minimize mistakes and thus more likely to select the best plan.
Sql profile is correction to stattistic that help optimizer to generate more efficient execution Plan.

select name,type,status,sql_text from dba_sql_profiles.

you can also disable the profile if sql is working properly after accepting the profile 


COLUMN category FORMAT a10
COLUMN sql_text FORMAT a20

SELECT NAME, SQL_TEXT, CATEGORY, STATUS FROM   DBA_SQL_PROFILES;

BEGIN
  DBMS_SQLTUNE.DROP_SQL_PROFILE ( 
    name => 'my_sql_profile' 
);
END;
/

. Change SQL PROFILE
a. To disable a SQL profile:
exec dbms_sqltune.alter_sql_profile('', 'STATUS', 'DISABLED');

b. To add description to a SQL profile:
exec DBMS_SQLTUNE.alter_sql_profile('sqlprofile_name','DESCRIPTION','this is a test sql profile');

10. To delete SQL Profile:
exec dbms_sqltune.drop_sql_profile('SYS_SQLPROF_0132f8432cbc0000');




A SQL plan baseline for a SQL statement consists of a set of accepted plans. When the statement is parsed, the optimizer will only select the best plan from among this set. If a different plan is found using the normal cost-based selection process, the optimizer will add it to the plan history but this plan will not be used until it is verified to perform better than the existing accepted plan and is evolved.

In Oracle a SQL Profile creates extra information about a particular SQL that the optimizer can use at run time to select the optimal plan to ensure best performance. In essence the SQL Profile enables dynamic behavior where the optimizer has multiple plans to choose from at run time based on run time bind variables etc. When you run the SQL Tuning Advisor for the list of recommendation you will see that the recommendation specifies whether a SQL can be improved by creating a SQL Profile or SQL Baseline. It is preferable to choose a SQL Profile simply because it allows the optimizer to pick best execution plans at run time.


SQL Baseline on the other hand is more of a brute force method, when you simply marry a particular SQL to stay with a specific SQL execution plan. So no matter what the run time bind variables are for a given SQL, the optimizer will always try to use the SQL Baseline plan. This may work fine for most cases but instances where data skew is high it is preferable that one pick more efficient plans based on bind variable values passed at run time instead of always picking the same plan as generated by the SQL Baseline.

SQL query to get SPM baseline.

SQL> select count(*) from dba_sql_plan_baselines where parsing_schema_name='$1';1.
Baselines know what plan they are trying recreate and SQL Profiles do not.
SQL Profiles will blindly apply any hints it has and what you get is what you get.
Baselines will apply the hints and if the optimizer gets the plan it was expecting, it uses the plan.
If it doesn’t come up with the expected plan, the hints are thrown away and the optimizer tries again (possibly with the hints from another accepted Baseline).

2.
Profiles have a “force matching” capability that allows them to be applied to multiple statements that differ only in the values of literals.
Think of it as a just in time cursor sharing feature. Baselines do not have this ability to act on multiple statements.

Comments from Kerry Osborne January 25th, 2012 – 16:38

I have seen Baselines be disregarded, even without such extreme conditions as a specified index having been removed.

The reason for this is that Baselines attempt to apply enough hints to limit the choices the optimizer has to a single plan,
but there are situations where the set of hints is not sufficient to actually force the desired plan.

What I mean is that the hints will eliminate virtually all possibility but there still may be a few that are valid and so it’s possible to get a different plan.

In fact, I have even seen situations where the act of creating a Baseline causes the plan to change.


SQL> select empno,ename,job from emp t1 where deptno = 10;

     EMPNO ENAME      JOB
---------- ---------- ---------
      7782 CLARK      MANAGER
      7839 KING       PRESIDENT
      7934 MILLER     CLERK

SQL_ID    55utxfrbncds3, child number 0
-------------------------------------
select empno,ename,job from emp t1 where deptno = 10

Plan hash value: 1614352715

------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |          |       |       |     2 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED     | EMP      |     3 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                       | DEPT_EMP |     3 |       |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("DEPTNO"=10)

Note
-----
   - SQL plan baseline SQL_PLAN_g9mzurk4v9bjw9da07b3a used for this statement


We can also query the Baseline to gather Information about the stored query.

select sql_handle,sql_text,origin,enabled,accepted,adaptive 
  from dba_sql_plan_baselines 
 where plan_name = 'SQL_PLAN_g9mzurk4v9bjw9da07b3a';

SQL_HANDLE            SQL_TEXT                    ORIGIN        ENABLED ACCEPTED ADAPTIVE
--------------------- --------------------------- ------------- ------- -------- --------
SQL_f4cffabc89b4ae3c  select empno,ename,job      AUTO-CAPTURE  YES     YES      NO
                      from emp where deptno = 10 
Trouble with lost wallet
 
We had an 11.0.2.4 instance with an oracle wallet created, but after some issues with the server the master key file got lost.

Because this was a development server there was no available backup but the data wasn't so important. We created a new wallet with a new master key that is now open and "working".

For some queries to work we used the alter table <table_name> rekey command, and tested the environment by doing a select * over all the tables with encrypted columns (we only have encrypted columns in some tables, not encrypted tablespaces) and it worked. But with certain more complicated queries (between several tables) we are getting ORA-28362.

Is it possible to recover from the lost of the previous wallet and could this error have to do with that? Would we need to recreate the tables?

Summary: No.  In order to restore a backup with encrypted data, the correct TDE wallet file must be available, else the restore/recover cannot be done. 

If all copies of the current ewallet.p12 file (the encryption wallet or TDE wallet, used to stored the master encryption keys needed by the database) are lost -- whether deleted or corrupted -- then the database cannot be restored.  Oracle Support cannot assist in restoring the database if the correct TDE wallet is missing.

The wallet password is not the same as the database master key.  Knowing the password will not help, because this is only used to open the ewallet.p12 file.

The ewallet.p12 file is a critical component of the database's ability to function when TDE has been implemented.  There is no way to substitute another wallet, or decrypt the data, without having the correct TDE wallet file. 

Treat the ewallet.p12 file accordingly, and make sure to protect it against loss.

Solution:
--------
So what you have to do is the following :

1) decrypt all the encrypted columns ( select from mkeyid should return no rows )

2) remove the current wallet

3) reimplement TDE ( a new key with a new wallet )

4) re-encrypt those columns with the new key

Finally the previous queries will return the same TDE master key and you will not have any issues.

or

To know what you have to do / how you can recover from this we need to see the output of these queries :

select ts#, masterkeyid,  utl_raw.cast_to_varchar2( utl_encode.base64_encode('01'||substr(masterkeyid,1,4))) || utl_raw.cast_to_varchar2( utl_encode.base64_encode(substr(masterkeyid,5,length(masterkeyid)))) masterkeyid_base64  FROM v$encrypted_tablespaces;

select  utl_raw.cast_to_varchar2( utl_encode.base64_encode('01'||substr(mkeyid,1,4))) || utl_raw.cast_to_varchar2( utl_encode.base64_encode(substr(mkeyid,5,length(mkeyid)))) masterkeyid_base64  FROM (select RAWTOHEX(mkid) mkeyid from x$kcbdbk);

select mkeyid from enc$;

-------------------------

How to change from 8k db block size to 16k
---------------------------------------------

Summary : I think my first thoughts would be to create a new empty database on an 8KB block size, add a 4KB cache, and use transportable tablespaces to move the 4KB database to the 8KB database, 
          then move objects from 4KB to 8KB over time

SOLUTION:
A) Well I think that create an 8k tablespace will not solve your problem, you must remember 
that you need to have a db_8k_cache_size defined in case you're going to create an 8k in a 4k database. The only way to define the block size is at the creation of your database, so you'll be mixturing different cache sizes if you need them.

Try expdp with parallelism, maybe that can help you to short your times.

1. create a new Oracle 12.2 database

2. use your export file to populate it.

B) You can't simply create an 8k block buffer, and restore the database, because the blocks will still be 4k in size.

To create the 8k buffer you can do it dynamically with:

ALTER SYSTEM SET DB_8K_CACHE_SIZE=512M

Now you can move objects like this:

alter table .. move tablespace tbs_app8k;

alter index .. rebuild tablespace tbs_app8k;

----- 
Step: 1
As a first step it’s good to be ensure that valid full backup of database and it not found its most recommend to have one.

Step: 2
A new database instance with 8k block size is created in same server with identical character sets.

Step: 3
All application specific schemas are listed with consulting application team and tablespaces that holding application schema data are identified. 
With identification of schema and tablespaces the respective tablespaces with 8k block size are created with extension 8k i.e if tablespace name is users,
 the 8k block size tablespace are created as users8k. Before creating 8k tablespaces we need to set db_8k_cache_size parameter 
otherwise we will get ORA-29339 signaled during: while creating tablespaces with 8k block size. 
As the database has only one application user HR and only one tablespace user we will only create user8k tablespace with 8k block size.


Move DP schema from 8k db block size to 16k db block size

1. Collect data prior to this exercise.
Capture OOR and CF prior to REORG process. Refer CF_OOR_Query.sql file.

2. Take export backup of DP schema
Shutdown all demantra services

$ expdp system/manager DUMPFILE=<DUMP_FILE_NAME>.dmp LOGFILE=<LOG_FILE_NAME>.log DIRECTORY=EXPDPBKP SCHEMAS=DP EXCLUDE=STATISTICS PARALLEL=8
 

3. Test that export backup is good enough by importing into a new schema

Add more data files to existing default tablespace of DP for this test then run impdp as below.

$ impdp system/manager DIRECTORY=EXPDPBKP DUMPFILE=<EXPORT_DUMP_NAME>.dmp REMAP_SCHEMA=DP:DPTEST TRANSFORM=OID:N LOGFILE=<LOG_FILENAME>.log
After successful import, drop schema DPTEST.
SQL> drop user DPTEST cascade;

4. Drop schema DP

SQL> drop user DP cascade;

5. Drop tablespaces of DP with 8k block size i.e.
1. APPS_TS_TX_DEMT_DP
2. APPS_TS_TX_DEMT_SALES_DATA
3. APPS_TS_TX_DEMT_SALES_INDX
4. APPS_TS_TX_DEMT_SIM_DATA
5. APPS_TS_TX_DEMT_SIM_INDX
6. APPS_TS_TX_DEMT_SALE_ENG_DATA
7. APPS_TS_TX_DEMT_SALE_ENG_INDX

SQL> set timing on
SQL> drop tablespace APPS_TS_TX_DEMT_DP including contents and datafiles; 
SQL> drop tablespace APPS_TS_TX_DEMT_SALES_DATA including contents and datafiles; 
SQL> drop tablespace APPS_TS_TX_DEMT_SALES_INDX including contents and datafiles;
SQL> drop tablespace APPS_TS_TX_DEMT_SIM_DATA including contents and datafiles;
SQL> drop tablespace APPS_TS_TX_DEMT_SIM_INDX including contents and datafiles; 
SQL> drop tablespace APPS_TS_TX_DEMT_SALE_ENG_DATA including contents and datafiles;
SQL> drop tablespace APPS_TS_TX_DEMT_SALE_ENG_INDX including contents and datafiles;
Advertisements

REPORT THIS AD

 

6. Set SGA size accordingly and restart database
SQL> alter system set sga_max_size=<SIZE_IN_MB> scope=spfile;
SQL> alter system set sga_target=<SIZE_IN_MB> scope=spfile;

Restart database.

*Set sga_max_size and sga_target to double of the current value as we are setting db_16k_cache_block now.

7. Set db_16K_cache_size

SQL> alter system set db_16K_cache_size=<SIZE_IN_MB> scope=both; 
 *Set this to half of the sga_max_size value.
 

8. Create new tablespaces as below with db block size as 16k.

1. APPS_TS_TX_DEMT_DP
2. APPS_TS_TX_DEMT_SALES_DATA
3. APPS_TS_TX_DEMT_SALES_INDX
4. APPS_TS_TX_DEMT_SIM_DATA
5. APPS_TS_TX_DEMT_SIM_INDX
6. APPS_TS_TX_DEMT_SALE_ENG_DATA
7. APPS_TS_TX_DEMT_SALE_ENG_INDX

*Replace datafiles name and location as per instances in below commands.
* Consider the current size of DP schema accordingly create & size the default tablespaces i.e. “APPS_TS_TX_DEMT_DP”.

Create tablespace APPS_TS_TX_DEMT_DP
datafile '/p01/pnmtvd01/data1/apps_ts_tx_demt_dp_001.dbf' 
size 8G
SEGMENT SPACE MANAGEMENT AUTO
EXTENT MANAGEMENT LOCAL 
UNIFORM SIZE 1M
BLOCKSIZE 16k;

alter tablespace APPS_TS_TX_DEMT_DP add datafile '/p01/pnmtvd01/data1/apps_ts_tx_demt_dp_002.dbf' size 8192M;

Create tablespace APPS_TS_TX_DEMT_SALES_DATA
datafile '/p01/pnmtvd01/data1/apps_ts_tx_demt_sales_data_001.dbf' 
size 5G
SEGMENT SPACE MANAGEMENT AUTO
EXTENT MANAGEMENT LOCAL 
UNIFORM SIZE 1M
BLOCKSIZE 16k;

Create tablespace APPS_TS_TX_DEMT_SALES_INDX
datafile '/p01/pnmtvd01/data1/apps_ts_tx_demt_sales_indx_001.dbf' 
size 5G
SEGMENT SPACE MANAGEMENT AUTO
EXTENT MANAGEMENT LOCAL 
UNIFORM SIZE 1M
BLOCKSIZE 16k;

Create tablespace APPS_TS_TX_DEMT_SIM_DATA
datafile '/p01/pnmtvd01/data1/apps_ts_tx_demt_sim_data_001.dbf' 
size 2G
SEGMENT SPACE MANAGEMENT AUTO
EXTENT MANAGEMENT LOCAL 
UNIFORM SIZE 1M
BLOCKSIZE 16k;

Create tablespace APPS_TS_TX_DEMT_SIM_INDX
datafile '/p01/pnmtvd01/data1/apps_ts_tx_demt_sim_indx_001.dbf' 
size 2G
SEGMENT SPACE MANAGEMENT AUTO
EXTENT MANAGEMENT LOCAL 
UNIFORM SIZE 1M
BLOCKSIZE 16k;

Create tablespace APPS_TS_TX_DEMT_SALE_ENG_DATA
datafile '/p01/pnmtvd01/data1/apps_ts_tx_demt_sale_eng_data_001.dbf' 
size 2G
SEGMENT SPACE MANAGEMENT AUTO
EXTENT MANAGEMENT LOCAL 
UNIFORM SIZE 1M
BLOCKSIZE 16k;

Create tablespace APPS_TS_TX_DEMT_SALE_ENG_INDX
datafile '/p01/pnmtvd01/data1/apps_ts_tx_demt_sale_eng_indx_001.dbf' 
size 2G
SEGMENT SPACE MANAGEMENT AUTO
EXTENT MANAGEMENT LOCAL 
UNIFORM SIZE 1M
BLOCKSIZE 16k;

** Consider the sizes of existing tablespaces and accordingly create new tablespaces.
 

9. Import the exported dump file

$ impdp system/***** DIRECTORY=EXPDPBKP DUMPFILE=<EXPORTED_DUMP_FILENAME>.dmp LOGFILE=<LOGFILE_NAME>.log
10. Give GRANTS to DP user. Execute grant scripts attached i.e. “grantsOnDP.zip” with APPS, GEPSVCP, etc. users as per scripts.

11. Recompile INVALIDS using utlrp.sql

12. Change INITRANS

SQL> ALTER TABLE DP.SALES_DATA INITRANS 20;
SQL> ALTER TABLE DP.MDP_MATRIX INITRANS 20;
13. Gather Stats on tables on tables identified for REORG.

SQL> execute DBMS_STATS.GATHER_TABLE_STATS(ownname => 'DP', tabname => 'SALES_DATA', estimate_percent=> 40, method_opt=>'for all columns size 1, for all indexed columns size auto', degree=> 4);

SQL> execute DBMS_STATS.GATHER_TABLE_STATS(ownname => 'DP', tabname => 'MDP_MATRIX', estimate_percent=> 40, method_opt=>'for all columns size 1, for all indexed columns size auto', degree=> 4);
 

14. Reorder SALES_DATA and MDP_MARTIX.

1. Execute below script as SYS to GRANT all required privileges to DP user for running REORG
SQL> @grant_table_reorg.sql
(Script is kept on pwercd01vn015 at /home/orapnmtvd02/. Also, available on Remote Windows Server of Demantra)

2. Run reorg as DP user
SQL> exec table_reorg.reorg(‘DP’,’SALES_DATA’,’C’,30,1);
SQL> exec table_reorg.reorg(‘DP’,’MDP_MATRIX’,’C’,30,1);

3. Execute script to REVOKE all given privileges from DP user. Run as SYS.
SQL> @revoke_table_reorg.sql
(Script is kept on pwercd01vn015 at /home/orapnmtvd02/. Also, available on Remote Windows Server of Demantra)

4. Verify success of the Reorg from entries in the table DP.LOG_TABLE_REORG.
SQL> SELECT * FROM DP.LOG_TABLE_REORG ORDER BY LOG_TIME DESC;

15. Delete table stats of SALES_DATA and MDP_MATRIX and then Gather Stats on schema with 80%

SQL> execute DBMS_STATS.DELETE_TABLE_STATS(ownname => ‘DP’, tabname => ‘SALES_DATA’);
SQL> execute DBMS_STATS.DELETE_TABLE_STATS(ownname => ‘DP’, tabname => ‘MDP_MATRIX’);
SQL> exec dbms_stats.GATHER_SCHEMA_STATS(OWNNAME=>’DP’, estimate_percent=>80 ,DEGREE=> 10);

16. Rebuild indexes on table SALES_DATA & MDP_MATRIX in new tablespaces
*Researching on this for more info if it’s actually required every time we do reorg.

17. Recheck OOR & CF as was done in step 1.





Top 5 issues that may prevent the successful startup of the Grid Infrastructure (GI) stack

To determine the status of GI, please run the following commands:

1. $GRID_HOME/bin/crsctl check crs
2. $GRID_HOME/bin/crsctl stat res -t -init
3. $GRID_HOME/bin/crsctl stat res -t
4. ps -ef | egrep 'init|d.bin'


Issue #1: CRS-4639: Could not contact Oracle High Availability Services, ohasd.bin not running or ohasd.bin is running but no init.ohasd or other processes
Symptoms:

1. Command '$GRID_HOME/bin/crsctl check crs' returns error:
     CRS-4639: Could not contact Oracle High Availability Services
2. Command 'ps -ef | grep init' does not show a line similar to:
     root 4878 1 0 Sep12 ? 00:00:02 /bin/sh /etc/init.d/init.ohasd run
3. Command 'ps -ef | grep d.bin' does not show a line similar to:
     root 21350 1 6 22:24 ? 00:00:01 /u01/app/11.2.0/grid/bin/ohasd.bin reboot
    Or it may only show "ohasd.bin reboot" process without any other processes
4. ohasd.log report:
       2013-11-04 09:09:15.541: [ default][2609911536] Created alert : (:OHAS00117:) :  TIMED OUT WAITING FOR OHASD MONITOR
5. ohasOUT.log report:
       2013-11-04 08:59:14
       Changing directory to /u01/app/11.2.0/grid/log/lc1n1/ohasd
       OHASD starting
       Timed out waiting for init.ohasd script to start; posting an alert
6. ohasd.bin keeps restarting, ohasd.log report:
     2014-08-31 15:00:25.132: [  CRSSEC][733177600]{0:0:2} Exception: PrimaryGroupEntry constructor failed to validate group name with error: 0 groupId: 0x7f8df8022450 acl_string: pgrp:spec:r-x
     2014-08-31 15:00:25.132: [  CRSSEC][733177600]{0:0:2} Exception: ACL entry creation failed for: pgrp:spec:r-x
     2014-08-31 15:00:25.132: [    INIT][733177600]{0:0:2} Dump State Starting ...
7. Only the ohasd.bin is runing, but there is nothing written in ohasd.log. OS /var/log/messages shows:
     2015-07-12 racnode1 logger: autorun file for ohasd is missing


Possible Causes:

1. For OL5/RHEL5/under and other platform, the file '/etc/inittab' does not contain the line similar to the following (platform dependent) :
      h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
     For OL6/RHEL6+, upstart is not configed properly. For RHEL7/OEL7, systemd is not configured correctly
2. runlevel 3 has not been reached, some rc3 script is hanging
3. the init process (pid 1) did not spawn the process defined in /etc/inittab (h1) or a bad entry before init.ohasd like xx:wait:<process> blocked the start of init.ohasd
4. CRS autostart is disabled
5. The Oracle Local Registry ($GRID_HOME/cdata/<node>.olr) is missing or corrupted (check as root user via "ocrdump -local /tmp/olr.log", the /tmp/olr.log should contain all GI daemon processes related information, compare with a working cluster to verify)
6. root user was in group "spec" before but now the group "spec" has been removed, the old group for root user is still recorded in the OLR, this can be verified in OLR dump
7. HOSTNAME was null when init.ohasd started especially after a node reboot

Solutions:

1. For OL5/RHEL5 and under, add the following line to /etc/inittab
    h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
   and then run "init q" as the root user.
   For Linux OL6/RHEL6, please refer to Note 1607600.1
2. Run command 'ps -ef | grep rc' and kill any remaining rc3 scripts that appear to be stuck.
3. Remove the bad entry before init.ohasd. Consult with OS vendor if "init q" does not spawn "init.ohasd run" process. As a workaround,
   start the init.ohasd manually, eg: as root user, run "/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null &"
4. Enable CRS autostart:
   # crsctl enable crs
   # crsctl start crs
5. Restore OLR from backup, as root user: (refer to Note 1193643.1)
   # crsctl stop crs -f
   # touch <GRID_HOME>/cdata/<node>.olr
   # chown root:oinstall <GRID_HOME>/cdata/<node>.olr
   # ocrconfig -local -restore <GRID_HOME>/cdata/<node>/backup_<date>_<num>.olr
   # crsctl start crs

If OLR backup does not exist for any reason, perform deconfig and rerun root.sh is required to recreate OLR, as root user:
   # <GRID_HOME>/crs/install/rootcrs.pl -deconfig -force
   # <GRID_HOME>/root.sh
6. Reinitialize/Recreate the OLR is required, using the same command as recreating OLR per above
7. Restart the init.ohasd process or add "sleep 30" in init.ohasd to allow hostname populated correctly before starting Clusterware, refer to Note 1427234.1
8. If above does not help, check OS messages for ohasd.bin logger message and manually execute crswrapexece.pl command mentioned in the OS message with LD_LIBRARY_PATH set to <GRID_HOME>/lib to continue debug.
 

Issue #2: CRS-4530: Communications failure contacting Cluster Synchronization Services daemon, ocssd.bin is not running

Symptoms:

1. Command '$GRID_HOME/bin/crsctl check crs' returns errors:
    CRS-4638: Oracle High Availability Services is online
    CRS-4535: Cannot communicate with Cluster Ready Services
    CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
    CRS-4534: Cannot communicate with Event Manager
2. Command 'ps -ef | grep d.bin' does not show a line similar to:
    oragrid 21543 1 1 22:24 ? 00:00:01 /u01/app/11.2.0/grid/bin/ocssd.bin
3. ocssd.bin is running but abort with message "CLSGPNP_CALL_AGAIN" in ocssd.log
4. ocssd.log shows:

   2012-01-27 13:42:58.796: [ CSSD][19]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 223132864, wrtcnt, 1112, LATS 783238209, 
   lastSeqNo 1111, uniqueness 1327692232, timestamp 1327693378/787089065

5. for 3 or more node cases, 2 nodes form cluster fine, the 3rd node joined then failed, ocssd.log show:

   2012-02-09 11:33:53.048: [ CSSD][1120926016](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 2 nodes with leader 2, racnode2, is smaller than  
   cohort of 2 nodes led by node 1, racnode1, based on map type 2
   2012-02-09 11:33:53.048: [ CSSD][1120926016]###################################
   2012-02-09 11:33:53.048: [ CSSD][1120926016]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread

6. ocssd.bin startup timeout after 10minutes

   2012-04-08 12:04:33.153: [    CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1333911873
   ......
   2012-04-08 12:14:31.994: [    CSSD][5]clssgmShutDown: Received abortive shutdown request from client.
   2012-04-08 12:14:31.994: [    CSSD][5]###################################
   2012-04-08 12:14:31.994: [    CSSD][5]clssscExit: CSSD aborting from thread GMClientListener
   2012-04-08 12:14:31.994: [    CSSD][5]###################################
   2012-04-08 12:14:31.994: [    CSSD][5](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

7. alert<node>.log shows:
2014-02-05 06:16:56.815
[cssd(3361)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /u01/app/11.2.0/grid/log/bdprod2/cssd/ocssd.log
...
2014-02-05 06:27:01.707
[ohasd(2252)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'bdprod2'.
2014-02-05 06:27:02.075
[ohasd(2252)]CRS-2771:Maximum restart attempts reached for resource 'ora.cssd'; will not restart.

Possible Causes:

1. Voting disk is missing or inaccessible
2. Multicast is not working for private network for 11.2.0.2.x (expected behavior) or 11.2.0.3 PSU5/PSU6/PSU7 or 12.1.0.1 (due to Bug 16547309)
3. private network is not working, ping or traceroute <private host> shows destination unreachable. Or firewall is enable for private network while ping/traceroute work fine
4. gpnpd does not come up, stuck in dispatch thread, Bug 10105195
5. too many disks discovered via asm_diskstring or slow scan of disks due to Bug 13454354 on Solaris 11.2.0.3 only
6. In some cases, known bug could cause 2nd node ocssd.bin can not join the cluster after private network issue is fixed, refer to Note 1479380.1

Solutions:

1. restore the voting disk access by checking storage access,  disk permissions etc.
   If the disk is not accessible at OS level, please engage system administrator to restore the disk access.
   If the voting disk is missing from the OCR ASM diskgroup, start CRS in exclusive mode and recreate the voting disk:
   # crsctl start crs -excl
   # crsctl replace votedisk <+OCRVOTE diskgroup>
2. Refer to Document 1212703.1 for multicast test and fix. For 11.2.0.3 PSU5/PSU6/PSU7 or 12.1.0.1, either enable multicast for private network or apply patch 16547309 or latest PSU.
3. Consult with the network administrator to restore private network access or disable firewall for private network (for Linux, check service iptables status and service ip6tables status)
4. Kill the gpnpd.bin process on surviving node, refer Document 10105195.8
   Once above issues are resolved, restart Grid Infrastructure stack.
   If ping/traceroute all work for private network, there is a failed 11.2.0.1 to 11.2.0.2 upgrade happened, please check out
   Bug 13416559 for workaround
5. Limit the number of ASM disks scan by supplying a more specific asm_diskstring, refer to bug 13583387
   For Solaris 11.2.0.3 only, please apply patch 13250497, see Note 1451367.1.
6. Refer to the solution and workaround in Note 1479380.1
 

Issue #3: CRS-4535: Cannot communicate with Cluster Ready Services, crsd.bin is not running
Symptoms:

1. Command '$GRID_HOME/bin/crsctl check crs' returns errors:
    CRS-4638: Oracle High Availability Services is online
    CRS-4535: Cannot communicate with Cluster Ready Services
    CRS-4529: Cluster Synchronization Services is online
    CRS-4534: Cannot communicate with Event Manager
2. Command 'ps -ef | grep d.bin' does not show a line similar to:
    root 23017 1 1 22:34 ? 00:00:00 /u01/app/11.2.0/grid/bin/crsd.bin reboot
3. Even if the crsd.bin process exists, command 'crsctl stat res -t -init' shows:
    ora.crsd
        1    ONLINE     INTERMEDIATE

Possible Causes:

1. ocssd.bin is not running or resource ora.cssd is not ONLINE
2. +ASM<n> instance can not startup due to various reason
3. OCR is inaccessible
4. Network configuration has been changed causing gpnp profile.xml mismatch
5. $GRID_HOME/crs/init/<host>.pid file for crsd has been removed or renamed manually, crsd.log shows: 'Error3 -2 writing PID to the file'
6. ocr.loc content mismatch with other cluster nodes. crsd.log shows: 'Shutdown CacheLocal. my hash ids don't match'
7. private network is pingable with normal ping command but not pingable with jumbo frame size (eg: ping -s 8900 <private ip>) when jumbo frame is enabled (MTU: 9000+). Or partial cluster nodes have jumbo frame set (MTU: 9000) and the problem node does not have jumbo frame set (MTU:1500)
8. On AIX 6.1 TL08 SP01 and AIX 7.1 TL02 SP01, due to truncation of multicast packets.
9. udp_sendspace is set to default 9216 on AIX platform

Solutions:

1. Check the solution for Issue 2, ensure ocssd.bin is running and ora.cssd is ONLINE
2. For 11.2.0.2+, ensure that the resource ora.cluster_interconnect.haip is ONLINE, refer to Document 1383737.1 for ASM startup issues related to HAIP.
   Check if GRID_HOME/bin/oracle binary is linked with RAC option Document 284785.1
3. Ensure the OCR disk is available and accessible. If the OCR is lost for any reason, refer to Document 1062983.1 on how to restore the OCR.
4. Restore network configuration to be the same as interface defined in $GRID_HOME/gpnp/<node>/profiles/peer/profile.xml, refer to Document 283684.1 for private network modification.
5. touch the file with <host>.pid under $GRID_HOME/crs/init.
   For 11.2.0.1, the file is owned by <grid> user.
   For 11.2.0.2, the file is owned by root user.
6. Using ocrconfig -repair command to fix the ocr.loc content:
   for example, as root user:
# ocrconfig -repair -add +OCR2 (to add an entry)
# ocrconfig -repair -delete +OCR2 (to remove an entry)
ohasd.bin needs to be up and running in order for above command to run.

Once above issues are resolved, either restart GI stack or start crsd.bin via:
   # crsctl start res ora.crsd -init
7. Engage network admin to enable jumbo frame from switch layer if it is enabled at the network interface. If jumbo frame is not required, change MTU to 1500 for the private network on all nodes, then restart GI stack on all nodes.
8. On AIX 6.1 TL08 SP01 and AIX 7.1 TL02 SP01, apply AIX patch per Document 1528452.1 AIX 6.1 TL8 or 7.1 TL2: 11gR2 GI Second Node Fails to Join the Cluster as CRSD and EVMD are in INTERMEDIATE State
9. Increase udp_sendspace to recommended value, refer to Document 1280234.1
 

Issue #4: Agent or mdnsd.bin, gpnpd.bin, gipcd.bin not running
Symptoms:

1. orarootagent not running. ohasd.log shows:
2012-12-21 02:14:05.071: [    AGFW][24] {0:0:2} Created alert : (:CRSAGF00123:) :  Failed to start the agent process: /grid/11.2.0/grid_2/bin/orarootagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/grid/11.2.0/grid_2/bin/orarootagent]
2. mdnsd.bin, gpnpd.bin or gipcd.bin not running, here is a sample for mdnsd log file:
2012-12-31 21:37:27.601: [  clsdmt][1088776512]Creating PID [4526] file for home /u01/app/11.2.0/grid host lc1n1 bin mdns to /u01/app/11.2.0/grid/mdns/init/
2012-12-31 21:37:27.602: [  clsdmt][1088776512]Error3 -2 writing PID [4526] to the file []
2012-12-31 21:37:27.602: [  clsdmt][1088776512]Failed to record pid for MDNSD
or
2012-12-31 21:39:52.656: [  clsdmt][1099217216]Creating PID [4645] file for home /u01/app/11.2.0/grid host lc1n1 bin mdns to /u01/app/11.2.0/grid/mdns/init/
2012-12-31 21:39:52.656: [  clsdmt][1099217216]Writing PID [4645] to the file [/u01/app/11.2.0/grid/mdns/init/lc1n1.pid]
2012-12-31 21:39:52.656: [  clsdmt][1099217216]Failed to record pid for MDNSD
3. oraagent or appagent not running, crsd.log shows:
2012-12-01 00:06:24.462: [    AGFW][1164069184] {0:2:27} Created alert : (:CRSAGF00130:) :  Failed to start the agent /u01/app/grid/11.2.0/bin/appagent_oracle

Possible Causes:

1. orarootagent missing execute permission
2. missing process associated <node>.pid file or the file has wrong ownership or permission
3. wrong permission/ownership within GRID_HOME
4. GRID_HOME disk space 100% full

Solutions:

1. Either compare the permission/ownership with a good node GRID_HOME and make correction accordingly or as root user:
   # cd <GRID_HOME>/crs/install
   # ./rootcrs.pl -unlock
   # ./rootcrs.pl -patch
This will stop clusterware stack, set permssion/owership to root for required files and restart clusterware stack.
2. If the corresponding <node>.pid does not exist, touch the file with correct ownership and permission, otherwise correct the <node>.pid ownership/permission as required, then restart the clusterware stack.
Here is the list of <node>.pid file under <GRID_HOME>, owned by root:root, permission 644:
  ./ologgerd/init/<node>.pid
  ./osysmond/init/<node>.pid
  ./ctss/init/<node>.pid
  ./ohasd/init/<node>.pid
  ./crs/init/<node>.pid
Owned by <grid>:oinstall, permission 644:
  ./mdns/init/<node>.pid  
  ./evm/init/<node>.pid
  ./gipc/init/<node>.pid
  ./gpnp/init/<node>.pid

3. For cause 3, please refer to solution 1.
4. Please clean up the disk space from GRID_HOME, particularly clean up old files under <GRID_HOME>/log/<node>/client/, <diag dest>/tnslsnr/<node>/<listener name>/alert/
 

Issue #5: ASM instance does not start, ora.asm is OFFLINE
Symptoms:

1. Command 'ps -ef | grep asm' shows no ASM processes
2. Command 'crsctl stat res -t -init' shows:
         ora.asm
               1    ONLINE    OFFLINE


Possible Causes:

1. ASM spfile is corrupted
2. ASM discovery string is incorrect and therefore voting disk/OCR cannot be discovered
3. ASMlib configuration problem
4. ASM instances are using different cluster_interconnect, HAIP OFFLINE on 1 node causing the 2nd ASM instance could not start

Solutions:

1. Create a temporary pfile to start ASM instance, then recreate spfile, see Document 1095214.1 for more details.
2. Refer to Document 1077094.1 to correct the ASM discovery string.
3. Refer to Document 1050164.1 to fix ASMlib configuration.
4. Refer to Document 1383737.1 for solution. For more information about HAIP, please refer to Document 1210883.1



Troubleshoot Grid Infrastructure Startup Issues


Start up sequence:

In a nutshell, the operating system starts ohasd, ohasd starts agents to start up daemons (gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd, evmd asm etc), and crsd starts agents that start user resources (database, SCAN, listener etc).

For detailed Grid Infrastructure clusterware startup sequence, please refer to note 1053147.1


Cluster status

To find out cluster and daemon status:

$GRID_HOME/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

$GRID_HOME/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       rac1                  Started
ora.crsd
      1        ONLINE  ONLINE       rac1
ora.cssd
      1        ONLINE  ONLINE       rac1
ora.cssdmonitor
      1        ONLINE  ONLINE       rac1
ora.ctssd
      1        ONLINE  ONLINE       rac1                  OBSERVER
ora.diskmon
      1        ONLINE  ONLINE       rac1
ora.drivers.acfs
      1        ONLINE  ONLINE       rac1
ora.evmd
      1        ONLINE  ONLINE       rac1
ora.gipcd
      1        ONLINE  ONLINE       rac1
ora.gpnpd
      1        ONLINE  ONLINE       rac1
ora.mdnsd
      1        ONLINE  ONLINE       rac1

For 11.2.0.2 and above, there will be two more processes:

ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       rac1
ora.crf
      1        ONLINE  ONLINE       rac1
For 11.2.0.3 onward in non-Exadata, ora.diskmon will be offline:

ora.diskmon
      1        OFFLINE  OFFLINE       rac1

For 12c onward, ora.storage is introduced: 

ora.storage
1 ONLINE ONLINE racnode1 STABLE



To start an offline daemon - if ora.crsd is OFFLINE:

$GRID_HOME/bin/crsctl start res ora.crsd -init
 

Case 1: OHASD does not start

As ohasd.bin is responsible to start up all other cluserware processes directly or indirectly, it needs to start up properly for the rest of the stack to come up. If ohasd.bin is not up, when checking its status, CRS-4639 (Could not contact Oracle High Availability Services) will be reported; and if ohasd.bin is already up, CRS-4640 will be reported if another start up attempt is made; if it fails to start, the following will be reported:


CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.


Automatic ohasd.bin start up depends on the following:

1. OS is at appropriate run level:

OS need to be at specified run level before CRS will try to start up.

To find out at which run level the clusterware needs to come up:

cat /etc/inittab|grep init.ohasd
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
Note: Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6) has deprecated inittab, rather, init.ohasd will be configured via upstart in /etc/init/oracle-ohasd.conf, however, the process ""/etc/init.d/init.ohasd run" should still be up. Oracle Linux 7 (and Red Hat Linux 7) uses systemd to manage start/stop services (example: /etc/systemd/system/oracle-ohasd.service)

Above example shows CRS suppose to run at run level 3 and 5; please note depend on platform, CRS comes up at different run level.

To find out current run level:

who -r


2. "init.ohasd run" is up

On Linux/UNIX, as "init.ohasd run" is configured in /etc/inittab, process init (pid 1, /sbin/init on Linux, Solaris and hp-ux, /usr/sbin/init on AIX) will start and respawn "init.ohasd run" if it fails. Without "init.ohasd run" up and running, ohasd.bin will not start:


ps -ef|grep init.ohasd|grep -v grep
root      2279     1  0 18:14 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run
Note: Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6) has deprecated inittab, rather, init.ohasd will be configured via upstart in /etc/init/oracle-ohasd.conf, however, the process ""/etc/init.d/init.ohasd run" should still be up.

If any rc Snncommand script (located in rcn.d, example S98gcstartup) stuck, init process may not start "/etc/init.d/init.ohasd run"; please engage OS vendor to find out why relevant Snncommand script stuck.

Error "[ohasd(<pid>)] CRS-0715:Oracle High Availability Service has timed out waiting for init.ohasd to be started." may be reported of init.ohasd fails to start on time. 
 
If SA can not identify the reason why init.ohasd is not starting, the following can be a very short term workaround:

 cd <location-of-init.ohasd>
 nohup ./init.ohasd run &


3. Cluserware auto start is enabled - it's enabled by default

By default CRS is enabled for auto start upon node reboot, to enable:

$GRID_HOME/bin/crsctl enable crs

To verify whether it's currently enabled or not:

$GRID_HOME/bin/crsctl config crs

If the following is in OS messages file

Feb 29 16:20:36 racnode1 logger: Oracle Cluster Ready Services startup disabled.
Feb 29 16:20:36 racnode1 logger: Could not access /var/opt/oracle/scls_scr/racnode1/root/ohasdstr

The reason is the file does not exist or not accessible, cause can be someone modified it manually or wrong opatch is used to apply a GI patch(i.e. opatch for Solaris X64 used to apply patch on Linux).



4. syslogd is up and OS is able to execute init script S96ohasd

OS may stuck with some other Snn script while node is coming up, thus never get chance to execute S96ohasd; if that's the case, following message will not be in OS messages:

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.

If you don't see above message, the other possibility is syslogd(/usr/sbin/syslogd) is not fully up. Grid may fail to come up in that case as well. This may not apply to AIX.

To find out whether OS is able to execute S96ohasd while node is coming up, modify S96ohasd:

From:

    case `$CAT $AUTOSTARTFILE` in
      enable*)
        $LOGERR "Oracle HA daemon is enabled for autostart."

To:

    case `$CAT $AUTOSTARTFILE` in
      enable*)
        /bin/touch /tmp/ohasd.start."`date`"
        $LOGERR "Oracle HA daemon is enabled for autostart."

After a node reboot, if you don't see /tmp/ohasd.start.timestamp get created, it means OS stuck with some other Snn script. If you do see /tmp/ohasd.start.timestamp but not "Oracle HA daemon is enabled for autostart" in messages, likely syslogd is not fully up. For both case, you will need engage System Administrator to find out the issue on OS level. For latter case, the workaround is to "sleep" for about 2 minutes, modify ohasd:

From:

    case `$CAT $AUTOSTARTFILE` in
      enable*)
        $LOGERR "Oracle HA daemon is enabled for autostart."

To:

    case `$CAT $AUTOSTARTFILE` in
      enable*)
        /bin/sleep 120
        $LOGERR "Oracle HA daemon is enabled for autostart."

5. File System that GRID_HOME resides is online when init script S96ohasd is executed; once S96ohasd is executed, following message should be in OS messages file:

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.
..
Jan 20 20:46:57 rac1 logger: exec /ocw/grid/perl/bin/perl -I/ocw/grid/perl/lib /ocw/grid/bin/crswrapexece.pl /ocw/grid/crs/install/s_crsconfig_rac1_env.txt /ocw/grid/bin/ohasd.bin "reboot"


If you see the first line, but not the last line, likely the filesystem containing the GRID_HOME was not online while S96ohasd is executed.


6. Oracle Local Registry (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) is accessible and valid


ls -l $GRID_HOME/cdata/*.olr
-rw------- 1 root  oinstall 272756736 Feb  2 18:20 rac1.olr

If the OLR is inaccessible or corrupted, likely ohasd.log will have similar messages like following:


..
2010-01-24 22:59:10.470: [ default][1373676464] Initializing OLR
2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:6m':failed in stat OCR file/disk /ocw/grid/cdata/rac1.olr, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.473: [  OCRRAW][1373676464]proprinit: Could not open raw device
2010-01-24 22:59:10.473: [  OCRAPI][1373676464]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 22:59:10.473: [  CRSOCR][1373676464] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
2010-01-24 22:59:10.473: [ default][1373676464] OLR initalization failured, rc=26
2010-01-24 22:59:10.474: [ default][1373676464]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2010-01-24 22:59:10.474: [ default][1373676464][PANIC] OHASD exiting; Could not init OLR

OR


..
2010-01-24 23:01:46.275: [  OCROSD][1228334000]utread:3: Problem reading buffer 1907f000 buflen 4096 retval 0 phy_offset 102400 retry 5
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprioini: all disks are not OCR/OLR formatted
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprinit: Could not open raw device
2010-01-24 23:01:46.275: [  OCRAPI][1228334000]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 23:01:46.276: [  CRSOCR][1228334000] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage
2010-01-24 23:01:46.276: [ default][1228334000] OLR initalization failured, rc=26
2010-01-24 23:01:46.276: [ default][1228334000]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2010-01-24 23:01:46.277: [ default][1228334000][PANIC] OHASD exiting; Could not init OLR

OR


..
2010-11-07 03:00:08.932: [ default][1] Created alert : (:OHAS00102:) : OHASD is not running as privileged user
2010-11-07 03:00:08.932: [ default][1][PANIC] OHASD exiting: must be run as privileged user

OR


ohasd.bin comes up but output of "crsctl stat res -t -init"shows no resource, and "ocrconfig -local -manualbackup" fails

OR


..
2010-08-04 13:13:11.102: [   CRSPE][35] Resources parsed
2010-08-04 13:13:11.103: [   CRSPE][35] Server [] has been registered with the PE data model
2010-08-04 13:13:11.103: [   CRSPE][35] STARTUPCMD_REQ = false:
2010-08-04 13:13:11.103: [   CRSPE][35] Server [] has changed state from [Invalid/unitialized] to [VISIBLE]
2010-08-04 13:13:11.103: [  CRSOCR][31] Multi Write Batch processing...
2010-08-04 13:13:11.103: [ default][35] Dump State Starting ...
..
2010-08-04 13:13:11.112: [   CRSPE][35] SERVERS:
:VISIBLE:address{{Absolute|Node:0|Process:-1|Type:1}}; recovered state:VISIBLE. Assigned to no pool

------------- SERVER POOLS:
Free [min:0][max:-1][importance:0] NO SERVERS ASSIGNED

2010-08-04 13:13:11.113: [   CRSPE][35] Dumping ICE contents...:ICE operation count: 0
2010-08-04 13:13:11.113: [ default][35] Dump State Done.


The solution is to restore a good backup of OLR with "ocrconfig -local -restore <ocr_backup_name>".
By default, OLR will be backed up to $GRID_HOME/cdata/$HOST/backup_$TIME_STAMP.olr once installation is complete.

7. ohasd.bin is able to access network socket files:


2010-06-29 10:31:01.570: [ COMMCRS][1206901056]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))

2010-06-29 10:31:01.571: [  OCRSRV][1217390912]th_listen: CLSCLISTEN failed clsc_ret= 3, addr= [(ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))]
2010-06-29 10:31:01.571: [  OCRSRV][3267002960]th_init: Local listener did not reach valid state

In Grid Infrastructure cluster environment, ohasd related socket files should be owned by root, but in Oracle Restart environment, they should be owned by grid user, refer to "Network Socket File Location, Ownership and Permission" section for example output.

8. ohasd.bin is able to access log file location:

OS messages/syslog shows:

Feb 20 10:47:08 racnode1 OHASD[9566]: OHASD exiting; Directory /ocw/grid/log/racnode1/ohasd not found.

Refer to "Log File Location, Ownership and Permission" section for example output, if the expected directory is missing, create it with proper ownership and permission.

9. After node reboot, ohasd may fail to start on SUSE Linux after node reboot, refer to note 1325718.1 - OHASD not Starting After Reboot on SLES

10. OHASD fails to start, "ps -ef| grep ohasd.bin" shows ohasd.bin is started, but nothing in $GRID_HOME/log/<node>/ohasd/ohasd.log for many minutes, truss shows it is looping to close non-opened file handles:


..
15058/1:         0.1995 close(2147483646)                               Err#9 EBADF
15058/1:         0.1996 close(2147483645)                               Err#9 EBADF
..

Call stack of ohasd.bin from pstack shows the following:

_close  sclssutl_closefiledescriptors  main ..

The cause is bug 11834289 which is fixed in 11.2.0.3 and above, other symptoms of the bug is clusterware processes may fail to start with same call stack and truss output (looping on OS call "close"). If the bug happens when trying to start other resources, "CRS-5802: Unable to start the agent process" could show up as well.

11. Other potential causes/solutions listed in note 1069182.1 - OHASD Failed to Start: Inappropriate ioctl for device

12. ohasd.bin started fine, however, "crsctl check crs" shows only the following and nothing else:

CRS-4638: Oracle High Availability Services is online
And "crsctl stat res -p -init" shows nothing

The cause is that OLR is corrupted, refer to note 1193643.1 to restore.

13. On EL7/OL7: note 1959008.1 - Install of Clusterware fails while running root.sh on OL7 - ohasd fails to start 

14. For EL7/OL7, patch 25606616 is needed: TRACKING BUG TO PROVIDE GI FIXES FOR OL7

15. If ohasd still fails to start, refer to ohasd.log in <grid-home>/log/<nodename>/ohasd/ohasd.log and ohasdOUT.log
 



Case 2: OHASD Agents do not start

OHASD.BIN will spawn four agents/monitors to start resource:

  oraagent: responsible for ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd etc
  orarootagent: responsible for ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs etc
  cssdagent / cssdmonitor: responsible for ora.cssd(for ocssd.bin) and ora.cssdmonitor(for cssdmonitor itself)

If ohasd.bin can not start any of above agents properly, clusterware will not come to healthy state.

1. Common causes of agent failure are that the log file or log directory for the agents don't have proper ownership or permission.

Refer to below section "Log File Location, Ownership and Permission" for general reference.

One example is "rootcrs.pl -patch/postpatch" wasn't executed while patching manually resulting in agent start failure: 

2015-02-25 15:43:54.350806 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/orarootagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/orarootagent]

2015-02-25 15:43:54.382154 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/cssdagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdagent]

2015-02-25 15:43:54.384105 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/cssdmonitor Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdmonitor]


The solution is to execute the missed steps.



2. If agent binary (oraagent.bin or orarootagent.bin etc) is corrupted, agent will not start resulting in related resources not coming up:

2011-05-03 11:11:13.189
[ohasd(25303)]CRS-5828:Could not start agent '/ocw/grid/bin/orarootagent_grid'. Details at (:CRSAGF00130:) {0:0:2} in /ocw/grid/log/racnode1/ohasd/ohasd.log.


2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Created alert : (:CRSAGF00130:) :  Failed to start the agent /ocw/grid/bin/orarootagent_grid
2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_START[ora.diskmon 1 1] ID 4098:403
2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Can not stop the agent: /ocw/grid/bin/orarootagent_grid because pid is not initialized
..
2011-05-03 12:03:17.492: [   CRSPE][1128372576] {0:0:184} Fatal Error from AGFW Proxy: Unable to start the agent process
2011-05-03 12:03:17.492: [   CRSPE][1128372576] {0:0:184} CRS-2674: Start of 'ora.diskmon' on 'racnode1' failed

..

2011-06-27 22:34:57.805: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) :  Failed to start the agent process: /ocw/grid/bin/cssdagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdagent]
2011-06-27 22:34:57.805: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00126:) :  Agent start failed
..
2011-06-27 22:34:57.806: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) :  Failed to start the agent process: /ocw/grid/bin/cssdmonitor Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdmonitor]

The solution is to compare agent binary with a "good" node, and restore a good copy.

 truss/strace of ohasd shows agent binary is corrupted
32555 17:38:15.953355 execve("/ocw/grid/bin/orarootagent.bin",
["/opt/grid/product/112020/grid/bi"...], [/* 38 vars */]) = 0
..
32555 17:38:15.954151 --- SIGBUS (Bus error) @ 0 (0) ---  

3. Agent may fail to start due to bug 11834289 with error "CRS-5802: Unable to start the agent process", refer to Section "OHASD does not start"  #10 for details.

4. Refer to: note 1964240.1 - CRS-5823:Could not initialize agent framework

 

Case 3: OCSSD.BIN does not start

Successful cssd.bin startup depends on the following:

1. GPnP profile is accessible - gpnpd needs to be fully up to serve profile

If ocssd.bin is able to get the profile successfully, likely ocssd.log will have similar messages like following:

2010-02-02 18:00:16.251: [    GPnP][408926240]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "ipc://GPNPD_rac1", try 4 of 500...
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileVerifyForCall: [at clsgpnp.c:1867] Result: (87) CLSGPNP_SIG_VALPEER. Profile verified.  prf=0x165160d0
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileGetSequenceRef: [at clsgpnp.c:841] Result: (0) CLSGPNP_OK. seq of p=0x165160d0 is '6'=6
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2186] Result: (0) CLSGPNP_OK. Successful get-profile CALL to remote "ipc://GPNPD_rac1" disco ""

Otherwise messages like following will show in ocssd.log

2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1100] GIPC gipcretConnectionRefused (29) gipcConnect(ipc-ipc://GPNPD_rac1)
2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1101] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "ipc://GPNPD_rac1"
2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnp_getProfileEx: [at clsgpnp.c:546] Result: (13) CLSGPNP_NO_DAEMON. Can't get GPnP service profile from local GPnP daemon
2010-02-03 22:26:17.057: [ default][3852126240]Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2010-02-03 22:26:17.057: [    CSSD][3852126240]clsgpnp_getProfile failed, rc(13)
The solution is to ensure gpnpd is up and running properly.


2. Voting Disk is accessible

In 11gR2, ocssd.bin discover voting disk with setting from GPnP profile, if not enough voting disks can be identified, ocssd.bin will abort itself.

2010-02-03 22:37:22.212: [    CSSD][2330355744]clssnmReadDiscoveryProfile: voting file discovery string(/share/storage/di*)
..
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvDiskVerify: Successful discovery of 0 disks
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvFindInitialConfigs: No voting files found
2010-02-03 22:37:22.228: [    CSSD][1145538880]###################################
2010-02-03 22:37:22.228: [    CSSD][1145538880]clssscExit: CSSD signal 11 in thread clssnmvDDiscThread

ocssd.bin may not come up with the following error if all nodes failed while there's a voting file change in progress:

2010-05-02 03:11:19.033: [    CSSD][1197668093]clssnmCompleteInitVFDiscovery: Detected voting file add in progress for CIN 0:1134513465:0, waiting for configuration to complete 0:1134513098:0

The solution is to start ocssd.bin in exclusive mode with note 1364971.1


If the voting disk is located on a non-ASM device, ownership and permissions should be:

-rw-r----- 1 ogrid oinstall 21004288 Feb  4 09:13 votedisk1

3. Network is functional and name resolution is working:

If ocssd.bin can't bind to any network, likely the ocssd.log will have messages like following:

2010-02-03 23:26:25.804: [GIPCXCPT][1206540320]gipcmodGipcPassInitializeNetwork: failed to find any interfaces in clsinet, ret gipcretFail (1)
2010-02-03 23:26:25.804: [GIPCGMOD][1206540320]gipcmodGipcPassInitializeNetwork: EXCEPTION[ ret gipcretFail (1) ]  failed to determine host from clsinet, using default
..
2010-02-03 23:26:25.810: [    CSSD][1206540320]clsssclsnrsetup: gipcEndpoint failed, rc 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssnmOpenGIPCEndp: failed to listen on gipc addr gipc://rac1:nm_eotcs- ret 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssscmain: failed to open gipc endp


If there's connectivity issue on private network (including multicast is off), likely the ocssd.log will have messages like following:

2010-09-20 11:52:54.014: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 453, LATS 328297844, lastSeqNo 452, uniqueness 1284979488, timestamp 1284979973/329344894
2010-09-20 11:52:54.016: [    CSSD][1078421824]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
..  >>>> after a long delay
2010-09-20 12:02:39.578: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 1037, LATS 328883434, lastSeqNo 1036, uniqueness 1284979488, timestamp 1284980558/329930254
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmExecuteClientRequest: MAINT recvd from proc 2 (0xe1ad870)
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmShutDown: Received abortive shutdown request from client.
2010-09-20 12:02:39.895: [    CSSD][1107286336]###################################
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssscExit: CSSD aborting from thread GMClientListener
2010-09-20 12:02:39.895: [    CSSD][1107286336]###################################

To validate network, please refer to note 1054902.1
Please also check if the network interface name is matching the gpnp profile definition ("gpnptool get") for cluster_interconnect if CSSD could not start after a network change.

In 11.2.0.1, ocssd.bin may bind to public network if private network is unavailable

4. Vendor clusterware is up (if using vendor clusterware)

Grid Infrastructure provide full clusterware functionality and doesn't need Vendor clusterware to be installed; but if you happened to have Grid Infrastructure on top of Vendor clusterware in your environment, then Vendor clusterware need to come up fully before CRS can be started, to verify, as grid user:

$GRID_HOME/bin/lsnodes -n
racnode1    1
racnode1    0

If vendor clusterware is not fully up, likely ocssd.log will have similar messages like following:

2010-08-30 18:28:13.207: [    CSSD][36]clssnm_skgxninit: skgxncin failed, will retry
2010-08-30 18:28:14.207: [    CSSD][36]clssnm_skgxnmon: skgxn init failed
2010-08-30 18:28:14.208: [    CSSD][36]###################################
2010-08-30 18:28:14.208: [    CSSD][36]clssscExit: CSSD signal 11 in thread skgxnmon

Before the clusterware is installed, execute the command below as grid user:

$INSTALL_SOURCE/install/lsnodes -v
 

One issue on hp-ux: note 2130230.1 - Grid infrastructure startup fails due to vendor Clusterware did not start (HP-UX Service guard)

 

5. Command "crsctl" being executed from wrong GRID_HOME

Command "crsctl" must be executed from correct GRID_HOME to start the stack, or similar message will be reported:

2012-11-14 10:21:44.014: [    CSSD][1086675264]ASSERT clssnm1.c 3248
2012-11-14 10:21:44.014: [    CSSD][1086675264](:CSSNM00056:)clssnmvStartDiscovery: Terminating because of the release version(11.2.0.2.0) of this node being lesser than the active version(11.2.0.3.0) that the cluster is at
2012-11-14 10:21:44.014: [    CSSD][1086675264]###################################
2012-11-14 10:21:44.014: [    CSSD][1086675264]clssscExit: CSSD aborting from thread clssnmvDDiscThread#
 

Case 4: CRSD.BIN does not start
If the "crsctl stat res -t -init" shows that ora.crsd is in intermediate state and if this is not the first node where crsd is starting, then a likely cause is that the csrd.bin is not able to talk to the master crsd.bin.
In this case, the master crsd.bin is likely having a problem, so killing the master crsd.bin is a likely solution. 
Issue "grep MASTER crsd.trc" to find out the node where the master crsd.bin is running.  Kill the crsd.bin on that master node.
The crsd.bin will automatically respawn although the master will be transferred to crsd.bin on another node.


Successful crsd.bin startup depends on the following:

1. ocssd is fully up

If ocssd.bin is not fully up, crsd.log will show messages like following:

2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clssscConnect: gipc request failed with 29 (0x16)
2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clsssInitNative: connect failed, rc 29
2010-02-03 22:37:51.639: [  CRSRTI][1548456880] CSS is not ready. Received status 3 from CSS. Waiting for good status ..


2. OCR is accessible

If the OCR is located on ASM, ora.asm resource (ASM instance) must be up and diskgroup for OCR must be mounted, if not, likely the crsd.log will show messages like:

2010-02-03 22:22:55.186: [  OCRASM][2603807664]proprasmo: Error in open/create file in dg [GI]
[  OCRASM][2603807664]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup

2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: kgfoCheckMount returned [7]
2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: The ASM instance is down
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: Failed to open [+GI]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: No OCR/OLR devices are usable
2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprinit: Could not open raw device
2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [  OCRAPI][2603807664]a_init:16!: Backend init unsuccessful : [26]
2010-02-03 22:22:55.190: [  CRSOCR][2603807664] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup
] [7]
2010-02-03 22:22:55.190: [    CRSD][2603807664][PANIC] CRSD exiting: Could not init OCR, code: 26

Note: in 11.2 ASM starts before crsd.bin, and brings up the diskgroup automatically if it contains the OCR.

If the OCR is located on a non-ASM device, expected ownership and permissions are:

-rw-r----- 1 root  oinstall  272756736 Feb  3 23:24 ocr

If OCR is located on non-ASM device and it's unavailable, likely crsd.log will show similar message like following:

2010-02-03 23:14:33.583: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-02-03 23:14:33.583: [  OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:33.583: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:6m':failed in stat OCR file/disk /share/storage/ocr, errno=2, os err string=No such file or directory
2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-02-03 23:14:34.587: [  OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:34.587: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:14:35.589: [    CRSD][2346668976][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26


If the OCR is corrupted, likely crsd.log will show messages like the following:

2010-02-03 23:19:38.417: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]propriogid:1_2: INVALID FORMAT
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprioini: all disks are not OCR/OLR formatted
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprinit: Could not open raw device
2010-02-03 23:19:39.429: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:19:40.432: [    CRSD][3360863152][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26


If owner or group of grid user got changed, even ASM is available, likely crsd.log will show following:

2010-03-10 11:45:12.510: [  OCRASM][611467760]proprasmo: Error in open/create file in dg [SYSTEMDG]
[  OCRASM][611467760]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge
ORA-01031: insufficient privileges

2010-03-10 11:45:12.528: [  OCRASM][611467760]proprasmo: kgfoCheckMount returned [7]
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmo: The ASM instance is down
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: Failed to open [+SYSTEMDG]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: No OCR/OLR devices are usable
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprinit: Could not open raw device
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [  OCRAPI][611467760]a_init:16!: Backend init unsuccessful : [26]
2010-03-10 11:45:12.530: [  CRSOCR][611467760] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge
ORA-01031: insufficient privileges
] [7]


If oracle binary in GRID_HOME has wrong ownership or permission regardless whether ASM is up and running, or if grid user can not write in ORACLE_BASE, likely crsd.log will show following:

2012-03-04 21:34:23.139: [  OCRASM][3301265904]proprasmo: Error in open/create file in dg [OCR]
[  OCRASM][3301265904]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=12547, loc=kgfokge

2012-03-04 21:34:23.139: [  OCRASM][3301265904]ASM Error Stack : ORA-12547: TNS:lost contact

2012-03-04 21:34:23.633: [  OCRASM][3301265904]proprasmo: kgfoCheckMount returned [7]
2012-03-04 21:34:23.633: [  OCRASM][3301265904]proprasmo: The ASM instance is down
2012-03-04 21:34:23.634: [  OCRRAW][3301265904]proprioo: Failed to open [+OCR]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2012-03-04 21:34:23.634: [  OCRRAW][3301265904]proprioo: No OCR/OLR devices are usable
2012-03-04 21:34:23.635: [  OCRASM][3301265904]proprasmcl: asmhandle is NULL
2012-03-04 21:34:23.636: [    GIPC][3301265904] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 690], original from [clsss.c : 5326]
2012-03-04 21:34:23.639: [ default][3301265904]clsvactversion:4: Retrieving Active Version from local storage.
2012-03-04 21:34:23.643: [  OCRRAW][3301265904]proprrepauto: The local OCR configuration matches with the configuration published by OCR Cache Writer. No repair required.
2012-03-04 21:34:23.645: [  OCRRAW][3301265904]proprinit: Could not open raw device
2012-03-04 21:34:23.646: [  OCRASM][3301265904]proprasmcl: asmhandle is NULL
2012-03-04 21:34:23.650: [  OCRAPI][3301265904]a_init:16!: Backend init unsuccessful : [26]
2012-03-04 21:34:23.651: [  CRSOCR][3301265904] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage
ORA-12547: TNS:lost contact

2012-03-04 21:34:23.652: [ CRSMAIN][3301265904] Created alert : (:CRSD00111:) :  Could not init OCR, error: PROC-26: Error while accessing the physical storage
ORA-12547: TNS:lost contact

2012-03-04 21:34:23.652: [    CRSD][3301265904][PANIC] CRSD exiting: Could not init OCR, code: 26

The expected ownership and permission of oracle binary in GRID_HOME should be:

-rwsr-s--x 1 grid oinstall 184431149 Feb  2 20:37 /ocw/grid/bin/oracle

If OCR or mirror is unavailable (could be ASM is up, but diskgroup for OCR/mirror is unmounted), likely crsd.log will show following:

2010-05-11 11:16:38.578: [  OCRASM][18]proprasmo: Error in open/create file in dg [OCRMIR]
[  OCRASM][18]SLOS : SLOS: cat=8, opn=kgfoOpenFile01, dep=15056, loc=kgfokge
ORA-17503: ksfdopn:DGOpenFile05 Failed to open file +OCRMIR.255.4294967295
ORA-17503: ksfdopn:2 Failed to open file +OCRMIR.255.4294967295
ORA-15001: diskgroup "OCRMIR
..
2010-05-11 11:16:38.647: [  OCRASM][18]proprasmo: kgfoCheckMount returned [6]
2010-05-11 11:16:38.648: [  OCRASM][18]proprasmo: The ASM disk group OCRMIR is not found or not mounted
2010-05-11 11:16:38.648: [  OCRASM][18]proprasmdvch: Failed to open OCR location [+OCRMIR] error [26]
2010-05-11 11:16:38.648: [  OCRRAW][18]propriodvch: Error  [8] returned device check for [+OCRMIR]
2010-05-11 11:16:38.648: [  OCRRAW][18]dev_replace: non-master could not verify the new disk (8)
[  OCRSRV][18]proath_invalidate_action: Failed to replace [+OCRMIR] [8]
[  OCRAPI][18]procr_ctx_set_invalid_no_abort: ctx set to invalid
..
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91: Comparing device hash ids between local and master failed
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91 Local dev (1862408427, 1028247821, 0, 0, 0)
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91 Master dev (1862408427, 1859478705, 0, 0, 0)
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:9: Shutdown CacheLocal. my hash ids don't match
[  OCRAPI][19]procr_ctx_set_invalid_no_abort: ctx set to invalid
[  OCRAPI][19]procr_ctx_set_invalid: aborting...
2010-05-11 11:16:46.587: [    CRSD][19] Dump State Starting ...


3. crsd.bin pid file exists and points to running crsd.bin process

If pid file does not exist, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:


2010-02-14 17:40:57.927: [ora.crsd][1243486528] [check] PID FILE doesn't exist.
..
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Creating PID [30269] file for home /ocw/grid host racnode1 bin crs to /ocw/grid/crs/init/
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Error3 -2 writing PID [30269] to the file []
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Failed to record pid for CRSD
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Terminating process
2010-02-14 17:41:57.927: [ default][1092499776] CRSD exiting on stop request from clsdms_thdmai

The solution is to create a dummy pid file ($GRID_HOME/crs/init/$HOST.pid) manually as grid user with "touch" command and restart resource ora.crsd

If pid file does exist and the PID in this file references a running process which is NOT the crsd.bin process, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:

2011-04-06 15:53:38.777: [ora.crsd][1160390976] [check] PID will be looked for in /ocw/grid/crs/init/racnode1.pid
2011-04-06 15:53:38.778: [ora.crsd][1160390976] [check] PID which will be monitored will be 1535                               >> 1535 is output of "cat /ocw/grid/crs/init/racnode1.pid"
2011-04-06 15:53:38.965: [ COMMCRS][1191860544]clsc_connect: (0x2aaab400b0b0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD))
[  clsdmc][1160390976]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD)) with status 9
2011-04-06 15:53:38.966: [ora.crsd][1160390976] [check] Error = error 9 encountered when connecting to CRSD
2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Calling PID check for daemon
2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Trying to check PID = 1535
2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] PID check returned ONLINE CLSDM returned OFFLINE
2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] DaemonAgent::check returned 5
2011-04-06 15:53:39.203: [    AGFW][1160390976] check for resource: ora.crsd 1 1 completed with status: FAILED
2011-04-06 15:53:39.203: [    AGFW][1170880832] ora.crsd 1 1 state changed from: UNKNOWN to: FAILED
..
2011-04-06 15:54:10.511: [    AGFW][1167522112] ora.crsd 1 1 state changed from: UNKNOWN to: CLEANING
..
2011-04-06 15:54:10.513: [ora.crsd][1146542400] [clean] Trying to stop PID = 1535
..
2011-04-06 15:54:11.514: [ora.crsd][1146542400] [clean] Trying to check PID = 1535


To verify on OS level:

ls -l /ocw/grid/crs/init/*pid
-rwxr-xr-x 1 ogrid oinstall 5 Feb 17 11:00 /ocw/grid/crs/init/racnode1.pid
cat /ocw/grid/crs/init/*pid
1535
ps -ef| grep 1535
root      1535     1  0 Mar30 ?        00:00:00 iscsid                  >> Note process 1535 is not crsd.bin

The solution is to create an empty pid file and to restart the resource ora.crsd, as root:


# > $GRID_HOME/crs/init/<racnode1>.pid
# $GRID_HOME/bin/crsctl stop res ora.crsd -init
# $GRID_HOME/bin/crsctl start res ora.crsd -init


4. Network is functional and name resolution is working:

If the network is not fully functioning, ocssd.bin may still come up, but crsd.bin may fail and the crsd.log will show messages like:


2010-02-03 23:34:28.412: [    GPnP][2235814832]clsgpnp_Init: [at clsgpnp0.c:837] GPnP client pid=867, tl=3, f=0
2010-02-03 23:34:28.428: [  OCRAPI][2235814832]clsu_get_private_ip_addresses: no ip addresses found.
..
2010-02-03 23:34:28.434: [  OCRAPI][2235814832]a_init:13!: Clusterware init unsuccessful : [44]
2010-02-03 23:34:28.434: [  CRSOCR][2235814832] OCR context init failure.  Error: PROC-44: Error in network address and interface operations Network address and interface operations error [7]
2010-02-03 23:34:28.434: [    CRSD][2235814832][PANIC] CRSD exiting: Could not init OCR, code: 44

Or:


2009-12-10 06:28:31.974: [  OCRMAS][20]proath_connect_master:1: could not connect to master  clsc_ret1 = 9, clsc_ret2 = 9
2009-12-10 06:28:31.974: [  OCRMAS][20]th_master:11: Could not connect to the new master
2009-12-10 06:29:01.450: [ CRSMAIN][2] Policy Engine is not initialized yet!
2009-12-10 06:29:31.489: [ CRSMAIN][2] Policy Engine is not initialized yet!

Or:


2009-12-31 00:42:08.110: [ COMMCRS][10]clsc_receive: (102b03250) Error receiving, ns (12535, 12560), transport (505, 145, 0)

To validate the network, please refer to note 1054902.1

5. crsd executable (crsd.bin and crsd in GRID_HOME/bin) has correct ownership/permission and hasn't been manually modified, a simply way to check is to compare output of "ls -l <grid-home>/bin/crsd <grid-home>/bin/crsd.bin" with a "good" node.


6. crsd may not start due to the following:

note 1552472.1 -CRSD Will Not Start Following a Node Reboot: crsd.log reports: clsclisten: op 65 failed and/or Unable to get E2E port
note 1684332.1 - GI crsd Fails to Start: clsclisten: op 65 failed, NSerr (12560, 0), transport: (583, 0, 0)

 

7. To troubleshoot further, refer to note 1323698.1 - Troubleshooting CRSD Start up Issue
 

Case 5: GPNPD.BIN does not start
1. Name Resolution is not working

gpnpd.bin fails with following error in gpnpd.log:


2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "tcp://node2:9393", try 1 of 3...
2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1015] ENTRY
2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1066] GIPC gipcretFail (1) gipcConnect(tcp-tcp://node2:9393)
2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1067] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "tcp://node2:9393"

In above example, please make sure current node is able to ping "node2", and no firewall between them.

2. Bug 10105195

Due to Bug 10105195, gpnp dispatch is single threaded and could be blocked by network scanning etc, the bug is fixed in 11.2.0.2 GI PSU2, 11.2.0.3 and above, refer to note 10105195.8 for more details.


Case 6: Various other daemons do not start
Common causes:

1. Log file or directory for the daemon doesn't have appropriate ownership or permission

If the log file or log directory for the daemon doesn't have proper ownership or permissions, usually there is no new info in the log file and the timestamp remains the same while the daemon tries to come up.

Refer to below section "Log File Location, Ownership and Permission" for general reference.


2. Network socket file doesn't have appropriate ownership or permission

In this case, the daemon log will show messages like:

2010-02-02 12:55:20.485: [ COMMCRS][1121433920]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))

2010-02-02 12:55:20.485: [  clsdmt][1110944064]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))



3. OLR is corrupted

In this case, the daemon log will show messages like (this is a case that ora.ctssd fails to start):

2012-07-22 00:15:16.565: [ default][1]clsvactversion:4: Retrieving Active Version from local storage.
2012-07-22 00:15:16.575: [    CTSS][1]clsctss_r_av3: Invalid active version [] retrieved from OLR. Returns [19].
2012-07-22 00:15:16.585: [    CTSS][1](:ctss_init16:): Error [19] retrieving active version. Returns [19].
2012-07-22 00:15:16.585: [    CTSS][1]ctss_main: CTSS init failed [19]
2012-07-22 00:15:16.585: [    CTSS][1]ctss_main: CTSS daemon aborting [19].
2012-07-22 00:15:16.585: [    CTSS][1]CTSS daemon aborting


 
The solution is to restore a good copy of OLR note 1193643.1   

 

4.  Other cases:

note 1087521.1 - CTSS Daemon Aborting With "op 65 failed, NSerr (12560, 0), transport: (583, 0, 0)" 

 

Case 7: CRSD Agents do not start

CRSD.BIN will spawn two agents to start up user resource -the two agent share same name and binary as ohasd.bin agents:

  orarootagent: responsible for ora.netn.network, ora.nodename.vip, ora.scann.vip and  ora.gns
  oraagent: responsible for ora.asm, ora.eons, ora.ons, listener, SCAN listener, diskgroup, database, service resource etc

To find out the user resource status:

$GRID_HOME/crsctl stat res -t


If crsd.bin can not start any of the above agents properly, user resources may not come up. 

1. Common cause of agent failure is that the log file or log directory for the agents don't have proper ownership or permissions.

Refer to below section "Log File Location, Ownership and Permission" for general reference.

2. Agent may fail to start due to bug 11834289 with error "CRS-5802: Unable to start the agent process", refer to Section "OHASD does not start"  #10 for details.


Case 8: HAIP does not start
HAIP may fail to start with various errors, i.e. 

[ohasd(891)]CRS-2807:Resource 'ora.cluster_interconnect.haip' failed to start automatically.
Refer to note 1210883.1 for more details of HAIP 

Network and Naming Resolution Verification

CRS depends on a fully functional network and name resolution. If the network or name resolution is not fully functioning, CRS may not come up successfully.

To validate network and name resolution setup, please refer to note 1054902.1


Log File Location, Ownership and Permission

Appropriate ownership and permission of sub-directories and files in $GRID_HOME/log is critical for CRS components to come up properly.

In Grid Infrastructure cluster environment:
Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and two separate RDBMS owner rdbmsap and rdbmsar, here's what it looks like under $GRID_HOME/log in cluster environment:


drwxrwxr-x 5 grid oinstall 4096 Dec  6 09:20 log
  drwxr-xr-x  2 grid oinstall 4096 Dec  6 08:36 crs
  drwxr-xr-t 17 root   oinstall 4096 Dec  6 09:22 rac1
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:20 admin
    drwxrwxr-t 4 root   oinstall  4096 Dec  6 09:20 agent
      drwxrwxrwt 7 root    oinstall 4096 Jan 26 18:15 crsd
        drwxr-xr-t 2 grid  oinstall 4096 Dec  6 09:40 application_grid
        drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 oraagent_grid
        drwxr-xr-t 2 rdbmsap oinstall 4096 Jan 26 18:15 oraagent_rdbmsap
        drwxr-xr-t 2 rdbmsar oinstall 4096 Jan 26 18:15 oraagent_rdbmsar
        drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 ora_oc4j_type_grid
        drwxr-xr-t 2 root    root     4096 Jan 26 20:09 orarootagent_root
      drwxrwxr-t 6 root oinstall 4096 Dec  6 09:24 ohasd
        drwxr-xr-t 2 grid oinstall 4096 Jan 26 18:14 oraagent_grid
        drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdagent_root
        drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdmonitor_root
        drwxr-xr-t 2 root   root     4096 Jan 26 18:14 orarootagent_root    
    -rw-rw-r-- 1 root root     12931 Jan 26 21:30 alertrac1.log
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:44 client
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:24 crsd
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:24 cssd
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:24 ctssd
    drwxr-x--- 2 grid oinstall  4096 Jan 26 18:14 diskmon
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:25 evmd     
    drwxr-x--- 2 grid oinstall  4096 Jan 26 21:20 gipcd     
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:20 gnsd      
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:58 gpnpd    
    drwxr-x--- 2 grid oinstall  4096 Jan 26 21:19 mdnsd    
    drwxr-x--- 2 root oinstall  4096 Jan 26 21:20 ohasd     
    drwxrwxr-t 5 grid oinstall  4096 Dec  6 09:34 racg       
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgeut
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgevtf
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgmain
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:57 srvm        
Please note most log files in sub-directory inherit ownership of parent directory; and above are just for general reference to tell whether there's unexpected recursive ownership and permission changes inside the CRS home . If you have a working node with the same version, the working node should be used as a reference.


In Oracle Restart environment:
And here's what it looks like under $GRID_HOME/log in Oracle Restart environment:

drwxrwxr-x 5 grid oinstall 4096 Oct 31  2009 log
  drwxr-xr-x  2 grid oinstall 4096 Oct 31  2009 crs
  drwxr-xr-x  3 grid oinstall 4096 Oct 31  2009 diag
  drwxr-xr-t 17 root   oinstall 4096 Oct 31  2009 rac1
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 admin
    drwxrwxr-t 4 root   oinstall  4096 Oct 31  2009 agent
      drwxrwxrwt 2 root oinstall 4096 Oct 31  2009 crsd
      drwxrwxr-t 8 root oinstall 4096 Jul 14 08:15 ohasd
        drwxr-xr-x 2 grid oinstall 4096 Aug  5 13:40 oraagent_grid
        drwxr-xr-x 2 grid oinstall 4096 Aug  2 07:11 oracssdagent_grid
        drwxr-xr-x 2 grid oinstall 4096 Aug  3 21:13 orarootagent_grid
    -rwxr-xr-x 1 grid oinstall 13782 Aug  1 17:23 alertrac1.log
    drwxr-x--- 2 grid oinstall  4096 Nov  2  2009 client
    drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 crsd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 cssd
    drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 ctssd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 diskmon
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 evmd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 gipcd
    drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 gnsd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 gpnpd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 mdnsd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 ohasd
    drwxrwxr-t 5 grid oinstall  4096 Oct 31  2009 racg
      drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgeut
      drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgevtf
      drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgmain
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 srvm
 

For 12.1.0.2 onward, refer to note 1915729.1 - Oracle Clusterware Diagnostic and Alert Log Moved to ADR

 
Network Socket File Location, Ownership and Permission

Network socket files can be located in /tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle

When socket file has unexpected ownership or permission, usually daemon log file (i.e. evmd.log) will have the following:


2011-06-18 14:07:28.545: [ COMMCRS][772]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_EVMD))

2011-06-18 14:07:28.545: [  clsdmt][515]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=lexxxDBG_EVMD))
2011-06-18 14:07:28.545: [  clsdmt][515]Terminating process
2011-06-18 14:07:28.559: [ default][515] EVMD exiting on stop request from clsdms_thdmai


And the following error may be reported:


CRS-5017: The resource action "ora.evmd start" encountered the following error:
CRS-2674: Start of 'ora.evmd' on 'racnode1' failed
..

The solution is to stop GI as root (crsctl stop crs -f), clean up socket files and restart GI.


Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and clustername eotcs

In Grid Infrastructure cluster environment:
Below is an example output from cluster environment:


drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle

./.oracle:
drwxrwxrwt 2 root  oinstall 4096 Feb  2 21:25 .
srwxrwx--- 1 grid oinstall    0 Feb  2 18:00 master_diskmon
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 mdnsd
-rw-r--r-- 1 grid oinstall    5 Feb  2 18:00 mdnsd.pid
prw-r--r-- 1 root  root        0 Feb  2 13:33 npohasd
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 ora_gipc_GPNPD_rac1
-rw-r--r-- 1 grid oinstall    0 Feb  2 13:34 ora_gipc_GPNPD_rac1_lock
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sAevm
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sCevm
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_IPC_SOCKET_11
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_UI_SOCKET
srwxrwxrwx 1 root  root        0 Feb  2 21:25 srac1DBG_CRSD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_CSSD
srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_CTSSD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_EVMD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GIPCD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GPNPD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_MDNSD
srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_OHASD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN3
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs_lock
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1__lock
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_IPC_SOCKET_11
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_UI_SOCKET
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1_lock
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sora_crsqs
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROC
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROL
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sSYSTEM.evm.acceptor.auth
 

In Oracle Restart environment:
And below is an example output from Oracle Restart environment:


drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle

./.oracle:
srwxrwx--- 1 grid oinstall 0 Aug  1 17:23 master_diskmon
prw-r--r-- 1 grid oinstall 0 Oct 31  2009 npohasd
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 s#14478.1
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 s#14478.2
srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.1
srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.2
srwxrwxrwx 1 grid oinstall 0 Jul  7 10:59 s#2269.1
srwxrwxrwx 1 grid oinstall 0 Jul  7 10:59 s#2269.2
srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.1
srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.2
srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.1
srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.2
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sCRSD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 srac1DBG_CSSD
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 srac1DBG_OHASD
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sEXTPROC1521
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_localhost
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_localhost_lock
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1__lock
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOHASD_IPC_SOCKET_11
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOHASD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sgrid_CSS_LclLstnr_localhost_1
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sgrid_CSS_LclLstnr_localhost_1_lock
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sprocr_local_conn_0_PROL



Diagnostic file collection

If the issue can't be identified with the note, as root, please run $GRID_HOME/bin/diagcollection.sh on all nodes, and upload all .gz files it generated in current directory.


REFERENCES
NOTE:969254.1 - How to Proceed from Failed Upgrade to 11gR2 Grid Infrastructure on Linux/Unix
BUG:10105195 - PROC-32 ACCESSING OCR; CRS DOES NOT COME UP ON NODE
NOTE:1325718.1 - OHASD not Starting After Reboot on SLES
NOTE:1427234.1 - autorun file for ohasd is missing
NOTE:1077094.1 - How to fix the "DiscoveryString" in profile.xml or "asm_diskstring" in ASM if set wrongly
BUG:11834289 - OHASD FAILED TO START TIMELY
NOTE:1053970.1 - Troubleshooting 11.2 or 12.1 Grid Infrastructure root.sh Issues
NOTE:1564555.1 - 11.2.0.3 PSU5/PSU6/PSU7 or 12.1.0.1 CSSD Fails to Start if Multicast Fails on Private Network
NOTE:1068835.1 - What to Do if 11gR2 Grid Infrastructure is Unhealthy


NOTE:1323698.1 - Troubleshooting CRSD Start up Issue
NOTE:1915729.1 - 12.1.0.2 Grid Infrastructure Oracle Clusterware Diagnostic (traces) and Alert Log Moved to ADR

NOTE:1054902.1 - How to Validate Network and Name Resolution Setup for the Clusterware and RAC
NOTE:1069182.1 - OHASD Failed to Start: Inappropriate ioctl for device
NOTE:942166.1 - How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation
NOTE:10105195.8 - Bug 10105195 - Clusterware fails to start after reboot due to gpnpd fails to start
NOTE:1053147.1 - 11gR2 Clusterware and Grid Home - What You Need to Know
-------------------


How to Change Various IP's In RAC Cluster


1) Shutdown Oracle Clusterware stack
2) Modify the IP address at network layer, DNS and /etc/hosts file to reflect the change or modify the MAC address at network layer
3) Restart Oracle Clusterware stack

Note : Network information(interface, subnet and role of each interface) for Oracle Clusterware is managed by ‘oifcfg’, 
but actual IP address for each interfaces are not, ‘oifcfg’ can not update IP address information. ‘oifcfg getif’ can be used to find out currently configured interfaces in OCR:
The ‘public’ network is for database client communication (VIP also uses the same network though it’s stored in OCR as separate entry), 
whereas the ‘cluster_interconnect’ network is for RDBMS/ASM cache fusion. Starting with 11gR2,cluster_interconnect is also used for clusterware heartbeats –
 this is significant change compare to prior release as pre-11gR2 uses the private nodename 
that were specified at installation time for clusterware heartbeats.

In this article we are going to see how we can change our Public, Private,VIP and Scan IP's.

First we need to change the RAC configuration before we make any changes to server level IP's.

Changing Public and Private IPs in RAC Cluster.

Get the public and private Interconnect information

[grid@racnode1 ~]$ oifcfg getif
eth1  192.168.56.0  global  public
eth2  10.0.0.0  global  cluster_interconnect


Current IP Configurations

192.168.56.10   racnode1     
192.168.56.11   racnode2     
10.10.0.10      racnode1-priv
10.10.0.11      racnode2-priv
192.168.56.111   racnode1-vip 
192.168.56.112   racnode2-vip 
192.168.56.121   racdb-scan   
192.168.56.122   racdb-scan   
192.168.56.123   racdb-scan  

New IP Configurations that will done

10.1.1.30   racnode1     
10.1.1.31   racnode2     
20.20.20.21      racnode1-priv
20.20.20.22     racnode2-priv
10.1.1.40   racnode1-vip 
10.1.1.41   racnode2-vip 
10.1.1.50   racdb-scan   
10.1.1.51   racdb-scan   
10.1.1.52   racdb-scan 

Check current status of Cluster

crsctl stat res -t
[grid@racnode1 ~]$ crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER.lsnr
               ONLINE  ONLINE       racnode1                 STABLE
               ONLINE  ONLINE       racnode2                 STABLE
ora.PROD.dg
               ONLINE  ONLINE       racnode1                 STABLE
               ONLINE  ONLINE       racnode2                 STABLE
ora.asm
               ONLINE  ONLINE       racnode1                 Started,STABLE
               ONLINE  ONLINE       racnode2                 Started,STABLE
ora.net1.network
               ONLINE  ONLINE       racnode1                 STABLE
               ONLINE  ONLINE       racnode2                 STABLE
ora.ons
               ONLINE  ONLINE       racnode1                 STABLE
               ONLINE  ONLINE       racnode2                 STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       racnode2                 STABLE
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       racnode1                 STABLE
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       racnode1                 STABLE
ora.MGMTLSNR
      1        ONLINE  ONLINE       racnode1                 169.254.112.86 10.10
                                                             .0.10,STABLE
ora.cvu
      1        ONLINE  ONLINE       racnode1                 STABLE
ora.mgmtdb
      1        ONLINE  ONLINE       racnode1                 Open,STABLE
ora.oc4j
      1        ONLINE  ONLINE       racnode1                 STABLE
ora.prod.db
      1        ONLINE  ONLINE       racnode1                 Open,STABLE
      2        ONLINE  ONLINE       racnode2                 Open,STABLE
ora.racnode1.vip
      1        ONLINE  ONLINE       racnode1                 STABLE
ora.racnode2.vip
      1        ONLINE  ONLINE       racnode2                 STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       racnode2                 STABLE
ora.scan2.vip
      1        ONLINE  ONLINE       racnode1                 STABLE
ora.scan3.vip
      1        ONLINE  ONLINE       racnode1                 STABLE
--------------------------------------------------------------------------------



Stop below services. Make sure these services are stopped on all RAC nodes.
>srvctl stop database -d prod
>srvctl stop  mgmtdb
>srvctl stop MGMTLSNR
>srvctl stop nodeapps -f


Delete previous PUBLIC  IP Configuration

[grid@racnode1 ~]$ oifcfg getif
eth1  192.168.56.0  global  public
eth2  10.0.0.0  global  cluster_interconnect


 oifcfg  delif -global eth1

Redefine Public IP

oifcfg  setif -global eth1/10.1.1.0:public

Stop Cluster Services
crsctl stop cluster -all

Change system level IP  Public.
Make changes in /etc/hosts or DNS as configured.
crsctl start cluster -all

Verify the changes.
[grid@racnode1 ~]$ oifcfg getif
eth2  10.0.0.0  global  cluster_interconnect
eth1  10.1.1.0  global  public



Changing Private IP 

Note: Dont change Private and Public IP together.

After 11.2.0.2 we cannot delete private interconnect directly. We need to add a private interconnect and restart clusterware after system level IP's are changed then we will remove old private interconnect.

Redefine Private IP

# oifcfg  setif -global eth2/20.20.20.0:cluster_interconnect

Check new changes are done

[grid@racnode1 ~]$ oifcfg  getif
eth2  10.0.0.0  global  cluster_interconnect--Old Entry
eth1  10.1.1.0  global  public
eth2  20.20.20.0  global  cluster_interconnect--New Entry

crsctl stop cluster -all

Change system level IP of Private  interconnect.
Make changes in /etc/hosts or DNS as configured.
crsctl start cluster -all


Delete Old Private Interconnect

# oifcfg  delif -global eth2/10.0.0.0

[root@racnode1 grid]# oifcfg getif
eth2  20.20.20.0  global  cluster_interconnect
eth1  10.1.1.0  global  public


Changing VIP of RAC cluster

Current VIP details:
[grid@racnode1 ~]$ srvctl config nodeapps
Network 1 exists
Subnet IPv4: 192.168.56.0/255.255.255.0/eth1, static
Subnet IPv6: 
Ping Targets: 
Network is enabled
Network is individually enabled on nodes: 
Network is individually disabled on nodes: 
VIP exists: network number 1, hosting node racnode1
VIP Name: racnode1-vip
VIP IPv4 Address: 192.168.56.111
VIP IPv6 Address: 
VIP is enabled.
VIP is individually enabled on nodes: 
VIP is individually disabled on nodes: 
VIP exists: network number 1, hosting node racnode2
VIP Name: racnode2-vip
VIP IPv4 Address: 192.168.56.112
VIP IPv6 Address: 
VIP is enabled.
VIP is individually enabled on nodes: 
VIP is individually disabled on nodes: 
ONS exists: Local port 6100, remote port 6200, EM port 2016, Uses SSL false
ONS is enabled
ONS is individually enabled on nodes: 
ONS is individually disabled on nodes: 

Make sure nodeapps service are stopped
Make changes in /etc/hosts or DNS


Run below commands to change from root user

# srvctl modify nodeapps -n racnode1 -A 10.1.1.40/255.255.255.0/eth1
# srvctl modify nodeapps -n racnode2 -A 10.1.1.41/255.255.255.0/eth1


Verify changes are done 
[root@racnode1 grid]# srvctl config nodeapps
Network 1 exists
Subnet IPv4: 10.1.1.0/255.255.255.0/eth1, static
Subnet IPv6: 
Ping Targets: 
Network is enabled
Network is individually enabled on nodes: 
Network is individually disabled on nodes: 
VIP exists: network number 1, hosting node racnode1
VIP Name: racnode1-vip
VIP IPv4 Address: 10.1.1.40
VIP IPv6 Address: 
VIP is enabled.
VIP is individually enabled on nodes: 
VIP is individually disabled on nodes: 
VIP exists: network number 1, hosting node racnode2
VIP Name: racnode2-vip
VIP IPv4 Address: 10.1.1.41
VIP IPv6 Address: 
VIP is enabled.
VIP is individually enabled on nodes: 
VIP is individually disabled on nodes: 
ONS exists: Local port 6100, remote port 6200, EM port 2016, Uses SSL false
ONS is enabled
ONS is individually enabled on nodes: 
ONS is individually disabled on nodes: 

Start nodeapps
srvctl start nodeapps




Changing SCAN IP in RAC Cluster


Check the current status SCAN IP address on  DNS
[grid@racnode1 ~]$ nslookup racdb-scan
Server:  192.168.56.101
Address: 192.168.56.101#53

Name: racdb-scan.himvirtualdns.lab
Address: 192.168.56.121
Name: racdb-scan.himvirtualdns.lab
Address: 192.168.56.122
Name: racdb-scan.himvirtualdns.lab
Address: 192.168.56.123



Check the current status SCAN-VIP in the resource file

cd $GRID_HOME/bin

[grid@racnode1 ~]$ srvctl config scan
SCAN name: racdb-scan, Network: 1
Subnet IPv4: 192.168.56.0/255.255.255.0/eth1, static
Subnet IPv6: 
SCAN 0 IPv4 VIP: 192.168.56.121
SCAN VIP is enabled.
SCAN VIP is individually enabled on nodes: 
SCAN VIP is individually disabled on nodes: 
SCAN 1 IPv4 VIP: 192.168.56.122
SCAN VIP is enabled.
SCAN VIP is individually enabled on nodes: 
SCAN VIP is individually disabled on nodes: 
SCAN 2 IPv4 VIP: 192.168.56.123
SCAN VIP is enabled.
SCAN VIP is individually enabled on nodes: 
SCAN VIP is individually disabled on nodes: 


Update NEW SCAN IP address in the DNS server.

[root@racnode1 grid]# nslookup racdb-scan
Server:  192.168.56.101
Address: 192.168.56.101#53

Name: racdb-scan.himvirtualdns.lab
Address: 10.1.1.51
Name: racdb-scan.himvirtualdns.lab
Address: 10.1.1.52
Name: racdb-scan.himvirtualdns.lab
Address: 10.1.1.50



Stop Scan resource before modifying the CRS resource file.

[root@racnode1 grid]# srvctl stop scan_listener
[root@racnode1 grid]# srvctl stop scan
[root@racnode1 grid]# srvctl status scan
SCAN VIP scan1 is enabled
SCAN VIP scan1 is not running
SCAN VIP scan2 is enabled
SCAN VIP scan2 is not running
SCAN VIP scan3 is enabled
SCAN VIP scan3 is not running


[root@racnode1 grid]# srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is not running
SCAN Listener LISTENER_SCAN2 is enabled
SCAN listener LISTENER_SCAN2 is not running
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is not running


# srvctl modify scan -n racdb-scan

Verify that the change

# srvctl config scan

[root@racnode1 grid]# srvctl config scan
SCAN name: racdb-scan, Network: 1
Subnet IPv4: 10.1.1.0/255.255.255.0/eth1, static
Subnet IPv6: 
SCAN 0 IPv4 VIP: 10.1.1.50
SCAN VIP is enabled.
SCAN VIP is individually enabled on nodes: 
SCAN VIP is individually disabled on nodes: 
SCAN 1 IPv4 VIP: 10.1.1.51
SCAN VIP is enabled.
SCAN VIP is individually enabled on nodes: 
SCAN VIP is individually disabled on nodes: 
SCAN 2 IPv4 VIP: 10.1.1.52
SCAN VIP is enabled.
SCAN VIP is individually enabled on nodes: 
SCAN VIP is individually disabled on nodes: 


Start SCAN and the SCAN listener

# srvctl start scan
# srvctl start scan_listener

1 comment:

  1. The RCN Speed Test is a great tool for quickly checking your internet speeds and performance. It’s simple to use and helps ensure you’re getting the service you’re paying for.

    ReplyDelete