EasyReliableDBA: Quick RAC Interview Question and Answer

Oracle DBA question and answer

Important link for interview question

http://oracle-dba-help.blogspot.com/search/label/Interview%20QuestionsOracle Database interview questions and answers

Understanding Offline Processes in Oracle Grid Infrastructure

After the installation of Oracle Grid Infrastructure, some components may be listed as OFFLINE.

Oracle Grid Infrastructure activates these resources when you choose to add them.

Oracle Grid Infrastructure provides required resources for various Oracle products and components.

Some of those products and components are optional, so you can install and enable them after installing Oracle Grid Infrastructure.

To simplify postinstall additions, Oracle Grid Infrastructure preconfigures and registers all required resources for all products available for these products and components,

but only activates them when you choose to add them. As a result, some components may be listed as OFFLINE after the installation of Oracle Grid Infrastructure.

Run the following command to view status of any resource:

$crsctl status resource resource_name -t

crsctl status resource ora.easydb.easy_srv.svc -l

rsctl stat res -t -w "TYPE = ora.database.type"

srvctl status service -d easydb1

crsctl stat res -f -w "(TYPE = ora.service.type)"

crsctl stat res -t -w '((TARGET != ONLINE) or (STATE != ONLINE)'

Resources listed as TARGET:OFFLINE and STATE:OFFLINE do not need to be monitored.

They represent components that are registered, but not enabled, so they do not use any system resources.

If an Oracle product or component is installed on the system, and it requires a particular resource to be online, then the software prompts you to activate the required offline resource.

RAC Investigation

Environment description

2 Node RAC with Oracle 11.2.0.2.2

Oracle Linux 5.6

Conducted tests

easy_srv is a service which has both the instance running on node1 and node2 as preferred instances.

On node1 the service was manually stopped.

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=OFFLINE

STATE=OFFLINE

CARDINALITY_ID=2

DEGREE_ID=1

TARGET=ONLINE

STATE=ONLINE on node2

Issue a “shutdown abort” on the instance running on node2

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=OFFLINE

STATE=OFFLINE

CARDINALITY_ID=2

DEGREE_ID=1

TARGET=ONLINE

STATE=ONLINE on node1

start the instance again

[grid@node1 ~]$ srvctl start instance -d mydb -i mydb2

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=ONLINE

STATE=ONLINE on node2

CARDINALITY_ID=2

DEGREE_ID=1

TARGET=ONLINE

STATE=ONLINE on node1

The service is now running on both instances, although before the crash the service was set offline on node1.

Same test, but this time the service is stopped on all instances

[grid@node1 ~]$ srvctl stop service -d mydb -s easy_srv

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=OFFLINE

STATE=OFFLINE

CARDINALITY_ID=2

DEGREE_ID=1

TARGET=OFFLINE

STATE=OFFLINE

[grid@node1 ~]$ srvctl stop instance -d mydb -i mydb2 -o abort

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=OFFLINE

STATE=OFFLINE

CARDINALITY_ID=2

DEGREE_ID=1

TARGET=OFFLINE

STATE=OFFLINE

This time both services stay offline.

But what happens if we start the instance again:

[grid@node1 ~]$ srvctl start instance -d mydb -i mydb2

[grid@node1 ~]$ crsctl status resource ora.mydb.test_srv.svc -l

NAME=ora.mydb.test_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=ONLINE

STATE=ONLINE on node2

CARDINALITY_ID=2

DEGREE_ID=1

TARGET=OFFLINE

STATE=OFFLINE

Now the service has started again on the restarted instance.

Explanation for this is that the service was configured to come up automatically with the instance, which explains why the service is started on the restarted node.

For the failover this seems to me as expected behaviour as it is the same as what would happen with a preferred / available configuration.

For the third test, we will reconfigure the service to have a preferred and an available node

[grid@node1 ~]$ srvctl stop service -d mydb -s easy_srv

[grid@node1 ~]$ srvctl modify service -d mydb -s easy_srv -n -i mydb2 -a mydb1

[grid@node1 ~]$ srvctl config service -d mydb -s easy_srv

Service name: easy_srv

Service is enabled

Server pool: mydb_easy_srv

Cardinality: 1

Disconnect: false

Service role: PRIMARY

Management policy: AUTOMATIC

DTP transaction: false

AQ HA notifications: false

Failover type: NONE

Failover method: NONE

TAF failover retries: 0

TAF failover delay: 0

Connection Load Balancing Goal: LONG

Runtime Load Balancing Goal: NONE

TAF policy specification: NONE

Edition:

Preferred instances: mydb2

Available instances: mydb1

[grid@node1 ~]$ srvctl start service -d mydb -s easy_srv -i mydb2

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=ONLINE

STATE=ONLINE on node2

The service is running on its preferred instance, which we will now crash

[grid@node1 ~]$ srvctl stop instance -d mydb -i mydb2 -o abort

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=ONLINE

STATE=OFFLINE

I actually expected a relocation here…

As I have other services which have a preferred / available configuration, I know this service should failover.

[grid@node1 ~]$ srvctl status service -d mydb -s easy_srv

Service test_srv is not running.

[grid@node1 ~]$ srvctl config service -d mydb -s easy_srv

Service name: easy_srv

Service is enabled

Server pool: mydb_easy_srv

Cardinality: 1

Disconnect: false

Service role: PRIMARY

Management policy: AUTOMATIC

DTP transaction: false

AQ HA notifications: false

Failover type: NONE

Failover method: NONE

TAF failover retries: 0

TAF failover delay: 0

Connection Load Balancing Goal: LONG

Runtime Load Balancing Goal: NONE

TAF policy specification: NONE

Edition:

Preferred instances: mydb2

Available instances: mydb1

[grid@node1 ~]$ srvctl status database -d mydb

Instance mydb1 is running on node node1

Instance mydb2 is not running on node node2

I could find no clues in the different cluster log files as of why the relocation did not occur.

More testing will be necessary.

Also note that the output of the crsctl status resource does not contain information about on which node or instance the service is expected to be online.

But by using the -v flag we can see the last_server attribute:

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -v

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

LAST_SERVER=node2

STATE=OFFLINE

TARGET=ONLINE

CARDINALITY_ID=1

CREATION_SEED=137

RESTART_COUNT=0

FAILURE_COUNT=0

FAILURE_HISTORY=

ID=ora.mydb.test_srv.svc 1 1

INCARNATION=5

LAST_RESTART=08/10/2011 16:32:53

LAST_STATE_CHANGE=08/10/2011 16:34:03

STATE_DETAILS=

INTERNAL_STATE=STABLE

After starting the instance again, the service was back available

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=ONLINE

STATE=ONLINE on node2

A second run of this test gave the same result.

Manually relocating the service did work though:

[grid@node1 ~]$ srvctl relocate service -d mydb -s easy_srv -i mydb1 -t mydb2

[grid@node1 ~]$ crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.test_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=ONLINE

STATE=ONLINE on node2

What if I removed the service and recreated it directly as preferred / available:

[grid@node1 ~]$ srvctl stop service -d mydb -s easy_srv

[grid@node1 ~]$ srvctl remove service -d mydb -s easy_srv

[grid@node1 ~]$ srvctl add service -d mydb -s easy_srv -r mydb2 -a mydb1 -y AUTOMATIC -P BASIC -e SELECT

PRCD-1026 : Failed to create service test_srv for database mydb

PRKH-1014 : Current user grid is not the same as oracle owner orauser of oracle home /opt/oracle/orauser/product/11.2.0.2/dbhome_1.

would it?

Let us test it:

[grid@node1 ~]$ su - orauser

Password:

[orauser@node1 ~]$ srvctl add service -d mydb -s easy_srv -r mydb1,mydb2 -y AUTOMATIC -P BASIC -e SELECT

[orauser@node1 ~]$ srvctl config service -d mydb -s easy_srv

Service name: easy_srv

Service is enabled

Server pool: mydb_easy_srv

Cardinality: 2

Disconnect: false

Service role: PRIMARY

Management policy: AUTOMATIC

DTP transaction: false

AQ HA notifications: false

Failover type: SELECT

Failover method: NONE

TAF failover retries: 0

TAF failover delay: 0

Connection Load Balancing Goal: LONG

Runtime Load Balancing Goal: NONE

TAF policy specification: BASIC

Edition:

Preferred instances: mydb1,mydb2

Available instances:

[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=OFFLINE

STATE=OFFLINE

CARDINALITY_ID=2

DEGREE_ID=1

TARGET=OFFLINE

STATE=OFFLINE

now modify it

[orauser@node1 ~]$ srvctl modify service -d mydb -s easy_srv -n -i mydb2 -a mydb1

[orauser@node1 ~]$ srvctl config service -d mydb -s easy_srv

Service name: easy_srv

Service is enabled

Server pool: mydb_easy_srv

Cardinality: 1

Disconnect: false

Service role: PRIMARY

Management policy: AUTOMATIC

DTP transaction: false

AQ HA notifications: false

Failover type: SELECT

Failover method: NONE

TAF failover retries: 0

TAF failover delay: 0

Connection Load Balancing Goal: LONG

Runtime Load Balancing Goal: NONE

TAF policy specification: BASIC

Edition:

Preferred instances: mydb2

Available instances: mydb1

[orauser@node1 ~]$ srvctl start service -d mydb -s easy_srv -i mydb2

[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=ONLINE

STATE=ONLINE on node2

[orauser@node1 ~]$ srvctl stop instance -d mydb -i mydb2 -o abort

[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=ONLINE

STATE=OFFLINE

Nope, the user modifying the service has nothing to do with it.

I also tested the scenario where I directly created a preferred / available service, but in this case the failover also did not work.

But after some more testing I found the reason.

During the first test I had shutdown the instance via sqlplus, not via srvctl. And the other services I talked about had failed over during this test (I never did a failback).

After doing the shutdown abort again via sqlplus, the failover worked again.

[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=ONLINE

STATE=ONLINE on node2

[orauser@node2 ~]$ export ORACLE_SID=mydb2

[orauser@node2 ~]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.2.0 Production on Wed Aug 10 18:28:29 2011

Connected to:

Oracle Database 11g Release 11.2.0.2.0 - 64bit Production

With the Real Application Clusters and Automatic Storage Management options

SQL> shutdown abort

ORACLE instance shut down.

[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=ONLINE

STATE=ONLINE on node1

SQL> startup

ORACLE instance started.

Total System Global Area 3140026368 bytes

Fixed Size 2230600 bytes

Variable Size 1526728376 bytes

Database Buffers 1593835520 bytes

Redo Buffers 17231872 bytes

Database mounted.

Database opened.

[orauser@node1 ~]$ /opt/grid/11.2.0.2/bin/crsctl status resource ora.mydb.easy_srv.svc -l

NAME=ora.mydb.easy_srv.svc

TYPE=ora.service.type

CARDINALITY_ID=1

DEGREE_ID=1

TARGET=ONLINE

STATE=ONLINE on node1

as expected, starting the instance again did not trigger a failback of the service.

Question now is, if the failover not happening when issuing the shutdown via srvctl is expected behaviour or not.

For this, one probably would have to open a service case, answer a couple of question not important for this issue, escalate and still have to wait for several months.

Do I sound bitter now?

Conclusion:

When restarting an instance, an offline service that has this instance listed as a preferred node will be started (management policy = automatic).
When an instance on which a service was running fails, the service is started on at least one other preferred instance.
The service will remain running on this instance, even when the original instance is started again (in which case the service will run on both instances).
When a service has a preferred / available configuration, the service will failover to the available instance, but not failback afterwards.
Failover in a preferred / available configuration does not happen when the instance was stopped via “srvctl shutdown <db_unique_name> – o abort”

Questions remaining:

What if there were more than 2 nodes, with a service that has all three or more nodes listed as preferred, but currently only running on one node.
If the instance on which that service is running fails, would the service then be started on all preferred nodes or on only 1 of them?
What if, in the above case, the service was running on 2 nodes.
Would it still be started on other nodes?
And what if one of the nodes was configured as available and not as preferred? Would the service on the preferred node still be started or the one on the available instance or both?
And last but not least, is the srcvtl shutdown behavour a bug or not?

How many IPs is required before installation of clusterware/ Grid

If you do a not enable GNS, the public and virtual IP addresses for each node must be static IP addresses, configured before installation for each node but not currently in use. Public and virtual IP addresses must be on the same subnet. Oracle Clusterware manages private IP addresses in the private subnet on interfaces you identify as private during the installation process.

The cluster must have the following addresses configured:

A public IP address for each node, with the following characteristics:

Static IP address

Configured before installation for each node, and resolvable to that node before installation

On the same subnet as all other public IP, VIP, and SCAN addresses

A virtual IP address for each node, with the following characteristics:

Static IP address

Configured before installation for each node, but not currently in us

On the same subnet as all other public IP addresses, VIP addresses, and SCAN addresses

A Single-Client Access Name (SCAN) for the cluster, with the following characteristics:

Three Static IP addresses configured on the domain name server (DNS)

before installation so that the three IP addresses are associated with the name provided as the SCAN, and all three addresses are returned in random order by the DNS to the requestor

Configured before installation in the DNS to resolve to addresses that are not currently in use

Given a name that does not begin with a numeral

On the same subnet as all other public IP addresses, VIP addresses, and SCAN addresses

A private IP address for each node, with the following characteristics:

Static IP address

Configured before installation, but on a separate private network, with its own subnet, that is not resolvable except by other cluster member nodes to improve the interconnect performance

12c RAC Installation

Hardware Requirements.

Each node must have at least two network interface cards (NIC), or network adapters. One adapter is for the public network interface and the other adapter is for the private network interface (interconnect).

Public interface names must be the same for all nodes. If the public interface on one node uses the network adapter eth0, then you must configure eth0 as the public interface on all nodes.

You should configure the same private interface names for all nodes as well. If eth1 is the private interface name for the first node, then eth1 should be the private interface name for your second node.

The private network adapters must support the user datagram protocol (UDP) using high-speed network adapters and a network switch that supports TCP/IP (Gigabit Ethernet or better). Oracle recommends that you use a dedicated network switch

IP Address Requirements.

You must have a DNS server in order to make SCAN listener work. So, before you proceed installation prepare you DNS server. You must give the following entry manually in your DNS server.

i) A public IP address for each node

ii) A virtual IP address for each node

ii) Three single client access name (SCAN) addresses for the cluster

During installation a SCAN for the cluster is configured, which is a domain name that resolves to all the SCAN addresses allocated for the cluster. The IP addresses used for the SCAN addresses must be on the same subnet as the VIP addresses.

The SCAN name must be unique within your network. The SCAN addresses should not respond to ping commands before installation.

Interface	Type	Resolution
Public	Static	DNS
Private	Static	Not required
ASM	Static	Not required
Node Virtual IP	Static	Not required
SCAN virtual IP	Static	Not required

OS and software Requirements

Preparing Shared Storage

Database Volume Type/Purpose	Number of Volumes	Volume Size
OCR/VOTE	3	50GB each
DATA	4	250GB1 each
REDO2	2	At least 50GB each
FRA	1	100GB3
TEMP	1	100GB

Preparing the server to install Grid Infrastructure.

Create OS groups using the command below.

Enter these commands as the ‘root’ user:

#/usr/sbin/groupadd -g 501 oinstall

#/usr/sbin/groupadd -g 502 dba

#/usr/sbin/groupadd -g 504 asmadmin

#/usr/sbin/groupadd -g 506 asmdba

#/usr/sbin/groupadd -g 507 asmoper

Create the users that will own the Oracle software using the commands:

#/usr/sbin/useradd -u 501 -g oinstall -G asmadmin,asmdba,asmoper grid

#/usr/sbin/useradd -u 502 -g oinstall -G dba,asmdba oracle

vi /etc/sysctl.conf

kernel.shmmni = 4096

kernel.sem = 250 32000 100 128

fs.file-max = 6553600

net.ipv4.ip_local_port_range = 9000 65500

net.core.rmem_default = 262144

net.core.rmem_max = 4194304

net.core.wmem_default = 262144

net.core.wmem_max = 1048576

Add or edit the following line in the /etc/pam.d/login file, if it does not already exist:

session required pam_limits.so

Create the Oracle Inventory Directory

To create the Oracle Inventory directory, enter the following commands as the root user:

# mkdir -p /u01/app/oraInventory

# chown -R grid:oinstall /u01/app/oraInventory

# chmod -R 775 /u01/app/oraInventory

Creating the Oracle Grid Infrastructure Home Directory

To create the Grid Infrastructure home directory, enter the following commands as the root user:

# mkdir -p /u01/11.2.0/grid

# chown -R grid:oinstall /u01/11.2.0/grid

# chmod -R 775 /u01/11.2.0/grid

Creating the Oracle Base Directory

To create the Oracle Base directory, enter the following commands as the root user:

# mkdir -p /u01/app/oracle

# mkdir /u01/app/oracle/cfgtoollogs –needed to ensure that dbca is able to run after the rdbms installation.

# chown -R oracle:oinstall /u01/app/oracle

# chmod -R 775 /u01/app/oracle

How to increase the performance of interconnect

1) Link aggregation. It is also known as NIC teaming,” “NIC bonding. it can be used to increase redundancy for higher availability with an Active/Standby configuration. Link aggregation can be used to increase bandwidth for performance with an Active/Active configuration. This arrangement involves simultaneous use of both the bonded physical network interface cards in parallel to achieve a higher bandwidth beyond the limit of any one single network card. It is very important that if 802.3ad is used at the NIC layer, the switch must also support and be configured for 802.3ad. Misconfiguration results in poor performance and interface resets or “port flapping.

2)An alternative is to consider a single network interface card with a higher bandwidth, such as 10 Gb Ethernet instead of 1Gb Ethernet. InfiniBand can also be used for the interconnect.

3)UDP socket buffer (rx):

Default settings are adequate for the majority of customers. It may be necessary to increase the allocated buffer size

when the:

– MTU size has been increased

– netstat command reports errors

– ifconfig command reports dropped packets or overflow

The maximum UDP socket receive buffer size varies according to the operating system. the upper limit may be as small as 128 KB or as large as 1 GB. In most cases, the default settings are adequate for the majority of customers. This is one of the first settings to consider if you are receiving lost blocks.

Three significant conditions that indicate when it may be necessary to change the UDP socket receive buffer size are when the MTU size has been increased, when excessive fragmentation and/or reassembly of packets is observed, and if dropped packets or overflows are observed

4)Jumbo frames: It is not an Institute of Electrical and Electronics Engineers (IEEE) standard Jumbo frames are not a requirement for Oracle Clusterware and not configured by default. The use of jumbo frames is supported; however, special care must be taken because this is not an IEEE standard and there are significant variances among network devices and switches especially from different manufacturers. The typical frame size for jumbo frames is 9 KB, but again, this can vary. It is necessary that all devices in the communication path be set to the same value.

A jumbo frame is an Ethernet frame with a payload greater than the standard maximum transmission unit (MTU) of 1,500 bytes. By default each network packet can carry 1500 bytes of data (also referred to as the packet’s payload). Any payload larger than 1500 bytes sent over the network will be split into more than one packet. If we enable jumbo frames we reduce the number of packets sent over the network when sending large amounts of data.When we enable “jumbo frames” we are telling our network devices that we want to send more than 1500 bytes in each packet.

Most commonly, jumbo frames means setting the MTU (maximum transmission unit) to enable a payload of 9000 bytes.

The obvious advantage of using jumbo frames is more data is transferred in less packets.

They can speed up your overall network speed, provide better interaction between some applications,

and reduce strain on your network.

If you’re considering implementing Jumbo Frames, it’s important to do your homework first.

MTU=Default Ethernet 1500 or Jumbo Frames 9000

Let’s assume we need to transfer 20 gigabytes (21,474,836,480 bytes) of data as quickly as possible.

With a standard 1500 byte MTU that will take 14,316,558 packets,

but with an MTU of 9000 we are sending 2,386,093 packets. That’s a difference of 11,930,465 packets.

Note: For Oracle Clusterware, the Maximum Transmission Unit (MTU) needs to be the same on all nodes. If it is not set to the same value, an error message will be sent to the Clusterware alert logs.

What is HAIP

HAIP, High Availability IP, is the Oracle based solution for load balancing and failover for private interconnect traffic. Typically, Host based solutions such as Bonding (Linux), Trunking (Solaris) etc is used to implement high availability solutions for private interconnect traffic. But, HAIP is an Oracle solution for high availability.

In earlier releases, to minimize node evictions due to frequent private NIC down events, bonding, trunking, teaming, or similar technology was required to make use of redundant network connections between the nodes. Oracle Clusterware now provides an integrated solution which ensures “Redundant Interconnect Usage” as it supports IP failover .

Multiple private network adapters can be defined either during the installation phase or afterward using the oifcfg. The ora.cluster_interconnect.haip resource will pick up a highly available virtual IP (the HAIP) from “link-local” (Linux/Unix) IP range (169.254.0.0 ) and assign to each private network. With HAIP, by default, interconnect traffic will be load balanced across all active interconnect interfaces. If a private interconnect interface fails or becomes non-communicative, then Clusterware transparently moves the corresponding HAIP address to one of the remaining functional interfaces.

Grid Infrastructure can activate a maximum of four private network adapters at a time even if more are defined. The number of HAIP addresses is decided by how many private network adapters are active when Grid comes up on the first node in the cluster . If there’s only one active private network, Grid will create one; if two, Grid will create two and so on. The number of HAIPs won’t increase beyond four even if more private network adapters are activated . A restart of clusterware on all nodes is required for new adapters to become effective.

. This functionality is available starting with Oracle Database 11g Release 2 (11.2.0.2). If you use the Oracle Clusterware Redundant Interconnect feature, you must use IPv4 addresses for the interfaces.

When you define multiple interfaces, Oracle Clusterware creates from one to four highly available IP (HAIP) addresses. Oracle RAC and Oracle Automatic Storage Management (Oracle ASM) instances use these interface addresses to ensure highly available, load-balanced interface communication between nodes. The installer enables Redundant Interconnect Usage to provide a high-availability private network.

By default, Oracle Grid Infrastructure software uses all of the HAIP addresses for private network communication, providing load-balancing across the set of interfaces you identify for the private network. If a private interconnect interface fails or becomes noncommunicative,

Oracle Clusterware transparently moves the corresponding HAIP address to one of the remaining functional interfaces.

What is advantage of Single-Client Access

The single-client access name is address used by clients connecting to the cluster. The scan is a fully qualified host name (host name + domain) registered to three IP addresses. If you use GNS, and have DHCP support, then the GNS will assign addresses dynamically to the SCAN.

If you do not use GNS, the SCAN should be defined in the DNS to resolve to the three addresses assigned to that name.

This should be done before you install Oracle Grid Infrastructure.

The SCAN and its associated IP addresses provide a stable name for clients to use for connections, independent of the nodes that make up the cluster.
SCANs function like a cluster alias. However, SCANs are resolved on any node in the cluster, so unlike a VIP address for a node, clients connecting to the SCAN no longer require updated VIP addresses as nodes are added to or removed from the cluster. Because the SCAN addresses
resolve to the cluster, rather than to a node address in the cluster, nodes can be added to or removed from the cluster without affecting the SCAN address configuration.
During installation, listeners are created on each node for the SCAN IP addresses. Oracle Clusterware routes application requests to the cluster SCAN to the least loaded instance providing the service.SCAN listeners can run on any node in the cluster. SCANs provide location independence for databases so that the client configuration does not have to depend on which nodes run a particular database.Instances register with SCAN listeners only as remote listeners. Upgraded databases register with SCAN listeners as remote listeners, and also continue to register with all other listeners.
If you specify a GNS domain during installation, the SCAN defaults to clustername-scan.GNS_domain. If a GNS domain is not specified at installation, the SCAN defaults to clustername-scan.current_domain.

How SCAN and Local Listeners work

When a client submits a connection request, the SCAN listener listening on a SCAN IP address and the SCAN port are contacted on the client’s behalf. Because all services on the cluster are registered with the SCAN listener, the SCAN listener replies with the address of the local listener on the least-loaded node where the service is currently being offered. Finally, the client establishes a connection to the service through the listener on the node where service is offered. All these actions take place transparently to the client without any explicit configuration required in the client.

During installation, listeners are created on nodes for SCAN IP addresses.

Oracle net Services routes application requests to the least loaded

Instance providing service Because SCAN addresses resolve to cluster, rather than to a node address in the cluster cluster, nodes can be added or removed from the cluster without affecting the SCAN address configuration.

What is Node eviction and its advantage

An important service provided by Oracle Clusterware is node fencing.

Node fencing is used to evict nonresponsive hosts from the cluster, preventing data corruptions.

An important service provided by Oracle Clusterware is node fencing. Node fencing is a technique used by clustered environments to evict nonresponsive or malfunctioning hosts from the cluster.

Allowing affected nodes to remain in the cluster increases the probability of data corruption due to Traditionally, Oracle Clusterware uses a STONITH (Shoot The Other Node In The Head)comparable fencing algorithm to ensure data integrity in cases, in which cluster integrity is endangered and split-brain scenarios need to be prevented.

For Oracle Clusterware this means that a local process enforces the removal of one or more nodes from the cluster (fencing). This approach traditionally involved a forced “fast” reboot of the offending node. A fast reboot is a shutdown and restart procedure that does not wait for any I/O to finish or for file systems to synchronize on shutdown. Starting with Oracle Clusterware 11g Release 2 (11.2.0.2), this

mechanism has been changed to prevent such a reboot as much as possible by introducing rebootless node fencing.

Now, when a decision is made to evict a node from the cluster, Oracle Clusterware will first attempt to shut down all resources on the machine that was chosen to be the subject of an eviction. Specifically, I/O generating processes are killed and Oracle Clusterware ensures that those processes are completely stopped before continuing.

If all resources can be stopped and all I/O generating processes can be killed, Oracle Clusterware will shut itself down on the respective node, but will attempt to restart after the stack has been stopped.

If, for some reason, not all resources can be stopped or I/O generating processes cannot be stopped completely, Oracle Clusterware will still perform a reboot.

In addition to this traditional fencing approach, Oracle Clusterware now supports a new fencing mechanism based on remote node termination. The concept uses an external mechanism capable of restarting a problem node without cooperation either from Oracle Clusterware or from the

operating system running on that node. To provide this capability, Oracle Clusterware supports the Intelligent Management Platform Interface specification (IPMI), a standard management protocol.

To use IPMI and to be able to remotely fence a server in the cluster, the server must be equipped with a Baseboard Management Controller (BMC), which supports IPMI over a local area network(LAN). After this hardware is in place in every server of the cluster, IPMI can be activated either during the installation of the Oracle Grid Infrastructure or after the installation in course of a post installation management task by using CRSCTL.

What is the difference between a oracle global index and a local index?

When using Oracle partitioning, you can specify the “global” or “local” parameter in the create index syntax:

Local Index(Local partitioned indexes ): A local index is a one-to-one mapping between a index partition and a table partition.

The partitioning of the indexes is transparent to all SQL queries. The great benefit is that the Oracle query engine will scan only the index partition that is required to service the query, thus speeding up the query significantly. In addition, the Oracle parallel query engine will sense that the index is partitioned and will fire simultaneous queries to scan the indexes.

Local partitioned indexes allow the DBA to take individual partitions of a table and indexes offline for maintenance (or reorganization) without affecting the other partitions and indexes in the table.

A local index on a partitioned table is created where the index is partitioned in exactly the same manner as the underlying partitioned table. That is, the local index inherits the partitioning method of the table. This is known as equi-partitioning.
For local indexes, the index keys within the index will refer only to the rows stored in the single underlying table partition. A local index is created by specifying the LOCAL attribute, and can be created as UNIQUE or NON-UNIQUE.
The table and local index are either partitioned in exactly the same manner, or have the same partition key because the local indexes are automatically maintained, can offer higher availability.
As the Oracle database ensures that the index partitions are synchronized with their corresponding table partitions, it follows that the database automatically maintains the index partition whenever any maintenance operation is performed on the underlying tables
for example, when partitions are added, dropped, or merged.
A local index is prefixed if the partition key of the table and the index key are the same; otherwise it is a local non-prefixed index

In a local partitioned index, the key values and number of index partitions will match the number of partitions in the base table.

To create Rang partition on table

CREATE TABLE EASY_INVOICE (id number, item_id number, name varchar2(20))

PARTITION BY RANGE (id, item_id)

(partition EASYINVOICE_1 values less than (10, 100),

partition EASYINVOICE_2 values less than (20, 200),

partition EASYINVOICE_3 values less than (30, 300),

partition EASYINVOICE_4 values less than (40, 400));

Table created

CREATE INDEX test_idx ON EASY_INVOICE(id, item_id)

LOCAL

(partition test_idx_1,

partition test_idx_2,

partition test_idx_3,

partition test_idx_4);

Index created.

SELECT index_name, partition_name, status

FROM user_ind_partitions where index_name='TEST_IDX'

ORDER BY index_name, partition_name;

1) check partition table

SELECT distinct table_name

FROM dba_tab_partitions where table_name like '%EASY_INVOICE%' ORDER BY table_name;

set long 9999999

set pagesize 0

set linesize 120

SELECT DBMS_METADATA.GET_DDL('TABLE','EASY_INVOICE','EASYOWNER

') FROM DUAL

COLUMN table_name FORMAT A30

COLUMN partition_name FORMAT A30

COLUMN high_value FORMAT A20

SELECT table_name,partition_name,high_value,num_rows

FROM dba_tab_partitions where table_name='EASY_INVOICE' ORDER BY table_name, partition_name;

select max(partition_position) from dba_tab_partitions where table_owner='EASYOWNER' and table_name='EASY_INVOICE';

select PARTITION_NAME from dba_tab_partitions where table_owner='EASYOWNER' and table_name='EASY_INVOICE' and partition_position=15;

Select TABLE_OWNER, table_name,PARTITION_NAME ,last_analyzed from dba_tab_partitions where table_owner='SBM_DWH_EDW' and table_name='EASY_INVOICE'

and partition_name= 'EASYPAR_100';

SQL> select a.index_name, partition_name,table_name from all_ind_partitions a inner join all_part_indexes b on a.index_name = b.INDEX_NAME

where status = 'UNUSABLE' and (a.INDEX_NAME like '%ESM%' or a.INDEX_NAME like '%SSM%')

2 3 /

SQL> CREATE UNIQUE INDEX "EASYOWNER"."PKC_EBA_T_BMS_OUT_AGG_1SK" ON "MLY_UC_DAT"."EBA_T_BMS_OUT_AGG" ("START_DATE", "CONTROL_POINT_ID", "EVENT_TYPE_ID", "CALLING_NO_GRP_ID", "CALLED_NO_GRP_ID", "SUBSCRIBER_TYPE_ID", "ROAMING_TYPE_ID", "CALL_TYPE_ID","NE")

2 PCTFREE 10 INITRANS 2 MAXTRANS 255

STORAGE(

BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)

LOCAL

(PARTITION "P_FIRST",PARTITION "SYS_P3525304",PARTITION "SYS_P3537238",

PARTITION "SYS_P3548658",

PARTITION "SYS_P3560119",PARTITION "SYS_P3573892",PARTITION "SYS_P3579810",

PARTITION "SYS_P3591451",PARTITION "SYS_P3603371",PARTITION "SYS_P3620692",

PARTITION "SYS_P3636416",PARTITION "SYS_P3640544",PARTITION "SYS_P3646479",

PARTITION "SYS_P3658676",PARTITION "SYS_P3672966",PARTITION "SYS_P368 3 6620",

PARTITION "SYS_P3708768",PARTITION "SYS_P3717108",PARTITION "SYS_P3726469",

PARTITION "SYS_P3737761",PARTITION "SYS_P3753479",PARTITION "SYS_P3766044",

PARTITION "SYS_P3778936",PARTITION "SYS_P3785475",PARTITION "SYS_P3796607",

PARTITION "SYS_P3802547",PARTITION "SYS_P3812703",PARTITION "SYS_P3824595",

PARTITION "SYS_P3828487",PARTITION "SYS_P3840822"

4 5 6 7 8 9 10 11 ,PARTITION "SYS_P3844642",PARTITION "SYS_P3854731",PARTITION "SYS_P3864643"

,PARTITION "SYS_P3875729",PARTITION "SYS_P3886228",PARTITION "SYS_P3895740",

PARTITION "SYS_P3905488",PARTITION "SYS_P3916783",PARTITION "SYS_P3919643",

PARTITION "SYS_P3928320",PARTITION "SYS_P3939494",PARTITION "SYS_P3954794",

PARTITION "SYS_P3963313",PARTITION "SYS_P3974561",PARTITION "SYS_P3984998",

PARTITION "SYS_P3998992",PARTITION "SYS_P4008979",PARTITION "SYS_P4013778",

PARTITION "SYS_P4023967",PARTITION "SYS_P4035561", PAR 12 TITION "SYS_P4047192",

PARTITION "SYS_P4052024",PARTITION "SYS_P4073615",PARTITION "SYS_P4077074",

13 14 15 16 17 18 19 20 21 22 23 24 25 PARTITION "SYS_P4080872"

26 PCTFREE 10 INITRANS 2 MAXTRANS 125 LOGGING

STORAGE(

BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)

TABLESPACE "RAS_UC_DAT_TAB") ; 27 28 29

Index created.

To rebuild partition index

set time on;

set timing on;

ALTER INDEX EASYOWNER.EASYINVOICE REBUILD PARTITION EAB_P1_2015;

ALTER INDEX EASYOWNER.EASYINVOICE REBUILD PARTITION SYS_P10766;

ALTER INDEX EASYOWNER.EASYINVOICE REBUILD PARTITION SYS_P12025;

ALTER INDEX EASYOWNER.EASYINVOICE REBUILD PARTITION SYS_P13245;

ALTER INDEX EASYOWNER.EASYINVOICE REBUILD PARTITION SYS_P14403;

ALTER INDEX EASYOWNER.EASYINVOICE REBUILD PARTITION SYS_P1520;

Oracle will automatically use equal partitioning of the index based upon the number of partitions in the indexed table. For example, in the above definition, if we created four indexes on all_fact, the CREATE INDEX would fail since the partitions do not match. This equal partition also makes index maintenance easier, since a single partition can be taken offline and the index rebuilt without affecting the other partitions in the table.

Global partitioned indexes

A global index is a one-to-many relationship, allowing one index partition to map to many table partitions.A global partitioned index is used for all other indexes except for the one that is used as the table partition key. Global indexes partition OLTP (online transaction processing) applications where fewer index probes are required than with local partitioned indexes. In the global index partition scheme, the index is harder to maintain since the index may span partitions in the base table.

For example, when a table partition is dropped as part of a reorganization, the entire global index will be affected. When defining a global partitioned index, the DBA has complete freedom to specify as many partitions for the index as desired.

A global partitioned index is an index on a partitioned or non-partitioned table that is partitioned independently, i.e. using a different partitioning key from the table. Global-partitioned indexes can be range or hash partitioned.
Global partitioned indexes are more difficult to maintain than local indexes. However, they do offer a more efficient access method to any individual record.
During table or index interaction during partition maintenance, all partitions in a global index will be affected.
When the underlying table partition has any SPLIT, MOVE, DROP, or TRUNCATE maintenance operations performed on it, both global indexes and global partitioned indexes will be marked as unusable. It therefore follows that partition independence it is not possible for global indexes.
Depending on the type of operation performed on a table partition, the indexes on the table will be affected. When altering a table partition, the UPDATE INDEXES clause can be specified. This automatically maintains the affected global indexes and partitions.
The advantages of using this option are that the index will remain online and available throughout the operation and does not have to be rebuilt once the operation has completed.

Now that we understand the concept, let’s examine the Oracle CREATE INDEX syntax for a globally partitioned index:

CREATE INDEX item_idx

on all_fact (item_nbr)

GLOBAL

(PARTITION city_idx1 VALUES LESS THAN (100)),

(PARTITION city_idx1 VALUES LESS THAN (200)),

(PARTITION city_idx1 VALUES LESS THAN (300)),

(PARTITION city_idx1 VALUES LESS THAN (400)),

(PARTITION city_idx1 VALUES LESS THAN (500));

Here, we see that the item index has been defined with five partitions, each containing a subset of the index range values. Note that it is irrelevant that the base table is in three partitions. In fact, it is acceptable to create a global partitioned index on a table that does not have any partitioning.

Making Failover Seamless

In addition to adding database instances to mitigate node failure, Oracle RAC offers a number of technologies to make a node failover seamless to the application (and subsequently, to the end user),

including the following:

Transparent Application Failover

Fast Connect Failover

Transparent Application Failover (TAF) is a client-side feature. The term refers to the failover/reestablishment of sessions in case of instance or node failures. TAF is not limited to RAC configurations; active/passive clusters can benefit equally from it. TAF can be defined through local naming in the client’s tnsnames.ora file or, alternatively, as attributes to a RAC database service. The

latter is the preferred way of configuring it. Note that this feature requires the use of the OCI libraries, so thin-client only applications won’t be able to benefit from it. With the introduction of the Oracle Instant client, this problem can be alleviated somewhat by switching to the correct driver.

TAF can operate in two ways:

it can either restore a session or re-execute a select statement in the event of a node failure.

While this feature has been around for a long time,

Oracle’s net manager configuration assistant doesn’t provide support for setting up client-side TAF. Also, TAF isn’t the most elegant way of handling node failures because any in-flight transactions will be rolled back—

TAF can resume running select statements only.

The fast connection failover feature provides a different way of dealing with node failures and other types of events published by the RAC high availability framework (also known as the Fast Application Notification, or FAN). It is more flexible than TAF.

Fast Connection Failover offers a driver-independent way for your JDBC application to take advantage of the connection failover facilities introduced in 10g Release 1 (10.1).

When a RAC service failure is propagated to the JDBC application, the database has already rolled back the local transaction. The cache manager then cleans up all invalid connections. When an application holding an invalid connection tries to do work through that connection, it receives a SQLException ORA-17008, Closed Connection. The application has to handle the exception and reconnect.

What are the different types of failover mechanisms available

JDBC-THIN driver supports Fast Connection Failover (FCF)
JDBC-OCI driver supports Transparent Application Failover (TAF)
JDBC-THIN 11gR2 supports Single Client Access Name (SCAN)

Can I use FCF and TAF together ?

No. Only one of them should be used at a time.

Is FCF provided by Oracle JDBC 9i drivers ?

No. FCF is built on the pooling feature known as 'Implicit Connection Caching' and this is available only with JDBC 10g or higher versions.

Please also note that in version 11gR2 the FCF is now deprecated along with the Implicit Connection Caching in favor of using the Universal Connection Pool (UCP)

What is SCAN ? Which version of JDBC supports SCAN ?

SCAN or Single Client Access Name is a new Oracle Real Application Clusters (RAC) 11g Release 2 feature that provides a single name for clients to access an Oracle Database running in a cluster.

The benefit is clients using SCAN do not need to change if you add or remove nodes in the cluster. Having a single name to access the cluster allows clients to use the EZConnect client and the simple JDBC thin URL to access any database running in the clusters independently of which server(s) in the cluster the database is active. SCAN provides load balancing and failover of client connections to the database. The SCAN works as an IP alias for the cluster.

What exactly is the use of FCF?

FCF provides is very fast notification of the failure and the ability to reconnect immediately using the same URL. When a RAC node fails the application will receive an exception. The application has to handle the exception and reconnect.

The JDBC driver does not re-target existing connections. If a node fails the application must close the existing connection and get a new one. The way the application knows that the node failed is by getting an exception. There is more than one ORA error that can be thrown when a node fails,the application must be able to deal with them all.

An application may call isFatalConnectionError() API on the OracleConnectionCacheManager to determine if the SQLException caught is fatal.

If the return value of this API is true, we need to retry the getConnection on the DataSource.xxxxxx

How do we use FCF with JDBC driver?

In order to use FCF with JDBC, the following things must be done:

Configure and start ONS. If ONS is not correctly set up,implicit connection cache creation fails and an ONSException is thrown at the first getConnection() request.
See Oracle® Universal Connection Pool for JDBC Developer's Guide in the section Configuring ONS located in Using Fast Connection Failover
FCF is now configured through a pool-enabled data source and is tightly integrated with UCP. The FCF enabled through the Implicit Connection Cache as was used in 10g and 11g R1 is now deprecated.
Set the FastConnectionFailoverEnabled property before making the first getConnection() request to an OracleDataSource. When FastConnection Failover is enabled, the failover applies to all connections in the pool.
Use a service name rather than a SID when setting the OracleDataSource url property.

what is Transparent Application Failover

Transparent Application Failover (TAF) or simply Application Failover is a feature of the OCI driver. It enables you to automatically reconnect to a database if the database instance to which the connection is made goes down. In this case, the active transactions roll back. A transaction rollback restores the last committed transaction. The new database connection, though created by a different node, is identical to the original. This is true regardless of how the connection was lost.

TAF is always active and does not have to be set.

TAF cannot be used with thin driver.

Failover Modes

Transparent Application Failover can be configured to work in two modes, or it can be deactivated. If we count deactivated as a mode, it means TAF can be assigned the following three options:

Session failover

Select failover
None (default)
Failover Type Events

The following are possible failover types in the OracleOCI Failover interface:

FO_SESSION

Is equivalent to FAILOVER_MODE=SESSION in the tnsnames.ora file CONNECT_DATA flags. This means that only the user session is re-authenticated on the server-side while open cursors in the OCI application need to be re-executed.

FO_SELECT

Is equivalent to FAILOVER_MODE=SELECT in tnsnames.ora file CONNECT_DATA flags. This means that not only the user session is re-authenticated on the server-side, but open cursors in the OCI can continue fetching. This implies that the client-side logic maintains fetch-state of each open cursor.

FO_NONE

Is equivalent to FAILOVER_MODE=NONE in the tnsnames.ora file CONNECT_DATA flags. This is the default, in which no failover functionality is used. This can also be explicitly specified to prevent failover from happening. Additionally, FO_TYPE_UNKNOWN implies that a bad failover type was returned from the OCI driverFailover Methods

With the failover mode specified, users can further define a method that dictates exactly how TAF will re-establish the session on the other instance. A failover method can be defined independently of the

failover type.

The failover method determines how the failover works; the following options are available:

Basic
Preconnect

Basic option instructs the client to establish a new connection only after the node failed. This can potentially lead to a large number of new connection requests to the surviving instance. In the case of a two-node RAC, this might cause performance degradation until all user

connections are re-established. If you consider using this approach, you should test for potential performance degradation during the design stage.

The preconnect option is slightly more difficult to configure. When you specify the preconnect parameter, the client is instructed to preconnect a session to a backup instance to speed up session failover. You need to bear in mind that these preconnections increase the number of sessions to the cluster. In addition, you also need to define what the backup connection should be.

What are huge pages and Large Pages

HugePages is a feature integrated into the Linux kernel with release 2.6. This feature basically provides the alternative to the 4K page size (16K for IA64) providing bigger pages. HugePages is a method to have larger pages where it is useful for working with very large memory.

If you run an Oracle Database on a Linux Server with more than 16 GB physical memory and your System Global Area (SGA) is greater than 8 GB, you should configure HugePages. Oracle promises more performance by doing this. A HugePages configuration means, that the linux kernel can handle „large pages“, like Oracle generally calls them. Instead of standardly 4 KB on x86 and x86_64 or 16 KB on IA64 systems – 4 MB on x86, 2 MB on x86_64 and 256 MB on IA64 system. Bigger pages means, that the system uses less page tables, manages less mappings and by that reduce the effort for their management and access.

However, there is a limitation by Oracle, because Automatic Memory Management (AMM) does not support HugePages. If you already use AMM and MEMORY_TARGET is set, you have to disable it and switch back to Automatic Shared Memory Management (ASMM). That means set SGA_TARGET and PGA_AGGREGATE_TARGET. But there is another innovation called Transparent Hugpages (THP) which should be disabled as well. The feature will be delivered since Red Hat Linux 6 or a according derivate. Oracle as well as Red Hat recommend disabling Transparent Hugepages.

Why Do You Need HugePages?

HugePages is crucial for faster Oracle database performance on Linux if you have a large RAM and SGA. If your combined database SGAs is large (like more than 8GB, can even be important for smaller), you will need HugePages configured. Note that the size of the SGA matters. Advantages of HugePages are:

Larger Page Size and Less # of Pages: Default page size is 4K whereas the HugeTLB size is 2048K. That means the system would need to handle 512 times less pages.
Reduced Page Table Walking: Since a HugePage covers greater contiguous virtual address range than a regular sized page, a probability of getting a TLB hit per TLB entry with HugePages are higher than with regular pages. This reduces the number of times page tables are walked to obtain physical address from a virtual address.
Less Overhead for Memory Operations: On virtual memory systems (any modern OS) each memory operation is actually two abstract memory operations. With HugePages, since there are less number of pages to work on, the possible bottleneck on page table access is clearly avoided.
Less Memory Usage: From the Oracle Database perspective, with HugePages, the Linux kernel will use less memory to create pagetables to maintain virtual to physical mappings for SGA address range, in comparison to regular size pages. This makes more memory to be available for process-private computations or PGA usage.
No Swapping: We must avoid swapping to happen on Linux OS at all Document 1295478.1. HugePages are not swappable (whereas regular pages are). Therefore there is no page replacement mechanism overhead. HugePages are universally regarded as pinned.
No 'kswapd' Operations: kswapd will get very busy if there is a very large area to be paged (i.e. 13 million page table entries for 50GB memory) and will use an incredible amount of CPU resource. When HugePages are used, kswapd is not involved in managing them. See also Document 361670.1

Troubleshooting

Some of the common problems and how to troubleshoot them are listed in the following table:

Symptom	Possible Cause	Troubleshooting Action
System is running out of memory or swapping	Not enough HugePages to cover the SGA(s) and therefore the area reserved for HugePages are wasted where SGAs are allocated through regular pages.	Review your HugePages configuration to make sure that all SGA(s) are covered.
Databases fail to start	memlock limits are not set properly	Make sure the settings in limits.conf apply to database owner account.
One of the database fail to start while another is up	The SGA of the specific database could not find available HugePages and remaining RAM is not enough.	Make sure that the RAM and HugePages are enough to cover all your database SGAs
Cluster Ready Services (CRS) fail to start	HugePages configured too large (maybe larger than installed RAM)	Make sure the total SGA is less than the installed RAM and re-calculate HugePages.
HugePages_Total = HugePages_Free	HugePages are not used at all. No database instances are up or using AMM.	Disable AMM and make sure that the database instances are up. See Doc ID 1373255.1
Database started successfully and the performance is slow	The SGA of the specific database could not find available HugePages and therefore the SGA is handled by regular pages, which leads to slow performance	Make sure that the HugePages are many enough to cover all your database SGAs

So let's get started and come to the 7 steps:

1. Check Physical Memory

First we should check our „physical“ available Memory. In the example we have about 128 GB of RAM. SGA_TARGET and PGA_AGGREGATE_TARGET together, should not be more than the availabel memory. Besides should be enough space for OS processes itself:

grep MemTotal /proc/meminfo

MemTotal: 132151496 kB

2. Check Database Parameter

Second check your database parameter. Initially: AMM disabled? MEMORY_TARGET and MEMORY_MAX_TARGET should be set to 0:

SQL> select value from v$parameter where name = 'memory_target';

VALUE

---------------------------

How big is our SGA? In this example about 40 GB. Important: In the following query we directly convert into kB (value/1024). With that we can continue to calculate directly:

SQL> select value/1024 from v$parameter where name = 'sga_target';

VALUE

---------------------------

41943040

Finally as per default the parameter use_large_pages should be enabled:

SQL> select value from v$parameter where name = 'use_large_pages';

VALUE

---------------------------

TRUE

3. Check Hugepagesize

In our example we use a x86_64 Red Hat Enterprise Linux Server. So by default hugepagesize should be set to 2 MB:

grep Hugepagesize /proc/meminfo

Hugepagesize: 2048 kB

4. Calculate Hugepages

For the calculation of the number of hugepages there is a easy way:

SGA / Hugepagesize = Number Hugepages

Following our example:

41943040 / 2048 = 20480

If you run more than one database on your server, you should include the SGA of all of your instances into the calculation:

( SGA 1. Instance + SGA 2. Instance + … etc. ) / Hugepagesize = Number Hugepages

In My Oracle Support you can find a script (Doc ID 401749.1) called hugepages_settings.sh, which does the calculation. This also includes a check of your kernel version and the actually used shared memory area by the SGA. Please consider that this calculation observes only the actual use of SGA and their use. If your second instance is down it will be not in the account. That means to adjust your SGA and restart your database first. Than you can run the script. Result should be the following line. Maybe you can make your own calculation and than check it with the script:

Recommended setting: vm.nr_hugepages = 20480

5. Change Server Configuration

The next step is to enter the number of hugepages in the server config file. For that you need root permissions. On Red Hat Linux 6 /etc/sysctl.conf.

vi /etc/sysctl.conf

vm.nr_hugepages=20480

Correctly inserted, following result should show up:

grep vm.nr_hugepages /etc/sysctl.conf

vm.nr_hugepages=20480

The next parameter is hard and soft memlock in /etc/security/limits.conf for our oracle user. This value should be smaller than our available memory but minor to our SGA. Our hugepages should fit into that by 100 percent. For that following calculation:

Number Hugepages * Hugepagesize = minimum Memlock

Following our example:

20480 * 2048 = 41943040

vi /etc/security/limits.conf

oracle soft memlock 41943040

oracle hard memlock 41943040

Correctly inserted, following result should show up:

grep oracle /etc/security/limits.conf

...

oracle soft memlock 41943040

oracle hard memlock 41943040

As mentioned, before we have to disable transparent hugepages from Red Hat Linux version 6 ongoing:

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled

[always] madvise never

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled

always madvise [never]

6. Server Reboot

If all parameter are set, make a complete reboot your server. As an alternative you can reload the parameters with sysctl -p.

7. Check Configuration

Memlock correct?

ulimit -l

41943040

HugePages correctly configured and in use?

grep Huge /proc/meminfo

AnonHugePages: 538624 kB

HugePages_Total: 20480

HugePages_Free: 12292

HugePages_Rsvd: 8188

HugePages_Surp: 0

Hugepagesize: 2048 kB

Transparent Hugepages disabled?

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled

always madvise [never]

Did the database uses HugePages? For that we take a look into the alert log. After „Starting ORACLE instance (normal)“ following entry „Large Pages Information“ gives us advise:

************************ Large Pages Information *******************

Per process system memlock (soft) limit = 100 GB

Total Shared Global Region in Large Pages = 40 GB (100%)

Large Pages used by this instance: 20481 (40 GB)

Large Pages unused system wide = 0 (0 KB)

Large Pages configured system wide = 20481 (40 GB)

Large Page size = 2048 KB

********************************************************************

If your configuration is incorrect Oracle delivers recommendation here for the right setting. In the following example exactly one Page is missing, so 2048 kB memlock to come to 100% of SGA use of hugepages:

************************ Large Pages Information *******************

...

RECOMMENDATION:

Total System Global Area size is 40 GB. For optimal performance,

prior to the next instance restart:

1. Increase the number of unused large pages by

at least 1 (page size 2048 KB, total size 2048 KB) system wide to

get 100% of the System Global Area allocated with large pages

2. Large pages are automatically locked into physical memory.

Increase the per process memlock (soft) limit to at least 40 GB to lock

100% System Global Area's large pages into physical memory

********************************************************************

Done!

Why OLR is used and its significant at time clusterware startup

An additional cluster configuration file has been introduced with Oracle 11.2, the so-called Oracle Local Registry (OLR). Each node has its own copy of the file in the Grid Infrastructure software home.

The OLR stores important security contexts used by the Oracle High Availability Service early in the start sequence of Clusterware. The information in the OLR and the Grid Plug and Play configuration file are needed to locate the voting disks. If they are stored in ASM, the discovery string in the GPnP profile will be used by the cluster synchronization daemon to look them up. Later in the Clusterware boot sequence,the ASM instance will be started by the cssd process to access the OCR files; however, their location is stored in the /etc/ocr.loc file, just as it is in RAC 11.1. Of course, if the voting files and OCR are on a shared cluster file system, then an ASM instance is not needed and won’t be started unless a different resource depends on ASM.

Storing Information in the Oracle Local Registry

The Oracle Local Registry is the OCR’s local counterpart and a new feature introduced with Grid Infrastructure. The information stored in the OLR is needed by the Oracle High Availability Services daemon (OHASD) to start; this includes data about GPnP wallets, Clusterware configuration, and version information. Comparing the OCR with the OLR reveals that the OLR has far fewer keys;

for example,

ocrdump reported 704 different keys for the OCR vs. 526 keys for the OLR on our installation.

If you compare only the keys again, you will notice that the majority of keys in the OLR deal with the OHASD process, whereas the majority of keys in the OCR deal with the CRSD. This confirms what we said earlier: you need the OLR (along with the GPnP profile) to start the High Availability Services stack.

In contrast, the OCR is used extensively by CRSD. The OLR is maintained by the same command-line utilities as the OCR, with the appended -local option. Interestingly, the OLR is automatically backed up during an upgrade to Grid Infrastructure, whereas the OCR is not.

What is Grid Infrastructure Agents

In Oracle 11gR2 and later, there are two new types of agent processes: the Oracle Agent and the Oracle Root Agent. These processes interface between Oracle Clusterware and managed resources.

In previous versions of Oracle Clusterware, this functionality was provided by the RACG family of scripts and processes.

To slightly complicate matters, there are two sets of Oracle Agents and Oracle Root Agents, one for the High Availability Services stack and one for the Cluster Ready Services stack.

The Oracle Agent and Oracle Root Agent that belong to the High Availability Services stack are started by ohasd daemon. The Oracle Agent and Oracle Root Agent pertaining to the Cluster Ready

Services stack are started by the crsd daemon. In systems where the Grid Infrastructure installation is not owned by Oracle—and this is-probably the majority of installations—there is a third Oracle Agent

created as part of the Cluster Ready Services stack. Similarly, the Oracle Agent spawned by OHAS is owned by the Grid Infrastructure software owner.

In addition to these two processes, there are agents responsible for managing and monitoring the CSS daemon, called CSSDMONITOR and CSSDAGENT. CSSDAGENT, the agent process responsible for spawning CSSD is created by the OHAS daemon. CSSDMONITOR, which monitors CSSD and the overall node health (jointly with the CSSDAGENT), is also spawned by OHAS.

You might wonder how CSSD, which is required to start the clustered ASM instance, can be started if voting disks are stored in ASM? This sounds like a chicken-and-egg problem: without access to the voting disks there is no CSS, hence the node cannot join the cluster. But without being part of the

cluster, CSSD cannot start the ASM instance. To solve this problem the ASM disk headers have new metadata in 11.2: you can use kfed to read the header of an ASM disk containing a voting disk. The kfdhdb.vfstart and kfdhdb.vfend fields tell CSS where to find the voting file. This does not require the ASM instance to be up. Once the voting disks are located, CSS can access them and joins the cluster.

The high availability stack’s Oracle Agent runs as the owner of the Grid Infrastructure stack in a clustered environment, as either the oracle or grid users. It is spawned by OHAS directly as part of the cluster startup sequence, and it is responsible for starting resources that do not require root privileges.

The list of processes Oracle Agent starts includes the following:

EVMD and EVMLOGGER
the gipc daemon
the gpnp daemon
The mDNS daemon

The Oracle Root Agent that is spawned by OHAS in turn starts all daemons that require root privileges to perform their programmed tasks. Such tasks include the following:

CRS daemon
CTSS daemon
Disk Monitoring daemon
ACFS drivers

Once CRS is started, it will create another Oracle Agent and Oracle Root Agent. If Grid Infrastructure is owned by the grid account, a second Oracle Agent is created. The grid Oracle Agent(s) will be responsible for:

Starting and monitoring the local ASM instance

ONS and eONS daemons

The SCAN listener, where applicable

The Node listener

There can be a maximum of three SCAN listeners in the cluster at any given time. If you have more than three nodes, then you can end up without a SCAN listener on a node. Likewise, in the extreme example where there is only one node in the cluster, you could end up with three SCAN listeners on that

node.The oracle Oracle Agent will only spawn the database resource if account separation is used. If not—i.e., if you didn’t install Grid Infrastructure with a different user than the RDBMS binaries—then

the oracle Oracle Agent will also perform the tasks listed previously with the grid Oracle Agent.

The Oracle Root Agent finally will create the following background processes:

GNS, if configured
GNS VIP if GNS enabled
ACFS Registry
Network
SCAN VIP, if applicable
Node VIP

The functionality provided by the Oracle Agent process in Oracle 11gR2

Clusterware startup sequence

Here is the brief explanation that how the clusterware brings up step by step .

When a node of an Oracle Clusterware cluster restarts, OHASD is started by platform-specific means. OHASD is the root for bringing up Oracle Clusterware. OHASD has access to the OLR (Oracle Local Registry) stored on the local file system. OLR provides needed data to complete OHASD initialization.
OHASD brings up GPNPD and CSSD. CSSD has access to the GPNP Profile stored on the local file system. This profile contains the following vital bootstrap data;

a. ASM Diskgroup Discovery String

b. ASM SPFILE location (Diskgroup name)

c. Name of the ASM Diskgroup containing the Voting Files

3. The Voting Files locations on ASM Disks are accessed by CSSD with well-known pointers in the ASM Disk headers and CSSD is able to complete initialization and start or join an existing cluster.

4. OHASD starts an ASM instance and ASM can now operate with CSSD initialized and operating. The ASM instance uses special code to locate the contents of the ASM SPFILE, assuming it is stored in a Diskgroup.

5. With an ASM instance operating and its Diskgroups mounted, access to Clusterware’s OCR is available to CRSD.

6. OHASD starts CRSD with access to the OCR in an ASM Diskgroup.

7. Clusterware completes initialization and brings up other services under its control.

12c Oracle Clusterware Startup Sequence - Oracle clusterware startup automatically when the RAC node starts. Startup sequence process tuns through different levels, in below figure you can find how multiple level startup process to start the full grid infrastructure stack also how the resources that clusterware manage.

This tutorial will describe startup sequence of oracle 12c RAC clusterware which is installed on Unix / Linux platform.

Oracle RAC 12c Clusterware Startup Sequence

Oracle 12c RAC Clusterware Startup Sequence

Once your Operating system finish the boot scrap pocess it reads /etc/init.d file via the initialisation daemon names init or init.d. In the init tab file is the one it triggers oracle high availability service daemon.

$cat /etc/inittab | grep init.d | grep -v grep

h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

Oracle Linux 6.x and Red Hat Linux 6.x have deprecated inittab. init.ohasd is configured in startup in /etc/init/oracle-ohasd.conf:

$cat /etc/init/oracle-ohasd.conf

start on runLevel [35]

start on tunLevel [!35]

respawn

exec /etc/init.d/init.ohasd run > /dev/null 2>&1 <dev/null

this start up " init.ohasd run " , which in turn starts up the ohasd.bin background process :

$ps -ef | grep ohasd | grep -v grep

root 4056 1 1 Feb19 ? 01:54:34 /u01/app/12.1.0/grid/bin/ohsd.bin reboot

root 2715 1 0 Feb19 ? 00:00:00 /bin/sh /etc/init.d/init.ohsd run

OHASD ( Oracle High Availability Service Daemon ) - we also call it as oracle restart

First /etc/init triggers OHASD, once ohasd is started on Level 0, it is responsible for starting the rest of clusterware and the resources that clusterware manages directly or indirectly through Levels 1- 4.

Level 1 - Ohasd on it own triggers four agent process

cssdmonitor : CSS Monitor

OHASD orarootagent : High Availability Service stack Oracle root agent

OHASD oraagent : High Availability Service stack Oracle Agent

cssdagent : CSS Agent

Level 2 - On this level, OHASD ora agent trigger five processes

mDNSD : mDNS daemon process

GIPCD : Grid Interprocess Comunication

GPnPD : GPnP Profile daemon

EVMD : Even Monitor Daemon

ASM : Resources for monitoring ASM Instances

Then, OHASD oraclerootagent trigger following processes

CRSD : CRS daemon

CTSSD : Cluster Time Synchronisation Service Daemon

Diskmon : Disk Monitor Daemon ( Exadata Server Storage )

ACFS : ( ASM Cluster File System ) Drivers

Next, the cssdagent starts the CSSD ( CSS daemon ) process.

Level 3 - The CRSD spawns two CRSD agents : CRSD orarootagent and CRSD oracleagent.

Level 4 - On this levael, the CRSD orarootagent is responsible for starting he following resources :

Network resource : for the public network

SCAN VIPs

Node VIPs : VIPs for each node

ACFS Registry

GNS VIP : VIP for GNS if you use the GNS option

Then, the CRSD orarootagent is responsible for starting the rest of the resources as follow

ASM Resources : ASM instances(s) resource

Diskgroup : Used for managing / monitoring ASM diskgroups

Disk Resource : Used for managing and monitoring the DB and instances

SCAN Listener : Listener for SCAN listening on SCAN VIP

Listener : Node Listener listening on the Node VIP

Services : Database Services

ONS

eONS : Enhanced ONS

GSD : For 9i backword compatibility

GNS : performs name resolution ( Optional )

How Database interact with ASM

The file creation process provides a fine illustration of the interactions that take place between

database instances and ASM. The file creation process occurs as follows:

1. The database requests file creation.

2. An ASM foreground process creates a Continuing Operation Directory (COD) entry and

allocates space for the new file across the disk group.

3. The ASMB database process receives an extent map for the new file.

4. The file is now open and the database process initializes the file directly.

5. After initialization, the database process requests that the file creation is committed. This

causes the ASM foreground process to clear the COD entry and mark the file as created.

6. Acknowledgement of the file commit implicitly closes the file. The database instance will

need to reopen the file for future I/O.

What is GPnP profile and its importance

The GPnP profile is a XML file located at location <GRID_HOME/gpnp/<hostname>/profiles/peer as profile.xml. Each node of the cluster maintains a copy of this profile locally and is maintained by GPnP daemon along with mdns daemon.

Now before understanding why Oracle came up with GPnP profile, we need to focus on what it contains.GPnP defines a node’s meta data about network interfaces for public and private interconnect, the ASM server parameter file, and CSS voting disks. This profile is protected by a wallet against modification. If you have to manually modify the profile, it must first be unsigned with $GRID_HOME/bin/gpnptool, modified, and then signed again with the same utility, however there is a very slight chance you would ever be required to do so.

Now we’ll use the gpnptool with get option to dump this xml file into standard output. Below is the formatted output for the ease of readability.

<?xml version=”1.0″ encoding=”UTF-8″?>

<gpnp:GPnP-Profile Version=”1.0″ xmlns=”http://www.xyz/gpnp-profile”

xmlns:gpnp=”http://xyz/gpnp-profile”

xmlns:orcl=”http://xyz/gpnp-profile”

xmlns:xsi=”http://xyz/XMLSchema-instance”

xsi:schemaLocation=”http://xyz/gpnp-profile gpnp-profile.xsd”

ProfileSequence=”3″ ClusterUId=”002c207a71cvaljgkcea7bea5b3a49″

ClusterName=”Cluster01″ PALocation=””>

<gpnp:Network-Profile>

<gpnp:HostNetwork id=”gen” HostName=”*”>

<gpnp:Network id=”net1″ IP=”xxx.xx.x.x” Adapter=”bond0″ Use=”public”/>

<gpnp:Network id=”net2″ IP=”xxx.xxx.x.x” Adapter=”bond1″

Use=”cluster_interconnect”/>

</gpnp:HostNetwork>

</gpnp:Network-Profile>

<orcl:CSS-Profile id=”css” DiscoveryString=”+asm” LeaseDuration=”400″ />

<orcl:ASM-Profile id=”asm” DiscoveryString=””

SPFile=”+DATA/prod/asmparameterfile/registry.253.699915959″ />

<ds:Signature…>…</ds:Signature>

</gpnp:GPnP-Profile>

So from the above dump we can see that GPnP profile contains following information:-

1) Cluster Name

2) Network Profile

3) CSS-Profile tag

4) ASM-Profile tag

Now that we have understood the content of a GPnP profile, we need to understand how the Clusterware uses this information to start. From 11gr2 you have the option of storing the OCR and Voting disk on ASM, but clusterware needs OCR and Voting disk to start crsd & cssd and both these files are on ASM which itself is a resource for the node. so how does the clusterware starts, which files it accesses to get the information needed to start clusterware, to resolve this Oracle came up with two local operating system files OLR & GPnP.

When a node of an Oracle Clusterware cluster restarts, OHASD is started by platform-specific means.OHASD has access to the OLR (Oracle Local Registry) stored on the local file system. OLR provides needed data (Would explain in another post) to complete OHASD initialization

OHASD brings up GPnP Daemon and CSS Daemon. CSS Daemon has access to the GPNP Profile stored on the local file system.

The Voting Files locations on ASM Disks are accessed by CSSD with well-known pointers in the ASM Disk headers and CSSD is able to complete initialization and start or join an existing cluster.

OHASD starts an ASM instance and ASM can now operate with CSSD initialized and operating. The ASM instance uses special code to locate the contents of the ASM SPFILE, assuming it is stored in a Diskgroup.

With an ASM instance operating and its Diskgroups mounted, access to Clusterware’s OCR is available to CRSD.OHASD starts CRSD with access to the OCR in an ASM Diskgroup.And thus Clusterware completes initialization and brings up other services under its control.

Thus with the use of GPnP profile several information stored in it along with the information in the OLR several tasks have been automated or eased for the administrators.

I hope the above information helps you in understanding the Grid plug and play profile, its content, its usage and why was it required. Please comment below if you need more information on GPnP as in the complete dump of the profile, how GPnP daemon and mdns daemon communicates to maintain the updated profile on all the nodes, how does oifcfg, crsctl, asmcmd and other utilities does uses IPC to alter the content of these files accordingly, etc.

What is voting disk

OCR is used to store the cluster configuration details. It stores the information about the resources that Oracle Clusterware controls. The resources include the Oracle RAC database and instances, listeners, and virtual IPs (VIPs) such as SCAN VIPs and local VIPs.

The voting disk (VD) stores the cluster membership information. Oracle Clusterware uses the VD to determine which nodes are members of a cluster. Oracle Cluster Synchronization Service daemon (OCSSD) on each cluster node updates the VD with the current status of the node every second. The VD is used to determine which RAC nodes are still in the cluster should the interconnect heartbeat between the RAC nodes fail.

CSS is the service which determine which node in cluster is available and provides cluster group membership and simple locking services to other processes. CSS typically determines node availability via communication through a dedicated private network with a voting disk used as a secondary communication mechanism. This is done by sending heartbeat messages through the network and the voting disk, as illustrated by the top graphic

The voting disk is a file on a clustered file system that is accessible to all nodes in the cluster. Its primary purpose is to help in situations where the private network communication fails. The voting disk is then used to communicate the node state information used to determine which nodes go offline. Without the voting disk, it can be difficult for isolated nodes to determine whether it is experiencing a network failure or whether the other nodes are no longer available. It would then be possible for the cluster to enter a state where multiple subclusters of nodes would have unsynchronized access to the same database files.

It contains information regarding nodes in the cluster and the disk heartbeat information, CSSD of the individual nodes registers the information regarding their nodes in the voting disk and with that pwrite() system call at a specific offset and then a pread() system call to read the status of other CSSD processes. But as information regarding the nodes is in OCR/OLR too and system calls have nothing to do with previous calls, there isn’t any useful data kept in the voting disk. So, if you lose voting disks, you can simply add them back without losing any data. But, of course, losing voting disks can lead to node reboots. If you lose all voting disks, then you will have to keep the CRS daemons down, then only you can add the voting disks.Now that we have understood both the heartbeats which was the most important part, we will dig deeper into Voting Disk, as in what is stored inside voting disk, why Clusterware needs of Voting Disk, how many voting disks are required etc.

Now finally to understand the whole concept of voting disk we need to know what is split brain syndrome, I/O Fencing and simple majority rule.

Split Brain Syndrome, In a Oracle RAC environment all the instances/servers communicate with each other using high-speed interconnects on the private network. This private network interface or interconnect are redundant and are only used for inter-instance oracle data block transfers. Now talking about split-brain concept with respect to oracle RAC systems, it occurs when the instance members in a RAC fail to ping/connect to each other via this private interconnect, but the servers are all physically up and running and the database instance on each of these servers is also running. These individual nodes are running fine and can conceptually accept user connections and work independently. So basically due to lack of communication the instance thinks that the other instance that it is not able to connect is down and it needs to do something about the situation. The problem is if we leave these instances running, the same block might get read, updated in these individual instances and there would be data integrity issue, as the blocks changed in one instance, will not be locked and could be over-written by another instance. This situation is termed as Split Brain Syndrome.

I/O Fencing, there will be some situation where the leftover write operations from failed database instances (The cluster function failed on the nodes, but the nodes are still running at OS level) reach the storage system after the recovery process starts. Since these write operations are no longer in the proper serial order, they can damage the consistency of the data stored data. Therefore when a cluster node fails, the failed node needs to be fenced off from all the shared disk devices or disk groups. This methodology is called I/O fencing or failure fencing

Simple Majority Rule, According to Oracle – “An absolute majority of voting disks configured (more than half) must be available and responsive at all times for Oracle Clusterware to operate.” Which means to survive from loss of ‘N’ voting disks, you must configure atleast ‘2N+1′ voting disks.

Now we are in a state to understand the use of voting disk in case of heartbeat failure.

Example 1:- Suppose in a 3 node cluster with 3 voting disks, a network heartbeat fails between Node 1 and Node 3 & Node 2 and Node 3 whereas Node 1 and Node 2 are able to communicate via interconnect, and from the Voting Disk CSSD notices that all the nodes are able to write to Voting Disks thus spli-brain, so the healthy nodes Node 1 & Node 2 would would update the kill block in the voting disk for Node 3. Then when during pread() system call of CSSD of Node 3, it sees a self kill flag set and thus the CSSD of Node 3 evicts itself. And then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.

Example 2:- Suppose in a 2 node cluster with 3 voting disk, a disk heartbeat fails such that Node 1 can see 2 Voting Disks and Node 2 can see 1 Voting Disk, ( If here the Voting Disk wouldn’t have been odd then both the Nodes would have thought the other node should be killed hence would have been difficult to avoid split-brain), thus based on Simple Majority Rule, CSSD process of Node 1 (2 Voting Disks) sends a kill request to the CSSD process of Node 2 (1 Voting Disk) and thus the Node 2 evicts itself and then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.

Thus Voting disk is plays a role in both the heartbeat failures, and hence a very important file for node eviction & I/O fencing in case of a split brain situation.

I hope the above information helps you in understanding the concept of Voting Disk, its purpose, what it contains, when its used etc, Please comment below if you need more information as in the split brain examples of a bigger cluster, how does Oracle executes STONITH internally, what processes are involved in the complete node eviction process how to identify the cause of a node eviction, how is a node evicted, rebooted and then joined again in the cluster etc.

What is node eviction and its troubleshooting steps

Clusterware will evict one or more node from cluster if

a critical problem idsdetected . these problem include :

- A node not responding via network or disk heartbeat

- A hung node

- A hung ocssd.bin process

The purpose of this to maintain the overall health of the cluster

by removing suspected node

In Grid infrastructure ,More than two nodes made cluster , There are two heartbeat ,One voting disk heartbeat ,network heartbeat

Network heartbeat is across the interconnect, every one second, a thread (sending) of CSSD sends a network tcp heartbeat to itself and all other nodes, another thread (receiving) of CSSD receives the heartbeat. If the network packets are dropped or has error, the error correction mechanism on tcp would re-transmit the package, Oracle does not re-transmit in this case. In the CSSD log, you will see a WARNING message about missing of heartbeat if a node does not receive a heartbeat from another node for 15 seconds (50% of misscount). Another warning is reported in CSSD log if the same node is missing for 22 seconds (75% of misscount) and similarly at 90% of misscount and when the heartbeat is missing for a period of 100% of the misscount (i.e. 30 seconds by default), the node is evicted.

Disk heartbeat is between the cluster nodes and the voting disk. CSSD process in each RAC node maintains a heart beat in a block of size 1 OS block in a specific offset by read/write system calls (pread/pwrite), in the voting disk. In addition to maintaining its own disk block, CSSD processes also monitors the disk blocks maintained by the CSSD processes running in other cluster nodes. The written block has a header area with the node name and a counter which is incremented with every next beat (pwrite) from the other nodes. Disk heart beat is maintained in the voting disk by the CSSD processes and If a node has not written a disk heartbeat within the I/O timeout, the node is declared dead. Nodes that are of an unknown state, i.e. cannot be definitively said to be dead, and are not in the group of nodes designated to survive, are evicted, i.e. the node’s kill block is updated to indicate that it has been evicted.

Thus, summarizing the heartbeats, N/W Heartbeat is pinged every second, nodes must respond in css_miscount time, failure would lead to node eviction. Similarly Disk Heartbeat, node pings (r/w) voting disk every second, nodes must recieve a response in (long/short) disk timeout time.

both heartbeat have threshold level set. For network heartbeat ,cssmiscount which is default 30 second and similarly for disk heartbeat distimout which is default 200second in case both nodes are not able to communicate with each other within threshold time . one of node will be evicted

Voting disk should be in odd number why

All nodes should vote to voting disk ,for example if we have three voting disk. If one voting disk gets failed ,we have two voting disks , so clusterware will not stop functioning. Since As per rule, at any given time, every node should access more than 50 percent of voting disk.

other examples

When you have 1 voting disk and it goes bad, the cluster stops functioning.

When you have 2 and 1 goes bad, the same happens because the nodes realize they can only write to half of the original disks (1 out of 2), violating the rule that they must be able to write > half (yes, the rule says >, not >=).

When you have 3 and 1 goes bad, the cluster runs fine because the nodes know they can access more than half of the original voting disks (2/3 > half).

When you have 4 and 1 goes bad, the same, because (3/4 > half).

When you have 3 and 2 go bad, the cluster stops because the nodes can only access 1/3 of the voting disks, not > half.

When you have 4 and 2 go bad, the same, because the nodes can only access half, not > half.

So you see 4 voting disks have the same fault tolerance as 3, but you waste 1 disk, without gaining anything. The recommendation for odd number of voting disks helps save a little on hardware requirement.s.

Node Eviction Troubleshoot steps

The node eviction process is reported as Oracle error ORA-29740 in the alert log and LMON trace files

1. Look at the cssd.log files on both nodes; usually we will get more information on the second node if the first node is evicted.Also take a look at crsd.log file too

2. The evicted node will have core dump file generated and system reboot info.

3. Find out if there was node reboot , is it because of CRS or others, check system reboot time

4. If you see “Polling” key words with reduce in percentage values in cssd.log file that says the eviction is probably due to Network.

If you see “Diskpingout” are something related to -DISK- then, the eviction is because of Disk time out.

5. After finding Network or Disk issue. Then starting going in depth.

6. Now it’s time to collect NMON/OSW/RDA reports to make sure /justify if it was DISK issue or Network.

7. If in case we see more memory contention/paging in the reports then it’s time to collect AWR report to see what loads/SQL was running during that period?

8. If network was the issue, then check if any NIC cards were down, or if link switching as happen. And check private interconnect is working between both the nodes.

9. Sometimes eviction could also be due to OS error where the system is in halt state for while or Memory over commitment or CPU 100% used.

10. Check OS /system logfiles to get more information.

11. What got changed recently? Ask your coworker to open up a ticket with Oracle and upload logs

12. Check the health of clusterware, db instances, asm instances, uptime of all hosts and all the logs – ASM logs, Grid logs, CRS and ocssd.log,

HAS logs, EVM logs, DB instances logs, OS logs, SAN logs for that particular timestamp.

13. Check health of interconnect if error logs guide you in that direction.

14. Check the OS memory, CPU usage if error logs guide you in that direction.

15. Check storage error logs guide you in that direction.

16. Run TFA and OSWATCHER, NETSTAT, IFCONFIG settings etc based on error messages during your RCA.

17. Node eviction because iptables had been enabled. After iptables was turned off, everything went back to normal.

Avoid to enable firewalls between the nodes, and that appears to be true.

The ACL can open the ports on the interconnect, as we did, but we still experienced all kinds of issues.

(unable to start crs, unable to stop crs and node eviction).

We also had a problem with the voting disk caused by presenting LDEV's using business copies / Shadowimage that made RAC less than happy.

18. Verify user equiv between cluster nodes

19. Verify switch use for only interconnect. DO NOT USE same switch for other network operations.

20. Verify all nodes are 100pct the same configuration, sometimes there are net or config diffs that are not obvious.

look for hangs in the logs and the monitoring tools like NAGIOS to see any memory usage ran out of RAM, or became unresponsive.

21. A major reason however for node evictions at our cluster was at the "patch-levels" not being equal across the two nodes.

Nodes sometimes completely died, without any error what so ever.It turned to be a bug in the installer of 11.1.0.7.1 PSU,

What is Undo retention and retention guarantee and its importance

Undo_retention is new parameter introduced in Oracle 9i.

This parameter is used to support the "flashback query" feature. However this parameter can potentially cause the Ora-1555 "snapshot too old" error. The value for this parameter is specified in seconds. This parameter determines the lower threshold value of undo retention. The system retains undo for at least the time specified in this parameter.

Oracle 10g/11g automatically tunes undo retention to reduce the chances of "snapshot too old" errors during long-running queries. In the event of any undo space constraints the system will prioritize DML operations over undo retention meaning the low threshold may not be achieved. If the undo retention threshold must be guaranteed, even at the expense of DML operations, the RETENTION GUARANTEE clause can be set against the undo tablespace during or after creation:

-- Reset the undo low threshold.

ALTER SYSTEM SET UNDO_RETENTION = 2400;

-- Guarantee the minimum threshold is maintained.

ALTER TABLESPACE undotbs RETENTION GUARANTEE;

-- Check the value in data dictionary view.

SELECT tablespace_name, retention FROM dba_tablespaces;

TABLESPACE_NAME RETENTION

------------------------------ -----------

SYSTEM NOT APPLY

UNDOTBS GUARANTEE

SYSAUX NOT APPLY

TEMP NOT APPLY

USERS NOT APPLY

5 rows selected.

-- Switch back to the default mode.

ALTER TABLESPACE undotbs1 RETENTION NOGUARANTEE;

-- Check the value in data dictionary view.

SELECT tablespace_name, retention FROM dba_tablespaces;

TABLESPACE_NAME RETENTION

------------------------------ -----------

SYSTEM NOT APPLY

UNDOTBS NOGUARANTEE

SYSAUX NOT APPLY

TEMP NOT APPLY

USERS NOT APPLY

5 rows selected.

As the name suggests, the NOT APPLY value is assigned to non-undo tablespaces for which this functionality does not apply.

You can enable the guarantee option for the undo tablespace when it is created by either the CREATE DATABASE or CREATE UNDO TABLESPACE statement or at a later period using the ALTER TABLESPACE statement.

Automatic Tuning of Undo Retention Common Issues

Automatic Tuning of Undo Retention Feature

Oracle 10g/11g and higher version automatically tunes undo retention to reduce the chances of "snapshot too old" errors during long-running queries. In the event of any undo space constraints, the system will prioritize DML operations over undo retention. In such situations, the low threshold may not be achieved and tuned_undoretention can go below undo_retention. If the undo retention threshold must be guaranteed, even at the expense of DML operations, the RETENTION GUARANTEE clause can be set against the undo tablespace during or after creation.

-- Set/Reset the undo low threshold.

ALTER SYSTEM SET UNDO_RETENTION = 900;

-- Guarantee the minimum threshold is maintained.

ALTER TABLESPACE undotbs RETENTION GUARANTEE;

Thus, tuned_undoretention can be less than undo_retention specified in the parameter file.

Common Issues

1- Space Related Issues/ Undo Tablespace is Full

Many UNEXPIRED undo segments can be seen when selecting from dba_undo_extents view, although no ORA-1555 or ORA-30036 errors are reported.

a. Expected behavior/ Concepts Misunderstanding:

When the UNDO tablespace is created with NO AUTOEXTEND, the below allocation algorithm is being followed:

1. If the current extent has more free blocks then the next free block is allocated.

2. Otherwise, if the next extent expired then wrap in the next extent and return the first block.

3. If the next extent is not expired then get space from the UNDO tablespace. If a free extent is available then allocate it to the undo segment and return the first block in the new extent.

4. If there is no free extent available, then steal expired extents from offline undo segments. De-allocate the expired extent from the offline undo segment and add it to the undo segment. Return the first free block of the extent.

5. If no expired extents are available in offline undo segments, then steal from online undo segments and add the new extents to the current undo segment. Return the first free block of the extent.

6. Extend the file in the UNDO tablespace. If the file can be extended then add an extent to the current undo segment and then return the block.

7. Tune down retention in decrements of 10% and steal extents that were unexpired, but now expired with respect to the lower retention value.

8. Steal unexpired extents from any offline undo segments.

9. Try to reuse unexpired extents from own undo segment. If all extents are currently busy (they contains uncommitted information) go to the step 10. Otherwise, wrap into the next extent.

10. Try to steal unexpired extents from any online undo segment.

11. If all the above fails then return ORA-30036 unable to extend segment by %s in undo tablespace '%s'

For a fixed size UNDO tablespace (NO AUTOEXTEND), starting with 10.2, we provide max retention given the fixed undo space, which is set to a value based on the UNDO tablespace size.

This means that even if the undo_retention is set to a number of seconds (900 default), the fixed UNDO tablespace supports a bigger undo_retention time interval (e.g: 36 hours), based on the tablespace size, thing that makes the undo extents to be UNEXPIRED. But this doesn't indicate that there are no available undo extents when a transaction will be run in the database, as the UNEXPIRED undo segments will be reused.

Here is a small test case for making it clearer:

Before starting any transaction in the database:

SQL> select count(status) from dba_undo_extents where status = 'UNEXPIRED';

COUNT(STATUS)

-------------

463

SQL> select count(status) from dba_undo_extents where status = 'EXPIRED';

COUNT(STATUS)

-------------

SQL> select count(status) from dba_undo_extents where status = 'ACTIVE';

COUNT(STATUS)

-------------

Space available reported by dba_free_space:

SUM(BYTES)/(1024*1024) TABLESPACE_NAME

---------------------- ---------------------

3 UNDOTBS1

58.4375 SYSAUX

3 USERS3

4.3125 SYSTEM

103.9375 USERS04

When the transactions run:

SUM(BYTES)/(1024*1024) TABLESPACE_NAME

---------------------- ----------------

58.25 SYSAUX

98 USERS3

4.3125 SYSTEM

87.9375 USERS04

b. wrong calculation of the tuned undo retention value

For DB version lower than 10.2.0.4

it is caused by Bug:5387030 - This bug only affects db version lower than 10.2.0.4

To investigate the issue, check the following:

1- Undo automatically managed by the database

SQL> show parameter undo_

2- Type of undo tablespace (fixed, auto extensible)

SQL> SELECT autoextensible

FROM dba_data_files

WHERE tablespace_name='<UNDO_TABLESPACE_NAME>'

This returns "NO" for all the undo tablespace datafiles.

3- The undo tablespace is already sized such that it always has more than enough space to store all the undo generated within the undo_retention time, and the in-use undo space never exceeds the undo tablespace warning alert threshold (see below for the query to show the thresholds).

4- The tablespace threshold alerts recommend that the DBA add more space to the undo tablespace:

SQL> SELECT creation_time, metric_value, message_type, reason, suggested_action

FROM dba_outstanding_alerts

WHERE object_name='<UNDO_TABLESPACE_NAME>';

This returns a suggested action of: "Add space to the tablespace".

Or,

This recommendation has been reported in the past but the condition has now cleared:

SQL> SELECT creation_time, metric_value, message_type, reason, suggested_action, resolution

FROM dba_alert_history

WHERE object_name='<UNDO_TABLESPACE_NAME>';

5- The undo tablespace in-use space exceeded the warning alert threshold at some point in time. To see the warning alert percentage threshold, issue:

SQL> SELECT object_type, object_name, warning_value, critical_value

FROM dba_thresholds

WHERE object_type='TABLESPACE';

To see the (current) undo tablespace percent of space in use:

SQL> SELECT

((SELECT (NVL(SUM(bytes),0))

FROM dba_undo_extents

WHERE tablespace_name='<UNDO_TABLESPACE_NAME>'

AND status IN ('ACTIVE','UNEXPIRED')) * 100)

(SELECT SUM(bytes)

FROM dba_data_files

WHERE tablespace_name='<UNDO_TABLESPACE_NAME>')

"PCT_INUSE"

FROM dual;

To solve the issue, you can either apply patch:5387030 (This bug only affects db version lower than 10.2.0.4) OR use any of the following workarounds:

1- Set the AUTOEXTEND and MAXSIZE attributes of each datafile of the undo tablespace in such a way that they are autoextensible and the MAXSIZE is equal to the current size (so the undo tablespace now has the AUTOEXTEND attribute but does not autoextend):

SQL> ALTER DATABASE DATAFILE '<datafile_flename>' AUTOEXTEND ON MAXSIZE <current_size>

2- Set the following instance parameter (Contact Oracle Support before setting it):

_undo_autotune = false

With this setting, V$UNDOSTAT (and therefore V$UNDOSTAT.TUNED_UNDORETENTION) is not maintained and the undo retention used is based on the UNDO_RETENTION instance parameter. Which means that you will loose all advantages in having automatic undo management and is not an ideal long term fix.

Even with the patch fix installed, the autotuned retention can still grow under certain circumstances. The fix attempts to throttle back how aggressive that autotuning will be. Options 2 and 3 may be needed to get around this aggressive growth in some environments.

For DB version 12.2 or higher and has Local Undo Mode enabled

It is probably caused by bug 27543971, check note 27543971.8 for details

2- Undo Remains Unexpired and TUNED_UNDORETENTION is high

This matches Bug 9650380 where WRH$_UNDOSTAT table shows that TUNED_UNDORETENTION remains high based on previous workload after restarting the instance.

Bug 9650380 is closed as duplicate of Bug 9681444. It is caused by heavy undo generation right before an instance shutdown influencing the calculation of the tuned undo retention after instance startup.

Possible workarounds for this issue are:

1) disable automatic tuning of undo by setting _undo_autotune=false (Contact Support before setting this parameter)

This option requires manual tuning of the UNDO_RETENTION instance parameter to avoid ORA-1555 errors.

2) turn on autoextensibility of the undo tablespace datafiles and set the MAXSIZE to the actual size of the all the datafiles of the undo tablespace, or set _smu_debug_mode=33554432.

This option allows tuned undo retention, but bases the calculation on MAXQUERYLEN instead of the free space in the undo tablespace. This option also requires a sensible value for UNDO_RETENTION being set.

WARNING: do NOT set MAXSIZE > actual size! Otherwise the algorithm to calculate the tuned undo will again work based on the free space.

3) allocate more space to the undo tablespace and/or reduce the threshold level used for computation of the tuned undo retention value.

You can control the extra space available for undo growth using the tablespace warning threshold:

begin

DBMS_SERVER_ALERT.SET_THRESHOLD(

metrics_id => dbms_server_alert.tablespace_pct_full,

warning_operator => dbms_server_alert.operator_ge,

warning_value => '50', /* <<<<<<<<<<<<<<<<<<<<<<<< */

critical_operator => dbms_server_alert.operator_ge,

critical_value => '90',

observation_period => 1,

consecutive_occurrences => 1,

instance_name => NULL,

object_type => dbms_server_alert.object_type_tablespace,

object_name => 'UNDOTBS1');

end;

The above example provides more headroom for undo by setting the warning threshold to 50% of the tablespace size, thereby letting the tuned undo retention algorithm use 50% - 10% = 40% of the tablespace

size for its calculations. By default the algorithm uses 70% of the undo tablespace in the tuned undo retention calculation, leaving 30% headroom.

4) set the _first_spare_parameter (depending on version, check note 742035.1 for more details) or _highthreshold_undoretention (depending on version, check note 742035.1 for more details) instance parameter to a value limiting the tuned undo retention value. See Note 742035.1 for more details.

If this value is set too high, you still will encounter problems with tuned undo retention. If this value is set too low, ORA-1555 errors are bound to happen.

Next to this, monitor the ACTIVEBLKS + UNEXPIREDBLKS values in V$UNDOSTAT to be sure that not a too large portion of the undo tablespace is allocated by these block types. Otherwise stealing of undo blocks will occur which might again result in ORA-1555 errors. If these blocks take up a too high percentage of the blocks of the undo tablespace, consider adding more space to the tablespace, or tune down the undo retention.

When available, download Patch 9681444 to resolve this issue.

Further Diagnostics

If none of the above addressed your issue, please feel free to log an SR with Oracle Support providing the following information:

1- Alert.log file

2- output of script in Doc 1579035.1.

3- Trace Files generated at the time of the issue

wrong calculation of the tuned undo retention value

How to Determine the Value Of UNDO_RETENTION Parameter to Avoid ORA-1555

SYMPTOMS

The objective of this note is to explain how to set UNDO_RETENTION parameter and to clarify how the error ORA-1555 could be generated due to wrong setting of UNDO_RETENTION parameter value.

CAUSE

undo_retention sizing

SOLUTION

You need to tune to increase to an optimum value the UNDO_RETENTION parameter.

The value for this parameter is specified in seconds.

This is important for systems running long queries.

That could be tuned by checking the maxquerylen from v$undostat;

The UNDO_RETENTION value should at least be equal to the length of longest running query on a

given database instance.

This can be determined by querying V$UNDOSTAT view once the database has been running for a while.

SQL> select max(maxquerylen) from v$undostat;

This needs to be captured when the system has been running for a while and is fully used.

The following two column are enough to check if you are detecting or not an out of space error and/or ora-1555 one :

SSOLDERRCNT - The number of ORA-1555 errors that occurred during the interval

NOSPACEERRCNT - The number of Out-of-Space errors

The folowing Note 262066.1: How To Size UNDO Tablespace For Automatic Undo Management

explains how to set undo tablespace correctly to guarantee undo retention.

When this option is enabled the database never overwrites unexpired undo data that is, undo data

whose age is less than the undo retention period.

The storage and used space for undo is then a direct consecuency on your undo_retention configuration.

The recommend value for undo_retention is the value is length of longest running query on a given

database instance.

If you see a message in the trace file like "Query Duration=5095 " means that the Query was running for '5095 sec' when the error occured.

Note that the UNDO_RETENTION parameter works best if the current undo tablespace has enough space for the active transactions.

If an active transaction needs undo space and the undo tablespace does not have any free space,

then the system will start reusing undo space that would have been retained.

This may cause long queries to fail.

Be sure to allocate enough space in the undo tablespace to satisfy the space requirement for the current setting of this parameter.

How to recover Table data Using the Flashback Table Feature

PURPOSE

-------

Purpose of this document to restore the table data which was deleted accidentally.

Recovering Tables Using the Flashback Table Feature:

-----------------------------------------------------

The FLASHBACK TABLE statement enables users to recover a table to a previous

point in time. It provides a fast, online solution for recovering a table that has been

accidentally modified or deleted by a user or application.

Flashback Drop is substantially faster than other recovery mechanisms that

can be used in this situation, such as point-in-time recovery, and does not lead to any loss of recent transactions or downtime.

Restores all data in a specified table to a previous point in time described by a

timestamp or SCN. An exclusive DML lock is held on a table while it is being

restored.

Performs the restore operation online.

Note: You must be using automatic undo management to use the

flashback table feature. It is based on undo information stored in an

undo tablespace.

Automatically restores all of the table attributes, such as indexes, triggers, and

the likes that are necessary for an application to function with the flashed back

table.

Maintains any remote state in a distributed environment. For example, all of the

table modifications required by replication if a replicated table is flashed back.

Maintains data integrity as specified by constraints. Tables are flashed back

provided none of the table constraints are violated. This includes any referential

integrity constraints specified between a table included in the FLASHBACK

TABLE statement and another table that is not included in the FLASHBACK

TABLE statement.

Even after a flashback, the data in the original table is not lost. You can later

revert to the original state.

To use the FLASHBACK TABLE statement you must have been granted the

FLASHBACK ANY TABLE system privilege or you must have the FLASHBACK object

privilege on the table. Additionally, you must have SELECT, INSERT, DELETE, and

UPDATE privileges on the table. The table that you are performing the flashback

operation on must have row movement enabled.

Example:

SQL>alter tablespace UNDOTBS1 retention guarantee;

SQL>select tablespace_name,retention from dba_tablespaces;

TABLESPACE_NAME RETENTION

------------------------------ -----------

SYSTEM NOT APPLY

UNDOTBS1 GUARANTEE

SYSAUX NOT APPLY

TEMP NOT APPLY

EXAMPLE NOT APPLY

USERS NOT APPLY

HISTORY NOT APPLY

7 rows selected.

SQL> ALTER TABLE flash_test_table enable row movement;

Table altered.

SQL> select * from flash_test_table;

EMPNO EMPNAME

---------- ------------------------------

1 Kiran

2 Scott

3 Tiger

4 Jeff

SQL> select current_scn from v$database;

CURRENT_SCN

----------------

332348

SQL> connect scott/tiger

Connected.

SQL> insert into flash_test_table values(5,'Jane');

1 row created.

SQL> insert into flash_test_table values(6,'John');

1 row created.

SQL> commit;

Commit complete.

SQL> connect / as sysdba

Connected.

SQL> select current_scn from v$database;

CURRENT_SCN

----------------

332376

SQL> connect scott/tiger

Connected.

SQL> select * from flash_test_table;

EMPNO EMPNAME

---------- ------------------------------

1 Kiran

2 Scott

3 Tiger

4 Jeff

5 Jane

6 John

6 rows selected.

SQL> flashback table flash_test_table to scn 332348;

Flashback complete.

SQL> select * from flash_test_table;

EMPNO EMPNAME

---------- ------------------------------

1 Kiran

2 Scott

3 Tiger

4 Jeff

SQL> flashback table flash_test_table to scn 332376;

Flashback complete.

SQL> select * from flash_test_table;

EMPNO EMPNAME

---------- ------------------------------

1 Kiran

2 Scott

3 Tiger

4 Jeff

5 Jane

6 John

6 rows selected.

Additional comment:

------------------------

Adding a example for using flashback table with timestamp (to_timestamp)

SQL> flashback table xxx to timestamp to_timestamp('2012-09-01 11:00:00', 'YYYY-MM-DD HH24:MI:SS') ;

Note: When Using Dataguard

While performing flashback table, you can create a guaranteed restore point on primary only and you can perform flashback "table"

to restore point on primary db which will automatically apply the data changes on physical standby db & logical standby db.

You do not have to create guaranteed restore point on standby db and no action to be performed on standby db while performing flashback

table.

Defragmenting Objects with Alter Shrink Method

Solution

------

Pre-requisites

With respect to indexes, rebuild online might be more efficient than defragmentation. This is not necessarily the case.

Rebuilding online the indexes is available in Oracle Server 9, but only in the Enterprise Edition.

To eliminate or reduce fragmentation, you can rebuild or coalesce/shrink the index. But before you perform either task weigh the costs and benefits of each option and choose the one that works best for your situation.

Following table is a comparison of the costs and benefits associated with rebuilding and coalescing indexes.

Rebuild index	Coalesce index
Quickly moves index to another tablespace	Cannot move index to another tablespace
Higher costs: requires more disk space	Lower costs: does not require more disk space
Creates new tree, shrinks height if applicable	Coalesces leaf blocks within same branch of tree
Enables you to quickly change storage and tablespace parameters without having to drop the original index.	Quickly frees up index leaf blocks for use.

Procedure

Prerequisites: What privileges are needed?

In the scope of this document, ALTER TABLE SHRINK and CREATE JOB will be executed.

To alter the table, the table must be in your own schema, or you must have ALTER object privilege on the table, or you must have ALTER ANY TABLE system privilege.

To run a dbms job, you must have either

CREATE JOB privilege: This privilege enables you to create jobs, chains, schedules, and programs in your own schema. You will always be able to alter and drop jobs, schedules and programs in your own schema, even if you do not have the CREATE JOB privilege. In this case, the job must have been created in your schema by another user with the CREATE ANY JOB privilege.

CREATE ANY JOB privilege: This privilege enables you to create, alter, and drop jobs, chains, schedules, and programs in any schema except SYS. This privilege is very powerful and should be used with care because it allows the grantee to execute code as any other user.

For information, commands in this document are executed by SYSDBA.

Check tablespace usage

Use the following SQL command to check tablespaces usage:

SELECT /*+ RULE */ T.TABLESPACE_NAME||' '||

(100 - ROUND((((T.TOT_AVAIL - NVL(F.TOT_USED,0))*100)/TOT_AVAIL),0))

FROM (SELECT /*+ RULE */ TABLESPACE_NAME, SUM(BYTES) TOT_USED

FROM SYS.DBA_EXTENTS GROUP BY TABLESPACE_NAME) F,

(SELECT /*+ RULE */ TABLESPACE_NAME, COUNT(1) FILE_COUNT,

SUM(DECODE(maxbytes,0,BYTES,maxbytes)) TOT_AVAIL

FROM SYS.DBA_DATA_FILES

GROUP BY TABLESPACE_NAME

UNION

SELECT /*+ RULE */ TABLESPACE_NAME,COUNT(1) FILE_COUNT,

SUM(DECODE(maxbytes,0,BYTES,maxbytes)) TOT_AVAIL

FROM SYS.DBA_TEMP_FILES

GROUP BY TABLESPACE_NAME) T,

SYS.DBA_TABLESPACES D

WHERE T.TABLESPACE_NAME = F.TABLESPACE_NAME(+)

AND D.TABLESPACE_NAME = T.TABLESPACE_NAME

ORDER BY ROUND((((T.TOT_AVAIL - NVL(F.TOT_USED,0))*100)/TOT_AVAIL),0);

On SMS and VWSs, we are interested in the following tablespaces:

On SMS:

TABLESPACE_NAME PERCENT USED

------------------------------ ------------

CCS_VOUCHERS 93

CCS_VOUCHERS_I 66

On VWS:

TABLESPACE_NAME PERCENT USED

------------------------------ ------------

CCS_VOUCHERS 89

CCS_VOUCHERS_I 82

BE_VOUCHERS 80

BE_VOUCHERS_I 56

Find segments to defragment

SQL> set lines 300

SQL> select segment_name,sum(bytes) from dba_segments where tablespace_name='<tablespace_name_from_above>'

On SMS and VWSs, we are interested in the following segments:

On SMS:

SEGMENT_NAME extents

------------------------------ ----------

CCS_VOUCHER_REFERENCE_PK 20

CCS_VOUCHER_REFERENCE_IX_01 23

CCS_VOUCHER_REFERENCE_IXR 25

CCS_VOUCHER_REFERENCE_UQ 37

On VWS:

SEGMENT_NAME extents

------------------------------ ----------

CCS_VOUCHER_REFERENCE_PK 20

CCS_VOUCHER_REFERENCE_UQ 46

BE_VOUCHER_PK 1128

Finding Candidates for Shrinking

Why using shrink instead of coalesce?

Coalesce is designed specifically to reduce fragmentation within an index but not to deallocate any freed up blocks which are placed on the freelist and recycled by subsequent block splits.

Shrink is designed specifically to reduce the overall size of an index segment, resetting the High Water Mark (HWM) and releasing any excess storage as necessary.

The key difference being that Shrink must reorganise the index leaf blocks in such a way that all the freed up, now empty blocks are all grouped together at one end of the index segment. All these blocks can then be deallocated and removed from the index segment. This means that specific leaf block entries must be removed from these specific blocks, in order to free up the leaf blocks in this manner.

Basically, we are interested in decreasing the tablespace usage (the customer will want that in particular, to get rid of a monitoring system alarms saying the tablespace is full), therefore we will use shrink.

Before performing an online shrink, you may want to find out the biggest bang-for-the-buck by identifying the segments that can be most fully compressed. Simply use the built-in function verify_shrink_candidate in the package dbms_space.

Execute this PL/SQL code for each segment mentioned above to test if the segment can be shrunk to a desired value.

In the following example, the segment "BE_VOUCHER_PK" is used with the desired value 1GB:

declare

x char(1);

begin

if (dbms_space.verify_shrink_candidate

('E2BE_ADMIN','BE_VOUCHER_PK','INDEX',1073741824) -- Shrink to 1GB here

) then

x := 'T';

else

x := 'F';

end if;

dbms_output.put_line(' Result: '||x);

end;

Shrink!

After having found candidates, the desired objects can be shrunk (table or index).

Some best-practices:

As from experience, a 60 Millions rows table (from which 30 Millions vouchers were deleted (ie, the table is really fragmented)) can be shrunk in about 7 hours.

Therefore it is recommended to run the shrink method in a background Oracle dbms job.

Shrinking is preferred to be executed in two steps:

alter table <table> shrink space compact;

alter table <table> shrink space;

The first will take a long time and will not change the HWM. The second will be faster (since the object is already shrunk) and will move the HWM.

The following example is the procedure for the table CCS_VOUCHER_REFERENCE with the "shrink space compact" method:

SQL> alter table CCS_ADMIN.CCS_VOUCHER_REFERENCE enable row movement;

SQL>

begin

DBMS_SCHEDULER.create_job(

job_name => 'NEW_SHRINK_CCS_VOUCHER',

job_type => 'PLSQL_BLOCK',

job_action => 'BEGIN EXECUTE IMMEDIATE ''alter table CCS_ADMIN.CCS_VOUCHER_REFERENCE shrink space compact''; END;',

start_date => to_date('20110511000000', 'YYYYMMDDHH24MISS'),

auto_drop => FALSE,

enabled => TRUE,

Comments => 'ccs_voucher_reference defragmentation');

commit;

end;

You can kick off the job manually:

SQL> exec DBMS_SCHEDULER.run_job('NEW_SHRINK_CCS_VOUCHER');

If you no longer need to job:

SQL> exec DBMS_SCHEDULER.drop_job('NEW_SHRINK_CCS_VOUCHER');

To show the job's running history:

col LOG_DATE format a20

col JOB_NAME format a20

col STATUS format a15

col REQ_START_DATE format a20

col ACTUAL_START_DATE format a20

col RUN_DURATION format a20

select log_date

, job_name

, status

, req_start_date

, actual_start_date

, run_duration

from dba_scheduler_job_run_details

where

job_name = 'NEW_SHRINK_CCS_VOUCHER';

To check out the job running schedule:

col START_DATE format a20

col END_DATE format a20

col LAST_START_DATE format a20

select JOB_NAME,START_DATE,END_DATE, LAST_START_DATE from dba_scheduler_jobs;

At this stage, shrink compact is performed but the HWM is not reset, recreate another job with the following change in the procedure:

job_action => 'BEGIN EXECUTE IMMEDIATE ''alter table CCS_ADMIN.CCS_VOUCHER_REFERENCE shrink space compact''; END;',

job_action => 'BEGIN EXECUTE IMMEDIATE ''alter table CCS_ADMIN.CCS_VOUCHER_REFERENCE shrink space''; END;',

Finally, execute the same procedure for all objects that need to be defragmented.

Troubleshooting

Shrinking is an expensive operation, specially when the table is big as the whole activity involves the compaction phase of segment shrink that will be done as insert/delete pairs.

That means every row has an insert and delete DML involved. So no wonder we see such a huge

redo and undo values.

Therefore you could face the following issue while shrinking:

IMP-00058: ORACLE error 30036 encountered

ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDO'

There are 2 options to avoid it:

Compare the tablesize with UNDO size. If the table requires more than UNDO size, then we will need to increase the UNDO tablespace. Another tip is to check is the RETENTION GUARANTEE is off.

Example:

SQL> select RETENTION from dba_tablespaces where tablespace_name = 'UNDOTBS1';

RETENTION

-----------

NOGUARANTEE

Archiving logfiles can be temporarily disabled for this table by issuing the following commands:

SQL> ALTER TABLE TABLE_NAME NOLOGGING;

SQL> ALTER TABLESPACE TABLESPACE_NAME NOLOGGING;

Please remember to set it back to LOGGING after shrinking is performed.

Also consider making a backup otherwise you won't be able to recover the tablespace because of no archived logs will be generated during the operation. Taking a Database backup it out of scope of this document.

What is incremental check pointing Checkpoint & SCN

Checkpoint is a data structure that indicates the “checkpoint position“, determined by the oldest dirty buffer in the database buffer cache. In terms of Oracle’s clock this position is actually the SCN in the redo stream where instance recovery must begin. The checkpoint position acts as a pointer to the redo stream and is stored in the control file and in each data file header. Whenever we say checkpoint happened we mean that The writing of modified database buffers in the database buffer cache to disk. A successful checkpoint guarantees that all database changes up to the checkpoint SCN have been recorded in the datafiles and SCNs recorded in the file headers guarantee that all changes made to database blocks prior to that SCN are already written to disk. As a result, only those changes made after the checkpoint need to be applied during recovery.

Checkpoints triggered on following conditions:

§ Every 3 seconds (Incremental Checkpoint)

§ When Logswitch happened

§ When instance shutdown normal/transactional/immediate

§ Whenever Alter Tablespace [Offline Normal| Read Only| Begin Backup]

§ Controlled by internal checkpoint forced by recovery related parameters i.e. Fast_Start_MTTR_Target etc.

Purpose of Checkpoints

Oracle Database uses checkpoints to achieve the following goals:

§ Reduce the time required for recovery in case of an instance or media failure

§ Ensure that dirty buffers in the buffer cache are written to disk regularly

§ Ensure that all committed data is written to disk during a consistent shutdown

When Oracle Database Initiates Checkpoints

The checkpoint process (CKPT) is responsible for writing checkpoints to the data file headers and control file. Implementing full checkpoint every time would be a costly operation and a major bottleneck for concurrency, so Oracle using different types of checkpoints based on different purposes:

§ Full checkpoint: Writes block images to the database for all dirty buffers from all instances. Controlfile and datafile headers are updated during this checkpoint. Until Oracle 8 log switch was also causing full check point which is changed since 8i onwards for performance reasons. Occurred in following situations

§ Alter system checkpoint global

§ Alter database begin backup

§ Alter database close

§ Shutdown Immediate/Transactional

§ Thread checkpoints: The database writes to disk all buffers modified by redo in a specific thread before a certain target. The set of thread checkpoints on all instances in a database is a database checkpoint. Controlfile and datafile headers are updated during this checkpoint. Occures in the following situations

§ Consistent database shutdown

§ Alter system checkpoint local

§ Online redo log switch

§ Tablespace and Datafile Checkpoint: Writes block images to the database for all dirty buffers for all files of a tablespace from all instances. Controlfile and datafile headers are updated during this checkpoint. Occurs in following situations

§ Alter tablespace … offline

§ Alter tablespace … begin backup

§ Alter tablespace … read only

§ Alter database datafile resize ( while shrinking a data file)

§ Parallel Query Checkpoint: Writes block images to the database for all dirty buffers belonging to objects accessed by the query from all instances. It’s mandatory to maintain consistency. Occurs in following situations

§ Parallel Query

§ Parallel Query component of PDML or PDDL.

§ Incremental checkpoints: An incremental checkpoint is a type of thread checkpoint partly intended to avoid writing large numbers of blocks at online redo log switches. DBWn checks at least every 3 seconds to determine whether it has work to do. When DBWn writes dirty buffers, it advances the checkpoint position, causing CKPT to write the checkpoint position to the control file, but not to the data file headers.

§ Object Checkpoint: Writes block images to the database for all dirty buffers belonging to an object from all instances. Occurs in following situations

§ Drop table

§ Drop table … purge

§ Truncate table

§ Drop Index

§ Log Switch Checkpoint: Writes the contents of “some” dirty buffers to the database. Controlfile and datafile headers are updated with checkpoint_change#.

§ Instance Recovery Checkpoint: Writes recovered block back to datafiles. Trigger as soon as SMON is done with instance recovery.

§ RBR Checkpoint: It’s actually Reuse Block Range checkpoint, usually appears post index rebuild operations.

§ Multiple Object Checkpoint: Triggered whenever a single operation causes checkpoints on multiple objects i.e. dropping partitioned table or index.

Whenever anything happened in database, Oracle has a SCN number which has to update into various places. We can classify SCN into following major categories:

§ System (checkpoint) SCN: After a checkpoint completes, Oracle stores the system checkpoint SCN in the control file. We can check that in checkpoint_change# of v$database view.

SQL> select checkpoint_change# from v$database;

CHECKPOINT_CHANGE#

------------------

1677903

SQL> alter system checkpoint;

System altered.

SQL> select checkpoint_change# from v$database;

CHECKPOINT_CHANGE#

------------------

1679716

§ DataFile (checkpoint) SCN: After a checkpoint completes, Oracle stores the SCN individually in the control file for each datafile. The following SQL shows the datafile checkpoint SCN for a datafile in the control file:

SQL> select name,checkpoint_change# from v$datafile where name like '%system01%';

NAME CHECKPOINT_CHANGE#

---------------------------------------------------- ------------------

/u02/app/oracle/oradata/mask11g/system01.dbf 1679716

§ Partial (Checkpoint) SCN: Operational non-full checkpoints for sub set of system i.e. tablespace or a datafile etc, would set checkpoint for affected entities only

SQL> select name,checkpoint_change# from v$datafile_header where name like '%01.dbf';

NAME CHECKPOINT_CHANGE#

---------------------------------------------------- ------------------

/u02/app/oracle/oradata/mask11g/system01.dbf 1685610

/u02/app/oracle/oradata/mask11g/sysaux01.dbf 1685610

/u02/app/oracle/oradata/mask11g/undotbs01.dbf 1685610

/u02/app/oracle/oradata/mask11g/users01.dbf 1685610

SQL> alter tablespace users read only;

Tablespace altered.

SQL> select name,checkpoint_change# from v$datafile_header where name like '%01.dbf';

NAME CHECKPOINT_CHANGE#

---------------------------------------------------- ------------------

/u02/app/oracle/oradata/mask11g/system01.dbf 1685610

/u02/app/oracle/oradata/mask11g/sysaux01.dbf 1685610

/u02/app/oracle/oradata/mask11g/undotbs01.dbf 1685610

/u02/app/oracle/oradata/mask11g/users01.dbf 1685618

SQL> alter tablespace users read write;

Tablespace altered.

SQL> select name,checkpoint_change# from v$datafile_header where name like '%01.dbf';

NAME CHECKPOINT_CHANGE#

---------------------------------------------------- ------------------

/u02/app/oracle/oradata/mask11g/system01.dbf 1685610

/u02/app/oracle/oradata/mask11g/sysaux01.dbf 1685610

/u02/app/oracle/oradata/mask11g/undotbs01.dbf 1685610

/u02/app/oracle/oradata/mask11g/users01.dbf 1685642

§ Start (Checkpoint) SCN: Oracle stores the checkpoint SCN value in the header of each datafile. This is referred to as the start SCN because it is used at instance startup time to check if recovery is required. The following SQL shows the checkpoint SCN in the datafile header for a single datafile.

SQL> select name,checkpoint_change# from v$datafile_header where name like '%system01%';

NAME CHECKPOINT_CHANGE#

---------------------------------------------------- ------------------

/u02/app/oracle/oradata/mask11g/system01.dbf 1657172

§ End (checkpoint) SCN: The stop SCN or Termination is held in the control file for each datafile. The following SQL shows the stop SCN for a single datafile when the database is open for normal use.

SQL> select distinct LAST_CHANGE# from v$datafile;

LAST_CHANGE#

------------

SQL> alter database close;

Database altered.

SQL> select distinct LAST_CHANGE# from v$datafile;

LAST_CHANGE#

------------

2125206

SQL> select distinct CHECKPOINT_CHANGE# from v$datafile_header ;

CHECKPOINT_CHANGE#

------------------

2125206

During normal database operation, the stop SCN is NULL for all datafiles that are online in read-write mode. SCN Values while the Database Is Up Following a checkpoint while the database is up and open for use, the system checkpoint in the control file, the datafile checkpoint SCN in the control file, and the start SCN in each datafile header all match. The stop SCN for each datafile in the control file is NULL. SCN after a Clean Shutdown After a clean database shutdown resulting from a SHUTDOWN IMMEDIATE or SHUTDOWN NORMAL of the database, followed by STARTUP MOUNT, the previous queries on v$database and v$datafile return the following:

During a clean shutdown, a checkpoint is performed and the stop SCN for each datafile is set to the start SCN from the datafile header. Upon startup, Oracle checks the start SCN in the file header with the datafile checkpoint SCN. If they match, Oracle checks the start SCN in the datafile header with the datafile stop SCN in the control file. If they match, the database can be opened because all block changes have been applied, no changes were lost on shutdown, and therefore no recovery is required on startup. After the database is opened, the datafile stop SCN in the control file once again changes to NULL to indicate that the datafile is open for normal use.

An incremental checkpoint

An incremental checkpoint is a type of thread checkpoint partly intended to avoid writing large numbers of blocks at online redo log switches. DBWR checks at least every three seconds to determine whether it has work to do. When DBWn writes dirty buffers, it advances the checkpoint position, causing CKPT to write the checkpoint position to the control file, but not to the data file headers. . . .

During instance recovery, the database must apply the changes that occur between the checkpoint position and the end of the redo thread. Some changes may already have been written to the data files. However, only changes with SCNs lower than the checkpoint position are guaranteed to be on disk."

Can you explain incremental checkpoints in plain English?

Answer: An incremental checkpoint is sort of like when you are sitting on the toilet taking a large dump and you flush multiple times to prevent clogging the toilet.

The "fast start" recovery (and the fast_start_mttr_target) is directly related to the incremental checkpoint. By reducing the checkpoint time to be more frequent than a log switch, Oracle will recover and re-start faster in case of an instance crash.

The docs note that a DBWR writes buffers to disk in advance the checkpoint position, writing the "oldest" blocks first to preserve integrity.

A "checkpoint" is the event that triggers writing of dirty blocks to the disks and a "normal" checkpoint only occurs with every redo log file switch.

In a nutshell, an "incremental" directs the CKPT process to search for "dirty" blocks that need to be written by the DBWR process. thereby advancing the SCN to the control file.

The DBWR wakes up every 3 seconds, seeking dirty blocks and sleeps if he finds no blocks. This prevents a "burst" of writing when a redo log switches.

How ASM communicate with database

A database that stores data on ASM volumes has two new background processes; RBAL and ASMB. RBAL performs global opens of the disks. ASMB connects to the +ASMn instance to communicate information such as file creation and deletion

ASMB process communicates with CSS daemon on node and receives file extend map information from ASM instance . ASMB is also responsible for providing I/O stats to ASM instance .

How connection established when DML run

The user process first communicates with a listener process that creates a server process in a dedicated enviroment

Oracle Database creates server processes to handle the requests of user processes connected to the instance.The user process represents the application or tool that connects to the Oracle database

Server processes created on behalf of each user’s application can perform one or more of the following:

Parse and run SQL statements issued through the application.
Read necessary data blocks from data files on disk into the shared database buffers of the SGA (if the blocks are not already present in the SGA).
Return results in such a way that the application can process the information.

When a user starts a transaction—for example, a DML operation—the old data is written from the buffer cache to the undo tablespace and the new change details are in the redo log files.

What is Single instance recovery and RAC instance recovery

If an instance of open database fails, either because of a SHUTDOWN ABORT statement o abnormal termination,the following situations can result:

Data blocks committed by a transaction are not written to the data files and appear only in the online redo log. These changes must be reapplied to the database.The data files contain changes that had not been committed when the instance failed. These changes must be rolled back to ensure transactional consistency. Instance recovery uses only online redo log files and current online data files to synchronize the data files and ensure that they are consistent.

Understanding Instance Recovery

Automatic instance or crash recovery:

Is caused by attempts to open a database whose files are not synchronized on shutdown
Uses information stored in redo log groups to synchronize files
Involves two distinct operations:
Rolling forward: Redo log changes (both committed and uncommitted) are applied to data files.
Rolling back: Changes that are made but not committed returned to their original state.

The Oracle database automatically recovers from instance failure. All that needs to happen is for the instance to be started normally. If Oracle Restart is enabled and configured to monitor this database,then this happens automatically. The instance mounts the control files and then attempts to open the data files. When it discovers that the data files have not been

synchronized during shutdown, the instance uses information contained in

the redo log groups to roll the data files forward to the time of shutdown. Then the database is opened and any uncommitted transactions are rolled back.

Phases of Instance Recovery

Startup instance (data files are out of sync)
Roll forward (redo)
Committed and uncommitted data in files
Database opened
Roll back (undo)
Committed data in files

For an instance to open a datafile, the system change number (SCN) contained in the data fil’s header must match the current SCN that is stored in the database’s control files.

If the numbers do not match, the instance applies redo data from the online redo logs, sequentially “redoing” transactions until the data files are up to date. After all data files have been synchronized with the control files, the database is opened and users can log in.

When redo logs are applied, all transactions are applied to bring the database up to the state as of the time of failure. This usually includes transactions that are in progress but have not yet been committed. After the database has been opened, those uncommitted transactions are rolled back.

At the end of the rollback phase of instance recovery, the data files contain only committed data.

Tuning Instance Recovery

During instance recovery, the transactions between the checkpoint position and the end of the redo log must be applied to data files.
You tune instance recovery by controlling the difference between the checkpoint position and the end of the redo log.

Why does Oracle recommend 3 voting disks when you have 2 nodes?

When you have 1 voting disk and it goes bad, the cluster stops functioning.

When you have 2 and 1 goes bad, the same happens because the nodes realize they can only write to half of the original disks (1 out of 2), violating the rule that they must be able to write > half (yes, the rule says >, not >=).

When you have 3 and 1 goes bad, the cluster runs fine because the nodes know they can access more than half of the original voting disks (2/3 > half).

When you have 4 and 1 goes bad, the same, because (3/4 > half).

When you have 3 and 2 go bad, the cluster stops because the nodes can only access 1/3 of the voting disks, not > half.

When you have 4 and 2 go bad, the same, because the nodes can only access half, not > half.

All the above assume the nodes themselves are fine.

How big table loading in buffer cache

Before John can explain the new feature, everyone in the office wants to know why a full table scan doesn’t use the buffer cache. When a session connected to the Oracle Database instance selects data from a table, John elaborates, the database server process reads the appropriate data blocks from the disk and puts them into the buffer cache by default. Each block goes into a buffer in the buffer cache. The reason is simple: if another session wants some data from those blocks, it can be served from those cached blocks much faster than being served from disk. The buffer cache is limited and usually smaller than the entire database, so when the cache is full and a new database block comes in, Oracle Database forces old buffers that have not been accessed in a long time out of the cache to make room for the new blocks coming in.

However, John continues, consider the case of a full table scan query that selects all the blocks of the table. If that table is large, its blocks will consume a large portion of the buffer cache, forcing out the blocks of other tables. It’s unlikely that all the blocks of a large table will be accessed regularly, so having those blocks in the cache does not actually help performance. But forcing out the blocks of other tables, especially popular blocks, degrades the overall performance of the applications running against the database. That is why, John explains, Oracle Database does not load the blocks into the buffer cache for full table scans.

how connection is getting establish and sql internally are running

how database is communicating with ASM

how hot backup is happening

How instance recovery is happening in single and RAC Database

Instance recovery occurs when an instance goes down abruptly, either via a SHUTDOWN ABORT, a killing of a background process, or a crash of a node or the instance itself. After an ungraceful shutdown, it is necessary for the database to go through the process of rolling forward all information in the redo logs and rolling back any transactions that had not yet been committed. This process is known as instance recovery and is usually automatically performed by the SMON process.

The redo logs for all RAC instances are located either on an OCFS shared disk asset or on a RAW file system that is visible to all the other RAC instances. This allows any other node to recover for a failed RAC node in the event of instance failure.

There are basically two types of failure in a RAC environment: instance and media. Instance failure involves the loss of one or more RAC instances, whether due to node failure or connectivity failure. Media failure involves the loss of one or more of the disk assets used to store the database files themselves.

All nodes available.
One or more RAC instances fail.
Node failure is detected by any one of the remaining instances.
Global Resource Directory(GRD) is reconfigured and distributed among surviving nodes.
The instance which first detected the failed instance, reads the failed instances redo logs to determine the logs which are needed to be recovered.

The above task is done by the SMON process of the instance that detected failure.

6. Until this time database activity is frozen, The SMON issues recovery requests

for all the blocks that are needed for recovery. Once all the blocks are available,

the other blocks which are not needed for recovery are available for normal processing.

7. Oracle performs roll forward operation against the blocks that were modified by the

failed instance but were not written to disk using redo log recorded transactions.

8.Once redo logs are applied, uncomitted transactions are rolled back using

undo tablespace.

9. Database on the RAC in now fully available.

INSTANCE RECOVERY IN RAC DATABASE

I will discuss how instance recovery takes place in 11g R2 RAC. Instance recovery aims at

– writing all committed changes to the datafiles

– undoing all the uncommitted changes from the datafiles

– Incrementing the checkpoint no. to the SCN till which changes have been written to datafiles.

In a single instance database, before the instance crashes,

– some committed changes are in the redo log files but have not been written to the datafiles

– some uncommitted changes have made their way to datafiles

– some uncommitted changes are in the redo log buffer

After the instance crashes in a single instance database

– all uncommitted changes in the redo log buffer are wiped out

– Online redo log files are read to identify the blocks that need to be recovered

– Identified blocks are read from the datafiles

– During roll forward phase, all the changes (committed/uncommitted) in redo log files are applied to them

– During rollback phase, all uncommitted changes are rolled back after reading undo from undo tablespace.

– CKTP# is incremented in control file/data file headers

In a RAC database there can be two scenarios :

– Only one instance crashes

– Multiple instances crash

We will discuss these cases one by one.

Single instance crash in RAC database

In this case, scenario is quite similar to instance crash in a single instance database. But there is slight difference also.

Let us consider a 3 node setup. We will consider a data block B1 with one

column and 4 records in it . The column contains values 100, 200, 300 and 400 in 4 records. Initially the block is on disk . In the following chart, update operations on the block in various nodes and corresponding states of the block are represented. Colour code followed is : CR, PI, XCUR:

SCN# —-Update operation on — ———– State of the block on ————

Node1 Node2 Node3 Node1 Node2 Node3 Disk

1 100->101 – – 101 – – 100

200 – – 200

300 – – 300

400 – – 400

2 – 200->201 101 101 – 100

200 201 – 200

300 300 – 300

400 400 – 400

3 – – 300->301 101 101 101 100

200 201 201 200

300 300 301 300

400 400 400 400

4 CRASH

(Node2)

– – 300->301 101 101 101 100

200 201 201 200

300 300 301 300

400 400 400 400

It is assumed that no incremental checkpointing has taken place on any of the nodes in the meanwhile.

Before crash status of block on various nodes is as follows:

– PI at SCN# 2 on Node1

– PI at SCN# 3 on Node2

– XCUR on Node3

Redo logs at various nodes are

Node1 : B1: 100 -> 101, SCN# 1

Node2 : B1:200 -> 201, SCN# 2

Node3 : B1:300 -> 301, SCN# 3

After the crash,

– Redo logs of crashed node (Node2) is analyzed and it is identified that block B1 needs to be recovered.

– It is also identified that role of the block is global as its different versions are available in Node1 and Node3

– It is identified that there is a PI on node1 whose SCN# (2) is earlier than the SCN# of crash (4)

– Changes from redo logs of Node2 are applied to the PI on Node1 and the block is written to disk

– Checkpoint # of node1 is incremented.

– a BWR is placed in redo log of Node1 to indicate that the block has been written to disk and need not be recovered in case Node1

Here it can be readily seen that there are certain differences from the instance recovery in single instance database.

The Role of the block is checked.

If the role is local, then the block will be read from the disk and changes from redo logs of Node2 will be applied i.e. just like single instance database

If the role is global,

It is checked if PI of the block at a SCN# earlier than the SCN# of crash is available

If PI is available, then changes in redo logs of node2 are applied to the PI ,instead of reading the block from the disk,

If PI is not available (has been flushed to disk due to incremental checkpointing

on the owner node of PI or

on any of the nodes at a SCN# > PI holder)

the block will be read from the disk and changes from redo logs of Node2 will be applied just like it used to happen in OPS.

Hence, it can be inferred that PI, if available, speeds up the instance recovery as need to read the block from disk is eliminated. If PI is

not available, block is read from the disk just like in OPS.

Multiple instance crash in RAC database

Let us consider a 4 node setup. We will consider a data block B1 with one column and 4 records in it

. The column contains values 100, 200, 300 and 400 in 4 records. Initially the block is on disk . It can be represented as:

SCN# —- Update operation on —– ————– State of the block on ————–

Node1 Node2 Node3 Node4 Node1 Node2 Node3 Node4 Disk

1 100->101 – – – 101 – – – 100

200 – – – 200

300 – – – 300

400 – – – 400

2 – 200->201 – – 101 101 – – 100

200 201 – – 200

300 300 – – 300

400 400 – – 400

3 – – 300->301 – 101 101 101 – 100

200 201 201 – 200

300 300 301 – 300

400 400 400 – 400

4 CKPT

101 101 101 – 101

200 201 201 – 201

300 300 301 – 300

400 400 400 – 400

5 – – – 400->401 101 101 101 101 100

– – – 200 201 201 201 201

– – – 300 300 301 301 300

– – – 400 400 400 401 400

6 401->402 – – – 101 101 101 101 100

200 201 201 201 201

300 300 301 301 300

400 400 400 401 400

101

201

301

402

7 CRASH CRASH

(Node2) (Node3)

101 – – 101 101

200 – – 201 201

300 – – 301 301

400 – – 401 400

101

201

301

402

Explanation:

SCN#1 – Node1 reads the block from disk and updates 100 to 101 in record. It holds the block in XCUR mode

SCN#2 – Node2 requests the same block for update. Node1 keeps the PI and Node2 holds the block in XCUR mode

SCN#3 – Node3 requests the same block for update. Node2 keeps the PI and Node3 holds the block in XCUR mode . Now we have two PIs

– On Node1 with SCN# 2

– On Node2 with SCN# 3

SCN# 4 – Local checkpointing takes place on Node2. PI on this node has SCN# 3.

It is checked if any of the other nodes has a PI at an earlier SCN# than this. Node1 has PI at SCN# 2.

CHanges in redo log of Node2 are applied to its PI and it is flushed to disk.

BWR is placed in redo log of Node2 to indicate that the block has been written to disk and need not be recovered in case Node2 crashes.

PI at node2 is discarded i.e. its state changes to CR which can’t be used to serve remote nodes.

PI at node1 is discarded i.e. its state changes to CR which can’t be used to serve remote nodes.

BWR is placed in redo log of Node1 to indicate that block has been written to disk and need not be recovered in case Node2 crashes.

Now on disk version of block contains changes of both Node1 and Node2.

SCN# 5 – Node4 requests the same block for update. Node3 keeps the PI and Node4 holds the block in XCUR mode .Node1 and Node2 have the CR’s.

SCN# 6 – Node1 again requests the same block for update. Node4 keeps the PI and Node1 holds the block in XCUR mode. Now Node1 has both the same block in CR and XCUR mode. Node3 has PI at SCN# 5.

SCN# 7 – Node2 and Node3 crash.

It is assumed that no incremental checkpointing has taken place on any of the nodes in the meanwhile.

Before crash status of block on various nodes is as follows:

– CR at SCN# 2 on Node1, XCUR on Node1

– CR at SCN# 3 on Node2

– PI at SCN# 5 on Node3

– PI at SCN# 6 on Node4

Redo logs at various nodes are

Node1 : B1: 100 -> 101, SCN# 1, BWR for B1 , B1:401->402 at SCN#6

Node2 : B1:200 -> 201, SCN# 2, BWR for B1

Node3 : B1:300 -> 301, SCN# 3

Node4 : B1:400->401 at SCN# 5

After the crash,

– Redo logs of crashed node (Node2) are analyzed and it is identified that block B1 has been flushed to disk as of SCN# 4 and need not be recovered as no changes have been made to it from Node2.

– No Redo log entry from Node2 needs to be applied

– Redo logs of crashed node (Node3) are analyzed and it is identified that block B1 needs to be recovered

– It is also identified that role of the block is global as its different versions was/is available in Node1(XCUR), Node2(crashed) , Node4(PI)

– Changes from Node3 have to be applied . It is checked if any PI is available which is earlier than the SCN# of the change on node3 which needs to be applied i.e. SCN# 3.

– It is identified that no PI is available whose SCN is earlier than the SCN# (3). Hence, block is read from the disk.

– Redo log entry which needs to be applied is : B1:300 -> 301, SCN# 3

– Redo is applied to the block read from the disk and the block is written to disk so that on disk version contains changes made by Node3 also.

– Checkpoint # of node2 and Node3 are incremented.

After instance recovery :

Node1 : holds CR and XCUR

Node2 :

Node3 :

Node4 : holds PI

On disk version of the block is:

101

201

301

400

4) what when we put database in begin backup mode

5) how sql execute and make connection with database

How SQL Statement Processing in Oracle Architecture

SQL Statements are processed differently depending on whether the statement is a query, data manipulation language (DML) to update, insert, or delete a row, or data definition language (DDL) to write information to the data dictionary.

Connect to an Instance using:

_ User process

_Server process

_The Oracle server components that are used depend on the type os SQL statemebt:

-Quries return rows

-DML statements log changes

-commit ensures transactio recovery

_Some Oracle server components do not participate in SQL statement processing

Processing a query:

Parse:

o Search for identical statement in the Shared SQL Area.

o Check syntax, object names, and privileges.

o Lock objects used during parse.

o Create and store execution plan.

Bind: Obtains values for variables.

Execute: Process statement.

Fetch: Return rows to user process.

Processing a DML statement:

Parse: Same as the parse phase used for processing a query.

Bind: Same as the bind phase used for processing a query.

Execute:

o If the data and undo blocks are not already in the Database Buffer Cache, the server process reads them from the datafiles into the Database Buffer Cache.

o The server process places locks on the rows that are to be modified. The undo block is used to store the before image of the data, so that the DML statements can be rolled back if necessary.

o The data blocks record the new values of the data.

o The server process records the before image to the undo block and updates the data block. Both of these changes are made in the Database Buffer Cache. Any changed blocks in the Database Buffer Cache are marked as dirty buffers. That is, buffers that are not the same as the corresponding blocks on the disk.

o The processing of a DELETE or INSERT command uses similar steps. The before image for a DELETE contains the column values in the deleted row, and the before image of an INSERT contains the row location information.

Processing a DDL statement:

The execution of DDL (Data Definition Language) statements differs from the execution of DML (Data Manipulation Language) statements and queries, because the success of a DDL statement requires write access to the data dictionary.

For these statements, parsing actually includes parsing, data dictionary lookup, and execution. Transaction management, session management, and system management SQL statements are processed using the parse and execute stages. To re-execute them, simply perform another execute.

Stage 2: Parse the Statement

During parsing, the SQL statement is passed from the user process to Oracle and a parsed representation of the SQL statement is loaded into a shared SQL area. Many errors can be caught during this phase of statement processing. Parsing is the process of

translating a SQL statement, verifying it to be a valid statement
performing data dictionary lookups to check table and column definitions
acquiring parse locks on required objects so that their definitions do not change during the statement’s parsing
checking privileges to access referenced schema objects
determining the optimal execution plan for the statement
loading it into a shared SQL area
for distributed statements, routing all or part of the statement to remote nodes that contain referenced data

A SQL statement is parsed only if a shared SQL area for an identical SQL statement does not exist in the shared pool. In this case, a new shared SQL area is allocated and the statement is parsed. For more information about shared SQL, refer to Chapter 10, “Managing SQL and Shared PL/SQL Areas“.

The parse phase includes processing requirements that need to be done only once no matter how many times the statement is executed. Oracle translates each SQL statement only once, re-executing that parsed statement during subsequent references to the statement.

Although the parsing of a SQL statement validates that statement, parsing only identifies errors that can be found before statement execution. Thus, certain errors cannot be caught by parsing. For example, errors in data conversion or errors in data (such as an attempt to enter duplicate values in a primary key) and deadlocks are all errors or situations that can only be encountered and reported during the execution phase.

Module 1 – Oracle Architecture

Objectives

These notes introduce the Oracle server architecture. The architecture includes physical components, memory components, processes, and logical structures.

Primary Architecture Components

The figure shown above details the Oracle architecture.

Oracle server: An Oracle server includes an Oracle Instance and an Oracle database.

An Oracle database includes several different types of files: datafiles, control files, redo log files and archive redo log files. The Oracle server also accesses parameter files and password files.

This set of files has several purposes.

One is to enable system users to process SQL statements.

Another is to improve system performance.

Still another is to ensure the database can be recovered if there is a software/hardware failure.

The database server must manage large amounts of data in a multi-user environment.

The server must manage concurrent access to the same data.

The server must deliver high performance. This generally means fast response times.

Oracle instance: An Oracle Instance consists of two different sets of components:

The first component set is the set of background processes (PMON, SMON, RECO, DBW0, LGWR, CKPT, D000 and others).

These will be covered later in detail – each background process is a computer program.

These processes perform input/output and monitor other Oracle processes to provide good performance and database reliability.

The second component set includes the memory structures that comprise the Oracle instance.

When an instance starts up, a memory structure called the System Global Area (SGA) is allocated.

At this point the background processes also start.

An Oracle Instance provides access to one and only one Oracle database.

Oracle database: An Oracle database consists of files.

Sometimes these are referred to as operating system files, but they are actually database files that store the database information that a firm or organization needs in order to operate.

The redo log files are used to recover the database in the event of application program failures, instance failures and other minor failures.

The archived redo log files are used to recover the database if a disk fails.

Other files not shown in the figure include:

The required parameter file that is used to specify parameters for configuring an Oracle instance when it starts up.

The optional password file authenticates special users of the database – these are termed privileged users and include database administrators.

Alert and Trace Log Files – these files store information about errors and actions taken that affect the configuration of the database.

User and server processes: The processes shown in the figure are called user and server processes. These processes are used to manage the execution of SQL statements.

A Shared Server Process can share memory and variable processing for multiple user processes.

A Dedicates Server Process manages memory and variables for a single user process.

Connecting to an Oracle Instance – Creating a Session

System users can connect to an Oracle database through SQLPlus or through an application program like the Internet Developer Suite (the program becomes the system user). This connection enables users to execute SQL statements.

The act of connecting creates a communication pathway between a user process and an Oracle Server. As is shown in the figure above, the User Process communicates with the Oracle Server through a Server Process. The User Process executes on the client computer. The Server Process executes on the server computer, and actually executes SQL statements submitted by the system user.

The figure shows a one-to-one correspondence between the User and Server Processes. This is called a Dedicated Server connection. An alternative configuration is to use a Shared Server where more than one User Process shares a Server Process.

Sessions: When a user connects to an Oracle server, this is termed a session. The session starts when the Oracle server validates the user for connection. The session ends when the user logs out (disconnects) or if the connection terminates abnormally (network failure or client computer failure).

A user can typically have more than one concurrent session, e.g., the user may connect using SQLPlus and also connect using Internet Developer Suite tools at the same time. The limit of concurrent session connections is controlled by the DBA.

If a system users attempts to connect and the Oracle Server is not running, the system user receives the Oracle Not Available error message.

Physical Structure – Database Files

As was noted above, an Oracle database consists of physical files. The database itself has:

Datafiles – these contain the organization’s actual data.

Redo log files – these contain a record of changes made to the database, and enable recovery when failures occur.

Control files – these are used to synchronize all database activities and are covered in more detail in a later module.

Other key files as noted above include:

Parameter file – there are two types of parameter files.

The init.ora file (also called the PFILE) is a static parameter file. It contains parameters that specify how the database instance is to start up. For example, some parameters will specify how to allocate memory to the various parts of the system global area.

The spfile.ora is a dynamic parameter file. It also stores parameters to specify how to startup a database; however, its parameters can be modified while the database is running.

Password file – specifies which *special* users are authenticated to startup/shut down an Oracle Instance.

Archived redo log files – these are copies of the redo log files and are necessary for recovery in an online, transaction-processing environment in the event of a disk failure.

Memory Structure

The memory structures include two areas of memory:

System Global Area (SGA) – this is allocated when an Oracle Instance starts up.

Program Global Area (PGA) – this is allocated when a Server Process starts up.

System Global Area

The SGA is an area in memory that stores information shared by all database processes and by all users of the database (sometimes it is called the Shared Global Area).

This information includes both organizational data and control information used by the Oracle Server.

The SGA is allocated in memory and virtual memory.

The size of the SGA can be established by a DBA by assigning a value to the parameter SGA_MAX_SIZE in the parameter file—this is an optional parameter.

The SGA is allocated when an Oracle instance (database) is started up based on values specified in the initialization parameter file (either PFILE or SPFILE).

The SGA has the following mandatory memory structures:

Shared Pool – includes two components:

Library Cache

Data Dictionary Cache

Database Buffer Cache

Redo Log Buffer

Other structures (for example, lock and latch management, statistical data)

Additional optional memory structures in the SGA include:

Large Pool

Java Pool

Streams Pool

The SHOW SGA SQL command will show you the SGA memory allocations. This is a recent clip of the SGA for the Oracle database at SIUE. In order to execute SHOW SGA you must be connected with the special privilege SYSDBA (which is only available to user accounts that are members of the DBA Linux group).

SQL> connect / as sysdba

Connected.

SQL> show sga

Total System Global Area 1610612736 bytes

Fixed Size 2084296 bytes

Variable Size 385876536 bytes

Database Buffers 1207959552 bytes

Redo Buffers 14692352 bytes

Oracle 8i and earlier versions of the Oracle Server used a Static SGA. This meant that if modifications to memory management were required, the database had to be shutdown, modifications were made to the init.ora parameter file, and then the database had to be restarted.

Oracle 9i and 10g use a Dynamic SGA. Memory configurations for the system global area can be made without shutting down the database instance. The advantage is obvious. This allows the DBA to resize the Database Buffer Cache and Shared Pool dynamically.

Several initialization parameters are set that affect the amount of random access memory dedicated to the SGA of an Oracle Instance. These are:

SGA_MAX_SIZE: This optional parameter is used to set a limit on the amount of virtual memory allocated to the SGA – a typical setting might be 1 GB; however, if the value for SGA_MAX_SIZE in the initialization parameter file or server parameter file is less than the sum the memory allocated for all components, either explicitly in the parameter file or by default, at the time the instance is initialized, then the database ignores the setting for SGA_MAX_SIZE.

DB_CACHE_SIZE: This optional parameter is used to tune the amount memory allocated to the Database Buffer Cache in standard database blocks. Block sizes vary among operating systems. The DBORCL database uses 8 KB blocks. The total blocks in the cache defaults to 48 MB on LINUX/UNIX and 52 MB on Windows operating systems.

LOG_BUFFER: This optional parameter specifies the number of bytes allocated for the Redo Log Buffer.

SHARED_POOL_SIZE: This optional parameter specifies the number of bytes of memory allocated to shared SQL and PL/SQL. The default is 16 MB. If the operating system is based on a 64 bit configuration, then the default size is 64 MB.

LARGE_POOL_SIZE: This is an optional memory object – the size of the Large Pool defaults to zero. If the init.ora parameter PARALLEL_AUTOMATIC_TUNING is set to TRUE, then the default size is automatically calculated.

JAVA_POOL_SIZE: This is another optional memory object. The default is 24 MB of memory.

The size of the SGA cannot exceed the parameter SGA_MAX_SIZE minus the combination of the size of the additional parameters, DB_CACHE_SIZE, LOG_BUFFER, SHARED_POOL_SIZE, LARGE_POOL_SIZE, and JAVA_POOL_SIZE.

Memory is allocated to the SGA as contiguous virtual memory in units termed granules. Granule size depends on the estimated total size of the SGA, which as was noted above, depends on the SGA_MAX_SIZE parameter. Granules are sized as follows:

If the SGA is less than 128 MB in total, each granule is 4 MB.

If the SGA is greater than 128 MB in total, each granule is 16 MB.

Granules are assigned to the Database Buffer Cache and Shared Pool, and these two memory components can dynamically grow and shrink. Using contiguous memory improves system performance. The actual number of granules assigned to one of these memory components can be determined by querying the database view named V$BUFFER_POOL.

Granules are allocated when the Oracle server starts a database instance in order to provide memory addressing space to meet the SGA_MAX_SIZE parameter. The minimum is 3 granules: one each for the fixed SGA, Database Buffer Cache, and Shared Pool. In practice, you’ll find the SGA is allocated much more memory than this. The SELECT statement shown below shows a current_size of 1,152 granules.

SELECT name, block_size, current_size, prev_size, prev_buffers

FROM v$buffer_pool;

NAME BLOCK_SIZE CURRENT_SIZE PREV_SIZE PREV_BUFFERS

——————– ———- ———— ———- ————

DEFAULT 8192 1152 0 0

For additional information on the dynamic SGA sizing, enroll in Oracle’s Oracle10g Database Performance Tuning course.

Automatic Shared Memory Management

Prior to Oracle 10G, a DBA had to manually specify SGA Component sizes through the initialization parameters, such as SHARED_POOL_SIZE, DB_CACHE_SIZE, JAVA_POOL_SIZE, and LARGE_POOL_SIZE parameters.

Automatic Shared Memory Management enables a DBA to specify the total SGA memory available through the SGA_TARGET initialization parameter. The Oracle Database automatically distributes this memory among various subcomponents to ensure most effective memory utilization.

The DBORCL database SGA_TARGET is set in the initDBORCL.ora file:

sga_target=1610612736

With automatic SGA memory management, the different SGA components are flexibly sized to adapt to the SGA available.

Setting a single parameter simplifies the administration task – the DBA only specifies the amount of SGA memory available to an instance – the DBA can forget about the sizes of individual components. No out of memory errors are generated unless the system has actually run out of memory. No manual tuning effort is needed.

The SGA_TARGET initialization parameter reflects the total size of the SGA and includes memory for the following components:

Fixed SGA and other internal allocations needed by the Oracle Database instance

The log buffer

The shared pool

The Java pool

The buffer cache

The keep and recycle buffer caches (if specified)

Nonstandard block size buffer caches (if specified)

The Streams Pool

If SGA_TARGET is set to a value greater than SGA_MAX_SIZE at startup, then the SGA_MAX_SIZE value is bumped up to accomodate SGA_TARGET. After startup, SGA_TARGET can be decreased or increased dynamically. However, it cannot exceed the value of SGA_MAX_SIZE that was computed at startup.

When you set a value for SGA_TARGET, Oracle Database 10g automatically sizes the most commonly configured components, including:

The shared pool (for SQL and PL/SQL execution)

The Java pool (for Java execution state)

The large pool (for large allocations such as RMAN backup buffers)

The buffer cache

There are a few SGA components whose sizes are not automatically adjusted. The DBA must specify the sizes of these components explicitly, if they are needed by an application. Such components are:

Keep/Recycle buffer caches (controlled by DB_KEEP_CACHE_SIZE and DB_RECYCLE_CACHE_SIZE)

Additional buffer caches for non-standard block sizes (controlled by DB_nK_CACHE_SIZE, n = {2, 4, 8, 16, 32})

Streams Pool (controlled by the new parameter STREAMS_POOL_SIZE)

Shared Pool

The Shared Pool is a memory structure that is shared by all system users. It consists of both fixed and variable structures. The variable component grows and shrinks depending on the demands placed on memory size by system users and application programs.

Memory can be allocated to the Shared Pool by the parameter SHARED_POOL_SIZE in the parameter file. You can alter the size of the shared pool dynamically with the ALTER SYSTEM SET command. An example command is shown in the figure below. You must keep in mind that the total memory allocated to the SGA is set by the SGA_TARGET parameter (and may also be limited by the SGA_MAX_SIZE if it is set), and since the Shared Pool is part of the SGA, you cannot exceed the maximum size of the SGA.

The Shared Pool stores the most recently executed SQL statements and used data definitions. This is because some system users and application programs will tend to execute the same SQL statements often. Saving this information in memory can improve system performance.

The Shared Pool includes the Library Cache and Data Dictionary Cache.

Library Cache

Memory is allocated to the Library Cache whenever an SQL statement is parsed or a program unit is called. This enables storage of the most recently used SQL and PL/SQL statements.

If the Library Cache is too small, the Library Cache must purge statement definitions in order to have space to load new SQL and PL/SQL statements. Actual management of this memory structure is through a Least-Recently-Used (LRU) algorithm. This means that the SQL and PL/SQL statements that are oldest and least recently used are purged when more storage space is needed.

The Library Cache is composed of two memory subcomponents:

Shared SQL: This stores/shares the execution plan and parse tree for SQL statements. If a system user executes an identical statement, then the statement does not have to be parsed again in order to execute the statement.

Shared PL/SQL Procedures and Packages: This stores/shares the most recently used PL/SQL statements such as functions, packages, and triggers.

Data Dictionary Cache

The Data Dictionary Cache is a memory structure that caches data dictionary information that has been recently used. This includes user account information, datafile names, table descriptions, user privileges, and other information.

The database server manages the size of the Data Dictionary Cache internally and the size depends on the size of the Shared Pool in which the Data Dictionary Cache resides. If the size is too small, then the data dictionary tables that reside on disk must be queried often for information and this will slow down performance.

Buffer Caches

A number of buffer caches are maintained in memory in order to improve system response time.

Database Buffer Cache

The Database Buffer Cache is a fairly large memory object that stores the actual data blocks that are retrieved from datafiles by system queries and other data manipulation language commands.

A query causes a Server Process to first look in the Database Buffer Cache to determine if the requested information happens to already be located in memory – thus the information would not need to be retrieved from disk and this would speed up performance. If the information is not in the Database Buffer Cache, the Server Process retrieves the information from disk and stores it to the cache.

Keep in mind that information read from disk is read a block at a time, not a row at a time, because a database block is the smallest addressable storage space on disk.

Database blocks are kept in the Database Buffer Cache according to a Least Recently Used (LRU) algorithm and are aged out of memory if a buffer cache block is not used in order to provide space for the insertion of newly needed database blocks.

The buffers in the cache are organized in two lists:

the write list and,

the least recently used (LRU) list.

The write list holds dirty buffers – these are buffers that hold that data that has been modified, but the blocks have not been written back to disk.

The LRU list holds free buffers, pinned buffers, and dirty buffers that have not yet been moved to the write list. Free buffers do not contain any useful data and are available for use. Pinned buffers are currently being accessed.

When an Oracle process accesses a buffer, the process moves the buffer to the most recently used (MRU) end of the LRU list – this causes dirty buffers to age toward the LRU end of the LRU list.

When an Oracle user process needs a data row, it searches for the data in the database buffer cache because memory can be searched more quickly than hard disk can be accessed. If the data row is already in the cache (a cache hit), the process reads the data from memory; otherwise a cache miss occurs and data must be read from hard disk into the database buffer cache.

Before reading a data block into the cache, the process must first find a free buffer. The process searches the LRU list, starting at the LRU end of the list. The search continues until a free buffer is found or until the search reaches the threshold limit of buffers.

Each time the user process finds a dirty buffer as it searches the LRU, that buffer is moved to the write list and the search for a free buffer continues.

When the process finds a free buffer, it reads the data block from disk into the buffer and moves the buffer to the MRU end of the LRU list.

If an Oracle user process searches the threshold limit of buffers without finding a free buffer, the process stops searching the LRU list and signals the DBW0 background process to write some of the dirty buffers to disk. This frees up some buffers.

The block size for a database is set when a database is created and is determined by the init.ora parameter file parameter named DB_BLOCK_SIZE. Typical block sizes are 2KB, 4KB, 8KB, 16KB, and 32KB. The size of blocks in the Database Buffer Cache matches the block size for the database. The DBORCL database uses a 8KB block size.

Because tablespaces that store oracle tables can use different (non-standard) block sizes, there can be more than one Database Buffer Cache allocated to match block sizes in the cache with the block sizes in the non-standard tablespaces.

The size of the Database Buffer Caches can be controlled by the parameters DB_CACHE_SIZE and DB_nK_CACHE_SIZE to dynamically change the memory allocated to the caches without restarting the Oracle instance.

You can dynamically change the size of the Database Buffer Cache with the ALTER SYSTEM command like the one shown here:

ALTER SYSTEM SET DB_CACHE_SIZE = 96M;

You can have the Oracle Server gather statistics about the Database Buffer Cache to help you size it to achieve an optimal workload for the memory allocation. This information is displayed from the V$DB_CACHE_ADVICE view. In order for statistics to be gathered, you can dynamically alter the system by using the ALTER SYSTEM SET DB_CACHE_ADVICE (OFF, ON, READY) command. However, gathering statistics on system performance always incurs some overhead that will slow down system performance.

SQL> ALTER SYSTEM SET db_cache_advice = ON;

System altered.

SQL> DESC V$DB_cache_advice;

Name Null? Type

—————————————– ——– ————-

ID NUMBER

NAME VARCHAR2(20)

BLOCK_SIZE NUMBER

ADVICE_STATUS VARCHAR2(3)

SIZE_FOR_ESTIMATE NUMBER

SIZE_FACTOR NUMBER

BUFFERS_FOR_ESTIMATE NUMBER

ESTD_PHYSICAL_READ_FACTOR NUMBER

ESTD_PHYSICAL_READS NUMBER

ESTD_PHYSICAL_READ_TIME NUMBER

ESTD_PCT_OF_DB_TIME_FOR_READS NUMBER

ESTD_CLUSTER_READS NUMBER

ESTD_CLUSTER_READ_TIME NUMBER

SQL> SELECT name, block_size, advice_status FROM v$db_cache_advice;

NAME BLOCK_SIZE ADV

——————– ———- —

DEFAULT 8192 ON

21 rows selected.

SQL> ALTER SYSTEM SET db_cache_advice = OFF;

System altered.

KEEP Buffer Pool

This pool retains blocks in memory (data from tables) that are likely to be reused throughout daily processing. An example might be a table containing user names and passwords or a validation table of some type.

The DB_KEEP_CACHE_SIZE parameter sizes the KEEP Buffer Pool.

RECYCLE Buffer Pool

This pool is used to store table data that is unlikely to be reused throughout daily processing – thus the data is quickly recycled.

The DB_RECYCLE_CACHE_SIZE parameter sizes the RECYCLE Buffer Pool.

Redo Log Buffer

The Redo Log Buffer memory object stores images of all changes made to database blocks. As you know, database blocks typically store several table rows of organizational data. This means that if a single column value from one row in a block is changed, the image is stored. Changes include INSERT, UPDATE, DELETE, CREATE, ALTER, or DROP.

Think of the Redo Log Buffer as a circular buffer that is reused over and over. As the buffer fills up, copies of the images are stored to the Redo Log Files that are covered in more detail in a later module.

Large Pool

The Large Pool is an optional memory structure that primarily relieves the memory burden placed on the Shared Pool. The Large Pool is used for the following tasks if it is allocated:

Allocating space for session memory requirements from the User Global Area (part of the Server Process) where a Shared Server is in use.

Transactions that interact with more than one database, e.g., a distributed database scenario.

Backup and restore operations by the Recovery Manager (RMAN) process.

RMAN uses this only if the BACKUP_DISK_IO = n and BACKUP_TAPE_IO_SLAVE = TRUE parameters are set.

If the Large Pool is too small, memory allocation for backup will fail and memory will be allocated from the Shared Pool.

Parallel execution message buffers for parallel server operations. The PARALLEL_AUTOMATIC_TUNING = TRUE parameter must be set.

The Large Pool size is set with the LARGE_POOL_SIZE parameter – this is not a dynamic parameter. It does not use an LRU list to manage memory.

Java Pool

The Java Pool is an optional memory object, but is required if the database has Oracle Java installed and in use for Oracle JVM (Java Virtual Machine). The size is set with the JAVA_POOL_SIZE parameter that defaults to 24MB.

The Java Pool is used for memory allocation to parse Java commands.

Storing Java code and data in the Java Pool is analogous to SQL and PL/SQL code cached in the Shared Pool.

Streams Pool

This cache is new to Oracle 10g. It is sized with the parameter STREAMS_POOL_SIZE.

This pool stores data and control structures to support the Oracle Streams feature of Oracle Enterprise Edition. Oracle Steams manages sharing of data and events in a distributed environment.

If STEAMS_POOL_SIZE is not set or is zero, memory for Oracle Streams operations is allocated from up to 10% of the Shared Pool memory.

Program Global Area

The Program Global Area is also termed the Process Global Area (PGA) and is a part of memory allocated that is outside of the Oracle Instance. The PGA stores data and control information for a single Server Process or a single Background Process. It is allocated when a process is created and the memory is scavenged by the operating system when the process terminates. This is NOT a shared part of memory – one PGA to each process only.

The content of the PGA varies, but generally includes the following:

Private SQL Area: Data for binding variables and runtime memory allocations. A user session issuing SQL statements has a Private SQL Area that may be associated with a Shared SQL Area if the same SQL statement is being executed by more than one system user. This often happens in OLTP environments where many users are executing and using the same application program.

Dedicated Server environment – the Private SQL Area is located in the Program Global Area.

Shared Server environment – the Private SQL Area is located in the System Global Area.

Session Memory: Memory that holds session variables and other session information.

SQL Work Area: Memory allocated for sort, hash-join, bitmap merge, and bitmap create types of operations.

Oracle 9i and later versions enable automatic sizing of the SQL Work Areas by setting the WORKAREA_SIZE_POLICY = AUTO parameter (this is the default!) and PGA_AGGREGATE_TARGET = n (where n is some amount of memory established by the DBA). However, the DBA can let Oracle 10g determine the appropriate amount of memory.

Oracle 8i and earlier required the DBA to set the following parameters to control SQL Work Area memory allocations:

SORT_AREA_SIZE.

HASH_AREA_SIZE.

BITMAP_MERGE_AREA_SIZE.

CREATE_BITMAP_AREA_SIZE.

Software Code Area

Software code areas store Oracle executable files running as part of the Oracle instance.

These code areas are static in nature and are located in privileged memory that is separate from other user programs.

The code can be installed sharable when multiple Oracle instances execute on the same server with the same software release level.

Processes

You need to understand three different types of Processes:

User Process: Starts when a database user requests to connect to an Oracle Server.

Server Process: Establishes the Connection to an Oracle Instance when a User Process requests connection – makes the connection for the User Process.

Background Processes: These start when an Oracle Instance is started up.

User Process

In order to use Oracle, you must obviously connect to the database. This must occur whether you’re using SQLPlus, an Oracle tool such as Designer or Forms, or an application program.

This generates a User Process (a memory object) that generates programmatic calls through your user interface (SQLPlus, Integrated Developer Suite, or application program) that creates a session and causes the generation of a Server Process that is either dedicated or shared.

Server Process

As you have seen, the Server Process is the go-between for a User Process and the Oracle Instance. In a Dedicated Server environment, there is a single Server Process to serve each User Process. In a Shared Server environment, a Server Process can serve several User Processes, although with some performance reduction. Allocation of server process in a dedicated environment versus a shared environment is covered in further detail in the Oracle10g Database Performance Tuning course offered by Oracle Education.

Background Processes

As is shown here, there are both mandatory and optional background processes that are started whenever an Oracle Instance starts up. These background processes serve all system users. We will cover mandatory process in detail.

Optional Background Process Definition:

ARCn: Archiver – One or more archiver processes copy the online redo log files to archival storage when they are full or a log switch occurs.

CJQ0: Coordinator Job Queue – This is the coordinator of job queue processes for an instance. It monitors the JOB$ table (table of jobs in the job queue) and starts job queue processes (Jnnn) as needed to execute jobs The Jnnn processes execute job requests created by the DBMS_JOBS package.

Dnnn: Dispatcher number “nnn”, for example, D000 would be the first dispatcher process – Dispatchers are optional background processes, present only when the shared server configuration is used. Shared server is discussed in your readings on the topic “Configuring Oracle for the Shared Server”.

RECO: Recoverer – The Recoverer process is used to resolve distributed transactions that are pending due to a network or system failure in a distributed database. At timed intervals, the local RECO attempts to connect to remote databases and automatically complete the commit or rollback of the local portion of any pending distributed transactions. For information about this process and how to start it, see your readings on the topic “Managing Distributed Transactions”.

Of these, the ones you’ll use most often are ARCn (archiver) when you automatically archive redo log file information (covered in a later module), and RECO for recovery where the database is distributed on two or more separate physical Oracle servers, perhaps a UNIX machine and an NT machine.

DBWn (also called DBWR in earlier Oracle Versions)

The Database Writer writes modified blocks from the database buffer cache to the datafiles. Although one database writer process (DBW0) is sufficient for most systems, you can configure up to 20 DBWn processes (DBW0 through DBW9 and DBWa through DBWj) in order to improve write performance for a system that modifies data heavily.

The initialization parameter DB_WRITER_PROCESSES specifies the number of DBWn processes.

The purpose of DBWn is to improve system performance by caching writes of database blocks from the Database Buffer Cache back to datafiles. Blocks that have been modified and that need to be written back to disk are termed “dirty blocks.” The DBWn also ensures that there are enough free buffers in the Database Buffer Cache to service Server Processes that may be reading data from datafiles into the Database Buffer Cache. Performance improves because by delaying writing changed database blocks back to disk, a Server Process may find the data that is needed to meet a User Process request already residing in memory!

DBWn writes to datafiles when one of these events occurs that is illustrated in the figure below.

LGWR

The Log Writer (LGWR) writes contents from the Redo Log Buffer to the Redo Log File that is in use. These are sequential writes since the Redo Log Files record database modifications based on the actual time that the modification takes place. LGWR actually writes before the DBWn writes and only confirms that a COMMIT operation has succeeded when the Redo Log Buffer contents are successfully written to disk. LGWR can also call the DBWn to write contents of the Database Buffer Cache to disk. The LGWR writes according to the events illustrated in the figure shown below.

SMON

The System Monitor (SMON) is responsible for instance recovery by applying entries in the online redo log files to the datafiles. It also performs other activities as outlined in the figure shown below.

If an Oracle Instance fails, all information in memory not written to disk is lost. SMON is responsible for recovering the instance when the database is started up again. It does the following:

Rolls forward to recover data that was recorded in a Redo Log File, but that had not yet been recorded to a datafile by DBWn. SMON reads the Redo Log Files and applies the changes to the data blocks. This recovers all transactions that were committed because these were written to the Redo Log Files prior to system failure.

Opens the database to allow system users to logon.

Rolls back uncommitted transactions.

SMON also does limited space management. It combines (coalesces) adjacent areas of free space in the database’s datafiles for tablespaces that are dictionary managed.

It also deallocates temporary segments to create free space in the datafiles.

PMON

The Process Monitor (PMON) is a cleanup type of process that cleans up after failed processes such as the dropping of a user connection due to a network failure or the abend of a user application program. It does the tasks shown in the figure below.

CKPT

The Checkpoint (CPT) process writes information to the database control files that identifies the point in time with regard to the Redo Log Files where instance recovery is to begin should it be necessary. This is done at a minimum, once every three seconds.

Think of a checkpoint record as a starting point for recovery. DBWn will have completed writing all buffers from the Database Buffer Cache to disk prior to the checkpoint, thus those record will not require recovery. This does the following:

Ensures modified data blocks in memory are regularly written to disk – CKPT can call the DBWn process in order to ensure this and does so when writing a checkpoint record.

Reduces Instance Recovery time by minimizing the amount of work needed for recovery since only Redo Log File entries processed since the last checkpoint require recovery.

Causes all committed data to be written to datafiles during database shutdown.

If a Redo Log File fills up and a switch is made to a new Redo Log File (this is covered in more detail in a later module), the CKPT process also writes checkpoint information into the headers of the datafiles.

Checkpoint information written to control files includes the system change number (the SCN is a number stored in the control file and in the headers of the database files that are used to ensure that all files in the system are synchronized), location of which Redo Log File is to be used for recovery, and other information.

CKPT does not write data blocks or redo blocks to disk – it calls DBWn and LGWR as necessary.

ARCn

We cover the Archiver (ARCn) optional background process in more detail because it is almost always used for production systems storing mission critical information. The ARCn process must be used to recover from loss of a physical disk drive for systems that are “busy” with lots of transactions being completed.

When a Redo Log File fills up, Oracle switches to the next Redo Log File. The DBA creates several of these and the details of creating them are covered in a later module. If all Redo Log Files fill up, then Oracle switches back to the first one and uses them in a round-robin fashion by overwriting ones that have already been used – it should be obvious that the information stored on the files, once overwritten, is lost forever.

If ARCn is in what is termed ARCHIVELOG mode, then as the Redo Log Files fill up, they are individually written to Archived Redo Log Files and LGWR does not overwrite a Redo Log File until archiving has completed. Thus, committed data is not lost forever and can be recovered in the event of a disk failure. Only the contents of the SGA will be lost if an Instance fails.

In NOARCHIVELOG mode, the Redo Log Files are overwritten and not archived. Recovery can only be made to the last full backup of the database files. All committed transactions after the last full backup are lost, and you can see that this could cost the firm a lot of $$$.

When running in ARCHIVELOG mode, the DBA is responsible to ensure that the Archived Redo Log Files do not consume all available disk space! Usually after two complete backups are made, any Archived Redo Log Files for prior backups are deleted.

Logical Structure

It is helpful to understand how an Oracle database is organized in terms of a logical structure that is used to organize physical objects.

Tablespace: An Oracle 10g database must always consist of at least two tablespaces (SYSTEM and SYSAUX), although a typical Oracle database will multiple tablespaces tablespaces.

A tablespace is a logical storage facility (a logical container) for storing objects such as tables, indexes, sequences, clusters, and other database objects.

Each tablespace has at least one physical datafile that actually stores the tablespace at the operating system level. A large tablespace may have more than one datafile allocated for storing objects assigned to that tablespace.

A tablespace belongs to only one database.

Tablespaces can be brought online and taken offline for purposes of backup and management, except for the SYSTEM tablespace that must always be online.

Tablespaces can be in either read-only or read-write status.

Datafile: Tablespaces are stored in datafiles which are physical disk objects.

A datafile can only store objects for a single tablespace, but a tablespace may have more than one datafile – this happens when a disk drive device fills up and a tablespace needs to be expanded, then it is expanded to a new disk drive.

The DBA can change the size of a datafile to make it smaller or later. The file can also grow in size dynamically as the tablespace grows.

Segment: When logical storage objects are created within a tablespace, for example, an employee table, a segment is allocated to the object.

Obviously a tablespace typically has many segments.

A segment cannot span tablespaces but can span datafiles that belong to a single tablespace.

Extent: Each object has one segment which is a physical collection of extents.

Extents are simply collections of contiguous disk storage blocks. A logical storage object such as a table or index always consists of at least one extent – ideally the initial extent allocated to an object will be large enough to store all data that is initially loaded.

As a table or index grows, additional extents are added to the segment.

A DBA can add extents to segments in order to tune performance of the system.

An extent cannot span a datafile.

Block: The Oracle Server manages data at the smallest unit in what is termed a block or data block. Data are actually stored in blocks.

A physical block is the smallest addressable location on a disk drive for read/write operations.

An Oracle data block consists of one or more physical blocks (operating system blocks) so the data block, if larger than an operating system block, should be an even multiple of the operating system block size, e.g., if the Linux operating system block size is 2K or 4K, then the Oracle data block should be 2K, 4K, 8K, 16K, etc in size. This optimizes I/O.

The data block size is set at the time the database is created and cannot be changed. It is set with the DB_BLOCK_SIZE parameter. The maximum data block size depends on the operating system.

Thus, the Oracle database architecture includes both logical and physical structures as follows:

Physical: Control files; Redo Log Files; Datafiles; Operating System Blocks.

Logical: Tablespaces; Segments; Extents; Data Blocks.

SQL Statement Processing

Processing a query:

Parse:

Search for identical statement in the Shared SQL Area.

Check syntax, object names, and privileges.

Lock objects used during parse.

Create and store execution plan.

Bind: Obtains values for variables.

Execute: Process statement.

Fetch: Return rows to user process.

Processing a DML statement:

Parse: Same as the parse phase used for processing a query.

Bind: Same as the bind phase used for processing a query.

Execute:

If the data and undo blocks are not already in the Database Buffer Cache, the server process reads them from the datafiles into the Database Buffer Cache.

The server process places locks on the rows that are to be modified. The undo block is used to store the before image of the data, so that the DML statements can be rolled back if necessary.

The data blocks record the new values of the data.

The server process records the before image to the undo block and updates the data block. Both of these changes are made in the Database Buffer Cache. Any changed blocks in the Database Buffer Cache are marked as dirty buffers. That is, buffers that are not the same as the corresponding blocks on the disk.

The processing of a DELETE or INSERT command uses similar steps. The before image for a DELETE contains the column values in the deleted row, and the before image of an INSERT contains the row location information.

Processing a DDL statement:The execution of DDL (Data Definition Language) statements differs from the execution of DML (Data Manipulation Language) statements and queries, because the success of a DDL statement requires write access to the data dictionary.

For these statements, parsing actually includes parsing, data dictionary lookup, and execution. Transaction management, session management, and system management SQL statements are processed using the parse and execute stages. To re-execute them, simply perform another execute.

Instance recovery is performed in two steps. ie, rollforward and rollback.

Cache Recovery or Rollforward:

Here, the changes recorded in the redolog files are applied to the affected blocks. This includes both committed and uncommited data. Since Undo data is protected by redo, rollforward generated the undo images also. The time required for this will be proportional to the changes made in the database after the last successful checkpoint. After cache recovery, the database will be ‘consistent’ to the point when the crash occurred. Now the database will be open and users can start connecting to it. The parameter RECOVERY_PARALLELISM specifies the number of processes to participate in instance or crash recovery and we can thus speed up rollforward.

Transaction Recovery or Rollback

The uncommitted data in the database will now be rolled back. This is coordinated by SMON and rolls back set of transactions parallely (by default) using multiple server processes. SMON automatically decides when to begin parallel rollback and disperses the work among several parallel processes: process one rolls back one transaction, process two rolls back a second transaction, and so on. And if a transaction is huge Oracle begins intra-transaction recovery by dispersing the huge transaction among the slave processes: process one takes one part, process two takes another part, and so on. Parallel mode is the default one and is decided by the parameter FAST_START_PARALLEL_ROLLBACK . We can either turn it off (for serial recovery) or increase the degree of parallelism. If you change the value of the parameter FAST_START_PARALLEL_ROLLBACK, then transaction recovery will be stopped and restarted with the new implied degree of parallelism.

As mentioned earlier, user sessions are allowed to connect even before the transaction recovery is completed. If a user attempts to access a row that is locked by a terminated transaction, Oracle rolls back only those changes necessary to complete the transaction; in other words, it rolls them back on demand. Consequently, new transactions do not have to wait until all parts of a long transaction are rolled back.

This transaction recovery is required and has to be completed. We can disable transaction recovery temporarily but at some point this has to be completed. We can monitor the progress of fast-start parallel rollback by examining the V$FAST_START_SERVERS and V$FAST_START_TRANSACTIONS views.

The Fast-Start Fault Recovery feature reduces the time required for cache recovery, and makes the recovery bounded and predictable by limiting the number of dirty buffers and the number of redo records generated between the most recent redo record and the last checkpoint. With the Fast-Start Fault Recovery feature, the FAST_START_MTTR_TARGET initialization parameter simplifies the configuration of recovery time from instance or system failure. FAST_START_MTTR_TARGET specifies a target for the expected mean time to recover (MTTR), that is, the time (in seconds) that it should take to start up the instance and perform cache recovery. After FAST_START_MTTR_TARGET is set, the database manages incremental checkpoint writes in an attempt to meet that target.

If the SMON is busy doing the transaction recovery you should never attempt a shutdown abort and restarting the database. The entire work done till that point needs to be done again.

There are different modes in which you can open the database eg: migrate, read only, restricted modes.

What is Active Guard DataGuard

It enables physical standby database to be open for read access while media recovery is being performed on them to keep them synchronized with the production

database. The physical standby database is open in read-only mode while

redo transport and standby apply are both active.

Active Data Guard automatically repairs corrupt blocks online by using the active standby database.

Normally, queries executed on active standby databases return up-to-date results. Due to potential delays in redo transport or standby apply, a standby database may “fall behind” its primary, which can cause results of queries on the standby to be out of date.

Active Data Guard sessions can be configured with a maximum query delay (in

seconds). If the standby exceeds the delay, Active Data Guard returns an error to the application, which can then retry the query or transparently redirect the query to the primary, as required

What is logical standby database

In a logical standby database configuration, Data Guard SQL Apply uses redo information shipped from the primary system. However, instead of using media recovery to apply changes no physical

(as in thestandby database configuration), the redo data is transformed into

equivalent SQL statements by using LogMiner technology. These SQL statements are then applied to the logical standby database. The logical standby database is open in read/write mode and is available for reporting capabilities.

The logical standby database can offer protection at database level, schema level, or even object level.

A logical standby database can be used to perform rolling database upgrades, thereby minimizing down time when upgrading to new database patch sets or full database releases.

Oracle Back ground process

When a buffer in database buffer cache is modified .it is marked dirty buffer and added in head of the checkpoint queue that is kept in system change number (SCN) order. This order therefore matches the order of redo that is written to the redo logs for these changed buffers.

When the number of available buffers in the buffer cache falls below an internal threshold (to the extent that server processes find it difficult to obtain available buffers), DBWn writes non

frequently used, modified (dirty) buffers to the data files from the tail of the LRU list so that processes can replace buffers when they need them. DBWn also writes from the tail of the checkpoint queue to keep the checkpoint advancing.

The SGA contains a memory structure that has the redo byte address (RBA) of the position in the redo stream where recovery should begin in the case of an instance failure. This structure acts as a pointer into the redo and is written to the control file by the CKPT process once every three seconds. Because the DBWn writes dirty buffers in SCN order, and because the

redo is in SCN order, every time DBWn writes dirty buffers from the LRU list, it also advances the pointer held in the SGA memory structure so that instance recovery (if required) begins reading the redo from approximately the correct location and avoids unnecessary I/O. This is known as incremental checkpointing.

In all cases, DBWn performs batched (multiblock) writes to improve efficiency. The number of blocks written in a multiblock write varies by operating system.

LGWR is responsible for redo log buffer management by writing redo buffer entries to a redo log file to disk. LGWR writes all redo entries that have been copied into the buffer Since last time it wrote

The redo log buffer is a circular buffer. When LGWR writes redo entries from the redo log buffer to a redo log file, server processes can then copy new entries over the entries in the redo log buffer that have been written to disk. LGWR normally writes fast enough to ensure that space is always available in the buffer for new entries, even when access to the redo log

is heavy. LGWR writes one contiguous portion of the buffer to disk.

LGWR writes:

When a user process commits a transaction

When the redo log buffer is one-third full

Before a DBWn process writes modified buffers to disk (if necessary)

Every three seconds

Before DBWn can write a modified buffer, all redo records that are associated with the changes to the buffer must be written to disk (the write-ahead protocol). If DBWn finds that some redo records have not been written, it signals LGWR to write the redo records to disk and waits forLGWR to complete writing the redo log buffer before it can write out the data buffers. LGWR writes to the current log group. If one of the files in the group is damaged or unavailable, LGWR continues writing to other files in the group and logs an error in the LGWR trace file and in the system alert log. If all files in a group are damaged, or if the group is unavailable because it has

not been archived, LGWR cannot continue to function.

When a user issues a COMMIT statement, LGWR puts a commit record in the redo log buffer and writes it to disk immediately, along with the transaction’s redo entries. The corresponding changes to data blocks are deferred until it is more efficient to write them. This is called a fast commit mechanism. The atomic write of the redo entry containing the transaction’s commit

record is the single event that determines whether the transaction has committed. Oracle Database returns a success code to the committing transaction, although the data buffers have not yet been written to disk.

What is checkpoint and its types

A checkpoint is concept and mechanism. There are different type of checkpoint .The most import is related to this course are the full checkpoint and the incremental checkpoint.

The checkpoint position defines at which system change number (SCN) in the redo thread instance recovery would need to begin.

The SCN at which a full checkpoint occurred is stored in both the data file headers and the control file.

The SCN at which the last incremental checkpoint occurred is only stored in the control file (in a structure known as the checkpoint progress record).

The CKPT process updates the control files and the headers of all data files to record the details of the checkpoint (as shown in the graphic). The CKPT process does not write blocks to disk; DBWn always performs that work. The SCNs recorded in the file headers guarantee that all changes made to database blocks prior to that SCN have been written to disk.

What is Process Monitor and System Monitor Process

The System Monitor Process performs recovery at instance startup if necessary SMON is also responsible for cleaning up temporary segments that are no longer in use. If any terminated transactions were skipped during instance recovery because of file-read or offline errors, SMON recovers them when the tablespace or file is brought back online. SMON checks regularly to see whether the process is needed. Other processes can call SMON if they detect a need for it.

The Process Monitor Process perform process recovery when a user process fails .PMON is responsible for cleaning up the database buffer cache and freeing resources that the user process was using. For example, it resets the status of the active transaction table,releases locks, and removes the process ID from the list of active processes.

PMON periodically checks the status of dispatcher and server processes, and restarts any that have stopped running (but not any that Oracle Database has terminated intentionally).Like SMON, PMON checks regularly to see whether it is needed. It can be called if another process detects the need for it.

What is retention policy

A retention policy describes which backups will be kept and for how long.

You can set the value of the retention policy by using the RMAN CONFIGURE command or Recovery Window Retention Policy

The best practice is to establish a period of time during which it will be possible to discoverlogical errors and fix the affected objects by doing a point-in-time recovery to just before the error occurred. This period of time is called the recovery window. This policy is specified in number of days. For each data file, there must always exist at least one backup that satisfies

the following condition:

SYSDATE – backup_checkpoint_time >= recovery_window

You can use the following command syntax to configure a recovery window retention policy:

RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF <days> DAYS;

where <days> is the size of the recovery window.

If you are not using a recovery catalog, you should keep the recovery window time period less than or equal to the value of the CONTROL_FILE_RECORD_KEEP_TIME parameter to prevent

the record of older backups from being overwritten in the control file. If you are using a recovery catalog, then make sure the value of CONTROL_FILE_RECORD_KEEP_TIME is greater than the

time period between catalog resynchronizations. Resynchronizations happen when you:

Create a backup. In this case, the synchronization is done implicitly.

Execute the RESYNC CATALOG command.

Recovery catalogs are covered in more detail in the lesson titled “Using the RMAN Recovery

Catalog.”

Redundancy Retention Policy

If you require a certain number of backups to be retained, you can set the retention policy on the basis of the redundancy option. This option requires that a specified number of backups be cataloged before any backup is identified as obsolete. The default retention policy has a redundancy of 1, which means that only one backup of a file must exist at any given time. A

backup is deemed obsolete when a more recent version of the same file has been backed up.

What is obsolete and expired backup

Expired backup : When the CROSSCHECK command is used to determine whether backups is recorded in the repository still exist on disk or tape, if RMAN cannot locate the backups, then it updates their records in the RMAN repository to EXPIRED status.

Obsolete Backup: One of the key questions every backup strategy must address is how long you want to keep a backup. Although you can specify that a backup be kept forever without becoming obsolete,

it’s not common to follow such a strategy, unless you’re doing it for a special reason. Instead,backups become obsolete according to the retention policy you adopt. You can select the retention duration of backups when using RMAN in two ways.

In the first method, you can specify backup retention based on a recovery window. That is, all backups necessary to perform a point-in-time recovery to a specified past point of time will be retained by RMAN. If a backup is older than the point of time you chose, that backup will become obsolete according

to the backup retention rules.

The second way to specify the retention duration is to use a redundancy-based retention policy, under which you specify the number of backups of a file

that must be kept on disk. Any backups of a datafile greater than that number will be considered obsolete. obsolete backups are automatically deleted when space is needed for fresh files, you won’t be running the risk of accidentally deleting necessary files. obsolete backups are automatically deleted when space is needed for fresh files, you won’t be running the risk of accidentally deleting necessary files.

Obsolete backups are any backups that you don’t need to satisfy a configured retention policy.You may also delete obsolete backups according to any retention policy you may specify as an option to the delete obsolete command. The delete obsolete command will remove the deleted files from the backup media and mark those backups as deleted in both the control file and the recovery catalog.

The report obsolete command reports on any obsolete backups. Always run the crosscheck command first in order to update the status of the backups in the RMAN repository to that on disk and tape.

In the following example, the report obsolete command shows no obsolete

backups:

RMAN> crosscheck backup;

RMAN> report obsolete;

RMAN retention policy will be applied to the command

RMAN retention policy is set to redundancy 1

no obsolete backups found

The following execution of the report obsolete command shows that there are both obsolete backup sets and obsolete archived redo log backups. Again, run the crosscheck command before issuing the report obsolete command.

RMAN> crosscheck backup;

RMAN> report obsolete;

What is Clusterware Startup Sequence

The Oracle Services daemon (ohasd) is responsible for starting in proper order monitoring, and restarting other local Oracle Clusterware daemons, up through the daemon, crsd which in turn manages clusterwide resources.

When a cluster node boots, or Clusterware is started on a running clusterware node, the init process starts ohasd. The ohasd process then initiates the startup of the processes in the lower, or Oracle High Availability (OHASD) stack.

The cssdagent process is started, which in turn starts ocssd. The ocssd process discovers the voting disk either in ASM or on shared storage, and then joins the cluster.The cssdagent process monitors the cluster and provides I/O fencing. This service formerly was provided by Oracle Process Monitor Daemon (oprocd). A cssdagent failure may result in Oracle Clusterware restarting the node.

The orarootagent is started. This process is a specialized oraagent process that helps crsd start and manage resources owned by root, such as the network and the grid virtual IP address.

Node vip: The node vip is a node application (nodeapp) responsible for eliminating response delays (TCP timeouts) to client programs requesting a connection to the database. Each node vip is assigned an unused IP address.

This is usually done via DHCP but can be manually assigned. There is initially one node vip per cluster node at Clusterware startup. When a cluster node becomes unreachable, the node vip is failed over to a surviving node and redirects connection requests made to the unreachable node to a surviving node.

SCAN vip: SCAN vips or Single Client Access Name vips are part of a connection framework that eliminates dependencies on static cluster node names. This framework allows nodes to be added to or removed from the cluster without affecting the ability of clients to connect to the database. If GNS is used in the cluster, three SCAN vips are started on the member nodes using the IP addresses assigned by the DHCP server. If GNS is not used, SCAN vip addresses for the cluster can be defined in the DNS server used by the cluster nodes.

SCAN Listener: Three SCAN Listeners are started on the cluster nodes where the SCAN VIPs are started. Oracle Database 11g Release 2 and later instances register with SCAN listeners only as remote listeners.

Node Listener: If GNS is used to resolve client requests for the cluster, a single GNS vip for the cluster is started. The IP address is assigned in the GNS server used by the cluster nodes.

SCAN and Local Listeners

Finally, the client establishes a connection to the service through the listener on the node where service is offered. All these actions take place transparently to the client without any explicit configuration required in the client.

During installation ,listeners are created on nodes for SCAN IP Addresses

Oracle net services routes application requests to least loaded instance providing services Becouse the SCAN addresses resolve to cluster, rather than to a node address in a cluster,node can be added or removed from cluster without affecting the SCAN address configuration

Failing to Start OHAS

The first daemon to start in a Grid Infrastructure environment is OHAS. This process relies on the init process to invoke /etc/init.d/init.ohasd, which starts /etc/rc.d/init.d/ohasd, which in turn executes $GRID_HOME/ohasd.bin. Without a properly working ohasd.bin process, none of the other stack

components will start. The entry in /etc/inittab defines that /etc/init.d/init.ohasd is started at runlevels 3 and 5. Runlevel 3 in Linux usually brings the system up in networked, multi-user mode;

however, it doesn’t start X11. Runlevel 5 is normally used for the same purpose, but it also starts the graphical user interface. If the system is at a runlevel other than 3 or 5, then ohasd.bin cannot be started, and you need to use a call to init to change the runlevel to either 3 or 5. You can check

/var/log/messages for output from the scripts under /etc/rc.d/init.d/; ohasd.bin logs information into the default log file destination at $GRID_HOME/log/hostname in the ohasd/ohasd.log subdirectory.

The administrator has the option to disable the start of the High Availability Services stack by calling crsctl disable crs. This call updates a flag in /etc/oracle/scls_scr/hostname/root/ohasdstr. The file contains only one word, either enable or disable, and no carriage return. If set to disable, then

/etc/rc.d/init.d/ohasd will not proceed with the startup. Call crsctl start crs to start the cluster stack manually in that case. Many Grid Infrastructure background processes rely on sockets created in /var/tmp/.oracle. You

can check which socket is used by a process by listing the contents of the /proc/pid/fd directory, where pid is the process id of the program you are looking at. In some cases, permissions on the sockets can become garbled; in our experience, moving the .oracle directory to a safe location and rebooting solved the cluster communication problems.

Another reason ohasd.bin might fail to start: the file system for $GRID_HOME could be either corrupt or otherwise not mounted. Earlier, it was noted that ohasd.bin lives in $GRID_HOME/bin. If $GRID_HOME isn’t mounted, then it is not possible to start the daemon.

We introduced the OLR as an essential file for starting Grid Infrastructure. If the OLR has become corrupt or is otherwise not accessible, then ohasd.bin cannot start. Successful initialization of the OLR is recorded in the ohasd.log, as in the following example (the timestamps have been removed for the sake

of clarity):

[ default][3046704848] OHASD Daemon Starting. Command string :reboot

[ default][3046704848] Initializing OLR

[ OCRRAW][3046704848]proprioo: for disk 0

(/u01/app/crs/cdata/london1.olr),

id match (1), total id sets, (1) need recover (0), my votes (0),

total votes (0), commit_lsn (15), lsn (15)

[ OCRRAW][3046704848]proprioo: my id set: (2018565920, 1028247821, 0, 0, 0)

[ OCRRAW][3046704848]proprioo: 1st set: (2018565920, 1028247821, 0, 0, 0)

[ OCRRAW][3046704848]proprioo: 2nd set: (0, 0, 0, 0, 0)

[ CRSOCR][3046704848] OCR context init CACHE Level: 0xaa4cfe0

[ default][3046704848] OHASD running as the Privileged user

Interestingly, the errors pertaining to the local registry have the same numbers as those for the OCR; however, they have been prefixed by PROCL. The L can easily be missed, so check carefully! If the OLR cannot be read, then you will see the error messages immediately under the Initializing OLR line. This chapter has covered two causes so far: the OLR is missing or the OLR is corrupt. The first case is much easier to diagnose because, in that case, OHAS will not start:

[root@london1 ~]# crsctl check crs

CRS-4639: Could not contact Oracle High Availability Services

In the preceding example, ohasd.log will contain an error message similar to this one:

[ default][1381425744] OHASD Daemon Starting. Command string :restart

[ default][1381425744] Initializing OLR

[ OCROSD][1381425744]utopen:6m’:failed in stat OCR file/disk

/u01/app/crs/cdata/london1.olr,

errno=2, os err string=No such file or directory

[ OCROSD][1381425744]utopen:7:failed to open any OCR file/disk, errno=2,

os err string=No such file or directory

[ OCRRAW][1381425744]proprinit: Could not open raw device

[ OCRAPI][1381425744]a_init:16!: Backend init unsuccessful : [26]

[ CRSOCR][1381425744] OCR context init failure. Error: PROCL-26: Error

while accessing the physical storage Operating System

error [No such file or directory] [2]

417CHAPTER 8  CLUSTERWARE

[ default][1381425744] OLR initalization failured, rc=26

[ default][1381425744]Created alert : (:OHAS00106:) : Failed to initialize

Oracle Local Registry

[ default][1381425744][PANIC] OHASD exiting; Could not init OLR

In this case, you should restore the OLR, which you will learn how to do in the “Maintaining Voting

Disk and OCR/OLR” section.

If the OLR is corrupted, then you will slightly different errors. OHAS tries to read the OLR; while it succeeds for some keys, it fails for some others. Long hex dumps will appear in the ohasd.log, indicating a problem. You should perform an ocrcheck -local in this case, which can help you determine the root

cause. The following output has been taken from a system where the OLR was corrupt:

[root@london1 ohasd]# ocrcheck -local

Status of Oracle Local Registry is as follows :

Version

Total space (kbytes)

262120

Used space (kbytes)

2232

Available space (kbytes) :

259888

: 1022831156

Device/File Name

: /u01/app/crs/cdata/london1.olr

Device/File integrity check failed

Local registry integrity check failed

Logical corruption check bypassed

If the utility confirms that the OLR is corrupted, then you have no option but to restore it. Again,

Testing ASM Disk Failure Scenario and disk_repair_time

External Redundancy --Every piece of data is stored once

If five drives are allocated , Data will spread across five drives . if loose one drive, Data get lost. We need to restore from backup

Normal Redundancy --It is possible by two or more asm disk .it can protect from single drive outage

Eg A A

B B

C C

If lose one drive , you have data since it is mirror

High Redundancy -- it is possible by three or more asm disk .it can protect from two drive outage .

writing same data twice or thrice on disk group

If you drop one drive in high redundancy and abruptly shutdown machine .Disk group will not mount Automatically

due to disk is incomplete mean one disk is missing and it can be mounted by force only as below

For more details https://youtu.be/GNcU0NudlJ0

;

When a disk failure occurs for an ASM disk, behavior of ASM would be different, based on what kind of redundancy for the diskgroup is in use. If diskgroup has EXTERNAL REDUDANCY, diskgroup would keep working if you have redundancy at external RAID level. If there is no RAID at external level, the diskgroup would immediately get dismounted and disk would need a repair/replaced and then diskgroup might need to be dropped and re-created, and data on this diskgroup would require recovery.

For NORMAL and HIGH redundancy diskgroups, the behavior is a little different. When a disk gets corrupted/missing in a NORMAL/HIGH redundancy diskgroup, error is reported in the alert log file, and disk becomes OFFLINE, as we can see in the output of bellow query, after I started my testing for an ASM disk failure. I just needed to plug out the disk from the storage that belonged to an ASM diskgroup with NORMAL redundancy.

col name format a8

col header_status format a7

set lines 2000

col path format a10

select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk;

NAME PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S

-------- ---------- -------- ------- ------------ ------- ----------------- ------------ ------------- -----------------

DATA1 ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED

DATA2 ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED

DATA3 ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED

DATA4 NORMAL UNKNOWN 1200 OFFLINE MISSING

Here we see a value “1200” under REPAIR_TIME column; this value is time in seconds after which this disk would be dropped automatically. This time is calculated using value of a diskgroup attribute called DISK_REPAIR_TIME that I will discuss bellow.

In 10g, if a disk goes missing, it would immediately get dropped and REBALANCE operation would kick in immediately whereby ASM would start redistributing the ASM extents across the available disks in ASM diskgroup to restore the redundancy.

DISK_REPAIR_TIME

Starting 11g, oracle has provided an attribute for diskgroups called “DISK_REPAIR_TIME”. This has a default value of 3.6 hours. This actually means that in case a disk goes missing, this disk should not be dropped immediately and ASM should wait for this disk to come online/replaced. This feature helps in scenarios where a disk is plugged out accidentally, or a storage server/SAN gets disconnected/rebooted which leaves some ASM diskgroup without one or more disks. During the time when disk(s) remain unavailable, ASM would keep track of the extents that are candidates of being written to the missing disks, and immediately starts writing to the disk(s) as soon as missing disk(s) come back online (this feature is called fast mirror resync). If disk(s) does not come back online within DISK_REPAIR_TIME threshold, disk(s) is/are dropped and rebalance starts.

FAILGROUP_REPAIR_TIME

Starting 12c, another new attribute can be set for the diskgroup. This attribute is FAILGROUP_REPAIR_TIME, and this has a default value of 24 hours. This attribute is similar to DISK_REPAIR_TIME, but is applied to the whole failgroup. In Exadata, all disks belonging to a storage server can belong to a failgroup (to avoid a mirror copy of extent to be written in a disk from the same storage server), and this attribute is quite handy in Exadata environment when complete storage server is taken down for maintenance, or some other reason.

In the following we can see how to set values for the diskgroup attributes explained above.

SQL> col name format a30

SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%';

NAME VALUE

------------------------------ --------------------

disk_repair_time 3.6h

failgroup_repair_time 24.0h

SQL> alter diskgroup data set attribute 'disk_repair_time'='1h';

Diskgroup altered.

SQL> alter diskgroup data set attribute 'failgroup_repair_time'='10h';

Diskgroup altered.

SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%';

NAME VALUE

------------------------------ --------------------

disk_repair_time 1h

failgroup_repair_time 10h

ORA-15042

If a disk is offline/missing from an ASM diskgroup, ASM may not mount the diskgroup automatically during instance restart. In this case, we might need to mount the diskgroup manually, with FORCE option.

SQL> alter diskgroup data mount;

alter diskgroup data mount

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15040: diskgroup is incomplete

ORA-15042: ASM disk "3" is missing from group number "2"

SQL> alter diskgroup data mount force;

Diskgroup altered.

Monitoring the REPAIR_TIME

After a disk goes offline, the time starts ticking and value of REPAIR_TIMER can be monitored to see the time remains before the disk can be made available to avoid auto drop of the disk.

SQL> select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk;

NAME PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S

-------- ---------- -------- ------- ------------ ------- ----------------- ------------ ------------- -----------------

DATA1 ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED

DATA2 ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED

DATA3 ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED

DATA4 NORMAL UNKNOWN 649 OFFLINE MISSING

--We can confirm that no rebalance has started yet by using following query

SQL> select * from v$asm_operation;

no rows selected

If we are able to make this disk available/replaced before DISK_REPAIR_TIME lapses, we can bring this disk back online. Please note that we would need to bring it ONLINE manually.

SQL> alter diskgroup data online disk data4;

Diskgroup altered.

select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk;

NAME PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S

-------- ---------- -------- ------- ------------ ------- ----------------- ------------ ------------- -----------------

DATA1 ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED

DATA2 ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED

DATA3 ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED

DATA4 NORMAL UNKNOWN 465 SYNCING CACHED

--Syncing is in progress, and hence no rebalance would occur.

SQL> select * from v$asm_operation;

no rows selected

-- After some time, everything would become normal.

select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk;

NAME PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S

-------- ---------- -------- ------- ------------ ------- ----------------- ------------ ------------- -----------------

DATA1 ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED

DATA2 ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED

DATA3 ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED

DATA4 ORCL:DATA4 NORMAL MEMBER 0 ONLINE CACHED

If same disk cannot be made available, or replaced, either ASM would auto drop the disk after DISK_REPAIR_TIME has lapsed, or we manually drop this ASM disk. Rebalance would occur after the disk drop.

Since the disk status if OFFLINE, we would need to use FORCE option to drop the disk. After dropping the disk rebalance would start and can be monitored from v$ASM_OPERATION view.

SQL> alter diskgroup data drop disk data4;

alter diskgroup data drop disk data4

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15084: ASM disk "DATA4" is offline and cannot be dropped.

SQL> alter diskgroup data drop disk data4 force;

Diskgroup altered.

select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;

GROUP_NUMBER OPERA PASS STATE POWER SOFAR EST_WORK

---------------------------------- --------- ---- ---------- ---------- ---------- ------------------------

2 REBAL RESYNC DONE 9 0 0

2 REBAL REBALANCE DONE 9 42 42

2 REBAL COMPACT RUN 9 1 0

Later we can replace the faulty disk and then add back the new disk again into this diskgroup. Adding diskgroup back would initiate rebalance once again.

SQL> alter diskgroup data add disk 'ORCL:DATA4';

Diskgroup altered.

SQL> select * from v$asm_operation;

select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;

GROUP_NUMBER OPERA PASS STATE POWER SOFAR EST_WORK

---------------------------------- --------- ---- ---------- ---------- ---------- ------------------------

2 REBAL RESYNC DONE 9 0 0

2 REBAL REBALANCE RUN 9 37 2787

2 REBAL COMPACT WAIT 9 1 0

https://www.oraclenext.com/2018/01/testing-asm-disk-failure-scenario-and.html

How to add disk to ASM

Posted on October 12, 2012by Sher khan

How to add disk to ASM (DATABASE) runing in production server

We have database running on ASM, after two years we faced the problem of space deficiency.

Now we planed to add disk to ASM diskgroup DATAGROUP.

SQL> @asm

NAME TOTAL_GB FREE_GB

------------------------------ ---------- ----------

DATAGROUP 249.995117 15.2236328

IDXGROUP 149.99707 10.4892578

Steps are below

1) Create partition of disk /dev/sdm which we got new LUN from Storage

[root@rac-node1 ~]# fdisk -l /dev/sdm

Disk /dev/sdm: 85.8 GB, 85899345920 bytes

255 heads, 63 sectors/track, 10443 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdm doesn't contain a valid partition table

[root@rac-node1 ~]# fdisk /dev/sdm

Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel

Building a new DOS disklabel. Changes will remain in memory only,

until you decide to write them. After that, of course, the previous

content won't be recoverable.

The number of cylinders for this disk is set to 10443.

There is nothing wrong with that, but this is larger than 1024,

and could in certain setups cause problems with:

1) software that runs at boot time (e.g., old versions of LILO)

2) booting and partitioning software from other OSs

(e.g., DOS FDISK, OS/2 FDISK)

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n

Command action

e extended

p primary partition (1-4)

Partition number (1-4): 1

First cylinder (1-10443, default 1):

Using default value 1

Last cylinder or +size or +sizeM or +sizeK (1-10443, default 10443):

Using default value 10443

Command (m for help): w

The partition table has been altered!

Calling ioctl() to re-read partition table.

Syncing disks.

[root@rac-node1 ~]# fdisk -l /dev/sdm

Disk /dev/sdm: 85.8 GB, 85899345920 bytes

255 heads, 63 sectors/track, 10443 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

/dev/sdm1 1 10443 83883366 83 Linux

[root@rac-node1 ~]#

2) Configure the disk /dev/sdm1 to ASM and giving LABEL DATA5

[root@rac-node1 ~]# /etc/init.d/oracleasm createdisk DATA5 /dev/sdm1

Marking disk "DATA5" as an ASM disk: [ OK ]

[root@rac-node1 ~]# /etc/init.d/oracleasm scandisks

Scanning the system for Oracle ASMLib disks: [ OK ]

[root@rac-node1 ~]# /etc/init.d/oracleasm listdisks

DATA3

DATA4

DATA5

DISK1

DISK2

INDEX2

INDEX5

[root@rac-node1 ~]#

Scandisks on RAC -node2

[root@rac-node2 ~]# /etc/init.d/oracleasm scandisks

Scanning the system for Oracle ASMLib disks: [ OK ]

[root@rac-node2 ~]# /etc/init.d/oracleasm listdisks

DATA3

DATA4

DATA5

DISK1

DISK2

INDEX2

INDEX5

Add the disk to /etc/rawdevices

[root@rac-node2 bin]vi /etc/sysconfig/rawdevices

/dev/raw/raw6 /dev/sdm1 ==> add this to rawdevices file

And added to /etc/rc.local for permission on reboot

[root@rac-node2 bin]#vi /etc/rc.local

chmod 660 /dev/raw/raw6

Check the disk status

SQL> set linesize 9999

SQL> ;

SELECT

NVL(a.name, '[CANDIDATE]') disk_group_name

, b.path disk_file_path

, b.name disk_file_name

, b.failgroup disk_file_fail_group

FROM

v$asm_diskgroup a RIGHT OUTER JOIN v$asm_disk b USING (group_number)

ORDER BY

* a.name

SQL> /

DISK_GROUP_NAME DISK_FILE_PATH DISK_FILE_NAME DISK_FILE_FAIL_GROUP

------------------------------ --------------

DATAGROUP ORCL:DISK1 DISK1 DISK1

DATAGROUP ORCL:INDEX5 INDEX5 INDEX5

DATAGROUP ORCL:DATA4 DATA4 DATA4

DATAGROUP ORCL:DATA3 DATA3 DATA3

IDXGROUP ORCL:DISK2 DISK2 DISK2

IDXGROUP ORCL:INDEX2 INDEX2 INDEX2

[CANDIDATE] ORCL:DATA5 ==> this is the new disk

7 rows selected.

3) Add disk DATA5 to diskgroup DATAGROUP

SQL> alter diskgroup DATAGROUP ADD DISK 'ORCL:DATA5' ;

Diskgroup altered.

Check disk status again

SQL> SELECT

2 NVL(a.name, '[CANDIDATE]') disk_group_name

, b.path disk_file_path

, b.name disk_file_name

3 4 5 , b.failgroup disk_file_fail_group

6 FROM

7 v$asm_diskgroup a RIGHT OUTER JOIN v$asm_disk b USING (group_number)

8 ORDER BY

9 a.name;

DISK_GROUP_NAME DISK_FILE_PATH ISK_FILE_NAME DISK_FILE_FAIL_GROUP

----------------------------------------------------------

DATAGROUP ORCL:INDEX5 NDEX5 INDEX5

DATAGROUP ORCL:DATA4 ATA4 DATA4

DATAGROUP ORCL:DISK1 ISK1 DISK1

DATAGROUP ORCL:DATA5 ATA5 DATA5

DATAGROUP ORCL:DATA3 ATA3 DATA3

IDXGROUP ORCL:INDEX2 NDEX2 INDEX2

IDXGROUP ORCL:DISK2 ISK2 DISK2

7 rows selected. There is no candidates any more for DATA5

SQL> host cat script/asm.sql

select name,TOTAL_MB/1024 total_gb,free_mb/1024 FREE_GB from v$asm_diskgroup;

NAME TOTAL_GB FREE_GB

------------------------------ ---------- ----------

DATAGROUP 329.992188 95.21875

IDXGROUP 149.99707 10.4892578

SQL>

Configuring three IPs for SCAN listener in Oracle 11gR2

SCAN

The benefit of using the SCAN is that the connection information of the client does not need to change if you add or remove nodes in the cluster."

"Having a single name to access the cluster enables the client to use the EZConnect client and the simple JDBC thin URL to access any Database running in the cluster, independent of the active servers in the cluster. The SCAN provides load balancing and failover for client connections to the Database. The SCAN works as a cluster alias for Databases in the cluster."

"...provide information on what services are being provided by the instance, the current load, and a recommendation on how many incoming connections should be directed to the instance."

https://docs.oracle.com/database/121/JJDBC/scan.htm#JJDBC29160

"each pair of resources (SCAN VIP and Listener) will be started on a different server in the cluster, assuming the cluster consists of three or more nodes."

"In case, a 2-node-cluster is used (for which 3 IPs are still recommended for simplification reasons), one server in the cluster will host two sets of SCAN resources under normal operations.

How to configure SCAN listener with DNS?

About SCAN( Single Client Access Name) Listener in Oracle 11gR2 RAC:

Single Client Access Name (SCAN) is a new Oracle Real Application Clusters (RAC) 11g Release 2 feature that provides a single name for clients to access Oracle Databases running in a cluster. The benefit is that the client’s connect information does not need to change if you add or remove nodes in the cluster. Having a single name to access the cluster allows clients to use the EZConnect client and the simple JDBC thin URL to access any database running in the cluster, independently of which server(s) in the cluster the database is active. SCAN provides load balancing and failover for client connections to the database. The SCAN works as a cluster alias for databases in the cluster.

Why three IPs to be configured for SCAN listener through DNS (Domain Name Server):

If we configure 3 IPs for SCAN listener through DNS, then in case of any failure on any SCAN IP, then fail-over will happen to other running IP. Another benefit, any client access should resolved throgh DNS also.

Failover becomes a bit of a problem if there is only one SCAN Listener. Let's assume the node where the one SCAN Listener is running on dies. Grid Infrastructure on a surviving node will start the SCAN Listener on some surviving node. It does take some time for GI to detect the failure and start it on some node. During this time, applications will not be able to connect so you'd lose high availability. Also consider the scenario where the SCAN Listener fails for some reason but the node it was running on is still operational. In that case, the SCAN Listener will not be restarted anywhere. You want more than one for high availability.

There are three ways we can configured our RAC environment .

1) Non – DNS ( that means IP based RAC configuration

( Only 1 scan ip will work ))

2) DNS ( minimum 3 SCAN ip are required )

3) GNS ( which we called DHCP)

Before Configuration:

$ srvctl config scan

SCAN name: prddbscan, Network: 1/101.10.1.1/255.255.255.192/en0

SCAN VIP name: scan1, IP: /prddbscan/101.10.1.4

Starting Configuration:

Step:1 - add three IPs in DNS , e.g.,

101.10.1.5 hrdbscan.hrapplication.com

101.10.1.6 hrdbscan.hrapplication.com

101.10.1.7 hrdbscan.hrapplication.com

Step:2 - Stop all node scan listeners

$ srvctl stop scan_listener

Step:3 - Create the below file in /etc location with adding domain name

# vi /etc/resolv.conf

search domain hrapplication.com

nameserver 101.10.9.9

Note : Name of the DNS/ AD server is "hrapplication.com" and IP is 10.10.9.9

Step: 4 - Verify with nslookup in all nodes - all node should show configured three IPs

# nslookup hrdbscan.hrapplication.com

Server: 101.10.9.9

Address: 101.10.9.9#53

Name: hrdbscan.hrapplication.com

Address: 101.10.1.7

Name: hrdbscan.hrapplication.com

Address: 101.10.1.6

Name: hrdbscan.hrapplication.com

Address: 101.10.1.5

Note: If your DNS server does not return a set of 3 IPs as shown in figure 3 or does not round-robin, ask your network administrator to enable such a setup. DNS using a round-robin algorithm on its own does not ensure failover of connections. However, the Oracle Client typically handles this. It is therefore recommended that the minimum version of the client used is the Oracle Database 11g Release 2 client.

Step: 5 - modify scan

#./srvctl modify scan -n hrdbscan.hrapplication.com

#./srvctl modify scan_listener -u

-- again verify

# ./srvctl config scan

Step: 6 - start the scan listener

#./srvctl start scan_listener

#./srvctl status scan_listener

Step: 7 - Now stop cluster services and start it again to effect

./crsctl stop crs -- one by one node

./crsctl start crs

./crsctl stat res -t

./crsctl check crs

Step: 8 - check the services

./crsctl stat res -t

HOW CONNECTION LOAD BALANCING WORKS USING SCANFor clients connecting using Oracle SQL*Net 11g Release 2, three IP addresses will be received by the client by resolving the SCAN name through DNS as discussed. The client will then go through the list it receives from the DNS and try connecting through one of the IPs received. If the client receives an error, it will try the other addresses before returning an error to the user or application. This is similar to how client connection failover works in previous releases when an address list is provided in the client connection string.

When a SCAN Listener receives a connection request, the SCAN Listener will check for the least loaded instance providing the requested service. It will then re-direct the connection request to the local listener on the node where the least loaded instance is running. Subsequently, the client will be given the address of the local listener. The local listener will finally create the connection to the database instance.

This document may help you. You can refer Oracle support documents for more clarification.

Case -1: Unable to start database instances after starting all cluster services:

When I tried to start database instances after starting all node cluster services, I found below error.

SQL> startup nomount;

ORA-00119: invalid specification for system parameter REMOTE_LISTENER

ORA-00132: syntax error or unresolved network name 'hrapplication.com:1521'

After some verification, I found some body disbaled below options during some RAM upgrations time.

# vi /etc/resolv.conf

#search domain hrapplication.com

#nameserver 101.10.9.9

I uncommented like below and then started my databases services and found ok.

# vi /etc/resolv.conf

search domain hrapplication.com

nameserver 101.10.9.9

Approach to Troubleshoot an Abended OGG Process

Oracle GoldenGate is an Heterogeneous replication tool. It is very easy to install and configure Oracle GoldenGate. The real challenge comes when the Processes gets ABEND. Sometimes it is easy to detect problems, but sometimes we will be really not knowing how to proceed or approach to solve or troubleshoot the issue.

This article explains,

1. Levels of Failure in Oracle GoldenGate.

2. The approach to Troubleshoot Oracle GoldenGate.

3. How to identify the issue.

4. What are the files to be looked for Troubleshooting Oracle GoldenGate.

5. Tools to Monitor and Troubleshoot Oracle GoldenGate.

1. LEVELS OF FAILURE

Oracle GoldenGate can Abend or Fail at different levels. There might be many reasons for the Oracle GoldenGate Process failures. The different levels at which the Oracle GoldenGate processes fails or Abends are .,

1. Database Level

2. Network Level

3. Storage Level

4. User Level

a. DATABASE LEVEL OF FAILURE

Oracle GoldenGate also fails if you have issues at the Database Level. Below are some of the issues listed.,

Tablespace filled

Redo log corruption

Archive log destination filled

No Primary Key or Unique Index on the tables

Archive log Mode not enabled

Reset Log performed

Memory Problem with Streams_Pool_Size

Database Hung

b. NETWORK LEVEL OF FAILURE

Network plays a vital role in the Oracle GoldenGate Replication. For each and every commands you execute in the GGSCI prompt, the Manager Process opens a port. There should be a proper, speedy network between the Source and Target sides. Some of the Network level failures are listed below.,

Network Fails

Network slow

Ports Unavailability

Firewall Enabled

c. STORAGE LEVEL OF FAILURE

There should be sufficient storage space available for the Oracle GoldenGate to keep the Trail files. Even at Oracle Database level, there should be sufficient space to retain the Archive Log files and also space for tablespaces. Proper privileges should be given to the file system so that, Oracle GoldenGate Processes creates the trail files in the location.

File System fills

File System corruption

No Proper privileges given to the File System

Connection Problem between Database Server and Storage

No Free Disks Available in Storage

d. USER LEVEL OF FAILURE

Of course, we users make some mistakes. Some of the user level of failures are below.,

Mistakenly Delete the GoldenGate Admin User at Database level.

Manually Performing Operations like Insert, Delete and Update at Target Side.

Manually deleting / removing the Trail Files either from Source server or Target server.

Forcefully Stopping any Oracle GoldenGate Processes like Manager, Extract, Pump, Collector or Replicat.

Killing the Oracle GoldenGate Processes at OS level.

Performing an ETROLLOVER at Extract / Pump / Replicat Processes.

So we have seen the different levels of Failures in Oracle GoldenGate. How to proceed if you face these failures in your day to day life. What is the approach to identify the issue and solve it.

2. HOW TO APPROACH?

The below are the steps on how to approach to the problem. If the environment is a known one, then you can skip some of the steps.

Learn and Understand the Environment

Operating Provider and Operating System Version

Database Provider and Database Version

Is it a Cluster, Active / Passive?

Oracle GoldenGate UniDirectional or Bi-Directional

If Oracle, then is it a Single Instance or RAC – Real Application Clusters

Is it a Homogeneous or Heterogeneous Environment Replication

Network Flow, Ports Used and Firewalls configured

Components used in Oracle GoldenGate like Extract, Pump, Replicat processes and Trails files.

After seeing all the prerequisites like Environment study etc, check if the Processes are up and running. INFO ALL is the command to check the status of the processes. There are different status of process.

RUNNING

The Process has started and running normally.

STOPPED

The Process has stopped either normally (Controlled Manner) or due to an error.

STARTING

The Process is starting.

ABENDED

The Process has been stopped in an uncontrolled manner. Abnormal End is known was ABEND.

From the above status of the Processes status, RUNNING, STOPPED and ABENDED are common. But what is STARTING? What actually happens when the Oracle GoldenGate process is in this state?

Whenever you start an Abended Extract Process, it will take some time to get started. It is because, the process is getting recovered from its last Abend point. To recover it’s processing state, the Extract Process search back to its Online Redo Log file or Archive log file, to find the first log record for the opened transactions when it is crashed. The more back the Extract Process goes in search, the more it takes to recover itself and get started. So, It takes more time depending upon how long back the Open transaction is in the Redo Logs or Archive Logs.

To check the status of the Extract Process and also to check if it is recovering properly, issue the command.,

THREE BASIC FILES

There are many files which needs to be checked whenever you face issue in Oracle GoldenGate. Out of which Oracle GoldenGate logs the activity in three files.

1. Error Log File – ggserr.log

2. Report File

3. Discard File

4. Trace File

5. Alert Log File

6. DDL Trace File

The first three are the very basic, also can be called as major files which are to be looked in to whenever there are problems in the Oracle GoldenGate. Below, is the explanation for these three files.

What is Error Log file – ggserr.log?

This file is created during the Installation of the Oracle GoldenGate. The file is created in the Oracle GoldenGate home directory with the name ggserr.log. For each installation of Oracle GoldenGate, a ggserr.log file is created in the respective Oracle GoldenGate directory. This file is updated by all the processes of Oracle GoldenGate and the below information are logged in this file.,

 Start and Stop of the Oracle GoldenGate Processes.

 Processing Information like Bounded Recovery operations.

 Error Messages.

 Informational messages like normal operations happening in Oracle GoldenGate.

 WARNING Messages like Long Running Transactions.

 Commands executed in GGSCI Prompt.

The format in which the Oracle GoldenGate processes logs the information in to this ggserr.log file is below.,

You can view this file in the ggsci prompt itself by using the command VIEW GGSEVT. But it is always better to view it using the OS tool as this file can grow a lot. The below is the example.,

So with the ggserr.log file you basically identify the below.,

 What is the Error?

 When the Error occurred?

 How Frequently it occurred?

 What were the operations performed before the Error occurred?

 How Frequently the error occurred?

What is Report File?

A Report file is a process specific log file. Each process has its own report file created and this file is created during the instantiation of the process. This file is stored in the directory /dirrpt and the format of this file is .rpt. This file is automatically renamed on the next instantiation of the process. If a process starts all the log entries for that process are written to its respective report file.

Let’s consider a process called EXT and the report file during instantiation of this process is called as EXT.rpt. If this process is stopped and started again, existing file EXT.rpt will be automatically renamed to EXT0.rpt and a new file will be generated with the name EXT.rpt and this occurs recursively till the value of the sequence reaches 9. If the last report file name for the process EXT is created as EXT9, now during the new file generation, the last file EXT9.rpt will be removed and EXT8.rpt will be renamed as EXT9.rpt. So, the report file with the lower sequence value will be the latest and younger one when compared with older sequence valued report file.

REPORTROLLOVER parameter is used to manually or forcefully create a new report file for the processes. To view the current report of the process the below command is used.,

To get the runtime statistics report of a process, use the below command,

The below information can be seen in the report file of a particular process.,

 Oracle GoldenGate Product Version and Release

 Operating Version, Release, Machine Type, Hostname and Ulimit settings of the respective process

 Memory Utilized by the respective process

 Configured Parameters of the respective Oracle GoldenGate Process

 Database Provider, Version and Release

 Trail files Information

 Mapping of Tables

 Informational messages with respective to a particular process

 Warning messages with respective to a particular process

 Error messages with respective to a particular process

 All DDL Operations performed.

 All the Discarded Errors and Ignored Operations

 Crash dumps

 Any commands which are performed on that particular process.

The below is the example of the Report file which I had split it to many parts so that you will get an clear understanding.

1. Oracle GoldenGate Product Version and Release. Operating Version, Release, Machine Type, Hostname and Ulimit settings of the respective process

2. Configured Parameters of the respective Oracle GoldenGate Process

3. Database Provider, Version, Release and Trail File information.

4. Mapping of tables and Informational messages with respect to the Process.

5. Crash dump and Error messages of the respective process.

Above examples clearly shows the contents of a Report file. So with the help of a Report file, the following can be known,

 In which Trail File the Process gets Abend.

 Whether the Trail File is moving forward?

 Whether the process is getting failed with Same Trail File?

 What operations has been performed before the process abend?

 Whether any errors in the Parameter configuration?

 Whether the MAP statements has the correct table names?

What is Discard File?

A log file for logging failed operations of the Oracle GoldenGate processes. It is mainly used for Data errors. In Oracle GoldenGate 11g, this file is not created by default. We have to mention a keyword DISCARDFILE to enable discard file logging. But from Oracle GoldenGate 12c, this file is generated by default during the instantiation of the process.

The Naming format of the Log file is ., but this file can named manually when enabling. Extension of this file is .DSC and this file is located in the directory /dirrpt

PURGE and APPEND keywords are used in the process parameter files to manually maintain the Discard File. Similar to the Report file, the Discard file can also be rolled over using the keyword DISCARDFILEROLLOVER. The syntax is as below.,

file_name

The relative or fully qualified name of the discard file, including the actual file name.

APPEND

Adds new content to existing content if the file already exists.

PURGE

Purges the file before writing new content.

MAXBYTESn | MEGABYTESn

File size in Bytes. For file size in bytes the valid range is from 1 to 2147483646. The default is 50000000. For file size in megabytes the valid range is from 1 to 2147. The default size is 50MB. If the specified size is exceeded, the process Abends.

NODISCARDFILE

When using this parameter, there will be no discard file creation. It prevents generating the Discard file.

The below is the example for the Discard file parameter used in the Replicat process parameter file.,

The Discard File is mainly used in the Target Side. Each and Every Replicat Process should have its own Discard File. This is a mandatory one.

The below is the example which shows the contents of the Discard file. The Replicat process got Abended due to the error OCI Error ORA-01403 : no data found. The discard file is as below.,

So, we have seen about the three basic and important file where Oracle GoldenGate Processes logs the information. There is also a tool which is used to troubleshoot Oracle GoldenGate during Data corruption or trail file corruption. This is mainly used when Data error occurs in the Oracle GoldenGate.

The tool is called LOGDUMP. It is a very useful tool which allows a user to navigate through the trail file and compare the information of the trail file with the data extracted and replicated by the processes. The below can be seen in the trail file using the LOGDUMP utility.,

 Transactions Information

 Operation type and Time when the Record written.

 Source Object name

 Image type, whether it is a Before Image or After Image.

 Column information with data and sequence information.

 Record length, Record data in ASCII format.

 RBA Information.

The below is the example of the contents of the Trail File.,

Some of the Logdump commands with the description are below., To get in to the logdump prompt, just run the logdump program from the Oracle GoldenGate Home directory.

Logdump 1> GHDR ON – To view the Record Header.

Logdump 2> DETAIL ON – To view the column information.

Logdump 3> DETAIL DATA – To view the Hex and ASCII values of the Column.

Logdump 4> USERTOKEN ON – User defined information specified in the Table of Map statements. These information are stored in the Trail file.

Logdump 4> GGSTOKEN ON – Oracle GoldenGate generated tokens. These tokens contains the Transaction ID, Row ID etc.,

Logdump 5> RECLEN length – Manually control the length of the record.

Logdump 6> OPEN file_name – To open a Trail file.

Logdump 7> NEXT – To move to the next File record. In short, you can use the letter N.

Logdump 8> POS rba – To position to a particular RBA.

Logdump 9> POS FIRST – To go to the position of the first record in the file.

Logdump 10> POS 0 – This is the alternate command for the POS FIRST. Either of this can be used.

Logdump 11> SCANFORENDTRANS – To go to the end of the transaction.

Logdump 12> HELP – To get the online help.

Logdump 13> EXIT – To exit from the Logdump prompt. You can also use QUIT alternatively.

Hope you got a clear view on how to approach to a Oracle GoldenGate problem and also find who stopped the Oracle GoldenGate process and the reason behind it.

-----------------------Standby--------------

Comparative study of Standby database from 7.3 to latest version

12c Data Guard Agenda

- Physical Standby

- Logical Standby

- Snapshot Standby

- Active Data Guard

- Data Guard Broker

- Architecture

- Configurations

- Standby Creations using Commands and OEM

- 12c NFs (Far Sync, Fast Sync, etc.)

- Far Sync Lab

- Data Protection Modes

- Role Transitions

- Flashback Database

- Fast-Start Failover (Observer Software)

- Backup and Recovery in DG

- Patching

- Optimization DG

High Availability Solutions from Oracle

- RAC

- RAC ONE Node

- Data Guard

- Golden Gate

- Oracle Streams

History

Version 7.3

- keeping duplicate DB in a separate server

- can be synchronized with Primary Database

- was constantly in Recovery Mode

- NOT able to automate the transfer of Archive Redo Logs

and Apply

- DBAs has to find an option for transfer of Archive

Redo Logs and Apply

- aim was disaster recovery

Version 8i

- Archive log shipping and apply process automatic

- which is now called

- managed standby environment (log shipping)

- managed recovery (apply process)

- was not possible to set a DELAY in the managed recovery mode

- possible to open a Standby with read-only mode

for reporting purpose

- when we added a Data File or Created TS on Primary,

these changes were NOT being replicated to STandby

- when we opened the Primary with resetlogs

or restored a backup control file,

we had to re-create the Standby

Version 9i

- Oracle 8i Standby was renamed to Oracle 9i Data Guard

- introduced Data Guard Broker

- ZERO Data Loss on Failover was guaranteed

- Switchover as introduced (primary <> standby)

- Gap resolution (missing logs detected

and trasmits automatically)

- DELAY option was added

- parallel recovery increase recovery performance on STandby

- Logical Standby was introduced

Version 10g

- Real-Time Apply (provides faster swithover and failover)

- Flashback Database support was introduced

- if we open a Primary with resetlogs,

it was NOT required to re-create the Standby

- Standby was able to recover through resetlogs

- Rolling Upgrades of Primary Database

- Fast-Start-Failover (Observer Software)

Version 11g

- Active Data Guard

- Snapshot Standby (possible with 10g R2

guaranteed restore point)

- continuous archived log shipping with snapshot standby

- compress REDO when resolving Gaps => 11g R1

- compress of all REDO => 11g R2

- possible to include different O/S in DataGuard

- recovery of Block corruptions automatic

for Active Data Guard

- "Block Change Tracking" can be run on Active Data Guard

Version 12c

- Far Sync

- Fast Sync

- Session Sequence

- temp as UNDO

- Rolling Upgrade using PL/SQL Package (DBMS_Rolling)

LNS=Log Writer Network Server

RFS=Remote File Server

MRP=Managed Recovery

LSP=Logical STandby

DMON=Data Guard Broker Monitor Process

NSS = Network Server SYNC

MRP = coordinates read and apply process of REDO on Physical Standby

RFS = responsible for receiving the REDO Data,

which is sent by the Primary to STandby

LGWR and SYNC

- REDO is read and sent to the STandby directly

from the log buffer by the LNS process

- Ack. needed from the standby (RFS to LNS

and LNS to LGWR)

to send COMMIT Ack to the Database USer

LGWR and ASYNC

- NO ack needed from standby to send the COMMIT ack.

to Prmary Database

- redo is read and sent to standby from redo log buffer

or online redo logs by the LNS process

- redo log buffer before it is recycled,

it automatically reads and sends redo data

from online redo

- the committed transactions that weren't shipped

to standby yet, may be lost in a Failover

FAL_CLIENT

- no longer required from 11g R2

- primary DB will obtain the Client Service Name

from the related LOG_ARCHIVE_DEST_n

How to start REDO Apply as Foreground Process?

alter database recover managed standby database;

How to start REDO Apply as Background Process?

use DISCONNECT FROM SESSION option,

alter database recover managed standby database

disconnect from session;

How to cancel REDO Apply?

alter database recover managed standby database cancel;

I found this... check "4.7.10.1 Data Guard Status" in http://docs.oracle.com/cd/E11857_01/em.111/e16285/oracle_database.htm#CHDBEAFG

Posts Tagged ‘alter database recover managed standby database’

How to Resolve Primary/Standby Log GAP In Case of Deleting Archivelogs From Primary?

I will write about resolving the Primary/Standby log gap in case of we deleted some archive log files from primary. Suppose that we don’t have the backup of the deleted archive files. Normally we (DBAs) should not allow such a situation but such a situation can happen to us. In this case, we need to learn the current SCN number of Primary and standby databases.1- let’s learn current SCN number with the following query on the Primary.

SQL> select current_scn from v$database;

CURRENT_SCN

———–

1289504966

2- let’s learn current SCN number with the following query on the Standby

SQL> select current_scn from v$database;

CURRENT_SCN

———–

1289359962

using the function scn_to_timestamp(SCN_NUMBER) you can check the time difference between primary and standby.

3- Stop apply process on the Standby database.

SQL> alter database recover managed standby database cancel;

4– Shutdown the Standby database.

SQL> shutdown immediate;

5- Take incremental backup from the latest SCN number of the Standby database on the Primary database. And copy backup to the standby server.

RMAN> backup incremental from scn 1289359962 database;

# scp /backup_ISTANBUL/dun52q66_1_1 oracle@192.168.2.3:/oracle/ora11g

6- Create new standby control file on the Primary database. And copy this file to standby server.

SQL> alter database create standby controlfile as ‘/oracle/ora11g/standby.ctl’;

# scp /oracle/ora11g/standby.ctl oracle@192.168.2.3:/oracle/ora11g

7- Open the Standby database on NOMOUNT state to learn control files location.

SQL> startup nomount

SQL> show parameter control_files

8- Replace new standby control file with old files.

# cp /oracle/ora11g/standby.ctl /oracle/ora11g/ISTANBUL/data1/control01.ctl

# cp /oracle/ora11g/standby.ctl /oracle/ora11g/ISTANBUL/data2/control02.ctl

9- Open the Standby database on MOUNT state.

SQL> alter database mount standby database;

10- Connect to the RMAN and register backup to catalog.

# rman target /

RMAN> catalog start with ‘/oracle/ora11g’;

It will ask for confirmation. Click “y” .

11- Now, you can recover the Standby database. Start recover database.

RMAN> recover database;

When recover of database is finished, it searches the latest archive file. And it gives an ORA-00334 error. In this case, don’t worry about it. Exit from RMAN and start apply process on the standby database.

SQL> alter database recover managed standby database disconnect from session;

We solved the Primary/Standby log gap with RMAN incremental backup . When we faced with such a situation we don’t need to think about re-installing standby database. Because time is very valuable for us.

SRL – standby redo log

How to Standby apply process

in the system above, SRLs are not configured on the standby database. The arrows show how redo transport flows through the system. Redo travels along this path:

A transaction writes redo records into the Log Buffer in the System Global Area (SGA).

The Log Writer process (LGWR) writes redo records from the Log Buffer to the Online Redo Logs (ORLs).

When the ORL switches to the next log sequence (normally when the ORL fills up), the Archiver process (ARC0) will copy the ORL to the Archived Redo Log.

Because a standby database exists, a second Archiver process (ARC1) will read from a completed Archived Redo Log and transmit the redo over the network to the Remote File Server (RFS) process running for the standby instance.

RFS sends the redo stream to the local Archiver process (ARCn).

ARCn then writes the redo to the archived redo log location on the standby server.

Once the archived redo log is completed, the Managed Recovery Process (MRP0) sends the redo to the standby instance for applying the transaction.

With SRLs, not only do we have more items in the picture, we also have different choices, i.e. different paths to get from the primary to the standby. The first choice is to decide if we are configured for Max Protect or Max Performance as I will discuss its impact below.

Just like without SRLs, a transaction generates redo in the Log Buffer in the SGA.

The LGWR process writes the redo to the ORL.

Are we in Max Protect or Max Performance mode?

If Max Protect, then we are performing SYNC redo transport. The Network Server SYNC process (NSSn) is a slave process to LGWR. It ships redo to the RFS process on the standby server.

If Max Performance mode, then we are performing ASYNC redo transport. The Network Server ASYNC process (NSAn) reads from the ORL and transports the redo to the RFS process on the standby server.

RFS on the standby server simply writes the redo stream directly to the SRLs.

How the redo gets applied depends if we are using Real Time Apply or not.

If we are using Real Time Apply, MRP0 will read directly from the SRLs and apply the redo to the standby database.

If we are not using Real Time Apply, MRP0 will wait for the SRL’s contents to be archived and then once archived and once the defined delay has elapsed, MRP0 will apply the redo to the standby database.

Best Practices

I’ve already covered a few best practices concerning SRLs. I’ll recap what I have already covered and include a few more in this section.

Make sure your ORL groups all have the same exact size. You want every byte in the ORL to have a place in its corresponding SRL.

Create the SRLs with the same exact byte size as the ORL groups. If they can’t be the same exact size, make sure they are bigger than the ORLs.

Do not assign the SRLs to any specific thread. That way, the SRLs can be used by any thread, even with Oracle RAC primary databases.

When you create SRLs in the standby, create SRLs in the primary. They will normally never be used. But one day you may perform a switchover operation. When you do switchover, you want the old primary, now a standby database, to have SRLs. Create them at the same time.

For an Oracle RAC primary database, create the number of SRLs equal to the number of ORLs in all primary instances. For example, if you have a 3-node RAC database with 4 ORLs in each thread, create 12 SRLs (3x4) in your standby. No matter how many instances are in your standby, the standby needs enough SRLs to support all ORLs in the primary, for all instances

DB_NAME – both the primary and the physical standby database will have the same database name. After all, the standby is an exact copy of the primary and its name needs to be the same.

DB_UNIQUE_NAME – While both the primary and the standby have the same database name, they have a unique name. In the primary, DB_UNIQUE_NAME equals DB_NAME. In the standby, DB_UNIQUE_NAME does not equal DB_NAME. The DB_UNIQUE_NAME will match the ORACLE_SID for that environment.

LOG_ARCHIVE_FORMAT – Because we have to turn on archive logging, we need to specify the file format of the archived redo logs.

LOG_ARCHIVE_DEST_1 – The location on the primary where the ORLs are archived. This is called the archive log destination.

LOG_ARCHIVE_DEST_2 – The location of the standby database.

REMOTE_LOGIN_PASSWORDFILE – We need a password file because the primary will sign on to the standby remotely as SYSDBA.

Create the Standby Database

This is the second major task to complete. In this section, we will create a backup of the primary and use it to create the standby database. We need to create a special standby control file. A password file, a parameter file, and a similar TNS alias will complete the setup. The subtasks are outlined below.

Create a backup of the primary.

Create a standby controlfile.

Copy the password file.

Create a parameter file.

Create a TNSNAMES.ORA file.

How to Speed up and Troubleshooting MRP (Log Apply Rate of a Standby Database) Stuck Issues

To Speed up MRP on Standby database, Check for

1) parallel_execution_message_size - this is an OS dependent parameter

2) recovery_parallelism - this will be dictated by the numbers of CPU's and your ability to handle IO

3) Consider increasing sga_target parameter, if it's set to low.

4) Check for Disk I/O. Move you I/O intensive files to faster disks including Online Redo log and Standby redo log files.

I've came across below these links, which I've found very useful in this regards,

Dataguard Performance

Edit (07,2013):

The following information is important about Physical Data Guard Redo Apply performance:

11g Media Recovery performance improvements include:

•More parallelism by default

•More efficient asynchronous redo read, parse, and apply

•Fewer synchronization points in the parallel apply algorithm

•The media recovery checkpoint at a redo log boundary no longer blocks the apply of the next log

In 11g, when tuning redo apply consider following:

•By default recovery parallelism = CPU Count-1. Do not use any other values.

•Keep PARALLEL_EXECUTION_MESSAGE_SIZE >= 8192

•Keep DB_CACHE_SIZE >= Primary value

•Keep DB_BLOCK_CHECKING = FALSE (if you have to)

•System Resources Needs to be assessed

•Query what MRP process is waiting

select a.event, a.wait_time, a.seconds_in_wait from gv$session_wait a, gv$session b where a.sid=b.sid and b.sid=(select SID from v$session where PADDR=(select PADDR from v$bgprocess where NAME='MRP0'))

Check: Active Data Guard 11g Best Practices Oracle Maximum Availability Architecture White Paper

When tuning redo transport service, consider following:

1 - Tune LOG_ARCHIVE_MAX_PROCESSES parameter on the primary.

•Specifies the parallelism of redo transport

•Default value is 2 in 10g, 4 in 11g

•Increase if there is high redo generation rate and/or multiple standbys

•Must be increased up to 30 in some cases.

•Significantly increases redo transport rate.

2 - Consider using Redo Transport Compression:

•In 11.2.0.2 redo transport compression can be always on

•Use if network bandwidth is insufficient

•and CPU power is available

Also consider:

3 - Configuring TCP Send / Receive Buffer Sizes (RECV_BUF_SIZE / SEND_BUF_SIZE)

4 - Increasing SDU Size

5 - Setting TCP.NODELAY to YES

Check: Redo Transport Services Best Practices Chapter of Oracle® Database High Availability Best Practices 11g Release 1

-------------------------------------------------------------------

Original Post:

Problem: Recovery service has stopped for a while and there has been a gap between primary and standby side. After recovery process was started again, standby side is not able to catch primary side because of low log applying performance. Disk I/O and memory utilization on standby server are nearly 100%.

Solution:

1 – Rebooting the standby server reduced memory utilization a little.

2 – ALTER DATABASE RECOVER MANAGED STANDBY DATABASE PARALLEL 8 DISCONNECT FROM SESSION;

In general, using the parallel recovery option is most effective at reducing recovery time when several datafiles on several different disks are being recovered concurrently. The performance improvement from the parallel recovery option is also dependent upon whether the operating system supports asynchronous I/O. If asynchronous I/O is not supported, the parallel recovery option can dramatically reduce recovery time. If asynchronous I/O is supported, the recovery time may be only slightly reduced by using parallel recovery.

3 – SQL>alter system Set PARALLEL_EXECUTION_MESSAGE_SIZE = 4096 scope = spfile;

Set PARALLEL_EXECUTION_MESSAGE_SIZE = 4096

When using parallel media recovery or parallel standby recovery, increasing the PARALLEL_EXECUTION_MESSAGE_SIZE database parameter to 4K (4096) can improve parallel recovery by as much as 20 percent. Set this parameter on both the primary and standby databases in preparation for switchover operations. Increasing this parameter requires more memory from the shared pool by each parallel execution slave process.

4 – Kernel parameters that changed in order to reduce file system cache size.

dbc_max_pct 10 10 Immed

dbc_min_pct 3 3 Immed

5 – For secure path (HP) load balancing, SQL Shortest Queue Length is chosen.

autopath set -l 6005-08B4-0007-4D25-0000-D000-025F-0000 -b SQL

Oracle 12c multiple physical standby setup and consideration

I have a question regarding having multiple physical standby DBs at the same time (redo apply/log shipping) in the same server, is it possible ? if so what would be the DB_NAME for my second stand-by based on the values of the parameters below?

Primary

*.DB_NAME=CHICAGO

*.db_unique_name='CHICAGO'

*.log_archive_config='DG_CONFIG=(CHICAGO, BOSTON, TORONTO)'

*.log_archive_dest_1='location=/adjarch/CHICAGO reopen=60'

*.log_archive_dest_2='SERVICE=BOSTON NOAFFIRM ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=BOSTON'

*.log_archive_dest_3='SERVICE=TORONTO NOAFFIRM ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=TORONTO'

*.log_archive_dest_state_2='DEFER'

*.log_archive_dest_state_3='DEFER'

*.log_archive_format='archCHICAGO_%t_%s_%r.log'

target1:

*.DB_NAME=CHICAGO

*.db_unique_name='BOSTON'

*.log_archive_config='DG_CONFIG=(CHICAGO, BOSTON)'

*.log_archive_dest_1='location=/adjarch/CHICAGO reopen=60'

*.log_archive_dest_2='SERVICE=CHICAGO NOAFFIRM ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=CHICAGO'

Target2:

*.DB_NAME=????

*.db_unique_name='TORONTO'

*.log_archive_config='DG_CONFIG=(CHICAGO,TORONTO)'

*.log_archive_dest_1='location=/adjarch/CHICAGO reopen=60'

*.log_archive_dest_3='SERVICE=CHICAGO NOAFFIRM ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=CHICAGO'

In this paper, I thought I would provide a few things that you may not know about SRLs. Some of this information was covered in the previous article, but it’s good to have all of this in one location. I write these items in no particular order.

Do not assign SRLs to a specific thread. – There is a temptation for DBAs who work on Oracle RAC databases to assign a SRL to a specific thread of redo. The RAC DBA is already familiar with creating Online Redo Logs (ORLs) for a specific thread, one for each instance in the RAC database. So they must do similarly for SRLs, correct? The answer is no. Do not assign SRLs to a specific thread. If the SRL is assigned to a specific thread, then it can only be used by that thread and no other. If the SRL is not assigned to a thread, it can be used by any thread.

SRLs do not rotate like ORLs. – Most DBAs are used to seeing Online Redo Logs rotate. If there are three ORL groups, redo will be written to group 1, then group 2, and then group 3, and then back to group 1 again. Many DBAs working with SRLs for the first time assume SRLs rotate the same way, but they do not. If you have SRL groups 10, 11, 12, and 13 then the first redo transport stream will be written to SRL group 10. The next one will be written to SRL group 11. If group 10 becomes available again, the third redo stream will be written to SRL group 10. It is possible that SRL group 13 never gets used.

You should have one more SRL groups than ORL groups – If you go back to the article I linked at the start, there is a second diagram showing the flow of redo when SRLs are in place. Either MRP0 or ARCn is reading from the SRL and applying redo or creating an archived redo log. No matter which route is taken for the redo, the process can take some time. It is a good idea to have an extra SRL in case the redo writes from the SRLs take extra time. Remember for Oracle RAC primary databases, to count the groups from all primary threads.

SRLs provide a near-zero data loss solution, even in Max Performance mode. – As I stated in the previous article, SRLs are great for achieving a near-zero data loss solution even in Max Performance mode. If you look at the second diagram in that article, you can see that the primary will transport redo to the SRL in near real time. You would use Max Protect mode when you absolutely cannot afford data loss, but SRLs get you close to near-zero data loss which is often good enough for most people. Here is a production database (name changed to protect the innocent) as seen in the DG Broker. We can see the configuration is Max Performance mode.

DGMGRL> show configuration

Configuration - orcl_orcls

Protection Mode: MaxPerformance

DGMGRL> show database orcls

Database - orcls

Role: PHYSICAL STANDBY

Intended State: APPLY-ON

Transport Lag: 0 seconds (computed 0 seconds ago)

Apply Lag: 4 hours 29 minutes 21 seconds (computed 0 seconds ago)

Average Apply Rate: 5.02 MByte/s

We can also see a transport lag of 0 seconds. I typically see between 0 and 2 seconds of transport lag on this system. One might say that this system does not generate a log of redo, which is why I included the Average Apply Rate in the output. This standby is applying 5 megabytes per second on average, which means the primary is generating lots of redo. This is not an idle system and I’m still seeing a near-zero data loss implementation.

For RAC standby, any node can receive redo to the SRLs. – If you have an Oracle RAC primary and an Oracle RAC standby, then any node in the primary can transport redo to the SRLs on any node in the standby. This is why I use VIPs for my LOG_ARCHIVE_DEST_2 parameter settings in my configuration. I now have high availability for my redo transport. You may see SRLs active on any node in your RAC standby. That being said, in 12.1.0.2 and earlier, only one node will perform the redo apply.

If you use an Apply Delay, redo apply won’t take place until the log switch occurs. – If you look at my output from the DG Broker above, you’ll see a 0 second transport lag but an apply lag of almost 4.5 hours. I have set an Apply Delay of 4 hours in this standby database. Because the redo is written to the SRL, the redo from that SRL is not available to be applied until the log switch completes and that log switch passes the same apply delay. In my primary databases, I often set ARCHIVE_LAG_TARGET to 1 hour so that I have ORL switches at most once per hour. The apply lag in the standby will often be between Apply Delay and Apply Delay+ARCHIVE_LAG_TARGET. With my configuration, my apply lag is often between 4 and 5 hours. If you use an apply delay, then it’s a good idea to set the ARCHIVE_LAG_TARGET as well. Too many people miss this point and assume the apply lag should be very close to the Apply Delay setting.

Redo Transport does not use ARCn if SRLs are in place. – Too often, I see a posting in some forum where it is assumed that ARCn is the only process that transports redo to the standby. If SRLs are in place, NSAn is used instead, or LNS if prior to 12c. Refer back to the diagrams in my earlier paper for details on how log transport works.

SRLs should be the same size and at least as large as the biggest ORL. – I try to keep the ORL and SRL groups all set to the same exact size. But if there are mixed ORL sizes, then make sure the SRLs are sized to the largest of the ORL groups. Redo transport can be written to any of the SRLs, so all of the SRLs need to be sized to handle redo from any ORL.

why are job running slow on particulart node

How to improve performance of managed recover process

what is oracle wallet

if extract process failed , How to troubleshoot.

How to set up maximum protection and maximum availablity database

setup:

There is a Primary Database in Maximum Protection Mode having at least two associated Standby Databases.

Both Standby Databases are serving the Maximum Protection Mode, ie. Log Transport Services to these Standby Databases are using 'LGWR SYNC AFFIRM'

One or both Standby Databases are Physical Standby Databases in Active Data Guard Mode or at least open 'READ ONLY'

Behaviour:

If we now try to shutdown such a Standby Database which is open READ ONLY, if fails with

ORA-01154: database busy. Open, close, mount, and dismount not allowed

although the remaining Standby Databases are serving the Maximum Protection Mode, too.

In the ALERT.LOG we can find Entries like this:

Attempt to shut down Standby Database

Standby Database operating in NO DATA LOSS mode

Detected primary database alive, shutdown primary first, shutdown aborted

Cause

If the Primary Database is in Maximum Protection Mode all associated Standby Databases serving this Protection Mode are considered as 'No Data Loss' Standby Databases and so cannot be shutdown as long as the Primary Database is in this Proctection Mode or still alive.

Solution

If you want to shutdown this Standby Database only, there are two Possibilities:

1. Use 'shutdown abort' which will force the Shutdown the Standby Database. Typically this should not harm the Standby Database; however ensure that Log Apply Services (Managed Recovery) are stopped before you issue this Command. So you can use:

SQL> alter database recover managed standby database cancel;

SQL> shutdown abort

2. Set the State of the corresponding log_archive_dest_n serving this Standby Database to 'defer' on the Primary Database (and perform a Log Switch to make this Change effective), then you can shutdown the Standby Database in any Way after the RFS-Processes terminated on this Standby Database (if they do not terminate in a timely Manner you can also kill those using OS-Kill Command)

On the Primary set the State to 'defer', eg. for log_archive_dest_2

SQL> alter system set log_archive_dest_state_2='defer' scope=memory;

SQL> alter system switch logfile;

Then on the Standby you can can shutdown (eg. shutdown immediate)

SQL> shutdown immediate;

To find out about still alive RFS-Processes and their PID you can use this Query:

SQL> select process, pid, status from v$managed_standby

If you have to kill RFS Processes you can do this using OS Kill-Command:

$ kill -9 <pid>

For both Cases ensure there is at least one surviving Standby Database available still serving the Maximum Protection Mode.

=======================

what will happend evm down and clusterware troubleshoot method

This note gives the output of the 'ps' command on pre 11gR2 releases of Oracle CRS and shows all clusterware processes running. It also helps to diagnose in which state the clusterware stands following the 'ps' outlook. note:1050908.1 explains the same for 11gR2 onwards.

Solution

Introduction

All the clusterware processes are normally retrieved via OS commands like:

There are general processes, i.e. processes that need to be started on all platforms/releases

and specific processes, i.e. processes that need to be started on some CRS versions/platforms

a. the general processes are

ocssd.bin

evmd.bin

evmlogger.bin

crsd.bin

b. the specific processes are

oprocd: run on Unix when vendor Clusterware is not running. On Linux, only starting with 10.2.0.4.

oclsvmon.bin: normally run when a third party clusterware is running

oclsomon.bin: check program of the ocssd.bin (starting in 10.2.0.1)

diskmon.bin: new 11.1.0.7 process for exadata

oclskd.bin: new 11.1.0.6 process to reboot nodes in case rdbms instances are hanging

There are three fatal processes, i.e. processes whose abnormal halt or kill will provoque a node reboot (see note:265769.1):

1. the ocssd.bin

2. the oprocd.bin

3. the oclsomon.bin

The other processes are automatically restarted when they go away.

outlook of the 'ps' output

A. When all clusterware processes are started

1. CRS 10.2.0.3 on Solaris without third party clusterware

UID PID PPID C STIME TTY TIME CMD

root 1 0 0 Aug 25 ? 43:22 /sbin/init

root 799 1 1 Aug 25 ? 1447:06 /bin/sh /etc/init.d/init.cssd fatal

root 797 1 0 Aug 25 ? 0:00 /bin/sh /etc/init.d/init.evmd run

root 801 1 0 Aug 25 ? 0:00 /bin/sh /etc/init.d/init.crsd run

root 1144 799 0 Aug 25 ? 0:00 /bin/sh /etc/init.d/init.cssd daemon

root 1091 799 0 Aug 25 ? 0:00 /bin/sh /etc/init.d/init.cssd oprocd

root 1107 799 0 Aug 25 ? 0:00 /bin/sh /etc/init.d/init.cssd oclsomon

oracle 1342 1144 0 Aug 25 ? 687:50 /u01/app/oracle/crs/10.2/bin/ocssd.bin

root 1252 1091 0 Aug 25 ? 25:45 /u01/app/oracle/crs/10.2/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90

oracle 1265 1107 0 Aug 25 ? 0:00 /bin/sh -c cd /u01/app/oracle/crs/10.2/log/artois1/cssd/oclsomon; ulimit -c unl

oracle 1266 1265 0 Aug 25 ? 125:34 /u01/app/oracle/crs/10.2/bin/oclsomon.bin

root 22137 799 0 07:10:38 ? 0:00 /bin/sleep 1

oracle 1041 797 0 Aug 25 ? 68:01 /u01/app/oracle/crs/10.2/bin/evmd.bin

oracle 1464 1041 0 Aug 25 ? 2:58 /u01/app/oracle/crs/10.2/bin/evmlogger.bin -o /u01/app/oracle/crs/10.2/evm/log/

root 1080 801 0 Aug 25 ? 2299:04 /u01/app/oracle/crs/10.2/bin/crsd.bin reboot

2. CRS 10.2.0.3 on HP/UX with HP Serviceguard

UID PID PPID C STIME TTY TIME COMMAND

root 1 0 0 Nov 13 ? 12:58 init

root 17424 1 0 Dec 17 ? 136:39 /bin/sh /sbin/init.d/init.cssd fatal

root 17425 1 0 Dec 17 ? 0:00 /bin/sh /sbin/init.d/init.crsd run

root 17624 17424 0 Dec 17 ? 0:00 /bin/sh /sbin/init.d/init.cssd daemon

haclu 17821 17624 0 Dec 17 ? 268:13 /haclu/64bit/app/oracle/product/crs102/bin/ocssd.bin

root 17621 17424 0 Dec 17 ? 0:00 /bin/sh /sbin/init.d/init.cssd oclsvmon

haclu 17688 17621 0 Dec 17 ? 0:00 /bin/sh -c cd /haclu/64bit/app/oracle/product/crs102/log/cehpclu7/cssd/oclsvmon; ulimit -c unlimited; /haclu/64bit/app/oracle/p

haclu 17689 17688 0 Dec 17 ? 8:04 /haclu/64bit/app/oracle/product/crs102/bin/oclsvmon.bin

root 17623 17424 0 Dec 17 ? 0:00 /bin/sh /sbin/init.d/init.cssd oclsomon

haclu 17744 17623 0 Dec 17 ? 0:00 /bin/sh -c cd /haclu/64bit/app/oracle/product/crs102/log/cehpclu7/cssd/oclsomon; ulimit -c unlimited; /haclu/64bit/app/oracle/p

haclu 17750 17744 0 Dec 17 ? 158:34 /haclu/64bit/app/oracle/product/crs102/bin/oclsomon.bin

root 11530 17424 1 14:13:28 ? 0:00 /bin/sleep 1

haclu 5727 1 0 13:49:56 ? 0:00 /haclu/64bit/app/oracle/product/crs102/bin/evmd.bin

haclu 5896 5727 0 13:49:59 ? 0:00 /haclu/64bit/app/oracle/product/crs102/bin/evmlogger.bin -o /haclu/64bit/app/oracle/product/crs102/evm/log/evmlogger.info -l /h

root 17611 17425 0 Dec 17 ? 163:50 /haclu/64bit/app/oracle/product/crs102/bin/crsd.bin reboot

3. CRS 10.2.0.4 on AIX with HACMP installed

UID PID PPID C STIME TTY TIME CMD

root 1 0 0 Dec 23 - 0:56 /etc/init

root 106718 1 0 Jan 05 - 25:01 /bin/sh /etc/init.cssd fatal

root 213226 1 0 Jan 05 - 0:00 /bin/sh /etc/init.crsd run

root 278718 1 0 Jan 05 - 0:00 /bin/sh /etc/init.evmd run

root 258308 106718 0 Jan 05 - 0:00 /bin/sh /etc/init.cssd daemon

haclu 299010 348438 0 Jan 05 - 12:24 /haclu/64bit/app/oracle/product/crs102/bin/ocssd.bin

root 315604 106718 0 Jan 05 - 0:00 /bin/sh /etc/init.cssd oclsomon

haclu 303300 315604 0 Jan 05 - 0:00 /bin/sh -c cd /haclu/64bit/app/oracle/product/crs102/log/celaixclu3/cssd/oclsomon; ulimit -c unlimited; /haclu/64bit/app/oracle/product/crs102/bin/oclsomon || exit $?

haclu 278978 303300 0 Jan 05 - 2:36 /haclu/64bit/app/oracle/product/crs102/bin/oclsomon.bin

root 250352 106718 0 13:56:56 - 0:00 /bin/sleep 1

haclu 323672 278718 0 Jan 05 - 0:58 /haclu/64bit/app/oracle/product/crs102/bin/evmd.bin

haclu 311416 323672 0 Jan 05 - 0:01 /haclu/64bit/app/oracle/product/crs102/bin/evmlogger.bin -o /haclu/64bit/app/oracle/product/crs102/evm/log/evmlogger.info -l /haclu/64bit/app/oracle/product/crs102/evm/log/evmlogger.log

root 287166 213226 2 Jan 05 - 84:56 /haclu/64bit/app/oracle/product/crs102/bin/crsd.bin reboot

4. CRS 11.1.0.7 on Linux 32bit

UID PID PPID C STIME TTY TIME CMD

root 1 0 0 16:55 ? 00:00:00 init [5]

root 5412 1 0 16:56 ? 00:00:00 /bin/sh /etc/init.d/init.evmd run

root 5413 1 0 16:56 ? 00:00:03 /bin/sh /etc/init.d/init.cssd fatal

root 5416 1 0 16:56 ? 00:00:00 /bin/sh /etc/init.d/init.crsd run

root 7690 5413 0 16:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd daemon

oracle 8465 7690 0 16:57 ? 00:00:01 /orasoft/red4u2/crs/bin/ocssd.bin

root 7648 5413 0 16:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd oprocd

root 8372 7648 0 16:56 ? 00:00:00 /orasoft/red4u2/crs/bin/oprocd run -t 1000 -m 500 -f

root 7672 5413 0 16:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd diskmon

oracle 8255 7672 0 16:56 ? 00:00:00 /orasoft/red4u2/crs/bin/diskmon.bin -d -f

root 7658 5413 0 16:56 ? 00:00:00 /bin/sh /etc/init.d/init.cssd oclsomon

root 8384 7658 0 16:56 ? 00:00:00 /sbin/runuser -l oracle -c /bin/sh -c 'cd /orasoft/red4u2/crs/log/haclulnx1/cssd/oclsomon; ulimit -c unlimited; /orasoft/red4u2/crs/bin/oclsomon || exit $?'

oracle 8385 8384 0 16:56 ? 00:00:00 /bin/sh -c cd /orasoft/red4u2/crs/log/haclulnx1/cssd/oclsomon; ulimit -c unlimited; /orasoft/red4u2/crs/bin/oclsomon || exit $?

oracle 8418 8385 0 16:56 ? 00:00:01 /orasoft/red4u2/crs/bin/oclsomon.bin

root 9746 1 0 17:00 ? 00:00:00 /orasoft/red4u2/crs/bin/oclskd.bin

oracle 10537 1 0 17:01 ? 00:00:00 /orasoft/red4u2/crs/bin/oclskd.bin

oracle 7606 7605 0 16:56 ? 00:00:00 /orasoft/red4u2/crs/bin/evmd.bin

oracle 9809 7606 0 17:00 ? 00:00:00 /orasoft/red4u2/crs/bin/evmlogger.bin -o /orasoft/red4u2/crs/evm/log/evmlogger.info -l /orasoft/red4u2/crs/evm/log/evmlogger.log

root 7585 5416 0 16:56 ? 00:00:08 /orasoft/red4u2/crs/bin/crsd.bin reboot

B. When the clusterware is not allowed to start on boot

This state is reached when:

1. 'crsctl stop crs' has been issued and the clusterware is stopped

2. the automatic startup of the clusterware has been disabled and the node has been rebooted, e.g.

./init.crs disable

Automatic startup disabled for system boot.

The 'ps' command only show the three inittab processes with spawned sleeping processes in a 30seconds loop

UID PID PPID C STIME TTY TIME CMD

root 1 0 0 16:55 ? 00:00:00 init [5]

root 19770 1 0 18:00 ? 00:00:00 /bin/sh /etc/init.d/init.evmd run

root 19854 1 0 18:00 ? 00:00:00 /bin/sh /etc/init.d/init.crsd run

root 19906 1 0 18:00 ? 00:00:00 /bin/sh /etc/init.d/init.cssd fatal

root 22143 19770 0 18:02 ? 00:00:00 /bin/sleep 30

root 22255 19854 0 18:02 ? 00:00:00 /bin/sleep 30

root 22266 19906 0 18:02 ? 00:00:00 /bin/sleep 30

The clusterware can be reenabled via './init.crs enable' execution or/and via 'crsctl start crs'

C. When the clusterware is allowed to start on boot, but can't start because some prerequisites are not met

This state is reached when the node has reboot and some prerequisites are missing, e.g.

1. OCR is not accessible

2. cluster interconnect can't accept tcp connections

3. CRS_HOME is not mounted

...

and 'crsctl check boot' (run as oracle) show errors, e.g.

$ crsctl check boot

Oracle Cluster Registry initialization failed accessing Oracle Cluster Registry device:

PROC-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]

The three inittab processes are sleeping for 60seconds in a loop in 'init.cssd startcheck'

UID PID PPID C STIME TTY TIME CMD

root 1 0 0 18:28 ? 00:00:00 init [5]

root 4969 1 0 18:29 ? 00:00:00 /bin/sh /etc/init.d/init.evmd run

root 5060 1 0 18:29 ? 00:00:00 /bin/sh /etc/init.d/init.cssd fatal

root 5064 1 0 18:29 ? 00:00:00 /bin/sh /etc/init.d/init.crsd run

root 5405 4969 0 18:29 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck

root 5719 5060 0 18:29 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck

root 5819 5064 0 18:29 ? 00:00:00 /bin/sh /etc/init.d/init.cssd startcheck

root 6986 5405 0 18:30 ? 00:00:00 /bin/sleep 60

root 6987 5819 0 18:30 ? 00:00:00 /bin/sleep 60

root 7025 5719 0 18:30 ? 00:00:00 /bin/sleep 60

Once the 'crsctl check boot' will return nothing (no error messages anymore), then the clusterware processes will start.

Oracle Rac crsctl and srvctl commands

CRSCTL Commands :-

Cluster Related Commands

crs_stat t Shows HA resource status (hard to read)

crsstat Output of crs_stat t formatted nicely

crsctl check crs CSS,CRS,EVM appears healthy

crsctl stop crs Stop crs and all other services

crsctl disable crs Prevents CRS from starting on reboot

crsctl enable crs Enables CRS start on reboot

crs_stop all Stops all registered resources

crs_start all Starts all registered resources

crsctl stop cluster -all Stops the cluster in all nodes

crsctl start cluster -all Starts the cluster in all nodesSRVCTL Commands :-

Database Related Commands

srvctl start instance -d <db_name> -i <inst_name> Starts an instance

srvctl start database -d <db_name> Starts all instances

srvctl stop database -d <db_name> Stops all instances, closes database

srvctl stop instance -d <db_name> -i <inst_name> Stops an instance

srvctl start service -d <db_name> -s <service_name> Starts a service

srvctl stop service -d <db_name> -s <service_name> Stops a service

srvctl status service -d <db_name> Checks status of a service

srvctl status instance -d <db_name> -i <inst_name> Checks an individual instance

srvctl status database -d <db_name> Checks status of all instances

srvctl start nodeapps -n <node_name> Starts gsd, vip, listener, and ons

srvctl stop nodeapps -n <node_name> Stops gsd, vip and listener

srvctl status scan Status of scan listener

srvctl config scan Configuration of scan listener

srvctl status asm Status of ASM instance

How to interpret explain plan and what is cost in explain planThis cost is a number that represents the estimated resource usage for each step.It is just a number or an internal unit that is used to be able to compare different plans.

Estimating

The estimator generates three types of measures:

• Selectivity y

• Cardinality

. Cost

Cardinality represents the number of rows in a row source.

Cost represents the units of work or resource that are used.

Cardinality represents the number of rows in a row source.Here, the row source can be base table , a view or

the result of a join or group by operator. If a select from a table is performed . the table is the row source and the cardinality is the number of rows in that table.

A higher cardinality => you're going to fetch more rows => you're

going to do more work => the query will take longer. Thus the cost is

(usually) higher.

A higher cardinality => you're going to fetch more rows => you're

going to do more work => the query will take longer. Thus the cost is

(usually) higher.

All other things being equal, a query with a

higher cost will use more resources and thus take longer to run. But all

things rarely are equal. A lower cost query can run faster than a

higher cost one!

Cost represents the number of units of work (or resource) that are used. The query optimizer

uses disk I/O, CPU usage, and memory usage as units of work. So the cost used by the query

optimizer represents an estimate of the number of disk I/Os and the amount of CPU and

memory used in performing an operation. The operation can be scanning a table, accessing

rows from a table by using an index, joining two tables together, or sorting a row source.

query is executed and its result is produced. The COST is the final output of the Cost-based optimiser (CBO),

The access path determines the number of units of work that are required to get data from a

base table. The access path can be a table scan, a fast full index scan, or an index scan.

Changing optimizer behavior

The optimizer is influenced by:

• SQL statement construction

• Data structure

• Statistics

• SQL Plan Management options

• Session parameters

• System parameters

• Hints

Adaptive Execution Plans

A query plan changes during execution because runtime conditions indicate that optimizer estimates are inaccurate.

All adaptive execution plans rely on statistics that are collected during query execution.

The two adaptive plan techniques are:

– Dynamic plans -- A dynamic plan chooses among subplans during statement execution.

For dynamic plans, the optimizer must decide which subplans to include in a dynamic

plan, which statistics to collect to choose a subplan, and thresholds for this choice.

– Re-optimization -- In contrast, re-optimization changes a plan for executions after the

current execution. For re-optimization, the optimizer must decide which statistics to

collect at which points in a plan and when re-optimation is feasible

Operations that Retrieve Rows (Access Paths)

As I mentioned earlier, some operations retrieve rows from data sources, and in those cases, the object_name column shows the name of the data source, which can be a table, a view, etc. However, the optimizer might choose to use different techniques to retrieve the data depending on the information it has available from the database statistics. These different techniques that can be used to retrieve data are usually called access paths, and they are displayed in the operations column of the plan, usually enclosed in parenthesis.

Below is a list of the most common access paths with a small explanation of them (source). I will not cover them all because I don’t want to bore you . I’m sure that after reading the ones I include here you will have a very good understanding of what access paths are and how they can affect the performance of you queries.

Full Table Scan

A full table scan reads all rows from a table, and then filters out those rows that do not meet the selection criteria (if there is one). Contrary to what one could think, full table scans are not necessarily a bad thing. There are situations where a full table scan would be more efficient than retrieving the data using an index.

Table Access by Rowid

A rowid is an internal representation of the storage location of data. The rowid of a row specifies the data file and data block containing the row and the location of the row in that block. Locating a row by specifying its rowid is the fastest way to retrieve a single row because it specifies the exact location of the row in the database.

In most cases, the database accesses a table by rowid after a scan of one or more indexes.

Index Unique Scan

An index unique scan returns at most 1 rowid, and thus, after an index unique scan you will typically see a table access by rowid (if the desired data is not available in the index). Index unique scans can be used when a query predicate references all of the columns of a unique index, by using the equality operator.

Index Range Scan

An index range scan is an ordered scan of values, and it is typically used when a query predicate references some of the leading columns of an index, or when for any reason more than one value can be retrieved by using an index key. These predicates can include equality and non-equality operators (=, <. >, etc).

Index Full Scan

An index full scan reads the entire index in order, and can be used in several situations, including cases in which there is no predicate, but certain conditions would allow the index to be used to avoid a separate sorting operation.

Index Fast Full Scan

An index fast full scan reads the index blocks in unsorted order, as they exist on disk. This method is used when all of the columns the query needs to retrieve are in the index, so the optimizer uses the index instead of the table.

Index Join Scan

An index join scan is a hash join of multiple indexes that together return all columns requested by a query. The database does not need to access the table because all data is retrieved from the indexes.

Operations that Manipulate Data

As I mentioned before, besides the operations that retrieve data from the database, there are some other types of operations you may see in an execution plan, which do not retrieve data, but operate on data that was retrieved by some other operation. The most common operations in this group are sorts and joins.

Sorts

A sort operation is performed when the rows coming out of the step need to be returned in some specific order. This can be necessary to comply with the order requested by the query, or to return the rows in the order in which the next operation needs them to work as expected, for example, when the next operation is a sort merge join.

Joins

When you run a query that includes more than one table in the FROM clause the database needs to perform a join operation, and the job of the optimizer is to determine the order in which the data sources should be joined, and the best join method to use in order to produce the desired results in the most efficient way possible.

Both of these decisions are made based on the available statistics.

Here is a small explanation for the different join methods the optimizer can decide to use:

Nested Loops Joins

When this method is used, for each row in the first data set that matches the single-table predicates, the database retrieves all rows in the second data set that satisfy the join predicate. As the name implies, this method works as if you had 2 nested for loops in a procedural programming language, in which for each iteration of the outer loop the inner loop is traversed to find the rows that satisfy the join condition.

As you can imagine, this join method is not very efficient on large data sets, unless the rows in the inner data set can be accessed efficiently (through an index).

In general, nested loops joins work best on small tables with indexes on the join conditions.

Hash Joins

The database uses a hash join to join larger data sets. In summary, the optimizer creates a hash table (what is a hash table?) from one of the data sets (usually the smallest one) using the columns used in the join condition as the key, and then scans the other data set applying the same hash function to the columns in the join condition to see if it can find a matching row in the hash table built from the first data set.

You don’t really need to understand how a hash table works. In general, what you need to know is that this join method can be used when you have an equi-join, and that it can be very efficient when the smaller of the data sets can be put completely in memory.

On larger data sets, this join method can be much more efficient than a nested loop.

Sort Merge Joins

A sort merge join is a variation of a nested loops join. The main difference is that this method requires the 2 data sources to be ordered first, but the algorithm to find the matching rows is more efficient.

This method is usually selected when joining large amounts of data when the join uses an inequality condition, or when a hash join would not be able to put the hash table for one of the data sets completely in memory.

what is Incremental statistics

Incremental statistics maintenance was introduced in Oracle Database 11g to improve the performance of gathering statistics on large partitioned table. When incremental statistics maintenance is enabled for a partitioned table, Oracle accurately generated global level statistics by aggregating partition level statistics.

By default, incremental maintenance does not use the staleness status to decide when to update statistics. This scenario is covered in an earlier blog post for Oracle Database 11g. If a partition or sub-partition is subject to even a single DML operation, statistics will be re-gathered, the appropriate synopsis will be updated and the global-level statistics will be re-calculated from the synopses. This behavior can be changed in Oracle Database 12c, allowing you to use the staleness threshold to define when incremental statistics will be re-calculated. This is covered in Staleness and DML thresholds, below.

Implementation

Enabling synopses

To enable the creation of synopses, a table must be configured to use incremental maintenance. This feature is switched on using a DBMS_STATS preference called ‘INCREMENTAL’. For example:

EXEC dbms_stats.set_table_prefs(null,'SALES','INCREMENTAL','TRUE')

Checking that incremental maintenance is enabled

The value of the DBMS_STATS preference can be checked as follows:

SELECT dbms_stats.get_prefs(pname=>'INCREMENTAL',

tabname=>'SALES')

FROM dual;

Staleness and DML thresholds

As mentioned above, Optimizer statistics are considered stale when the number of changes made to data exceeds a certain threshold. This threshold is expressed as a percentage of row changes for a table, partition or subpartition and is set using a DBMS_STATS preference called STALE_PERCENT. The default value for stale percent is 10 so, for example, a partition containing 100 rows would be marked stale if more than 10 rows are updated, added or deleted. Here is an example of setting and inspecting the preference:

EXEC dbms_stats.set_table_prefs(null, 'SALES', 'STALE_PERCENT','5')

select dbms_stats.get_prefs('STALE_PERCENT',null,'SALES') from dual;

It is easy to check if a table or partition has been marked as stale:

select partition_name,

subpartition_name,

stale_stats /* YES or NO */

from dba_tab_statistics

where table_name = 'SALES';

The database tracks DML operations to measure when data change has caused a table to exceed its staleness threshold. If you want to take a look at this information, bear in mind that the statistics are approximate and they are autmatically flushed to disk periodically. If you want to see the figures change immediately during your tests then you will need to flush them manually (you must have ‘ANALYZE ANY’ system privilege), like this:

EXEC dbms_stats.flush_database_monitoring_info

select *

from dba_tab_modifications

where table_name = 'SALES';

Remember that if you are using incremental statistics in Oracle Database 11g, a single DML operation on a partition or sub-partition will make it a target for a statistics refresh - even if it is not marked stale. In other words, we might update one row in a partition containing 1 million rows. The partition won't be marked state (if we assume a 10% staleness threshold) but fresh statistics will be gathered. Oracle Database 12c exhibits the same behavior by default, but this release gives you the option to allow multiple DML changes to occur against a partition or sub-partition before it is a target for incremental refresh. You can enable this behavior by changing the DBMS_STATS preference INCREMENTAL_STALENESS from its default value (NULL) to 'USE_STALE_PERCENT'. For example:

exec dbms_stats.set_global_prefs('INCREMENTAL_STALENESS', 'USE_STALE_PERCENT')

Once this preference is set, a table’s STALE_PERCENT value will be used to define the threshold of DML change in the context of incremental maintenance. In other words, statistics will not be re-gathered for a partition if the number of DML changes is below the STALE_PERCENT threshold.

Locking statistics

Incremental statistics does work with locked partitions statistics as long as no DML occurs on the locked partitions. However, if DML does occurs on the locked partitions then we can no longer guarantee that the global statistics built from the locked statistics will be accurate so the database will fall back to using the non-incremental approach when gathering global statistics. However, if for some reason you must lock the partition level statistics and still want to take advantage of incremental statistics gathering, you can set the 'INCREMENTAL_STALENESS' preference to include ‘USE_LOCKED_STATS’. Once set, the locked partitions/subpartitions stats are NOT considered as stale as long as they have synopses, regardless of DML changes.

Note that ‘INCREMENTAL_STALENESS’ accepts multiple values, such as:

BEGIN

dbms_stats.set_table_prefs(

ownname=>null,

tabname=>'SALES',

pname =>'INCREMENTAL_STALENESS',

pvalue=>'USE_STALE_PERCENT, USE_LOCKED_STATS');

END;

Checking for staleness

You can check for table/partition/subpartition staleness very easily using the statistics views. For example:

EXEC dbms_stats.flush_database_monitoring_info

select partition_name,subpartition_name,stale_stats

from dba_tab_statistics

where table_name = 'SALES'

order by partition_position, subpartition_position;

Database monitoring information is used identify stale statistics, so you’ll need to call FLUSH_DATABASE_MONITORING_INFO if you’re testing this out and you want to see immediately how the staleness status is affected by data change.

Gathering statistics

How do you gather statistics on a table using incremental maintenance? Keep things simple! Let the Oracle Database work out how best to do it. Use these procedures:

EXEC dbms_stats.gather_table_stats(null,'SALES')

or EXEC dbms_stats.gather_schema_stats(…)

or, even better EXEC dbms_stats.gather_database_stats()

For the DBMS_STATS.GATHER... procedures you must use ESTIMATE_PERCENT set to AUTO_SAMPLE_SIZE. Since this is the default, then that is what will be used in the examples above unless you have overriden it. If you use a percentage value for ESTIMATE_PERCENT, incremental maintenance will not kick in.

Regathering statistics when data hasn’t changed

From time-to-time you might notice that statistics are gathered on partitions that have not been subject to any DML changes. Why is this? There are a number of reasons:

Statistics have been unlocked.

Table column usage has changed (this is explained below).

New columns are added. This includes hidden columns created from statistics extensions such as column groups, column expressions.

Synopses are not in sync with the column statistics. It is possible that you have gathered statistics in incremental mode at time T1. Then you disable incremental and regather statistics at time T2. Then the synopses’ timestamp T1 is out of sync with the basic column statistics’ timestamp T2.

Unusual cases such as column statistics have been deleted using delete_column_statistics.

Bullet point "2" has some implications. The database tracks how columns are used in query predicates and stores this information in the data dictionary (sys.col_usage$). It uses this information to help it figure out which columns will benefit from a histogram to improve query cardinality estimates and, as a result, improve SQL execution plans. If column usage changes, then the database might choose to re-gather statistics and create a new histogram.

Locally partitioned index statistics

For locally partitioned index statistics, we first check their corresponding table partitions (or subpartitions). If the table (sub)partitions have fresh statistics and the index statistics have been gathered after the table (sub)partition-level statistics, then they are considered fresh and their statistics are not regathered.

Composite partitioned tables

Statistics at the subpartition level are gathered and stored by the database, but note that synopses are created at the partition level only. This means that if the statistics for a subpartition become stale due to data changes, then the statistics (and synopsis) for the parent partition will be refreshed by examining all of its subpartitions. The database only regathers subpartition-level statistics on subpartitions that are stale.

More information

What's difference between SQL profiles and SQL plan baselines.

SQL Profiles were designed to correct Optimizer behavior when underlying data do not fit anymore into its statistical. Their goal is to create the absolute best execution plan for the SQL by giving the very precise data to the optimizer.a SQL profile helps the optimizer minimize mistakes and thus more likely to select the best plan.

Sql profile is correction to stattistic that help optimizer to generate more efficient execution Plan.

select name,type,status,sql_text from dba_sql_profiles.

you can also disable the profile if sql is working properly after accepting the profile

COLUMN category FORMAT a10

COLUMN sql_text FORMAT a20

SELECT NAME, SQL_TEXT, CATEGORY, STATUS FROM DBA_SQL_PROFILES;

BEGIN

DBMS_SQLTUNE.DROP_SQL_PROFILE (

name => 'my_sql_profile'

);

END;

. Change SQL PROFILE

a. To disable a SQL profile:

exec dbms_sqltune.alter_sql_profile('', 'STATUS', 'DISABLED');

b. To add description to a SQL profile:

exec DBMS_SQLTUNE.alter_sql_profile('sqlprofile_name','DESCRIPTION','this is a test sql profile');

10. To delete SQL Profile:

exec dbms_sqltune.drop_sql_profile('SYS_SQLPROF_0132f8432cbc0000');

A SQL plan baseline for a SQL statement consists of a set of accepted plans. When the statement is parsed, the optimizer will only select the best plan from among this set. If a different plan is found using the normal cost-based selection process, the optimizer will add it to the plan history but this plan will not be used until it is verified to perform better than the existing accepted plan and is evolved.

In Oracle a SQL Profile creates extra information about a particular SQL that the optimizer can use at run time to select the optimal plan to ensure best performance. In essence the SQL Profile enables dynamic behavior where the optimizer has multiple plans to choose from at run time based on run time bind variables etc. When you run the SQL Tuning Advisor for the list of recommendation you will see that the recommendation specifies whether a SQL can be improved by creating a SQL Profile or SQL Baseline. It is preferable to choose a SQL Profile simply because it allows the optimizer to pick best execution plans at run time.

SQL Baseline on the other hand is more of a brute force method, when you simply marry a particular SQL to stay with a specific SQL execution plan. So no matter what the run time bind variables are for a given SQL, the optimizer will always try to use the SQL Baseline plan. This may work fine for most cases but instances where data skew is high it is preferable that one pick more efficient plans based on bind variable values passed at run time instead of always picking the same plan as generated by the SQL Baseline.

SQL query to get SPM baseline.

SQL> select count(*) from dba_sql_plan_baselines where parsing_schema_name='$1';1.

Baselines know what plan they are trying recreate and SQL Profiles do not.

SQL Profiles will blindly apply any hints it has and what you get is what you get.

Baselines will apply the hints and if the optimizer gets the plan it was expecting, it uses the plan.

If it doesn’t come up with the expected plan, the hints are thrown away and the optimizer tries again (possibly with the hints from another accepted Baseline).

Profiles have a “force matching” capability that allows them to be applied to multiple statements that differ only in the values of literals.

Think of it as a just in time cursor sharing feature. Baselines do not have this ability to act on multiple statements.

Comments from Kerry Osborne January 25th, 2012 – 16:38

I have seen Baselines be disregarded, even without such extreme conditions as a specified index having been removed.

The reason for this is that Baselines attempt to apply enough hints to limit the choices the optimizer has to a single plan,

but there are situations where the set of hints is not sufficient to actually force the desired plan.

What I mean is that the hints will eliminate virtually all possibility but there still may be a few that are valid and so it’s possible to get a different plan.

In fact, I have even seen situations where the act of creating a Baseline causes the plan to change.

SQL> select empno,ename,job from emp t1 where deptno = 10;

EMPNO ENAME JOB

---------- ---------- ---------

7782 CLARK MANAGER

7839 KING PRESIDENT

7934 MILLER CLERK

SQL_ID 55utxfrbncds3, child number 0

-------------------------------------

select empno,ename,job from emp t1 where deptno = 10

Plan hash value: 1614352715

------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | 2 (100)| |

| 1 | TABLE ACCESS BY INDEX ROWID BATCHED | EMP | 3 | 63 | 2 (0)| 00:00:01 |

|* 2 | INDEX RANGE SCAN | DEPT_EMP | 3 | | 1 (0)| 00:00:01 |

------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("DEPTNO"=10)

Note

-----

- SQL plan baseline SQL_PLAN_g9mzurk4v9bjw9da07b3a used for this statement

We can also query the Baseline to gather Information about the stored query.

select sql_handle,sql_text,origin,enabled,accepted,adaptive

from dba_sql_plan_baselines

where plan_name = 'SQL_PLAN_g9mzurk4v9bjw9da07b3a';

SQL_HANDLE SQL_TEXT ORIGIN ENABLED ACCEPTED ADAPTIVE

--------------------- --------------------------- ------------- ------- -------- --------

SQL_f4cffabc89b4ae3c select empno,ename,job AUTO-CAPTURE YES YES NO

from emp where deptno = 10

Trouble with lost wallet

We had an 11.0.2.4 instance with an oracle wallet created, but after some issues with the server the master key file got lost.

Because this was a development server there was no available backup but the data wasn't so important. We created a new wallet with a new master key that is now open and "working".

For some queries to work we used the alter table <table_name> rekey command, and tested the environment by doing a select * over all the tables with encrypted columns (we only have encrypted columns in some tables, not encrypted tablespaces) and it worked. But with certain more complicated queries (between several tables) we are getting ORA-28362.

Is it possible to recover from the lost of the previous wallet and could this error have to do with that? Would we need to recreate the tables?

Summary: No. In order to restore a backup with encrypted data, the correct TDE wallet file must be available, else the restore/recover cannot be done.

If all copies of the current ewallet.p12 file (the encryption wallet or TDE wallet, used to stored the master encryption keys needed by the database) are lost -- whether deleted or corrupted -- then the database cannot be restored. Oracle Support cannot assist in restoring the database if the correct TDE wallet is missing.

The wallet password is not the same as the database master key. Knowing the password will not help, because this is only used to open the ewallet.p12 file.

The ewallet.p12 file is a critical component of the database's ability to function when TDE has been implemented. There is no way to substitute another wallet, or decrypt the data, without having the correct TDE wallet file.

Treat the ewallet.p12 file accordingly, and make sure to protect it against loss.

Solution:

--------

So what you have to do is the following :

1) decrypt all the encrypted columns ( select from mkeyid should return no rows )

2) remove the current wallet

3) reimplement TDE ( a new key with a new wallet )

4) re-encrypt those columns with the new key

Finally the previous queries will return the same TDE master key and you will not have any issues.

To know what you have to do / how you can recover from this we need to see the output of these queries :

select ts#, masterkeyid, utl_raw.cast_to_varchar2( utl_encode.base64_encode('01'||substr(masterkeyid,1,4))) || utl_raw.cast_to_varchar2( utl_encode.base64_encode(substr(masterkeyid,5,length(masterkeyid)))) masterkeyid_base64 FROM v$encrypted_tablespaces;

select utl_raw.cast_to_varchar2( utl_encode.base64_encode('01'||substr(mkeyid,1,4))) || utl_raw.cast_to_varchar2( utl_encode.base64_encode(substr(mkeyid,5,length(mkeyid)))) masterkeyid_base64 FROM (select RAWTOHEX(mkid) mkeyid from x$kcbdbk);

select mkeyid from enc$;

-------------------------

How to change from 8k db block size to 16k

---------------------------------------------

Summary : I think my first thoughts would be to create a new empty database on an 8KB block size, add a 4KB cache, and use transportable tablespaces to move the 4KB database to the 8KB database,

then move objects from 4KB to 8KB over time

SOLUTION:

A) Well I think that create an 8k tablespace will not solve your problem, you must remember

that you need to have a db_8k_cache_size defined in case you're going to create an 8k in a 4k database. The only way to define the block size is at the creation of your database, so you'll be mixturing different cache sizes if you need them.

Try expdp with parallelism, maybe that can help you to short your times.

1. create a new Oracle 12.2 database

2. use your export file to populate it.

B) You can't simply create an 8k block buffer, and restore the database, because the blocks will still be 4k in size.

To create the 8k buffer you can do it dynamically with:

ALTER SYSTEM SET DB_8K_CACHE_SIZE=512M

Now you can move objects like this:

alter table .. move tablespace tbs_app8k;

alter index .. rebuild tablespace tbs_app8k;

-----

Step: 1

As a first step it’s good to be ensure that valid full backup of database and it not found its most recommend to have one.

Step: 2

A new database instance with 8k block size is created in same server with identical character sets.

Step: 3

All application specific schemas are listed with consulting application team and tablespaces that holding application schema data are identified.

With identification of schema and tablespaces the respective tablespaces with 8k block size are created with extension 8k i.e if tablespace name is users,

the 8k block size tablespace are created as users8k. Before creating 8k tablespaces we need to set db_8k_cache_size parameter

otherwise we will get ORA-29339 signaled during: while creating tablespaces with 8k block size.

As the database has only one application user HR and only one tablespace user we will only create user8k tablespace with 8k block size.

Move DP schema from 8k db block size to 16k db block size

1. Collect data prior to this exercise.

Capture OOR and CF prior to REORG process. Refer CF_OOR_Query.sql file.

2. Take export backup of DP schema

Shutdown all demantra services

$ expdp system/manager DUMPFILE=<DUMP_FILE_NAME>.dmp LOGFILE=<LOG_FILE_NAME>.log DIRECTORY=EXPDPBKP SCHEMAS=DP EXCLUDE=STATISTICS PARALLEL=8

3. Test that export backup is good enough by importing into a new schema

Add more data files to existing default tablespace of DP for this test then run impdp as below.

$ impdp system/manager DIRECTORY=EXPDPBKP DUMPFILE=<EXPORT_DUMP_NAME>.dmp REMAP_SCHEMA=DP:DPTEST TRANSFORM=OID:N LOGFILE=<LOG_FILENAME>.log

After successful import, drop schema DPTEST.

SQL> drop user DPTEST cascade;

4. Drop schema DP

SQL> drop user DP cascade;

5. Drop tablespaces of DP with 8k block size i.e.

1. APPS_TS_TX_DEMT_DP

2. APPS_TS_TX_DEMT_SALES_DATA

3. APPS_TS_TX_DEMT_SALES_INDX

4. APPS_TS_TX_DEMT_SIM_DATA

5. APPS_TS_TX_DEMT_SIM_INDX

6. APPS_TS_TX_DEMT_SALE_ENG_DATA

7. APPS_TS_TX_DEMT_SALE_ENG_INDX

SQL> set timing on

SQL> drop tablespace APPS_TS_TX_DEMT_DP including contents and datafiles;

SQL> drop tablespace APPS_TS_TX_DEMT_SALES_DATA including contents and datafiles;

SQL> drop tablespace APPS_TS_TX_DEMT_SALES_INDX including contents and datafiles;

SQL> drop tablespace APPS_TS_TX_DEMT_SIM_DATA including contents and datafiles;

SQL> drop tablespace APPS_TS_TX_DEMT_SIM_INDX including contents and datafiles;

SQL> drop tablespace APPS_TS_TX_DEMT_SALE_ENG_DATA including contents and datafiles;

SQL> drop tablespace APPS_TS_TX_DEMT_SALE_ENG_INDX including contents and datafiles;

EasyReliableDBA

Tuesday, 17 July 2018

Quick RAC Interview Question and Answer

Why Do You Need HugePages?

Troubleshooting

1 comment:

Search This Blog