Oracle DBA question and answer
- Data distribution / volume changed, crossing an optimizer search tree decision tipping point
- Indexes / files got fragmented
- Statistics have been updated / added / dropped or have become stale and misleading due to data changes
- Windows memory utilization changed
- Transaction logs are full and not truncating, causing repeated physical file expansion
- Schema changed - index / indexed view / column / constraint added, modified or dropped, data type changed, etc.
- Trace flag settings changed
- Database or server setting changed
- Server CU level changed
- Client application session settings changed
I’d start listing a few and see how many other (realistic) reasons other people wanted to add.
- A change in execution plan due to unlucky bind variable peeking with histograms
- A change in execution plan because the statistics have been slowly going out of sync with the data/query
- Special case: highly volatile tables, where the data/query/stats can go badly out of sync very quickly
- Special case: dynamic sampling used to bypass volatility problems – but sometimes the sample is an “unlucky” one.
- Special case: you had been using dynamic sampling, but someone collected stats on the objects that messed things up completely
- Special case: 12c has decided to add some SQL Directives that mean you do dynamic sampling even when there are stats on the table – the sampling takes more time and the resulting stats make things worse.
- A change in execution plan because new (and unlucky) statistics have been generated since the last time query ran.
- A change in execution plan because a few new, empty, partitions have been added to a critical table
- An unlucky change in execution plan because a small change in actual data volume (with correct stats in place) can have a dramatic impact on the shape of the plan
- Variant – the plan stays the same, but the execution mechanism changes because a table has grown and pushes Oracle into direct path reads
- Variant – as 5.1, but the change in mechanism is not due to the size of the table, but the current degree of caching of the table (see also 10)
- A change in execution plan because you had an incomplete set of hints in your SQL and your luck just ran out
- An unlucky change in data order that has a big impact on the success of subquery caching
- Special case – highly volatile tables being used as the target of existence subqueries (or “first_row” optimisation)
- Can be due to the actual placement of the “useful” data in the table compared to the optimizer’s “average” model.
- Can be due to stats and data not matching very well
- Special case – highly volatile tables being used as the target of existence subqueries (or “first_row” optimisation)
- A small slippage in timing that results in the SQL suffering a direct data collision (locking / read-consistency) with some other process.
- Special case – a small slowdown resulting in the process being restarted, possibly many times, and colliding with copies of itself
- A small slippage in timing that leaves the SQL running concurrently with something that is processing unrelated data but competing for the raw (disc / CPU / PX Slave) resources.
- Interesting example – competition for the queue on an HBA, especially when the queue depth is small
- Special variation – timing problems due to month-end/year-end increasing volume to be processed and adding extra jobs
- Common variant – even without a time slip – you may simply be competing with some other system that’s sharing the same physical resources
- As for 9.3 – but someone has recently added a new application or whole new VM on the storage that you were using.
- Special case – other processes were holding a lot of PGA memory so your big hash join ran with a small workarea and spilled to disc catastrophically
- A small slippage in timing that means some other process cleared everything you wanted from the buffer cache before your SQL started running.
- RAC variant – the data is cached in another node because e.g. some other process ran on a different node
- A small slippage in timing that means a process that normally follows your SQL preceded it, and you’re spending all your time on delayed block cleanout.
- Someone dropped some indexes (or made them invalid)
- Someone rebuilt one or more indexes since the last time the SQL executed
- Someone has done a “shrink space” on a critical table since the last time you ran the SQL
- this could result in a change in execution plan due to a change in index clustering factors
- it could result in a change in physical I/O requirements because of a change in data placement – even with the same execution plan
- it could reduce the benefit of subquery-caching because of a change in data access order – even with the same execution plan
- A space management, or similar, problem suspended processing for a while – e.g. archiver stuck, temp space full for resumable session.
- You’re operating over a database link, and the problem is either in the connection, or at the far end of the link
- The problem may also be about execution plan changes that have a more extreme side effect because the plan includes a remote database
- Problems with SQL Baselines or profiles (profiles going out of date, or a new Baseline being accepted – perhaps by the 12c automatic job – that blocks a previously stable plan)
- It’s not the SQL, it’s whatever is calling the SQL (surrounding pl/sql, network to client, client data handling)
- Special case: Introduction, or modification, of firewall setup
- Special case: hardware problem increasing rate of packet retransmission
- The job was running in a resource group and a change in workload of other processes made the resource manager de-schedule this process even though its workload hadn’t changed
- The SQL hasn’t changed plan, or volume of data accessed, but it calls a pl/sql function that executes some SQL that has changed its behaviour for some reason.
Different Query Plans
Over time, your query plan may also change because statistics have changed (or not changed depending on the statistic in question). Normally, that indicates that Oracle has found a more efficient plan or that your data volumes have changed and Oracle expects a different plan would be more efficient with the new data volume. If, however, you are giving Oracle bad statistics (if, for example, you have tables that get much larger during month-end processing but you gather statistics when the tables are almost empty), you may induce Oracle to choose a very bad query plan. Depending on the version of Oracle, there are various ways to force Oracle to use the same query plan. If you can drill down and figure out what the problem with statistics is, Oracle probably provides a way to give the optimizer better statistics.
If you take a look at AWR/ ASH data (if you have the appropriate licenses) or Statspace data (if your DBA has installed that), you should be able to figure out which camp your problems originate in. Are you getting different query plans for different executions (you may need to capture a query plan from your initial benchmarks and compare it to the current plan or you may need to increase your AWR retention to retain query plans for a few months in order to see this). Are you doing the same number of buffer gets over time but getting vastly different amounts of I/O waits? Do you see a lot of contention for resources from other sessions?If so, that probably indicates that the issue is different load at different times.
Oracle 19c Grid/clusterware patching commands
-
Grid Infrastructure pre configures and registers all required resources for all products available for these products and components, but only activates them when you choose to add them.As a result, some components may be listed as OFFLINE after the installation of Oracle Grid Infrastructure.Resources listed as TARGET:OFFLINE and STATE:OFFLINE do not need to be monitored.They represent components that are registered, but not enabled, so they do not use any system resources.
As a result, some components may be listed as OFFLINE after installation
Resources listed as TARGET:OFFLINE and STATE:OFFLINE represent components that are registered but not enabled
Resources listed as TARGET:OFFLINE and STATE:OFFLINE do not need to be monitored.
They represent components that are registered, but not enabled, so they do not use any
system resources. If an Oracle product or component is installed on the system and it requires
a particular resource to be online, the software prompts you to activate the required offline
resource.
crsctl stat res -t
nft2x015[/home/oracrs]$
nft2x015[/home/oracrs]$ crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE ONLINE nft2x015 Started
ora.cluster_interconnect.haip
1 ONLINE ONLINE nft2x015
ora.crf
1 ONLINE ONLINE nft2x015
ora.crsd
1 ONLINE ONLINE nft2x015
ora.cssd
1 ONLINE ONLINE nft2x015
ora.cssdmonitor
1 ONLINE ONLINE nft2x015
ora.ctssd
1 ONLINE ONLINE nft2x015 OBSERVER
ora.diskmon
1 OFFLINE OFFLINE
ora.drivers.acfs
1 ONLINE ONLINE nft2x015
ora.evmd
1 ONLINE ONLINE nft2x015
ora.gipcd
1 ONLINE ONLINE nft2x015
ora.gpnpd
1 ONLINE ONLINE nft2x015
ora.mdnsd
1 ONLINE ONLINE nft2x015
kasarla01[+ASM2]_grid> crsctl status resource ora.ons
NAME=ora.ons
TYPE=ora.ons.type
TARGET=ONLINE , ONLINE
STATE=OFFLINE on kasarla01, OFFLINE on kasarla02
kasarla01[+ASM2]_grid>
An Oracle RAC Service is in UNKNOWN state on instance 1 and it is impossible to stop or start it with srvctl.
crs_stat -u ora.DBName.My-ServiceName.DBName1.srv
NAME=ora.DBName.My-ServiceName.DBName1.srv
TYPE=application
TARGET=ONLINE
STATE=UNKNOWN on Host1
Solution:
The UNKNOWN state can often be resolved by bringing the resource offline using crs_stop.
If the resource is not in Unknown state: Use srvctl to stop it and do not use crs_stop (as mentionned in note Oracle support note 845709.1 there’s a risk of corruption of the OCR).
Make sure that you have a recent backup of your OCR:
ocrconfig -showbackup
Restore the instance resource to OFFLINE with this command:
crs_stop ora.DBName.My-ServiceName.DBName1.srv
Verify the status is now OFFLINE:
crs_stat -u ora.DBName.My-ServiceName.DBName1.srv
NAME=ora.DBName.My-ServiceName.DBName1.srv
TYPE=application
TARGET=OFFLINE
STATE=OFFLINE
Bring the service online with srvctl:
srvctl start service -d DBName -i DBName1 -s My-ServiceName
Verify that the service is ONLINE:
crs_stat -u ora.DBName.My-ServiceName.DBName1.srv
NAME=ora.DBName.My-ServiceName.DBName1.srv
TYPE=application
TARGET=ONLINE
STATE=ONLINE on Host1
Correcting our RAC problem
We have seen the CRS commands and syntax, now lets look at the practical aspect of it.
I was having some trouble while running my RAC on RHEL4 (I think the problem primarily arose because I supplied my virtual machines with inadequate memory. When I noticed that I was getting some alerts on my SoRAC tool, I looked at the status of my RAC.
oracle@vmora01rh4 ~]$ cd /u01/app/oracle/product/10.2.0/crs/bin
[oracle@vmora01rh4 bin]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.fokerac.db application ONLINE ONLINE vmora02rh4
ora....c1.inst application ONLINE ONLINE vmora01rh4
ora....c2.inst application OFFLINE UNKNOWN vmora02rh4
ora....serv.cs application ONLINE ONLINE vmora02rh4
ora....ac1.srv application ONLINE ONLINE vmora01rh4
ora....ac2.srv application ONLINE OFFLINE
ora....SM1.asm application ONLINE ONLINE vmora01rh4
ora....H4.lsnr application ONLINE ONLINE vmora01rh4
ora....rh4.gsd application ONLINE UNKNOWN vmora01rh4
ora....rh4.ons application ONLINE UNKNOWN vmora01rh4
ora....rh4.vip application ONLINE ONLINE vmora01rh4
ora....SM2.asm application OFFLINE UNKNOWN vmora02rh4
ora....H4.lsnr application OFFLINE UNKNOWN vmora02rh4
ora....rh4.gsd application ONLINE UNKNOWN vmora02rh4
ora....rh4.ons application OFFLINE UNKNOWN vmora02rh4
ora....rh4.vip application ONLINE ONLINE vmora02rh4
As you can see above, some of the applications are UNKNOWN or OFFLINE, either of which is not good for my RAC.
The crs_stat command gives you the names of the applications, which you might need to shut down some applications manually, in order to shut the whole cluster down and restart it.
[oracle@vmora01rh4 bin]$ crs_stat
NAME=ora.fokerac.db
TYPE=application
TARGET=ONLINE
STATE=ONLINE on vmora02rh4
NAME=ora.fokerac.fokerac1.inst
TYPE=application
TARGET=ONLINE
STATE=ONLINE on vmora01rh4
NAME=ora.fokerac.fokerac2.inst
TYPE=application
TARGET=OFFLINE
STATE=OFFLINE
NAME=ora.fokerac.fokeserv.cs
TYPE=application
TARGET=ONLINE
STATE=ONLINE on vmora02rh4
NAME=ora.fokerac.fokeserv.fokerac1.srv
TYPE=application
TARGET=ONLINE
STATE=ONLINE on vmora01rh4
NAME=ora.fokerac.fokeserv.fokerac2.srv
TYPE=application
TARGET=ONLINE
STATE=OFFLINE
NAME=ora.vmora01rh4.ASM1.asm
TYPE=application
TARGET=ONLINE
STATE=ONLINE on vmora01rh4
NAME=ora.vmora01rh4.LISTENER_VMORA01RH4.lsnr
TYPE=application
TARGET=ONLINE
STATE=ONLINE on vmora01rh4
NAME=ora.vmora01rh4.gsd
TYPE=application
TARGET=ONLINE
STATE=UNKNOWN on vmora01rh4
NAME=ora.vmora01rh4.ons
TYPE=application
TARGET=ONLINE
STATE=UNKNOWN on vmora01rh4
NAME=ora.vmora01rh4.vip
TYPE=application
TARGET=ONLINE
STATE=ONLINE on vmora01rh4
NAME=ora.vmora02rh4.ASM2.asm
TYPE=application
TARGET=OFFLINE
STATE=UNKNOWN on vmora02rh4
NAME=ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr
TYPE=application
TARGET=OFFLINE
STATE=UNKNOWN on vmora02rh4
NAME=ora.vmora02rh4.gsd
TYPE=application
TARGET=ONLINE
STATE=UNKNOWN on vmora02rh4
NAME=ora.vmora02rh4.ons
TYPE=application
TARGET=OFFLINE
STATE=UNKNOWN on vmora02rh4
NAME=ora.vmora02rh4.vip
TYPE=application
TARGET=ONLINE
STATE=ONLINE on vmora02rh4
I could have also attempted to stop them all using crs_stop –all, but it normally throws enough errors to force you do it manually one by one.
[oracle@vmora01rh4 bin]$ crs_stop -all
Attempting to stop `ora.vmora01rh4.ons` on member `vmora01rh4`
Attempting to stop `ora.vmora02rh4.ons` on member `vmora02rh4`
`ora.vmora02rh4.ons` on member `vmora02rh4` has experienced an unrecoverable failure.
Human intervention required to resume its availability.
Stop of `ora.vmora01rh4.ons` on member `vmora01rh4` succeeded.
Attempting to stop `ora.vmora01rh4.ASM1.asm` on member `vmora01rh4`
Attempting to stop `ora.fokerac.fokerac2.inst` on member `vmora02rh4`
`ora.fokerac.fokerac2.inst` on member `vmora02rh4` has experienced an unrecoverable failure.
Human intervention required to resume its availability.
Attempting to stop `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4`
`ora.vmora02rh4.ASM2.asm` on member `vmora02rh4` has experienced an unrecoverable failure.
Human intervention required to resume its availability.
Attempting to stop `ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr` on member `vmora02rh4`
`ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr` on member `vmora02rh4` has experienced an unrecoverable failure.
Human intervention required to resume its availability.
Attempting to stop `ora.fokerac.fokerac2.inst` on member `vmora02rh4`
`ora.fokerac.fokerac2.inst` on member `vmora02rh4` has experienced an unrecoverable failure.
Human intervention required to resume its availability.
Attempting to stop `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4`
`ora.vmora02rh4.ASM2.asm` on member `vmora02rh4` has experienced an unrecoverable failure.
Human intervention required to resume its availability.
Attempting to stop `ora.vmora02rh4.vip` on member `vmora02rh4`
Stop of `ora.vmora02rh4.vip` on member `vmora02rh4` succeeded.
Stop of `ora.vmora01rh4.ASM1.asm` on member `vmora01rh4` succeeded.
Attempting to stop `ora.vmora01rh4.LISTENER_VMORA01RH4.lsnr` on member `vmora01rh4`
Stop of `ora.vmora01rh4.LISTENER_VMORA01RH4.lsnr` on member `vmora01rh4` succeeded.
Attempting to stop `ora.vmora01rh4.vip` on member `vmora01rh4`
Stop of `ora.vmora01rh4.vip` on member `vmora01rh4` succeeded.
CRS-0216: Could not stop resource 'ora.vmora02rh4.ASM2.asm'.
CRS-0216: Could not stop resource 'ora.vmora02rh4.ons'.
CRS-0216: Could not stop resource 'ora.vmora02rh4.vip'.
For the very same reason we will go ahead and do it our way. Therefore, we need to stop our instances first.
[oracle@vmora01rh4 bin]$ srvctl stop instance -d fokerac -i fokerac1
[oracle@vmora01rh4 bin]$ srvctl stop instance -d fokerac -i fokerac2
Check our status
[oracle@vmora01rh4 bin]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.fokerac.db application OFFLINE OFFLINE
ora....c1.inst application OFFLINE OFFLINE
ora....c2.inst application OFFLINE OFFLINE
ora....serv.cs application ONLINE UNKNOWN vmora02rh4
ora....ac1.srv application OFFLINE OFFLINE
ora....ac2.srv application OFFLINE OFFLINE
ora....SM1.asm application OFFLINE OFFLINE
ora....H4.lsnr application OFFLINE OFFLINE
ora....rh4.gsd application ONLINE UNKNOWN vmora01rh4
ora....rh4.ons application OFFLINE OFFLINE
ora....rh4.vip application OFFLINE OFFLINE
ora....SM2.asm application OFFLINE UNKNOWN vmora02rh4
ora....H4.lsnr application OFFLINE UNKNOWN vmora02rh4
ora....rh4.gsd application ONLINE UNKNOWN vmora02rh4
ora....rh4.ons application OFFLINE UNKNOWN vmora02rh4
ora....rh4.vip application OFFLINE OFFLINE
Stop the service
[oracle@vmora01rh4 bin]$ srvctl stop service -d fokerac -s fokeserv
Check status again
[oracle@vmora01rh4 bin]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.fokerac.db application OFFLINE OFFLINE
ora....c1.inst application OFFLINE OFFLINE
ora....c2.inst application OFFLINE OFFLINE
ora....serv.cs application OFFLINE OFFLINE
ora....ac1.srv application OFFLINE OFFLINE
ora....ac2.srv application OFFLINE OFFLINE
ora....SM1.asm application OFFLINE OFFLINE
ora....H4.lsnr application OFFLINE OFFLINE
ora....rh4.gsd application ONLINE UNKNOWN vmora01rh4
ora....rh4.ons application OFFLINE OFFLINE
ora....rh4.vip application OFFLINE OFFLINE
ora....SM2.asm application OFFLINE UNKNOWN vmora02rh4
ora....H4.lsnr application OFFLINE UNKNOWN vmora02rh4
ora....rh4.gsd application ONLINE UNKNOWN vmora02rh4
ora....rh4.ons application OFFLINE UNKNOWN vmora02rh4
ora....rh4.vip application OFFLINE OFFLINE
OK, so we need to stop those applications now.
[oracle@vmora01rh4 bin]$ crs_stop ora.vmora01rh4.gsd
Attempting to stop `ora.vmora01rh4.gsd` on member `vmora01rh4`
Stop of `ora.vmora01rh4.gsd` on member `vmora01rh4` succeeded.
[oracle@vmora01rh4 bin]$ crs_stop ora.vmora02rh4.ASM2.asm
Attempting to stop `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4`
Stop of `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4` succeeded.
[oracle@vmora01rh4 bin]$ crs_stop ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr
Attempting to stop `ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr` on member `vmora02rh4`
Stop of `ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr` on member `vmora02rh4` succeeded.
[oracle@vmora01rh4 bin]$ crs_stop ora.vmora02rh4.gsd
Attempting to stop `ora.vmora02rh4.gsd` on member `vmora02rh4`
Stop of `ora.vmora02rh4.gsd` on member `vmora02rh4` succeeded.
[oracle@vmora01rh4 bin]$ crs_stop ora.vmora02rh4.ons
Attempting to stop `ora.vmora02rh4.ons` on member `vmora02rh4`
Stop of `ora.vmora02rh4.ons` on member `vmora02rh4` succeeded.
Check status
[oracle@vmora01rh4 bin]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.fokerac.db application OFFLINE OFFLINE
ora....c1.inst application OFFLINE OFFLINE
ora....c2.inst application OFFLINE OFFLINE
ora....serv.cs application OFFLINE OFFLINE
ora....ac1.srv application OFFLINE OFFLINE
ora....ac2.srv application OFFLINE OFFLINE
ora....SM1.asm application OFFLINE OFFLINE
ora....H4.lsnr application OFFLINE OFFLINE
ora....rh4.gsd application OFFLINE OFFLINE
ora....rh4.ons application OFFLINE OFFLINE
ora....rh4.vip application OFFLINE OFFLINE
ora....SM2.asm application OFFLINE OFFLINE
ora....H4.lsnr application OFFLINE OFFLINE
ora....rh4.gsd application OFFLINE OFFLINE
ora....rh4.ons application OFFLINE OFFLINE
ora....rh4.vip application OFFLINE OFFLINE
OK all set , now lets bring them all online.
[oracle@vmora01rh4 bin]$ crs_start -all
Attempting to start `ora.vmora02rh4.vip` on member `vmora02rh4`
Attempting to start `ora.vmora01rh4.vip` on member `vmora01rh4`
Start of `ora.vmora02rh4.vip` on member `vmora02rh4` succeeded.
Start of `ora.vmora01rh4.vip` on member `vmora01rh4` succeeded.
Attempting to start `ora.vmora01rh4.ASM1.asm` on member `vmora01rh4`
Attempting to start `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4`
Start of `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4` succeeded.
Attempting to start `ora.fokerac.fokerac2.inst` on member `vmora02rh4`
Start of `ora.vmora01rh4.ASM1.asm` on member `vmora01rh4` succeeded.
Attempting to start `ora.fokerac.fokerac1.inst` on member `vmora01rh4`
Start of `ora.fokerac.fokerac2.inst` on member `vmora02rh4` succeeded.
Attempting to start `ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr` on member `vmora02rh4`
Start of `ora.fokerac.fokerac1.inst` on member `vmora01rh4` succeeded.
Attempting to start `ora.vmora01rh4.LISTENER_VMORA01RH4.lsnr` on member `vmora01rh4`
Start of `ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr` on member `vmora02rh4` succeeded.
Start of `ora.vmora01rh4.LISTENER_VMORA01RH4.lsnr` on member `vmora01rh4` succeeded.
CRS-1002: Resource 'ora.vmora02rh4.ons' is already running on member 'vmora02rh4'
CRS-1002: Resource 'ora.vmora01rh4.ons' is already running on member 'vmora01rh4'
Attempting to start `ora.fokerac.fokeserv.fokerac1.srv` on member `vmora01rh4`
Attempting to start `ora.vmora01rh4.gsd` on member `vmora01rh4`
Attempting to start `ora.fokerac.db` on member `vmora01rh4`
Attempting to start `ora.fokerac.fokeserv.fokerac2.srv` on member `vmora02rh4`
Attempting to start `ora.fokerac.fokeserv.cs` on member `vmora02rh4`
Attempting to start `ora.vmora02rh4.gsd` on member `vmora02rh4`
Start of `ora.fokerac.fokeserv.fokerac2.srv` on member `vmora02rh4` succeeded.
Start of `ora.fokerac.fokeserv.cs` on member `vmora02rh4` succeeded.
Start of `ora.fokerac.db` on member `vmora01rh4` succeeded.
Start of `ora.vmora02rh4.gsd` on member `vmora02rh4` succeeded.
Start of `ora.vmora01rh4.gsd` on member `vmora01rh4` succeeded.
Start of `ora.fokerac.fokeserv.fokerac1.srv` on member `vmora01rh4` succeeded.
*CRS-0223: Resource 'ora.vmora01rh4.ons' has placement error.
CRS-0223: Resource 'ora.vmora02rh4.ons' has placement error.
*Don’t bother about those errors, as they just did not report back to us in the sequence they were started by the clusterware.
[oracle@vmora01rh4 bin]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.fokerac.db application ONLINE ONLINE vmora01rh4
ora....c1.inst application ONLINE ONLINE vmora01rh4
ora....c2.inst application ONLINE ONLINE vmora02rh4
ora....serv.cs application ONLINE ONLINE vmora02rh4
ora....ac1.srv application ONLINE ONLINE vmora01rh4
ora....ac2.srv application ONLINE ONLINE vmora02rh4
ora....SM1.asm application ONLINE ONLINE vmora01rh4
ora....H4.lsnr application ONLINE ONLINE vmora01rh4
ora....rh4.gsd application ONLINE ONLINE vmora01rh4
ora....rh4.ons application ONLINE ONLINE vmora01rh4
ora....rh4.vip application ONLINE ONLINE vmora01rh4
ora....SM2.asm application ONLINE ONLINE vmora02rh4
ora....H4.lsnr application ONLINE ONLINE vmora02rh4
ora....rh4.gsd application ONLINE ONLINE vmora02rh4
ora....rh4.ons application ONLINE ONLINE vmora02rh4
ora....rh4.vip application ONLINE ONLINE vmora02rh4
[oracle@vmora01rh4 bin]$
$ <GRID_HOME>/bin/crsctl stat res ora.racdb.db -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.racdb.db
1 OFFLINE OFFLINE Corrupted Controlfile
2 ONLINE ONLINE rachost2 Open
3 ONLINE ONLINE rachost3 Open
$ <GRID_HOME>/bin/srvctl status database -d racdb
Instance racdb1 is not running on node rachost1
Instance racdb2 is running on node rachost2
Instance racdb3 is running on node rachost3
$ sqlplus / as sysdba
...
SQL> select INSTANCE_NAME, STATUS from gv$instance;
INSTANCE_NAME STATUS
---------------- ------------
racdb1 OPEN
racdb2 OPEN
racdb3 OPEN
-
gpnpd: The Grid Plug and Play daemon (GPNPD) provides access to the Grid Plug and Play profile and coordinates updates to the profile among the nodes of the cluster to ensure that all the nodes have the most recent profile.
The cluster must have the following addresses configured:
-
A public IP address for each node, with the following characteristics:
-
Static IP address
-
Configured before installation for each node, and resolvable to that node before installation
-
On the same subnet as all other public IP, VIP, and SCAN addresses
-
A virtual IP address for each node, with the following characteristics:
-
Static IP address
-
Configured before installation for each node, but not currently in us
-
On the same subnet as all other public IP addresses, VIP addresses, and SCAN addresses
-
A Single-Client Access Name (SCAN) for the cluster, with the following characteristics:
-
Three Static IP addresses configured on the domain name server (DNS)
-
before installation so that the three IP addresses are associated with the name provided as the SCAN, and all three addresses are returned in random order by the DNS to the requestor
-
Configured before installation in the DNS to resolve to addresses that are not currently in use
-
Given a name that does not begin with a numeral
-
On the same subnet as all other public IP addresses, VIP addresses, and SCAN addresses
-
A private IP address for each node, with the following characteristics:
-
Static IP address
-
Configured before installation, but on a separate private network, with its own subnet, that is not resolvable except by other cluster member nodes to improve the interconnect performance
-
Default settings are adequate for the majority of customers. It may be necessary to increase the allocated buffer size
– MTU size has been increased
– netstat command reports errors
– ifconfig command reports dropped packets or overflow
The maximum UDP socket receive buffer size varies according to the operating system. the upper limit may be as small as 128 KB or as large as 1 GB. In most cases, the default settings are adequate for the majority of customers. This is one of the first settings to consider if you are receiving lost blocks.
Three significant conditions that indicate when it may be necessary to change the UDP socket receive buffer size are when the MTU size has been increased, when excessive fragmentation and/or reassembly of packets is observed, and if dropped packets or overflows are observed
In earlier releases, to minimize node evictions due to frequent private NIC down events, bonding, trunking, teaming, or similar technology was required to make use of redundant network connections between the nodes. Oracle Clusterware now provides an integrated solution which ensures “Redundant Interconnect Usage” as it supports IP failover .
Multiple private network adapters can be defined either during the installation phase or afterward using the oifcfg. The ora.cluster_interconnect.haip resource will pick up a highly available virtual IP (the HAIP) from “link-local” (Linux/Unix) IP range (169.254.0.0 ) and assign to each private network. With HAIP, by default, interconnect traffic will be load balanced across all active interconnect interfaces. If a private interconnect interface fails or becomes non-communicative, then Clusterware transparently moves the corresponding HAIP address to one of the remaining functional interfaces.
Grid Infrastructure can activate a maximum of four private network adapters at a time even if more are defined. The number of HAIP addresses is decided by how many private network adapters are active when Grid comes up on the first node in the cluster . If there’s only one active private network, Grid will create one; if two, Grid will create two and so on. The number of HAIPs won’t increase beyond four even if more private network adapters are activated . A restart of clusterware on all nodes is required for new adapters to become effective.
. This functionality is available starting with Oracle Database 11g Release 2 (11.2.0.2). If you use the Oracle Clusterware Redundant Interconnect feature, you must use IPv4 addresses for the interfaces.
When you define multiple interfaces, Oracle Clusterware creates from one to four highly available IP (HAIP) addresses. Oracle RAC and Oracle Automatic Storage Management (Oracle ASM) instances use these interface addresses to ensure highly available, load-balanced interface communication between nodes. The installer enables Redundant Interconnect Usage to provide a high-availability private network.
By default, Oracle Grid Infrastructure software uses all of the HAIP addresses for private network communication, providing load-balancing across the set of interfaces you identify for the private network. If a private interconnect interface fails or becomes noncommunicative,
Oracle Clusterware transparently moves the corresponding HAIP address to one of the remaining functional interfaces.
If you do not use GNS, the SCAN should be defined in the DNS to resolve to the three addresses assigned to that name.
This should be done before you install Oracle Grid Infrastructure.
The SCAN and its associated IP addresses provide a stable name for clients to use for connections, independent of the nodes that make up the cluster.
SCANs function like a cluster alias. However, SCANs are resolved on any node in the cluster, so unlike a VIP address for a node, clients connecting to the SCAN no longer require updated VIP addresses as nodes are added to or removed from the cluster. Because the SCAN addresses
resolve to the cluster, rather than to a node address in the cluster, nodes can be added to or removed from the cluster without affecting the SCAN address configuration.
During installation, listeners are created on each node for the SCAN IP addresses. Oracle Clusterware routes application requests to the cluster SCAN to the least loaded instance providing the service.SCAN listeners can run on any node in the cluster. SCANs provide location independence for databases so that the client configuration does not have to depend on which nodes run a particular database.Instances register with SCAN listeners only as remote listeners. Upgraded databases register with SCAN listeners as remote listeners, and also continue to register with all other listeners.
If you specify a GNS domain during installation, the SCAN defaults to clustername-scan.GNS_domain. If a GNS domain is not specified at installation, the SCAN defaults to clustername-scan.current_domain.
Finally, the client establishes a connection to the service through the listener on the node where service is offered. All these actions take place transparently to the client without any explicit configuration required in the client.
During installation, listeners are created on nodes for SCAN IP addresses.
Oracle net Services routes application requests to the least loaded
Instance providing service Because SCAN addresses resolve to cluster, rather than to a node address in the cluster
cluster, nodes can be added or removed from the cluster without affecting the SCAN address configuration.
-
An important service provided by Oracle Clusterware is node fencing.
-
Node fencing is used to evict nonresponsive hosts from the cluster, preventing data corruptions.
Allowing affected nodes to remain in the cluster increases the probability of data corruption due to Traditionally, Oracle Clusterware uses a STONITH (Shoot The Other Node In The Head)comparable fencing algorithm to ensure data integrity in cases, in which cluster integrity is endangered and split-brain scenarios need to be prevented. For Oracle Clusterware this means
that a local process enforces the removal of one or more nodes from the cluster (fencing). This approach traditionally involved a forced “fast” reboot of the offending node. A fast reboot is a shutdown and restart procedure that does not wait for any I/O to finish or for file systems to synchronize on shutdown. Starting with Oracle Clusterware 11g Release 2 (11.2.0.2), this
mechanism has been changed to prevent such a reboot as much as possible by introducing rebootless node fencing.
Now, when a decision is made to evict a node from the cluster, Oracle Clusterware will first attempt to shut down all resources on the machine that was chosen to be the subject of an eviction. Specifically, I/O generating processes are killed and Oracle Clusterware ensures that those processes are completely stopped before continuing.
If all resources can be stopped and all I/O generating processes can be killed, Oracle Clusterware will shut itself down on the respective node, but will attempt to restart after the stack has been stopped.
If, for some reason, not all resources can be stopped or I/O generating processes cannot be stopped completely, Oracle Clusterware will still perform a reboot.
In addition to this traditional fencing approach, Oracle Clusterware now supports a new fencing mechanism based on remote node termination. The concept uses an external mechanism capable of restarting a problem node without cooperation either from Oracle Clusterware or from the
operating system running on that node. To provide this capability, Oracle Clusterware supports the Intelligent Management Platform Interface specification (IPMI), a standard management protocol.
To use IPMI and to be able to remotely fence a server in the cluster, the server must be equipped with a Baseboard Management Controller (BMC), which supports IPMI over a local area network(LAN). After this hardware is in place in every server of the cluster, IPMI can be activated either during the installation of the Oracle Grid Infrastructure or after the installation in course of a post installation management task by using CRSCTL.
Oracle Clusterware continues to support third-party cluster solutions. For certified solutions,
Oracle Clusterware will integrate with the third-party cluster solution in a way that node membership decisions are deferred to the third-party cluster solution. For Oracle RAC environments, it is worth noticing that Oracle Clusterware is mandatory and provides all required functionality. No other third-party solution should therefore be required.
Local partitioned indexes
Local partitioned indexes allow the DBA to take individual partitions of a table and indexes offline for maintenance (or reorganization) without affecting the other partitions and indexes in the table.
In a local partitioned index, the key values and number of index partitions will match the number of partitions in the base table.
CREATE INDEX year_idx
on all_fact (order_date)
LOCAL
(PARTITION name_idx1),
(PARTITION name_idx2),
(PARTITION name_idx3);
Oracle will automatically use equal partitioning of the index based upon the number of partitions in the indexed table. For example, in the above definition, if we created four indexes on all_fact, the CREATE INDEX would fail since the partitions do not match. This equal partition also makes index maintenance easier, since a single partition can be taken offline and the index rebuilt without affecting the other partitions in the table.
A global partitioned index is used for all other indexes except for the one that is used as the table partition key. Global indexes partition OLTP (online transaction processing) applications where fewer index probes are required than with local partitioned indexes. In the global index partition scheme, the index is harder to maintain since the index may span partitions in the base table.
For example, when a table partition is dropped as part of a reorganization, the entire global index will be affected. When defining a global partitioned index, the DBA has complete freedom to specify as many partitions for the index as desired.
Now that we understand the concept, let’s examine the Oracle CREATE INDEX syntax for a globally partitioned index:
CREATE INDEX item_idx
on all_fact (item_nbr)
GLOBAL
(PARTITION city_idx1 VALUES LESS THAN (100)),
(PARTITION city_idx1 VALUES LESS THAN (200)),
(PARTITION city_idx1 VALUES LESS THAN (300)),
(PARTITION city_idx1 VALUES LESS THAN (400)),
(PARTITION city_idx1 VALUES LESS THAN (500));
Here, we see that the item index has been defined with five partitions, each containing a subset of the index range values. Note that it is irrelevant that the base table is in three partitions. In fact, it is acceptable to create a global partitioned index on a table that does not have any partitioning.
including the following:
-
Transparent Application Failover
-
Fast Connect Failover
latter is the preferred way of configuring it. Note that this feature requires the use of the OCI libraries, so thin-client only applications won’t be able to benefit from it. With the introduction of the Oracle Instant client, this problem can be alleviated somewhat by switching to the correct driver.
TAF can operate in two ways:
it can either restore a session or re-execute a select statement in the
event of a node failure.
While this feature has been around for a long time,
Oracle’s net manager configuration assistant doesn’t provide support for setting up client-side TAF. Also, TAF isn’t the most elegant way of handling node failures because any in-flight transactions will be rolled back—
TAF can resume running select statements only.
When a RAC service failure is propagated to the JDBC application, the database has already rolled back the local transaction. The cache manager then cleans up all invalid connections. When an application holding an invalid connection tries to do work through that connection, it receives a SQLException ORA-17008, Closed Connection. The application has to handle the exception and reconnect.
Configuring ONS for Fast Connection Failover
In order for Fast Connection Failover to work, you must configure ONS correctly. ONS is shipped as part of Oracle Database 11g.This section covers the following topics:
ONS Configuration File
Client-Side ONS Configuration
Server-Side ONS Configuration Using racgons
Remote ONS Subscription
ONS Configuration File
ONS configuration is controlled by the ONS configuration file, ORACLE_HOME/opmn/conf/ons.config. This file tells the ONS daemon details about how it should behave and who it should talk to. Configuration information within ons.config is defined in simple name and value pairs. There are three values that should always be configured within ons.config. The first is localport, the port that ONS binds to on the localhost interface to talk to local clients. An example of thelocalport configuration is the following:
localport=4100
|
remoteport=4200
|
The nodes listed in the nodes line correspond to the individual nodes in the RAC instance. Listing the nodes ensures that the middle-tier node can communicate with the RAC nodes. At least one middle-tier node and one node in the RAC instance must be configured to see one another. As long as one node on each side is aware of the other, all nodes are visible. You need not list every single cluster and middle-tier node in the ONS config file of each Oracle RAC node. In particular, if one ONS config file cluster node is aware of the middle tier, then all nodes in the cluster are aware of it.
An example of the nodes configuration is the following:
nodes=myhost.example.com:4200,123.123.123.123:4200
|
loglevel=3
|
logfile=/private/oraclehome/opmn/logs/myons.log
|
walletfile=/private/oraclehome/opmn/conf/ssl.wlt/default
|
The ons.config file allows blank lines and comments on lines that begin with the number sign (#).
Client-Side ONS Configuration
You can access the client-side ONS through ORACLE_HOME/opmn. On the client-side, there are two ways to set up ONS:Remote ONS configuration
Example 26-1 illustrates how a sample configuration file may look.
# This is an example ons.config file
#
# The first three values are required
localport=4100
remoteport=4200
nodes=racnode1.example.com:4200,racnode2.example.com:4200
|
Command | Effect | Output |
---|---|---|
start | Starts the ONS daemon | onsctl: ons started |
stop | Stops the ONS daemon | onsctl: shutting down ons daemon... |
ping | Verifies whether or not the ONS daemon is running | ons is running ... |
reconfig | Triggers a reload of the ONS configuration without shutting down the ONS daemon | |
help | Prints a help summary message for onsctl | |
detailed | Prints a detailed help message for onsctl |
Server-Side ONS Configuration Using racgons
You can access the server-side ONS through ORA_CRS_HOME/opmn. You configure the server-side by using racgons to add the middle-tier node information to OCR. This command is found in ORA_CRS_HOME/bin/racgons. Before using racgons, you must edit ons.config to set useocr=on.What exactly is the use of FCF?
FCF provides is very fast notification of the failure and the ability to reconnect immediately using the same URL. When a RAC node fails the application will receive an exception. The application has to handle the exception and reconnect.
The JDBC driver does not re-target existing connections. If a node fails the application must close the existing connection and get a new one. The way the application knows that the node failed is by getting an exception. There is more than one ORA error that can be thrown when a node fails,the application must be able to deal with them all.
An application may call isFatalConnectionError() API on the OracleConnectionCacheManager to determine if the SQLException caught is fatal.
If the return value of this API is true, we need to retry the getConnection on the DataSource.xxxxxx
How do we use FCF with JDBC driver?
In order to use FCF with JDBC, the following things must be done:
Configure and start ONS. If ONS is not correctly set up,implicit connection cache creation fails and an ONSException is thrown at the first getConnection() request.
See Oracle® Universal Connection Pool for JDBC Developer's Guide in the section Configuring ONS located in Using Fast Connection Failover
FCF is now configured through a pool-enabled data source and is tightly integrated with UCP. The FCF enabled through the Implicit Connection Cache as was used in 10g and 11g R1 is now deprecated.
Set the FastConnectionFailoverEnabled property before making the first getConnection() request to an OracleDataSource. When FastConnection Failover is enabled, the failover applies to all connections in the pool.
Use a service name rather than a SID when setting the OracleDataSource url property.
TAF is always active and does not have to be set.
TAF cannot be used with thin driver.
-
Session failover
-
Select failover
-
None (default)
The following are possible failover types in the OracleOCI Failover interface:
FO_SESSION
Is equivalent to FAILOVER_MODE=SESSION in the tnsnames.ora file CONNECT_DATA flags. This means that only the user session is re-authenticated on the server-side while open cursors in the OCI application need to be re-executed.
FO_SELECT
Is equivalent to FAILOVER_MODE=SELECT in tnsnames.ora file CONNECT_DATA flags. This means that not only the user session is re-authenticated on the server-side, but open cursors in the OCI can continue fetching. This implies that the client-side logic maintains fetch-state of each open cursor.
FO_NONE
Is equivalent to FAILOVER_MODE=NONE in the tnsnames.ora file CONNECT_DATA flags. This is the default, in which no failover functionality is used. This can also be explicitly specified to prevent failover from happening. Additionally, FO_TYPE_UNKNOWN implies that a bad failover type was returned from the OCI driverFailover Methods
failover type. The failover method determines how the failover works; the following options are available:
-
Basic
-
Preconnect
connections are re-established. If you consider using this approach, you should test for potential performance degradation during the design stage.
The preconnect option is slightly more difficult to configure. When you specify the preconnect parameter, the client is instructed to preconnect a session to a backup instance to speed up session failover. You need to bear in mind that these preconnections increase the number of sessions to the cluster. In addition, you also need to define what the backup connection should be.
Hugepages and Large Pages
If you run a Oracle Database on a Linux Server with more than 16 GB physical memory and your System Global Area (SGA) is greater than 8 GB, you should configure HugePages. Oracle promises more performance by doing this. A HugePages configuration means, that the linux kernel can handle „large pages“, like Oracle generally calls them. Instead of standardly 4 KB on x86 and x86_64 or 16 KB on IA64 systems – 4 MB on x86, 2 MB on x86_64 and 256 MB on IA64 system. Bigger pages means, that the system uses less page tables, manages less mappings and by that reduce the effort for their management and access.
However their is a limitation by Oracle, because Automatic Memory Management (AMM) does not support HugePages. If you already use AMM and MEMORY_TARGET is set you have to disable it and switch back to Automatic Shared Memory Management (ASMM). That means set SGA_TARGET and PGA_AGGREGATE_TARGET. But there is another innovation called Transparent Hugpages (THP) which should be disabled as well. The feature will be delivered since Red Hat Linux 6 or a according derivate. Oracle as well as Red Hat recommend to disable Transparent Hugepages. Explanation in point 5 – Change Server configuration. So lets get started and come to the 7 steps:
1. Check Physical Memory
First we should check our „physical“ available Memory. In the example we have about 128 GB of RAM. SGA_TARGET and PGA_AGGREGATE_TARGET together, should not be more than the availabel memory. Besides should be enough space for OS processes itself:
grep MemTotal /proc/meminfo MemTotal: 132151496 kB
2. Check Database Parameter
Second check your database parameter. Initially: AMM disabled? MEMORY_TARGET and MEMORY_MAX_TARGET should be set to 0:
SQL> select value from v$parameter where name = 'memory_target'; VALUE --------------------------- 0
How big is our SGA? In this example about 40 GB. Important: In the following query we directly convert into kB (value/1024). With that we can continue to calculate directly:
SQL> select value/1024 from v$parameter where name = 'sga_target'; VALUE --------------------------- 41943040
Finally as per default the parameter use_large_pages should be enabled:
SQL> select value from v$parameter where name = 'use_large_pages'; VALUE --------------------------- TRUE
3. Check Hugepagesize
In our example we use a x86_64 Red Hat Enterprise Linux Server. So by default hugepagesize should be set to 2 MB:
grep Hugepagesize /proc/meminfo Hugepagesize: 2048 kB
4. Calculate Hugepages
For the calculation of the number of hugepages there is a easy way:
SGA / Hugepagesize = Number Hugepages
Following our example:
41943040 / 2048 = 20480
If you run more than one database on your server, you should include the SGA of all of your instances into the calculation:
( SGA 1. Instance + SGA 2. Instance + … etc. ) / Hugepagesize = Number Hugepages
In My Oracle Support you can find a script (Doc ID 401749.1) called hugepages_settings.sh, which does the calculation. This also includes a check of your kernel version and the actually used shared memory area by the SGA. Please consider that this calculation observes only the actual use of SGA and their use. If your second instance is down it will be not in the account. That means to adjust your SGA and restart your database first. Than you can run the script. Result should be the following line. Maybe you can make your own calculation and than check it with the script:
Recommended setting: vm.nr_hugepages = 20480
5. Change Server Configuration
The next step is to enter the number of hugepages in the server config file. For that you need root permissions. On Red Hat Linux 6 /etc/sysctl.conf.
vi /etc/sysctl.conf vm.nr_hugepages=20480
Correctly inserted, following result should show up:
grep vm.nr_hugepages /etc/sysctl.conf vm.nr_hugepages=20480
The next parameter is hard and soft memlock in /etc/security/limits.conf for our oracle user. This value should be smaller than our available memory but minor to our SGA. Our hugepages should fit into that by 100 percent. For that following calculation:
Number Hugepages * Hugepagesize = minimum Memlock
Following our example:
20480 * 2048 = 41943040
vi /etc/security/limits.conf oracle soft memlock 41943040 oracle hard memlock 41943040
Correctly inserted, following result should show up:
grep oracle /etc/security/limits.conf ... oracle soft memlock 41943040 oracle hard memlock 41943040
As mentioned before we have to disable transparent hugepages from Red Hat Linux version 6 ongoing:
cat /sys/kernel/mm/redhat_transparent_hugepage/enabled [always] madvise never echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag cat /sys/kernel/mm/redhat_transparent_hugepage/enabled always madvise [never]
6. Server Reboot
If all parameter are set, make a complete reboot your server. As an alternative you can reload the parameters with sysctl -p.
7. Check Configuration
Memlock correct?
ulimit -l 41943040
HugePages correctly configured and in use?
grep Huge /proc/meminfo AnonHugePages: 538624 kB HugePages_Total: 20480 HugePages_Free: 12292 HugePages_Rsvd: 8188 HugePages_Surp: 0 Hugepagesize: 2048 kB
Transparent Hugepages disabled?
cat /sys/kernel/mm/redhat_transparent_hugepage/enabled always madvise [never]
Did the database uses HugePages? For that we take a look into the alert log. After „Starting ORACLE instance (normal)“ following entry „Large Pages Information“ gives us advise:
************************ Large Pages Information ******************* Per process system memlock (soft) limit = 100 GB Total Shared Global Region in Large Pages = 40 GB (100%) Large Pages used by this instance: 20481 (40 GB) Large Pages unused system wide = 0 (0 KB) Large Pages configured system wide = 20481 (40 GB) Large Page size = 2048 KB ********************************************************************
If your configuration is incorrect Oracle delivers recommendation here for the right setting. In the following example exactly one Page is missing, so 2048 kB memlock to come to 100% of SGA use of hugepages:
************************ Large Pages Information ******************* ... ... RECOMMENDATION: Total System Global Area size is 40 GB. For optimal performance, prior to the next instance restart: 1. Increase the number of unused large pages by at least 1 (page size 2048 KB, total size 2048 KB) system wide to get 100% of the System Global Area allocated with large pages 2. Large pages are automatically locked into physical memory. Increase the per process memlock (soft) limit to at least 40 GB to lock 100% System Global Area's large pages into physical memory ********************************************************************
Done!
ocrdump reported 704 different keys for the OCR vs. 526 keys for the OLR on our installation.
To slightly complicate matters, there are two sets of Oracle Agents and Oracle Root Agents, one for the High Availability Services stack and one for the Cluster Ready Services stack.
created as part of the Cluster Ready Services stack. Similarly, the Oracle Agent spawned by OHAS is owned by the Grid Infrastructure software owner.
In addition to these two processes, there are agents responsible for managing and monitoring the CSS daemon, called CSSDMONITOR and CSSDAGENT. CSSDAGENT, the agent process responsible for spawning CSSD is created by the OHAS daemon. CSSDMONITOR, which monitors CSSD and the overall
node health (jointly with the CSSDAGENT), is also spawned by OHAS.
You might wonder how CSSD, which is required to start the clustered ASM instance, can be started if voting disks are stored in ASM? This sounds like a chicken-and-egg problem: without access to the voting disks there is no CSS, hence the node cannot join the cluster. But without being part of the
cluster, CSSD cannot start the ASM instance. To solve this problem the ASM disk headers have new metadata in 11.2: you can use kfed to read the header of an ASM disk containing a voting disk. The kfdhdb.vfstart and kfdhdb.vfend fields tell CSS where to find the voting file. This does not require the ASM instance to be up. Once the voting disks are located, CSS can access them and joins the cluster.
The high availability stack’s Oracle Agent runs as the owner of the Grid Infrastructure stack in a clustered environment, as either the oracle or grid users. It is spawned by OHAS directly as part of the cluster startup sequence, and it is responsible for starting resources that do not require root privileges.
The list of processes Oracle Agent starts includes the following:
-
EVMD and EVMLOGGER
-
the gipc daemon
-
the gpnp daemon
-
The mDNS daemon
-
CRS daemon
-
CTSS daemon
-
Disk Monitoring daemon
-
ACFS drivers
-
Starting and monitoring the local ASM instance
-
ONS and eONS daemons
-
The SCAN listener, where applicable
-
The Node listener
node.The oracle Oracle Agent will only spawn the database resource if account separation is used. If not—i.e., if you didn’t install Grid Infrastructure with a different user than the RDBMS binaries—then
the oracle Oracle Agent will also perform the tasks listed previously with the grid Oracle Agent.
The Oracle Root Agent finally will create the following background processes:
-
GNS, if configured
-
GNS VIP if GNS enabled
-
ACFS Registry
-
Network
-
SCAN VIP, if applicable
-
Node VIP
The functionality provided by the Oracle Agent process in Oracle 11gR2 wa
Clusterware startup sequence
or
This tutorial will describe startup sequence of oracle 12c RAC clusterware which is installed on Unix / Linux platform.
Oracle
12c RAC Clusterware Startup Sequence
|
$cat /etc/inittab | grep init.d | grep -v grep
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
Oracle Linux 6.x and Red Hat Linux 6.x have deprecated inittab. init.ohasd is configured in startup in /etc/init/oracle-ohasd.conf:
$cat /etc/init/oracle-ohasd.conf
start on runLevel [35]
start on tunLevel [!35]
respawn
exec /etc/init.d/init.ohasd run > /dev/null 2>&1 <dev/null
this start up " init.ohasd run " , which in turn starts up the ohasd.bin background process :
$ps -ef | grep ohasd | grep -v grep
root 4056 1 1 Feb19 ? 01:54:34 /u01/app/12.1.0/grid/bin/ohsd.bin reboot
root 2715 1 0 Feb19 ? 00:00:00 /bin/sh /etc/init.d/init.ohsd run
OHASD ( Oracle High Availability Service Daemon ) - we also call it as oracle restart
First /etc/init triggers OHASD, once ohasd is started on Level 0, it is responsible for starting the rest of clusterware and the resources that clusterware manages directly or indirectly through Levels 1- 4.
Level 1 - Ohasd on it own triggers four agent process
-
cssdmonitor : CSS Monitor
-
OHASD orarootagent : High Availability Service stack Oracle root agent
-
OHASD oraagent : High Availability Service stack Oracle Agent
-
cssdagent : CSS Agent
Level 2 - On this level, OHASD ora agent trigger five processes
-
mDNSD : mDNS daemon process
-
GIPCD : Grid Interprocess Comunication
-
GPnPD : GPnP Profile daemon
-
EVMD : Even Monitor Daemon
-
ASM : Resources for monitoring ASM Instances
-
CRSD : CRS daemon
-
CTSSD : Cluster Time Synchronisation Service Daemon
-
Diskmon : Disk Monitor Daemon ( Exadata Server Storage )
-
ACFS : ( ASM Cluster File System ) Drivers
Level 3 - The CRSD spawns two CRSD agents : CRSD orarootagent and CRSD oracleagent.
Level 4 - On this levael, the CRSD orarootagent is responsible for starting he following resources :
-
Network resource : for the public network
-
SCAN VIPs
-
Node VIPs : VIPs for each node
-
ACFS Registry
-
GNS VIP : VIP for GNS if you use the GNS option
-
ASM Resources : ASM instances(s) resource
-
Diskgroup : Used for managing / monitoring ASM diskgroups
-
Disk Resource : Used for managing and monitoring the DB and instances
-
SCAN Listener : Listener for SCAN listening on SCAN VIP
-
Listener : Node Listener listening on the Node VIP
-
Services : Database Services
-
ONS
-
eONS : Enhanced ONS
-
GSD : For 9i backword compatibility
-
GNS : performs name resolution ( Optional )
How Database interact with ASM
The file creation process provides a fine illustration of the interactions that take place between
database instances and ASM. The file creation process occurs as follows:
1. The database requests file creation.
2. An ASM foreground process creates a Continuing Operation Directory (COD) entry and
allocates space for the new file across the disk group.
3. The ASMB database process receives an extent map for the new file.
4. The file is now open and the database process initializes the file directly.
5. After initialization, the database process requests that the file creation is committed. This
causes the ASM foreground process to clear the COD entry and mark the file as created.
6. Acknowledgement of the file commit implicitly closes the file. The database instance will
need to reopen the file for future I/O.
What is GPnP profile and its importance
The GPnP profile is a XML file located at location <GRID_HOME/gpnp/<hostname>/profiles/peer as profile.xml. Each node of the cluster maintains a copy of this profile locally and is maintained by GPnP daemon along with mdns daemon.
Now before understanding why Oracle came up with GPnP profile, we need to focus on what it contains.
GPnP defines a node’s meta data about network interfaces for public and private interconnect, the ASM server parameter file, and CSS voting disks. This profile is protected by a wallet against modification. If you have to manually modify the profile, it must first be unsigned with $GRID_HOME/bin/gpnptool, modified, and then signed again with the same utility, however there is a very slight chance you would ever be required to do so.
Now we’ll use the gpnptool with get option to dump this xml file into standard output. Below is the formatted output for the ease of readability.
<?xml version=”1.0″ encoding=”UTF-8″?>
<gpnp:Network id=”net1″ IP=”xxx.xx.x.x” Adapter=”bond0″ Use=”public”/>
<gpnp:Network id=”net2″ IP=”xxx.xxx.x.x” Adapter=”bond1″
Use=”cluster_interconnect”/>
</gpnp:HostNetwork>
</gpnp:Network-Profile>
<ds:Signature…>…</ds:Signature>
</gpnp:GPnP-Profile>
So from the above dump we can see that GPnP profile contains following information:-
1) Cluster Name
2) Network Profile
3) CSS-Profile tag
4) ASM-Profile tag
Now that we have understood the content of a GPnP profile, we need to understand how the Clusterware uses this information to start. From 11gr2 you have the option of storing the OCR and Voting disk on ASM, but clusterware needs OCR and Voting disk to start crsd & cssd and both these files are on ASM which itself is a resource for the node. so how does the clusterware starts, which files it accesses to get the information needed to start clusterware, to resolve this Oracle came up with two local operating system files OLR & GPnP.
When a node of an Oracle Clusterware cluster restarts, OHASD is started by platform-specific means.OHASD has access to the OLR (Oracle Local Registry) stored on the local file system. OLR provides needed data (Would explain in another post) to complete OHASD initialization
OHASD brings up GPnP Daemon and CSS Daemon. CSS Daemon has access to the GPNP Profile stored on the local file system.
The Voting Files locations on ASM Disks are accessed by CSSD with well-known pointers in the ASM Disk headers and CSSD is able to complete initialization and start or join an existing cluster.
OHASD starts an ASM instance and ASM can now operate with CSSD initialized and operating. The ASM instance uses special code to locate the contents of the ASM SPFILE, assuming it is stored in a Diskgroup.
With an ASM instance operating and its Diskgroups mounted, access to Clusterware’s OCR is available to CRSD.OHASD starts CRSD with access to the OCR in an ASM Diskgroup.And thus Clusterware completes initialization and brings up other services under its control.
Thus with the use of GPnP profile several information stored in it along with the information in the OLR several tasks have been automated or eased for the administrators.
I hope the above information helps you in understanding the Grid plug and play profile, its content, its usage and why was it required. Please comment below if you need more information on GPnP as in the complete dump of the profile, how GPnP daemon and mdns daemon communicates to maintain the updated profile on all the nodes, how does oifcfg, crsctl, asmcmd and other utilities does uses IPC to alter the content of these files accordingly, etc.
The voting disk (VD) stores the cluster membership information. Oracle Clusterware uses the VD to determine which nodes are members of a cluster. Oracle Cluster Synchronization Service daemon (OCSSD) on each cluster node updates the VD with the current status of the node every second. The VD is used to determine which RAC nodes are still in the cluster should the interconnect heartbeat between the RAC nodes fail.
CSS is the service which determine which node in cluster is available and provides cluster group membership and simple locking services to other processes. CSS typically determines node availability via communication through a dedicated private network with a voting disk used as a secondary communication mechanism. This is done by sending heartbeat messages through the network and the voting disk, as illustrated by the top graphic
The voting disk is a file on a clustered file system that is accessible to all nodes in the cluster. Its primary purpose is to help in situations where the private network communication fails. The voting disk is then used to communicate the node state information used to determine which nodes go offline. Without the voting disk, it can be difficult for isolated nodes to determine whether it is experiencing a network failure or whether the other nodes are no longer available. It would then be possible for the cluster to enter a state where multiple subclusters of nodes would have unsynchronized access to the same database files.
Clusterware will evict one or more node from cluster if
other examples
When you have 2 and 1 goes bad, the same happens because the nodes realize they can only write to half of the original disks (1 out of 2), violating the rule that they must be able to write > half (yes, the rule says >, not >=).
When you have 3 and 1 goes bad, the cluster runs fine because the nodes know they can access more than half of the original voting disks (2/3 > half).
When you have 4 and 1 goes bad, the same, because (3/4 > half).
When you have 3 and 2 go bad, the cluster stops because the nodes can only access 1/3 of the voting disks, not > half.
When you have 4 and 2 go bad, the same, because the nodes can only access half, not > half.
By default, the database uses the following criteria to determine how long it needs to retain undo data:
-
Length of the longest-running query
-
Length of the longest-running transaction
-
Longest flashback duration
-
If you don’t configure the undo tablespace with the AUTOEXTEND option, the database simply ignores the value you set for the UNDO_RETENTION parameter. The database will automatically tune the undo retention period based on database workload and the size of the undo tablespace. So, make sure you set the undo tablespace to a large value if you’re receiving errors indicating (snapshot too old ) that the database is not retaining undo for a long enough time.
-
If you want the database to try to honor the settings you specify for the UNDO_RETENTION parameter, make sure that you enable the AUTOEXTEND option for the undo tablespace. This way, Oracle will automatically extend the size of the undo tablespace to make room for undo from new transactions, instead of overwriting the older undo data. However, if you’re receiving ORA-0155 (snapshot too old) errors—say, due to Oracle Flashback operations—it means that the database isn’t able to dynamically tune the undo retention period effectively. In a case such as this, try increasing the value of the UNDO_RETENTION parameter to match the length of the longest Oracle Flashback operation. Alternatively, you can try going to a larger fixed-size undo tablespace (without the AUTOEXTEND option).
You’re receiving the ORA-01555 (snapshot too old) errors during nightly runs of key production batch jobs. You want to eliminate these errors.
Solution
While setting a high value for the UNDO_RETENTION parameter can potentially minimize the possibility of receiving
“snapshot too old” errors, it doesn’t guarantee that the database won’t overwrite older undo data that may be needed by a running transaction. You can move long-running batch jobs to a separate time interval when other programs aren’t running in the database, to avoid these errors. Remember that you don’t always need several concurrent programs to generate “snapshot too old”. You can do generate that error by running one poorly written query on a big table in a cursor, and update the same table in the loop.
Regardless, while you can minimize the occurrence of “snapshot too old” errors with these approaches, you can’t completely eliminate such errors without specifying the guaranteed undo retention feature. When you configure guaranteed undo retention in a database, no SELECT statement can fail because of the “snapshot too old” error.Oracle will keep new DML statements from executing when you set up guaranteed undo retention. Implementing the guaranteed undo feature is simple. Suppose you want to ensure that the database retains undo for at least an hour (3,600 seconds). First set the undo retention threshold with the alter system command shown here, and then set up guaranteed undo retention by specifying the retention guarantee clause to alter the undo tablespace.
SQL> alter system set undo_retention=3600;
System altered.
SQL> alter tablespace undotbs1 retention guarantee;
Tablespace altered.
SQL>
You can switch off guaranteed undo retention by executing the
alter tablespace command with the retention noguarantee clause.
■ Tip You can enable guaranteed undo retention by using the alter system command as shown in this recipe, as well as with the create database and create undo tablespace statements.
How It Works
Oracle uses the undo records stored in the undo tablespace to help roll back transactions, provide read consistency, and to help recover the database. In addition, the database also uses undo records to read data from a past point intime using Oracle Flashback Query. Undo data serves as the underpinning for several Oracle Flashback features that help you recover from logical errors.
Occurrence of the Error
The ORA-01555 error (snapshot too old) may occur in various situations. The following is a case where the error occurs during an export.
EXP-00008: ORACLE error 1555 encountered
ORA-01555: snapshot too old: rollback segment number 10 with name “_SYSSMU10$” too small
EXP-00000: Export terminated unsuccessfully
And you can receive the same error when performing a flashback transaction:
ERROR at line 1:
ORA-01555: snapshot too old: rollback segment number with name “” too small
ORA-06512: at “SYS.DBMS_FLASHBACK”, line 37
ORA-06512: at “SYS.DBMS_FLASHBACK”, line 70
ORA-06512: at li
The “snapshot too old” error occurs when Oracle overwrites undo data that’s needed by another query. The error is a direct result of how Oracle’s read consistency mechanism works. The error occurs during the execution of a long-running query when Oracle tries to read the “before image” of any changed rows from the undo segments.
For example, if a long-running query starts at 1 A.M. and runs until 6 A.M. it’s possible for the database to change the data that’s part of this query during the period in which the query executes. When Oracle tries to read the data as it appeared at 1 A.M., the query may fail if that data is no longer present in the undo segments. If your database is experiencing a lot of updates, Oracle may not be able to fetch the changed rows, because the before changes recorded in the undo segments may have been overwritten. The transactions that changed the rows will have already committed, and the undo segments don’t have a record of the before change row values because the database overwrote the relevant undo data. Since Oracle fails to return consistent data for the current query, it issues the ORA-01555 error. The query that’s currently running requires the before image to construct read-consistent data,but the before image isn’t available.
The ORA-01555 error may be the result of one or both of the following: a heavy amount of queries during a time of intense changes in the database or too small an undo tablespace. You can increase the size of the undo tablespace, but that doesn’t ensure that the error won’t occur again.
Influence of Extents
The database stores undo data in undo extents, and there are three distinct types of undo extents:
“snapshot too old” error. By default, the database will essentially stop honoring the undo retention period you specify if it encounters space pressure to accommodate the undo from new transactions. Since the unexpired undo extents contain undo records needed to satisfy the undo retention period, overwriting those extents in reality means that the
database is lowering the undo retention period you’ve set. Enabling the undo retention guarantee helps assure the success of long-running queries as well as Oracle Flashback operations. The “guarantee” part of the undo retention
guarantee is real—Oracle will certainly retain undo at least for the time you specify and will never overwrite any of the unexpired undo extents that contain the undo required to satisfy the undo retention period. However, there’s a stiff price attached to this guarantee—Oracle will guarantee retention even if it means that DML transactions fail because the database can’t find space to record the undo for those tra
Purpose of Checkpoints
When Oracle Database Initiates Checkpoints
SQL>
select checkpoint_change# from
v$database;
CHECKPOINT_CHANGE#
------------------
1677903
SQL>
alter system checkpoint;
System
altered.
SQL>
select checkpoint_change# from
v$database;
CHECKPOINT_CHANGE#
------------------
1679716
SQL>
select name,checkpoint_change# from v$datafile where name like
'%system01%';
NAME
CHECKPOINT_CHANGE#
----------------------------------------------------
------------------
/u02/app/oracle/oradata/mask11g/system01.dbf
1679716
SQL>
select name,checkpoint_change# from v$datafile_header where name like
'%01.dbf';
NAME
CHECKPOINT_CHANGE#
----------------------------------------------------
------------------
/u02/app/oracle/oradata/mask11g/system01.dbf
1685610
/u02/app/oracle/oradata/mask11g/sysaux01.dbf
1685610
/u02/app/oracle/oradata/mask11g/undotbs01.dbf
1685610
/u02/app/oracle/oradata/mask11g/users01.dbf
1685610SQL>
alter tablespace users read only;
Tablespace
altered.
SQL>
select name,checkpoint_change# from v$datafile_header where name like
'%01.dbf';
NAME
CHECKPOINT_CHANGE#
----------------------------------------------------
------------------
/u02/app/oracle/oradata/mask11g/system01.dbf
1685610
/u02/app/oracle/oradata/mask11g/sysaux01.dbf
1685610
/u02/app/oracle/oradata/mask11g/undotbs01.dbf
1685610
/u02/app/oracle/oradata/mask11g/users01.dbf
1685618SQL>
alter tablespace users read write;
Tablespace
altered.
SQL>
select name,checkpoint_change# from v$datafile_header where name like
'%01.dbf';
NAME
CHECKPOINT_CHANGE#
----------------------------------------------------
------------------
/u02/app/oracle/oradata/mask11g/system01.dbf
1685610
/u02/app/oracle/oradata/mask11g/sysaux01.dbf
1685610
/u02/app/oracle/oradata/mask11g/undotbs01.dbf
1685610
/u02/app/oracle/oradata/mask11g/users01.dbf
1685642SQL>
select name,checkpoint_change# from v$datafile_header where name like
'%system01%';
NAME
CHECKPOINT_CHANGE#
----------------------------------------------------
------------------
/u02/app/oracle/oradata/mask11g/system01.dbf
1657172
SQL>
select distinct LAST_CHANGE# from
v$datafile;
LAST_CHANGE#
------------
SQL>
alter database close;
Database
altered.
SQL>
select distinct LAST_CHANGE# from
v$datafile;
LAST_CHANGE#
------------
2125206
SQL>
select distinct CHECKPOINT_CHANGE# from v$datafile_header
;
CHECKPOINT_CHANGE#
------------------
2125206
Answer: An incremental checkpoint is sort of like when you are sitting on the toilet taking a large dump and you flush multiple times to prevent clogging the toilet.
The "fast start" recovery (and the fast_start_mttr_target) is directly related to the incremental checkpoint. By reducing the checkpoint time to be more frequent than a log switch, Oracle will recover and re-start faster in case of an instance crash.
The docs note that a DBWR writes buffers to disk in advance the checkpoint position, writing the "oldest" blocks first to preserve integrity.
A "checkpoint" is the event that triggers writing of dirty blocks to the disks and a "normal" checkpoint only occurs with every redo log file switch.
In a nutshell, an "incremental" directs the CKPT process to search for "dirty" blocks that need to be written by the DBWR process. thereby advancing the SCN to the control file.
The DBWR wakes up every 3 seconds, seeking dirty blocks and sleeps if he finds no blocks. This prevents a "burst" of writing when a redo log switches.
ASMB process communicates with CSS daemon on node and receives file extend map information from ASM instance . ASMB is also responsible for providing I/O stats to ASM instance .
Oracle Database creates server processes to handle the requests of user processes connected to the instance.The user process represents the application or tool that connects to the Oracle database
Server processes created on behalf of each user’s application can perform one or more of the following:
-
Parse and run SQL statements issued through the application.
-
Read necessary data blocks from data files on disk into the shared database buffers of the SGA (if the blocks are not already present in the SGA).
-
Return results in such a way that the application can process the information.
Data blocks committed by a transaction are not written to the data files and appear only in the online redo log. These changes must be reapplied to the database.The data files contain changes that had not been committed when the instance failed. These changes must be rolled back to ensure transactional consistency. Instance recovery uses only online redo log files and current online data files to synchronize the data files and ensure that they are consistent.
-
Is caused by attempts to open a database whose files are not synchronized on shutdown
-
Uses information stored in redo log groups to synchronize files
-
Involves two distinct operations:
-
Rolling forward: Redo log changes (both committed and uncommitted) are applied to data files.
-
Rolling back: Changes that are made but not committed returned to their original state.
synchronized during shutdown, the instance uses information contained in
the redo log groups to roll the data files forward to the time of shutdown. Then the database is opened and any uncommitted transactions are rolled back.
-
Startup instance (data files are out of sync)
-
Roll forward (redo)
-
Committed and uncommitted data in files
-
Database opened
-
Roll back (undo)
-
Committed data in files
For an instance to open a datafile, the system change number (SCN) contained in the data fil’s header must match the current SCN that is stored in the database’s control files.
If the numbers do not match, the instance applies redo data from the online redo logs, sequentially “redoing” transactions until the data files are up to date. After all data files have been synchronized with the control files, the database is opened and users can log in.
When redo logs are applied, all transactions are applied to bring the database up to the state as of the time of failure. This usually includes transactions that are in progress but have not yet been committed. After the database has been opened, those uncommitted transactions are rolled back.
At the end of the rollback phase of instance recovery, the data files contain only committed data.
-
During instance recovery, the transactions between the checkpoint position and the end of the redo log must be applied to data files.
-
You tune instance recovery by controlling the difference between the checkpoint position and the end of the redo log.
When you have 2 and 1 goes bad, the same happens because the nodes realize they can only write to half of the original disks (1 out of 2), violating the rule that they must be able to write > half (yes, the rule says >, not >=).
When you have 3 and 1 goes bad, the cluster runs fine because the nodes know they can access more than half of the original voting disks (2/3 > half).
When you have 4 and 1 goes bad, the same, because (3/4 > half).
When you have 3 and 2 go bad, the cluster stops because the nodes can only access 1/3 of the voting disks, not > half.
When you have 4 and 2 go bad, the same, because the nodes can only access half, not > half.
So you see 4 voting disks have the same fault tolerance as 3, but you waste 1 disk, without gaining anything. The recommendation for odd number of voting disks helps save a little on hardware requirement.
All the above assume the nodes themselves are fine.
However, John continues, consider the case of a full table scan query that selects all the blocks of the table. If that table is large, its blocks will consume a large portion of the buffer cache, forcing out the blocks of other tables. It’s unlikely that all the blocks of a large table will be accessed regularly, so having those blocks in the cache does not actually help performance. But forcing out the blocks of other tables, especially popular blocks, degrades the overall performance of the applications running against the database. That is why, John explains, Oracle Database does not load the blocks into the buffer cache for full table scans.
how connection is getting establish and sql internally are running
how database is communicating with ASM
how hot backup is happening
The redo logs for all RAC instances are located either on an OCFS shared disk asset or on a RAW file system that is visible to all the other RAC instances. This allows any other node to recover for a failed RAC node in the event of instance failure.
There are basically two types of failure in a RAC environment: instance and media. Instance failure involves the loss of one or more RAC instances, whether due to node failure or connectivity failure. Media failure involves the loss of one or more of the disk assets used to store the database files themselves.
-
All nodes available.
-
One or more RAC instances fail.
-
Node failure is detected by any one of the remaining instances.
-
Global Resource Directory(GRD) is reconfigured and distributed among
-
The instance which first detected the failed instance, reads the failed
The above task is done by the SMON process of the instance that detected failure.
-
Until this time database activity is frozen, The SMON issues recovery requests
the other blocks which are not needed for recovery are available for normal processing.
-
Oracle performs roll forward operation against the blocks that were modified by the
-
Once redo logs are applied, uncomitted transactions are rolled back using
-
Database on the RAC in now fully available.
Or
– writing all committed changes to the datafiles
– undoing all the uncommitted changes from the datafiles
– Incrementing the checkpoint no. to the SCN till which changes have been written to datafiles.
In a single instance database, before the instance crashes,
– some committed changes are in the redo log files but have not been written to the datafiles
– some uncommitted changes have made their way to datafiles
– some uncommitted changes are in the redo log buffer
After the instance crashes in a single instance database
– all uncommitted changes in the redo log buffer are wiped out
– Online redo log files are read to identify the blocks that need to be recovered
– Identified blocks are read from the datafiles
– During roll forward phase, all the changes (committed/uncommitted) in redo log files are applied to them
– During rollback phase, all uncommitted changes are rolled back after reading undo from undo tablespace.
– CKTP# is incremented in control file/data file headers
In a RAC database there can be two scenarios :
– Only one instance crashes
– Multiple instances crash
We will discuss these cases one by one.
Single instance crash in RAC database
In this case, scenario is quite similar to instance crash in a single instance database. But there is slight difference also.
Let us consider a 3 node setup. We will consider a data block B1 with one
column and 4 records in it . The column contains values 100, 200, 300 and 400 in 4 records. Initially the block is on disk . In the following chart, update operations on the block in various nodes and corresponding states of the block are represented. Colour code followed is : CR, PI, XCUR:
It is assumed that no incremental checkpointing has taken place on any of the nodes in the meanwhile.
Before crash status of block on various nodes is as follows:
– PI at SCN# 2 on Node1
– PI at SCN# 3 on Node2
– XCUR on Node3
Redo logs at various nodes are
Node1 : B1: 100 -> 101, SCN# 1
Node2 : B1:200 -> 201, SCN# 2
Node3 : B1:300 -> 301, SCN# 3
After the crash,
– Redo logs of crashed node (Node2) is analyzed and it is identified that block B1 needs to be recovered.
– It is also identified that role of the block is global as its different versions are available in Node1 and Node3
– It is identified that there is a PI on node1 whose SCN# (2) is earlier than the SCN# of crash (4)
– Changes from redo logs of Node2 are applied to the PI on Node1 and the block is written to disk
– Checkpoint # of node1 is incremented.
– a BWR is placed in redo log of Node1 to indicate that the block has been written to disk and need not be recovered in case Node1
Here it can be readily seen that there are certain differences from the instance recovery in single instance database.
The Role of the block is checked.
If the role is local, then the block will be read from the disk and changes from redo logs of Node2 will be applied i.e. just like single instance database
If the role is global,
It is checked if PI of the block at a SCN# earlier than the SCN# of crash is available
If PI is available, then changes in redo logs of node2 are applied to the PI ,instead of reading the block from the disk,
If PI is not available (has been flushed to disk due to incremental checkpointing
on the owner node of PI or
on any of the nodes at a SCN# > PI holder)
the block will be read from the disk and changes from redo logs of Node2 will be applied just like it used to happen in OPS.
Hence, it can be inferred that PI, if available, speeds up the instance recovery as need to read the block from disk is eliminated. If PI is
not available, block is read from the disk just like in OPS.
Multiple instance crash in RAC database
Let us consider a 4 node setup. We will consider a data block B1 with one column and 4 records in it
. The column contains values 100, 200, 300 and 400 in 4 records. Initially the block is on disk . It can be represented as:
Explanation:
SCN#1 – Node1 reads the block from disk and updates 100 to 101 in record. It holds the block in XCUR mode
SCN#2 – Node2 requests the same block for update. Node1 keeps the PI and Node2 holds the block in XCUR mode
SCN#3 – Node3 requests the same block for update. Node2 keeps the PI and Node3 holds the block in XCUR mode . Now we have two PIs
– On Node1 with SCN# 2
– On Node2 with SCN# 3
SCN# 4 – Local checkpointing takes place on Node2. PI on this node has SCN# 3.
It is checked if any of the other nodes has a PI at an earlier SCN# than this. Node1 has PI at SCN# 2.
CHanges in redo log of Node2 are applied to its PI and it is flushed to disk.
BWR is placed in redo log of Node2 to indicate that the block has been written to disk and need not be recovered in case Node2 crashes.
PI at node2 is discarded i.e. its state changes to CR which can’t be used to serve remote nodes.
PI at node1 is discarded i.e. its state changes to CR which can’t be used to serve remote nodes.
BWR is placed in redo log of Node1 to indicate that block has been written to disk and need not be recovered in case Node2 crashes.
Now on disk version of block contains changes of both Node1 and Node2.
SCN# 5 – Node4 requests the same block for update. Node3 keeps the PI and Node4 holds the block in XCUR mode .Node1 and Node2 have the CR’s.
SCN# 6 – Node1 again requests the same block for update. Node4 keeps the PI and Node1 holds the block in XCUR mode. Now Node1 has both the same block in CR and XCUR mode. Node3 has PI at SCN# 5.
SCN# 7 – Node2 and Node3 crash.
It is assumed that no incremental checkpointing has taken place on any of the nodes in the meanwhile.
Before crash status of block on various nodes is as follows:
– CR at SCN# 2 on Node1, XCUR on Node1
– CR at SCN# 3 on Node2
– PI at SCN# 5 on Node3
– PI at SCN# 6 on Node4
Redo logs at various nodes are
Node1 : B1: 100 -> 101, SCN# 1, BWR for B1 , B1:401->402 at SCN#6
Node2 : B1:200 -> 201, SCN# 2, BWR for B1
Node3 : B1:300 -> 301, SCN# 3
Node4 : B1:400->401 at SCN# 5
After the crash,
– Redo logs of crashed node (Node2) are analyzed and it is identified that block B1 has been flushed to disk as of SCN# 4 and need not be recovered as no changes have been made to it from Node2.
– No Redo log entry from Node2 needs to be applied
– Redo logs of crashed node (Node3) are analyzed and it is identified that block B1 needs to be recovered
– It is also identified that role of the block is global as its different versions was/is available in Node1(XCUR), Node2(crashed) , Node4(PI)
– Changes from Node3 have to be applied . It is checked if any PI is available which is earlier than the SCN# of the change on node3 which needs to be applied i.e. SCN# 3.
– It is identified that no PI is available whose SCN is earlier than the SCN# (3). Hence, block is read from the disk.
– Redo log entry which needs to be applied is : B1:300 -> 301, SCN# 3
– Redo is applied to the block read from the disk and the block is written to disk so that on disk version contains changes made by Node3 also.
– Checkpoint # of node2 and Node3 are incremented.
After instance recovery :
Node1 : holds CR and XCUR
Node2 :
Node3 :
Node4 : holds PI
On disk version of the block is:
101
201
301
400
3) How database communicate with ASM
4) what when we put database in begin backup mode
5) how sql execute and make connection with database
Connect to an Instance using:
_ User process
_Server process
_The Oracle server components that are used depend on the type os SQL statemebt:
-Quries return rows
-DML statements log changes
-commit ensures transactio recovery
_Some Oracle server components do not participate in SQL statement processing
o Search for identical statement in the Shared SQL Area.
o Check syntax, object names, and privileges.
o Lock objects used during parse.
o Create and store execution plan.
Bind: Obtains values for variables.
Execute: Process statement.
Fetch: Return rows to user process.
Bind: Same as the bind phase used for processing a query.
Execute:
o If the data and undo blocks are not already in the Database Buffer Cache, the server process reads them from the datafiles into the Database Buffer Cache.
o The server process places locks on the rows that are to be modified. The undo block is used to store the before image of the data, so that the DML statements can be rolled back if necessary.
o The data blocks record the new values of the data.
o The server process records the before image to the undo block and updates the data block. Both of these changes are made in the Database Buffer Cache. Any changed blocks in the Database Buffer Cache are marked as dirty buffers. That is, buffers that are not the same as the corresponding blocks on the disk.
o The processing of a DELETE or INSERT command uses similar steps. The before image for a DELETE contains the column values in the deleted row, and the before image of an INSERT contains the row location information.
For these statements, parsing actually includes parsing, data dictionary lookup, and execution. Transaction management, session management, and system management SQL statements are processed using the parse and execute stages. To re-execute them, simply perform another execute.
Stage 2: Parse the Statement
During parsing, the SQL statement is passed from the user process to Oracle and a parsed representation of the SQL statement is loaded into a shared SQL area. Many errors can be caught during this phase of statement processing. Parsing is the process of
-
translating a SQL statement, verifying it to be a valid statement
-
performing data dictionary lookups to check table and column definitions
-
acquiring parse locks on required objects so that their definitions do not change during the statement’s parsing
-
checking privileges to access referenced schema objects
-
determining the optimal execution plan for the statement
-
loading it into a shared SQL area
-
for distributed statements, routing all or part of the statement to remote nodes that contain referenced data
Although the parsing of a SQL statement validates that statement, parsing only identifies errors that can be found before statement execution. Thus, certain errors cannot be caught by parsing. For example, errors in data conversion or errors in data (such as an attempt to enter duplicate values in a primary key) and deadlocks are all errors or situations that can only be encountered and reported during the execution phase.
These notes introduce the Oracle server architecture. The architecture includes physical components, memory components, processes, and logical structures.
Oracle server: An Oracle server includes an Oracle Instance and an Oracle database.
An Oracle database includes several different types of files: datafiles, control files, redo log files and archive redo log files. The Oracle server also accesses parameter files and password files.
This set of files has several purposes.
One is to enable system users to process SQL statements.
Another is to improve system performance.
Still another is to ensure the database can be recovered if there is a software/hardware failure.
The database server must manage large amounts of data in a multi-user environment.
The server must manage concurrent access to the same data.
The server must deliver high performance. This generally means fast response times.
Oracle instance: An Oracle Instance consists of two different sets of components:
The first component set is the set of background processes (PMON, SMON, RECO, DBW0, LGWR, CKPT, D000 and others).
These will be covered later in detail – each background process is a computer program.
These processes perform input/output and monitor other Oracle processes to provide good performance and database reliability.
The second component set includes the memory structures that comprise the Oracle instance.
When an instance starts up, a memory structure called the System Global Area (SGA) is allocated.
At this point the background processes also start.
An Oracle Instance provides access to one and only one Oracle database.
Oracle database: An Oracle database consists of files.
Sometimes these are referred to as operating system files, but they are actually database files that store the database information that a firm or organization needs in order to operate.
The redo log files are used to recover the database in the event of application program failures, instance failures and other minor failures.
The archived redo log files are used to recover the database if a disk fails.
Other files not shown in the figure include:
The required parameter file that is used to specify parameters for configuring an Oracle instance when it starts up.
The optional password file authenticates special users of the database – these are termed privileged users and include database administrators.
Alert and Trace Log Files – these files store information about errors and actions taken that affect the configuration of the database.
User and server processes: The processes shown in the figure are called user and server processes. These processes are used to manage the execution of SQL statements.
A Shared Server Process can share memory and variable processing for multiple user processes.
A Dedicates Server Process manages memory and variables for a single user process.
Connecting to an Oracle Instance – Creating a Session
System users can connect to an Oracle database through SQLPlus or through an application program like the Internet Developer Suite (the program becomes the system user). This connection enables users to execute SQL statements.
The act of connecting creates a communication pathway between a user process and an Oracle Server. As is shown in the figure above, the User Process communicates with the Oracle Server through a Server Process. The User Process executes on the client computer. The Server Process executes on the server computer, and actually executes SQL statements submitted by the system user.
The figure shows a one-to-one correspondence between the User and Server Processes. This is called a Dedicated Server connection. An alternative configuration is to use a Shared Server where more than one User Process shares a Server Process.
Sessions: When a user connects to an Oracle server, this is termed a session. The session starts when the Oracle server validates the user for connection. The session ends when the user logs out (disconnects) or if the connection terminates abnormally (network failure or client computer failure).
A user can typically have more than one concurrent session, e.g., the user may connect using SQLPlus and also connect using Internet Developer Suite tools at the same time. The limit of concurrent session connections is controlled by the DBA.
If a system users attempts to connect and the Oracle Server is not running, the system user receives the Oracle Not Available error message.
Physical Structure – Database Files
As was noted above, an Oracle database consists of physical files. The database itself has:
Datafiles – these contain the organization’s actual data.
Redo log files – these contain a record of changes made to the database, and enable recovery when failures occur.
Control files – these are used to synchronize all database activities and are covered in more detail in a later module.
Other key files as noted above include:
Parameter file – there are two types of parameter files.
The init.ora file (also called the PFILE) is a static parameter file. It contains parameters that specify how the database instance is to start up. For example, some parameters will specify how to allocate memory to the various parts of the system global area.
The spfile.ora is a dynamic parameter file. It also stores parameters to specify how to startup a database; however, its parameters can be modified while the database is running.
Password file – specifies which *special* users are authenticated to startup/shut down an Oracle Instance.
Archived redo log files – these are copies of the redo log files and are necessary for recovery in an online, transaction-processing environment in the event of a disk failure.
Memory Structure
The memory structures include two areas of memory:
System Global Area (SGA) – this is allocated when an Oracle Instance starts up.
Program Global Area (PGA) – this is allocated when a Server Process starts up.
System Global Area
The SGA is an area in memory that stores information shared by all database processes and by all users of the database (sometimes it is called the Shared Global Area).
This information includes both organizational data and control information used by the Oracle Server.
The SGA is allocated in memory and virtual memory.
The size of the SGA can be established by a DBA by assigning a value to the parameter SGA_MAX_SIZE in the parameter file—this is an optional parameter.
The SGA is allocated when an Oracle instance (database) is started up based on values specified in the initialization parameter file (either PFILE or SPFILE).
The SGA has the following mandatory memory structures:
Shared Pool – includes two components:
Library Cache
Data Dictionary Cache
Database Buffer Cache
Redo Log Buffer
Other structures (for example, lock and latch management, statistical data)
Additional optional memory structures in the SGA include:
Large Pool
Java Pool
Streams Pool
The SHOW SGA SQL command will show you the SGA memory allocations. This is a recent clip of the SGA for the Oracle database at SIUE. In order to execute SHOW SGA you must be connected with the special privilege SYSDBA (which is only available to user accounts that are members of the DBA Linux group).
SQL> connect / as sysdba
Connected.
SQL> show sga
Total System Global Area 1610612736 bytes
Fixed Size 2084296 bytes
Variable Size 385876536 bytes
Database Buffers 1207959552 bytes
Redo Buffers 14692352 bytes
Oracle 8i and earlier versions of the Oracle Server used a Static SGA. This meant that if modifications to memory management were required, the database had to be shutdown, modifications were made to the init.ora parameter file, and then the database had to be restarted.
Oracle 9i and 10g use a Dynamic SGA. Memory configurations for the system global area can be made without shutting down the database instance. The advantage is obvious. This allows the DBA to resize the Database Buffer Cache and Shared Pool dynamically.
Several initialization parameters are set that affect the amount of random access memory dedicated to the SGA of an Oracle Instance. These are:
SGA_MAX_SIZE: This optional parameter is used to set a limit on the amount of virtual memory allocated to the SGA – a typical setting might be 1 GB; however, if the value for SGA_MAX_SIZE in the initialization parameter file or server parameter file is less than the sum the memory allocated for all components, either explicitly in the parameter file or by default, at the time the instance is initialized, then the database ignores the setting for SGA_MAX_SIZE.
DB_CACHE_SIZE: This optional parameter is used to tune the amount memory allocated to the Database Buffer Cache in standard database blocks. Block sizes vary among operating systems. The DBORCL database uses 8 KB blocks. The total blocks in the cache defaults to 48 MB on LINUX/UNIX and 52 MB on Windows operating systems.
LOG_BUFFER: This optional parameter specifies the number of bytes allocated for the Redo Log Buffer.
SHARED_POOL_SIZE: This optional parameter specifies the number of bytes of memory allocated to shared SQL and PL/SQL. The default is 16 MB. If the operating system is based on a 64 bit configuration, then the default size is 64 MB.
LARGE_POOL_SIZE: This is an optional memory object – the size of the Large Pool defaults to zero. If the init.ora parameter PARALLEL_AUTOMATIC_TUNING is set to TRUE, then the default size is automatically calculated.
JAVA_POOL_SIZE: This is another optional memory object. The default is 24 MB of memory.
The size of the SGA cannot exceed the parameter SGA_MAX_SIZE minus the combination of the size of the additional parameters, DB_CACHE_SIZE, LOG_BUFFER, SHARED_POOL_SIZE, LARGE_POOL_SIZE, and JAVA_POOL_SIZE.
Memory is allocated to the SGA as contiguous virtual memory in units termed granules. Granule size depends on the estimated total size of the SGA, which as was noted above, depends on the SGA_MAX_SIZE parameter. Granules are sized as follows:
If the SGA is less than 128 MB in total, each granule is 4 MB.
If the SGA is greater than 128 MB in total, each granule is 16 MB.
Granules are assigned to the Database Buffer Cache and Shared Pool, and these two memory components can dynamically grow and shrink. Using contiguous memory improves system performance. The actual number of granules assigned to one of these memory components can be determined by querying the database view named V$BUFFER_POOL.
Granules are allocated when the Oracle server starts a database instance in order to provide memory addressing space to meet the SGA_MAX_SIZE parameter. The minimum is 3 granules: one each for the fixed SGA, Database Buffer Cache, and Shared Pool. In practice, you’ll find the SGA is allocated much more memory than this. The SELECT statement shown below shows a current_size of 1,152 granules.
SELECT name, block_size, current_size, prev_size, prev_buffers
FROM v$buffer_pool;
NAME BLOCK_SIZE CURRENT_SIZE PREV_SIZE PREV_BUFFERS
——————– ———- ———— ———- ————
DEFAULT 8192 1152 0 0
For additional information on the dynamic SGA sizing, enroll in Oracle’s Oracle10g Database Performance Tuning course.
Automatic Shared Memory Management
Prior to Oracle 10G, a DBA had to manually specify SGA Component sizes through the initialization parameters, such as SHARED_POOL_SIZE, DB_CACHE_SIZE, JAVA_POOL_SIZE, and LARGE_POOL_SIZE parameters.
Automatic Shared Memory Management enables a DBA to specify the total SGA memory available through the SGA_TARGET initialization parameter. The Oracle Database automatically distributes this memory among various subcomponents to ensure most effective memory utilization.
The DBORCL database SGA_TARGET is set in the initDBORCL.ora file:
sga_target=1610612736
With automatic SGA memory management, the different SGA components are flexibly sized to adapt to the SGA available.
Setting a single parameter simplifies the administration task – the DBA only specifies the amount of SGA memory available to an instance – the DBA can forget about the sizes of individual components. No out of memory errors are generated unless the system has actually run out of memory. No manual tuning effort is needed.
The SGA_TARGET initialization parameter reflects the total size of the SGA and includes memory for the following components:
Fixed SGA and other internal allocations needed by the Oracle Database instance
The log buffer
The shared pool
The Java pool
The buffer cache
The keep and recycle buffer caches (if specified)
Nonstandard block size buffer caches (if specified)
The Streams Pool
If SGA_TARGET is set to a value greater than SGA_MAX_SIZE at startup, then the SGA_MAX_SIZE value is bumped up to accomodate SGA_TARGET. After startup, SGA_TARGET can be decreased or increased dynamically. However, it cannot exceed the value of SGA_MAX_SIZE that was computed at startup.
When you set a value for SGA_TARGET, Oracle Database 10g automatically sizes the most commonly configured components, including:
The shared pool (for SQL and PL/SQL execution)
The Java pool (for Java execution state)
The large pool (for large allocations such as RMAN backup buffers)
The buffer cache
There are a few SGA components whose sizes are not automatically adjusted. The DBA must specify the sizes of these components explicitly, if they are needed by an application. Such components are:
Keep/Recycle buffer caches (controlled by DB_KEEP_CACHE_SIZE and DB_RECYCLE_CACHE_SIZE)
Additional buffer caches for non-standard block sizes (controlled by DB_nK_CACHE_SIZE, n = {2, 4, 8, 16, 32})
Streams Pool (controlled by the new parameter STREAMS_POOL_SIZE)
Shared Pool
The Shared Pool is a memory structure that is shared by all system users. It consists of both fixed and variable structures. The variable component grows and shrinks depending on the demands placed on memory size by system users and application programs.
Memory can be allocated to the Shared Pool by the parameter SHARED_POOL_SIZE in the parameter file. You can alter the size of the shared pool dynamically with the ALTER SYSTEM SET command. An example command is shown in the figure below. You must keep in mind that the total memory allocated to the SGA is set by the SGA_TARGET parameter (and may also be limited by the SGA_MAX_SIZE if it is set), and since the Shared Pool is part of the SGA, you cannot exceed the maximum size of the SGA.
The Shared Pool stores the most recently executed SQL statements and used data definitions. This is because some system users and application programs will tend to execute the same SQL statements often. Saving this information in memory can improve system performance.
The Shared Pool includes the Library Cache and Data Dictionary Cache.
Library Cache
Memory is allocated to the Library Cache whenever an SQL statement is parsed or a program unit is called. This enables storage of the most recently used SQL and PL/SQL statements.
If the Library Cache is too small, the Library Cache must purge statement definitions in order to have space to load new SQL and PL/SQL statements. Actual management of this memory structure is through a Least-Recently-Used (LRU) algorithm. This means that the SQL and PL/SQL statements that are oldest and least recently used are purged when more storage space is needed.
The Library Cache is composed of two memory subcomponents:
Shared SQL: This stores/shares the execution plan and parse tree for SQL statements. If a system user executes an identical statement, then the statement does not have to be parsed again in order to execute the statement.
Shared PL/SQL Procedures and Packages: This stores/shares the most recently used PL/SQL statements such as functions, packages, and triggers.
Data Dictionary Cache
The Data Dictionary Cache is a memory structure that caches data dictionary information that has been recently used. This includes user account information, datafile names, table descriptions, user privileges, and other information.
The database server manages the size of the Data Dictionary Cache internally and the size depends on the size of the Shared Pool in which the Data Dictionary Cache resides. If the size is too small, then the data dictionary tables that reside on disk must be queried often for information and this will slow down performance.
Buffer Caches
A number of buffer caches are maintained in memory in order to improve system response time.
Database Buffer Cache
The Database Buffer Cache is a fairly large memory object that stores the actual data blocks that are retrieved from datafiles by system queries and other data manipulation language commands.
A query causes a Server Process to first look in the Database Buffer Cache to determine if the requested information happens to already be located in memory – thus the information would not need to be retrieved from disk and this would speed up performance. If the information is not in the Database Buffer Cache, the Server Process retrieves the information from disk and stores it to the cache.
Keep in mind that information read from disk is read a block at a time, not a row at a time, because a database block is the smallest addressable storage space on disk.
Database blocks are kept in the Database Buffer Cache according to a Least Recently Used (LRU) algorithm and are aged out of memory if a buffer cache block is not used in order to provide space for the insertion of newly needed database blocks.
The buffers in the cache are organized in two lists:
the write list and,
the least recently used (LRU) list.
The write list holds dirty buffers – these are buffers that hold that data that has been modified, but the blocks have not been written back to disk.
The LRU list holds free buffers, pinned buffers, and dirty buffers that have not yet been moved to the write list. Free buffers do not contain any useful data and are available for use. Pinned buffers are currently being accessed.
When an Oracle process accesses a buffer, the process moves the buffer to the most recently used (MRU) end of the LRU list – this causes dirty buffers to age toward the LRU end of the LRU list.
When an Oracle user process needs a data row, it searches for the data in the database buffer cache because memory can be searched more quickly than hard disk can be accessed. If the data row is already in the cache (a cache hit), the process reads the data from memory; otherwise a cache miss occurs and data must be read from hard disk into the database buffer cache.
Before reading a data block into the cache, the process must first find a free buffer. The process searches the LRU list, starting at the LRU end of the list. The search continues until a free buffer is found or until the search reaches the threshold limit of buffers.
Each time the user process finds a dirty buffer as it searches the LRU, that buffer is moved to the write list and the search for a free buffer continues.
When the process finds a free buffer, it reads the data block from disk into the buffer and moves the buffer to the MRU end of the LRU list.
If an Oracle user process searches the threshold limit of buffers without finding a free buffer, the process stops searching the LRU list and signals the DBW0 background process to write some of the dirty buffers to disk. This frees up some buffers.
The block size for a database is set when a database is created and is determined by the init.ora parameter file parameter named DB_BLOCK_SIZE. Typical block sizes are 2KB, 4KB, 8KB, 16KB, and 32KB. The size of blocks in the Database Buffer Cache matches the block size for the database. The DBORCL database uses a 8KB block size.
Because tablespaces that store oracle tables can use different (non-standard) block sizes, there can be more than one Database Buffer Cache allocated to match block sizes in the cache with the block sizes in the non-standard tablespaces.
The size of the Database Buffer Caches can be controlled by the parameters DB_CACHE_SIZE and DB_nK_CACHE_SIZE to dynamically change the memory allocated to the caches without restarting the Oracle instance.
You can dynamically change the size of the Database Buffer Cache with the ALTER SYSTEM command like the one shown here:
ALTER SYSTEM SET DB_CACHE_SIZE = 96M;
You can have the Oracle Server gather statistics about the Database Buffer Cache to help you size it to achieve an optimal workload for the memory allocation. This information is displayed from the V$DB_CACHE_ADVICE view. In order for statistics to be gathered, you can dynamically alter the system by using the ALTER SYSTEM SET DB_CACHE_ADVICE (OFF, ON, READY) command. However, gathering statistics on system performance always incurs some overhead that will slow down system performance.
SQL> ALTER SYSTEM SET db_cache_advice = ON;
System altered.
SQL> DESC V$DB_cache_advice;
Name Null? Type
—————————————– ——– ————-
ID NUMBER
NAME VARCHAR2(20)
BLOCK_SIZE NUMBER
ADVICE_STATUS VARCHAR2(3)
SIZE_FOR_ESTIMATE NUMBER
SIZE_FACTOR NUMBER
BUFFERS_FOR_ESTIMATE NUMBER
ESTD_PHYSICAL_READ_FACTOR NUMBER
ESTD_PHYSICAL_READS NUMBER
ESTD_PHYSICAL_READ_TIME NUMBER
ESTD_PCT_OF_DB_TIME_FOR_READS NUMBER
ESTD_CLUSTER_READS NUMBER
ESTD_CLUSTER_READ_TIME NUMBER
SQL> SELECT name, block_size, advice_status FROM v$db_cache_advice;
NAME BLOCK_SIZE ADV
——————– ———- —
DEFAULT 8192 ON
<more rows will display>
21 rows selected.
SQL> ALTER SYSTEM SET db_cache_advice = OFF;
System altered.
KEEP Buffer Pool
This pool retains blocks in memory (data from tables) that are likely to be reused throughout daily processing. An example might be a table containing user names and passwords or a validation table of some type.
The DB_KEEP_CACHE_SIZE parameter sizes the KEEP Buffer Pool.
RECYCLE Buffer Pool
This pool is used to store table data that is unlikely to be reused throughout daily processing – thus the data is quickly recycled.
The DB_RECYCLE_CACHE_SIZE parameter sizes the RECYCLE Buffer Pool.
Redo Log Buffer
The Redo Log Buffer memory object stores images of all changes made to database blocks. As you know, database blocks typically store several table rows of organizational data. This means that if a single column value from one row in a block is changed, the image is stored. Changes include INSERT, UPDATE, DELETE, CREATE, ALTER, or DROP.
Think of the Redo Log Buffer as a circular buffer that is reused over and over. As the buffer fills up, copies of the images are stored to the Redo Log Files that are covered in more detail in a later module.
Large Pool
The Large Pool is an optional memory structure that primarily relieves the memory burden placed on the Shared Pool. The Large Pool is used for the following tasks if it is allocated:
Allocating space for session memory requirements from the User Global Area (part of the Server Process) where a Shared Server is in use.
Transactions that interact with more than one database, e.g., a distributed database scenario.
Backup and restore operations by the Recovery Manager (RMAN) process.
RMAN uses this only if the BACKUP_DISK_IO = n and BACKUP_TAPE_IO_SLAVE = TRUE parameters are set.
If the Large Pool is too small, memory allocation for backup will fail and memory will be allocated from the Shared Pool.
Parallel execution message buffers for parallel server operations. The PARALLEL_AUTOMATIC_TUNING = TRUE parameter must be set.
The Large Pool size is set with the LARGE_POOL_SIZE parameter – this is not a dynamic parameter. It does not use an LRU list to manage memory.
Java Pool
The Java Pool is an optional memory object, but is required if the database has Oracle Java installed and in use for Oracle JVM (Java Virtual Machine). The size is set with the JAVA_POOL_SIZE parameter that defaults to 24MB.
The Java Pool is used for memory allocation to parse Java commands.
Storing Java code and data in the Java Pool is analogous to SQL and PL/SQL code cached in the Shared Pool.
Streams Pool
This cache is new to Oracle 10g. It is sized with the parameter STREAMS_POOL_SIZE.
This pool stores data and control structures to support the Oracle Streams feature of Oracle Enterprise Edition. Oracle Steams manages sharing of data and events in a distributed environment.
If STEAMS_POOL_SIZE is not set or is zero, memory for Oracle Streams operations is allocated from up to 10% of the Shared Pool memory.
Program Global Area
The Program Global Area is also termed the Process Global Area (PGA) and is a part of memory allocated that is outside of the Oracle Instance. The PGA stores data and control information for a single Server Process or a single Background Process. It is allocated when a process is created and the memory is scavenged by the operating system when the process terminates. This is NOT a shared part of memory – one PGA to each process only.
The content of the PGA varies, but generally includes the following:
Private SQL Area: Data for binding variables and runtime memory allocations. A user session issuing SQL statements has a Private SQL Area that may be associated with a Shared SQL Area if the same SQL statement is being executed by more than one system user. This often happens in OLTP environments where many users are executing and using the same application program.
Dedicated Server environment – the Private SQL Area is located in the Program Global Area.
Shared Server environment – the Private SQL Area is located in the System Global Area.
Session Memory: Memory that holds session variables and other session information.
SQL Work Area: Memory allocated for sort, hash-join, bitmap merge, and bitmap create types of operations.
Oracle 9i and later versions enable automatic sizing of the SQL Work Areas by setting the WORKAREA_SIZE_POLICY = AUTO parameter (this is the default!) and PGA_AGGREGATE_TARGET = n (where n is some amount of memory established by the DBA). However, the DBA can let Oracle 10g determine the appropriate amount of memory.
Oracle 8i and earlier required the DBA to set the following parameters to control SQL Work Area memory allocations:
SORT_AREA_SIZE.
HASH_AREA_SIZE.
BITMAP_MERGE_AREA_SIZE.
CREATE_BITMAP_AREA_SIZE.
Software Code Area
Software code areas store Oracle executable files running as part of the Oracle instance.
These code areas are static in nature and are located in privileged memory that is separate from other user programs.
The code can be installed sharable when multiple Oracle instances execute on the same server with the same software release level.
Processes
You need to understand three different types of Processes:
User Process: Starts when a database user requests to connect to an Oracle Server.
Server Process: Establishes the Connection to an Oracle Instance when a User Process requests connection – makes the connection for the User Process.
Background Processes: These start when an Oracle Instance is started up.
User Process
In order to use Oracle, you must obviously connect to the database. This must occur whether you’re using SQLPlus, an Oracle tool such as Designer or Forms, or an application program.
This generates a User Process (a memory object) that generates programmatic calls through your user interface (SQLPlus, Integrated Developer Suite, or application program) that creates a session and causes the generation of a Server Process that is either dedicated or shared.
Server Process
As you have seen, the Server Process is the go-between for a User Process and the Oracle Instance. In a Dedicated Server environment, there is a single Server Process to serve each User Process. In a Shared Server environment, a Server Process can serve several User Processes, although with some performance reduction. Allocation of server process in a dedicated environment versus a shared environment is covered in further detail in the Oracle10g Database Performance Tuning course offered by Oracle Education.
Background Processes
As is shown here, there are both mandatory and optional background processes that are started whenever an Oracle Instance starts up. These background processes serve all system users. We will cover mandatory process in detail.
Dnnn: Dispatcher number “nnn”, for example, D000 would be the first dispatcher process – Dispatchers are optional background processes, present only when the shared server configuration is used. Shared server is discussed in your readings on the topic “Configuring Oracle for the Shared Server”.
RECO: Recoverer – The Recoverer process is used to resolve distributed transactions that are pending due to a network or system failure in a distributed database. At timed intervals, the local RECO attempts to connect to remote databases and automatically complete the commit or rollback of the local portion of any pending distributed transactions. For information about this process and how to start it, see your readings on the topic “Managing Distributed Transactions”.
Of these, the ones you’ll use most often are ARCn (archiver) when you automatically archive redo log file information (covered in a later module), and RECO for recovery where the database is distributed on two or more separate physical Oracle servers, perhaps a UNIX machine and an NT machine.
The initialization parameter DB_WRITER_PROCESSES specifies the number of DBWn processes.
The purpose of DBWn is to improve system performance by caching writes of database blocks from the Database Buffer Cache back to datafiles. Blocks that have been modified and that need to be written back to disk are termed “dirty blocks.” The DBWn also ensures that there are enough free buffers in the Database Buffer Cache to service Server Processes that may be reading data from datafiles into the Database Buffer Cache. Performance improves because by delaying writing changed database blocks back to disk, a Server Process may find the data that is needed to meet a User Process request already residing in memory!
DBWn writes to datafiles when one of these events occurs that is illustrated in the figure below.
If an Oracle Instance fails, all information in memory not written to disk is lost. SMON is responsible for recovering the instance when the database is started up again. It does the following:
Rolls forward to recover data that was recorded in a Redo Log File, but that had not yet been recorded to a datafile by DBWn. SMON reads the Redo Log Files and applies the changes to the data blocks. This recovers all transactions that were committed because these were written to the Redo Log Files prior to system failure.
Opens the database to allow system users to logon.
Rolls back uncommitted transactions.
SMON also does limited space management. It combines (coalesces) adjacent areas of free space in the database’s datafiles for tablespaces that are dictionary managed.
It also deallocates temporary segments to create free space in the datafiles.
Think of a checkpoint record as a starting point for recovery. DBWn will have completed writing all buffers from the Database Buffer Cache to disk prior to the checkpoint, thus those record will not require recovery. This does the following:
Ensures modified data blocks in memory are regularly written to disk – CKPT can call the DBWn process in order to ensure this and does so when writing a checkpoint record.
Reduces Instance Recovery time by minimizing the amount of work needed for recovery since only Redo Log File entries processed since the last checkpoint require recovery.
Causes all committed data to be written to datafiles during database shutdown.
If a Redo Log File fills up and a switch is made to a new Redo Log File (this is covered in more detail in a later module), the CKPT process also writes checkpoint information into the headers of the datafiles.
Checkpoint information written to control files includes the system change number (the SCN is a number stored in the control file and in the headers of the database files that are used to ensure that all files in the system are synchronized), location of which Redo Log File is to be used for recovery, and other information.
CKPT does not write data blocks or redo blocks to disk – it calls DBWn and LGWR as necessary.
When a Redo Log File fills up, Oracle switches to the next Redo Log File. The DBA creates several of these and the details of creating them are covered in a later module. If all Redo Log Files fill up, then Oracle switches back to the first one and uses them in a round-robin fashion by overwriting ones that have already been used – it should be obvious that the information stored on the files, once overwritten, is lost forever.
If ARCn is in what is termed ARCHIVELOG mode, then as the Redo Log Files fill up, they are individually written to Archived Redo Log Files and LGWR does not overwrite a Redo Log File until archiving has completed. Thus, committed data is not lost forever and can be recovered in the event of a disk failure. Only the contents of the SGA will be lost if an Instance fails.
In NOARCHIVELOG mode, the Redo Log Files are overwritten and not archived. Recovery can only be made to the last full backup of the database files. All committed transactions after the last full backup are lost, and you can see that this could cost the firm a lot of $$$.
When running in ARCHIVELOG mode, the DBA is responsible to ensure that the Archived Redo Log Files do not consume all available disk space! Usually after two complete backups are made, any Archived Redo Log Files for prior backups are deleted.
Logical Structure
It is helpful to understand how an Oracle database is organized in terms of a logical structure that is used to organize physical objects.
Each tablespace has at least one physical datafile that actually stores the tablespace at the operating system level. A large tablespace may have more than one datafile allocated for storing objects assigned to that tablespace.
A tablespace belongs to only one database.
Tablespaces can be brought online and taken offline for purposes of backup and management, except for the SYSTEM tablespace that must always be online.
Tablespaces can be in either read-only or read-write status.
The DBA can change the size of a datafile to make it smaller or later. The file can also grow in size dynamically as the tablespace grows.
Segment: When logical storage objects are created within a tablespace, for example, an employee table, a segment is allocated to the object.
Obviously a tablespace typically has many segments.
A segment cannot span tablespaces but can span datafiles that belong to a single tablespace.
Extent: Each object has one segment which is a physical collection of extents.
Extents are simply collections of contiguous disk storage blocks. A logical storage object such as a table or index always consists of at least one extent – ideally the initial extent allocated to an object will be large enough to store all data that is initially loaded.
As a table or index grows, additional extents are added to the segment.
A DBA can add extents to segments in order to tune performance of the system.
An extent cannot span a datafile.
An Oracle data block consists of one or more physical blocks (operating system blocks) so the data block, if larger than an operating system block, should be an even multiple of the operating system block size, e.g., if the Linux operating system block size is 2K or 4K, then the Oracle data block should be 2K, 4K, 8K, 16K, etc in size. This optimizes I/O.
The data block size is set at the time the database is created and cannot be changed. It is set with the DB_BLOCK_SIZE parameter. The maximum data block size depends on the operating system.
Thus, the Oracle database architecture includes both logical and physical structures as follows:
Physical: Control files; Redo Log Files; Datafiles; Operating System Blocks.
Logical: Tablespaces; Segments; Extents; Data Blocks.
Processing a query:
Parse:
Search for identical statement in the Shared SQL Area.
Check syntax, object names, and privileges.
Lock objects used during parse.
Create and store execution plan.
Bind: Obtains values for variables.
Execute: Process statement.
Fetch: Return rows to user process.
Processing a DML statement:
Parse: Same as the parse phase used for processing a query.
Bind: Same as the bind phase used for processing a query.
Execute:
If the data and undo blocks are not already in the Database Buffer Cache, the server process reads them from the datafiles into the Database Buffer Cache.
The server process places locks on the rows that are to be modified. The undo block is used to store the before image of the data, so that the DML statements can be rolled back if necessary.
The data blocks record the new values of the data.
The server process records the before image to the undo block and updates the data block. Both of these changes are made in the Database Buffer Cache. Any changed blocks in the Database Buffer Cache are marked as dirty buffers. That is, buffers that are not the same as the corresponding blocks on the disk.
The processing of a DELETE or INSERT command uses similar steps. The before image for a DELETE contains the column values in the deleted row, and the before image of an INSERT contains the row location information.
Here, the changes recorded in the redolog files are applied to the affected blocks. This includes both committed and uncommited data. Since Undo data is protected by redo, rollforward generated the undo images also. The time required for this will be proportional to the changes made in the database after the last successful checkpoint. After cache recovery, the database will be ‘consistent’ to the point when the crash occurred. Now the database will be open and users can start connecting to it. The parameter RECOVERY_PARALLELISM specifies the number of processes to participate in instance or crash recovery and we can thus speed up rollforward.
As mentioned earlier, user sessions are allowed to connect even before the transaction recovery is completed. If a user attempts to access a row that is locked by a terminated transaction, Oracle rolls back only those changes necessary to complete the transaction; in other words, it rolls them back on demand. Consequently, new transactions do not have to wait until all parts of a long transaction are rolled back.
This transaction recovery is required and has to be completed. We can disable transaction recovery temporarily but at some point this has to be completed. We can monitor the progress of fast-start parallel rollback by examining the V$FAST_START_SERVERS and V$FAST_START_TRANSACTIONS views.
The Fast-Start Fault Recovery feature reduces the time required for cache recovery, and makes the recovery bounded and predictable by limiting the number of dirty buffers and the number of redo records generated between the most recent redo record and the last checkpoint. With the Fast-Start Fault Recovery feature, the FAST_START_MTTR_TARGET initialization parameter simplifies the configuration of recovery time from instance or system failure. FAST_START_MTTR_TARGET specifies a target for the expected mean time to recover (MTTR), that is, the time (in seconds) that it should take to start up the instance and perform cache recovery. After FAST_START_MTTR_TARGET is set, the database manages incremental checkpoint writes in an attempt to meet that target.
If the SMON is busy doing the transaction recovery you should never attempt a shutdown abort and restarting the database. The entire work done till that point needs to be done again.
There are different modes in which you can open the database eg: migrate, read only, restricted modes.
Switchover and switch backup standby database
Convert physical standby into snapshot mode and revert back
It enables physical standby database to be open for read access while media recovery is being performed on them to keep them synchronized with the production
database. The physical standby database is open in read-only mode while
redo transport and standby apply are both active.
Active Data Guard automatically repairs corrupt blocks online by using the active standby database.
Normally, queries executed on active standby databases return up-to-date results. Due to potential delays in redo transport or standby apply, a standby database may “fall behind” its primary, which can cause results of queries on the standby to be out of date.
Active Data Guard sessions can be configured with a maximum query delay (in
seconds). If the standby exceeds the delay, Active Data Guard returns an error to the application, which can then retry the query or transparently redirect the query to the primary, as required
In a logical standby database configuration, Data Guard SQL Apply uses redo information shipped from the primary system. However, instead of using media recovery to apply changes no physical
(as in thestandby database configuration), the redo data is transformed into
equivalent SQL statements by using LogMiner technology. These SQL statements are then applied to the logical standby database. The logical standby database is open in read/write mode and is available for reporting capabilities.
The logical standby database can offer protection at database level, schema level, or even object level.
A logical standby database can be used to perform rolling database upgrades, thereby minimizing down time when upgrading to new database patch sets or full database releases.
When a buffer in database buffer cache is modified .it is marked dirty buffer and added in head of the checkpoint queue that is kept in system change number (SCN) order. This order therefore matches the order of redo that is written to the redo logs for these changed buffers.
When the number of available buffers in the buffer cache falls below an internal threshold (to the extent that server processes find it difficult to obtain available buffers), DBWn writes non
frequently used, modified (dirty) buffers to the data files from the tail of the LRU list so that processes can replace buffers when they need them. DBWn also writes from the tail of the checkpoint queue to keep the checkpoint advancing.
is heavy. LGWR writes one contiguous portion of the buffer to disk.
LGWR writes:
-
When a user process commits a transaction
-
When the redo log buffer is one-third full
-
Before a DBWn process writes modified buffers to disk (if necessary)
-
Every three seconds
not been archived, LGWR cannot continue to function.
The SCN at which a full checkpoint occurred is stored in both the data file headers and the control file.
The SCN at which the last incremental checkpoint occurred is only stored in the control file (in a structure known as the checkpoint progress record).
The CKPT process updates the control files and the headers of all data files to record the details of the checkpoint (as shown in the graphic). The CKPT process does not write blocks to disk; DBWn always performs that work. The SCNs recorded in the file headers guarantee that all changes made to database blocks prior to that SCN have been written to disk.
You can set the value of the retention policy by using the RMAN CONFIGURE command or Recovery Window Retention Policy
The best practice is to establish a period of time during which it will be possible to discoverlogical errors and fix the affected objects by doing a point-in-time recovery to just before the error occurred. This period of time is called the recovery window. This policy is specified in number of days. For each data file, there must always exist at least one backup that satisfies
the following condition:
SYSDATE – backup_checkpoint_time >= recovery_window
You can use the following command syntax to configure a recovery window retention policy:
RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF <days> DAYS;
where <days> is the size of the recovery window.
If you are not using a recovery catalog, you should keep the recovery window time period less than or equal to the value of the CONTROL_FILE_RECORD_KEEP_TIME parameter to prevent
the record of older backups from being overwritten in the control file. If you are using a recovery catalog, then make sure the value of CONTROL_FILE_RECORD_KEEP_TIME is greater than the
time period between catalog resynchronizations. Resynchronizations happen when you:
-
Create a backup. In this case, the synchronization is done implicitly.
-
Execute the RESYNC CATALOG command.
backup is deemed obsolete when a more recent version of the same file has been backed up.
Obsolete backups are any backups that you don’t need to satisfy a configured retention policy.You may also delete obsolete backups according to any retention policy you may specify as an option to the delete obsolete command. The delete obsolete command will remove the deleted files from the backup media and mark those backups as deleted in both the control file and the recovery catalog.
The report obsolete command reports on any obsolete backups. Always run the crosscheck command first in order to update the status of the backups in the RMAN repository to that on disk and tape.
In the following example, the report obsolete command shows no obsolete
backups:
RMAN> crosscheck backup;
RMAN> report obsolete;
RMAN retention policy will be applied to the command
RMAN retention policy is set to redundancy 1
no obsolete backups found
The following execution of the report obsolete command shows that there are both obsolete backup sets and obsolete archived redo log backups. Again, run the crosscheck command before issuing the report obsolete command.
RMAN> crosscheck backup;
RMAN> report obsolete;
When a cluster node boots, or Clusterware is started on a running clusterware node, the init process starts ohasd. The ohasd process then initiates the startup of the processes in the lower, or Oracle High Availability (OHASD) stack.
-
The cssdagent process is started, which in turn starts ocssd. The ocssd process discovers the voting disk either in ASM or on shared storage, and then joins the cluster.The cssdagent process monitors the cluster and provides I/O fencing. This service formerly was provided by Oracle Process Monitor Daemon (oprocd). A cssdagent failure may result in Oracle Clusterware restarting the node.
-
The orarootagent is started. This process is a specialized oraagent process that helps crsd start and manage resources owned by root, such as the network and the grid virtual IP address.
Finally, the client establishes a connection to the service through the listener on the node where service is offered. All these actions take place transparently to the client without any explicit configuration required in the client.
During installation ,listeners are created on nodes for SCAN IP Addresses
Oracle net services routes application requests to least loaded instance providing services Becouse the SCAN addresses resolve to cluster, rather than to a node address in a cluster,node can be added or removed from cluster without affecting the SCAN address configuration
components will start. The entry in /etc/inittab defines that /etc/init.d/init.ohasd is started at runlevels 3 and 5. Runlevel 3 in Linux usually brings the system up in networked, multi-user mode;
however, it doesn’t start X11. Runlevel 5 is normally used for the same purpose, but it also starts the graphical user interface. If the system is at a runlevel other than 3 or 5, then ohasd.bin cannot be started, and you need to use a call to init to change the runlevel to either 3 or 5. You can check
/var/log/messages for output from the scripts under /etc/rc.d/init.d/; ohasd.bin logs information into the default log file destination at $GRID_HOME/log/hostname in the ohasd/ohasd.log subdirectory.
The administrator has the option to disable the start of the High Availability Services stack by calling crsctl disable crs. This call updates a flag in /etc/oracle/scls_scr/hostname/root/ohasdstr. The file contains only one word, either enable or disable, and no carriage return. If set to disable, then
/etc/rc.d/init.d/ohasd will not proceed with the startup. Call crsctl start crs to start the cluster stack manually in that case. Many Grid Infrastructure background processes rely on sockets created in /var/tmp/.oracle. You
can check which socket is used by a process by listing the contents of the /proc/pid/fd directory, where pid is the process id of the program you are looking at. In some cases, permissions on the sockets can become garbled; in our experience, moving the .oracle directory to a safe location and rebooting solved the cluster communication problems.
Another reason ohasd.bin might fail to start: the file system for $GRID_HOME could be either corrupt or otherwise not mounted. Earlier, it was noted that ohasd.bin lives in $GRID_HOME/bin. If $GRID_HOME isn’t mounted, then it is not possible to start the daemon.
We introduced the OLR as an essential file for starting Grid Infrastructure. If the OLR has become corrupt or is otherwise not accessible, then ohasd.bin cannot start. Successful initialization of the OLR is recorded in the ohasd.log, as in the following example (the timestamps have been removed for the sake
of clarity):
[ default][3046704848] OHASD Daemon Starting. Command string :reboot
[ default][3046704848] Initializing OLR
[ OCRRAW][3046704848]proprioo: for disk 0
(/u01/app/crs/cdata/london1.olr),
id match (1), total id sets, (1) need recover (0), my votes (0),
total votes (0), commit_lsn (15), lsn (15)
[ OCRRAW][3046704848]proprioo: my id set: (2018565920, 1028247821, 0, 0, 0)
[ OCRRAW][3046704848]proprioo: 1st set: (2018565920, 1028247821, 0, 0, 0)
[ OCRRAW][3046704848]proprioo: 2nd set: (0, 0, 0, 0, 0)
[ CRSOCR][3046704848] OCR context init CACHE Level: 0xaa4cfe0
[ default][3046704848] OHASD running as the Privileged user
Interestingly, the errors pertaining to the local registry have the same numbers as those for the OCR; however, they have been prefixed by PROCL. The L can easily be missed, so check carefully! If the OLR cannot be read, then you will see the error messages immediately under the Initializing OLR line. This chapter has covered two causes so far: the OLR is missing or the OLR is corrupt. The first case is much easier to diagnose because, in that case, OHAS will not start:
[root@london1 ~]# crsctl check crs
CRS-4639: Could not contact Oracle High Availability Services
In the preceding example, ohasd.log will contain an error message similar to this one:
[ default][1381425744] OHASD Daemon Starting. Command string :restart
[ default][1381425744] Initializing OLR
[ OCROSD][1381425744]utopen:6m’:failed in stat OCR file/disk
/u01/app/crs/cdata/london1.olr,
errno=2, os err string=No such file or directory
[ OCROSD][1381425744]utopen:7:failed to open any OCR file/disk, errno=2,
os err string=No such file or directory
[ OCRRAW][1381425744]proprinit: Could not open raw device
[ OCRAPI][1381425744]a_init:16!: Backend init unsuccessful : [26]
[ CRSOCR][1381425744] OCR context init failure. Error: PROCL-26: Error
while accessing the physical storage Operating System
error [No such file or directory] [2]
417CHAPTER 8 CLUSTERWARE
[ default][1381425744] OLR initalization failured, rc=26
[ default][1381425744]Created alert : (:OHAS00106:) : Failed to initialize
Oracle Local Registry
[ default][1381425744][PANIC] OHASD exiting; Could not init OLR
In this case, you should restore the OLR, which you will learn how to do in the “Maintaining Voting
Disk and OCR/OLR” section.
If the OLR is corrupted, then you will slightly different errors. OHAS tries to read the OLR; while it succeeds for some keys, it fails for some others. Long hex dumps will appear in the ohasd.log, indicating a problem. You should perform an ocrcheck -local in this case, which can help you determine the root
cause. The following output has been taken from a system where the OLR was corrupt:
[root@london1 ohasd]# ocrcheck -local
Status of Oracle Local Registry is as follows :
Device/File Name
: /u01/app/crs/cdata/london1.olr
Device/File integrity check failed
Local registry integrity check failed
Logical corruption check bypassed
If the utility confirms that the OLR is corrupted, then you have no option but to restore it. Again,
Testing ASM Disk Failure Scenario and disk_repair_time
col
name format a8 col header_status format a7 set lines 2000 col path format a10 select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk; NAME PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S -------- ---------- -------- ------- ------------ ------- ----------------- ------------ ------------- ----------------- DATA1 ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED DATA2 ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED DATA3 ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED DATA4 NORMAL UNKNOWN 1200 OFFLINE MISSING |
In 10g, if a disk goes missing, it would immediately get dropped and REBALANCE operation would kick in immediately whereby ASM would start redistributing the ASM extents across the available disks in ASM diskgroup to restore the redundancy.
DISK_REPAIR_TIME
FAILGROUP_REPAIR_TIME
SQL>
col name format a30 SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%'; NAME VALUE ------------------------------ -------------------- disk_repair_time 3.6h failgroup_repair_time 24.0h SQL> alter diskgroup data set attribute 'disk_repair_time'='1h'; Diskgroup altered. SQL> alter diskgroup data set attribute 'failgroup_repair_time'='10h'; Diskgroup altered. SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%'; NAME VALUE ------------------------------ -------------------- disk_repair_time 1h failgroup_repair_time 10h |
ORA-15042
SQL>
alter diskgroup data mount; alter diskgroup data mount * ERROR at line 1: ORA-15032: not all alterations performed ORA-15040: diskgroup is incomplete ORA-15042: ASM disk "3" is missing from group number "2" SQL> alter diskgroup data mount force; Diskgroup altered. |
Monitoring the REPAIR_TIME
SQL>
select
name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from
v$asm_disk; NAME PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S -------- ---------- -------- ------- ------------ ------- ----------------- ------------ ------------- ----------------- DATA1 ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED DATA2 ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED DATA3 ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED DATA4 NORMAL UNKNOWN 649 OFFLINE MISSING --We can confirm that no rebalance has started yet by using following query SQL> select * from v$asm_operation; no rows selected |
SQL>
alter diskgroup data online disk data4; Diskgroup altered. select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk; NAME PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S -------- ---------- -------- ------- ------------ ------- ----------------- ------------ ------------- ----------------- DATA1 ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED DATA2 ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED DATA3 ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED DATA4 NORMAL UNKNOWN 465 SYNCING CACHED --Syncing is in progress, and hence no rebalance would occur. SQL> select * from v$asm_operation; no rows selected -- After some time, everything would become normal. select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk; NAME PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S -------- ---------- -------- ------- ------------ ------- ----------------- ------------ ------------- ----------------- DATA1 ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED DATA2 ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED DATA3 ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED DATA4 ORCL:DATA4 NORMAL MEMBER 0 ONLINE CACHED |
Since the disk status if OFFLINE, we would need to use FORCE option to drop the disk. After dropping the disk rebalance would start and can be monitored from v$ASM_OPERATION view.
SQL>
alter diskgroup data drop disk data4; alter diskgroup data drop disk data4 * ERROR at line 1: ORA-15032: not all alterations performed ORA-15084: ASM disk "DATA4" is offline and cannot be dropped. SQL> alter diskgroup data drop disk data4 force; Diskgroup altered. select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation; GROUP_NUMBER OPERA PASS STATE POWER SOFAR EST_WORK ---------------------------------- --------- ---- ---------- ---------- ---------- ------------------------ 2 REBAL RESYNC DONE 9 0 0 2 REBAL REBALANCE DONE 9 42 42 2 REBAL COMPACT RUN 9 1 0 |
SQL>
alter diskgroup data add disk 'ORCL:DATA4'; Diskgroup altered. SQL> select * from v$asm_operation; select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation; GROUP_NUMBER OPERA PASS STATE POWER SOFAR EST_WORK ---------------------------------- --------- ---- ---------- ---------- ---------- ------------------------ 2 REBAL RESYNC DONE 9 0 0 2 REBAL REBALANCE RUN 9 37 2787 2 REBAL COMPACT WAIT 9 1 0 |
https://www.oraclenext.com/2018/01/testing-asm-disk-failure-scenario-and.html
two node cell node reboot in exadata -- normal redundancy
HIGH redundancy
HIGH redundancy results in three copies of data being written to storage on three separate
Exadata Storage Servers. It provides the greatest protection, tolerating the simultaneous loss
of up to two entire cells.
HIGH redundancy is generally recommended for disk groups that house critical databases.
This is especially the case for disk groups based on high capacity cells because the amount
of time required to restore redundancy with high capacity disks (8 TB) increases the potential
for a second failure to occur before the first failure is completely dealt with.
One cost associated with this protection is that the overall storage capacity of Exadata is
effectively reduced by one third compared with NORMAL redundancy. For example, a Full
Rack Database Machine with high capacity disks has a raw disk capacity of 1344 TB. With
HIGH redundancy applied the usable capacity becomes approximately 400 TB compared to
approximately 600 TB using NORMAL redundancy.
Another cost is that all writes to HIGH redundancy disk groups must be performed on three
separate storage servers. This means that 50% more I/O bandwidth is consumed by writes on
HIGH redundancy disk groups compared with writes to NORMAL redundancy disk groups.
Extra I/O is not required for read operations. So the balance of application reads and writes,
coupled with the I/O bandwidth requirements of the application should be considered when
selecting the appropriate protection level.
NORMAL redundancy:
Double mirroring across two separate storage servers
Provides one layer of redundancy
No additional protection if one cell is offline
Requires extra time and effort to maintain redundancy through planned maintenance
NORMAL redundancy results in two copies of data being written to storage on two separate
Exadata Storage Servers. As already highlighted, NORMAL redundancy has lower costs in
terms of storage capacity and I/O bandwidth consumption for write operations.
Using NORMAL redundancy on Exadata provides the good data protection, seamlessly
tolerating the loss of up to one entire cell. However, note that there are situations where data
protection may be compromised using NORMAL redundancy.
Firstly, consider that some maintenance operations, such as cell patching for example
require that a cell is taken offline for a period of time.During that time,
data availability is maintained using the remaining copy.However, what happens if a failure affects
only remaining copy?.The disk group would be taken offline disrupting or possibly even
terminating database processing.This situation can be protected against by removing the cell
from the disk group and rebalancing the data on to other cells prior to taking the cell offline.
However, this operation requires a substantial amount of time and would consume a
substantial amount of I/O bandwidth. It would also require sufficient free space to be available
on the remaining cells.
Similarly, using NORMAL redundancy, the simultaneous loss of two disks on different storage
servers may result in some portion of a database becoming unavailable. While this is a highly
unlikely scenario, it is far more likely than suffering a comparable interruption using HIGH
redundancy.
Free space management
When a failure occurs ,ASM requires free space in a disk group to re-create lost data extents
in order to preserve redundancy. The amount of free space required depends on the amount
or storage affected by the failure. Oracle recommends that customers consider the possibility
of losing an entire cell when determining the amount of free space which is usually
maintained. Note that the ASM redundancy level (HIGH or NORMAL), and Exadata model
being used can have a profound bearing on the amount of free space which is required to
maintain ASM redundancy.
For example, on a Full Rack Database Machine, with 14 cells, the failure of a cell requires the
contents of the failed cell to be redistributed across the 13 surviving cells. This means that
1/14 (a little over 7%) of the overall capacity needs to be reserved as free space in order to
preserve redundancy if a cell is lost.
By contrast, on a Quarter Rack Database Machine, with only 3 cells, you require at least 1/3
of the total capacity to be free to preserve NORMAL redundancy if one cell becomes
unavailable.
Furthermore, on a Quarter Rack Database Machine using HIGH redundancy, it is impossible
to preserve redundancy if a cell is lost since a Quarter Rack only contains three cells to start
with. However in this case you could choose to continue operations with the remaining two
cells until the third one is replaced.
Quorum Disks on Database Servers
Oracle clusterware require 5 voting disks on different failure groups when using a high redundancy
disk group to store to the cluster voting disks. Consequently, in earlier releases,
the voting disk are always created on a normal redundancy disk group on Exadata clusters
with 4 or fewer storage servers. In such cases, if two cells become unavailable, the cluster is
forced to shut down, even if the data is being protected in other high redundancy disk groups.
Quorum disks enable users to leverage the disks on database servers to achieve higher
availability in smaller Exadata configurations. Quorum disks are created on the database
servers and can be used to store cluster voting disks, the cluster registry (OCR), and
database server parameter files (spfiles).
For new systems, Oracle Exadata Deployment Assistant automatically configures quorum
disks during deployment if the required conditions are met. The main requirements are:
• You are configuring an Exadata cluster with 4 or fewer storage servers.
• High redundancy is specified for the DATA or RECO disk group (or both).
• The minimum required software versions are being used, including Exadata Storage
Server release 12.1.2.3.0 and Grid Infrastructure version 12.1.0.2.160119.
How database server interact with storage server when fire any command
Oracle Database communicates with Exadata cells by using a special library called
LIBCELL ($ORACLE_HOME/lib/libcell11.so or
$ORACLE_HOME/lib/libcell12.so). In combination with the database kernel and
ASM, LIBCELL transparently maps database I/O operations to Exadata Storage Server
enhanced operations. LIBCELL communicates with Exadata cells by using the
Intelligent Database protocol (iDB). iDB is a unique Oracle data transfer protocol, built
on Reliable Datagram Sockets (RDS), which runs on industry standard InfiniBand
networking hardware. LIBCELL and iDB enable ASM and database instances to utilize
Exadata Storage Server features, such as Smart Scan and I/O Resource Management.
The Database Resource Manager (DBRM) is integrated with Exadata Storage Server
I/O Resource Management (IORM). DBRM and IORM work together to ensure that I/O
resources are allocated based on administrator-defined priorities.
Diskmon checks the storage network interface state and cell liveness. It also performs
DBRM plan propagation to Exadata cells. Diskmon uses a node-wide master process
(diskmon.bin) and one slave process (DSKM) for each RDBMS or ASM instance.
The master performs the monitoring and propagates state information to the slaves. The
slaves use the SGA to communicate with the RDBMS or ASM processes. If there is a
failure in the cluster, Diskmon performs I/O fencing to protect data integrity. Cluster
Synchronization Services (CSS) decides what to fence. The master Diskmon starts with
the clusterware processes. The slave Diskmon processes are background processes
that are started and stopped in conjunction with the associated RDBMS or ASM
instance.
The Management Server (MS) on a database server provides a set of management and
configuration functions. It works in cooperation with the DBMCLI command-line
interface. Each database server is individually managed with DBMCLI. DBMCLI can be
used only from within a database server to manage that server. However, you can run
the same DBMCLI command remotely on multiple nodes with the dcli utility. In
addition, MS is responsible for sending alerts, and collects some statistics.
The Restart Server (RS) is used to start up or shut down MS and monitors MS to
automatically restart it if required.
Oracle Database and ASM processes use
LIBCELL to communicate with CELLSRV, and LIBCELL converts I/O requests into
messages that are sent to CELLSRV by using the iDB protocol.
To illustrate how SQL processing is performed in this architecture, an example of a
table scan is shown in the graphic in the slide.
1. The client issues a SELECT statement with a predicate to filter a table and return only
the rows of interest to the user.
2. The database kernel maps this request to the file and extents containing the table.
3. The database kernel issues the I/Os to read all the table blocks.
4. All the blocks for the table being queried are read into memory.
5. SQL processing is conducted against the data blocks searching for the rows that satisfy
the predicate.
6. The required rows are returned to the client.
As is often the case with the large queries, the predicate filters out most of the rows in the
table. Yet all the blocks from the table need to be read, transferred across the storage
network, and copied into memory. Many more rows are read into memory than required to
complete the requested SQL operation. This generates a large amount of unproductive I/O,
which wastefully consumes resources and impacts application throughput and response time.
On Exadata,database operations are handled differently. Queries that perform table scans
can be processed within Exadata cells and return only the required subset of data to the
database server. Row filtering, column filtering, some join processing, and other functions can
be performed within Exadata cells. Exadata Storage Server uses a special direct-read
mechanism for Smart Scan processing.
1. The client issues a SELECT statement to return some rows of interest.
2. The database kernel determines that the data is stored on Exadata cells so an iDB
command representing the SQL command is constructed and sent to the Exadata cells.
3. The Exadata Storage Server software scans the data blocks to extract the relevant rows
and columns which satisfy the SQL command.
4. Exadata cells return to the database instance iDB messages containing the requested
rows and columns of data. These results are not block images, so they are not stored in
the buffer cache.
5. The database kernel consolidates the result sets from across all the Exadata cells. This
is similar to how the results from a parallel query operation are consolidated.
6. The rows are returned to the client.
Moving SQL processing off the database server frees server CPU cycles and eliminates a
massive amount of unproductive I/O transfers. These resources are free to better service
other requests. Queries run faster, and more of them can be processed.
How to add disk to ASM
SQL> @asm NAME TOTAL_GB FREE_GB ------------------------------ ---------- ---------- DATAGROUP 249.995117 15.2236328 IDXGROUP 149.99707 10.4892578
[root@rac-node1 ~]# fdisk -l /dev/sdm Disk /dev/sdm: 85.8 GB, 85899345920 bytes 255 heads, 63 sectors/track, 10443 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/sdm doesn't contain a valid partition table [root@rac-node1 ~]# fdisk /dev/sdm Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable. The number of cylinders for this disk is set to 10443. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-10443, default 1): Using default value 1 Last cylinder or +size or +sizeM or +sizeK (1-10443, default 10443): Using default value 10443 Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks. [root@rac-node1 ~]# fdisk -l /dev/sdm Disk /dev/sdm: 85.8 GB, 85899345920 bytes 255 heads, 63 sectors/track, 10443 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdm1 1 10443 83883366 83 Linux [root@rac-node1 ~]#
[root@rac-node1 ~]# /etc/init.d/oracleasm createdisk DATA5 /dev/sdm1 Marking disk "DATA5" as an ASM disk: [ OK ] [root@rac-node1 ~]# /etc/init.d/oracleasm scandisks Scanning the system for Oracle ASMLib disks: [ OK ] [root@rac-node1 ~]# /etc/init.d/oracleasm listdisks DATA3 DATA4 DATA5 DISK1 DISK2 INDEX2 INDEX5 [root@rac-node1 ~]#
[root@rac-node2 ~]# /etc/init.d/oracleasm scandisks Scanning the system for Oracle ASMLib disks: [ OK ] [root@rac-node2 ~]# /etc/init.d/oracleasm listdisks DATA3 DATA4 DATA5 DISK1 DISK2 INDEX2 INDEX5
/dev/raw/raw6 /dev/sdm1 ==> add this to rawdevices file
chmod 660 /dev/raw/raw6
SQL> set linesize 9999 SQL> ; SELECT NVL(a.name, '[CANDIDATE]') disk_group_name , b.path disk_file_path , b.name disk_file_name , b.failgroup disk_file_fail_group FROM v$asm_diskgroup a RIGHT OUTER JOIN v$asm_disk b USING (group_number) ORDER BY * a.name SQL> / DISK_GROUP_NAME DISK_FILE_PATH DISK_FILE_NAME DISK_FILE_FAIL_GROUP ------------------------------ -------------- DATAGROUP ORCL:DISK1 DISK1 DISK1 DATAGROUP ORCL:INDEX5 INDEX5 INDEX5 DATAGROUP ORCL:DATA4 DATA4 DATA4 DATAGROUP ORCL:DATA3 DATA3 DATA3 IDXGROUP ORCL:DISK2 DISK2 DISK2 IDXGROUP ORCL:INDEX2 INDEX2 INDEX2 [CANDIDATE] ORCL:DATA5 ==> this is the new disk
SQL> alter diskgroup DATAGROUP ADD DISK 'ORCL:DATA5' ; Diskgroup altered.
SQL> SELECT 2 NVL(a.name, '[CANDIDATE]') disk_group_name , b.path disk_file_path , b.name disk_file_name 3 4 5 , b.failgroup disk_file_fail_group 6 FROM 7 v$asm_diskgroup a RIGHT OUTER JOIN v$asm_disk b USING (group_number) 8 ORDER BY 9 a.name; DISK_GROUP_NAME DISK_FILE_PATH ISK_FILE_NAME DISK_FILE_FAIL_GROUP ---------------------------------------------------------- DATAGROUP ORCL:INDEX5 NDEX5 INDEX5 DATAGROUP ORCL:DATA4 ATA4 DATA4 DATAGROUP ORCL:DISK1 ISK1 DISK1 DATAGROUP ORCL:DATA5 ATA5 DATA5 DATAGROUP ORCL:DATA3 ATA3 DATA3 IDXGROUP ORCL:INDEX2 NDEX2 INDEX2 IDXGROUP ORCL:DISK2 ISK2 DISK2 7 rows selected. There is no candidates any more for DATA5 SQL> host cat script/asm.sql select name,TOTAL_MB/1024 total_gb,free_mb/1024 FREE_GB from v$asm_diskgroup; NAME TOTAL_GB FREE_GB ------------------------------ ---------- ---------- DATAGROUP 329.992188 95.21875 IDXGROUP 149.99707 10.4892578 SQL>
Configuring three IPs for SCAN listener in Oracle 11gR2
SCAN
The benefit of using the SCAN is that the connection information of the client does not need to change if you add or remove nodes in the cluster."
1) Non – DNS ( that means IP based RAC configuration
( Only 1 scan ip will work ))
2) DNS ( minimum 3 SCAN ip are required )
3) GNS ( which we called DHCP)
Before Configuration:
$ srvctl stop scan_listener
./crsctl start crs
./crsctl stat res -t
SQL> startup nomount;
# vi /etc/resolv.conf
===
https://blogs.oracle.com/optimizer/how-to-generate-a-useful-sql-execution-plan
To Check What Execution Plan the Query Used Yesterday or Some days Back? and
Find out SQL id fo that query and using the query to find out query plan with timings,
select SQL_ID,PLAN_HASH_VALUE, OPERATION, COST, TIMESTAMP from DBA_HIST_SQL_PLAN where SQL_ID='g9ks2772vgzdy'
Select SAMPLE_TIME,SESSION_ID,SQL_PLAN_HASH_VALUE from DBA_HIST_ACTIVE_SESS_HISTORY where SQL_ID='9hha8mvkmmjfd' order by 1;
DBA_HIST_SQL_PLAN displays the execution plan information for each child cursor in the workload repository.
This view captures information from V$SQL_PLAN and is used with the DBA_HIST_SQLSTAT view.
SQL_ID VARCHAR2(13)-SQL identifier of the parent cursor in the library cache
PLAN_HASH_VALUE -NUMBER - Numerical representation of the SQL plan for the cursor.
Comparing one PLAN_HASH_VALUE to another easily identifies whether or not two plans are the same (rather than comparing the two plans line by line).
COST-NUMBER - Cost of the operation as estimated by the optimizer's cost-based approach. For statements that use the rule-based approach, this column is null.
CARDINALITY-NUMBER- Estimate, by the cost-based optimizer, of the number of rows produced by the operation
Query is picking a bad execution plan , how to fix it to use the good execution plan available ( using Oracle baselines ) ?
Because SQL plans will change whenever we make changes to the instance, any SQL statement will have a history of execution plans.
Here are some common acts that will change execution plans:
Enabling dynamic sampling
Table in index DDL (e.g. dropping an index)
Re-analyze the schema statistics (done automatically, starting in 10g)
Changing an optimizer parameter (optimizer_mode, optimizer_index_cost_adj)
Enabling parallelism
This execution plan history can be seen by running scripts against the STATSPACK (stats$sql_plan and stats$sql_plan_usage) or the AWR (dba_hist_sql_plan) tables.
Once you have acquired the SQL_ID for a statement, you can use the built-in dbms_xplan.display_awr procedure to see all the different execution plans for the query.
The dbms_xplan.display_awr allows you to input only a SQL_ID, and Oracle will show you the of the explain plans for that SQL_ID, as recorded in the AWR.
This simple query will show changes to SQL explain plan history, if you have purchased the extra-cost licenses for AWR.
select * from TABLE(DBMS_XPLAN.DISPLAY_AWR('&sqlid'));
Using Oracle baselines We can fix the sql plan for a SQLID:
SQL plan management is a preventative mechanism that records and evaluates the execution plans of SQL statements over time.
This mechanism can build a SQL plan baseline, which is a set of accepted plans for a SQL statement. The accepted plans have been proven to perform well.
The goal of SQL plan baselines is to preserve the performance of corresponding SQL statements, regardless of changes in the database.
Examples of changes include:
New optimizer version
Changes to optimizer statistics and optimizer parameters
Changes to schema and metadata definitions
Changes to system settings
SQL profile creation
SQL plan baselines cannot help in cases where an event has caused irreversible execution plan changes, such as dropping an index.
The SQL tuning features of Oracle Database generate SQL profiles that help the optimizer to produce well-tuned plans. However,
this mechanism is reactive and cannot guarantee stable performance when drastic database changes occur.
SQL tuning can only resolve performance issues after they have occurred and are identified.
Unstable Plans (Oracle Plan Stability/Instability)
One of the most frustrating things about the Cost Based Optimizer in Oracle is the sometimes erratic performance.
It can seem almost random at times. Resolving these issues can also be challenging. It’s a little like taking your car to the mechanic
when it makes a funny noise. It never seems to make the noise when the mechanic is there.
Fortunately we have ASH/AWR which captures a ton of information about what our databases are doing.
The first one can be used to show statements that have experienced significant variances in execution time
(it can be modified to look for variances in the amount of logical i/o, but I’ll leave it as an exercise for the reader).
I called the script unstable_plans.sql. It uses an analytic function to calculate a standard deviation on the average elapsed time by plan.
So the statements that have multiple plans with wild variations in the response time between plans will be returned by the script.
The script prompts for a couple of values. The first is minimum number of standard deviations. The second is the minimum elapsed time
(I usually don’t care if a statement executes sometimes in .005 seconds and sometimes in .02 seconds,
even though this is a large swing statistically). Both these inputs are defaulted by the way.
The second one is called awr_plan_change.sql. It’s shows how the plans for a given statement have changed over time,
along with some statistics such as the average elapsed time and the average lio’s.
Anyway, here’s an example of the usage of these two scripts (by the way, the example is on an 11gR1 database, but the scrips work fine on 10g as well).
> sqlplus / as sysdba
SQL*Plus: Release 11.1.0.6.0 - Production on Tue Oct 7 15:44:20 2008
Copyright (c) 1982, 2007, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> alter system set optimizer_mode=rule;
System altered.
SQL> @awr_snap
PL/SQL procedure successfully completed.
SQL> ----------------------------------------------------------------------------------------
--
-- File name: unstable_plans.sql
--
-- Purpose: Attempts to find SQL statements with plan instability.
--
-- Author: Kerry Osborne
--
-- Usage: This scripts prompts for two values, both of which can be left blank.
--
-- min_stddev: the minimum "normalized" standard deviation between plans
-- (the default is 2)
--
-- min_etime: only include statements that have an avg. etime > this value
-- (the default is .1 second)
--
-- See http://kerryosborne.oracle-guy.com/2008/10/unstable-plans/ for more info.
---------------------------------------------------------------------------------------
set lines 155
col execs for 999,999,999
col min_etime for 999,999.99
col max_etime for 999,999.99
col avg_etime for 999,999.999
col avg_lio for 999,999,999.9
col norm_stddev for 999,999.9999
col begin_interval_time for a30
col node for 99999
break on plan_hash_value on startup_time skip 1
select * from (
select sql_id, sum(execs), min(avg_etime) min_etime, max(avg_etime) max_etime, stddev_etime/min(avg_etime) norm_stddev
from (select sql_id, plan_hash_value, execs, avg_etime,
stddev(avg_etime) over (partition by sql_id) stddev_etime
from (
select sql_id, plan_hash_value,
sum(nvl(executions_delta,0)) execs,
(sum(elapsed_time_delta)/decode(sum(nvl(executions_delta,0)),0,1,sum(executions_delta))/1000000) avg_etime
-- sum((buffer_gets_delta/decode(nvl(buffer_gets_delta,0),0,1,executions_delta))) avg_lio
from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS
where ss.snap_id = S.snap_id
and ss.instance_number = S.instance_number
and executions_delta > 0
group by sql_id, plan_hash_value
))
group by sql_id, stddev_etime
)
where norm_stddev > nvl(to_number('&min_stddev'),2)
and max_etime > nvl(to_number('&min_etime'),.1)
order by norm_stddev
/
Enter value for min_stddev:
Enter value for min_etime:
SQL_ID SUM(EXECS) MIN_ETIME MAX_ETIME NORM_STDDEV
------------- ---------- ----------- ----------- -------------
c5by9gdw3814r 14 .03 .12 2.1274
848dyu9288c3h 16 .02 .16 2.1309
2am60vd2kw8ux 11 .05 .22 2.4976
frxg53fac2n8j 5 .03 .30 4.2479
0qa98gcnnza7h 62 25.58 314.34 7.9833
g0jvz8csyrtcf 2 .09 1.19 8.2304
2cn0kc8u4b81w 545 .02 .42 12.8022
9wt62290ah0f7 6 .01 .47 38.5857
d8mayxqw0wnpv 1373 .01 .85 48.3874
9 rows selected.
SQL> /
Enter value for min_stddev:
Enter value for min_etime: 2
SQL_ID SUM(EXECS) MIN_ETIME MAX_ETIME NORM_STDDEV
------------- ---------- ----------- ----------- -------------
0qa98gcnnza7h 62 25.58 314.34 7.9833
SQL> @find_sql
Enter value for sql_text:
Enter value for address:
Enter value for sql_id: 0qa98gcnnza7h
SQL_ID CHILD PLAN_HASH EXECS ETIME AVG_ETIME USERNAME SQL_TEXT
------------- ------ ---------- ------------ ------------- ------------- ------------- -----------------------------------------
0qa98gcnnza7h 0 3723858078 5 356.53 71.31 SYS select avg(pk_col) from kso.skew where co
l1 > 0
0qa98gcnnza7h 1 568322376 1 7.92 7.92 SYS select avg(pk_col) from kso.skew where co
l1 > 0
0qa98gcnnza7h 2 568322376 10 52.14 5.21 SYS select avg(pk_col) from kso.skew where co
l1 > 0
0qa98gcnnza7h 3 568322376 30 1,064.19 35.47 KSO select avg(pk_col) from kso.skew where co
l1 > 0
0qa98gcnnza7h 4 3723858078 10 4,558.62 455.86 KSO select avg(pk_col) from kso.skew where co
l1 > 0
SQL> set lines 155
col execs for 999,999,999
col avg_etime for 999,999.999
col avg_lio for 999,999,999.9
col begin_interval_time for a30
col node for 99999
break on plan_hash_value on startup_time skip 1
select ss.snap_id, ss.instance_number node, begin_interval_time, sql_id, plan_hash_value,
nvl(executions_delta,0) execs,
(elapsed_time_delta/decode(nvl(executions_delta,0),0,1,executions_delta))/1000000 avg_etime,
(buffer_gets_delta/decode(nvl(buffer_gets_delta,0),0,1,executions_delta)) avg_lio
from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS
where sql_id = nvl('&sql_id','4dqs2k5tynk61')
and ss.snap_id = S.snap_id
and ss.instance_number = S.instance_number
and executions_delta > 0
order by 1, 2, 3
/
Enter value for sql_id: 0qa98gcnnza7h
SNAP_ID NODE BEGIN_INTERVAL_TIME SQL_ID PLAN_HASH_VALUE EXECS AVG_ETIME AVG_LIO
---------- ------ ------------------------------ ------------- --------------- ------------ ------------ --------------
3206 1 02-OCT-08 08.00.38.743 AM 0qa98gcnnza7h 568322376 4 10.359 121,722.8
3235 1 03-OCT-08 01.00.44.932 PM 0qa98gcnnza7h 1 10.865 162,375.0
3235 1 03-OCT-08 01.00.44.932 PM 0qa98gcnnza7h 3723858078 1 127.664 28,913,271.0
3236 1 03-OCT-08 01.28.09.000 PM 0qa98gcnnza7h 568322376 1 7.924 162,585.0
3236 1 03-OCT-08 01.28.09.000 PM 0qa98gcnnza7h 3723858078 1 86.682 27,751,123.0
3305 1 06-OCT-08 10.00.11.988 AM 0qa98gcnnza7h 4 64.138 22,616,931.5
3305 1 06-OCT-08 10.00.11.988 AM 0qa98gcnnza7h 568322376 2 5.710 81,149.0
3306 1 06-OCT-08 11.00.16.490 AM 0qa98gcnnza7h 6 5.512 108,198.5
3307 1 06-OCT-08 12.00.20.716 PM 0qa98gcnnza7h 2 3.824 81,149.0
3328 1 07-OCT-08 08.39.20.525 AM 0qa98gcnnza7h 30 35.473 156,904.7
3335 1 07-OCT-08 03.00.20.950 PM 0qa98gcnnza7h 3723858078 10 455.862 28,902,128.6
11 rows selected.
SQL> @dplan_awr
Enter value for sql_id: 0qa98gcnnza7h
Enter value for plan_hash_value:
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 0qa98gcnnza7h
--------------------
select avg(pk_col) from kso.skew where col1 > 0
Plan hash value: 568322376
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 44497 (100)| |
| 1 | SORT AGGREGATE | | 1 | 11 | | |
| 2 | TABLE ACCESS FULL| SKEW | 32M| 335M| 44497 (2)| 00:08:54 |
---------------------------------------------------------------------------
SQL_ID 0qa98gcnnza7h
--------------------
select avg(pk_col) from kso.skew where col1 > 0
Plan hash value: 3723858078
--------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT AGGREGATE | |
| 2 | TABLE ACCESS BY INDEX ROWID| SKEW |
| 3 | INDEX RANGE SCAN | SKEW_COL1 |
--------------------------------------------------
Note
-----
- rule based optimizer used (consider using cbo)
33 rows selected.
A couple of quick comments. The unstable_plan.sql script shows several statements that are switching plans.
Most of them are sub-second response times though. The one that stands out is sql_id 0qa98gcnnza7h.
It has been executed 62 times and one plan averages about 25 seconds while another averages about 314 seconds.
The awr_plan_changes.sql script shows that this statement has been switching between two plans (3723858078 and 568322376).
3723858078 is the less efficient plan (doing 25M or so logical i/o’s per execution) while plan 568322376 is much better
(doing only about 120K logical i/o’s per execution).
This is a typical pattern we see when bind variable peeking is an issue. Plans that flip flop back and forth between two or three options.
This post is not specifically about bind variable peeking and so I won’t go into all the details (if you want more info just search on this site).
But I will say that it is quite often the case that a single plan is acceptable regardless of the bind variables that are used (in this case the 568322376 plan).
It is pretty easy to get Oracle to use a specific plan using an Outline,
SQL Profile, or Baseline (depending on the version of Oracle you are using). And that’s exactly what we did in this case to resolve the issue.
SQL> @unstable_plans
Enter value for min_stddev:
Enter value for min_etime:
SQL_ID SUM(EXECS) MIN_ETIME MAX_ETIME NORM_STDDEV
------------- ---------- ----------- ----------- -------------
0qa98gcnnza7h 4 42.08 208.80 2.8016
SSQL> @awr_plan_stats
Enter value for sql_id: 0qa98gcnnza7h
SQL_ID PLAN_HASH_VALUE EXECS ETIME AVG_ETIME AVG_LIO
------------- --------------- ------------ -------------- ------------ --------------
0qa98gcnnza7h 568322376 3 126.2 42.079 124,329.7
0qa98gcnnza7h 3723858078 1 208.8 208.796 28,901,466.0
SQL> @create_sql_profile_awr
Enter value for sql_id: 0qa98gcnnza7h
Enter value for plan_hash_value: 568322376
Enter value for category:
Enter value for force_matching:
PL/SQL procedure successfully completed.
SQL> @sql_profiles
Enter value for sql_text:
Enter value for name: PROFIL%
NAME CATEGORY STATUS SQL_TEXT FOR
------------------------------ --------------- -------- ---------------------------------------------------------------------- ---
PROFILE_0qa98gcnnza7h DEFAULT ENABLED select avg(pk_col) from kso.skew NO
SQL> set echo on
SQL> @sql_profile_hints
SQL> set lines 155
SQL> col hint for a150
SQL> select attr_val hint
2 from dba_sql_profiles p, sqlprof$attr h
3 where p.signature = h.signature
4 and name like ('&profile_name')
5 order by attr#
6 /
Enter value for profile_name: PROFILE_0qa98gcnnza7h
HINT
------------------------------------------------------------------------------------------------------------------------------------------------------
IGNORE_OPTIM_EMBEDDED_HINTS
OPTIMIZER_FEATURES_ENABLE('10.2.0.3')
ALL_ROWS
OUTLINE_LEAF(@"SEL$1")
FULL(@"SEL$1" "SKEW"@"SEL$1")
SQL> @sql_hints_awr
SQL> select
2 extractvalue(value(d), '/hint') as outline_hints
3 from
4 xmltable('/*/outline_data/hint'
5 passing (
6 select
7 xmltype(other_xml) as xmlval
8 from
9 dba_hist_sql_plan
10 where
11 sql_id = '&sql_id'
12 and plan_hash_value = &plan_hash_value
13 and other_xml is not null
14 )
15 ) d;
Enter value for sql_id: 0qa98gcnnza7h
Enter value for plan_hash_value: 568322376
OUTLINE_HINTS
-----------------------------------------------------------------------------------------------------------------------------------------------------------
IGNORE_OPTIM_EMBEDDED_HINTS
OPTIMIZER_FEATURES_ENABLE('10.2.0.3')
ALL_ROWS
OUTLINE_LEAF(@"SEL$1")
FULL(@"SEL$1" "SKEW"@"SEL$1")
A couple of additional points:
Outlines and SQL Profiles both take the same approach to controlling execution plans. They both attempt to force the optimizer down a certain path by applying hints behind the scenes. This is in my opinion an almost impossible task. The more complex the statement, the more difficult the task becomes. The newest kid on the block in this area (in 11g) is called a baseline and while it doesn’t abandon the hinting approach altogether, it does at least store the plan_hash_value – so it can tell if it regenerated the correct plan or not.
It does not appear that Outlines are being actively pursued by Oracle development anymore. So while they still work in 11g, they are becoming a little less reliable (and they were a bit quirky to begin with).
SQL Profiles have the ability to replace literals with bind variables similar to the cursor_sharing parameter. This means you can have a SQL Profile that will match multiple statements which use literals without having to set cursor_sharing for the whole instance.
Outlines take precedence over SQL Profiles. You can create both on the same statement and if you do, the outline will be used and the SQL Profile will be ignored. This is true in 11g as well, by the way.
Outlines don’t appear to use the OPT_ESTIMATE hint. So I believe it is still a valid approach to accept a SQL Profile as offered by the SQL Tuning Advisor and then create an Outline on top of it. It seems to work pretty well most of the time. (be sure and check the hints and the plan that gets produced)
Manually created SQL Profiles also don’t appear to use the OPT_ESTIMATE hint. So I believe it is also a valid approach to accept a SQL Profile as offered by the SQL Tuning Advisor and then create a SQL Profile on top of it. Note that you’ll have to use a different category, then drop the original, then enable the new SQL Profile. Which means this approach is a bit more work than just creating an Outline in the DEFAULT category.
Have a look at the difference between SQL Tuning Set generated hints and those created by a manual SQL Profile or an Outline (note that I have tried to change the object names to protect the innocent and in so doing may have made it slightly more difficult to follow) :
SQL>
SQL> select * from table(dbms_xplan.display_cursor('&sql_id','&child_no',''));
Enter value for sql_id: fknfhx8wth51q
Enter value for child_no: 1
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID fknfhx8wth51q, child number 1
-------------------------------------
SELECT /* test4 */ col1, col2, col3 ...
Plan hash value: 3163842146
----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 1778 (100)| |
| 1 | NESTED LOOPS | | 1039 | 96627 | 1778 (1)| 00:00:33 |
| 2 | NESTED LOOPS | | 916 | 57708 | 1778 (1)| 00:00:33 |
|* 3 | TABLE ACCESS BY INDEX ROWID| TABLE_XXXX_LOOKUP | 446 | 17840 | 891 (1)| 00:00:17 |
|* 4 | INDEX RANGE SCAN | INDEX_XXXX_IS_CPCI | 12028 | | 18 (0)| 00:00:01 |
| 5 | TABLE ACCESS BY INDEX ROWID| TABLE_XXXX_IDENT | 2 | 46 | 2 (0)| 00:00:01 |
|* 6 | INDEX RANGE SCAN | INDEX_XXXXIP_17_FK | 2 | | 1 (0)| 00:00:01 |
|* 7 | INDEX UNIQUE SCAN | PK_TABLE_XXXX_ASSIGNMENT | 1 | 30 | 0 (0)| |
----------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter((
...
4 - access("L"."COL1"=:N1)
6 - access("L"."COL2"="I"."COL1")
Note
-----
- SQL profile SYS_SQLPROF_012061f471d50001 used for this statement
85 rows selected.
SQL> @sql_profile_hints
Enter value for name: SYS_SQLPROF_012061f471d50001
HINT
------------------------------------------------------------------------------------------------------------------------------------------------------
OPT_ESTIMATE(@"SEL$1", TABLE, "L"@"SEL$1", SCALE_ROWS=0.0536172171)
OPT_ESTIMATE(@"SEL$1", INDEX_SKIP_SCAN, "A"@"SEL$1", PK_TABLE_XXXX_ASSIGNMENT, SCALE_ROWS=4)
COLUMN_STATS("APP_OWNER"."TABLE_XXXX_ASSIGNMENT", "COL1", scale, length=6 distinct=1234 nulls=0 min=1000000014 max=1026369632)
COLUMN_STATS("APP_OWNER"."TABLE_XXXX_ASSIGNMENT", "COL2", scale, length=12 distinct=2 nulls=0)
COLUMN_STATS("APP_OWNER"."TABLE_XXXX_ASSIGNMENT", "COL3", scale, length=12 distinct=2 nulls=0)
TABLE_STATS("APP_OWNER"."TABLE_XXXX_ASSIGNMENT", scale, blocks=5 rows=2400)
OPTIMIZER_FEATURES_ENABLE(default)
7 rows selected.
SQL> -- no direct hints - only stats and scaling on the profile created by the SQL Tuning Advisor
SQL> -- (i.e. the dreaded OPT_ESTIMATE hints and no directive type hints like INDEX or USE_NL)
SQL>
SQL> -- now let's try an outline on top of it
SQL> @create_outline
Session altered.
Enter value for sql_id: fknfhx8wth51q
Enter value for child_number: 1
Enter value for outline_name: KSOTEST1
Outline KSOTEST1 created.
PL/SQL procedure successfully completed.
SQL> @outline_hints
Enter value for name: KSOTEST1
HINT
------------------------------------------------------------------------------------------------------------------------------------------------------
USE_NL(@"SEL$1" "A"@"SEL$1")
USE_NL(@"SEL$1" "I"@"SEL$1")
LEADING(@"SEL$1" "L"@"SEL$1" "I"@"SEL$1" "A"@"SEL$1")
INDEX(@"SEL$1" "A"@"SEL$1" ("TABLE_XXXX_ASSIGNMENT"."COL1" "TABLE_XXXX_ASSIGNMENT"."COL2" "TABLE_XXXX_ASSIGNMENT"."COL3"))
INDEX_RS_ASC(@"SEL$1" "I"@"SEL$1" ("TABLE_XXXX_IDENT"."COL1"))
INDEX_RS_ASC(@"SEL$1" "L"@"SEL$1" ("TABLE_XXXX_LOOKUP"."COL1" "TABLE_XXXX_LOOKUP"."COL2"))
OUTLINE_LEAF(@"SEL$1")
ALL_ROWS
DB_VERSION('11.1.0.7')
OPTIMIZER_FEATURES_ENABLE('11.1.0.7')
IGNORE_OPTIM_EMBEDDED_HINTS
11 rows selected.
SQL> -- no OPT_ESTIMATE hints on the outline
SQL> -- directive type hints - INDEX, USE_NL, etc...
SQL>
SQL> -- now let's try creating a manual profile
SQL> @create_sql_profile.sql
Enter value for sql_id: fknfhx8wth51q
Enter value for child_no: 1
Enter value for category: TEST
Enter value for force_matching:
PL/SQL procedure successfully completed.
SQL> @sql_profile_hints
Enter value for name: PROFILE_fknfhx8wth51q
HINT
------------------------------------------------------------------------------------------------------------------------------------------------------
IGNORE_OPTIM_EMBEDDED_HINTS
OPTIMIZER_FEATURES_ENABLE('11.1.0.7')
DB_VERSION('11.1.0.7')
ALL_ROWS
OUTLINE_LEAF(@"SEL$1")
INDEX_RS_ASC(@"SEL$1" "L"@"SEL$1" ("TABLE_XXXX_LOOKUP"."COL1" "TABLE_XXXX_LOOKUP"."COL2"))
INDEX_RS_ASC(@"SEL$1" "I"@"SEL$1" ("TABLE_XXXX_IDENT"."COL1"))
INDEX(@"SEL$1" "A"@"SEL$1" ("TABLE_XXXX_ASSIGNMENT"."COL1" "TABLE_XXXX_ASSIGNMENT"."COL2" "TABLE_XXXX_ASSIGNMENT"."COL3"))
LEADING(@"SEL$1" "L"@"SEL$1" "I"@"SEL$1" "A"@"SEL$1")
USE_NL(@"SEL$1" "I"@"SEL$1")
USE_NL(@"SEL$1" "A"@"SEL$1")
11 rows selected.
SQL> -- no OPT_ESTIMATE with the SQL Profile we created manually !
SQL> -- again it's directive - USE_NL, INDEX, LEADING, etc...
SQL>
So I apologize to all you SQL Profiles out there who have been lumped together by my prejudiced view, just because of the acts of a few of your brethren (i.e. the ones created by the SQL Tuning Advisor). SQL Profiles do indeed have all the capabilities of Outlines and probably are a better choice in most cases than Outlines.
SQL Plan Management
Introduction
SQL plan management (SPM) ensures that runtime performance will
not degrade due to execution plan changes.
To guarantee this, only accepted execution plans are used; any
plan evolution that does occur is tracked and evaluated at a later point in time,
and only accepted if the new plan shows a noticeable improvement in runtime.
SQL Plan Management has three main components:
Plan Capture:
»Creation of SQL plan baselines that store accepted execution plans for all relevant SQL statements. SQL
plan baselines are stored in the
SQL management base in the SYSAUX tablespace.
»Plan Selection:
»Ensures only accepted execution plans are used for statements with a SQL plan baseline
and records any new execution plans found for a statement as unaccepted plans in the SQL plan baseline.
»
Plan Evolution:
»Evaluate all unaccepted execution plans for a given statement, with only plans that show a performance
improvement becoming accepted plans in the SQL plan baseline.
SQL Management Base
The SQL management base (SMB) is a logical repository in the data dictionary, physically located in the SYSAUX
tablespace. In the context of SPM, it
stores the following structures:
»SQL Plan History
»The SQL plan history is the set of SQL execution plans generated for SQL statements over time.
»The history contains both SQL plan baselines and unaccepted plans.
»SQL Plan Baselines
»A SQL plan baseline is an accepted plan that the optimizer is allowed to
use for a SQL statement. In the typical use case, the database accepts a plan into the plan baseline only after verifying that the plan
performs well.
»SQL Statement Log
»A series of query signatures used to identify queries
that have been executed more than once during automatic plan capture (see below).
Plan Capture
For SPM to become active, the SQL management
base must be seeded with a set of acceptable execution plans,
which will become the SQL plan baseline for the corresponding SQL statements.
There are two different ways to populate the SQL management
base: automatically or manually
No comments:
Post a Comment