Saturday, 17 August 2019

Oracle Database Interview Questions1


Oracle DBA question and answer
Important link for interview question



http://oracle-dba-help.blogspot.com/search/label/Interview%20QuestionsOracle Database interview questions and answers 


===========================To check Execution Plan Change=====================
Note --

We are using Oracle 12c. We have a nightly jobs that is pretty complicated but has been run OK for years.

Recently it suddenly takes a couple hours or 3 hours longer than its normal run (3 hours).

We did not change any code about the procedures, and the data are increased little by little each day, so not sure what made the slowness. And the slowness does not always happen every day.

for example in the past 2 weeks, it runs 3 hours in most of days, but 5 hours in the other 2 days. And a couple of times failed, with the error of using a lot of temp table spaces and time out. There is also more disk I/o happened during that time.

We have extended temp table space to pretty big, but still happens some day.

We have a monitor tool, I noticed since this happened, it started to use a different execution plan, that has some table scan instead of previously used index scan.

But the tables are not very big, about 8000 rows.

What could be wrong and how should we troubleshoot it?

Note:  the query is the same, is the data the same ? Maybe something else is heavily using the database at certain times. Maybe you're using network storage that is slow at certain times. Maybe the data happened to be cached in memory at certain times.


Reason
---------
1) stale statistic for batch job related table
2) volume of the data changed tremendously, but the statistics did not change at all
  1. Data distribution / volume changed, crossing an optimizer search tree decision tipping point
  2. Indexes / files got fragmented
  3. Statistics have been updated / added / dropped or have become stale and misleading due to data changes
  4. Windows memory utilization changed
  5. Transaction logs are full and not truncating, causing repeated physical file expansion
  6. Schema changed - index / indexed view / column / constraint added, modified or dropped, data type changed, etc.
  7. Trace flag settings changed
  8. Database or server setting changed
  9. Server CU level changed
  10. Client application session settings changed


1) check temp and undo tablespace, statistic of related table and wait event
2) generate AWR report for the time window you need to investigate and compare with good time AWR report
3)Search for the SQL ordered by Elapsed Time section. Look for operations with either a long "Elapsed Time per Exec" or an SQL with many executions or Basically we see SQL STATISTIS section of the AWR report
4)  Use the SQL Id to obtain the execution plan for the relevant SQL.
  SELECT * FROM TABLE(DBMS_XPLAN.display_cursor('sql_id', NULL, 'all +peeked_binds +outline'));

Look for "TABLE ACCESS FULL” or "INDEX FULL SCAN," which can indicate poorly performing SQL. Check for an option to choose a different execution plan that saves a significant amount of time

2) If you have the licenses, run tuning adviser on the query. Check whether there are suggestion on profile or you might need to create a plan base line.
 3) Query tuning starts with getting the actual plan in use.  Once you get both plans you can compare them and look for the cause of the change. 
  if you have the DM Diagnostic/Tuning Pack licenses then you can query the AWR for the plans for the SQL that is switching plans.
But if the SQL is part of a batch process it is probably not in the shared pool and has to be parsed each night on first execution.  
Then on run a bind variable peek may be taking place so depending on the values of the variable peeked the plan is set for query.  
The plan might be good for the specific values peeked but not for the majority of values to be processed.  
I have seen this happen numerous times.  Once you know the problem you can determine which technique or feature you want to use to avoid the issue: 
SQL Profile, hint the SQL, fire the query with a representative set of values just prior to running the program that submits the problem SQL. 

Fetch Bad Execution Plan Details or to check execution plan change/explain plan change

/* Find the bad SQL */
SELECT * FROM V$SQL where SQL_TEXT LIKE '%T_IDX_ID%'
/* Get SQL_ID, PLAN_HASH_VALUE */

SELECT SQL_ID,PLAN_HASH_VALUE,SQL_TEXT FROM V$SQL WHERE SQL_ID='&SQL_ID'


SELECT SQL_ID,PLAN_HASH_VALUE FROM V$SQL WHERE SQL_ID='&SQL_ID'

Enter value for sql_id: 4sw8r4uqf0znt
old   1: SELECT SQL_ID,PLAN_HASH_VALUE FROM V$SQL WHERE SQL_ID='&SQL_ID'
new   1: SELECT SQL_ID,PLAN_HASH_VALUE FROM V$SQL WHERE SQL_ID='4sw8r4uqf0znt'

SQL_ID        PLAN_HASH_VALUE
------------- ---------------
4sw8r4uqf0znt      3670979307
4sw8r4uqf0znt      1401136819


You may use below query to identify plan changes related,

-- should also run unstable_plan.sql
Prompt
prompt +------------------------------------+
prompt |Might be Indication of bind peaking |
Prompt +------------------------------------+
prompt
set lines 155
col execs for 999,999,999
col avg_etime for 999,999.999
col avg_lio for 999,999,999.9
col begin_interval_time for a30
col node for 99999
break on plan_hash_value on startup_time skip 1
select ss.snap_id, ss.instance_number node, begin_interval_time, sql_id, plan_hash_value,
nvl(executions_delta,0) execs,
(elapsed_time_delta/decode(nvl(executions_delta,0),0,1,executions_delta))/1000000 avg_etime,
(buffer_gets_delta/decode(nvl(buffer_gets_delta,0),0,1,executions_delta)) avg_lio
from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS
where sql_id = nvl('&sql_id','4dqs2k5tynk61')
and ss.snap_id = S.snap_id
and ss.instance_number = S.instance_number
and executions_delta > 0
order by 1, 2, 3
/

Prompt
prompt +------------------------------------+
prompt |         AWR Plan Stats             |
Prompt +------------------------------------+
prompt
-- Note that I have modified this script slightly to include snaps with 0 executions.
-- This is to account for situations with very long running statements (that generally
-- cross snapshot boundaries). In these situations, the executions_delta is incremented 
-- in the snapshot when the statement begins. There will be 0 executions_delta in
-- subsequent snapshots, but the time and lio's should still be considered.
set lines 155
col execs for 999,999,999
col etime for 999,999,999.9
col avg_etime for 999,999.999
col avg_cpu_time for 999,999.999
col avg_lio for 999,999,999.9
col avg_pio for 999,999,999.9
col begin_interval_time for a30
col node for 99999
break on plan_hash_value on startup_time skip 1
select sql_id, plan_hash_value, 
sum(execs) execs, 
-- sum(etime) etime, 
sum(etime)/sum(execs) avg_etime, 
sum(cpu_time)/sum(execs) avg_cpu_time,
sum(lio)/sum(execs) avg_lio, 
sum(pio)/sum(execs) avg_pio
from (
select ss.snap_id, ss.instance_number node, begin_interval_time, sql_id, plan_hash_value,
nvl(executions_delta,0) execs,
elapsed_time_delta/1000000 etime,
(elapsed_time_delta/decode(nvl(executions_delta,0),0,1,executions_delta))/1000000 avg_etime,
buffer_gets_delta lio,
disk_reads_delta pio,
cpu_time_delta/1000000 cpu_time,
(buffer_gets_delta/decode(nvl(buffer_gets_delta,0),0,1,executions_delta)) avg_lio,
(cpu_time_delta/decode(nvl(executions_delta,0),0,1,executions_delta)) avg_cpu_time
from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS
where sql_id = nvl('&sql_id','4dqs2k5tynk61')
and ss.snap_id = S.snap_id
and ss.instance_number = S.instance_number 
-- and executions_delta > 0
)
group by sql_id, plan_hash_value
order by 5
/

I’d start listing a few and see how many other (realistic) reasons other people wanted to add.

  1. A change in execution plan due to unlucky bind variable peeking with histograms
  2. A change in execution plan because the statistics have been slowly going out of sync with the data/query
    1. Special case: highly volatile tables, where the data/query/stats can go badly out of sync very quickly
    2. Special case: dynamic sampling used to bypass volatility problems – but sometimes the sample is an “unlucky” one.
    3. Special case: you had been using dynamic sampling, but someone collected stats on the objects that messed things up completely
    4. Special case: 12c has decided to add some SQL Directives that mean you do dynamic sampling even when there are stats on the table – the sampling takes  more time and the resulting stats make things worse.
  3. A change in execution plan because new (and unlucky) statistics have been generated since the last time query ran.
  4. A change in execution plan because a few new, empty, partitions have been added to a critical table
  5. An unlucky change in execution plan because a small change in actual data volume (with correct stats in place) can have a dramatic impact on the shape of the plan
    1. Variant – the plan stays the same, but the execution mechanism changes because a table has grown and pushes Oracle into direct path reads
    2. Variant – as 5.1, but the change in mechanism is not due to the size of the table, but the current degree of caching of the table (see also 10)
  6. A change in execution plan because you had an incomplete set of hints in your SQL and your luck just ran out
  7. An unlucky change in data order that has a big impact on the success of subquery caching
    1. Special case – highly volatile tables being used as the target of existence subqueries (or “first_row” optimisation)
      1. Can be due to the actual placement of the “useful” data in the table compared to the optimizer’s “average” model.
      2. Can be due to stats and data not matching very well
  8. A small slippage in timing that results in the SQL suffering a direct data collision (locking / read-consistency) with some other process.
    1. Special case – a small slowdown resulting in the process being restarted, possibly many times, and colliding with copies of itself
  9. A small slippage in timing that leaves the SQL running concurrently with something that is processing unrelated data but competing for the raw (disc / CPU / PX Slave) resources.
    1. Interesting example – competition for the queue on an HBA, especially when the queue depth is small
    2. Special variation – timing problems due to month-end/year-end increasing volume to be processed and adding extra jobs
    3. Common variant – even without a time slip – you may simply be competing with some other system that’s sharing the same physical resources
    4. As for 9.3 – but someone has recently added a new application or whole new VM on the storage that you were using.
    5. Special case – other processes were holding a lot of PGA memory so your big hash join ran with a small workarea and spilled to disc catastrophically
  10. A small slippage in timing that means some other process cleared everything you wanted from the buffer cache before your SQL started running.
    1. RAC variant – the data is cached in another node because e.g. some other process ran on a different node
  11. A small slippage in timing that means a process that normally follows your SQL preceded it, and you’re spending all your time on delayed block cleanout.
  12. Someone dropped some indexes (or made them invalid)
  13. Someone rebuilt one or more indexes since the last time the SQL executed
  14. Someone has done a “shrink space” on a critical table since the last time you ran the SQL
    1. this could result in a change in execution plan due to a change in index clustering factors
    2. it could result in a change in physical I/O requirements because of a change in data placement – even with the same execution plan
    3. it could reduce the benefit of subquery-caching because of a change in data access order – even with the same execution plan
  15. A space management, or similar, problem suspended processing for a while – e.g. archiver stuck, temp space full for resumable session.
  16. You’re operating over a database link, and the problem is either in the connection, or at the far end of the link
    1. The problem may also be about execution plan changes that have a more extreme side effect because the plan includes a remote database
  17. Problems with SQL Baselines or profiles (profiles going out of date, or a new Baseline being accepted – perhaps by the 12c automatic job – that blocks a previously stable plan)
  18. It’s not the SQL, it’s whatever is calling the SQL (surrounding pl/sql, network to client, client data handling)
    1. Special case: Introduction, or modification, of firewall setup
    2. Special case: hardware problem increasing rate of packet retransmission
  19. The job was running in a resource group and a change in workload of other processes made the resource manager de-schedule this process even though its workload hadn’t changed
  20. The SQL hasn’t changed plan, or volume of data accessed, but it calls a pl/sql function that executes some SQL that has changed its behaviour for some reason.
or

Different Query Plans

Over time, your query plan may also change because statistics have changed (or not changed depending on the statistic in question). Normally, that indicates that Oracle has found a more efficient plan or that your data volumes have changed and Oracle expects a different plan would be more efficient with the new data volume. If, however, you are giving Oracle bad statistics (if, for example, you have tables that get much larger during month-end processing but you gather statistics when the tables are almost empty), you may induce Oracle to choose a very bad query plan. Depending on the version of Oracle, there are various ways to force Oracle to use the same query plan. If you can drill down and figure out what the problem with statistics is, Oracle probably provides a way to give the optimizer better statistics.

If you take a look at AWR/ ASH data (if you have the appropriate licenses) or Statspace data (if your DBA has installed that), you should be able to figure out which camp your problems originate in. Are you getting different query plans for different executions (you may need to capture a query plan from your initial benchmarks and compare it to the current plan or you may need to increase your AWR retention to retain query plans for a few months in order to see this). Are you doing the same number of buffer gets over time but getting vastly different amounts of I/O waits? Do you see a lot of contention for resources from other sessions?If so, that probably indicates that the issue is different load at different times.


===============================
Oracle 19c Grid/clusterware patching commands

srvctl relocate database -d esr1pcdb -node mprl224 -verbose

srvctl relocate database -d esr1pcdb -node mprl223 -verbose
srvctl relocate database -d esr2pcdb -node mprl224 -verbose



/oracle/esrgrid/19.0.0/crs/install/rootcrs.sh -prepatch

--cd /oracle/software/jdk/
FROM GRID HOME
opatch rollback -id 33197296

FROM RDBMS HOME
opatch rollback -id 33197296 -- rdbms
opatch rollback -id 33197296 -- rdbms (2nd home)

/oracle/esrgrid/19.0.0/rdbms/install/rootadd_rdbms.sh

/oracle/esrgrid/19.0.0/crs/install/rootcrs.sh -postpatch

----------------------------------------------------------------------
/oracle/esrgrid/19.0.0/OPatch/opatchauto apply /oracle/software/JUL_PSU_GRID/32895426 -analyze

nohup /oracle/esrgrid/19.0.0/OPatch/opatchauto apply /oracle/software/JUL_PSU_GRID/32895426 &
----------------------------------------------------------------------

/oracle/esrgrid/19.0.0/crs/install/rootcrs.sh -prepatch

FROM GRID
cd /oracle/software/OCT_JDK/33197296
opatch apply

FROM RDBMS HOME
cd /oracle/software/OCT_JDK/33197296
opatch apply
opatch apply (2nd home)

From Root
/oracle/esrgrid/19.0.0/crs/rdbms/install/rootadd_rdbms.sh
/oracle/esrgrid/19.0.0/crs/install/rootcrs.sh -postpatch
----------------------------------------------------------------------------------------------
/oracle/esrgrid/19.0.0/OPatch/opatchauto apply /oracle/software/JUL_PSU_GRID/32895426 -analyze
/oracle/esr1pcdb/19.0.0/OPatch/opatchauto apply /oracle/software/JUL_PSU_GRID/32895426 -oh /oracle/esr1pcdb/19.0.0 -analyze
/oracle/esr2pcdb/19.0.0/OPatch/opatchauto apply /oracle/software/JUL_PSU_GRID/32895426 -oh /oracle/esr2pcdb/19.0.0 -analyze



nohup /oracle/esrgrid/19.0.0/OPatch/opatchauto apply /oracle/software/JUL_PSU_GRID/32895426 &

nohup /oracle/esr1pcdb/19.0.0/OPatch/opatchauto apply /oracle/software/JUL_PSU_GRID/32895426 -oh /oracle/esr1pcdb/19.0.0 &
nohup /oracle/esr2pcdb/19.0.0/OPatch/opatchauto apply /oracle/software/JUL_PSU_GRID/32895426 -oh /oracle/esr2pcdb/19.0.0 &

Converting a physical standby database into snapshot standby database

1) To check gap


SELECT ARCH.THREAD# "Thread", ARCH.SEQUENCE# "Last Sequence Received", APPL.SEQUENCE# "Last Sequence Applied", (ARCH.SEQUENCE# - APPL.SEQUENCE#) "Difference" FROM
(SELECT THREAD# ,SEQUENCE# FROM V$ARCHIVED_LOG WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$ARCHIVED_LOG GROUP BY THREAD#)) ARCH,
(SELECT THREAD# ,SEQUENCE# FROM V$LOG_HISTORY WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$LOG_HISTORY GROUP BY THREAD#)) APPL
WHERE ARCH.THREAD# = APPL.THREAD#    


select name,open_mode,database_role,controlfile_type from v$database;

select flashback_on from v$database;

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

alter database flashback on;

select flashback_on from v$database;

show parameter recovery
select name from v$recovery_file_dest;
show parameter db_recovery_file_dest_size


check recovery desk and recovery desk size parameter



ALTER DATABASE CONVERT TO SNAPSHOT STANDBY;

select name,open_mode,database_role,controlfile_type from v$database;

select NAME,GUARANTEE_FLASHBACK_DATABASE from v$restore_point;



shutdown immediate ---
startup ------  


select name,open_mode,database_role,controlfile_type from v$database;


Converting a snapshot standby database  into physical standby 

Revert back
--------------
select name,open_mode,database_role,controlfile_type from v$database;
shutdown immediate -
startup mount  -

select name,open_mode,database_role,controlfile_type from v$database;

alter database convert to physical standby;

SQL> alter database convert to physical standby;

Database altered.

SQL> select name,open_mode,database_role,controlfile_type from v$database;
select name,open_mode,database_role,controlfile_type from v$database
                                                          *
ERROR at line 1:
ORA-01507: database not mounted


SQL> shutdown immediate;
ORA-01507: database not mounted


ORACLE instance shut down.


select name,open_mode,database_role,controlfile_type from v$database;


shutdown immediate;
startup mount
select name,open_mode,database_role,controlfile_type from v$database;

select flashback_on from v$database;
alter database flashback off --

select flashback_on from v$database;

ALTER DATABASE RECOVER  managed standby database disconnect from session;
select name,open_mode,database_role,controlfile_type from v$database;

==============Standby Database Switchover and Switchback=============

Overview

Check in db "dba_jobs" - stop those jobs in DC servers
Ensure ODSMASTER.SNP_USER table backup has taken for ODIDB01
Verify that each database is properly configured for the role
Verify that there are no redo transport errors or redo gaps at the standby database
Verify target of your switchover is synchronized or not
Check at D.R site Redo is received or not and applied
Ensure temporary files exist on the standby database that match the temporary files on the primary database
Ensure Flashback is on both D.C and D.R
Initiate the switchover on the primary database
Shut down and then mount the former primary database.
Verify that the switchover target is ready to be switched to the primary role.
Switch the target physical standby database role to the primary role.
Open the new primary database.
Start Redo Apply on the new physical standby database.



==============
Overview steps:

Step:1 Check the db name, open mode, database role of the primary and standby databases.

Step:2 Select switchover status on primary & standby db.

Step:3 Check that there is no active users connected to the databases.

Step:4 Switch the current online redo log file on primary database and verify that it has been applied in the standby database.

Step:5 Connect with primary database and initiate the switchover.

Step:6 Bounce the primary db and check the switchover status.

Step:7 Then convert the physical standby into primary db.(stop the MRP process)

Step:8 Open the new standby db in read only mode.

Step:9 Apply the redo log files in newly created standby.(start the MRP process).Check whether the logs are applying in the new standby db.

Step:1 Check the db name, open mode, database role of the primary and standby databases.

Precheck
---------------

SELECT PROTECTION_MODE,CONTROLFILE_TYPE,REMOTE_ARCHIVE,PROTECTION_LEVEL,DATABASE_ROLE ROLE, SWITCHOVER_STATUS,DATAGUARD_BROKER,GUARD_STATUS FROM V$DATABASE;


set lines 2000
set pages 2000
set owner for a13
set username for a12
col host for a50
col username for a15
col DB_LINK for a30
col CREATED for a30

select * from dba_db_links ;
/

select owner,DB_LINK from dba_db_links

set lines 300 pages 1000
col OWNER for a30
col JOB_NAME for a30
SELECT OWNER, JOB_NAME, START_DATE,NEXT_RUN_DATE,STATE FROM DBA_SCHEDULER_JOBS WHERE ENABLED='TRUE' AND OWNER <> 'SYS';

col JOB_ACTION for a20
 col START_DATE for a20
 col LAST_START_DATE for a20
 col LAST_RUN_DURATION for a20
 col NEXT_RUN_DATE for a30
select JOB_NAME,JOB_ACTION,REPEAT_INTERVAL,LAST_START_DATE,LAST_RUN_DURATION,NEXT_RUN_DATE from dba_scheduler_jobs where owner='SYS' and JOB_NAME like '%STAT%';


SELECT db_unique_name, open_mode, database_role FROM v$database;


SELECT PROCESS FROM V$MANAGED_STANDBY WHERE PROCESS LIKE 'MRP%';
SELECT  PROCESS, STATUS,SEQUENCE#,BLOCK#,BLOCKS, DELAY_MINS FROM V$MANAGED_STANDBY;

column destination format a35 wrap
column process format a7
column archiver format a8
column ID format 99
column mid format 99
select dest_id "ID",destination,status,target,  schedule,process,mountid  mid from v$archive_dest order by dest_id;


SELECT RECOVERY_MODE FROM V$ARCHIVE_DEST_STATUS WHERE DEST_ID=2;

select * from V$DATAGUARD_CONFIG

SELECT FS_FAILOVER_STATUS "FSFO STATUS",FS_FAILOVER_CURRENT_TARGET TARGET,FS_FAILOVER_THRESHOLD THRESHOLD,FS_FAILOVER_OBSERVER_PRESENT "OBSERVER PRESENT" FROM  V$DATABASE;


SELECT PROTECTION_MODE,CONTROLFILE_TYPE,REMOTE_ARCHIVE,PROTECTION_LEVEL,DATABASE_ROLE ROLE, SWITCHOVER_STATUS,DATAGUARD_BROKER,GUARD_STATUS FROM V$DATABASE;  

SELECT MESSAGE FROM V$DATAGUARD_STATUS;

SELECT fs_failover_observer_present, fs_failover_observer_host FROM v$database;

set lines 100 pages 999
col member format a70
select st.group#,st.sequence#,ceil(st.bytes / 1048576) mb,lf.member
from v$standby_log st,v$logfile lf
where st.group# = lf.group#
/

ALTER SESSION SET nls_date_format=’DD-MON-YYYY HH24:MI:SS’;
SELECT sequence#, first_time, next_time, applied FROM v$archived_log ORDER BY sequence#;

SELECT ARCH.THREAD# "Thread", ARCH.SEQUENCE# "Last Sequence Received", APPL.SEQUENCE# "Last Sequence Applied", 
(ARCH.SEQUENCE# - APPL.SEQUENCE#) "Difference" FROM (SELECT THREAD# ,SEQUENCE# FROM V$ARCHIVED_LOG WHERE 
(THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$ARCHIVED_LOG GROUP BY THREAD#)) ARCH, 
(SELECT THREAD# ,SEQUENCE# FROM V$LOG_HISTORY WHERE (THREAD#,FIRST_TIME ) 
IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$LOG_HISTORY GROUP BY THREAD#)) APPL WHERE ARCH.THREAD# = APPL.THREAD# ORDER BY 1; 

3) show parameter log_archive_config
        show parameter dg
        SELECT MESSAGE FROM V$DATAGUARD_STATUS;

4) select name,flashback_on from v$database;
5) select name from v$tempfile;
6) select value from v$diag_info where name ='Diag Trace';
7) show parameter recovery;
show parameter job
show parameter aq
8) check crontab and scheduler if any state and rman backup will scheduled during activity if yes disable it


Implementation
--------------

run command on primary database(production) fist we need to run primary database
compatible               = "11.2.0"
  log_archive_dest_1       = "LOCATION=/db1/orafra/svfe VALID_FOR=(ALL_LOGFILES,ALL_ROLES)  DB_UNIQUE_NAME=svfe"
  log_archive_dest_2       = "SERVICE=svfedr LGWR ASYNC NOAFFIRM VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=svfedr"
  log_archive_dest_state_1 = "ENABLE"


 alter system switch logfile

switchover primary to standby
--------------------------------

select SWITCHOVER_STATUS from V$DATABASE;
select name,database_role,switchover_status from v$database;
!ps -ef|grep rman


SELECT * From dba_jobs_running;


show parameter job
show parameter aq
SQL> 
SQL> alter system set job_queue_processes=0 ;

System altered.

SQL> alter system set aq_tm_processes=0 ;

System altered.

show parameter job
show parameter aq
select name,database_role,switchover_status from v$database;

alter database commit to switchover to physical standby with session shutdown
shut abort
startup mount;
or
SHUTDOWN IMMEDIATE;
STARTUP NOMOUNT;
ALTER DATABASE MOUNT STANDBY DATABASE;

ALTER DATABASE RECOVER MANAGED Oracle instance DISCONNECT FROM SESSION;

Archived Log entry 2278737 added for T-1.S-1136245 ID 0x109a137e LAD:1
Completed: ALTER DATABASE   MOUNT
2022-05-07T02:09:27.821104+04:00
ARC7 started with pid=42, OS id=941
Starting background process ARC8
2022-05-07T02:09:27.834239+04:00
ALTER SYSTEM SET ENCRYPTION WALLET OPEN IDENTIFIED BY *
Completed Successfully
   ALTER SYSTEM SET ENCRYPTION WALLET OPEN IDENTIFIED BY *
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION
2022-05-07T02:09:27.861561+04:00

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION


select name, open_mode, db_unique_name, database_role from v$database;


==================================

Convert standby to primary: Our primary is already converted to standby. Now it’s time to convert original standby into primary

After completing activity on primary database (production)

convert standby database to primary database

1) SELECT PROTECTION_MODE,CONTROLFILE_TYPE,REMOTE_ARCHIVE,PROTECTION_LEVEL,DATABASE_ROLE ROLE, SWITCHOVER_STATUS,DATAGUARD_BROKER,GUARD_STATUS FROM V$DATABASE;


2) select name,flashback_on from v$database;
3) select name from v$tempfile;
4) select value from v$diag_info where name ='Diag Trace';
5) show parameter recovery;
6) check crontab and scheduler if any state and rman backup will scheduled during activity if yes disable it


1)select SWITCHOVER_STATUS from V$DATABASE;
if require older version
 Cancel the MRP process
SQL> alter database recover managed standby database cancel;
 
Terminate the current switchover to Standby that never completed fully.  
select max(sequence#) from v$archived_log where applied='YES';



2) ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WITH SESSION SHUTDOWN

3) shut immediate

4) startup
  or
 alter database open

select database_role from v$database;
select name, open_mode, database_role from v$database;



===============================================startup script===========

$ cat startup.sh
export ORACLE_SID=testdbs3

sqlplus / as sysdba <<EOF
startup mount;
alter database recover managed standby database disconnect from session ;
EOF
export ORACLE_SID=testdbs2
sqlplus / as sysdba <<EOF
startup mount;
alter database recover managed standby database disconnect from session ;
EOF
export ORACLE_SID=testdbs1
sqlplus / as sysdba <<EOF
startup mount;
alter database recover managed standby database disconnect from session ;
EOF

lsnrctl start LISTENER_TESTDB_STB

ps -ef|grep pmon
ps -ef|grep tns




















Important Point
  • Grid Infrastructure pre configures and registers all required resources for all products available for these products and components, but only activates them when you choose to add them.As a result, some components may be listed as OFFLINE after the installation of Oracle Grid Infrastructure.Resources listed as TARGET:OFFLINE and STATE:OFFLINE do not need to be monitored.They represent components that are registered, but not enabled, so they do not use any system resources. 


    As a result, some components may be listed as OFFLINE after installation
    Resources listed as TARGET:OFFLINE and STATE:OFFLINE represent components that are registered but not enabled

    Resources listed as TARGET:OFFLINE and STATE:OFFLINE do not need to be monitored.
    They represent components that are registered, but not enabled, so they do not use any
    system resources. If an Oracle product or component is installed on the system and it requires
    a particular resource to be online, the software prompts you to activate the required offline
    resource.


    crsctl stat res -t
    nft2x015[/home/oracrs]$
    nft2x015[/home/oracrs]$ crsctl stat res -t -init
    --------------------------------------------------------------------------------
    NAME           TARGET  STATE        SERVER                   STATE_DETAILS
    --------------------------------------------------------------------------------
    Cluster Resources
    --------------------------------------------------------------------------------
    ora.asm
          1        ONLINE  ONLINE       nft2x015                 Started
    ora.cluster_interconnect.haip
          1        ONLINE  ONLINE       nft2x015
    ora.crf
          1        ONLINE  ONLINE       nft2x015
    ora.crsd
          1        ONLINE  ONLINE       nft2x015
    ora.cssd
          1        ONLINE  ONLINE       nft2x015
    ora.cssdmonitor
          1        ONLINE  ONLINE       nft2x015
    ora.ctssd
          1        ONLINE  ONLINE       nft2x015                 OBSERVER
    ora.diskmon
          1        OFFLINE OFFLINE
    ora.drivers.acfs
          1        ONLINE  ONLINE       nft2x015
    ora.evmd
          1        ONLINE  ONLINE       nft2x015
    ora.gipcd
          1        ONLINE  ONLINE       nft2x015
    ora.gpnpd
          1        ONLINE  ONLINE       nft2x015
    ora.mdnsd
          1        ONLINE  ONLINE       nft2x015


    kasarla01[+ASM2]_grid> crsctl status resource ora.ons
    NAME=ora.ons
    TYPE=ora.ons.type
    TARGET=ONLINE              , ONLINE
    STATE=OFFLINE on kasarla01, OFFLINE on kasarla02

    kasarla01[+ASM2]_grid>

    An Oracle RAC Service is in UNKNOWN state on instance 1 and it is impossible to stop or start it with srvctl.

    crs_stat -u ora.DBName.My-ServiceName.DBName1.srv

    NAME=ora.DBName.My-ServiceName.DBName1.srv
    TYPE=application
    TARGET=ONLINE
    STATE=UNKNOWN on Host1



    Solution:

    The UNKNOWN state can often be resolved by bringing the resource offline using crs_stop.

    If the resource is not in Unknown state: Use srvctl to stop it and do not use crs_stop (as mentionned in note Oracle support note 845709.1 there’s a risk of corruption of the OCR).

    Make sure that you have a recent backup of your OCR:
    ocrconfig -showbackup

    Restore the instance resource to OFFLINE with this command:
    crs_stop ora.DBName.My-ServiceName.DBName1.srv

    Verify the status is now OFFLINE:
    crs_stat -u ora.DBName.My-ServiceName.DBName1.srv

    NAME=ora.DBName.My-ServiceName.DBName1.srv
    TYPE=application
    TARGET=OFFLINE
    STATE=OFFLINE

    Bring the service online with srvctl:
    srvctl start service -d DBName -i DBName1 -s My-ServiceName

    Verify that the service is ONLINE:
    crs_stat -u ora.DBName.My-ServiceName.DBName1.srv

    NAME=ora.DBName.My-ServiceName.DBName1.srv
    TYPE=application
    TARGET=ONLINE
    STATE=ONLINE on Host1








    Correcting our RAC problem

    We have seen the CRS commands and syntax, now lets look at the practical aspect of it.
    I was having some trouble while running my RAC on RHEL4 (I think the problem primarily arose because I supplied my virtual machines with inadequate memory. When I noticed that I was getting some alerts on my SoRAC tool, I looked at the status of my RAC.

    oracle@vmora01rh4 ~]$ cd /u01/app/oracle/product/10.2.0/crs/bin
    [oracle@vmora01rh4 bin]$ crs_stat -t
    Name           Type           Target    State     Host
    ------------------------------------------------------------
    ora.fokerac.db application    ONLINE    ONLINE    vmora02rh4
    ora....c1.inst application    ONLINE    ONLINE    vmora01rh4
    ora....c2.inst application    OFFLINE   UNKNOWN   vmora02rh4
    ora....serv.cs application    ONLINE    ONLINE    vmora02rh4
    ora....ac1.srv application    ONLINE    ONLINE    vmora01rh4
    ora....ac2.srv application    ONLINE    OFFLINE
    ora....SM1.asm application    ONLINE    ONLINE    vmora01rh4
    ora....H4.lsnr application    ONLINE    ONLINE    vmora01rh4
    ora....rh4.gsd application    ONLINE    UNKNOWN   vmora01rh4
    ora....rh4.ons application    ONLINE    UNKNOWN   vmora01rh4
    ora....rh4.vip application    ONLINE    ONLINE    vmora01rh4
    ora....SM2.asm application    OFFLINE   UNKNOWN   vmora02rh4
    ora....H4.lsnr application    OFFLINE   UNKNOWN   vmora02rh4
    ora....rh4.gsd application    ONLINE    UNKNOWN   vmora02rh4
    ora....rh4.ons application    OFFLINE   UNKNOWN   vmora02rh4
    ora....rh4.vip application    ONLINE    ONLINE    vmora02rh4

    As you can see above, some of the applications are UNKNOWN or OFFLINE, either of which is not good for my RAC.

    The crs_stat command gives you the names of the applications, which you might need to shut down some applications manually, in order to shut the whole cluster down and restart it.

    [oracle@vmora01rh4 bin]$ crs_stat
    NAME=ora.fokerac.db
    TYPE=application
    TARGET=ONLINE
    STATE=ONLINE on vmora02rh4
    NAME=ora.fokerac.fokerac1.inst
    TYPE=application
    TARGET=ONLINE
    STATE=ONLINE on vmora01rh4
    NAME=ora.fokerac.fokerac2.inst
    TYPE=application
    TARGET=OFFLINE
    STATE=OFFLINE
    NAME=ora.fokerac.fokeserv.cs
    TYPE=application
    TARGET=ONLINE
    STATE=ONLINE on vmora02rh4
    NAME=ora.fokerac.fokeserv.fokerac1.srv
    TYPE=application
    TARGET=ONLINE
    STATE=ONLINE on vmora01rh4
    NAME=ora.fokerac.fokeserv.fokerac2.srv
    TYPE=application
    TARGET=ONLINE
    STATE=OFFLINE
    NAME=ora.vmora01rh4.ASM1.asm
    TYPE=application
    TARGET=ONLINE
    STATE=ONLINE on vmora01rh4
    NAME=ora.vmora01rh4.LISTENER_VMORA01RH4.lsnr
    TYPE=application
    TARGET=ONLINE
    STATE=ONLINE on vmora01rh4
    NAME=ora.vmora01rh4.gsd
    TYPE=application
    TARGET=ONLINE
    STATE=UNKNOWN on vmora01rh4
    NAME=ora.vmora01rh4.ons
    TYPE=application
    TARGET=ONLINE
    STATE=UNKNOWN on vmora01rh4
    NAME=ora.vmora01rh4.vip
    TYPE=application
    TARGET=ONLINE
    STATE=ONLINE on vmora01rh4
    NAME=ora.vmora02rh4.ASM2.asm
    TYPE=application
    TARGET=OFFLINE
    STATE=UNKNOWN on vmora02rh4
    NAME=ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr
    TYPE=application
    TARGET=OFFLINE
    STATE=UNKNOWN on vmora02rh4
    NAME=ora.vmora02rh4.gsd
    TYPE=application
    TARGET=ONLINE
    STATE=UNKNOWN on vmora02rh4
    NAME=ora.vmora02rh4.ons
    TYPE=application
    TARGET=OFFLINE
    STATE=UNKNOWN on vmora02rh4
    NAME=ora.vmora02rh4.vip
    TYPE=application
    TARGET=ONLINE
    STATE=ONLINE on vmora02rh4

    I could have also attempted to stop them all using crs_stop –all, but it normally throws enough errors to force you do it manually one by one.

    [oracle@vmora01rh4 bin]$ crs_stop -all
    Attempting to stop `ora.vmora01rh4.ons` on member `vmora01rh4`
    Attempting to stop `ora.vmora02rh4.ons` on member `vmora02rh4`
    `ora.vmora02rh4.ons` on member `vmora02rh4` has experienced an unrecoverable failure.
    Human intervention required to resume its availability.
    Stop of `ora.vmora01rh4.ons` on member `vmora01rh4` succeeded.
    Attempting to stop `ora.vmora01rh4.ASM1.asm` on member `vmora01rh4`
    Attempting to stop `ora.fokerac.fokerac2.inst` on member `vmora02rh4`
    `ora.fokerac.fokerac2.inst` on member `vmora02rh4` has experienced an unrecoverable failure.
    Human intervention required to resume its availability.
    Attempting to stop `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4`
    `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4` has experienced an unrecoverable failure.
    Human intervention required to resume its availability.
    Attempting to stop `ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr` on member `vmora02rh4`
    `ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr` on member `vmora02rh4` has experienced an unrecoverable failure.
    Human intervention required to resume its availability.
    Attempting to stop `ora.fokerac.fokerac2.inst` on member `vmora02rh4`
    `ora.fokerac.fokerac2.inst` on member `vmora02rh4` has experienced an unrecoverable failure.
    Human intervention required to resume its availability.
    Attempting to stop `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4`
    `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4` has experienced an unrecoverable failure.
    Human intervention required to resume its availability.
    Attempting to stop `ora.vmora02rh4.vip` on member `vmora02rh4`
    Stop of `ora.vmora02rh4.vip` on member `vmora02rh4` succeeded.
    Stop of `ora.vmora01rh4.ASM1.asm` on member `vmora01rh4` succeeded.
    Attempting to stop `ora.vmora01rh4.LISTENER_VMORA01RH4.lsnr` on member `vmora01rh4`
    Stop of `ora.vmora01rh4.LISTENER_VMORA01RH4.lsnr` on member `vmora01rh4` succeeded.
    Attempting to stop `ora.vmora01rh4.vip` on member `vmora01rh4`
    Stop of `ora.vmora01rh4.vip` on member `vmora01rh4` succeeded.
    CRS-0216: Could not stop resource 'ora.vmora02rh4.ASM2.asm'.
    CRS-0216: Could not stop resource 'ora.vmora02rh4.ons'.
    CRS-0216: Could not stop resource 'ora.vmora02rh4.vip'.

    For the very same reason we will go ahead and do it our way. Therefore, we need to stop our instances first.

    [oracle@vmora01rh4 bin]$ srvctl stop instance -d fokerac -i fokerac1
    [oracle@vmora01rh4 bin]$ srvctl stop instance -d fokerac -i fokerac2

    Check our status

    [oracle@vmora01rh4 bin]$ crs_stat -t
    Name           Type           Target    State     Host
    ------------------------------------------------------------
    ora.fokerac.db application    OFFLINE   OFFLINE
    ora....c1.inst application    OFFLINE   OFFLINE
    ora....c2.inst application    OFFLINE   OFFLINE
    ora....serv.cs application    ONLINE    UNKNOWN   vmora02rh4
    ora....ac1.srv application    OFFLINE   OFFLINE
    ora....ac2.srv application    OFFLINE   OFFLINE
    ora....SM1.asm application    OFFLINE   OFFLINE
    ora....H4.lsnr application    OFFLINE   OFFLINE
    ora....rh4.gsd application    ONLINE    UNKNOWN   vmora01rh4
    ora....rh4.ons application    OFFLINE   OFFLINE
    ora....rh4.vip application    OFFLINE   OFFLINE
    ora....SM2.asm application    OFFLINE   UNKNOWN   vmora02rh4
    ora....H4.lsnr application    OFFLINE   UNKNOWN   vmora02rh4
    ora....rh4.gsd application    ONLINE    UNKNOWN   vmora02rh4
    ora....rh4.ons application    OFFLINE   UNKNOWN   vmora02rh4
    ora....rh4.vip application    OFFLINE   OFFLINE

    Stop the service

    [oracle@vmora01rh4 bin]$ srvctl stop service -d fokerac -s fokeserv

    Check status again

    [oracle@vmora01rh4 bin]$ crs_stat -t
    Name           Type           Target    State     Host
    ------------------------------------------------------------
    ora.fokerac.db application    OFFLINE   OFFLINE
    ora....c1.inst application    OFFLINE   OFFLINE
    ora....c2.inst application    OFFLINE   OFFLINE
    ora....serv.cs application    OFFLINE   OFFLINE
    ora....ac1.srv application    OFFLINE   OFFLINE
    ora....ac2.srv application    OFFLINE   OFFLINE
    ora....SM1.asm application    OFFLINE   OFFLINE
    ora....H4.lsnr application    OFFLINE   OFFLINE
    ora....rh4.gsd application    ONLINE    UNKNOWN   vmora01rh4
    ora....rh4.ons application    OFFLINE   OFFLINE
    ora....rh4.vip application    OFFLINE   OFFLINE
    ora....SM2.asm application    OFFLINE   UNKNOWN   vmora02rh4
    ora....H4.lsnr application    OFFLINE   UNKNOWN   vmora02rh4
    ora....rh4.gsd application    ONLINE    UNKNOWN   vmora02rh4
    ora....rh4.ons application    OFFLINE   UNKNOWN   vmora02rh4
    ora....rh4.vip application    OFFLINE   OFFLINE

    OK, so we need to stop those applications now.

    [oracle@vmora01rh4 bin]$ crs_stop ora.vmora01rh4.gsd
    Attempting to stop `ora.vmora01rh4.gsd` on member `vmora01rh4`
    Stop of `ora.vmora01rh4.gsd` on member `vmora01rh4` succeeded.
    [oracle@vmora01rh4 bin]$ crs_stop ora.vmora02rh4.ASM2.asm
    Attempting to stop `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4`
    Stop of `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4` succeeded.
    [oracle@vmora01rh4 bin]$ crs_stop ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr
    Attempting to stop `ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr` on member `vmora02rh4`
    Stop of `ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr` on member `vmora02rh4` succeeded.
    [oracle@vmora01rh4 bin]$ crs_stop ora.vmora02rh4.gsd
    Attempting to stop `ora.vmora02rh4.gsd` on member `vmora02rh4`
    Stop of `ora.vmora02rh4.gsd` on member `vmora02rh4` succeeded.
    [oracle@vmora01rh4 bin]$ crs_stop ora.vmora02rh4.ons
    Attempting to stop `ora.vmora02rh4.ons` on member `vmora02rh4`
    Stop of `ora.vmora02rh4.ons` on member `vmora02rh4` succeeded.

    Check status

    [oracle@vmora01rh4 bin]$ crs_stat -t
    Name           Type           Target    State     Host
    ------------------------------------------------------------
    ora.fokerac.db application    OFFLINE   OFFLINE
    ora....c1.inst application    OFFLINE   OFFLINE
    ora....c2.inst application    OFFLINE   OFFLINE
    ora....serv.cs application    OFFLINE   OFFLINE
    ora....ac1.srv application    OFFLINE   OFFLINE
    ora....ac2.srv application    OFFLINE   OFFLINE
    ora....SM1.asm application    OFFLINE   OFFLINE
    ora....H4.lsnr application    OFFLINE   OFFLINE
    ora....rh4.gsd application    OFFLINE   OFFLINE
    ora....rh4.ons application    OFFLINE   OFFLINE
    ora....rh4.vip application    OFFLINE   OFFLINE
    ora....SM2.asm application    OFFLINE   OFFLINE
    ora....H4.lsnr application    OFFLINE   OFFLINE
    ora....rh4.gsd application    OFFLINE   OFFLINE
    ora....rh4.ons application    OFFLINE   OFFLINE
    ora....rh4.vip application    OFFLINE   OFFLINE

    OK all set , now lets bring them all online.

    [oracle@vmora01rh4 bin]$ crs_start -all
    Attempting to start `ora.vmora02rh4.vip` on member `vmora02rh4`
    Attempting to start `ora.vmora01rh4.vip` on member `vmora01rh4`
    Start of `ora.vmora02rh4.vip` on member `vmora02rh4` succeeded.
    Start of `ora.vmora01rh4.vip` on member `vmora01rh4` succeeded.
    Attempting to start `ora.vmora01rh4.ASM1.asm` on member `vmora01rh4`
    Attempting to start `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4`
    Start of `ora.vmora02rh4.ASM2.asm` on member `vmora02rh4` succeeded.
    Attempting to start `ora.fokerac.fokerac2.inst` on member `vmora02rh4`
    Start of `ora.vmora01rh4.ASM1.asm` on member `vmora01rh4` succeeded.
    Attempting to start `ora.fokerac.fokerac1.inst` on member `vmora01rh4`
    Start of `ora.fokerac.fokerac2.inst` on member `vmora02rh4` succeeded.
    Attempting to start `ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr` on member `vmora02rh4`
    Start of `ora.fokerac.fokerac1.inst` on member `vmora01rh4` succeeded.
    Attempting to start `ora.vmora01rh4.LISTENER_VMORA01RH4.lsnr` on member `vmora01rh4`
    Start of `ora.vmora02rh4.LISTENER_VMORA02RH4.lsnr` on member `vmora02rh4` succeeded.
    Start of `ora.vmora01rh4.LISTENER_VMORA01RH4.lsnr` on member `vmora01rh4` succeeded.
    CRS-1002: Resource 'ora.vmora02rh4.ons' is already running on member 'vmora02rh4'
    CRS-1002: Resource 'ora.vmora01rh4.ons' is already running on member 'vmora01rh4'
    Attempting to start `ora.fokerac.fokeserv.fokerac1.srv` on member `vmora01rh4`
    Attempting to start `ora.vmora01rh4.gsd` on member `vmora01rh4`
    Attempting to start `ora.fokerac.db` on member `vmora01rh4`
    Attempting to start `ora.fokerac.fokeserv.fokerac2.srv` on member `vmora02rh4`
    Attempting to start `ora.fokerac.fokeserv.cs` on member `vmora02rh4`
    Attempting to start `ora.vmora02rh4.gsd` on member `vmora02rh4`
    Start of `ora.fokerac.fokeserv.fokerac2.srv` on member `vmora02rh4` succeeded.
    Start of `ora.fokerac.fokeserv.cs` on member `vmora02rh4` succeeded.
    Start of `ora.fokerac.db` on member `vmora01rh4` succeeded.
    Start of `ora.vmora02rh4.gsd` on member `vmora02rh4` succeeded.
    Start of `ora.vmora01rh4.gsd` on member `vmora01rh4` succeeded.
    Start of `ora.fokerac.fokeserv.fokerac1.srv` on member `vmora01rh4` succeeded.
    *CRS-0223: Resource 'ora.vmora01rh4.ons' has placement error.
    CRS-0223: Resource 'ora.vmora02rh4.ons' has placement error.

    *Don’t bother about those errors, as they just did not report back to us in the sequence they were started by the clusterware.

    [oracle@vmora01rh4 bin]$ crs_stat -t
    Name           Type           Target    State     Host
    ------------------------------------------------------------
    ora.fokerac.db application    ONLINE    ONLINE    vmora01rh4
    ora....c1.inst application    ONLINE    ONLINE    vmora01rh4
    ora....c2.inst application    ONLINE    ONLINE    vmora02rh4
    ora....serv.cs application    ONLINE    ONLINE    vmora02rh4
    ora....ac1.srv application    ONLINE    ONLINE    vmora01rh4
    ora....ac2.srv application    ONLINE    ONLINE    vmora02rh4
    ora....SM1.asm application    ONLINE    ONLINE    vmora01rh4
    ora....H4.lsnr application    ONLINE    ONLINE    vmora01rh4
    ora....rh4.gsd application    ONLINE    ONLINE    vmora01rh4
    ora....rh4.ons application    ONLINE    ONLINE    vmora01rh4
    ora....rh4.vip application    ONLINE    ONLINE    vmora01rh4
    ora....SM2.asm application    ONLINE    ONLINE    vmora02rh4
    ora....H4.lsnr application    ONLINE    ONLINE    vmora02rh4
    ora....rh4.gsd application    ONLINE    ONLINE    vmora02rh4
    ora....rh4.ons application    ONLINE    ONLINE    vmora02rh4
    ora....rh4.vip application    ONLINE    ONLINE    vmora02rh4
    [oracle@vmora01rh4 bin]$



    $ <GRID_HOME>/bin/crsctl stat res ora.racdb.db -t

    --------------------------------------------------------------------------------
    NAME           TARGET  STATE        SERVER                   STATE_DETAILS      
    --------------------------------------------------------------------------------
    Cluster Resources
    --------------------------------------------------------------------------------
    ora.racdb.db
         1        OFFLINE OFFLINE                               Corrupted Controlfile
         2        ONLINE  ONLINE       rachost2                 Open
         3        ONLINE  ONLINE       rachost3                 Open


    $ <GRID_HOME>/bin/srvctl status database -d racdb
    Instance racdb1 is not running on node rachost1
    Instance racdb2 is running on node rachost2
    Instance racdb3 is running on node rachost3


    $ sqlplus / as sysdba
    ...
    SQL> select INSTANCE_NAME, STATUS from gv$instance;

    INSTANCE_NAME        STATUS
    ----------------     ------------
    racdb1               OPEN
    racdb2               OPEN
    racdb3               OPEN

     

     
  • gpnpd: The Grid Plug and Play daemon (GPNPD) provides access to the Grid Plug and Play profile and coordinates updates to the profile among the nodes of the cluster to ensure that all the nodes have the most recent profile.
How many  IPs is required before installation of clusterware/ Grid
If you do a not enable GNS, the public and virtual IP addresses for each node must be static IP addresses, configured before installation for each node but not currently in use. Public and virtual IP addresses must be on the same subnet. Oracle Clusterware manages private IP addresses in the private subnet on interfaces you identify as private during the installation process.
The cluster must have the following addresses configured:
  • A public IP address for each node, with the following characteristics:
  1. Static IP address
  2. Configured before installation for each node, and resolvable to that node before installation
  3. On the same subnet as all other public IP, VIP, and SCAN addresses
  • A virtual IP address for each node, with the following characteristics:
  1. Static IP address
  2. Configured before installation for each node, but not currently in us
  3. On the same subnet as all other public IP addresses, VIP addresses, and SCAN addresses



  • A Single-Client Access Name (SCAN) for the cluster, with the following characteristics:
  1. Three Static IP addresses configured on the domain name server (DNS)
  2. before installation so that the three IP addresses are associated with the name provided as the SCAN, and all three addresses are returned in random order by the DNS to the requestor
  3. Configured before installation in the DNS to resolve to addresses that are not currently in use
  4. Given a name that does not begin with a numeral
  5. On the same subnet as all other public IP addresses, VIP addresses, and SCAN addresses
  • A private IP address for each node, with the following characteristics:
  1. Static IP address
  2. Configured before installation, but on a separate private network, with its own subnet, that is not resolvable except by other cluster member nodes to improve the interconnect performance
How to increase the performance of interconnect
1) Link aggregation . It is also known as NIC teaming,” “NIC bonding. it  can be used to increase redundancy for higher availability with an Active/Standby configuration. Link aggregation can be used to increase bandwidth for performance with an Active/Active configuration. This arrangement involves simultaneous use of both the bonded physical network interface cards in parallel to achieve a higher bandwidth beyond the limit of any one single network card. It is very important that if 802.3ad is used at the NIC layer, the switch must also support and be configured for 802.3ad. Misconfiguration results in poor performance and interface resets or “port flapping.
2)An alternative is to consider a single network interface card with a higher bandwidth, such as 10 Gb Ethernet instead of 1Gb Ethernet. InfiniBand can also be used for the interconnect.
3)UDP socket buffer (rx):
  • Default settings are adequate for the majority of customers. It may be necessary to increase the allocated buffer size
when the:
MTU size has been increased
netstat command reports errors
ifconfig command reports dropped packets or overflow
The maximum UDP socket receive buffer size varies according to the operating system. the upper  limit may be as small as 128 KB or as large as 1 GB. In most cases, the default settings are adequate for the majority of customers. This is one of the first settings to consider if you are receiving lost blocks.
Three significant conditions that indicate when it may be necessary to change the UDP socket receive buffer size are when the MTU size has been increased, when excessive fragmentation and/or reassembly of packets is observed, and if dropped packets or overflows are observed
4)Jumbo frames: It is not an Institute of Electrical and Electronics Engineers (IEEE) standard Jumbo frames are not a requirement for Oracle Clusterware and not configured by default. The use of jumbo frames is supported; however, special care must be taken because this is not an IEEE standard and there are significant variances among network devices and switches especially from different manufacturers. The typical frame size for jumbo frames is 9 KB, but again, this can vary. It is necessary that all devices in the communication path be set to the same value.
Note: For Oracle Clusterware, the Maximum Transmission Unit (MTU) needs to be the same on all nodes. If it is not set to the same value, an error message will be sent to the Clusterware alert logs.
What is HAIP
HAIP, High Availability IP, is the Oracle based solution for load balancing and failover for private interconnect traffic. Typically, Host based solutions such as Bonding (Linux), Trunking (Solaris) etc is used to implement high availability solutions for private interconnect traffic. But, HAIP is an Oracle solution for high availability.
In earlier releases, to minimize node evictions due to frequent private NIC down events, bonding, trunking, teaming, or similar technology was required to make use of redundant network connections between the nodes. Oracle Clusterware now provides an integrated solution which ensures “Redundant Interconnect Usage” as it supports IP failover .
Multiple private network adapters can be defined either during the installation phase or afterward using the oifcfg. The ora.cluster_interconnect.haip resource will pick up a  highly available virtual IP (the HAIP) from “link-local” (Linux/Unix)  IP range (169.254.0.0 ) and assign to each private network.   With HAIP, by default, interconnect traffic will be load balanced across all active interconnect interfaces. If a private interconnect interface fails or becomes non-communicative, then Clusterware transparently moves the corresponding HAIP address to one of the remaining functional interfaces.
Grid Infrastructure can activate a maximum of four private network adapters at a time even if more are defined. The number of HAIP addresses is decided by how many private network adapters are active when Grid comes up on the first node in the cluster .  If there’s only one active private network, Grid will create one;  if two, Grid will create two and so on. The number of HAIPs won’t increase beyond four even if more private network adapters are activated . A restart of clusterware on all nodes is required for new adapters to become effective.
. This functionality is available starting with Oracle Database 11g Release 2 (11.2.0.2). If you use the Oracle Clusterware Redundant Interconnect feature, you must use IPv4 addresses for the interfaces.
When you define multiple interfaces, Oracle Clusterware creates from one to four highly available IP (HAIP) addresses. Oracle RAC and Oracle Automatic Storage Management (Oracle ASM) instances use these interface addresses to ensure highly available, load-balanced interface communication between nodes. The installer enables Redundant Interconnect Usage to provide a high-availability private network.
By default, Oracle Grid Infrastructure software uses all of the HAIP addresses for private network communication, providing load-balancing across the set of interfaces you identify for the private network. If a private interconnect interface fails or becomes noncommunicative,
Oracle Clusterware transparently moves the corresponding HAIP address to one of the remaining functional interfaces.
What is advantage of Single-Client Access
The single-client access name is address used by clients connecting to the cluster. The scan is a fully qualified host name (host name + domain) registered to three IP addresses. If  you use GNS, and have DHCP support, then the GNS will assign addresses dynamically to the SCAN.
If  you do not use GNS, the SCAN should be defined in the DNS to resolve to the three addresses assigned to that name.
This should be done before you install Oracle Grid Infrastructure.
The SCAN and its associated IP addresses provide a stable name for clients to use for connections, independent of the nodes that make up the cluster.
SCANs function like a cluster alias. However, SCANs are resolved on any node in the cluster, so unlike a VIP address for a node, clients connecting to the SCAN no longer require updated VIP addresses as nodes are added to or removed from the cluster. Because the SCAN addresses
resolve to the cluster, rather than to a node address in the cluster, nodes can be added to or removed from the cluster without affecting the SCAN address configuration.
During installation, listeners are created on each node for the SCAN IP addresses. Oracle Clusterware routes application requests to the cluster SCAN to the least loaded instance providing the service.SCAN listeners can run on any node in the cluster. SCANs provide location independence for databases so that the client configuration does not have to depend on which nodes run a particular database.Instances register with SCAN listeners only as remote listeners. Upgraded databases register with SCAN listeners as remote listeners, and also continue to register with all other listeners.
If you specify a GNS domain during installation, the SCAN defaults to clustername-scan.GNS_domain. If a GNS domain is not specified at installation, the SCAN defaults to clustername-scan.current_domain.
SCAN and Local Listeners
When a client submits a connection request, the SCAN listener listening on a SCAN IP address and the SCAN port are contacted on the client’s behalf. Because all services on the cluster are registered with the SCAN listener, the SCAN listener replies with the address of the local listener on the least-loaded node where the service is currently being offered.
Finally, the client establishes a connection to the service through the listener on the node where service is offered. All these actions take place transparently to the client without any explicit configuration required in the client.
During installation, listeners are created on nodes for SCAN IP addresses.
Oracle net Services routes application requests to  the least loaded
Instance  providing service Because SCAN addresses resolve to cluster, rather than to a node address in the cluster
cluster, nodes can be added or removed from the cluster without affecting the SCAN address configuration.
What is Node eviction and its advantage
  • An important service provided by Oracle Clusterware is node fencing.
  • Node fencing is used to evict nonresponsive hosts from the cluster, preventing data corruptions.
An important service provided by Oracle Clusterware is node fencing. Node fencing is a technique used by clustered environments to evict nonresponsive or malfunctioning hosts from the cluster.
Allowing affected nodes to remain in the cluster increases the probability of data corruption due to Traditionally, Oracle Clusterware uses a STONITH (Shoot The Other Node In The Head)comparable fencing algorithm to ensure data integrity in cases, in which cluster integrity is endangered and split-brain scenarios need to be prevented. For Oracle Clusterware this means
that a local process enforces the removal of one or more nodes from the cluster (fencing). This approach traditionally involved a forced “fast” reboot of the offending node. A fast reboot is a shutdown and restart procedure that does not wait for any I/O to finish or for file systems to synchronize on shutdown. Starting with Oracle Clusterware 11g Release 2 (11.2.0.2), this
mechanism has been changed to prevent such a reboot as much as possible by introducing rebootless node fencing.
Now, when a decision is made to evict a node from the cluster, Oracle Clusterware will first attempt to shut down all resources on the machine that was chosen to be the subject of an eviction. Specifically, I/O generating processes are killed and Oracle Clusterware ensures that those processes are completely stopped before continuing.
If all resources can be stopped and all I/O generating processes can be killed, Oracle Clusterware will shut itself down on the respective node, but will attempt to restart after the stack has been stopped.
If, for some reason, not all resources can be stopped or I/O generating processes cannot be stopped completely, Oracle Clusterware will still perform a reboot.
In addition to this traditional fencing approach, Oracle Clusterware now supports a new fencing mechanism based on remote node termination. The concept uses an external mechanism capable of restarting a problem node without cooperation either from Oracle Clusterware or from the
operating system running on that node. To provide this capability, Oracle Clusterware supports the Intelligent Management Platform Interface specification (IPMI), a standard management protocol.
To use IPMI and to be able to remotely fence a server in the cluster, the server must be equipped with a Baseboard Management Controller (BMC), which supports IPMI over a local area network(LAN). After this hardware is in place in every server of the cluster, IPMI can be activated either during the installation of the Oracle Grid Infrastructure or after the installation in course of a post installation management task by using CRSCTL.
Oracle Clusterware continues to support third-party cluster solutions. For certified solutions,
Oracle Clusterware will integrate with the third-party cluster solution in a way that node membership decisions are deferred to the third-party cluster solution. For Oracle RAC environments, it is worth noticing that Oracle Clusterware is mandatory and provides all required functionality. No other third-party solution should therefore be required.



  What is the difference between a oracle global index and a local index?
When using Oracle partitioning, you can specify the “global” or “local” parameter in the create index syntax:
Local Index: A local index is a one-to-one mapping between a index partition and a table partition.  In general, local indexes allow for a cleaner “divide and conquer” approach for generating fast SQL execution plans with partition pruning. Of course, the key value for the table partition and the value for the local index must be identical.
The partitioning of the indexes is transparent to all SQL queries. The great benefit is that the Oracle query engine will scan only the index partition that is required to service the query, thus speeding up the query significantly. In addition, the Oracle parallel query engine will sense that the index is partitioned and will fire simultaneous queries to scan the indexes.
Local partitioned indexes
Local partitioned indexes allow the DBA to take individual partitions of a table and indexes offline for maintenance (or reorganization) without affecting the other partitions and indexes in the table.
In a local partitioned index, the key values and number of index partitions will match the number of partitions in the base table.
CREATE INDEX year_idx
on all_fact (order_date)
LOCAL
(PARTITION name_idx1),
(PARTITION name_idx2),
(PARTITION name_idx3);



Oracle will automatically use equal partitioning of the index based upon the number of partitions in the indexed table. For example, in the above definition, if we created four indexes on all_fact, the CREATE INDEX would fail since the partitions do not match. This equal partition also makes index maintenance easier, since a single partition can be taken offline and the index rebuilt without affecting the other partitions in the table.
Global partitioned indexes
A global index is a one-to-many relationship, allowing one index partition to map to many table partitions.
A global partitioned index is used for all other indexes except for the one that is used as the table partition key. Global indexes partition OLTP (online transaction processing) applications where fewer index probes are required than with local partitioned indexes. In the global index partition scheme, the index is harder to maintain since the index may span partitions in the base table.
For example, when a table partition is dropped as part of a reorganization, the entire global index will be affected. When defining a global partitioned index, the DBA has complete freedom to specify as many partitions for the index as desired.
Now that we understand the concept, let’s examine the Oracle CREATE INDEX syntax for a globally partitioned index:
CREATE INDEX item_idx
on all_fact (item_nbr)
GLOBAL
(PARTITION city_idx1 VALUES LESS THAN (100)),
(PARTITION city_idx1 VALUES LESS THAN (200)),
(PARTITION city_idx1 VALUES LESS THAN (300)),
(PARTITION city_idx1 VALUES LESS THAN (400)),
(PARTITION city_idx1 VALUES LESS THAN (500));
Here, we see that the item index has been defined with five partitions, each containing a subset of the index range values. Note that it is irrelevant that the base table is in three partitions. In fact, it is acceptable to create a global partitioned index on a table that does not have any partitioning.
Making Failover Seamless
In addition to adding database instances to mitigate node failure, Oracle RAC offers a number of technologies to make a node failover seamless to the application (and subsequently, to the end user),
including the following:
  • Transparent Application Failover
  • Fast Connect Failover
Transparent Application Failover (TAF) is a client-side feature. The term refers to the failover/reestablishment of sessions in case of instance or node failures. TAF is not limited to RAC configurations; active/passive clusters can benefit equally from it. TAF can be defined through local naming in the client’s tnsnames.ora file or, alternatively, as attributes to a RAC database service. The
latter is the preferred way of configuring it. Note that this feature requires the use of the OCI libraries, so thin-client only applications won’t be able to benefit from it. With the introduction of the Oracle Instant client, this problem can be alleviated somewhat by switching to the correct driver.
TAF can operate in two ways:
it can either restore a session or re-execute a select statement in the
event of a node failure.
While this feature has been around for a long time,
Oracle’s net manager configuration assistant doesn’t provide support for setting up client-side TAF. Also, TAF isn’t the most elegant way of handling node failures because any in-flight transactions will be rolled back—
TAF can resume running select statements only.
The fast connection failover feature provides a different way of dealing with node failures and other types of events published by the RAC high availability framework (also known as the Fast Application Notification, or FAN). It is more flexible than TAF.
Fast Connection Failover offers a driver-independent way for your JDBC application to take advantage of the connection failover facilities introduced in 10g Release 1 (10.1).
    When a RAC service failure is propagated to the JDBC application, the database has already rolled back the local transaction. The cache manager then cleans up all invalid connections. When an application holding an invalid connection tries to do work through that connection, it receives a SQLException ORA-17008, Closed Connection. The application has to handle the exception and reconnect.



Configuring ONS for Fast Connection Failover

In order for Fast Connection Failover to work, you must configure ONS correctly. ONS is shipped as part of Oracle Database 11g.
This section covers the following topics:
ONS Configuration File
Client-Side ONS Configuration
Server-Side ONS Configuration Using racgons
Remote ONS Subscription

ONS Configuration File

ONS configuration is controlled by the ONS configuration file, ORACLE_HOME/opmn/conf/ons.config. This file tells the ONS daemon details about how it should behave and who it should talk to. Configuration information within ons.config is defined in simple name and value pairs. There are three values that should always be configured within ons.config. The first is localport, the port that ONS binds to on the localhost interface to talk to local clients. An example of thelocalport configuration is the following:
?
localport=4100
The second value is remoteport, the port that ONS binds to on all interfaces for talking to other ONS daemons. An example of the remoteport configuration is the following:
?
remoteport=4200
The third value specifies nodes, a list of other ONS daemons to talk to. Node values are given as a comma-delimited list of either host names or IP addresses plus ports. Note that the port value that is given is the remote port that each ONS instance is listening on. In order to maintain an identical file on all nodes, thehost:port of the current ONS node can also be listed in the nodes list. It will be ignored when reading the list.
The nodes listed in the nodes line correspond to the individual nodes in the RAC instance. Listing the nodes ensures that the middle-tier node can communicate with the RAC nodes. At least one middle-tier node and one node in the RAC instance must be configured to see one another. As long as one node on each side is aware of the other, all nodes are visible. You need not list every single cluster and middle-tier node in the ONS config file of each Oracle RAC node. In particular, if one ONS config file cluster node is aware of the middle tier, then all nodes in the cluster are aware of it.
An example of the nodes configuration is the following:
?
nodes=myhost.example.com:4200,123.123.123.123:4200
There are also several optional values that can be provided in ons.config.The first optional value is a loglevel. This specifies the level of messages that should be logged by ONS. This value is an integer that ranges from 1, which indicates least messages logged, to 9, which indicates most messages logged. The default value is 3. The following is an example:
?
loglevel=3
The second optional value is a logfile name. This specifies a log file that ONS should use for logging messages. The default value for logfile is$ORACLE_HOME/opmn/logs/ons.log. The following is an example:
?
logfile=/private/oraclehome/opmn/logs/myons.log
The third optional value is a walletfile name. A wallet file is used by the Oracle Secure Sockets Layer (SSL) to store SSL certificates. If a wallet file is specified to ONS, it will use SSL when communicating with other ONS instances and require SSL certificate authentication from all ONS instances that try to connect to it. This means that if you want to turn on SSL for one ONS instance, then you must turn it on for all instances that are connected. This value should point to the directory where your ewallet.p12 file is located. The following is an example:
?
walletfile=/private/oraclehome/opmn/conf/ssl.wlt/default
One optional value is reserved for use on the server-side. useocr=on is used to tell ONS to store all Oracle RAC nodes and port numbers in Oracle Cluster Registry (OCR) instead of in the ONS configuration file. Do not use this option on the client-side.
The ons.config file allows blank lines and comments on lines that begin with the number sign (#).

Client-Side ONS Configuration

You can access the client-side ONS through ORACLE_HOME/opmn. On the client-side, there are two ways to set up ONS:
Remote ONS configuration
See Also:
"Remote ONS Subscription" ONS daemon on the client-side
Example 26-1 illustrates how a sample configuration file may look.
Example 26-1 Example of a Sample ons.config File
?
# This is an example ons.config file
#
# The first three values are required
localport=4100
remoteport=4200
nodes=racnode1.example.com:4200,racnode2.example.com:4200
After configuring ONS, you start the ONS daemon with the onsctl command. It is the user's responsibility to make sure that an ONS daemon is running at all times.
Using the onsctl Command
After configuring, use ORACLE_HOME/opmn/bin/onsctl to start, stop, reconfigure, and monitor the ONS daemon. Table 26-1 is a summary of the commands thatonsctl supports.
Table 26-1 onsctl Commands
Command Effect Output
start Starts the ONS daemon onsctl: ons started
stop Stops the ONS daemon onsctl: shutting down ons daemon...
ping Verifies whether or not the ONS daemon is running ons is running ...
reconfig Triggers a reload of the ONS configuration without shutting down the ONS daemon  
help Prints a help summary message for onsctl  
detailed Prints a detailed help message for onsctl  

Server-Side ONS Configuration Using racgons

You can access the server-side ONS through ORA_CRS_HOME/opmn. You configure the server-side by using racgons to add the middle-tier node information to OCR. This command is found in ORA_CRS_HOME/bin/racgons. Before using racgons, you must edit ons.config to set useocr=on.


    What exactly is the use of FCF?
    FCF provides is very fast notification of the failure and the ability to reconnect immediately using the same URL. When a RAC node fails the application will receive an exception. The application has to handle the exception and reconnect.
    The JDBC driver does not re-target existing connections. If a node fails the application must close the existing connection and get a new one. The way the application knows that the node failed is by getting an exception. There is more than one ORA error that can be thrown when a node fails,the application must be able to deal with them all.
    An application may call isFatalConnectionError() API on the OracleConnectionCacheManager to determine if the SQLException caught is fatal.



    If the return value of this API is true, we need to retry the getConnection on the DataSource.xxxxxx






    How do we use FCF with JDBC driver?
    In order to use FCF with JDBC, the following things must be done:
        Configure and start ONS. If ONS is not correctly set up,implicit connection cache creation fails and an ONSException is thrown at the first getConnection() request.
        See Oracle® Universal Connection Pool for JDBC Developer's Guide in the section Configuring ONS located in Using Fast Connection Failover
        FCF is now configured through a pool-enabled data source and is tightly integrated with UCP.  The FCF enabled through the Implicit Connection Cache as was used in 10g and 11g R1 is now deprecated.
        Set the FastConnectionFailoverEnabled property before making the first getConnection() request to an OracleDataSource. When FastConnection Failover is enabled, the failover applies to all connections in the pool.
        Use a service name rather than a SID when setting the OracleDataSource url property.



Transparent Application Failover
Transparent Application Failover (TAF) or simply Application Failover is a feature of the OCI driver. It enables you to automatically reconnect to a database if the database instance to which the connection is made goes down. In this case, the active transactions roll back. A transaction rollback restores the last committed transaction. The new database connection, though created by a different node, is identical to the original. This is true regardless of how the connection was lost.
TAF is always active and does not have to be set.
TAF cannot be used with thin driver.






Failover Modes
Transparent Application Failover can be configured to work in two modes, or it can be deactivated. If we count deactivated as a mode, it means TAF can be assigned the following three options:
  • Session failover
  • Select failover
  • None (default)
Failover Type Events

The following are possible failover types in the OracleOCI Failover interface:

    FO_SESSION
    Is equivalent to FAILOVER_MODE=SESSION in the tnsnames.ora file CONNECT_DATA flags. This means that only the user session is re-authenticated on the server-side while open cursors in the OCI application need to be re-executed.

    FO_SELECT
    Is equivalent to FAILOVER_MODE=SELECT in tnsnames.ora file CONNECT_DATA flags. This means that not only the user session is re-authenticated on the server-side, but open cursors in the OCI can continue fetching. This implies that the client-side logic maintains fetch-state of each open cursor.

    FO_NONE
    Is equivalent to FAILOVER_MODE=NONE in the tnsnames.ora file CONNECT_DATA flags. This is the default, in which no failover functionality is used. This can also be explicitly specified to prevent failover from happening. Additionally, FO_TYPE_UNKNOWN implies that a bad failover type was returned from the OCI driverFailover Methods
With the failover mode specified, users can further define a method that dictates exactly how TAF will re-establish the session on the other instance. A failover method can be defined independently of the
failover type. The failover method determines how the failover works; the following options are available:
  • Basic
  • Preconnect
As its name suggests, the basic option instructs the client to establish a new connection only after the node failed. This can potentially lead to a large number of new connection requests to the surviving instance. In the case of a two-node RAC, this might cause performance degradation until all user
connections are re-established. If you consider using this approach, you should test for potential performance degradation during the design stage.
The preconnect option is slightly more difficult to configure. When you specify the preconnect parameter, the client is instructed to preconnect a session to a backup instance to speed up session failover. You need to bear in mind that these preconnections increase the number of sessions to the cluster. In addition, you also need to define what the backup connection should be.

Hugepages and Large Pages

If you run a Oracle Database on a Linux Server with more than 16 GB physical memory and your System Global Area (SGA) is greater than 8 GB, you should configure HugePages. Oracle promises more performance by doing this. A HugePages configuration means, that the linux kernel can handle „large pages“, like Oracle generally calls them. Instead of standardly 4 KB on x86 and x86_64 or 16 KB on IA64 systems – 4 MB on x86, 2 MB on x86_64 and 256 MB on IA64 system. Bigger pages means, that the system uses less page tables, manages less mappings and by that reduce the effort for their management and access.

However their is a limitation by Oracle, because Automatic Memory Management (AMM) does not support HugePages. If you already use AMM and MEMORY_TARGET is set you have to disable it and switch back to Automatic Shared Memory Management (ASMM). That means set SGA_TARGET and PGA_AGGREGATE_TARGET. But there is another innovation called Transparent Hugpages (THP) which should be disabled as well. The feature will be delivered since Red Hat Linux 6 or a according derivate. Oracle as well as Red Hat recommend to disable Transparent Hugepages. Explanation in point 5 – Change Server configuration. So lets get started and come to the 7 steps:

1. Check Physical Memory

First we should check our „physical“ available Memory. In the example we have about 128 GB of RAM. SGA_TARGET and PGA_AGGREGATE_TARGET together, should not be more than the availabel memory. Besides should be enough space for OS processes itself:

grep MemTotal /proc/meminfo

MemTotal: 132151496 kB

2. Check Database Parameter

Second check your database parameter. Initially: AMM disabled? MEMORY_TARGET and MEMORY_MAX_TARGET should be set to 0:

SQL> select value from v$parameter where name = 'memory_target';

VALUE
---------------------------
0 

How big is our SGA? In this example about 40 GB. Important: In the following query we directly convert into kB (value/1024). With that we can continue to calculate directly:

SQL> select value/1024 from v$parameter where name = 'sga_target';

VALUE
---------------------------
41943040

Finally as per default the parameter use_large_pages should be enabled:

SQL> select value from v$parameter where name = 'use_large_pages';

VALUE
---------------------------
TRUE

3. Check Hugepagesize

In our example we use a x86_64 Red Hat Enterprise Linux Server. So by default hugepagesize should be set to 2 MB:

grep Hugepagesize /proc/meminfo

Hugepagesize:       2048 kB

4. Calculate Hugepages

For the calculation of the number of hugepages there is a easy way:

SGA / Hugepagesize = Number Hugepages

Following our example:

41943040 / 2048 = 20480

If you run more than one database on your server, you should include the SGA of all of your instances into the calculation:

( SGA 1. Instance + SGA 2. Instance + … etc. ) / Hugepagesize = Number Hugepages

In My Oracle Support you can find a script (Doc ID 401749.1) called hugepages_settings.sh, which does the calculation. This also includes a check of your kernel version and the actually used shared memory area by the SGA. Please consider that this calculation observes only the actual use of SGA and their use. If your second instance is down it will be not in the account. That means to adjust your SGA and restart your database first. Than you can run the script. Result should be the following line. Maybe you can make your own calculation and than check it with the script:

Recommended setting: vm.nr_hugepages = 20480

5. Change Server Configuration

The next step is to enter the number of hugepages in the server config file. For that you need root permissions. On Red Hat Linux 6 /etc/sysctl.conf.

vi /etc/sysctl.conf

vm.nr_hugepages=20480

Correctly inserted, following result should show up:

grep vm.nr_hugepages /etc/sysctl.conf

vm.nr_hugepages=20480 

The next parameter is hard and soft memlock in /etc/security/limits.conf for our oracle user. This value should be smaller than our available memory but minor to our SGA. Our hugepages should fit into that by 100 percent. For that following calculation:

Number Hugepages * Hugepagesize = minimum Memlock

Following our example:

20480 * 2048 = 41943040
vi /etc/security/limits.conf

oracle               soft    memlock 41943040
oracle               hard    memlock 41943040

Correctly inserted, following result should show up:

grep oracle /etc/security/limits.conf

...
oracle               soft    memlock 41943040
oracle               hard    memlock 41943040

As mentioned before we have to disable transparent hugepages from Red Hat Linux version 6 ongoing:

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
[always] madvise never

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
always madvise [never]

6. Server Reboot

If all parameter are set, make a complete reboot your server. As an alternative you can reload the parameters with sysctl -p.

7. Check Configuration

Memlock correct?

ulimit -l

41943040 

HugePages correctly configured and in use?

grep Huge /proc/meminfo

AnonHugePages:    538624 kB 
HugePages_Total:    20480
HugePages_Free:     12292
HugePages_Rsvd:      8188
HugePages_Surp:        0
Hugepagesize:       2048 kB

Transparent Hugepages disabled?

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled

always madvise [never]

Did the database uses HugePages? For that we take a look into the alert log. After „Starting ORACLE instance (normal)“ following entry „Large Pages Information“ gives us advise:

************************ Large Pages Information *******************

Per process system memlock (soft) limit = 100 GB
Total Shared Global Region in Large Pages = 40 GB (100%) 
Large Pages used by this instance: 20481 (40 GB)
Large Pages unused system wide = 0 (0 KB)
Large Pages configured system wide = 20481 (40 GB)
Large Page size = 2048 KB

********************************************************************

If your configuration is incorrect Oracle delivers recommendation here for the right setting. In the following example exactly one Page is missing, so 2048 kB memlock to come to 100% of SGA use of hugepages:

************************ Large Pages Information *******************
...
...

RECOMMENDATION:

Total System Global Area size is 40 GB. For optimal performance,
prior to the next instance restart:
1. Increase the number of unused large pages by
at least 1 (page size 2048 KB, total size 2048 KB) system wide to
get 100% of the System Global Area allocated with large pages
2. Large pages are automatically locked into physical memory.
Increase the per process memlock (soft) limit to at least 40 GB to lock
100% System Global Area's large pages into physical memory

********************************************************************

Done!


Why OLR is used and its significant at time clusterware startup
An additional cluster configuration file has been introduced with Oracle 11.2, the so-called Oracle Local Registry (OLR). Each node has its own copy of the file in the Grid Infrastructure software home.
The OLR stores important security contexts used by the Oracle High Availability Service early in the start sequence of Clusterware. The information in the OLR and the Grid Plug and Play configuration file are needed to locate the voting disks. If they are stored in ASM, the discovery string in the GPnP profile will be used by the cluster synchronization daemon to look them up. Later in the Clusterware boot sequence,the ASM instance will be started by the cssd process to access the OCR files; however, their location is stored in the /etc/ocr.loc file, just as it is in RAC 11.1. Of course, if the voting files and OCR are on a shared cluster file system, then an ASM instance is not needed and won’t be started unless a different resource depends on ASM.
Storing Information in the Oracle Local Registry
The Oracle Local Registry is the OCR’s local counterpart and a new feature introduced with Grid Infrastructure. The information stored in the OLR is needed by the Oracle High Availability Services daemon (OHASD) to start; this includes data about GPnP wallets, Clusterware configuration, and version information. Comparing the OCR with the OLR reveals that the OLR has far fewer keys;
for example,
ocrdump reported 704 different keys for the OCR vs. 526 keys for the OLR on our installation.
If you compare only the keys again, you will notice that the majority of keys in the OLR deal with the OHASD process, whereas the majority of keys in the OCR deal with the CRSD. This confirms what we said earlier: you need the OLR (along with the GPnP profile) to start the High Availability Services stack.
In contrast, the OCR is used extensively by CRSD. The OLR is maintained by the same command-line utilities as the OCR, with the appended -local option. Interestingly, the OLR is automatically backed up during an upgrade to Grid Infrastructure, whereas the OCR is not.



Using Grid Infrastructure Agents
In Oracle 11gR2 and later, there are two new types of agent processes: the Oracle Agent and the Oracle Root Agent. These processes interface between Oracle Clusterware and managed resources.
In previous versions of Oracle Clusterware, this functionality was provided by the RACG family of scripts and processes.
To slightly complicate matters, there are two sets of Oracle Agents and Oracle Root Agents, one for the High Availability Services stack and one for the Cluster Ready Services stack.
The Oracle Agent and Oracle Root Agent that belong to the High Availability Services stack are started by ohasd daemon. The Oracle Agent and Oracle Root Agent pertaining to the Cluster Ready
Services stack are started by the crsd daemon. In systems where the Grid Infrastructure installation is not owned by Oracle—and this is-probably the majority of installations—there is a third Oracle Agent
created as part of the Cluster Ready Services stack. Similarly, the Oracle Agent spawned by OHAS is owned by the Grid Infrastructure software owner.
In addition to these two processes, there are agents responsible for managing and monitoring the CSS daemon, called CSSDMONITOR and CSSDAGENT. CSSDAGENT, the agent process responsible for spawning CSSD is created by the OHAS daemon. CSSDMONITOR, which monitors CSSD and the overall
node health (jointly with the CSSDAGENT), is also spawned by OHAS.
You might wonder how CSSD, which is required to start the clustered ASM instance, can be started if voting disks are stored in ASM? This sounds like a chicken-and-egg problem: without access to the voting disks there is no CSS, hence the node cannot join the cluster. But without being part of the
cluster, CSSD cannot start the ASM instance. To solve this problem the ASM disk headers have new metadata in 11.2: you can use kfed to read the header of an ASM disk containing a voting disk. The kfdhdb.vfstart and kfdhdb.vfend fields tell CSS where to find the voting file. This does not require the ASM instance to be up. Once the voting disks are located, CSS can access them and joins the cluster.
The high availability stack’s Oracle Agent runs as the owner of the Grid Infrastructure stack in a clustered environment, as either the oracle or grid users. It is spawned by OHAS directly as part of the cluster startup sequence, and it is responsible for starting resources that do not require root privileges.
The list of processes Oracle Agent starts includes the following:



  • EVMD and EVMLOGGER
  • the gipc daemon
  • the gpnp daemon
  • The mDNS daemon
The Oracle Root Agent that is spawned by OHAS in turn starts all daemons that require root privileges to perform their programmed tasks. Such tasks include the following:
  • CRS daemon
  • CTSS daemon
  • Disk Monitoring daemon
  • ACFS drivers
Once CRS is started, it will create another Oracle Agent and Oracle Root Agent. If Grid Infrastructure is owned by the grid account, a second Oracle Agent is created. The grid Oracle Agent(s) will be responsible for:
  • Starting and monitoring the local ASM instance
  • ONS and eONS daemons
  • The SCAN listener, where applicable
  • The Node listener
There can be a maximum of three SCAN listeners in the cluster at any given time. If you have more than three nodes, then you can end up without a SCAN listener on a node. Likewise, in the extreme example where there is only one node in the cluster, you could end up with three SCAN listeners on that
node.The oracle Oracle Agent will only spawn the database resource if account separation is used. If not—i.e., if you didn’t install Grid Infrastructure with a different user than the RDBMS binaries—then
the oracle Oracle Agent will also perform the tasks listed previously with the grid Oracle Agent.
The Oracle Root Agent finally will create the following background processes:
  • GNS, if configured
  • GNS VIP if GNS enabled
  • ACFS Registry
  • Network
  • SCAN VIP, if applicable
  • Node VIP



The functionality provided by the Oracle Agent process in Oracle 11gR2 wa
Clusterware startup sequence



Here is the brief explanation that how the clusterware brings up step by step .
1. When a node of an Oracle Clusterware cluster restarts, OHASD is started by platform-specific means. OHASD is the root for bringing up Oracle Clusterware. OHASD has access to the OLR (Oracle Local Registry) stored on the local file system. OLR provides needed data to complete OHASD initialization.
2. OHASD brings up GPNPD and CSSD. CSSD has access to the GPNP Profile stored on the local file system. This profile contains the following vital bootstrap data; 
        a. ASM Diskgroup Discovery String 
        b. ASM SPFILE location (Diskgroup name) 
        c. Name of the ASM Diskgroup containing the Voting Files
3. The Voting Files locations on ASM Disks are accessed by CSSD with well-known pointers in the ASM Disk headers and CSSD is able to complete initialization and start or join an existing cluster.
4. OHASD starts an ASM instance and ASM can now operate with CSSD initialized and operating. The ASM instance uses special code to locate the contents of the ASM SPFILE, assuming it is stored in a Diskgroup.
5. With an ASM instance operating and its Diskgroups mounted, access to Clusterware’s OCR is available to CRSD.
6. OHASD starts CRSD with access to the OCR in an ASM Diskgroup.
7. Clusterware completes initialization and brings up other services under its control.

or



12c Oracle Clusterware Startup Sequence - Oracle clusterware startup automatically when the RAC node starts. Startup sequence process tuns through different levels, in below figure you can find how multiple level startup process to start the full grid infrastructure stack also how the resources that clusterware manage.



This tutorial will describe startup sequence of oracle 12c RAC clusterware which is installed on Unix / Linux platform.



Oracle 12c RAC Clusterware Startup Sequence






Once your Operating system finish the boot scrap pocess it reads /etc/init.d file via the initialisation daemon names init or init.d. In the init tab file is the one it triggers oracle high availability service daemon.



$cat /etc/inittab | grep init.d | grep -v grep
h1:35:respawn:/etc/init.d/init.ohasd run  >/dev/null  2>&1 </dev/null
Oracle Linux 6.x and Red Hat Linux 6.x have deprecated inittab.  init.ohasd is configured in startup in /etc/init/oracle-ohasd.conf:

$cat /etc/init/oracle-ohasd.conf


 start on runLevel [35]

start on tunLevel [!35]
respawn
exec /etc/init.d/init.ohasd run > /dev/null  2>&1 <dev/null
this start up " init.ohasd run " , which in turn starts up the ohasd.bin background process :

$ps  -ef  | grep  ohasd  | grep  -v grep

root  4056  1  1  Feb19   ?     01:54:34 /u01/app/12.1.0/grid/bin/ohsd.bin  reboot
root  2715  1   0 Feb19  ?     00:00:00  /bin/sh   /etc/init.d/init.ohsd  run



OHASD ( Oracle High Availability Service Daemon )  - we also call it as oracle restart



First /etc/init triggers OHASD, once ohasd is started on Level 0, it is responsible for starting the rest of clusterware and the resources that clusterware manages directly or indirectly through Levels 1- 4.



Level 1 - Ohasd on it own triggers four agent process



  • cssdmonitor : CSS Monitor
  • OHASD orarootagent : High Availability Service stack Oracle root agent
  • OHASD oraagent : High Availability Service stack Oracle Agent
  • cssdagent : CSS Agent






Level 2 - On this level, OHASD ora agent trigger five processes



  • mDNSD : mDNS daemon process
  • GIPCD : Grid Interprocess Comunication
  • GPnPD : GPnP Profile daemon
  • EVMD : Even Monitor Daemon
  • ASM : Resources for monitoring ASM Instances
Then, OHASD oraclerootagent trigger following processes 
  • CRSD : CRS daemon
  • CTSSD : Cluster Time Synchronisation Service Daemon 
  • Diskmon : Disk Monitor Daemon ( Exadata Server Storage )
  • ACFS : ( ASM Cluster File System ) Drivers
Next, the cssdagent starts the CSSD ( CSS daemon ) process.
Level 3 - The CRSD spawns two CRSD agents : CRSD orarootagent and CRSD oracleagent.
Level 4 - On this levael, the CRSD orarootagent is responsible for starting he following resources :
  • Network resource : for the public network
  • SCAN VIPs
  • Node VIPs : VIPs for each node
  • ACFS Registry
  • GNS VIP : VIP for GNS if you use the GNS option
Then, the CRSD orarootagent is responsible for starting the rest of the resources as follow 
  • ASM Resources : ASM instances(s) resource
  • Diskgroup : Used for managing / monitoring ASM diskgroups
  • Disk Resource : Used for managing and monitoring the DB and instances
  • SCAN Listener : Listener for SCAN listening on SCAN VIP
  • Listener : Node Listener listening on the Node VIP
  • Services : Database Services
  • ONS
  • eONS : Enhanced ONS
  • GSD : For 9i backword compatibility
  • GNS : performs name resolution ( Optional )



How Database interact with ASM






The file creation process provides a fine illustration of the interactions that take place between
database instances and ASM. The file creation process occurs as follows:



1. The database requests file creation.
2. An ASM foreground process creates a Continuing Operation Directory (COD) entry and
allocates space for the new file across the disk group.
3. The ASMB database process receives an extent map for the new file.
4. The file is now open and the database process initializes the file directly.
5. After initialization, the database process requests that the file creation is committed. This
causes the ASM foreground process to clear the COD entry and mark the file as created.
6. Acknowledgement of the file commit implicitly closes the file. The database instance will
need to reopen the file for future I/O.  






What is GPnP profile and its importance



The GPnP profile is a XML file located at location <GRID_HOME/gpnp/<hostname>/profiles/peer as profile.xml. Each node of the cluster maintains a copy of this profile locally and is maintained by GPnP daemon along with mdns daemon.
Now before understanding why Oracle came up with GPnP profile, we need to focus on what it contains.
GPnP defines a node’s meta data about network interfaces for public and private interconnect, the ASM server parameter file, and CSS voting disks. This profile is protected by a wallet against modification. If you have to manually modify the profile, it must first be unsigned with $GRID_HOME/bin/gpnptool, modified, and then signed again with the same utility, however there is a very slight chance you would ever be required to do so.
Now we’ll use the gpnptool with get option to dump this xml file into standard output. Below is the formatted output for the ease of readability.
<?xml version=”1.0″ encoding=”UTF-8″?>
<gpnp:GPnP-Profile Version=”1.0″ xmlns=”http://www.xyz/gpnp-profile&#8221;
xsi:schemaLocation=”http://xyz/gpnp-profile gpnp-profile.xsd”
ProfileSequence=”3″ ClusterUId=”002c207a71cvaljgkcea7bea5b3a49″
ClusterName=”Cluster01″ PALocation=””>
<gpnp:Network-Profile>
<gpnp:HostNetwork id=”gen” HostName=”*”>
<gpnp:Network id=”net1″ IP=”xxx.xx.x.x” Adapter=”bond0″ Use=”public”/>
<gpnp:Network id=”net2″ IP=”xxx.xxx.x.x” Adapter=”bond1″
Use=”cluster_interconnect”/>
</gpnp:HostNetwork>
</gpnp:Network-Profile>
<orcl:CSS-Profile id=”css” DiscoveryString=”+asm” LeaseDuration=”400″ />
<orcl:ASM-Profile id=”asm” DiscoveryString=””
SPFile=”+DATA/prod/asmparameterfile/registry.253.699915959″ />
<ds:Signature…>…</ds:Signature>
</gpnp:GPnP-Profile>
So from the above dump we can see that GPnP profile contains following information:-
1) Cluster Name
2) Network Profile
3) CSS-Profile tag
4) ASM-Profile tag
Now that we have understood the content of a GPnP profile, we need to understand how the Clusterware uses this information to start. From 11gr2 you have the option of storing the OCR and Voting disk on ASM, but clusterware needs OCR and Voting disk to start crsd & cssd and both these files are on ASM which itself is a resource for the node. so how does the clusterware starts, which files it accesses to get the information needed to start clusterware, to resolve this Oracle came up with two local operating system files OLR & GPnP.
When a node of an Oracle Clusterware cluster restarts, OHASD is started by platform-specific means.OHASD has access to the OLR (Oracle Local Registry) stored on the local file system. OLR provides needed data (Would explain in another post) to complete OHASD initialization
OHASD brings up GPnP Daemon and CSS Daemon. CSS Daemon has access to the GPNP Profile stored on the local file system.
The Voting Files locations on ASM Disks are accessed by CSSD with well-known pointers in the ASM Disk headers and CSSD is able to complete initialization and start or join an existing cluster.
OHASD starts an ASM instance and ASM can now operate with CSSD initialized and operating. The ASM instance uses special code to locate the contents of the ASM SPFILE, assuming it is stored in a Diskgroup.
With an ASM instance operating and its Diskgroups mounted, access to Clusterware’s OCR is available to CRSD.OHASD starts CRSD with access to the OCR in an ASM Diskgroup.And thus Clusterware completes initialization and brings up other services under its control.
Thus with the use of GPnP profile several information stored in it along with the information in the OLR several tasks have been automated or eased for the administrators.
I hope the above information helps you in understanding the Grid plug and play profile, its content, its usage and why was it required. Please comment below if you need more information on GPnP as in the complete dump of the profile, how GPnP daemon and mdns daemon communicates to maintain the updated profile on all the nodes, how does oifcfg, crsctl, asmcmd and other utilities does uses IPC to alter the content of these files accordingly, etc.






What is voting disk
OCR is used to store the cluster configuration details. It stores the information about the resources that Oracle Clusterware controls. The resources include the Oracle RAC database and instances, listeners, and virtual IPs (VIPs) such as SCAN VIPs and local VIPs.



The voting disk (VD) stores the cluster membership information. Oracle Clusterware uses the VD to determine which nodes are members of a cluster. Oracle Cluster Synchronization Service daemon (OCSSD) on each cluster node updates the VD with the current status of the node every second. The VD is used to determine which RAC nodes are still in the cluster should the interconnect heartbeat between the RAC nodes fail.
CSS is the service which determine which node in cluster is available and provides cluster group membership  and simple locking services to other  processes. CSS typically determines node availability via communication through a dedicated private network with a voting disk used as a secondary communication mechanism. This is done by sending heartbeat messages through the network and the voting disk, as illustrated by the top graphic
 The voting disk is a file on a clustered file system that is accessible to all nodes in the cluster. Its primary purpose is to help in situations where the private network communication fails. The voting disk is then used to communicate the node state information used to determine which nodes go offline. Without the voting disk, it can be difficult for isolated nodes to determine whether it is experiencing a network failure or whether the other nodes are no longer available. It would then be possible for the cluster to enter a state where multiple subclusters of nodes would have unsynchronized access to the same database files.
It contains information regarding nodes in the cluster and the disk heartbeat information, CSSD of the individual nodes registers the information regarding their nodes in the voting disk and with that pwrite() system call at a specific offset and then a pread() system call to read the status of other CSSD processes. But as information regarding the nodes is in OCR/OLR too and system calls have nothing to do with previous calls, there isn’t any useful data kept in the voting disk. So, if you lose voting disks, you can simply add them back without losing any data. But, of course, losing voting disks can lead to node reboots. If you lose all voting disks, then you will have to keep the CRS daemons down, then only you can add the voting disks.Now that we have understood both the heartbeats which was the most important part, we will dig deeper into Voting Disk, as in what is stored inside voting disk, why Clusterware needs of Voting Disk, how many voting disks are required etc.
Now finally to understand the whole concept of voting disk we need to know what is split brain syndrome, I/O Fencing and simple majority rule.
Split Brain Syndrome, In a Oracle RAC environment all the instances/servers communicate with each other using high-speed interconnects on the private network. This private network interface or interconnect are redundant and are only used for inter-instance oracle data block transfers. Now talking about split-brain concept with respect to oracle RAC systems, it occurs when the instance members in a RAC fail to ping/connect to each other via this private interconnect, but the servers are all physically up and running and the database instance on each of these servers is also running. These individual nodes are running fine and can conceptually accept user connections and work independently. So basically due to lack of communication the instance thinks that the other instance that it is not able to connect is down and it needs to do something about the situation. The problem is if we leave these instances running, the same block might get read, updated in these individual instances and there would be data integrity issue, as the blocks changed in one instance, will not be locked and could be over-written by another instance. This situation is termed as Split Brain Syndrome.
I/O Fencing, there will be some situation where the leftover write operations from failed database instances (The cluster function failed on the nodes, but the nodes are still running at OS level) reach the storage system after the recovery process starts. Since these write operations are no longer in the proper serial order, they can damage the consistency of the data stored data. Therefore when a cluster node fails, the failed node needs to be fenced off from all the shared disk devices or disk groups. This methodology is called I/O fencing or failure fencing
Simple Majority Rule, According to Oracle – “An absolute majority of voting disks configured (more than half) must be available and responsive at all times for Oracle Clusterware to operate.” Which means to survive from loss of ‘N’ voting disks, you must configure atleast ‘2N+1′ voting disks.
Now we are in a state to understand the use of voting disk in case of heartbeat failure.
Example 1:- Suppose in a 3 node cluster with 3 voting disks, a network heartbeat fails between Node 1 and Node 3 & Node 2 and Node 3 whereas Node 1 and Node 2 are able to communicate via interconnect, and from the Voting Disk CSSD notices that all the nodes are able to write to Voting Disks thus spli-brain, so the healthy nodes Node 1 & Node 2 would would update the kill block in the voting disk for Node 3. Then when during pread() system call of CSSD of Node 3, it sees a self kill flag set and thus the CSSD of Node 3 evicts itself. And then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.
Example 2:- Suppose in a 2 node cluster with 3 voting disk, a disk heartbeat fails such that Node 1 can see 2 Voting Disks and Node 2 can see 1 Voting Disk, ( If here the Voting Disk wouldn’t have been odd then both the Nodes would have thought the other node should be killed hence would have been difficult to avoid split-brain), thus based on Simple Majority Rule, CSSD process of Node 1 (2 Voting Disks) sends a kill request to the CSSD process of Node 2 (1 Voting Disk) and thus the Node 2 evicts itself and then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.
Thus Voting disk is plays a role in both the heartbeat failures, and hence a very important file for node eviction & I/O fencing in case of a split brain situation.



I hope the above information helps you in understanding the concept of Voting Disk, its purpose, what it contains, when its used etc,  Please comment below if you need more information as in the split brain examples of a bigger cluster, how does Oracle executes STONITH internally, what processes are involved in the complete node eviction process how to identify the cause of a node eviction, how is a node evicted, rebooted and then joined again in the cluster etc.
What is node eviction and its troubleshooting steps



Clusterware will evict one or more node from cluster if 
a critical problem idsdetected . these problem include :

- A node not responding via network or disk heartbeat
- A hung node
- A hung ocssd.bin process

The purpose of this to maintain the overall health of the cluster
by removing suspected node



In Grid infrastructure ,More than two nodes made cluster , There are two heartbeat ,One voting disk heartbeat ,network heartbeat

Network heartbeat is across the interconnect, every one second, a thread (sending) of CSSD sends a network tcp heartbeat to itself and all other nodes, another thread (receiving) of CSSD receives the heartbeat. If the network packets are dropped or has error, the error correction mechanism on tcp would re-transmit the package, Oracle does not re-transmit in this case. In the CSSD log, you will see a WARNING message about missing of heartbeat if a node does not receive a heartbeat from another node for 15 seconds (50% of misscount). Another warning is reported in CSSD log if the same node is missing for 22 seconds (75% of misscount) and similarly at 90% of misscount and when the heartbeat is missing for a period of 100% of the misscount (i.e. 30 seconds by default), the node is evicted.
Disk heartbeat is between the cluster nodes and the voting disk. CSSD process in each RAC node maintains a heart beat in a block of size 1 OfS block in a specific offset by read/write system calls (pread/pwrite), in the voting disk. In addition to maintaining its own disk block, CSSD processes also monitors the disk blocks maintained by the CSSD processes running in other cluster nodes. The written block has a header area with the node name and a counter which is incremented with every next beat (pwrite) from the other nodes. Disk heart beat is maintained in the voting disk by the CSSD processes and If a node has not written a disk heartbeat within the I/O timeout, the node is declared dead. Nodes that are of an unknown state, i.e. cannot be definitively said to be dead, and are not in the group of nodes designated to survive, are evicted, i.e. the node’s kill block is updated to indicate that it has been evicted.


Thus summarizing the heartbeats, N/W Heartbeat is pinged every second, nodes must respond in css_miscount time, failure would lead to node eviction. Similarly Disk Heartbeat, node pings (r/w) voting disk every second, nodes must recieve a response in (long/short) disk timeout time.
  both heartbeat have threshold level set. For network heartbeat ,cssmiscount which is default 30 second and similarly for disk heartbeat distimout which is default 200second in case both nodes are not able to communicate with each other within  threshold time . one of node will be evicted 

Voting disk should be in odd number why

All nodes should vote to voting disk ,for example if we have three voting disk. If one voting disk gets failed ,we have two voting disks , so clusterware will not stop functioning. Since As per rule,  at any given time, every node should access  more than 50 percent of voting disk.

other examples

When you have 1 voting disk and it goes bad, the cluster stops functioning.

When you have 2 and 1 goes bad, the same happens because the nodes realize they can only write to half of the original disks (1 out of 2), violating the rule that they must be able to write > half (yes, the rule says >, not >=).

When you have 3 and 1 goes bad, the cluster runs fine because the nodes know they can access more than half of the original voting disks (2/3 > half).

When you have 4 and 1 goes bad, the same, because (3/4 > half).

When you have 3 and 2 go bad, the cluster stops because the nodes can only access 1/3 of the voting disks, not > half.

When you have 4 and 2 go bad, the same, because the nodes can only access half, not > half.

So you see 4 voting disks have the same fault tolerance as 3, but you waste 1 disk, without gaining anything. The recommendation for odd number of voting disks helps save a little on hardware requirement.s.




Troubleshoot steps

 The node eviction process is reported as Oracle error ORA-29740 in the alert log and LMON trace files

1. Look at the cssd.log files on both nodes; usually we will get more information on the second node if the first node is evicted.Also take a look at crsd.log file too

2. The evicted node will have core dump file generated and system reboot info.

3. Find out if there was node reboot , is it because of CRS or others, check system reboot time

4. If you see “Polling” key words with reduce in percentage values in cssd.log file that says the eviction is probably due to Network.
If you see “Diskpingout” are something related to -DISK- then, the eviction is because of Disk time out.

5. After finding Network or Disk issue. Then starting going in depth.

6. Now it’s time to collect NMON/OSW/RDA reports to make sure /justify if it was DISK issue or Network.

7. If in case we see more memory contention/paging in the reports then it’s time to collect AWR report to see what loads/SQL was running during that period?

8. If network was the issue, then check if any NIC cards were down, or if link switching as happen. And check private interconnect is working between both the nodes.

9. Sometimes eviction could also be due to OS error where the system is in halt state for while or Memory over commitment or CPU 100% used.

10. Check OS /system logfiles to get more information.

11. What got changed recently? Ask your coworker to open up a ticket with Oracle and upload logs

12. Check the health of clusterware, db instances, asm instances, uptime of all hosts and all the logs – ASM logs, Grid logs, CRS and ocssd.log,
HAS logs, EVM logs, DB instances logs, OS logs, SAN logs for that particular timestamp.

13. Check health of interconnect if error logs guide you in that direction.

14. Check the OS memory, CPU usage if error logs guide you in that direction.

15. Check storage error logs guide you in that direction.

16. Run TFA and OSWATCHER, NETSTAT, IFCONFIG settings etc based on error messages during your RCA.

17. Node eviction because iptables had been enabled. After iptables was turned off, everything went back to normal.
Avoid to enable firewalls between the nodes, and that appears to be true.
The ACL can open the ports on the interconnect, as we did, but we still experienced all kinds of issues.
(unable to start crs, unable to stop crs and node eviction).
We also had a problem with the voting disk caused by presenting LDEV's using business copies / Shadowimage that made RAC less than happy.

18. Verify user equiv between cluster nodes
19. Verify switch use for only interconnect. DO NOT USE same switch for other network operations.
20. Verify all nodes are 100pct the same configuration, sometimes there are net or config diffs that are not obvious.
look for hangs in the logs and the monitoring tools like NAGIOS to see any memory usage ran out of RAM, or became unresponsive.
21. A major reason however for node evictions at our cluster was at the "patch-levels" not being equal across the two nodes.



Nodes sometimes completely died, without any error what so ever.It turned to be a bug in the installer of 11.1.0.7.1 PSU,






What is Undo retention and  retention guarantee and its importance
Enabling the undo retention guarantee helps assure the success of long-running queries as well as Oracle Flashback operations. The “guarantee” part of the undo retention guarantee is real—Oracle will certainly retain undo at least for the time you specify and will never overwrite any of the unexpired undo extents that contain the undo required to satisfy the undo retention period. However, there’s a stiff price attached to this guarantee—Oracle will guarantee retention even if it means that DML transactions fail because the database can’t find space to record the undo for those transactions. Therefore, you must exercise great caution when enabling the guaranteed undo retention capability.
By default, the database uses the following criteria to determine how long it needs to retain undo data:
  • Length of the longest-running query
  • Length of the longest-running transaction
  • Longest flashback duration
It’s somewhat difficult to understand how the Oracle database handles the retention of undo data. Here’s a brief summary of how things work:
  • If you don’t configure the undo tablespace with the AUTOEXTEND option, the database simply ignores the value you set for the UNDO_RETENTION parameter. The database will automatically tune the undo retention period based on database workload and the size of the undo tablespace. So, make sure you set the undo tablespace to a large value if you’re receiving errors indicating (snapshot too old ) that the database is not retaining undo for a long enough time.
Typically, the undo retention in this case is for a duration significantly longer than the longest-running active query in the database.
  • If you want the database to try to honor the settings you specify for the UNDO_RETENTION parameter, make sure that you enable the AUTOEXTEND option for the undo tablespace. This way, Oracle will automatically extend the size of the undo tablespace to make room for undo from new transactions, instead of overwriting the older undo data. However, if you’re receiving ORA-0155 (snapshot too old) errors—say, due to Oracle Flashback operations—it means that the database isn’t able to dynamically tune the undo retention period effectively. In a case such as this, try increasing the value of the UNDO_RETENTION parameter to match the length of the longest Oracle Flashback operation. Alternatively, you can try going to a larger fixed-size undo tablespace (without the AUTOEXTEND option).
7-3. Resolving an ORA-01555 Error Problem
You’re receiving the ORA-01555 (snapshot too old) errors during nightly runs of key production batch jobs. You want to eliminate these errors.
Solution
While setting a high value for the UNDO_RETENTION parameter can potentially minimize the possibility of receiving
snapshot too old” errors, it doesn’t guarantee that the database won’t overwrite older undo data that may be needed by a running transaction. You can move long-running batch jobs to a separate time interval when other programs aren’t running in the database, to avoid these errors. Remember that you don’t always need several concurrent programs to generate “snapshot too old”. You can do generate that error by running one poorly written query on a big table in a cursor, and update the same table in the loop.
Regardless, while you can minimize the occurrence of “snapshot too old” errors with these approaches, you can’t completely eliminate such errors without specifying the guaranteed undo retention feature. When you configure guaranteed undo retention in a database, no SELECT statement can fail because of the “snapshot too old” error.Oracle will keep new DML statements from executing when you set up guaranteed undo retention. Implementing the guaranteed undo feature is simple. Suppose you want to ensure that the database retains undo for at least an hour (3,600 seconds). First set the undo retention threshold with the alter system command shown here, and then set up guaranteed undo retention by specifying the retention guarantee clause to alter the undo tablespace.
SQL> alter system set undo_retention=3600;
System altered.
SQL> alter tablespace undotbs1 retention guarantee;
Tablespace altered.
SQL>
You can switch off guaranteed undo retention by executing the
alter tablespace command with the retention noguarantee clause.
 ■ Tip   You can enable guaranteed undo retention by using the alter system command as shown in this recipe, as well as with the create database and create undo tablespace statements.
How It Works
Oracle uses the undo records stored in the undo tablespace to help roll back transactions, provide read consistency, and to help recover the database. In addition, the database also uses undo records to read data from a past point intime using Oracle Flashback Query. Undo data serves as the underpinning for several Oracle Flashback features that help you recover from logical errors.
Occurrence of the Error
The ORA-01555 error (snapshot too old) may occur in various situations. The following is a case where the error occurs during an export.
EXP-00008: ORACLE error 1555 encountered
ORA-01555: snapshot too old: rollback segment number 10 with name “_SYSSMU10$” too small
EXP-00000: Export terminated unsuccessfully
And you can receive the same error when performing a flashback transaction:
ERROR at line 1:
ORA-01555: snapshot too old: rollback segment number with name “” too small
ORA-06512: at “SYS.DBMS_FLASHBACK”, line 37
ORA-06512: at “SYS.DBMS_FLASHBACK”, line 70
ORA-06512: at li
The “snapshot too old” error occurs when Oracle overwrites undo data that’s needed by another query. The error is a direct result of how Oracle’s read consistency mechanism works. The error occurs during the execution of a long-running query when Oracle tries to read the “before image” of any changed rows from the undo segments.
For example, if a long-running query starts at 1 A.M. and runs until 6 A.M. it’s possible for the database to change the data that’s part of this query during the period in which the query executes. When Oracle tries to read the data as it appeared at 1 A.M., the query may fail if that data is no longer present in the undo segments. If your database is experiencing a lot of updates, Oracle may not be able to fetch the changed rows, because the before changes recorded in the undo segments may have been overwritten. The transactions that changed the rows will have already committed, and the undo segments don’t have a record of the before change row values because the database overwrote the relevant undo data. Since Oracle fails to return consistent data for the current query, it issues the ORA-01555 error. The query that’s currently running requires the before image to construct read-consistent data,but the before image isn’t available.
The ORA-01555 error may be the result of one or both of the following: a heavy amount of queries during a time of intense changes in the database or too small an undo tablespace. You can increase the size of the undo tablespace, but that doesn’t ensure that the error won’t occur again.
Influence of Extents
The database stores undo data in undo extents, and there are three distinct types of undo extents:
Active: Transactions are currently using these extents.
Unexpired: These are extents that contain undo that’s required to satisfy the undo retention time specified by the UNDO_RETENTION initialization parameter.
Expired: These are extents with undo that’s been retained longer than the duration specified by the UNDO_RETENTION parameter.
If the database doesn’t find enough expired extents in the undo tablespace or it can’t get new undo extents,it’ll re-use the unexpired (but never an active undo extent) extents, and this leaves the door open for an ORA-01555,
snapshot too old” error. By default, the database will essentially stop honoring the undo retention period you specify if it encounters space pressure to accommodate the undo from new transactions. Since the unexpired undo extents contain undo records needed to satisfy the undo retention period, overwriting those extents in reality means that the
database is lowering the undo retention period you’ve set. Enabling the undo retention guarantee helps assure the success of long-running queries as well as Oracle Flashback operations. The “guarantee” part of the undo retention
guarantee is real—Oracle will certainly retain undo at least for the time you specify and will never overwrite any of the unexpired undo extents that contain the undo required to satisfy the undo retention period. However, there’s a stiff price attached to this guarantee—Oracle will guarantee retention even if it means that DML transactions fail because the database can’t find space to record the undo for those tra



What is incremental check pointing Checkpoint & SCN



Checkpoint is a data structure that indicates the “checkpoint position“, determined by the oldest dirty buffer in the database buffer cache. In terms of Oracle’s clock this position is actually the SCN in the redo stream where instance recovery must begin. The checkpoint position acts as a pointer to the redo stream and is stored in the control file and in each data file header. Whenever we say checkpoint happened we mean that The writing of modified database buffers in the database buffer cache to disk. A successful checkpoint guarantees that all database changes up to the checkpoint SCN have been recorded in the datafiles and SCNs recorded in the file headers guarantee that all changes made to database blocks prior to that SCN are already written to disk. As a result, only those changes made after the checkpoint need to be applied during recovery.
Checkpoints triggered on following conditions:
§  Every 3 seconds (Incremental Checkpoint)
§  When Logswitch happened
§  When instance shutdown normal/transactional/immediate
§  Whenever Alter Tablespace [Offline Normal| Read Only| Begin Backup]
§  Controlled by internal checkpoint forced by recovery related parameters i.e. Fast_Start_MTTR_Target etc.

Purpose of Checkpoints


Oracle Database uses checkpoints to achieve the following goals:
§  Reduce the time required for recovery in case of an instance or media failure
§  Ensure that dirty buffers in the buffer cache are written to disk regularly
§  Ensure that all committed data is written to disk during a consistent shutdown

When Oracle Database Initiates Checkpoints

The checkpoint process (CKPT) is responsible for writing checkpoints to the data file headers and control file. Implementing full checkpoint every time would be a costly operation and a major bottleneck for concurrency, so Oracle using different types of checkpoints based on different purposes:
§  Full checkpoint: Writes block images to the database for all dirty buffers from all instances. Controlfile and datafile headers are updated during this checkpoint. Until Oracle 8 log switch was also causing full check point which is changed since 8i onwards for performance reasons. Occurred in following situations
§  Alter system checkpoint global
§  Alter database begin backup
§  Alter database close
§  Shutdown Immediate/Transactional
§  Thread checkpoints: The database writes to disk all buffers modified by redo in a specific thread before a certain target. The set of thread checkpoints on all instances in a database is a database checkpoint. Controlfile and datafile headers are updated during this checkpoint. Occures in the following situations
§  Consistent database shutdown
§  Alter system checkpoint local
§  Online redo log switch
§  Tablespace and Datafile Checkpoint: Writes block images to the database for all dirty buffers for all files of a tablespace from all instances. Controlfile and datafile headers are updated during this checkpoint. Occurs in following situations
§  Alter tablespace … offline
§  Alter tablespace … begin backup
§  Alter tablespace … read only
§  Alter database datafile resize ( while shrinking a data file)
§  Parallel Query Checkpoint: Writes block images to the database for all dirty buffers belonging to objects accessed by the query from all instances. It’s mandatory to maintain consistency. Occurs in following situations
§  Parallel Query
§  Parallel Query component of PDML or PDDL.
§  Incremental checkpoints: An incremental checkpoint is a type of thread checkpoint partly intended to avoid writing large numbers of blocks at online redo log switches. DBWchecks at least every 3 seconds to determine whether it has work to do. When DBWn writes dirty buffers, it advances the checkpoint position, causing CKPT to write the checkpoint position to the control file, but not to the data file headers.
§  Object Checkpoint: Writes block images to the database for all dirty buffers belonging to an object from all instances. Occurs in following situations
§  Drop table
§  Drop table … purge
§  Truncate table
§  Drop Index
§  Log Switch Checkpoint: Writes the contents of “some” dirty buffers to the database. Controlfile and datafile headers are updated with checkpoint_change#.
§  Instance Recovery Checkpoint: Writes recovered block back to datafiles. Trigger as soon as SMON is done with instance recovery.
§  RBR Checkpoint: It’s actually Reuse Block Range checkpoint, usually appears post index rebuild operations.
§  Multiple Object Checkpoint: Triggered whenever a single operation causes checkpoints on multiple objects i.e. dropping partitioned table or index.
Whenever anything happened in database, Oracle has a SCN number which has to update into various places. We can classify SCN into following major categories:
§  System (checkpoint) SCN: After a checkpoint completes, Oracle stores the system checkpoint SCN in the control file. We can check that in checkpoint_change# of v$database view.
SQL> select checkpoint_change# from v$database;

CHECKPOINT_CHANGE#
------------------
1677903

SQL> alter system checkpoint;

System altered.

SQL> select checkpoint_change# from v$database;

CHECKPOINT_CHANGE#
------------------
1679716
§  DataFile (checkpoint) SCN: After a checkpoint completes, Oracle stores the SCN individually in the control file for each datafile. The following SQL shows the datafile checkpoint SCN for a datafile in the control file:
SQL> select name,checkpoint_change# from v$datafile where name like '%system01%';

NAME                                                 CHECKPOINT_CHANGE#
---------------------------------------------------- ------------------
/u02/app/oracle/oradata/mask11g/system01.dbf                1679716
§  Partial (Checkpoint) SCN: Operational non-full checkpoints for sub set of system i.e. tablespace or a datafile etc, would set checkpoint for affected entities only
SQL> select name,checkpoint_change# from v$datafile_header where name like '%01.dbf';

NAME                                                 CHECKPOINT_CHANGE#
---------------------------------------------------- ------------------
/u02/app/oracle/oradata/mask11g/system01.dbf                 1685610
/u02/app/oracle/oradata/mask11g/sysaux01.dbf                 1685610
/u02/app/oracle/oradata/mask11g/undotbs01.dbf                1685610
/u02/app/oracle/oradata/mask11g/users01.dbf                  1685610

SQL> alter tablespace users read only;

Tablespace altered.

SQL> select name,checkpoint_change# from v$datafile_header where name like '%01.dbf';

NAME                                                 CHECKPOINT_CHANGE#
---------------------------------------------------- ------------------
/u02/app/oracle/oradata/mask11g/system01.dbf                  1685610
/u02/app/oracle/oradata/mask11g/sysaux01.dbf                  1685610
/u02/app/oracle/oradata/mask11g/undotbs01.dbf                 1685610
/u02/app/oracle/oradata/mask11g/users01.dbf                   1685618

SQL> alter tablespace users read write;

Tablespace altered.

SQL> select name,checkpoint_change# from v$datafile_header where name like '%01.dbf';

NAME                                                 CHECKPOINT_CHANGE#
---------------------------------------------------- ------------------
/u02/app/oracle/oradata/mask11g/system01.dbf                  1685610
/u02/app/oracle/oradata/mask11g/sysaux01.dbf                  1685610
/u02/app/oracle/oradata/mask11g/undotbs01.dbf                 1685610
/u02/app/oracle/oradata/mask11g/users01.dbf                   1685642
§  Start (Checkpoint) SCN: Oracle stores the checkpoint SCN value in the header of each datafile. This is referred to as the start SCN because it is used at instance startup time to check if recovery is required. The following SQL shows the checkpoint SCN in the datafile header for a single datafile.
SQL> select name,checkpoint_change# from v$datafile_header where name like '%system01%';

NAME                                                 CHECKPOINT_CHANGE#
---------------------------------------------------- ------------------
/u02/app/oracle/oradata/mask11g/system01.dbf                  1657172
§  End (checkpoint) SCN: The stop SCN or Termination is held in the control file for each datafile. The following SQL shows the stop SCN for a single datafile when the database is open for normal use.
SQL> select distinct LAST_CHANGE# from v$datafile;

LAST_CHANGE#
------------

SQL> alter database close;

Database altered.

SQL> select distinct LAST_CHANGE# from v$datafile;

LAST_CHANGE#
------------
2125206

SQL> select distinct CHECKPOINT_CHANGE# from v$datafile_header ;

CHECKPOINT_CHANGE#
------------------
2125206
During normal database operation, the stop SCN is NULL for all datafiles that are online in read-write mode. SCN Values while the Database Is Up Following a checkpoint while the database is up and open for use, the system checkpoint in the control file, the datafile checkpoint SCN in the control file, and the start SCN in each datafile header all match. The stop SCN for each datafile in the control file is NULL. SCN after a Clean Shutdown After a clean database shutdown resulting from a SHUTDOWN IMMEDIATE or SHUTDOWN NORMAL of the database, followed by STARTUP MOUNT, the previous queries on v$database and v$datafile return the following:
During a clean shutdown, a checkpoint is performed and the stop SCN for each datafile is set to the start SCN from the datafile header. Upon startup, Oracle checks the start SCN in the file header with the datafile checkpoint SCN. If they match, Oracle checks the start SCN in the datafile header with the datafile stop SCN in the control file. If they match, the database can be opened because all block changes have been applied, no changes were lost on shutdown, and therefore no recovery is required on startup. After the database is opened, the datafile stop SCN in the control file once again changes to NULL to indicate that the datafile is open for normal use.


 An incremental checkpoint

An incremental checkpoint is a type of thread checkpoint partly intended to avoid writing large numbers of blocks at online redo log switches. DBWR checks at least every three seconds to determine whether it has work to do. When DBWn writes dirty buffers, it advances the checkpoint position, causing CKPT to write the checkpoint position to the control file, but not to the data file headers. . . .
During instance recovery, the database must apply the changes that occur between the checkpoint position and the end of the redo thread. Some changes may already have been written to the data files. However, only changes with SCNs lower than the checkpoint position are guaranteed to be on disk."
Can you explain incremental checkpoints in plain English?
Answer: An incremental checkpoint is sort of like when you are sitting on the toilet taking a large dump and you flush multiple times to prevent clogging the toilet.

The "fast start" recovery (and the fast_start_mttr_target) is directly related to the incremental checkpoint.  By reducing the checkpoint time to be more frequent than a log switch, Oracle will recover and re-start faster in case of an instance crash. 

The docs note that a DBWR writes buffers to disk in advance the checkpoint position, writing the "oldest" blocks first to preserve integrity.
A "checkpoint" is the event that triggers writing of dirty blocks to the disks and a "normal" checkpoint only occurs with every redo log file switch. 
In a nutshell, an "incremental" directs the CKPT process to search for "dirty" blocks that need to be written by the DBWR process. thereby advancing the SCN to the control file.
The DBWR wakes up every 3 seconds, seeking dirty blocks and sleeps if he finds no blocks.  This prevents a "burst" of writing when a redo log switches.




How ASM communicate with database
A database that stores data on ASM volumes has two new background processes; RBAL and ASMB. RBAL performs global opens of the disks. ASMB connects to the +ASMn instance to communicate information such as file creation and deletion
ASMB process communicates with CSS daemon on node and receives file extend map information from ASM instance . ASMB is also responsible for providing I/O stats to ASM instance .
How connection established when DML run
The user process first communicates with a listener process that creates a server process in a dedicated enviroment
Oracle Database creates server processes to handle the requests of user processes connected to the instance.The user process represents the application or tool that connects to the Oracle database
Server processes created on behalf of each user’s application can perform one or more of the following:



  • Parse and run SQL statements issued through the application.
  • Read necessary data blocks from data files on disk into the shared database buffers of the SGA (if the blocks are not already present in the SGA).
  • Return results in such a way that the application can process the information.
When a user starts a transaction—for example, a DML operation—the old data is written from the buffer cache to the undo tablespace and the new change details are in the redo log files.
What is Single instance recovery and RAC instance recovery
If an instance of open database fails, either because of a SHUTDOWN ABORT statement o abnormal termination,the following situations can result:



Data blocks committed by a transaction are not written to the data files and appear only in the online redo log. These changes must be reapplied to the database.The data files contain changes that had not been committed when the instance failed. These changes must be rolled back to ensure transactional consistency. Instance recovery uses only online redo log files and current online data files to synchronize the data files and ensure that they are consistent.
Understanding Instance Recovery
Automatic instance or crash recovery:
  • Is caused by attempts to open a database whose files are not synchronized on shutdown
  • Uses information stored in redo log groups to synchronize files
  • Involves two distinct operations:
  1. Rolling forward: Redo log changes (both committed and uncommitted) are applied to data files.
  2. Rolling back: Changes that are made but not committed returned to their original state.
The Oracle database automatically recovers from instance failure. All that needs to happen is for the instance to be started normally. If Oracle Restart is enabled and configured to monitor this database,then this happens automatically. The instance mounts the control files and then attempts to open the data files. When it discovers that the data files have not been
synchronized during shutdown, the instance uses information contained in
the redo log groups to roll the data files forward to the time of shutdown. Then the database is opened and any uncommitted transactions are rolled back.



Phases of Instance Recovery



  1. Startup instance (data files are out of sync)
  2. Roll forward (redo)
  3. Committed and uncommitted data in files
  4. Database opened
  5. Roll back (undo)
  6. Committed data in files



For an instance to open a datafile, the system change number (SCN) contained in the data fil’s header must  match the current SCN that is stored in the database’s control files.
If the numbers do not match, the instance applies redo data from the online redo logs, sequentially “redoing” transactions until the data files are up to date. After all data files have been synchronized with the control files, the database is opened and users can log in.
When redo logs are applied, all transactions are applied to bring the database up to the state as of the time of failure. This usually includes transactions that are in progress but have not yet been committed. After the database has been opened, those uncommitted transactions are rolled back.
At the end of the rollback phase of instance recovery, the data files contain only committed data.



Tuning Instance Recovery



  • During instance recovery, the transactions between the checkpoint position and the end of the redo log must be applied to data files.
  • You tune instance recovery by controlling the difference between the checkpoint position and the end of the redo log.



 Why does Oracle recommend 3 voting disks when you have 2 nodes?
When you have 1 voting disk and it goes bad, the cluster stops functioning.
When you have 2 and 1 goes bad, the same happens because the nodes realize they can only write to half of the original disks (1 out of 2), violating the rule that they must be able to write > half (yes, the rule says >, not >=).
When you have 3 and 1 goes bad, the cluster runs fine because the nodes know they can access more than half of the original voting disks (2/3 > half).
When you have 4 and 1 goes bad, the same, because (3/4 > half).
When you have 3 and 2 go bad, the cluster stops because the nodes can only access 1/3 of the voting disks, not > half.
When you have 4 and 2 go bad, the same, because the nodes can only access half, not > half.
So you see 4 voting disks have the same fault tolerance as 3, but you waste 1 disk, without gaining anything. The recommendation for odd number of voting disks helps save a little on hardware requirement.
All the above assume the nodes themselves are fine.
How big table loading in buffer cache
Before John can explain the new feature, everyone in the office wants to know why a full table scan doesn’t use the buffer cache. When a session connected to the Oracle Database instance selects data from a table, John elaborates, the database server process reads the appropriate data blocks from the disk and puts them into the buffer cache by default. Each block goes into a buffer in the buffer cache. The reason is simple: if another session wants some data from those blocks, it can be served from those cached blocks much faster than being served from disk. The buffer cache is limited and usually smaller than the entire database, so when the cache is full and a new database block comes in, Oracle Database forces old buffers that have not been accessed in a long time out of the cache to make room for the new blocks coming in.
However, John continues, consider the case of a full table scan query that selects all the blocks of the table. If that table is large, its blocks will consume a large portion of the buffer cache, forcing out the blocks of other tables. It’s unlikely that all the blocks of a large table will be accessed regularly, so having those blocks in the cache does not actually help performance. But forcing out the blocks of other tables, especially popular blocks, degrades the overall performance of the applications running against the database. That is why, John explains, Oracle Database does not load the blocks into the buffer cache for full table scans.
how connection is getting establish and sql internally are running
how database  is communicating with ASM
how hot backup is happening
How instance recovery is happening in single and RAC Database
Instance recovery occurs when an instance goes down abruptly, either via a SHUTDOWN ABORT, a killing of a background process, or a crash of a node or the instance itself. After an ungraceful shutdown, it is necessary for the database to go through the process of rolling forward all information in the redo logs and rolling back any transactions that had not yet been committed. This process is known as instance recovery and is usually automatically performed by the SMON process.
The redo logs for all RAC instances are located either on an OCFS shared disk asset or on a RAW file system that is visible to all the other RAC instances. This allows any other node to recover for a failed RAC node in the event of instance failure.
There are basically two types of failure in a RAC environment: instance and media. Instance failure involves the loss of one or more RAC instances, whether due to node failure or connectivity failure. Media failure involves the loss of one or more of the disk assets used to store the database files themselves.
  1. All nodes available.
  1. One or more RAC instances fail.



  1. Node failure is detected by any one of the remaining instances.



  1. Global Resource Directory(GRD) is reconfigured and distributed among
surviving nodes.



  1. The instance which first detected the failed instance, reads the failed
instances redo logs to determine the logs which are needed to be recovered.
The above task is done by the SMON process of the instance that detected failure.



  1. Until this time database activity is frozen, The SMON issues recovery requests
for all the blocks that are needed for recovery. Once all the blocks are available,
the other blocks which are not needed for recovery are available for normal processing.



  1. Oracle performs roll forward operation against the blocks that were modified by the
failed instance but were not written to disk using redo log recorded transactions.



  1. Once redo logs are applied, uncomitted transactions are rolled back using
undo tablespace.



  1. Database on the RAC in now fully available.



Or
INSTANCE RECOVERY IN RAC DATABASE
I will discuss how instance recovery takes place in 11g R2 RAC. Instance recovery aims at
writing all committed changes to the datafiles
undoing all the uncommitted changes from the datafiles
Incrementing the checkpoint no. to the SCN till which changes have been written to datafiles.
In a single instance database, before the instance crashes,
some committed changes are in the redo log files but have not been written to the datafiles
some uncommitted changes have made their way to datafiles
some uncommitted changes are in the redo log buffer
After  the instance crashes in a single instance database
all uncommitted changes in the redo log buffer are wiped out
Online redo log files are read to identify the blocks that need to be recovered
Identified blocks are read from the datafiles
During roll forward phase, all the changes (committed/uncommitted) in redo log files are applied to them
During rollback phase, all uncommitted changes are rolled back after reading undo from undo tablespace.
CKTP# is incremented in control file/data file headers
In a RAC database there can be two scenarios :
Only one instance crashes
Multiple instances crash
We will discuss these cases one by one.
Single instance crash in RAC database
In this case, scenario is quite similar to instance crash in a single instance database. But there is slight difference also.
Let us consider a 3 node setup. We will consider a data block B1 with one
column and 4 records in it . The column contains values 100, 200, 300 and 400 in 4 records. Initially the block is on disk . In the following chart, update operations on the block in various nodes and corresponding states of the block are represented. Colour code followed is : CR, PI, XCUR:
It is assumed that no incremental checkpointing has taken place on any of the nodes in the meanwhile.
Before crash status of block on various nodes is as follows:
PI at SCN# 2 on Node1
PI at SCN# 3 on Node2
XCUR on Node3



Redo logs at various nodes are
Node1 : B1: 100 -> 101, SCN# 1
Node2 : B1:200 -> 201, SCN# 2
Node3 : B1:300 -> 301, SCN# 3
After the crash,
Redo logs of crashed node (Node2) is analyzed and it is identified that block B1 needs to be recovered.
It is also identified that role of the block is global as its different versions are available in Node1 and Node3
It is identified that there is a PI on node1 whose SCN# (2) is earlier than the SCN# of crash (4)
Changes from redo logs of Node2 are applied to the PI on Node1 and the block is written to disk
Checkpoint # of node1 is incremented.
a BWR is placed in redo log of Node1 to indicate that the block has been written to disk and need not be recovered in case Node1
Here it can be readily seen that there are certain differences from the instance recovery in single instance database.
The Role of the block is checked.
If the role is local, then the block will be read from the disk and changes from redo logs of Node2 will be applied i.e. just like single instance database
If the role is global,
It is checked if PI of the block at a SCN# earlier than the SCN# of crash is available
If PI is available, then changes in redo logs of node2 are applied to the PI ,instead of reading the block from the disk,
If PI is not available (has been flushed to disk due to incremental checkpointing
on the owner node of PI  or
on any of the nodes at a SCN# > PI holder)
the block will be read from the disk and changes from redo logs of Node2 will be applied just like it used to happen in OPS.
Hence, it can be inferred that PI, if available, speeds up the instance recovery as need to read the block from disk is eliminated. If PI is
not available, block is read from the disk just like in OPS.
Multiple instance crash in RAC database
Let us consider a 4 node setup. We will consider a data block B1 with one column and 4 records in it
. The column contains values 100, 200, 300 and 400 in 4 records. Initially the block is on disk . It can be represented as:
Explanation:
SCN#1 – Node1 reads the block from disk and updates 100 to 101 in  record. It holds the block in XCUR mode
SCN#2 – Node2  requests the same block for update. Node1 keeps the PI and Node2 holds the block in XCUR mode
SCN#3 – Node3  requests the same block for update. Node2 keeps the PI and Node3 holds the block in XCUR mode . Now we have two PIs
On Node1 with SCN# 2
On Node2 with SCN# 3
SCN# 4 – Local checkpointing takes place on Node2. PI on this node has SCN# 3.
It is checked if any of the other nodes has a PI at an earlier SCN# than this. Node1 has PI at SCN# 2.
CHanges in redo log of Node2 are applied to its PI and it is flushed to disk.
BWR is placed in redo log of Node2 to indicate that the block has been written to disk and need not be recovered in case Node2 crashes.
PI at node2 is discarded i.e. its state changes to CR which can’t be used to serve remote nodes.
PI at node1 is discarded i.e. its state changes to CR which can’t be used to serve remote nodes.
BWR is placed in redo log of Node1 to indicate that block has been written to disk and need not be recovered in case Node2 crashes.
Now on disk version of block contains changes of both Node1 and Node2.
SCN# 5 – Node4  requests the same block for update. Node3 keeps the PI and Node4 holds the block in XCUR mode .Node1 and Node2 have the CR’s.
SCN# 6 – Node1 again requests the same block for update. Node4 keeps the PI and Node1 holds the block in XCUR mode. Now Node1 has both the same block in CR and XCUR mode. Node3 has PI at SCN# 5.
SCN# 7 – Node2 and Node3 crash.
It is assumed that no incremental checkpointing has taken place on any of the nodes in the meanwhile.
Before crash status of block on various nodes is as follows:
CR at SCN# 2 on Node1, XCUR on Node1
CR at SCN# 3 on Node2
PI  at SCN# 5 on Node3
PI at SCN# 6 on Node4
Redo logs at various nodes are
Node1 : B1: 100 -> 101, SCN# 1, BWR for B1 , B1:401->402 at SCN#6
Node2 : B1:200 -> 201, SCN# 2, BWR for B1
Node3 : B1:300 -> 301, SCN# 3
Node4 : B1:400->401 at SCN# 5
After the crash,
Redo logs of crashed node (Node2) are analyzed and it is identified that block B1 has been flushed to disk as of SCN# 4 and need not be recovered as no changes have been made to it from Node2.
No Redo log entry from Node2  needs to be applied
Redo logs of crashed node (Node3) are analyzed and it is identified that block B1 needs to be recovered
It is also identified that role of the block is global as its different versions was/is  available in Node1(XCUR), Node2(crashed) , Node4(PI)
Changes from Node3 have to be applied . It is checked if any PI is available which is earlier than the SCN# of the change on node3 which needs to be applied i.e. SCN# 3.
It is identified that no PI is available  whose SCN is earlier  than the  SCN# (3). Hence, block is read from the disk.
Redo log entry which needs to be applied is : B1:300 -> 301, SCN# 3
–  Redo is applied to the block read from the disk and the block is written to disk so that on disk version contains changes made by Node3 also.
Checkpoint # of node2 and Node3 are incremented.
After instance recovery :
Node1 : holds CR and XCUR
Node2 :
Node3 :
Node4 : holds PI
On disk version  of the block is:
101
201
301
400






3) How database communicate with ASM



4) what when we put database in begin backup mode
5) how sql execute and make connection  with  database






How SQL Statement Processing in Oracle Architecture
SQL Statements are processed differently depending on whether the statement is a query, data manipulation language (DML) to update, insert, or delete a row, or data definition language (DDL) to write information to the data dictionary.



Connect to an Instance using:
_ User  process
_Server process
_The Oracle server components that are used depend on the type os SQL statemebt:
-Quries return rows
-DML statements log changes
-commit ensures transactio  recovery
_Some Oracle server components do not participate in SQL statement processing



Processing a query:
Parse:
o       Search for identical statement in the Shared SQL Area.
o       Check syntax, object names, and privileges.
o       Lock objects used during parse.
o       Create and store execution plan.
Bind: Obtains values for variables.
Execute: Process statement.
Fetch: Return rows to user process.
Processing a DML statement:
Parse: Same as the parse phase used for processing a query.
Bind: Same as the bind phase used for processing a query.
Execute:
o       If the data and undo blocks are not already in the Database Buffer Cache, the server process reads them from the datafiles into the Database Buffer Cache.
o       The server process places locks on the rows that are to be modified. The undo block is used to store the before image of the data, so that the DML statements can be rolled back if necessary.
o       The data blocks record the new values of the data.
o       The server process records the before image to the undo block and updates the data block.  Both of these changes are made in the Database Buffer Cache.  Any changed blocks in the Database Buffer Cache are marked as dirty buffers.  That is, buffers that are not the same as the corresponding blocks on the disk.
o       The processing of a DELETE or INSERT command uses similar steps.  The before image for a DELETE contains the column values in the deleted row, and the before image of an INSERT contains the row location information.



Processing a DDL statement:
The execution of DDL (Data Definition Language) statements differs from the execution of DML (Data Manipulation Language) statements and queries, because the success of a DDL statement requires write access to the data dictionary.
For these statements, parsing actually includes parsing, data dictionary lookup, and execution. Transaction management, session management, and system management SQL statements are processed using the parse and execute stages. To re-execute them, simply perform another execute.
Stage 2: Parse the Statement
During parsing, the SQL statement is passed from the user process to Oracle and a parsed representation of the SQL statement is loaded into a shared SQL area. Many errors can be caught during this phase of statement processing. Parsing is the process of
  • translating a SQL statement, verifying it to be a valid statement
  • performing data dictionary lookups to check table and column definitions
  • acquiring parse locks on required objects so that their definitions do not change during the statement’s parsing
  • checking privileges to access referenced schema objects
  • determining the optimal execution plan for the statement
  • loading it into a shared SQL area
  • for distributed statements, routing all or part of the statement to remote nodes that contain referenced data
A SQL statement is parsed only if a shared SQL area for an identical SQL statement does not exist in the shared pool. In this case, a new shared SQL area is allocated and the statement is parsed. For more information about shared SQL, refer to Chapter 10, Managing SQL and Shared PL/SQL Areas.
The parse phase includes processing requirements that need to be done only once no matter how many times the statement is executed. Oracle translates each SQL statement only once, re-executing that parsed statement during subsequent references to the statement.
Although the parsing of a SQL statement validates that statement, parsing only identifies errors that can be found before statement execution. Thus, certain errors cannot be caught by parsing. For example, errors in data conversion or errors in data (such as an attempt to enter duplicate values in a primary key) and deadlocks are all errors or situations that can only be encountered and reported during the execution phase.
Module 1 – Oracle Architecture
Objectives
These notes introduce the Oracle server architecture.  The architecture includes physical components, memory components, processes, and logical structures.
Primary Architecture Components
The figure shown above details the Oracle architecture.
Oracle server:  An Oracle server includes an Oracle Instance and an Oracle database.
An Oracle database includes several different types of files:  datafiles, control files, redo log files and archive redo log files.  The Oracle server also accesses parameter files and password files.
This set of files has several purposes.
One is to enable system users to process SQL statements.
Another is to improve system performance.
Still another is to ensure the database can be recovered if there is a software/hardware failure.
The database server must manage large amounts of data in a multi-user environment.
The server must manage concurrent access to the same data.
The server must deliver high performance.  This generally means fast response times.
Oracle instance:  An Oracle Instance consists of two different sets of components:
The first component set is the set of background processes (PMON, SMON, RECO, DBW0, LGWR, CKPT, D000 and others).
These will be covered later in detail – each background process is a computer program.
These processes perform input/output and monitor other Oracle processes to provide good performance and database reliability.
The second component set includes the memory structures that comprise the Oracle instance.
When an instance starts up, a memory structure called the System Global Area (SGA) is allocated.
At this point the background processes also start.
An Oracle Instance provides access to one and only one Oracle database.
Oracle database: An Oracle database consists of files.
Sometimes these are referred to as operating system files, but they are actually database files that store the database information that a firm or organization needs in order to operate.
The redo log files are used to recover the database in the event of application program failures, instance failures and other minor failures.
The archived redo log files are used to recover the database if a disk fails.
Other files not shown in the figure include:
The required parameter file that is used to specify parameters for configuring an Oracle instance when it starts up.
The optional password file authenticates special users of the database – these are termed privileged users and include database administrators.
Alert and Trace Log Files – these files store information about errors and actions taken that affect the configuration of the database.
User and server processes:  The processes shown in the figure are called user and server processes.  These processes are used to manage the execution of SQL statements.
A Shared Server Process can share memory and variable processing for multiple user processes.
A Dedicates Server Process manages memory and variables for a single user process.
Connecting to an Oracle Instance – Creating a Session
System users can connect to an Oracle database through SQLPlus or through an application program like the Internet Developer Suite (the program becomes the system user).  This connection enables users to execute SQL statements.
The act of connecting creates a communication pathway between a user process and an Oracle Server.  As is shown in the figure above, the User Process communicates with the Oracle Server through a Server Process.  The User Process executes on the client computer.  The Server Process executes on the server computer, and actually executes SQL statements submitted by the system user.
The figure shows a one-to-one correspondence between the User and Server Processes.  This is called a Dedicated Server connection.  An alternative configuration is to use a Shared Server where more than one User Process shares a Server Process.
Sessions:  When a user connects to an Oracle server, this is termed a session.  The session starts when the Oracle server validates the user for connection.  The session ends when the user logs out (disconnects) or if the connection terminates abnormally (network failure or client computer failure).
A user can typically have more than one concurrent session, e.g., the user may connect using SQLPlus and also connect using Internet Developer Suite tools at the same time.  The limit of concurrent session connections is controlled by the DBA.
If a system users attempts to connect and the Oracle Server is not running, the system user receives the Oracle Not Available error message.
Physical Structure – Database Files
As was noted above, an Oracle database consists of physical files.  The database itself has:
Datafiles – these contain the organization’s actual data.
Redo log files – these contain a record of changes made to the database, and enable recovery when failures occur.
Control files – these are used to synchronize all database activities and are covered in more detail in a later module.
Other key files as noted above include:
Parameter file – there are two types of parameter files.
The init.ora file (also called the PFILE) is a static parameter file.  It contains parameters that specify how the database instance is to start up.  For example, some parameters will specify how to allocate memory to the various parts of the system global area.
The spfile.ora is a dynamic parameter file.  It also stores parameters to specify how to startup a database; however, its parameters can be modified while the database is running.
Password file – specifies which *special* users are authenticated to startup/shut down an Oracle Instance.
Archived redo log files – these are copies of the redo log files and are necessary for recovery in an online, transaction-processing environment in the event of a disk failure.
Memory Structure
The memory structures include two areas of memory:
System Global Area (SGA) – this is allocated when an Oracle Instance starts up.
Program Global Area (PGA) – this is allocated when a Server Process starts up.
System Global Area
The SGA is an area in memory that stores information shared by all database processes and by all users of the database (sometimes it is called the Shared Global Area).
This information includes both organizational data and control information used by the Oracle Server.
The SGA is allocated in memory and virtual memory.
The size of the SGA can be established by a DBA by assigning a value to the parameter SGA_MAX_SIZE in the parameter file—this is an optional parameter.
The SGA is allocated when an Oracle instance (database) is started up based on values specified in the initialization parameter file (either PFILE or SPFILE).
The SGA has the following mandatory memory structures:
Shared Pool – includes two components:
Library Cache
Data Dictionary Cache
Database Buffer Cache
Redo Log Buffer
Other structures (for example, lock and latch management, statistical data)
Additional optional memory structures in the SGA include:
Large Pool
Java Pool
Streams Pool
The SHOW SGA SQL command will show you the SGA memory allocations.  This is a recent clip of the SGA for the Oracle database at SIUE.  In order to execute SHOW SGA you must be connected with the special privilege SYSDBA (which is only available to user accounts that are members of the DBA Linux group).
SQL> connect / as sysdba
Connected.
SQL> show sga
Total System Global Area 1610612736 bytes
Fixed Size                  2084296 bytes
Variable Size             385876536 bytes
Database Buffers         1207959552 bytes
Redo Buffers               14692352 bytes
Oracle 8i and earlier versions of the Oracle Server used a Static SGA.  This meant that if modifications to memory management were required, the database had to be shutdown, modifications were made to the init.ora parameter file, and then the database had to be restarted.
Oracle 9i and 10g use a Dynamic SGA.   Memory configurations for the system global area can be made without shutting down the database instance.  The advantage is obvious.  This allows the DBA to resize the Database Buffer Cache and Shared Pool dynamically.
Several initialization parameters are set that affect the amount of random access memory dedicated to the SGA of an Oracle Instance.  These are:
SGA_MAX_SIZE:  This optional parameter is used to set a limit on the amount of virtual memory allocated to the SGA – a typical setting might be 1 GB; however, if the value for SGA_MAX_SIZE in the initialization parameter file or server parameter file is less than the sum the memory allocated for all components, either explicitly in the parameter file or by default, at the time the instance is initialized, then the database ignores the setting for SGA_MAX_SIZE.
DB_CACHE_SIZE:  This optional parameter is used to tune the amount memory allocated to the Database Buffer Cache in standard database blocks.  Block sizes vary among operating systems.  The DBORCL database uses 8 KB blocks.  The total blocks in the cache defaults to 48 MB on LINUX/UNIX and 52 MB on Windows operating systems.
LOG_BUFFER:   This optional parameter specifies the number of bytes allocated for the Redo Log Buffer.
SHARED_POOL_SIZE:  This optional parameter specifies the number of bytes of memory allocated to shared SQL and PL/SQL.  The default is 16 MB.  If the operating system is based on a 64 bit configuration, then the default size is 64 MB.
LARGE_POOL_SIZE:  This is an optional memory object – the size of the Large Pool defaults to zero.  If the init.ora parameter PARALLEL_AUTOMATIC_TUNING is set to TRUE, then the default size is automatically calculated.
JAVA_POOL_SIZE:   This is another optional memory object.  The default is 24 MB of memory.
The size of the SGA cannot exceed the parameter SGA_MAX_SIZE minus the combination of the size of the additional parameters, DB_CACHE_SIZE, LOG_BUFFER, SHARED_POOL_SIZE, LARGE_POOL_SIZE, and JAVA_POOL_SIZE.
Memory is allocated to the SGA as contiguous virtual memory in units termed granules.  Granule size depends on the estimated total size of the SGA, which as was noted above, depends on the SGA_MAX_SIZE parameter.  Granules are sized as follows:
If the SGA is less than 128 MB in total, each granule is 4 MB.
If the SGA is greater than 128 MB in total, each granule is 16 MB.
Granules are assigned to the Database Buffer Cache and Shared Pool, and these two memory components can dynamically grow and shrink.  Using contiguous memory improves system performance.  The actual number of granules assigned to one of these memory components can be determined by querying the database view named V$BUFFER_POOL.
Granules are allocated when the Oracle server starts a database instance in order to provide memory addressing space to meet the SGA_MAX_SIZE parameter.  The minimum is 3 granules:  one each for the fixed SGA, Database Buffer Cache, and Shared Pool.  In practice, you’ll find the SGA is allocated much more memory than this.  The SELECT statement shown below shows a current_size of 1,152 granules.
SELECT name, block_size, current_size, prev_size, prev_buffers
FROM v$buffer_pool;
NAME                 BLOCK_SIZE CURRENT_SIZE  PREV_SIZE PREV_BUFFERS
——————– ———- ———— ———- ————
DEFAULT                    8192         1152          0            0
For additional information on the dynamic SGA sizing, enroll in Oracle’s Oracle10g Database Performance Tuning course.
Automatic Shared Memory Management
Prior to Oracle 10G, a DBA had to manually specify SGA Component sizes through the initialization parameters, such as SHARED_POOL_SIZE, DB_CACHE_SIZE, JAVA_POOL_SIZE, and LARGE_POOL_SIZE parameters.
Automatic Shared Memory Management enables a DBA to specify the total SGA memory available through the SGA_TARGET initialization parameter.  The Oracle Database automatically distributes this memory among various subcomponents to ensure most effective memory utilization.
The DBORCL database SGA_TARGET is set in the initDBORCL.ora file:
sga_target=1610612736
With automatic SGA memory management, the different SGA components are flexibly sized to adapt  to the SGA available.
Setting a single parameter simplifies the administration task – the DBA only specifies the amount of SGA memory available to an instance – the DBA can forget about the sizes of individual components. No out of memory errors are generated unless the system has actually run out of memory.  No manual tuning effort is needed.
The SGA_TARGET initialization parameter reflects the total size of the SGA and includes memory for the following components:
Fixed SGA and other internal allocations needed by the Oracle Database instance
The log buffer
The shared pool
The Java pool
The buffer cache
The keep and recycle buffer caches (if specified)
Nonstandard block size buffer caches (if specified)
The Streams Pool
If SGA_TARGET is set to a value greater than SGA_MAX_SIZE at startup, then the SGA_MAX_SIZE value is bumped up to accomodate SGA_TARGET.  After startup, SGA_TARGET can be decreased or increased dynamically. However, it cannot exceed the value of SGA_MAX_SIZE that was computed at startup.
When you set a value for SGA_TARGET, Oracle Database 10g automatically sizes the most commonly configured components, including:
The shared pool (for SQL and PL/SQL execution)
The Java pool (for Java execution state)
The large pool (for large allocations such as RMAN backup buffers)
The buffer cache
There are a few SGA components whose sizes are not automatically adjusted. The DBA must specify the sizes of these components explicitly, if they are needed by an application. Such components are:
Keep/Recycle buffer caches (controlled by DB_KEEP_CACHE_SIZE and DB_RECYCLE_CACHE_SIZE)
Additional buffer caches for non-standard block sizes (controlled by DB_nK_CACHE_SIZE, n = {2, 4, 8, 16, 32})
Streams Pool (controlled by the new parameter STREAMS_POOL_SIZE)
Shared Pool
The Shared Pool is a memory structure that is shared by all system users.  It consists of both fixed and variable structures.  The variable component grows and shrinks depending on the demands placed on memory size by system users and application programs.
Memory can be allocated to the Shared Pool by the parameter SHARED_POOL_SIZE in the parameter file.  You can alter the size of the shared pool dynamically with the ALTER SYSTEM SET command.  An example command is shown in the figure below.  You must keep in mind that the total memory allocated to the SGA is set by the SGA_TARGET parameter (and may also be limited by the SGA_MAX_SIZE if it is set), and since the Shared Pool is part of the SGA, you cannot exceed the maximum size of the SGA.
The Shared Pool stores the most recently executed SQL statements and used data definitions.  This is because some system users and application programs will tend to execute the same SQL statements often.  Saving this information in memory can improve system performance.
The Shared Pool includes the Library Cache and Data Dictionary Cache.
Library Cache
Memory is allocated to the Library Cache whenever an SQL statement is parsed or a program unit is called.  This enables storage of the most recently used SQL and PL/SQL statements.
If the Library Cache is too small, the Library Cache must purge statement definitions in order to have space to load new SQL and PL/SQL statements.  Actual management of this memory structure is through a Least-Recently-Used (LRU) algorithm.  This means that the SQL and PL/SQL statements that are oldest and least recently used are purged when more storage space is needed.
The Library Cache is composed of two memory subcomponents:
Shared SQL:  This stores/shares the execution plan and parse tree for SQL statements.  If a system user executes an identical statement, then the statement does not have to be parsed again in order to execute the statement.
Shared PL/SQL Procedures and Packages:  This stores/shares the most recently used PL/SQL statements such as functions, packages, and triggers.
Data Dictionary Cache
The Data Dictionary Cache is a memory structure that caches data dictionary information that has been recently used.  This includes user account information, datafile names, table descriptions, user privileges, and other information.
The database server manages the size of the Data Dictionary Cache internally and the size depends on the size of the Shared Pool in which the Data Dictionary Cache resides.  If the size is too small, then the data dictionary tables that reside on disk must be queried often for information and this will slow down performance.
Buffer Caches
A number of buffer caches are maintained in memory in order to improve system response time.
Database Buffer Cache
The Database Buffer Cache is a fairly large memory object that stores the actual data blocks that are retrieved from datafiles by system queries and other data manipulation language commands.
A query causes a Server Process to first look in the Database Buffer Cache to determine if the requested information happens to already be located in memory – thus the information would not need to be retrieved from disk and this would speed up performance.  If the information is not in the Database Buffer Cache, the Server Process retrieves the information from disk and stores it to the cache.
Keep in mind that information read from disk is read a block at a time, not a row at a time, because a database block is the smallest addressable storage space on disk.
Database blocks are kept in the Database Buffer Cache according to a Least Recently Used (LRU) algorithm and are aged out of memory if a buffer cache block is not used in order to provide space for the insertion of newly needed database blocks.
The buffers in the cache are organized in two lists:
the write list and,
the least recently used (LRU) list.
The write list holds dirty buffers – these are buffers that hold that data that has been modified, but the blocks have not been written back to disk.
The LRU list holds free buffers, pinned buffers, and dirty buffers that have not yet been moved to the write list.  Free buffers do not contain any useful data and are available for use.  Pinned buffers are currently being accessed.
When an Oracle process accesses a buffer, the process moves the buffer to the most recently used (MRU) end of the LRU list – this causes dirty buffers to age toward the LRU end of the LRU list.
When an Oracle user process needs a data row, it searches for the data in the database buffer cache because memory can be searched more quickly than hard disk can be accessed.  If the data row is already in the cache (a cache hit), the process reads the data from memory; otherwise a cache miss occurs and data must be read from hard disk into the database buffer cache.
Before reading a data block into the cache, the process must first find a free buffer. The process searches the LRU list, starting at the LRU end of the list.  The search continues until a free buffer is found or until the search reaches the threshold limit of buffers.
Each time the user process finds a dirty buffer as it searches the LRU, that buffer is moved to the write list and the search for a free buffer continues.
When the process finds a free buffer, it reads the data block from disk into the buffer and moves the buffer to the MRU end of the LRU list.
If an Oracle user process searches the threshold limit of buffers without finding a free buffer, the process stops searching the LRU list and signals the DBW0 background process to write some of the dirty buffers to disk.  This frees up some buffers.
The block size for a database is set when a database is created and is determined by the init.ora parameter file parameter named DB_BLOCK_SIZE.  Typical block sizes are 2KB, 4KB, 8KB, 16KB, and 32KB.  The size of blocks in the Database Buffer Cache matches the block size for the database.  The DBORCL database uses a 8KB block size.
Because tablespaces that store oracle tables can use different (non-standard) block sizes, there can be more than one Database Buffer Cache allocated to match block sizes in the cache with the block sizes in the non-standard tablespaces.
The size of the Database Buffer Caches can be controlled by the parameters DB_CACHE_SIZE and DB_nK_CACHE_SIZE to dynamically change the memory allocated to the caches without restarting the Oracle instance.
You can dynamically change the size of the Database Buffer Cache with the ALTER SYSTEM command like the one shown here:
ALTER SYSTEM SET DB_CACHE_SIZE = 96M;
You can have the Oracle Server gather statistics about the Database Buffer Cache to help you size it to achieve an optimal workload for the memory allocation.  This information is displayed from the V$DB_CACHE_ADVICE view.   In order for statistics to be gathered, you can dynamically alter the system by using the ALTER SYSTEM SET DB_CACHE_ADVICE (OFF, ON, READY) command.  However, gathering statistics on system performance always incurs some overhead that will slow down system performance.
SQL> ALTER SYSTEM SET db_cache_advice = ON;
System altered.
SQL> DESC V$DB_cache_advice;
Name                                      Null?    Type
—————————————– ——– ————-
ID                                                 NUMBER
NAME                                               VARCHAR2(20)
BLOCK_SIZE                                         NUMBER
ADVICE_STATUS                                      VARCHAR2(3)
SIZE_FOR_ESTIMATE                                  NUMBER
SIZE_FACTOR                                        NUMBER
BUFFERS_FOR_ESTIMATE                               NUMBER
ESTD_PHYSICAL_READ_FACTOR                          NUMBER
ESTD_PHYSICAL_READS                                NUMBER
ESTD_PHYSICAL_READ_TIME                            NUMBER
ESTD_PCT_OF_DB_TIME_FOR_READS                      NUMBER
ESTD_CLUSTER_READS                                 NUMBER
ESTD_CLUSTER_READ_TIME                             NUMBER
SQL> SELECT name, block_size, advice_status FROM v$db_cache_advice;
NAME                 BLOCK_SIZE ADV
——————– ———- —
DEFAULT                    8192 ON
<more rows will display>
21 rows selected.
SQL> ALTER SYSTEM SET db_cache_advice = OFF;
System altered.
KEEP Buffer Pool
This pool retains blocks in memory (data from tables) that are likely to be reused throughout daily processing.  An example might be a table containing user names and passwords or a validation table of some type.
The DB_KEEP_CACHE_SIZE parameter sizes the KEEP Buffer Pool.
RECYCLE Buffer Pool
This pool is used to store table data that is unlikely to be reused throughout daily processing – thus the data is quickly recycled.
The DB_RECYCLE_CACHE_SIZE parameter sizes the RECYCLE Buffer Pool.
Redo Log Buffer
The Redo Log Buffer memory object stores images of all changes made to database blocks.  As you know, database blocks typically store several table rows of organizational data.  This means that if a single column value from one row in a block is changed, the image is stored.  Changes include INSERT, UPDATE, DELETE, CREATE, ALTER, or DROP.
Think of the Redo Log Buffer as a circular buffer that is reused over and over.  As the buffer fills up, copies of the images are stored to the Redo Log Files that are covered in more detail in a later module.
Large Pool
The Large Pool is an optional memory structure that primarily relieves the memory burden placed on the Shared Pool.  The Large Pool is used for the following tasks if it is allocated:
Allocating space for session memory requirements from the User Global Area (part of the Server Process) where a Shared Server is in use.
Transactions that interact with more than one database, e.g., a distributed database scenario.
Backup and restore operations by the Recovery Manager (RMAN) process.
RMAN uses this only if the BACKUP_DISK_IO = n and BACKUP_TAPE_IO_SLAVE = TRUE parameters are set.
If the Large Pool is too small, memory allocation for backup will fail and memory will be allocated from the Shared Pool.
Parallel execution message buffers for parallel server operations.  The PARALLEL_AUTOMATIC_TUNING = TRUE parameter must be set.
The Large Pool size is set with the LARGE_POOL_SIZE parameter – this is not a dynamic parameter.  It does not use an LRU list to manage memory.
Java Pool
The Java Pool is an optional memory object, but is required if the database has Oracle Java installed and in use for Oracle JVM (Java Virtual Machine).  The size is set with the JAVA_POOL_SIZE parameter that defaults to 24MB.
The Java Pool is used for memory allocation to parse Java commands.
Storing Java code and data in the Java Pool is analogous to SQL and PL/SQL code cached in the Shared Pool.
Streams Pool
This cache is new to Oracle 10g.  It is sized with the parameter STREAMS_POOL_SIZE.
This pool stores data and control structures to support the Oracle Streams feature of Oracle Enterprise Edition.  Oracle Steams manages sharing of data and events in a distributed environment.
If STEAMS_POOL_SIZE is not set or is zero, memory for Oracle Streams operations is allocated from up to 10% of the Shared Pool memory.
Program Global Area
The Program Global Area is also termed the Process Global Area (PGA) and is a part of memory allocated that is outside of the Oracle Instance.  The PGA stores data and control information for a single Server Process or a single Background Process.  It is allocated when a process is created and the memory is scavenged by the operating system when the process terminates.  This is NOT a shared part of memory – one PGA to each process only.
The content of the PGA varies, but generally includes the following:
Private SQL Area:  Data for binding variables and runtime memory allocations.  A user session issuing SQL statements has a Private SQL Area that may be associated with a Shared SQL Area if the same SQL statement is being executed by more than one system user.  This often happens in OLTP environments where many users are executing and using the same application program.
Dedicated Server environment – the Private SQL Area is located in the Program Global Area.
Shared Server environment – the Private SQL Area is located in the System Global Area.
Session Memory:  Memory that holds session variables and other session information.
SQL Work Area:  Memory allocated for sort, hash-join, bitmap merge, and bitmap create types of operations.
Oracle 9i and later versions enable automatic sizing of the SQL Work Areas by setting the WORKAREA_SIZE_POLICY = AUTO parameter (this is the default!) and PGA_AGGREGATE_TARGET = n (where n is some amount of memory established by the DBA).  However, the DBA can let Oracle 10g determine the appropriate amount of memory.
Oracle 8i and earlier required the DBA to set the following parameters to control SQL Work Area memory allocations:
SORT_AREA_SIZE.
HASH_AREA_SIZE.
BITMAP_MERGE_AREA_SIZE.
CREATE_BITMAP_AREA_SIZE.
Software Code Area
Software code areas store Oracle executable files running as part of the Oracle instance.
These code areas are static in nature and are located in privileged memory that is separate from other user programs.
The code can be installed sharable when multiple Oracle instances execute on the same server with the same software release level.
Processes
You need to understand three different types of Processes:
User Process:  Starts when a database user requests to connect to an Oracle Server.
Server Process:  Establishes the Connection to an Oracle Instance when a User Process requests connection – makes the connection for the User Process.
Background Processes:  These start when an Oracle Instance is started up.
User Process
In order to use Oracle, you must obviously connect to the database.  This must occur whether you’re using SQLPlus, an Oracle tool such as Designer or Forms, or an application program.
This generates a User Process (a memory object) that generates programmatic calls through your user interface (SQLPlus, Integrated Developer Suite, or application program) that creates a session and causes the generation of a Server Process that is either dedicated or shared.
Server Process
As you have seen, the Server Process is the go-between for a User Process and the Oracle Instance.   In a Dedicated Server environment, there is a single Server Process to serve each User Process.  In a Shared Server environment, a Server Process can serve several User Processes, although with some performance reduction.  Allocation of server process in a dedicated environment versus a shared environment is covered in further detail in the Oracle10g Database Performance Tuning course offered by Oracle Education.
Background Processes
As is shown here, there are both mandatory and optional background processes that are started whenever an Oracle Instance starts up.  These background processes serve all system users.  We will cover mandatory process in detail.
Optional Background Process Definition:
ARCn: Archiver – One or more archiver processes copy the online redo log files to archival storage when they are full or a log switch occurs.
CJQ0:  Coordinator Job Queue – This is the coordinator of job queue processes for an instance. It monitors the JOB$ table (table of jobs in the job queue) and starts job queue processes (Jnnn) as needed to execute jobs The Jnnn processes execute job requests created by the DBMS_JOBS package.
Dnnn:  Dispatcher number “nnn”, for example, D000 would be the first dispatcher process – Dispatchers are optional background processes, present only when the shared server configuration is used. Shared server is discussed in your readings on the topic “Configuring Oracle for the Shared Server”.
RECO:  Recoverer – The Recoverer process is used to resolve distributed transactions that are pending due to a network or system failure in a distributed database.  At timed intervals, the local RECO attempts to connect to remote databases and automatically complete the commit or rollback of the local portion of any pending distributed transactions.  For information about this process and how to start it, see your readings on the topic “Managing Distributed Transactions”.
Of these, the ones you’ll use most often are ARCn (archiver) when you automatically archive redo log file information (covered in a later module), and RECO for recovery where the database is distributed on two or more separate physical Oracle servers, perhaps a UNIX machine and an NT machine.
DBWn (also called DBWR in earlier Oracle Versions)
The Database Writer writes modified blocks from the database buffer cache to the datafiles. Although one database writer process (DBW0) is sufficient for most systems, you can configure up to 20 DBWn processes (DBW0 through DBW9 and DBWa through DBWj) in order to improve write performance for a system that modifies data heavily.
The initialization parameter DB_WRITER_PROCESSES specifies the number of DBWn processes.
The purpose of DBWn is to improve system performance by caching writes of database blocks from the Database Buffer Cache back to datafiles.  Blocks that have been modified and that need to be written back to disk are termed “dirty blocks.”  The DBWn also ensures that there are enough free buffers in the Database Buffer Cache to service Server Processes that may be reading data from datafiles into the Database Buffer Cache.  Performance improves because by delaying writing changed database blocks back to disk, a Server Process may find the data that is needed to meet a User Process request already residing in memory!
DBWn writes to datafiles when one of these events occurs that is illustrated in the figure below.
LGWR
The Log Writer (LGWR) writes contents from the Redo Log Buffer to the Redo Log File that is in use.  These are sequential writes since the Redo Log Files record database modifications based on the actual time that the modification takes place.  LGWR actually writes before the DBWn writes and only confirms that a COMMIT operation has succeeded when the Redo Log Buffer contents are successfully written to disk.  LGWR can also call the DBWn to write contents of the Database Buffer Cache to disk.  The LGWR writes according to the events illustrated in the figure shown below.
SMON
The System Monitor (SMON) is responsible for instance recovery by applying entries in the online redo log files to the datafiles.  It also performs other activities as outlined in the figure shown below.
If an Oracle Instance fails, all information in memory not written to disk is lost.  SMON is responsible for recovering the instance when the database is started up again.  It does the following:
Rolls forward to recover data that was recorded in a Redo Log File, but that had not yet been recorded to a datafile by DBWn.  SMON reads the Redo Log Files and applies the changes to the data blocks.  This recovers all transactions that were committed because these were written to the Redo Log Files prior to system failure.
Opens the database to allow system users to logon.
Rolls back uncommitted transactions.
SMON also does limited space management.  It combines (coalesces) adjacent areas of free space in the database’s datafiles for tablespaces that are dictionary managed.
It also deallocates temporary segments to create free space in the datafiles.
PMON
The Process Monitor (PMON) is a cleanup type of process that cleans up after failed processes such as the dropping of a user connection due to a network failure or the abend of a user application program.  It does the tasks shown in the figure below.
CKPT
The Checkpoint (CPT) process writes information to the database control files that identifies the point in time with regard to the Redo Log Files where instance recovery is to begin should it be necessary.  This is done at a minimum, once every three seconds.
Think of a checkpoint record as a starting point for recovery.  DBWn will have completed writing all buffers from the Database Buffer Cache to disk prior to the checkpoint, thus those record will not require recovery.  This does the following:
Ensures modified data blocks in memory are regularly written to disk – CKPT can call the DBWn process in order to ensure this and does so when writing a checkpoint record.
Reduces Instance Recovery time by minimizing the amount of work needed for recovery since only Redo Log File entries processed since the last checkpoint require recovery.
Causes all committed data to be written to datafiles during database shutdown.
If a Redo Log File fills up and a switch is made to a new Redo Log File (this is covered in more detail in a later module), the CKPT process also writes checkpoint information into the headers of the datafiles.
Checkpoint information written to control files includes the system change number (the SCN is a number stored in the control file and in the headers of the database files that are used to ensure that all files in the system are synchronized), location of which Redo Log File is to be used for recovery, and other information.
CKPT does not write data blocks or redo blocks to disk – it calls DBWn and LGWR as necessary.
ARCn
We cover the Archiver (ARCn) optional background process in more detail because it is almost always used for production systems storing mission critical information.   The ARCn process must be used to recover from loss of a physical disk drive for systems that are “busy” with lots of transactions being completed.
When a Redo Log File fills up, Oracle switches to the next Redo Log File.  The DBA creates several of these and the details of creating them are covered in a later module.  If all Redo Log Files fill up, then Oracle switches back to the first one and uses them in a round-robin fashion by overwriting ones that have already been used – it should be obvious that the information stored on the files, once overwritten, is lost forever.
If ARCn is in what is termed ARCHIVELOG mode, then as the Redo Log Files fill up, they are individually written to Archived Redo Log Files and LGWR does not overwrite a Redo Log File until archiving has completed.  Thus, committed data is not lost forever and can be recovered in the event of a disk failure.  Only the contents of the SGA will be lost if an Instance fails.
In NOARCHIVELOG mode, the Redo Log Files are overwritten and not archived.  Recovery can only be made to the last full backup of the database files.  All committed transactions after the last full backup are lost, and you can see that this could cost the firm a lot of $$$.
When running in ARCHIVELOG mode, the DBA is responsible to ensure that the Archived Redo Log Files do not consume all available disk space!  Usually after two complete backups are made, any Archived Redo Log Files for prior backups are deleted.
Logical Structure
It is helpful to understand how an Oracle database is organized in terms of a logical structure that is used to organize physical objects.
Tablespace:  An Oracle 10g database must always consist of at least two tablespaces (SYSTEM and SYSAUX), although a typical Oracle database will multiple tablespaces tablespaces.
A tablespace is a logical storage facility (a logical container) for storing objects such as tables, indexes, sequences, clusters, and other database objects.
Each tablespace has at least one physical datafile that actually stores the tablespace at the operating system level.  A large tablespace may have more than one datafile allocated for storing objects assigned to that tablespace.
A tablespace belongs to only one database.
Tablespaces can be brought online and taken offline for purposes of backup and management, except for the SYSTEM tablespace that must always be online.
Tablespaces can be in either read-only or read-write status.
Datafile:  Tablespaces are stored in datafiles which are physical disk objects.
A datafile can only store objects for a single tablespace, but a tablespace may have more than one datafile – this happens when a disk drive device fills up and a tablespace needs to be expanded, then it is expanded to a new disk drive.
The DBA can change the size of a datafile to make it smaller or later.  The file can also grow in size dynamically as the tablespace grows.
Segment:  When logical storage objects are created within a tablespace, for example, an employee table, a segment is allocated to the object.
Obviously a tablespace typically has many segments.
A segment cannot span tablespaces but can span datafiles that belong to a single tablespace.
Extent:  Each object has one segment which is a physical collection of extents.
Extents are simply collections of contiguous disk storage blocks.  A logical storage object such as a table or index always consists of at least one extent – ideally the initial extent allocated to an object will be large enough to store all data that is initially loaded.
As a table or index grows, additional extents are added to the segment.
A DBA can add extents to segments in order to tune performance of the system.
An extent cannot span a datafile.
Block:  The Oracle Server manages data at the smallest unit in what is termed a block or data block.  Data are actually stored in blocks.
A physical block is the smallest addressable location on a disk drive for read/write operations.
An Oracle data block consists of one or more physical blocks (operating system blocks) so the data block, if larger than an operating system block, should be an even multiple of the operating system block size, e.g., if the Linux operating system block size is 2K or 4K, then the Oracle data block should be 2K, 4K, 8K, 16K, etc in size.  This optimizes I/O.
The data block size is set at the time the database is created and cannot be changed.  It is set with the DB_BLOCK_SIZE parameter.  The maximum data block size depends on the operating system.
Thus, the Oracle database architecture includes both logical and physical structures as follows:
Physical:  Control files; Redo Log Files; Datafiles; Operating System Blocks.
Logical:  Tablespaces; Segments; Extents; Data Blocks.
SQL Statement Processing
SQL Statements are processed differently depending on whether the statement is a query, data manipulation language (DML) to update, insert, or delete a row, or data definition language (DDL) to write information to the data dictionary.
Processing a query:
Parse:
Search for identical statement in the Shared SQL Area.
Check syntax, object names, and privileges.
Lock objects used during parse.
Create and store execution plan.
Bind: Obtains values for variables.
Execute: Process statement.
Fetch: Return rows to user process.
Processing a DML statement:
Parse: Same as the parse phase used for processing a query.
Bind: Same as the bind phase used for processing a query.
Execute:
If the data and undo blocks are not already in the Database Buffer Cache, the server process reads them from the datafiles into the Database Buffer Cache.
The server process places locks on the rows that are to be modified. The undo block is used to store the before image of the data, so that the DML statements can be rolled back if necessary.
The data blocks record the new values of the data.
The server process records the before image to the undo block and updates the data block.  Both of these changes are made in the Database Buffer Cache.  Any changed blocks in the Database Buffer Cache are marked as dirty buffers.  That is, buffers that are not the same as the corresponding blocks on the disk.
The processing of a DELETE or INSERT command uses similar steps.  The before image for a DELETE contains the column values in the deleted row, and the before image of an INSERT contains the row location information.
Processing a DDL statement:The execution of DDL (Data Definition Language) statements differs from the execution of DML (Data Manipulation Language) statements and queries, because the success of a DDL statement requires write access to the data dictionary.
For these statements, parsing actually includes parsing, data dictionary lookup, and execution.  Transaction management, session management, and system management SQL statements are processed using the parse and execute stages.  To re-execute them, simply perform another execute.



Instance recovery is performed in two steps. ie, rollforward and rollback.
Cache Recovery or Rollforward:
Here, the changes recorded in the redolog files are applied to the affected blocks. This includes both committed and uncommited data. Since Undo data is protected by redo, rollforward generated the undo images also. The time required for this will be proportional to the changes made in the database after the last successful checkpoint. After cache recovery, the database will be ‘consistent’ to the point when the crash occurred. Now the database will be open and users can start connecting to it. The parameter RECOVERY_PARALLELISM specifies the number of processes to participate in instance or crash recovery and we can thus speed up rollforward.
Transaction Recovery or Rollback
The uncommitted data in the database will now be rolled back. This is coordinated by SMON and rolls back set of transactions parallely (by default) using multiple server processes. SMON automatically decides when to begin parallel rollback and disperses the work among several parallel processes: process one rolls back one transaction, process two rolls back a second transaction, and so on. And if a transaction is huge Oracle  begins intra-transaction recovery by dispersing the huge transaction among the slave processes: process one takes one part, process two takes another part, and so on. Parallel mode is the default one and is decided by the parameter FAST_START_PARALLEL_ROLLBACK . We can either turn it off (for serial recovery) or increase the degree of parallelism. If you change the value of the parameter FAST_START_PARALLEL_ROLLBACK, then transaction recovery will be stopped and restarted with the new implied degree of parallelism.
As mentioned earlier, user sessions are allowed to connect even before the transaction recovery is completed. If a user attempts to access a row that is locked by a terminated transaction, Oracle rolls back only those changes necessary to complete the transaction; in other words, it rolls them back on demand. Consequently, new transactions do not have to wait until all parts of a long transaction are rolled back.
This transaction recovery is required and has to be completed. We can disable transaction recovery temporarily but at some point this has to be completed. We can monitor the progress of fast-start parallel rollback by examining the V$FAST_START_SERVERS and V$FAST_START_TRANSACTIONS views.
The Fast-Start Fault Recovery feature reduces the time required for cache recovery, and makes the recovery bounded and predictable by limiting the number of dirty buffers and the number of redo records generated between the most recent redo record and the last checkpoint. With the Fast-Start Fault Recovery feature, the FAST_START_MTTR_TARGET initialization parameter simplifies the configuration of recovery time from instance or system failure. FAST_START_MTTR_TARGET specifies a target for the expected mean time to recover (MTTR), that is, the time (in seconds) that it should take to start up the instance and perform cache recovery. After FAST_START_MTTR_TARGET is set, the database manages incremental checkpoint writes in an attempt to meet that target.
If the SMON is busy doing the transaction recovery you should never attempt a shutdown abort and restarting the database. The entire work done till that point needs to be done again.
There are different modes in which you can open the database eg: migrate, read only, restricted modes.

Switchover and switch backup standby database


Overview steps:

Step:1 Check the db name, open mode, database role of the primary and standby databases.

Step:2 Select switchover status on primary & standby db.

Step:3 Check that there is no active users connected to the databases.

Step:4 Switch the current online redo log file on primary database and verify that it has been applied in the standby database.

Step:5 Connect with primary database and initiate the switchover.

Step:6 Bounce the primary db and check the switchover status.

Step:7 Then convert the physical standby into primary db.(stop the MRP process)

Step:8 Open the new standby db in read only mode.

Step:9 Apply the redo log files in newly created standby.(start the MRP process).Check whether the logs are applying in the new standby db.

Step:1 Check the db name, open mode, database role of the primary and standby databases.

Precheck
---------------

SELECT PROTECTION_MODE,CONTROLFILE_TYPE,REMOTE_ARCHIVE,PROTECTION_LEVEL,DATABASE_ROLE ROLE, SWITCHOVER_STATUS,DATAGUARD_BROKER,GUARD_STATUS FROM V$DATABASE;


set lines 2000
set pages 2000
set owner for a13
set username for a12
col host for a50
col username for a15
col DB_LINK for a30
col CREATED for a30

select * from dba_db_links ;
/

select owner,DB_LINK from dba_db_links

set lines 300 pages 1000
col OWNER for a30
col JOB_NAME for a30
SELECT OWNER, JOB_NAME, START_DATE,NEXT_RUN_DATE,STATE FROM DBA_SCHEDULER_JOBS WHERE ENABLED='TRUE' AND OWNER <> 'SYS';

col JOB_ACTION for a20
 col START_DATE for a20
 col LAST_START_DATE for a20
 col LAST_RUN_DURATION for a20
 col NEXT_RUN_DATE for a30
select JOB_NAME,JOB_ACTION,REPEAT_INTERVAL,LAST_START_DATE,LAST_RUN_DURATION,NEXT_RUN_DATE from dba_scheduler_jobs where owner='SYS' and JOB_NAME like '%STAT%';


SELECT db_unique_name, open_mode, database_role FROM v$database;


SELECT PROCESS FROM V$MANAGED_STANDBY WHERE PROCESS LIKE 'MRP%';
SELECT  PROCESS, STATUS,SEQUENCE#,BLOCK#,BLOCKS, DELAY_MINS FROM V$MANAGED_STANDBY;

column destination format a35 wrap
column process format a7
column archiver format a8
column ID format 99
column mid format 99
select dest_id "ID",destination,status,target,  schedule,process,mountid  mid from v$archive_dest order by dest_id;


SELECT RECOVERY_MODE FROM V$ARCHIVE_DEST_STATUS WHERE DEST_ID=2;

select * from V$DATAGUARD_CONFIG

SELECT FS_FAILOVER_STATUS "FSFO STATUS",FS_FAILOVER_CURRENT_TARGET TARGET,FS_FAILOVER_THRESHOLD THRESHOLD,FS_FAILOVER_OBSERVER_PRESENT "OBSERVER PRESENT" FROM  V$DATABASE;


SELECT PROTECTION_MODE,CONTROLFILE_TYPE,REMOTE_ARCHIVE,PROTECTION_LEVEL,DATABASE_ROLE ROLE, SWITCHOVER_STATUS,DATAGUARD_BROKER,GUARD_STATUS FROM V$DATABASE;  

SELECT MESSAGE FROM V$DATAGUARD_STATUS;

SELECT fs_failover_observer_present, fs_failover_observer_host FROM v$database;

set lines 100 pages 999
col member format a70
select st.group#,st.sequence#,ceil(st.bytes / 1048576) mb,lf.member
from v$standby_log st,v$logfile lf
where st.group# = lf.group#
/

ALTER SESSION SET nls_date_format=’DD-MON-YYYY HH24:MI:SS’;
SELECT sequence#, first_time, next_time, applied FROM v$archived_log ORDER BY sequence#;

SELECT ARCH.THREAD# "Thread", ARCH.SEQUENCE# "Last Sequence Received", APPL.SEQUENCE# "Last Sequence Applied", 
(ARCH.SEQUENCE# - APPL.SEQUENCE#) "Difference" FROM (SELECT THREAD# ,SEQUENCE# FROM V$ARCHIVED_LOG WHERE 
(THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$ARCHIVED_LOG GROUP BY THREAD#)) ARCH, 
(SELECT THREAD# ,SEQUENCE# FROM V$LOG_HISTORY WHERE (THREAD#,FIRST_TIME ) 
IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$LOG_HISTORY GROUP BY THREAD#)) APPL WHERE ARCH.THREAD# = APPL.THREAD# ORDER BY 1; 

3) show parameter log_archive_config
        show parameter dg
        SELECT MESSAGE FROM V$DATAGUARD_STATUS;

4) select name,flashback_on from v$database;
5) select name from v$tempfile;
6) select value from v$diag_info where name ='Diag Trace';
7) show parameter recovery;
show parameter job
show parameter aq
8) check crontab and scheduler if any state and rman backup will scheduled during activity if yes disable it


Implementation
--------------

run command on primary database(production) fist we need to run primary database
compatible               = "11.2.0"
  log_archive_dest_1       = "LOCATION=/db1/orafra/svfe VALID_FOR=(ALL_LOGFILES,ALL_ROLES)  DB_UNIQUE_NAME=svfe"
  log_archive_dest_2       = "SERVICE=svfedr LGWR ASYNC NOAFFIRM VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=svfedr"
  log_archive_dest_state_1 = "ENABLE"


 alter system switch logfile

switchover primary to standby
--------------------------------

select SWITCHOVER_STATUS from V$DATABASE;
select name,database_role,switchover_status from v$database;
!ps -ef|grep rman



SELECT * From dba_jobs_running;


show parameter job
show parameter aq
SQL> 
SQL> alter system set job_queue_processes=0 ;

System altered.

SQL> alter system set aq_tm_processes=0 ;

System altered.

show parameter job
show parameter aq
select name,database_role,switchover_status from v$database;

alter database commit to switchover to physical standby with session shutdown
shut abort
startup mount;
or
SHUTDOWN IMMEDIATE;
STARTUP NOMOUNT;
ALTER DATABASE MOUNT STANDBY DATABASE;

ALTER DATABASE RECOVER MANAGED Oracle instance DISCONNECT FROM SESSION;

Archived Log entry 2278737 added for T-1.S-1136245 ID 0x109a137e LAD:1
Completed: ALTER DATABASE   MOUNT
2022-05-07T02:09:27.821104+04:00
ARC7 started with pid=42, OS id=941
Starting background process ARC8
2022-05-07T02:09:27.834239+04:00
ALTER SYSTEM SET ENCRYPTION WALLET OPEN IDENTIFIED BY *
Completed Successfully
   ALTER SYSTEM SET ENCRYPTION WALLET OPEN IDENTIFIED BY *
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION
2022-05-07T02:09:27.861561+04:00

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION


select name, open_mode, db_unique_name, database_role from v$database;

==================================

Convert standby to primary: Our primary is already converted to standby. Now it’s time to convert original standby into primary

After completing activity on primary database (production)

convert standby database to primary database

1) SELECT PROTECTION_MODE,CONTROLFILE_TYPE,REMOTE_ARCHIVE,PROTECTION_LEVEL,DATABASE_ROLE ROLE, SWITCHOVER_STATUS,DATAGUARD_BROKER,GUARD_STATUS FROM V$DATABASE;


2) select name,flashback_on from v$database;
3) select name from v$tempfile;
4) select value from v$diag_info where name ='Diag Trace';
5) show parameter recovery;
6) check crontab and scheduler if any state and rman backup will scheduled during activity if yes disable it


1)select SWITCHOVER_STATUS from V$DATABASE;
if require older version
 Cancel the MRP process
SQL> alter database recover managed standby database cancel;
 
Terminate the current switchover to Standby that never completed fully.  
select max(sequence#) from v$archived_log where applied='YES';



2) ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WITH SESSION SHUTDOWN

3) shut immediate

4) startup
  or
 alter database open

select database_role from v$database;
select name, open_mode, database_role from v$database;







Convert physical standby into snapshot mode and revert back

select name,open_mode,database_role,controlfile_type from v$database;

select flashback_on from v$database;

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

alter database flashback on;

select flashback_on from v$database;

show parameter recovery
select name from v$recovery_file_dest;
show parameter db_recovery_file_dest_size


check recovery desk and recovery desk size parameter



ALTER DATABASE CONVERT TO SNAPSHOT STANDBY;

select name,open_mode,database_role,controlfile_type from v$database;

select NAME,GUARANTEE_FLASHBACK_DATABASE from v$restore_point;



shutdown immediate ---
startup ------  


select name,open_mode,database_role,controlfile_type from v$database;



===============================

Revert back from Snapshot mode into physical standby
--------------
select name,open_mode,database_role,controlfile_type from v$database;
shutdown immediate -
startup mount  -

select name,open_mode,database_role,controlfile_type from v$database;
alter database convert to physical standby;

select name,open_mode,database_role,controlfile_type from v$database;

shutdown immediate;
startup mount
select name,open_mode,database_role,controlfile_type from v$database;

select flashback_on from v$database;
alter database flashback off --

select flashback_on from v$database;

ALTER DATABASE RECOVER  managed standby database disconnect from session
select name,open_mode,database_role,controlfile_type from v$database;





What is Active Guard DataGuard



It enables physical standby database to be open for read access while media  recovery is being performed on them to keep them synchronized with the production
database. The physical standby database is open in read-only mode while
redo transport and standby apply are both active.
Active Data Guard automatically repairs corrupt blocks online by using the active standby database.
Normally, queries executed on active standby databases return up-to-date results. Due to potential delays in redo transport or standby apply, a standby database may “fall behind” its primary, which can cause results of queries on the standby to be out of date.
Active Data Guard sessions can be configured with a maximum query delay (in
seconds). If the standby exceeds the delay, Active Data Guard returns an error to the application, which can then retry the query or transparently redirect the query to the primary, as required



What is logical standby database



In a logical standby database configuration, Data Guard SQL Apply uses redo information shipped from  the primary system. However, instead of using media recovery to apply changes no physical
(as in thestandby database configuration), the redo data is transformed into
equivalent SQL statements by using LogMiner technology. These SQL statements are then applied to the logical standby database. The logical standby database is open in read/write mode and is available for reporting capabilities.
The logical standby database can offer protection at database level, schema level, or even object level.
A logical standby database can be used to perform rolling database upgrades, thereby minimizing down time when upgrading to new database patch sets or full database releases.



Oracle Back ground process



When a buffer in database buffer cache is modified .it is marked dirty  buffer and added in  head of the checkpoint queue that is kept in system change number (SCN) order. This order therefore matches the order of redo that is written to the redo logs for these changed buffers.



When the number of available buffers in the buffer cache falls below an internal threshold (to the extent that server processes find it difficult to obtain available buffers), DBWn writes non
frequently used, modified (dirty) buffers to the data files from the tail of the LRU list so that processes can replace buffers when they need them. DBWn also writes from the tail of the checkpoint queue to keep the checkpoint advancing.
The SGA contains a memory structure that has the redo byte address (RBA) of the position in the redo stream where recovery should begin in the case of an instance failure. This structure acts as a pointer into the redo and is written to the control file by the CKPT process once every three seconds. Because the DBWn writes dirty buffers in SCN order, and because the
redo is in SCN order, every time DBWn writes dirty buffers from the LRU list, it also advances the pointer held in the SGA memory structure so that instance recovery (if required) begins reading the redo from approximately the correct location and avoids unnecessary I/O. This is known as incremental checkpointing.
In all cases, DBWn performs batched (multiblock) writes to improve efficiency. The number of blocks written in a multiblock write varies by operating system.



LGWR is  responsible for redo log buffer management by writing redo buffer entries to a redo log file to  disk. LGWR writes all redo entries that have been copied into the buffer Since last time it wrote
The redo log buffer is a circular buffer. When LGWR writes redo entries from the redo log buffer to a redo log file, server processes can then copy new entries over the entries in the redo log buffer that have been written to disk. LGWR normally writes fast enough to ensure that space is always available in the buffer for new entries, even when access to the redo log
is heavy. LGWR writes one contiguous portion of the buffer to disk.
LGWR writes:



  • When a user process commits a transaction
  • When the redo log buffer is one-third full
  • Before a DBWn process writes modified buffers to disk (if necessary)
  • Every three seconds
Before DBWn can write a modified buffer, all redo records that are associated with the changes to the buffer must be written to disk (the write-ahead protocol). If DBWn finds that some redo records have not been written, it signals LGWR to write the redo records to disk and waits forLGWR to complete writing the redo log buffer before it can write out the data buffers. LGWR writes to the current log group. If one of the files in the group is damaged or unavailable, LGWR continues writing to other files in the group and logs an error in the LGWR trace file and in the system alert log. If all files in a group are damaged, or if the group is unavailable because it has
not been archived, LGWR cannot continue to function.
When a user issues a COMMIT statement, LGWR puts a commit record in the redo log buffer and writes it to disk immediately, along with the transaction’s redo entries. The corresponding changes to data blocks are deferred until it is more efficient to write them. This is called a fast commit mechanism. The atomic write of the redo entry containing the transaction’s commit
record is the single event that determines whether the transaction has committed. Oracle Database returns a success code to the committing transaction, although the data buffers have not yet been written to disk.
What is checkpoint and its types
A checkpoint is concept and mechanism. There are different type of checkpoint .The most import is  related to this course are the full checkpoint and the incremental checkpoint.
The checkpoint position defines at which system change number (SCN) in the redo thread instance recovery would need to begin.
The SCN at which a full checkpoint occurred is stored in both the data file headers and the control file.
The SCN at which the last incremental checkpoint occurred is only stored in the control file (in a structure known as the checkpoint progress record).
The CKPT process updates the control files and the headers of all data files to record the details of the checkpoint (as shown in the graphic). The CKPT process does not write blocks to disk; DBWn always performs that work. The SCNs recorded in the file headers guarantee that all changes made to database blocks prior to that SCN have been written to disk.
What is Process Monitor and System Monitor Process
The System Monitor Process performs recovery at instance startup if necessary SMON is also responsible for cleaning up temporary segments that are no longer in use. If any terminated transactions were skipped during instance recovery because of file-read or offline errors, SMON recovers them when the tablespace or file is brought back online. SMON checks regularly to see whether the process is needed. Other processes can call SMON if they detect a need for it.
The Process Monitor Process perform process recovery when a user process fails  .PMON is responsible for cleaning up the database buffer cache and freeing resources that the user process was using. For example, it resets the status of the active transaction table,releases locks, and removes the process ID from the list of active processes.
PMON periodically checks the status of dispatcher and server processes, and restarts any that have stopped running (but not any that Oracle Database has terminated intentionally).Like SMON, PMON checks regularly to see whether it is needed. It can be called if another process detects the need for it.
What is retention policy
A retention policy describes which backups will be kept and for how long.
You can set the value of the retention policy by using the RMAN CONFIGURE command or Recovery Window Retention Policy
The best practice is to establish a period of time during which it will be possible to discoverlogical errors and fix the affected objects by doing a point-in-time recovery to just before the error occurred. This period of time is called the recovery window. This policy is specified in number of days. For each data file, there must always exist at least one backup that satisfies
the following condition:
SYSDATE – backup_checkpoint_time >= recovery_window
You can use the following command syntax to configure a recovery window retention policy:
RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF <days> DAYS;
where <days> is the size of the recovery window.
If you are not using a recovery catalog, you should keep the recovery window time period less than or equal to the value of the CONTROL_FILE_RECORD_KEEP_TIME parameter to prevent
the record of older backups from being overwritten in the control file. If you are using a recovery catalog, then make sure the value of CONTROL_FILE_RECORD_KEEP_TIME is greater than the
time period between catalog resynchronizations. Resynchronizations happen when you:
  • Create a backup. In this case, the synchronization is done implicitly.
  • Execute the RESYNC CATALOG command.
Recovery catalogs are covered in more detail in the lesson titled “Using the RMAN Recovery
Redundancy Retention Policy
If you require a certain number of backups to be retained, you can set the retention policy on the basis of the redundancy option. This option requires that a specified number of backups be cataloged before any backup is identified as obsolete. The default retention policy has a redundancy of 1, which means that only one backup of a file must exist at any given time. A
backup is deemed obsolete when a more recent version of the same file has been backed up.
What is obsolete and expired backup
Expired backup : When the CROSSCHECK command is used to determine whether backups  is recorded in the repository still exist on disk or tape, if RMAN cannot locate the backups, then it updates their records in the RMAN repository to EXPIRED status.
Obsolete Backup: One of the key questions every backup strategy must address is how long you want to keep a backup. Although you can specify that a backup be kept forever without becoming obsolete,
it’s not common to follow such a strategy, unless you’re doing it for a special reason. Instead,backups become obsolete according to the retention policy you adopt. You can select the retention duration of backups when using RMAN in two ways.
In the first method, you can specify backup retention based on a recovery window. That is, all backups necessary to perform a point-in-time recovery to a specified past point of time will be retained by RMAN. If a backup is older than the point of time you chose, that backup will become obsolete according
to the backup retention rules.
The second way to specify the retention duration is to use a redundancy-based retention policy, under which you specify the number of backups of a file
that must be kept on disk. Any backups of a datafile greater than that number will be considered obsolete. obsolete backups are automatically deleted when space is needed for fresh files, you won’t be running the risk of accidentally deleting necessary files. obsolete backups are automatically deleted when space is needed for fresh files, you won’t be running the risk of accidentally deleting necessary files.
Obsolete backups are any backups that you don’t need to satisfy a configured retention policy.You may also delete obsolete backups according to any retention policy you may specify as an option to the delete obsolete command. The delete obsolete command will remove the deleted files from the backup media and mark those backups as deleted in both the control file and the recovery catalog.
The report obsolete command reports on any obsolete backups. Always run the crosscheck command first in order to update the status of the backups in the RMAN repository to that on disk and tape.
In the following example, the report obsolete command shows no obsolete
backups:
RMAN> crosscheck backup;
RMAN> report obsolete;
RMAN retention policy will be applied to the command
RMAN retention policy is set to redundancy 1
no obsolete backups found
The following execution of the report obsolete command shows that there are both obsolete backup sets and obsolete archived redo log backups. Again, run the crosscheck command before issuing the report obsolete command.
RMAN> crosscheck backup;
RMAN> report obsolete;



What is Clusterware Startup Sequence
The Oracle Services daemon (ohasd) is responsible for starting in proper order monitoring, and restarting other local Oracle Clusterware daemons, up through the daemon, crsd which in turn manages clusterwide resources.
When a cluster node boots, or Clusterware is started on a running clusterware node, the init process starts ohasd. The ohasd process then initiates the startup of the processes in the lower, or Oracle High Availability (OHASD) stack.



  • The cssdagent process is started, which in turn starts ocssd. The ocssd process discovers the voting disk either in ASM or on shared storage, and then joins the cluster.The cssdagent process monitors the cluster and provides I/O fencing. This service formerly was provided by Oracle Process Monitor Daemon (oprocd). A cssdagent failure may result in Oracle Clusterware restarting the node.
  • The orarootagent is started. This process is a specialized oraagent process that helps crsd start and manage resources owned by root, such as the network and the grid virtual IP address.
Node vip: The node vip is a node application (nodeapp) responsible for eliminating response delays (TCP timeouts) to client programs requesting a connection to the database. Each node vip is assigned an unused IP address.
This is usually done via DHCP but can be manually assigned. There is initially one node vip per cluster node at Clusterware startup. When a cluster node becomes unreachable, the node vip is failed over to a surviving node and redirects connection requests made to the unreachable node to a surviving node.
SCAN vip: SCAN vips or Single Client Access Name vips are part of a connection framework that eliminates dependencies on static cluster node names. This framework allows nodes to be added to or removed from the cluster without affecting the ability of clients to connect to the database. If GNS is used in the cluster, three SCAN vips are started on the member nodes using the IP addresses assigned by the DHCP server. If GNS is not used, SCAN vip addresses for the cluster can be defined in the DNS server used by the cluster nodes.
SCAN Listener: Three SCAN Listeners are started on the cluster nodes where the SCAN VIPs are started. Oracle Database 11g Release 2 and later instances register with SCAN listeners only as remote listeners.
Node Listener: If GNS is used to resolve client requests for the cluster, a single GNS vip for the cluster is started. The IP address is assigned in the GNS server used by the cluster nodes.
SCAN and Local Listeners
When a client submits a connection request, the SCAN listener listening on a SCAN IP address and the SCAN port are contacted on the client’s behalf. Because all services on the cluster are registered with the SCAN listener, the SCAN listener replies with the address of the local listener on the least-loaded node where the service is currently being offered.
Finally, the client establishes a connection to the service through the listener on the node where service is offered. All these actions take place transparently to the client without any explicit configuration required in the client.
During installation ,listeners are created on nodes for SCAN IP Addresses
Oracle net services routes application requests to least loaded instance providing services  Becouse the SCAN addresses resolve to cluster, rather than to  a node address in a cluster,node can be added or removed from cluster without affecting the SCAN address configuration
Failing to Start OHAS
The first daemon to start in a Grid Infrastructure environment is OHAS. This process relies on the init process to invoke /etc/init.d/init.ohasd, which starts /etc/rc.d/init.d/ohasd, which in turn executes $GRID_HOME/ohasd.bin. Without a properly working ohasd.bin process, none of the other stack
components will start. The entry in /etc/inittab defines that /etc/init.d/init.ohasd is started at runlevels 3 and 5. Runlevel 3 in Linux usually brings the system up in networked, multi-user mode;
however, it doesn’t start X11. Runlevel 5 is normally used for the same purpose, but it also starts the graphical user interface. If the system is at a runlevel other than 3 or 5, then ohasd.bin cannot be started, and you need to use a call to init to change the runlevel to either 3 or 5. You can check
/var/log/messages for output from the scripts under /etc/rc.d/init.d/; ohasd.bin logs information into the default log file destination at $GRID_HOME/log/hostname in the ohasd/ohasd.log subdirectory.
The administrator has the option to disable the start of the High Availability Services stack by calling crsctl disable crs. This call updates a flag in /etc/oracle/scls_scr/hostname/root/ohasdstr. The file contains only one word, either enable or disable, and no carriage return. If set to disable, then
/etc/rc.d/init.d/ohasd will not proceed with the startup. Call crsctl start crs to start the cluster stack manually in that case. Many Grid Infrastructure background processes rely on sockets created in /var/tmp/.oracle. You
can check which socket is used by a process by listing the contents of the /proc/pid/fd directory, where pid is the process id of the program you are looking at. In some cases, permissions on the sockets can become garbled; in our experience, moving the .oracle directory to a safe location and rebooting solved the cluster communication problems.
Another reason ohasd.bin might fail to start: the file system for $GRID_HOME could be either corrupt or otherwise not mounted. Earlier, it was noted that ohasd.bin lives in $GRID_HOME/bin. If $GRID_HOME isn’t mounted, then it is not possible to start the daemon.
We introduced the OLR as an essential file for starting Grid Infrastructure. If the OLR has become corrupt or is otherwise not accessible, then ohasd.bin cannot start. Successful initialization of the OLR is recorded in the ohasd.log, as in the following example (the timestamps have been removed for the sake
of clarity):
[ default][3046704848] OHASD Daemon Starting. Command string :reboot
[ default][3046704848] Initializing OLR
[ OCRRAW][3046704848]proprioo: for disk 0
(/u01/app/crs/cdata/london1.olr),
id match (1), total id sets, (1) need recover (0), my votes (0),
total votes (0), commit_lsn (15), lsn (15)
[ OCRRAW][3046704848]proprioo: my id set: (2018565920, 1028247821, 0, 0, 0)
[ OCRRAW][3046704848]proprioo: 1st set: (2018565920, 1028247821, 0, 0, 0)
[ OCRRAW][3046704848]proprioo: 2nd set: (0, 0, 0, 0, 0)
[ CRSOCR][3046704848] OCR context init CACHE Level: 0xaa4cfe0
[ default][3046704848] OHASD running as the Privileged user



Interestingly, the errors pertaining to the local registry have the same numbers as those for the OCR; however, they have been prefixed by PROCL. The L can easily be missed, so check carefully! If the OLR cannot be read, then you will see the error messages immediately under the Initializing OLR line. This chapter has covered two causes so far: the OLR is missing or the OLR is corrupt. The first case is much easier to diagnose because, in that case, OHAS will not start:
[root@london1 ~]# crsctl check crs
CRS-4639: Could not contact Oracle High Availability Services
In the preceding example, ohasd.log will contain an error message similar to this one:
[ default][1381425744] OHASD Daemon Starting. Command string :restart
[ default][1381425744] Initializing OLR
[ OCROSD][1381425744]utopen:6m’:failed in stat OCR file/disk
/u01/app/crs/cdata/london1.olr,
errno=2, os err string=No such file or directory
[ OCROSD][1381425744]utopen:7:failed to open any OCR file/disk, errno=2,
os err string=No such file or directory
[ OCRRAW][1381425744]proprinit: Could not open raw device
[ OCRAPI][1381425744]a_init:16!: Backend init unsuccessful : [26]
[ CRSOCR][1381425744] OCR context init failure. Error: PROCL-26: Error
while accessing the physical storage Operating System
error [No such file or directory] [2]
417CHAPTER 8  CLUSTERWARE
[ default][1381425744] OLR initalization failured, rc=26
[ default][1381425744]Created alert : (:OHAS00106:) : Failed to initialize
Oracle Local Registry
[ default][1381425744][PANIC] OHASD exiting; Could not init OLR
In this case, you should restore the OLR, which you will learn how to do in the “Maintaining Voting
Disk and OCR/OLR” section.



If the OLR is corrupted, then you will slightly different errors. OHAS tries to read the OLR; while it succeeds for some keys, it fails for some others. Long hex dumps will appear in the ohasd.log, indicating a problem. You should perform an ocrcheck -local in this case, which can help you determine the root
cause. The following output has been taken from a system where the OLR was corrupt:
[root@london1 ohasd]# ocrcheck -local
Status of Oracle Local Registry is as follows :
Device/File Name
: /u01/app/crs/cdata/london1.olr
Device/File integrity check failed
Local registry integrity check failed
Logical corruption check bypassed



If the utility confirms that the OLR is corrupted, then you have no option but to restore it. Again,






Testing ASM Disk Failure Scenario and disk_repair_time



External Redundancy  --Every piece of data is stored once
     If five drives are allocated , Data will spread across five drives . if loose one drive, Data get lost. We need to restore from backup
 
Normal Redundancy --It is possible by two or more asm disk .it can protect from single drive outage
Eg  A      A
                B   B
     C             C
If lose one drive , you have data since it is mirror


High Redundancy -- it is possible by three or more asm disk .it can protect  from two drive outage .
writing same data twice or thrice on disk group
Eg
If you drop one drive in high redundancy  and abruptly shutdown machine .Disk group will not mount Automatically
due to disk is incomplete mean one disk is missing and it can be mounted by force only as below


For more details  https://youtu.be/GNcU0NudlJ0
    


When a disk failure occurs for an ASM disk, behavior of ASM would be different, based on what kind of redundancy for the diskgroup is in use. If diskgroup has EXTERNAL REDUDANCY, diskgroup would keep working if you have redundancy at external RAID level. If there is no RAID at external level, the diskgroup would immediately get dismounted and disk would need a repair/replaced and then diskgroup might need to be dropped and re-created, and data on this diskgroup would require recovery.


For NORMAL and HIGH redundancy diskgroups, the behavior is a little different. When a disk gets corrupted/missing in a NORMAL/HIGH redundancy diskgroup, error is reported in the alert log file, and disk becomes OFFLINE, as we can see in the output of bellow query, after I started my testing for an ASM disk failure. I just needed to plug out the disk from the storage that belonged to an ASM diskgroup with NORMAL redundancy.
col name format a8
col header_status format a7
set lines 2000
col path format a10
select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;


NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S
-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------
DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA4                              NORMAL   UNKNOWN         1200                   OFFLINE       MISSING
  
Here we see a value “1200” under REPAIR_TIME column; this value is time in seconds after which this disk would be dropped automatically. This time is calculated using value of a diskgroup attribute called DISK_REPAIR_TIME that I will discuss bellow.

In 10g, if a disk goes missing, it would immediately get dropped and REBALANCE operation would kick in immediately whereby ASM would start redistributing the ASM extents across the available disks in ASM diskgroup to restore the redundancy.


DISK_REPAIR_TIME

Starting 11g, oracle has provided an attribute for diskgroups called “DISK_REPAIR_TIME”. This has a default value of 3.6 hours. This actually means that in case a disk goes missing, this disk should not be dropped immediately and ASM should wait for this disk to come online/replaced. This feature helps in scenarios where a disk is plugged out accidentally, or a storage server/SAN gets disconnected/rebooted which leaves some ASM diskgroup without one or more disks. During the time when disk(s) remain unavailable, ASM would keep track of the extents that are candidates of being written to the missing disks, and immediately starts writing to the disk(s) as soon as missing disk(s) come back online (this feature is called fast mirror resync). If disk(s) does not come back online within DISK_REPAIR_TIME threshold, disk(s) is/are dropped and rebalance starts.


FAILGROUP_REPAIR_TIME

Starting 12c, another new attribute can be set for the diskgroup. This attribute is FAILGROUP_REPAIR_TIME, and this has a default value of 24 hours. This attribute is similar to DISK_REPAIR_TIME, but is applied to the whole failgroup. In Exadata, all disks belonging to a storage server can belong to a failgroup (to avoid a mirror copy of extent to be written in a disk from the same storage server), and this attribute is quite handy in Exadata environment when complete storage server is taken down for maintenance, or some other reason.
In the following we can see how to set values for the diskgroup attributes explained above.
SQL> col name format a30
SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%';


NAME                           VALUE
------------------------------ --------------------
disk_repair_time               3.6h
failgroup_repair_time          24.0h


SQL> alter diskgroup data set attribute 'disk_repair_time'='1h';


Diskgroup altered.


SQL>  alter diskgroup data set attribute  'failgroup_repair_time'='10h';


Diskgroup altered.


SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%';


NAME                           VALUE
------------------------------ --------------------
disk_repair_time               1h
failgroup_repair_time          10h


ORA-15042

If a disk is offline/missing from an ASM diskgroup, ASM may not mount the diskgroup automatically during instance restart. In this case, we might need to mount the diskgroup manually, with FORCE option.
SQL> alter diskgroup data mount;
alter diskgroup data mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "3" is missing from group number "2"


SQL> alter diskgroup data mount force;


Diskgroup altered.


Monitoring the REPAIR_TIME

After a disk goes offline, the time starts ticking and value of REPAIR_TIMER can be monitored to see the time remains before the disk can be made available to avoid auto drop of the disk.
SQL> select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;


NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S
-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------
DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA4                              NORMAL   UNKNOWN         649                     OFFLINE       MISSING


--We can confirm that no rebalance has started yet by using following query
SQL> select * from v$asm_operation;


no rows selected


If we are able to make this disk available/replaced before DISK_REPAIR_TIME lapses, we can bring this disk back online. Please note that we would need to bring it ONLINE manually.
SQL> alter diskgroup data online disk data4;


Diskgroup altered.


select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;


NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S
-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------
DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA4                              NORMAL   UNKNOWN        465                      SYNCING     CACHED


--Syncing is in progress, and hence no rebalance would occur.


SQL> select * from v$asm_operation;


no rows selected
-- After some time, everything would become normal.


select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;


NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S
-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------
DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED
DATA4    ORCL:DATA4 NORMAL   MEMBER             0                         ONLINE        CACHED




If same disk cannot be made available, or replaced, either ASM would auto drop the disk after DISK_REPAIR_TIME has lapsed, or we manually drop this ASM disk. Rebalance would occur after the disk drop.
Since the disk status if OFFLINE, we would need to use FORCE option to drop the disk. After dropping the disk rebalance would start and can be monitored from v$ASM_OPERATION view.
SQL> alter diskgroup data drop disk data4;
alter diskgroup data drop disk data4
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15084: ASM disk "DATA4" is offline and cannot be dropped.




SQL> alter diskgroup data drop disk data4 force;


Diskgroup altered.


select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;


GROUP_NUMBER OPERA PASS                   STATE      POWER      SOFAR   EST_WORK 
---------------------------------- --------- ----            ---------- ---------- ---------- ------------------------
           2                     REBAL RESYNC             DONE          9                0             0   
           2                     REBAL REBALANCE    DONE           9                42          42  
           2                     REBAL COMPACT         RUN             9                1            0   


Later we can replace the faulty disk and then add back the new disk again into this diskgroup. Adding diskgroup back would initiate rebalance once again.
SQL> alter diskgroup data add disk 'ORCL:DATA4';


Diskgroup altered.


SQL> select * from v$asm_operation;


select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;


GROUP_NUMBER OPERA PASS                   STATE      POWER      SOFAR   EST_WORK 
---------------------------------- --------- ----            ---------- ---------- ---------- ------------------------
           2                     REBAL RESYNC             DONE          9                0             0   
           2                     REBAL REBALANCE    RUN              9               37           2787  
           2                     REBAL COMPACT         WAIT            9                1            0   



https://www.oraclenext.com/2018/01/testing-asm-disk-failure-scenario-and.html




two node cell node reboot in exadata -- normal redundancy

HIGH redundancy
HIGH redundancy results in three copies of data being written to storage on three separate
Exadata Storage Servers. It provides the greatest protection, tolerating the simultaneous loss
of up to two entire cells.
HIGH redundancy is generally recommended for disk groups that house critical databases.
This is especially the case for disk groups based on high capacity cells because the amount
of time required to restore redundancy with high capacity disks (8 TB) increases the potential
for a second failure to occur before the first failure is completely dealt with.
One cost associated with this protection is that the overall storage capacity of Exadata is
effectively reduced by one third compared with NORMAL redundancy. For example, a Full
Rack Database Machine with high capacity disks has a raw disk capacity of 1344 TB. With
HIGH redundancy applied the usable capacity becomes approximately 400 TB compared to
approximately 600 TB using NORMAL redundancy.
Another cost is that all writes to HIGH redundancy disk groups must be performed on three
separate storage servers. This means that 50% more I/O bandwidth is consumed by writes on
HIGH redundancy disk groups compared with writes to NORMAL redundancy disk groups.
Extra I/O is not required for read operations. So the balance of application reads and writes,
coupled with the I/O bandwidth requirements of the application should be considered when
selecting the appropriate protection level.


NORMAL redundancy:
Double mirroring across two separate storage servers
Provides one layer of redundancy
No additional protection if one cell is offline
Requires extra time and effort to maintain redundancy through planned maintenance

NORMAL redundancy results in two copies of data being written to storage on two separate
Exadata Storage Servers. As already highlighted, NORMAL redundancy has lower costs in
terms of storage capacity and I/O bandwidth consumption for write operations.
Using NORMAL redundancy on Exadata provides the good data protection, seamlessly
tolerating the loss of up to one entire cell. However, note that there are situations where data
protection may be compromised using NORMAL redundancy.

Firstly, consider that some maintenance operations, such as cell patching for example
require that a cell is taken offline for a period of time.During that time,
data availability is maintained using the remaining copy.However, what happens if a failure affects
only remaining copy?.The disk group would be taken offline disrupting or possibly even
terminating database processing.This situation can be protected against by removing the cell
from the disk group and rebalancing the data on to other cells prior to taking the cell offline.

However, this operation requires a substantial amount of time and would consume a
substantial amount of I/O bandwidth. It would also require sufficient free space to be available
on the remaining cells.

Similarly, using NORMAL redundancy, the simultaneous loss of two disks on different storage
servers may result in some portion of a database becoming unavailable. While this is a highly
unlikely scenario, it is far more likely than suffering a comparable interruption using HIGH
redundancy.

Free space management


When a failure occurs ,ASM requires free space in a disk group to re-create lost data extents
in order to preserve redundancy. The amount of free space required depends on the amount
or storage affected by the failure. Oracle recommends that customers consider the possibility
of losing an entire cell when determining the amount of free space which is usually
maintained. Note that the ASM redundancy level (HIGH or NORMAL), and Exadata model
being used can have a profound bearing on the amount of free space which is required to
maintain ASM redundancy.
For example, on a Full Rack Database Machine, with 14 cells, the failure of a cell requires the
contents of the failed cell to be redistributed across the 13 surviving cells. This means that
1/14 (a little over 7%) of the overall capacity needs to be reserved as free space in order to
preserve redundancy if a cell is lost.
By contrast, on a Quarter Rack Database Machine, with only 3 cells, you require at least 1/3
of the total capacity to be free to preserve NORMAL redundancy if one cell becomes
unavailable.
Furthermore, on a Quarter Rack Database Machine using HIGH redundancy, it is impossible
to preserve redundancy if a cell is lost since a Quarter Rack only contains three cells to start
with. However in this case you could choose to continue operations with the remaining two
cells until the third one is replaced.

Quorum Disks on Database Servers

Oracle clusterware require 5 voting disks on different failure groups when using a high redundancy
disk group to store to the cluster voting disks. Consequently, in earlier releases,
the voting disk are always created on a normal redundancy disk group on Exadata clusters
with 4 or fewer storage servers. In such cases, if two cells become unavailable, the cluster is
forced to shut down, even if the data is being protected in other high redundancy disk groups.
Quorum disks enable users to leverage the disks on database servers to achieve higher
availability in smaller Exadata configurations. Quorum disks are created on the database
servers and can be used to store cluster voting disks, the cluster registry (OCR), and
database server parameter files (spfiles).
For new systems, Oracle Exadata Deployment Assistant automatically configures quorum
disks during deployment if the required conditions are met. The main requirements are:
• You are configuring an Exadata cluster with 4 or fewer storage servers.
• High redundancy is specified for the DATA or RECO disk group (or both).
• The minimum required software versions are being used, including Exadata Storage
Server release 12.1.2.3.0 and Grid Infrastructure version 12.1.0.2.160119.



How database server interact with storage server when fire any command

Oracle Database communicates with Exadata cells by using a special library called
LIBCELL ($ORACLE_HOME/lib/libcell11.so or
$ORACLE_HOME/lib/libcell12.so). In combination with the database kernel and
ASM, LIBCELL transparently maps database I/O operations to Exadata Storage Server
enhanced operations. LIBCELL communicates with Exadata cells by using the
Intelligent Database protocol (iDB). iDB is a unique Oracle data transfer protocol, built
on Reliable Datagram Sockets (RDS), which runs on industry standard InfiniBand
networking hardware. LIBCELL and iDB enable ASM and database instances to utilize
Exadata Storage Server features, such as Smart Scan and I/O Resource Management.

The Database Resource Manager (DBRM) is integrated with Exadata Storage Server
I/O Resource Management (IORM). DBRM and IORM work together to ensure that I/O
resources are allocated based on administrator-defined priorities.

Diskmon checks the storage network interface state and cell liveness. It also performs
DBRM plan propagation to Exadata cells. Diskmon uses a node-wide master process
(diskmon.bin) and one slave process (DSKM) for each RDBMS or ASM instance.
The master performs the monitoring and propagates state information to the slaves. The
slaves use the SGA to communicate with the RDBMS or ASM processes. If there is a
failure in the cluster, Diskmon performs I/O fencing to protect data integrity. Cluster
Synchronization Services (CSS) decides what to fence. The master Diskmon starts with
the clusterware processes. The slave Diskmon processes are background processes
that are started and stopped in conjunction with the associated RDBMS or ASM
instance.

The Management Server (MS) on a database server provides a set of management and
configuration functions. It works in cooperation with the DBMCLI command-line
interface. Each database server is individually managed with DBMCLI. DBMCLI can be
used only from within a database server to manage that server. However, you can run
the same DBMCLI command remotely on multiple nodes with the dcli utility. In
addition, MS is responsible for sending alerts, and collects some statistics.
The Restart Server (RS) is used to start up or shut down MS and monitors MS to
automatically restart it if required.

Oracle Database and ASM processes use
LIBCELL to communicate with CELLSRV, and LIBCELL converts I/O requests into
messages that are sent to CELLSRV by using the iDB protocol.


To illustrate how SQL processing is performed in this architecture, an example of a
table scan is shown in the graphic in the slide.


1. The client issues a SELECT statement with a predicate to filter a table and return only
the rows of interest to the user.
2. The database kernel maps this request to the file and extents containing the table.
3. The database kernel issues the I/Os to read all the table blocks.
4. All the blocks for the table being queried are read into memory.
5. SQL processing is conducted against the data blocks searching for the rows that satisfy
the predicate.
6. The required rows are returned to the client.
As is often the case with the large queries, the predicate filters out most of the rows in the
table. Yet all the blocks from the table need to be read, transferred across the storage
network, and copied into memory. Many more rows are read into memory than required to
complete the requested SQL operation. This generates a large amount of unproductive I/O,
which wastefully consumes resources and impacts application throughput and response time.





On Exadata,database operations are handled differently. Queries that perform table scans
can be processed within Exadata cells and return only the required subset of data to the
database server. Row filtering, column filtering, some join processing, and other functions can
be performed within Exadata cells. Exadata Storage Server uses a special direct-read
mechanism for Smart Scan processing.

1. The client issues a SELECT statement to return some rows of interest.
2. The database kernel determines that the data is stored on Exadata cells so an iDB
command representing the SQL command is constructed and sent to the Exadata cells.
3. The Exadata Storage Server software scans the data blocks to extract the relevant rows
and columns which satisfy the SQL command.
4. Exadata cells return to the database instance iDB messages containing the requested
rows and columns of data. These results are not block images, so they are not stored in
the buffer cache.
5. The database kernel consolidates the result sets from across all the Exadata cells. This
is similar to how the results from a parallel query operation are consolidated.
6. The rows are returned to the client.

Moving SQL processing off the database server frees server CPU cycles and eliminates a
massive amount of unproductive I/O transfers. These resources are free to better service
other requests. Queries run faster, and more of them can be processed.




How to add disk to ASM



How to add disk to ASM (DATABASE) runing in production server
We have database running on ASM, after two years we faced the problem of space deficiency.
Now we planed to add disk to ASM diskgroup DATAGROUP.
SQL> @asm
NAME                 TOTAL_GB                FREE_GB
------------------------------ ---------- ----------
DATAGROUP            249.995117             15.2236328
IDXGROUP             149.99707              10.4892578
Steps are below
1) Create partition of disk /dev/sdm which we got new LUN from Storage
[root@rac-node1 ~]# fdisk -l /dev/sdm
Disk /dev/sdm: 85.8 GB, 85899345920 bytes
255 heads, 63 sectors/track, 10443 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk /dev/sdm doesn't contain a valid partition table
[root@rac-node1 ~]# fdisk /dev/sdm
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

The number of cylinders for this disk is set to 10443.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
 (e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
Command (m for help): n
Command action
 e extended
 p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-10443, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-10443, default 10443):
Using default value 10443
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
[root@rac-node1 ~]# fdisk -l /dev/sdm
Disk /dev/sdm: 85.8 GB, 85899345920 bytes
255 heads, 63 sectors/track, 10443 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdm1 1 10443 83883366 83 Linux
[root@rac-node1 ~]#
2) Configure the disk /dev/sdm1 to ASM and giving LABEL DATA5
[root@rac-node1 ~]# /etc/init.d/oracleasm createdisk DATA5 /dev/sdm1
Marking disk "DATA5" as an ASM disk: [ OK ]
[root@rac-node1 ~]# /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks: [ OK ]
[root@rac-node1 ~]# /etc/init.d/oracleasm listdisks
DATA3
DATA4
DATA5
DISK1
DISK2
INDEX2
INDEX5
[root@rac-node1 ~]#
Scandisks on RAC -node2
[root@rac-node2 ~]# /etc/init.d/oracleasm scandisks
Scanning the system for Oracle ASMLib disks: [ OK ]
[root@rac-node2 ~]# /etc/init.d/oracleasm listdisks
DATA3
DATA4
DATA5
DISK1
DISK2
INDEX2
INDEX5
Add the disk to /etc/rawdevices
[root@rac-node2 bin]vi /etc/sysconfig/rawdevices
/dev/raw/raw6 /dev/sdm1    ==> add this to rawdevices file
And added to /etc/rc.local for permission on reboot
[root@rac-node2 bin]#vi /etc/rc.local
chmod 660 /dev/raw/raw6


Check the disk status 
SQL> set linesize 9999
SQL> ;
 SELECT
 NVL(a.name, '[CANDIDATE]') disk_group_name
 , b.path disk_file_path
 , b.name disk_file_name
 , b.failgroup disk_file_fail_group
 FROM
 v$asm_diskgroup a RIGHT OUTER JOIN v$asm_disk b USING (group_number)
 ORDER BY
* a.name
SQL> /
DISK_GROUP_NAME DISK_FILE_PATH DISK_FILE_NAME DISK_FILE_FAIL_GROUP
------------------------------ --------------
DATAGROUP       ORCL:DISK1 DISK1 DISK1
DATAGROUP       ORCL:INDEX5 INDEX5 INDEX5
DATAGROUP       ORCL:DATA4 DATA4 DATA4
DATAGROUP       ORCL:DATA3 DATA3 DATA3
IDXGROUP        ORCL:DISK2 DISK2 DISK2
IDXGROUP        ORCL:INDEX2 INDEX2 INDEX2
[CANDIDATE]    ORCL:DATA5 ==> this is the new disk 
7 rows selected.
3) Add disk DATA5 to diskgroup DATAGROUP
SQL> alter diskgroup DATAGROUP ADD DISK 'ORCL:DATA5' ;
Diskgroup altered.
Check disk status again
SQL> SELECT
 2 NVL(a.name, '[CANDIDATE]') disk_group_name
 , b.path disk_file_path
 , b.name disk_file_name
 3 4 5 , b.failgroup disk_file_fail_group
 6 FROM
 7 v$asm_diskgroup a RIGHT OUTER JOIN v$asm_disk b USING (group_number)
 8 ORDER BY
 9 a.name;
DISK_GROUP_NAME DISK_FILE_PATH ISK_FILE_NAME DISK_FILE_FAIL_GROUP
----------------------------------------------------------
DATAGROUP        ORCL:INDEX5 NDEX5 INDEX5
DATAGROUP        ORCL:DATA4 ATA4 DATA4
DATAGROUP        ORCL:DISK1 ISK1 DISK1
DATAGROUP        ORCL:DATA5 ATA5 DATA5
DATAGROUP        ORCL:DATA3 ATA3 DATA3
IDXGROUP         ORCL:INDEX2 NDEX2 INDEX2
IDXGROUP         ORCL:DISK2 ISK2 DISK2
7 rows selected. There is no candidates any more for DATA5

SQL> host cat script/asm.sql
select name,TOTAL_MB/1024 total_gb,free_mb/1024 FREE_GB from v$asm_diskgroup;
NAME             TOTAL_GB        FREE_GB
------------------------------ ---------- ----------
DATAGROUP       329.992188      95.21875
IDXGROUP        149.99707       10.4892578
SQL>
 
 
 

Configuring three IPs for SCAN listener in Oracle 11gR2 

SCAN

 

The benefit of using the SCAN is that the connection information of the client does not need to change if you add or remove nodes in the cluster."

"Having a single name to access the cluster enables the client to use the EZConnect client and the simple JDBC thin URL to access any Database running in the cluster, independent of the active servers in the cluster. The SCAN provides load balancing and failover for client connections to the Database. The SCAN works as a cluster alias for Databases in the cluster."
"...provide information on what services are being provided by the instance, the current load, and a recommendation on how many incoming connections should be directed to the instance."

"each pair of resources (SCAN VIP and Listener) will be started on a different server in the cluster, assuming the cluster consists of three or more nodes."
"In case, a 2-node-cluster is used (for which 3 IPs are still recommended for simplification reasons), one server in the cluster will host two sets of SCAN resources under normal operations.

 

 

How to configure SCAN listener with DNS?

About SCAN( Single Client Access Name) Listener in Oracle 11gR2 RAC:
Single Client Access Name (SCAN) is a new Oracle Real Application Clusters (RAC) 11g Release 2 feature that provides a single name for clients to access Oracle Databases running in a cluster. The benefit is that the client’s connect information does not need to change if you add or remove nodes in the cluster. Having a single name to access the cluster allows clients to use the EZConnect client and the simple JDBC thin URL to access any database running in the cluster, independently of which server(s) in the cluster the database is active. SCAN provides load balancing and failover for client connections to the database. The SCAN works as a cluster alias for databases in the cluster.
Why three IPs to be configured for SCAN listener through DNS (Domain Name Server):

If we configure 3 IPs for SCAN listener through DNS, then in case of any failure on any SCAN IP, then fail-over will happen to other running IP. Another benefit, any client access should resolved throgh DNS also.
Failover becomes a bit of a problem if there is only one SCAN Listener. Let's assume the node where the one SCAN Listener is running on dies. Grid Infrastructure on a surviving node will start the SCAN Listener on some surviving node. It does take some time for GI to detect the failure and start it on some node. During this time, applications will not be able to connect so you'd lose high availability. Also consider the scenario where the SCAN Listener fails for some reason but the node it was running on is still operational. In that case, the SCAN Listener will not be restarted anywhere. You want more than one for high availability.

 There are three ways we can configured our RAC environment .
1) Non – DNS ( that means IP based RAC configuration
( Only 1 scan ip will work ))
2) DNS ( minimum 3 SCAN ip are required )
3) GNS ( which we called DHCP)
 

Before Configuration:

$ srvctl config scan
SCAN name: prddbscan, Network: 1/101.10.1.1/255.255.255.192/en0
SCAN VIP name: scan1, IP: /prddbscan/101.10.1.4
$

Starting Configuration:
Step:1  - add three IPs in DNS , e.g.,

101.10.1.5            hrdbscan.hrapplication.com
101.10.1.6            hrdbscan.hrapplication.com
101.10.1.7            hrdbscan.hrapplication.com

Step:2 - Stop all node scan listeners

$ srvctl stop scan_listener

Step:3 - Create the below file in /etc location with adding domain name

# vi /etc/resolv.conf
search domain hrapplication.com
nameserver      101.10.9.9

Note : Name of the DNS/ AD server is "hrapplication.com" and IP is 10.10.9.9

Step: 4 - Verify with nslookup in all nodes - all node should show configured three IPs

# nslookup hrdbscan.hrapplication.com
Server:         101.10.9.9
Address:        101.10.9.9#53

Name:   hrdbscan.hrapplication.com
Address: 101.10.1.7
Name:   hrdbscan.hrapplication.com
Address: 101.10.1.6
Name:   hrdbscan.hrapplication.com
Address: 101.10.1.5

Note: If your DNS server does not return a set of 3 IPs as shown in figure 3 or does not round-robin, ask your network administrator to enable such a setup. DNS using a round-robin algorithm on its own does not ensure failover of connections. However, the Oracle Client typically handles this. It is therefore recommended that the minimum version of the client used is the Oracle Database 11g Release 2 client.

Step: 5 - modify scan
#./srvctl modify scan -n hrdbscan.hrapplication.com
#./srvctl modify scan_listener -u

-- again verify

# ./srvctl config scan 

Step: 6 - start the scan listener

#./srvctl start scan_listener
#./srvctl status scan_listener

Step: 7 - Now stop cluster services and start it again to effect

./crsctl stop crs -- one by one node

./crsctl start crs

./crsctl stat res -t
./crsctl check crs

Step: 8 - check the services

./crsctl stat res -t

HOW CONNECTION LOAD BALANCING WORKS USING SCAN For clients connecting using Oracle SQL*Net 11g Release 2, three IP addresses will be received by the client by resolving the SCAN name through DNS as discussed. The client will then go through the list it receives from the DNS and try connecting through one of the IPs received. If the client receives an error, it will try the other addresses before returning an error to the user or application. This is similar to how client connection failover works in previous releases when an address list is provided in the client connection string.

When a SCAN Listener receives a connection request, the SCAN Listener will check for the least loaded instance providing the requested service. It will then re-direct the connection request to the local listener on the node where the least loaded instance is running. Subsequently, the client will be given the address of the local listener. The local listener will finally create the connection to the database instance.


Case -1: Unable to start database instances after starting all cluster services:

When I tried to start database instances after starting all node cluster services, I found below error.

SQL> startup nomount;
ORA-00119: invalid specification for system parameter REMOTE_LISTENER
ORA-00132: syntax error or unresolved network name 'hrapplication.com:1521'

After some verification, I found some body disbaled below options during some RAM upgrations time.

# vi /etc/resolv.conf
#search domain hrapplication.com
#nameserver      101.10.9.9

I uncommented like below and then started my databases services and found ok.

# vi /etc/resolv.conf
search domain hrapplication.com
nameserver      101.10.9.9




===



https://blogs.oracle.com/optimizer/how-to-generate-a-useful-sql-execution-plan

 To Check What Execution Plan the Query Used Yesterday or Some days Back? and

Find out SQL id fo that query and using the query to find out query plan with timings,

select SQL_ID,PLAN_HASH_VALUE, OPERATION, COST, TIMESTAMP from DBA_HIST_SQL_PLAN where SQL_ID='g9ks2772vgzdy'
Select SAMPLE_TIME,SESSION_ID,SQL_PLAN_HASH_VALUE from DBA_HIST_ACTIVE_SESS_HISTORY where SQL_ID='9hha8mvkmmjfd' order by 1;
DBA_HIST_SQL_PLAN displays the execution plan information for each child cursor in the workload repository.
This view captures information from V$SQL_PLAN and is used with the DBA_HIST_SQLSTAT view.

SQL_ID     VARCHAR2(13)-SQL identifier of the parent cursor in the library cache
PLAN_HASH_VALUE -NUMBER - Numerical representation of the SQL plan for the cursor.
Comparing one PLAN_HASH_VALUE to another easily identifies whether or not two plans are the same (rather than comparing the two plans line by line).

COST-NUMBER - Cost of the operation as estimated by the optimizer's cost-based approach. For statements that use the rule-based approach, this column is null.
CARDINALITY-NUMBER- Estimate, by the cost-based optimizer, of the number of rows produced by the operation

 Query is picking a bad execution plan , how to fix it to use the good execution plan available ( using Oracle baselines ) ?

Because SQL plans will change whenever we make changes to the instance, any SQL statement will have a history of execution plans. 
Here are some common acts that will change execution plans:

    Enabling dynamic sampling

    Table in index DDL (e.g. dropping an index)

    Re-analyze the schema statistics (done automatically, starting in 10g)

    Changing an optimizer parameter (optimizer_mode, optimizer_index_cost_adj)

    Enabling parallelism

This execution plan history can be seen by running scripts against the STATSPACK (stats$sql_plan and stats$sql_plan_usage) or the AWR (dba_hist_sql_plan) tables.

Once you have acquired the SQL_ID for a statement, you can use the built-in dbms_xplan.display_awr procedure to see all the different execution plans for the query.

The dbms_xplan.display_awr allows you to input only a SQL_ID, and Oracle will show you the of the explain plans for that SQL_ID, as recorded in the AWR.
This simple query will show changes to SQL explain plan history, if you have purchased the extra-cost licenses for AWR.

  select * from TABLE(DBMS_XPLAN.DISPLAY_AWR('&sqlid'));


Using Oracle baselines We can fix the sql plan for a SQLID:  
                                              
SQL plan management is a preventative mechanism that records and evaluates the execution plans of SQL statements over time.
This mechanism can build a SQL plan baseline, which is a set of accepted plans for a SQL statement. The accepted plans have been proven to perform well.
The goal of SQL plan baselines is to preserve the performance of corresponding SQL statements, regardless of changes in the database.
Examples of changes include:

    New optimizer version
    Changes to optimizer statistics and optimizer parameters
    Changes to schema and metadata definitions
    Changes to system settings
    SQL profile creation

SQL plan baselines cannot help in cases where an event has caused irreversible execution plan changes, such as dropping an index.
The SQL tuning features of Oracle Database generate SQL profiles that help the optimizer to produce well-tuned plans. However,
this mechanism is reactive and cannot guarantee stable performance when drastic database changes occur.
SQL tuning can only resolve performance issues after they have occurred and are identified.


Unstable Plans (Oracle Plan Stability/Instability)

One of the most frustrating things about the Cost Based Optimizer in Oracle is the sometimes erratic performance.
It can seem almost random at times. Resolving these issues can also be challenging. It’s a little like taking your car to the mechanic
 when it makes a funny noise. It never seems to make the noise when the mechanic is there.
Fortunately we have ASH/AWR which captures a ton of information about what our databases are doing.

The first one can be used to show statements that have experienced significant variances in execution time
(it can be modified to look for variances in the amount of logical i/o, but I’ll leave it as an exercise for the reader).
 I called the script unstable_plans.sql. It uses an analytic function to calculate a standard deviation on the average elapsed time by plan.
So the statements that have multiple plans with wild variations in the response time between plans will be returned by the script.
The script prompts for a couple of values. The first is minimum number of standard deviations. The second is the minimum elapsed time
(I usually don’t care if a statement executes sometimes in .005 seconds and sometimes in .02 seconds,
even though this is a large swing statistically). Both these inputs are defaulted by the way.

The second one is called awr_plan_change.sql. It’s shows how the plans for a given statement have changed over time,
along with some statistics such as the average elapsed time and the average lio’s.

Anyway, here’s an example of the usage of these two scripts (by the way, the example is on an 11gR1 database, but the scrips work fine on 10g as well).

> sqlplus / as sysdba

SQL*Plus: Release 11.1.0.6.0 - Production on Tue Oct 7 15:44:20 2008

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> alter system set optimizer_mode=rule;

System altered.

SQL> @awr_snap

PL/SQL procedure successfully completed.

SQL> ----------------------------------------------------------------------------------------
--
-- File name:   unstable_plans.sql
--
-- Purpose:     Attempts to find SQL statements with plan instability.
--
-- Author:      Kerry Osborne
--
-- Usage:       This scripts prompts for two values, both of which can be left blank.
--
--              min_stddev: the minimum "normalized" standard deviation between plans
--                          (the default is 2)
--
--              min_etime:  only include statements that have an avg. etime > this value
--                          (the default is .1 second)
--
-- See http://kerryosborne.oracle-guy.com/2008/10/unstable-plans/ for more info.
---------------------------------------------------------------------------------------

set lines 155
col execs for 999,999,999
col min_etime for 999,999.99
col max_etime for 999,999.99
col avg_etime for 999,999.999
col avg_lio for 999,999,999.9
col norm_stddev for 999,999.9999
col begin_interval_time for a30
col node for 99999
break on plan_hash_value on startup_time skip 1
select * from (
select sql_id, sum(execs), min(avg_etime) min_etime, max(avg_etime) max_etime, stddev_etime/min(avg_etime) norm_stddev
from (select sql_id, plan_hash_value, execs, avg_etime,
stddev(avg_etime) over (partition by sql_id) stddev_etime
from (
select sql_id, plan_hash_value,
sum(nvl(executions_delta,0)) execs,
(sum(elapsed_time_delta)/decode(sum(nvl(executions_delta,0)),0,1,sum(executions_delta))/1000000) avg_etime
-- sum((buffer_gets_delta/decode(nvl(buffer_gets_delta,0),0,1,executions_delta))) avg_lio
from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS
where ss.snap_id = S.snap_id
and ss.instance_number = S.instance_number
and executions_delta > 0
group by sql_id, plan_hash_value
))
group by sql_id, stddev_etime
)
where norm_stddev > nvl(to_number('&min_stddev'),2)
and max_etime > nvl(to_number('&min_etime'),.1)
order by norm_stddev
/
Enter value for min_stddev:
Enter value for min_etime:

SQL_ID        SUM(EXECS)   MIN_ETIME   MAX_ETIME   NORM_STDDEV
------------- ---------- ----------- ----------- -------------
c5by9gdw3814r         14         .03         .12        2.1274
848dyu9288c3h         16         .02         .16        2.1309
2am60vd2kw8ux         11         .05         .22        2.4976
frxg53fac2n8j          5         .03         .30        4.2479
0qa98gcnnza7h         62       25.58      314.34        7.9833
g0jvz8csyrtcf          2         .09        1.19        8.2304
2cn0kc8u4b81w        545         .02         .42       12.8022
9wt62290ah0f7          6         .01         .47       38.5857
d8mayxqw0wnpv       1373         .01         .85       48.3874

9 rows selected.

SQL> /
Enter value for min_stddev:
Enter value for min_etime: 2

SQL_ID        SUM(EXECS)   MIN_ETIME   MAX_ETIME   NORM_STDDEV
------------- ---------- ----------- ----------- -------------
0qa98gcnnza7h         62       25.58      314.34        7.9833

SQL> @find_sql
Enter value for sql_text:
Enter value for address:
Enter value for sql_id: 0qa98gcnnza7h

SQL_ID         CHILD  PLAN_HASH        EXECS         ETIME     AVG_ETIME USERNAME      SQL_TEXT
------------- ------ ---------- ------------ ------------- ------------- ------------- -----------------------------------------
0qa98gcnnza7h      0 3723858078            5        356.53         71.31 SYS           select avg(pk_col) from kso.skew where co
                                                                                       l1 > 0

0qa98gcnnza7h      1  568322376            1          7.92          7.92 SYS           select avg(pk_col) from kso.skew where co
                                                                                       l1 > 0

0qa98gcnnza7h      2  568322376           10         52.14          5.21 SYS           select avg(pk_col) from kso.skew where co
                                                                                       l1 > 0

0qa98gcnnza7h      3  568322376           30      1,064.19         35.47 KSO           select avg(pk_col) from kso.skew where co
                                                                                       l1 > 0

0qa98gcnnza7h      4 3723858078           10      4,558.62        455.86 KSO           select avg(pk_col) from kso.skew where co
                                                                                       l1 > 0

SQL> set lines 155
col execs for 999,999,999
col avg_etime for 999,999.999
col avg_lio for 999,999,999.9
col begin_interval_time for a30
col node for 99999
break on plan_hash_value on startup_time skip 1
select ss.snap_id, ss.instance_number node, begin_interval_time, sql_id, plan_hash_value,
nvl(executions_delta,0) execs,
(elapsed_time_delta/decode(nvl(executions_delta,0),0,1,executions_delta))/1000000 avg_etime,
(buffer_gets_delta/decode(nvl(buffer_gets_delta,0),0,1,executions_delta)) avg_lio
from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS
where sql_id = nvl('&sql_id','4dqs2k5tynk61')
and ss.snap_id = S.snap_id
and ss.instance_number = S.instance_number
and executions_delta > 0
order by 1, 2, 3
/

Enter value for sql_id: 0qa98gcnnza7h

   SNAP_ID   NODE BEGIN_INTERVAL_TIME            SQL_ID        PLAN_HASH_VALUE        EXECS    AVG_ETIME        AVG_LIO
---------- ------ ------------------------------ ------------- --------------- ------------ ------------ --------------
      3206      1 02-OCT-08 08.00.38.743 AM      0qa98gcnnza7h       568322376            4       10.359      121,722.8
      3235      1 03-OCT-08 01.00.44.932 PM      0qa98gcnnza7h                            1       10.865      162,375.0
      3235      1 03-OCT-08 01.00.44.932 PM      0qa98gcnnza7h      3723858078            1      127.664   28,913,271.0
      3236      1 03-OCT-08 01.28.09.000 PM      0qa98gcnnza7h       568322376            1        7.924      162,585.0
      3236      1 03-OCT-08 01.28.09.000 PM      0qa98gcnnza7h      3723858078            1       86.682   27,751,123.0
      3305      1 06-OCT-08 10.00.11.988 AM      0qa98gcnnza7h                            4       64.138   22,616,931.5
      3305      1 06-OCT-08 10.00.11.988 AM      0qa98gcnnza7h       568322376            2        5.710       81,149.0
      3306      1 06-OCT-08 11.00.16.490 AM      0qa98gcnnza7h                            6        5.512      108,198.5
      3307      1 06-OCT-08 12.00.20.716 PM      0qa98gcnnza7h                            2        3.824       81,149.0
      3328      1 07-OCT-08 08.39.20.525 AM      0qa98gcnnza7h                           30       35.473      156,904.7
      3335      1 07-OCT-08 03.00.20.950 PM      0qa98gcnnza7h      3723858078           10      455.862   28,902,128.6

11 rows selected.

SQL> @dplan_awr
Enter value for sql_id: 0qa98gcnnza7h
Enter value for plan_hash_value:

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 0qa98gcnnza7h
--------------------
select avg(pk_col) from kso.skew where col1 > 0

Plan hash value: 568322376

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       | 44497 (100)|          |
|   1 |  SORT AGGREGATE    |      |     1 |    11 |            |          |
|   2 |   TABLE ACCESS FULL| SKEW |    32M|   335M| 44497   (2)| 00:08:54 |
---------------------------------------------------------------------------

SQL_ID 0qa98gcnnza7h
--------------------
select avg(pk_col) from kso.skew where col1 > 0

Plan hash value: 3723858078

--------------------------------------------------
| Id  | Operation                    | Name      |
--------------------------------------------------
|   0 | SELECT STATEMENT             |           |
|   1 |  SORT AGGREGATE              |           |
|   2 |   TABLE ACCESS BY INDEX ROWID| SKEW      |
|   3 |    INDEX RANGE SCAN          | SKEW_COL1 |
--------------------------------------------------

Note
-----
   - rule based optimizer used (consider using cbo)

33 rows selected.

A couple of quick comments. The unstable_plan.sql script shows several statements that are switching plans.
Most of them are sub-second response times though. The one that stands out is sql_id 0qa98gcnnza7h.
It has been executed 62 times and one plan averages about 25 seconds while another averages about 314 seconds.
The awr_plan_changes.sql script shows that this statement has been switching between two plans (3723858078 and 568322376).
3723858078 is the less efficient plan (doing 25M or so logical i/o’s per execution) while plan 568322376 is much better
(doing only about 120K logical i/o’s per execution).
This is a typical pattern we see when bind variable peeking is an issue. Plans that flip flop back and forth between two or three options.
This post is not specifically about bind variable peeking and so I won’t go into all the details (if you want more info just search on this site).
 But I will say that it is quite often the case that a single plan is acceptable regardless of the bind variables that are used (in this case the 568322376 plan).
It is pretty easy to get Oracle to use a specific plan using an Outline,
SQL Profile, or Baseline (depending on the version of Oracle you are using). And that’s exactly what we did in this case to resolve the issue.



SQL> @unstable_plans
Enter value for min_stddev:
Enter value for min_etime:

SQL_ID        SUM(EXECS)   MIN_ETIME   MAX_ETIME   NORM_STDDEV
------------- ---------- ----------- ----------- -------------
0qa98gcnnza7h          4       42.08      208.80        2.8016

SSQL> @awr_plan_stats
Enter value for sql_id: 0qa98gcnnza7h

SQL_ID        PLAN_HASH_VALUE        EXECS          ETIME    AVG_ETIME        AVG_LIO
------------- --------------- ------------ -------------- ------------ --------------
0qa98gcnnza7h       568322376            3          126.2       42.079      124,329.7
0qa98gcnnza7h      3723858078            1          208.8      208.796   28,901,466.0

SQL> @create_sql_profile_awr
Enter value for sql_id: 0qa98gcnnza7h
Enter value for plan_hash_value: 568322376
Enter value for category:
Enter value for force_matching:

PL/SQL procedure successfully completed.

SQL> @sql_profiles
Enter value for sql_text:
Enter value for name: PROFIL%

NAME                           CATEGORY        STATUS   SQL_TEXT                                                               FOR
------------------------------ --------------- -------- ---------------------------------------------------------------------- ---
PROFILE_0qa98gcnnza7h          DEFAULT         ENABLED  select avg(pk_col) from kso.skew                                       NO

SQL> set echo on
SQL> @sql_profile_hints
SQL> set lines 155
SQL> col hint for a150
SQL> select attr_val hint
  2  from dba_sql_profiles p, sqlprof$attr h
  3  where p.signature = h.signature
  4  and name like ('&profile_name')
  5  order by attr#
  6  /
Enter value for profile_name: PROFILE_0qa98gcnnza7h

HINT
------------------------------------------------------------------------------------------------------------------------------------------------------
IGNORE_OPTIM_EMBEDDED_HINTS
OPTIMIZER_FEATURES_ENABLE('10.2.0.3')
ALL_ROWS
OUTLINE_LEAF(@"SEL$1")
FULL(@"SEL$1" "SKEW"@"SEL$1")

SQL> @sql_hints_awr
SQL> select
  2  extractvalue(value(d), '/hint') as outline_hints
  3  from
  4  xmltable('/*/outline_data/hint'
  5  passing (
  6  select
  7  xmltype(other_xml) as xmlval
  8  from
  9  dba_hist_sql_plan
 10  where
 11  sql_id = '&sql_id'
 12  and plan_hash_value = &plan_hash_value
 13  and other_xml is not null
 14  )
 15  ) d;
Enter value for sql_id: 0qa98gcnnza7h
Enter value for plan_hash_value: 568322376

OUTLINE_HINTS
-----------------------------------------------------------------------------------------------------------------------------------------------------------
IGNORE_OPTIM_EMBEDDED_HINTS
OPTIMIZER_FEATURES_ENABLE('10.2.0.3')
ALL_ROWS
OUTLINE_LEAF(@"SEL$1")
FULL(@"SEL$1" "SKEW"@"SEL$1")

A couple of additional points:

    Outlines and SQL Profiles both take the same approach to controlling execution plans. They both attempt to force the optimizer down a certain path by applying hints behind the scenes. This is in my opinion an almost impossible task. The more complex the statement, the more difficult the task becomes. The newest kid on the block in this area (in 11g) is called a baseline and while it doesn’t abandon the hinting approach altogether, it does at least store the plan_hash_value – so it can tell if it regenerated the correct plan or not.
    It does not appear that Outlines are being actively pursued by Oracle development anymore. So while they still work in 11g, they are becoming a little less reliable (and they were a bit quirky to begin with).
    SQL Profiles have the ability to replace literals with bind variables similar to the cursor_sharing parameter. This means you can have a SQL Profile that will match multiple statements which use literals without having to set cursor_sharing for the whole instance.
    Outlines take precedence over SQL Profiles. You can create both on the same statement and if you do, the outline will be used and the SQL Profile will be ignored. This is true in 11g as well, by the way.
    Outlines don’t appear to use the OPT_ESTIMATE hint. So I believe it is still a valid approach to accept a SQL Profile as offered by the SQL Tuning Advisor and then create an Outline on top of it. It seems to work pretty well most of the time. (be sure and check the hints and the plan that gets produced)
    Manually created SQL Profiles also don’t appear to use the OPT_ESTIMATE hint. So I believe it is also a valid approach to accept a SQL Profile as offered by the SQL Tuning Advisor and then create a SQL Profile on top of it. Note that you’ll have to use a different category, then drop the original, then enable the new SQL Profile. Which means this approach is a bit more work than just creating an Outline in the DEFAULT category.

Have a look at the difference between SQL Tuning Set generated hints and those created by a manual SQL Profile or an Outline (note that I have tried to change the object names to protect the innocent and in so doing may have made it slightly more difficult to follow) :

SQL>
SQL> select * from table(dbms_xplan.display_cursor('&sql_id','&child_no',''));
Enter value for sql_id: fknfhx8wth51q
Enter value for child_no: 1

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  fknfhx8wth51q, child number 1
-------------------------------------
SELECT /* test4 */ col1, col2, col3 ...


Plan hash value: 3163842146

----------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                          |       |       |  1778 (100)|          |
|   1 |  NESTED LOOPS                 |                          |  1039 | 96627 |  1778   (1)| 00:00:33 |
|   2 |   NESTED LOOPS                |                          |   916 | 57708 |  1778   (1)| 00:00:33 |
|*  3 |    TABLE ACCESS BY INDEX ROWID| TABLE_XXXX_LOOKUP        |   446 | 17840 |   891   (1)| 00:00:17 |
|*  4 |     INDEX RANGE SCAN          | INDEX_XXXX_IS_CPCI       | 12028 |       |    18   (0)| 00:00:01 |
|   5 |    TABLE ACCESS BY INDEX ROWID| TABLE_XXXX_IDENT         |     2 |    46 |     2   (0)| 00:00:01 |
|*  6 |     INDEX RANGE SCAN          | INDEX_XXXXIP_17_FK       |     2 |       |     1   (0)| 00:00:01 |
|*  7 |   INDEX UNIQUE SCAN           | PK_TABLE_XXXX_ASSIGNMENT |     1 |    30 |     0   (0)|          |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter((
...
   4 - access("L"."COL1"=:N1)
   6 - access("L"."COL2"="I"."COL1")

Note
-----
   - SQL profile SYS_SQLPROF_012061f471d50001 used for this statement


85 rows selected.

SQL> @sql_profile_hints
Enter value for name: SYS_SQLPROF_012061f471d50001

HINT
------------------------------------------------------------------------------------------------------------------------------------------------------
OPT_ESTIMATE(@"SEL$1", TABLE, "L"@"SEL$1", SCALE_ROWS=0.0536172171)
OPT_ESTIMATE(@"SEL$1", INDEX_SKIP_SCAN, "A"@"SEL$1", PK_TABLE_XXXX_ASSIGNMENT, SCALE_ROWS=4)
COLUMN_STATS("APP_OWNER"."TABLE_XXXX_ASSIGNMENT", "COL1", scale, length=6 distinct=1234 nulls=0 min=1000000014 max=1026369632)
COLUMN_STATS("APP_OWNER"."TABLE_XXXX_ASSIGNMENT", "COL2", scale, length=12 distinct=2 nulls=0)
COLUMN_STATS("APP_OWNER"."TABLE_XXXX_ASSIGNMENT", "COL3", scale, length=12 distinct=2 nulls=0)
TABLE_STATS("APP_OWNER"."TABLE_XXXX_ASSIGNMENT", scale, blocks=5 rows=2400)
OPTIMIZER_FEATURES_ENABLE(default)

7 rows selected.

SQL> -- no direct hints - only stats and scaling on the profile created by the SQL Tuning Advisor
SQL> -- (i.e. the dreaded OPT_ESTIMATE hints and no directive type hints like INDEX or USE_NL)
SQL>
SQL> -- now let's try an outline on top of it
SQL> @create_outline

Session altered.

Enter value for sql_id: fknfhx8wth51q
Enter value for child_number: 1
Enter value for outline_name: KSOTEST1
Outline KSOTEST1 created.

PL/SQL procedure successfully completed.

SQL> @outline_hints
Enter value for name: KSOTEST1

HINT
------------------------------------------------------------------------------------------------------------------------------------------------------
USE_NL(@"SEL$1" "A"@"SEL$1")
USE_NL(@"SEL$1" "I"@"SEL$1")
LEADING(@"SEL$1" "L"@"SEL$1" "I"@"SEL$1" "A"@"SEL$1")
INDEX(@"SEL$1" "A"@"SEL$1" ("TABLE_XXXX_ASSIGNMENT"."COL1" "TABLE_XXXX_ASSIGNMENT"."COL2" "TABLE_XXXX_ASSIGNMENT"."COL3"))
INDEX_RS_ASC(@"SEL$1" "I"@"SEL$1" ("TABLE_XXXX_IDENT"."COL1"))
INDEX_RS_ASC(@"SEL$1" "L"@"SEL$1" ("TABLE_XXXX_LOOKUP"."COL1" "TABLE_XXXX_LOOKUP"."COL2"))
OUTLINE_LEAF(@"SEL$1")
ALL_ROWS
DB_VERSION('11.1.0.7')
OPTIMIZER_FEATURES_ENABLE('11.1.0.7')
IGNORE_OPTIM_EMBEDDED_HINTS

11 rows selected.

SQL> -- no OPT_ESTIMATE hints on the outline
SQL> -- directive type hints - INDEX, USE_NL, etc...
SQL>
SQL> -- now let's try creating a manual profile
SQL> @create_sql_profile.sql
Enter value for sql_id: fknfhx8wth51q
Enter value for child_no: 1
Enter value for category: TEST
Enter value for force_matching:

PL/SQL procedure successfully completed.

SQL> @sql_profile_hints
Enter value for name: PROFILE_fknfhx8wth51q

HINT
------------------------------------------------------------------------------------------------------------------------------------------------------
IGNORE_OPTIM_EMBEDDED_HINTS
OPTIMIZER_FEATURES_ENABLE('11.1.0.7')
DB_VERSION('11.1.0.7')
ALL_ROWS
OUTLINE_LEAF(@"SEL$1")
INDEX_RS_ASC(@"SEL$1" "L"@"SEL$1" ("TABLE_XXXX_LOOKUP"."COL1" "TABLE_XXXX_LOOKUP"."COL2"))
INDEX_RS_ASC(@"SEL$1" "I"@"SEL$1" ("TABLE_XXXX_IDENT"."COL1"))
INDEX(@"SEL$1" "A"@"SEL$1" ("TABLE_XXXX_ASSIGNMENT"."COL1" "TABLE_XXXX_ASSIGNMENT"."COL2" "TABLE_XXXX_ASSIGNMENT"."COL3"))
LEADING(@"SEL$1" "L"@"SEL$1" "I"@"SEL$1" "A"@"SEL$1")
USE_NL(@"SEL$1" "I"@"SEL$1")
USE_NL(@"SEL$1" "A"@"SEL$1")

11 rows selected.

SQL> -- no OPT_ESTIMATE with the SQL Profile we created manually !
SQL> -- again it's directive - USE_NL, INDEX, LEADING, etc...
SQL>

So I apologize to all you SQL Profiles out there who have been lumped together by my prejudiced view, just because of the acts of a few of your brethren (i.e. the ones created by the SQL Tuning Advisor). SQL Profiles do indeed have all the capabilities of Outlines and probably are a better choice in most cases than Outlines.



SQL Plan Management

Introduction

SQL plan management (SPM) ensures that runtime performance will
not degrade due to execution plan changes.
To guarantee this, only accepted execution plans are used; any
plan evolution that does occur is tracked and evaluated at a later point in time,
and only accepted if the new plan shows a noticeable improvement in runtime.
SQL Plan Management has three main components:

Plan Capture:
»Creation of SQL plan baselines that store accepted execution plans for all relevant SQL statements. SQL
plan baselines are stored in the
SQL management base in the SYSAUX tablespace.
»Plan Selection:
»Ensures only accepted execution plans are used for statements with a SQL plan baseline
and records any new execution plans found for a statement as unaccepted plans in the SQL plan baseline.
»
Plan Evolution:
»Evaluate all unaccepted execution plans for a given statement, with only plans that show a performance
improvement becoming accepted plans in the SQL plan baseline.

SQL Management Base

The SQL management base (SMB) is a logical repository in the data dictionary, physically located in the SYSAUX
tablespace. In the context of SPM, it
stores the following structures:
»SQL Plan History
»The SQL plan history is the set of SQL execution plans generated for SQL statements over time.
»The history contains both SQL plan baselines and unaccepted plans.

»SQL Plan Baselines
»A  SQL plan baseline is an accepted plan that the optimizer is allowed to
use for a SQL statement. In the typical use case, the database accepts a plan into the plan baseline only after verifying that the plan
performs well.
»SQL Statement Log
»A series of query signatures used to identify queries
that have been executed more than once during  automatic plan capture (see below).

Plan Capture
For SPM to become active, the SQL management
base must be seeded with a set of acceptable execution plans,
which will become the SQL plan baseline for the corresponding SQL statements.
There are two different ways to populate the SQL management
base: automatically or manually


No comments:

Post a Comment