2)ASM and Grid Infrastructure Stack
In releases prior to 11gR2, Automatic Storage Management (ASM) was tightly integrated with the Clusterware stack. In 11gR2,
ASM is not only tightly integrated with the Clusterware stack, it’s
actually part of the Clusterware stack. The Grid Infrastructure stack is
the foundation of Oracle’s Private Database Cloud, and it provides the
essential Cloud Pool capabilities, such as growing server and storage
capacity as needed. This chapter discusses how ASM fits into the Oracle
Clusterware stack.
Clusterware Primer
Oracle Clusterware is the cross-platform
cluster software required to run the Real Application Clusters (RAC)
option for Oracle Database and provides the basic clustering services at
the operating system level that enable Oracle software to run in
clustered mode. The two main components of Oracle Clusterware are
Cluster Ready Services and Cluster Synchronization Services:
Cluster Ready Services (CRS) Provides
high-availability operations in a cluster. The CRS daemon (CRSd)
manages cluster resources based on the persistent configuration
information stored in Oracle Cluster Registry (OCR). These cluster
resources include the Oracle Database instance, listener, VIPs, SCAN
VIPs, and ASM. CRSd provides start, stop, monitor, and failover
operations for all the cluster resources, and it generates events when
the status of a resource changes.
Cluster Synchronization Services (CSS) Manages
the cluster configuration by controlling which nodes are members of the
cluster and by notifying members when a member (node) joins or leaves
the cluster. The following functions are provided by the Oracle Cluster
Synchronization Services daemon (OCSSd):
Group Services A distributed group membership system that allows for the synchronization of services between nodes
Lock Services Provide the basic cluster-wide serialization locking functions
Node Services Use OCR to store state data and update the information during reconfiguration
OCR Overview
The Oracle Cluster Registry is the central
repository for all the resources registered with Oracle Clusterware. It
contains the profile, state, and ownership details of the resources.
This includes both Oracle resources and user-defined application
resources. Oracle resources include the node apps (VIP, ONS, GSD, and
Listener) and database resources, such as database instances, and
database services. Oracle resources are added to the OCR by tools such
as DBCA, NETCA, and srvctl.
Voting File Overview
Oracle Clusterware maintains membership of the nodes in the cluster using a special file called voting disk (mistakenly also referred to as quorum disk).
Sometimes, the voting disk is also referred to as the vote file, so
you’ll see this referenced both ways, and both are correct. This file
contains the heartbeat records from all the nodes in the cluster. If a
node loses access to the voting file or is not able to complete the
heartbeat I/O within the threshold time, then that node is evicted out
of the cluster. Oracle Clusterware also maintains heartbeat with the
other member nodes of the cluster via the shared private interconnect
network. A split-brain syndrome occurs when there is a failure in the
private interconnect whereby multiple sub-clusters are formed within the
clustered nodes and the nodes in different sub-clusters are not able to
communicate with each other via the interconnect network but they still
have access to the voting files. The voting file enables Clusterware to
resolve network split brain among the cluster nodes. In such a
situation, the largest active sub-cluster survives. Oracle Clusterware
requires an odd number of voting files (1, 3, 5, …) to be created. This
is done to ensure that at any point in time, an active member of the
cluster has access to the majority number (n / 2 + 1) of voting files.
Here’s a list of some interesting 11gR2 changes for voting files:
The
voting files’ critical data is stored in the voting file and not in the
OCR anymore. From a voting file perspective, the OCR is not touched at
all. The critical data each node must agree on to form a cluster is, for
example, miscount and the list of voting files configured.
In Oracle Clusterware 11g
Release 2 (11.2), it is no longer necessary to back up the voting
files. The voting file data is automatically backed up in OCR as part of
any configuration change and is automatically restored as needed. If
all voting files are corrupted, users can restore them as described in
the Oracle Clusterware Administration and Deployment Guide.
Grid Infrastructure Stack Overview
The Grid Infrastructure stack includes
Oracle Clusterware components, ASM, and ASM Cluster File System (ACFS).
Throughout this chapter, as well as the book, we will refer to Grid
Infrastructure as the GI stack.
The Oracle GI stack consists of two
sub-stacks: one managed by the Cluster Ready Services daemon (CRSd) and
the other by the Oracle High Availability Services daemon (OHASd). How
these sub-stacks come into play depends on how the GI stack is
installed. The GI stack is installed in two ways:
Grid Infrastructure for Standalone Server
Grid Infrastructure for Cluster
ASM is available in both these software stack
installations. When Oracle Universal Installer (OUI) is invoked to
install Grid Infrastructure, the main screen will show four options (see
Figure 2-1).
In this section, the options we want to focus on are Grid
Infrastructure for Standalone Server and Grid Infrastructure for
Cluster.
Grid Infrastructure for Standalone Server
Grid Infrastructure for Standalone Server is
essentially the single-instance (non-clustered) configuration, as in
previous releases. It is important to note that in 11gR2, because
ASM is part of the GI stack, Clusterware must be installed first before
the database software is installed; this holds true even for
single-instance deployments. Keep in mind that ASM will not need to be
in a separate ORACLE_HOME; it is installed and housed in the GI
ORACLE_HOME.
Grid Infrastructure for Standalone Server does
not configure the full Clusterware stack; just the minimal components
are set up and enabled—that is, private interconnect, CRS, and
OCR/voting files are not enabled or required. The OHASd startup and
daemon replaces all the existing pre-11.2 init scripts. The entry point
for OHASd is /etc/inittab, which executes the /etc/init.d/ohasd and
/etc/init.d/init.ohasd control scripts, including the start and stop
actions. This OHASD script is the framework control script, which will
spawn the $GI_HOME/bin/ohasd.bin executable. The OHASd is the main
daemon that provides High Availability Services (HAS) and starts the
remaining stack, including ASM, listener, and the database in a
single-instance environment.
A new feature that’s automatically enabled as
part of Grid Infrastructure for Standalone Server installation is Oracle
Restart, which provides high-availability restart functionality for
failed instances (database and ASM), services, listeners, and dismounted
disk groups. It also ensures these protected components start up and
shut down according to the dependency order required. This functionality
essentially replaces the legacy dbstart/dbstop script used in the pre-11gR2
single-instance configurations. Oracle Restart also executes health
checks that periodically monitor the health of these components. If a
check operation fails for a component, the component is forcibly shut
down and restarted. Note that Oracle Restart is only enabled in GI for
Standalone Server (non-clustered) environments. For clustered
configurations, health checks and the monitoring capability are provided
by Oracle Clusterware CRS agents.
When a server that has Grid Infrastructure for
Standalone Server enabled is booted up, the HAS process will initialize
and start up by first starting up ASM. ASM has a hard-start (pull-up)
dependency with CSS, so CSS is started up. Note that there is a
hard-stop dependency between ASM and CSS, so on stack shutdown ASM will
stop and then CSS will stop.
Grid Infrastructure for Cluster
Grid Infrastructure for Cluster is the
traditional installation of Clusterware. It includes multinode RAC
support, private interconnect, Clusterware files, and now also installs
ASM and ACFS drivers. With Oracle Clusterware 11gR2, ASM is not
simply the storage manager for database files, but also houses the
Clusterware files (OCR and voting files) and the ASM spfile.
When you select the Grid Infrastructure for Cluster option in OUI, as shown previously in Figure 2-1,
you will be prompted next on file storage options for the Clusterware
files (Oracle Clusterware Registry and Clusterware voting file). This is
shown in Figure 2-2.
Users are prompted to place Clusterware files
on either a shared file system or ASM. Note that raw disks are not
supported any longer for new installations. Oracle will support the
legacy method of storing Clusterware files (raw and so on) in upgrade
scenarios only.
When ASM is selected as the storage location for Clusterware files, the Create ASM Disk Group screen is shown next (see Figure 2-3).
You can choose external or ASM redundancy for the storage of
Clusterware files. However, keep in mind that the type of redundancy
affects the redundancy (or number of copies) of the voting files.
For example, for normal redundancy, there
needs to be a minimum of three failure groups, and for high redundancy a
minimum of five failure groups. This requirement stems from the fact
that an odd number of voting files must exist to enable a vote quorum.
Additionally this enables to tolerate one or two disk failures and still
provide quorums.
This first disk group that is created during
the installation can also be used to store database files. In previous
versions of ASM, this disk group was referred to as the DATA disk group.
Although it is recommended that you create a single disk group for
storing the Clusterware files and database files, for users who are
employing a third-party vendor snapshot technology against the ASM disk
group, users may want to have a separate disk group for the Clusterware
files. Users may also deploy a separate disk group for the Clusterware
to leverage normal or high redundancy for the Clusterware files. In both
of the cases, users should create a small CRSDATA disk group with 1MB
AU and enough failure groups to support the redundancy required. Next,
the installation users can then use ASMCA to create the DATA disk group.
Voting Files and Oracle Cluster Repository Files in ASM
In versions prior to 11gR2, users
needed to configure and set up raw devices for housing the Clusterware
files (OCR and voting files). This step creates additional management
overhead and is error prone. Incorrect OCR/voting files setup creates
havoc for the Clusterware installation and directly affects run-time
environments. To mitigate these install preparation issues, 11gR2
allows the storing of the Clusterware files in ASM; this also
eliminates the need for a third-party cluster file system and eliminates
the complexity of managing disk partitions for the OCR and voting
files. The COMPATIBLE.ASM disk group compatibility attribute must be set
to 11.2 or greater to store the OCR or voting file data in a disk
group. This attribute is automatically set for new installations with
the OUI. Note that COMPATIBLE.RDBMS does not need to be advanced to
enable this feature. The COMPATIBLE.* attributes topic is covered in Chapter 3.
Voting Files in ASM
If you choose to store voting files in ASM,
then all voting files must reside in ASM in a single disk group (in
other words, Oracle does not support mixed configurations of storing
some voting files in ASM and some on NAS devices).
Unlike most ASM files, the voting files are
wholly consumed in multiple contiguous AUs. Additionally, the voting
file is not stored as a standard ASM file (that is, it cannot be listed
in the asmcmd ls command). However, the disk that contains the voting
file is reflected in the V$ASM_DISK view:
The number of voting files you want to create in a particular Oracle ASM disk group depends on the redundancy of the disk group:
External redundancy A
disk group with external redundancy can store only one voting file.
Currently, no supported way exists to have multiple voting files stored
on an external redundancy disk group.
Normal redundancy A disk group with normal redundancy can store up to three voting files.
High redundancy A disk group with high redundancy can store up to five voting files.
In this example, we created an ASM disk group
with normal redundancy for the disk group containing voting files. The
following can be seen:
ASM puts each voting file in its own failure group within the disk group. A failure group
is defined as the collection of disks that have a shared hardware
component for which you want to prevent its loss from causing a loss of
data.
For example, four drives that are in a single
removable tray of a large JBOD (just a bunch of disks) array are in the
same failure group because the tray could be removed, making all four
drives fail at the same time. Conversely, drives in the same cabinet can
be in multiple failure groups if the cabinet has redundant power and
cooling so that it is not necessary to protect against the failure of
the entire cabinet. If voting files are stored on ASM with normal or
high redundancy, and the storage hardware in one failure group suffers a
failure, then if another disk is available in a disk group in an
unaffected failure group, ASM allocates new voting files in other
candidate disks.
Voting files are managed differently from
other files that are stored in ASM. When voting files are placed on
disks in an ASM disk group, Oracle Clusterware records exactly on which
disks in that disk group they are located. Note that CSS has access to
voting files even if ASM becomes unavailable.
Voting files can be migrated from raw/block
devices into ASM. This is a typical scenario for upgrade scenarios. For
example, when a user upgrades from 10g to 11gR2, they are
allowed to continue storing their OCR/voting files on raw, but at a
later convenient time they can migrate these Clusterware files into ASM.
It is important to point out that users cannot migrate to Oracle
Clusterware 12c from 10g without first moving the voting
files into ASM (or shared file system), since raw disks are no longer
supported even for upgraded environments in 12c.
The following illustrates this:
Voting File Discovery
The method by CSS that identifies and locates voting files has changed in 11.2. Before 11gR2, the voting files were located via lookup in OCR; in 11gR2, voting files are located via a Grid Plug and Play (GPnP) query. GPnP, a new component in the 11gR2
Clusterware stack, allows other GI stack components to query or modify
cluster-generic (non-node-specific) attributes. For example, the cluster
name and network profiles are stored in the GPnP profile. The GPnP
configuration, which consists of the GPnP profile and wallet, is created
during the GI stack installation. The GPnP profile is an XML file that
contains bootstrap information necessary to form a cluster. This profile
is identical on every peer node in the cluster. The profile is managed
by gpnpd and exists on every node (in gpnpd caches). The profile should
never be edited because it has a profile signature that maintains its
integrity.
When the CSS component of the Clusterware
stack starts up, it queries the GPnP profile to obtain the disk
discovery string. Using this disk string, CSS performs a discovery to
locate its voting files.
The following is an example of a CSS GPnP
profile entry. To query the GPnP profile, the user should use the
supplied (in CRS ORACLE_HOME) gpnptool utility:
The CSS voting file discovery string anchors
into the ASM profile entry; that is, it derives its DiscoveryString from
the ASM profile entry. The ASM profile lists the value in the ASM
discovery string as ‘/dev/mapper/*’. Additionally, ASM uses this GPnP
profile entry to locate its spfile.
Voting File Recovery
Here’s a question that is often heard: If
ASM houses the Clusterware files, then what happens if the ASM instance
is stopped? This is an important point about the relationship between
CSS and ASM. CSS and ASM do not communicate directly. CSS discovers its
voting files independently and outside of ASM. This is evident at
cluster startup when CSS initializes before ASM is available. Thus, if
ASM is stopped, CSS continues to access the voting files, uninterrupted.
Additionally, the voting file is backed up into the OCR at every
configuration change and can be restored with the crsctl command.
If all voting files are corrupted, you can restore them as described next.
Furthermore, if the cluster is down and cannot
restart due to lost voting files, you must start CSS in exclusive mode
to replace the voting files by entering the following command:
Oracle Cluster Registry (OCR)
Oracle Clusterware 11gR2 provides the
ability to store the OCR in ASM. Up to five OCR files can be stored in
ASM, although each has to be stored in a separate disk group.
The OCR is created, along with the voting
disk, when root.sh of the OUI installation is executed. The OCR is
stored in an ASM disk group as a standard ASM file with the file type
OCRFILE. The OCR file is stored like other ASM files and striped across
all the disks in the disk group. It also inherits the redundancy of the
disk group. To determine which ASM disk group the OCR is stored in, view
the default configuration location at /etc/oracle/ocr.loc:
The disk group that houses the OCR file is automounted by the ASM instance during startup.
All 11gR2 OCR commands now support the
ASM disk group. From a user perspective, OCR management and maintenance
works the same as in previous versions, with the exception of OCR
recovery, which is covered later in this section. As in previous
versions, the OCR is backed up automatically every four hours. However,
the new backup location is <GRID_HOME>/cdata/<scan name>.
A single OCR file is stored when an external
redundancy disk group is used. It is recommended that for external
redundancy disk groups an additional OCR file be created in another disk
group for added redundancy. This can be done as follows:
In an ASM redundancy disk group, the ASM
partnership and status table (PST) is replicated on multiple disks. In
the same way, there are redundant extents of OCR file stored in an ASM
redundancy disk group. Consequently, OCR can tolerate the loss of the
same number of disks as are in the underlying disk group, and it can be
relocated/rebalanced in response to disk failures. The ASM PST is
covered in Chapter 9.
OCR Recovery
When a process (OCR client) that wants to
read the OCR incurs a corrupt block, the OCR client I/O will
transparently reissue the read to the mirrored extents for a normal- or
high-redundancy disk group. In the background the OCR master (nominated
by CRS) provides a hint to the ASM layer identifying the corrupt disk.
ASM will subsequently start “check disk group” or “check disk,” which
takes the corrupt disk offline. This corrupt block recovery is only
possible when the OCR is configured in a normal- or high-redundancy disk
group.
In a normal- or high-redundancy disk group, users can recover from the corruption by taking either of the following steps:
Use the ALTER DISK GROUP CHECK statement if the disk group is already mounted.
Remount
the disk group with the FORCE option, which also takes the disk offline
when it detects the disk header corruption. If you are using an
external redundancy disk group, users must restore the OCR from backup
to recover from a corruption. Starting in Oracle Clusterware 11.2.0.3,
the OCR backup can be stored in a disk group as well.
The workaround is to configure an additional
OCR location on a different storage location using the ocrconfig -add
command. OCR clients can tolerate a corrupt block returned by ASM, as
long as the same block from the other OCR locations (mirrors) is not
corrupt. The following guidelines can be used to set up a redundant OCR
copy:
Ensure
that the ASM instance is up and running with the required disk group
mounted and/or check ASM alert.log for the status for the ASM instance.
Verify
that the OCR files were properly created in the disk group, using
asmcmd ls. Because the Clusterware stack keeps accessing OCR files, most
of the time the error will show up as a CRSD error in the crsd.log. Any
errors related to an ocr* command will generate a trace file in the
Grid_home/log/<hostname>/client directory; look for kgfo, kgfp, or
kgfn at the top of the error stack.
Use Case Example
A customer has an existing three-node cluster with an 11gR1
stack (CRS 11.1.0.7; ASM 11.1.0.7; DB 11.1.0.7). They want to migrate
to a new cluster with new server hardware but the same storage. They
don’t want to install 11.1.0.7 on the new servers; they just want to
install 11.2.0.3. In other words, instead of doing an upgrade, they want
to create a new “empty” cluster and then “import” the ASM disks into
the 11.2 ASM instance. Is this possible?
Yes. To make this solution work, you will
install the GI stack and create a new cluster on the new servers, stop
the old cluster, and then rezone the SAN paths to the new servers.
During the GI stack install, when you’re prompted in the OUI to
configure the ASM disk group for a storage location for the OCR and
voting files, use the drop-down box to use an existing disk group. The
other option is to create a new disk group for the Clusterware files and
then, after the GI installation, discover and mount the old 11.1.0.7
disk group. You will need to do some post-install work to register the
databases and services with the new cluster.
The Quorum Failure Group
In certain circumstances, customers might want to build a stretch cluster. A stretch cluster
provides protection from site failure by allowing a RAC configuration
to be set up across distances greater than what’s typical “in the data
center.” In these RAC configurations, a third voting file must be
created at a third location for cluster arbitration. In pre-11gR2 configurations, users set up this third voting file on a NAS from a third location. In 11gR2, the third voting file can now be stored in an ASM quorum failure group.
The “Quorum Failgroup” clause was introduced
for setups with Extended RAC and/or for setups with disk groups that
have only two disks (respectively, only two failure groups) but want to
use normal redundancy.
A quorum failure group is a special type of
failure group where the disks do not contain user data and are not
considered when determining redundancy requirements. Unfortunately,
during GI stack installation, the OUI does not offer the capability to
create a quorum failure group. However, this can be set up after the
installation. In the following example, we create a disk group with a
failure group and optionally a quorum failure group if a third array is
available:
If the disk group creation was done using
ASMCA, then after we add a quorum disk to the disk group, Oracle
Clusterware will automatically change the CSS vote disk location to the
following:
Clusterware Startup Sequence—Bootstrap If OCR Is Located in ASM
Oracle Clusterware 11g Release 2 introduces an integral component called the cluster agents. These agents are highly available, multithreaded daemons that implement entry points for multiple resource types.
ASM has to be up with the disk group mounted
before any OCR operations can be performed. OHASd maintains the resource
dependency and will bring up ASM with the required disk group mounted
before it starts the CRSd. Once ASM is up with the disk group mounted,
the usual ocr* commands (ocrcheck, ocrconfig, and so on) can be used. Figure 2-4 displays the client connections into ASM once the entire stack, including the database, is active.
|
NOTE
This lists the processes connected to ASM using the OS ps command. Note that most of these are bequeath connections.
|
The following output displays a similar listing but from an ASM client perspective:
There will be an ASM client listed for the connection OCR:
Here, +data.255 is the OCR file number, which is used to identify the OCR file within ASM.
The voting files, OCR, and spfile are processed differently at bootstrap:
Voting file The
GPnP profile contains the disk group name where the voting files are
kept. The profile also contains the discovery string that covers the
disk group in question. When CSS starts up, it scans each disk group for
the matching string and keeps track of the ones containing a voting
disk. CSS then directly reads the voting file.
ASM spfile The
ASM spfile location is recorded in the disk header(s), which has the
spfile data. It is always just one AU. The logic is similar to CSS and
is used by the ASM server to find the parameter file and complete the
bootstrap.
OCR file OCR is stored as a regular ASM file. Once the ASM instance comes up, it mounts the disk group needed by the CRSd.
Disk Groups and Clusterware Integration
Before discussing the relationship of ASM
and Oracle Clusterware, it’s best to provide background on CRS modeling,
which describes the relationship between a resource, the resource
profile, and the resource relationship. A resource, as described
previously, is any entity that is being managed by CRS—for example,
physical (network cards, disks, and so on) or logical (VIPs, listeners,
databases, disk groups, and so on). The resource relationship defines
the dependency between resources (for example, state dependencies or
proximities) and is considered to be a fundamental building block for
expressing how an application’s components interact with each other. Two
or more resources are said to have a relationships when one (or both)
resource(s) either depends on or affects the other. For example, CRS
modeling mandated that the DB instance resource depend on the ASM
instance and the required disk groups.
As discussed earlier, because Oracle Clusterware version 11gR2
allows the Clusterware files to be stored in ASM, the ASM resources are
also managed by CRS. The key resource managed by CRS is the ASM disk
group resource.
Oracle Clusterware 11g Release 2
introduces a new agent concept that makes cluster resource management
very efficient and scalable. These agents are multithreaded daemons that
implement entry points for multiple resource types and spawn new
processes for different users. The agents are highly available and,
besides oraagent, orarootagent, and cssdagent/cssdmonitor, there can be
an application agent and a script agent. The two main agents are
oraagent and orarootagent. As the names suggest, oraagent and
orarootagent manage resources owned by Oracle and root, respectively. If
the CRS user is different from the ORACLE user, then CRSd would utilize
two oraagents and one orarootagent. The main agents perform different
tasks with respect to ASM. For example, oraagent performs the
start/stop/check/clean actions for ora.asm, database, and disk group
resources, whereas orarootagent performs start/stop/check/clean actions
for the ora.diskmon and ora.drivers.acfs resources.
The following output shows typical ASM-related CRS resources:
When the disk group is created, the disk group
resource is automatically created with the name, ora.<DGNAME>.dg,
and the status is set to ONLINE. The status OFFLINE will be set if the
disk group is dismounted, because this is a CRS-managed resource now.
When the disk group is dropped, the disk group resource is removed as
well. A dependency between the database and the disk group is
automatically created when the database tries to access the ASM files.
More specifically, a “hard” dependency type is created for the following
files types: datafiles, controlfiles, online logs, and SPFile. These
are the files that are absolutely needed to start up the database; for
all other files, the dependency is set to weak. This becomes important
when there are more than two disk groups: one for archive, another for
flash or temp, and so on. However, when the database no longer uses the
ASM files or the ASM files are removed, the database dependency is not
removed automatically. This must be done using the srvctl command-line
tool.
The following database CRS profile illustrates the dependency relationships between the database and ASM:
Summary
The tighter integration between ASM and
Oracle Clusterware provides the capability for quickly deploying new
applications as well as managing changing workloads and capacity
requirements. This faster agility and elasticity are key drivers for the
Private Database Cloud. In addition, the ASM/Clusterware integration
with the database is the platform at the core factor of Oracle’s
Engineered Systems.
ASM Disks and Disk Groups
The first task
in building the ASM infrastructure is to discover and place disks under
ASM management. This step is best done with the coordination of storage
and system administrators. In storage area network (SAN) environments,
it is assumed that the disks are identified and configured
correctly—that is, they are appropriately zoned or “LUN masked” within
the SAN fabric and can be seen by the operating system (OS). Although
the concepts in this chapter are platform generic, we specifically show
examples using the Linux or Solaris platforms.
ASM Storage Provisioning
Before disks can be added to ASM, the
storage administrator needs to identify a set of disks or logical
devices from a storage array. Note that the term disk is used loosely because a disk can be any of the following:
An entire disk spindle
A partition of a physical disk spindle
An aggregation of several disk partitions across several disks
A logical device carved from a RAID (redundant array of independent drives) group set
A file created from an NFS file system
Once the preceding devices are created, they
are deemed logical unit numbers (LUNs). These LUNs are then presented to
the OS as logical disks.
In this book, we refer generically to LUNs or disks presented to the OS as simply disks. The terms LUN and disk may be used interchangeably.
DBAs and system administrators are often in
doubt as to the maximum LUN size they can use without performance
degradation, or as to the LUN size that will give the best performance.
For example, will 1TB- or 2TB-sized LUNs perform the same as 100GB- or
200GB-sized LUNs?
Size alone should not affect the performance
of an LUN. The underlying hardware, the number of disks that compose an
LUN, and the read-ahead and write-back caching policy defined on the LUN
all, in turn, affect the speed of the LUN. There is no magic number for
the LUN size or the number of ASM disks in the disk group.
Seek the advice of the storage vendor for the
best storage configuration for performance and availability, because
this may vary between vendors.
Given the database size and storage hardware
available, the best practice is to create larger LUNs (to reduce LUN
management) and, if possible, generate LUNs from a separate set of
storage array RAID sets so that the LUNs do not share drives. If the
storage array is a low-end commodity storage unit and storage RAID will
not be used, then it is best to employ ASM redundancy and use entire
drives as ASM disks. Additionally, the ASM disk size is the minimal
increment by which a disk group’s size can change.
|
NOTE
The maximum disk size for an ASM disk in pre-12c configurations is 2TB, and the minimum disk size is 4MB.
|
Users should create ASM disks with sizes less than 2TB in pre-12c environments. A message such as the following will be thrown if users specify ASM candidate disks that are greater than 2TB:
ASM Storage Device Configuration
This section details the steps and
considerations involved in configuring storage devices presented to the
operating system that were provisioned in the earlier section. This
function is typically performed by the system administrator or an ASM
administrator (that is, someone with root privileges).
Typically, disks presented to the OS can be
seen in the /dev directory on Unix/Linux systems. Note that each OS has
its unique representation of small computer system interface (SCSI) disk
naming. For example, on Solaris systems, disks usually have the SCSI
name format cwtxdysz, where c is the controller number, t is the target, d is the LUN/disk number, and s is the partition. Creating a partition serves three purposes:
To
skip the OS label/VTOC (volume table of contents). Different operating
systems have varying requirements for the OS label—that is, some may
require an OS label before it is used, whereas others do not. For
example, on a Solaris system, it is a best practice to create a
partition on the disk, such as partition 4 or 6, that skips the first
1MB into the disk.
To
create a placeholder to identify that the disk is being used because an
unpartitioned disk could be accidentally misused or overwritten.
To preserve alignment between ASM striping and storage array internal striping.
The goal is to align the ASM file extent
boundaries with any striping that may be done in the storage array. The
Oracle database does a lot of 1MB input/outputs (I/Os) that are aligned
to 1MB offsets in the data files. It is slightly less efficient to
misalign these I/Os with the stripes in the storage array, because
misalignment can cause one extra disk to be involved in the I/O.
Although this misalignment may not affect the latency of that particular
I/O, it reduces the overall throughput of the system by increasing the
number of disk seeks. This misalignment is independent of the operating
system. However, some operating systems may make it more difficult to
control the alignment or may add more offsets to block 0 of the ASM
disk.
The disk partition used for an ASM disk is
best aligned at 1MB within the LUN, as presented to the OS by the
storage. ASM uses the first allocation unit of a disk for metadata,
which includes the disk header. The ASM disk header itself is in block 0
of the disk given to ASM as an ASM disk.
Aligning ASM disk extent boundaries to storage
array striping only makes sense if the storage array striping is a
power of 2; otherwise, it is not much of a concern.
The alignment issue would be solved if we
could start the ASM disk at block 0 of the LUN, but that does not work
on some operating systems (Solaris, in particular). On Linux, you could
start the ASM disk at block 0, but then there is a chance an
administrator would run fdisk on the LUN and destroy the ASM disk
header. Therefore, we always recommend using a partition rather than
starting the ASM disk at block 0 of the LUN.
ASM Disk Device Discovery
Once the disks are presented to the OS, ASM
needs to discover them. This requires that the disk devices (Unix
filenames) have their ownership changed from root to the software owner
of Grid Infrastructure stack. The system administrator usually makes
this change. In our example, disks c3t19d5s4, c3t19d16s4, c3t19d17s4,
and c3t19d18s4 are identified, and their ownership set to the
oracle:dba. Now ASM must be configured to discover these disks. This is
done by defining the ASM init.ora parameter ASM_DISKSTRING. In our
example, we will use the following wildcard setting:
An alternative to using standard SCSI names (such as cwtxdysz
or /dev/sdda) is to use special files. This option is useful when
establishing standard naming conventions and for easily identifying ASM
disks, such as asmdisk1. This option requires creating special files
using the mknod or udev generated names on Linux command.
The following is a use case example of mknod.
To create a special file called asmdisk1 for a preexisting device
partition called c3t19d7s4, you can determine the OS major number and
minor number as follows:
|
NOTE
Major and minor numbers are associated with
the device special files in the /dev directory and are used by the
operating system to determine the actual driver and device to be
accessed by the user-level request for the special device file.
|
The preceding example shows that the major and minor device numbers for this device are 32 and 20, respectively. The c at the beginning indicates that this is a character (raw) file.
After obtaining the major and minor numbers,
use the mknod command to create the character and block special files
that will be associated with c3t19d7s4. A special file called
/dev/asmdisk can be created under the /dev directory, as shown:
Listing the special file shows the following:
Notice that this device has the same major and minor numbers as the native device c3t19d7s4.
For this partition (or slice) to be accessible to the ASM instance, change the permissions on this special file to the appropriate oracle user permissions:
Repeat this step for all the required disks
that will be discovered by ASM. Now the slice is accessible by the ASM
instance. The ASM_DISKSTRING can be set to /dev/asmdisk/*. Once
discovered, the disk can be used as an ASM disk.
|
NOTE
It is not recommended that you create mknod
devices in the /dev/asm directory because the /dev/asm path is reserved
for ACFS to place ACFS configuration files and ADVM volumes. During 11gR2
Clusterware installation or upgrade, the root.sh or rootupgrade.sh
script may remove and re-create the /dev/asm directory, causing the
original mknod devices to be deleted. Be sure to use a different
directory instead, such as /dev/asmdisk.
|
ASM discovers all the required disks that make
up the disk group using “on-disk” headers and its search criteria
(ASM_DISKSTRING). ASM scans only for disks that match that ASM search
string. There are two forms of ASM disk discovery: shallow and deep. For
shallow discovery, ASM simply scans the disks that are eligible to be
opened. This is equivalent to executing “ls -l” on all the disk devices
that have the appropriate permissions. For deep discovery, ASM opens
each of those eligible disk devices. In most cases, ASM discoveries are
deep, the exception being when the *_STAT tables are queried instead of
the standard tables.
|
NOTE
For ASM in clustered environments, it is not
necessary to have the same pathname or major or minor device numbers
across all nodes. For example, node1 could access a disk pointed to by
path /dev/rdsk/c3t1d4s4, whereas node2 could present /dev/rdsk/c4t1d4s4
for the same device. Although ASM does not require that the disks have
the same names on every node, it does require that the same disks be
visible to each ASM instance via that instance’s discovery string. In
the event that pathnames differ between ASM nodes, the only necessary
action is to modify the ASM_DISKSTRING to match the search path. This is
a non-issue on Linux systems that use ASMLIB, because ASMLIB handles
the disk search and scan process.
|
Upon successful discovery, the V$ASM_DISK view
on the ASM instance reflects which disks were discovered. Note that
henceforth all views, unless otherwise stated, are queried from the ASM
instance and not from the RDBMS instance.
The following example shows the disks that
were discovered using the defined ASM_DISKSTRING. Notice that the NAME
column is empty and the GROUP_NUMBER is set to 0. This is because disks
were discovered that are not yet associated with a disk group.
Therefore, they have a null name and a group number of 0.
In an Exadata environment, the physical disks on the storage cells are called cell disks.
Grid disks are created from the cell disks and are presented to ASM via
the LIBCELL interface; they are used to create disk groups in Exadata.
The default value for ASM_DISKSTRING in Exadata is ‘o/*/*’.
Note that these Exadata disks as presented by
LIBCELL are not presented to the OS as block devices, but rather as
internal network devices; they are not visible at the OS level. However,
the kfod tool can be used to verify ASM disk discovery. The following
shows kfod output of grid disks:
The preceding output shows the following:
The grid disks are presented from three storage cells (192.168.10.12, 192.168.10.13, and 192.168.10.14).
Disks have various header statuses that
reflect their membership state with a disk group. Disks can have the
following header statuses:
FORMER This state declares that the disk was formerly part of a disk group.
CANDIDATE A disk in this state is available to be added to a disk group.
MEMBER This state indicates that a disk is already part of a disk group. Note that the disk group may or may not be mounted.
PROVISIONED This
state is similar to CANDIDATE, in that it is available to be added to
disk groups. However, the provisioned state indicates that this disk has
been configured or made available using ASMLIB.
Note that ASM does not ever mark disks as
CANDIDATE. Disks with a HEADER_STATUS of CANDIDATE is the outcome of the
evaluation of ASM disk discovery. If a disk is dropped by ASM via a
normal DROP DISK, the header status would become listed as FORMER.
Moreover, if a disk is taken offline and subsequently force dropped, the
HEADER_STATUS would remain as MEMBER.
The following is a useful query to run to view the status of disks in the ASM system:
The views V$ASM_DISK_STAT and
V$ASM_DISKGROUP_STAT are identical to V$ASM_DISK and V$ASM_DISKGROUP.
However, the $ASM_DISK_STAT and V$ASM_DISKGROUP_STAT views are polled
from memory and are based on the last deep disk discovery. Because these
new views provide efficient lightweight access, Enterprise Manager (EM)
can periodically query performance statistics at the disk level and
aggregate space usage statistics at the disk group level without
incurring significant overhead.
Third-Party Volume Managers and ASM
Although it is not a recommended practice,
host volume managers such as Veritas VxVM and IBM LVM can sit below ASM.
For example, a logical volume manager (LVM) can create raw logical
volumes and present these as disks to ASM. However, the third-party LVM
should not use any host-based mirroring or striping. ASM algorithms are
based on the assumption that I/Os to different disks are relatively
independent and can proceed in parallel. If any of the volume manager
virtualization features are used beneath ASM, the configuration becomes
too complex and confusing and can needlessly incur overhead, such as the
maintenance of a dirty region log (DRL). DRL is discussed in greater
detail later in this chapter.
In a clustered environment, such a
configuration can be particularly expensive. ASM does a better job of
providing this configuration’s functionality for database files.
Additionally, in RAC environments, if ASM were to run over third-party
volume managers, the volume managers must be cluster-aware—that is, they
must be cluster volume managers (CVMs).
However, it may make sense in certain cases to
have a volume manager under ASM (for example, when Sysadmins need a
simplified management and tracking of disk assignments is needed).
If a volume manager is used to create logical
volumes as ASM disks, the logical volumes should not use any LVM RAID
functionality.
Preparing ASM Disks on NFS
ASM supports Network File System (NFS) files
as ASM disks. To prepare NFS for ASM storage, the NAS NFS file system
must be made accessible to the server where ASM is running.
The following steps can be used to set up and configure ASM disks using the NFS file system:
1. On the NAS server, create the
file system. Depending on the NAS server, this will require creating
LUNs, creating RAID groups out of the LUNs, and finally creating a file
system from the block devices.
2. Export the NAS file system so
that it’s accessible to the host server running ASM. This mechanism will
differ based on the filer or NFS server being used. Typically this
requires the /etc/exports file to specify the NFS file system to be
remotely mounted.
3. On the host server, create the mount point where the NFS file system will be mounted:
4. Update /etc/fstab with the following entry:
5. Mount the NFS file system on the host server using the mount –a command.
6. Initialize the NFS file system files so they can be used as ASM disks:
This step should be repeated to configure the appropriate number of disks.
7. Ensure that ASM can discover the newly created disk files (that is, check that the permissions are grid:asmadmin).
8. Set the ASM disk string appropriately when prompted in OUI for the ASM configuration.
It is very important to have the correct NFS
mount options set. If wrong mount options are set, an exception will be
thrown on file open. This is shown in the following listing. RAC and
Clusterware code uses the O_DIRECT flag for write calls, so the data
writes bypass the cache and go directly to the NFS server (thus avoiding
possible corruptions by an extra caching layer).
File system files that are opened as read-only
by all nodes (such as shared libraries) or files that are accessed by a
single node (such as trace files) can be on a mount point with the
mount option actimeo set to greater than 0. Only files that are
concurrently written and read by multiple nodes (such as database files,
application output files, and natively compiled libraries shared among
nodes) need to be on a mount point with actimeo set to 0. This not only
saves on network round trips for stat() calls, but the calls also don’t
have to wait for writes to complete. This could be a significant
speedup, especially for files being read and written by a single node.
Direct NFS
Oracle Database has built-in support for the
Network File System (NFS) client via Direct NFS (dNFS). dNFS, an
Oracle-optimized NFS client introduced in Oracle Database 11gR1, is built directly into the database kernel.
dNFS provides faster performance than the
native OS NFS client driver because it bypasses the OS. Additionally,
once dNFS is enabled, very little user configuration or tuning is
required. Data is cached just once in user space, so there’s no second
copy in kernel space. dNFS also provides implicit network interface load
balancing. ASM supports the dNFS client that integrates the NFS client
functionality directly in the Oracle Database software stack. If you are
using dNFS for RAC configurations, some special considerations need to
be made. dNFS cannot be used to store (actually access) voting files.
The reason for this lies in how voting files are accessed. CSS is a
multi-threaded process and dNFS (in its current state) is not thread
safe. OCR files and other cluster files (including database files) are
accessed using ASM file I/O operations.
Note that ASM Dynamic Volume Manager (Oracle ADVM) does not currently support NFS-based ASM files.
Preparing ASM Disks on OS Platforms
This section illustrates the specific tasks needed to configure ASM for the specific operating systems and environments.
Linux
On Intel-based systems such as
Linux/Windows, the first 63 blocks have been reserved for the master
boot record (MBR). Therefore, the first data partition starts with
offset at 31.5KB (that is, 63 times 512 bytes equals 31.5KB).
This offset can cause misalignment on many
storage arrays’ memory cache or RAID configurations, causing performance
degradation due to overlapping I/Os. This performance impact is
especially evident for large block I/O workloads, such as parallel query
processing and full table scans.
The following shows how to manually perform
the alignment using sfdisk against an EMC Powerpath device. Note that
this procedure is applicable to any OS device that needs to be
partitioned aligned.
Solaris
This section covers some of the nuances of
creating disk devices in a Solaris environment. The Solaris format
command is used to create OS slices. Note that slices 0 and 2 (for SMI
labels) cannot be used as ASM disks because these slices include the
Solaris VTOC. An example of the format command output (partition map)
for the device follows:
Notice that slice 4 is created and that it skips four cylinders, thus offsetting past the VTOC.
Use the logical character device as listed in
the /dev/rdsk directory. Devices in this directory are symbolic links to
the physical device files. Here’s an example:
To change the permission on these devices, do the following:
Now the ASM instance can access the slice. Set
the ASM_DISKSTRING to /dev/rdsk/c*s4. Note that the actual disk string
differs in each environment.
AIX
This section describes how to configure ASM
disks for AIX. It also recommends some precautions that are necessary
when using AIX disks.
In AIX, a disk is assigned a physical volume
identifier (PVID) when it is first assigned to a volume group or when it
is manually set using the AIX chdev command. When the PVID is assigned,
it is stored on the physical disk and in the AIX server’s system object
database, called Object Data Manager (ODM). The PVID resides in
the first 4KB of the disk and is displayed using the AIX lspv command.
In the following listing, the first two disks have PVIDs assigned and
the others do not:
If a PVID-assigned disk is incorporated into
an ASM disk group, ASM will write an ASM disk header on the first 40
bytes of the disk, thus overwriting the PVID. Although initially no
problems may arise, on the subsequent reboot the OS, in coordination
with the ODM database, will restore the PVID onto the disk, thus
destroying the ASM disk and potentially resulting in data loss.
Therefore, it is a best practice on AIX not to
include a PVID on any disk that ASM will use. If a PVID does exist and
ASM has not used the disk yet, you can clear the PVID by using the AIX
chdev command.
Functionality was added to AIX to help prevent
corruption such as this. AIX commands that write to the LVM information
block have special checking added to determine if the disk is already
in use by ASM. This mechanism is used to prevent these disks from being
assigned to the LVM, which would result in the Oracle data becoming
corrupted. Table 4-1 lists the command and the corresponding AIX version where this checking is done.
AIX 6.1 and AIX 7.1 LVM commands contain new
functionality that can be used to better manage AIX devices used by
Oracle. This new functionality includes commands to better identify
shared disks across multiple nodes, the ability to assign a meaningful
name to a device, and a locking mechanism that the system administrator
can use when the disk is assigned to ASM to help prevent the accidental
reuse of a disk at a later time. This new functionality is listed in Table 4-2,
along with the minimum AIX level providing that functionality. Note
that these manageability commands do not exist for AIX 5.3.
The following illustrates the disk-locking and -checking functionality:
Lock every raw disk used by ASM Lock to protect the disk. This can be done while Oracle RAC is active on the cluster:
Then use the lspv command to check the status of the disks.
ASM and Multipathing
An I/O path generally consists of an
initiator port, fabric port, target port, and LUN. Each permutation of
this I/O path is considered an independent path. For example, in a
high-availability scenario where each node has two host bus adapter
(HBA) ports connected to two separate switch ports to two target ports
on the back-end storage to a LUN, eight paths are visible to that LUN
from the OS perspective (two HBA ports times two switch ports times two
target ports times one LUN equals eight paths).
Path managers discover multiple paths to a
device by issuing a SCSI inquiry command (SCSI_INQUIRY) to each
operating system device. For example, on Linux the scsi_id call queries a
SCSI device via the SCSI_INQUIRY command and leverages the vital
product data (VPD) page 0x80 or 0x83. A disk or LUN responds to the
SCSI_INQUIRY command with information about itself, including vendor and
product identifiers and a unique serial number. The output from this
query is used to generate a value that is unique across all SCSI devices
that properly support pages 0x80 or 0x83.
Typically devices that respond to the
SCSI_INQUIRY with the same serial number are considered to be accessible
from multiple paths.
Path manager software also provides multipath
software drivers. Most multipathing drivers support multipath services
for fibre channel–attached SCSI-3 devices. These drivers receive naming
and transport services from one or more physical HBA devices. To support
multipathing, a physical HBA driver must comply with the multipathing
services provided by this driver. Multipathing tools provide the
following benefits:
They provide a single block device interface for a multipathed LUN.
They detect any component failures in the I/O path, including the fabric port, channel adapter, or HBA.
When a loss of path occurs, such tools ensure that I/Os are rerouted to the available paths, with no process disruption.
They reconfigure the multipaths automatically when events occur.
They ensure that failed paths get revalidated as soon as possible and provide auto-failback capabilities.
They
configure the multipaths to maximize performance using various
load-balancing methods, such as round robin, least I/Os queued, and
least service time.
When a given disk has several paths defined,
each one will be presented as a unique pathname at the OS level,
although they all reference the same physical LUN; for example, the LUNs
/dev/rdsk/c3t19d1s4 and /dev/rdsk/c7t22d1s4 could be pointing to the
same disk device. The multipath abstraction provides I/O load balancing
across the HBAs as well as nondisruptive failovers on I/O path failures.
ASM, however, can tolerate the discovery of
only one unique device path per disk. For example, if the ASM_DISKSTRING
is /dev/rdsk/*, then several paths to the same device will be
discovered and ASM will produce an error message stating this. A
multipath driver, which generally sits above this SCSI-block layer,
usually produces a pseudo device that virtualizes the subpaths. For
example, in the case of EMC’s PowerPath, you can use the ASM_DISKSTRING
setting of /dev/rdsk/emcpower*. When I/O is issued to this disk device,
the multipath driver intercepts it and provides the necessary load
balancing to the underlying subpaths.
Examples of multipathing software include
Linux Device Mapper, EMC PowerPath, Veritas Dynamic Multipathing (DMP),
Oracle Sun Traffic Manager, Hitachi Dynamic Link Manager (HDLM), Windows
MPIO, and IBM Subsystem Device Driver Path Control Module (SDDPCM).
Additionally, some HBA vendors, such as QLogic, also provide multipathing solutions.
|
NOTE
Users are advised to verify the vendor
certification of ASM/ASMLIB with their multipathing drivers, because
Oracle does not certify or qualify these multipathing tools. Although
ASM does not provide multipathing capabilities, it does leverage
multipathing tools as long as the path or device that they produce
brings back a successful return code from an fstat system call. Metalink
Note 294869.1 provides more details on ASM and multipathing.
|
Linux Device Mapper
Device mapper provides a kernel framework of
allowing multiple device drivers to be stacked on top of each other.
This section will describe the details of device mapper as it relates to
ASM and ASMLIB or Udev, as these components all are heavily used
together. This section will provide a high-level overview of Linux
Device Mapper and Udev, in order to support ASM more effectively.
Linux device mapper’s subsystem is the core
component of the Linux multipath chain. The component provides the
following high-level functionality:
A single logical device node for multiple paths to a single storage device.
I/O
gets rerouted to the available paths when a path loss occurs and there
is no disruption at the upper (user) layers because of this.
Device mapper is configured by using the
library libdevmapper. This library is used by multiple modules such as
dmsetup, LVM2, multipath tools, and kpartx, as shown in Figure 4-1.
Device mapper provides the kernel resident
mechanisms that support the creation of different combinations of
stacked target drivers for different block devices. At the highest level
of the stack is a single mapped device. This mapped device is
configured in the device mapper by passing map information about the
target devices to the device mapper via the libdevmapper library
interfaces. This role is performed by the multipath configuration tool
(discussed later). Each mapped segment consists of a start sector and
length and a target driver–specific number of target driver parameters.
Here’s an example of a mapping table for a multipath device with two underneath block devices:
The preceding indicates that the starting
sector is 0, the length is 1172123558, and the driver is multipath
followed by multipathing parameters. Two devices are associated with
this map, 65:208 and 65:16 (major:minor). The parameter 1000 indicates
that after 1,000 I/Os the second path will be used in a round-robin
fashion.
Here’s an example of a mapping table for a logical volume (LVM2) device with one underneath block device:
This indicates that the starting sector is 0 and the length is 209715200 with a linear driver for device 9:1 (major:minor).
Udev
In the previous versions of Linux, all the
devices were statically created in the dev-FS (/dev) file system at
installation time. This created a huge list of entries in /dev, most of
which were useless because most of the devices were not actually
connected to the system. The other problem with this approach was that
the same device could be named differently when connected to the system
because the kernel assigned the devices on a first-come basis, so the
first SCSI device discovered was named /dev/sda, the second one
/dev/sdb, and so on. Udev resolves this problem by managing the device
nodes on demand. Also, name-to-device mappings are not based on the
order in which the devices are detected but on a system of predefined
rules. Udev relies on the kernel hot-plug mechanism to create device
files in user space.
The discovery of the current configuration is
done by probing block device nodes created in Sysfs. This file system
presents to the user space kernel objects such as block devices, busses,
drivers, and so on in a hierarchical manner. A device node is created
by Udev in reaction to a hot-plug event generated when a block device’s
request queue is registered with the kernel’s block subsystem.
As shown in Figure 4-2, in the context of multipath implementation, Udev performs the following tasks:
The
addition and suppression of paths are listened to by the multipath user
space daemon. This ensures that the multipath device maps are always up
to date with the physical topology.
The
user space callbacks after path addition or suppression also call the
user space tool kpartx to create maps for device partitions.
Many users use symbolic links to point to ASM
disks instead of using Udev. A common question we get is, is there a
difference (advantages/disadvantages) between using symbolic links
versus mknod devices with respect to aliasing a device?
Neither one provides persistent naming. In
other words, if you use a mknod device, you have a new alias to the
major/minor number. However, if the device changes its major/minor
number for any reason, your mknod device will be stale, just like the
symlink will be stale. The only way to obtain persistence is by using
ASMLIB, ASM, or Udev. For example, in Udev, you can create either an
mknod/block/char device or a SYMLINK using that keyword. Here’s an
example:
The first example will modify a sd device to
be called “asmdisk1”. The second example will leave the sd device alone
and create a new symlink called “asmdisk1” pointing to the sd device.
Both methods keep pointing to the right device as given by the UUID;
obviously you should use only one method or the other.
Multipath Configuration
Now that we’ve talked about the various
tools and modules involved in making Linux multipathing work, this
section discusses the configuration steps required to make it
operational.
The main configuration file that drives the
Linux multipathing is /etc/multipath.conf. The configuration file is
composed of the following sections:
Defaults The default values of attributes are used whenever no specific device setting is given.
Blacklist Devices that should be excluded from the multipath discovery.
Blacklist_exceptions Devices
that should be included in the multipath discovery despite being listed
in the blacklist section. This section is generally for when you use
wildcards in the blacklist section to remove many disks and want to
manually use a few of them.
Multipaths This
is the main section that defines the multipath topology. The values are
indexed by the worldwide identifier (WWID) of the device.
Devices Device-specific settings.
Following are some commonly used
configuration parameters in the multipath.conf file (for a complete
listing, refer to the man pages on your system):
Path_grouping_policy This parameter defines the path grouping policy. Here are the possible values:
Failover There is one active path. This is equivalent to active/passive configuration.
Multibus All the paths are in one priority group and are used in a load-balancing configuration. This is the default value.
Group_by_serial/group_by_prio/group_by_node_name The paths are grouped based on serial number or priority, as defined by a usercallout program or based on target node names.
Prio_callout The default program and arguments to call out to obtain a path priority value.
path_checker The method used to determine the path’s state.
user_friendly_names The names assigned to the multipath devices will be in the form of mpath<n>, if not defined in the multipaths section.
no_path_retry Used
to specify the number of retries until queuing is disabled, or issue
fail I/O for immediate failure (in this case no queuing). Default is 0.
Here’s an example of a configuration file that is used for RAC cluster configurations:
The first section blacklists the asm and
ofsctl (ACFS) devices. It also specifically blacklists the two disks
with specific WWIDs, which in the context of the sample setup were the
OS disks. The defaults section provides generic parameters that are
applicable in this setup. Finally, the multipaths section lists two
WWIDs and corresponding aliases to be assigned to the devices. The mode,
uid, and gid provide the created device permissions, user ID, and group
ID, respectively, after creation.
Setting Up the Configuration
To create the multipath device mappings after configuring the multipath.conf file, use the /sbin/multipath command:
In this example, two multipath devices are
created in failover configuration. The first priority group is the
active paths: The device /dev/sdau is the active path for HDD__966617575
and /dev/sdk is the passive (standby) path. The path is visible in the
/dev/mapper folder. Furthermore, the partition mappings for the devices
can be created on the device using the kpartx tool.
After these settings have completed, the devices can be used by ASM or other applications on the server.
Disk Group
The primary component of ASM is the disk group, which is the highest-level data structure in ASM (see Figure 4-3).
A disk group is essentially a container that consists of a logical
grouping of disks that are managed together as a unit. The disk group is
comparable to a logical volume manager’s (LVM’s) volume group.
A disk group can contain files from many
different Oracle databases. Allowing multiple databases to share a disk
group provides greater potential for improved disk utilization and
greater overall throughput. The Oracle database may also store its files
in multiple disk groups managed by the same ASM instance. Note that a
database file can only be part of one disk group.
ASM disk groups differ from typical LVM volume
groups in that ASM disk groups have inherent automatic file-level
striping and mirroring capabilities. A database file created within an
ASM disk group is distributed equally across all disks in the disk
group, which provides an even input/output (I/O) load.
Disk Group Management
ASM has three disk groups types: external
redundancy, normal redundancy, and high redundancy. The disk group type,
which is defined at disk group creation time, determines the default
level of mirroring performed by ASM. An external redundancy disk group
indicates stripping will be done by ASM and the mirroring will be
handled and managed by the storage array. For example, a user may create
an external redundancy disk group where the storage array (SAN) is an
EMC VMAX or Hitachi USP series. Because the core competency of these
high-end arrays is mirroring, external redundancy is well suited for
them. A common question is, does ASM stripping conflict with the
stripping performed by the SAN? The answer is no. ASM stripping is
complementary to the SAN stripping.
With ASM redundancy, ASM performs and manages
the mirroring. ASM redundancy is the core deployment strategy used in
Oracle’s Engineered Solutions, such as Oracle Exadata, Oracle
SuperCluster, and Oracle Database Appliance. ASM redundancy is also used
with low-cost commodity storage or when deploying stretch clusters. For
details on ASM redundancy, see the “ASM Redundancy and Failure Groups”
section later in this chapter. The next section focuses on creating disk
groups in an external redundancy environment.
Creating Disk Groups
The creation of a disk group involves validation of the disks to be added. The disks must have the following attributes:
They cannot already be in use by another disk group.
They must not have a preexisting valid ASM header. The FORCE option must be used to override this.
They
cannot have an Oracle file header (for example, from any file created
by the RDBMS). The FORCE option must be used to override this. Trying to
create an ASM disk using a raw device data file results in the
following error:
The disk header validation prevents ASM from
destroying any data device already in use. Only disks with a header
status of CANDIDATE, FORMER, or PROVISIONED are allowed to be included
in disk group creation. To add disks to a disk group with a header
status of MEMBER or FOREIGN, use the FORCE option in the disk group
creation. To prevent gratuitous use of the FORCE option, ASM allows it
only when using the NOFORCE option would fail. An attempt to use FORCE
when it is not required results in an ORA-15034 error (disk ‘%s’ does
not require the FORCE option). Use the FORCE option with extreme
caution, because it overwrites the data on the disk that was previously
used as an ASM disk or database file.
A disk without a recognizable header is considered a CANDIDATE. There is no persistent header status called “candidate.”
Once ASM has discovered the disks, they can be
used to create a disk group. To reduce the complexity of managing ASM
and its disk groups, Oracle recommends that generally no more than two
disk groups be maintained and managed per RAC cluster or single ASM
instance. The following are typical disk groups that are created by
customers:
DATA disk group This
is where active database files such as data files, control files,
online redo logs, and change-tracking files used in incremental backups
are stored.
Fast Recovery Area (FRA) disk group This
is where recovery-related files are created, such as multiplexed copies
of the current control file and online redo logs, archived redo logs,
backup sets, and flashback log files.
Having one DATA disk group means there’s only
one place to store all your database files, and it obviates the need to
juggle around data files or having to decide where to place a new
tablespace, like in traditional file system configurations. Having one
disk group for all your files also means better storage utilization,
thus making the IT director and storage teams very happy. If more
storage capacity or I/O capacity is needed, just add an ASM disk and
ensure that this storage pool container houses enough spindles to
accommodate the I/O rate of all the database objects.
To provide higher availability for the
database, when a Fast Recovery Area is chosen at database creation time,
an active copy of the control file and one member set of the redo log
group are stored in the Fast Recovery Area. Note that additional copies
of the control file or extra log files can be created and placed in
either disk group, as desired.
RAC users can optionally create a CRSDATA disk
group to store Oracle Clusterware files (for example, voting disks and
the Oracle Cluster registry) or the ASM spfile. When deploying the
CRSDATA disk group for this purpose, you should minimally use ASM normal
redundancy with three failure groups. This is generally done to provide
added redundancy for the Clusterware files. See Chapter 2 for more details.
Note that creating additional disk groups for
storing database data does not necessarily improve performance. However,
additional disk groups may be added to support tiered storage classes
in Information Lifecycle Management (ILM) or Hierarchical Storage
Management (HSM) deployments. For example, a separate disk group can be
created for archived or retired data (or partitions), and these
partitioned tablespaces can be migrated or initially placed on a disk
group based on Tier2 storage (RAID5), whereas Tier1 storage (RAID10) can
be used for the DATA disk group.
ASM provides out-of-the-box enablement of
redundancy and optimal performance. However, the following items should
be considered to increase performance and/or availability:
Implement multiple access paths to the storage array using two or more HBAs or initiators.
Deploy multipathing software over these multiple HBAs to provide I/O load-balancing and failover capabilities.
Use
disk groups with similarly sized and performing disks. A disk group
containing a large number of disks provides a wide distribution of data
extents, thus allowing greater concurrency for I/O and reducing the
occurrences of hotspots. Because a large disk group can easily sustain
various I/O characteristics and workloads, a single DATA disk group can
be used to house database files, log files, and control files.
Use
disk groups with four or more disks, making sure these disks span
several back-end disk adapters. As stated earlier, Oracle generally
recommends no more than two to three disk groups. For example, a common
deployment can be four or more disks in a DATA disk group spanning all
back-end disk adapters/directors, and eight to ten disks for the FRA
disk group. The size of the FRA will depend on what is stored and how
much (that is, full database backups, incremental backups, flashback
database logs, and archive logs). Note that an active copy of the
control file and one member of each of the redo log group are stored in
the FRA.
A disk group can be created using SQL,
Enterprise Manager (EM), ASMCMD commands, or ASMCA. In the following
example, a DATA disk group is created using four disks that reside in a
storage array, with the redundancy being handled externally by the
storage array. The following query lists the available disks that will
be used in the disk group:
Notice that one of the disks, c3t19d19s4, was dropped from a disk group and thus shows a status of FORMER.
The output from V$ASM_DISGROUP shows the newly created disk group:
The output from V$ASM_DISK shows the status disks once the disk group is created:
After the disk group is successfully created,
metadata information, which includes creation date, disk group name, and
redundancy type, is stored in the System Global Area (SGA) and on each
disk (in the disk header) within the disk group. Although it possible to
mount disk groups only on specific nodes of the cluster, this is
generally not recommended because it may potentially obstruct CRS
resource startup modeling.
Once these disks are under ASM management, all
subsequent mounts of the disk group reread and validate the ASM disk
headers. The following output shows how the V$ASM_DISK view reflects the
disk state change after the disk is incorporated into the disk group:
The output that follows shows entries from the
ASM alert log reflecting the creation of the disk group and the
assignment of the disk names:
When you’re mounting disk groups, either at
ASM startup or for subsequent mounts, it is advisable to mount all
required disk groups at once. This minimizes the overhead of multiple
ASM disk discovery scans. With Grid Infrastructure, agents will
automatically mount any disk group needed by a database.
ASM Disk Names
ASM disk names are assigned by default based
on the disk group name and disk number, but names can be defined by the
user either during ASM disk group creation or when disks are added. The
following example illustrates how to create a disk group where disk
names are defined by the user:
If disk names are not provided, ASM dynamically assigns a disk name with a sequence number to each disk added to the disk group:
The ASM disk name is used when performing disk management activities, such as DROP, ONLINE, OFFLINE, and RESIZE DISK.
The ASM disk name is different from the small
computer system interface (SCSI) address. ASM disk names are stored
persistently in the ASM header, and they persist even if the SCSI
address name changes. Persistent names also allow for consistent naming
across Real Application Clusters (RAC) nodes. SCSI address name changes
occur due to array reconfigurations and/or after OS reboots. There is no
persistent binding of disk numbers to pathnames used by ASM to access
the storage.
Disk Group Numbers
The lowest nonzero available disk group
number is assigned on the first mount of a disk group in a cluster.
However, in an ASM cluster, even if the disk groups are mounted in a
different order between cluster nodes, the disk group numbers will still
be consistent across the cluster (for a given disk group) but the disk
group name never changes. For example, if node 1 has dgA as group number
1 and dgB as group number 2, then if node 2 mounts only dgB, it will be
group number 2, even though 1 is not in use in node 2.
|
NOTE
Disk group numbers are never recorded
persistently, so there is no disk group number in a disk header. Only
the disk group name is recorded in the disk header.
|
Disk Numbers
Although disk group numbers are never
recorded persistently, disk numbers are recorded on the disk headers.
When an ASM instance starts up, it discovers all the devices matching
the pattern specified in the initialization parameter ASM_DISKSTRING and
for which it has read/write access. If it sees an ASM disk header, it
knows the ASM disk number.
Also, disks that are discovered but are not
part of any mounted disk group are reported in disk group 0. A disk that
is not part of any mounted disk group will be in disk group 0 until it
is added to a disk group or mounted. When the disk is added to a disk
group, it will be associated with the correct disk group.
ASM Redundancy and Failure Groups
For systems that do not use external
redundancy, ASM provides its own redundancy mechanism. This redundancy,
as stated earlier, is used extensively in Exadata and Oracle Database
Appliance systems. These Engineered Solutions are covered in Chapter 12.
A disk group is divided into failure groups,
and each disk is in exactly one failure group. A failure group (FG) is a
collection of disks that can become unavailable due to the failure of
one of its associated components. Possible failing components could be
any of the following:
Storage array controllers
Host bus adapters (HBAs)
Fibre Channel (FC) switches
Disks
Entire arrays, such as NFS filers
Thus, disks in two separate failure groups
(for a given disk group) must not share a common failure component. If
you define failure groups for your disk group, ASM can tolerate the
simultaneous failure of multiple disks in at least one failure group or
two FGs in a high-redundancy disk group.
ASM uses a unique mirroring algorithm. ASM
does not mirror disks; rather, it mirrors extents. When ASM allocates a
primary extent of a file to one disk in a failure group, it allocates a
mirror copy of that extent to another disk in another failure group.
Thus, ASM ensures that a primary extent and its mirror copy never reside
in the same failure group.
Each file can have a different level of
mirroring (redundancy) in the same disk group. For example, in a normal
redundancy disk group, with at least three failure groups, we can have
one file with (default) normal redundancy, another file with no
redundancy, and yet another file with high redundancy (triple
mirroring).
Unlike other volume managers, ASM has no
concept of a primary disk or a mirrored disk. As a result, to provide
continued protection in the event of failure, your disk group requires
only spare capacity; a hot spare disk is unnecessary. Redundancy for
disk groups can be either normal (the default), where files are two-way mirrored (requiring at least two failure groups), or high,
which provides a higher degree of protection using three-way mirroring
(requiring at least three failure groups). After you create a disk
group, you cannot change its redundancy level. If you want a different
redundancy, you must create another disk group with the desired
redundancy and then move the data files (using Recovery Manager [RMAN]
restore, the ASMCMD copy command, or DBMS_FILE_TRANSFER) from the
original disk group to the newly created disk group.
|
NOTE
Disk group metadata is always triple mirrored with normal or high redundancy.
|
Additionally, after you assign a disk to a
failure group, you cannot reassign it to another failure group. If you
want to move it to another failure group, you must first drop it from
the current failure group and then add it to the desired failure group.
However, because the hardware configuration usually dictates the choice
of a failure group, users generally do not need to reassign a disk to
another failure group unless it is physically moved.
Creating ASM Redundancy Disk Groups
The following simple example shows how to create a normal redundancy disk group using two failure groups over a NetApp filer:
The same create diskgroup command can be executed using wildcard syntax:
In this following example, ASM normal
redundancy is being deployed over a low-cost commodity storage array.
This storage array has four internal trays, with each tray having four
disks. Because the failing component to isolate is the storage tray, the
failure group boundary is set for the storage tray—that is, each
storage tray is associated with a failure group:
ASM and Intelligent Data Placement
Short stroking (along with RAID 0
striping) is a technique that storage administrators typically use to
minimize the performance impact of head repositioning delays. This
technique reduces seek times by limiting the actuator’s range of motion
and increases the media transfer rate. This effectively improves overall
disk throughput. Short stroking is implemented by formatting a drive so
that only the outer sectors of the disk platter (which have the highest
track densities) are used to store heavily accessed data, thus
providing the best overall throughput. However, short stroking a disk
limits the drive’s capacity by using a subset of the available tracks,
resulting in reduced usable capacity.
Intelligent Data Placement (IDP), a feature
introduced in Oracle Database version 11 Release 2, emulates the short
stroking technique without sacrificing usable capacity or redundancy.
IDP automatically defines disk region
boundaries on ASM disks for best performance. Using the disk region
settings of a disk group, you can place frequently accessed data on the
outermost (hot) tracks. In addition, files with similar access patterns
are located physically close, thus reducing latency. IDP also enables
the placement of primary and mirror extents into different hot or cold
regions.
The IDP feature primarily works on JBOD (just a
bunch of disks) storage or disk storage that has not been partitioned
(for example, using RAID techniques) by the array. Although IDP can be
used with external redundancy, it can only be effective if you know that
certain files are frequently accessed while other files are rarely
accessed and if the lower numbered blocks perform better than the higher
numbered blocks. IDP on external redundancy may not be highly
beneficial. Moreover, IDP over external redundancy is not a tested
configuration in Oracle internal labs.
The default region of IDP is COLD so that all
data is allocated on the lowest disk addresses, which are on the outer
edge of physical disks. When the disk region settings are modified for
an existing file, only new file extensions are affected. Existing file
extents are not affected until a rebalance operation is initiated. It is
recommended that you manually initiate an ASM rebalance when a
significant number of IDP file policies (for existing files) are
modified. Note that a rebalance may affect system throughput, so it
should be a planned change management activity.
IDP settings can be specified for a file or by
using disk group templates. The disk region settings can be modified
after the disk group has been created. IDP is most effective for the
following workloads and access patterns:
For
databases with data files that are accessed at different rates. A
database that accesses all data files in the same way is unlikely to
benefit from IDP.
For
ASM disk groups that are sufficiently populated with usable data. As a
best-practice recommendation, the disk group should be more than
25-percent full. With lesser populated disk groups, the IDP management
overhead may minimize the IDP benefits.
For
disks that have better performance at the beginning of the media
relative to the end. Because Intelligent Data Placement leverages the
geometry of the disk, it is well suited to JBOD (just a bunch of disks).
In contrast, a storage array with LUNs composed of concatenated volumes
masks the geometry from ASM.
To implement IDP, the COMPATIBLE.ASM and
COMPATIBLE.RDBMS disk group attributes must be set to 11.2 or higher.
IDP can be implemented and managed using ASMCA or the following SQL
commands:
ALTER DISKGROUP ADD or MODIFY TEMPLATE
ALTER DISKGROUP TEMPLATE SQL or MODIFY FILE
These commands include the disk region clause for setting hot/mirrorhot or cold/mirrorcold regions in a template:
IDP is also applicable for ADVM volumes. When
creating ADVM volumes, you can specify the region location for primary
and secondary extents:
Designing for ASM Redundancy Disk Groups
Note that with ASM redundancy, you are not
restricted to having two failure groups for normal redundancy and three
for high redundancy. In the preceding example, four failure groups are
created to ensure that disk partners are not allocated from the same
storage tray. Another such example can be found in Exadata. In a
full-frame Exadata, there are 14 failure groups.
There may be cases where users want to protect
against storage area network (SAN) array failures. This can be
accomplished by putting each array in a separate failure group. For
example, a configuration may include two NetApp filers and the
deployment of ASM normal redundancy such that each filer—that is, all
logical unit numbers (LUNs) presented through the filer—is part of an
ASM failure group. In this scenario, ASM mirrors the extent between the
two filers.
If the database administrator (DBA) does not
specify a failure group in the CREATE DISKGROUP command, a failure group
is automatically constructed for each disk. This method of placing
every disk in its own failure group works well for most customers. In
fact, in Oracle Database Appliance, all disks are assigned in this
manner.
In case of Exadata, the storage grid disks are
presented with extra information to the database server nodes, so these
servers know exactly how to configure the failure groups without user
intervention.
The choice of failure groups depends on the
kinds of failures that need to be tolerated without the loss of data
availability. For small numbers of disks (for example, fewer than 20),
it is usually best to put every disk in its own failure group.
Nonetheless, this is also beneficial for large numbers of disks when the
main concern is spindle failure. To protect against the simultaneous
loss of multiple disk drives due to a single component failure, an
explicit failure group specification should be used. For example, a disk
group may be constructed from several small modular disk arrays. If the
system needs to continue operation when an entire modular array fails,
each failure group should consist of all the disks in one module. If one
module fails, all the data on that module is relocated to other modules
to restore redundancy. Disks should be placed in the same failure group
if they depend on a common piece of hardware whose unavailability or
failure needs to be tolerated.
It is much better to have several failure
groups as long as the data is still protected against the necessary
component failures. Having additional failure groups provides better
odds of tolerating multiple failures. Failure groups of uneven capacity
can lead to allocation problems that prevent full utilization of all
available storage. Moreover, having failure groups of different sizes
can waste disk space. There may be enough room to allocate primary
extents, but no space available for secondary extents. For example, in a
disk group with six disks and three failure groups, if two disks are in
their own individual failure groups and the other four are in one
common failure group, the allocation will be very unequal. All the
secondary extents from the big failure group can be placed on only two
of the six disks. The disks in the individual failure groups fill up
with secondary extents and block additional allocation even though
plenty of space is left in the large failure group. This also places an
uneven read and write load on the two disks that are full because they
contain more secondary extents that are accessed only for writes or if
the disk with the primary extent fails.
Allocating ASM Extent Sets
With ASM redundancy, the first file extent
allocated is chosen as the primary extent, and the mirrored extent is
called the secondary extent. In the case of high redundancy, there will
be two secondary extents. This logical grouping of primary and secondary
extents is called an extent set. Each disk in a disk group
contains nearly the same number of primary and secondary extents. This
provides an even distribution of read I/O activity across all the disks.
All the extents in an extent set always
contain the exact same data because they are mirrored versions of each
other. When a block is read from disk, it is always read from the
primary extent, unless the primary extent cannot be read. The preferred
read feature allows the database to read the secondary extent first
instead of reading the primary extent. This is especially important for
RAC Extended Cluster implementations. See the section “ASM and Extended
Clusters,” later in this chapter, for more details on this feature.
When a block is to be written to a file, each
extent in the extent set is written in parallel. This requires that all
writes complete before acknowledging the write to the client. Otherwise,
the unwritten side could be read before it is written. If one write I/O
fails, that side of the mirror must be made unavailable for reads
before the write can be acknowledged.
Disk Partnering
In ASM redundancy disk groups, ASM protects
against a double-disk failure (which can lead to data loss) by mirroring
copies of data on disks that are partners of the disk containing the
primary data extent. A disk partnership is a symmetric
relationship between two disks in a high- or normal-redundancy disk
group, and ASM automatically creates and maintains these relationships.
ASM selects partners for a disk from failure groups other than the
failure group to which the disk belongs. This ensures that a disk with a
copy of the lost disk’s data will be available following the failure of
the shared resource associated with the failure group. ASM limits the
number of disk partners to eight for any single disk.
Note that ASM did not choose partner disks
from its own failure group (FLGRP1); rather, eight partners were chosen
from the other three failure groups. Disk partnerships are only changed
when there is a loss or addition of an ASM disk. These partnerships are
not modified when disks are placed offline. Disk partnership is detailed
in Chapter 9.
Recovering Failure Groups
Let’s now return to the example in the
previous CREATE DISKGROUP DATA_NRML command. In the event of a disk
failure in failure group FLGRP1, which will induce a rebalance, the
contents (data extents) of the failed disk are reconstructed using the
redundant copies of the extents from partner disks. These partner disks
are from failure group FLGRP2, FLGRP3, or both. If the database instance
needs to access an extent whose primary extent was on the failed disk,
the database will read the mirror copy from the appropriate disk. After
the rebalance is complete and the disk contents are fully reconstructed,
the database instance returns to reading primary copies only.
ASM and Extended Clusters
An extended cluster—also called a stretch cluster, geocluster, campus cluster, or metro-cluster—is
essentially a RAC environment deployed across two data center
locations. Many customers implement extended RAC to marry disaster
recovery with the benefits of RAC, all in an effort to provide higher
availability. Within Oracle, the term extended clusters is used to refer to all of the stretch cluster implementations.
The distance for extended RAC can be anywhere
between several meters to several hundred kilometers. Because Cluster
Ready Services (CRS)–RAC cluster group membership is based on the
ability to communicate effectively across the interconnect, extended
cluster deployment requires a low-latency network infrastructure. For
close proximity, users typically use Fibre Channel, whereas for large
distances Dark Fiber is used.
For normal-redundancy disk groups in extended
RAC, there should be only one failure group on each site of the extended
cluster. High-redundancy disk groups should not be used in extended
cluster configurations unless there are three sites. In this scenario,
there should one failure group at each site. Note that you must name the
failure groups explicitly based on the site name.
|
NOTE
If a disk group contains an asymmetrical
configuration, such that there are more failure groups on one site than
another, then an extent could get mirrored to the same site and not to
the remote failure group. This could cause the loss of access to the
entire disk group if the site containing more than one failure group
fails.
|
In Oracle Clusterware 11g Release 2,
the concept of a quorum failgroup was introduced. Regular failgroup is
the default. A quorum failure group is a special type of failure group
that is used to house a single CSS quorum file, along with ASM metadata.
Therefore, this quorum failgroup only needs to be in 300MB in size.
Because these failure groups do not contain user data, the quorum
failure group is not considered when determining redundancy
requirements. Additionally, the USABLE_FILE_MB in V$ASM_DISKGROUP does
not consider any free space that is present in QUORUM disks. However, a
quorum failure group counts when mounting a disk group. Chapter 1 contains details on creating disk groups with a quorum failgroup.
ASM Preferred Read
As stated earlier, ASM always reads the
primary copy of a mirrored extent set. Thus, a read for a specific block
may require a read of the primary extent at the remote site across the
interconnect. Accessing a remote disk through a metropolitan area or
wide area storage network is substantially slower than accessing a local
disk. This can tax the interconnect as well as result in high I/O and
network latency.
To assuage this, a feature called preferred reads
enables ASM administrators to specify a failure group for local
reads—that is, provide preferred reads. In a normal-or high-redundancy
disk group, with an extent set that has a preferred disk, a read is
always satisfied by the preferred disk if it is online. This feature is
especially beneficial for extended cluster configurations.
The ASM_PREFERRED_READ_FAILURE_GROUPS
initialization parameter is used to specify a list of failure group
names that will provide local reads for each node in a cluster. The
format of the ASM_PREFERRED_READ_FAILURE_GROUPS is as follows:
Each entry is composed of DISKGROUP_NAME, which is the name of the disk group, and FAILUREGROUP_NAME,
which is the name of the failure group within that disk group, with a
period separating these two variables. Multiple entries can be specified
using commas as separators. This parameter can be dynamically changed.
The Preferred Read feature can also be useful
in mixed storage configurations. For example, in read-mostly workloads,
SSD storage can be created in one failgroup and standard disk drives can
be included in a second failgroup. This mixed configuration is
beneficial when the SSD storage is in limited supply (in the array) or
for economic reasons. The ASM_PREFERRED_READ_FAILURE_GROUPS parameter
can be set to the SSD failgroup. Note that writes will occur on both
failgroups.
In an extended cluster, the failure groups
that you specify with settings for the ASM_PREFERRED_READ_FAILURE_GROUPS
parameter should contain only disks that are local to the instance.
V$ASM_DISK indicates the preferred disks with a Y in the PREFERRED_READ
column.
The following example shows how to deploy the
preferred read feature and demonstrates some of its inherent benefits.
This example illustrates I/O patterns when the
ASM_PREFERRED_READ_FAILURE_GROUPS parameter is not set, and then
demonstrates how changing the parameter affects I/O:
1. Create a disk group with two failure groups:
2. The I/Os are evenly distributed across all disks—that is, these are non-localized I/Os.
3. The following query displays the balanced IO, default for ASM configurations:
|
NOTE
V$ASM_DISK includes I/Os that are performed
by the ASM instance for ASM metadata. The V$ASM_DISK_IOSTAT tracks I/O
on a per-database basis. This view can be used to verify that the RDBMS
instance does not perform any I/O to a nonpreferred disk.
|
4. Now set the appropriate ASM
parameters for the preferred read. Note that you need not dismount or
remount the disk group because this parameter is dynamic.
Enter the following for Node1 (site1):
Enter this code for Node2 (site2):
5. Verify that the parameter took effect by querying GV$ASM_DISK. From Node1, observe the following:
Keep in mind that disks MYDATA_0000,
MYDATA_0001, and MYDATA_0004 are part of the FG1 failure group, and
disks MYDATA_0002, MYDATA_0003, and MYDATA_0005 are in failure group
FG2.
6. Put a load on the system and
check I/O calls via EM or using V$ASM_DISK_IOSTAT. Notice in the
“Reads-Total” column that reads have a strong affinity to the disks in
FG1. This is because FG1 is local to Node1 where +ASM1 is running. The
remote disks in FG2 have very few reads.
7. Notice the small number of reads
that instance 1 is making to FG2 and the small number of reads that
instance 2 is making to FG1:
Recovering from Transient and Permanent Disk Failures
This section reviews how ASM handles transient and permanent disk failures in normal- and high-redundancy disk groups.
Recovering from Disk Failures: Fast Disk Resync
The ASM Fast Disk Resync feature
significantly reduces the time to recover from transient disk failures
in failure groups. The feature accomplishes this speedy recovery by
quickly resynchronizing the failed disk with its partnered disks.
With Fast Disk Resync, the repair time is
proportional to the number of extents that have been written or modified
since the failure. This feature can significantly reduce the time that
it takes to repair a failed disk group from hours to minutes.
The Fast Disk Resync feature allows the user a
grace period to repair the failed disk and return it online. This time
allotment is dictated by the ASM disk group attribute DISK_REPAIR_TIME.
This attribute dictates maximum time of the disk outage that ASM can
tolerate before dropping the disk. If the disk is repaired before this
time is exceeded, then ASM resynchronizes the repaired disk when the
user places the disk online. The command ALTER DISKGROUP DISK ONLINE is
used to place the repaired disk online and initiate disk
resynchronization.
Taking disks offline does not change any
partnerships. Repartnering occurs when the disks are dropped at the end
of the expiration period.
Fast Disk Resync requires that the
COMPATIBLE.ASM and COMPATIBLE.RDBMS attributes of the ASM disk group be
set to at least 11.1.0.0.
In the following example, the current ASM 11gR2
disk group has a compatibility of 11.1.0.0 and is modified to 11.2.0.3.
To validate the attribute change, the V$ASM_ATTRIBUTE view is queried:
After you correctly set the compatibility to
Oracle Database version 11.2.0.3, you can set the DISK_REPAIR_TIME
attribute accordingly. Notice that the default repair time is 12,960
seconds, or 3.6 hours. The best practice is to set DISK_REPAIR_TIME to a
value depending on the operational logistics of the site; in other
words, it should be set to the mean time to detect and repair the disk.
If the value of DISK_REPAIR_TIME needs to be changed, you can enter the following command:
If the DISK_REPAIR_TIME parameter is not 0 and
an ASM disk fails, that disk is taken offline but not dropped. During
this outage, ASM tracks any modified extents using a bitmap that is
stored in disk group metadata. (See Chapter 9 for more details on the algorithms used for resynchronization.)
ASM’s GMON process will periodically inspect
(every three seconds) all mounted disk groups for offline disks. If GMON
finds any, it sends a message to a slave process to increment their
timer values (by three seconds) and initiate a drop for the offline
disks when the timer expires. This timer display is shown in the
REPAIR_TIMER column of V$ASM_DISK.
The ALTER DISKGROUP DISK OFFLINE SQL command
or the EM ASM Target page can also be used to take the ASM disks offline
manually for preventative maintenance. The following describes this
scenario using SQL*Plus:
Notice that the offline disk’s MOUNT_STATUS
and MODE_STATUS are set to the MISSING and OFFLINE states, and also that
the REPAIR_TIMER begins to decrement from the drop timer.
Disks Are Offline
After the maintenance is completed, you can use the DISKGROUP DATA ONLINE command to bring the disk(s) online:
or
This statement brings all the offline disks back online to bring the stale contents up to date and to enable new contents. See Chapter 9 for more details on how to implement resynchronization.
The following is an excerpt from the ASM alert log showing a disk being brought offline and online:
After fixing the disk, you can bring it online using the following command:
Once the disk is brought back online, the REPAIR_TIMER is reset to 0 and the MODE_STATUS is set to ONLINE.
At first glance, the Fast Disk Resync feature
may seem to be a substitute for Dirty Region Logging (DRL), which
several logical volume managers implement. However, Fast Disk Resync and
DRL are distinctly different.
DRL is a mechanism to track blocks that have
writes in flight. A mirrored write cannot be issued unless a bit in the
DRL is set to indicate there may be a write in flight. Because DRL
itself is on disk and also mirrored, it may require two DRL writes
before issuing the normal mirrored write. This is mitigated by having
each DRL bit cover a range of data blocks such that setting one bit will
cover multiple mirrored block writes. There is also some overhead for
I/O to clear DRL bits for blocks that are no longer being written. You
can often clear these bits while setting another bit in DRL.
If a host dies while it has mirrored writes in
flight, it is possible that one side of the mirror is written and the
other is not. Most applications require that they get the same data
every time if they read a block multiple times without writing it. If
one side was written but not the other, then different reads may get
different data. DRL mitigates this by constructing a set of blocks that
must be copied from one side to the other to ensure that all blocks are
the same on both sides of the mirror. Usually this set of blocks is much
larger than those that were being written at the time of the crash, and
it takes a while to create the copies.
During the copy, the storage is unavailable,
which increases overall recovery times. Additionally, it is possible
that the failure caused a partial write to one side, resulting in a
corrupt logical block. The copying may write the bad data over the good
data because the volume manager has no way of knowing which side is
good.
Fortunately, ASM does not need to maintain a
DRL. ASM clients manage resilvering ASM clients, which includes the
Oracle database and ACFS, and know how to recover their data so that the
mirror sides are the same for the cases that matter; in other words, it
is a client-side implementation. It is not always necessary to make the
mirror sides the same. For example, if a file is being initialized
before it is part of the database, it will be reinitialized after a
failure, so that file does not matter for the recovery process. For data
that does matter, Oracle must always have a means of tolerating a write
that was started but that might not have been completed. The redo log
is an example of one such mechanism in Oracle. Because Oracle already
has to reconstruct such interrupted writes, it is simple to rewrite both
sides of the mirror even if it looks like the write completed
successfully. The number of extra writes can be small, because Oracle is
excellent at determining exactly which blocks need recovery.
Another benefit of not using a DRL is that a
corrupt block, which does not report an I/O error on read, can be
recovered from the good side of the mirror. When a block corruption is
discovered, each side of the mirror is read to determine whether one of
them is valid. If the sides are different and one is valid, then the
valid copy is used and rewritten to both sides. This can repair a
partial write at host death. This mechanism is used all the time, not
just for recovery reads. Thus, an external corruption that affects only
one side of an ASM mirrored block can also be recovered.
ASM and I/O Error Failure Management
Whereas the previous section covers ASM
handling of transient and permanent disk failures in ASM redundancy disk
groups, this section discusses how ASM processes I/O errors, such as
read and write errors, and also discusses in general how to handle I/O
failures in external redundancy disk groups.
General Disk Failure Overview
Disk drives are mechanical devices and thus
tend to fail. As drives begin to fail or have sporadic I/O errors,
database failures become more likely.
The ability to detect and resolve device path
failures is a core component of path managers as well as HBAs. A disk
device can be in the following states or have the following issues:
Media sense errors These
include hard read errors and unrecoverable positioning errors. In this
situation, the disk device is still functioning and responds to
SCSI_INQUIRY requests.
Device too busy A
disk device can become so overwhelmed with I/O requests that it will
not respond to the SCSI_INQUIRY within a reasonable amount of time.
Failed device In
this case, the disk has actually failed and will not respond to a
SCSI_INQUIRY request, and when the SCSI_INQUIRY timeout occurs, the disk
and path will be taken offline.
Path failure The disk device may be intact, but a path component—such as a port or a fiber adapter—has failed.
In general, I/O requests can time out because
either the SCSI driver device is unable to respond to a host message
within the allotted time or the path on which a message was sent has
failed. To detect this path failure, HBAs typically enable a timer each
time a message is received from the SCSI driver. A link failure is
thrown if the timer exceeds the link-down timeout without receiving the
I/O acknowledgment. After the link-down event occurs, the Path Manager
determines that the path is dead and evaluates whether to reroute queued
I/O requests to alternate paths.
ASM and I/O Failures
The method that ASM uses to handle I/O
failures depends on the context in which the I/O failure occurred. If
the I/O failure occurs in the database instance, then it notifies ASM,
and ASM decides whether to take the disk offline. ASM takes whatever
action is appropriate based on the redundancy of the disk group and the
number of disks that are already offline.
If the I/O error occurs while ASM is trying to mount a disk group, the behavior depends on the release. In Oracle Database 10g
Release 2, if the instance is not the first to mount the disk group in
the cluster, it will not attempt to take any disks offline that are
online in the disk group mounted by other instances. If none of the
disks can be found, the mount will fail. The rationale here is that if
the disk in question has truly failed, the running instances will very
quickly take the disk offline. If the instance is the “first to mount,”
it will offline the missing disks because it has no other instance to
consult regarding the well-being of the missing disks.
If the error is local and you want to mount
the disk group on the instance that cannot access the disk, you need to
drop the disk from a node that mounted the disk group. Note that a drop
force command will allow the mount immediately. Often in such scenarios,
the disk cannot be found on a particular node because of errors in the
ASM_DISKSTRING or the permissions on the node.
In Oracle Database 11g, these two
behaviors are still valid, but rather than choosing one or the other
based on whether the instance is first to mount the disk group, the
behavior is based on how it was mounted. For example, if the disk group
MOUNT [NOFORCE] command is used, which is the default, this requires
that all online disks in the disk group be found at mount time. If any
disks are missing (or have I/O errors), the mount will fail. A disk
group MOUNT FORCE attempts to take disks offline as necessary, but
allows the mount to complete. Note that to discourage the excessive use
of FORCE, MOUNT FORCE succeeds only if a disk needs to be taken offline.
In 11.2.0.3, MOUNT [NOFORCE] will succeed in
Exadata and Oracle Database Appliance as long as the result leaves more
than one failgroup for normal redundancy or more than two failgroups for
high-redundancy disk groups.
ASM, as well as the database, takes proactive measures to handle I/O failures or data corruptions.
When the database reads a data block from
disk, it validates the checksum, the block number, and some other
fields. If the block fails the consistency checks, then an attempt is
made to reread the block to get a valid block read. A reread is meant to
handle potential transient issues with the I/O subsystem. Oracle can
read individual mirror sides to resolve corruptions. For corrupt blocks
in data files, the database code reads each side of the mirror and looks
for a good copy. If it finds a good copy, the read succeeds and the
good copy is written back to disk to repair the corruption, assuming
that the database is holding the appropriate locks to perform a write.
If the mirroring is done in a storage array (external redundancy), there
is no interface to select mirror sides for reading. In that case, the
RDBMS simply rereads the same block and hopes for the best; however,
with a storage array, this process will most likely return the same data
from the array cache unless the original read was corrupted. If the
RDBMS cannot find good data, an error is signaled. The corrupt block is
kept in buffer cache (if it is a cache-managed block) to avoid repeated
attempts to reread the block and to avoid excessive error reporting.
Note that the handling of corruption is different for each file type and
for each piece of code that accesses the file. For example, the
handling of data file corruption during an RMAN backup is different from
that described in this section, and the handling of archive log file
corruption.
ASM, like most volume managers, does not do
any proactive polling of the hardware looking for faults. Servers
usually have enough I/O activity to make such polling unnecessary.
Moreover, ASM cannot tell whether an I/O error is due to a cable being
pulled or a disk failing. It is up to the operating system (OS) to
decide when to return an error or continue waiting for an I/O
completion. ASM has no control over how the OS handles I/O completions.
The OS signals a permanent I/O error to the caller (the Oracle I/O
process) after several retries in the device driver.
|
NOTE
Starting with Oracle Database 11g, in
the event of a disk failure, ASM polls disk partners and the other
disks in the failure group of the failed disk. This is done to
efficiently detect a pathological problem that may exist in the failure
group.
|
ASM takes disks offline from the disk group
only on a write operation I/O error, not for read operations. For
example, in Oracle Database 10g, if a permanent disk I/O error is
incurred during an Oracle write I/O operation, ASM takes the affected
disk offline and immediately drops it from the disk group, thus
preventing stale data reads. In Oracle Database 11g, if the
DISK_REPAIR_TIMER attribute is enabled, ASM takes the disk offline but
does not drop it. However, ASM does drop the disk if the
DISK_REPAIR_TIMER expires. This feature is covered in the section
“Recovering from Disk Failures: Fast Disk Resync,” earlier in this
chapter.
In Oracle Database 11g, ASM (in ASM
redundancy disk groups) attempts to remap bad blocks if a read fails.
This remapping can lead to a write, which could lead to ASM taking the
disk offline. For read errors, the block is read from the secondary
extents (only for normal or high redundancy). If the loss of a disk
would result in data loss, as in the case where a disk’s partner disk is
also offline, ASM automatically dismounts the disk group to protect the
integrity of the disk group data.
|
NOTE
Read failures from disk header and other unmirrored, physically addressed reads also cause ASM to take the disk offline.
|
In 11g, before taking a disk offline,
ASM checks the disk headers of all remaining disks in that failure group
to proactively check their liveliness. For offline efficiency, if all
remaining disks in that same failure group show signs of failure, ASM
will proactively offline the entire failure group.
ASM dismounts a disk group rather than taking
some disks offline and then dismounting the disk group in case of
apparent failures of disks in multiple failure groups. Also, ASM takes
disks in a failure group offline all at once to allow for more efficient
repartnering.
If the heartbeat cannot be written to a copy
of the Partnership Status Table (PST) in a normal- or high-redundancy
disk group, ASM takes the disk containing the PST copy offline and moves
the PST to another disk in the same disk group. In an external
redundancy disk group, the disk group is dismounted if the heartbeat
write fails. At mount time, it is read twice, at least six seconds
apart, to determine whether an instance outside the local cluster mounts
the disk group. If the two reads show different contents, the disk
group has mounted in an unseen instance.
After the disk group is mounted, ASM will
reread the heartbeat every hundredth I/O that it is written. This is
done to address two issues: One, to catch any potential race condition
that mount time check did not catch, and, two, to check if a disk group
got accidently mounted into two different clusters, with both of them
heartbeating against the PST.
In the following example, ASM detects I/O failures as shown from the alert log:
The following warning indicates that ASM detected an I/O error on a particular disk:
This error message alerts the user that trying
to take the disk offline would cause data loss, so ASM is dismounting
the disk group instead:
Messages should also appear in the OS log indicating problems with this same disk
(DATA_1_0001)
.
Many users want to simulate corruption in an
ASM file in order to test failure and recovery. Two types of
failure-injection tests that customers induce are block corruption and
disk failure. Unfortunately, overwriting an ASM disk simulates
corruption, not a disk failure. Note further that overwriting the
disk will corrupt ASM metadata as well as database files. This may not
be the user’s intended fault-injection testing. You must be cognizant of
the redundancy type deployed before deciding on the suite of tests run
in fault-injection testing. In cases where a block or set of blocks is
physically corrupted, ASM (in ASM redundancy disk groups) attempts to
reread all mirror copies of a corrupt block to find one copy that is not
corrupt.
Redundancy and the source of corruption does
matter when recovering a corrupt block. If data is written to disk in an
ASM external redundancy disk group through external means, then these
writes will go to all copies of the storage array mirror. An example of
this is that corruption could occur if the Unix/Linux dd command is
inadvertently used to write to an in-use ASM disk.
Space Management Views for ASM Redundancy
Two V$ ASM views provide more accurate information on free space usage: USABLE_FREE_SPACE and REQUIRED_MB_FREE.
In Oracle Database 10g Release 2, the
column USABLE_FILE_MB in V$ASM_DISKGROUP indicates the amount of free
space that can be “safely” utilized taking mirroring into account. The
column provides a more accurate view of usable space in the disk group.
Note that for external redundancy, the column FREE_MB is equal to
USABLE_FREE_SPACE.
Along with USABLE_FREE_SPACE, the
REQUIRED_MB_FREE column, in V$ASM_DISKGROUP is used to indicate more
accurately the amount of space that is required to be available in a
given disk group to restore redundancy after one or more disk failures.
The amount of space displayed in this column takes mirroring into
account. The following discussion describes how REQUIRED_MB_FREE is
computed.
REQUIRED_MB_FREE indicates the amount of space
that must be available in a disk group to restore full redundancy after
the worst failure that can be tolerated by the disk group without
adding additional storage, where the worst failure refers to permanent
disk failures and disks become dropped. The purpose of this requirement
is to ensure that there is sufficient space in the failure groups to
restore redundancy. However, the computed value depends on the type of
ASM redundancy deployed:
For
a normal-redundancy disk group with more than two failure groups, the
value is the total raw space for all of the disks in the largest failure
group. The largest failure group is the one with the largest total raw
capacity. For example, if each disk is in its own failure group, the
value would be the size of the largest capacity disk. In the case where
there are only two failure groups in a normal redundancy, the size of
the largest disk in the disk group is used to compute the
REQUIRED_MB_FREE.
For
a high-redundancy disk group with more than three failure groups, the
value is the total raw space for all the disks in the two largest
failure groups.
If disks are of different sizes across the
failure groups, this further complicates the REQUIRED_MB_FREE
calculation. Therefore, it is highly recommended that disk groups have
disks of equal size.
Be careful of cases where USABLE_FILE_MB has
negative values in V$ASM_DISKGROUP due to the relationship among
FREE_MB, REQUIRED_MIRROR_FREE_MB, and USABLE_FILE_MB. If USABLE_FILE_MB
is a negative value, you do not have sufficient space to reconstruct the
mirror of all extents in certain disk failure scenarios. For example,
in a normal-redundancy disk group with two failure groups,
USABLE_FILE_MB goes negative if you do not have sufficient space to
tolerate the loss of a single disk. In this situation, you could gain
more usable space, at the expense of losing all redundancy, by
force-dropping the remaining disks in the failure group containing the
failed disk.
Negative values in USABLE_FILE_MB could mean
that depending on the value of FREE_MB, you may not be able to create
new files. The next failure may result in files with reduced redundancy
or can result in an out-of-space condition, which can hang the database.
If USABLE_FILE_MB becomes negative, it is strongly recommended that you
add more space to the disk group as soon as possible.
Disk Groups and Attributes
Oracle Database 11g introduced the
concept of ASM attributes. Unlike initialization parameters, which are
instance specific but apply to all disk groups, ASM attributes are disk
group specific and apply to all instances.
Attributes Overview
The ASM disk group attributes are shown in
the V$ASM_ATTRIBUTES view. However, this view is not populated until the
disk group compatibility—that is, COMPATIBLE.ASM—is set to 11.1.0. In
Clusterware Database 11g Release 2, the following attributes can be set:
Compatibility
COMPATIBLE.ASM This
attribute determines the minimum software version for an Oracle ASM
instance that can mount the disk group. This setting also affects the
format of the data structures for the ASM metadata on the disk. If the
SQL CREATE DISKGROUP statement, the ASMCMD mkdg command, or Oracle
Enterprise Manager is used to create disk groups, the default value for
the COMPATIBLE.ASM attribute is 10.1.
COMPATIBLE.RDBMS This
attribute determines the minimum COMPATIBLE database initialization
parameter setting for any database instance that is allowed to use
(open) the disk group. Ensure that the values for the COMPATIBLE
initialization parameter for all of the databases that access the disk
group are set to at least the value of the new setting for
COMPATIBLE.RDBMS. As with the COMPATIBLE.ASM attribute, the default
value is 10.1. The COMPATIBLE.ASM will always be greater than or equal
to COMPATIBLE.RDBMS. This topic is covered in more detail later in this
section.
COMPATIBLE.ADVM This
attribute determines whether the disk group can contain ADVM volumes.
The value can only be set to 11.2 or higher. The default value of the
COMPATIBLE.ADVM attribute is empty until set. However, before the
COMPATIBLE.ADVM is advanced, the COMPATIBLE.ASM attribute must already
be set to 11.2 or higher and the ADVM volume drivers must be loaded. The
COMPATIBLE.ASM attribute will always be greater than or equal to
COMPATIBLE.ADVM. Also, there is no relation between COMPATIBLE.ADVM and
COMPATIBLE.RDBMS.
ASM Disk Group Management
DISK_REPAIR_TIME This
attribute defines the delay in the drop disk operation by specifying a
time interval to repair the disk and bring it back online. The time can
be specified in units of minutes (m or M) or hours (h or H). This topic
is covered in the “Recovering from Disk Failures: Fast Disk Resync”
section.
AU_SIZE This attribute defines the disk group allocation unit size.
SECTOR_SIZE This
attribute specifies the default sector size of the disk contained in
the disk group. The SECTOR_SIZE disk group attribute can be set only
during disk group creation, and the possible values are 512, 4096, and
4K. The COMPATIBLE.ASM and COMPATIBLE.RDBMS disk group attributes must
be set to 11.2 or higher to set the sector size to a value other than
the default value.
Exadata Systems
CONTENT.TYPE This new attribute was introduced in 11gR2
specifically and is valid only for Exadata systems. The COMPATIBLE.ASM
attribute must be set to 11.2.0.3 or higher to enable the CONTENT.TYPE
attribute for the disk group. The CONTENT.TYPE attribute identifies the
disk group type, and implicitly dictates disk partnering specifically
for that specific disk group. The option type can be DATA, Recovery, or
System. Setting this parameter determines the distance to the nearest
neighbor disk in the failure group where ASM mirrors copies of the data.
Keep the following points in mind:
The default value is DATA, which specifies a distance of 1 to the nearest neighbor disk.
A value of RECOVERY specifies a distance of 3 to the nearest neighbor disk.
A value of SYSTEM specifies a distance of 5.
STORAGE.TYPE This
attribute identifies the type of disks in the disk group and allows
users to enable Hybrid Columnar Compression (HCC) on that hardware. The
possible values are AXIOM, PILLAR, ZFSSA, and OTHER. The AXIOM and ZFSSA
values reflect the Oracle Sun Pillar Axiom storage platform and the
Oracle SUN ZFSSA storage appliance, respectively. If the attribute is
set to OTHER, any types of disks can be in the disk group. The
STORAGE.TYPE attribute can only be set when creating a disk group or
when altering a disk group. The attribute cannot be set when clients are
connected to the disk group.
IDP.TYPE This attribute is related to the Intelligent Data Placement feature and influences data placement on disk.
CELL.SMART_SCAN_CAPABLE When set, this attribute enables Smart Scan capabilities in Exadata.
File Access Control
ACCESS_CONTROL.ENABLED This
attribute, when set, enables the facility for File Access Control. This
attribute can only be set when altering a disk group, with possible
values of TRUE and FALSE.
ACCESS_CONTROL.UMASK This
attribute specifies which permissions are masked on the creation of an
ASM file for the user that owns the file, for users in the same user
group and others not in the user group. The semantics of ASM umask
settings are similar to Unix/Linux umask. This attribute applies to all
files on a disk group, with possible values in the combinations of three
digits: {0|2|6} {0|2|6} {0|2|6}. The default is 066. This attribute can
only be set when altering a disk group.
This is just a placeholder for real output as
well as a table for the features enabled by the disk group
compatibility attribute settings:
Disk Group Compatibility Attributes
The disk group attributes can be set at disk
group creation or by using the ALTER DISKGROUP command. For example, a
disk group can be created with 10.1 disk group compatibility and then
advanced to 11.2 by setting the COMPATIBLE.ASM attribute to 11.2. The
discussion on compatibility attributes is covered in the next section.
The following example shows a CREATE DISKGROUP command that results in a disk group with 10.1 compatibility (the default):
This disk group can then be advanced to 11.2 using the following command:
On successful advancing of the disk group, the following message appears:
In another example, the AU_SIZE attribute,
which dictates the allocation unit size, and the COMPATIBLE.ASM
attributes are specified at disk group creation. Note that the AU_SIZE
attribute can only be specified at disk group creation and cannot be
altered using the ALTER DISKGROUP command:
The V$ASM_ATTRIBUTE view can be queried to get the DATA disk group attributes:
In the previous example, the COMPATIBLE.ASM
attribute was advanced; this next example advances the COMPATIBLE.RDBMS
attribute. Notice that the version is set to simply 11.2, which is
equivalent to 11.2.0.0.0.
Database Compatibility
When a database instance first connects to
an ASM instance, it negotiates the highest Oracle version that can be
supported between the instances. There are two types of compatibility
settings between ASM and the RDBMS: instance-level software
compatibility settings and disk group–specific settings.
Instance-level software compatibility is
defined using the init.ora parameter COMPATIBLE. This COMPATIBLE
parameter, which can be set to 11.2, 11.1, 10.2, or 10.1 at the ASM or
database instance level, defines what software features are available to
the instance. Setting the COMPATIBLE parameter in the ASM instance is
not allowed. Using lower values of the COMPATIBLE parameter for an ASM
instance is not useful, because ASM is compatible with multiple database
versions. Note that the COMPATIBLE.ASM value must be greater than or
equal to that of COMPATIBLE.RDBMS.
The other compatibility settings are specific
to a disk group and control which attributes are available to the ASM
disk group and which are available to the database. This is defined by
the ASM compatibility (COMPATIBLE.ASM) and RDBMS compatibility
(COMPATIBLE.RDBMS) attributes, respectively. These compatibility
attributes are persistently stored in the disk group metadata.
RDBMS Compatibility
RDBMS disk group compatibility is defined by
the COMPATIBLE.RDBMS attribute. This attribute, which defaults to 10.1
in Oracle Database 11g, is the minimum COMPATIBLE version setting
of a database that can mount the disk group. After the disk group
attribute of COMPATIBLE.RDBMS is advanced to 11.2, it cannot be
reversed.
ASM Compatibility
ASM disk group compatibility, as defined by
COMPATIBLE.ASM, controls the persistent format of the on-disk ASM
metadata structures. The ASM compatibility defaults to 10.1 and must
always be greater than or equal to the RDBMS compatibility level. After
the compatibility is advanced to 11.2, it cannot be reset to lower
versions. Any value up to the current software version can be set and
will be enforced. The compatibility attributes have quantized values, so
not all five parts of the version number have to be specified.
COMPATIBLE.RDBMS and COMPATIBLE.ASM together
control the persistent format of the on-disk ASM metadata structures.
The combination of the compatibility parameter setting of the database,
the software version of the database, and the RDBMS compatibility
setting of a disk group determines whether a database instance is
permitted to mount a given disk group. The compatibility setting also
determines which ASM features are available for a disk group.
The following query shows an ASM instance that was recently upgraded from Oracle Database 10g to Oracle Clusterware 11gR2:
Notice that the ASM compatibility and RDBMS
compatibility are still at the default (for upgraded instances) of 10.1.
The 10.1 setting is the lowest attribute supported by ASM.
|
NOTE
An ASM instance can support different RDBMS
clients with different compatibility settings, as long as the database
COMPATIBLE init.ora parameter setting of each database instance is
greater than or equal to the RDBMS compatibility of all disk groups.
|
See the section “Disk Groups and Attributes,” earlier in this chapter, for examples on advancing the compatibility.
The ASM compatibility of a disk group can be
set to 11.0, whereas its RDBMS compatibility could be 10.1, as in the
following example:
This implies that the disk group can be
managed only by ASM software version 11.0 or higher, whereas any
database software version must be 10.1 or higher.
Summary
An ASM disk is the unit of persistent
storage given to a disk group. A disk can be added to or dropped from a
disk group. When a disk is added to a disk group, it is given a disk
name either automatically or by the administrator. This is different
from the OS name that is used to access the disk through the operating
system. In a RAC environment, the same disk may be accessed by different
OS names on different nodes. ASM accesses disks through the standard OS
interfaces used by Oracle to access any file (unless an ASMLIB is
used). Typically an ASM disk is a partition of an LUN seen by the OS. An
ASM disk can be any device that can be opened through the OS open
system call, except for a local file system file—that is, the LUN could
be a single physical disk spindle or it could be a virtual LUN managed
by a highly redundant storage array.
A disk group is the fundamental object managed
by ASM. It is composed of multiple ASM disks. Each disk group is
self-describing—that is, all the metadata about the usage of the space
in the disk group is completely contained within the disk group. If ASM
can find all the disks in a disk group, it can provide access to the
disk group without any additional metadata.
A given ASM file is completely contained
within a single disk group. However, a disk group may contain files
belonging to several databases, and a single database may use files from
multiple disk groups. Most installations include only a small number of
disk groups—usually two, and rarely more than three.
No comments:
Post a Comment