EasyReliableDBA: Database Cloud Storage

2)ASM and Grid Infrastructure Stack

In releases prior to 11gR2, Automatic Storage Management (ASM) was tightly integrated with the Clusterware stack. In 11gR2, ASM is not only tightly integrated with the Clusterware stack, it’s actually part of the Clusterware stack. The Grid Infrastructure stack is the foundation of Oracle’s Private Database Cloud, and it provides the essential Cloud Pool capabilities, such as growing server and storage capacity as needed. This chapter discusses how ASM fits into the Oracle Clusterware stack.

Clusterware Primer

Oracle Clusterware is the cross-platform cluster software required to run the Real Application Clusters (RAC) option for Oracle Database and provides the basic clustering services at the operating system level that enable Oracle software to run in clustered mode. The two main components of Oracle Clusterware are Cluster Ready Services and Cluster Synchronization Services:

Cluster Ready Services (CRS) Provides high-availability operations in a cluster. The CRS daemon (CRSd) manages cluster resources based on the persistent configuration information stored in Oracle Cluster Registry (OCR). These cluster resources include the Oracle Database instance, listener, VIPs, SCAN VIPs, and ASM. CRSd provides start, stop, monitor, and failover operations for all the cluster resources, and it generates events when the status of a resource changes.

Cluster Synchronization Services (CSS) Manages the cluster configuration by controlling which nodes are members of the cluster and by notifying members when a member (node) joins or leaves the cluster. The following functions are provided by the Oracle Cluster Synchronization Services daemon (OCSSd):

Group Services A distributed group membership system that allows for the synchronization of services between nodes

Lock Services Provide the basic cluster-wide serialization locking functions

Node Services Use OCR to store state data and update the information during reconfiguration

OCR Overview

The Oracle Cluster Registry is the central repository for all the resources registered with Oracle Clusterware. It contains the profile, state, and ownership details of the resources. This includes both Oracle resources and user-defined application resources. Oracle resources include the node apps (VIP, ONS, GSD, and Listener) and database resources, such as database instances, and database services. Oracle resources are added to the OCR by tools such as DBCA, NETCA, and srvctl.

Voting File Overview

Oracle Clusterware maintains membership of the nodes in the cluster using a special file called voting disk (mistakenly also referred to as quorum disk). Sometimes, the voting disk is also referred to as the vote file, so you’ll see this referenced both ways, and both are correct. This file contains the heartbeat records from all the nodes in the cluster. If a node loses access to the voting file or is not able to complete the heartbeat I/O within the threshold time, then that node is evicted out of the cluster. Oracle Clusterware also maintains heartbeat with the other member nodes of the cluster via the shared private interconnect network. A split-brain syndrome occurs when there is a failure in the private interconnect whereby multiple sub-clusters are formed within the clustered nodes and the nodes in different sub-clusters are not able to communicate with each other via the interconnect network but they still have access to the voting files. The voting file enables Clusterware to resolve network split brain among the cluster nodes. In such a situation, the largest active sub-cluster survives. Oracle Clusterware requires an odd number of voting files (1, 3, 5, …) to be created. This is done to ensure that at any point in time, an active member of the cluster has access to the majority number (n / 2 + 1) of voting files.

Here’s a list of some interesting 11gR2 changes for voting files:

The voting files’ critical data is stored in the voting file and not in the OCR anymore. From a voting file perspective, the OCR is not touched at all. The critical data each node must agree on to form a cluster is, for example, miscount and the list of voting files configured.

In Oracle Clusterware 11g Release 2 (11.2), it is no longer necessary to back up the voting files. The voting file data is automatically backed up in OCR as part of any configuration change and is automatically restored as needed. If all voting files are corrupted, users can restore them as described in the Oracle Clusterware Administration and Deployment Guide.

Grid Infrastructure Stack Overview

The Grid Infrastructure stack includes Oracle Clusterware components, ASM, and ASM Cluster File System (ACFS). Throughout this chapter, as well as the book, we will refer to Grid Infrastructure as the GI stack.

The Oracle GI stack consists of two sub-stacks: one managed by the Cluster Ready Services daemon (CRSd) and the other by the Oracle High Availability Services daemon (OHASd). How these sub-stacks come into play depends on how the GI stack is installed. The GI stack is installed in two ways:

Grid Infrastructure for Standalone Server

Grid Infrastructure for Cluster

ASM is available in both these software stack installations. When Oracle Universal Installer (OUI) is invoked to install Grid Infrastructure, the main screen will show four options (see Figure 2-1). In this section, the options we want to focus on are Grid Infrastructure for Standalone Server and Grid Infrastructure for Cluster.

FIGURE 2-1. Oracle Universal Installer for Grid Infrastructure install

Grid Infrastructure for Standalone Server

Grid Infrastructure for Standalone Server is essentially the single-instance (non-clustered) configuration, as in previous releases. It is important to note that in 11gR2, because ASM is part of the GI stack, Clusterware must be installed first before the database software is installed; this holds true even for single-instance deployments. Keep in mind that ASM will not need to be in a separate ORACLE_HOME; it is installed and housed in the GI ORACLE_HOME.

Grid Infrastructure for Standalone Server does not configure the full Clusterware stack; just the minimal components are set up and enabled—that is, private interconnect, CRS, and OCR/voting files are not enabled or required. The OHASd startup and daemon replaces all the existing pre-11.2 init scripts. The entry point for OHASd is /etc/inittab, which executes the /etc/init.d/ohasd and /etc/init.d/init.ohasd control scripts, including the start and stop actions. This OHASD script is the framework control script, which will spawn the $GI_HOME/bin/ohasd.bin executable. The OHASd is the main daemon that provides High Availability Services (HAS) and starts the remaining stack, including ASM, listener, and the database in a single-instance environment.

A new feature that’s automatically enabled as part of Grid Infrastructure for Standalone Server installation is Oracle Restart, which provides high-availability restart functionality for failed instances (database and ASM), services, listeners, and dismounted disk groups. It also ensures these protected components start up and shut down according to the dependency order required. This functionality essentially replaces the legacy dbstart/dbstop script used in the pre-11gR2 single-instance configurations. Oracle Restart also executes health checks that periodically monitor the health of these components. If a check operation fails for a component, the component is forcibly shut down and restarted. Note that Oracle Restart is only enabled in GI for Standalone Server (non-clustered) environments. For clustered configurations, health checks and the monitoring capability are provided by Oracle Clusterware CRS agents.

When a server that has Grid Infrastructure for Standalone Server enabled is booted up, the HAS process will initialize and start up by first starting up ASM. ASM has a hard-start (pull-up) dependency with CSS, so CSS is started up. Note that there is a hard-stop dependency between ASM and CSS, so on stack shutdown ASM will stop and then CSS will stop.

Grid Infrastructure for Cluster

Grid Infrastructure for Cluster is the traditional installation of Clusterware. It includes multinode RAC support, private interconnect, Clusterware files, and now also installs ASM and ACFS drivers. With Oracle Clusterware 11gR2, ASM is not simply the storage manager for database files, but also houses the Clusterware files (OCR and voting files) and the ASM spfile.

When you select the Grid Infrastructure for Cluster option in OUI, as shown previously in Figure 2-1, you will be prompted next on file storage options for the Clusterware files (Oracle Clusterware Registry and Clusterware voting file). This is shown in Figure 2-2.

FIGURE 2-2. Oracle Universal Installer Storage Option screen

Users are prompted to place Clusterware files on either a shared file system or ASM. Note that raw disks are not supported any longer for new installations. Oracle will support the legacy method of storing Clusterware files (raw and so on) in upgrade scenarios only.

When ASM is selected as the storage location for Clusterware files, the Create ASM Disk Group screen is shown next (see Figure 2-3). You can choose external or ASM redundancy for the storage of Clusterware files. However, keep in mind that the type of redundancy affects the redundancy (or number of copies) of the voting files.

FIGURE 2-3. Create ASM Disk Group screen

For example, for normal redundancy, there needs to be a minimum of three failure groups, and for high redundancy a minimum of five failure groups. This requirement stems from the fact that an odd number of voting files must exist to enable a vote quorum. Additionally this enables to tolerate one or two disk failures and still provide quorums.

This first disk group that is created during the installation can also be used to store database files. In previous versions of ASM, this disk group was referred to as the DATA disk group. Although it is recommended that you create a single disk group for storing the Clusterware files and database files, for users who are employing a third-party vendor snapshot technology against the ASM disk group, users may want to have a separate disk group for the Clusterware files. Users may also deploy a separate disk group for the Clusterware to leverage normal or high redundancy for the Clusterware files. In both of the cases, users should create a small CRSDATA disk group with 1MB AU and enough failure groups to support the redundancy required. Next, the installation users can then use ASMCA to create the DATA disk group.

Voting Files and Oracle Cluster Repository Files in ASM

In versions prior to 11gR2, users needed to configure and set up raw devices for housing the Clusterware files (OCR and voting files). This step creates additional management overhead and is error prone. Incorrect OCR/voting files setup creates havoc for the Clusterware installation and directly affects run-time environments. To mitigate these install preparation issues, 11gR2 allows the storing of the Clusterware files in ASM; this also eliminates the need for a third-party cluster file system and eliminates the complexity of managing disk partitions for the OCR and voting files. The COMPATIBLE.ASM disk group compatibility attribute must be set to 11.2 or greater to store the OCR or voting file data in a disk group. This attribute is automatically set for new installations with the OUI. Note that COMPATIBLE.RDBMS does not need to be advanced to enable this feature. The COMPATIBLE.* attributes topic is covered in Chapter 3.

Voting Files in ASM

If you choose to store voting files in ASM, then all voting files must reside in ASM in a single disk group (in other words, Oracle does not support mixed configurations of storing some voting files in ASM and some on NAS devices).

Unlike most ASM files, the voting files are wholly consumed in multiple contiguous AUs. Additionally, the voting file is not stored as a standard ASM file (that is, it cannot be listed in the asmcmd ls command). However, the disk that contains the voting file is reflected in the V$ASM_DISK view:

The number of voting files you want to create in a particular Oracle ASM disk group depends on the redundancy of the disk group:

External redundancy A disk group with external redundancy can store only one voting file. Currently, no supported way exists to have multiple voting files stored on an external redundancy disk group.

Normal redundancy A disk group with normal redundancy can store up to three voting files.

High redundancy A disk group with high redundancy can store up to five voting files.

In this example, we created an ASM disk group with normal redundancy for the disk group containing voting files. The following can be seen:

ASM puts each voting file in its own failure group within the disk group. A failure group is defined as the collection of disks that have a shared hardware component for which you want to prevent its loss from causing a loss of data.

For example, four drives that are in a single removable tray of a large JBOD (just a bunch of disks) array are in the same failure group because the tray could be removed, making all four drives fail at the same time. Conversely, drives in the same cabinet can be in multiple failure groups if the cabinet has redundant power and cooling so that it is not necessary to protect against the failure of the entire cabinet. If voting files are stored on ASM with normal or high redundancy, and the storage hardware in one failure group suffers a failure, then if another disk is available in a disk group in an unaffected failure group, ASM allocates new voting files in other candidate disks.

Voting files are managed differently from other files that are stored in ASM. When voting files are placed on disks in an ASM disk group, Oracle Clusterware records exactly on which disks in that disk group they are located. Note that CSS has access to voting files even if ASM becomes unavailable.

Voting files can be migrated from raw/block devices into ASM. This is a typical scenario for upgrade scenarios. For example, when a user upgrades from 10g to 11gR2, they are allowed to continue storing their OCR/voting files on raw, but at a later convenient time they can migrate these Clusterware files into ASM. It is important to point out that users cannot migrate to Oracle Clusterware 12c from 10g without first moving the voting files into ASM (or shared file system), since raw disks are no longer supported even for upgraded environments in 12c.

The following illustrates this:

Voting File Discovery

The method by CSS that identifies and locates voting files has changed in 11.2. Before 11gR2, the voting files were located via lookup in OCR; in 11gR2, voting files are located via a Grid Plug and Play (GPnP) query. GPnP, a new component in the 11gR2 Clusterware stack, allows other GI stack components to query or modify cluster-generic (non-node-specific) attributes. For example, the cluster name and network profiles are stored in the GPnP profile. The GPnP configuration, which consists of the GPnP profile and wallet, is created during the GI stack installation. The GPnP profile is an XML file that contains bootstrap information necessary to form a cluster. This profile is identical on every peer node in the cluster. The profile is managed by gpnpd and exists on every node (in gpnpd caches). The profile should never be edited because it has a profile signature that maintains its integrity.

When the CSS component of the Clusterware stack starts up, it queries the GPnP profile to obtain the disk discovery string. Using this disk string, CSS performs a discovery to locate its voting files.

The following is an example of a CSS GPnP profile entry. To query the GPnP profile, the user should use the supplied (in CRS ORACLE_HOME) gpnptool utility:

The CSS voting file discovery string anchors into the ASM profile entry; that is, it derives its DiscoveryString from the ASM profile entry. The ASM profile lists the value in the ASM discovery string as ‘/dev/mapper/*’. Additionally, ASM uses this GPnP profile entry to locate its spfile.

Voting File Recovery

Here’s a question that is often heard: If ASM houses the Clusterware files, then what happens if the ASM instance is stopped? This is an important point about the relationship between CSS and ASM. CSS and ASM do not communicate directly. CSS discovers its voting files independently and outside of ASM. This is evident at cluster startup when CSS initializes before ASM is available. Thus, if ASM is stopped, CSS continues to access the voting files, uninterrupted. Additionally, the voting file is backed up into the OCR at every configuration change and can be restored with the crsctl command.

If all voting files are corrupted, you can restore them as described next.

Furthermore, if the cluster is down and cannot restart due to lost voting files, you must start CSS in exclusive mode to replace the voting files by entering the following command:

Oracle Cluster Registry (OCR)

Oracle Clusterware 11gR2 provides the ability to store the OCR in ASM. Up to five OCR files can be stored in ASM, although each has to be stored in a separate disk group.

The OCR is created, along with the voting disk, when root.sh of the OUI installation is executed. The OCR is stored in an ASM disk group as a standard ASM file with the file type OCRFILE. The OCR file is stored like other ASM files and striped across all the disks in the disk group. It also inherits the redundancy of the disk group. To determine which ASM disk group the OCR is stored in, view the default configuration location at /etc/oracle/ocr.loc:

The disk group that houses the OCR file is automounted by the ASM instance during startup.

All 11gR2 OCR commands now support the ASM disk group. From a user perspective, OCR management and maintenance works the same as in previous versions, with the exception of OCR recovery, which is covered later in this section. As in previous versions, the OCR is backed up automatically every four hours. However, the new backup location is <GRID_HOME>/cdata/<scan name>.

A single OCR file is stored when an external redundancy disk group is used. It is recommended that for external redundancy disk groups an additional OCR file be created in another disk group for added redundancy. This can be done as follows:

In an ASM redundancy disk group, the ASM partnership and status table (PST) is replicated on multiple disks. In the same way, there are redundant extents of OCR file stored in an ASM redundancy disk group. Consequently, OCR can tolerate the loss of the same number of disks as are in the underlying disk group, and it can be relocated/rebalanced in response to disk failures. The ASM PST is covered in Chapter 9.

OCR Recovery

When a process (OCR client) that wants to read the OCR incurs a corrupt block, the OCR client I/O will transparently reissue the read to the mirrored extents for a normal- or high-redundancy disk group. In the background the OCR master (nominated by CRS) provides a hint to the ASM layer identifying the corrupt disk. ASM will subsequently start “check disk group” or “check disk,” which takes the corrupt disk offline. This corrupt block recovery is only possible when the OCR is configured in a normal- or high-redundancy disk group.

In a normal- or high-redundancy disk group, users can recover from the corruption by taking either of the following steps:

Use the ALTER DISK GROUP CHECK statement if the disk group is already mounted.

Remount the disk group with the FORCE option, which also takes the disk offline when it detects the disk header corruption. If you are using an external redundancy disk group, users must restore the OCR from backup to recover from a corruption. Starting in Oracle Clusterware 11.2.0.3, the OCR backup can be stored in a disk group as well.

The workaround is to configure an additional OCR location on a different storage location using the ocrconfig -add command. OCR clients can tolerate a corrupt block returned by ASM, as long as the same block from the other OCR locations (mirrors) is not corrupt. The following guidelines can be used to set up a redundant OCR copy:

Ensure that the ASM instance is up and running with the required disk group mounted and/or check ASM alert.log for the status for the ASM instance.

Verify that the OCR files were properly created in the disk group, using asmcmd ls. Because the Clusterware stack keeps accessing OCR files, most of the time the error will show up as a CRSD error in the crsd.log. Any errors related to an ocr* command will generate a trace file in the Grid_home/log/<hostname>/client directory; look for kgfo, kgfp, or kgfn at the top of the error stack.

Use Case Example

A customer has an existing three-node cluster with an 11gR1 stack (CRS 11.1.0.7; ASM 11.1.0.7; DB 11.1.0.7). They want to migrate to a new cluster with new server hardware but the same storage. They don’t want to install 11.1.0.7 on the new servers; they just want to install 11.2.0.3. In other words, instead of doing an upgrade, they want to create a new “empty” cluster and then “import” the ASM disks into the 11.2 ASM instance. Is this possible?

Yes. To make this solution work, you will install the GI stack and create a new cluster on the new servers, stop the old cluster, and then rezone the SAN paths to the new servers. During the GI stack install, when you’re prompted in the OUI to configure the ASM disk group for a storage location for the OCR and voting files, use the drop-down box to use an existing disk group. The other option is to create a new disk group for the Clusterware files and then, after the GI installation, discover and mount the old 11.1.0.7 disk group. You will need to do some post-install work to register the databases and services with the new cluster.

The Quorum Failure Group

In certain circumstances, customers might want to build a stretch cluster. A stretch cluster provides protection from site failure by allowing a RAC configuration to be set up across distances greater than what’s typical “in the data center.” In these RAC configurations, a third voting file must be created at a third location for cluster arbitration. In pre-11gR2 configurations, users set up this third voting file on a NAS from a third location. In 11gR2, the third voting file can now be stored in an ASM quorum failure group.

The “Quorum Failgroup” clause was introduced for setups with Extended RAC and/or for setups with disk groups that have only two disks (respectively, only two failure groups) but want to use normal redundancy.

A quorum failure group is a special type of failure group where the disks do not contain user data and are not considered when determining redundancy requirements. Unfortunately, during GI stack installation, the OUI does not offer the capability to create a quorum failure group. However, this can be set up after the installation. In the following example, we create a disk group with a failure group and optionally a quorum failure group if a third array is available:

If the disk group creation was done using ASMCA, then after we add a quorum disk to the disk group, Oracle Clusterware will automatically change the CSS vote disk location to the following:

Clusterware Startup Sequence—Bootstrap If OCR Is Located in ASM

Oracle Clusterware 11g Release 2 introduces an integral component called the cluster agents. These agents are highly available, multithreaded daemons that implement entry points for multiple resource types.

ASM has to be up with the disk group mounted before any OCR operations can be performed. OHASd maintains the resource dependency and will bring up ASM with the required disk group mounted before it starts the CRSd. Once ASM is up with the disk group mounted, the usual ocr* commands (ocrcheck, ocrconfig, and so on) can be used. Figure 2-4 displays the client connections into ASM once the entire stack, including the database, is active.

FIGURE 2-4. Clusterware startup sequence

NOTE

This lists the processes connected to ASM using the OS ps command. Note that most of these are bequeath connections.

The following output displays a similar listing but from an ASM client perspective:

There will be an ASM client listed for the connection OCR:

Here, +data.255 is the OCR file number, which is used to identify the OCR file within ASM.

The voting files, OCR, and spfile are processed differently at bootstrap:

Voting file The GPnP profile contains the disk group name where the voting files are kept. The profile also contains the discovery string that covers the disk group in question. When CSS starts up, it scans each disk group for the matching string and keeps track of the ones containing a voting disk. CSS then directly reads the voting file.

ASM spfile The ASM spfile location is recorded in the disk header(s), which has the spfile data. It is always just one AU. The logic is similar to CSS and is used by the ASM server to find the parameter file and complete the bootstrap.

OCR file OCR is stored as a regular ASM file. Once the ASM instance comes up, it mounts the disk group needed by the CRSd.

Disk Groups and Clusterware Integration

Before discussing the relationship of ASM and Oracle Clusterware, it’s best to provide background on CRS modeling, which describes the relationship between a resource, the resource profile, and the resource relationship. A resource, as described previously, is any entity that is being managed by CRS—for example, physical (network cards, disks, and so on) or logical (VIPs, listeners, databases, disk groups, and so on). The resource relationship defines the dependency between resources (for example, state dependencies or proximities) and is considered to be a fundamental building block for expressing how an application’s components interact with each other. Two or more resources are said to have a relationships when one (or both) resource(s) either depends on or affects the other. For example, CRS modeling mandated that the DB instance resource depend on the ASM instance and the required disk groups.

As discussed earlier, because Oracle Clusterware version 11gR2 allows the Clusterware files to be stored in ASM, the ASM resources are also managed by CRS. The key resource managed by CRS is the ASM disk group resource.

Oracle Clusterware 11g Release 2 introduces a new agent concept that makes cluster resource management very efficient and scalable. These agents are multithreaded daemons that implement entry points for multiple resource types and spawn new processes for different users. The agents are highly available and, besides oraagent, orarootagent, and cssdagent/cssdmonitor, there can be an application agent and a script agent. The two main agents are oraagent and orarootagent. As the names suggest, oraagent and orarootagent manage resources owned by Oracle and root, respectively. If the CRS user is different from the ORACLE user, then CRSd would utilize two oraagents and one orarootagent. The main agents perform different tasks with respect to ASM. For example, oraagent performs the start/stop/check/clean actions for ora.asm, database, and disk group resources, whereas orarootagent performs start/stop/check/clean actions for the ora.diskmon and ora.drivers.acfs resources.

The following output shows typical ASM-related CRS resources:

When the disk group is created, the disk group resource is automatically created with the name, ora.<DGNAME>.dg, and the status is set to ONLINE. The status OFFLINE will be set if the disk group is dismounted, because this is a CRS-managed resource now. When the disk group is dropped, the disk group resource is removed as well. A dependency between the database and the disk group is automatically created when the database tries to access the ASM files. More specifically, a “hard” dependency type is created for the following files types: datafiles, controlfiles, online logs, and SPFile. These are the files that are absolutely needed to start up the database; for all other files, the dependency is set to weak. This becomes important when there are more than two disk groups: one for archive, another for flash or temp, and so on. However, when the database no longer uses the ASM files or the ASM files are removed, the database dependency is not removed automatically. This must be done using the srvctl command-line tool.

The following database CRS profile illustrates the dependency relationships between the database and ASM:

Summary

The tighter integration between ASM and Oracle Clusterware provides the capability for quickly deploying new applications as well as managing changing workloads and capacity requirements. This faster agility and elasticity are key drivers for the Private Database Cloud. In addition, the ASM/Clusterware integration with the database is the platform at the core factor of Oracle’s Engineered Systems.

ASM Disks and Disk Groups

The first task in building the ASM infrastructure is to discover and place disks under ASM management. This step is best done with the coordination of storage and system administrators. In storage area network (SAN) environments, it is assumed that the disks are identified and configured correctly—that is, they are appropriately zoned or “LUN masked” within the SAN fabric and can be seen by the operating system (OS). Although the concepts in this chapter are platform generic, we specifically show examples using the Linux or Solaris platforms.

ASM Storage Provisioning

Before disks can be added to ASM, the storage administrator needs to identify a set of disks or logical devices from a storage array. Note that the term disk is used loosely because a disk can be any of the following:

An entire disk spindle

A partition of a physical disk spindle

An aggregation of several disk partitions across several disks

A logical device carved from a RAID (redundant array of independent drives) group set

A file created from an NFS file system

Once the preceding devices are created, they are deemed logical unit numbers (LUNs). These LUNs are then presented to the OS as logical disks.

In this book, we refer generically to LUNs or disks presented to the OS as simply disks. The terms LUN and disk may be used interchangeably.

DBAs and system administrators are often in doubt as to the maximum LUN size they can use without performance degradation, or as to the LUN size that will give the best performance. For example, will 1TB- or 2TB-sized LUNs perform the same as 100GB- or 200GB-sized LUNs?

Size alone should not affect the performance of an LUN. The underlying hardware, the number of disks that compose an LUN, and the read-ahead and write-back caching policy defined on the LUN all, in turn, affect the speed of the LUN. There is no magic number for the LUN size or the number of ASM disks in the disk group.

Seek the advice of the storage vendor for the best storage configuration for performance and availability, because this may vary between vendors.

Given the database size and storage hardware available, the best practice is to create larger LUNs (to reduce LUN management) and, if possible, generate LUNs from a separate set of storage array RAID sets so that the LUNs do not share drives. If the storage array is a low-end commodity storage unit and storage RAID will not be used, then it is best to employ ASM redundancy and use entire drives as ASM disks. Additionally, the ASM disk size is the minimal increment by which a disk group’s size can change.

NOTE

The maximum disk size for an ASM disk in pre-12c configurations is 2TB, and the minimum disk size is 4MB.

Users should create ASM disks with sizes less than 2TB in pre-12c environments. A message such as the following will be thrown if users specify ASM candidate disks that are greater than 2TB:

ASM Storage Device Configuration

This section details the steps and considerations involved in configuring storage devices presented to the operating system that were provisioned in the earlier section. This function is typically performed by the system administrator or an ASM administrator (that is, someone with root privileges).

Typically, disks presented to the OS can be seen in the /dev directory on Unix/Linux systems. Note that each OS has its unique representation of small computer system interface (SCSI) disk naming. For example, on Solaris systems, disks usually have the SCSI name format cwtxdysz, where c is the controller number, t is the target, d is the LUN/disk number, and s is the partition. Creating a partition serves three purposes:

To skip the OS label/VTOC (volume table of contents). Different operating systems have varying requirements for the OS label—that is, some may require an OS label before it is used, whereas others do not. For example, on a Solaris system, it is a best practice to create a partition on the disk, such as partition 4 or 6, that skips the first 1MB into the disk.

To create a placeholder to identify that the disk is being used because an unpartitioned disk could be accidentally misused or overwritten.

To preserve alignment between ASM striping and storage array internal striping.

The goal is to align the ASM file extent boundaries with any striping that may be done in the storage array. The Oracle database does a lot of 1MB input/outputs (I/Os) that are aligned to 1MB offsets in the data files. It is slightly less efficient to misalign these I/Os with the stripes in the storage array, because misalignment can cause one extra disk to be involved in the I/O. Although this misalignment may not affect the latency of that particular I/O, it reduces the overall throughput of the system by increasing the number of disk seeks. This misalignment is independent of the operating system. However, some operating systems may make it more difficult to control the alignment or may add more offsets to block 0 of the ASM disk.

The disk partition used for an ASM disk is best aligned at 1MB within the LUN, as presented to the OS by the storage. ASM uses the first allocation unit of a disk for metadata, which includes the disk header. The ASM disk header itself is in block 0 of the disk given to ASM as an ASM disk.

Aligning ASM disk extent boundaries to storage array striping only makes sense if the storage array striping is a power of 2; otherwise, it is not much of a concern.

The alignment issue would be solved if we could start the ASM disk at block 0 of the LUN, but that does not work on some operating systems (Solaris, in particular). On Linux, you could start the ASM disk at block 0, but then there is a chance an administrator would run fdisk on the LUN and destroy the ASM disk header. Therefore, we always recommend using a partition rather than starting the ASM disk at block 0 of the LUN.

ASM Disk Device Discovery

Once the disks are presented to the OS, ASM needs to discover them. This requires that the disk devices (Unix filenames) have their ownership changed from root to the software owner of Grid Infrastructure stack. The system administrator usually makes this change. In our example, disks c3t19d5s4, c3t19d16s4, c3t19d17s4, and c3t19d18s4 are identified, and their ownership set to the oracle:dba. Now ASM must be configured to discover these disks. This is done by defining the ASM init.ora parameter ASM_DISKSTRING. In our example, we will use the following wildcard setting:

An alternative to using standard SCSI names (such as cwtxdysz or /dev/sdda) is to use special files. This option is useful when establishing standard naming conventions and for easily identifying ASM disks, such as asmdisk1. This option requires creating special files using the mknod or udev generated names on Linux command.

The following is a use case example of mknod. To create a special file called asmdisk1 for a preexisting device partition called c3t19d7s4, you can determine the OS major number and minor number as follows:

NOTE

Major and minor numbers are associated with the device special files in the /dev directory and are used by the operating system to determine the actual driver and device to be accessed by the user-level request for the special device file.

The preceding example shows that the major and minor device numbers for this device are 32 and 20, respectively. The c at the beginning indicates that this is a character (raw) file.

After obtaining the major and minor numbers, use the mknod command to create the character and block special files that will be associated with c3t19d7s4. A special file called /dev/asmdisk can be created under the /dev directory, as shown:

Listing the special file shows the following:

Notice that this device has the same major and minor numbers as the native device c3t19d7s4.

For this partition (or slice) to be accessible to the ASM instance, change the permissions on this special file to the appropriate oracle user permissions:

Repeat this step for all the required disks that will be discovered by ASM. Now the slice is accessible by the ASM instance. The ASM_DISKSTRING can be set to /dev/asmdisk/*. Once discovered, the disk can be used as an ASM disk.

NOTE

It is not recommended that you create mknod devices in the /dev/asm directory because the /dev/asm path is reserved for ACFS to place ACFS configuration files and ADVM volumes. During 11gR2 Clusterware installation or upgrade, the root.sh or rootupgrade.sh script may remove and re-create the /dev/asm directory, causing the original mknod devices to be deleted. Be sure to use a different directory instead, such as /dev/asmdisk.

ASM discovers all the required disks that make up the disk group using “on-disk” headers and its search criteria (ASM_DISKSTRING). ASM scans only for disks that match that ASM search string. There are two forms of ASM disk discovery: shallow and deep. For shallow discovery, ASM simply scans the disks that are eligible to be opened. This is equivalent to executing “ls -l” on all the disk devices that have the appropriate permissions. For deep discovery, ASM opens each of those eligible disk devices. In most cases, ASM discoveries are deep, the exception being when the *_STAT tables are queried instead of the standard tables.

NOTE

For ASM in clustered environments, it is not necessary to have the same pathname or major or minor device numbers across all nodes. For example, node1 could access a disk pointed to by path /dev/rdsk/c3t1d4s4, whereas node2 could present /dev/rdsk/c4t1d4s4 for the same device. Although ASM does not require that the disks have the same names on every node, it does require that the same disks be visible to each ASM instance via that instance’s discovery string. In the event that pathnames differ between ASM nodes, the only necessary action is to modify the ASM_DISKSTRING to match the search path. This is a non-issue on Linux systems that use ASMLIB, because ASMLIB handles the disk search and scan process.

Upon successful discovery, the V$ASM_DISK view on the ASM instance reflects which disks were discovered. Note that henceforth all views, unless otherwise stated, are queried from the ASM instance and not from the RDBMS instance.

The following example shows the disks that were discovered using the defined ASM_DISKSTRING. Notice that the NAME column is empty and the GROUP_NUMBER is set to 0. This is because disks were discovered that are not yet associated with a disk group. Therefore, they have a null name and a group number of 0.

In an Exadata environment, the physical disks on the storage cells are called cell disks. Grid disks are created from the cell disks and are presented to ASM via the LIBCELL interface; they are used to create disk groups in Exadata. The default value for ASM_DISKSTRING in Exadata is ‘o/*/*’.

Note that these Exadata disks as presented by LIBCELL are not presented to the OS as block devices, but rather as internal network devices; they are not visible at the OS level. However, the kfod tool can be used to verify ASM disk discovery. The following shows kfod output of grid disks:

The preceding output shows the following:

The grid disks are presented from three storage cells (192.168.10.12, 192.168.10.13, and 192.168.10.14).

Disks have various header statuses that reflect their membership state with a disk group. Disks can have the following header statuses:

FORMER This state declares that the disk was formerly part of a disk group.

CANDIDATE A disk in this state is available to be added to a disk group.

MEMBER This state indicates that a disk is already part of a disk group. Note that the disk group may or may not be mounted.

PROVISIONED This state is similar to CANDIDATE, in that it is available to be added to disk groups. However, the provisioned state indicates that this disk has been configured or made available using ASMLIB.

Note that ASM does not ever mark disks as CANDIDATE. Disks with a HEADER_STATUS of CANDIDATE is the outcome of the evaluation of ASM disk discovery. If a disk is dropped by ASM via a normal DROP DISK, the header status would become listed as FORMER. Moreover, if a disk is taken offline and subsequently force dropped, the HEADER_STATUS would remain as MEMBER.

The following is a useful query to run to view the status of disks in the ASM system:

The views V$ASM_DISK_STAT and V$ASM_DISKGROUP_STAT are identical to V$ASM_DISK and V$ASM_DISKGROUP. However, the $ASM_DISK_STAT and V$ASM_DISKGROUP_STAT views are polled from memory and are based on the last deep disk discovery. Because these new views provide efficient lightweight access, Enterprise Manager (EM) can periodically query performance statistics at the disk level and aggregate space usage statistics at the disk group level without incurring significant overhead.

Third-Party Volume Managers and ASM

Although it is not a recommended practice, host volume managers such as Veritas VxVM and IBM LVM can sit below ASM. For example, a logical volume manager (LVM) can create raw logical volumes and present these as disks to ASM. However, the third-party LVM should not use any host-based mirroring or striping. ASM algorithms are based on the assumption that I/Os to different disks are relatively independent and can proceed in parallel. If any of the volume manager virtualization features are used beneath ASM, the configuration becomes too complex and confusing and can needlessly incur overhead, such as the maintenance of a dirty region log (DRL). DRL is discussed in greater detail later in this chapter.

In a clustered environment, such a configuration can be particularly expensive. ASM does a better job of providing this configuration’s functionality for database files. Additionally, in RAC environments, if ASM were to run over third-party volume managers, the volume managers must be cluster-aware—that is, they must be cluster volume managers (CVMs).

However, it may make sense in certain cases to have a volume manager under ASM (for example, when Sysadmins need a simplified management and tracking of disk assignments is needed).

If a volume manager is used to create logical volumes as ASM disks, the logical volumes should not use any LVM RAID functionality.

Preparing ASM Disks on NFS

ASM supports Network File System (NFS) files as ASM disks. To prepare NFS for ASM storage, the NAS NFS file system must be made accessible to the server where ASM is running.

The following steps can be used to set up and configure ASM disks using the NFS file system:

1. On the NAS server, create the file system. Depending on the NAS server, this will require creating LUNs, creating RAID groups out of the LUNs, and finally creating a file system from the block devices.

2. Export the NAS file system so that it’s accessible to the host server running ASM. This mechanism will differ based on the filer or NFS server being used. Typically this requires the /etc/exports file to specify the NFS file system to be remotely mounted.

3. On the host server, create the mount point where the NFS file system will be mounted:

4. Update /etc/fstab with the following entry:

5. Mount the NFS file system on the host server using the mount –a command.

6. Initialize the NFS file system files so they can be used as ASM disks:

This step should be repeated to configure the appropriate number of disks.

7. Ensure that ASM can discover the newly created disk files (that is, check that the permissions are grid:asmadmin).

8. Set the ASM disk string appropriately when prompted in OUI for the ASM configuration.

It is very important to have the correct NFS mount options set. If wrong mount options are set, an exception will be thrown on file open. This is shown in the following listing. RAC and Clusterware code uses the O_DIRECT flag for write calls, so the data writes bypass the cache and go directly to the NFS server (thus avoiding possible corruptions by an extra caching layer).

File system files that are opened as read-only by all nodes (such as shared libraries) or files that are accessed by a single node (such as trace files) can be on a mount point with the mount option actimeo set to greater than 0. Only files that are concurrently written and read by multiple nodes (such as database files, application output files, and natively compiled libraries shared among nodes) need to be on a mount point with actimeo set to 0. This not only saves on network round trips for stat() calls, but the calls also don’t have to wait for writes to complete. This could be a significant speedup, especially for files being read and written by a single node.

Direct NFS

Oracle Database has built-in support for the Network File System (NFS) client via Direct NFS (dNFS). dNFS, an Oracle-optimized NFS client introduced in Oracle Database 11gR1, is built directly into the database kernel.

dNFS provides faster performance than the native OS NFS client driver because it bypasses the OS. Additionally, once dNFS is enabled, very little user configuration or tuning is required. Data is cached just once in user space, so there’s no second copy in kernel space. dNFS also provides implicit network interface load balancing. ASM supports the dNFS client that integrates the NFS client functionality directly in the Oracle Database software stack. If you are using dNFS for RAC configurations, some special considerations need to be made. dNFS cannot be used to store (actually access) voting files. The reason for this lies in how voting files are accessed. CSS is a multi-threaded process and dNFS (in its current state) is not thread safe. OCR files and other cluster files (including database files) are accessed using ASM file I/O operations.

Note that ASM Dynamic Volume Manager (Oracle ADVM) does not currently support NFS-based ASM files.

Preparing ASM Disks on OS Platforms

This section illustrates the specific tasks needed to configure ASM for the specific operating systems and environments.

Linux

On Intel-based systems such as Linux/Windows, the first 63 blocks have been reserved for the master boot record (MBR). Therefore, the first data partition starts with offset at 31.5KB (that is, 63 times 512 bytes equals 31.5KB).

This offset can cause misalignment on many storage arrays’ memory cache or RAID configurations, causing performance degradation due to overlapping I/Os. This performance impact is especially evident for large block I/O workloads, such as parallel query processing and full table scans.

The following shows how to manually perform the alignment using sfdisk against an EMC Powerpath device. Note that this procedure is applicable to any OS device that needs to be partitioned aligned.

Solaris

This section covers some of the nuances of creating disk devices in a Solaris environment. The Solaris format command is used to create OS slices. Note that slices 0 and 2 (for SMI labels) cannot be used as ASM disks because these slices include the Solaris VTOC. An example of the format command output (partition map) for the device follows:

Notice that slice 4 is created and that it skips four cylinders, thus offsetting past the VTOC.

Use the logical character device as listed in the /dev/rdsk directory. Devices in this directory are symbolic links to the physical device files. Here’s an example:

To change the permission on these devices, do the following:

Now the ASM instance can access the slice. Set the ASM_DISKSTRING to /dev/rdsk/c*s4. Note that the actual disk string differs in each environment.

AIX

This section describes how to configure ASM disks for AIX. It also recommends some precautions that are necessary when using AIX disks.

In AIX, a disk is assigned a physical volume identifier (PVID) when it is first assigned to a volume group or when it is manually set using the AIX chdev command. When the PVID is assigned, it is stored on the physical disk and in the AIX server’s system object database, called Object Data Manager (ODM). The PVID resides in the first 4KB of the disk and is displayed using the AIX lspv command. In the following listing, the first two disks have PVIDs assigned and the others do not:

If a PVID-assigned disk is incorporated into an ASM disk group, ASM will write an ASM disk header on the first 40 bytes of the disk, thus overwriting the PVID. Although initially no problems may arise, on the subsequent reboot the OS, in coordination with the ODM database, will restore the PVID onto the disk, thus destroying the ASM disk and potentially resulting in data loss.

Therefore, it is a best practice on AIX not to include a PVID on any disk that ASM will use. If a PVID does exist and ASM has not used the disk yet, you can clear the PVID by using the AIX chdev command.

Functionality was added to AIX to help prevent corruption such as this. AIX commands that write to the LVM information block have special checking added to determine if the disk is already in use by ASM. This mechanism is used to prevent these disks from being assigned to the LVM, which would result in the Oracle data becoming corrupted. Table 4-1 lists the command and the corresponding AIX version where this checking is done.

TABLE 4-1. AIX Version Where This PVID Checking Is Implemented

AIX 6.1 and AIX 7.1 LVM commands contain new functionality that can be used to better manage AIX devices used by Oracle. This new functionality includes commands to better identify shared disks across multiple nodes, the ability to assign a meaningful name to a device, and a locking mechanism that the system administrator can use when the disk is assigned to ASM to help prevent the accidental reuse of a disk at a later time. This new functionality is listed in Table 4-2, along with the minimum AIX level providing that functionality. Note that these manageability commands do not exist for AIX 5.3.

TABLE 4-2. Existing AIX Commands That Have Been Updated Used to Better Identify ASM Devices

The following illustrates the disk-locking and -checking functionality:

Lock every raw disk used by ASM Lock to protect the disk. This can be done while Oracle RAC is active on the cluster:

Then use the lspv command to check the status of the disks.

ASM and Multipathing

An I/O path generally consists of an initiator port, fabric port, target port, and LUN. Each permutation of this I/O path is considered an independent path. For example, in a high-availability scenario where each node has two host bus adapter (HBA) ports connected to two separate switch ports to two target ports on the back-end storage to a LUN, eight paths are visible to that LUN from the OS perspective (two HBA ports times two switch ports times two target ports times one LUN equals eight paths).

Path managers discover multiple paths to a device by issuing a SCSI inquiry command (SCSI_INQUIRY) to each operating system device. For example, on Linux the scsi_id call queries a SCSI device via the SCSI_INQUIRY command and leverages the vital product data (VPD) page 0x80 or 0x83. A disk or LUN responds to the SCSI_INQUIRY command with information about itself, including vendor and product identifiers and a unique serial number. The output from this query is used to generate a value that is unique across all SCSI devices that properly support pages 0x80 or 0x83.

Typically devices that respond to the SCSI_INQUIRY with the same serial number are considered to be accessible from multiple paths.

Path manager software also provides multipath software drivers. Most multipathing drivers support multipath services for fibre channel–attached SCSI-3 devices. These drivers receive naming and transport services from one or more physical HBA devices. To support multipathing, a physical HBA driver must comply with the multipathing services provided by this driver. Multipathing tools provide the following benefits:

They provide a single block device interface for a multipathed LUN.

They detect any component failures in the I/O path, including the fabric port, channel adapter, or HBA.

When a loss of path occurs, such tools ensure that I/Os are rerouted to the available paths, with no process disruption.

They reconfigure the multipaths automatically when events occur.

They ensure that failed paths get revalidated as soon as possible and provide auto-failback capabilities.

They configure the multipaths to maximize performance using various load-balancing methods, such as round robin, least I/Os queued, and least service time.

When a given disk has several paths defined, each one will be presented as a unique pathname at the OS level, although they all reference the same physical LUN; for example, the LUNs /dev/rdsk/c3t19d1s4 and /dev/rdsk/c7t22d1s4 could be pointing to the same disk device. The multipath abstraction provides I/O load balancing across the HBAs as well as nondisruptive failovers on I/O path failures.

ASM, however, can tolerate the discovery of only one unique device path per disk. For example, if the ASM_DISKSTRING is /dev/rdsk/*, then several paths to the same device will be discovered and ASM will produce an error message stating this. A multipath driver, which generally sits above this SCSI-block layer, usually produces a pseudo device that virtualizes the subpaths. For example, in the case of EMC’s PowerPath, you can use the ASM_DISKSTRING setting of /dev/rdsk/emcpower*. When I/O is issued to this disk device, the multipath driver intercepts it and provides the necessary load balancing to the underlying subpaths.

Examples of multipathing software include Linux Device Mapper, EMC PowerPath, Veritas Dynamic Multipathing (DMP), Oracle Sun Traffic Manager, Hitachi Dynamic Link Manager (HDLM), Windows MPIO, and IBM Subsystem Device Driver Path Control Module (SDDPCM).

Additionally, some HBA vendors, such as QLogic, also provide multipathing solutions.

NOTE

Users are advised to verify the vendor certification of ASM/ASMLIB with their multipathing drivers, because Oracle does not certify or qualify these multipathing tools. Although ASM does not provide multipathing capabilities, it does leverage multipathing tools as long as the path or device that they produce brings back a successful return code from an fstat system call. Metalink Note 294869.1 provides more details on ASM and multipathing.

Linux Device Mapper

Device mapper provides a kernel framework of allowing multiple device drivers to be stacked on top of each other. This section will describe the details of device mapper as it relates to ASM and ASMLIB or Udev, as these components all are heavily used together. This section will provide a high-level overview of Linux Device Mapper and Udev, in order to support ASM more effectively.

Linux device mapper’s subsystem is the core component of the Linux multipath chain. The component provides the following high-level functionality:

A single logical device node for multiple paths to a single storage device.

I/O gets rerouted to the available paths when a path loss occurs and there is no disruption at the upper (user) layers because of this.

Device mapper is configured by using the library libdevmapper. This library is used by multiple modules such as dmsetup, LVM2, multipath tools, and kpartx, as shown in Figure 4-1.

FIGURE 4-1. Device mapper topology

Device mapper provides the kernel resident mechanisms that support the creation of different combinations of stacked target drivers for different block devices. At the highest level of the stack is a single mapped device. This mapped device is configured in the device mapper by passing map information about the target devices to the device mapper via the libdevmapper library interfaces. This role is performed by the multipath configuration tool (discussed later). Each mapped segment consists of a start sector and length and a target driver–specific number of target driver parameters.

Here’s an example of a mapping table for a multipath device with two underneath block devices:

The preceding indicates that the starting sector is 0, the length is 1172123558, and the driver is multipath followed by multipathing parameters. Two devices are associated with this map, 65:208 and 65:16 (major:minor). The parameter 1000 indicates that after 1,000 I/Os the second path will be used in a round-robin fashion.

Here’s an example of a mapping table for a logical volume (LVM2) device with one underneath block device:

This indicates that the starting sector is 0 and the length is 209715200 with a linear driver for device 9:1 (major:minor).

Udev

In the previous versions of Linux, all the devices were statically created in the dev-FS (/dev) file system at installation time. This created a huge list of entries in /dev, most of which were useless because most of the devices were not actually connected to the system. The other problem with this approach was that the same device could be named differently when connected to the system because the kernel assigned the devices on a first-come basis, so the first SCSI device discovered was named /dev/sda, the second one /dev/sdb, and so on. Udev resolves this problem by managing the device nodes on demand. Also, name-to-device mappings are not based on the order in which the devices are detected but on a system of predefined rules. Udev relies on the kernel hot-plug mechanism to create device files in user space.

The discovery of the current configuration is done by probing block device nodes created in Sysfs. This file system presents to the user space kernel objects such as block devices, busses, drivers, and so on in a hierarchical manner. A device node is created by Udev in reaction to a hot-plug event generated when a block device’s request queue is registered with the kernel’s block subsystem.

As shown in Figure 4-2, in the context of multipath implementation, Udev performs the following tasks:

FIGURE 4-2. Udev topology

The addition and suppression of paths are listened to by the multipath user space daemon. This ensures that the multipath device maps are always up to date with the physical topology.

The user space callbacks after path addition or suppression also call the user space tool kpartx to create maps for device partitions.

Many users use symbolic links to point to ASM disks instead of using Udev. A common question we get is, is there a difference (advantages/disadvantages) between using symbolic links versus mknod devices with respect to aliasing a device?

Neither one provides persistent naming. In other words, if you use a mknod device, you have a new alias to the major/minor number. However, if the device changes its major/minor number for any reason, your mknod device will be stale, just like the symlink will be stale. The only way to obtain persistence is by using ASMLIB, ASM, or Udev. For example, in Udev, you can create either an mknod/block/char device or a SYMLINK using that keyword. Here’s an example:

The first example will modify a sd device to be called “asmdisk1”. The second example will leave the sd device alone and create a new symlink called “asmdisk1” pointing to the sd device. Both methods keep pointing to the right device as given by the UUID; obviously you should use only one method or the other.

Multipath Configuration

Now that we’ve talked about the various tools and modules involved in making Linux multipathing work, this section discusses the configuration steps required to make it operational.

The main configuration file that drives the Linux multipathing is /etc/multipath.conf. The configuration file is composed of the following sections:

Defaults The default values of attributes are used whenever no specific device setting is given.

Blacklist Devices that should be excluded from the multipath discovery.

Blacklist_exceptions Devices that should be included in the multipath discovery despite being listed in the blacklist section. This section is generally for when you use wildcards in the blacklist section to remove many disks and want to manually use a few of them.

Multipaths This is the main section that defines the multipath topology. The values are indexed by the worldwide identifier (WWID) of the device.

Devices Device-specific settings.

Following are some commonly used configuration parameters in the multipath.conf file (for a complete listing, refer to the man pages on your system):

Path_grouping_policy This parameter defines the path grouping policy. Here are the possible values:

Failover There is one active path. This is equivalent to active/passive configuration.

Multibus All the paths are in one priority group and are used in a load-balancing configuration. This is the default value.

Group_by_serial/group_by_prio/group_by_node_name The paths are grouped based on serial number or priority, as defined by a usercallout program or based on target node names.

Prio_callout The default program and arguments to call out to obtain a path priority value.

path_checker The method used to determine the path’s state.

user_friendly_names The names assigned to the multipath devices will be in the form of mpath<n>, if not defined in the multipaths section.

no_path_retry Used to specify the number of retries until queuing is disabled, or issue fail I/O for immediate failure (in this case no queuing). Default is 0.

Here’s an example of a configuration file that is used for RAC cluster configurations:

The first section blacklists the asm and ofsctl (ACFS) devices. It also specifically blacklists the two disks with specific WWIDs, which in the context of the sample setup were the OS disks. The defaults section provides generic parameters that are applicable in this setup. Finally, the multipaths section lists two WWIDs and corresponding aliases to be assigned to the devices. The mode, uid, and gid provide the created device permissions, user ID, and group ID, respectively, after creation.

Setting Up the Configuration

To create the multipath device mappings after configuring the multipath.conf file, use the /sbin/multipath command:

In this example, two multipath devices are created in failover configuration. The first priority group is the active paths: The device /dev/sdau is the active path for HDD__966617575 and /dev/sdk is the passive (standby) path. The path is visible in the /dev/mapper folder. Furthermore, the partition mappings for the devices can be created on the device using the kpartx tool.

After these settings have completed, the devices can be used by ASM or other applications on the server.

Disk Group

The primary component of ASM is the disk group, which is the highest-level data structure in ASM (see Figure 4-3). A disk group is essentially a container that consists of a logical grouping of disks that are managed together as a unit. The disk group is comparable to a logical volume manager’s (LVM’s) volume group.

FIGURE 4-3. ASM layer

A disk group can contain files from many different Oracle databases. Allowing multiple databases to share a disk group provides greater potential for improved disk utilization and greater overall throughput. The Oracle database may also store its files in multiple disk groups managed by the same ASM instance. Note that a database file can only be part of one disk group.

ASM disk groups differ from typical LVM volume groups in that ASM disk groups have inherent automatic file-level striping and mirroring capabilities. A database file created within an ASM disk group is distributed equally across all disks in the disk group, which provides an even input/output (I/O) load.

Disk Group Management

ASM has three disk groups types: external redundancy, normal redundancy, and high redundancy. The disk group type, which is defined at disk group creation time, determines the default level of mirroring performed by ASM. An external redundancy disk group indicates stripping will be done by ASM and the mirroring will be handled and managed by the storage array. For example, a user may create an external redundancy disk group where the storage array (SAN) is an EMC VMAX or Hitachi USP series. Because the core competency of these high-end arrays is mirroring, external redundancy is well suited for them. A common question is, does ASM stripping conflict with the stripping performed by the SAN? The answer is no. ASM stripping is complementary to the SAN stripping.

With ASM redundancy, ASM performs and manages the mirroring. ASM redundancy is the core deployment strategy used in Oracle’s Engineered Solutions, such as Oracle Exadata, Oracle SuperCluster, and Oracle Database Appliance. ASM redundancy is also used with low-cost commodity storage or when deploying stretch clusters. For details on ASM redundancy, see the “ASM Redundancy and Failure Groups” section later in this chapter. The next section focuses on creating disk groups in an external redundancy environment.

Creating Disk Groups

The creation of a disk group involves validation of the disks to be added. The disks must have the following attributes:

They cannot already be in use by another disk group.

They must not have a preexisting valid ASM header. The FORCE option must be used to override this.

They cannot have an Oracle file header (for example, from any file created by the RDBMS). The FORCE option must be used to override this. Trying to create an ASM disk using a raw device data file results in the following error:

The disk header validation prevents ASM from destroying any data device already in use. Only disks with a header status of CANDIDATE, FORMER, or PROVISIONED are allowed to be included in disk group creation. To add disks to a disk group with a header status of MEMBER or FOREIGN, use the FORCE option in the disk group creation. To prevent gratuitous use of the FORCE option, ASM allows it only when using the NOFORCE option would fail. An attempt to use FORCE when it is not required results in an ORA-15034 error (disk ‘%s’ does not require the FORCE option). Use the FORCE option with extreme caution, because it overwrites the data on the disk that was previously used as an ASM disk or database file.

A disk without a recognizable header is considered a CANDIDATE. There is no persistent header status called “candidate.”

Once ASM has discovered the disks, they can be used to create a disk group. To reduce the complexity of managing ASM and its disk groups, Oracle recommends that generally no more than two disk groups be maintained and managed per RAC cluster or single ASM instance. The following are typical disk groups that are created by customers:

DATA disk group This is where active database files such as data files, control files, online redo logs, and change-tracking files used in incremental backups are stored.

Fast Recovery Area (FRA) disk group This is where recovery-related files are created, such as multiplexed copies of the current control file and online redo logs, archived redo logs, backup sets, and flashback log files.

Having one DATA disk group means there’s only one place to store all your database files, and it obviates the need to juggle around data files or having to decide where to place a new tablespace, like in traditional file system configurations. Having one disk group for all your files also means better storage utilization, thus making the IT director and storage teams very happy. If more storage capacity or I/O capacity is needed, just add an ASM disk and ensure that this storage pool container houses enough spindles to accommodate the I/O rate of all the database objects.

To provide higher availability for the database, when a Fast Recovery Area is chosen at database creation time, an active copy of the control file and one member set of the redo log group are stored in the Fast Recovery Area. Note that additional copies of the control file or extra log files can be created and placed in either disk group, as desired.

RAC users can optionally create a CRSDATA disk group to store Oracle Clusterware files (for example, voting disks and the Oracle Cluster registry) or the ASM spfile. When deploying the CRSDATA disk group for this purpose, you should minimally use ASM normal redundancy with three failure groups. This is generally done to provide added redundancy for the Clusterware files. See Chapter 2 for more details.

Note that creating additional disk groups for storing database data does not necessarily improve performance. However, additional disk groups may be added to support tiered storage classes in Information Lifecycle Management (ILM) or Hierarchical Storage Management (HSM) deployments. For example, a separate disk group can be created for archived or retired data (or partitions), and these partitioned tablespaces can be migrated or initially placed on a disk group based on Tier2 storage (RAID5), whereas Tier1 storage (RAID10) can be used for the DATA disk group.

ASM provides out-of-the-box enablement of redundancy and optimal performance. However, the following items should be considered to increase performance and/or availability:

Implement multiple access paths to the storage array using two or more HBAs or initiators.

Deploy multipathing software over these multiple HBAs to provide I/O load-balancing and failover capabilities.

Use disk groups with similarly sized and performing disks. A disk group containing a large number of disks provides a wide distribution of data extents, thus allowing greater concurrency for I/O and reducing the occurrences of hotspots. Because a large disk group can easily sustain various I/O characteristics and workloads, a single DATA disk group can be used to house database files, log files, and control files.

Use disk groups with four or more disks, making sure these disks span several back-end disk adapters. As stated earlier, Oracle generally recommends no more than two to three disk groups. For example, a common deployment can be four or more disks in a DATA disk group spanning all back-end disk adapters/directors, and eight to ten disks for the FRA disk group. The size of the FRA will depend on what is stored and how much (that is, full database backups, incremental backups, flashback database logs, and archive logs). Note that an active copy of the control file and one member of each of the redo log group are stored in the FRA.

A disk group can be created using SQL, Enterprise Manager (EM), ASMCMD commands, or ASMCA. In the following example, a DATA disk group is created using four disks that reside in a storage array, with the redundancy being handled externally by the storage array. The following query lists the available disks that will be used in the disk group:

Notice that one of the disks, c3t19d19s4, was dropped from a disk group and thus shows a status of FORMER.

The output from V$ASM_DISGROUP shows the newly created disk group:

The output from V$ASM_DISK shows the status disks once the disk group is created:

After the disk group is successfully created, metadata information, which includes creation date, disk group name, and redundancy type, is stored in the System Global Area (SGA) and on each disk (in the disk header) within the disk group. Although it possible to mount disk groups only on specific nodes of the cluster, this is generally not recommended because it may potentially obstruct CRS resource startup modeling.

Once these disks are under ASM management, all subsequent mounts of the disk group reread and validate the ASM disk headers. The following output shows how the V$ASM_DISK view reflects the disk state change after the disk is incorporated into the disk group:

The output that follows shows entries from the ASM alert log reflecting the creation of the disk group and the assignment of the disk names:

When you’re mounting disk groups, either at ASM startup or for subsequent mounts, it is advisable to mount all required disk groups at once. This minimizes the overhead of multiple ASM disk discovery scans. With Grid Infrastructure, agents will automatically mount any disk group needed by a database.

ASM Disk Names

ASM disk names are assigned by default based on the disk group name and disk number, but names can be defined by the user either during ASM disk group creation or when disks are added. The following example illustrates how to create a disk group where disk names are defined by the user:

If disk names are not provided, ASM dynamically assigns a disk name with a sequence number to each disk added to the disk group:

The ASM disk name is used when performing disk management activities, such as DROP, ONLINE, OFFLINE, and RESIZE DISK.

The ASM disk name is different from the small computer system interface (SCSI) address. ASM disk names are stored persistently in the ASM header, and they persist even if the SCSI address name changes. Persistent names also allow for consistent naming across Real Application Clusters (RAC) nodes. SCSI address name changes occur due to array reconfigurations and/or after OS reboots. There is no persistent binding of disk numbers to pathnames used by ASM to access the storage.

Disk Group Numbers

The lowest nonzero available disk group number is assigned on the first mount of a disk group in a cluster. However, in an ASM cluster, even if the disk groups are mounted in a different order between cluster nodes, the disk group numbers will still be consistent across the cluster (for a given disk group) but the disk group name never changes. For example, if node 1 has dgA as group number 1 and dgB as group number 2, then if node 2 mounts only dgB, it will be group number 2, even though 1 is not in use in node 2.

NOTE

Disk group numbers are never recorded persistently, so there is no disk group number in a disk header. Only the disk group name is recorded in the disk header.

Disk Numbers

Although disk group numbers are never recorded persistently, disk numbers are recorded on the disk headers. When an ASM instance starts up, it discovers all the devices matching the pattern specified in the initialization parameter ASM_DISKSTRING and for which it has read/write access. If it sees an ASM disk header, it knows the ASM disk number.

Also, disks that are discovered but are not part of any mounted disk group are reported in disk group 0. A disk that is not part of any mounted disk group will be in disk group 0 until it is added to a disk group or mounted. When the disk is added to a disk group, it will be associated with the correct disk group.

ASM Redundancy and Failure Groups

For systems that do not use external redundancy, ASM provides its own redundancy mechanism. This redundancy, as stated earlier, is used extensively in Exadata and Oracle Database Appliance systems. These Engineered Solutions are covered in Chapter 12.

A disk group is divided into failure groups, and each disk is in exactly one failure group. A failure group (FG) is a collection of disks that can become unavailable due to the failure of one of its associated components. Possible failing components could be any of the following:

Storage array controllers

Host bus adapters (HBAs)

Fibre Channel (FC) switches

Disks

Entire arrays, such as NFS filers

Thus, disks in two separate failure groups (for a given disk group) must not share a common failure component. If you define failure groups for your disk group, ASM can tolerate the simultaneous failure of multiple disks in at least one failure group or two FGs in a high-redundancy disk group.

ASM uses a unique mirroring algorithm. ASM does not mirror disks; rather, it mirrors extents. When ASM allocates a primary extent of a file to one disk in a failure group, it allocates a mirror copy of that extent to another disk in another failure group. Thus, ASM ensures that a primary extent and its mirror copy never reside in the same failure group.

Each file can have a different level of mirroring (redundancy) in the same disk group. For example, in a normal redundancy disk group, with at least three failure groups, we can have one file with (default) normal redundancy, another file with no redundancy, and yet another file with high redundancy (triple mirroring).

Unlike other volume managers, ASM has no concept of a primary disk or a mirrored disk. As a result, to provide continued protection in the event of failure, your disk group requires only spare capacity; a hot spare disk is unnecessary. Redundancy for disk groups can be either normal (the default), where files are two-way mirrored (requiring at least two failure groups), or high, which provides a higher degree of protection using three-way mirroring (requiring at least three failure groups). After you create a disk group, you cannot change its redundancy level. If you want a different redundancy, you must create another disk group with the desired redundancy and then move the data files (using Recovery Manager [RMAN] restore, the ASMCMD copy command, or DBMS_FILE_TRANSFER) from the original disk group to the newly created disk group.

NOTE

Disk group metadata is always triple mirrored with normal or high redundancy.

Additionally, after you assign a disk to a failure group, you cannot reassign it to another failure group. If you want to move it to another failure group, you must first drop it from the current failure group and then add it to the desired failure group. However, because the hardware configuration usually dictates the choice of a failure group, users generally do not need to reassign a disk to another failure group unless it is physically moved.

Creating ASM Redundancy Disk Groups

The following simple example shows how to create a normal redundancy disk group using two failure groups over a NetApp filer:

The same create diskgroup command can be executed using wildcard syntax:

In this following example, ASM normal redundancy is being deployed over a low-cost commodity storage array. This storage array has four internal trays, with each tray having four disks. Because the failing component to isolate is the storage tray, the failure group boundary is set for the storage tray—that is, each storage tray is associated with a failure group:

ASM and Intelligent Data Placement

Short stroking (along with RAID 0 striping) is a technique that storage administrators typically use to minimize the performance impact of head repositioning delays. This technique reduces seek times by limiting the actuator’s range of motion and increases the media transfer rate. This effectively improves overall disk throughput. Short stroking is implemented by formatting a drive so that only the outer sectors of the disk platter (which have the highest track densities) are used to store heavily accessed data, thus providing the best overall throughput. However, short stroking a disk limits the drive’s capacity by using a subset of the available tracks, resulting in reduced usable capacity.

Intelligent Data Placement (IDP), a feature introduced in Oracle Database version 11 Release 2, emulates the short stroking technique without sacrificing usable capacity or redundancy.

IDP automatically defines disk region boundaries on ASM disks for best performance. Using the disk region settings of a disk group, you can place frequently accessed data on the outermost (hot) tracks. In addition, files with similar access patterns are located physically close, thus reducing latency. IDP also enables the placement of primary and mirror extents into different hot or cold regions.

The IDP feature primarily works on JBOD (just a bunch of disks) storage or disk storage that has not been partitioned (for example, using RAID techniques) by the array. Although IDP can be used with external redundancy, it can only be effective if you know that certain files are frequently accessed while other files are rarely accessed and if the lower numbered blocks perform better than the higher numbered blocks. IDP on external redundancy may not be highly beneficial. Moreover, IDP over external redundancy is not a tested configuration in Oracle internal labs.

The default region of IDP is COLD so that all data is allocated on the lowest disk addresses, which are on the outer edge of physical disks. When the disk region settings are modified for an existing file, only new file extensions are affected. Existing file extents are not affected until a rebalance operation is initiated. It is recommended that you manually initiate an ASM rebalance when a significant number of IDP file policies (for existing files) are modified. Note that a rebalance may affect system throughput, so it should be a planned change management activity.

IDP settings can be specified for a file or by using disk group templates. The disk region settings can be modified after the disk group has been created. IDP is most effective for the following workloads and access patterns:

For databases with data files that are accessed at different rates. A database that accesses all data files in the same way is unlikely to benefit from IDP.

For ASM disk groups that are sufficiently populated with usable data. As a best-practice recommendation, the disk group should be more than 25-percent full. With lesser populated disk groups, the IDP management overhead may minimize the IDP benefits.

For disks that have better performance at the beginning of the media relative to the end. Because Intelligent Data Placement leverages the geometry of the disk, it is well suited to JBOD (just a bunch of disks). In contrast, a storage array with LUNs composed of concatenated volumes masks the geometry from ASM.

To implement IDP, the COMPATIBLE.ASM and COMPATIBLE.RDBMS disk group attributes must be set to 11.2 or higher. IDP can be implemented and managed using ASMCA or the following SQL commands:

ALTER DISKGROUP ADD or MODIFY TEMPLATE

ALTER DISKGROUP TEMPLATE SQL or MODIFY FILE

These commands include the disk region clause for setting hot/mirrorhot or cold/mirrorcold regions in a template:

IDP is also applicable for ADVM volumes. When creating ADVM volumes, you can specify the region location for primary and secondary extents:

Designing for ASM Redundancy Disk Groups

Note that with ASM redundancy, you are not restricted to having two failure groups for normal redundancy and three for high redundancy. In the preceding example, four failure groups are created to ensure that disk partners are not allocated from the same storage tray. Another such example can be found in Exadata. In a full-frame Exadata, there are 14 failure groups.

There may be cases where users want to protect against storage area network (SAN) array failures. This can be accomplished by putting each array in a separate failure group. For example, a configuration may include two NetApp filers and the deployment of ASM normal redundancy such that each filer—that is, all logical unit numbers (LUNs) presented through the filer—is part of an ASM failure group. In this scenario, ASM mirrors the extent between the two filers.

If the database administrator (DBA) does not specify a failure group in the CREATE DISKGROUP command, a failure group is automatically constructed for each disk. This method of placing every disk in its own failure group works well for most customers. In fact, in Oracle Database Appliance, all disks are assigned in this manner.

In case of Exadata, the storage grid disks are presented with extra information to the database server nodes, so these servers know exactly how to configure the failure groups without user intervention.

The choice of failure groups depends on the kinds of failures that need to be tolerated without the loss of data availability. For small numbers of disks (for example, fewer than 20), it is usually best to put every disk in its own failure group. Nonetheless, this is also beneficial for large numbers of disks when the main concern is spindle failure. To protect against the simultaneous loss of multiple disk drives due to a single component failure, an explicit failure group specification should be used. For example, a disk group may be constructed from several small modular disk arrays. If the system needs to continue operation when an entire modular array fails, each failure group should consist of all the disks in one module. If one module fails, all the data on that module is relocated to other modules to restore redundancy. Disks should be placed in the same failure group if they depend on a common piece of hardware whose unavailability or failure needs to be tolerated.

It is much better to have several failure groups as long as the data is still protected against the necessary component failures. Having additional failure groups provides better odds of tolerating multiple failures. Failure groups of uneven capacity can lead to allocation problems that prevent full utilization of all available storage. Moreover, having failure groups of different sizes can waste disk space. There may be enough room to allocate primary extents, but no space available for secondary extents. For example, in a disk group with six disks and three failure groups, if two disks are in their own individual failure groups and the other four are in one common failure group, the allocation will be very unequal. All the secondary extents from the big failure group can be placed on only two of the six disks. The disks in the individual failure groups fill up with secondary extents and block additional allocation even though plenty of space is left in the large failure group. This also places an uneven read and write load on the two disks that are full because they contain more secondary extents that are accessed only for writes or if the disk with the primary extent fails.

Allocating ASM Extent Sets

With ASM redundancy, the first file extent allocated is chosen as the primary extent, and the mirrored extent is called the secondary extent. In the case of high redundancy, there will be two secondary extents. This logical grouping of primary and secondary extents is called an extent set. Each disk in a disk group contains nearly the same number of primary and secondary extents. This provides an even distribution of read I/O activity across all the disks.

All the extents in an extent set always contain the exact same data because they are mirrored versions of each other. When a block is read from disk, it is always read from the primary extent, unless the primary extent cannot be read. The preferred read feature allows the database to read the secondary extent first instead of reading the primary extent. This is especially important for RAC Extended Cluster implementations. See the section “ASM and Extended Clusters,” later in this chapter, for more details on this feature.

When a block is to be written to a file, each extent in the extent set is written in parallel. This requires that all writes complete before acknowledging the write to the client. Otherwise, the unwritten side could be read before it is written. If one write I/O fails, that side of the mirror must be made unavailable for reads before the write can be acknowledged.

Disk Partnering

In ASM redundancy disk groups, ASM protects against a double-disk failure (which can lead to data loss) by mirroring copies of data on disks that are partners of the disk containing the primary data extent. A disk partnership is a symmetric relationship between two disks in a high- or normal-redundancy disk group, and ASM automatically creates and maintains these relationships. ASM selects partners for a disk from failure groups other than the failure group to which the disk belongs. This ensures that a disk with a copy of the lost disk’s data will be available following the failure of the shared resource associated with the failure group. ASM limits the number of disk partners to eight for any single disk.

Note that ASM did not choose partner disks from its own failure group (FLGRP1); rather, eight partners were chosen from the other three failure groups. Disk partnerships are only changed when there is a loss or addition of an ASM disk. These partnerships are not modified when disks are placed offline. Disk partnership is detailed in Chapter 9.

Recovering Failure Groups

Let’s now return to the example in the previous CREATE DISKGROUP DATA_NRML command. In the event of a disk failure in failure group FLGRP1, which will induce a rebalance, the contents (data extents) of the failed disk are reconstructed using the redundant copies of the extents from partner disks. These partner disks are from failure group FLGRP2, FLGRP3, or both. If the database instance needs to access an extent whose primary extent was on the failed disk, the database will read the mirror copy from the appropriate disk. After the rebalance is complete and the disk contents are fully reconstructed, the database instance returns to reading primary copies only.

ASM and Extended Clusters

An extended cluster—also called a stretch cluster, geocluster, campus cluster, or metro-cluster—is essentially a RAC environment deployed across two data center locations. Many customers implement extended RAC to marry disaster recovery with the benefits of RAC, all in an effort to provide higher availability. Within Oracle, the term extended clusters is used to refer to all of the stretch cluster implementations.

The distance for extended RAC can be anywhere between several meters to several hundred kilometers. Because Cluster Ready Services (CRS)–RAC cluster group membership is based on the ability to communicate effectively across the interconnect, extended cluster deployment requires a low-latency network infrastructure. For close proximity, users typically use Fibre Channel, whereas for large distances Dark Fiber is used.

For normal-redundancy disk groups in extended RAC, there should be only one failure group on each site of the extended cluster. High-redundancy disk groups should not be used in extended cluster configurations unless there are three sites. In this scenario, there should one failure group at each site. Note that you must name the failure groups explicitly based on the site name.

NOTE

If a disk group contains an asymmetrical configuration, such that there are more failure groups on one site than another, then an extent could get mirrored to the same site and not to the remote failure group. This could cause the loss of access to the entire disk group if the site containing more than one failure group fails.

In Oracle Clusterware 11g Release 2, the concept of a quorum failgroup was introduced. Regular failgroup is the default. A quorum failure group is a special type of failure group that is used to house a single CSS quorum file, along with ASM metadata. Therefore, this quorum failgroup only needs to be in 300MB in size. Because these failure groups do not contain user data, the quorum failure group is not considered when determining redundancy requirements. Additionally, the USABLE_FILE_MB in V$ASM_DISKGROUP does not consider any free space that is present in QUORUM disks. However, a quorum failure group counts when mounting a disk group. Chapter 1 contains details on creating disk groups with a quorum failgroup.

ASM Preferred Read

As stated earlier, ASM always reads the primary copy of a mirrored extent set. Thus, a read for a specific block may require a read of the primary extent at the remote site across the interconnect. Accessing a remote disk through a metropolitan area or wide area storage network is substantially slower than accessing a local disk. This can tax the interconnect as well as result in high I/O and network latency.

To assuage this, a feature called preferred reads enables ASM administrators to specify a failure group for local reads—that is, provide preferred reads. In a normal-or high-redundancy disk group, with an extent set that has a preferred disk, a read is always satisfied by the preferred disk if it is online. This feature is especially beneficial for extended cluster configurations.

The ASM_PREFERRED_READ_FAILURE_GROUPS initialization parameter is used to specify a list of failure group names that will provide local reads for each node in a cluster. The format of the ASM_PREFERRED_READ_FAILURE_GROUPS is as follows:

Each entry is composed of DISKGROUP_NAME, which is the name of the disk group, and FAILUREGROUP_NAME, which is the name of the failure group within that disk group, with a period separating these two variables. Multiple entries can be specified using commas as separators. This parameter can be dynamically changed.

The Preferred Read feature can also be useful in mixed storage configurations. For example, in read-mostly workloads, SSD storage can be created in one failgroup and standard disk drives can be included in a second failgroup. This mixed configuration is beneficial when the SSD storage is in limited supply (in the array) or for economic reasons. The ASM_PREFERRED_READ_FAILURE_GROUPS parameter can be set to the SSD failgroup. Note that writes will occur on both failgroups.

In an extended cluster, the failure groups that you specify with settings for the ASM_PREFERRED_READ_FAILURE_GROUPS parameter should contain only disks that are local to the instance. V$ASM_DISK indicates the preferred disks with a Y in the PREFERRED_READ column.

The following example shows how to deploy the preferred read feature and demonstrates some of its inherent benefits. This example illustrates I/O patterns when the ASM_PREFERRED_READ_FAILURE_GROUPS parameter is not set, and then demonstrates how changing the parameter affects I/O:

1. Create a disk group with two failure groups:

2. The I/Os are evenly distributed across all disks—that is, these are non-localized I/Os.

3. The following query displays the balanced IO, default for ASM configurations:

NOTE

V$ASM_DISK includes I/Os that are performed by the ASM instance for ASM metadata. The V$ASM_DISK_IOSTAT tracks I/O on a per-database basis. This view can be used to verify that the RDBMS instance does not perform any I/O to a nonpreferred disk.

4. Now set the appropriate ASM parameters for the preferred read. Note that you need not dismount or remount the disk group because this parameter is dynamic.

Enter the following for Node1 (site1):

Enter this code for Node2 (site2):

5. Verify that the parameter took effect by querying GV$ASM_DISK. From Node1, observe the following:

Keep in mind that disks MYDATA_0000, MYDATA_0001, and MYDATA_0004 are part of the FG1 failure group, and disks MYDATA_0002, MYDATA_0003, and MYDATA_0005 are in failure group FG2.

6. Put a load on the system and check I/O calls via EM or using V$ASM_DISK_IOSTAT. Notice in the “Reads-Total” column that reads have a strong affinity to the disks in FG1. This is because FG1 is local to Node1 where +ASM1 is running. The remote disks in FG2 have very few reads.

7. Notice the small number of reads that instance 1 is making to FG2 and the small number of reads that instance 2 is making to FG1:

Recovering from Transient and Permanent Disk Failures

This section reviews how ASM handles transient and permanent disk failures in normal- and high-redundancy disk groups.

Recovering from Disk Failures: Fast Disk Resync

The ASM Fast Disk Resync feature significantly reduces the time to recover from transient disk failures in failure groups. The feature accomplishes this speedy recovery by quickly resynchronizing the failed disk with its partnered disks.

With Fast Disk Resync, the repair time is proportional to the number of extents that have been written or modified since the failure. This feature can significantly reduce the time that it takes to repair a failed disk group from hours to minutes.

The Fast Disk Resync feature allows the user a grace period to repair the failed disk and return it online. This time allotment is dictated by the ASM disk group attribute DISK_REPAIR_TIME. This attribute dictates maximum time of the disk outage that ASM can tolerate before dropping the disk. If the disk is repaired before this time is exceeded, then ASM resynchronizes the repaired disk when the user places the disk online. The command ALTER DISKGROUP DISK ONLINE is used to place the repaired disk online and initiate disk resynchronization.

Taking disks offline does not change any partnerships. Repartnering occurs when the disks are dropped at the end of the expiration period.

Fast Disk Resync requires that the COMPATIBLE.ASM and COMPATIBLE.RDBMS attributes of the ASM disk group be set to at least 11.1.0.0.

In the following example, the current ASM 11gR2 disk group has a compatibility of 11.1.0.0 and is modified to 11.2.0.3. To validate the attribute change, the V$ASM_ATTRIBUTE view is queried:

After you correctly set the compatibility to Oracle Database version 11.2.0.3, you can set the DISK_REPAIR_TIME attribute accordingly. Notice that the default repair time is 12,960 seconds, or 3.6 hours. The best practice is to set DISK_REPAIR_TIME to a value depending on the operational logistics of the site; in other words, it should be set to the mean time to detect and repair the disk.

If the value of DISK_REPAIR_TIME needs to be changed, you can enter the following command:

If the DISK_REPAIR_TIME parameter is not 0 and an ASM disk fails, that disk is taken offline but not dropped. During this outage, ASM tracks any modified extents using a bitmap that is stored in disk group metadata. (See Chapter 9 for more details on the algorithms used for resynchronization.)

ASM’s GMON process will periodically inspect (every three seconds) all mounted disk groups for offline disks. If GMON finds any, it sends a message to a slave process to increment their timer values (by three seconds) and initiate a drop for the offline disks when the timer expires. This timer display is shown in the REPAIR_TIMER column of V$ASM_DISK.

The ALTER DISKGROUP DISK OFFLINE SQL command or the EM ASM Target page can also be used to take the ASM disks offline manually for preventative maintenance. The following describes this scenario using SQL*Plus:

Notice that the offline disk’s MOUNT_STATUS and MODE_STATUS are set to the MISSING and OFFLINE states, and also that the REPAIR_TIMER begins to decrement from the drop timer.

Disks Are Offline

After the maintenance is completed, you can use the DISKGROUP DATA ONLINE command to bring the disk(s) online:

This statement brings all the offline disks back online to bring the stale contents up to date and to enable new contents. See Chapter 9 for more details on how to implement resynchronization.

The following is an excerpt from the ASM alert log showing a disk being brought offline and online:

After fixing the disk, you can bring it online using the following command:

Once the disk is brought back online, the REPAIR_TIMER is reset to 0 and the MODE_STATUS is set to ONLINE.

At first glance, the Fast Disk Resync feature may seem to be a substitute for Dirty Region Logging (DRL), which several logical volume managers implement. However, Fast Disk Resync and DRL are distinctly different.

DRL is a mechanism to track blocks that have writes in flight. A mirrored write cannot be issued unless a bit in the DRL is set to indicate there may be a write in flight. Because DRL itself is on disk and also mirrored, it may require two DRL writes before issuing the normal mirrored write. This is mitigated by having each DRL bit cover a range of data blocks such that setting one bit will cover multiple mirrored block writes. There is also some overhead for I/O to clear DRL bits for blocks that are no longer being written. You can often clear these bits while setting another bit in DRL.

If a host dies while it has mirrored writes in flight, it is possible that one side of the mirror is written and the other is not. Most applications require that they get the same data every time if they read a block multiple times without writing it. If one side was written but not the other, then different reads may get different data. DRL mitigates this by constructing a set of blocks that must be copied from one side to the other to ensure that all blocks are the same on both sides of the mirror. Usually this set of blocks is much larger than those that were being written at the time of the crash, and it takes a while to create the copies.

During the copy, the storage is unavailable, which increases overall recovery times. Additionally, it is possible that the failure caused a partial write to one side, resulting in a corrupt logical block. The copying may write the bad data over the good data because the volume manager has no way of knowing which side is good.

Fortunately, ASM does not need to maintain a DRL. ASM clients manage resilvering ASM clients, which includes the Oracle database and ACFS, and know how to recover their data so that the mirror sides are the same for the cases that matter; in other words, it is a client-side implementation. It is not always necessary to make the mirror sides the same. For example, if a file is being initialized before it is part of the database, it will be reinitialized after a failure, so that file does not matter for the recovery process. For data that does matter, Oracle must always have a means of tolerating a write that was started but that might not have been completed. The redo log is an example of one such mechanism in Oracle. Because Oracle already has to reconstruct such interrupted writes, it is simple to rewrite both sides of the mirror even if it looks like the write completed successfully. The number of extra writes can be small, because Oracle is excellent at determining exactly which blocks need recovery.

Another benefit of not using a DRL is that a corrupt block, which does not report an I/O error on read, can be recovered from the good side of the mirror. When a block corruption is discovered, each side of the mirror is read to determine whether one of them is valid. If the sides are different and one is valid, then the valid copy is used and rewritten to both sides. This can repair a partial write at host death. This mechanism is used all the time, not just for recovery reads. Thus, an external corruption that affects only one side of an ASM mirrored block can also be recovered.

ASM and I/O Error Failure Management

Whereas the previous section covers ASM handling of transient and permanent disk failures in ASM redundancy disk groups, this section discusses how ASM processes I/O errors, such as read and write errors, and also discusses in general how to handle I/O failures in external redundancy disk groups.

General Disk Failure Overview

Disk drives are mechanical devices and thus tend to fail. As drives begin to fail or have sporadic I/O errors, database failures become more likely.

The ability to detect and resolve device path failures is a core component of path managers as well as HBAs. A disk device can be in the following states or have the following issues:

Media sense errors These include hard read errors and unrecoverable positioning errors. In this situation, the disk device is still functioning and responds to SCSI_INQUIRY requests.

Device too busy A disk device can become so overwhelmed with I/O requests that it will not respond to the SCSI_INQUIRY within a reasonable amount of time.

Failed device In this case, the disk has actually failed and will not respond to a SCSI_INQUIRY request, and when the SCSI_INQUIRY timeout occurs, the disk and path will be taken offline.

Path failure The disk device may be intact, but a path component—such as a port or a fiber adapter—has failed.

In general, I/O requests can time out because either the SCSI driver device is unable to respond to a host message within the allotted time or the path on which a message was sent has failed. To detect this path failure, HBAs typically enable a timer each time a message is received from the SCSI driver. A link failure is thrown if the timer exceeds the link-down timeout without receiving the I/O acknowledgment. After the link-down event occurs, the Path Manager determines that the path is dead and evaluates whether to reroute queued I/O requests to alternate paths.

ASM and I/O Failures

The method that ASM uses to handle I/O failures depends on the context in which the I/O failure occurred. If the I/O failure occurs in the database instance, then it notifies ASM, and ASM decides whether to take the disk offline. ASM takes whatever action is appropriate based on the redundancy of the disk group and the number of disks that are already offline.

If the I/O error occurs while ASM is trying to mount a disk group, the behavior depends on the release. In Oracle Database 10g Release 2, if the instance is not the first to mount the disk group in the cluster, it will not attempt to take any disks offline that are online in the disk group mounted by other instances. If none of the disks can be found, the mount will fail. The rationale here is that if the disk in question has truly failed, the running instances will very quickly take the disk offline. If the instance is the “first to mount,” it will offline the missing disks because it has no other instance to consult regarding the well-being of the missing disks.

If the error is local and you want to mount the disk group on the instance that cannot access the disk, you need to drop the disk from a node that mounted the disk group. Note that a drop force command will allow the mount immediately. Often in such scenarios, the disk cannot be found on a particular node because of errors in the ASM_DISKSTRING or the permissions on the node.

In Oracle Database 11g, these two behaviors are still valid, but rather than choosing one or the other based on whether the instance is first to mount the disk group, the behavior is based on how it was mounted. For example, if the disk group MOUNT [NOFORCE] command is used, which is the default, this requires that all online disks in the disk group be found at mount time. If any disks are missing (or have I/O errors), the mount will fail. A disk group MOUNT FORCE attempts to take disks offline as necessary, but allows the mount to complete. Note that to discourage the excessive use of FORCE, MOUNT FORCE succeeds only if a disk needs to be taken offline.

In 11.2.0.3, MOUNT [NOFORCE] will succeed in Exadata and Oracle Database Appliance as long as the result leaves more than one failgroup for normal redundancy or more than two failgroups for high-redundancy disk groups.

ASM, as well as the database, takes proactive measures to handle I/O failures or data corruptions.

When the database reads a data block from disk, it validates the checksum, the block number, and some other fields. If the block fails the consistency checks, then an attempt is made to reread the block to get a valid block read. A reread is meant to handle potential transient issues with the I/O subsystem. Oracle can read individual mirror sides to resolve corruptions. For corrupt blocks in data files, the database code reads each side of the mirror and looks for a good copy. If it finds a good copy, the read succeeds and the good copy is written back to disk to repair the corruption, assuming that the database is holding the appropriate locks to perform a write. If the mirroring is done in a storage array (external redundancy), there is no interface to select mirror sides for reading. In that case, the RDBMS simply rereads the same block and hopes for the best; however, with a storage array, this process will most likely return the same data from the array cache unless the original read was corrupted. If the RDBMS cannot find good data, an error is signaled. The corrupt block is kept in buffer cache (if it is a cache-managed block) to avoid repeated attempts to reread the block and to avoid excessive error reporting. Note that the handling of corruption is different for each file type and for each piece of code that accesses the file. For example, the handling of data file corruption during an RMAN backup is different from that described in this section, and the handling of archive log file corruption.

ASM, like most volume managers, does not do any proactive polling of the hardware looking for faults. Servers usually have enough I/O activity to make such polling unnecessary. Moreover, ASM cannot tell whether an I/O error is due to a cable being pulled or a disk failing. It is up to the operating system (OS) to decide when to return an error or continue waiting for an I/O completion. ASM has no control over how the OS handles I/O completions. The OS signals a permanent I/O error to the caller (the Oracle I/O process) after several retries in the device driver.

NOTE

Starting with Oracle Database 11g, in the event of a disk failure, ASM polls disk partners and the other disks in the failure group of the failed disk. This is done to efficiently detect a pathological problem that may exist in the failure group.

ASM takes disks offline from the disk group only on a write operation I/O error, not for read operations. For example, in Oracle Database 10g, if a permanent disk I/O error is incurred during an Oracle write I/O operation, ASM takes the affected disk offline and immediately drops it from the disk group, thus preventing stale data reads. In Oracle Database 11g, if the DISK_REPAIR_TIMER attribute is enabled, ASM takes the disk offline but does not drop it. However, ASM does drop the disk if the DISK_REPAIR_TIMER expires. This feature is covered in the section “Recovering from Disk Failures: Fast Disk Resync,” earlier in this chapter.

In Oracle Database 11g, ASM (in ASM redundancy disk groups) attempts to remap bad blocks if a read fails. This remapping can lead to a write, which could lead to ASM taking the disk offline. For read errors, the block is read from the secondary extents (only for normal or high redundancy). If the loss of a disk would result in data loss, as in the case where a disk’s partner disk is also offline, ASM automatically dismounts the disk group to protect the integrity of the disk group data.

NOTE

Read failures from disk header and other unmirrored, physically addressed reads also cause ASM to take the disk offline.

In 11g, before taking a disk offline, ASM checks the disk headers of all remaining disks in that failure group to proactively check their liveliness. For offline efficiency, if all remaining disks in that same failure group show signs of failure, ASM will proactively offline the entire failure group.

ASM dismounts a disk group rather than taking some disks offline and then dismounting the disk group in case of apparent failures of disks in multiple failure groups. Also, ASM takes disks in a failure group offline all at once to allow for more efficient repartnering.

If the heartbeat cannot be written to a copy of the Partnership Status Table (PST) in a normal- or high-redundancy disk group, ASM takes the disk containing the PST copy offline and moves the PST to another disk in the same disk group. In an external redundancy disk group, the disk group is dismounted if the heartbeat write fails. At mount time, it is read twice, at least six seconds apart, to determine whether an instance outside the local cluster mounts the disk group. If the two reads show different contents, the disk group has mounted in an unseen instance.

After the disk group is mounted, ASM will reread the heartbeat every hundredth I/O that it is written. This is done to address two issues: One, to catch any potential race condition that mount time check did not catch, and, two, to check if a disk group got accidently mounted into two different clusters, with both of them heartbeating against the PST.

In the following example, ASM detects I/O failures as shown from the alert log:

The following warning indicates that ASM detected an I/O error on a particular disk:

This error message alerts the user that trying to take the disk offline would cause data loss, so ASM is dismounting the disk group instead:

Messages should also appear in the OS log indicating problems with this same disk (DATA_1_0001).

Many users want to simulate corruption in an ASM file in order to test failure and recovery. Two types of failure-injection tests that customers induce are block corruption and disk failure. Unfortunately, overwriting an ASM disk simulates corruption, not a disk failure. Note further that overwriting the disk will corrupt ASM metadata as well as database files. This may not be the user’s intended fault-injection testing. You must be cognizant of the redundancy type deployed before deciding on the suite of tests run in fault-injection testing. In cases where a block or set of blocks is physically corrupted, ASM (in ASM redundancy disk groups) attempts to reread all mirror copies of a corrupt block to find one copy that is not corrupt.

Redundancy and the source of corruption does matter when recovering a corrupt block. If data is written to disk in an ASM external redundancy disk group through external means, then these writes will go to all copies of the storage array mirror. An example of this is that corruption could occur if the Unix/Linux dd command is inadvertently used to write to an in-use ASM disk.

Space Management Views for ASM Redundancy

Two V$ ASM views provide more accurate information on free space usage: USABLE_FREE_SPACE and REQUIRED_MB_FREE.

In Oracle Database 10g Release 2, the column USABLE_FILE_MB in V$ASM_DISKGROUP indicates the amount of free space that can be “safely” utilized taking mirroring into account. The column provides a more accurate view of usable space in the disk group. Note that for external redundancy, the column FREE_MB is equal to USABLE_FREE_SPACE.

Along with USABLE_FREE_SPACE, the REQUIRED_MB_FREE column, in V$ASM_DISKGROUP is used to indicate more accurately the amount of space that is required to be available in a given disk group to restore redundancy after one or more disk failures. The amount of space displayed in this column takes mirroring into account. The following discussion describes how REQUIRED_MB_FREE is computed.

REQUIRED_MB_FREE indicates the amount of space that must be available in a disk group to restore full redundancy after the worst failure that can be tolerated by the disk group without adding additional storage, where the worst failure refers to permanent disk failures and disks become dropped. The purpose of this requirement is to ensure that there is sufficient space in the failure groups to restore redundancy. However, the computed value depends on the type of ASM redundancy deployed:

For a normal-redundancy disk group with more than two failure groups, the value is the total raw space for all of the disks in the largest failure group. The largest failure group is the one with the largest total raw capacity. For example, if each disk is in its own failure group, the value would be the size of the largest capacity disk. In the case where there are only two failure groups in a normal redundancy, the size of the largest disk in the disk group is used to compute the REQUIRED_MB_FREE.

For a high-redundancy disk group with more than three failure groups, the value is the total raw space for all the disks in the two largest failure groups.

If disks are of different sizes across the failure groups, this further complicates the REQUIRED_MB_FREE calculation. Therefore, it is highly recommended that disk groups have disks of equal size.

Be careful of cases where USABLE_FILE_MB has negative values in V$ASM_DISKGROUP due to the relationship among FREE_MB, REQUIRED_MIRROR_FREE_MB, and USABLE_FILE_MB. If USABLE_FILE_MB is a negative value, you do not have sufficient space to reconstruct the mirror of all extents in certain disk failure scenarios. For example, in a normal-redundancy disk group with two failure groups, USABLE_FILE_MB goes negative if you do not have sufficient space to tolerate the loss of a single disk. In this situation, you could gain more usable space, at the expense of losing all redundancy, by force-dropping the remaining disks in the failure group containing the failed disk.

Negative values in USABLE_FILE_MB could mean that depending on the value of FREE_MB, you may not be able to create new files. The next failure may result in files with reduced redundancy or can result in an out-of-space condition, which can hang the database. If USABLE_FILE_MB becomes negative, it is strongly recommended that you add more space to the disk group as soon as possible.

Disk Groups and Attributes

Oracle Database 11g introduced the concept of ASM attributes. Unlike initialization parameters, which are instance specific but apply to all disk groups, ASM attributes are disk group specific and apply to all instances.

Attributes Overview

The ASM disk group attributes are shown in the V$ASM_ATTRIBUTES view. However, this view is not populated until the disk group compatibility—that is, COMPATIBLE.ASM—is set to 11.1.0. In Clusterware Database 11g Release 2, the following attributes can be set:

Compatibility

COMPATIBLE.ASM This attribute determines the minimum software version for an Oracle ASM instance that can mount the disk group. This setting also affects the format of the data structures for the ASM metadata on the disk. If the SQL CREATE DISKGROUP statement, the ASMCMD mkdg command, or Oracle Enterprise Manager is used to create disk groups, the default value for the COMPATIBLE.ASM attribute is 10.1.

COMPATIBLE.RDBMS This attribute determines the minimum COMPATIBLE database initialization parameter setting for any database instance that is allowed to use (open) the disk group. Ensure that the values for the COMPATIBLE initialization parameter for all of the databases that access the disk group are set to at least the value of the new setting for COMPATIBLE.RDBMS. As with the COMPATIBLE.ASM attribute, the default value is 10.1. The COMPATIBLE.ASM will always be greater than or equal to COMPATIBLE.RDBMS. This topic is covered in more detail later in this section.

COMPATIBLE.ADVM This attribute determines whether the disk group can contain ADVM volumes. The value can only be set to 11.2 or higher. The default value of the COMPATIBLE.ADVM attribute is empty until set. However, before the COMPATIBLE.ADVM is advanced, the COMPATIBLE.ASM attribute must already be set to 11.2 or higher and the ADVM volume drivers must be loaded. The COMPATIBLE.ASM attribute will always be greater than or equal to COMPATIBLE.ADVM. Also, there is no relation between COMPATIBLE.ADVM and COMPATIBLE.RDBMS.

ASM Disk Group Management

DISK_REPAIR_TIME This attribute defines the delay in the drop disk operation by specifying a time interval to repair the disk and bring it back online. The time can be specified in units of minutes (m or M) or hours (h or H). This topic is covered in the “Recovering from Disk Failures: Fast Disk Resync” section.

AU_SIZE This attribute defines the disk group allocation unit size.

SECTOR_SIZE This attribute specifies the default sector size of the disk contained in the disk group. The SECTOR_SIZE disk group attribute can be set only during disk group creation, and the possible values are 512, 4096, and 4K. The COMPATIBLE.ASM and COMPATIBLE.RDBMS disk group attributes must be set to 11.2 or higher to set the sector size to a value other than the default value.

Exadata Systems

CONTENT.TYPE This new attribute was introduced in 11gR2 specifically and is valid only for Exadata systems. The COMPATIBLE.ASM attribute must be set to 11.2.0.3 or higher to enable the CONTENT.TYPE attribute for the disk group. The CONTENT.TYPE attribute identifies the disk group type, and implicitly dictates disk partnering specifically for that specific disk group. The option type can be DATA, Recovery, or System. Setting this parameter determines the distance to the nearest neighbor disk in the failure group where ASM mirrors copies of the data. Keep the following points in mind:

The default value is DATA, which specifies a distance of 1 to the nearest neighbor disk.

A value of RECOVERY specifies a distance of 3 to the nearest neighbor disk.

A value of SYSTEM specifies a distance of 5.

STORAGE.TYPE This attribute identifies the type of disks in the disk group and allows users to enable Hybrid Columnar Compression (HCC) on that hardware. The possible values are AXIOM, PILLAR, ZFSSA, and OTHER. The AXIOM and ZFSSA values reflect the Oracle Sun Pillar Axiom storage platform and the Oracle SUN ZFSSA storage appliance, respectively. If the attribute is set to OTHER, any types of disks can be in the disk group. The STORAGE.TYPE attribute can only be set when creating a disk group or when altering a disk group. The attribute cannot be set when clients are connected to the disk group.

IDP.TYPE This attribute is related to the Intelligent Data Placement feature and influences data placement on disk.

CELL.SMART_SCAN_CAPABLE When set, this attribute enables Smart Scan capabilities in Exadata.

File Access Control

ACCESS_CONTROL.ENABLED This attribute, when set, enables the facility for File Access Control. This attribute can only be set when altering a disk group, with possible values of TRUE and FALSE.

ACCESS_CONTROL.UMASK This attribute specifies which permissions are masked on the creation of an ASM file for the user that owns the file, for users in the same user group and others not in the user group. The semantics of ASM umask settings are similar to Unix/Linux umask. This attribute applies to all files on a disk group, with possible values in the combinations of three digits: {0|2|6} {0|2|6} {0|2|6}. The default is 066. This attribute can only be set when altering a disk group.

This is just a placeholder for real output as well as a table for the features enabled by the disk group compatibility attribute settings:

Disk Group Compatibility Attributes

The disk group attributes can be set at disk group creation or by using the ALTER DISKGROUP command. For example, a disk group can be created with 10.1 disk group compatibility and then advanced to 11.2 by setting the COMPATIBLE.ASM attribute to 11.2. The discussion on compatibility attributes is covered in the next section.

The following example shows a CREATE DISKGROUP command that results in a disk group with 10.1 compatibility (the default):

This disk group can then be advanced to 11.2 using the following command:

On successful advancing of the disk group, the following message appears:

In another example, the AU_SIZE attribute, which dictates the allocation unit size, and the COMPATIBLE.ASM attributes are specified at disk group creation. Note that the AU_SIZE attribute can only be specified at disk group creation and cannot be altered using the ALTER DISKGROUP command:

The V$ASM_ATTRIBUTE view can be queried to get the DATA disk group attributes:

In the previous example, the COMPATIBLE.ASM attribute was advanced; this next example advances the COMPATIBLE.RDBMS attribute. Notice that the version is set to simply 11.2, which is equivalent to 11.2.0.0.0.

Database Compatibility

When a database instance first connects to an ASM instance, it negotiates the highest Oracle version that can be supported between the instances. There are two types of compatibility settings between ASM and the RDBMS: instance-level software compatibility settings and disk group–specific settings.

Instance-level software compatibility is defined using the init.ora parameter COMPATIBLE. This COMPATIBLE parameter, which can be set to 11.2, 11.1, 10.2, or 10.1 at the ASM or database instance level, defines what software features are available to the instance. Setting the COMPATIBLE parameter in the ASM instance is not allowed. Using lower values of the COMPATIBLE parameter for an ASM instance is not useful, because ASM is compatible with multiple database versions. Note that the COMPATIBLE.ASM value must be greater than or equal to that of COMPATIBLE.RDBMS.

The other compatibility settings are specific to a disk group and control which attributes are available to the ASM disk group and which are available to the database. This is defined by the ASM compatibility (COMPATIBLE.ASM) and RDBMS compatibility (COMPATIBLE.RDBMS) attributes, respectively. These compatibility attributes are persistently stored in the disk group metadata.

RDBMS Compatibility

RDBMS disk group compatibility is defined by the COMPATIBLE.RDBMS attribute. This attribute, which defaults to 10.1 in Oracle Database 11g, is the minimum COMPATIBLE version setting of a database that can mount the disk group. After the disk group attribute of COMPATIBLE.RDBMS is advanced to 11.2, it cannot be reversed.

ASM Compatibility

ASM disk group compatibility, as defined by COMPATIBLE.ASM, controls the persistent format of the on-disk ASM metadata structures. The ASM compatibility defaults to 10.1 and must always be greater than or equal to the RDBMS compatibility level. After the compatibility is advanced to 11.2, it cannot be reset to lower versions. Any value up to the current software version can be set and will be enforced. The compatibility attributes have quantized values, so not all five parts of the version number have to be specified.

COMPATIBLE.RDBMS and COMPATIBLE.ASM together control the persistent format of the on-disk ASM metadata structures. The combination of the compatibility parameter setting of the database, the software version of the database, and the RDBMS compatibility setting of a disk group determines whether a database instance is permitted to mount a given disk group. The compatibility setting also determines which ASM features are available for a disk group.

The following query shows an ASM instance that was recently upgraded from Oracle Database 10g to Oracle Clusterware 11gR2:

Notice that the ASM compatibility and RDBMS compatibility are still at the default (for upgraded instances) of 10.1. The 10.1 setting is the lowest attribute supported by ASM.

NOTE

An ASM instance can support different RDBMS clients with different compatibility settings, as long as the database COMPATIBLE init.ora parameter setting of each database instance is greater than or equal to the RDBMS compatibility of all disk groups.

See the section “Disk Groups and Attributes,” earlier in this chapter, for examples on advancing the compatibility.

The ASM compatibility of a disk group can be set to 11.0, whereas its RDBMS compatibility could be 10.1, as in the following example:

This implies that the disk group can be managed only by ASM software version 11.0 or higher, whereas any database software version must be 10.1 or higher.

Summary

An ASM disk is the unit of persistent storage given to a disk group. A disk can be added to or dropped from a disk group. When a disk is added to a disk group, it is given a disk name either automatically or by the administrator. This is different from the OS name that is used to access the disk through the operating system. In a RAC environment, the same disk may be accessed by different OS names on different nodes. ASM accesses disks through the standard OS interfaces used by Oracle to access any file (unless an ASMLIB is used). Typically an ASM disk is a partition of an LUN seen by the OS. An ASM disk can be any device that can be opened through the OS open system call, except for a local file system file—that is, the LUN could be a single physical disk spindle or it could be a virtual LUN managed by a highly redundant storage array.

A disk group is the fundamental object managed by ASM. It is composed of multiple ASM disks. Each disk group is self-describing—that is, all the metadata about the usage of the space in the disk group is completely contained within the disk group. If ASM can find all the disks in a disk group, it can provide access to the disk group without any additional metadata.

A given ASM file is completely contained within a single disk group. However, a disk group may contain files belonging to several databases, and a single database may use files from multiple disk groups. Most installations include only a small number of disk groups—usually two, and rarely more than three.

EasyReliableDBA

Friday, 6 July 2018

Database Cloud Storage

2)ASM and Grid Infrastructure Stack

ASM Disks and Disk Groups

No comments:

Post a Comment

Search This Blog