EasyReliableDBA: Database cloud storage ASM and ACFS Design and Deployment

7.ASM Files, Aliases, and Security

When an ASM disk group is created, a hierarchical file system structure is created. This hierarchical layout is very similar to the Unix or Windows file system hierarchy. ASM files, stored within this file system structure, are the objects that RDBMS instances access. They come in the form of data files, control files, spfiles, and redo log files, and several other file types. The RDBMS treats ASM-based database files just like standard file system files.

ASM Filenames

When you create a database file (using the create tablespace, add datafile, or add logfile command) or even an archive log file, ASM explicitly creates the ASM file in the disk group specified. The following example illustrates how database files can be created in an ASM disk group:

This command creates a data file in the DATA disk group. ASM filenames are derived from and generated upon the successful creation of a data file. Once the file is created, the file becomes visible to the user via the standard RDBMS views, such as the V$DATAFILE view. Note that the ASM filename syntax is different from that of the typical naming standards; ASM filenames use the following format:

+diskgroup_name/database_name/database file type/tag_name.file_number.incarnation

For example, the ASM filename of +DATA/yoda/datafile/ishan.259.616618501 for the tablespace named ISHAN can be dissected as follows:

+DATA This is the name of the disk group where this file was created.

yoda This specifies the name of the database that contains this file.

datafile This is the database file type—in this case, datafile. There are over 20 file types in Oracle 11g.

ISHAN.259.616618501 This portion of the filename is the suffix of the full filename, and is composed of the tag name, file number, and incarnation number. The tag name in the data file name corresponds to the tablespace name. In this example, the tag is the tablespace named ISHAN. For redo log files, the tag name is the group number (for example, group_3.264.54632413). The ASM file number for the ISHAN tablespace is 259. The file number in the ASM instance can be used to correlate filenames in database instance. The incarnation number is 616618501. The incarnation number, which has been derived from a timestamp, is used to provide uniqueness. Note that once the file has been created, the incarnation number does not change. The incarnation number distinguishes between a new file that has been created using the same file number and another file that has been deleted.

For best practice, every database should implement the Oracle Managed File (OMF) feature to simplify Oracle database file administration. Here are some key benefits of OMF:

Simplified Oracle file management All files are automatically created in a default location with system-generated names, thus a consistent file standard is inherently in place.

Space usage optimization Files are deleted automatically when the tablespaces are dropped.

Reduction of Oracle file management errors OMF minimizes errant file creation and deletion, and also mitigates file corruption due to inadvertent file reuse.

Enforcement of Optimal Flexible Architecture (OFA) standards OMF complies with the OFA standards for filename and file locations.

You can enable OMF by setting the DB_CREATE_FILE_DEST and DB_RECOVERY_FILE_DEST parameters. Note that other *_DEST variables can be used for other file types. When the DB_CREATE_FILE_DEST parameter is set to +DATA, the default file location for tablespace data files becomes +DATA. Moreover, you need not even specify the disk group location in the tablespace creation statement. In fact, when the DB_CREATE_FILE_DEST and DB_RECOVERY_FILE_DEST parameters are set, the create database command can be simplified to the following statement:

You can use the following command to create a tablespace:

This command simply creates a data file in the ISHAN tablespace under the +DATA disk group using the default data file size of 100MB. However, this file size can be overridden and still leverage the OMF name, as in the following example:

NOTE

OMF is not enabled for a file when the filename is explicitly specified in “create/alter tablespace add datafile” commands. For example, the following is not considered an OMF file because it specifies an explicit filename and path:

However, the following is considered an OMF file:

The following listing shows the relationship between the RDBMS files and the ASM file. Note that the file number from V$ASM_FILE is embedded in the filename. The first query is executed from the ASM instance and the second query is executed from the RDBMS instance:

Observe that this database contains ASM files and a non-ASM file named NISHA01.dbf. The NISHA tablespace is stored in a Unix file system called /u01/oradata—that is, it is not an ASM-managed file. Because the NISHA01.dbf file is a Unix file system file rather than an ASM file, the ASM file list from the SQL output does not include it. This illustrates an important point: An Oracle database can have files that reside on file systems, raw devices, and ASM, simultaneously. However, in RAC environments, they must all be on shared storage and accessible by all nodes in the cluster.

ASM Directories

ASM provides the capability to create user-defined directories using the ADD DIRECTORY clause of the ALTER DISKGROUP statement. User-defined directories can be created to support user-defined ASM aliases (discussed later). ASM directories must start with a plus sign (+) and valid disk group name, followed by any user-specified subdirectory names. The only restriction is that the parent directory must exist before you attempt to create a subdirectory or alias in that directory. For example, both of the following are valid ASM directories:

However, the following ASM directory cannot be created because the parent directory of data files (oradata) does not exist:

Although system directories such as +DATA/yoda cannot be manipulated, user-defined directories, such as the one successfully created in the previous example, can be renamed or dropped. The following examples illustrate this:

ASM Aliases

The filename notation described thus far (+diskgroup_name/database_name/database file type/tag_name.file_number.incarnation) is called the fully qualified filename notation (FQFN). An ASM alias can be used to make filenaming conventions easier to remember.

Note that whenever a file is created, a system alias is also automatically created for that file. The system aliases are created in a hierarchical directory structure that takes the following syntax:

<db_unique_name>/<file_type>/<alias name>

When the files are removed, the <alias name> is deleted but the hierarchical directory structure remains.

ASM aliases are essentially in hierarchical directory format, similar to the filesystem hierarchy (/u01/oradata/dbname/datafile_name) and are used to reference a system-generated filename such as +DATA/yoda/datafile/system.256.589462555.

Alias names specify a disk group name, but instead of using a file and incarnation number, they take a user-defined string name. Alias ASM filenames are distinguished from fully qualified or numeric names because they do not end in a dotted pair of numbers. Note that there is a limit of one alias per ASM file. The following examples show how to create an ASM alias:

Note, as stated earlier, that OMF is not enabled when file aliases are explicitly specified in “create/alter tablespace add datafile” commands (as in the previous example).

Aliases are particularly useful when dealing with control files and spfiles—that is, an ASM alias filename is normally used in the CONTROL_FILES and SPFILE initialization parameters. In the following example, the SPFILE and CONTROL_FILES parameters are set to the alias, and the DB_CREATE_FILE_DEST and DB_RECOVERY_FILE_DEST parameters are set to the appropriate OMF destinations:

To show the hierarchical tree of files stored in the disk group, use the following connect by clause SQL to generate the full path. However, a more efficient way to browse the hierarchy is to use the ASMCMD ls command or Enterprise Manager.

Templates

ASM file templates are named collections of attributes applied to files during file creation. Templates are used to set file-level redundancy (mirror, high, or unprotected) and striping attributes (fine or coarse) of files created in an ASM disk group.

Templates simplify file creation by housing complex file attribute specifications. When a disk group is created, ASM establishes a set of initial system default templates associated with that disk group. These templates contain the default attributes for the various Oracle database file types. When a file is created, the redundancy and striping attributes are set for that file, where the attributes are based on the system template that is the default template for the file type or an explicitly named template.

The following query lists the ASM files, redundancy, and striping size for a sample database.

The administrator can change attributes of the default templates if required. However, system default templates cannot be deleted. Additionally, administrators can add their own unique templates, as needed. The following SQL command illustrates how to create user templates (performed on the ASM instance) and then apply them to a new tablespace data file (performed on the RDBMS):

Once a template is created, you can apply it when creating the new tablespace:

Using the ALTER DISKGROUP command, you can modify a template or drop the template using the DROP TEMPLATE clause. The following commands illustrate this:

If you need to change an ASM file attribute after the file has been created, the file must be copied into a new file with the new attributes. This is the only method of changing a file’s attributes.

V$ASM_TEMPLATE

Query the V$ASM_TEMPLATE view for information about templates. Here is an example for one of the disk groups:

ASM File Access Control

In 11gR2, a new feature called ASM File Access Control was introduced to restrict file access to specific database instance users who connect as SYSDBA. ASM File Access Control uses the user ID that owns the database instance home.

ASM ACL Overview

ASM uses File Access Control to determine the additional privileges that are given to a database that has been authenticated as SYSDBA on the ASM instance. These additional privileges include the ability to modify and delete certain files, aliases, and user groups. Cloud DBAs can set up “user groups” to specify the list of databases that share the same access permissions to ASM files. User groups are lists of databases, and any database that authenticates as SYSDBA can create a user group.

Just as in Unix/Linux file permissions, each ASM file has three categories of privileges: owner, group, and other. Each category can have read-only permission, read-write permission, or no permission. The file owner is usually the creator of the file and can assign permissions for the file in any of the owner, group, and other categories. The owner can also change the group associated with the file. Note that only the creator of a group can delete it or modify its membership list.

When administering ASM File Access Control, it is recommended that you connect as SYSDBA to the database instance that is the owner of the files in the disk group.

To set up ASM File Access Control for files in a disk group, ensure the COMPATIBLE.ASM and COMPATIBLE.RDBMS disk group attributes are set to 11.2 or higher.

Create a new (or alter an existing) disk group with the following ASM File Access Control disk group attributes: ACCESS_CONTROL.ENABLED and ACCESS_CONTROL.UMASK. Before setting the ACCESS_CONTROL.UMASK disk group attribute, you must set the ACCESS_CONTROL.ENABLED attribute to true to enable ASM File Access Control.

The ACCESS_CONTROL.ENABLED attribute determines whether Oracle ASM File Access Control is enabled for a disk group. The value can be true or false.

The ACCESS_CONTROL.UMASK attribute determines which permissions are masked out on the creation of an ASM file for the user who owns the file, users in the same user group, and others not in the user group. This attribute applies to all files on a disk group. The values can be combinations of three digits: {0|2|6} {0|2|6} {0|2|6}. The default is 066. Setting the attribute to 0 masks out nothing. Setting it to 2 masks out write permission. Setting it to 6 masks out both read and write permissions.

The upcoming example in the next section shows how to enable ASM File Access Control for a disk group with a permissions setting of 026, which enables read-write access for the owner, read access for users in the group, and no access to others not in the group. Optionally, you can create user groups that are groups of database users who share the same access permissions to ASM files. Here are some File Access Control list considerations:

For files that exist in a disk group, before setting the ASM File Access Control disk group attributes, you must explicitly set the permissions and ownership on those existing files. Additionally, the files must be closed before setting the ownership or permissions.

When you set up File Access Control on an existing disk group, the files previously created remain accessible by everyone, unless you run the set permissions to restrict access.

Ensure that the user exists before setting ownership or permissions on a file.

File Access Control, including permission management, can be performed using SQL*Plus, ASCMD, or Enterprise Manager (but using Enterprise Manager or ASMCMD is the easiest method).

ASM ACL Setup Example

To illustrate ASM File Access Control, we start with three OS users:

In the ASM instance, prepare the disk group for File Access Control:

Next, add two ASM groups:

Set File Access Control for the data file ‘+DATA/yoda/datafile/marlie.283.702218775’:

Ownership cannot be changed for an open file, so we need to take the file offline in the database instance:

We can now set file ownership in the ASM instance:

Default permissions are unchanged:

Now set the file permissions in the ASM instance:

This example illustrates that the Grid Infrastructure owner (ASM owner) cannot copy files (in this case, RMAN backups) out of the disk group if they are protected by ASM File Access Control.

First, create an RMAN backup:

Now, using the grid users, we’ll try to copy those backup pieces to the OS file system (recall that the backup files were created using the oracle user). File Access Control should prevent the copy operation and throw an “ORA-15260: permission denied on ASM disk group”:

To be able to copy files from the disk group (DATA) to the file system, either disable access control or add an OS user grid to the correct ASM user group.

Check the user and user groups setup in ASM:

Summary

Like most file systems, an ASM disk group contains a directory tree. The root directory for the disk group is always the disk group name. Every ASM file has a system-generated filename; the name is generated based on the instance that created it, the Oracle file type, the usage of the file, and the file numbers. The system-generated filename is of the form +disk_group/db_name/file_type/usage_tag.file_number.time_stamp. Directories are created automatically as needed to construct system-generated filenames.

A file can have one user alias and can be placed in any existing directory within the same disk group. The user alias can be used to refer to the file in any file operation where the system-generated filename could be used. When a full pathname is used to create the file, the pathname becomes the user alias. If a file is created by just using the disk group name, then no user alias is created. A user alias may be added to or removed from any file without disturbing the file.

The system-generated name is an OMF name, whereas a user alias is not an OMF name. If the system-generated name is used for a file, the system will automatically create and delete the file as needed. If the file is referred to by its user alias, the user is responsible for creating and deleting the file and any required directories.

CHAPTER

ASM Space Allocation and Rebalance

When a database is created under the constructs of ASM, it will be striped (and can be optionally mirrored) as per the Stripe and Mirror Everything (SAME) methodology. SAME is a concept that makes extensive use of striping and mirroring across large sets of disks to achieve high availability and to provide good performance with minimal tuning. ASM incorporates the SAME methodology. Using this method, ASM evenly distributes and balances input/output (I/O) load across all disks within the disk group. ASM solves one of the shortcomings of the original SAME methodology, because ASM maintains balanced data distribution even when storage configurations change.

ASM Space Allocation

This section discusses how ASM allocates space in the disk group and how clients such as the relational database management system (RDBMS) and ASM Cluster File System (ACFS) use the allocated space.

ASM Allocation Units

ASM allocates space in chunks called allocation units (AUs). An AU is the most granular allocation on a per-disk basis—that is, every ASM disk is divided into AUs of the same size. For most deployments of ASM, 1MB stripe size has proved to be the best stripe depth for Oracle databases and also happens to be the largest I/O request that the RDBMS will currently issue in Oracle Database 11g. In large environments, it is recommended to use a larger AU size to reduce the metadata associated to describe the files in the disk group. This optimal stripe size, coupled with even distribution of extents in the disk group and the buffer cache in the RDBMS, prevents hot spots.

Unlike traditional random array of independent drives (RAID) configurations, ASM striping is not done in a round-robin basis, nor is it done at the individual disk level. ASM randomly chooses a disk for allocating the initial extent. This is done to optimize the balance of the disk group. All subsequent AUs are allocated in such a way as to distribute each file equally and evenly across all disks and to fill all disks evenly (see Figure 8-1). Thus, every disk is maintained at the same percentage full, regardless of the size of the disk.

FIGURE 8-1. ASM extents

For example, if a disk is twice as big as the others, it will contain twice as many extents. This ensures that all disks in a disk group have the same I/O load relative to their capacity. Because ASM balances the load across all the disks in a disk group, it is not a good practice to create multiple disk partitions from different areas of the same physical disk and then allocate the partitions as members of the same disk group. However, it may make sense for multiple partitions on a physical disk to be in different disk groups. ASM is abstracted from the underlying characteristic of the storage array (LUN). For example, if the storage array presents several RAID5 LUNs to ASM as disks, ASM will allocate extents transparently across each of those LUNs.

ASM Extents

When a database file is created in an ASM disk group, it is composed of a set of ASM extents, and these extents are evenly distributed across all disks in the disk group. Each extent consists of an integral number of AUs on an ASM disk. The extent size to number of AU mapping changes with the size of the file

The following two queries display the extent distribution for a disk group (the FAST disk group) that contains four disks. The first query shows the evenness based on megabytes per disk, and the second query lists the total extents for each disk in the FAST disk group (group_number 2) using the X$KFFXP base table:

Similarly, the following example illustrates the even distribution of ASM extents for the System tablespace across all the disks in the DATA disk group (group number 3). This tablespace contains a single 100MB data file called +DATA/yoda/datafile/system.256.589462555.

ASM Striping

There are two types of ASM file striping: coarse and fine-grained. For coarse distribution, each coarse-grained file extent is mapped to a single allocation unit.

With fine-grained distribution, each grain is interleaved 128K across groups of eight AUs. Since each AU is guaranteed to be on a different ASM disk, each strip will end up on a different physical disk. It is also used for very small files (such as control files) to ensure that they are distributed across disks. Fine-grained striping is generally not good for sequential I/O (such as full table scans) once the sequential I/O exceeds one AU. As of Oracle 11gR2, only control files are file-striped by default when the disk group is created; the users can change the template for a given file type to change the defaults.

As discussed previously, each file stored in ASM requires metadata structures to describe the file extent locations. As the file grows, the metadata associated with that file also grows as well as the memory used to store the file extent locations. Oracle 11g introduces a new feature called Variable Sized Extents to minimize the overhead of the metadata. The main objective of this feature is to enable larger file extents to reduce metadata requirements as a file grows, and as a byproduct it allows for larger file size support (file sizes up to 140PB [a petabyte is 1,024TB]). For example, if a data file is initially as small as 1GB, the file extent size used will be 1 AU. As the file grows, several size thresholds are crossed and larger extent sizes are employed at each threshold, with the maximum extent size capped at 16 AUs. Note that there are two thresholds: 20,000 extents (20GB with 1MB AUs) and 40,000 extents (100GB [20GB of 1×AU and 20,000 of 4×AU] with 1MB AUs). Finally, extents beyond 40,000 use a 16× multiplier. Valid extent sizes are 1, 4, and 16 AUs (which translate to 1MB, 4MB, and 16MB with 1MB AUs, respectively). When the file gets into multiple AU extents, the file gets striped at 1AU to maintain the coarse-grained striping granularity of the file. The database administrator (DBA) or ASM administrator need not manage variable extents; ASM handles this automatically. This feature is very similar in behavior to the Automatic Extent Allocation that the RDBMS uses.

NOTE

The RDBMS layers of the code effectively limit file size to 128TB. The ASM structures can address 140PB.

The following example demonstrates the use of Variable Sized Extents. In this example, the SYSAUX tablespace contains a data file that is approximately 32GB, which exceeds the first threshold of 20,000 extents (20GB):

Now if X$KFFXP is queried to find the ASM file that has a nondefault extent size, it should indicate that it is file number 263:

The Variable Sized Extents feature is available only for disk groups with Oracle 11g RDBMS and ASM compatibility. For disk groups created with Oracle Database 10g, the compatibility attribute must be advanced to 11.1.0. Variable extents take effect for newly created files and will not be retroactively applied to files that were created with 10g software.

Setting Larger AU Sizes for VLDBs

For very large databases (VLDBs)—for example, databases that are 10TB and larger—it may be beneficial to change the default AU size, for example 4MB AU size. The following are benefits of changing the default size for VLDBs:

Reduced SGA size to manage the extent maps in the RDBMS instance

Increased file size limits

Reduced database open time, because VLDBs usually have many big data files

Increasing the AU size improves the time to open large databases and also reduces the amount of shared pool consumed by the extent maps. With 1MB AUs and fixed-size extents, the extent map for a 10TB database is about 90MB, which has to be read at open and then kept in memory. With 16MB AUs, this is reduced to about 5.5MB. In Oracle Database 10g, the entire extent map for a file is read from disk at file-open time.

Oracle Database 11g significantly minimizes the file-open latency issue by reading extent maps on demand for certain file types. In Oracle 10g, for every file open, the complete extent map is built and sent to the RDBMS instance from the ASM instance. For large files, this unnecessarily lengthens file-open time. In Oracle 11g, only the first 60 extents in the extent map are sent at file-open time. The rest are sent in batches as required by the database instance.

Setting Larger AU Sizes in Oracle Database 11g

For Oracle Database 11g ASM systems, the following CREATE DISKGROUP command can be executed to set the appropriate AU size:

The AU attribute can be used only at the time of disk group creation; furthermore, the AU size of an existing disk group cannot be changed after the disk group has been created.

ASM Rebalance

With traditional volume managers, expanding or shrinking striped file systems has typically been difficult. With ASM, these disk changes are now seamless operations involving redistribution (rebalancing) of the striped data. Additionally, these operations can be performed online.

Any change in the storage configuration—adding, dropping, or resizing a disk—triggers a rebalance operation. ASM does not dynamically move around “hot areas” or “hot extents.” Because ASM distributes extents evenly across all disks and the database buffer cache prevents small chunks of data from being hot areas on disk, it completely obviates the notion of hot disks or extents.

Rebalance Operation

The main objective of the rebalance operation is always to provide an even distribution of file extents and space usage across all disks in the disk group. The rebalance is done on a per-file basis to ensure that each file is evenly balanced across all disks. Upon completion of distributing the files evenly among all the disks in a disk group, ASM starts compacting the disks to ensure there is no fragmentation in the disk group. Fragmentation is possible only in disk groups where one or more files use variable extents. This is critical to ASM’s assurance of balanced I/O load. The ASM background process, RBAL, manages this rebalance. The RBAL process examines each file extent map, and the extents are redistributed on the new storage configuration. For example, consider an eight-disk external redundancy disk group, with a data file with 40 extents (each disk will house five extents). When two new drives of same size are added, that data file is rebalanced and distributed across 10 drives, with each drive containing four extents. Only eight extents need to move to complete the rebalance—that is, a complete redistribution of extents is not necessary because only the minimum number of extents is moved to reach equal distribution.

During the compaction phase of the rebalance, each disk is examined and data is moved to the head of the disk to eliminate any holes. The rebalance estimates reported in V$ASM_OPERATIONS do not factor in the work needed to complete the compaction of the disk group as of Oracle 11g.

NOTE

A weighting factor, influenced by disk size and file size, affects rebalancing. A larger drive will consume more extents. This factor is used to achieve even distribution based on overall size.

The following is a typical process flow for ASM rebalancing:

1. On the ASM instance, a DBA adds (or drops) a disk to (or from) a disk group.

2. This invokes the RBAL process to create the rebalance plan and then begin coordination of the redistribution.

3. RBAL calculates the work required to perform the task and then messages the ASM Rebalance (ARBx) processes to handle the request. In Oracle Release 11.2.0.2 and earlier, the number of ARBx processes invoked is directly determined by the init.ora parameter ASM_POWER_LIMIT or the power level specified in an add, drop, or rebalance command. Post–Oracle 11.2.0.2, there is always just one ARB0 process performing the rebalance operation. The ASM_POWER_LIMIT or the power level specified in the SQL command translates to the number of extents relocated in parallel.

4. The Continuing Operations Directory (COD) is updated to reflect a rebalance activity. The COD is important when an influx rebalance fails. Recovering instances will see an outstanding COD entry for the rebalance and restart it.

5. RBAL distributes plans to the ARBs. In general, RBAL generates a plan per file; however, larger files can be split among ARBs.

6. ARBx performs a rebalance on these extents. Each extent is locked, relocated, and unlocked. Reads can proceed while an extent is being relocated. New writes to the locked extent are blocked while outstanding writes have to be reissued to the new location after the relocation is complete. This step is shown as Operation REBAL in V$ASM_OPERATION. The rebalance algorithm is detailed in Chapter 9.

The following is an excerpt from the ASM alert log during a rebalance operation for a drop disk command:

The following is an excerpt from the ASM alert log during a rebalance operation for a add disk command:

An ARB trace file is created for each ARB process involved in the rebalance operation. This ARB trace file can be found in the trace subdirectory under the DIAG directory. The following is a small excerpt from this trace file:

The preceding entries are repeated for each file assigned to the ARB process.

Resizing a Physical Disk or LUN and the ASM Disk Group

When you’re increasing the size of a disk group, it is a best practice to add disks of similar size. However, in some cases it is appropriate to resize disks rather than to add storage of equal size. For these cases, you should resize all the disks in the disk group (to the same size) at the same time. This section discusses how to expand or resize the logical unit number (LUN) as an ASM disk.

Disks in the storage are usually configured as a LUN and presented to the host. When a LUN runs out of space, you can expand it within the storage array by adding new disks in the back end. However, the operating system (OS) must then recognize the new space. Some operating systems require a reboot to recognize the new LUN size. On Linux systems that use Emulex drivers, for example, the following can be used:

Here, N is the SCSI port ordinal assigned to this HBA port (see the /proc/scsi/lpfc directory and look for the “port_number” files).

The first step in increasing the size of an ASM disk is to add extra storage capacity to the LUN. To use more space, the partition must be re-created. This operation is at the partition table level, and that table is stored in the first sectors of the disk. Changing the partition table does not affect the data as long as the starting offset of the partition is not changed.

The view V$ASM_DISK called OS_MB gives the actual OS size of the disk. This column can aid in appropriately resizing the disk and preventing attempts to resize disks that cannot be resized.

The general steps to resize an ASM disk are as follows:

1. Resize the LUN from storage array. This is usually a noninvasive operation.

2. Query V$ASM_DISK for the OS_MB for the disk to be resized. If the OS or ASM does not see the new size, review the steps from the host bus adapter (HBA) vendor to probe for new devices. In some cases, this may require a reboot.

Rebalance Power Management

Rebalancing involves physical movement of file extents. Its impact is usually low because the rebalance is done a few extents at a time, so there’s little outstanding I/O at any given time per ARB process. This should not adversely affect online database activity. However, it is generally advisable to schedule the rebalance operation during off-peak hours.

The init.ora parameter ASM_POWER_LIMIT is used to influence the throughput and speed of the rebalance operation. For Oracle 11.2.0.1 and below, the range of values for ASM_POWER_LIMIT is 0–11, where a value of 11 is full throttle and a value of 1 (the default) is low speed. A value of 0, which turns off automatic rebalance, should be used with caution. In a Real Application Clusters (RAC) environment, the ASM_POWER_LIMIT is specific to each ASM instance. (A common question is why the maximum power limit is 11 rather than 10. Movie lovers might recall the amplifier discussion from This is Spinal Tap.)

For releases 11.2.0.2 and above, the ASM_POWER_LIMIT can be set up to 1024, which results in ASM relocating those many extents in parallel. Increasing the rebalance power also increases the amount of PGA memory needed during relocation. In the event the current memory settings of the ASM instance prevents allocation of the required memory, ASM dials down the power and continues with the available memory.

The power value can also be set for a specific rebalance activity using the ALTER DISKGROUP command. This value is effective only for the specific rebalance task. The following example demonstrates this:

Here is an example from another session:

Each rebalance step has various associated states. The following are the valid states:

WAIT This indicates that currently no operations are running for the group.

RUN An operation is running for the group.

HALT An administrator is halting an operation.

ERRORS An operation has been halted by errors.

A power value of 0 indicates that no rebalance should occur for this rebalance. This setting is particularly important when you’re adding or removing storage (that has external redundancy) and then deferring the rebalance to a later scheduled time. However, a power level of 0 should be used with caution; this is especially true if the disk group is low on available space, which may result in an ORA-15041 for out-of-balance disk groups.

The power level is adjusted with the ALTER DISKGROUP REBALANCE command, which affects only the current rebalance for the specified disk group; future rebalances are not affected. If you increase the power level of the existing rebalance, it will spawn new ARB processes. If you decrease the power level, the running ARB process will finish its extent relocation and then quiesce and die off.

If you are removing or adding several disks, add or remove disks in a single ALTER DISKGROUP statement; this reduces the number of rebalance operations needed for storage changes. This behavior is more critical where normal- and high-redundancy disk groups have been configured because of disk repartnering. Executing a single disk group reconfiguration command allows ASM to figure out the ideal disk partnering and reduce excessive data movement. The following example demonstrates this storage change:

An ASM disk group rebalance is an asynchronous operation in that the control is returned immediately to the DBA after the operation executes in the background. The status of the ongoing operation can be queried from V$ASM_OPERATION. However, in some situations the disk group operation needs to be synchronous—that is, it must wait until rebalance is completed. The ASM ALTER DISKGROUP commands that result in a rebalance offer a WAIT option. This option allows for accurate (sequential) scripting that may rely on the space change from a rebalance completing before any subsequent action is taken. For instance, if you add 100GB of storage to a completely full disk group, you will not be able to use all 100GB of storage until the rebalance completes. The WAIT option ensures that the space addition is successful and is available for space allocations. If a new rebalance command is entered while one is already in progress in WAIT mode, the command will not return until the disk group is in a balanced state or the rebalance operation encounters an error.

The following SQL script demonstrates how the WAIT option can be used in SQL scripting. The script adds a new disk, /dev/sdc6, and waits until the add and rebalance operations complete, returning the control back to the script. The subsequent step adds a large tablespace.

In the event that dropping a disk results in a hung rebalance operation due to the lack of free space, ASM rejects the drop command when it is executed. Here’s an example:

Fast Rebalance

When a storage change initiates a disk group rebalance, typically all active ASM instances of an ASM cluster and their RDBMS clients are notified and become engaged in the synchronization of the extents that are being rearranged. This messaging between instances can be “chatty” and thus can increase the overall time to complete the rebalance operation.

In certain situations where the user does not need the disk group to be “user accessible” and needs rebalancing to complete as soon as possible, it is beneficial to perform the rebalance operation without the extra overhead of the ASM-to-ASM and ASM-to-RDBMS messaging. The Fast Rebalance feature eliminates this overhead by allowing a single ASM instance to rebalance the disk group without the messaging overhead. The primary goal of Fast Rebalance is to improve the overall performance of the rebalance operation. Additionally, the rebalance operation can be invoked at maximum power level (power level 11 or 1024 for 11.2.0.2 above) to provide the highest throttling, making the rebalance operation limited only by the I/O subsystem (to the degree you can saturate the I/O subsystem with maximum synchronous 1MB I/Os).

To eliminate messaging to other ASM instances, the ASM instance that performs the rebalance operation requires exclusive access to the disk group. To provide this exclusive disk group access, a new disk group mount mode, called RESTRICTED, was introduced in Oracle Database 11g. A disk group can be placed in RESTRICTED mode using STARTUP RESTRICT or ALTER DISKGROUP MOUNT RESTRICT.

When a disk group is mounted in RESTRICTED mode, RDBMS instances are prevented from accessing that disk group and thus databases cannot be opened. Furthermore, only one ASM instance in a cluster can mount a disk group in RESTRICTED mode. When the instance is started in RESTRICTED mode, all disk group mounts in that instance will automatically be in RESTRICTED mode. The ASM instance needs to be restarted in NORMAL mode to get it out of RESTRICTED mode.

At the end of the rebalance operation, the user must explicitly dismount the disk group and remount it in NORMAL mode to make the disk group available to RDBMS instances.

Effects of Imbalanced Disks

This section illustrates how ASM distributes extents evenly and creates a balanced disk group. Additionally, the misconceptions of disk group space balance management are covered.

The term “balanced” in the ASM world is slightly overloaded. A disk group can become imbalanced for several reasons:

Dissimilarly sized disks are used in a given disk group.

A rebalance operation was aborted.

A rebalance operation was halted. In this case, this state can be determined by the UNBALANCED column of the V$ASM_DISKGROUP view. Operationally, the DBA can resolve this problem by manually performing a rebalance against the specific disk group.

NOTE

The UNBALANCED column in V$ASM_DISKGROUP indicates that a rebalance is in flux—that is, either in progress or stopped. This column is not an indicator for an unbalanced disk group.

A disk was added to the disk group with an ASM_POWER_LIMIT or power level of 0, but the disk group was never rebalanced afterward.

This section focuses on the first reason: a disk group being imbalanced due to differently sized disks. For the other reasons, allowing the rebalance to complete will fix the imbalance automatically.

The main goal of ASM is to provide an even distribution of data extents across all disk members of a disk group. When an RDBMS instance requests a file creation, ASM allocates extents from all the disks in the specified disk group. The first disk allocation is chosen randomly, but all subsequent disks for extent allocation are chosen to evenly spread each file across all disks and to evenly fill all disks. Therefore, if all disks are equally sized, all disks should have the same number of extents and thus an even I/O load.

But what happens when a disk group contains unequally sized disks—for example, a set of 25GB disks mixed with a couple of 50GB disks? When allocating extents, ASM will place twice as many extents on each of the bigger 50GB disks as on the smaller 25GB disks. Thus, the 50GB disks will contain more data extents than their 25GB counterparts. This allocation scheme causes dissimilarly sized disks to fill at the same proportion, but will also induce unbalanced I/O across the disk group because the disk with more extents will receive more I/O requests.

The following example illustrates this scenario:

1. Note that the FAST disk group initially contains two disks that are equally sized (8.6GB):

2. Display the extent distribution on the current disk group layout. The even extent distribution is shown by the COUNT(PXN_KFFXP) column.

3. Add two more 8.6GB disks to the disk group:

4. Use the following query to display the extent distribution after the two disks were added:

5. Note that a 1GB disk was accidentally added:

6. Display the space usage from V$ASM_DISK. Notice the size of disk FAST_0004:

7. The extent distribution query is rerun to display the effects of this mistake. Notice the unevenness of the extent distribution.

MYTH

Adding and dropping a disk in the same disk group requires two separate rebalance activities. In fact, some disks can be dropped and others added to a disk group with a single rebalance command. This is more efficient than separate commands.

ASM and Storage Array Migration

One of the core benefits of ASM is the ability to rebalance extents not only within disk enclosure frames (storage arrays) but also across frames. Customers have used this ability extensively when migrating between storage arrays (for example, between an EMC VNX to the VMAX storage systems) or between storage vendors (for example, from EMC arrays to Hitachi Data Systems [HDS] arrays).

The following example illustrates the simplicity of this migration. In this example, the DATA disk group will migrate from an EMC VNX storage enclosure to the EMC VMAX enclosure. A requirement for this type of storage migration is that both storage enclosures must be attached to the host during the migration and must be discovered by ASM. Once the rebalance is completed and all the data is moved from the old frame, the old frame can be “unzoned” and “uncabled” from the host(s).

This command indicates that the disks DATA_0001 through DATA_0004 (from the current EMC VNX disks) are to be dropped and that the new VMAX disks, specified by /dev/rdsk/c7t19*, are to be added. The ADD and DROP commands can all be done in one rebalance operation. Additionally, the RDBMS instance can stay online while the rebalance is in progress. Note that migration to new storage is an exception to the general rule against mixing different size/performance disks in the same disk group. This mixture of disparate disks is transient; the configuration at the end of the rebalance will have disks of similar size and performance characteristics.

ASM and OS Migration

Customers often ask whether they can move between the same endianness systems, but different operating systems, while keeping the storage array the same. For example, suppose that a customer wants to migrate from Solaris to AIX, with the database on ASM over an EMC Clariion storage array network (SAN). The storage will be kept intact and physically reattached to the AIX server. Customers ask whether this migration is viable and/or supported.

Although ASM data structures are compatible with most OSs (except for endianness) and should not have a problem, other factors preclude this from working. In particular, the OS LUN partition has its own OS partition table format. It is unlikely that this partition table can be moved between different OSs.

Additionally, the database files themselves may have other issues, such as platform-specific structures and formats, and thus the database data will need to be converted to the target platform’s format. Some viable options include the following:

Data pump full export/import

Cross-platform transportable tablespaces (XTTSs)

Streams

Important Points on ASM Rebalance

The following are some important points on rebalance and extent distribution:

It is very important that similarly sized disks be used in a single disk group, and that failure groups are also of similar sizes. The use of dissimilar disks will cause uneven extent distribution and I/O load. If one disk lacks free space, it is impossible to do any allocation in a disk group because every file must be evenly allocated across all disks. Rebalancing and allocation should make the percentage of allocated space about the same on every disk.

Rebalance runs automatically only when there is a disk group configuration change. Many users have the misconception that ASM periodically wakes up to perform rebalance. This simply is not true.

If you are using similarly sized disks and you still see disk group imbalance, either a previous rebalance operation failed to complete (or was cancelled) or the administrator set the rebalance power to 0 via a rebalance command. A manual rebalance should fix these cases.

If a server goes down while you’re executing a rebalance, the rebalance will be automatically restarted after ASM instance/crash recovery. A persistently stored record indicates the need for a rebalance. The node that does the recovery sees the record indicating that a rebalance is in progress and that the rebalance was running on the instance that died. It will then start a new rebalance. The recovering node may be different from the node that initiated the rebalance.

Many factors determine the speed of rebalance. Most importantly, it depends on the underlying I/O subsystem. To calculate the lower bound of the time required for the rebalance to complete, determine the following:

1. Calculate amount of data that has to be moved. ASM relocates data proportional to the amount of space being added. If you are doubling the size of the disk group, then 50 percent of the data will be moved; if you are adding 10 percent more storage, then at least 10 percent of the data will be moved; and so on.

2. Determine how long it will take the I/O subsystem to perform that amount of data movement. As described previously, ASM does relocation I/Os as a synchronous 1 AU read followed by a synchronous 1 AU write. Up to ASM_POWER_LIMIT I/Os can operate in parallel depending on the rebalance power. This calculation is a lower bound, because ASM has additional synchronization overhead.

3. The impact of rebalance should be low because ASM relocates and locks extents one at a time. ASM relocates multiple extents simultaneously only if rebalance is running with higher power. Only the I/Os against the extents being relocated are blocked; ASM does not block I/Os for all files in the ASM disk group. I/Os are not actually blocked per se; reads can proceed from the old location during relocation, whereas some writes need to be temporarily stalled or may need to be reissued if they were in process during the relocation I/O. The writes to these extents can be completed after the extent is moved to its new location. All this activity is transparent to the application. Note that the chance of an I/O being issued during the time that an extent is locked is very small. In the case of the Flash disk group, which contains archive logs or backups, many of the files being relocated will not even be open at the time, so the impact is very minimal.

When a rebalance is started for newly added disks, ASM immediately begins using the free space on them; however, ASM continues to allocate files evenly across all disks. If a disk group is almost full, and a large disk (or set of disks) is then added, the RDBMS could get out-of-space (ORA-15041) errors even though there is seemingly sufficient space in the disk group. With the WAIT option to the ADD DISK command, control does not return to the user until rebalance is complete. This may provide more intuitive behavior to customers who run near capacity.

If disks are added very frequently, the same data is relocated many times, causing excessive data movement. It is a best practice to add and drop multiple disks at a time so that ASM can reorganize partnership information within ASM metadata more efficiently. For normal- and high-redundancy disk groups, it is very important to batch the operations for adding and dropping disks rather than doing them in rapid succession. The latter option generates much more overhead because mirroring and failure groups place greater constraints on where data can be placed. In extreme cases, nesting many adds and drops without allowing the intervening rebalance to run to completion can lead to the error ORA-15074.

Summary

Every ASM disk is divided into fixed-size allocation units. The AU is the fundamental unit of allocation within a disk group, and the usable space in an ASM disk is a multiple of this size. The AU size is a disk group attribute specified at disk group creation and defaults to 1MB, but may be set as high as 64MB. An AU should be large enough that accessing it in one I/O operation provides very good throughput—that is, the time to access an entire AU in one I/O should be dominated by the transfer rate of the disk rather than the time to seek to the beginning of the AU.

ASM spreads the extents of a file evenly across all disks in a disk group. Each extent comprises an integral number of AUs. Most files use coarse striping. With coarse striping, in each set of extents, the file is striped across the set at 1 AU granularity. Thus, each stripe of data in a file is on a different disk than the previous stripe of the file. A file may have fine-grained striping rather than coarse-grained. The difference is that the fine-grained striping granularity is 128K rather than 1 AU.

Rebalancing a disk group moves file extents between disks in the disk group to ensure that every file is evenly distributed across all the disks in the disk group. When all files are evenly distributed, all disks are evenly filled to the same percentage. This ensures load balancing. Rebalancing does not relocate data based on I/O statistics, nor is it started as a result of statistics. ASM automatically invokes rebalance only when a storage configuration change is made to an ASM disk group.

CHAPTER 9 ASM Operations

This chapter describes the flow of the critical operations for ASM disk groups and files. It also describes the key interactions between the ASM and relational database management system (RDBMS) instances.

Note that many of these ASM operations have been optimized in Engineered Systems (Exadata and ODA) and thus have significantly different behavior. Chapter 12 covers these optimizations. This chapter describes ASM operations in non–Engineered Systems.

ASM Instance Discovery

The first time an RDBMS instance tries to access an ASM file, it needs to establish its connection to the local ASM instance. Rather than requiring a static configuration file to locate the ASM instance, the RDBMS contacts the Cluster Synchronization Services (CSS) daemon where the ASM instance has registered. CSS provides the necessary connect string for the RDBMS to spawn a Bequeath connection to the ASM instance. The RDBMS authenticates itself to the ASM instance via operating system (OS) authentication by connecting as SYSDBA. This initial connection between the ASM instance and the RDBMS instance is known as the umbilicus, and it remains active as long as the RDBMS instance has any ASM files open. The RDBMS side of this connection is the ASMB process. (See the “File Open” section for an explanation of why ASMB can appear in an ASM instance.) The ASM side of the connection is a foreground process called the umbilicus foreground (UFG). RDBMS and ASM instances exchange critical messages over the umbilicus. Failure of the umbilicus is fatal to the RDBMS instance because the connection is critical to maintaining the integrity of the disk group. Some of the umbilicus messages are described later in the “Relocation” section of this chapter.

RDBMS Operations on ASM Files

This section describes the interaction between the RDBMS and ASM instances for the following operations on ASM files:

File Create

File Open

File I/O

File Close

File Delete

File Create

ASM filenames in the RDBMS are distinguished by the fact that they start with a plus sign (+). ASM file creation consists of three phases:

File allocation in the ASM instance

File initialization in the RDBMS instance

File creation committed in the RDBMS and ASM instance

When an RDBMS instance wants to create an ASM file, it sends a request to create the ASM file. The RDBMS instance has a pool of processes (the o0nn processes) that have connections to the foreground process in the ASM instance is called the Network Foreground (NFG), and are used for tasks such as file creations. The file-creation request sent over the appropriate connection includes the following information:

Disk group name

File type

File block size

File size

File tag

The request may optionally include the following additional information:

Template name

Alias name

ASM uses this information to allocate the file (the topic of file allocation is described in greater detail in the “ASM File Allocation” section). ASM determines the appropriate striping and redundancy for the file based on the template. If the template is not explicitly specified in the request, the default template is used based on the file type. ASM uses the file type and tag information to create the system-generated filename. The system-generated filename is formed as follows:

After allocating the file, ASM sends extent map information to the RDBMS instance. ASM creates a Continuing Operations Directory (COD) entry to track the pending file creation. The RDBMS instance subsequently issues the appropriate I/O to initialize the file. When initialization is complete, the RDBMS instance messages ASM to commit the creation of the file.

When ASM receives the commit message, ASM’s LGWR flushes the Active Change Directory (ACD) change record with the file-creation information. ASM’s DBWR subsequently asynchronously writes the appropriate allocation table, file directory, and alias directory entries to disk. Thus, the high-level tasks for DBWR and LGWR are similar in the ASM instance as they are in the RDBMS instance.

If the RDBMS instance explicitly or implicitly aborts the file creation without committing the creation, ASM uses the COD to roll back the file creation. Rollback marks the allocation table entries as free, releases the file directory entry, and removes the appropriate alias directory entries. Note that rollbacks in ASM do not use the same infrastructure (undo segments) or semantics that the RDBMS instance uses.

File Open

When an RDBMS instance needs to open an ASM file, it sends to the ASM instance a File Open request, with the filename, via one of the o0nn processes. ASM consults the file directory to get the extent map for the file. ASM sends the extent map to the RDBMS instance. The extent maps of the files are sent in batches to the database instance. ASM sends the first 60 extents of the extent map to the RDBMS instance at file-open time; the remaining extents are paged in on demand by the database instance. This delayed shipping of extent maps greatly improves the time it takes to open the database.

Opening the spfile at RDBMS instance startup is a special code path. This open operation cannot follow the typical open path, because the RDBMS instance System Global Area (SGA) does not yet exist to hold the extent map. The SGA sizing information is contained in the spfile. In this specific case, the RDBMS does proxy I/O through the ASM instance. When the ASM instance reads user data, it does a loop-back connection to itself. This results in ASM having an ASMB process during RDBMS instance startup. After the RDBMS has gotten the initial contents of the spfile, it allocates the SGA, closes the proxy open of the spfile, and opens the spfile again via the normal method used for all other files.

ASM tracks all the files an RDBMS instance has open. This allows ASM to prevent the deletion of open files. ASM also needs to know what files an RDBMS instance has open so that it can coordinate extent relocation, as described later in this chapter.

File I/O

RDBMS instances perform ASM file I/O directly to the ASM disks; in other words, the I/O is performed unbuffered and directly to the disk without involving ASM. Keep in mind that each RDBMS instance uses the extent maps it obtains during file open to determine where on the ASM disks to direct its reads and writes; thus, the RDBMS instance has all the information it needs to perform the I/O to the database file.

MYTH

RDBMS instances proxy all I/O to ASM files through an ASM instance.

File Close

When an RDBMS instance closes a file, it sends a message to the ASM instance. ASM cleans up its internal state when the file is closed. Closed files do not require messaging to RDBMS instances when their extents are relocated via rebalance.

ASM administrators can issue the following command to delete closed files manually:

The DROP FILE command fails for open files (generating an ORA-15028 error message “ASM file filename not dropped; currently being accessed”). Generally, manual deletion of files is not required if the ASM files are Oracle Managed Files (OMFs), which are automatically deleted when they are no longer needed. For instance, when the RDBMS drops a tablespace, it will also delete the underlying data files if they are OMFs.

File Delete

When an RDBMS instance deletes a file, it sends a request to the ASM instance. The ASM instance creates a COD entry to record the intent to delete the file. ASM then marks the appropriate allocation table entries as free, releases the file directory entry, and removes the appropriate alias directory entries. If the instance fails during file deletion, COD recovery completes the deletion. The delete request from the RDBMS instance is not complete until ASM completes freeing all of the allocated space.

ASM File Allocation

This section describes how ASM allocates files within an external redundancy disk group. For simplicity, this section explains ASM striping and variable-sized extents for ASM files in the absence of ASM redundancy. The concepts in striping and variable-sized extents also apply to files with ASM redundancy. A later section explains allocation of files with ASM redundancy.

External Redundancy Disk Groups

ASM allocates files so that they are evenly spread across all disks in a disk group. ASM uses the same algorithm for choosing disks for file allocation and for file rebalance. In the case of rebalance, if multiple disks are equally desirable for extent placement, ASM chooses the disk where the extent is already allocated if that is one of the choices.

MYTH

ASM chooses disks for allocation based on I/O statistics.

MYTH

ASM places the same number of megabytes from a file on each disk regardless of disk size.

ASM chooses the disk for the first extent of a file to optimize for space usage in the disk group. It strives to fill each disk evenly (proportional to disk size if the disks are not the same size). Subsequently, ASM tries to spread the extents of the file evenly across all disks. As described in the file directory persistent structure, ASM extent placement is not bound by strict modulo arithmetic; however, extent placement tends to follow an approximately round-robin pattern across the disks in a disk group. ASM allocation on a disk begins at the lower-numbered AUs (typically the outer tracks of disks). Keeping allocations concentrated in the lower-numbered AUs tends to reduce seek time and takes advantage of the highest-performing tracks of the disk. Note that the assumption about the mapping of lower-numbered AUs to higher-performing tracks is generally true for physical disks, but may not hold true for LUNs presented by storage arrays. ASM is not aware of the underlying physical layout of the ASM disks.

A potentially nonintuitive side effect of ASM’s allocation algorithm is that ASM may report out-of-space errors even when it shows adequate aggregate free space for the disk group. This side effect can occur if the disk group is out of balance. For external redundancy disk groups, the primary reason for disk group imbalance is an incomplete rebalance operation. This can occur if a user stops a rebalance operation by specifying rebalance power 0, or if rebalance was cancelled/terminated. Also, if a disk group is almost full when new disks are added, allocation may fail on the existing disks until rebalance has freed sufficient space. It is a best practice to leave at minimum 20 percent free uniformly across all disks.

Variable-Sized Extents

In disk groups with COMPATIBLE.RDBMS lower than 11.1, all extents are a fixed size of one allocation unit (AU). If COMPATIBLE.RDBMS is 11.1 or higher, extent sizes increase for data files as the file size grows. A multi-AU extent consists of multiple contiguous allocation units on the same disk. The first 20,000 extents in a file are one AU. The next 20,000 extents are four AUs. All extents beyond 40,000 are 16 AUs. This allows ASM to address larger files more efficiently. For the first 20,000 extents, allocation occurs exactly as with fixed-extent-size files. For multi-AU extents, ASM must find contiguous extents on a disk. ASM’s allocation pattern tends to concentrate allocations in the lower-numbered AUs and leave free space in the higher-numbered AUs. File shrinking or deletion can lead to free space fragmentation. ASM attempts to maintain defragmented disks during disk group rebalance. If during file allocation ASM is unable to find sufficient contiguous space on a disk for a multi-AU extent, it consolidates the free space until enough contiguous space is available for the allocation. Defragmentation uses the relocation locking mechanism described later in this chapter.

ASM Striping

ASM offers two types of striping for files. Coarse striping is done at the AU level. Fine-grained striping is done at 128K granularity. Coarse striping provides better throughput, whereas fine-grained striping provides better latency. The file template indicates which type of striping is performed for the file. If a template is not specified during file creation, ASM uses the default template for the file type.

ASM striping is performed logically at the client level. In other words, RDBMS instances interpret extent maps differently based on the type of striping for the file. Striping is opaque to ASM operations such as extent relocation; however, ASM takes striping into account during file allocation.

Coarse Striping

With single AU extents, each extent contains a contiguous logical chunk of the file. Because ASM distributes files evenly across all disks at the extent level, such files are effectively striped at AU-sized boundaries.

With multi-AU extents, the ASM file is logically striped at the AU level across a stripe set of eight disks. For instance, with 1MB AUs, the first MB of the file is written to the first disk, the second MB is written to the second disk, the third MB is written to the third disk, and so on. The file allocation dictates the set of eight disks involved in the stripe set and the order in which they appear in the stripe set. If fewer than eight disks exist in the disk group, then disks may be repeated within a stripe set.

Figure 9-1 shows a file with coarse striping and 1MB fixed extents. The logical database file is 6.5MB. In the figure, each letter represents 128K, so uppercase A through Z followed by lowercase a through z cover 6.5MB. Each extent is 1MB (represented by eight letters, or 8 × 128K) and holds the contiguous logical content of the database file. ASM must allocate seven extents to hold a 6.5MB file with coarse striping.

FIGURE 9-1. Coarse striping

Figure 9-2 represents a file with variable-sized extents and coarse striping. In order for the figure to fit on one page, it represents a disk group with eight disks. The file represented is 20,024MB. In the figure, each capital letter represents 1MB in the database file. The file has 20,008 extents. The first 20,000 are one AU, whereas the next eight are eight AU extents. Notice that the eight AU extents are allocated in sets of eight so that ASM can continue to stripe at the AU level. This file uses 20,064MB of space in the disk group.

FIGURE 9-2. Coarse striping variable-sized extents

Fine-Grained Striping

Fine-grained striped ASM files are logically striped at 128K. For each set of eight extents, the file data is logically striped in a round-robin fashion: bytes 0K through 127K go on the first disk, 128K through 255K on the second disk, 256K through 383K on the third disk, and so on. The file allocation determines the set of eight disks involved in the stripe set and the order in which they appear in the stripe set. If fewer than eight disks exist in the disk group, then disks may be repeated within a stripe set.

MYTH

Fine-grained striped ASM files have smaller extent sizes than files with coarse striping.

Figure 9-3 shows a file with fine-grained striping and 1MB fixed extents. The logical database file is 6.5MB (and is the same as the file shown in Figure 9-1). In the figure, each letter represents 128K, so uppercase A through Z followed by lowercase a through z cover 6.5MB. Each extent is 1MB (representing eight letters or 8 × 128K). ASM must allocate eight extents to hold a 6.5MB file with fine-grained striping. As described previously, the logical contents of the database file are striped across the eight extents in 128K stripes.

FIGURE 9-3. Fine-grained striping

ASM Redundancy

Unlike traditional volume managers, ASM mirrors extents rather than mirroring disks. ASM’s extent-based mirroring provides load balancing and hardware utilization advantages over traditional redundancy methods.

In traditional RAID 1 mirroring, the disk that mirrors a failed disk suddenly must serve all the reads that were intended for the failed drive, and it also serves all the reads needed to populate the hot spare disk that replaces the failed drive. When an ASM disk fails, the load of the I/Os that would have gone to that disk is distributed among the disks that contain the mirror copies of extents on the failed disk.

MYTH

ASM mirroring makes two disks look identical.

RAID 5 and RAID 1 redundancy schemes have a rigid formula that dictates the placement of the data relative to the mirror or parity copies. As a result, these schemes require hot spare disks to replace drives when they fail. “Hot spare” is a euphemism for “usually idle.” During normal activity, applications cannot take advantage of the I/O capacity of a hot spare disk. ASM, on the other hand, does not require any hot spare disks; it requires only spare capacity. During normal operation, all disks are actively serving I/O for the disk group. When a disk fails, ASM’s flexible extent pointers allow extents from a failed disk to be reconstructed from the mirror copies distributed across the disk partners of the failed disk. The reconstructed extents can be placed on the remaining disks in the disk group. See Chapter 12 for the specialized layout and configurations on Engineered Systems.

Failure Groups

Because of the physical realities of disk connectivity, disks do not necessarily fail independently. Certain disks share common single points of failure, such as a power supply, a host bus adapter (HBA), or a controller. Different users have different ideas about which component failures they want to be able to tolerate. Users can specify the shared components whose failures they want to tolerate by specifying failure groups. By default, each disk is in its own failure group; in other words, if the failgroup specification is omitted, ASM automatically places each disk into its own failgroup. The only exceptions are Exadata and ODA. In Exadata, all disks from the same storage cell are automatically placed in the same failgroup. In ODA, disks from specific slots in the array are automatically put into a specific failgroup.

ASM allocates mirror extents such that they are always in a different failure group than the primary extent. In the case of high-redundancy files, each extent copy in an extent set is in a different failure group. By placing mirror copies in different failure groups, ASM guarantees that even with the loss of all disks in a failure group, a disk group will still have at least one copy of every extent available. The way failure groups are managed and extents are distributed has been specialized and specifically designed to take advantage of the Engineered Systems. See Chapter 12 for the details on disk, failure group, and recovery management on Engineered Systems.

Disk Partners

ASM disks store the mirror copies of their extents on their disk partners. A disk partnership is a symmetric relationship between two disks in the same disk group but different failure groups. ASM selects partners for a disk from failure groups other than the failure group to which the disk belongs, but an ASM disk may have multiple partners that are in the same failure group. Limiting the number of partners for each disk minimizes the number of disks whose overlapping failures could lead to loss of data in a disk group. Consider a disk group with 100 disks. If an extent could be mirrored between any two disks, the failure of any two of the 100 disks could lead to data loss. If a disk mirrors its extents on only up to 10 disks, then when one disk fails, its 10 partners must remain online until the lost disk’s contents are reconstructed by rebalance, but a failure of any of the other 89 disks could be tolerated without loss of data.

ASM stores the partnership information in the Partnership Status Table (PST). Users do not specify disk partners; ASM chooses disk partners automatically based on the disk group’s failure group definitions. A disk never has any partners in its own failure group. Disk partnerships may change when the disk group configuration changes. Disks have a maximum of 10 active partnerships, but typically it is eight active partnerships. When disk group reconfigurations cause disks to drop existing partnerships and form new partnerships, the PST tracks the former partnerships until rebalance completes to ensure that former partners no longer mirror any extent between them. PST space restrictions limit each disk to a total of 20 total and former partners. For this reason, it is more efficient to perform disk reconfiguration (that is, to add or drop a disk) in batches. Too many nested disk group reconfigurations can exhaust the PST space and result in an “ORA-15074: diskgroup requires rebalance completion” error message.

Allocation with ASM Redundancy

File allocation in normal- and high-redundancy disk groups is similar to allocation in external redundancy disk groups. ASM uses the same algorithm for allocating the primary extent for each extent set. ASM then balances mirror extent allocation among the partners of the disk that contains the primary extent.

Because mirror extent placement is dictated by disk partnerships, which are influenced by failure group definitions, it is important that failure groups in a disk group be of similar sizes. If some failure groups are smaller than others, ASM may return out-of-space errors during file allocation if the smaller failure group fills up sooner than the other failure groups.

I/O to ASM Mirrored Files

When all disks in an ASM disk group are online, ASM writes in parallel to all copies of an extent and reads from the primary extent. Writes to all extent copies are necessary for correctness. Reads from the primary extent provide the best balance of I/O across disks in a disk group.

MYTH

ASM needs to read from all mirror sides to balance I/O across disks.

Most traditional volume managers distribute reads evenly among the mirror sides in order to balance I/O load among the disks that mirror each other. Because each ASM disk contains a combination of primary and mirror extents, I/O to primary extents of ASM files spreads the load evenly across all disks. Although the placement of mirror extents is constrained by failure group definitions, ASM can allocate primary extents on any disk to optimize for even distribution of primary extents.

Read Errors

When an RDBMS instance encounters an I/O error trying to read a primary extent, it will try to read the mirror extent. If the read from the mirror extent is successful, the RDBMS can satisfy the read request, and upper-layer processing continues as usual. For a high-redundancy file, the RDBMS tries to read the second mirror extent if the first mirror extent returns an I/O error. If reads fail to all extent copies, the I/O error propagates to the upper layers, which take the appropriate action (such as taking a tablespace offline). A read error in an RDBMS instance never causes a disk to go offline.

The ASM instance handles read errors in a similar fashion. If ASM is unable to read any copy of a virtual metadata extent, it forces the dismounting of the disk group. If ASM is unable to read physical metadata for a disk, it takes the disk offline, because physical metadata is not mirrored.

Read errors can be due to the loss of access to the entire disk or due to bad sectors on an otherwise healthy disk. ASM tries to recover from bad sectors on a disk. Read errors in the RDBMS or ASM instance trigger the ASM instance to attempt bad block remapping. ASM reads a good copy of the extent and copies it to the disk that had the read error. If the write to the same location succeeds, the underlying allocation unit is deemed healthy (because the underlying disk likely did its own bad block relocation). If the write fails, ASM attempts to write the extent to a new allocation unit of the same disk. If that write succeeds, the original allocation unit is marked as unusable. If the write to the new allocation unit fails, the disk is taken offline. The process of relocating a bad allocation unit uses the same locking logic discussed later in this chapter for rebalance.

One unique benefit on ASM-based mirroring is that the RDBMS instance is aware of the mirroring. For many types of logical file corruptions, if the RDBMS instance reads unexpected data (such as a bad checksum or incorrect System Change Number [SCN]) from the primary extent, the RDBMS proceeds through the mirror sides looking for valid content. If the RDBMS can find valid data on an alternate mirror, it can proceed without errors (although it will log the problem in the alert log). If the process in the RDBMS that encountered the read is in a position to obtain the appropriate locks to ensure data consistency, it writes the correct data to all mirror sides.

Write Errors

When an RDBMS instance encounters a write error, it sends to the ASM instance a disk offline message indicating which disk had the write error. If the RDBMS can successfully complete a write to at least one extent copy and receive acknowledgment of the offline disk from the ASM instance, the write is considered successful for the purposes of the upper layers of the RDBMS. If writes to all mirror sides fail, the RDBMS takes the appropriate actions in response to a write error (such as taking a tablespace offline).

When the ASM instance receives a write error message from an RDBMS instance (or when an ASM instance encounters a write error itself), ASM attempts to take the disk offline. ASM consults the PST to see whether any of the disk’s partners are offline. If too many partners are already offline, ASM forces the dismounting of the disk group on that node. Otherwise, ASM takes the disk offline. ASM also tries to read the disk header for all the other disks in the same failure group as the disk that had the write error. If the disk header read fails for any of those disks, they are also taken offline. This optimization allows ASM to handle potential disk group reconfigurations more efficiently.

Disk offline is a global operation. The ASM instance that initiates the offline sends a message to all other ASM instances in the cluster. The ASM instances all relay the offline status to their client RDBMS instances. If COMPATIBLE.RDBMS is less than 11.1, ASM immediately force-drops disks that have gone offline. If COMPATIBLE.RDBMS is 11.1 or higher, disks stay offline until the administrator issues an online command or until the timer specified by the DISK_REPAIR_TIME attribute expires. If the timer expires, ASM force-drops the disk. The “Resync” section later in this chapter describes how ASM handles I/O to offline disks.

Mirror Resilvering

Because the RDBMS writes in parallel to multiple copies of extents, in the case of a process or node failure, it is possible that a write has completed to one extent copy but not another. Mirror resilvering ensures that two mirror sides (that may be read) are consistent.

MYTH

ASM needs a Dirty Region Log to keep mirrors in sync after process or node failure.

Most traditional volume managers handle mirror resilvering by maintaining a Dirty Region Log (DRL). The DRL is a bitmap with 1 bit per chunk (typically 512K) of the volume. Before a mirrored write can be issued, the DRL bit for the chunk must be set on disk. This results in a performance penalty if an additional write to the DRL is required before the mirrored write can be initiated. A write is unnecessary if the DRL bit is already set because of a prior write. The volume manager lazily clears DRL bits as part of other DRL accesses. Following a node failure, for each region marked in the DRL, the volume manager copies one mirror side to the other as part of recovery before the volume is made available to the application. The lazy clearing of bits in the DRL and coarse granularity inevitably result in more data than necessary being copied during traditional volume manager resilvering.

Because the RDBMS must recover from process and node failure anyway, RDBMS recovery also addresses potential ASM mirror inconsistencies if necessary. Some files, such as archived logs, are simply re-created if there was a failure while they were being written, so no resilvering is required. This is possible because an archive log’s creation is not committed until all of its contents are written. For some operations, such as writing intermediate sort data, the failure means that the data will never be read, so the mirror sides need not be consistent and no recovery is required. If the redo log indicates that a data file needs a change applied, the RDBMS reads the appropriate blocks from the data file and checks the SCN. If the SCN in the block indicates that the change must be applied, the RDBMS writes the change to both mirror sides (just as with all other writes). If the SCN in the block indicates that the change is already present, the RDBMS writes the block back to the data file in case an unread mirror side does not contain the change. The resilvering of the redo log itself is handled by reading all the mirror sides of the redo log and picking the valid version during recovery.

The ASM resilvering algorithm also provides better data integrity guarantees. For example, a crash could damage a block that was in the midst of being written. Traditional volume manager resilvering has a 50-percent chance of copying the damaged block over the good block as part of applying the Dirty Region Log (DRL). When RDBMS recovery reads a block, it verifies that the block is not corrupt. If the block is corrupt and mirrored by ASM, each mirrored copy is examined to find a valid version. Thus, a valid version is always used for ASM resilvering.

ASM’s mirror resilvering is more efficient and more effective than traditional volume managers, both during normal I/O operation and during recovery.

Preferred Read

Some customers deploy extended clusters for high availability. Extended clusters consist of servers and storage that reside in separate data centers at sites that are usually several kilometers apart. In these configurations, ASM normal-redundancy disk groups have two failure groups: one for each site. From a given server in an extended cluster, the I/O latency is greater for the disks at the remote site than for the disks in the local site. When COMPATIBLE.RDBMS is set to 11.1 or higher, users can set the PREFERRED_READ_FAILURE_GROUPS initialization parameter for each ASM instance to specify the local failure group in each disk group for that instance. When this parameter is set, reads are preferentially directed to the disk in the specified failure group rather than always going to the primary extent first. In the case of I/O errors, reads can still be satisfied from the nonpreferred failure group.

To ensure that all mirrors have the correct data, writes must still go to all copies of an extent.

Rebalance

Whenever the disk group configuration changes—whenever a disk is added, dropped, or resized—ASM rebalances the disk group to ensure that all the files are evenly spread across the disks in the disk group. ASM creates a COD entry to indicate that a rebalance is in progress. If the instance terminates abnormally, a surviving instance restarts the rebalance. If no other ASM instances are running, when the instance restarts, it restarts the rebalance.

MYTH

An ASM rebalance places data on disks based on I/O statistics.

MYTH

ASM rebalances dynamically in response to hot spots on disks.

ASM spreads every file evenly across all the disks in the disk group. Because of the access patterns of the database, including the caching that occurs in the database’s buffer cache, ASM’s even distribution of files across the disks in a disk group usually leads to an even I/O load on each of the disks as long as the disks share similar size and performance characteristics.

Because ASM maintains pointers to each extent of a file, rebalance can minimize the number of extents that it moves during configuration changes. For instance, adding a fifth disk to an ASM disk group with four disks results in moving 20 percent of the extents. If modulo arithmetic were used—changing placement to every fifth extent of a file on each disk from every fourth extent of a file on each disk—then almost 100 percent of the extents would have needed to move. Figures 9-4 and 9-5 show the differences in data movement required when adding a one disk to a four-disk disk group using ASM rebalance and traditional modulo arithmetic.

FIGURE 9-4. Data movement using traditional module arithmetic

FIGURE 9-5. Data movement using ASM rebalance

The RBAL process on the ASM instance that initiates the rebalance coordinates rebalance activity. For each file in the disk group—starting with the metadata files and continuing in ascending file number order—RBAL creates a plan for even placement of the file across the disks in the disk group. RBAL dispatches the rebalance plans to the ARBn processes in the instance. The number of ARBs is dictated by the power of the rebalance. Each ARBn relocates the appropriate extents according to its plan. Coordination of extent relocation among the ASM instances and RDBMS instances is described in the next section. In 11.2.0.2 and above (with compatible.asm set to 11.2.0.2), the same workflow occurs, except there is only one ARB0 process that performs all the relocations. The rebalance power dictates the number of outstanding async I/Os that will be issued or in flight. See the “Relocation” section for details on relocation mechanisms.

ASM displays the progress of the rebalance in V$ASM_OPERATION. ASM estimates the number of extents that need to be moved and the time required to complete the task. As rebalance progresses, ASM reports the number of extents moved and updates the estimates.

Resync

Resync is the operation that restores the updated contents of disks that are brought online from an offline state. Bringing online a disk that has had a transient failure is more efficient than adding the disk back to the disk group. When you add a disk, all the contents of the disk must be restored via a rebalance. Resync restores only the extents that would have been written while the disk was offline.

When a disk goes offline, ASM distributes an initialized slot of the Staleness Registry (SR) to each RDBMS and ASM instance that has mounted the disk group with the offline disk. Each SR slot has a bit per allocation unit in the offline disk. Any instance that needs to perform a write to an extent on the offline disk persistently sets the corresponding bit in its SR slot and performs the writes to the extent set members on other disks.

When a disk is being brought online, the disk is enabled for writes, so the RDBMS and ASM instances stop setting bits in their SR slots for that disk. Then ASM reconstructs the allocation table (AT) and free space table for the disk. Initially, all the AUs (except for the physical metadata and AUs marked as unreadable by block remapping) are marked as free. At this point, deallocation is allowed on the disk. ASM then examines all the file extent maps and updates the AT to reflect the extents that are allocated to the disk being brought online. The disk is enabled for new allocations. To determine which AUs are stale, ASM performs a bitwise OR of all the SR slots for the disk. Those AUs with set bits are restored using the relocation logic described in the following section. After all the necessary AUs have been restored, ASM enables the disk for reads. When the online is complete, ASM clears the SR slots.

Relocation

Relocation is the act of moving an extent from one place to another in an ASM disk group. Relocation takes place most commonly during rebalance, but also occurs during resync and during bad block remapping. The relocation logic guarantees that the active RDBMS clients always read the correct data and that no writes are lost during the relocation process.

Relocation operates on a per-extent basis. For a given extent, ASM first checks whether the file is open by any instance. If the file is closed, ASM can relocate the file without messaging any of the client instances. Recovery Manager (RMAN) backup pieces and archive logs often account for a large volume of the space used in ASM disk groups, especially if the disk group is used for the Fastback Recovery Area (FRA), but these files are usually not open. For closed files, ASM reads the extent contents from the source disk and writes them to the specified location on the target disk. ASM keeps track of the relocations that are ongoing for closed files so that if a client opens a file which is in the middle of relocation, the appropriate relocation messages are sent to the client as part of the file open.

If the file being relocated is open, the instance performing the relocation sends to all other ASM instances in the cluster a message stating its intent to relocate an extent. All ASM instances send a message, which indicates the intent to relocate the extent over the umbilicus to any RDBMS clients that have the file open. At this point, the RDBMS instances delay any new writes until the new extent location is available. Because only one extent is relocated at a time, the chances are very small of an RDBMS instance writing to the extent being relocated at its time of relocation. Each RDBMS instance acknowledges receipt of this message to its ASM instance. The ASM instances in turn acknowledge receipt to the coordinating ASM instance. After all acknowledgments arrive, the coordinating ASM instance reads the extent contents from the source disk and writes them to the target disk. After this is complete, the same messaging sequence occurs, this time to indicate the new location of the extent set. At this time, in-flight reads from the old location are still valid. Subsequent reads are issued to the new location. Writes that were in flight must be reissued to the new location. Writes that were delayed based on the first relocation message can now be issued to the new location. After all in-flight writes issued before the first relocation message are complete, the RDBMS instances use the same message-return sequence to notify the coordinating ASM instance that they are no longer referencing the old extent location. At this point, ASM can return the allocation unit(s) from the old extent to the free pool. Note that the normal RDBMS synchronization mechanisms remain in effect in addition to the relocation logic.

ASM Instance Recovery and Crash Recovery

When ASM instances exit unexpectedly due to a shutdown abort or an ASM instance crash, ASM performs recovery. As with the RDBMS, ASM can perform crash recovery or instance recovery. ASM performs instance recovery when an ASM instance mounts a disk group following abnormal instance termination. In a clustered configuration, surviving ASM instances perform crash recovery following the abnormal termination of another ASM instance in the cluster. ASM recovery is similar to the recovery performed by the RDBMS. For ASM recovery, the ACD is analogous to the redo logs in the RDBMS instance. The ASM COD is similar to the undo tablespaces in the RDBMS. With ASM, recovery is performed on a disk group basis. In a cluster, different ASM instances can perform recovery on different disk groups.

CSS plays a critical role in RAC database recovery. With RAC databases, CSS ensures that all I/O-capable processes from a crashed RDBMS instance have exited before a surviving RDBMS instance performs recovery. The OS guarantees that terminated processes do not have any pending I/O requests. This effectively fences off the crashed RDBMS instance. The surviving RDBMS instances can then proceed with recovery with the assurance that the crashed RDBMS instances cannot corrupt the database.

CSS plays a similar role in ASM recovery. Even a single-node ASM deployment involves multiple Oracle instances. When an ASM instance mounts a disk group, it registers with CSS. Each RDBMS instance that accesses ASM storage registers with CSS, which associates the RDBMS instance with the disk group that it accesses and with its local ASM instance. Each RDBMS instance has an umbilicus connection between its ASMB process and the UFG in the local ASM instance. Because ASMB is a fatal background process in the RDBMS, and the failure of a UFG kills ASMB, the failure of an ASM instance promptly terminates all its client RDBMS instances. CSS tracks the health of all the I/O-capable processes in RDBMS and ASM instances. If an RDBMS client terminates, CSS notifies the local ASM instance when all the I/O-capable processes have exited. At this point, ASM can clean up any resource held on behalf of the terminated RDBMS instances. When an ASM instance exits abnormally, its client RDBMS instances terminate shortly thereafter because of the severed umbilicus connection. CSS notifies the surviving ASM instances after all the I/O-capable processes from the terminated ASM instance and its associated RDBMS instances have exited. At this point, the surviving ASM instances can perform recovery knowing that the terminated instance can no longer perform any I/Os.

During ASM disk group recovery, an ASM instance first applies the ACD thread associated with the terminated ASM instance. Applying the ACD records ensures that the ASM cache is in a consistent state. The surviving ASM instance will eventually flush its cache to ensure that the changes are recorded in the other persistent disk group data structures. After ACD recovery is complete, ASM performs COD recovery. COD recovery performs the appropriate actions to handle long-running operations that were in progress on the terminated ASM instance. For example, COD recovery restarts a rebalance operation that was in progress on a terminated instance. It rolls back file creations that were in progress on terminated instances. File creations are rolled back because the RDBMS instance could not have committed a file creation that was not yet complete. The “Mirror Resilvering” section earlier in this chapter describes how ASM ensures mirror consistency during recovery.

ASM recovery is based on the same technology that the Oracle Database has used successfully for years.

Disk Discovery

Disk discovery is the process of examining disks that are available to an ASM instance. ASM performs disk discovery in several scenarios, including the following:

Mount disk group

Create disk group

Add disk

Online disk

Select from V$ASM_DISKGROUP or V$ASM_DISK

The set of disks that ASM examines is called the discovery set and is constrained by the ASM_DISKSTRING initialization parameter and the disks to which ASM has read/write permission. The default value for ASM_DISKSTRING varies by platform, but usually matches the most common location for devices on the platform. For instance, on Linux, the default is /dev/sd*. On Solaris, the default is /dev/rdsk/*. The ASM_DISKSTRING value should be broad enough to match all the disks that will be used by ASM. Making ASM_DISKSTRING more restrictive than the default can make disk discovery more efficient.

MYTH

ASM persistently stores the path to each disk in a disk group.

The basic role of disk discovery is to allow ASM to operate on disks based on their content rather than the path to them. This allows the paths to disks to vary between reboots of a node and across nodes of a cluster. It also means that ASM disk groups are self-describing: All the information required to mount a disk group is found on the disk headers. Users need not enumerate the paths to each disk in a disk group to mount it. This makes ASM disk groups easily transportable from one node to another.

Mount Disk Group

When mounting a disk group, ASM scans all the disks in the discovery set to find each disk whose header indicates that it is a member of the specified disk group. ASM also inspects the PSTs to ensure that the disk group has a quorum of PSTs and that it has a sufficient number of disks to mount the disk group. If discovery finds each of the disks listed in the PST (and does not find any duplicate disks), the mount completes and the mount timestamp is updated in each disk header.

If some disks are missing, the behavior depends on the version of ASM and the redundancy of the disk group. For external-redundancy disk groups, the mount fails if any disk is missing. For normal-redundancy disk groups, the mount can succeed as long as none of the missing disks are partners of each other. The mount succeeds if the instance is the first in the cluster to mount the disk group. If the disk group is already mounted by other instances, the mount will fail. The rationale for this behavior is that if the missing disk had truly failed, the instances with the disk group mounted would have already taken the disk offline. As a result, the missing disk on the node trying to mount is likely to be the result of local problems, such as incorrect permissions, an incorrect discovery string, or a local connection problem. Rather than continuing the schizophrenic mount behavior in Oracle Database 11g, ASM introduced a FORCE option to mount. If you do not specify FORCE (which is equivalent to specifying NOFORCE), the mount fails if any disks are missing. If you specify FORCE to the mount command, ASM takes the missing disks offline (if it finds a sufficient number of disks) and completes the mount. A mount with the FORCE option fails if no disks need to be taken offline as part of the mount. This is to prevent the gratuitous use of FORCE during a mount.

If multiple disks have the same name in the disk header, ASM does its best to pick the right one for the mount. If another ASM instance already has the disk group mounted, the mounting ASM instance chooses the disk with the mount timestamp that matches the mount timestamp cached in the instance that already has the disk group mounted. If no other instance has the disk group mounted or if multiple disks have the same ASM disk name and mount timestamp, the mount fails. With multipathing, the ASM_DISKSTRING or disk permissions should be set such that the discovery set contains only the path to the multipathing pseudo-device. The discovery set should not contain each individual path to each disk.

Create Disk Group

When creating a disk group, ASM writes a disk header to each of the disks specified in the CREATE DISKGROUP SQL statement. ASM recognizes disks that are formatted as ASM disks. Such disks have a MEMBER in the HEADER_STATUS field of V$ASM_DISK. Other patterns, such as those of database files, the Oracle Clusterware voting disk, or the Oracle Cluster Registry (OCR), appear as FOREIGN, if stored on raw devices outside of ASM, in the HEADER_STATUS of V$ASM_DISK. By default, ASM disallows the use of disks that have a header status of FOREIGN or MEMBER. Specifying the FORCE option to a disk specification allows the use of disks with these header statuses. ASM does not allow reuse of any disk that is a member of a mounted disk group. ASM then verifies that it is able to find exactly one copy of each specified disk by reading the headers in the discovery set. If any disk is not found during this discovery, the disk group creation fails. If any disk is found more than once, the disk group creation also fails.

Add Disk

When adding a disk to a disk group, ASM writes a disk header to each of the disks specified in the ADD DISK SQL statement. ASM verifies that it can discover exactly one copy of the specified disk in the discovery set. In clusters, the ASM instance adding the disk sends a message to all peers that have the disk group mounted to verify that they can also discover the new disk. If any peer ASM instance with the disk group mounted fails to discover the new disk, the ADD DISK operation fails. In this case, ASM marks the header as FORMER. ADD DISK follows the same rules for FORCE option usage as CREATE DISKGROUP.

Online Disk

The online operation specifies an ASM disk name, not a path. The ASM instance searches for the disk with the specified name in its discovery set. In a cluster, ASM sends a message to all peer instances that have the disk group mounted to verify that they can discover the specified disk. If any instance fails to discover the specified disk, ASM returns the disk to the OFFLINE state.

Select from V$ASM_DISK and V$ASM_DISKGROUP

Select from V$ASM_DISK or V$ASM_DISKGROUP reads all the disk headers in the discovery set to populate the tables. V$ASM_DISK_STAT and V$ASM_DISKGROUP_STAT return the same information as V$ASM_DISK and V$ASM_DISKGROUP, but the _STAT views do not perform discovery. If any disks were added since the last discovery, they will not be displayed in the _STAT views.

Summary

ASM is a sophisticated storage product that performs complex internal operations to provide simplified storage management to the user. Most of the operations described in this chapter are transparent to the user, but understanding what happens behind the curtains can be useful for planning and monitoring an ASM deployment.

CHAPTER 10 ACFS Design and Deployment

There is a big movement in the industry toward consolidation and cloud computing. At the core of all this is centralized-clustered storage, which requires a centralized file store such as a file system.

In Oracle Grid Infrastructure 11g Release 2, Oracle extends the capability of ASM by introducing the ASM Cluster File System, or ACFS. ACFS is a feature-rich, scalable, POSIX-compliant file system that is built from the traditional ASM disk groups. In Oracle Clusterware 11gR2 11.2.0.3, Oracle introduced a packaging suite called the Oracle Cluster File System, Cloud Edition, which provides a clustered file system for cloud storage applications, built on Automatic Storage Management, and includes advanced data management and security features. Oracle Cluster File System, Cloud Edition includes the following core components:

Automatic Storage Management (ASM)

ASM Dynamic Volume Manager (AVDM)

ASM Cluster File System (ACFS)

In addition, it provides the following services:

ACFS file tagging for aggregate operations

ACFS read-only snapshots

ACFS continuous replication

ACFS encryption

ACFS realm-based security

This chapter focuses on ACFS, its underlying architecture, and the associated data services.

ASM Cluster File System Overview

ACFS is a fully POSIX-compliant file system that can be accessed through native OS file system tools and APIs. Additionally, ACFS can be exported and accessed by remote clients via standard NAS file access protocols such as NFS and CIFS. In addition to ACFS, Oracle introduces ASM Dynamic Volume Manager (ADVM) to provide scalable volume management capabilities to support ACFS. Despite its name, ACFS can be utilized as a local file system as well as a cluster file system.

ACFS provides shared, cluster-wide access to various file types, including database binaries, trace files, Oracle BFILEs, user application data and reports, and other non-database application data. Storing these file types in ACFS inherently provides consistent global namespace support, which is the key to enabling uniform, cluster-wide file access and file pathing of all file system data. In the past, to get consistent namespace across cluster nodes meant implementing third-party cluster file system solutions or NFS.

ACFS is built on top of the standard vnode/VFS file system interface ubiquitous in the Unix/Linux world, and it uses Microsoft standard interfaces on Windows. ACFS supports Windows APIs, command-line interfaces (CLIs) and tools (including Windows Explorer). ACFS uses standard file-related system calls like other file systems in the market. However, most file systems on the market are platform specific; this is especially true of cluster file systems. ACFS is a multiplatform cluster file system for Linux, Unix, and Windows, with the same features across all OS and platforms.

ACFS provides benefits in the following key areas:

Performance ACFS is an extent-based clustered file system that provides metadata caching and fast directory lookups, thus enabling fast file access. Because ACFS leverages ASM functionality in the back end, it inherits a wide breadth of feature functionality already provided by ASM, such as maximized performance via ASM’s even extent distribution across all disks in the disk group.

Manageability A key aspect of any file system container is that it needs to be able to grow as the data requirements change, and this needs to be performed with minimized downtime. ACFS provides the capability to resize the file system while the file system is online and active. In addition, ACFS supports other storage management services, such as file system snapshots. ACFS (as well as its underlying volumes) can be easily managed using various command-line interfaces or graphical tools such as ASM Configuration Assistant (ASMCA) and Oracle Enterprise Manager (EM). This wide variety of tools and utilities allows ACFS to be easily managed and caters to both system administrators and DBAs. Finally, ACFS management is integrated with Oracle Clusterware, which allows automatic startup of ACFS drivers and mounting of ACFS file systems based on dependencies set by the administrator.

Availability ACFS, via journaling and checksums, provides the capability to quickly recover from file system outages and inconsistencies. ACFS checksums cover ACFS metadata. Any metadata inconsistency will be detected by checksum and can be remedied by performing fsck.

ACFS also leverages ASM mirroring for improved storage reliability. Additionally, because ACFS is tightly integrated with Oracle Clusterware, all cluster node membership and group membership services are inherently leveraged. Table 10-1 provides a summary of the features introduced in each of the 11gR2 patch set releases.

TABLE 10-1. Features Introduced in Each of the 11gR2 Patch Set Releases

NOTE

ACFS is supported on Oracle Virtual Machine (OVM) in both Paravirtualized and Hardware Virtualized guests.

ACFS File System Design

This section illustrates the inner workings of the ACFS file system design—from the file I/O to space management.

ACFS File I/O

All ACFS I/O goes through ADVM, which maps these I/O requests to the actual physical devices based on the ASM extent maps it maintains in kernel memory. ACFS, therefore, inherits all the benefits of ASM’s data management capability.

ACFS supports POSIX read/write semantics for the cluster environment. This essentially means that writes from multiple applications to the same file concurrently will result in no interlaced writes. In 11.2.0.4, ACFS introduces support flock/fcntl for cluster-wide serialization, thus, if whole file exclusive locks are taken out across multiple nodes of a cluster, they do not block each other.

ACFS shares the same cluster interconnect as Oracle RAC DB, Clusterware, and ASM; however, no ACFS I/Os travel across the interconnect. In other words, all ACFS I/Os are issued directly to the ASM storage devices. ACFS maintains a coherent distributed ACFS file cache (using the ACFS distributed lock manager) that provides a POSIX-compliant shared read cache as well as exclusive write caching to deliver “single-system POSIX file access semantics cluster-wide.” Note that this is cache coherency for POSIX file access semantics. ACFS delivers POSIX read-after-write semantics for a cluster environment. When an application requests to write a file on a particular node, ACFS will request an exclusive DLM lock. This, in turn, requires other cluster nodes to release any shared read or exclusive write locks for the file and to flush, if required, and invalidate any cached pages. However, it is important to contrast this with POSIX application file locking—cluster cache coherency provides POSIX access coherency (single-node coherency) semantics. POSIX file locking provides for application file access serialization, which is required if multiple users/applications are sharing a given file for reading and writing. ACFS does not provide cluster-wide POSIX file locking presently. However, it does support node local POSIX file locking.

Starting in 11.2.0.3, ACFS supports direct I/O—the ability to transfer data between user buffers and storage without additional buffering done by the operating system. ACFS supports the native file data caching mechanisms provided by different operating systems. Any caching directives settable via the open(2) system call are also honored. ACFS supports enabling direct I/O on a file-by-file basis using OS-provided mechanism; for example, using the O_DIRECT flag to open(2) on Linux and the FILE_FLAG_NO_BUFFERING flag to CreateFile on Windows. A file may be open in direct I/O mode and cached mode (the default) at the same time. ACFS ensures that the cache coherency guarantees mentioned earlier are always met even in such a mixed-usage scenario.

ACFS Space Allocation

ACFS file allocation is very similar to other POSIX-compliant file systems: The minimum file size is an ACFS block (4KB), with metadata sized 512 bytes to 4KB. ACFS also does file preallocation of storage as a means to efficiently manage file growth. This applies to regular files as well as directories. This preallocation amount is based on currently allocated storage.

The following is the basic algorithm for regular files:

The following are three use-case examples of ACFS file preallocation to illustrate this algorithm:

If a file is currently 64KB in size and the user writes 4KB to the file, ACFS will allocate 64KB of storage, bringing the file to a total of 128KB.

If a file is 4MB in size and the user writes 4KB, ACFS will allocate 1MB of storage, bringing the file to a total of 5MB.

If a file has zero bytes of storage—which may be the case if the file is brand new or was truncated to zero bytes—then at write() time no preallocation will be done.

The scenario is slightly different for directories. For directories, the algorithm is as follows:

Thus, if a directory is 16KB and needs to be extended, ACFS allocates another 16KB, bringing the directory to a total of 32KB. If the directory needs to be extended again, ACFS will allocate another 32KB, bringing it to a total of 64KB. Because the directory is now 64KB, all future extensions will be 64KB.

In addition to the preceding scenarios, if the desired storage cannot be contiguously allocated, ACFS will search for noncontiguous chunks of storage to satisfy the request. In such cases, ACFS will disable all preallocation on this node for a period of time, which is triggered when or until some thread on this node frees up some storage or a certain amount of time passes, with the hope that another node may have freed up some storage.

The salient aspects of ACFS file space management are:

ACFS tries to allocate contiguous space for a file when possible, which gives better performance for sequential reads and writes.

To improve performance of large file allocation, ACFS will preallocate the space when writing data.

This storage is not returned when the file is closed, but it is returned when the file is deleted.

ACFS allocates local metadata files as nodes mount the file system for the first time. This metadata space is allocated contiguously per node, with the maximum being 20MB.

ACFS maintains local bitmaps to reduce contention on the global storage bitmap. This local bitmap becomes very important during the search for local free space. Note that because of local space reservation (bitmap), when disk space is running low, allocations may be successful on one cluster node and fail on another. This is because no space is left in the latter’s local bitmap or the global bitmap.

It is important to note that this metadata space allocation may cause minor space discrepancies when used space is displayed by a command such as Unix/Linux df (for example, the disk space reported by the df command as “in use,” even though some of it may not actually be allocated as of yet). This local storage pool can be as large as 128MB per node and can allow space allocations to succeed, even though commands such as df report less space available than what is being allocated.

Distributed Lock Manager (DLM)

DLM provides a means for cluster-wide exclusive and shared locks. These DLM locks are taken whenever cluster-wide consistency and cache coherency must be guaranteed, such as for metadata updates and file reads and writes. ACFS uses the DLM services provided by Oracle Kernel Services (OKS). At a high level, a SHARED lock is requested when the cluster node is going only to read the protected entity. This SHARED lock is compatible with other SHARED locks and is incompatible with other EXCLUSIVE locks. An EXCLUSIVE lock is taken if the protected entity needs to be modified. Locks are cached on that node until some other node requests them in a conflicting mode. Therefore, a subsequent request for the same lock from the same node can be serviced without sending any internode messages, and the lock grant is quick.

When a node requests a lock that it does not hold or if the lock is not cached on that node, a blocking asynchronous system trap (BAST) is sent to the nodes that are holding the block. If one or more nodes are currently holding the lock in a conflicting mode, the request is queued and is granted when the lock is released on that node (if there is no other request ahead in the queue). ACFS has eight threads (called “BAST handlers”) per node to service BAST requests. These threads are started at ACFS kernel module load time and service all file systems mounted on the node. For example:

The following syslog entry shows a BAST request being handled:

Metadata Buffer Cache

ACFS caches metadata for performance reasons. Some examples of metadata are inodes, data extent headers, and directory blocks. Access to metadata is arbitrated using DLM locks, as described earlier. All operations (such as file deletes and renames) involving metadata updates are done within a transaction, and modified metadata buffers are marked dirty in the buffer cache. In case of any errors, the transaction is aborted and the metadata changes are reverted. For all successfully completed transactions, dirty metadata buffers are written to disk in a special location called the Volume Log. Periodically, a kernel thread reads the Volume Log and applies these metadata changes to the actual metadata locations on disk. Once written into the Volume Log, a transaction can be recovered even if the node performing that transaction crashes before the transaction is applied to actual metadata locations.

Recovery

This section describes how the ACFS and ADVM layers handle various recovery scenarios.

ACFS and ADVM Recovery

All ACFS file systems must be cleanly dismounted before ASM or the GI stack is stopped. Typically this is internally managed by CRS; however, there are cases when emergency shutdown of the stack is necessary. A forced shutdown of ASM via SHUTDOWN ABORT should be avoided when there are mounted ACFS file systems; rather, a clean ACFS dismount followed by a graceful shutdown of ASM should be used.

When there is a component or node failure, a recovery of the component and its underlying dependent resource is necessary. For example, if an ACFS file system is forcibly dismounted because of a node failure, then recovery is necessary for ASM, the ASM Dynamic volume, as well as ACFS. This recovery process is implicitly performed as part of the component restart process. For example, if an ASM instance crashes due to node failure, then upon restart, ASM will perform crash recovery, or in the case of RAC, a surviving instance will perform instance recovery on the failed instance’s behalf. The ASM Dynamic volume is implicitly recovered as part of ASM recovery, and ACFS will be recovered using the ACFS Metadata Transaction Log.

The ADVM driver also supports cluster I/O fencing schemes to maintain cluster integrity and a consistent view of cluster membership. Furthermore, ASM Dynamic volume mirror recovery operations are coordinated cluster-wide such that the death of a cluster node or ASM instance does not result in mirror inconsistency or any other form of data corruption. If ACFS detects inconsistent file metadata returned from a read operation, based on the checksum, ACFS takes the appropriate action to isolate the affected file system components and generate a notification that fsck should be run as soon as possible. Each time the file system is mounted, a notification is generated with a system event logger message until fsck is performed. Note that fsck only repairs ACFS metadata structures. If fsck runs cleanly but the user still perceives the user data files to be corrupted, the user should restore those files from a backup, but the file system itself does not need to be re-created.

Unlike other file systems, when an ACFS metadata write operation fails, the ACFS drivers do not interrupt or notify the operating system environment; instead, ACFS isolates errors by placing the file system in an offline error state. For these cases, a umount of the “offlined” file system is required for that node. If a node fails, then another node will recover the failed node’s transaction log, assuming it can write the metadata out to the storage.

To recover from this scenario, the affected underlying ADVM volumes must be closed and reopened by dismounting any affected file systems. After the instance is restarted, the corresponding disk group must be mounted with the volume enabled, followed by a remount of the file system.

However, it might not be possible for an administrator to dismount a file system while it is in the offline error state if there are processes referencing the file system, such as a directory of the file system being the current working directory for a process. In these cases, to dismount the file system you need to identify all processes on the node that have file system references to files and directories. The Linux fuser and lsof commands will list processes with open files.

ADVM and Dirty Region Logging

If ASM Dynamic volumes are created in ASM redundancy disk groups (normal or high), dirty region logging (DRL) is enabled via an ASM DRL volume file.

The ADVM driver will ensure ASM Dynamic volume mirror consistency and the recovery of only the dirty regions in cases of node and ASM instance failures. This is accomplished using DRL, which is an industry-common optimization for mirror consistency and the recovery of mirrored extents.

ADVM Design

Most file systems are created from an OS disk device, which is generally a logical volume device, created by a logical volume manager (LVM). For example, a Linux ext3 file system is generally created over a Linux LVM2 device with an underlying logical volume driver (LVD). ACFS is similar in this regard; however, it is created over an ADVM volume device file and all volume I/O is processed through the ADVM driver.

In Oracle ASM 11g Release 2, ASM introduces a new ASM file type, called asmvol, that is associated with ADVM Dynamic volumes. These volume files are similar to other ASM file types (such as archive logs, database data files, and so on) in that once they are created their extents are evenly distributed across all disks in the specified disk group. An ASM Dynamic volume file must be wholly contained in a single disk group, but there can be many volume files in one disk group.

As of 11.2.0.3, the default ADVM volume is now allocated with 8MB extents across four columns and a fine-grained stripe width of 128KB. ADVM writes data as 128KB stripe chunks in round-robin fashion to each of the four columns and fills a stripe set of four 8MB extents with 250 stripe chunks before moving to a second stripe set of four 8MB extents. Although the ADVM Dynamic volume extent size and the stripe columns can be optionally specified at volume creation, this is not a recommended practice.

If the ASM disk group AU size is 8MB or less, the ADVM extent size is 8MB. If the ASM disk group AU size is configured larger than 8MB, ADVM extent size is the same as the AU size.

Note that setting the number of columns on an ADVM dynamic volume to 1 effectively turns off fine-grained striping for the ADVM volume, but maintains the coarse-grained striping of the ASM file extent distribution. Consecutive stripe columns map to consecutive ASM extents. For example, if in a normal ASM file, four ASM extents map to two LUNs in alternating order, then the stripe-column-to-LUN mapping works the same way.

This section covers how to create an ADVM volume, which will subsequently be used to create an ACFS file system. Note that these steps are not needed if you’re deploying an ACFS file system using ASM Configuration Assistant (ASMCA). If you are not using ASMCA, you will need to do this manually.

To create a volume in a previously created disk group, use the following command:

The -G flag specifies the disk group where this volume will be created. This will create an ADVM volume file that is 25GB. All the extents of this volume file are distributed across all disks in the DATA disk group. Note that the volume name is limited to 11 characters on Linux and 23 characters on AIX.

Once the ASM Dynamic volume file is created inside ASM, an ADVM volume device (OS device node) will be automatically built in the /dev/asm directory. On Linux, udev functionality must be enabled for this device node to be generated. A udev rules file in /etc/udev/rules.d/udev_rules/55-usm.rules contains the rule for /dev/asm/* and sets the group permission to asmadmin.

For clarity, we refer to the logical volume inside ASM as the ASM Dynamic volume, and the OS device in /dev/asm as the ADVM volume device. There is a direct one-for-one mapping between an ASM Dynamic volume and its associated ADVM volume device:

Notice that the volume name is included as part of the ADVM volume device name. The three-digit (suffix) number is the ADVM persistent cluster-wide disk group number. It is recommended that you provide a meaningful name such that it is easily identifiable from the OS device.

The ADVM volume device filenames are unique and persistent across all cluster nodes. The ADVM volume devices are created when the ASM instance is active with the required disk group mounted and dynamic volumes enabled.

Note that the on-disk format for ASM and ACFS is consistent between 32-bit and 64-bit Linux. Therefore, customers wanting to migrate their file system from 32-bit to 64-bit should not have to convert their ASM- and ACFS-based metadata.

ACFS Configuration and Deployment

Generally space, volume, and file system management are performed with a typical workflow process. The following describes how a traditional file system is created:

The SAN administrator carves out the LUNs based on performance and availability criteria. These LUNs are zoned and presented to the OS as disk devices.

The system administrator encapsulates these disks into a storage pool or volume group. From this pool, logical volumes are created. Finally, file systems are created over the logical volumes.

So how does this change in the context of ACFS/ADVM/ASM? Not much. The following is the process flow for deploying ACFS/ADVM/ASM:

The SAN administrator carves out LUNs based on the defined application performance and availability SLA and using the ASM best practices, which state that all disks in an ASM disk group should be of the same size and have similar performance characteristics. This best practice and guideline makes LUN management and provisioning easier to build. Finally, the provisioned LUNs are zoned and presented to the OS as disk devices.

The system administrator or DBA creates or expands an ASM disk group using these disks. From the disk group, ASM (logical) volumes are created. If using SQL*Plus or ASMCMD, the user needs to be connected as SYASM for ADVM or ACFS configuration. Alternatively, ASMCA or EM can be used. Finally, file systems are created over these volume devices.

Note that although Oracle Grid Infrastructure 11g Release 2 supports RH/EL 4, 5, and 6, ACFS/ADVM requires a minimum of RH/EL 5.x. If you try to deploy ACFS on unsupported platforms, an error message similar to the following will be displayed:

Configuring the Environment for ACFS

In this section, we discuss ACFS and ADVM concepts as well as cover the workflow in building the ACFS infrastructure.

Although you can configure/manage ACFS in several ways, this chapter primarily illustrates ASMCMD and ASMCA. Note that every ASMCMD command shown in this chapter can also be performed in ASMCA, Enterprise Manager, or SQL*Plus. However, using ASMCA, Enterprise Manager, or ASMCMD is the recommended method for administrators managing ASM/ACFS.

The first task in setting up ACFS is to ensure the underlying disk group is created and mounted. Creating disk groups is described in Chapter 4.

Before the first volume in an ASM disk group is created, any dismounted disk groups must be mounted across all ASM instances in the cluster. This ensures uniqueness of all volume devices names. The ADVM driver cannot verify that all disk groups are mounted; this must be ensured by the ASM administrator before adding the first volume in a disk group.

Also, compatible.asm and compatible.advm must be minimally set to 11.2.0.0 if this disk group is going to hold ADVM Dynamic volumes. The compatible.rdbms can be set to any valid value.

As part of the ASM best practices, we recommend having two ASM disk groups. It is recommended that you place the ADVM volumes in either the DATA or RECO disk group, depending on the file system content. For example, the DATA disk group can be used to store ACFS file systems that contain database-related data or general-purpose data. The RECO disk group can be used to store ACFS file systems that store database-recovery-related files, such as archived log files, RMAN backups, Datapump dump sets, and even database ORACLE_HOME backups (possibly zipped backups). It is very typical to use ACFS file systems for GoldenGate. More specifically, for storing trail files. In these scenarios, a separate disk group is configured for holding GolenGate trail files. This also requires ACFS patch 11825850. Storing archived log files, RMAN backups, and Datapump dump sets is only supported in ACFS 11.2.0.3 and above. However, ACFS does not currently support snapshots of file systems housing these files.

ACFS Deployment

The two types of ACFS file systems are CRS Managed ACFS file systems and the Registry Managed ACFS file systems. Both of these ACFS solutions have similar benefits with respect to startup/recovery and leveraging CRS dependency modeling, such as unmounting and remounting offline file systems, mounting (pulling up) a disk group if not mounted, and enabling volumes if not enabled.

CRS Managed ACFS file systems have associated Oracle Clusterware resources and generally have defined interdependencies with other Oracle Clusterware resources (database, ASM disk group, and so on). CRS Managed ACFS is specifically designed for ORACLE_HOME file systems. Registry Managed ACFS file systems are general-use file systems that are completely transparent to Oracle Clusterware and its resources. There are no structural differences between the CRS Managed ACFS and Registry Managed ACFS file systems. The differences are strictly around Oracle Clusterware integration. Once an ACFS file system is created, all standard Linux/Unix file system commands can be used, such as the df, tar, cp, and rm commands. In 11gR2, storing any files that can be natively stored in ASM is not supported; in other words, you cannot store Oracle Database files (control files, data files, archived logs, online redo logs, and so on) in ACFS. Also, the Grid Infrastructure Home cannot be installed in ACFS; it must be installed in a separate file system, such as in ext3.

CRS Managed ACFS File Systems

As stated previously, the primary use for CRS Managed ACFS is storing the database Oracle Home. Figures 10-1 through 10-3 show how to create a CRS Managed ACFS file system. It is recommended that ACFS file systems used to house your database ORACLE_HOME have an OFA-compliant directory structure (for example, $ORACLE_BASE/product/11.2/dbhomex, where x represents the database home). Using ASMCA is the recommended method for creating a CRS Managed ACFS file system. ASMCA creates the volume and file system and establishes the required Oracle Clusterware resources.

FIGURE 10-1. ASMCA screen for file system management

FIGURE 10-2. File system creation screen

FIGURE 10-3. Complete file system creation

Figure 10-1 displays the ASMCA screen used to launch the Create the CRS Managed ACFS File System Wizard. Right-click the Create Diskgroup task on a specific disk group and then select the Create ACFS for Database Home option.

Figure 10-2 displays the Create ACFS Hosted Database Home screen, which will prompt for the volume name, file system mount point, and file system owner. The ADVM volume device name is taken from the volume name, which is specified during volume creation.

Figure 10-3 displays the Run ACFS Script window.

This script, which needs to run as root, is used to add the necessary Oracle Clusterware resources for the file system as well as mount the file system. Here’s what the script contains:

Once this script is run, the file system will be mounted on all nodes of the cluster.

Registry Managed ACFS

Besides storing the database software binaries on ACFS, ACFS can be used to store other database-related content. The following are some use-case scenarios for a Registry Managed ACFS file system:

Automatic Diagnostic Repository (ADR) A file system for a diagnostics logging area. Having a distinct file system for diagnostic logs provides a more easily managed container rather than placing this within ORACLE_BASE or ORACLE_HOME locations. Plus, a node’s logs are available even if that node is down for some reason.

External database data This includes Oracle external file types, such as BFILEs and ETL data/external tables.

Database exports Create a file system for storing the old database exports as well as Datapump exports. This is only supported in Grid Infrastructure stack 11.2.0.3 and above.

Utility log file repository For customers who want a common log file repository for utilities such as RMAN, SQL*Loader, and Datapump, an ACFS file system can be created to store these log files.

Directory for UTL_FILE_DIR and Extproc In a RAC environment, a shared file system plays an important role for customers who use UTL_FILE_DIR (PL/SQL file I/O), a shared file repository for CREATE DIRECTORY locations, and external procedures via Extproc calls.

Middle-tier shared file system for Oracle applications For example, E-Business Suite as well as Siebel Server have a shared file system requirement for logs, shared documents, reports output, and so on.

In order to use the ACFS file system that is created in the database tier, one can export the ACFS file system as an NFS file system and NFS mount on the middle-tier nodes. This allows for customers to consolidate their storage and have integrated storage management across tiers.

Creating Registry Managed ACFS File Systems

This section describes Registry Managed ACFS file system creation. In this case, a node-local file system will be created for a given RAC node. To create a Registry Managed ACFS file system, you must create an ADVM volume first. Note that root access is not required to create an ACFS file system, but mounting and creating the CRS resource will require root access. Here are the steps:

1. Create the volume:

2. Once the ASM Dynamic volume is created and enabled, the file system can be created over the ADVM volume device. Create the file system:

NOTE

When a volume is created, it is automatically enabled.

3. Mount the file system:

4. Verify that the volume was created and mounted:

Setting Up Registry Managed ACFS

An ACFS Mount Registry is used to provide a persistent entry for each Registry Managed ACFS file system that needs to be mounted after a reboot. This ACFS Mount Registry is very similar to /etc/fstab on Linux. However, in cluster configurations, file systems registered in the ACFS Mount Registry are automatically mounted globally, similar to a cluster-wide mount table. This is the added benefit Oracle ACFS Mount Registry has over Unix fstab. The ACFS Mount Registry can be probed, using the acfsutil command, to obtain file system, mount, and file information.

When a general-purpose ACFS file system is created, it should be registered with the ACFS Mount Registry to ensure that the file system gets mounted on cluster/node startup. Users should not create ACFS Mount Registry entries for CRS Managed ACFS file systems.

Continuing from the previous example, where an ADVM volume and a corresponding file system were created, we will now register those file systems in the ACFS Mount Registry on their respective nodes so they can be mounted on node startup. When ACFS is in a cluster configuration, the acfsutil registry command can be run from any node in the cluster:

The following shows some examples of the acfsutil registry command. Note that the acfsutil command can be run using either the root or oracle user. To check the ACFS Mount Registry, use the following query:

To get more detailed information on a currently mounted file system, use the acfsutil info fs command:

Note that because these file systems will be used to store database-related content, they will need to have CRS resources against them. The following example illustrates how to create these resources:

NOTE

You cannot have ACFS registry entries and CRS resources defined for the same file system.

The same operation can be performed using ACMCA or Enterprise Manager.

Managing ACFS and ADVM

This section describes how to manage typical tasks related to ACFS and ADVM as well as illustrates the relationships among ACFS, ADVM, ASM, and Oracle Clusterware.

ACFS File System Resize

An ACFS file system can be dynamically grown and shrunk while it is online and with no impact to the user. The ACFS file system size and attributes are managed using the /sbin/acfsutil command. The ACFS extend and shrink operations are performed at the file system layer, which implicitly grows the underlying ADVM volume in multiples of the volume allocation unit.

The following example shows the acfsutil size command:

Currently, the limit on the number of times an ACFS file system can be resized is four.

ACFS starts with one extent and can grow out to four more extents, for a total of five global bitmaps extents. To determine the number of times the file system has been resized, use the acfsdbg utility to list the internal file system storage bitmap:

Look at the Extent[*] fields for nonzero Length fields. The number of remaining zero-length extents is an indication of the minimum number of times you can grow the file system. If the number of times the file system is resized exceeds five times, the users need to take the mount point offline globally (across all nodes) and then run fsck -a to consolidate the internal storage bitmap for resize:

Unmounting File Systems

Unmounting file systems involves a typical OS umount command. Before unmounting the file system, ensure that it is not in use. This may involve stopping dependent databases, jobs, and so on. In-use file systems cannot be unmounted. You can use various methods to show open file references of a file system:

Linux/Unix lsof command Shows open file descriptors for the file system

Unix/Linux fuser command Displays the PIDs of processes using the specified file systems

Any users or processes listed should be logged off or killed (kill –9).

Next we’ll look at the steps required to unmount an ACFS file system. The steps to unmount a CRS Managed ACFS file system and a Registry Managed ACFS file system are slightly differ.

To unmount a general-purpose ACFS file system, unmount the file system. This command needs to be run on all nodes where the file system is currently mounted:

The steps are different to unmount a CRS Managed ACFS file system. Because CRS Managed ACFS file systems have associated CRS resources, the following steps need to be performed to stop the Oracle Clusterware resource and unmount the file system:

1. Once the application has stopped using the file system, you can stop the Oracle Clusterware file system resource. The following command will also unmount the file system. Here, we unmount the file system across all nodes.

NOTE

If srvctl stop does not include a node or node list, it will be unmounted on all nodes.

2. To unmount only a specific node, specify the node name in the –n flag:

3. Verify the file system resource is unmounted:

As stated, it is highly recommended that you unmount any ACFS file systems first before the ASM instance is shut down. A forced shutdown or failure of an ASM instance with a mounted ACFS file system will result in I/O failures and dangling file handles; in other words, the ACFS file system user data and metadata that was written at the time of the termination may not be flushed to storage before ASM storage is fenced off. Thus, a forced shutdown of ASM will result in the ACFS file system having an offline error state. In the event that a file system enters into an offline error state, the ACFS Mount Registry and CRS Managed Resource action routines attempt to recover the file system and return it to an online state by unmounting and remounting the file system.

Deleting File Systems

Similar to unmounting ACFS file systems, the steps to delete an ACFS file system slightly differ between CRS Managed ACFS and Registry Managed ACFS.

To unmount and delete a Registry Managed ACFS file system, execute the following, which needs to be run on all nodes where the file system is currently mounted:

Next, delete the acfsutil registry entry for the file system:

CRS Managed ACFS file systems have Oracle Clusterware resources that need to be removed. To do this, follow these steps:

1. Run the following command, which unmounts the file system and stops the Oracle Clusterware resources. (Although this command can be executed from any node in the cluster, it will have to be rerun for each node.)

NOTE

The format for stop filesystem is srvctl stop filesystem -d <device name> -n <nodelist where fs is mounted>.

2. Repeat this for each node.

3. Verify the file system resource is unmounted:

4. Once this file system is stopped and unmounted on all nodes where it was started, the Clusterware resource definitions need to be removed (this should only be run once from any node) and then the volume can be deleted:

ADVM Management

Generally it is not necessary to perform ADVM management; however, in rare cases, volumes may need to be manually disabled or dropped. The /sbin/advmutil and ASMCMD commands should be used for these tasks. For details on command usage and various command options, review the Oracle Storage Administrator’s Guide.

You can get volume information using the advmutil volinfo command:

The same information can be displayed by the asmcmd volinfo command; however, the asmcmd uses the ASM Dynamic volume name, whereas the advmutil uses the ADVM volume device name:

To enable or disable the ASM Dynamic volumes, you can use ASMCMD. Enabling the volume instantiates (creates) the volume device in the /dev/asm/directory. Here’s the command to enable the volume:

And the following command can be used to disable volume:

The disable command only disables the volume and removes the ADVM volume device node entry from the OS (or more specifically from the /dev/asm directory); it does not delete the ASM Dynamic volumes or reclaim space from the ASM disk group. To delete (drop) the ASM Dynamic volume, use the drop command:

NOTE

You cannot delete or disable a volume or file system that is currently in use.

ACFS Management

Like any other file system, ACFS consists of background processes and drivers. This section describes the background processes and their roles, reviews the ACFS kernel drivers, and discusses the integration of ACFS and Oracle Clusterware.

ASM/ACFS-Related Background Processes

Several new ASM instance background processes were added in Oracle 11gR2 to support the ACFS infrastructure. The following processes will be started upon the first use of an ADVM volume on the node:

VDBG The Volume Driver Background process is very similar to the Umbilicus Foreground (UFG) process, which is used by the database to communicate with ASM. The VDBG will forward ASM requests to the ADVM driver. This communication occurs in the following cases:

When an extent is locked/unlocked during ASM rebalance operations (ASM disk offline, add/drop disk, force dismount disk group, and so on).

During volume management activities such as a volume resize.

VBGx The Volume Background processes are a pool of worker processes used to manage requests from the ADVM driver and coordinate with the ASM instance. A typical case for this coordination is the opening/closing of an ADVM volume file (for example, when a file system mount or unmount request occurs).

VMBx The VMB is used to implement an I/O barrier and I/O fencing mechanism. This ensures that ASM instance recovery is not performed until all ADVM I/Os have completed.

ACFSx The ACFS Background process coordinates cluster membership and group membership with the ASM instance. The ACFS process communicates this information to the ACFS driver, which in turn communicates with both the OKS and ADVM drivers. When a membership state/transition change is detected, an ioctl call is sent down to the kernel, which then begins the process of adding/removing a node from the cluster.

The following shows the new background processes highlighted in bold:

The following are ACFS kernel threads dedicated to the management of ACFS and ADVM volumes devices:

The user mode processes, discussed earlier, are used to perform extent map services in support of the ADVM driver. For example, ASM file extents map the ADVM volume file to logical blocks located on specific physical devices. These ASM extent pointers are passed to the ADVM driver via the user space processes. When the driver receives I/O requests on the ADVM volume device, the driver redirects the I/O to the supporting physical devices as mapped by the target ADVM volume file’s extents. Because of this mapping, user I/O requests issued against ACFS file systems are sent directly to the block device (that is, ASM is not in the I/O path).

ACFS/ADVM Drivers

The installation of the Grid Infrastructure (GI) stack will also install the ACFS/ADVM drivers and utilities. Three drivers support ACFS and ADVM. They are dynamically loaded (in top-down order) by the OHASD process during Oracle Clusterware startup, and they will be installed whether the ACFS is to be used or not:

oracleoks.ko This is the kernel services driver, providing memory management support for ADVM/ACFS as well as lock and cluster synchronization primitives.

oracleadvm.ko The ADVM driver maps I/O requests against an ADVM volume device to blocks in a corresponding on-disk ASM file location. This ADVM driver provides volume management driver capabilities that directly interface with the file system.

oracleacfs.ko This is the ACFS driver, which supports all ACFS file system file operations.

During install, kernel modules are placed in /lib/modules/2.6.18-8.el5/extra/usm. These loaded drivers can be seen (on Linux) via the lsmod command:

Linux OS vendors now support a “white list” of kernel APIs (kABI compatible or weak modules), which are defined not to change in the event of OS kernel updates or patches. The ACFS kernel APIs were added to this white list, which enables ACFS drivers to continue operation across certain OS upgrades and avoids the need for new drivers with every OS kernel upgrade.

Integration with Oracle Clusterware

When a CRS Managed ACFS file system is created, Oracle Clusterware will manage the resources for ACFS. In Oracle Clusterware 11g Release 2, the Oracle High Availability Services daemon (OHASd) component is called from the Unix/Linux init daemon to start up and initialize the Clusterware framework. OHASd’s main functions are as follows:

Start/restart/stop the Clusterware infrastructure processes.

Verify the existence of critical resources.

Load the three ACFS drivers (listed previously).

The correct ordering of the startup and shutdown of critical resources is also maintained and managed by OHASd; for example, OHASd will ensure that ASM starts after ACFS drivers are loaded.

Several key Oracle Clusterware resources are created when Grid Infrastructure for Cluster is installed or when the ADVM volume is created. Each of these CRS resources is node local and will have a corresponding start, stop, check, and clean action, which is executed by the appropriate Clusterware agents. These resources include the following:

ACFS Driver resource This resource is created when Grid Infrastructure for Cluster is installed. This resource is created as ora.drivers.acfs and managed by the orarootagent. The ASM instance has a weak dependency against the ACFS driver resource. The Clusterware start action will start the ACFS driver resource when the ASM instance is started, and will implicitly load the ACFS/ADVM kernel drivers. These drivers will remain loaded until the GI stack is shut down.

ASM resource This resource, ora.asm, is created as part of Grid Infrastructure for Cluster installation. This resource is started as part of the standard bootstrap of the GI stack and managed by the oraagent agent.

ACFS Registry resource The Registry resource is created as part of the Grid Infrastructure for Cluster installation and managed by orarootagent. The activation of this resource will also mount all file systems listed in the ACFS Registry. This Registry resource also does file system recovery, via check action script, when file systems in an offline state are detected.

ACFS file system resource This resource is created as ora.<diskgroup>.<volume>.acfs when the ASM Configuration Assistant (ASMCA) is used to create a DB Home file system. When the Database Configuration Assistant (DBCA) is used to create a database using the DB Home file system, an explicit dependency between the database, the ACFS file system hosting the DB Home, and ASM is created. Thus, a startup of ASM will pull up the appropriate ACFS file system along with the database.

ACFS/Clusterware Resource

Once the Oracle database software is installed in the CRS Managed ACFS file system and the database is created in ASM, Oracle Clusterware will implicitly create the resource dependencies between the database, the CRS Managed ACFS file system, and the ASM disk group. These dependencies are shown via the start/stop actions for a database using the following command:

The output has been condensed to focus on the start/stop dependencies items:

ACFS Startup Sequence

The start action, which is the same for CRS Managed and general-purpose file system resources, is to mount the file system. The CRS resource action includes confirming that the ACFS drivers are loaded, the required disk group is mounted, the volume is enabled, and the mount point is created, if necessary. If the file system is successfully mounted, then the state of the resource is set to online; otherwise, it is set to offline.

When the OS boots up and Oracle Clusterware is started, the following Clusterware operations are performed:

OHASd will load the ACFS drivers and start ASM.

As part of the ASM instance startup, all the appropriate disk groups, as listed in the ASM asm_diskgroup parameter, will also be mounted. As part of the disk group mount, all the appropriate ASM Dynamic volumes are enabled.

The CRS agent will start up and mount the CRS Managed ACFS file systems.

The appropriate CRS agents will start their respective resources. For example, oraagent will start up the database. Note that just before this step, all the resources necessary for the database to start are enabled, such as the ASM instance, disk group, volume, and CRS Managed ACFS file systems.

The ACFS Mount Registry agent will mount any ACFS file systems that are listed in the ACFS Mount Registry.

ACFS Shutdown Sequence

Shutdown includes stopping of the Oracle Grid Infrastructure stack via the crsctl stop cluster command or node shutdown. The following describes how this shutdown impacts the ACFS stack:

As part of the infrastructure shutdown of CRS, the Oracle Clusterware orarootagent will perform unmounts for file systems contained on ADVM volume devices. If any file systems could not be unmounted because they are in use (open file references), then an internal grace period is set for the processes with the open file references. At the end of the grace period, if these processes have not exited, they are terminated and the file systems are unmounted, resulting in the closing of the associated dynamic volumes.

The ASM Dynamic volumes are then disabled; ASM and its related resources are stopped.

All ACFS and ADVM logs for startup/shutdown and errors will be logged in the following places:

Oracle Clusterware home (for example, $ORACLE_HOME/log/<hostname>/alert.log)

ASM alert.log

$ORACLE_HOME/log/<hostname>/agent/ohasd/rootagent

ACFS Startup on Grid Infrastructure for Standalone

Grid Infrastructure for Standalone, Oracle Restart, does not support managing root-based ACFS start actions. Thus, the following operations are not automatically performed:

Loading Oracle ACFS drivers

Mounting ACFS file systems listed in the Oracle ACFS Mount Registry

Mounting resource-based Oracle ACFS database home file systems

The following steps outline how to automate the load of the drivers and mount the file system (note that the root user needs to perform this setup):

1. Create an initialization script called /etc/init.d/acfsload. This script contains the runlevel configuration and the acfsload command:

2. Modify the permissions on the /etc/init.d/acfsload script to allow it to be executed by root:

3. Use the chkconfig command to build the appropriate symbolic links for the rc2.d, rc3.d, rc4.d, and rc5.d runlevel directories:

4. Verify that the chkconfig runlevel is set up correctly:

Finally, these file systems can be listed in Unix/Linux /etc/fstab and a similar rc initialization script can be used to mount them:

Exporting ACFS for NFS Access

Many customers want to replace their existing NFS appliances that are used for middle-tier apps with low-cost solutions. For example, Siebel architectures require a common file system (the Siebel file system) between nodes on the mid-tier to store data and physical files used by Siebel clients and the Siebel Enterprise Server. A similar common file system is required for E-Business Suite and PeopleSoft, as well as other packaged applications.

An NAS file access protocol is used to communicate file requests between an NAS client system and an NAS file server. NAS file servers provide the actual storage. ACFS can be configured as an NAS file server and, as such, can support remote file access from NAS clients that are configured with either NFS or CIFS file access protocols. Because ACFS is a cluster file system, it can support a common file system namespace cluster-wide; thus, each cluster node has access to the file system. If a node fails, the Grid Infrastructure stack transitions the state of the cluster, and the remaining cluster nodes continue to have access to ACFS file system data. Note that SCAN names cannot be used as NFS node service names; NFS mounting the ACFS exported file system using “hard” mount options is not supported.

In the current version, there is no failover of the NFS mount. The file system will need to be remounted on another node in the cluster.

When exporting ACFS file systems through NFS on Linux, you must specify the file system identification handle via the fsid exports option. The fsid value can be any 32-bit number. The use of the file system identification handle is necessary because the ADVM block device major numbers are not guaranteed to be the same across reboots of the same node or across different nodes in the cluster. The fsid exports option forces the file system identification portion of the file handle, which is used to communicate with NFS clients, to be the specified number instead of a number derived from the major and minor number of the block device on which the file system is mounted. If the fsid is not explicitly set, a reboot of the server (housing the ACFS file system) will cause NFS clients to see inconsistent file system data or detect “Stale NFS file handle” errors.

The following guidelines must be followed with respect to the fsid:

The value must be unique among all the exported file systems on the system.

The value must be unique among members of the cluster and must be the same number on each member of the cluster for a given file system.

Summary

The ASM Cluster File System (ACFS) extends Automatic Storage Management (ASM) by providing a robust, modern, general-purpose, extent-based, journaling file system for files beyond the Oracle database files and thus becomes a complete storage management solution. ACFS provides support for files such as Oracle binaries, report files, trace files, alert logs, and other application data files. ACFS scales from small files to very large files (exabytes) and supports large numbers of nodes in a cluster.

In Oracle Grid Infrastructure 11g Release 2, ASM simplifies, automates, and reduces cost and overhead by providing a unified and integrated solution stack for all your file management needs, thus eliminating the need for third-party volume managers, file systems, and Clusterware platforms.

With the advent of ACFS, Oracle ASM 11g Release 2 has the capability to manage all data, including Oracle database files, Oracle Clusterware files, and nonstructured general-purpose data such as log files and text files.

EasyReliableDBA

Friday, 6 July 2018

Database cloud storage ASM and ACFS Design and Deployment

No comments:

Post a Comment

Search This Blog