7.ASM Files, Aliases, and Security
When an ASM
disk group is created, a hierarchical file system structure is created.
This hierarchical layout is very similar to the Unix or Windows file
system hierarchy. ASM files, stored within this file system structure,
are the objects that RDBMS instances access. They come in the form of
data files, control files, spfiles, and redo log files, and several
other file types. The RDBMS treats ASM-based database files just like
standard file system files.
ASM Filenames
When you create a database file (using the
create tablespace, add datafile, or add logfile command) or even an
archive log file, ASM explicitly creates the ASM file in the disk group
specified. The following example illustrates how database files can be
created in an ASM disk group:
This command creates a data file in the DATA
disk group. ASM filenames are derived from and generated upon the
successful creation of a data file. Once the file is created, the file
becomes visible to the user via the standard RDBMS views, such as the
V$DATAFILE view. Note that the ASM filename syntax is different from
that of the typical naming standards; ASM filenames use the following
format:
+diskgroup_name/database_name/database file type/tag_name.file_number.incarnation
For example, the ASM filename of
+DATA/yoda/datafile/ishan.259.616618501 for the tablespace named ISHAN
can be dissected as follows:
+DATA This is the name of the disk group where this file was created.
yoda This specifies the name of the database that contains this file.
datafile This is the database file type—in this case, datafile. There are over 20 file types in Oracle 11g.
ISHAN.259.616618501 This
portion of the filename is the suffix of the full filename, and is
composed of the tag name, file number, and incarnation number. The tag
name in the data file name corresponds to the tablespace name. In this
example, the tag is the tablespace named ISHAN. For redo log files, the
tag name is the group number (for example, group_3.264.54632413). The
ASM file number for the ISHAN tablespace is 259. The file number in the
ASM instance can be used to correlate filenames in database instance.
The incarnation number is 616618501. The incarnation number, which has
been derived from a timestamp, is used to provide uniqueness. Note that
once the file has been created, the incarnation number does not change.
The incarnation number distinguishes between a new file that has been
created using the same file number and another file that has been
deleted.
For best practice, every database should
implement the Oracle Managed File (OMF) feature to simplify Oracle
database file administration. Here are some key benefits of OMF:
Simplified Oracle file management All
files are automatically created in a default location with
system-generated names, thus a consistent file standard is inherently in
place.
Space usage optimization Files are deleted automatically when the tablespaces are dropped.
Reduction of Oracle file management errors OMF minimizes errant file creation and deletion, and also mitigates file corruption due to inadvertent file reuse.
Enforcement of Optimal Flexible Architecture (OFA) standards OMF complies with the OFA standards for filename and file locations.
You can enable OMF by setting the
DB_CREATE_FILE_DEST and DB_RECOVERY_FILE_DEST parameters. Note that
other *_DEST variables can be used for other file types. When the
DB_CREATE_FILE_DEST parameter is set to +DATA, the default file location
for tablespace data files becomes +DATA. Moreover, you need not even
specify the disk group location in the tablespace creation statement. In
fact, when the DB_CREATE_FILE_DEST and DB_RECOVERY_FILE_DEST parameters
are set, the create database command can be simplified to the following
statement:
You can use the following command to create a tablespace:
This command simply creates a data file in
the ISHAN tablespace under the +DATA disk group using the default data
file size of 100MB. However, this file size can be overridden and still
leverage the OMF name, as in the following example:
NOTE
OMF is not
enabled for a file when the filename is explicitly specified in
“create/alter tablespace add datafile” commands. For example, the
following is not considered an OMF file because it specifies an explicit
filename and path:
However, the following is considered an OMF file:
|
The following listing shows the relationship
between the RDBMS files and the ASM file. Note that the file number from
V$ASM_FILE is embedded in the filename. The first query is executed
from the ASM instance and the second query is executed from the RDBMS
instance:
Observe that this database contains ASM files
and a non-ASM file named NISHA01.dbf. The NISHA tablespace is stored in
a Unix file system called /u01/oradata—that is, it is not an
ASM-managed file. Because the NISHA01.dbf file is a Unix file system
file rather than an ASM file, the ASM file list from the SQL output does
not include it. This illustrates an important point: An Oracle database
can have files that reside on file systems, raw devices, and ASM,
simultaneously. However, in RAC environments, they must all be on shared
storage and accessible by all nodes in the cluster.
ASM Directories
ASM provides the capability to create
user-defined directories using the ADD DIRECTORY clause of the ALTER
DISKGROUP statement. User-defined directories can be created to support
user-defined ASM aliases (discussed later). ASM directories must start
with a plus sign (+) and valid disk group name, followed by any
user-specified subdirectory names. The only restriction is that the
parent directory must exist before you attempt to create a subdirectory
or alias in that directory. For example, both of the following are valid
ASM directories:
However, the following ASM directory cannot be created because the parent directory of data files (oradata) does not exist:
Although system directories such as +DATA/yoda
cannot be manipulated, user-defined directories, such as the one
successfully created in the previous example, can be renamed or dropped.
The following examples illustrate this:
ASM Aliases
The filename notation described thus far (+diskgroup_name/database_name/database file type/tag_name.file_number.incarnation)
is called the fully qualified filename notation (FQFN). An ASM alias
can be used to make filenaming conventions easier to remember.
Note that whenever a file is created, a system
alias is also automatically created for that file. The system aliases
are created in a hierarchical directory structure that takes the
following syntax:
<db_unique_name>/<file_type>/<alias name>
When the files are removed, the <alias name> is deleted but the hierarchical directory structure remains.
ASM aliases are essentially in hierarchical directory format, similar to the filesystem hierarchy (/u01/oradata/dbname/datafile_name) and are used to reference a system-generated filename such as +DATA/yoda/datafile/system.256.589462555.
Alias names specify a disk group name, but
instead of using a file and incarnation number, they take a user-defined
string name. Alias ASM filenames are distinguished from fully qualified
or numeric names because they do not end in a dotted pair of numbers.
Note that there is a limit of one alias per ASM file. The following
examples show how to create an ASM alias:
Note, as stated earlier, that OMF is not
enabled when file aliases are explicitly specified in “create/alter
tablespace add datafile” commands (as in the previous example).
Aliases are particularly useful when dealing
with control files and spfiles—that is, an ASM alias filename is
normally used in the CONTROL_FILES and SPFILE initialization parameters.
In the following example, the SPFILE and CONTROL_FILES parameters are
set to the alias, and the DB_CREATE_FILE_DEST and DB_RECOVERY_FILE_DEST
parameters are set to the appropriate OMF destinations:
To show the hierarchical tree of files stored
in the disk group, use the following connect by clause SQL to generate
the full path. However, a more efficient way to browse the hierarchy is
to use the ASMCMD ls command or Enterprise Manager.
Templates
ASM file templates are named collections of
attributes applied to files during file creation. Templates are used to
set file-level redundancy (mirror, high, or unprotected) and striping
attributes (fine or coarse) of files created in an ASM disk group.
Templates simplify file creation by housing
complex file attribute specifications. When a disk group is created, ASM
establishes a set of initial system default templates associated with
that disk group. These templates contain the default attributes for the
various Oracle database file types. When a file is created, the
redundancy and striping attributes are set for that file, where the
attributes are based on the system template that is the default template
for the file type or an explicitly named template.
The following query lists the ASM files, redundancy, and striping size for a sample database.
The administrator can change attributes of the
default templates if required. However, system default templates cannot
be deleted. Additionally, administrators can add their own unique
templates, as needed. The following SQL command illustrates how to
create user templates (performed on the ASM instance) and then apply
them to a new tablespace data file (performed on the RDBMS):
Once a template is created, you can apply it when creating the new tablespace:
Using the ALTER DISKGROUP command, you can
modify a template or drop the template using the DROP TEMPLATE clause.
The following commands illustrate this:
If you need to change an ASM file attribute
after the file has been created, the file must be copied into a new file
with the new attributes. This is the only method of changing a file’s
attributes.
V$ASM_TEMPLATE
Query the V$ASM_TEMPLATE view for information about templates. Here is an example for one of the disk groups:
ASM File Access Control
In 11gR2, a new feature called ASM
File Access Control was introduced to restrict file access to specific
database instance users who connect as SYSDBA. ASM File Access Control
uses the user ID that owns the database instance home.
ASM ACL Overview
ASM uses File Access Control to determine
the additional privileges that are given to a database that has been
authenticated as SYSDBA on the ASM instance. These additional privileges
include the ability to modify and delete certain files, aliases, and
user groups. Cloud DBAs can set up “user groups” to specify the list of
databases that share the same access permissions to ASM files. User
groups are lists of databases, and any database that authenticates as
SYSDBA can create a user group.
Just as in Unix/Linux file permissions, each
ASM file has three categories of privileges: owner, group, and other.
Each category can have read-only permission, read-write permission, or
no permission. The file owner is usually the creator of the file and can
assign permissions for the file in any of the owner, group, and other
categories. The owner can also change the group associated with the
file. Note that only the creator of a group can delete it or modify its
membership list.
When administering ASM File Access Control, it
is recommended that you connect as SYSDBA to the database instance that
is the owner of the files in the disk group.
To set up ASM File Access Control for files in
a disk group, ensure the COMPATIBLE.ASM and COMPATIBLE.RDBMS disk group
attributes are set to 11.2 or higher.
Create a new (or alter an existing) disk group
with the following ASM File Access Control disk group attributes:
ACCESS_CONTROL.ENABLED and ACCESS_CONTROL.UMASK. Before setting the
ACCESS_CONTROL.UMASK disk group attribute, you must set the
ACCESS_CONTROL.ENABLED attribute to true to enable ASM File Access
Control.
The ACCESS_CONTROL.ENABLED attribute
determines whether Oracle ASM File Access Control is enabled for a disk
group. The value can be true or false.
The ACCESS_CONTROL.UMASK attribute determines
which permissions are masked out on the creation of an ASM file for the
user who owns the file, users in the same user group, and others not in
the user group. This attribute applies to all files on a disk group. The
values can be combinations of three digits: {0|2|6} {0|2|6} {0|2|6}.
The default is 066. Setting the attribute to 0 masks out nothing.
Setting it to 2 masks out write permission. Setting it to 6 masks out
both read and write permissions.
The upcoming example in the next section shows
how to enable ASM File Access Control for a disk group with a
permissions setting of 026, which enables read-write access for the
owner, read access for users in the group, and no access to others not
in the group. Optionally, you can create user groups that are groups of
database users who share the same access permissions to ASM files. Here
are some File Access Control list considerations:
For
files that exist in a disk group, before setting the ASM File Access
Control disk group attributes, you must explicitly set the permissions
and ownership on those existing files. Additionally, the files must be
closed before setting the ownership or permissions.
When
you set up File Access Control on an existing disk group, the files
previously created remain accessible by everyone, unless you run the set
permissions to restrict access.
Ensure that the user exists before setting ownership or permissions on a file.
File
Access Control, including permission management, can be performed using
SQL*Plus, ASCMD, or Enterprise Manager (but using Enterprise Manager or
ASMCMD is the easiest method).
ASM ACL Setup Example
To illustrate ASM File Access Control, we start with three OS users:
In the ASM instance, prepare the disk group for File Access Control:
Next, add two ASM groups:
Set File Access Control for the data file ‘+DATA/yoda/datafile/marlie.283.702218775’:
Ownership cannot be changed for an open file, so we need to take the file offline in the database instance:
We can now set file ownership in the ASM instance:
Default permissions are unchanged:
Now set the file permissions in the ASM instance:
This example illustrates that the Grid
Infrastructure owner (ASM owner) cannot copy files (in this case, RMAN
backups) out of the disk group if they are protected by ASM File Access
Control.
First, create an RMAN backup:
Now, using the grid users, we’ll try to copy
those backup pieces to the OS file system (recall that the backup files
were created using the oracle user). File Access Control should prevent
the copy operation and throw an “ORA-15260: permission denied on ASM
disk group”:
To be able to copy files from the disk group
(DATA) to the file system, either disable access control or add an OS
user grid to the correct ASM user group.
Check the user and user groups setup in ASM:
Summary
Like most file systems, an ASM disk group
contains a directory tree. The root directory for the disk group is
always the disk group name. Every ASM file has a system-generated
filename; the name is generated based on the instance that created it,
the Oracle file type, the usage of the file, and the file numbers. The
system-generated filename is of the form +disk_group/db_name/file_type/usage_tag.file_number.time_stamp. Directories are created automatically as needed to construct system-generated filenames.
A file can have one user alias and can be
placed in any existing directory within the same disk group. The user
alias can be used to refer to the file in any file operation where the
system-generated filename could be used. When a full pathname is used to
create the file, the pathname becomes the user alias. If a file is
created by just using the disk group name, then no user alias is
created. A user alias may be added to or removed from any file without
disturbing the file.
The system-generated name is an OMF name,
whereas a user alias is not an OMF name. If the system-generated name is
used for a file, the system will automatically create and delete the
file as needed. If the file is referred to by its user alias, the user
is responsible for creating and deleting the file and any required
directories.
CHAPTER
8
ASM Space Allocation and Rebalance
When a database
is created under the constructs of ASM, it will be striped (and can be
optionally mirrored) as per the Stripe and Mirror Everything (SAME)
methodology. SAME is a concept that makes extensive use of striping and
mirroring across large sets of disks to achieve high availability and to
provide good performance with minimal tuning. ASM incorporates the SAME
methodology. Using this method, ASM evenly distributes and balances
input/output (I/O) load across all disks within the disk group. ASM
solves one of the shortcomings of the original SAME methodology, because
ASM maintains balanced data distribution even when storage
configurations change.
ASM Space Allocation
This section discusses how ASM allocates
space in the disk group and how clients such as the relational database
management system (RDBMS) and ASM Cluster File System (ACFS) use the
allocated space.
ASM Allocation Units
ASM allocates space in chunks called allocation units (AUs).
An AU is the most granular allocation on a per-disk basis—that is,
every ASM disk is divided into AUs of the same size. For most
deployments of ASM, 1MB stripe size has proved to be the best stripe
depth for Oracle databases and also happens to be the largest I/O
request that the RDBMS will currently issue in Oracle Database 11g.
In large environments, it is recommended to use a larger AU size to
reduce the metadata associated to describe the files in the disk group.
This optimal stripe size, coupled with even distribution of extents in
the disk group and the buffer cache in the RDBMS, prevents hot spots.
Unlike traditional random array of independent
drives (RAID) configurations, ASM striping is not done in a round-robin
basis, nor is it done at the individual disk level. ASM randomly
chooses a disk for allocating the initial extent. This is done to
optimize the balance of the disk group. All subsequent AUs are allocated
in such a way as to distribute each file equally and evenly across all
disks and to fill all disks evenly (see Figure 8-1). Thus, every disk is maintained at the same percentage full, regardless of the size of the disk.
For example, if a disk is twice as big as the
others, it will contain twice as many extents. This ensures that all
disks in a disk group have the same I/O load relative to their capacity.
Because ASM balances the load across all the disks in a disk group, it
is not a good practice to create multiple disk partitions from different
areas of the same physical disk and then allocate the partitions as
members of the same disk group. However, it may make sense for multiple
partitions on a physical disk to be in different disk groups. ASM is
abstracted from the underlying characteristic of the storage array
(LUN). For example, if the storage array presents several RAID5 LUNs to
ASM as disks, ASM will allocate extents transparently across each of
those LUNs.
ASM Extents
When a database file is created in an ASM
disk group, it is composed of a set of ASM extents, and these extents
are evenly distributed across all disks in the disk group. Each extent
consists of an integral number of AUs on an ASM disk. The extent size to
number of AU mapping changes with the size of the file
The following two queries display the extent
distribution for a disk group (the FAST disk group) that contains four
disks. The first query shows the evenness based on megabytes per disk,
and the second query lists the total extents for each disk in the FAST
disk group (group_number 2) using the X$KFFXP base table:
Similarly, the following example illustrates
the even distribution of ASM extents for the System tablespace across
all the disks in the DATA disk group (group number 3). This tablespace
contains a single 100MB data file called
+DATA/yoda/datafile/system.256.589462555.
ASM Striping
There are two types of ASM file striping:
coarse and fine-grained. For coarse distribution, each coarse-grained
file extent is mapped to a single allocation unit.
With fine-grained distribution, each grain is
interleaved 128K across groups of eight AUs. Since each AU is guaranteed
to be on a different ASM disk, each strip will end up on a different
physical disk. It is also used for very small files (such as control
files) to ensure that they are distributed across disks. Fine-grained
striping is generally not good for sequential I/O (such as full table
scans) once the sequential I/O exceeds one AU. As of Oracle 11gR2,
only control files are file-striped by default when the disk group is
created; the users can change the template for a given file type to
change the defaults.
As discussed previously, each file stored in
ASM requires metadata structures to describe the file extent locations.
As the file grows, the metadata associated with that file also grows as
well as the memory used to store the file extent locations. Oracle 11g
introduces a new feature called Variable Sized Extents to minimize the
overhead of the metadata. The main objective of this feature is to
enable larger file extents to reduce metadata requirements as a file
grows, and as a byproduct it allows for larger file size support (file
sizes up to 140PB [a petabyte is 1,024TB]). For example, if a data file
is initially as small as 1GB, the file extent size used will be 1 AU. As
the file grows, several size thresholds are crossed and larger extent
sizes are employed at each threshold, with the maximum extent size
capped at 16 AUs. Note that there are two thresholds: 20,000 extents
(20GB with 1MB AUs) and 40,000 extents (100GB [20GB of 1×AU and 20,000
of 4×AU] with 1MB AUs). Finally, extents beyond 40,000 use a 16×
multiplier. Valid extent sizes are 1, 4, and 16 AUs (which translate to
1MB, 4MB, and 16MB with 1MB AUs, respectively). When the file gets into
multiple AU extents, the file gets striped at 1AU to maintain the
coarse-grained striping granularity of the file. The database
administrator (DBA) or ASM administrator need not manage variable
extents; ASM handles this automatically. This feature is very similar in
behavior to the Automatic Extent Allocation that the RDBMS uses.
NOTE
The RDBMS layers of the code effectively limit file size to 128TB. The ASM structures can address 140PB.
|
The following example demonstrates the use of
Variable Sized Extents. In this example, the SYSAUX tablespace contains a
data file that is approximately 32GB, which exceeds the first threshold
of 20,000 extents (20GB):
Now if X$KFFXP is queried to find the ASM file that has a nondefault extent size, it should indicate that it is file number 263:
The Variable Sized Extents feature is available only for disk groups with Oracle 11g RDBMS and ASM compatibility. For disk groups created with Oracle Database 10g,
the compatibility attribute must be advanced to 11.1.0. Variable
extents take effect for newly created files and will not be
retroactively applied to files that were created with 10g software.
Setting Larger AU Sizes for VLDBs
For very large databases (VLDBs)—for
example, databases that are 10TB and larger—it may be beneficial to
change the default AU size, for example 4MB AU size. The following are
benefits of changing the default size for VLDBs:
Reduced SGA size to manage the extent maps in the RDBMS instance
Increased file size limits
Reduced database open time, because VLDBs usually have many big data files
Increasing the AU size improves the time to
open large databases and also reduces the amount of shared pool consumed
by the extent maps. With 1MB AUs and fixed-size extents, the extent map
for a 10TB database is about 90MB, which has to be read at open and
then kept in memory. With 16MB AUs, this is reduced to about 5.5MB. In
Oracle Database 10g, the entire extent map for a file is read from disk at file-open time.
Oracle Database 11g significantly minimizes the file-open latency issue by reading extent maps on demand for certain file types. In Oracle 10g,
for every file open, the complete extent map is built and sent to the
RDBMS instance from the ASM instance. For large files, this
unnecessarily lengthens file-open time. In Oracle 11g, only the
first 60 extents in the extent map are sent at file-open time. The rest
are sent in batches as required by the database instance.
Setting Larger AU Sizes in Oracle Database 11g
For Oracle Database 11g ASM systems, the following CREATE DISKGROUP command can be executed to set the appropriate AU size:
The AU attribute can be used only at the time
of disk group creation; furthermore, the AU size of an existing disk
group cannot be changed after the disk group has been created.
ASM Rebalance
With traditional volume managers, expanding
or shrinking striped file systems has typically been difficult. With
ASM, these disk changes are now seamless operations involving
redistribution (rebalancing) of the striped data. Additionally, these
operations can be performed online.
Any change in the storage
configuration—adding, dropping, or resizing a disk—triggers a rebalance
operation. ASM does not dynamically move around “hot areas” or “hot
extents.” Because ASM distributes extents evenly across all disks and
the database buffer cache prevents small chunks of data from being hot
areas on disk, it completely obviates the notion of hot disks or
extents.
Rebalance Operation
The main objective of the rebalance
operation is always to provide an even distribution of file extents and
space usage across all disks in the disk group. The rebalance is done on
a per-file basis to ensure that each file is evenly balanced across all
disks. Upon completion of distributing the files evenly among all the
disks in a disk group, ASM starts compacting the disks to ensure there
is no fragmentation in the disk group. Fragmentation is possible only in
disk groups where one or more files use variable extents. This is
critical to ASM’s assurance of balanced I/O load. The ASM background
process, RBAL, manages this rebalance. The RBAL process examines each
file extent map, and the extents are redistributed on the new storage
configuration. For example, consider an eight-disk external redundancy
disk group, with a data file with 40 extents (each disk will house five
extents). When two new drives of same size are added, that data file is
rebalanced and distributed across 10 drives, with each drive containing
four extents. Only eight extents need to move to complete the
rebalance—that is, a complete redistribution of extents is not necessary
because only the minimum number of extents is moved to reach equal
distribution.
During the compaction phase of the rebalance,
each disk is examined and data is moved to the head of the disk to
eliminate any holes. The rebalance estimates reported in
V$ASM_OPERATIONS do not factor in the work needed to complete the
compaction of the disk group as of Oracle 11g.
NOTE
A weighting factor, influenced by disk size
and file size, affects rebalancing. A larger drive will consume more
extents. This factor is used to achieve even distribution based on
overall size.
|
The following is a typical process flow for ASM rebalancing:
1. On the ASM instance, a DBA adds (or drops) a disk to (or from) a disk group.
2. This invokes the RBAL process to create the rebalance plan and then begin coordination of the redistribution.
3. RBAL calculates the work required to perform the task and then messages the ASM Rebalance (ARBx) processes to handle the request. In Oracle Release 11.2.0.2 and earlier, the number of ARBx
processes invoked is directly determined by the init.ora parameter
ASM_POWER_LIMIT or the power level specified in an add, drop, or
rebalance command. Post–Oracle 11.2.0.2, there is always just one ARB0
process performing the rebalance operation. The ASM_POWER_LIMIT or the
power level specified in the SQL command translates to the number of
extents relocated in parallel.
4. The Continuing Operations
Directory (COD) is updated to reflect a rebalance activity. The COD is
important when an influx rebalance fails. Recovering instances will see
an outstanding COD entry for the rebalance and restart it.
5. RBAL distributes plans to the
ARBs. In general, RBAL generates a plan per file; however, larger files
can be split among ARBs.
6. ARBx performs a rebalance
on these extents. Each extent is locked, relocated, and unlocked. Reads
can proceed while an extent is being relocated. New writes to the locked
extent are blocked while outstanding writes have to be reissued to the
new location after the relocation is complete. This step is shown as
Operation REBAL in V$ASM_OPERATION. The rebalance algorithm is detailed
in Chapter 9.
The following is an excerpt from the ASM alert log during a rebalance operation for a drop disk command:
The following is an excerpt from the ASM alert log during a rebalance operation for a add disk command:
An ARB trace file is created for each ARB
process involved in the rebalance operation. This ARB trace file can be
found in the trace subdirectory under the DIAG directory. The following
is a small excerpt from this trace file:
The preceding entries are repeated for each file assigned to the ARB process.
Resizing a Physical Disk or LUN and the ASM Disk Group
When you’re increasing the size of a disk
group, it is a best practice to add disks of similar size. However, in
some cases it is appropriate to resize disks rather than to add storage
of equal size. For these cases, you should resize all the disks in the
disk group (to the same size) at the same time. This section discusses
how to expand or resize the logical unit number (LUN) as an ASM disk.
Disks in the storage are usually configured as
a LUN and presented to the host. When a LUN runs out of space, you can
expand it within the storage array by adding new disks in the back end.
However, the operating system (OS) must then recognize the new space.
Some operating systems require a reboot to recognize the new LUN size.
On Linux systems that use Emulex drivers, for example, the following can
be used:
Here, N is the SCSI port ordinal assigned to
this HBA port (see the /proc/scsi/lpfc directory and look for the
“port_number” files).
The first step in increasing the size of an
ASM disk is to add extra storage capacity to the LUN. To use more space,
the partition must be re-created. This operation is at the partition
table level, and that table is stored in the first sectors of the disk.
Changing the partition table does not affect the data as long as the
starting offset of the partition is not changed.
The view V$ASM_DISK called OS_MB gives the
actual OS size of the disk. This column can aid in appropriately
resizing the disk and preventing attempts to resize disks that cannot be
resized.
The general steps to resize an ASM disk are as follows:
1. Resize the LUN from storage array. This is usually a noninvasive operation.
2. Query V$ASM_DISK for the OS_MB
for the disk to be resized. If the OS or ASM does not see the new size,
review the steps from the host bus adapter (HBA) vendor to probe for new
devices. In some cases, this may require a reboot.
Rebalance Power Management
Rebalancing involves physical movement of
file extents. Its impact is usually low because the rebalance is done a
few extents at a time, so there’s little outstanding I/O at any given
time per ARB process. This should not adversely affect online database
activity. However, it is generally advisable to schedule the rebalance
operation during off-peak hours.
The init.ora parameter ASM_POWER_LIMIT is used
to influence the throughput and speed of the rebalance operation. For
Oracle 11.2.0.1 and below, the range of values for ASM_POWER_LIMIT is
0–11, where a value of 11 is full throttle and a value of 1 (the
default) is low speed. A value of 0, which turns off automatic
rebalance, should be used with caution. In a Real Application Clusters
(RAC) environment, the ASM_POWER_LIMIT is specific to each ASM instance.
(A common question is why the maximum power limit is 11 rather than 10.
Movie lovers might recall the amplifier discussion from This is Spinal Tap.)
For releases 11.2.0.2 and above, the
ASM_POWER_LIMIT can be set up to 1024, which results in ASM relocating
those many extents in parallel. Increasing the rebalance power also
increases the amount of PGA memory needed during relocation. In the
event the current memory settings of the ASM instance prevents
allocation of the required memory, ASM dials down the power and
continues with the available memory.
The power value can also be set for a specific
rebalance activity using the ALTER DISKGROUP command. This value is
effective only for the specific rebalance task. The following example
demonstrates this:
Here is an example from another session:
Each rebalance step has various associated states. The following are the valid states:
WAIT
This indicates that currently no operations are running for the group.RUN
An operation is running for the group.HALT
An administrator is halting an operation.ERRORS
An operation has been halted by errors.
A power value of 0 indicates that no
rebalance should occur for this rebalance. This setting is particularly
important when you’re adding or removing storage (that has external
redundancy) and then deferring the rebalance to a later scheduled time.
However, a power level of 0 should be used with caution; this is
especially true if the disk group is low on available space, which may
result in an ORA-15041 for out-of-balance disk groups.
The power level is adjusted with the ALTER
DISKGROUP REBALANCE command, which affects only the current rebalance
for the specified disk group; future rebalances are not affected. If you
increase the power level of the existing rebalance, it will spawn new
ARB processes. If you decrease the power level, the running ARB process
will finish its extent relocation and then quiesce and die off.
If you are removing or adding several disks,
add or remove disks in a single ALTER DISKGROUP statement; this reduces
the number of rebalance operations needed for storage changes. This
behavior is more critical where normal- and high-redundancy disk groups
have been configured because of disk repartnering. Executing a single
disk group reconfiguration command allows ASM to figure out the ideal
disk partnering and reduce excessive data movement. The following
example demonstrates this storage change:
An ASM disk group rebalance is an asynchronous
operation in that the control is returned immediately to the DBA after
the operation executes in the background. The status of the ongoing
operation can be queried from V$ASM_OPERATION. However, in some
situations the disk group operation needs to be synchronous—that is, it
must wait until rebalance is completed. The ASM ALTER DISKGROUP commands
that result in a rebalance offer a WAIT option. This option allows for
accurate (sequential) scripting that may rely on the space change from a
rebalance completing before any subsequent action is taken. For
instance, if you add 100GB of storage to a completely full disk group,
you will not be able to use all 100GB of storage until the rebalance
completes. The WAIT option ensures that the space addition is successful
and is available for space allocations. If a new rebalance command is
entered while one is already in progress in WAIT mode, the command will
not return until the disk group is in a balanced state or the rebalance
operation encounters an error.
The following SQL script demonstrates how the
WAIT option can be used in SQL scripting. The script adds a new disk,
/dev/sdc6, and waits until the add and rebalance operations complete,
returning the control back to the script. The subsequent step adds a
large tablespace.
In the event that dropping a disk results in a
hung rebalance operation due to the lack of free space, ASM rejects the
drop command when it is executed. Here’s an example:
Fast Rebalance
When a storage change initiates a disk group
rebalance, typically all active ASM instances of an ASM cluster and
their RDBMS clients are notified and become engaged in the
synchronization of the extents that are being rearranged. This messaging
between instances can be “chatty” and thus can increase the overall
time to complete the rebalance operation.
In certain situations where the user does not
need the disk group to be “user accessible” and needs rebalancing to
complete as soon as possible, it is beneficial to perform the rebalance
operation without the extra overhead of the ASM-to-ASM and ASM-to-RDBMS
messaging. The Fast Rebalance feature eliminates this overhead by
allowing a single ASM instance to rebalance the disk group without the
messaging overhead. The primary goal of Fast Rebalance is to improve the
overall performance of the rebalance operation. Additionally, the
rebalance operation can be invoked at maximum power level (power level
11 or 1024 for 11.2.0.2 above) to provide the highest throttling, making
the rebalance operation limited only by the I/O subsystem (to the
degree you can saturate the I/O subsystem with maximum synchronous 1MB
I/Os).
To eliminate messaging to other ASM instances,
the ASM instance that performs the rebalance operation requires
exclusive access to the disk group. To provide this exclusive disk group
access, a new disk group mount mode, called RESTRICTED, was introduced
in Oracle Database 11g. A disk group can be placed in RESTRICTED mode using STARTUP RESTRICT or ALTER DISKGROUP MOUNT RESTRICT.
When a disk group is mounted in RESTRICTED
mode, RDBMS instances are prevented from accessing that disk group and
thus databases cannot be opened. Furthermore, only one ASM instance in a
cluster can mount a disk group in RESTRICTED mode. When the instance is
started in RESTRICTED mode, all disk group mounts in that instance will
automatically be in RESTRICTED mode. The ASM instance needs to be
restarted in NORMAL mode to get it out of RESTRICTED mode.
At the end of the rebalance operation, the
user must explicitly dismount the disk group and remount it in NORMAL
mode to make the disk group available to RDBMS instances.
Effects of Imbalanced Disks
This section illustrates how ASM distributes
extents evenly and creates a balanced disk group. Additionally, the
misconceptions of disk group space balance management are covered.
The term “balanced” in the ASM world is slightly overloaded. A disk group can become imbalanced for several reasons:
Dissimilarly sized disks are used in a given disk group.
A rebalance operation was aborted.
A
rebalance operation was halted. In this case, this state can be
determined by the UNBALANCED column of the V$ASM_DISKGROUP view.
Operationally, the DBA can resolve this problem by manually performing a
rebalance against the specific disk group.
NOTE
The UNBALANCED column in V$ASM_DISKGROUP
indicates that a rebalance is in flux—that is, either in progress or
stopped. This column is not an indicator for an unbalanced disk group.
|
A
disk was added to the disk group with an ASM_POWER_LIMIT or power level
of 0, but the disk group was never rebalanced afterward.
This section focuses on the first reason: a
disk group being imbalanced due to differently sized disks. For the
other reasons, allowing the rebalance to complete will fix the imbalance
automatically.
The main goal of ASM is to provide an even
distribution of data extents across all disk members of a disk group.
When an RDBMS instance requests a file creation, ASM allocates extents
from all the disks in the specified disk group. The first disk
allocation is chosen randomly, but all subsequent disks for extent
allocation are chosen to evenly spread each file across all disks and to
evenly fill all disks. Therefore, if all disks are equally sized, all
disks should have the same number of extents and thus an even I/O load.
But what happens when a disk group contains
unequally sized disks—for example, a set of 25GB disks mixed with a
couple of 50GB disks? When allocating extents, ASM will place twice as
many extents on each of the bigger 50GB disks as on the smaller 25GB
disks. Thus, the 50GB disks will contain more data extents than their
25GB counterparts. This allocation scheme causes dissimilarly sized
disks to fill at the same proportion, but will also induce unbalanced
I/O across the disk group because the disk with more extents will
receive more I/O requests.
The following example illustrates this scenario:
1. Note that the FAST disk group initially contains two disks that are equally sized (8.6GB):
2. Display the extent distribution
on the current disk group layout. The even extent distribution is shown
by the COUNT(PXN_KFFXP) column.
3. Add two more 8.6GB disks to the disk group:
4. Use the following query to display the extent distribution after the two disks were added:
5. Note that a 1GB disk was accidentally added:
6. Display the space usage from V$ASM_DISK. Notice the size of disk FAST_0004:
7. The extent distribution query is
rerun to display the effects of this mistake. Notice the unevenness of
the extent distribution.
MYTH
Adding and dropping a disk in the same disk
group requires two separate rebalance activities. In fact, some disks
can be dropped and others added to a disk group with a single rebalance
command. This is more efficient than separate commands.
|
ASM and Storage Array Migration
One of the core benefits of ASM is the
ability to rebalance extents not only within disk enclosure frames
(storage arrays) but also across frames. Customers have used this
ability extensively when migrating between storage arrays (for example,
between an EMC VNX to the VMAX storage systems) or between storage
vendors (for example, from EMC arrays to Hitachi Data Systems [HDS]
arrays).
The following example illustrates the
simplicity of this migration. In this example, the DATA disk group will
migrate from an EMC VNX storage enclosure to the EMC VMAX enclosure. A
requirement for this type of storage migration is that both storage
enclosures must be attached to the host during the migration and must be
discovered by ASM. Once the rebalance is completed and all the data is
moved from the old frame, the old frame can be “unzoned” and “uncabled”
from the host(s).
This command indicates that the disks
DATA_0001 through DATA_0004 (from the current EMC VNX disks) are to be
dropped and that the new VMAX disks, specified by /dev/rdsk/c7t19*, are
to be added. The ADD and DROP commands can all be done in one rebalance
operation. Additionally, the RDBMS instance can stay online while the
rebalance is in progress. Note that migration to new storage is an
exception to the general rule against mixing different size/performance
disks in the same disk group. This mixture of disparate disks is
transient; the configuration at the end of the rebalance will have disks
of similar size and performance characteristics.
ASM and OS Migration
Customers often ask whether they can move
between the same endianness systems, but different operating systems,
while keeping the storage array the same. For example, suppose that a
customer wants to migrate from Solaris to AIX, with the database on ASM
over an EMC Clariion storage array network (SAN). The storage will be
kept intact and physically reattached to the AIX server. Customers ask
whether this migration is viable and/or supported.
Although ASM data structures are compatible
with most OSs (except for endianness) and should not have a problem,
other factors preclude this from working. In particular, the OS LUN
partition has its own OS partition table format. It is unlikely that
this partition table can be moved between different OSs.
Additionally, the database files themselves
may have other issues, such as platform-specific structures and formats,
and thus the database data will need to be converted to the target
platform’s format. Some viable options include the following:
Data pump full export/import
Cross-platform transportable tablespaces (XTTSs)
Streams
Important Points on ASM Rebalance
The following are some important points on rebalance and extent distribution:
It
is very important that similarly sized disks be used in a single disk
group, and that failure groups are also of similar sizes. The use of
dissimilar disks will cause uneven extent distribution and I/O load. If
one disk lacks free space, it is impossible to do any allocation in a
disk group because every file must be evenly allocated across all disks.
Rebalancing and allocation should make the percentage of allocated
space about the same on every disk.
Rebalance
runs automatically only when there is a disk group configuration
change. Many users have the misconception that ASM periodically wakes up
to perform rebalance. This simply is not true.
If
you are using similarly sized disks and you still see disk group
imbalance, either a previous rebalance operation failed to complete (or
was cancelled) or the administrator set the rebalance power to 0 via a
rebalance command. A manual rebalance should fix these cases.
If
a server goes down while you’re executing a rebalance, the rebalance
will be automatically restarted after ASM instance/crash recovery. A
persistently stored record indicates the need for a rebalance. The node
that does the recovery sees the record indicating that a rebalance is in
progress and that the rebalance was running on the instance that died.
It will then start a new rebalance. The recovering node may be different
from the node that initiated the rebalance.
Many
factors determine the speed of rebalance. Most importantly, it depends
on the underlying I/O subsystem. To calculate the lower bound of the
time required for the rebalance to complete, determine the following:
1. Calculate
amount of data that has to be moved. ASM relocates data proportional to
the amount of space being added. If you are doubling the size of the
disk group, then 50 percent of the data will be moved; if you are adding
10 percent more storage, then at least 10 percent of the data will be
moved; and so on.
2. Determine
how long it will take the I/O subsystem to perform that amount of data
movement. As described previously, ASM does relocation I/Os as a
synchronous 1 AU read followed by a synchronous 1 AU write. Up to
ASM_POWER_LIMIT I/Os can operate in parallel depending on the rebalance
power. This calculation is a lower bound, because ASM has additional
synchronization overhead.
3. The
impact of rebalance should be low because ASM relocates and locks
extents one at a time. ASM relocates multiple extents simultaneously
only if rebalance is running with higher power. Only the I/Os against
the extents being relocated are blocked; ASM does not block I/Os for all
files in the ASM disk group. I/Os are not actually blocked per se;
reads can proceed from the old location during relocation, whereas some
writes need to be temporarily stalled or may need to be reissued if they
were in process during the relocation I/O. The writes to these extents
can be completed after the extent is moved to its new location. All this
activity is transparent to the application. Note that the chance of an
I/O being issued during the time that an extent is locked is very small.
In the case of the Flash disk group, which contains archive logs or
backups, many of the files being relocated will not even be open at the
time, so the impact is very minimal.
When
a rebalance is started for newly added disks, ASM immediately begins
using the free space on them; however, ASM continues to allocate files
evenly across all disks. If a disk group is almost full, and a large
disk (or set of disks) is then added, the RDBMS could get out-of-space
(ORA-15041) errors even though there is seemingly sufficient space in
the disk group. With the WAIT option to the ADD DISK command, control
does not return to the user until rebalance is complete. This may
provide more intuitive behavior to customers who run near capacity.
If
disks are added very frequently, the same data is relocated many times,
causing excessive data movement. It is a best practice to add and drop
multiple disks at a time so that ASM can reorganize partnership
information within ASM metadata more efficiently. For normal- and
high-redundancy disk groups, it is very important to batch the
operations for adding and dropping disks rather than doing them in rapid
succession. The latter option generates much more overhead because
mirroring and failure groups place greater constraints on where data can
be placed. In extreme cases, nesting many adds and drops without
allowing the intervening rebalance to run to completion can lead to the
error ORA-15074.
Summary
Every ASM disk is divided into fixed-size
allocation units. The AU is the fundamental unit of allocation within a
disk group, and the usable space in an ASM disk is a multiple of this
size. The AU size is a disk group attribute specified at disk group
creation and defaults to 1MB, but may be set as high as 64MB. An AU
should be large enough that accessing it in one I/O operation provides
very good throughput—that is, the time to access an entire AU in one I/O
should be dominated by the transfer rate of the disk rather than the
time to seek to the beginning of the AU.
ASM spreads the extents of a file evenly
across all disks in a disk group. Each extent comprises an integral
number of AUs. Most files use coarse striping. With coarse striping, in
each set of extents, the file is striped across the set at 1 AU
granularity. Thus, each stripe of data in a file is on a different disk
than the previous stripe of the file. A file may have fine-grained
striping rather than coarse-grained. The difference is that the
fine-grained striping granularity is 128K rather than 1 AU.
Rebalancing a disk group moves file extents
between disks in the disk group to ensure that every file is evenly
distributed across all the disks in the disk group. When all files are
evenly distributed, all disks are evenly filled to the same percentage.
This ensures load balancing. Rebalancing does not relocate data based on
I/O statistics, nor is it started as a result of statistics. ASM
automatically invokes rebalance only when a storage configuration change
is made to an ASM disk group.
CHAPTER 9 ASM Operations
This chapter
describes the flow of the critical operations for ASM disk groups and
files. It also describes the key interactions between the ASM and
relational database management system (RDBMS) instances.
Note that many of these ASM operations have
been optimized in Engineered Systems (Exadata and ODA) and thus have
significantly different behavior. Chapter 12 covers these optimizations. This chapter describes ASM operations in non–Engineered Systems.
ASM Instance Discovery
The first time an RDBMS instance tries to
access an ASM file, it needs to establish its connection to the local
ASM instance. Rather than requiring a static configuration file to
locate the ASM instance, the RDBMS contacts the Cluster Synchronization
Services (CSS) daemon where the ASM instance has registered. CSS
provides the necessary connect string for the RDBMS to spawn a Bequeath
connection to the ASM instance. The RDBMS authenticates itself to the
ASM instance via operating system (OS) authentication by connecting as
SYSDBA. This initial connection between the ASM instance and the RDBMS
instance is known as the umbilicus, and it remains active as long
as the RDBMS instance has any ASM files open. The RDBMS side of this
connection is the ASMB process. (See the “File Open” section for an
explanation of why ASMB can appear in an ASM instance.) The ASM side of
the connection is a foreground process called the umbilicus foreground (UFG).
RDBMS and ASM instances exchange critical messages over the umbilicus.
Failure of the umbilicus is fatal to the RDBMS instance because the
connection is critical to maintaining the integrity of the disk group.
Some of the umbilicus messages are described later in the “Relocation”
section of this chapter.
RDBMS Operations on ASM Files
This section describes the interaction between the RDBMS and ASM instances for the following operations on ASM files:
File Create
File Open
File I/O
File Close
File Delete
File Create
ASM filenames in the RDBMS are distinguished
by the fact that they start with a plus sign (+). ASM file creation
consists of three phases:
File allocation in the ASM instance
File initialization in the RDBMS instance
File creation committed in the RDBMS and ASM instance
When an RDBMS instance wants to create an ASM
file, it sends a request to create the ASM file. The RDBMS instance has
a pool of processes (the o0nn processes) that have connections to the
foreground process in the ASM instance is called the Network Foreground
(NFG), and are used for tasks such as file creations. The file-creation
request sent over the appropriate connection includes the following
information:
Disk group name
File type
File block size
File size
File tag
The request may optionally include the following additional information:
Template name
Alias name
ASM uses this information to allocate the
file (the topic of file allocation is described in greater detail in the
“ASM File Allocation” section). ASM determines the appropriate striping
and redundancy for the file based on the template. If the template is
not explicitly specified in the request, the default template is used
based on the file type. ASM uses the file type and tag information to
create the system-generated filename. The system-generated filename is
formed as follows:
After allocating the file, ASM sends extent
map information to the RDBMS instance. ASM creates a Continuing
Operations Directory (COD) entry to track the pending file creation. The
RDBMS instance subsequently issues the appropriate I/O to initialize
the file. When initialization is complete, the RDBMS instance messages
ASM to commit the creation of the file.
When ASM receives the commit message, ASM’s
LGWR flushes the Active Change Directory (ACD) change record with the
file-creation information. ASM’s DBWR subsequently asynchronously writes
the appropriate allocation table, file directory, and alias directory
entries to disk. Thus, the high-level tasks for DBWR and LGWR are
similar in the ASM instance as they are in the RDBMS instance.
If the RDBMS instance explicitly or implicitly
aborts the file creation without committing the creation, ASM uses the
COD to roll back the file creation. Rollback marks the allocation table
entries as free, releases the file directory entry, and removes the
appropriate alias directory entries. Note that rollbacks in ASM do not
use the same infrastructure (undo segments) or semantics that the RDBMS
instance uses.
File Open
When an RDBMS instance needs to open an ASM
file, it sends to the ASM instance a File Open request, with the
filename, via one of the o0nn processes. ASM consults the file directory
to get the extent map for the file. ASM sends the extent map to the
RDBMS instance. The extent maps of the files are sent in batches to the
database instance. ASM sends the first 60 extents of the extent map to
the RDBMS instance at file-open time; the remaining extents are paged in
on demand by the database instance. This delayed shipping of extent
maps greatly improves the time it takes to open the database.
Opening the spfile at RDBMS instance startup
is a special code path. This open operation cannot follow the typical
open path, because the RDBMS instance System Global Area (SGA) does not
yet exist to hold the extent map. The SGA sizing information is
contained in the spfile. In this specific case, the RDBMS does proxy I/O
through the ASM instance. When the ASM instance reads user data, it
does a loop-back connection to itself. This results in ASM having an
ASMB process during RDBMS instance startup. After the RDBMS has gotten
the initial contents of the spfile, it allocates the SGA, closes the
proxy open of the spfile, and opens the spfile again via the normal
method used for all other files.
ASM tracks all the files an RDBMS instance has
open. This allows ASM to prevent the deletion of open files. ASM also
needs to know what files an RDBMS instance has open so that it can
coordinate extent relocation, as described later in this chapter.
File I/O
RDBMS instances perform ASM file I/O
directly to the ASM disks; in other words, the I/O is performed
unbuffered and directly to the disk without involving ASM. Keep in mind
that each RDBMS instance uses the extent maps it obtains during file
open to determine where on the ASM disks to direct its reads and writes;
thus, the RDBMS instance has all the information it needs to perform
the I/O to the database file.
MYTH
RDBMS instances proxy all I/O to ASM files through an ASM instance.
|
File Close
When an RDBMS instance closes a file, it
sends a message to the ASM instance. ASM cleans up its internal state
when the file is closed. Closed files do not require messaging to RDBMS
instances when their extents are relocated via rebalance.
ASM administrators can issue the following command to delete closed files manually:
The DROP FILE command fails for open files (generating an ORA-15028 error message “ASM file filename
not dropped; currently being accessed”). Generally, manual deletion of
files is not required if the ASM files are Oracle Managed Files (OMFs),
which are automatically deleted when they are no longer needed. For
instance, when the RDBMS drops a tablespace, it will also delete the
underlying data files if they are OMFs.
File Delete
When an RDBMS instance deletes a file, it
sends a request to the ASM instance. The ASM instance creates a COD
entry to record the intent to delete the file. ASM then marks the
appropriate allocation table entries as free, releases the file
directory entry, and removes the appropriate alias directory entries. If
the instance fails during file deletion, COD recovery completes the
deletion. The delete request from the RDBMS instance is not complete
until ASM completes freeing all of the allocated space.
ASM File Allocation
This section describes how ASM allocates
files within an external redundancy disk group. For simplicity, this
section explains ASM striping and variable-sized extents for ASM files
in the absence of ASM redundancy. The concepts in striping and
variable-sized extents also apply to files with ASM redundancy. A later
section explains allocation of files with ASM redundancy.
External Redundancy Disk Groups
ASM allocates files so that they are evenly
spread across all disks in a disk group. ASM uses the same algorithm for
choosing disks for file allocation and for file rebalance. In the case
of rebalance, if multiple disks are equally desirable for extent
placement, ASM chooses the disk where the extent is already allocated if
that is one of the choices.
MYTH
ASM chooses disks for allocation based on I/O statistics.
|
MYTH
ASM places the same number of megabytes from a file on each disk regardless of disk size.
|
ASM chooses the disk for the first extent of a
file to optimize for space usage in the disk group. It strives to fill
each disk evenly (proportional to disk size if the disks are not the
same size). Subsequently, ASM tries to spread the extents of the file
evenly across all disks. As described in the file directory persistent
structure, ASM extent placement is not bound by strict modulo
arithmetic; however, extent placement tends to follow an approximately
round-robin pattern across the disks in a disk group. ASM allocation on a
disk begins at the lower-numbered AUs (typically the outer tracks of
disks). Keeping allocations concentrated in the lower-numbered AUs tends
to reduce seek time and takes advantage of the highest-performing
tracks of the disk. Note that the assumption about the mapping of
lower-numbered AUs to higher-performing tracks is generally true for
physical disks, but may not hold true for LUNs presented by storage
arrays. ASM is not aware of the underlying physical layout of the ASM
disks.
A potentially nonintuitive side effect of
ASM’s allocation algorithm is that ASM may report out-of-space errors
even when it shows adequate aggregate free space for the disk group.
This side effect can occur if the disk group is out of balance. For
external redundancy disk groups, the primary reason for disk group
imbalance is an incomplete rebalance operation. This can occur if a user
stops a rebalance operation by specifying rebalance power 0, or if
rebalance was cancelled/terminated. Also, if a disk group is almost full
when new disks are added, allocation may fail on the existing disks
until rebalance has freed sufficient space. It is a best practice to
leave at minimum 20 percent free uniformly across all disks.
Variable-Sized Extents
In disk groups with COMPATIBLE.RDBMS lower
than 11.1, all extents are a fixed size of one allocation unit (AU). If
COMPATIBLE.RDBMS is 11.1 or higher, extent sizes increase for data files
as the file size grows. A multi-AU extent consists of multiple
contiguous allocation units on the same disk. The first 20,000 extents
in a file are one AU. The next 20,000 extents are four AUs. All extents
beyond 40,000 are 16 AUs. This allows ASM to address larger files more
efficiently. For the first 20,000 extents, allocation occurs exactly as
with fixed-extent-size files. For multi-AU extents, ASM must find
contiguous extents on a disk. ASM’s allocation pattern tends to
concentrate allocations in the lower-numbered AUs and leave free space
in the higher-numbered AUs. File shrinking or deletion can lead to free
space fragmentation. ASM attempts to maintain defragmented disks during
disk group rebalance. If during file allocation ASM is unable to find
sufficient contiguous space on a disk for a multi-AU extent, it
consolidates the free space until enough contiguous space is available
for the allocation. Defragmentation uses the relocation locking
mechanism described later in this chapter.
ASM Striping
ASM offers two types of striping for files.
Coarse striping is done at the AU level. Fine-grained striping is done
at 128K granularity. Coarse striping provides better throughput, whereas
fine-grained striping provides better latency. The file template
indicates which type of striping is performed for the file. If a
template is not specified during file creation, ASM uses the default
template for the file type.
ASM striping is performed logically at the
client level. In other words, RDBMS instances interpret extent maps
differently based on the type of striping for the file. Striping is
opaque to ASM operations such as extent relocation; however, ASM takes
striping into account during file allocation.
Coarse Striping
With single AU extents, each extent contains
a contiguous logical chunk of the file. Because ASM distributes files
evenly across all disks at the extent level, such files are effectively
striped at AU-sized boundaries.
With multi-AU extents, the ASM file is
logically striped at the AU level across a stripe set of eight disks.
For instance, with 1MB AUs, the first MB of the file is written to the
first disk, the second MB is written to the second disk, the third MB is
written to the third disk, and so on. The file allocation dictates the
set of eight disks involved in the stripe set and the order in which
they appear in the stripe set. If fewer than eight disks exist in the
disk group, then disks may be repeated within a stripe set.
Figure 9-1
shows a file with coarse striping and 1MB fixed extents. The logical
database file is 6.5MB. In the figure, each letter represents 128K, so
uppercase A through Z followed by lowercase a through z
cover 6.5MB. Each extent is 1MB (represented by eight letters, or 8 ×
128K) and holds the contiguous logical content of the database file. ASM
must allocate seven extents to hold a 6.5MB file with coarse striping.
Figure 9-2
represents a file with variable-sized extents and coarse striping. In
order for the figure to fit on one page, it represents a disk group with
eight disks. The file represented is 20,024MB. In the figure, each
capital letter represents 1MB in the database file. The file has 20,008
extents. The first 20,000 are one AU, whereas the next eight are eight
AU extents. Notice that the eight AU extents are allocated in sets of
eight so that ASM can continue to stripe at the AU level. This file uses
20,064MB of space in the disk group.
Fine-Grained Striping
Fine-grained striped ASM files are logically
striped at 128K. For each set of eight extents, the file data is
logically striped in a round-robin fashion: bytes 0K through 127K go on
the first disk, 128K through 255K on the second disk, 256K through 383K
on the third disk, and so on. The file allocation determines the set of
eight disks involved in the stripe set and the order in which they
appear in the stripe set. If fewer than eight disks exist in the disk
group, then disks may be repeated within a stripe set.
MYTH
Fine-grained striped ASM files have smaller extent sizes than files with coarse striping.
|
Figure 9-3
shows a file with fine-grained striping and 1MB fixed extents. The
logical database file is 6.5MB (and is the same as the file shown in Figure 9-1). In the figure, each letter represents 128K, so uppercase A through Z followed by lowercase a through z
cover 6.5MB. Each extent is 1MB (representing eight letters or 8 ×
128K). ASM must allocate eight extents to hold a 6.5MB file with
fine-grained striping. As described previously, the logical contents of
the database file are striped across the eight extents in 128K stripes.
ASM Redundancy
Unlike traditional volume managers, ASM
mirrors extents rather than mirroring disks. ASM’s extent-based
mirroring provides load balancing and hardware utilization advantages
over traditional redundancy methods.
In traditional RAID 1 mirroring, the disk that
mirrors a failed disk suddenly must serve all the reads that were
intended for the failed drive, and it also serves all the reads needed
to populate the hot spare disk that replaces the failed drive. When an
ASM disk fails, the load of the I/Os that would have gone to that disk
is distributed among the disks that contain the mirror copies of extents
on the failed disk.
MYTH
ASM mirroring makes two disks look identical.
|
RAID 5 and RAID 1 redundancy schemes have a
rigid formula that dictates the placement of the data relative to the
mirror or parity copies. As a result, these schemes require hot spare
disks to replace drives when they fail. “Hot spare” is a euphemism for
“usually idle.” During normal activity, applications cannot take
advantage of the I/O capacity of a hot spare disk. ASM, on the other
hand, does not require any hot spare disks; it requires only spare
capacity. During normal operation, all disks are actively serving I/O
for the disk group. When a disk fails, ASM’s flexible extent pointers
allow extents from a failed disk to be reconstructed from the mirror
copies distributed across the disk partners of the failed disk. The
reconstructed extents can be placed on the remaining disks in the disk
group. See Chapter 12 for the specialized layout and configurations on Engineered Systems.
Failure Groups
Because of the physical realities of disk
connectivity, disks do not necessarily fail independently. Certain disks
share common single points of failure, such as a power supply, a host
bus adapter (HBA), or a controller. Different users have different ideas
about which component failures they want to be able to tolerate. Users
can specify the shared components whose failures they want to tolerate
by specifying failure groups. By default, each disk is in its own
failure group; in other words, if the failgroup specification is
omitted, ASM automatically places each disk into its own failgroup. The
only exceptions are Exadata and ODA. In Exadata, all disks from the same
storage cell are automatically placed in the same failgroup. In ODA,
disks from specific slots in the array are automatically put into a
specific failgroup.
ASM allocates mirror extents such that they
are always in a different failure group than the primary extent. In the
case of high-redundancy files, each extent copy in an extent set is in a
different failure group. By placing mirror copies in different failure
groups, ASM guarantees that even with the loss of all disks in a failure
group, a disk group will still have at least one copy of every extent
available. The way failure groups are managed and extents are
distributed has been specialized and specifically designed to take
advantage of the Engineered Systems. See Chapter 12 for the details on disk, failure group, and recovery management on Engineered Systems.
Disk Partners
ASM disks store the mirror copies of their
extents on their disk partners. A disk partnership is a symmetric
relationship between two disks in the same disk group but different
failure groups. ASM selects partners for a disk from failure groups
other than the failure group to which the disk belongs, but an ASM disk
may have multiple partners that are in the same failure group. Limiting
the number of partners for each disk minimizes the number of disks whose
overlapping failures could lead to loss of data in a disk group.
Consider a disk group with 100 disks. If an extent could be mirrored
between any two disks, the failure of any two of the 100 disks could
lead to data loss. If a disk mirrors its extents on only up to 10 disks,
then when one disk fails, its 10 partners must remain online until the
lost disk’s contents are reconstructed by rebalance, but a failure of
any of the other 89 disks could be tolerated without loss of data.
ASM stores the partnership information in the
Partnership Status Table (PST). Users do not specify disk partners; ASM
chooses disk partners automatically based on the disk group’s failure
group definitions. A disk never has any partners in its own failure
group. Disk partnerships may change when the disk group configuration
changes. Disks have a maximum of 10 active partnerships, but typically
it is eight active partnerships. When disk group reconfigurations cause
disks to drop existing partnerships and form new partnerships, the PST
tracks the former partnerships until rebalance completes to ensure that
former partners no longer mirror any extent between them. PST space
restrictions limit each disk to a total of 20 total and former partners.
For this reason, it is more efficient to perform disk reconfiguration
(that is, to add or drop a disk) in batches. Too many nested disk group
reconfigurations can exhaust the PST space and result in an “ORA-15074:
diskgroup requires rebalance completion” error message.
Allocation with ASM Redundancy
File allocation in normal- and
high-redundancy disk groups is similar to allocation in external
redundancy disk groups. ASM uses the same algorithm for allocating the
primary extent for each extent set. ASM then balances mirror extent
allocation among the partners of the disk that contains the primary
extent.
Because mirror extent placement is dictated by
disk partnerships, which are influenced by failure group definitions,
it is important that failure groups in a disk group be of similar sizes.
If some failure groups are smaller than others, ASM may return
out-of-space errors during file allocation if the smaller failure group
fills up sooner than the other failure groups.
I/O to ASM Mirrored Files
When all disks in an ASM disk group are
online, ASM writes in parallel to all copies of an extent and reads from
the primary extent. Writes to all extent copies are necessary for
correctness. Reads from the primary extent provide the best balance of
I/O across disks in a disk group.
MYTH
ASM needs to read from all mirror sides to balance I/O across disks.
|
Most traditional volume managers distribute
reads evenly among the mirror sides in order to balance I/O load among
the disks that mirror each other. Because each ASM disk contains a
combination of primary and mirror extents, I/O to primary extents of ASM
files spreads the load evenly across all disks. Although the placement
of mirror extents is constrained by failure group definitions, ASM can
allocate primary extents on any disk to optimize for even distribution
of primary extents.
Read Errors
When an RDBMS instance encounters an I/O
error trying to read a primary extent, it will try to read the mirror
extent. If the read from the mirror extent is successful, the RDBMS can
satisfy the read request, and upper-layer processing continues as usual.
For a high-redundancy file, the RDBMS tries to read the second mirror
extent if the first mirror extent returns an I/O error. If reads fail to
all extent copies, the I/O error propagates to the upper layers, which
take the appropriate action (such as taking a tablespace offline). A
read error in an RDBMS instance never causes a disk to go offline.
The ASM instance handles read errors in a
similar fashion. If ASM is unable to read any copy of a virtual metadata
extent, it forces the dismounting of the disk group. If ASM is unable
to read physical metadata for a disk, it takes the disk offline, because
physical metadata is not mirrored.
Read errors can be due to the loss of access
to the entire disk or due to bad sectors on an otherwise healthy disk.
ASM tries to recover from bad sectors on a disk. Read errors in the
RDBMS or ASM instance trigger the ASM instance to attempt bad block
remapping. ASM reads a good copy of the extent and copies it to the disk
that had the read error. If the write to the same location succeeds,
the underlying allocation unit is deemed healthy (because the underlying
disk likely did its own bad block relocation). If the write fails, ASM
attempts to write the extent to a new allocation unit of the same disk.
If that write succeeds, the original allocation unit is marked as
unusable. If the write to the new allocation unit fails, the disk is
taken offline. The process of relocating a bad allocation unit uses the
same locking logic discussed later in this chapter for rebalance.
One unique benefit on ASM-based mirroring is
that the RDBMS instance is aware of the mirroring. For many types of
logical file corruptions, if the RDBMS instance reads unexpected data
(such as a bad checksum or incorrect System Change Number [SCN]) from
the primary extent, the RDBMS proceeds through the mirror sides looking
for valid content. If the RDBMS can find valid data on an alternate
mirror, it can proceed without errors (although it will log the problem
in the alert log). If the process in the RDBMS that encountered the read
is in a position to obtain the appropriate locks to ensure data
consistency, it writes the correct data to all mirror sides.
Write Errors
When an RDBMS instance encounters a write
error, it sends to the ASM instance a disk offline message indicating
which disk had the write error. If the RDBMS can successfully complete a
write to at least one extent copy and receive acknowledgment of the
offline disk from the ASM instance, the write is considered successful
for the purposes of the upper layers of the RDBMS. If writes to all
mirror sides fail, the RDBMS takes the appropriate actions in response
to a write error (such as taking a tablespace offline).
When the ASM instance receives a write error
message from an RDBMS instance (or when an ASM instance encounters a
write error itself), ASM attempts to take the disk offline. ASM consults
the PST to see whether any of the disk’s partners are offline. If too
many partners are already offline, ASM forces the dismounting of the
disk group on that node. Otherwise, ASM takes the disk offline. ASM also
tries to read the disk header for all the other disks in the same
failure group as the disk that had the write error. If the disk header
read fails for any of those disks, they are also taken offline. This
optimization allows ASM to handle potential disk group reconfigurations
more efficiently.
Disk offline is a global operation. The ASM
instance that initiates the offline sends a message to all other ASM
instances in the cluster. The ASM instances all relay the offline status
to their client RDBMS instances. If COMPATIBLE.RDBMS is less than 11.1,
ASM immediately force-drops disks that have gone offline. If
COMPATIBLE.RDBMS is 11.1 or higher, disks stay offline until the
administrator issues an online command or until the timer specified by
the DISK_REPAIR_TIME attribute expires. If the timer expires, ASM
force-drops the disk. The “Resync” section later in this chapter
describes how ASM handles I/O to offline disks.
Mirror Resilvering
Because the RDBMS writes in parallel to
multiple copies of extents, in the case of a process or node failure, it
is possible that a write has completed to one extent copy but not
another. Mirror resilvering ensures that two mirror sides (that may be
read) are consistent.
MYTH
ASM needs a Dirty Region Log to keep mirrors in sync after process or node failure.
|
Most traditional volume managers handle mirror
resilvering by maintaining a Dirty Region Log (DRL). The DRL is a
bitmap with 1 bit per chunk (typically 512K) of the volume. Before a
mirrored write can be issued, the DRL bit for the chunk must be set on
disk. This results in a performance penalty if an additional write to
the DRL is required before the mirrored write can be initiated. A write
is unnecessary if the DRL bit is already set because of a prior write.
The volume manager lazily clears DRL bits as part of other DRL accesses.
Following a node failure, for each region marked in the DRL, the volume
manager copies one mirror side to the other as part of recovery before
the volume is made available to the application. The lazy clearing of
bits in the DRL and coarse granularity inevitably result in more data
than necessary being copied during traditional volume manager
resilvering.
Because the RDBMS must recover from process
and node failure anyway, RDBMS recovery also addresses potential ASM
mirror inconsistencies if necessary. Some files, such as archived logs,
are simply re-created if there was a failure while they were being
written, so no resilvering is required. This is possible because an
archive log’s creation is not committed until all of its contents are
written. For some operations, such as writing intermediate sort data,
the failure means that the data will never be read, so the mirror sides
need not be consistent and no recovery is required. If the redo log
indicates that a data file needs a change applied, the RDBMS reads the
appropriate blocks from the data file and checks the SCN. If the SCN in
the block indicates that the change must be applied, the RDBMS writes
the change to both mirror sides (just as with all other writes). If the
SCN in the block indicates that the change is already present, the RDBMS
writes the block back to the data file in case an unread mirror side
does not contain the change. The resilvering of the redo log itself is
handled by reading all the mirror sides of the redo log and picking the
valid version during recovery.
The ASM resilvering algorithm also provides
better data integrity guarantees. For example, a crash could damage a
block that was in the midst of being written. Traditional volume manager
resilvering has a 50-percent chance of copying the damaged block over
the good block as part of applying the Dirty Region Log (DRL). When
RDBMS recovery reads a block, it verifies that the block is not corrupt.
If the block is corrupt and mirrored by ASM, each mirrored copy is
examined to find a valid version. Thus, a valid version is always used
for ASM resilvering.
ASM’s mirror resilvering is more efficient and
more effective than traditional volume managers, both during normal I/O
operation and during recovery.
Preferred Read
Some customers deploy extended clusters for
high availability. Extended clusters consist of servers and storage that
reside in separate data centers at sites that are usually several
kilometers apart. In these configurations, ASM normal-redundancy disk
groups have two failure groups: one for each site. From a given server
in an extended cluster, the I/O latency is greater for the disks at the
remote site than for the disks in the local site. When COMPATIBLE.RDBMS
is set to 11.1 or higher, users can set the
PREFERRED_READ_FAILURE_GROUPS initialization parameter for each ASM
instance to specify the local failure group in each disk group for that
instance. When this parameter is set, reads are preferentially directed
to the disk in the specified failure group rather than always going to
the primary extent first. In the case of I/O errors, reads can still be
satisfied from the nonpreferred failure group.
To ensure that all mirrors have the correct data, writes must still go to all copies of an extent.
Rebalance
Whenever the disk group configuration
changes—whenever a disk is added, dropped, or resized—ASM rebalances the
disk group to ensure that all the files are evenly spread across the
disks in the disk group. ASM creates a COD entry to indicate that a
rebalance is in progress. If the instance terminates abnormally, a
surviving instance restarts the rebalance. If no other ASM instances are
running, when the instance restarts, it restarts the rebalance.
MYTH
An ASM rebalance places data on disks based on I/O statistics.
|
MYTH
ASM rebalances dynamically in response to hot spots on disks.
|
ASM spreads every file evenly across all the
disks in the disk group. Because of the access patterns of the database,
including the caching that occurs in the database’s buffer cache, ASM’s
even distribution of files across the disks in a disk group usually
leads to an even I/O load on each of the disks as long as the disks
share similar size and performance characteristics.
Because ASM maintains pointers to each extent
of a file, rebalance can minimize the number of extents that it moves
during configuration changes. For instance, adding a fifth disk to an
ASM disk group with four disks results in moving 20 percent of the
extents. If modulo arithmetic were used—changing placement to every
fifth extent of a file on each disk from every fourth extent of a file
on each disk—then almost 100 percent of the extents would have needed to
move. Figures 9-4 and 9-5
show the differences in data movement required when adding a one disk
to a four-disk disk group using ASM rebalance and traditional modulo
arithmetic.
The RBAL process on the ASM instance that
initiates the rebalance coordinates rebalance activity. For each file in
the disk group—starting with the metadata files and continuing in
ascending file number order—RBAL creates a plan for even placement of
the file across the disks in the disk group. RBAL dispatches the
rebalance plans to the ARBn processes in the instance. The number of ARBs is dictated by the power of the rebalance. Each ARBn
relocates the appropriate extents according to its plan. Coordination
of extent relocation among the ASM instances and RDBMS instances is
described in the next section. In 11.2.0.2 and above (with
compatible.asm set to 11.2.0.2), the same workflow occurs, except there
is only one ARB0 process that performs all the relocations. The
rebalance power dictates the number of outstanding async I/Os that will
be issued or in flight. See the “Relocation” section for details on
relocation mechanisms.
ASM displays the progress of the rebalance in
V$ASM_OPERATION. ASM estimates the number of extents that need to be
moved and the time required to complete the task. As rebalance
progresses, ASM reports the number of extents moved and updates the
estimates.
Resync
Resync is the operation that restores the
updated contents of disks that are brought online from an offline state.
Bringing online a disk that has had a transient failure is more
efficient than adding the disk back to the disk group. When you add a
disk, all the contents of the disk must be restored via a rebalance.
Resync restores only the extents that would have been written while the
disk was offline.
When a disk goes offline, ASM distributes an
initialized slot of the Staleness Registry (SR) to each RDBMS and ASM
instance that has mounted the disk group with the offline disk. Each SR
slot has a bit per allocation unit in the offline disk. Any instance
that needs to perform a write to an extent on the offline disk
persistently sets the corresponding bit in its SR slot and performs the
writes to the extent set members on other disks.
When a disk is being brought online, the disk
is enabled for writes, so the RDBMS and ASM instances stop setting bits
in their SR slots for that disk. Then ASM reconstructs the allocation
table (AT) and free space table for the disk. Initially, all the AUs
(except for the physical metadata and AUs marked as unreadable by block
remapping) are marked as free. At this point, deallocation is allowed on
the disk. ASM then examines all the file extent maps and updates the AT
to reflect the extents that are allocated to the disk being brought
online. The disk is enabled for new allocations. To determine which AUs
are stale, ASM performs a bitwise OR of all the SR slots for the disk.
Those AUs with set bits are restored using the relocation logic
described in the following section. After all the necessary AUs have
been restored, ASM enables the disk for reads. When the online is
complete, ASM clears the SR slots.
Relocation
Relocation is the act of moving an extent
from one place to another in an ASM disk group. Relocation takes place
most commonly during rebalance, but also occurs during resync and during
bad block remapping. The relocation logic guarantees that the active
RDBMS clients always read the correct data and that no writes are lost
during the relocation process.
Relocation operates on a per-extent basis. For
a given extent, ASM first checks whether the file is open by any
instance. If the file is closed, ASM can relocate the file without
messaging any of the client instances. Recovery Manager (RMAN) backup
pieces and archive logs often account for a large volume of the space
used in ASM disk groups, especially if the disk group is used for the
Fastback Recovery Area (FRA), but these files are usually not open. For
closed files, ASM reads the extent contents from the source disk and
writes them to the specified location on the target disk. ASM keeps
track of the relocations that are ongoing for closed files so that if a
client opens a file which is in the middle of relocation, the
appropriate relocation messages are sent to the client as part of the
file open.
If the file being relocated is open, the
instance performing the relocation sends to all other ASM instances in
the cluster a message stating its intent to relocate an extent. All ASM
instances send a message, which indicates the intent to relocate the
extent over the umbilicus to any RDBMS clients that have the file open.
At this point, the RDBMS instances delay any new writes until the new
extent location is available. Because only one extent is relocated at a
time, the chances are very small of an RDBMS instance writing to the
extent being relocated at its time of relocation. Each RDBMS instance
acknowledges receipt of this message to its ASM instance. The ASM
instances in turn acknowledge receipt to the coordinating ASM instance.
After all acknowledgments arrive, the coordinating ASM instance reads
the extent contents from the source disk and writes them to the target
disk. After this is complete, the same messaging sequence occurs, this
time to indicate the new location of the extent set. At this time,
in-flight reads from the old location are still valid. Subsequent reads
are issued to the new location. Writes that were in flight must be
reissued to the new location. Writes that were delayed based on the
first relocation message can now be issued to the new location. After
all in-flight writes issued before the first relocation message are
complete, the RDBMS instances use the same message-return sequence to
notify the coordinating ASM instance that they are no longer referencing
the old extent location. At this point, ASM can return the allocation
unit(s) from the old extent to the free pool. Note that the normal RDBMS
synchronization mechanisms remain in effect in addition to the
relocation logic.
ASM Instance Recovery and Crash Recovery
When ASM instances exit unexpectedly due to a
shutdown abort or an ASM instance crash, ASM performs recovery. As with
the RDBMS, ASM can perform crash recovery or instance recovery. ASM
performs instance recovery when an ASM instance mounts a disk group
following abnormal instance termination. In a clustered configuration,
surviving ASM instances perform crash recovery following the abnormal
termination of another ASM instance in the cluster. ASM recovery is
similar to the recovery performed by the RDBMS. For ASM recovery, the
ACD is analogous to the redo logs in the RDBMS instance. The ASM COD is
similar to the undo tablespaces in the RDBMS. With ASM, recovery is
performed on a disk group basis. In a cluster, different ASM instances
can perform recovery on different disk groups.
CSS plays a critical role in RAC database
recovery. With RAC databases, CSS ensures that all I/O-capable processes
from a crashed RDBMS instance have exited before a surviving RDBMS
instance performs recovery. The OS guarantees that terminated processes
do not have any pending I/O requests. This effectively fences off the
crashed RDBMS instance. The surviving RDBMS instances can then proceed
with recovery with the assurance that the crashed RDBMS instances cannot
corrupt the database.
CSS plays a similar role in ASM recovery. Even
a single-node ASM deployment involves multiple Oracle instances. When
an ASM instance mounts a disk group, it registers with CSS. Each RDBMS
instance that accesses ASM storage registers with CSS, which associates
the RDBMS instance with the disk group that it accesses and with its
local ASM instance. Each RDBMS instance has an umbilicus connection
between its ASMB process and the UFG in the local ASM instance. Because
ASMB is a fatal background process in the RDBMS, and the failure of a
UFG kills ASMB, the failure of an ASM instance promptly terminates all
its client RDBMS instances. CSS tracks the health of all the I/O-capable
processes in RDBMS and ASM instances. If an RDBMS client terminates,
CSS notifies the local ASM instance when all the I/O-capable processes
have exited. At this point, ASM can clean up any resource held on behalf
of the terminated RDBMS instances. When an ASM instance exits
abnormally, its client RDBMS instances terminate shortly thereafter
because of the severed umbilicus connection. CSS notifies the surviving
ASM instances after all the I/O-capable processes from the terminated
ASM instance and its associated RDBMS instances have exited. At this
point, the surviving ASM instances can perform recovery knowing that the
terminated instance can no longer perform any I/Os.
During ASM disk group recovery, an ASM
instance first applies the ACD thread associated with the terminated ASM
instance. Applying the ACD records ensures that the ASM cache is in a
consistent state. The surviving ASM instance will eventually flush its
cache to ensure that the changes are recorded in the other persistent
disk group data structures. After ACD recovery is complete, ASM performs
COD recovery. COD recovery performs the appropriate actions to handle
long-running operations that were in progress on the terminated ASM
instance. For example, COD recovery restarts a rebalance operation that
was in progress on a terminated instance. It rolls back file creations
that were in progress on terminated instances. File creations are rolled
back because the RDBMS instance could not have committed a file
creation that was not yet complete. The “Mirror Resilvering” section
earlier in this chapter describes how ASM ensures mirror consistency
during recovery.
ASM recovery is based on the same technology that the Oracle Database has used successfully for years.
Disk Discovery
Disk discovery is the process of examining
disks that are available to an ASM instance. ASM performs disk discovery
in several scenarios, including the following:
Mount disk group
Create disk group
Add disk
Online disk
Select from V$ASM_DISKGROUP or V$ASM_DISK
The set of disks that ASM examines is called the discovery set
and is constrained by the ASM_DISKSTRING initialization parameter and
the disks to which ASM has read/write permission. The default value for
ASM_DISKSTRING varies by platform, but usually matches the most common
location for devices on the platform. For instance, on Linux, the
default is /dev/sd*. On Solaris, the default is /dev/rdsk/*. The
ASM_DISKSTRING value should be broad enough to match all the disks that
will be used by ASM. Making ASM_DISKSTRING more restrictive than the
default can make disk discovery more efficient.
MYTH
ASM persistently stores the path to each disk in a disk group.
|
The basic role of disk discovery is to allow
ASM to operate on disks based on their content rather than the path to
them. This allows the paths to disks to vary between reboots of a node
and across nodes of a cluster. It also means that ASM disk groups are
self-describing: All the information required to mount a disk group is
found on the disk headers. Users need not enumerate the paths to each
disk in a disk group to mount it. This makes ASM disk groups easily
transportable from one node to another.
Mount Disk Group
When mounting a disk group, ASM scans all
the disks in the discovery set to find each disk whose header indicates
that it is a member of the specified disk group. ASM also inspects the
PSTs to ensure that the disk group has a quorum of PSTs and that it has a
sufficient number of disks to mount the disk group. If discovery finds
each of the disks listed in the PST (and does not find any duplicate
disks), the mount completes and the mount timestamp is updated in each
disk header.
If some disks are missing, the behavior
depends on the version of ASM and the redundancy of the disk group. For
external-redundancy disk groups, the mount fails if any disk is missing.
For normal-redundancy disk groups, the mount can succeed as long as
none of the missing disks are partners of each other. The mount succeeds
if the instance is the first in the cluster to mount the disk group. If
the disk group is already mounted by other instances, the mount will
fail. The rationale for this behavior is that if the missing disk had
truly failed, the instances with the disk group mounted would have
already taken the disk offline. As a result, the missing disk on the
node trying to mount is likely to be the result of local problems, such
as incorrect permissions, an incorrect discovery string, or a local
connection problem. Rather than continuing the schizophrenic mount
behavior in Oracle Database 11g, ASM introduced a FORCE option to
mount. If you do not specify FORCE (which is equivalent to specifying
NOFORCE), the mount fails if any disks are missing. If you specify FORCE
to the mount command, ASM takes the missing disks offline (if it finds a
sufficient number of disks) and completes the mount. A mount with the
FORCE option fails if no disks need to be taken offline as part of the
mount. This is to prevent the gratuitous use of FORCE during a mount.
If multiple disks have the same name in the
disk header, ASM does its best to pick the right one for the mount. If
another ASM instance already has the disk group mounted, the mounting
ASM instance chooses the disk with the mount timestamp that matches the
mount timestamp cached in the instance that already has the disk group
mounted. If no other instance has the disk group mounted or if multiple
disks have the same ASM disk name and mount timestamp, the mount fails.
With multipathing, the ASM_DISKSTRING or disk permissions should be set
such that the discovery set contains only the path to the multipathing
pseudo-device. The discovery set should not contain each individual path
to each disk.
Create Disk Group
When creating a disk group, ASM writes a
disk header to each of the disks specified in the CREATE DISKGROUP SQL
statement. ASM recognizes disks that are formatted as ASM disks. Such
disks have a MEMBER in the HEADER_STATUS field of V$ASM_DISK. Other
patterns, such as those of database files, the Oracle Clusterware voting
disk, or the Oracle Cluster Registry (OCR), appear as FOREIGN, if
stored on raw devices outside of ASM, in the HEADER_STATUS of
V$ASM_DISK. By default, ASM disallows the use of disks that have a
header status of FOREIGN or MEMBER. Specifying the FORCE option to a
disk specification allows the use of disks with these header statuses.
ASM does not allow reuse of any disk that is a member of a mounted disk
group. ASM then verifies that it is able to find exactly one copy of
each specified disk by reading the headers in the discovery set. If any
disk is not found during this discovery, the disk group creation fails.
If any disk is found more than once, the disk group creation also fails.
Add Disk
When adding a disk to a disk group, ASM
writes a disk header to each of the disks specified in the ADD DISK SQL
statement. ASM verifies that it can discover exactly one copy of the
specified disk in the discovery set. In clusters, the ASM instance
adding the disk sends a message to all peers that have the disk group
mounted to verify that they can also discover the new disk. If any peer
ASM instance with the disk group mounted fails to discover the new disk,
the ADD DISK operation fails. In this case, ASM marks the header as
FORMER. ADD DISK follows the same rules for FORCE option usage as CREATE
DISKGROUP.
Online Disk
The online operation specifies an ASM disk
name, not a path. The ASM instance searches for the disk with the
specified name in its discovery set. In a cluster, ASM sends a message
to all peer instances that have the disk group mounted to verify that
they can discover the specified disk. If any instance fails to discover
the specified disk, ASM returns the disk to the OFFLINE state.
Select from V$ASM_DISK and V$ASM_DISKGROUP
Select from V$ASM_DISK or V$ASM_DISKGROUP
reads all the disk headers in the discovery set to populate the tables.
V$ASM_DISK_STAT and V$ASM_DISKGROUP_STAT return the same information as
V$ASM_DISK and V$ASM_DISKGROUP, but the _STAT views do not perform
discovery. If any disks were added since the last discovery, they will
not be displayed in the _STAT views.
Summary
ASM is a sophisticated storage product that
performs complex internal operations to provide simplified storage
management to the user. Most of the operations described in this chapter
are transparent to the user, but understanding what happens behind the
curtains can be useful for planning and monitoring an ASM deployment.
CHAPTER 10 ACFS Design and Deployment
There is a big
movement in the industry toward consolidation and cloud computing. At
the core of all this is centralized-clustered storage, which requires a
centralized file store such as a file system.
In Oracle Grid Infrastructure 11g
Release 2, Oracle extends the capability of ASM by introducing the ASM
Cluster File System, or ACFS. ACFS is a feature-rich, scalable,
POSIX-compliant file system that is built from the traditional ASM disk
groups. In Oracle Clusterware 11gR2 11.2.0.3, Oracle introduced a
packaging suite called the Oracle Cluster File System, Cloud Edition,
which provides a clustered file system for cloud storage applications,
built on Automatic Storage Management, and includes advanced data
management and security features. Oracle Cluster File System, Cloud
Edition includes the following core components:
Automatic Storage Management (ASM)
ASM Dynamic Volume Manager (AVDM)
ASM Cluster File System (ACFS)
In addition, it provides the following services:
ACFS file tagging for aggregate operations
ACFS read-only snapshots
ACFS continuous replication
ACFS encryption
ACFS realm-based security
This chapter focuses on ACFS, its underlying architecture, and the associated data services.
ASM Cluster File System Overview
ACFS is a fully POSIX-compliant file system
that can be accessed through native OS file system tools and APIs.
Additionally, ACFS can be exported and accessed by remote clients via
standard NAS file access protocols such as NFS and CIFS. In addition to
ACFS, Oracle introduces ASM Dynamic Volume Manager (ADVM) to provide
scalable volume management capabilities to support ACFS. Despite its
name, ACFS can be utilized as a local file system as well as a cluster
file system.
ACFS provides shared, cluster-wide access to
various file types, including database binaries, trace files, Oracle
BFILEs, user application data and reports, and other non-database
application data. Storing these file types in ACFS inherently provides
consistent global namespace support, which is the key to enabling
uniform, cluster-wide file access and file pathing of all file system
data. In the past, to get consistent namespace across cluster nodes
meant implementing third-party cluster file system solutions or NFS.
ACFS is built on top of the standard vnode/VFS
file system interface ubiquitous in the Unix/Linux world, and it uses
Microsoft standard interfaces on Windows. ACFS supports Windows APIs,
command-line interfaces (CLIs) and tools (including Windows Explorer).
ACFS uses standard file-related system calls like other file systems in
the market. However, most file systems on the market are platform
specific; this is especially true of cluster file systems. ACFS is a
multiplatform cluster file system for Linux, Unix, and Windows, with the
same features across all OS and platforms.
ACFS provides benefits in the following key areas:
Performance ACFS
is an extent-based clustered file system that provides metadata caching
and fast directory lookups, thus enabling fast file access. Because
ACFS leverages ASM functionality in the back end, it inherits a wide
breadth of feature functionality already provided by ASM, such as
maximized performance via ASM’s even extent distribution across all
disks in the disk group.
Manageability A
key aspect of any file system container is that it needs to be able to
grow as the data requirements change, and this needs to be performed
with minimized downtime. ACFS provides the capability to resize the file
system while the file system is online and active. In addition, ACFS
supports other storage management services, such as file system
snapshots. ACFS (as well as its underlying volumes) can be easily
managed using various command-line interfaces or graphical tools such as
ASM Configuration Assistant (ASMCA) and Oracle Enterprise Manager (EM).
This wide variety of tools and utilities allows ACFS to be easily
managed and caters to both system administrators and DBAs. Finally, ACFS
management is integrated with Oracle Clusterware, which allows
automatic startup of ACFS drivers and mounting of ACFS file systems
based on dependencies set by the administrator.
Availability ACFS,
via journaling and checksums, provides the capability to quickly
recover from file system outages and inconsistencies. ACFS checksums
cover ACFS metadata. Any metadata inconsistency will be detected by
checksum and can be remedied by performing fsck.
ACFS also leverages ASM mirroring for
improved storage reliability. Additionally, because ACFS is tightly
integrated with Oracle Clusterware, all cluster node membership and
group membership services are inherently leveraged. Table 10-1 provides a summary of the features introduced in each of the 11gR2 patch set releases.
NOTE
ACFS is supported on Oracle Virtual Machine (OVM) in both Paravirtualized and Hardware Virtualized guests.
|
ACFS File System Design
This section illustrates the inner workings of the ACFS file system design—from the file I/O to space management.
ACFS File I/O
All ACFS I/O goes through ADVM, which maps
these I/O requests to the actual physical devices based on the ASM
extent maps it maintains in kernel memory. ACFS, therefore, inherits all
the benefits of ASM’s data management capability.
ACFS supports POSIX read/write semantics for
the cluster environment. This essentially means that writes from
multiple applications to the same file concurrently will result in no
interlaced writes. In 11.2.0.4, ACFS introduces support flock/fcntl for
cluster-wide serialization, thus, if whole file exclusive locks are
taken out across multiple nodes of a cluster, they do not block each
other.
ACFS shares the same cluster interconnect as
Oracle RAC DB, Clusterware, and ASM; however, no ACFS I/Os travel across
the interconnect. In other words, all ACFS I/Os are issued directly to
the ASM storage devices. ACFS maintains a coherent distributed ACFS file
cache (using the ACFS distributed lock manager) that provides a
POSIX-compliant shared read cache as well as exclusive write caching to
deliver “single-system POSIX file access semantics cluster-wide.” Note
that this is cache coherency for POSIX file access semantics. ACFS
delivers POSIX read-after-write semantics for a cluster environment.
When an application requests to write a file on a particular node, ACFS
will request an exclusive DLM lock. This, in turn, requires other
cluster nodes to release any shared read or exclusive write locks for
the file and to flush, if required, and invalidate any cached pages.
However, it is important to contrast this with POSIX application file
locking—cluster cache coherency provides POSIX access coherency
(single-node coherency) semantics. POSIX file locking provides for
application file access serialization, which is required if multiple
users/applications are sharing a given file for reading and writing.
ACFS does not provide cluster-wide POSIX file locking presently.
However, it does support node local POSIX file locking.
Starting in 11.2.0.3, ACFS supports direct
I/O—the ability to transfer data between user buffers and storage
without additional buffering done by the operating system. ACFS supports
the native file data caching mechanisms provided by different operating
systems. Any caching directives settable via the open(2) system call
are also honored. ACFS supports enabling direct I/O on a file-by-file
basis using OS-provided mechanism; for example, using the O_DIRECT flag
to open(2) on Linux and the FILE_FLAG_NO_BUFFERING flag to CreateFile on
Windows. A file may be open in direct I/O mode and cached mode (the
default) at the same time. ACFS ensures that the cache coherency
guarantees mentioned earlier are always met even in such a mixed-usage
scenario.
ACFS Space Allocation
ACFS file allocation is very similar to
other POSIX-compliant file systems: The minimum file size is an ACFS
block (4KB), with metadata sized 512 bytes to 4KB. ACFS also does file
preallocation of storage as a means to efficiently manage file growth.
This applies to regular files as well as directories. This preallocation
amount is based on currently allocated storage.
The following is the basic algorithm for regular files:
The following are three use-case examples of ACFS file preallocation to illustrate this algorithm:
If
a file is currently 64KB in size and the user writes 4KB to the file,
ACFS will allocate 64KB of storage, bringing the file to a total of
128KB.
If a file is 4MB in size and the user writes 4KB, ACFS will allocate 1MB of storage, bringing the file to a total of 5MB.
If a file has zero bytes of storage—which may be the case if the file is brand new or was truncated to zero bytes—then at
write() time
no preallocation will be done.
The scenario is slightly different for directories. For directories, the algorithm is as follows:
Thus, if a directory is 16KB and needs to be
extended, ACFS allocates another 16KB, bringing the directory to a total
of 32KB. If the directory needs to be extended again, ACFS will
allocate another 32KB, bringing it to a total of 64KB. Because the
directory is now 64KB, all future extensions will be 64KB.
In addition to the preceding scenarios, if the
desired storage cannot be contiguously allocated, ACFS will search for
noncontiguous chunks of storage to satisfy the request. In such cases,
ACFS will disable all preallocation on this node for a period of time,
which is triggered when or until some thread on this node frees up some
storage or a certain amount of time passes, with the hope that another
node may have freed up some storage.
The salient aspects of ACFS file space management are:
ACFS
tries to allocate contiguous space for a file when possible, which
gives better performance for sequential reads and writes.
To improve performance of large file allocation, ACFS will preallocate the space when writing data.
This storage is not returned when the file is closed, but it is returned when the file is deleted.
ACFS
allocates local metadata files as nodes mount the file system for the
first time. This metadata space is allocated contiguously per node, with
the maximum being 20MB.
ACFS
maintains local bitmaps to reduce contention on the global storage
bitmap. This local bitmap becomes very important during the search for
local free space. Note that because of local space reservation (bitmap),
when disk space is running low, allocations may be successful on one
cluster node and fail on another. This is because no space is left in
the latter’s local bitmap or the global bitmap.
It is important to note that this metadata
space allocation may cause minor space discrepancies when used space is
displayed by a command such as Unix/Linux df (for example, the disk
space reported by the df command as “in use,” even though some of it may
not actually be allocated as of yet). This local storage pool can be as
large as 128MB per node and can allow space allocations to succeed,
even though commands such as df report less space available than what is
being allocated.
Distributed Lock Manager (DLM)
DLM provides a means for cluster-wide
exclusive and shared locks. These DLM locks are taken whenever
cluster-wide consistency and cache coherency must be guaranteed, such as
for metadata updates and file reads and writes. ACFS uses the DLM
services provided by Oracle Kernel Services (OKS). At a high level, a
SHARED lock is requested when the cluster node is going only to read the
protected entity. This SHARED lock is compatible with other SHARED
locks and is incompatible with other EXCLUSIVE locks. An EXCLUSIVE lock
is taken if the protected entity needs to be modified. Locks are cached
on that node until some other node requests them in a conflicting mode.
Therefore, a subsequent request for the same lock from the same node can
be serviced without sending any internode messages, and the lock grant
is quick.
When a node requests a lock that it does not
hold or if the lock is not cached on that node, a blocking asynchronous
system trap (BAST) is sent to the nodes that are holding the block. If
one or more nodes are currently holding the lock in a conflicting mode,
the request is queued and is granted when the lock is released on that
node (if there is no other request ahead in the queue). ACFS has eight
threads (called “BAST handlers”) per node to service BAST requests.
These threads are started at ACFS kernel module load time and service
all file systems mounted on the node. For example:
The following syslog entry shows a BAST request being handled:
Metadata Buffer Cache
ACFS caches metadata for performance
reasons. Some examples of metadata are inodes, data extent headers, and
directory blocks. Access to metadata is arbitrated using DLM locks, as
described earlier. All operations (such as file deletes and renames)
involving metadata updates are done within a transaction, and modified
metadata buffers are marked dirty in the buffer cache. In case of any
errors, the transaction is aborted and the metadata changes are
reverted. For all successfully completed transactions, dirty metadata
buffers are written to disk in a special location called the Volume Log.
Periodically, a kernel thread reads the Volume Log and applies these
metadata changes to the actual metadata locations on disk. Once written
into the Volume Log, a transaction can be recovered even if the node
performing that transaction crashes before the transaction is applied to
actual metadata locations.
Recovery
This section describes how the ACFS and ADVM layers handle various recovery scenarios.
ACFS and ADVM Recovery
All ACFS file systems must be cleanly
dismounted before ASM or the GI stack is stopped. Typically this is
internally managed by CRS; however, there are cases when emergency
shutdown of the stack is necessary. A forced shutdown of ASM via
SHUTDOWN ABORT should be avoided when there are mounted ACFS file
systems; rather, a clean ACFS dismount followed by a graceful shutdown
of ASM should be used.
When there is a component or node failure, a
recovery of the component and its underlying dependent resource is
necessary. For example, if an ACFS file system is forcibly dismounted
because of a node failure, then recovery is necessary for ASM, the ASM
Dynamic volume, as well as ACFS. This recovery process is implicitly
performed as part of the component restart process. For example, if an
ASM instance crashes due to node failure, then upon restart, ASM will
perform crash recovery, or in the case of RAC, a surviving instance will
perform instance recovery on the failed instance’s behalf. The ASM
Dynamic volume is implicitly recovered as part of ASM recovery, and ACFS
will be recovered using the ACFS Metadata Transaction Log.
The ADVM driver also supports cluster I/O
fencing schemes to maintain cluster integrity and a consistent view of
cluster membership. Furthermore, ASM Dynamic volume mirror recovery
operations are coordinated cluster-wide such that the death of a cluster
node or ASM instance does not result in mirror inconsistency or any
other form of data corruption. If ACFS detects inconsistent file
metadata returned from a read operation, based on the checksum, ACFS
takes the appropriate action to isolate the affected file system
components and generate a notification that fsck should be run as soon
as possible. Each time the file system is mounted, a notification is
generated with a system event logger message until fsck is performed.
Note that fsck only repairs ACFS metadata structures. If fsck runs
cleanly but the user still perceives the user data files to be
corrupted, the user should restore those files from a backup, but the
file system itself does not need to be re-created.
Unlike other file systems, when an ACFS
metadata write operation fails, the ACFS drivers do not interrupt or
notify the operating system environment; instead, ACFS isolates errors
by placing the file system in an offline error state. For these cases, a
umount of the “offlined” file system is required for that node. If a
node fails, then another node will recover the failed node’s transaction
log, assuming it can write the metadata out to the storage.
To recover from this scenario, the affected
underlying ADVM volumes must be closed and reopened by dismounting any
affected file systems. After the instance is restarted, the
corresponding disk group must be mounted with the volume enabled,
followed by a remount of the file system.
However, it might not be possible for an
administrator to dismount a file system while it is in the offline error
state if there are processes referencing the file system, such as a
directory of the file system being the current working directory for a
process. In these cases, to dismount the file system you need to
identify all processes on the node that have file system references to
files and directories. The Linux fuser and lsof commands will list
processes with open files.
ADVM and Dirty Region Logging
If ASM Dynamic volumes are created in ASM
redundancy disk groups (normal or high), dirty region logging (DRL) is
enabled via an ASM DRL volume file.
The ADVM driver will ensure ASM Dynamic volume
mirror consistency and the recovery of only the dirty regions in cases
of node and ASM instance failures. This is accomplished using DRL, which
is an industry-common optimization for mirror consistency and the
recovery of mirrored extents.
ADVM Design
Most file systems are created from an OS
disk device, which is generally a logical volume device, created by a
logical volume manager (LVM). For example, a Linux ext3 file system is
generally created over a Linux LVM2 device with an underlying logical
volume driver (LVD). ACFS is similar in this regard; however, it is
created over an ADVM volume device file and all volume I/O is processed
through the ADVM driver.
In Oracle ASM 11g Release 2, ASM
introduces a new ASM file type, called asmvol, that is associated with
ADVM Dynamic volumes. These volume files are similar to other ASM file
types (such as archive logs, database data files, and so on) in that
once they are created their extents are evenly distributed across all
disks in the specified disk group. An ASM Dynamic volume file must be
wholly contained in a single disk group, but there can be many volume
files in one disk group.
As of 11.2.0.3, the default ADVM volume is now
allocated with 8MB extents across four columns and a fine-grained
stripe width of 128KB. ADVM writes data as 128KB stripe chunks in
round-robin fashion to each of the four columns and fills a stripe set
of four 8MB extents with 250 stripe chunks before moving to a second
stripe set of four 8MB extents. Although the ADVM Dynamic volume extent
size and the stripe columns can be optionally specified at volume
creation, this is not a recommended practice.
If the ASM disk group AU size is 8MB or less,
the ADVM extent size is 8MB. If the ASM disk group AU size is configured
larger than 8MB, ADVM extent size is the same as the AU size.
Note that setting the number of columns on an
ADVM dynamic volume to 1 effectively turns off fine-grained striping for
the ADVM volume, but maintains the coarse-grained striping of the ASM
file extent distribution. Consecutive stripe columns map to consecutive
ASM extents. For example, if in a normal ASM file, four ASM extents map
to two LUNs in alternating order, then the stripe-column-to-LUN mapping
works the same way.
This section covers how to create an ADVM
volume, which will subsequently be used to create an ACFS file system.
Note that these steps are not needed if you’re deploying an ACFS file
system using ASM Configuration Assistant (ASMCA). If you are not using
ASMCA, you will need to do this manually.
To create a volume in a previously created disk group, use the following command:
The -G flag specifies the disk group where
this volume will be created. This will create an ADVM volume file that
is 25GB. All the extents of this volume file are distributed across all
disks in the DATA disk group. Note that the volume name is limited to 11
characters on Linux and 23 characters on AIX.
Once the ASM Dynamic volume file is created
inside ASM, an ADVM volume device (OS device node) will be automatically
built in the /dev/asm directory. On Linux, udev functionality must be
enabled for this device node to be generated. A udev rules file in
/etc/udev/rules.d/udev_rules/55-usm.rules contains the rule for
/dev/asm/* and sets the group permission to asmadmin.
For clarity, we refer to the logical volume
inside ASM as the ASM Dynamic volume, and the OS device in /dev/asm as
the ADVM volume device. There is a direct one-for-one mapping between an
ASM Dynamic volume and its associated ADVM volume device:
Notice that the volume name is included as
part of the ADVM volume device name. The three-digit (suffix) number is
the ADVM persistent cluster-wide disk group number. It is recommended
that you provide a meaningful name such that it is easily identifiable
from the OS device.
The ADVM volume device filenames are unique
and persistent across all cluster nodes. The ADVM volume devices are
created when the ASM instance is active with the required disk group
mounted and dynamic volumes enabled.
Note that the on-disk format for ASM and ACFS
is consistent between 32-bit and 64-bit Linux. Therefore, customers
wanting to migrate their file system from 32-bit to 64-bit should not
have to convert their ASM- and ACFS-based metadata.
ACFS Configuration and Deployment
Generally space, volume, and file system
management are performed with a typical workflow process. The following
describes how a traditional file system is created:
The
SAN administrator carves out the LUNs based on performance and
availability criteria. These LUNs are zoned and presented to the OS as
disk devices.
The
system administrator encapsulates these disks into a storage pool or
volume group. From this pool, logical volumes are created. Finally, file
systems are created over the logical volumes.
So how does this change in the context of
ACFS/ADVM/ASM? Not much. The following is the process flow for deploying
ACFS/ADVM/ASM:
The
SAN administrator carves out LUNs based on the defined application
performance and availability SLA and using the ASM best practices, which
state that all disks in an ASM disk group should be of the same size
and have similar performance characteristics. This best practice and
guideline makes LUN management and provisioning easier to build.
Finally, the provisioned LUNs are zoned and presented to the OS as disk
devices.
The
system administrator or DBA creates or expands an ASM disk group using
these disks. From the disk group, ASM (logical) volumes are created. If
using SQL*Plus or ASMCMD, the user needs to be connected as SYASM for
ADVM or ACFS configuration. Alternatively, ASMCA or EM can be used.
Finally, file systems are created over these volume devices.
Note that although Oracle Grid Infrastructure 11g
Release 2 supports RH/EL 4, 5, and 6, ACFS/ADVM requires a minimum of
RH/EL 5.x. If you try to deploy ACFS on unsupported platforms, an error
message similar to the following will be displayed:
Configuring the Environment for ACFS
In this section, we discuss ACFS and ADVM concepts as well as cover the workflow in building the ACFS infrastructure.
Although you can configure/manage ACFS in
several ways, this chapter primarily illustrates ASMCMD and ASMCA. Note
that every ASMCMD command shown in this chapter can also be performed in
ASMCA, Enterprise Manager, or SQL*Plus. However, using ASMCA,
Enterprise Manager, or ASMCMD is the recommended method for
administrators managing ASM/ACFS.
The first task in setting up ACFS is to ensure
the underlying disk group is created and mounted. Creating disk groups
is described in Chapter 4.
Before the first volume in an ASM disk group
is created, any dismounted disk groups must be mounted across all ASM
instances in the cluster. This ensures uniqueness of all volume devices
names. The ADVM driver cannot verify that all disk groups are mounted;
this must be ensured by the ASM administrator before adding the first
volume in a disk group.
Also, compatible.asm and compatible.advm must
be minimally set to 11.2.0.0 if this disk group is going to hold ADVM
Dynamic volumes. The compatible.rdbms can be set to any valid value.
As part of the ASM best practices, we
recommend having two ASM disk groups. It is recommended that you place
the ADVM volumes in either the DATA or RECO disk group, depending on the
file system content. For example, the DATA disk group can be used to
store ACFS file systems that contain database-related data or
general-purpose data. The RECO disk group can be used to store ACFS file
systems that store database-recovery-related files, such as archived
log files, RMAN backups, Datapump dump sets, and even database
ORACLE_HOME backups (possibly zipped backups). It is very typical to use
ACFS file systems for GoldenGate. More specifically, for storing trail
files. In these scenarios, a separate disk group is configured for
holding GolenGate trail files. This also requires ACFS patch 11825850.
Storing archived log files, RMAN backups, and Datapump dump sets is only
supported in ACFS 11.2.0.3 and above. However, ACFS does not currently
support snapshots of file systems housing these files.
ACFS Deployment
The two types of ACFS file systems are CRS
Managed ACFS file systems and the Registry Managed ACFS file systems.
Both of these ACFS solutions have similar benefits with respect to
startup/recovery and leveraging CRS dependency modeling, such as
unmounting and remounting offline file systems, mounting (pulling up) a
disk group if not mounted, and enabling volumes if not enabled.
CRS Managed ACFS file systems have associated
Oracle Clusterware resources and generally have defined
interdependencies with other Oracle Clusterware resources (database, ASM
disk group, and so on). CRS Managed ACFS is specifically designed for
ORACLE_HOME file systems. Registry Managed ACFS file systems are
general-use file systems that are completely transparent to Oracle
Clusterware and its resources. There are no structural differences
between the CRS Managed ACFS and Registry Managed ACFS file systems. The
differences are strictly around Oracle Clusterware integration. Once an
ACFS file system is created, all standard Linux/Unix file system
commands can be used, such as the df, tar, cp, and rm commands. In 11gR2,
storing any files that can be natively stored in ASM is not supported;
in other words, you cannot store Oracle Database files (control files,
data files, archived logs, online redo logs, and so on) in ACFS. Also,
the Grid Infrastructure Home cannot be installed in ACFS; it must be
installed in a separate file system, such as in ext3.
CRS Managed ACFS File Systems
As stated previously, the primary use for CRS Managed ACFS is storing the database Oracle Home. Figures 10-1 through 10-3
show how to create a CRS Managed ACFS file system. It is recommended
that ACFS file systems used to house your database ORACLE_HOME have an
OFA-compliant directory structure (for example,
$ORACLE_BASE/product/11.2/dbhomex, where x represents the
database home). Using ASMCA is the recommended method for creating a CRS
Managed ACFS file system. ASMCA creates the volume and file system and
establishes the required Oracle Clusterware resources.
Figure 10-1
displays the ASMCA screen used to launch the Create the CRS Managed
ACFS File System Wizard. Right-click the Create Diskgroup task on a
specific disk group and then select the Create ACFS for Database Home
option.
Figure 10-2
displays the Create ACFS Hosted Database Home screen, which will prompt
for the volume name, file system mount point, and file system owner.
The ADVM volume device name is taken from the volume name, which is
specified during volume creation.
Figure 10-3 displays the Run ACFS Script window.
This script, which needs to run as root, is
used to add the necessary Oracle Clusterware resources for the file
system as well as mount the file system. Here’s what the script
contains:
Once this script is run, the file system will be mounted on all nodes of the cluster.
Registry Managed ACFS
Besides storing the database software
binaries on ACFS, ACFS can be used to store other database-related
content. The following are some use-case scenarios for a Registry
Managed ACFS file system:
Automatic Diagnostic Repository (ADR) A
file system for a diagnostics logging area. Having a distinct file
system for diagnostic logs provides a more easily managed container
rather than placing this within ORACLE_BASE or ORACLE_HOME locations.
Plus, a node’s logs are available even if that node is down for some
reason.
External database data This includes Oracle external file types, such as BFILEs and ETL data/external tables.
Database exports Create
a file system for storing the old database exports as well as Datapump
exports. This is only supported in Grid Infrastructure stack 11.2.0.3
and above.
Utility log file repository For
customers who want a common log file repository for utilities such as
RMAN, SQL*Loader, and Datapump, an ACFS file system can be created to
store these log files.
Directory for UTL_FILE_DIR and Extproc In
a RAC environment, a shared file system plays an important role for
customers who use UTL_FILE_DIR (PL/SQL file I/O), a shared file
repository for CREATE DIRECTORY locations, and external procedures via
Extproc calls.
Middle-tier shared file system for Oracle applications For
example, E-Business Suite as well as Siebel Server have a shared file
system requirement for logs, shared documents, reports output, and so
on.
In order to use the ACFS file system that is
created in the database tier, one can export the ACFS file system as an
NFS file system and NFS mount on the middle-tier nodes. This allows for
customers to consolidate their storage and have integrated storage
management across tiers.
Creating Registry Managed ACFS File Systems
This section describes Registry Managed ACFS
file system creation. In this case, a node-local file system will be
created for a given RAC node. To create a Registry Managed ACFS file
system, you must create an ADVM volume first. Note that root access is
not required to create an ACFS file system, but mounting and creating
the CRS resource will require root access. Here are the steps:
1. Create the volume:
2. Once the ASM Dynamic volume is
created and enabled, the file system can be created over the ADVM volume
device. Create the file system:
NOTE
When a volume is created, it is automatically enabled.
|
3. Mount the file system:
4. Verify that the volume was created and mounted:
Setting Up Registry Managed ACFS
An ACFS Mount Registry is used to provide a
persistent entry for each Registry Managed ACFS file system that needs
to be mounted after a reboot. This ACFS Mount Registry is very similar
to /etc/fstab on Linux. However, in cluster configurations, file systems
registered in the ACFS Mount Registry are automatically mounted
globally, similar to a cluster-wide mount table. This is the added
benefit Oracle ACFS Mount Registry has over Unix fstab. The ACFS Mount
Registry can be probed, using the acfsutil command, to obtain file
system, mount, and file information.
When a general-purpose ACFS file system is
created, it should be registered with the ACFS Mount Registry to ensure
that the file system gets mounted on cluster/node startup. Users should
not create ACFS Mount Registry entries for CRS Managed ACFS file
systems.
Continuing from the previous example, where an
ADVM volume and a corresponding file system were created, we will now
register those file systems in the ACFS Mount Registry on their
respective nodes so they can be mounted on node startup. When ACFS is in
a cluster configuration, the acfsutil registry command can be run from
any node in the cluster:
The following shows some examples of the
acfsutil registry command. Note that the acfsutil command can be run
using either the root or oracle user. To check the ACFS Mount Registry,
use the following query:
To get more detailed information on a currently mounted file system, use the acfsutil info fs command:
Note that because these file systems will be
used to store database-related content, they will need to have CRS
resources against them. The following example illustrates how to create
these resources:
NOTE
You cannot have ACFS registry entries and CRS resources defined for the same file system.
|
The same operation can be performed using ACMCA or Enterprise Manager.
Managing ACFS and ADVM
This section describes how to manage typical
tasks related to ACFS and ADVM as well as illustrates the relationships
among ACFS, ADVM, ASM, and Oracle Clusterware.
ACFS File System Resize
An ACFS file system can be dynamically grown
and shrunk while it is online and with no impact to the user. The ACFS
file system size and attributes are managed using the /sbin/acfsutil
command. The ACFS extend and shrink operations are performed at the file
system layer, which implicitly grows the underlying ADVM volume in
multiples of the volume allocation unit.
The following example shows the acfsutil size command:
Currently, the limit on the number of times an ACFS file system can be resized is four.
ACFS starts with one extent and can grow out
to four more extents, for a total of five global bitmaps extents. To
determine the number of times the file system has been resized, use the
acfsdbg utility to list the internal file system storage bitmap:
Look at the Extent[*] fields for nonzero
Length fields. The number of remaining zero-length extents is an
indication of the minimum number of times you can grow the file system.
If the number of times the file system is resized exceeds five times,
the users need to take the mount point offline globally (across all
nodes) and then run fsck -a to consolidate the internal storage bitmap
for resize:
Unmounting File Systems
Unmounting file systems involves a typical
OS umount command. Before unmounting the file system, ensure that it is
not in use. This may involve stopping dependent databases, jobs, and so
on. In-use file systems cannot be unmounted. You can use various methods
to show open file references of a file system:
Linux/Unix lsof command Shows open file descriptors for the file system
Unix/Linux fuser command Displays the PIDs of processes using the specified file systems
Any users or processes listed should be logged off or killed (kill –9).
Next we’ll look at the steps required to
unmount an ACFS file system. The steps to unmount a CRS Managed ACFS
file system and a Registry Managed ACFS file system are slightly differ.
To unmount a general-purpose ACFS file system,
unmount the file system. This command needs to be run on all nodes
where the file system is currently mounted:
The steps are different to unmount a CRS
Managed ACFS file system. Because CRS Managed ACFS file systems have
associated CRS resources, the following steps need to be performed to
stop the Oracle Clusterware resource and unmount the file system:
1. Once the application has stopped
using the file system, you can stop the Oracle Clusterware file system
resource. The following command will also unmount the file system. Here,
we unmount the file system across all nodes.
NOTE
If srvctl stop does not include a node or node list, it will be unmounted on all nodes.
|
2. To unmount only a specific node, specify the node name in the –n flag:
3. Verify the file system resource is unmounted:
As stated, it is highly recommended that you
unmount any ACFS file systems first before the ASM instance is shut
down. A forced shutdown or failure of an ASM instance with a mounted
ACFS file system will result in I/O failures and dangling file handles;
in other words, the ACFS file system user data and metadata that was
written at the time of the termination may not be flushed to storage
before ASM storage is fenced off. Thus, a forced shutdown of ASM will
result in the ACFS file system having an offline error state. In the
event that a file system enters into an offline error state, the ACFS
Mount Registry and CRS Managed Resource action routines attempt to
recover the file system and return it to an online state by unmounting
and remounting the file system.
Deleting File Systems
Similar to unmounting ACFS file systems, the
steps to delete an ACFS file system slightly differ between CRS Managed
ACFS and Registry Managed ACFS.
To unmount and delete a Registry Managed ACFS
file system, execute the following, which needs to be run on all nodes
where the file system is currently mounted:
Next, delete the acfsutil registry entry for the file system:
CRS Managed ACFS file systems have Oracle Clusterware resources that need to be removed. To do this, follow these steps:
1. Run the following command, which
unmounts the file system and stops the Oracle Clusterware resources.
(Although this command can be executed from any node in the cluster, it
will have to be rerun for each node.)
NOTE
The format for stop filesystem is srvctl stop filesystem -d <device name> -n <nodelist where fs is mounted>.
|
2. Repeat this for each node.
3. Verify the file system resource is unmounted:
4. Once this file system is stopped
and unmounted on all nodes where it was started, the Clusterware
resource definitions need to be removed (this should only be run once
from any node) and then the volume can be deleted:
ADVM Management
Generally it is not necessary to perform
ADVM management; however, in rare cases, volumes may need to be manually
disabled or dropped. The /sbin/advmutil and ASMCMD commands should be
used for these tasks. For details on command usage and various command
options, review the Oracle Storage Administrator’s Guide.
You can get volume information using the advmutil volinfo command:
The same information can be displayed by the
asmcmd volinfo command; however, the asmcmd uses the ASM Dynamic volume
name, whereas the advmutil uses the ADVM volume device name:
To enable or disable the ASM Dynamic volumes,
you can use ASMCMD. Enabling the volume instantiates (creates) the
volume device in the /dev/asm/directory. Here’s the command to enable
the volume:
And the following command can be used to disable volume:
The disable command only disables the volume
and removes the ADVM volume device node entry from the OS (or more
specifically from the /dev/asm directory); it does not delete the ASM
Dynamic volumes or reclaim space from the ASM disk group. To delete
(drop) the ASM Dynamic volume, use the drop command:
NOTE
You cannot delete or disable a volume or file system that is currently in use.
|
ACFS Management
Like any other file system, ACFS consists of
background processes and drivers. This section describes the background
processes and their roles, reviews the ACFS kernel drivers, and
discusses the integration of ACFS and Oracle Clusterware.
ASM/ACFS-Related Background Processes
Several new ASM instance background processes were added in Oracle 11gR2 to support the ACFS infrastructure. The following processes will be started upon the first use of an ADVM volume on the node:
VDBG The
Volume Driver Background process is very similar to the Umbilicus
Foreground (UFG) process, which is used by the database to communicate
with ASM. The VDBG will forward ASM requests to the ADVM driver. This
communication occurs in the following cases:
When
an extent is locked/unlocked during ASM rebalance operations (ASM disk
offline, add/drop disk, force dismount disk group, and so on).
During volume management activities such as a volume resize.
VBGx The
Volume Background processes are a pool of worker processes used to
manage requests from the ADVM driver and coordinate with the ASM
instance. A typical case for this coordination is the opening/closing of
an ADVM volume file (for example, when a file system mount or unmount
request occurs).
VMBx The
VMB is used to implement an I/O barrier and I/O fencing mechanism. This
ensures that ASM instance recovery is not performed until all ADVM I/Os
have completed.
ACFSx The
ACFS Background process coordinates cluster membership and group
membership with the ASM instance. The ACFS process communicates this
information to the ACFS driver, which in turn communicates with both the
OKS and ADVM drivers. When a membership state/transition change is
detected, an ioctl call is sent down to the kernel, which then begins
the process of adding/removing a node from the cluster.
The following shows the new background processes highlighted in bold:
The following are ACFS kernel threads dedicated to the management of ACFS and ADVM volumes devices:
The user mode processes, discussed earlier,
are used to perform extent map services in support of the ADVM driver.
For example, ASM file extents map the ADVM volume file to logical blocks
located on specific physical devices. These ASM extent pointers are
passed to the ADVM driver via the user space processes. When the driver
receives I/O requests on the ADVM volume device, the driver redirects
the I/O to the supporting physical devices as mapped by the target ADVM
volume file’s extents. Because of this mapping, user I/O requests issued
against ACFS file systems are sent directly to the block device (that
is, ASM is not in the I/O path).
ACFS/ADVM Drivers
The installation of the Grid Infrastructure
(GI) stack will also install the ACFS/ADVM drivers and utilities. Three
drivers support ACFS and ADVM. They are dynamically loaded (in top-down
order) by the OHASD process during Oracle Clusterware startup, and they
will be installed whether the ACFS is to be used or not:
oracleoks.ko This
is the kernel services driver, providing memory management support for
ADVM/ACFS as well as lock and cluster synchronization primitives.
oracleadvm.ko The
ADVM driver maps I/O requests against an ADVM volume device to blocks
in a corresponding on-disk ASM file location. This ADVM driver provides
volume management driver capabilities that directly interface with the
file system.
oracleacfs.ko This is the ACFS driver, which supports all ACFS file system file operations.
During install, kernel modules are placed in
/lib/modules/2.6.18-8.el5/extra/usm. These loaded drivers can be seen
(on Linux) via the lsmod command:
Linux OS vendors now support a “white list” of
kernel APIs (kABI compatible or weak modules), which are defined not to
change in the event of OS kernel updates or patches. The ACFS kernel
APIs were added to this white list, which enables ACFS drivers to
continue operation across certain OS upgrades and avoids the need for
new drivers with every OS kernel upgrade.
Integration with Oracle Clusterware
When a CRS Managed ACFS file system is created, Oracle Clusterware will manage the resources for ACFS. In Oracle Clusterware 11g
Release 2, the Oracle High Availability Services daemon (OHASd)
component is called from the Unix/Linux init daemon to start up and
initialize the Clusterware framework. OHASd’s main functions are as
follows:
Start/restart/stop the Clusterware infrastructure processes.
Verify the existence of critical resources.
Load the three ACFS drivers (listed previously).
The correct ordering of the startup and
shutdown of critical resources is also maintained and managed by OHASd;
for example, OHASd will ensure that ASM starts after ACFS drivers are
loaded.
Several
key Oracle Clusterware resources are created when Grid Infrastructure
for Cluster is installed or when the ADVM volume is created. Each of
these CRS resources is node local and will have a corresponding start,
stop, check, and clean action, which is executed by the appropriate
Clusterware agents. These resources include the following:
ACFS Driver resource This
resource is created when Grid Infrastructure for Cluster is installed.
This resource is created as ora.drivers.acfs and managed by the
orarootagent. The ASM instance has a weak dependency against the ACFS
driver resource. The Clusterware start action will start the ACFS driver
resource when the ASM instance is started, and will implicitly load the
ACFS/ADVM kernel drivers. These drivers will remain loaded until the GI
stack is shut down.
ASM resource This
resource, ora.asm, is created as part of Grid Infrastructure for
Cluster installation. This resource is started as part of the standard
bootstrap of the GI stack and managed by the oraagent agent.
ACFS Registry resource The
Registry resource is created as part of the Grid Infrastructure for
Cluster installation and managed by orarootagent. The activation of this
resource will also mount all file systems listed in the ACFS Registry.
This Registry resource also does file system recovery, via check action
script, when file systems in an offline state are detected.
ACFS file system resource This
resource is created as ora.<diskgroup>.<volume>.acfs when
the ASM Configuration Assistant (ASMCA) is used to create a DB Home file
system. When the Database Configuration Assistant (DBCA) is used to
create a database using the DB Home file system, an explicit dependency
between the database, the ACFS file system hosting the DB Home, and ASM
is created. Thus, a startup of ASM will pull up the appropriate ACFS
file system along with the database.
ACFS/Clusterware Resource
Once the Oracle database software is
installed in the CRS Managed ACFS file system and the database is
created in ASM, Oracle Clusterware will implicitly create the resource
dependencies between the database, the CRS Managed ACFS file system, and
the ASM disk group. These dependencies are shown via the start/stop
actions for a database using the following command:
The output has been condensed to focus on the start/stop dependencies items:
ACFS Startup Sequence
The start action, which is the same for CRS
Managed and general-purpose file system resources, is to mount the file
system. The CRS resource action includes confirming that the ACFS
drivers are loaded, the required disk group is mounted, the volume is
enabled, and the mount point is created, if necessary. If the file
system is successfully mounted, then the state of the resource is set to
online; otherwise, it is set to offline.
When the OS boots up and Oracle Clusterware is started, the following Clusterware operations are performed:
OHASd will load the ACFS drivers and start ASM.
As
part of the ASM instance startup, all the appropriate disk groups, as
listed in the ASM asm_diskgroup parameter, will also be mounted. As part
of the disk group mount, all the appropriate ASM Dynamic volumes are
enabled.
The CRS agent will start up and mount the CRS Managed ACFS file systems.
The
appropriate CRS agents will start their respective resources. For
example, oraagent will start up the database. Note that just before this
step, all the resources necessary for the database to start are
enabled, such as the ASM instance, disk group, volume, and CRS Managed
ACFS file systems.
The ACFS Mount Registry agent will mount any ACFS file systems that are listed in the ACFS Mount Registry.
ACFS Shutdown Sequence
Shutdown includes stopping of the Oracle
Grid Infrastructure stack via the crsctl stop cluster command or node
shutdown. The following describes how this shutdown impacts the ACFS
stack:
As
part of the infrastructure shutdown of CRS, the Oracle Clusterware
orarootagent will perform unmounts for file systems contained on ADVM
volume devices. If any file systems could not be unmounted because they
are in use (open file references), then an internal grace period is set
for the processes with the open file references. At the end of the grace
period, if these processes have not exited, they are terminated and the
file systems are unmounted, resulting in the closing of the associated
dynamic volumes.
The ASM Dynamic volumes are then disabled; ASM and its related resources are stopped.
All ACFS and ADVM logs for startup/shutdown and errors will be logged in the following places:
Oracle Clusterware home (for example, $ORACLE_HOME/log/<hostname>/alert.log)
ASM alert.log
$ORACLE_HOME/log/<hostname>/agent/ohasd/rootagent
ACFS Startup on Grid Infrastructure for Standalone
Grid Infrastructure for Standalone, Oracle
Restart, does not support managing root-based ACFS start actions. Thus,
the following operations are not automatically performed:
Loading Oracle ACFS drivers
Mounting ACFS file systems listed in the Oracle ACFS Mount Registry
Mounting resource-based Oracle ACFS database home file systems
The following steps outline how to automate
the load of the drivers and mount the file system (note that the root
user needs to perform this setup):
1. Create an initialization script
called /etc/init.d/acfsload. This script contains the runlevel
configuration and the acfsload command:
2. Modify the permissions on the /etc/init.d/acfsload script to allow it to be executed by root:
3. Use the chkconfig command to
build the appropriate symbolic links for the rc2.d, rc3.d, rc4.d, and
rc5.d runlevel directories:
4. Verify that the chkconfig runlevel is set up correctly:
Finally, these file systems can be listed in
Unix/Linux /etc/fstab and a similar rc initialization script can be used
to mount them:
Exporting ACFS for NFS Access
Many customers want to replace their
existing NFS appliances that are used for middle-tier apps with low-cost
solutions. For example, Siebel architectures require a common file
system (the Siebel file system) between nodes on the mid-tier to store
data and physical files used by Siebel clients and the Siebel Enterprise
Server. A similar common file system is required for E-Business Suite
and PeopleSoft, as well as other packaged applications.
An NAS file access protocol is used to
communicate file requests between an NAS client system and an NAS file
server. NAS file servers provide the actual storage. ACFS can be
configured as an NAS file server and, as such, can support remote file
access from NAS clients that are configured with either NFS or CIFS file
access protocols. Because ACFS is a cluster file system, it can support
a common file system namespace cluster-wide; thus, each cluster node
has access to the file system. If a node fails, the Grid Infrastructure
stack transitions the state of the cluster, and the remaining cluster
nodes continue to have access to ACFS file system data. Note that SCAN
names cannot be used as NFS node service names; NFS mounting the ACFS
exported file system using “hard” mount options is not supported.
In the current version, there is no failover
of the NFS mount. The file system will need to be remounted on another
node in the cluster.
When exporting ACFS file systems through NFS
on Linux, you must specify the file system identification handle via the
fsid exports option. The fsid value can be any 32-bit number. The use
of the file system identification handle is necessary because the ADVM
block device major numbers are not guaranteed to be the same across
reboots of the same node or across different nodes in the cluster. The
fsid exports option forces the file system identification portion of the
file handle, which is used to communicate with NFS clients, to be the
specified number instead of a number derived from the major and minor
number of the block device on which the file system is mounted. If the
fsid is not explicitly set, a reboot of the server (housing the ACFS
file system) will cause NFS clients to see inconsistent file system data
or detect “Stale NFS file handle” errors.
The following guidelines must be followed with respect to the fsid:
The value must be unique among all the exported file systems on the system.
The
value must be unique among members of the cluster and must be the same
number on each member of the cluster for a given file system.
Summary
The ASM Cluster File System (ACFS) extends
Automatic Storage Management (ASM) by providing a robust, modern,
general-purpose, extent-based, journaling file system for files beyond
the Oracle database files and thus becomes a complete storage management
solution. ACFS provides support for files such as Oracle binaries,
report files, trace files, alert logs, and other application data files.
ACFS scales from small files to very large files (exabytes) and
supports large numbers of nodes in a cluster.
In Oracle Grid Infrastructure 11g
Release 2, ASM simplifies, automates, and reduces cost and overhead by
providing a unified and integrated solution stack for all your file
management needs, thus eliminating the need for third-party volume
managers, file systems, and Clusterware platforms.
With the advent of ACFS, Oracle ASM 11g
Release 2 has the capability to manage all data, including Oracle
database files, Oracle Clusterware files, and nonstructured
general-purpose data such as log files and text files.
No comments:
Post a Comment