ACFS Data Services
As mentioned in
the previous chapter, the Oracle Cloud File System simplifies storage
management across file systems, middleware, and applications in private
clouds with a unified namespace. The Oracle Cloud File System offers
rapid elasticity and increased availability of pooled storage resources
as well as an innovative architecture for balanced I/O and highest
performance without tedious and complex storage administration.
One of the key aspects of the Oracle Cloud
File System (ACFS) is support for advanced data services such as
point-in-time snapshots, replication, file tagging, and file system
security and encryption features. This chapter covers the inner workings
and implementation of the ACFS data services.
ACFS Snapshots
Snapshots are immutable views of the source
file system as it appeared at a specific point in time. Think of it as a
way to go into the past to see what files or directories looked like at
some point in time (when the snapshot was taken).
ACFS provides snapshot capability for the
respective file system. The snapshot starts out with a set of duplicate
pointers to the extents in the primary file system. When an update is to
be made, there is no need to copy extents to the snapshot because it
already points to the existing blocks. New storage is allocated for the
updates in the primary file system.
This snapshot uses the first copy-on-write
(FCOW or COW) methodology to enable a consistent, version-based, online
view of the source file system. Snapshots are initially a sparse file
system, and as the source file system’s files change, the before-image
extent of those files is copied into the snapshot directory. The
before-image granularity is an ACFS extent, so if any byte in an extent
is modified, the extent is COW’ed and any subsequent changes in that
extent require no action for the snapshot.
ACFS supports read-only and read-write
snapshot services. Note that ACFS snapshots cannot be used on file
systems that house RMAN backups, archived logs, and Datapump dump sets;
this restriction is removed in the Oracle 12g release.
When snapshots are created, they are
automatically available and always online while the file system is
mounted. Snapshots are created as a hidden subdirectory inside the
source file system called .ACFS/snaps/, so no separate mount operation
is needed and no separate file store needs to be maintained for the
snapshots.
ACFS supports a total of 63 read-only,
read-write, or combination of read-only and read-write snapshot views
for each file system.
ACFS Read-Only Snapshots
Because ACFS read-only snapshots are a
point-in-time view of a file system, they can be used as the source of a
file system logical backup. An ACFS snapshot can support the online
recovery of files inadvertently modified or deleted from a file system.
ACFS Read-Write Snapshots
ACFS read-write snapshots enable fast
creation of a snapshot image that can be both read and written without
impacting the state of the ACFS file system hosting the snapshot images.
To use ACFS read-write snapshots, the disk group compatibility
attribute for ADVM must be set to 11.2.0.3.0 or higher. If you create a
read-write snapshot on an existing ACFS file system from a version
earlier than 11.2.0.3.0, the file system is updated to the 11.2.0.3.0
format. After a file system has been updated to a higher version, it
cannot be returned to an earlier version.
The read-write snapshots can be used for the following purposes:
Testing
new versions of application software on production file data reflected
in the read-write snapshot image without modifying the original
production file system.
Running test scenarios on a real data set without modifying the original production file system.
Testing
ACFS features such as encryption or tagging. Data in snapshots can be
encrypted and workloads can be run to assess the performance impact of
the encryption before the live data is encrypted.
ACFS Snapshot by Example
The following shows how to create ACFS snapshots. In this example, a snapshot of the database ORACLE_HOME is created:
The following acfsutil command can be used to obtain information about ACFS snapshots and the file system:
To list all snapshots available in the cluster, execute the following query:
Accessing the snapshot will always provide a
point-in-time view of a file; thus, ACFS snapshots can be very useful
for file-based recovery or for file system logical backups. If
file-level recovery is needed (for the base file system), it can be
performed using standard file copy or replace commands.
A possible use-case scenario for snapshots
could be to create a consistent recovery point set between the database
ORACLE_HOME and the database. This is useful, for example, when a
recovery point needs to be established before applying a database patch
set. Here are the steps to follow for this scenario:
1. Create an ORACLE_HOME snapshot:
2. Create a guaranteed restore point (GRP) in the database;
3. Apply the patch set.
4. If the patch set application fails, take one of the following actions:
Restore the database to the GRP.
Recover the file system by leveraging the snapshot.
ACFS Tagging
ACFS tagging enables associating tag names
with files, logically grouping files that may be present in any location
(directory) in a file system. ACFS replication can then select files
with a unique tag name for replication to a different remote cluster
site. The tagging option avoids having to replicate an entire Oracle
ACFS file system. Tags can be set or unset, and tag information for
files can be displayed using the command acfsutil tag.
At creation time, files and directories
inherit any tags from the parent directory. When a new tag is added to a
directory, existing files in the directory do not get tagged with the
same tag unless the –r option is specified with the acfsutil tag set
command. Any files created in the future, however, do inherit the tag,
regardless of whether or not the –r option was specified with the
acfsutil tag set command.
ACFS implements tagging using extended
attributes. Some editing tools and backup utilities do not retain the
extended attributes of the original file by default, unless a specific
switch is supplied. The following list describes the necessary
requirements and switch settings for some common utilities to ensure
ACFS tag names are preserved on the original file:
Install
the coreutils library (version coreutils-5.97-23.el5_4.1.src.rpm or
coreutils-5.97-23.el5_4.2.x86_64.rpm or later) on Linux to install a
version of the cp command that supports extended attribute preservation
with the --preserve=xattr switch and a version of the mv command that
supports extended attribute preservation without any switches.
The
vi editor requires the set bkc=yes option in the .vimrc (Linux) or
_vimrc (Windows) file to make a backup copy of a file and overwrite the
original. This preserves tag names on the original file.
emacs
requires that the backup-by-copying option is set to a non-nil value to
preserve tag names on the original filename rather than a backup copy.
This option must be added to the .emacs file.
The
rsync file-transfer utility requires the -X flag option to preserve tag
names. In addition, you must set the -l and -X flags to preserve the
tag names assigned to symbolic link files themselves.
The
tar backup utility on Linux requires the --xattrs flag to be set on the
command line to preserve tag names on a file. However, tar does not
retain the tag names assigned to symbolic link files, even with the
--xattrs flag.
The
tar backup utility on Windows currently provides no support for
retaining tag names because no switch exists to save extended
attributes.
As of 11.2.0.3, the ACFS tagging feature is
available only on Linux and Windows. To use the ACFS tagging
functionality on Linux, the disk group compatibility attributes for ASM
and ADVM must be set to 11.2.0.2 or higher. To use ACFS tagging
functionality on Windows, the disk group compatibility attributes for
ASM and ADVM must be set to 11.2.0.3.
This can be done with SQL*Plus, as illustrated here (notice that SQL*Plus is executed from user Oracle on node1):
ACFS Replication Overview
In Oracle Release 11.2.0.2, the ACFS file
system replication feature was introduced on the Linux platform. This
feature enables replication of an ACFS file system across a network to a
remote site. This capability is useful for providing disaster recovery
capability. Similarly to Data Guard, which replicates databases by
capturing database redo operations, ACFS replication captures ACFS file
system changes on a primary file system and transmits these changes to a
standby file system.
ACFS replication leverages OracleNet and the
NETWORK_FILE_TRANSFER PL/SQL package for transferring replicated data
from a primary node to the standby file system node. ACFS replication is
only supported on Grid Infrastructure for Cluster, as selected on the
Oracle Installer. ACFS replication is not supported on Grid
Infrastructure for a Standalone Server. However, you can install Grid
Infrastructure for a Cluster on a single node by supplying the necessary
information for a single node during installation.
The combination of Oracle Real Application
Clusters, Data Guard, and ACFS Replication provides comprehensive site
and disaster recovery policies for all files inside and outside the
database.
Primary File System
The source ACFS file system is referred to
as a primary file system and the target ACFS file system as a standby
file system. For every primary file system there can be only be one
standby file system. ACFS replication captures, in real time, file
system changes on the primary file system and saves them in files called
replication logs (rlogs). These rlogs are stored in the
.ACFS/repl directory of the file system that is being replicated. If the
primary node is part of a multinode cluster, all rlogs (one rlog per
node) created at a specific instance are collectively called a cord. Rlogs combined into a cord are then transmitted to the standby node. The cord is then used to update the standby file system.
Keep in mind that data written to files is
first buffered in a file system cache (unless direct IO is used); then
at a later point in time it is committed to disk. ACFS guarantees that
when data is committed to disk it will also be written to the standby
file system.
Current Restrictions (11.2.0.3)
The following are consideration points when implementing ACFS file system replication in an Oracle Clusterware 11.2.0.3 system.
The minimum file system size that can be replicated is 4GB.
ACFS currently supports a maximum of eight node clusters for the primary file system.
The primary and standby file systems must be the same OS, architecture, and endianness.
ACFS cannot currently use encryption or security for replicated file systems.
Cascading standbys are not supported.
The ACFS standby file system must be empty before replication is initiated.
Standby File System
Replication logs are asynchronously
transported to the node hosting the standby file system, at which point,
replication logs are then read and applied to the standby file system.
When the replication logs have been successfully applied to the standby
file system, they are deleted on both the primary and standby file
systems. Because the standby file system is a read-only file system, it
can be the source of consistent file system backups after all the
outstanding logs are applied.
NOTE
If needed, a read-write snapshot can be taken of the standby file system.
|
Planning for ACFS Replication
This section describes how to enable ACFS
replication. The examples assume that the Grid Infrastructure software
has been installed on nodes hosting the ACFS file system and that the
ADVM volumes are enabled and the ACFS file systems are mounted.
Note that the primary and standby sites can
have differing configurations. In other words, the primary can be a
multinode cluster and the standby can be a single-node cluster. If a
standby node is used for disaster recovery purposes, it is recommended
that the standby node have a configuration similar to the cluster
configuration.
There are no rigid primary and standby node
roles; that is, a primary node can provide the role of primary for one
file system and also provide the role of standby for another file
system. However, for simplicity, this chapter will use the term primary node to indicate the node hosting the primary file system and the term standby node for the node hosting the standby file system.
This configuration represents the system used
in the following examples. With respect to replication, some commands,
such as acfsutil, must be executed with root privileges. Other commands,
such as sqlplus, are issued from the oracle user ID. In the examples,
the user ID is shown with the command prompt.
Tagging Considerations
ACFS tagging can be a key aspect of to ACFS
replication, as it allows users to assign a common naming attribute to a
group of files. This is done by leveraging OS-specific extended
attributes and implementing generic tagging CLIs. ACFS replication can
use these tags to select files with a unique tag name for replication to
a different remote cluster site. Thus, rather than an entire file
system being replicated, ACFS tagging enables a user to select specific
tagged files and directories for replication, by assigning a common
naming attribute to a group of files. ACFS replication uses this tag to
filter files with unique tag names for remote file system replication.
Tagging enables data- or attribute-based replication.
The following example illustrates recursively tagging all files of the /acfs directory with the “reptag” tag:
Using tagging with ACFS replication requires
that a replication tag be specified when replication is first initiated
on the primary node. Tagging with replication cannot be implemented
after replication has been initiated. To begin tagging after replication
has been initiated requires that replication first be terminated and
then restarted with a tag name.
Before you implement ACFS replication, it is
important to determine how and what will be replicated; for example,
will all file system data be replicated, certain directories, or only
specific ACFS tagged files? This choice will impact file system sizing.
Keep in mind that the tags specified on the
init command line need not be applied to files at the time of the
initialization. For example, you can replicate files with the tags
Chicago and Boston, when at the time of replication only files with the
Chicago tag exist (that is, no files with the Boston tag exist). Any
subsequent files tagged with Boston will also begin to be replicated.
Setting Up Replication
Before initializing ACFS replication, ensure
that the primary file system has a minimum of 4GB of free space
multiplied by the number of nodes mounting the file system. This should
be done prior to executing the acfsutil repl init command; otherwise,
this command will fail.
ACFS replication also requires that the
compatible.asm and compatible.advm attributes for the disk group
containing the ACFS file system are set to a minimum of 11.2.0.2.0 for
Linux (or 11.2.0.3 on Windows) on both the primary and standby nodes. If
this was not done in the earlier steps (for enabling tagging or other
features), then it can be done now with the sqlplus command, as
illustrated here:
Admin User Setup
In most cases, the SYS user in an ASM
instance can be used as the ACFS replication administrator, in which
case the SYS user will need to be granted the SYSDBA privilege (on the
ASM instance). If there is a need to have separate roles for replication
management (replication admin) and daily ASM management, then a
separate ASM user can be set up. This user must be granted SYSASM and
SYSDBA privileges. The following example shows how to set up a
replication admin user with a user ID of admin and a password of admin1.
If an ASM password file does not exist, you
should create the password file for ASM on all nodes (primary/standby
and secondary nodes with multinode clusters), as follows:
NOTE
Please use a password appropriate for your installation.
|
Next, create the ASM user on the primary node and assign the appropriate roles:
Then create the ASM user on the standby node and assign the appropriate roles:
Finally, review changes to the password file by querying v$pwfile_users:
Hereafter, the ACFS administrator role “admin” will refer to the role that manages ACFS file system replication.
File System Setup
Before initiating replication, the ACFS
admin must ensure that the primary file system is mounted and the
standby file system is only mounted on one node (in cluster
configurations).
It is recommended that users have the same
file system name for the standby and primary file systems. Also, ensure
that if you’re replicating the entire file system (that is, not using
ACFS tagging) that the standby file system is created with a size that
is equal to or larger than the primary file system.
Also, you should ensure that sufficient disk
space is available on both the primary and the standby file systems for
storing the replication logs. The “Pause and Resume Replication” section
later in this chapter covers file system sizing details when
replication is used. It is recommended that ACFS administrators monitor
and prevent both the primary file system and the standby file system
from running out of space. Enterprise Manager (EM) can be used for this
monitoring and for sending alerts when the file system approaches more
than 70-percent full.
In 11.2.0.3, the auto-terminate safeguard
functionality was introduced to prevent the primary file system from
running out of space. If 2GB or less of free space is available, ACFS
will terminate replication on the node. Auto-terminate prevents further
consumption of disk space for replication operations and frees disk
space consumed by any replication logs that remain. Before reaching the
2GB limit, ACFS writes warnings about the free space problem in the
Oracle Grid Infrastructure home alert log. Note, using the
Auto-terminate feature exposes the administrator to lose the ability to
use the standby if the primary fails when it is running near full
capacity. We advice that this feature should be used with extreme
caution.
If the primary file system runs out of space,
the applications using that file system may fail because ACFS cannot
create a new replication log. If the standby file system runs out of
space, it cannot accept new replication logs from the primary node;
therefore, changes cannot be applied to the standby file system, which
causes replication logs to accumulate on the primary file system as
well. In cases where the ACFS file system space becomes depleted, ACFS
administrators can expand the file system, remove unneeded ACFS
snapshots, or remove files to reclaim space (although the latter option
is not recommended). If the primary file system runs out of space and
the ACFS administrator intends to remove files to free up space, then
only files that are not currently being replicated (such as when ACFS
tagging is used) should be removed because the removal of a file that is
replicated will itself be captured in a replication log.
Network Setup
Two steps are needed to configure the network for ACFS replication:
1. Generate the appropriate Oracle
Network files. These files provide communication between the ASM
instances and ACFS replication.
2. Set the appropriate network
parameters for network transmission. Because ACFS replication is heavily
tied to network bandwidth, the appropriate settings need to be
configured.
Generating the Oracle Network Files
ACFS replication utilizes Oracle Net
Services for transmitting replication logs between primary and standby
nodes. The principal OracleNet configuration is a file called
tnsnames.ora, and it resides at $ORACLE_HOME/network/admin/tnsnames.ora.
This file can be edited manually or through a configuration assistant
called netca in the Grid Home. The tnsnames.ora file must be updated on
each of the nodes participating in ACFS replication. The purpose of a
tnsnames.ora file is to provide the Oracle environment the definition of
a remote endpoint used during replication. For example, there are
tnsnames.ora files for both primary and standby nodes.
Once the file systems are created, use
$ORACLE_HOME/bin/netca (from Grid Home) to create connect strings and
network aliases for the primary/standby sites. Figures 11-1 and 11-2 illustrate the usage of NETCA to create Net Services for ACFS Replication.
On netca exit, the following message should be displayed if the services were set up correctly:
In our example, we created a PRIMARY_DATA
service and STANDBY_DATA service for the primary file system and standby
file system, respectively. In this example, the tnsnames.ora file used
for the primary node is
The important elements are the alias name
(STANDBY), the hostname (node2), the default port (1521), and the
service name (acfs_fs). This tnsnames.ora defines the remote endpoint
for replication (in this case, standby to node1, which is the primary
node). The standby node requires a tnsnames.ora file that defines the
primary endpoint. It contains the following:
Notice the symmetry between the two tnsnames.ora files. For the sake of simplicity, the service names are the same.
Setting the Network Tunables
Successful replication deployment requires
network efficiency and sufficient bandwidth; therefore, the appropriate
network tuning must be performed. For ACFS replication, first determine
if Data Guard (DG) is already configured on the hosts. If DG is set up
appropriately with the appropriate network tunable parameters, then ACFS
replication can leverage the same settings. If DG is not enabled, use
the Data Guard best practices guide for network setup. The following
document describes these best practices (see the “Redo Transport Best
Practices” section of this paper):
Validating Network Configuration
Use the tnsping utility and SQL*Plus to test
and ensure that the tnsnames.ora files are set up correctly and basic
connectivity exists between both sites.
Execute the following to test connectivity from the primary node:
Execute the following to test connectivity from the standby node:
Replication Configuration and Initiation
For good measure, the following ensures that file systems are mounted on each node.
Execute the following to initiate replication from the primary node:
Execute the following to validate replication from the standby node:
Initializing the Standby File System
Replication is first initiated on the
standby node, followed by initiation on the primary. Replication on the
standby is initiated using the /sbin/acfsutil command by the root user,
like so:
NOTE
If this command is interrupted for any
reason, the user must re-create the standby file system, mount it only
on one node of the site hosting the standby file system, and then rerun
the command.
|
This command uses the following configuration information:
The
–p option indicates the username connection to the primary file system
site as well as the service name to be used to connect as ASMADMIN on
the primary file system node.
The file system listed is the standby file system (/acfs).
If the standby site is using a different service name than the primary file system site, the command –c service_name is required. (Note that this is optional and not shown in the example.)
Now you need to verify that the standby file system is initiated:
Initializing the Primary File System
Once the standby node has been enabled, the
ACFS admin can initialize replication on the primary file system by
running the acfsutil repl init primary command:
This command allows for the following configuration information:
The –s option, followed by the connect string used to connect as ASMADMIN on the standby node.
The ACFS file system that is to be replicated.
The mount point on the standby node (-m mountp).
This is optional and not shown in the example. If not specified, it is
assumed that this mount point path is the same on the standby node as it
is on the primary file system node.
The –c option, which is used to indicate the primary service name. Again, this is optional and not shown in the example.
If tagging was enabled for this directory, then the tag name “reptag” can be added in the initialization command, as follows:
Next, you need to verify that the primary file system is initiated:
Once the acfsutil repl init primary command
completes successfully, replication will begin transferring copies of
all specified files to the standby file system.
The replication happens in two phases: The
initial phase copies just the directory tree structure, and the second
phase copies the individual files. During this second phase, all updates
or truncates to replicated files are blocked. Once a file is completely
copied to the standby file system, replication logging for that
particular file is enabled. All changes to copied files are logged,
transported, and applied to the standby file system.
Next, you need to validate replication instantiation:
The rate of data change on the primary file system can be monitored using the command
where the –s flag indicates the sample rate.
The amount of change includes all user and metadata modifications to
the file system. The following example illustrates its usage:
This “amount” value approximates the size of
replication logs generated when capturing changes to the file system.
This command is useful for approximating the extra space required for
storing replication logs in cases of planned or unplanned outages.
Pause and Resume Replication
The acfsutil repl pause command is used in
instances when replication needs to be temporarily halted, such as for
planned downtime on either the primary or standby site. The ACFS pause
command can be issued on either the primary or a standby file system
node. However, there is a difference in behavior between the two
scenarios.
A pause command issued on the standby node
will continue to generate and propagate replication logs from the
primary to the standby file system, but these rlogs will not be applied
to the standby file system; in other words, it does not suspend transfer
of rlogs from the primary node, only the application is deferred.
Consequently, rlogs will continue to accumulate at the node hosting the
standby file system. As noted earlier, replication logs are deleted on
the primary and standby sites only after they are successfully applied
to the file system on the standby node; therefore, care should be taken
to ensure that this does not cause the primary and standby file systems
to run out of space.
A pause command issued on the primary node
will generate replication logs but not propagate them to standby; in
other words, it will generate rlogs but suspend their propagation. In
this scenario, rlogs will continue to accumulate at the primary file
system. This may cause the primary file system to run out of space.
Thus, when paused on the standby, it is possible to run out of space on
both primary and standby, while pausing on primary has the potential to
just cause issue for the primary. This would be relevant when a standby
system is destination for multiple file systems.
In both cases, ACFS administrators should run
the acfsutil repl resume command at the earliest point possible, before
the accumulated replication logs fill the file system. Note that the
resume command should be executed at the same location where replication
was paused.
In cases where there is a planned outage and
the standby and primary file systems have to be unmounted, it is best to
ensure that all the changes are propagated and applied on the standby
file system. The acfsutil repl sync command is used for this purpose. It
is used to synchronize the state of the primary and standby file
systems, and it implicitly causes all outstanding replication data to be
transferred to the standby file system. The acfsutil repl sync command
returns success when this transfer is complete or when all these changes
have been successfully applied to the standby file system, if the apply
parameter is supplied. This command can only be run on the node hosting
the primary file system.
For unplanned outages, if the cluster (or
node) hosting the primary file system fails, the administrator of the
standby file system should decide whether or not the situation is a
disaster. If it is not a disaster, then when the primary site recovers,
replication will automatically restart. If it is a disaster, you should
issue an acfsutil terminate command on the standby file system to
convert it into a primary. If replication needs to reinstantiated, then
once the original primary is restarted, replication initialization will
need to be performed again.
If the node hosting the standby file system
fails, a major concern is the amount of update activity that occurs on
the primary file system relative to the amount of free space allocated
to address standby file system outages. If the free space in the primary
file system is exceeded because of the inability to transfer updates to
the standby file system, a “file system out of space” condition will
occur and space will need to be made available—for example, by removing
items no longer needed (particularly snapshots), performing a file
system resize to add space, and so on. However, assuming the standby
comes back, then as soon as primary file system space is available,
replication will continue. During this interval, where no space is
available, the file system will return errors in response to update
requests. If the standby file system is going to be down for a long
period of time, it is recommended that the primary file system be
unmounted to avoid update activity on the file system that could result
in an out-of-space condition. When the standby file system becomes
available, the primary file system could be remounted and replication
will restart automatically. Alternatively, the primary file system admin
could elect to terminate and reinstantiate once the site hosting the
standby file system is recovered.
Sizing ACFS File Systems
To size the primary and standby file systems
appropriately for these planned and unplanned outages, you can use the
acfsutil fs info command, described earlier, as a guide to determine the
rate of replication log creation. First, determine the approximate time
interval when the primary file system is unable to send replication
logs to the standby file system at its usual rate or when standby file
systems are inaccessible while undergoing maintenance. Although it is
not easy to determine how long an unplanned will last, this exercise
helps in determining the overall impact when an unplanned outage occurs.
As an aid, run acfsutil info fs -s 1200 on the
primary file system to collect the average rate of change over a
24-hour period with a 20-minute interval:
The output from this command helps determine
the average rate of change, the peak rate of change, and how long the
peaks last. Note that this command only collects data on the node it is
executed on. For clustered configurations, run the command and collect
data for all nodes in the cluster.
In the following scenario, assume that t = 60
minutes is the time interval that would adequately account for network
problems or maintenance on the site hosting the standby file system. The
following formula approximates the extra storage capacity needed for an
outage of 60 minutes:
N = Number of cluster nodes in the primary site generating rlogs
pt= Peak amount of change generated across all nodes for time t
t = 60 minutes
Therefore, the extra storage capacity needed to hold the replication logs is (N * 1GB) + pt.
In this use-case example, assume a four-node
cluster on the primary where all four are generating replication logs.
Also, during peak workload intervals, the total amount of change
reported for 60 minutes is approximately 6GB for all nodes. Using the
preceding storage capacity formula, 10GB of excess storage capacity on
the site hosting the primary file system is required for the replication
logs, or (4 * 1GB) + 6GB = 10GB.
ACFS Compare Command
In certain situations, users may want to
compare the contents of the primary and standby file systems. The
acfsutil repl compare command can be used to compare the entire ACFS
file system or a subset of files (such as tagged files).
The acfsutil repl compare command requires
that the standby file system be mounted locally for comparison. This can
be accomplished by NFS mounting the standby file system onto the
primary. As with any compare operation, it is recommend that the primary
has limited or no file changes occurring.
The acfsutil repl compare command with the -a
option can be used to compare the entire contents of the primary file
system against those on the standby file system. The -a option also
tests for extra files on the standby file system that do not currently
exist on the primary.
The -a option is typically used when no tag
names were specified during the acfsutil repl init operation. When only
tagged files need to be compared, the -t option can be used. Users can
even compare multiple sets of tagged files by listing comma-separated
tag names. This option first locates all filenames on the primary file
system with the specified tag names and compares them to the
corresponding files on the standby. The -t option also tests for extra
files on the standby file system that do not have an associated tag name
specified during the acfsutil repl init operation. The acfsutil repl
info -c option can be used to determine what tags were specified during
the acfsutil repl init operation. If neither the -a nor -t option is
provided, a primary-to-standby file comparison is done without testing
tag names or extended attributes.
The following shows a sample execution of acfsutil repl compare:
Termination of Replication
The acfsutil repl terminate command is used
to abort the ongoing replication. The terminate command operates on a
specific file system. A graceful termination can be achieved by
terminating the replication first on the primary followed by the standby
node. A graceful termination allows for the standby to apply all
outstanding logs.
The following command terminates replication on primary node:
Next to terminate replication on standby node:
After the standby is terminated, the file system is automatically converted to writable mode:
Once file system replication termination has
completed for a specific file system, no replication infrastructure
exists between that primary and standby file systems. The termination of
replication is a permanent operation and requires a full
reinitialization to instantiate again. To restart replication, use the
acfsutil repl init command, as previously illustrated.
ACFS Security and Encryption
As discussed in Chapter 10,
ACFS provides standard POSIX file system support; however, ACFS also
provides other file system services, such as security, encryption,
tagging, snapshots, and replication. In this section we cover ACFS
Security and ACFS Encryption. ACFS Security and Encryption—along with
Oracle Database Vault and Oracle Advanced Security Option (ASO) —provide
a comprehensive security solution for unstructured data residing
outside the database and database-resident structured data,
respectively. Note that ACFS Security and Encryption are not part of the
ASO license; they must be licensed separately via the Cloud Edition.
There are two aspects of security on a file
system. One is the restriction of logical access to the data, such as
obtaining file information or file location. The other is preventing
physical access to the data, such as opening, reading, or writing to
data (files). The former is handled by ACFS Security and the latter by
ACFS Encryption.
Databases and Security
Databases generally have peripheral data
(data that lives outside the database, but has direct ties to the data
within the database), such as medical reports and images, text files,
contracts, metadata, and other unstructured data. This data needs to be
kept secured and must meet regulatory compliancy (for example, SOX,
HIPAA, PCI, or PII).
ACFS supports the Unix “user, group, others”
model and supports Access Control Lists (ACLs) on Windows. These
constructs are based on the Discretionary Access Control (DAC) model. In
the DAC model, controls are discretionary in the sense that a subject
with certain access permission is capable of passing that permission
(perhaps indirectly) on to any other subject. In the case of a file
system, the owner of a file can pass the privileges to anybody.
Besides some of the issues in DAC, such as
transfer of ownership, a major concern is that the root user or
administrator will bypass all user security and have the privileges to
access or modify anything on the file system. For databases where the
DBA (database administrator) has more privileges than required to
perform his duties, Oracle addresses this problem with a security
product called Oracle Database Vault, which helps users address such
security problems as protecting against insider threats, meeting
regulatory compliance requirements, and enforcing separation of duty. It
provides a number of flexible features that can be used to apply
fine-grained access control to the customer’s sensitive data. It
enforces industry-standard best practices in terms of separating duties
from traditionally powerful users. It protects data from privileged
users but still allows them to maintain Oracle databases.
The goal of Oracle Database Vault, however, is
limited to Oracle databases. Today, customers need the same kind of
fine-grained access control to data outside the database (such as Oracle
binaries, archive logs, redo logs, and application files such as Oracle
Apps). ACFS Security fills this gap with a similar paradigm, in which
realms, rules, rule sets, and command rules (described later) provide
fine-grained access to data.
ACFS Security
ACFS Security provides finer-grained access
policy definition and enforcement than allowed by an OS-provided access
control mechanism alone. Another goal of ACFS Security is to provide a
means to restrict users’ ability to pass privileges of the files they
own to other users if they are not consistent with the global policies
set within an organization. Lastly, ACFS Security follows the principle
of least privilege in the facilities it provides for the definition and
administration of security policies.
ACFS Security uses realms, rules, rule sets, and command rules for the definition and enforcement of security policies:
Realm An
ACFS realm is a functional grouping of file system objects that must be
secured for access by a user or a group of users. File system objects
can be files or directories. By having these objects grouped in the form
of a realm, ACFS Security can provide fine-grained access control to
the data stored in ACFS. For realm protection to take effect, objects
must be added to a realm. Objects can be added to more than one realm.
The definition of a realm also includes a list of users and groups. Only
those users who are part of the realm directly or indirectly via the
groups are allowed access to the realm. Only those users who are part of
the realm can access the objects within the realm if the rules are
satisfied.
Rule A
rule is a Boolean expression that evaluates to TRUE or FALSE based on
some system parameter on which the rule is based. An option of ALLOW or
DENY can be associated with each rule. Rules can be shared among
multiple rule sets. For example, a “5–9PM” rule evaluates to TRUE if the
system time is between 5 p.m. and 9 p.m. when the rule is evaluated.
ACFS Security supports four types of rules:
Time Evaluates
to TRUE or FALSE based on whether the current system time falls between
the start time and end time specified as part of rule definition.
User Evaluates to TRUE or FALSE based on the user executing the operation.
Application Evaluates to TRUE or FALSE based on the application that is accessing the file system object.
Hostname Evaluates
to TRUE or FALSE based on the hostname accessing the file system
object. The hostname specified must be a cluster member and not a client
host accessing an ACFS file system via NFS, for example.
Rule set A
rule set is a collection of rules that evaluates to “allow” or “deny”
based on the assessment of its constituent rules. Rule sets can be
configured to evaluate to “allow” if all constituent rules evaluate to
TRUE with the option “allow” or if at least one rule evaluates to TRUE
with the option “allow” depending on the rule set options.
Command rule Oracle
ACFS command rules are associations of the file system operation with a
rule set. For example, the association of a file system create, delete,
or rename operation with a rule set makes a command rule. Command rules
are associated with a realm.
ACFS Security Administrator
In accordance with the principle of least
privilege, ACFS Security mandates that security policy definition and
management be the duty of a user with a well-defined security
administrator role and not a user with the system administrator role. To
this end, as part of initializing ACFS Security, the system
administrator is required to designate an OS user as an ACFS security administrator.
A temporary password is set for this security administrator, and it
should be changed immediately to keep the security administrator’s role
secure. This security administrator can then designate additional users
as security administrators using the acfsutil sec admin add command.
Only a security administrator can designate or remove another user as a
security administrator. There is always at least one security
administrator once ACFS Security has been initialized, and the last
security administrator cannot be removed using the acfsutil sec admin
remove command.
The security administrator creates and manages
security policies using realms, rules, rule sets, and command rules.
For any administrative tasks, the security administrator must
authenticate himself using a password that is different from his OS
account password. Each security administrator has a unique password,
which can be changed only by that security administrator. These
passwords are managed by ACFS Security infrastructure and are kept in a
secure Oracle Wallet stored in the Oracle Cluster Repository (OCR).
Security administrators are allowed to browse any part of the
file system tree. This allows them to list and choose files and
directories to be realm-secured. No security administrator, however, is
allowed to read the contents of any files without appropriate OS and realm permissions.
Enabling and Disabling ACFS Security
ACFS Security can be enabled or disabled on a
file system by running the acfsutil sec enable and acfsutil sec disable
commands, respectively. Disabling ACFS Security on a file system
preserves all the security policies defined for that file system, but
disables their enforcement, which implies access to files and
directories on that file system is arbitrated only through only the OS
mechanism. To enable enforcement, the ACFS security administrator can
run the acfsutil sec enable command. Security can be enabled and
disabled at the file system or realm level. By default, ACFS Security is
enabled on a file system when it is prepared for security. A newly
created realm can have it enabled or disabled based via a command-line
option (the default is enabled). Disabling ACFS Security at the file
system level disables enforcement via all realms defined for that file
system. Enable and disable capability can be useful when security
policies are not completely defined and the security administrator
wishes to experiment with some policies before finalizing them.
Configuring ACFS Security
ACFS Security is supported only for ASM 11g Release 2, and the disk group compatibility attributes for ASM and ADVM must be set to 11.2.0.x, where x represents the version of the ASM installed.
ACFS file systems can be configured to use
ACFS Security via the acfsutil sec commands or the ASMCA utility. ACFS
Security must be initialized before any file systems can be
configured to use it. This is done using the acfsutil sec init command,
which needs to be run only once for the entire cluster. As part of the
acfsutil sec init command, an OS user is designated to be the first
security administrator. It is recommended that this OS user be distinct
from the DBA user. This user must also be in an existing OS group
designated as the Security Administrator group. Additional users can be
designated as security administrators. All security administrators,
however, must be members of the designated security OS group. Moreover,
initializing ACFS Security also creates the storage necessary to house
the security administrator’s security credentials.
Once ACFS Security has been initialized for the cluster, the security administrator can prepare
a file system to use it by running the acfsutil sec prepare command.
This step is a prerequisite for defining security policies for the file
system. The acfs sec prepare command performs the following actions:
It initializes ACFS Security metadata for the file system.
It enables ACFS Security on the file system.
It creates the following directories in the file system that is being prepared:
.Security
.Security/backup
.Security/logs
It builds the following system security realms:
SYSTEM_Logs Protects ACFS Security log files in the .Security/realm/logs/directory.
SYSTEM_SecurityMetadata Protects the ACFS Security metadata XML file in the .Security/backup/directory.
On
the Windows platform, the SYSTEM_Antivirus realm is created to provide
installed antivirus software programs access to run against the ACFS
file system. The SYSTEM_Antivirus realm can only perform the OPEN, READ,
READDIR, and setting time attribute operations on a file or directory.
Generally, antivirus software programs inoculate and delete infected
files. For the antivirus software programs to perform these actions
successfully, the ACFS security will need to be disabled. For every
realm-protected file or directory, the SYSTEM_Antivirus realm is
evaluated when authorization checks are performed to determine if the
SYSTEM_Antivirus realm allows access to the file or directory. To allow
the antivirus process to access realm-protected files or directories,
you must add the LocalSystem or SYSTEM group to the realm with the
acfsutil sec realm add command. If antivirus processes are running as
administrator, then the user administrator must be added to the
SYSTEM_Antivirus realm to allow access to realm-protected files and
directories. If no antivirus products have been installed, do not add
any users or groups to the SYSTEM_Antivirus realm. Because users or
groups added to the SYSTEM_Antivirus realm have READ and READDIR access,
you should limit the users or groups added to this realm. ACFS
administrators can restrict the time window when the users or groups of
this realm can access the realm-protected files or directories with
time-based rules. Additionally, ACFS administrators can also have
application-based rules if they can identify the process name for the
antivirus installation that scans the files.
Once a file system has been prepared for
security, the security administrator can start defining security
policies for the data in the file system by considering the following:
What data needs to be protected? Files to be protected must to be added to one or more realms.
Who has access to data? Users intended to be allowed access to files in the realm must be added to the realm.
What
actions are the users allowed or not allowed to take on data? Command
rules (in conjunction with rule sets) define these actions.
Under what conditions can the data be accessed? Rules and rule sets define these criteria.
Access to files in a realm of an ACFS file
system must be authorized by both the realm and the underlying OS
permissions (that is, the standard “owner, group, other” permissions on
typical Unix/Linux platforms or Access Control Lists (ACLs) on Windows).
Accessing a file that has security enabled involves tiered validation.
First, access is checked against all realms that the file is a part of.
If even a single realm denies access, overall operation is not allowed.
If realm authorization allows access, then OS permissions are checked.
Also, if authorized by the latter, the overall operation is allowed.
ACFS Security and Encryption Logging
Auditing is a key aspect of any security
configuration, and ACFS Security is no exception. Auditing and
diagnostic data are logged for ACFS Security and Encryption. These log
files include information such as the execution of acfsutil commands,
use of security or system administrator privileges, run-time realm-check
authorization failures, setting of encryption parameters, rekey
operations, and so on. Logs are written to the following log files:
$GRID_HOME/.Security/realm/logs/sec-host_name.log This
file is created during the acfsutil sec prepare command and is itself
protected by ACFS Security using the SYSTEM_Logs realm.
$GRID_HOME/log/host_name/acfssec/acfssec.log This
file contains messages for commands that are not associated with a
specific file system, such as acfsutil sec init. The directory is
created during installation and is owned by the root user.
When an active log file grows to a predefined maximum size (10MB), the file is automatically moved to log_file_name.bak,
the administrator is notified, and logging continues to the regular log
file name. When the administrator is notified, he must archive and
remove the log_file_name.bak file. If an active log file grows to the maximum size and the log_file_name.bak
file exists, logging stops until the backup file is removed. After the
backup log file is removed, logging restarts automatically.
Databases and Encryption
Although mechanisms natively built into the
Oracle Database and those provided by Oracle Database Vault can control
access to data stored in the database, this data is not protected from
direct access via physical storage. A number of third-party tools can be
used to provide read and write access to data stored on secondary
storage, thus circumventing protection provided by the database and the
OS. Furthermore, these database and OS protections mechanisms do not
protect against data loss or theft. For example, storage can be
re-attached on a completely different system from the one it was
intended for. Features in Oracle Database’s Advanced Security Option
(ASO) provide protection against such scenarios. Transparent Data
Encryption (TDE) provides the capability for encrypting data at the
column and tablespace levels, service data protection, and compliance
needs of customers.
ACFS and File Encryption
The same threats as those mentioned
previously for database data exist for file system data too. ACFS
Encryption protects from these threats by encrypting data stored on a
secondary device, or data at rest. It should be noted that ACFS Encryption protects user data and not file system metadata.
Keeping data at rest encrypted renders the data useless without
encryption keys in case of theft of the physical storage on which the
data resides.
Encryption can be applied to individual files,
directories, or an entire ACFS file system. Furthermore, both encrypted
and nonencrypted files can exist in the same ACFS file system.
Applications need no modification to continue to work seamlessly with
encrypted files. Data is automatically encrypted when it is written to
disk and automatically decrypted when accessed by the application. It
should be noted that encryption is used for protecting stored data. It
does not provide access control or protection against malicious access,
both of which fall under the purview of ACFS Security and OS-level
access control mechanisms. Thus, a user authorized to read a file would
always get plain-text data.
Figure 11-3 shows the application and internal view of ACFS Encryption.
ACFS Encryption imposes no penalty on cached
reads and writes because the data in the OS page cache is in plain text.
Data is encrypted when it is flushed to disk and decrypted when read
from disk into the OS page cache for the first time.
ACFS Encryption Key Management
ACFS Encryption requires minimal key
management tasks to be performed by the administrator. Keys are
transparently created and securely stored with minimal user
intervention. ACFS Encryption uses two-level keys to minimize the amount
of data that is encrypted with a single key. A file encryption key
(FEK) is a per-file unique key. A file’s data is encrypted using the
FEK. A volume encryption key (VEK), a per-file system key, serves as a wrapping key, and each FEK is stored on disk encrypted using the VEK. Figure 11-4 shows the relationship between the two types of keys.
The encryption keys are never stored on disk
or in memory in plain text. The keys are either obfuscated or encrypted
using a user-supplied password. ACFS Encryption supports the Advanced
Encryption Standard (AES), which is a symmetric cipher algorithm,
defined in Federal Information Processing (FIPS) Standard 197. AES
provides three approved key lengths: 256, 192, and 128 bits. The key
length can be specified when you are configuring ACFS Encryption for a
file system.
ACFS Encryption supports the “rekey” operation
for both VEKs and FEKs. The rekey operation generates a new key and
reencrypts the data with this new key. For FEKs, the data to encrypt is
the user data residing in files and for VEKs the data is the FEKs.
ACFS Encryption Configuration and Use
Before using ACFS Encryption, the system
administrator needs to create storage for encryption keys using the
acfsutil encr init command. This command needs to be run once per
cluster, and it must be run before any other encryption commands. ACFS
Encryption provides an option to create password-protected storage for
encryption keys. Creating password-protected storage requires that the
password be supplied whenever an operation is going to read or modify
the encryption key store. The three operations that read or modify the
encryption key store are acfsutil encr set, acfsutil encr rekey –v, and
mount.
Once ACFS Encryption has been initialized, a
file system can be configured to use it via the acfsutil encr set
command. This command sets or changes encryption parameters, algorithm,
and key length for a file system. Once this command has been run on a
file system, individual files and directories or the entire file system
can be encrypted using the acfsutil encr on command.
Certain ACFS Encryption command usage and
functionality requires system administrator privileges. This
functionality includes the commands for initiating, setting, and
reconfiguring ACFS Encryption. System administrators and ACFS security
administrators can initiate ACFS encryption operations; however,
unprivileged users can initiate encryption for files they own. An ACFS
security administrator can manage encryption parameters on a per-realm
basis. After a file is placed under realm security, file-level
encryption operations are not allowed on that file. Even if ACFS
Security allows the file owner or the root user to open the file,
file-level encryption operations are blocked. Encryption of
realm-protected files is managed exclusively or entirely by the ACFS
security administrator, who can enable and disable encryption for files
at a security realm level. After a directory has been added to a
security realm, all files created in the directory inherit the
realm-level encryption parameters. When a file is removed from its last
security realm, the file is encrypted or decrypted to match the file
system-level encryption status. The file is not re-encrypted to match
file system-level parameters if it was already encrypted with security
realm parameters.
A system administrator cannot rekey
realm-secured files at the file system or file level. To ensure all
realm-secured files are encrypted with the most recent VEK, you must
first remove encryption from all realms, and then re-enable encryption.
This action re-encrypts all files with the most recent VEK.
Encryption information for Oracle ACFS file
systems is displayed in the (G) V$ASM_ACFS_ENCRYPTION_INFO view or using
acfsutil sec info or acfsutil encr info command sets.
ACFS Snapshots, Security, and Encryption
Users cannot modify security or encryption
metadata in read-only snapshots. That is, security policies cannot be
modified or created and files cannot be encrypted, decrypted, or rekeyed
in a read-only snapshot. Files in a snapshot, however, preserve their
security and encryption statuses, as they existed, at the time of
snapshot creation. Changing the encryption or security status of a file
in the live file system does not change its status in the snapshot,
whether read-only or read-write. Therefore, if a file was not secured by
a realm in the snapshot, it cannot be realm-secured by adding the
corresponding file in the active file system to a security realm. If a
file was not encrypted in the snapshot, that file cannot be encrypted by
encrypting the corresponding file in the active file system. Therefore,
unprotected files in snapshots present another potential source of data
for malicious users. When applying security and encryption policies, an
administrator should be aware of these potential backdoors to
unprotected data. To that end, when certain encryption operations such
as enabling of file system–level encryption and rekey are attempted, a
warning is printed to let the administrator know that these will not
affect the files in the snapshot(s). To ensure no unprotected copies of
data are available for misuse, administrators should confirm that no
snapshots exist when security and encryption policies are applied.
Because read-write snapshots allow changes to
files in the snapshot, the encryption status of files can also be
changed. They can be encrypted, decrypted, or rekeyed by specifying as
the target a path in a read-write snapshot. An encryption, decryption,
or rekey operation specified at the file system level, however, does not
process files and directories of snapshots, read-only or read-write, in
the .ACFS/snaps directory. For the purpose of these operations, the
file system boundary includes only the live file system and not its
snapshots. To do these operations on read-write snapshot files, the
administrator can specify as the target a path in the read-write
snapshot.
In the 11.2.0.3 release, changing or creating
security policies in a read-write snapshot is not yet supported.
Furthermore, files in a read-write snapshot cannot be added to or
removed from realms. A new file created in a realm-secured directory in a
read-write snapshot, however, inherits the realm security attributes of
the parent directory. If the realm protecting the new file has
encryption turned on, the file is encrypted with the encryption
parameters set in the realm. If the realm protecting the new file has
encryption turned off, the file is decrypted. Files and directories in a
read-write snapshot cannot be added to or removed from any security
realm.
ACFS Security and Encryption Implementation
This section describes the steps to
implement ACFS Security and Encryption. These steps can be performed by
using the command line or the ASMCA utility. A mixture of both will be
shown for simplicity.
Here’s the use-case scenario: Company ABC
provides escrow services to buyers and sellers. As part of the value-add
services, ABC provides access to an information library. This library,
which is managed and maintained by ABC, stores historical content such
as preliminary reports, pro forma, and escrow final reports. ABC loads
all these reports from the front-end servers to the ACFS file system
that runs on a RAC cluster, with one directory per escrow. ABC now wants
to encrypt and secure all the escrow content.
In this example, it is assumed that the RAC
cluster is built, the ACFS file system is created with the appropriate
directories, and the content is loaded. Here are the steps to follow:
1. Create or identity the OS user
who will be the ACFS security administrator for the cluster. In our use
case, we will create a user named orasec in the orasec group.
2. Launch ASMCA to initialize ACFS Security and Encryption (see Figure 11-5).
Because enabling ACFS Security can only
be done by a privileged user, the Show Command button will display the
command to be issued, by root. The user will be prompted to enter a new
password for the ACFS security administrator, which must be eight
characters long. Note that this password is not the login password for
the OS user but rather the password for the ACFS security administrator.
Here’s the command to configure ACFS Security:
And here’s the command to configure ACFS Encryption:
3. Verify that orasec has the
appropriate authorization to display ACFS Security information. Execute
the acfsutil sec info commands using only the chosen user ID:
4. Configure the ASM disk group
compatibility attributes for ADVM and ASM. ASM has several disk group
attributes, but in the context of ACFS Security and Encryption the
relevant ones are the following:
5. Create or identify the ACFS file
system that will be secured. Once this is identified, prepare the file
system for ACFS Security. In our case, we want to enable ACFS Security
on the /u01/app/acfsdata/bfile_data file system:
6. Verify that security is enabled:
7. As stated earlier, ACFS Security
can be used in conjunction with ACFS Encryption. For these cases,
encryption must be initialized and set before encryption is enabled on a
security realm. Keep in mind that in our example we initialized ACFS
Security and Encryption in one command (see step 2). However, if this
was not done in step 2, the following needs to executed as root:
On the other hand, if it was performed in step 2, then run the following as root:
Note that we did not specify an AES
encryption algorithm or key length for which ACFS picked up the
defaults. To set a different key length (AES is the only supported
algorithm), use the –k option. For example, to set AES encryption of 256
bits, execute the following:
The acfsutil encr set command
transparently generates a volume encryption key that is kept in the key
store that was previously configured with the acfsutil encr init
command.
Here’s the command to verify that encryption has been set:
Note that “Encryption status” here is
set to OFF. This is because the acfsutil encr set command does not
encryption any data on the file system, but only creates a volume
encryption key (VEK) and sets the encryption parameters for the file
system.
8. Enable encryption at the file system level with the following command:
9. Create the ACFS Security rule sets and then the rules:
This example specifies a rule type and a
rule value. The rule type can be application, hostname, time, or
username. The rule value depends on the type of rule. A rule can be
added to a rule set, and that rule set can be added to a realm. However,
you can create singleton rules without having the hierarchy of the rule
set and realms.
10. Add new rules to the rule set:
All these definitions could be listed in a file executed in batch using the acfsutil sec batch command.
Summary
ACFS Data Services such as Replication,
Tagging, Snapshots, Security, and Encryption complement Oracle’s
Database Availability and Security technologies, such as Data Guard,
Database Vault, and Transparent Data Encryption. In addition, ACFS Data Services provides rich file support for unstructured data.ASM Optimizations in Oracle Engineered Solutions
When
implementing a private cloud solution, a common question is, “Should I
build or buy?” In the “build” option, the IT personnel would create a
cloud pool by purchasing all the essential components, such as servers,
HBAs, storage arrays, and fabric switches. Additionally, all software
components—RAC, ASM, Grid Infrastructure, and Database—would have to be
installed and validated. Although this approach allows IT the
flexibility to pick and choose the appropriate components, it does
increase mean time to deploy as well as the chances of misconfiguration.
With the “buy” approach, users can deploy
Engineered Solutions or integrated solutions, which are pretested,
prevalidated, and preconfigured. Both provide fast deployment and
simplified management. Examples of integrated solutions include HP
CloudSystem and vCloud by VCE. Engineered Solutions, such as Oracle’s
Exadata, are not just integrated solutions, but the application software
(the Oracle database, in this case) and the hardware are tightly and
intelligently woven together.
Oracle Exadata Database Machine, Oracle
Database Appliance, and Oracle SPARC SuperCluster are solutions designed
to be optimal platforms for Oracle Database and therefore ideal
platforms for Private Database Cloud computing. This chapter focuses on
the ASM optimizations developed specifically for Oracle Exadata and
Oracle Database Appliance. This chapter is not meant to be an exhaustive
look at Exadata or ODA; many papers are available on Oracle Technology
Network (OTN) that cover this topic.
Overview of Exadata
Oracle Exadata Database Machine includes all
the hardware and software required for private cloud deployments.
Exadata combines servers, storage, and networks into one engineered
package, eliminating the difficult integration problems typically faced
when building your own private cloud. Rather than going through the
entire rationalization and standardization process, IT departments can
simply implement Oracle Exadata Database Machine for database
consolidation onto a private cloud.
Exadata Components
This section covers the important components of the Exadata system.
Compute Servers
Compute servers are the database servers
that run the Grid Infrastructure stack (Clusterware and ASM) along with
the Oracle Real Application Clusters stack. The compute servers behave
like standard database servers, except that in Exadata they are linked
with the libcell library, which allows the databases to communicate with
the cellserv interface in the Exadata Storage Server.
Exadata Storage Server
In the Exadata X2 configuration, the Exadata
Storage cells (Exadata Cells) are servers preconfigured with 2×6-core
Intel Xeon L5640 processors, 24GB memory, 384GB of Exadata Smart Flash
Cache, 12 disks connected to a storage controller with 512MB
battery-backed cache, and dual-port InfiniBand connectivity. Oracle
Enterprise Linux operating system (OS) is the base OS for Exadata Cells
as well as for the compute nodes. All Exadata software comes
preinstalled when delivered. The 12 disks can be either High Performance
(HP) Serial Attached SCSI (SAS) disks that are 600GB 15,000 rpm or High
Capacity (HC) SAS disks that are 3TB 7,200 rpm.
Each of the 12 disks represents a Cell Disk
residing within an Exadata Storage cell. The Cell Disk is created
automatically by the Exadata software when the physical disk is
discovered. Cell Disks are logically partitioned into one or more Grid
Disks. Grid Disks are the logical disk devices assigned to ASM as ASM
disks. Figure 12-1 illustrates the relationship of Cell Disks to Grid Disks in a more comprehensive Exadata Storage grid.
Once the Cell Disks and Grid Disks are
configured, ASM disk groups are defined across the Exadata
configuration. When the data is loaded into the database, ASM will
evenly distribute the data and I/O within disk groups. ASM mirroring is
enabled for these disk groups to protect against disk failures.
Cellsrv (Cell Services), a primary component
of the Exadata software, provides the majority of Exadata storage
services and communicates with database instances on the database server
using the iDB protocol. Cellsrv provides the advanced SQL offload
capabilities, serves Oracle blocks when SQL offload processing is not
possible, and implements the DBRM I/O resource management functionality
to meter out I/O bandwidth to the various databases and consumer groups
issuing I/O. Cellsrv maintains a file called griddisk.owners.dat, which
has details such as the following:
ASM disk name
ASM disk group name
ASM failgroup name
Cluster identifier
When IORM is used, the IORM (I/O Resource
Manager) manages the Exadata cell I/O resources on a per-cell basis.
Whenever the I/O requests reach the cell’s disks threshold, IORM
schedules I/O requests according to the configured resource plan. When
the cell is operating below capacity, IORM does not queue I/O requests.
IORM schedules I/Os by immediately queuing requests and issuing other
I/O requests. IORM selects these I/Os to issue based on the resource
plan’s allocations; databases and consumer groups with higher
allocations are scheduled more frequently than those with lower
allocations.
When IORM is enabled, it automatically manages
background I/Os. Critical background I/Os such as log file syncs and
control file reads and writes are prioritized. Databases with higher
resource allocations are able to issue disk I/Os more rapidly. Resource
allocation for workloads within a database is specified through the
database resource plan. If no database resource plan is enabled, all
user I/O requests from the database are treated equally. Background
I/Os, however, are still prioritized automatically.
Two other components of Oracle software
running in the cell are the Management Server (MS) and Restart Server
(RS). The MS is the primary interface to administer, manage, and query
the status of the Exadata cell. MS manageability, which is performed
using the Exadata cell command-line interface (CLI) or EM Exadata
plug-in, provides standalone Exadata cell management and configuration.
For example, from the cell, CLI commands are issued to configure
storage, query I/O statistics, and restart the cell. The distributed CLI
can also be used to issue commands to multiple cells, which eases
management across cells. Restart Server (RS) ensures ongoing functioning
of the Exadata software and services. RS also ensures storage services
are started and running, or services are restarted when required.
InfiniBand infrastructure
The Database Machine includes an InfiniBand
interconnect between the compute nodes and Exadata Storage Server. The
InfiniBand network was chosen to ensure sufficient network is in place
to support the low-latency and high-bandwidth requirements. Each
database server and Exadata cell has dual-port Quad Data Rate (QDR)
InfiniBand connectivity for high availability. The same InfiniBand
network also provides a high-performance cluster interconnect for the
Oracle Database Real Application Clusters (RAC) nodes.
iDB Protocol
The database servers and Exadata Storage
Server software communicate using the Intelligent Database protocol
(iDB), which is implemented in the database kernel. iDB runs over
InfiniBand and leverages ZDP (Zero-loss Zero-copy Datagram Protocol), a
zero-copy implementation of the industry-standard Reliable Datagram
Sockets (RDSv3) protocol. ZDP is used to minimize the number of data
copies required to service I/O operations. The iDB protocol implements a
function shipping architecture in addition to the traditional data
block shipping provided by the database; for example, iDB is used to
ship SQL operations down to the Exadata cells for execution and to
return query result sets to the database kernel. This allows Exadata
cells to return only the rows and columns that satisfy the SQL query,
instead of returning the entire database blocks as in typical storage
arrays. Exadata Storage Server operates like a traditional storage array
when offload processing is not possible. But when feasible, the
intelligence in the database kernel enables table scans to be passed
down to execute on the Exadata Storage Server so only requested data is
returned to the database server.
11gR2 Database Optimizations for Exadata
Oracle Database 11g Release 2 has
been significantly enhanced to take advantage of Exadata storage. One of
the unique things the Exadata storage does compared to traditional
storage is return only the rows and columns that satisfy the database
query rather than the entire table being queried. Exadata pushes SQL
processing as close to the data (or disks) as possible and gets all the
disks operating in parallel. This reduces CPU consumption on the
database server, consumes much less bandwidth moving data between
database servers and storage servers, and returns a query result set
rather than entire tables. Eliminating data transfers and database
server workload can greatly benefit data warehousing queries that
traditionally become bandwidth and CPU constrained. Eliminating data
transfers can also have a significant benefit on online transaction
processing (OLTP) systems that often include large batch and report
processing operations.
Exadata Storage Servers also run more complex operations in storage:
Join filtering
Incremental backup filtering
I/O prioritization
Storage indexing
Database-level security
Offloaded scans on encrypted data
Data mining model scoring
Smart file creation
The application is transparent to the database using Exadata. In fact, the exact same Oracle Database 11g
Release 2 that runs on traditional systems runs on the Database
Machine. Existing SQL statements, whether ad hoc or in packaged or
custom applications, are unaffected and do not require any modification
when Exadata storage is used. The offload processing and bandwidth
advantages of the solution are delivered without any modification to the
application.
ASM Optimizations for Exadata
ASM provides the same functionality in
standard RAC clusters as in Exadata configurations. In Exadata, ASM
redundancy (ASM mirroring) is used, where each Exadata cell is defined
as a failure group. ASM automatically stripes the database data across
Exadata cells and disks to ensure a balanced I/O load and optimum
performance. The ASM mirroring in Exadata can be either normal or high
redundancy, but Maximum Availability Architecture (MAA) best practices
recommend high redundancy for higher resiliency.
ASM in Exadata automatically discovers grid
disks presented by the Exadata Storage Server. The pathname for
discovered Grid Disks has the format of o/cell-ip-address/griddisk-name.
The following is a sample listing of an Exadata Grid Disk discovered by ASM:
The o in the Grid Disk pathname
indicates that it’s presented via libcell. Keep in mind that these disks
are not standard block devices; therefore, they cannot be listed or
manipulated with typical Linux commands such as fdisk and multipath.
The IP address in the pathname is the address
of the cell on the InfiniBand storage network. The Grid Disk name is
defined when the Grid Disk was provisioned in Exadata Storage.
Alternatively, the name may be system generated, by concatenating an
administrator-specified prefix to the name of the Cell Disk on which the
Grid Disk resides. Note that all Grid Disks from the same cell use the
same IP address in their pathname.
Disk Management Automation in Exadata
In Exadata, many of the manual ASM
operations have been internally automated and several disk-partnering
capabilities have been enhanced. ASM dynamic add and drop capability
enables non-intrusive cell and disk allocation, deallocation, and
reallocation.
The XDMG is a new background process in the
ASM instance that monitors the cell storage for any state change. XDMG
also handles requests from the cells to online, offline, or drop/add a
disk based on an user event or a failure. For example, if a cell becomes
inaccessible from a transient failure, or if a Grid Disk or Cell Disk
in the cell is inactivated, then XDMG will automatically initiate an
OFFLINE operation in the ASM instance.
The XDMG process works with the XDWK process,
which also runs within the ASM instance. The XDWK process executes the
ONLINE or DROP/ADD operation as requested by XDMG. The new processes in
the ASM instance automatically handles storage reconfiguration after
disk replacement, after cell reboots, or after cellsrv crashes.
Exadata Storage Server has the capability to
proactively drop a disk if needed, and is effective for both true disk
failures and predictive failures. When a disk fails, all I/Os to that
disk will fail. The proactive disk drop feature will then automatically
interrupt the existing drop operation (triggered by the prior disk
predictive failure) and turn it into a disk drop force. This is to
ensure that redundancy gets restored immediately without having to wait
for the disk repair timer to kick in.
The following output displays the action taken by the XDWK process to drop a disk:
To list the condition of Exadata disks, the following cellcli commands can be run on the Exadata Storage Server:
In addition to the new ASM processes just
mentioned, there is one master diskmon process and one slave diskmon
process (dskm) for every Oracle database and ASM instance. The diskmon
is responsible for the following:
Handling of storage cell failures and I/O fencing
Monitoring of Exadata Server state on all storage cells in the cluster (heartbeat)
Broadcasting intra-database IORM (I/O Resource Manager) plans from databases to storage cells
Monitoring of the control messages from the database and ASM instances to storage cells
Communicating with other diskmons in the cluster
The following output shows the diskmon and the dskm processes from the database and ASM instances:
Exadata ASM Specific Attributes
The CONTENT.TYPE attribute identifies the
disk group type, which can be DATA, RECOVERY, or SYSTEM. The type value
determines the distance to the nearest partner disk/failgroup.
The default value is DATA, which specifies a distance of 1. The value of
RECOVERY specifies a distance of 3, and the value of SYSTEM specifies a
distance of 5. The primary objective of this attribute is to ensure
that failure at a given Exadata Storage Server does take out all the
disk groups configured on that RACK. By having different partners based
on the content type, a failure of a disk/cell does not affect the same
set of disk partners in all the disk group.
NOTE
For CONTENT.TYPE to be effective one needs to have a full RACK of the DB machine.
|
The CONTENT.TYPE attribute can be specified
when you’re creating or altering a disk group. If this attribute is set
or changed using ALTER DISKGROUP, then the new configuration does not
take effect until a disk group rebalance is explicitly run.
The CONTENT.TYPE attribute is only valid for
disk groups that are set to NORMAL or HIGH redundancy. The
COMPATIBLE.ASM attribute must be set to 11.2.0.3 or higher to enable the
CONTENT.TYPE attribute for the disk group.
ODA Overview
The Oracle Database Appliance (ODA) is a
fully integrated system of software, servers, storage, and networking in
a single chassis. Like Exadata, Oracle Database Appliance is not just
preconfigured; it is prebuilt, preconfigured, pretested, and pretuned.
This configuration is an ideal choice for a pay-as-you-grow Private
Database Cloud. Customers can continually consolidate and enable CPU
capacity as needed.
An appliance like ODA reduces the time in
procuring individual components and completely minimizes the effort
required to set up an optimal configuration. The OS, network components,
SAN, storage redundancy, multipathing, and more, all become configured
as part the ODA system enablement.
ODA Components
Oracle Database Appliance is made up of
building blocks similar to Exadata. These can be broken down into two
buckets: hardware and software. Each component is built and tested
together to provide maximum availability and performance.
Software Stack
ODA comes with preinstalled Oracle
Unbreakable Linux and Oracle Appliance Manager (OAK) software. The OAK
software provides one-button automation for the entire database stack,
which simplifies and automates the manual tasks typically associated
with installing, patching, managing, and supporting Oracle database
environments.
Hardware
The Oracle Database Appliance is a four-rack
unit (RU) server appliance that consists of two server nodes and
twenty-four 3.5 SAS/SSD disk slots.
Each Oracle Database Appliance system contains two redundant 2U form factor server nodes (system controllers SC0 and SC1).
Each server node plugs into the Oracle
Database Appliance chassis and operates independently of the other. A
failure on one server node does not impact the other node. The surviving
node uses cluster failover event management (via Oracle Clusterware) to
prevent complete service interruption. To support a redundant cluster,
each server node module contains a dual-port Ethernet controller
internally connected between the two server node modules through the
disk midplane. This internal connection eliminates the need for external
cables, thus making ODA a self-contained database appliance.
Each server contains two CPU sockets for the
Intel Xeon Processer X5675 CPUs, providing up to 12 enabled-on-demand
processor cores and 96GB of memory. On each ODA node are two dual-ported
LSI SAS controllers. They are each connected to an SAS expander that is
located on the system board. Each of these SAS expanders connects to 12
of the hard disks on the front of the ODA. Figure 12-2
shows the detailed layout of disk to expander relationship. The disks
are dual-ported SAS, so that each disk is connected to an expander on
each of the system controllers (SCs). The Oracle Database Appliance
contains twenty 600GB SAS hard disk drives that are shared between the
two nodes.
ASM and Storage Notice that
expander-to-disk connectivity is arranged such that Expander-0 from both
nodes connects to disks in columns 1 and 2, whereas Expander-1 from
both nodes connects to disks in columns 3 and 4.
ASM high redundancy is used to provide
triple-mirroring across these devices for highly available shared
storage. This appliance also contains four 73GB SAS solid state drives
for redo logs, triple mirrored to protect the Oracle database in case of
instance failure. Each disk is partitioned into two slices: p1 and p2.
The Oracle Linux Device Mapper utility is used to provide multipathing
for all disks in the appliance.
Brief Overview on Storage Layout (Slots/Columns) The ASM storage layout includes the
following:
ASM Diskgroup +DATA size 1.6TB (high redundancy)
ASM Diskgroup +RECO size 2.4TB (high redundancy) or size 0.8TB (high redundancy with external storage for backups)
ASM Diskgroup +REDO size 97.3GB (high redundancy)
The following output displays the disk naming
and mapping to the ASM disk group. Note that the output has been
shortened for brevity. ODA uses a specific naming convention to identify
the disks. For example, disk HDD_E0_S05_971436927p2 indicates that it
resides inside Expander 0 (E0) and slot 5 (S05).
ASM Optimizations for ODA
The Oracle Appliance Manager (OAK), in
conjunction with ASM, automatically configures, manages, and monitors
disks for performance and availability. Additionally, OAK provides
alerts on performance and availability events as well as automatically
configures replacement drives in case of a hard disk failure.
The Storage Management Module feature has the following capabilities:
It takes corrective action on appropriate events.
It interacts with ASM for complete automation.
Oracle Applicance Manager daemon (OAKd) monitors the physical state of disks.
Monitors disk status in ASM.
Based on events interacts with ASM for corrective actions.
ASM takes actions as directed by OAKd.
OAK tracks the configuration and layout of
all storage devices in the system. If a storage device fails, it detects
the failure and sends an alert by e-mail. When an alert is received,
users have the option to remove the failed disk and replace it. When OAK
detects a disk has been replaced, it verifies the disk size and other
characteristics, such the firmware level. OAK then rebuilds the
partition table on the new disk to match the table on the failed disk.
Because ODA disks map the disk slot exactly to an expander group and
failgroup, the disk is added back to the appropriate ASM disk group
without intervention from the user. This mitigates incorrect ASM disk
adds.
The oakcli command can be used to display the
status of all disks in the engineered package. This is shown here using
the oakcli show disk command:
The oakcli command can also be used to display
the status of a specific disk. This is shown here using the oakcli show
disk <disk_name> command:
ODA and NFS
Optionally, customers can use NFS storage as
Tier 3 external storage and can be connected using one of the 10GB
cards inside ODA. This NFS storage can be used to offload read-mostly or
archived datasets. Oracle recommends using ZFS Storage Appliance ZFSSSA
storage or other NFS appliance hardware. If the NFS appliance contains
read-only datasets, its recommended to convert these data files to
read-only using
and then set the init.ora parameter to
READ_ONLY_OPEN_DELAYED=TRUE. This will improve database availability in
case the NFS appliance becomes unavailable. Note that having NFS-based
files presented to ASM as disks is neither recommended nor supported.
Summary
Oracle Exadata Database Machine, Oracle
Database Appliance, and Oracle SPARC SuperCluster are excellent Private
Cloud Database consolidation platforms. These solutions provide
preintegrated configurations of hardware and software components
engineered to work together and optimized for different types of
database workloads. They also eliminate the complexity of deploying a
high-performance database system. Engineered systems such as ODA and
Exadata are tested in the factory and delivered ready to run. Because
all database machines are the same, their characteristics and operations
are well known and understood by Oracle field engineers and support.
Each customer will not need to diagnose and resolve unique issues that
only occur on their configuration. Performance tuning and stress testing
performed at Oracle are done on the exact same configuration that the
customer has, thus ensuring better performance and higher quality.
Further, the fact that the hardware and software configuration and
deployment, maintenance, troubleshooting, and diagnostics processes are
prevalidated greatly simplifies the operation of the system,
significantly reduces the risks, and lowers the overall cost while
producing a highly reliable, predictable database environment. The ease
and completeness of the predictive, proactive, and reactive management
processes possible on Oracle Database Appliance are simply nonexistent
in other environments.
Applications do not need to be certified
against engineered systems. Applications that are certified with Oracle
Database 11.2 RAC will run against the engineered systems. Choosing the
best platform for your organization will be one of the key milestones in
your roadmap to realizing the benefits of cloud computing for your
private databases.
ASM Tools and Utilities
Many tools can be used to manage ASM, such as SQL*Plus, ASMCMD, and Enterprise Manager (EM). In Oracle Clusterware 11gR2,
these tools were enhanced and several new tools were introduced to
manage ASM and its storage. These tools and utilities can be broken down
into two categories: fully functional ASM management and standalone
utilities.
Fully functional ASM management:
ASMCA
ASMCMD
Enterprise Manager
SQL*Plus
Standalone utilities:
Renamedg
ASRU
KFOD
AMDU
This chapter focuses on ASMCA, ASMCMD, and the standalone utilities.
ASMCA
ASMCA is a multipurpose utility and
configuration assistant like DBCA or NETCA. ASMCA is integrated and
invoked within the Oracle Universal Installer (OUI). It can be used as a
tool to upgrade ASM or run as a standalone configuration tool to manage
ASM instances, ASM disk groups, and ACFS.
ASMCA is invoked by running $GI_HOME/asmca. Prior to running asmca, ensure that the ORACLE_SID for ASM is set appropriately.
The following example illustrates ASMCA usage:
This illustration shows the ASM Instances tab,
where ASM instances can be started, stopped, and upgraded. ASMCA uses a
combination of SQL*Plus and Clusterware commands to configure the ASM
instance. By default, the ASM server parameter file (SPFILE) is always
stored in the disk group, allowing the instance to bootstrap on its own.
The Disk Groups tab of ASMCA, as shown in Figure 13-1,
allows the user to configure ASM disk groups based on the availability
requirements of the deployment. The Disk Groups tab allows the user to
modify disk group attributes.
If the user wants to create a new disk groups then ASMCA allows disk group creations as shown in Figure 13-2.
The default discovery string used to populate the candidate disks is
either obtained by what the server uses or the OS specific default is
used. As part of the disk group creation the user get to choose the
redundancy of the disk group.
Finally Figure 13-3 shows the list of disk groups that were created and are available for users to either create database or ACFS file systems on.
To leverage additional configuration menu
options, users can right-click any of the listed disk groups. The menu
options enable you to perform the following actions:
Add disks to the disk group
Edit the disk group attributes
Manage templates for the disk group
Create an ACFS-based database home on the selected disk group
Dismount and mount the disk group, either locally or globally on all nodes of the cluster
Drop the disk group
Using ASMCA to manage and configure ACFS and ADVM is covered in Chapter 11.
Although we showed ASMCA usage via GUI access, ASMCA can also be used
in command-line mode as well. The command line provides opportunities to
script ASM configuration. The following example illustrates ASMCA
command-line usage by creating an ASM disk group:
ASMCMD
ASMCMD was first introduced in 10gR2 with a basic command set for managing ASM files; however, in 11gR2
the asmcmd command has been expanded to fully manage ASM. In previous
versions, fully managing ASM was only possible via EM and SQL*Plus. This
section only focuses on the key new commands introduced in 11gR2. For the complete reference of the ASMCMD command set, refer to the Oracle Storage Administrator’s Guide.
ASMCMD, like ASMCA, needs to have the ORACLE_SID set to the ASM instance SID. In Oracle Clusterware 11gR2,
the “asmcmd -a” flag has been deprecated; in its place, the ASM
privilege must be set. ASMCMD can be used in interactive mode or
noninteractive (batch) mode.
The new capabilities found in 11.2 ASMCMD
include ASM instance management, disk group and disk management, ASM
file management, and ACFS/ADVM management. ASMCMD is profiled throughout
this book, please refer to the specific chapter for appropriate ASMCMD
command usage.
ASM instance management includes the following capabilities:
Starting and stopping ASM instances
Modifying and listing ASM disk strings
Creating/modifying/removing ASM users and groups
Backing up and restoring the SP file
Adding/removing/modifying/listing users from the password file
Backing up and restoring the metadata of disk groups
Displaying connected clients
Disk group and disk management includes the following capabilities:
Showing directory space utilization using the Linux/Unix-like command du
Mounting/dismounting/creating/altering/dropping disk groups
Rebalancing the disk groups
Displaying disk group attribute
Onlining or offlining the disks/failure groups
Repairing physical blocks
Displaying disk I/O statistic
ASM file management includes the following capabilities:
Managing ASM directories, templates, and aliases
Copying the files between the disk groups and OS
Adding/removing/modifying/listing the templates
Managing and modifying the ASM file ACLs
Finally, ACFS/ADVM management includes the capability to create, delete, enable, and list the ADVM volumes.
Renamedg
In many cases, users may need to rename disk
groups. This could include cloned disk groups that will used be on
different hosts for test environments or snapshots of disk groups that
are to be mounted on the same host. Oracle Clusterware 11gR2
introduces a command that provides this capability: the renamedg
command. The renamedg command can be executed using a single phase or
two phases. Phase one generates a configuration file to be used in phase
two, and phase two uses the configuration file to perform the renaming
of the disk group.
The following example shows the steps in renaming a disk group:
1. Before the disk group can be
renamed, it must first be dismounted. If databases or ACFS file systems
are using this disk group, they must be unmounted before the renamedg is
executed. Note that this must be done on all nodes on a RAC
configuration. For RAC configurations, it’s best to use ASMCA to
globally dismount the disk group. Additionally, if the ASM spfile exists
in this disk group, it must be moved to another location or disk group.
The asmcmd lsof command can be used to determine whether any open files
exist in the disk group before a dismount is attempted.
For example: Use asmcmd to dismount the diskgroup:
2. Verify that the desired disk group was dismounted:
3. Rename the disk group:
The renamedg
command can also be run in dry-run mode. This may be useful to verify
the execution. This check mode verifies all the disks can be discovered
and that the disk group can be successfully renamed. The following
example shows the check option of the renamedg command:
Once the check has verified appropriate behavior, we can run the actual renamedg command:
4. Once the renamedg command
completes, we can remount the disk group. Mounting the disk group
inherently validates the disk group header:
This disk group name change is also reflected in the CRS resource status automatically:
Although the renamedg command updates the CRS
resource with the new disk group name, the old CRS disk group resource
still exists within CRS. In our example the DATA_NISHA disk group is
still listed as a CRS resource, although it is in the offline state:
To remove this defunct CRS resource, users
should run the srvctl remove –g <diskgroup> command (note that
users should not use the crsctl delete resource ora.<diskgroup>.dg
command because it is unsupported):
The renamedg command currently does not rename
the disks that belong to the disk group. In our example, the original
disk group was named DATA_NISHA, so all underlying member disk names
started with DATA_NISHA by default. After the renamedg command is run,
the disk group is renamed to DATA_ISHAN01; however, the disks are still
named DATA_NISHA. This should not have any operational impact. As
reported in the example below:
The renamedg command does not rename the
pathname references for the data files that exist within that disk
group. To rename the data files appropriately and have this reflected in
the database control file, we can use the following database SQL to
generate a data file rename script:
Renaming disk groups that contain the
Clusterware files requires more careful planning. The following steps
illustrate this procedure. (Note that the Clusterware files must be
relocated to another disk group.)
1. Back up the OCR manually using either of the following commands:
or
2. Create a temporary disk group, like so:
We can use a shared file system instead of a temporary disk group.
3. Confirm that Grid Infrastructure on all nodes is active:
4. Move the OCR, the voting files, and the ASM spfile to a temporary disk group:
We can use a shared file system instead
of a temporary disk group as the destination for clusterware files as
well as ASM spfile.
5. Restart the Grid Infrastructure on all nodes and confirm that it is active on all nodes:
6. Rename the disk group, like so:
7. Mount the disk group:
The diskgroup (.dg) resource is registered automatically.
8. Remove the old disk group (.dg) resource:
9. Move the OCR, the voting files, and the ASM spfile to <new_name_dg>:
10. Restart the Grid Infrastructure on all nodes and confirm that it is active on all nodes:
Although the renamedg command was introduced as part of Oracle Clusterware 11gR2, in many cases a disk group in a 10g or 11.1 environment needs to be renamed. You can use this tool to rename your 10g or 11gR1 ASM disk group.
In these cases, the 11gR2 stack, or
more specifically Oracle Restart (Oracle Grid Infrastructure for Single
Instance), needs to be installed on the server(s) where the disk group
operation will be performed. It is not necessary to install 11gR2
RAC for this. Also, it is not necessary to start the Oracle Restart
stack (you simply need the dormant software installation for the
renamedg command). Here are the steps to follow:
1. Install 11.2.0.x Oracle Restart (Oracle Grid Infrastructure for Single Instance).
2. Unmount the disk group that will be renamed.
3. Run renamedg from the Oracle Restart home.
4. Use the renamedg tool to rename the 10g or 11gR1 disk group.
5. Optionally, uninstall the Oracle
Restart software stack. If frequent disk group renames will be needed,
this step is not recommended.
6. Mount the disk group.
NOTE
The disk group cannot be renamed when it contains offline disks.
|
ASM Storage Reclamation Utility (ASRU)
Storage costs—both in administrative
overhead and capital expenses—are growing concerns for most enterprises.
Storage vendors have introduced many features to reduce the acquisition
cost side of storage. One such feature is thin provisioning,
which is a feature common to many storage arrays. Thin provisioning
enables on-demand allocation rather than up-front allocation of physical
storage. This storage feature reduces unused space and improves storage
utilization. Deploying Oracle databases with cost-effective thin
provisioned storage is an ideal way to achieve high storage efficiency
and dramatic storage capacity savings. By boosting storage utilization,
thin provisioning drives savings in purchased capacity, associated
power, and cooling costs. Although ASRU can be used against storage
vendor providing thin provisioning, this section illustrates the ASRU
feature usage against 3Par’s storage to provide context.
The Oracle ASRU feature offers the ability to improve storage efficiency for Oracle Database 10g and 11g environments by reclaiming unused (but allocated) ASM disk space in thin provisioned environments.
Overview of ASRU Operation
Two key features allow thin provision storage reclamation:
Oracle
ASRU compacts the ASM disks, writes zeros to the free space, and
resizes the ASM disks to the original size with a single command, online
and without disruption.
3Par
Thin Persistence software detects zero writes and eliminates the
capacity associated with free space in thin provisioned volumes—simply,
quickly, and without disruption. 3Par Thin Persistence leverages the
unique, built-in, zero-detection capabilities.
Oracle ASM Storage Reclamation Utility (ASRU)
is a standalone utility used to reclaim storage in an ASM disk group
that was previously allocated but is no longer in use. ASRU accepts the
name of the disk group for which space should be reclaimed. When
executed, it writes blocks of zeros to regions on ASM disks where space
is currently unallocated. The storage array, using the zero-detect
capability of the array, will detect these zero blocks and reclaim any
corresponding physical storage.
The ASM administrator invokes the ASRU utility, which operates in three phases:
Compaction phase In
this phase, ASRU logically resizes the disks downward such that the
amount of space in the disk group is at the allocated amount of file
space in the disk group, plus a reserve capacity. The default value for
the reserve amount is 25 percent; however, the reserve value is a
tunable option in the utility. The resize operation of the disks is
logical to ASM and has no effect on the physical disks. The effect of
the resize operation is that file data in the ASM disk group is
compressed near the beginning of the disks, which is accomplished by an
ASM rebalance of the disk group. The utility uses the appropriate ASM V$
views to determine the current allocated size of the disk group. The
next phase does not begin until the ASM rebalance for the disk group has
completed and has been verified as complete.
Deallocation phase During
this phase, ASRU writes zeros above the region where the ASM disks have
been resized. The ASRU utility invokes another script called zerofill
that does the writing of zeros. It is during this deallocation phase
that the zero-detect algorithm within the 3Par Thin Engine will return
the freed storage blocks to the free storage pool.
Expansion phase In
the final phase, all the ASM disks will be resized to their original
size as determined when ASRU was started. This resize operation is a
logical resize of the disks with respect to ASM and does not result in a
reorganization of file data in the disk group.
When to Use ASRU to Reclaim Storage
Storage reclamation should be considered for the following database storage events:
Dropping one or more databases in an ASM disk group.
Dropping one or more tablespaces.
Adding
new LUNs to an ASM disk group to replace old LUNs. This triggers an ASM
rebalance to move a subset of the data from the old LUNs to the new
LUNs. The storage released from the old volumes is a candidate for
reclamation.
To determine whether storage reclamation will
be beneficial after one of these operations, it is important to
consider the effect of the reserve maintained by ASRU when the utility
reduces the size of the disk group during the compaction phase. The
temporarily reduced size is equal to the allocated space plus a reserve,
which allows active databases to grow during the reclamation process;
the default reserve is 25 percent of the allocated storage. Storage
reclamation is likely to be beneficial if the amount of allocated
physical storage significantly exceeds the amount of storage allocated
within ASM plus the reserve.
The amount of physical storage allocated on a
3Par InServ array can be determined using the 3Par InForm operating
system’s showvv command, available from the InForm command-line
interface (CLI), to show information about the virtual volumes (VVs)
used by ASM. Here is the standard way of using this command to obtain
information related to the effectiveness of thin provisioning for a
group of volumes matching oel5.*:
The –s option produces voluminous output.
Therefore, to make the output easier to understand, we will use more
complex options that show just the data columns that are directly
relevant to thin provisioning:
The Usr_Used_MB column indicates how many
megabytes are actually allocated to user data. In this example,
825,770MB of storage within ASM’s volumes has been written.
ASM’s view of how much storage is in use can be determined with a SQL query:
This example shows 197,986MB of free storage
out of 1,023,984MB available, or about 19.3 percent. The difference
between these quantities—825,998MB—is how much storage within ASM is in
use (that is, has actual written data).
Using ASRU to Reclaim Storage on 3Par: Use Cases
To illustrate storage reclamation using
Oracle ASRU and the 3Par InServ Storage Server, a 1TB ASM disk group was
created using four 250GB thin provisioned virtual volumes (TPVVs) on
the 3Par InServ array. Zero detection is enabled for the volumes from
the InForm CLI, as detailed in the following use cases.
Use Case #1
This use case involves reclaiming storage
space after dropping a database/data file. The steps involve creating a
DB named “yoda,” creating a 15GB data file tablespace and filling it
with data, and then dropping the data file and reclaiming the space via
ASRU.
1. Create an ASM disk group and create a database using DBCA. Now use the showvv command to view space consumed:
2. Create a 15GB tablespace (data
file) called par3_test. Create table BH in the par3_test tablespace and
fill it with test data:
3. Drop tablespace par3_test and then reclaim the space using ASRU:
4. Recheck space usage in 3Par:
Use Case #2
In this use case, a new database is created
along with a (single) new tablespace (15GB). Data is then seeded in it.
Then the table is truncated and a check is done to determine whether
ASRU can reclaim space from this space transaction.
1. Create the table:
2. Load the data:
3. Truncate the table:
4. Recheck space usage in 3Par:
5. No space was reclaimed, as expected.
NOTE
Typically DBAs will drop database segments
(tables, indexes, and so on) or truncate tables in order to reclaim
space. Although this operation does reclaim space back to the tablespace
(and to the overall database), it does not release physical space back
to the storage array. In order to reclaim physical space back to the
storage array, a physical data file (or files) must be dropped or shrunk.
|
6. Now shrink the data file (since space HWM was already done via truncate):
7. Recheck space usage in 3Par:
KFOD
This section describes the purpose and
function of the KFOD utility. KFOD is used to probe the system for disks
that can be used for ASM. Note that KFOD does not perform any unique
discovery; in other words, it uses the same discovery mechanism invoked
by the ASM instance. KFOD is also invoked within the Oracle Universal
Installer (OUI) for Grid Infrastructure stack installation. This section
is not intended to be an exhaustive command reference; you can find
that information in the Oracle Storage Administrator’s Guide. Instead, we will cover the most important and relevant features and commands.
Recall from Chapter 4
that ASM stamps every disk that it adds to a disk group with a disk
header. Therefore, when KFOD lists candidate disks, these are all disks
that do not include the disk header. KFOD can list “true” candidate
disks if the keyword status=TRUE is specified. True candidate disks are
ones identified with the header status of CANDIDATE, FOREIGN, or FORMER.
Note that disks with a header status of CANDIDATE include true
candidate disks as well as FORMER disks (that is, disks that were
previously part of an ASM diskgroup). KFOD does not distinguish between
the two.
KFOD can display MEMBER disks if disks=all or
disks=asm is specified. KFOD will include the disk group name if
dscvgroup=TRUE is specified. KFOD will also discover and list Exadata
grid disks.
You can specify the name of the disk group to
be discovered by KFOD. KFOD lists all disks that are part of this disk
group. At most one disk group name can be specified at a time. This
command is valid only if an ASM instance is active.
For disk discovery, KFOD can use the default
parameters, command-line options, or read options from a pfile.
Specifying a pfile allows users to determine what disks would be
available to an ASM instance using a configured pfile. If parameters and
a pfile are specified, the pfile is not read; otherwise, if a pfile is
specified, the parameters are taken from it.
As of 11g, KFOD supports a clustered
environment. This means that KFOD is aware of all ASM instances
currently running in the cluster and is able to get information about
disk groups and ASM clients pertaining to all instances in the cluster.
The “hostlist” parameter can be used to filter the output for the
specified node in the cluster. 11gR2 ASM also introduces a
display format, which is invoked using the cluster=true keyword. The
default is to run in noncluster mode for backward compatibility. The
next set of examples shows KFOD usage.
Here’s how to list all active ASM instances in the ASM cluster:
Here’s how to display the client databases accessing the local ASM instance:
KFOD can be used to display the total
megabytes of metadata required for a disk group, which it calculates
using numbers of disks, clients, nodes, and so on, provided on the
command line or by default. The metadata parameters can be overridden to
perform what-if scenarios.
Here’s how to display the total megabytes of metadata required for a disk group with specified parameters:
AMDU
The ASM Metadata Dump Utility (AMDU) is part
of the Oracle Grid Infrastructure distribution. AMDU is used to extract
the available metadata from one or more ASM disks and generate
formatted output of individual blocks.
AMDU also has the ability to extract one or
more files from an unmounted disk group and write them to the OS file
system. This dump output can be shipped to Oracle Support for analysis.
Oracle Support can use the dump output to generate formatted block
printouts. AMDU does not require the disk group to be mounted or the ASM
instance to be active.
AMDU performs three basic functions. A given execution of AMDU may perform one, two, or all three of these functions:
Dump metadata from ASM disks to the OS file system for later analysis.
Extract the contents of an ASM file and write it to an OS file system even if the disk group is not mounted.
Print metadata blocks.
The AMDU input data may be the contents of the ASM disks or ingested from a directory created by a previous run of AMDU:
AMDU produces four types of output files:
Extracted files One extracted file is created for every file listed under the -extract option on the command line.
Image files Image files contain block images from the ASM disks. This is the raw data that is copied from the disks.
Map files Map files are ASCII files that describe the data in the image files for a particular disk group.
Report file One report file is generated for every run of the utility without the -directory option (except if -noreport is specified).
In this first example, we will use AMDU to
extract a database control file. The disk group is still mounted and
we’ll extract one of the control files for a database named ISHAN.
1. Determine the ASM disk string:
2. Determine the location of all the control files:
In this example, we have a single copy of the control file in the disk group DATA.
3. Determine the disks for the DATA disk group:
4. Extract the control file out of the disk group DATA onto the file system. Here are the options used:
-diskstring This is either the full path to disk devices or the value of the ASM_DISKSTRING parameter.
-extract The disk group name, followed by a period, followed by the ASM file number.
-output The output file name (in the current directory).
-noreport Indicates not to generate the AMDU run report.
-nodir Indicates not to create the dump directory.
In this second example, we extract a data
file when the disk group is not mounted using AMDU. The objective is to
extract a single data file, named something like USERS, from the disk
group DATA, which is dismounted. This will require us to dump all
metadata for the disk group DATA.
Report.txt contains information about the
server, the amdu command, the options used, a list of disks that are
members of the disk group DATA, and information about the allocation
units (AUs) on those disks. Let’s review the contents of the report
file:
The file DATA.map contains the data map. The following shows a sampling of DATA.map that was generated:
Of immediate interest are fields starting with
A and F. The field A0000421, for example, indicates that this line is
for allocation unit (AU) 421, and the field F00000259 indicates that
this line is about ASM file 259.
ASM metadata file 6 is the alias directory, so
that is the first place to look. From DATA.map, we can work out AUs for
ASM file 6:
This single line in the map indicates that all
the file aliases fit in a single AU; in other words, there are not many
files in this disk group. If the output listed multiple lines from the
grep command, this would reflect that many ASM files exists in this disk
group.
From the preceding grep output, the alias
directory seems to be in allocation unit 10 (A00000010) on disk 2
(D0002). From report.txt, we know that disk 2 is /dev/xvde1 and that the
AU size is 1MB. Let’s have a look at the alias directory. You can use
kfed for this:
KFBTYP_ALIASDIR indicates that this is the alias directory. Now look for a data file named USERS:
This gives us the following output:
The USERS tablespace is ASM file 259. Now extract the file:
These steps can be repeated for the System and
Sysaux data files as well as the control files. Then these can used to
open the database, or then plug the extracted files into another
database and recover.
It is important to note that although the amdu
command will extract the file, the file itself may be corrupt or
damaged in some way. After all, there is a reason for the disk group not
mounting—chances are the ASM metadata is corrupt or missing, but that
can be the case with the data file as well. The point is that there’s no
substitute for a backup, so keep that in mind.
Summary
Various tools can be used to manage ASM.
These tools and utilities handle a range of tasks—from managing daily
ASM activities, to assisting in the recovery of ASM files and renaming
ASM disk groups. As a best practice, the ASMCMD utility or Enterprise
Manager should be used to manage ASM.
14 Oracle 12c ASM: A New Frontier
When Automatic Storage Management (ASM) was introduced in 10gR1,
it was simply marketed as the volume manager for the Oracle database.
ASM was designed as a purpose-built host-based volume management and
file system that is integrated with the Oracle database. These simple
ideas delivered a powerful solution that eliminates many headaches DBAs
and storage administrators once had with managing storage in an Oracle
environment.
However, ASM has now become an integral part
of the enterprise stack. ASM is not only a significant part of the
Oracle Clusterware stack, but is also a core component of Engineered
Systems such as Exadata and ODA. Oracle 12c was announced in 2013, and along with this 12c
release came significant changes for ASM. This chapter covers some of
the key management and high availability features introduced in 12c
ASM. You will get a glimpse of these advancements, the history behind
the new features, and why these features are a necessary part of the
future of ASM.
The main theme of 12c ASM is extreme
scalability and management of real-world data types. In addition, it
removes many of the limitations of previous ASM generations. This
chapter previews some of the key features of ASM and cloud storage in
Oracle 12c. Note that this chapter does not provide an exhaustive
overview of the new features, just the key features and optimizations.
This chapter was written using the Beta2 version, so the examples and
command syntax may be different from the production 12c version.
Password Files in ASM
In releases prior to Oracle 12c, most
of the Oracle database and ASM-related files could be stored in ASM
disk groups. The key exception was the oracle password file—neither the
ASM and database password files could be stored in a disk group. These
password files, created by orapwd utility, resided in the
$ORACLE_HOME/dbs directory by default and therefore were local to the
node and instance. This required manual synchronization of the password
file. If the password file became out of sync between instances, it
could cause inconsistent login behavior. Although Oracle 11gR2
provided the capability for cross-instance calls (CIC) to synchronize
the password file, if an instance or node was inactive, synchronization
was not possible, thus still leaving the password file inconsistent.
Inconsistent ASM password file is more problematic for ASM instances
because ASM does not have a data dictionary to fall back on when the
file system–based password file was inconsistent.
In Oracle 12c (for new installations), the
default location of the password file is in an ASM disk group. The
location of the password file becomes a CRS resource attribute of the
ASM and database instance. The ASM instance and the disk group that is
storing the password file needs to be available before password file
authentication is possible. The SYSASM or SYDBA privilege can be used
for the password file in ASM.
For the ASM instance, operating system
authentication is performed to bootstrap the startup of the ASM
instance. This is transparently handled as part of the Grid
Infrastructure startup sequence. As in previous releases, the SYSASM
privilege is required to create the ASM password file.
Note that the compatible.asm disk group
attribute must be set to 12.1 or later to enable storage of shared
password files in an ASM disk group.
The following illustrates how to set up a password in ASM:
Database password file:
1. Create a password file:
2. Move the existing password into ASM:
or
ASM password file:
Create an ASM password file. Note the asm=y option to distinguish this creation from regular password file creation.
or
Disk Management and Rebalance New Features
In 12c there are several new features
that improve disk management functions—specifically improved
availability from transient disk or failgroup failures. This section
covers these key disk management features.
Fast Disk Resync and Checkpoints
The 11g disk online feature provides
the capability to online and resync disks that have incurred transient
failures. Note that this feature is applicable only to ASM disk groups
that use ASM redundancy.
The resync operation updates the ASM extents
that were modified while the disk or disks were offline. However, prior
to Oracle 12c, this feature was single threaded; i.e., using a
single online process thread to bring the disk(s) completely online. For
disks that have been offline for a prolonged period of time, combined
with a large number of extent changes, could make the disk resync
operation very long. In Oracle 12c, the online and resync operation becomes a multi-threaded operation very similar to the ASM rebalance operation.
Thus the disk online can leverage a power
level from 1 to 1024, with 1 being the default. This power level
controls how many outstanding IOs will be issued to the IO subsystem,
and thus has a direct impact on the performance of the system. Keep in
mind that you are still bounded by the server’s IO subsystem layer, thus
setting a very large power level does not necessarily improve resync
time. This is because a server where the resync operation is submitted
can only process a certain number of IO. A power level between 8 and 16
has proven beneficial for resyncing a single disk, whereas a power level
of 8–32 has proven useful for bringing a failure group (with multiple
disks) online.
In versions prior to Oracle 12c, the
resync operation sets and clears flags (in Staleness Registry) at the
beginning and end of the resync operation; an interrupted resync
operation would need to be started from the beginning since the stale
extent bit flags are cleared at the end of the resync operation. In 12c
ASM, resync operations now support checkpoints. These checkpoints are
now set after a batch of extents are updated and their stale extent
flags cleared, thus making auto-restart begin at the last checkpoint. If
the resync operation fails or gets interrupted, it is automatically
restarted from the last resync phase and uses internally generated
resync checkpoints.
The following illustrates the command usage for performing an ASM fast disk resync:
Fast Disk Replacement
In versions 11gR2 and prior, a failed
disk is taken offline or dropped, a new disk put in its place
(generally in the same tray slot), and then this disk is added back into
the ASM disk group. This procedure required a complete disk group
rebalance. In Oracle 12c, the Fast Disk Replacement feature
allows a failed disk (or disks) to be replaced without requiring a
complete disk group rebalance operation. With the Fast Disk Replacement
feature, the disk is replaced in the disk tray slot and then added back
into the ASM disk group as a replacement disk. Initially this
disk is in an offline state and resynced (populated) with copies of ASM
extents from mirror extents from its partners. Note that because this is
a replacement disk, it inherits the same disk name and is automatically
placed back into the same failure group. The key benefit of the Fast
Disk Replacement feature is that it allows ASM administrators to replace
a disk using a fast, efficient, atomic operation with minimal system
impact because no disk group reorganization is necessary.
The main difference between Fast Disk Resync
and Fast Disk Replacement is that the disk has failed and is implicitly
dropped in Fast Disk Replacement, whereas in Fast Disk Resync the disk
is temporarily offline due to a transient path or component failure. If
the disk repair timer expires before the replacement disk can be put in
place, then users would have to use the regular disk add command to add
the replacement disk to the disk group.
The following illustrates the command for performing ASM Fast Disk Replacement:
Failure Group Repair Timer
When an individual disk fails, the failure
is often terminal and the disk must be replaced. When all the disks in a
failure group fail simultaneously, it is unlikely that all the disks
individually failed at the same time. Rather, it is more likely that
some transient issue caused the failure. For example, a failure group
could fail because of a storage network outage. Because failure group
outages are more likely to be transient in nature, and because replacing
all the disks in a failure group is a far more expensive operation than
replacing a single disk, it makes sense for failure groups to have a
larger repair time to ensure that all the disks don’t get dropped
automatically in the event of a failure group outage. Administrators can
now specify a failure group repair time similar to the 11g disk repair timer. This includes a new disk group attribute called failgroup_repair_time. The default setting is 24 hours.
Rebalance Time Estimations
In Oracle 12c, the different phases
of the ASM rebalance operation are itemized with time estimations. In
versions prior to Oracle Database 12, the rebalance work estimates were
highly variable.
With Oracle Database 12c, a more
detailed and accurate work plan is created at the beginning of each
rebalance operation. Additionally, administrators can produce a work
plan estimate before actually performing a rebalance operation, allowing
administrators to better plan storage changes and predict impact.
In Oracle Database 12c, administrators
can now use the new ESTIMATE WORK command to generate the work plan.
This work estimate populates the V$ASM_ESTIMATE view, and the EST_WORK
column can be used to estimate the number of ASM extents units that will
be moved by the operation.
It is important to note that the unit in the
V$ASM_ESTIMATE view is ASM extents, and this does not provide an
explicit time estimate, such as the one provided in V$ASM_OPERATION.
The time estimate in V$ASM_OPERATION is based
on the current work rate observed during execution of the operation.
Because the current work rate can vary considerably, due to variations
in the overall system workload, administrators should use knowledge of
their environment and workload patterns to convert the data in
V$ASM_ESTIMATE into a time estimate if required.
The first step is generating a work estimate for the disk group rebalance operation:
Now, the work plan estimate that’s generated can be viewed:
File Priority Rebalance
When a disk fails and no replacement is
available, the rebalance operation redistributes the data across the
remaining available disks in order to quickly restore redundancy.
With Oracle Database 12c, ASM
implements file priority ordered rebalance, which provides
priority-based restoration of the redundancy of critical files, such as
control files and online redo log files, to ensure that they are
protected if a secondary failure occurs soon afterward.
Flex ASM
In releases prior to 12c, an ASM
instance ran on every node in a cluster, and the databases communicated
via this local ASM instance for storage access. Furthermore, the ASM
instances communicated with each other and presented shared disk groups
to the database clients running in that cluster. This collection of ASM
instances form what is known as an ASM cluster domain.
Although this ASM architecture has been the standard since the inception of ASM, it does have some drawbacks:
Database
instances are dependent on a node-specific ASM instance. Thus, if an
ASM instance fails, all the database instances on that server fail as
well. Additionally, as the ASM cluster size grows, the number of ASM
instances grows and the communication overhead associated with managing
the storage increases.
ASM
overhead scales with the size of the cluster, and cluster
reconfiguration events increase with the number of servers in a cluster.
From an ASM perspective, larger clusters mean more frequent
reconfiguration events. A reconfiguration event is when a server enters
or departs a cluster configuration. From a cluster management
perspective, reconfiguration is a relatively expensive event.
With
Private Database Cloud and database consolidation, as the number of
database instances increases on a server, the importance and dependence
on the ASM instance increases.
The new feature Flex ASM, in Oracle Release 12c,
changes this architecture with regard to ASM cluster organization and
communication. The Flex ASM feature includes two key sub-features or
architectures: Flex ASM Clustering and Remote ASM Access.
Flex ASM Clustering
In Oracle Release 12c, a smaller number of ASM instances run on a subset of servers in the cluster. The number of ASM instances is called the ASM cardinality. The default ASM cardinality is three, but that can be changed using the srvctl modify asm command. 12c
database instance connectivity is connection time load balanced across
the set of ASM instances. If a server running an ASM instance fails,
Oracle Clusterware will start a new ASM instance on a different server
to maintain the cardinality. If a 12c database instance is using a
particular ASM instance, and that instance is lost because of a server
crash or ASM instance failure, then the Oracle 12c database
instance will reconnect to an ASM instance on another node. The key
benefits of the Flex ASM Clustering feature include the following:
It eliminates the requirement for an ASM instance on every cluster server.
Database instances connect to any ASM instance in the cluster.
Database instances can fail over to a secondary ASM instance.
Administrators specify the cardinality of ASM instances (the default is three).
Clusterware ensures ASM cardinality is maintained.
The Flex ASM feature can be implemented in three different ways:
Pure 12c mode In this mode, the Grid Infrastructure and database are both running the 12c version. In this model, the database fully leverages all the new 12c features.
Mixed mode This mode includes two sub-modes: Standard mode and Flex Cluster mode.
With Standard mode, 12c Clusterware and ASM are hard-wired to each node, similar to the pre-12c deployment style. This model allows pre-12c and 12c databases to coexist. However, in the event of a node or ASM failure, only 12c databases can leverage the failover to an existing ASM instance (on another node).
In Flex Cluster mode, ASM instances only run on specific nodes (as noted by cardinality). Pre-12c databases connect locally where ASM is running, and 12c database can be running on any node in the cluster and connect to ASM remotely.
Flex ASM Listeners
In order to support the Flex ASM feature,
the ASM listener was introduced. The ASM Listener, which is functionally
similar to the SCAN Listener, is a new global CRS resource with the
following key characteristics:
Three ASM listeners in Flex ASM and runs where ASM Instance is running.
ASM instances register with all ASM listeners.
Connectivity is load balanced across ASM instances.
Clients (DB instances) connect to ASM using
ASM listener endpoints. These clients connect using connect data
credentials defined by the Cluster Synchronization Services (CSS) Group
Membership Services (GMS) layer. The clients seek the best connection by
using the ASM Listener if it’s running on the same local node; if no
local-node ASM, then the clients connect to (any) remote ASM instance in
the cluster.
Flex ASM Network
In versions prior to 12c, Oracle
Clusterware required a public network for client application access and a
private network for internode communication within the cluster; this
included ASM traffic. The Flex ASM Network feature also provides the
capability to isolate ASM’s internal network traffic to its own
dedicated private network. The Oracle Universal Installer (OUI) presents
the DBA with a choice as to whether a dedicated network is to be used
for ASM. The ASM network is the communication path in which all the
traffic between database instances and ASM instances commence. This
traffic is mostly metadata, such as a particular file’s extent map. If
the customer chooses, the ASM private network can be dedicated for ASM
traffic or shared with CSS, and a dedicated network is not required.
Remote ASM Access
In previous versions, ASM clients use OS
authentication to connect to ASM. This was a simplistic model because
ASM clients and servers are always on the same server. With Oracle
Database 12c, ASM clients and ASM servers can be on different
servers (as part of the Flex ASM Network configuration). A default
configuration is created when the ASM cluster is formed, which is based
on the password specified for the ASM administrator at installation
time. Also, by default, the password file for ASM is now stored in an
ASM disk group. Having a common global password file addresses many
issues related to synchronizing separate password files on many servers
in a cluster. Additionally, the storing of password files in a disk
group is also extended to Oracle 12 databases as well. For database
instances, the DBCA utility executes commands to create an ASM user for
the operating system user creating the database. This is done
automatically without user intervention. Following this process, the
database user can remotely log into ASM and access ASM disk groups.
ASM Optimizations on Engineered Systems
In Chapter 12,
we described some of the ASM optimizations that were made specifically
for Engineered Systems such as Exadata and the Oracle Database Appliance
(ODA). There are several other important features in Oracle 12c ASM that support Engineered Systems. This section describes further ASM optimizations and features added in Oracle 12c for supporting Engineered Systems:
Oracle Database 12c
allows administrators to control the amount of resources dedicated to
disk resync operations. The ASM power limit can now be set for disk
resync operations, when disks are brought back online. This feature is
conceptually similar to the power limit setting for disk group
rebalance, with the range being 1 (least system resources) to 1024 (most
system resources).
If
a resync operation is interrupted and restarted, the previously
completed phases of the resync are skipped and processing recommences at
the beginning of the first remaining incomplete phase. Additionally,
these disk resync operations now have checkpoints enabled, such that an
interrupted resync operation is automatically restarted.
With Oracle Database 12c, extent
relocations performed by a rebalance operation can be offloaded to
Exadata Storage Server. Using this capability, a single offload request
can replace multiple read and write I/O requests. Offloading relocations
avoids sending data to the ASM host, thus improving rebalance
performance.
For
NORMAL and HIGH redundancy ASM disk groups, the algorithm that
determines the placement of secondary extents uses an adjacency measure
to determine the placement. In prior versions of ASM, the same algorithm
and adjacency measure were used for all disk groups. Oracle Database 12c
ASM provides administrators with the option to specify the content type
associated with each ASM disk group. Three possible settings are
allowed: data, recovery, and system. Each content type setting modifies
the adjacency measure used by the secondary extent placement algorithm.
The result is that the contents of disk groups with different content
type settings are distributed across the available disks differently.
This decreases the likelihood that a double-failure will result in data
loss across NORMAL redundancy disk groups with different content type
settings. Likewise, a triple-failure is less likely to result in data
loss for HIGH redundancy disk groups with different content type
settings.
Administrators can specify the content type
to reach disk group using the disk group attribute CONTENT.TYPE.
Possible values for the content type are data, recovery, or system.
Specifying different content types decreases the likelihood of a single
disk failure from impacting multiple disk groups in the same way.
Error Checking and Scrubbing
In previous Oracle Database versions, when
data was read, a series of checks was performed on data to validate its
logical consistency. If a logical corruption was detected, ASM could
automatically recover by reading the mirror copies on NORMAL and HIGH
redundancy disk groups. One problem with this approach is that
corruption to seldom-accessed data could go unnoticed in the system for a
long time between reads. Also, the possibility of multiple corruptions
affecting all the mirror copies of data increases over time, so
seldom-accessed data may simply be unavailable when it is required.
Additionally, in releases prior to Oracle 12c, when an ASM extent
was moved during a rebalance operation, it was read and written without
any additional content or consistency checks.
ASM in Oracle 12c introduces proactive
scrubbing capabilities, which is the process of checking content
consistency in flight (as it’s accessed). Scrubbing provides the
capability to perform early corruption detection. Early detection of
corruption is vital because undetected corruption can compromise
redundancy and increases the likelihood of data loss.
Scrubbing is performed by a new background
process, SCRB, that performs various checks for logical data
corruptions. When a corruption is detected, the scrubbing process first
tries to use available mirrors to resolve the situation. If all the
mirror copies of data are corrupt or unavailable, the scrubbing process
gives up and the user can recover the corrupted blocks from an RMAN
backup if one is available.
Scrubbing
can be implicitly invoked during rebalance operations or areas can be
scrubbed on-demand at a disk group level, on specific areas, by an
administrator. To perform on-demand scrubbing, the following command can
be executed:
When
scrubbing occurs during a rebalance, extents that are read during the
rebalance undergo a series of internal checks to ensure their logical
integrity. Scrubbing in the rebalance operation requires a new attribute
to be set for the disk group.
The content checking includes hardware
assisted resilient data (HARD), which includes checks on user data,
validation of file types from the file directory against the block
contents and file directory information, and mirror side comparisons.
Other Miscellaneous Flex ASM Features
Other features that are part of Flex ASM that are notable include:
The maximum number of ASM disk groups is increased from 63 to 511.
ASM
instances in an ASM cluster validate the patch level of each other.
However, this is disabled for the purposes of rolling upgrades. At the
end of rolling upgrades the patch level consistency is validated.
ASM
physical metadata such as disk headers and allocation tables are now
replicated. Previously, only virtual metadata had been replicated when
ASM mirroring was used.
Summary
Since its inception, ASM has grown from
being a purpose-built volume manager for the database to a feature-rich
storage manager that supports all database-related files and includes a
POSIX-complaint cluster file system. In addition, ASM has become the
centerpiece of the Oracle Engineered Systems. The 12c ASM
addresses extreme scalability and the management of real-world data
types. In addition, it removes many of the limitations of previous
generations of ASM. ASM also has evolved cloud computing demands of
consolidation, high utilization, and high availability.
No comments:
Post a Comment