EasyReliableDBA: Daily tasks and monitoring on Exadata And useful link

Oracle Exadata administration requires specialized monitoring of both database and storage cells. Daily tasks focus on ensuring high availability, maintaining cell health, and proactively reviewing hardware alerts

Key Daily Tasks & Commands

1. Cell and Hardware Health

Check overall cell health and status: dcli -g cell_group -c "cellcli -e list cell detail"
Check hardware alerts: cellcli -e list alerthistory where severity = 'critical' or active
List physical and grid disks status: cellcli -e list griddisk and cellcli -e list physicaldisk

2. Storage and Smart Flash Cache

Monitor Flash Cache usage and efficiency: cellcli -e list flashcache detail
Check IO statistics: Run iostat -x or query cell metrics like cellcli -e list metriccurrent where name like '.*_IO.*'

3. Database & Cluster Verification

Monitor Clusterware status: crsctl check cluster and crsctl check cssd
Review database and Exadata wait events in AWR reports. Look for Exadata-specific waits like cell single block physical read

How to get hardware alert in Exadata

Exadata hardware alerts are generated via ILOM, CellCLI/DBMCLI, or Oracle Enterprise Manager (OEM). They are categorized as Critical, Warning, or Info. The best way to view alerts and take action includes the following steps

1. View and Get Notified of Alerts

Oracle Enterprise Manager (OEM): Review hardware status on the Exadata Database Machine Home page where active incidents are outlined in red. Configure Incident Rules to send alerts via email or SNMP.
Exadata Storage Server (Cells): Use the LIST ALERTHISTORY command in CellCLI. You can configure email notifications directly from the cell.
Database Servers (Compute Nodes): Use DBMCLI to review alerts and set up SMTP/SNMP notifications.
Oracle Auto Service Request (ASR): Register your Exadata to auto-open Service Requests with Oracle Support for critical hardware faults

2. Determine Further Action (Troubleshooting)

When a hardware alert (e.g., HALRT-*) fires, take the following steps to evaluate the severity and plan the fix:

Check the ILOM: Access the Integrated Lights Out Manager (ILOM) of the affected server to check the fault status and event log:
show /SP/faultmgmt

Check Physical Indicators: Look for illuminated amber Fault LEDs (e.g., on hard drives, power supplies).

Review Details: Review the component specific error definitions (e.g., Temperature, Fan, Power Supply, Disk errors) detailed in the Oracle Exadata Error Messages Guide.

Engage Oracle Support: If the component is hardware-related (e.g., disk, power supply, fan) or requires replacement, open a Service Request on My Oracle Support. Provide the alert details and run diagpack if requested by Oracle

3. Clear the Alert

After the physical hardware has been replaced or the issue is rectified, clear the alert. Depending on how the alert was raised, this might clear automatically or require manual intervention via ILOM fault management or CellCLI/DBMCLI

To get hardware alerts and safely replace a disk in Oracle Exadata, use CellCLI to identify the failed drive, wait for Automatic Storage Management (ASM) to rebalance, and pull the drive once the blue "OK to remove" LED illuminates. All storage devices are hot-pluggable and can be replaced without powering down

Follow this exact process to identify, drop, and replace your failed disk:

1. Check for Hardware Alerts

CellCLI> LIST ALERTHISTORY WHERE severity = 'critical' OR severity = 'warning'

2. Identify the Failed Disk

Run the following command on the storage server to find the exact name, slot number, and device path of the failed drive.

bash

CellCLI> LIST PHYSICALDISK WHERE status != 'normal' DETAIL

Note: Make a note of the name (e.g., 28:5) and slotNumber.
3. Drop the Disk for Replacement
If you are running Oracle Exadata System Software Release 21.2.0 or newer, 
use the following command to drop the physical disk while maintaining redundancy

bash
CellCLI> ALTER PHYSICALDISK <disk_name> DROP FOR REPLACEMENT MAINTAIN REDUNDANCY NOWAIT

Wait for a storage server alert confirming the disk is dropped and data has successfully rebalanced before proceeding.
4. Physically Replace the Disk
Locate the physical server (a white locator LED will be illuminated on the front of the chassis).
Identify the disk itself (an amber "Fault" LED will be lit).
Wait for the blue "OK to remove" LED to light up before pulling the drive.
Press the disk ejection lever, pull out the failed drive, and slide the new drive into the chassis until it locks in place

5. Verify the Replacement
Once the drive is inserted, the new disk will be automatically detected and configured. Run the following to confirm it is back to normal:
bash
CellCLI> LIST PHYSICALDISK WHERE name = <disk_name> ATTRIBUTES status

Full procedural breakdowns for varying disk types (e.g., hard disks, flash disks, or M.2 system disks) are available in the Oracle Exadata Maintenance Guide. 

How to get hardware alert in Exadata and oracle support

To automatically receive Exadata hardware alerts and have them routed to Oracle Support,
 you must configure Oracle ASR (Auto Service Request). ASR automatically logs a Service Request (SR) with Oracle Support for specific hardware faults, 
while configuring Exadata Alert Notifications keeps your team informed

Here is the step-by-step process to configure and monitor hardware alerts:

1. Enable Oracle Auto Service Request (ASR)

Oracle ASR detects critical hardware faults (such as disks, memory, or power supplies) and automatically opens a support ticket with Oracle.

Install ASR Manager: Deploy the ASR Manager software on a standalone server external to your Exadata rack.
Enable Telemetry: Configure your Exadata Database Servers and Storage Servers to send telemetry and traps to the ASR Manager.
Activate ASR: Register and activate your ASR assets through the My Oracle Support portal to link them with your Oracle Support Identifier (CSI)

2. Configure Email and SNMP Alert Notifications

To ensure your internal IT and DBAs are immediately notified of any warning or critical hardware metrics:

Storage Servers: Log in via CellCLI and use the ALTER CELL command to define your SMTP mail server, from/to addresses, and notification policy.
Database Servers: Set up hardware fault alerts through the Integrated Lights Out Manager (ILOM) so you receive email or SNMP traps directly when a component fails

3. Integrate with Oracle Enterprise Manager (OEM)

Oracle Enterprise Manager Cloud Control provides a comprehensive graphical interface for monitoring Exadata hardware.

Discover Targets: Use the Exadata plug-in within OEM to discover and promote all database nodes, storage cells, InfiniBand switches, and PDUs.
View Hardware Incidents: Navigate to the Database Machine Home Page to view the schematic layout. Components with active incidents (like a faulty disk or fan) will be outlined in red.
Setup Incident Rules: Configure Incident Rules in OEM to forward hardware alerts to your internal ticketing systems and alert designated administrator

4. Direct Command-Line Checks

If you need to view active or historical alerts directly on the Exadata system:

Storage Cell Alerts: Run LIST ALERTHISTORY or LIST ALERTDEFINITION using CellCLI to check the storage cell alert log files.
Compute Node Alerts: Query the standard alert.log or review the Oracle ILOM Event Log via SSH

Useful Links & Documentation

For a complete understanding of best practices, metric definitions, and command references, leverage the following official and expert resources:

Oracle Exadata Platform: Official portal covering on-premises, OCI, and multi-cloud Exadata platforms.
Oracle Exadata Database Machine - Operational Best Practices: Comprehensive PDF detailing tuning, isolation, and day-to-day operations.
Exadata Performance and AWR: Guide to reading Exadata-specific statistics in Automatic Workload Repository (AWR) reports.
Monitoring Oracle Exadata MOS Notes: Official Oracle blog tracking critical My Oracle Support (MOS) articles and patching guideline

During the daily tasks it is very helpful to have a collection of Exadata MOS notes. The following notes are more or less my „Favorites“ from MOS.

Information Center

1306791.2 – „Oracle Exadata Database Machine“

Master Note

888828.1 – Exadata Database Machine and Exadata Storage Server Supported Versions

Best Practices

757552.1 – Oracle Exadata Best Practices

1274318.1 – Oracle Sun Database Machine Setup/Configuration Best Practices

1244344.1 – Exadata Starter Kit

Operation Tasks

1473002.1 – Using dbserver_backup.sh to backup compute nodes

1538068.1 – Remove partition if dbserver_backup.sh fails

1428394.1 – Password stuff (pam_talley2)

1093890.1 – Shutdown and startup Exadata and Compute nodes on rack

1446274.1 – ILOM command reference (startup and shutdown Exadata from ILOM)

1520896.1 – DBFS Configuration Health Check

1054431.1 – Configure DBFS on Exadata Checklist

1553103.1 – latest dbnoteupdate.sh note

401749.1 – Shell Script to Calculate Values Recommended Linux HugePages / HugeTLB Configuration

…

Cell-Storage Server

1921528.1 – SRDC – EEST Storage Cell General Issues

1306635.1 – Replacement of flash – how to check firmware and status. Resetting status

1188080.1 – Steps to shut down or reboot an Exadata storage cell without affecting ASM

1477020.1 – Exadata: ASM Diskgroup Showing Status Of _DROPPED_… After Storage Maintance

761868.1 – Oracle Exadata Diagnostic Information required for Disk Failures and some other Hardware issues

Patching

1262380.1 – Master note on Exadata patching

1473002.1 – Using ULN to install server patches with YUM

1545789.1 – ISO install Cheat Sheets

1136544.1 – Relinking notes

1553103.1 – Exadata Database Server Patching using the DB Node Update Utility

Software Specific Release Notes

1537407.1 – Oracle 12c

EasyReliableDBA

Tuesday, 20 June 2023

Daily tasks and monitoring on Exadata And useful link

How to get hardware alert in Exadata

How to get hardware alert in Exadata and oracle support

No comments:

Post a Comment

Search This Blog