Tuesday, 20 June 2023

Daily tasks and monitoring on Exadata And useful link

Oracle Exadata administration requires specialized monitoring of both database and storage cells. Daily tasks focus on ensuring high availability, maintaining cell health, and proactively reviewing hardware alerts 

Key Daily Tasks & Commands
1. Cell and Hardware Health
  • Check overall cell health and status: dcli -g cell_group -c "cellcli -e list cell detail"
  • Check hardware alerts: cellcli -e list alerthistory where severity = 'critical' or active
  • List physical and grid disks status: cellcli -e list griddisk and cellcli -e list physicaldisk 
2. Storage and Smart Flash Cache
  • Monitor Flash Cache usage and efficiency: cellcli -e list flashcache detail
  • Check IO statistics: Run iostat -x or query cell metrics like cellcli -e list metriccurrent where name like '.*_IO.*'
3. Database & Cluster Verification
  • Monitor Clusterware status: crsctl check cluster and crsctl check cssd
  • Review database and Exadata wait events in AWR reports. Look for Exadata-specific waits like cell single block physical read


How to get hardware alert in Exadata 


Exadata hardware alerts are generated via ILOM, CellCLI/DBMCLI, or Oracle Enterprise Manager (OEM). They are categorized as Critical, Warning, or Info. The best way to view alerts and take action includes the following steps


1. View and Get Notified of Alerts
  • Oracle Enterprise Manager (OEM): Review hardware status on the Exadata Database Machine Home page where active incidents are outlined in red. Configure Incident Rules to send alerts via email or SNMP.
  • Exadata Storage Server (Cells): Use the LIST ALERTHISTORY command in CellCLI. You can configure email notifications directly from the cell.
  • Database Servers (Compute Nodes): Use DBMCLI to review alerts and set up SMTP/SNMP notifications.
  • Oracle Auto Service Request (ASR): Register your Exadata to auto-open Service Requests with Oracle Support for critical hardware faults

2. Determine Further Action (Troubleshooting)
When a hardware alert (e.g., HALRT-*) fires, take the following steps to evaluate the severity and plan the fix:
  • Check the ILOM: Access the Integrated Lights Out Manager (ILOM) of the affected server to check the fault status and event log:
    show /SP/faultmgmt
  • Check Physical Indicators: Look for illuminated amber Fault LEDs (e.g., on hard drives, power supplies).
  • Review Details: Review the component specific error definitions (e.g., Temperature, Fan, Power Supply, Disk errors) detailed in the Oracle Exadata Error Messages Guide.
  • Engage Oracle Support: If the component is hardware-related (e.g., disk, power supply, fan) or requires replacement, open a Service Request on My Oracle Support. Provide the alert details and run diagpack if requested by Oracle

  • 3. Clear the Alert
    After the physical hardware has been replaced or the issue is rectified, clear the alert. Depending on how the alert was raised, this might clear automatically or require manual intervention via ILOM fault management or CellCLI/DBMCLI

    To get hardware alerts and safely replace a disk in Oracle Exadata, use CellCLI to identify the failed drive, wait for Automatic Storage Management (ASM) to rebalance, and pull the drive once the blue "OK to remove" LED illuminates. All storage devices are hot-pluggable and can be replaced without powering down
    Follow this exact process to identify, drop, and replace your failed disk:
    1. Check for Hardware Alerts
    Log in to your storage server (celladmin or root) and verify the alert history to locate the failing disk
    CellCLI> LIST ALERTHISTORY WHERE severity = 'critical' OR severity = 'warning'

    2. Identify the Failed Disk
    Run the following command on the storage server to find the exact name, slot number, and device path of the failed drive.
    bash
    CellCLI> LIST PHYSICALDISK WHERE status != 'normal' DETAIL

    Note: Make a note of the name (e.g., 28:5) and slotNumber.
    3. Drop the Disk for Replacement
    If you are running Oracle Exadata System Software Release 21.2.0 or newer,
    use the following command to drop the physical disk while maintaining redundancy

    bash
    CellCLI> ALTER PHYSICALDISK <disk_name> DROP FOR REPLACEMENT MAINTAIN REDUNDANCY NOWAIT

    Wait for a storage server alert confirming the disk is dropped and data has successfully rebalanced before proceeding.
    4. Physically Replace the Disk
    1. Locate the physical server (a white locator LED will be illuminated on the front of the chassis).
    2. Identify the disk itself (an amber "Fault" LED will be lit).
    3. Wait for the blue "OK to remove" LED to light up before pulling the drive.
    4. Press the disk ejection lever, pull out the failed drive, and slide the new drive into the chassis until it locks in place

    5. Verify the Replacement
    Once the drive is inserted, the new disk will be automatically detected and configured. Run the following to confirm it is back to normal:
    bash
    CellCLI> LIST PHYSICALDISK WHERE name = <disk_name> ATTRIBUTES status

    Full procedural breakdowns for varying disk types (e.g., hard disks, flash disks, or M.2 system disks) are available in the Oracle Exadata Maintenance Guide.

    How to get hardware alert in Exadata and oracle support


    To automatically receive Exadata hardware alerts and have them routed to Oracle Support,
    you must configure Oracle ASR (Auto Service Request). ASR automatically logs a Service Request (SR) with Oracle Support for specific hardware faults,
    while configuring Exadata Alert Notifications keeps your team informed
    Here is the step-by-step process to configure and monitor hardware alerts:
    1. Enable Oracle Auto Service Request (ASR)
    Oracle ASR detects critical hardware faults (such as disks, memory, or power supplies) and automatically opens a support ticket with Oracle. 
    • Install ASR Manager: Deploy the ASR Manager software on a standalone server external to your Exadata rack.
    • Enable Telemetry: Configure your Exadata Database Servers and Storage Servers to send telemetry and traps to the ASR Manager.
    • Activate ASR: Register and activate your ASR assets through the My Oracle Support portal to link them with your Oracle Support Identifier (CSI)
    2. Configure Email and SNMP Alert Notifications
    To ensure your internal IT and DBAs are immediately notified of any warning or critical hardware metrics:
    • Storage Servers: Log in via CellCLI and use the ALTER CELL command to define your SMTP mail server, from/to addresses, and notification policy.
    • Database Servers: Set up hardware fault alerts through the Integrated Lights Out Manager (ILOM) so you receive email or SNMP traps directly when a component fails

    3. Integrate with Oracle Enterprise Manager (OEM)
    Oracle Enterprise Manager Cloud Control provides a comprehensive graphical interface for monitoring Exadata hardware. 
    • Discover Targets: Use the Exadata plug-in within OEM to discover and promote all database nodes, storage cells, InfiniBand switches, and PDUs.
    • View Hardware Incidents: Navigate to the Database Machine Home Page to view the schematic layout. Components with active incidents (like a faulty disk or fan) will be outlined in red.
    • Setup Incident Rules: Configure Incident Rules in OEM to forward hardware alerts to your internal ticketing systems and alert designated administrator
    4. Direct Command-Line Checks
    If you need to view active or historical alerts directly on the Exadata system:
    • Storage Cell Alerts: Run LIST ALERTHISTORY or LIST ALERTDEFINITION using CellCLI to check the storage cell alert log files.
    • Compute Node Alerts: Query the standard alert.log or review the Oracle ILOM Event Log via SSH


    Useful Links & Documentation
    For a complete understanding of best practices, metric definitions, and command references, leverage the following official and expert resources:




    During the daily tasks it is very helpful to have a collection of Exadata MOS notes. The following notes are more or less my „Favorites“ from MOS.


    Information Center

    1306791.2 – „Oracle Exadata Database Machine“

    Master Note

    888828.1 – Exadata Database Machine and Exadata Storage Server Supported Versions

    Best Practices

    757552.1 – Oracle Exadata Best Practices

    1274318.1 – Oracle Sun Database Machine Setup/Configuration Best Practices

    1244344.1 – Exadata Starter Kit

    Operation Tasks

    1473002.1 – Using dbserver_backup.sh to backup compute nodes

    1538068.1 – Remove partition if dbserver_backup.sh fails

    1428394.1 – Password stuff (pam_talley2)

    1093890.1 – Shutdown and startup Exadata and Compute nodes on rack

    1446274.1 – ILOM command reference (startup and shutdown Exadata from ILOM)

    1520896.1 – DBFS Configuration Health Check

    1054431.1 – Configure DBFS on Exadata Checklist

    1553103.1 – latest dbnoteupdate.sh note

    401749.1  –  Shell Script to Calculate Values Recommended Linux HugePages / HugeTLB Configuration

     …

    Cell-Storage Server

    1921528.1 – SRDC – EEST Storage Cell General Issues

    1306635.1 – Replacement of flash – how to check firmware and status. Resetting status

    1188080.1 – Steps to shut down or reboot an Exadata storage cell without affecting ASM 

    1477020.1 – Exadata: ASM Diskgroup Showing Status Of _DROPPED_… After Storage Maintance

    761868.1   – Oracle Exadata Diagnostic Information required for Disk Failures and some other Hardware issues

    Patching

    1262380.1 – Master note on Exadata patching

    1473002.1 – Using ULN to install server patches with YUM

    1545789.1 – ISO install Cheat Sheets

    1136544.1 – Relinking notes

    1553103.1 – Exadata Database Server Patching using the DB Node Update Utility

     

    Software Specific Release Notes

    1537407.1 – Oracle 12c

    No comments:

    Post a Comment