EasyReliableDBA: Oracle Database Architect(L4 Support (Architect/Lead)) Question and Answer 2026 Part 3

Part 1: Architect/Lead Interview Q&A

1. Explain Plan & SQL Plan Directives

Question: How do you tackle a suddenly degraded execution plan in production without changing application code?
Answer & Commands: Use DBMS_XPLAN to extract the plan via the SQL ID.
sql
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('your_sql_id'));
Check if the optimizer generated dynamic hints. Inject a SQL Plan Directive or SQL Profile to force specific join methods (e.g., HASH_SJ) or indexes.
Risk Consideration: Forcing hints can backfire as data volumes grow. Always baseline the plan first via SPM to test the fix.

2. SQL Plan Management (SPM) & Real-Time SPM

Question: How does Real-Time SPM work in Oracle Database 23ai?
Answer: Prior to 23ai, identifying and fixing bad plans required manual intervention or background jobs. Oracle Database 23ai introduces Real-Time SPM to automatically detect and repair performance regressions caused by plan changes during hard parses. The optimizer evaluates alternative plans, instantly accepting and enforcing the best-performing plan using SQL Plan Baselines without DBA intervention.

3. Oracle 23ai "AI Vector Search" & ML Engine

Question: How are you utilizing 23ai’s in-database AI/ML features in your recent designs?
Answer: Oracle Database 23ai allows declarative SQL to run machine learning models and AI Vector Search directly on business data. This eliminates slow data movement and reduces risks since data never has to leave the protected database environment.
Customer Requirement/Risk: When deploying this, consider GPU/memory sizing constraints and maintain compliance by ensuring AI vector data complies with TDE (Transparent Data Encryption) and row-level security protocols.

Part 2: BLOB Migration & Performance Tuning

Customer Requirement / Problem

Migrating legacy LONG/LONG RAW or older BASICFILE LOBs to modern SECUREFILE LOBs, particularly in multi-terabyte architectures.

Step 1: Check Database Compatibility & Default Parameters

sql

ALTER SYSTEM SET DB_SECUREFILE = 'PREFERRED' SCOPE=BOTH;

Step 2: Migration Command (Using Data Pump with Parallel Workers)

bash

expdp system/password directory=data_dump_dir schemas=hr dumpfile=hr_secure.dmp logfile=expdp_hr.log
impdp system/password directory=data_dump_dir schemas=hr 
dumpfile=hr_secure.dmp logfile=impdp_hr.log TRANSFORM=LOB_STORAGE:SECUREFILE

Test Cases & Risk Considerations

Storage Space: SecureFile LOBs might consume more space depending on deduplication rules. Ensure your ASM has 20-30% headroom.
Integrity Check: Run DBMS_REDEFINITION for near-zero downtime, or use parallel ALTER TABLE MOVE commands for static tables.
Benchmark Tool: Use ORION to benchmark I/O throughput before the move. Use DBMS_RESOURCE_MANAGER to throttle LOB segment read/writes to avoid impacting OLTP operations during peak hours.

Migrating legacy LONG and BASICFILE LOBs to modern SECUREFILE LOBs in multi-terabyte Oracle 19c and 26ai environments is best achieved using the DBMS_REDEFINITION package. This method minimizes downtime, maximizes space savings through Advanced LOB Compression and Deduplication, and manages massive volumes of data at the partition level. [

Why Migrate?

Performance: SECUREFILE outperforms BASICFILE by 2x to 4x, particularly in read/write caching and SSD/NVMe environments.
Storage Efficiency: It provides native, block-level compression and deduplication.
Deprecation: LONG columns are fundamentally deprecated by Oracle; converting them to CLOB/BLOB is a mandatory prerequisite for modern database usage. [

Recommended Architecture & Strategy

1. Leverage DBMS_REDEFINITION [

In a multi-terabyte architecture, attempting to alter tables directly or executing standard ALTER TABLE... MOVE operations will cause severe locking and generate overwhelming undo/redo logs.

Use the DBMS_REDEFINITION package to perform the migration online. Your applications can continue reading and writing to the source table while the massive dataset is being transformed in the background. ]

2. Process Partition-by-Partition

Attempting a single-table migration of a 5TB table will drain system resources and require massive temporary tablespaces.

Use the partition-level redefinition feature of DBMS_REDEFINITION. Redefine one partition at a time to keep resource contention, UNDO/REDO generation, and tablespace spikes under strict control.

3. Setup Target SECUREFILE Storage Parameters

During the redefinition process, map the legacy types to SECUREFILE using a specialized CREATE TABLE interim statement. Make sure to include advanced features to reduce the terabyte footprint:

Compression: Use COMPRESS HIGH or COMPRESS MEDIUM (part of the Oracle Advanced Compression Option).
Deduplication: Use DEDUPLICATE to save space if your datasets have repeated binary payloads.
Cache Logging: Specify CACHE or NOCACHE LOGGING depending on your I/O subsystem capabilities.

Step-by-Step Migration Process

Phase 1: LONG to LOB (Prerequisite)

If your schema contains legacy LONG or LONG RAW columns, you cannot directly move them to SECUREFILE. You first need to map them to CLOB/BLOB using TO_LOB() in DBMS_REDEFINITION:

Create an interim table defining the migrated columns with CLOB/BLOB data types.
Call DBMS_REDEFINITION.START_REDEF_TABLE and supply the TO_LOB() conversion parameter. []

Phase 2: BASICFILE to SECUREFILE

If your columns are already basic LOB datatypes, or once you finish the Phase 1 LONG conversion, execute the online redefinition:

Create an Interim Table: Define the LOB columns with the STORE AS SECUREFILE clause. Add COMPRESS or DEDUPLICATE here.
Start Redefinition: Use DBMS_REDEFINITION.START_REDEF_TABLE.
Copy Dependents: Run DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS to clone indexes, triggers, and constraints onto the interim table.
Sync Interims: Periodically call DBMS_REDEFINITION.SYNC_INTERIM_TABLE to catch up on DML changes while the bulk copy runs.
Finish Redefinition: Call DBMS_REDEFINITION.FINISH_REDEF_TABLE to finalize the atomic swap.

Best Practices for Multi-Terabyte Environments

Tablespace Sizing: Allocate a new, dedicated BIGFILE tablespace for SECUREFILE segments. Ensure auto-extend is disabled or strictly capped on underlying datafiles to prevent runaway space consumption.
Parallel DML: When conducting the data copy, use parallel hints or parallelism clauses to speed up the bulk insertion, reducing the overall time needed for the materialized view to catch up.
UNDO & Redo Generation: Ensure your UNDO tablespace and REDO logs are highly provisioned or use NOLOGGING on the interim table if you can afford to rebuild backups post-migration.

Part 3: Deep Dive: Memorable L4 Task & Escalation

Customer Scenario

An enterprise OLTP system handling financial transactions experienced sudden 100% CPU usage and ORA-04031 (Shared Pool out of memory) errors. This halted the billing application.

Tools Used:

AWR/ASH: To pinpoint exact SQL IDs and execution concurrency.
SQLT (SQL Tuning Health-Check Script): To deeply analyze execution paths and index usage.
Oradebug: Used to dump the shared pool trace and inspect sub-heap memory fragmentation.

Troubleshooting Steps & Fix

Identify the Culprit: Ran ASH queries to spot unindexed search paths causing massive hard parsing.
sql
SELECT sql_id, child_number, executions, parse_calls FROM V$SQL WHERE parse_calls > 10000 AND loaded = 'Y';
Memory Dump & Analysis: The dump identified heavy shared pool sub-heap fragmentation.
The Solution: We identified 120,000 literal (non-bind) SQL statements flooding the memory.
- Fix: Altered database parameters to force bind variables and flushed the shared pool safely.
sql
ALTER SYSTEM SET cursor_sharing = FORCE; ALTER SYSTEM FLUSH SHARED_POOL;
Post-Setup Issue: After implementation, certain queries with extreme skewed data ran slowly because of CURSOR_SHARING = FORCE. We fixed this by placing SQL Profiles on the specific skewed queries, overriding the FORCE behavior locally.

Here is the step-by-step command guide and troubleshooting workflow to stabilize the system and apply a permanent fix.

Step 1: Immediate Stabilization (Under Production Pressure)

Do not restart the instance immediately unless the database is completely frozen. Start with immediate relief to clear the fragmented memory and stop rogue sessions.

1. Flush the Shared Pool (Temporary Defragmentation):

sql

ALTER SYSTEM FLUSH SHARED_POOL;

Note: This will free up memory by clearing unused objects but will temporarily cause a spike in CPU usage as queries get hard parsed again.

2. Identify the Culprit/Rogue Sessions (High CPU/Locks):
Run this script to find sessions executing queries that consume the most memory or are causing heavy parsing: [

sql

SELECT s.sid, s.serial#, s.username, s.osuser, s.machine, s.status, q.sql_text, q.parsing_schema_name 
FROM v$session s 
JOIN v$sql q ON s.sql_id = q.sql_id 
WHERE s.status = 'ACTIVE' 
AND s.type = 'USER' 
ORDER BY q.sharable_mem DESC;

3. Kill the Culprit Processes:

If a specific billing process or a literal-heavy query is monopolizing the CPU, terminate it gracefully:

sql

ALTER SYSTEM KILL SESSION 'sid,serial#' IMMEDIATE;

Step 2: In-Depth Troubleshooting

Identify exactly what is causing the shared pool to fill up and fragment.

1. Check the Failed Memory Request (from ORA-04031 error trace):
Look at the error trace file in the database alert log (or the application error stack) to identify which component is failing:

shared pool: Indicates dynamic SQL/PLSQL issues.
large pool: Indicates RMAN, parallel execution, or Parallel Query issues.

2. Find Hard Parse Culprits (Literal SQL without Bind Variables):
Heavy OLTP systems often suffer from queries that do not use bind variables (e.g., WHERE id = 1, WHERE id = 2). Run the following to locate the offenders:

sql

SELECT sql_text, executions, parse_calls, sharable_mem 
FROM v$sql 
WHERE executions < 5 
AND parse_calls > 100 
ORDER BY parse_calls DESC;

Step 3: Permanent Fixes and Application Tuning

1. Use Bind Variables (Application Fix):
The ultimate solution is to modify the billing application to use bind variables. If the developers cannot alter the code immediately, enable Cursor Sharing globally or at the session level to force the database to replace literal values with binds:

sql

ALTER SYSTEM SET cursor_sharing = FORCE SCOPE=BOTH;

2. Pin Large Packages:

If the ORA-04031 error complains about large PL/SQL packages (common in large enterprise billing applications), pre-load and pin them into memory during startup:

sql

EXEC DBMS_SHARED_POOL.KEEP('your_package_name', 'P');

3. Increase SGA & Shared Pool Reserved Size:

If the database sizing parameters are too low, increase the limits. It is highly recommended to allocate at least 10-15% of your SHARED_POOL_SIZE for reserved memory: [

sql

ALTER SYSTEM SET shared_pool_size = <New_Size_e.g_16G> SCOPE=BOTH;
ALTER SYSTEM SET shared_pool_reserved_size = <Size_e.g_2G> SCOPE=BOTH;

Step 4: Use Cases and Best Practices

Dynamic SQL in Billing Reports: OLTP reports often build queries dynamically depending on user inputs. Use case: If these queries do not use binds, they trigger ORA-04031. Fix: Utilize CURSOR_SHARING = SIMILAR or FORCE to let the database handle it dynamically.
Batch Billing Processes: Bulk-inserting financial records can sometimes cause PGA/Temporary Tablespace exhaustions which cascade into shared pool issues. Fix: Periodically flush the shared pool during off-peak windows or increase the SHARED_POOL_SIZE to accommodate high-volume batch statements.
Database Maintenance: Monitor memory health using the Autonomous Health Framework (AHF) by running the Support Diagnostic Tool:
bash
tfactl diagcollect -srdc ORA-04031

Q: How does Oracle Database 23ai/26ai transform SQL Plan Management (SPM)?

Answer: Prior to 23ai, capturing baselines and evolving them manually was tedious. Oracle now includes Real-Time SPM and the Automatic SQL Tuning Set (ASTS). The database actively test-executes alternative plans in the background/foreground and automatically creates baselines or reinstates the best-performing plan without manual intervention.
Test Case Pre-consideration: When testing upgrades to 23/26ai, run the SQL Performance Analyzer (SPA) to predict and validate performance regressions before they hit production.

Q: Risk Assessment: Deploying a "SQL Plan Directive" vs. fixing an execution plan

Answer: A SQL Plan Directive is a proactive instruction given to the Optimizer to create dynamic sampling or gather statistics when it suspects missing or stale statistics. An explicit fix (like SPM or a SQL Profile) is a reactive, surgical strike.
Risk Assessment & Compliance: Directives are dynamic and change plan stability unpredictably, which may violate strict financial (e.g., SOX) compliance that requires predictable, repeatable execution plans. Always prioritize fixed SPM baselines for critical reporting queries.

1. Performance Issue Troubleshooting: Explain Plan & SQL Tuning

When a query is performing poorly, follow this step-by-step diagnostic triage:

Step 1: Identify the SQL & Execution Plan

sql

SELECT sql_id, child_number, plan_hash_value, executions, elapsed_time 
FROM v$sql 
WHERE sql_id = '&sql_id';

-- Get the exact execution plan
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR('&sql_id', &child_number, 'ALLSTATS LAST'));

Step 2: Check for Bottlenecks (Wait Events)

Look at V$SESSION_WAIT or V$ACTIVE_SESSION_HISTORY (ASH) to see if the delay is I/O related (e.g., db file scattered read) or locking related (e.g., enq: TX - row lock contention).

Step 3: Analyze Table/Index Statistics

sql

EXEC DBMS_STATS.GATHER_TABLE_STATS(ownname => '&schema', tabname => '&table', cascade => TRUE, method_opt => 'FOR ALL COLUMNS SIZE AUTO');

Step 4: Fix Using SQL Plan Management (SPM)

If a new execution plan is performing poorly, load the known good plan from the cursor cache or AWR into a baseline:

sql

-- Load good plan from Cursor Cache
DECLARE
  v_plans_loaded NUMBER;
BEGIN
  v_plans_loaded := DBMS_SPM.LOAD_PLANS_FROM_CURSOR_CACHE(
    sql_id => '&good_sql_id',
    plan_hash_value => &good_php
  );
END;
/

2. Checkpoints

What it is: A checkpoint is a synchronized event where the database writes all dirty (modified) data blocks from the Buffer Cache in memory to the datafiles on disk.
Mechanism: The Checkpoint Process (CKPT) updates the headers of all datafiles and the control files to record the checkpointscn. This limits the amount of redo logs needed to be read in the event of an instance crash. [
Manual Trigger:

sql

ALTER SYSTEM CHECKPOINT;

3. Customer Requirement & Post-Setup Issue Troubleshooting

Customer Requirement: "Users are reporting severe latency across globally distributed nodes."

Solution: Implement Oracle Globally Distributed Database (Raft protocol) to distribute data geographically.
Post-Setup Troubleshooting: If distributed transactions are failing, verify node connectivity. Use V$GLOBAL_TRANSACTION to find locks or orphaned 2-phase commit states. If read traffic is high on the central node, offload reads to Oracle True Cache.

5. Memorable Career L4 Issue Resolution

The Issue: During a month-end close (highly concurrent environment), users experienced a global system freeze.

Tool Used: ASH (Active Session History) and AWR.
The Investigation:

Queried V$LOCK and V$SESSION to find the "Head of the Snake"—a massive batch update was blocking hundreds of OLTP transactions.
The culprit blocked session held a TX enqueue (row lock) and was waiting indefinitely on library cache lock due to an uncompiled PL/SQL block in a heavily accessed package.
The root cause was a DDL being executed in the middle of peak hours by another support team, invalidating package states.

The Fix & Solution:

Terminated the offending session (DDL process) safely to roll back:
sql
ALTER SYSTEM KILL SESSION '&sid,&serial#' IMMEDIATE;
Implemented an Automated Event Trigger on the database to prevent DDL executions during production peak hours (proactive compliance and risk control).

A global system freeze during a highly concurrent month-end close in Oracle 19c is typically caused by severe contention within the shared memory tier—specifically, Library Cache locks/pins, latch or mutex serializations, or automated background maintenance jobs conflicting with peak transactional application workloads. Under extreme concurrency, a single blocking session trying to invalidate or parse a highly dependent object can cause a cascading, database-wide hang.

Top Root Causes to Investigate

Automated Statistics Gathering Interference: If the default Oracle automatic maintenance task (GATHER_STATS_JOB) kicks in during month-end activities, it flushes and invalidates existing cursors. When thousands of concurrent sessions simultaneously attempt to re-parse the invalidated SQL statements, the system undergoes severe cursor: pin S wait on X or library cache lock contention.
On-the-Fly DDL or Package Recompilations: Running ad-hoc structural modifications (DDL), truncating global temporary tables, or compiling a shared PL/SQL package while thousands of sessions are executing it will immediately demand an exclusive library cache pin. This forces all incoming transactional sessions into an un-killable wait state.
Logon Storms and USER$ Contention: High volumes of financial batch processes spinning up new connections concurrently can overwhelm the Shared Pool. In Oracle 19c, connection storms frequently lock up on library cache: mutex X or library cache load lock due to internal row updates on the dictionary table tracking the Last Successful Logon Time (LSLT).
Hard Parsing via SQL Literals: Month-end reconciliation code that passes unique hardcoded literal values instead of using Bind Variables fills up the Shared Pool with un-reusable SQL statements. This induces massive latch: shared pool contention and high CPU thrashing.

Immediate Diagnostic & Resolution Steps

1. Gather Real-Time Hang Information

Because the system is "frozen," standard SQL queries may hang. You must log into SQL*Plus using the Preliminary Connection mode to safely probe the instance memory without allocating a standard session:

bash

sqlplus -prelim / as sysdba

Once inside, generate a system state dump to analyze the exact dependency graph of the freeze:

sql

ORADEBUG SETMYPID;
ORADEBUG UNLIMIT;
ORADEBUG HANGANALYZE 3;
ORADEBUG SYSTEMSTATE 266;

Review the resulting trace files in your Diagnostic Dest directory (alert/trace) to isolate the root blocking SID.

Generate an Emergency HANGANALYZE
If the database is so frozen that SQL*Plus hangs, log in as SYSDBA using the preliminary option and run a hang analysis:
bash
# From OS terminal sqlplus -prelim / as sysdba # Inside SQL*Plus ORADEBUG SETMYPID; ORADEBUG UNLIMIT; ORADEBUG HANGANALYZE 3; # Wait 1 minute ORADEBUG HANGANALYZE 3; EXIT;

2. Identify the Ultimate Blocker

If you can run standard queries, isolate the top wait events and the root session causing the cascading tree using v$session and v$session_wait:

sql

SELECT sid, serial#, username, event, p1raw, blocking_session, status 
FROM v$session 
WHERE wait_class <> 'Idle' AND status = 'ACTIVE';

Look closely for wait events like library cache lock, library cache pin, cursor: pin S wait on X, or enq: TX - row lock contention.

3. Terminate the Blocker Safely

Locate the root session at the top of the wait tree and terminate it to instantly free up the system queues:

sql

ALTER SYSTEM KILL SESSION 'sid,serial#' IMMEDIATE;

Root Cause Analysis (RCA) Synthesis

Common Root Cause	Technical Trigger	Prevention/Fix
Row-Level Enqueue Locking	Multiple batch jobs updating the same control rows simultaneously.	Implement row skipping (`SKIP LOCKED`) or batch scheduling offsets.
Missing Foreign Key Indexes	Child table locks completely during updates to parent tables under high concurrency.	Index all foreign keys involved in month-end tables.
Library Cache Pin	An administrator or job tried to compile/alter a package while users were actively running it.	Ban DDL/compilations during business hours. Use `EDITION-BASED REDEFINITION`.
ITL Shortage	Not enough Interested Transaction Lists slots allocated for high concurrent block updates.	Increase `INITRANS` on heavily hit transactional tables and indexes.

Post-Incident Actions

Pull an AWR Report: Generate an Automatic Workload Repository (AWR) report covering the 1-hour window before and during the freeze to visualize resource utilization profiles.
Review ASH Data: Query v$active_session_history for that specific time window to track historical session states even if the sessions have terminated.

Permanent Mitigation Strategies

Area [	Strategic Fix	Expected Impact
Maintenance	Lock financial table statistics or disable the automated `DBMS_STATS` window during the closing period.	Prevents cursor invalidation and post-break cascades.
Logon Storms	Apply fix for Bug 33121934 and set `_granularity_last_successful_login_time=30`.	Throttle dictionary updates to eliminate logon freeze conditions.
Shared Pool	Use Bind Variables across financial scripts or set `CURSOR_SHARING = FORCE` if changing code isn't viable.	Minimizes hard parsing and lowers latch pressure.
Architecture	Configure standard Connection Pools to hold persistent connections rather than creating new threads dynamically.	Eliminates CPU overhead caused by aggressive login loops.

1. Daily L4 Tasks & Automation

Fleet Patching & Provisioning (FPP): Manage out-of-place patching for Oracle Grid Infrastructure and databases to minimize downtime.
Database & Workload Consolidation: Manage large-scale CDB/PDB environments. Automate PDB provisioning, cloning, and refresh cycles via scripts.
Proactive Capacity Planning: Use AWR warehouse and ASH analytics to predict storage and compute bottlenecks.

EasyReliableDBA

Sunday, 5 July 2026

Oracle Database Architect(L4 Support (Architect/Lead)) Question and Answer 2026 Part 3

The Issue: During a month-end close (highly concurrent environment), users experienced a global system freeze.

No comments:

Post a Comment

Search This Blog