Chapter 06 · Storage Layer

I/O & Storage Tuning

I/O is the most common bottleneck in Oracle databases. Unlike CPU or memory, I/O problems are visible, measurable, and often fixable without touching a single line of SQL. But fixing I/O requires understanding exactly what Oracle is reading, why it's reading it, and where that data lives on disk.

I/O taxonomy Datafile analysis Redo log tuning Temp tablespace Checkpoints Partitioning Storage layout DBWR / async I/O

Oracle I/O Taxonomy

Before diagnosing anything, you need to know what types of I/O Oracle produces and which process or operation generates each one. Every wait event maps to exactly one I/O type.

Single-block read
READcritical path
One 8KB block from data files into buffer cache. Caused by index range scans, unique scans, rowid lookups. Blocks the server process until complete.
db file sequential read
Multi-block read
READ
8–128 contiguous blocks at once. Full table scans and fast full index scans. DB_FILE_MULTIBLOCK_READ_COUNT controls batch size.
db file scattered read
Direct path read
READ
Bypasses buffer cache entirely — data goes straight to PGA. Large parallel scans, temp reads during sort spill. Cannot benefit other sessions.
direct path read direct path read temp
Direct path write
WRITE
Writes directly from PGA to disk, bypassing buffer cache. Sort spills to temp, direct-path INSERT (APPEND hint), parallel DML.
direct path write temp
DBWR writes
WRITE
Database Writer flushes dirty blocks from buffer cache to data files. Happens at checkpoints and when buffer cache fills. Asynchronous — does not directly block users.
db file parallel write
LGWR writes
WRITEcritical path
Log Writer flushes redo buffer to online redo logs on every COMMIT. Directly blocks committing sessions until complete. Redo log placement on fast storage is critical.
log file sync log file parallel write
ARCH writes
WRITE
Archiver copies filled redo log groups to archive destination. If ARCH falls behind, redo log group reuse is blocked and all committing sessions stall.
log file switch (archiving needed)
Control file I/O
READWRITE
Every COMMIT updates the control file SCN. Multiple copies multiply this I/O. Critical file — never put on the same disk as redo logs.
control file sequential read

Diagnosing I/O Bottlenecks

I/O diagnosis always moves from coarse to fine: system → tablespace → datafile → segment → SQL. Never jump straight to segment-level — the datafile view tells you which physical disk is the problem first.

SQL — System-wide I/O wait picture
-- Step 1: System-wide I/O wait events (all time, sorted by total wait)
SELECT event, wait_class,
       total_waits,
       ROUND(time_waited_micro/1e6, 1)              AS total_wait_secs,
       ROUND(time_waited_micro/total_waits/1000, 2)  AS avg_wait_ms,
       ROUND(time_waited_micro * 100 /
         SUM(time_waited_micro) OVER (), 1)          AS pct_of_all
FROM   v$system_event
WHERE  wait_class IN ('User I/O', 'System I/O')
AND    total_waits > 0
ORDER BY time_waited_micro DESC;

-- Thresholds to know:
-- avg_wait_ms > 20ms on SSD      = storage subsystem struggling
-- avg_wait_ms > 5ms  on NVMe     = possible I/O saturation
-- avg_wait_ms > 100ms on any disk = critical bottleneck

I/O Analysis by Datafile

One datafile with 10× the reads of others means a hot segment lives there. This query pinpoints the exact physical file — then you can find which segments live in it.

SQL — I/O stats per datafile
SELECT df.name                          AS datafile_path,
       ts.name                          AS tablespace_name,
       fs.phyrds                        AS physical_reads,
       fs.phywrts                       AS physical_writes,
       ROUND(fs.readtim  / NULLIF(fs.phyrds,  0), 2) AS avg_read_ms,
       ROUND(fs.writetim / NULLIF(fs.phywrts, 0), 2) AS avg_write_ms,
       fs.singleblkrds                  AS single_block_reads,
       ROUND(fs.singleblkrdtim /
         NULLIF(fs.singleblkrds, 0), 2)              AS avg_singleblk_ms
FROM   v$filestat   fs
JOIN   v$datafile   df ON df.file# = fs.file#
JOIN   v$tablespace ts ON ts.ts#   = df.ts#
ORDER BY fs.phyrds + fs.phywrts DESC;

I/O Analysis by Segment

Once you know the hot datafile, find which tables and indexes live there. The segment statistics view gives you both current and historical breakdowns.

SQL — Top segments by physical reads (current + AWR)
-- Current: top segments by physical reads
SELECT owner, object_name, object_type,
       logical_reads, physical_reads, physical_writes,
       ROUND(physical_reads*100/NULLIF(logical_reads,0),1) AS miss_pct
FROM   v$segment_statistics
WHERE  owner NOT IN ('SYS','SYSTEM')
AND   (physical_reads > 1000 OR logical_reads > 100000)
ORDER BY physical_reads DESC
FETCH FIRST 20 ROWS ONLY;

-- Historical: top segments from AWR (last 24 hours)
SELECT o.owner, o.object_name, o.object_type,
       SUM(s.logical_reads_delta)  AS logical_reads,
       SUM(s.physical_reads_delta) AS physical_reads
FROM   dba_hist_seg_stat      s
JOIN   dba_hist_seg_stat_obj  o  ON  o.dataobj# = s.dataobj#
                                AND o.obj#     = s.obj#
JOIN   dba_hist_snapshot      sn ON sn.snap_id = s.snap_id
WHERE  sn.begin_interval_time > SYSDATE - 1
AND    o.owner NOT IN ('SYS','SYSTEM')
GROUP BY o.owner, o.object_name, o.object_type
ORDER BY physical_reads DESC
FETCH FIRST 20 ROWS ONLY;

Redo Log Tuning

Redo logs are on the critical path of every COMMIT. Getting them wrong is expensive. Getting them right is one of the highest-leverage fixes available.

Redo log buffer SGA · circular · in memory LGWR flushes on COMMIT Group 1 — CURRENT (active) Group 2 — ACTIVE Group 3 — INACTIVE switch ARCH process copies inactive groups Archive log dest LOG_ARCHIVE_DEST_n Multiplexed redo log group (best practice) Member A — fast disk / ASM DG1 Member B — different disk / ASM DG2 LGWR writes both simultaneously waits for the slower one
SQL — Redo log health check + switch frequency
-- Current log groups, sizes, status
SELECT l.group#, l.members, l.bytes/1024/1024 AS size_mb,
       l.status, l.archived, lf.member AS log_file_path
FROM   v$log     l
JOIN   v$logfile lf ON lf.group# = l.group#
ORDER BY l.group#, lf.member;

-- Log switch heatmap (should be every 15-30 min in OLTP)
-- More frequent = redo logs too small
SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24') AS hour,
       COUNT(*)                              AS switches
FROM   v$log_history
WHERE  first_time > SYSDATE - 2
GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24')
ORDER BY hour;

-- What size does Oracle recommend for your workload?
SELECT optimal_logfile_size AS recommended_log_mb
FROM   v$instance_recovery;

-- Check for log file switch waits (ARCH falling behind)
SELECT event, total_waits, time_waited,
       ROUND(time_waited/total_waits,2) AS avg_wait_cs
FROM   v$system_event
WHERE  event LIKE 'log file switch%'
ORDER BY time_waited DESC;
SQL — Add larger redo log groups (cannot resize existing)
-- Step 1: Add new larger groups (multiplexed on two disk groups)
ALTER DATABASE ADD LOGFILE GROUP 4
  ('/u02/oradata/redo4a.log',
   '/u03/oradata/redo4b.log') SIZE 500M;

ALTER DATABASE ADD LOGFILE GROUP 5
  ('/u02/oradata/redo5a.log',
   '/u03/oradata/redo5b.log') SIZE 500M;

-- Step 2: Force log switches to make old groups INACTIVE
ALTER SYSTEM SWITCH LOGFILE;
ALTER SYSTEM SWITCH LOGFILE;
ALTER SYSTEM CHECKPOINT;

-- Step 3: Drop old small groups (only when INACTIVE + ARCHIVED=YES)
ALTER DATABASE DROP LOGFILE GROUP 1;
ALTER DATABASE DROP LOGFILE GROUP 2;
ALTER DATABASE DROP LOGFILE GROUP 3;
🔎 Real scenario — log switch storm
A nightly batch runs midnight–4 AM. The heatmap shows 120 log switches per hour — one every 30 seconds. Redo logs are 50MB. The batch generates 100MB/min of redo. Every switch triggers ARCH. ARCH can't keep up. Sessions wait on log file switch (archiving needed). Fix: increase redo log size to 500MB — switches drop to 12/hour, ARCH keeps up easily. Batch runtime drops from 4 hours to 2.5 hours.

Temp Tablespace Tuning

Temp tablespace I/O is almost always a symptom of PGA being too small. But when you can't add memory immediately, you need to make temp I/O as fast as possible — and prevent one runaway query from starving everyone else.

SQL — Temp tablespace usage and best practices
-- Who is using temp right now and how much
SELECT s.sid, s.username, s.sql_id,
       SUM(u.blocks) * 8 / 1024 AS temp_mb,
       s.event, s.seconds_in_wait
FROM   v$session    s
JOIN   v$sort_usage u ON u.session_addr = s.saddr
GROUP BY s.sid, s.username, s.sql_id, s.event, s.seconds_in_wait
ORDER BY temp_mb DESC;

-- Temp tablespace size vs used
SELECT tf.tablespace_name, tf.file_name,
       tf.bytes/1024/1024      AS total_mb,
       NVL(su.used_mb, 0)       AS used_mb
FROM   dba_temp_files tf
LEFT JOIN (SELECT tablespace,
                  SUM(blocks)*8/1024 AS used_mb
           FROM   v$sort_usage
           GROUP BY tablespace) su
  ON su.tablespace = tf.tablespace_name;

-- Add more temp files for parallel I/O striping
ALTER TABLESPACE TEMP ADD TEMPFILE
  '/u04/oradata/temp02.dbf' SIZE 10G AUTOEXTEND ON MAXSIZE 50G;
ALTER TABLESPACE TEMP ADD TEMPFILE
  '/u05/oradata/temp03.dbf' SIZE 10G AUTOEXTEND ON MAXSIZE 50G;

-- Separate temp tablespace for report users (isolation)
CREATE TEMPORARY TABLESPACE TEMP_REPORTS
  TEMPFILE '/u04/oradata/temp_reports01.dbf' SIZE 50G
  EXTENT MANAGEMENT LOCAL UNIFORM SIZE 1M;

ALTER USER report_user1 TEMPORARY TABLESPACE TEMP_REPORTS;
ALTER USER report_user2 TEMPORARY TABLESPACE TEMP_REPORTS;

Checkpoint Tuning

A checkpoint forces DBWR to write all dirty blocks to disk and updates the control file SCN. More frequent checkpoints = faster crash recovery but higher write I/O spikes. The right tool is FAST_START_MTTR_TARGET — tell Oracle your recovery time objective and let it set checkpoint frequency automatically.

SQL — Checkpoint tuning with MTTR target
-- Tell Oracle: recover in under 60 seconds after a crash
-- Oracle works backwards to determine checkpoint frequency
ALTER SYSTEM SET fast_start_mttr_target = 60 SCOPE=BOTH;

-- Check estimated vs target MTTR
SELECT target_mttr, estimated_mttr,
       recovery_estimated_ios, actual_redo_blks,
       optimal_logfile_size AS recommended_log_mb
FROM   v$instance_recovery;
-- estimated_mttr should be <= target_mttr

-- Check DBWR is keeping up (free_buffer_waits > 0 = DBWR bottleneck)
SELECT name, value FROM v$sysstat
WHERE  name = 'free buffer waits';

DBWR Processes & Async I/O

On systems with high write I/O, a single DBWR process becomes the bottleneck. Adding processes or enabling async I/O distributes the write load without a restart.

SQL — DBWR tuning and async I/O
-- Add DBWR processes (rule: 1 per 4-8 CPU cores / storage controller)
-- Requires restart
ALTER SYSTEM SET db_writer_processes = 4 SCOPE=SPFILE;

-- Enable async I/O (massive write throughput improvement)
-- Check current state first
SELECT name, value FROM v$parameter
WHERE  name IN ('disk_asynch_io', 'filesystemio_options');

-- Enable for both reads and writes (best option)
ALTER SYSTEM SET disk_asynch_io        = TRUE   SCOPE=SPFILE;
ALTER SYSTEM SET filesystemio_options  = 'SETALL' SCOPE=SPFILE;
-- SETALL = enable both async I/O and direct I/O
-- Restart required for filesystemio_options

Partitioning — Eliminate I/O at the Design Level

Partitioning is the most powerful I/O reduction technique for large tables. Instead of reading millions of blocks to find rows in a date range, Oracle reads only the partition that contains those rows — often reducing I/O by 90%+.

Without partitioning — 120GB scan
2024 Q1 — 10GB
2024 Q2 — 10GB
2024 Q3 — 10GB
2024 Q4 — 10GB
... 8 more quarters ...
Query: WHERE order_date > SYSDATE-7
Result: 15,360 blocks read — 45 seconds
With range partitioning — pruning works
2024 Q1 — skipped
2024 Q2 — skipped
... 10 quarters skipped ...
2026 Q1 — ACCESSED ONLY
Partition pruning: skip 11 partitions
Result: ~900 blocks read — 1.2 seconds
SQL — Interval partitioning + verify pruning
-- Interval partitioning: auto-creates new partitions as months arrive
CREATE TABLE orders_partitioned (
  order_id     NUMBER,
  customer_id  NUMBER,
  order_date   DATE NOT NULL,
  status       VARCHAR2(20),
  total_amount NUMBER
)
PARTITION BY RANGE (order_date)
INTERVAL (NUMTOYMINTERVAL(1, 'MONTH'))
(
  PARTITION p_before_2024
    VALUES LESS THAN (DATE '2024-01-01')
);
-- Oracle auto-creates new monthly partitions going forward

-- Verify partition pruning in the plan (look for these operations):
-- PARTITION RANGE SINGLE     → best: exactly 1 partition accessed
-- PARTITION RANGE ITERATOR   → range of partitions (acceptable)
-- PARTITION RANGE ALL        → all partitions: PRUNING FAILED
EXPLAIN PLAN FOR
SELECT COUNT(*) FROM orders_partitioned
WHERE  order_date >= DATE '2026-03-01';

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY(NULL,NULL,'PARTITION'));
-- Look for: Pstart=1 Pstop=1 (one partition) vs Pstart=1 Pstop=ALL (all)

-- Partition pruning FAILS with functions on partition key:
-- WHERE TRUNC(order_date) = TRUNC(SYSDATE)  → ALL partitions scanned
-- Fix: WHERE order_date >= TRUNC(SYSDATE) AND order_date < TRUNC(SYSDATE)+1
🚨 The partition pruning trap
Wrapping the partition key in a function kills pruning silently. WHERE TRUNC(order_date) = TRUNC(SYSDATE) scans ALL partitions even though it looks selective. Always use direct comparisons: WHERE order_date >= TRUNC(SYSDATE) AND order_date < TRUNC(SYSDATE)+1.

Storage Layout Best Practices

Where you put files matters enormously. The golden rule: redo logs on the fastest storage, on separate controllers from data files. LGWR writes redo synchronously on every COMMIT — it must never compete with DBWR for I/O bandwidth.

Fastest disk (NVMe)
Redo log group 1 — member A
Redo log group 2 — member A
Redo log group 3 — member A
Control file — copy 1 of 3
Fast SSD — DATA
SYSTEM tablespace
SYSAUX tablespace
APPS_DATA — hot tables
APPS_IDX — indexes
Second SSD — DG2
Redo log group 1 — member B (mirror)
Redo log group 2 — member B (mirror)
TEMP tablespace — sort spill
Control file — copy 2 of 3
High-capacity — RECO
Archive logs
RMAN backups
FRA — flashback logs
Control file — copy 3 of 3
⚠ LGWR writes both redo members simultaneously
LGWR writes to all members of the current group at the same time and waits for the slowest one. If member A is on NVMe (0.5ms) and member B is on spinning disk (8ms), every COMMIT takes 8ms. Keep both members on equally fast storage.

I/O Triage Toolkit

01
Find the I/O pain in ASH (last 30 min)
Filter ASH for wait_class IN ('User I/O','System I/O') — shows which I/O event dominates and which SQL_IDs cause it.
02
Find the hottest datafiles
V$FILESTAT — sort by phyrds+phywrts. One file dominating = a hot segment lives there. Check avg_read_ms — >20ms on SSD is a problem.
03
Find which segments live in the hot file
V$SEGMENT_STATISTICS or DBA_SEGMENTS joined to DBA_DATA_FILES by tablespace_name. Now you know the exact table or index causing the I/O.
04
Check redo log health
V$LOG_HISTORY heatmap for switch frequency. V$SYSTEM_EVENT for log file switch waits. V$INSTANCE_RECOVERY for recommended log size.
05
Route to the right fix
Full table scan causing scattered reads → missing index or partitioning. log file sync → slow redo disk, too many commits, or logs too small. direct path temp → PGA too small (fix in Chapter 5).
SQL — Complete I/O triage queries
-- STEP 1: I/O pain in ASH (last 30 min)
SELECT event, wait_class,
       COUNT(*)                                          AS ash_samples,
       ROUND(COUNT(*)*100/SUM(COUNT(*)) OVER(),1)     AS pct
FROM   v$active_session_history
WHERE  sample_time > SYSDATE - 30/1440
AND    wait_class IN ('User I/O','System I/O')
GROUP BY event, wait_class
ORDER BY ash_samples DESC;

-- STEP 2: Top SQL causing I/O in that window
SELECT sql_id, COUNT(*) AS samples
FROM   v$active_session_history
WHERE  sample_time > SYSDATE - 30/1440
AND    wait_class IN ('User I/O','System I/O')
GROUP BY sql_id
ORDER BY samples DESC
FETCH FIRST 10 ROWS ONLY;

-- STEP 3: Hottest datafiles
SELECT df.name, ts.name AS ts_name,
       fs.phyrds, fs.phywrts,
       ROUND(fs.readtim/NULLIF(fs.phyrds,0),2)   AS avg_read_ms,
       ROUND(fs.writetim/NULLIF(fs.phywrts,0),2) AS avg_write_ms
FROM   v$filestat fs
JOIN   v$datafile df ON df.file# = fs.file#
JOIN   v$tablespace ts ON ts.ts# = df.ts#
WHERE  fs.phyrds + fs.phywrts > 0
ORDER BY fs.phyrds + fs.phywrts DESC
FETCH FIRST 10 ROWS ONLY;

-- STEP 4: Hot segments in that tablespace
SELECT s.segment_name, s.segment_type,
       s.bytes/1024/1024 AS size_mb
FROM   dba_segments s
JOIN   dba_data_files df ON df.tablespace_name = s.tablespace_name
WHERE  df.file_name = '&hot_datafile_from_step3'
ORDER BY s.bytes DESC
FETCH FIRST 20 ROWS ONLY;
🔎 Real scenario — "everything slowed at 2AM"
You check ASH for 2–3 AM and find db file scattered read at 60% of all wait time. The SQL is a weekly job: SELECT * FROM audit_log WHERE created_date < SYSDATE - 90. The AUDIT_LOG table is 80GB with no partitioning. It's doing a full table scan, flooding the buffer cache and evicting OLTP blocks — causing OLTP sessions to wait on db file sequential read for their indexes. Fix: (1) Immediate — use /*+ PARALLEL */ hint so Oracle uses direct path reads (bypasses buffer cache, doesn't evict OLTP blocks). (2) Long-term — partition AUDIT_LOG by month — the job then touches only the relevant partition.

Self-Check Questions

  • Q1 You see db file scattered read dominating ASH, caused by a legitimate full table scan on a 500GB table that runs once daily. It's evicting OLTP blocks from the buffer cache. What two techniques can you apply immediately to prevent this scan from hurting all other sessions?
  • Q2 Your redo log heatmap shows 80 log switches per hour during business hours. V$INSTANCE_RECOVERY shows optimal_logfile_size = 800MB but your current logs are 100MB. Describe the exact steps to increase redo log size without taking the database offline.
  • Q3 A query with WHERE TRUNC(order_date) >= TRUNC(SYSDATE) - 7 on a range-partitioned table is scanning all 36 partitions instead of just the last one. Why is this happening and how exactly do you fix the query?
  • Q4 You add a second DBWR process (db_writer_processes = 2) but free buffer waits in V$SYSSTAT is still climbing. What else should you check, and what parameter controls how aggressively DBWR writes dirty blocks proactively?
  • Q5 Your Linux server has 128GB RAM and SGA_TARGET = 80GB. HugePages is not configured. Explain what is happening at the OS memory management level and what the performance impact is on a busy OLTP system with many concurrent sessions.