Oracle I/O Taxonomy
Before diagnosing anything, you need to know what types of I/O Oracle produces and which process or operation generates each one. Every wait event maps to exactly one I/O type.
Diagnosing I/O Bottlenecks
I/O diagnosis always moves from coarse to fine: system → tablespace → datafile → segment → SQL. Never jump straight to segment-level — the datafile view tells you which physical disk is the problem first.
-- Step 1: System-wide I/O wait events (all time, sorted by total wait) SELECT event, wait_class, total_waits, ROUND(time_waited_micro/1e6, 1) AS total_wait_secs, ROUND(time_waited_micro/total_waits/1000, 2) AS avg_wait_ms, ROUND(time_waited_micro * 100 / SUM(time_waited_micro) OVER (), 1) AS pct_of_all FROM v$system_event WHERE wait_class IN ('User I/O', 'System I/O') AND total_waits > 0 ORDER BY time_waited_micro DESC; -- Thresholds to know: -- avg_wait_ms > 20ms on SSD = storage subsystem struggling -- avg_wait_ms > 5ms on NVMe = possible I/O saturation -- avg_wait_ms > 100ms on any disk = critical bottleneck
I/O Analysis by Datafile
One datafile with 10× the reads of others means a hot segment lives there. This query pinpoints the exact physical file — then you can find which segments live in it.
SELECT df.name AS datafile_path, ts.name AS tablespace_name, fs.phyrds AS physical_reads, fs.phywrts AS physical_writes, ROUND(fs.readtim / NULLIF(fs.phyrds, 0), 2) AS avg_read_ms, ROUND(fs.writetim / NULLIF(fs.phywrts, 0), 2) AS avg_write_ms, fs.singleblkrds AS single_block_reads, ROUND(fs.singleblkrdtim / NULLIF(fs.singleblkrds, 0), 2) AS avg_singleblk_ms FROM v$filestat fs JOIN v$datafile df ON df.file# = fs.file# JOIN v$tablespace ts ON ts.ts# = df.ts# ORDER BY fs.phyrds + fs.phywrts DESC;
I/O Analysis by Segment
Once you know the hot datafile, find which tables and indexes live there. The segment statistics view gives you both current and historical breakdowns.
-- Current: top segments by physical reads SELECT owner, object_name, object_type, logical_reads, physical_reads, physical_writes, ROUND(physical_reads*100/NULLIF(logical_reads,0),1) AS miss_pct FROM v$segment_statistics WHERE owner NOT IN ('SYS','SYSTEM') AND (physical_reads > 1000 OR logical_reads > 100000) ORDER BY physical_reads DESC FETCH FIRST 20 ROWS ONLY; -- Historical: top segments from AWR (last 24 hours) SELECT o.owner, o.object_name, o.object_type, SUM(s.logical_reads_delta) AS logical_reads, SUM(s.physical_reads_delta) AS physical_reads FROM dba_hist_seg_stat s JOIN dba_hist_seg_stat_obj o ON o.dataobj# = s.dataobj# AND o.obj# = s.obj# JOIN dba_hist_snapshot sn ON sn.snap_id = s.snap_id WHERE sn.begin_interval_time > SYSDATE - 1 AND o.owner NOT IN ('SYS','SYSTEM') GROUP BY o.owner, o.object_name, o.object_type ORDER BY physical_reads DESC FETCH FIRST 20 ROWS ONLY;
Redo Log Tuning
Redo logs are on the critical path of every COMMIT. Getting them wrong is expensive. Getting them right is one of the highest-leverage fixes available.
-- Current log groups, sizes, status SELECT l.group#, l.members, l.bytes/1024/1024 AS size_mb, l.status, l.archived, lf.member AS log_file_path FROM v$log l JOIN v$logfile lf ON lf.group# = l.group# ORDER BY l.group#, lf.member; -- Log switch heatmap (should be every 15-30 min in OLTP) -- More frequent = redo logs too small SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24') AS hour, COUNT(*) AS switches FROM v$log_history WHERE first_time > SYSDATE - 2 GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24') ORDER BY hour; -- What size does Oracle recommend for your workload? SELECT optimal_logfile_size AS recommended_log_mb FROM v$instance_recovery; -- Check for log file switch waits (ARCH falling behind) SELECT event, total_waits, time_waited, ROUND(time_waited/total_waits,2) AS avg_wait_cs FROM v$system_event WHERE event LIKE 'log file switch%' ORDER BY time_waited DESC;
-- Step 1: Add new larger groups (multiplexed on two disk groups) ALTER DATABASE ADD LOGFILE GROUP 4 ('/u02/oradata/redo4a.log', '/u03/oradata/redo4b.log') SIZE 500M; ALTER DATABASE ADD LOGFILE GROUP 5 ('/u02/oradata/redo5a.log', '/u03/oradata/redo5b.log') SIZE 500M; -- Step 2: Force log switches to make old groups INACTIVE ALTER SYSTEM SWITCH LOGFILE; ALTER SYSTEM SWITCH LOGFILE; ALTER SYSTEM CHECKPOINT; -- Step 3: Drop old small groups (only when INACTIVE + ARCHIVED=YES) ALTER DATABASE DROP LOGFILE GROUP 1; ALTER DATABASE DROP LOGFILE GROUP 2; ALTER DATABASE DROP LOGFILE GROUP 3;
log file switch (archiving needed). Fix: increase redo log size to 500MB — switches drop to 12/hour, ARCH keeps up easily. Batch runtime drops from 4 hours to 2.5 hours.
Temp Tablespace Tuning
Temp tablespace I/O is almost always a symptom of PGA being too small. But when you can't add memory immediately, you need to make temp I/O as fast as possible — and prevent one runaway query from starving everyone else.
-- Who is using temp right now and how much SELECT s.sid, s.username, s.sql_id, SUM(u.blocks) * 8 / 1024 AS temp_mb, s.event, s.seconds_in_wait FROM v$session s JOIN v$sort_usage u ON u.session_addr = s.saddr GROUP BY s.sid, s.username, s.sql_id, s.event, s.seconds_in_wait ORDER BY temp_mb DESC; -- Temp tablespace size vs used SELECT tf.tablespace_name, tf.file_name, tf.bytes/1024/1024 AS total_mb, NVL(su.used_mb, 0) AS used_mb FROM dba_temp_files tf LEFT JOIN (SELECT tablespace, SUM(blocks)*8/1024 AS used_mb FROM v$sort_usage GROUP BY tablespace) su ON su.tablespace = tf.tablespace_name; -- Add more temp files for parallel I/O striping ALTER TABLESPACE TEMP ADD TEMPFILE '/u04/oradata/temp02.dbf' SIZE 10G AUTOEXTEND ON MAXSIZE 50G; ALTER TABLESPACE TEMP ADD TEMPFILE '/u05/oradata/temp03.dbf' SIZE 10G AUTOEXTEND ON MAXSIZE 50G; -- Separate temp tablespace for report users (isolation) CREATE TEMPORARY TABLESPACE TEMP_REPORTS TEMPFILE '/u04/oradata/temp_reports01.dbf' SIZE 50G EXTENT MANAGEMENT LOCAL UNIFORM SIZE 1M; ALTER USER report_user1 TEMPORARY TABLESPACE TEMP_REPORTS; ALTER USER report_user2 TEMPORARY TABLESPACE TEMP_REPORTS;
Checkpoint Tuning
A checkpoint forces DBWR to write all dirty blocks to disk and updates the control file SCN. More frequent checkpoints = faster crash recovery but higher write I/O spikes. The right tool is FAST_START_MTTR_TARGET — tell Oracle your recovery time objective and let it set checkpoint frequency automatically.
-- Tell Oracle: recover in under 60 seconds after a crash -- Oracle works backwards to determine checkpoint frequency ALTER SYSTEM SET fast_start_mttr_target = 60 SCOPE=BOTH; -- Check estimated vs target MTTR SELECT target_mttr, estimated_mttr, recovery_estimated_ios, actual_redo_blks, optimal_logfile_size AS recommended_log_mb FROM v$instance_recovery; -- estimated_mttr should be <= target_mttr -- Check DBWR is keeping up (free_buffer_waits > 0 = DBWR bottleneck) SELECT name, value FROM v$sysstat WHERE name = 'free buffer waits';
DBWR Processes & Async I/O
On systems with high write I/O, a single DBWR process becomes the bottleneck. Adding processes or enabling async I/O distributes the write load without a restart.
-- Add DBWR processes (rule: 1 per 4-8 CPU cores / storage controller) -- Requires restart ALTER SYSTEM SET db_writer_processes = 4 SCOPE=SPFILE; -- Enable async I/O (massive write throughput improvement) -- Check current state first SELECT name, value FROM v$parameter WHERE name IN ('disk_asynch_io', 'filesystemio_options'); -- Enable for both reads and writes (best option) ALTER SYSTEM SET disk_asynch_io = TRUE SCOPE=SPFILE; ALTER SYSTEM SET filesystemio_options = 'SETALL' SCOPE=SPFILE; -- SETALL = enable both async I/O and direct I/O -- Restart required for filesystemio_options
Partitioning — Eliminate I/O at the Design Level
Partitioning is the most powerful I/O reduction technique for large tables. Instead of reading millions of blocks to find rows in a date range, Oracle reads only the partition that contains those rows — often reducing I/O by 90%+.
WHERE order_date > SYSDATE-7Result: 15,360 blocks read — 45 seconds
Result: ~900 blocks read — 1.2 seconds
-- Interval partitioning: auto-creates new partitions as months arrive CREATE TABLE orders_partitioned ( order_id NUMBER, customer_id NUMBER, order_date DATE NOT NULL, status VARCHAR2(20), total_amount NUMBER ) PARTITION BY RANGE (order_date) INTERVAL (NUMTOYMINTERVAL(1, 'MONTH')) ( PARTITION p_before_2024 VALUES LESS THAN (DATE '2024-01-01') ); -- Oracle auto-creates new monthly partitions going forward -- Verify partition pruning in the plan (look for these operations): -- PARTITION RANGE SINGLE → best: exactly 1 partition accessed -- PARTITION RANGE ITERATOR → range of partitions (acceptable) -- PARTITION RANGE ALL → all partitions: PRUNING FAILED EXPLAIN PLAN FOR SELECT COUNT(*) FROM orders_partitioned WHERE order_date >= DATE '2026-03-01'; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY(NULL,NULL,'PARTITION')); -- Look for: Pstart=1 Pstop=1 (one partition) vs Pstart=1 Pstop=ALL (all) -- Partition pruning FAILS with functions on partition key: -- WHERE TRUNC(order_date) = TRUNC(SYSDATE) → ALL partitions scanned -- Fix: WHERE order_date >= TRUNC(SYSDATE) AND order_date < TRUNC(SYSDATE)+1
WHERE TRUNC(order_date) = TRUNC(SYSDATE) scans ALL partitions even though it looks selective. Always use direct comparisons: WHERE order_date >= TRUNC(SYSDATE) AND order_date < TRUNC(SYSDATE)+1.
Storage Layout Best Practices
Where you put files matters enormously. The golden rule: redo logs on the fastest storage, on separate controllers from data files. LGWR writes redo synchronously on every COMMIT — it must never compete with DBWR for I/O bandwidth.
I/O Triage Toolkit
-- STEP 1: I/O pain in ASH (last 30 min) SELECT event, wait_class, COUNT(*) AS ash_samples, ROUND(COUNT(*)*100/SUM(COUNT(*)) OVER(),1) AS pct FROM v$active_session_history WHERE sample_time > SYSDATE - 30/1440 AND wait_class IN ('User I/O','System I/O') GROUP BY event, wait_class ORDER BY ash_samples DESC; -- STEP 2: Top SQL causing I/O in that window SELECT sql_id, COUNT(*) AS samples FROM v$active_session_history WHERE sample_time > SYSDATE - 30/1440 AND wait_class IN ('User I/O','System I/O') GROUP BY sql_id ORDER BY samples DESC FETCH FIRST 10 ROWS ONLY; -- STEP 3: Hottest datafiles SELECT df.name, ts.name AS ts_name, fs.phyrds, fs.phywrts, ROUND(fs.readtim/NULLIF(fs.phyrds,0),2) AS avg_read_ms, ROUND(fs.writetim/NULLIF(fs.phywrts,0),2) AS avg_write_ms FROM v$filestat fs JOIN v$datafile df ON df.file# = fs.file# JOIN v$tablespace ts ON ts.ts# = df.ts# WHERE fs.phyrds + fs.phywrts > 0 ORDER BY fs.phyrds + fs.phywrts DESC FETCH FIRST 10 ROWS ONLY; -- STEP 4: Hot segments in that tablespace SELECT s.segment_name, s.segment_type, s.bytes/1024/1024 AS size_mb FROM dba_segments s JOIN dba_data_files df ON df.tablespace_name = s.tablespace_name WHERE df.file_name = '&hot_datafile_from_step3' ORDER BY s.bytes DESC FETCH FIRST 20 ROWS ONLY;
db file scattered read at 60% of all wait time. The SQL is a weekly job: SELECT * FROM audit_log WHERE created_date < SYSDATE - 90. The AUDIT_LOG table is 80GB with no partitioning. It's doing a full table scan, flooding the buffer cache and evicting OLTP blocks — causing OLTP sessions to wait on db file sequential read for their indexes. Fix: (1) Immediate — use /*+ PARALLEL */ hint so Oracle uses direct path reads (bypasses buffer cache, doesn't evict OLTP blocks). (2) Long-term — partition AUDIT_LOG by month — the job then touches only the relevant partition.
Self-Check Questions
-
Q1
You see
db file scattered readdominating ASH, caused by a legitimate full table scan on a 500GB table that runs once daily. It's evicting OLTP blocks from the buffer cache. What two techniques can you apply immediately to prevent this scan from hurting all other sessions? -
Q2
Your redo log heatmap shows 80 log switches per hour during business hours.
V$INSTANCE_RECOVERYshowsoptimal_logfile_size = 800MBbut your current logs are 100MB. Describe the exact steps to increase redo log size without taking the database offline. -
Q3
A query with
WHERE TRUNC(order_date) >= TRUNC(SYSDATE) - 7on a range-partitioned table is scanning all 36 partitions instead of just the last one. Why is this happening and how exactly do you fix the query? -
Q4
You add a second DBWR process (
db_writer_processes = 2) butfree buffer waitsinV$SYSSTATis still climbing. What else should you check, and what parameter controls how aggressively DBWR writes dirty blocks proactively? -
Q5
Your Linux server has 128GB RAM and
SGA_TARGET = 80GB. HugePages is not configured. Explain what is happening at the OS memory management level and what the performance impact is on a busy OLTP system with many concurrent sessions.