Postgresql

Query Optimization: EXPLAIN ANALYZE Deep Dive

An advanced guide to PostgreSQL query optimization with EXPLAIN ANALYZE, covering node types, join algorithms, cost estimation, plan reading, slow query diagnosis, and practical optimization techniques.

Query Optimization: EXPLAIN ANALYZE Deep Dive

Overview

EXPLAIN ANALYZE is the single most important tool in your PostgreSQL performance toolkit. It shows you exactly what the database engine does when it executes your query — which indexes it uses, how it joins tables, where it spends time, and where your assumptions about performance are wrong. If you are running PostgreSQL in production and you have never read an EXPLAIN ANALYZE output top to bottom, you are flying blind.

Prerequisites

  • PostgreSQL 13+ (some features discussed require 14+)
  • Solid understanding of SQL (JOINs, subqueries, CTEs, aggregations)
  • Familiarity with indexing concepts (B-tree, hash, GIN)
  • Node.js and the pg module for the application-layer examples
  • A database with enough data to make query plans interesting (at least tens of thousands of rows)

EXPLAIN vs EXPLAIN ANALYZE

There are two forms of EXPLAIN, and they are fundamentally different.

EXPLAIN (without ANALYZE) asks the planner: "What would you do if I ran this query?" It produces an estimated plan based on table statistics. It does not execute the query.

EXPLAIN SELECT * FROM orders WHERE customer_id = 4821;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Index Scan using idx_orders_customer_id on orders  (cost=0.43..12.58 rows=7 width=94)
   Index Cond: (customer_id = 4821)

EXPLAIN ANALYZE actually runs the query and shows you both the estimated plan and the actual execution results side by side.

EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 4821;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_orders_customer_id on orders  (cost=0.43..12.58 rows=7 width=94) (actual time=0.028..0.035 rows=6 loops=1)
   Index Cond: (customer_id = 4821)
 Planning Time: 0.089 ms
 Execution Time: 0.058 ms

The critical difference: EXPLAIN guesses. EXPLAIN ANALYZE measures. When the estimates diverge from actuals, that is where your performance problems live.

Warning: EXPLAIN ANALYZE executes the query. For INSERT, UPDATE, or DELETE statements, wrap it in a transaction and roll back:

BEGIN;
EXPLAIN ANALYZE DELETE FROM orders WHERE status = 'cancelled';
ROLLBACK;

Reading the Output Tree

Every EXPLAIN output is a tree of plan nodes. The query executor processes this tree bottom-up — leaf nodes execute first, and their results flow upward.

EXPLAIN ANALYZE
SELECT c.name, COUNT(o.id) AS order_count
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE c.created_at > '2024-01-01'
GROUP BY c.name
ORDER BY order_count DESC
LIMIT 10;
                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1842.56..1842.59 rows=10 width=44) (actual time=24.381..24.384 rows=10 loops=1)
   ->  Sort  (cost=1842.56..1848.23 rows=2268 width=44) (actual time=24.379..24.381 rows=10 loops=1)
         Sort Key: (count(o.id)) DESC
         Sort Method: top-N heapsort  Memory: 26kB
         ->  HashAggregate  (cost=1780.12..1802.80 rows=2268 width=44) (actual time=23.985..24.198 rows=2268 loops=1)
               Group Key: c.name
               Batches: 1  Memory Usage: 369kB
               ->  Hash Join  (cost=112.45..1735.78 rows=8868 width=40) (actual time=1.204..19.877 rows=8734 loops=1)
                     Hash Cond: (o.customer_id = c.id)
                     ->  Seq Scan on orders o  (cost=0.00..1443.00 rows=50000 width=12) (actual time=0.008..6.742 rows=50000 loops=1)
                     ->  Hash  (cost=98.12..98.12 rows=1146 width=36) (actual time=1.178..1.179 rows=2268 loops=1)
                           Buckets: 4096  Batches: 1  Memory Usage: 168kB
                           ->  Seq Scan on customers c  (cost=0.00..98.12 rows=1146 width=36) (actual time=0.011..0.712 rows=2268 loops=1)
                                 Filter: (created_at > '2024-01-01 00:00:00'::timestamp without time zone)
                                 Rows Removed by Filter: 2732

Each node has these fields:

Field Meaning
cost=0.43..12.58 Estimated startup cost .. total cost (in arbitrary planner units)
rows=7 Estimated number of rows returned
width=94 Estimated average row width in bytes
actual time=0.028..0.035 Actual startup time .. total time in milliseconds
rows=6 Actual number of rows returned
loops=1 Number of times this node was executed

Important: When loops is greater than 1, multiply the actual time by loops to get the true wall-clock cost. The reported times are per-loop averages.

Startup Cost vs Total Cost

The two cost numbers are not a range — they represent different things:

  • Startup cost — the cost before the node can emit its first row. For a Sort node, this includes sorting the entire input. For a Seq Scan, this is essentially zero.
  • Total cost — the cost to emit all rows.

This matters for queries with LIMIT. If a node has a high startup cost (like a Sort), adding LIMIT 10 does not make the sort cheaper — it still must sort everything before returning the first row. But an Index Scan has near-zero startup cost and can return rows immediately, making it efficient with LIMIT.

-- Sort has high startup cost — LIMIT does not help the sort itself
EXPLAIN ANALYZE SELECT * FROM orders ORDER BY total_amount DESC LIMIT 10;
 Limit  (cost=4892.33..4892.36 rows=10 width=94) (actual time=42.517..42.521 rows=10 loops=1)
   ->  Sort  (cost=4892.33..5017.33 rows=50000 width=94) (actual time=42.515..42.517 rows=10 loops=1)
         Sort Key: total_amount DESC
         Sort Method: top-N heapsort  Memory: 27kB
         ->  Seq Scan on orders  (cost=0.00..1443.00 rows=50000 width=94) (actual time=0.009..8.131 rows=50000 loops=1)

Now with an index on total_amount:

CREATE INDEX idx_orders_total_amount ON orders (total_amount DESC);
EXPLAIN ANALYZE SELECT * FROM orders ORDER BY total_amount DESC LIMIT 10;
 Limit  (cost=0.29..1.04 rows=10 width=94) (actual time=0.031..0.047 rows=10 loops=1)
   ->  Index Scan using idx_orders_total_amount on orders  (cost=0.29..3768.29 rows=50000 width=94) (actual time=0.029..0.043 rows=10 loops=1)

The total cost is higher (3768 vs 4892) but the startup cost is near zero, so LIMIT makes this fast. The planner knows this.

Scan Node Types

Seq Scan (Sequential Scan)

Reads every row in the table from disk, in physical order. This is the baseline — no index needed, no random I/O. On small tables, this is often faster than an index scan.

Seq Scan on orders  (cost=0.00..1443.00 rows=50000 width=94)
  Filter: (status = 'pending')
  Rows Removed by Filter: 47234

When you see Rows Removed by Filter is much larger than rows, that is a sign an index might help.

Index Scan

Traverses a B-tree index to find matching rows, then fetches the actual row data from the heap (table).

Index Scan using idx_orders_status on orders  (cost=0.29..85.42 rows=2766 width=94)
  Index Cond: (status = 'pending')

Good for selective queries (returning a small percentage of rows). Involves random I/O to the heap, which is why Postgres might prefer a Seq Scan if the query returns a large percentage of the table.

Index Only Scan

Like an Index Scan, but all requested columns are in the index — no heap fetch needed. This is as fast as it gets.

Index Only Scan using idx_orders_customer_status on orders  (cost=0.29..42.18 rows=12 width=8)
  Index Cond: (customer_id = 4821)
  Heap Fetches: 0

Heap Fetches: 0 is the ideal. If this number is high, run VACUUM on the table to update the visibility map.

Bitmap Index Scan + Bitmap Heap Scan

A two-phase approach. First, the Bitmap Index Scan reads the index and builds a bitmap of which table pages contain matching rows. Then the Bitmap Heap Scan reads those pages sequentially.

Bitmap Heap Scan on orders  (cost=52.18..1298.42 rows=2766 width=94)
  Recheck Cond: (status = 'pending')
  Heap Blocks: exact=842
  ->  Bitmap Index Scan on idx_orders_status  (cost=0.00..51.49 rows=2766 width=0)
        Index Cond: (status = 'pending')

This is Postgres's sweet spot between Seq Scan and Index Scan — good for medium-selectivity queries. It avoids the random I/O of Index Scan by reading pages in physical order.

Multiple bitmap scans can be combined with BitmapAnd or BitmapOr for queries with multiple conditions:

EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'pending' AND region = 'us-west';
Bitmap Heap Scan on orders  (cost=72.38..498.12 rows=312 width=94)
  Recheck Cond: ((status = 'pending') AND (region = 'us-west'))
  ->  BitmapAnd  (cost=72.38..72.38 rows=312 width=0)
        ->  Bitmap Index Scan on idx_orders_status  (cost=0.00..51.49 rows=2766 width=0)
              Index Cond: (status = 'pending')
        ->  Bitmap Index Scan on idx_orders_region  (cost=0.00..20.64 rows=5641 width=0)
              Index Cond: (region = 'us-west')

Join Algorithms

PostgreSQL has three join strategies, and the planner picks the cheapest one based on table sizes, available indexes, and sort order.

Nested Loop

For each row in the outer table, scan the inner table for matches. Fast when the outer table is small and the inner table has an index on the join column.

Nested Loop  (cost=0.29..845.72 rows=12 width=130) (actual time=0.038..0.142 rows=12 loops=1)
  ->  Index Scan using idx_customers_email on customers c  (cost=0.29..8.30 rows=1 width=36) (actual time=0.021..0.023 rows=1 loops=1)
        Index Cond: (email = '[email protected]')
  ->  Index Scan using idx_orders_customer_id on orders o  (cost=0.29..837.30 rows=12 width=94) (actual time=0.014..0.108 rows=12 loops=1)
        Index Cond: (customer_id = c.id)

This is the right choice here: one customer, few orders. But Nested Loop with a large outer table and no inner index is catastrophic.

Hash Join

Builds a hash table from the smaller (inner) table, then probes it for each row from the larger (outer) table. No index needed. Good for large equi-joins.

Hash Join  (cost=112.45..1735.78 rows=8868 width=40) (actual time=1.204..19.877 rows=8734 loops=1)
  Hash Cond: (o.customer_id = c.id)
  ->  Seq Scan on orders o  (cost=0.00..1443.00 rows=50000 width=12) (actual time=0.008..6.742 rows=50000 loops=1)
  ->  Hash  (cost=98.12..98.12 rows=1146 width=36) (actual time=1.178..1.179 rows=2268 loops=1)
        Buckets: 4096  Batches: 1  Memory Usage: 168kB

Watch the Batches count. If it exceeds 1, the hash table spilled to disk because work_mem is too small. This kills performance.

Merge Join

Both inputs must be sorted on the join key. Postgres walks through both sorted sets simultaneously. Efficient when both inputs are already sorted (e.g., via an index).

Merge Join  (cost=0.57..3842.18 rows=50000 width=130) (actual time=0.034..28.412 rows=50000 loops=1)
  Merge Cond: (c.id = o.customer_id)
  ->  Index Scan using customers_pkey on customers c  (cost=0.29..248.29 rows=5000 width=36) (actual time=0.012..1.842 rows=5000 loops=1)
  ->  Index Scan using idx_orders_customer_id on orders o  (cost=0.29..2843.29 rows=50000 width=94) (actual time=0.009..14.218 rows=50000 loops=1)

Merge Join is common in large analytical queries that join pre-sorted data. If both sides need sorting first, the planner usually prefers Hash Join instead.

Sort Operations

Sort

Standard sort operation. Pay attention to the Sort Method:

Sort  (cost=4892.33..5017.33 rows=50000 width=94) (actual time=42.515..48.218 rows=50000 loops=1)
  Sort Key: created_at DESC
  Sort Method: external merge  Disk: 5624kB

external merge Disk means the sort spilled to disk. This is slow. Increase work_mem:

SET work_mem = '16MB';

After increasing work_mem:

Sort  (cost=4127.33..4252.33 rows=50000 width=94) (actual time=31.207..35.442 rows=50000 loops=1)
  Sort Key: created_at DESC
  Sort Method: quicksort  Memory: 7428kB

Now it fits in memory. Much faster.

Incremental Sort (PostgreSQL 13+)

When data is already partially sorted, Incremental Sort only sorts within groups of the presorted prefix.

EXPLAIN ANALYZE SELECT * FROM orders ORDER BY customer_id, created_at DESC;

If there is an index on customer_id, Postgres can do an Incremental Sort:

Incremental Sort  (cost=0.48..6842.48 rows=50000 width=94) (actual time=0.078..28.914 rows=50000 loops=1)
  Sort Key: customer_id, created_at DESC
  Presorted Key: customer_id
  Full-sort Groups: 1563  Sort Method: quicksort  Average Memory: 28kB  Peak Memory: 32kB

This uses much less memory because it only sorts small groups at a time.

Aggregate Operations

HashAggregate

Builds a hash table of groups. Fast, but requires enough memory to hold all groups.

HashAggregate  (cost=1780.12..1802.80 rows=2268 width=44) (actual time=23.985..24.198 rows=2268 loops=1)
  Group Key: c.name
  Batches: 1  Memory Usage: 369kB

If Batches > 1, the aggregate spilled to disk. Increase work_mem.

GroupAggregate

Requires sorted input. Streams through sorted data and aggregates group by group. Uses minimal memory but needs the sort cost up front.

GroupAggregate  (cost=0.29..4218.29 rows=5000 width=44) (actual time=0.042..32.187 rows=5000 loops=1)
  Group Key: customer_id
  ->  Index Scan using idx_orders_customer_id on orders  (cost=0.29..3768.29 rows=50000 width=94) (actual time=0.028..18.412 rows=50000 loops=1)

The planner chooses GroupAggregate when the data is already sorted (e.g., from an index scan) and HashAggregate when it would need to sort first.

CTEs and Materialization

Before PostgreSQL 12, CTEs were always materialized — the CTE result was computed once and stored in a temporary buffer, creating an optimization fence. The planner could not push predicates into the CTE.

Starting in PostgreSQL 12, the planner can inline CTEs that are referenced once. You can force the old behavior with MATERIALIZED:

-- Inlined (PostgreSQL 12+, referenced once)
WITH recent_orders AS (
    SELECT * FROM orders WHERE created_at > '2025-01-01'
)
SELECT * FROM recent_orders WHERE status = 'pending';
Index Scan using idx_orders_status_created on orders  (cost=0.29..85.42 rows=124 width=94)
  Index Cond: (status = 'pending')
  Filter: (created_at > '2025-01-01'::date)

The CTE was eliminated. The planner combined both predicates. Now force materialization:

WITH recent_orders AS MATERIALIZED (
    SELECT * FROM orders WHERE created_at > '2025-01-01'
)
SELECT * FROM recent_orders WHERE status = 'pending';
CTE Scan on recent_orders  (cost=1842.00..2217.00 rows=5000 width=94)
  Filter: (status = 'pending')
  ->  CTE recent_orders
        ->  Seq Scan on orders  (cost=0.00..1843.00 rows=15000 width=94)
              Filter: (created_at > '2025-01-01'::date)

Now the status = 'pending' filter cannot be pushed down. It scans 15,000 rows from the CTE and filters afterward. Use MATERIALIZED only when you know the CTE result is expensive to compute and is referenced multiple times.

Subquery Execution: SubPlan and InitPlan

InitPlan

A subquery that runs once and its result is reused. Shown at the top of the plan:

EXPLAIN ANALYZE SELECT * FROM orders WHERE total_amount > (SELECT AVG(total_amount) FROM orders);
Seq Scan on orders  (cost=1568.00..3011.00 rows=16667 width=94) (actual time=8.412..18.724 rows=17234 loops=1)
  Filter: (total_amount > $0)
  Rows Removed by Filter: 32766
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=1443.00..1443.01 rows=1 width=32) (actual time=8.389..8.390 rows=1 loops=1)
          ->  Seq Scan on orders orders_1  (cost=0.00..1443.00 rows=50000 width=8) (actual time=0.005..4.812 rows=50000 loops=1)

The subquery runs once (loops=1), computes the average, and the outer query uses it as a constant.

SubPlan

A correlated subquery that runs once per outer row. This is usually a performance disaster:

EXPLAIN ANALYZE
SELECT c.name,
       (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.id) AS order_count
FROM customers c;
Seq Scan on customers c  (cost=0.00..91443.00 rows=5000 width=44) (actual time=0.048..412.187 rows=5000 loops=1)
  SubPlan 1
    ->  Aggregate  (cost=18.15..18.16 rows=1 width=8) (actual time=0.079..0.079 rows=1 loops=5000)
          ->  Index Only Scan using idx_orders_customer_id on orders o  (cost=0.29..18.12 rows=10 width=0) (actual time=0.028..0.068 rows=10 loops=5000)
                Index Cond: (customer_id = c.id)

The subquery runs 5,000 times (loops=5000). Total time: 0.079ms * 5000 = 395ms. Rewrite this as a JOIN:

SELECT c.name, COUNT(o.id) AS order_count
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
GROUP BY c.name;

EXPLAIN Options

BUFFERS

Shows how many pages were read from cache (shared hit) vs disk (read):

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 4821;
Index Scan using idx_orders_customer_id on orders  (cost=0.29..12.58 rows=7 width=94) (actual time=0.028..0.035 rows=6 loops=1)
  Index Cond: (customer_id = 4821)
  Buffers: shared hit=4
Planning:
  Buffers: shared hit=12

shared hit=4 means 4 pages were read from PostgreSQL's shared buffer cache. If you see shared read=4, those pages came from disk (or OS cache). High read counts indicate a cold cache or insufficient shared_buffers.

FORMAT JSON

Machine-readable output, great for programmatic analysis:

EXPLAIN (ANALYZE, FORMAT JSON) SELECT * FROM orders WHERE customer_id = 4821;

SETTINGS (PostgreSQL 12+)

Shows non-default planner settings that affected the plan:

SET work_mem = '256MB';
EXPLAIN (ANALYZE, SETTINGS) SELECT * FROM orders ORDER BY created_at;
Sort  (cost=4127.33..4252.33 rows=50000 width=94) (actual time=31.207..35.442 rows=50000 loops=1)
  Sort Key: created_at
  Sort Method: quicksort  Memory: 7428kB
Settings: work_mem = '256MB'

WAL (PostgreSQL 13+)

Shows WAL (Write-Ahead Log) generation for DML statements:

EXPLAIN (ANALYZE, WAL, BUFFERS) INSERT INTO orders (customer_id, total_amount) VALUES (1, 99.99);

Useful for understanding the write amplification of your queries.

Identifying Slow Nodes

The key skill is comparing estimated rows to actual rows. When they diverge, the planner made a bad decision.

Hash Join  (cost=112.45..235.78 rows=10 width=40) (actual time=1.204..847.212 rows=48721 loops=1)

The planner estimated 10 rows but got 48,721. This means:

  1. The hash table was undersized
  2. Downstream nodes were not prepared for this volume
  3. The total time (847ms) is far beyond what the cost model predicted

Root cause: Stale statistics. Fix it:

ANALYZE orders;

Another pattern — watch for high actual time on a single node relative to its children:

Sort  (cost=5892.33..6017.33 rows=50000 width=94) (actual time=312.515..358.218 rows=50000 loops=1)
  Sort Key: created_at DESC
  Sort Method: external merge  Disk: 12824kB
  ->  Seq Scan on orders  (cost=0.00..1443.00 rows=50000 width=94) (actual time=0.009..8.131 rows=50000 loops=1)

The Seq Scan took 8ms but the Sort took 312ms and spilled to disk. The fix is either increasing work_mem or adding an index on created_at DESC.

Common Red Flags

  1. Seq Scan on large tables with a selective filter — you almost certainly need an index
  2. Rows estimate off by 10x or more — run ANALYZE, check for correlated columns, or use extended statistics
  3. Nested Loop with high outer row count and no inner index — produces O(n*m) behavior
  4. Hash Batches > 1 — hash table spilled to disk, increase work_mem
  5. external merge Disk on Sort nodes — same fix, increase work_mem
  6. Bitmap Heap Scan with Lossy blockswork_mem too low for the bitmap, resulting in page-level (lossy) rather than row-level filtering
  7. SubPlan with high loops count — rewrite as a JOIN

Fixing Bad Plans

Adding Indexes

The most common fix. Target columns that appear in WHERE, JOIN ON, and ORDER BY:

-- Composite index for a common query pattern
CREATE INDEX idx_orders_status_created ON orders (status, created_at DESC);

-- Covering index to enable Index Only Scan
CREATE INDEX idx_orders_covering ON orders (customer_id) INCLUDE (total_amount, status);

-- Partial index for a common filter
CREATE INDEX idx_orders_pending ON orders (created_at DESC) WHERE status = 'pending';

Rewriting Queries

Sometimes the query structure itself is the problem:

-- Bad: correlated subquery
SELECT * FROM customers c
WHERE (SELECT COUNT(*) FROM orders WHERE customer_id = c.id) > 5;

-- Good: JOIN with HAVING
SELECT c.*
FROM customers c
JOIN orders o ON o.customer_id = c.id
GROUP BY c.id
HAVING COUNT(o.id) > 5;

-- Also good: semi-join with EXISTS (when you do not need aggregation)
SELECT * FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.id
    HAVING COUNT(*) > 5
);

Updating Statistics

When row estimates are wrong, statistics are usually stale or insufficient:

-- Update statistics for a specific table
ANALYZE orders;

-- Increase statistics target for a specific column (default is 100)
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;

-- Create extended statistics for correlated columns (PostgreSQL 10+)
CREATE STATISTICS stat_orders_region_status (dependencies, ndistinct, mcv)
    ON region, status FROM orders;
ANALYZE orders;

Planner Configuration

These settings influence plan selection. Change them per-session for testing, not globally unless you know what you are doing.

-- Discourage Seq Scans (useful for testing if an index plan is faster)
SET enable_seqscan = off;

-- Adjust the cost of random I/O relative to sequential I/O
-- Default is 4.0. On SSDs, set closer to 1.1-1.5
SET random_page_cost = 1.1;

-- Increase memory available for sorts and hash operations
SET work_mem = '64MB';

-- Increase memory for maintenance operations (CREATE INDEX, VACUUM)
SET maintenance_work_mem = '512MB';

-- Allow more parallel workers
SET max_parallel_workers_per_gather = 4;

For SSD-based servers, random_page_cost = 1.1 is one of the most impactful changes you can make. The default of 4.0 assumes spinning disks and makes the planner irrationally afraid of index scans.

Using pg_stat_statements to Find Slow Queries

Before you can optimize, you need to know what to optimize. pg_stat_statements tracks query performance across all sessions.

-- Enable the extension (requires superuser, add to postgresql.conf or run once)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find the top 10 slowest queries by total time
SELECT
    substring(query, 1, 80) AS short_query,
    calls,
    round(total_exec_time::numeric, 2) AS total_ms,
    round(mean_exec_time::numeric, 2) AS mean_ms,
    round(stddev_exec_time::numeric, 2) AS stddev_ms,
    rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
          short_query           | calls  | total_ms  | mean_ms | stddev_ms | rows
--------------------------------+--------+-----------+---------+-----------+---------
 SELECT * FROM orders WHERE sta | 284710 | 142847.12 |    0.50 |      1.24 | 1847210
 SELECT c.name, COUNT(o.id) AS | 12847  |  89412.84 |    6.96 |      3.18 |  128470
 UPDATE orders SET status = $1  |  48721 |  51842.18 |    1.06 |      0.84 |   48721

The first query ran 284,710 times. Even though each call averages 0.50ms, the total is 142 seconds. Optimizing this saves the most cumulative time.

EXPLAIN in Node.js

In production Node.js applications, you want to automatically capture slow query plans. Here is a practical approach using the pg module:

var pg = require('pg');

var pool = new pg.Pool({
    connectionString: process.env.DATABASE_URL,
    max: 20
});

var SLOW_QUERY_THRESHOLD_MS = 200;

function queryWithExplain(text, params, callback) {
    var start = Date.now();

    pool.query(text, params, function(err, result) {
        var duration = Date.now() - start;

        if (duration > SLOW_QUERY_THRESHOLD_MS) {
            console.warn('[SLOW QUERY] ' + duration + 'ms: ' + text);
            console.warn('[SLOW QUERY] Params: ' + JSON.stringify(params));

            // Capture the EXPLAIN ANALYZE for diagnosis
            var explainQuery = 'EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) ' + text;
            pool.query(explainQuery, params, function(explainErr, explainResult) {
                if (!explainErr) {
                    var plan = JSON.stringify(explainResult.rows[0]['QUERY PLAN'], null, 2);
                    console.warn('[SLOW QUERY] Plan:\n' + plan);

                    // In production, send this to your logging service
                    logSlowQuery({
                        query: text,
                        params: params,
                        duration: duration,
                        plan: explainResult.rows[0]['QUERY PLAN']
                    });
                }
            });
        }

        callback(err, result);
    });
}

function logSlowQuery(data) {
    // Send to your logging/monitoring service
    // e.g., Datadog, Sentry, CloudWatch, or your own table
    console.log('[SLOW QUERY LOG]', JSON.stringify({
        timestamp: new Date().toISOString(),
        query: data.query,
        duration_ms: data.duration,
        plan_nodes: extractNodeTypes(data.plan)
    }));
}

function extractNodeTypes(plan) {
    var nodes = [];
    function walk(node) {
        if (node['Node Type']) {
            nodes.push({
                type: node['Node Type'],
                actual_rows: node['Actual Rows'],
                actual_time: node['Actual Total Time']
            });
        }
        if (node.Plans) {
            node.Plans.forEach(function(child) {
                walk(child);
            });
        }
    }
    if (Array.isArray(plan)) {
        plan.forEach(function(p) { walk(p.Plan); });
    }
    return nodes;
}

// Usage
queryWithExplain(
    'SELECT * FROM orders WHERE customer_id = $1 AND status = $2',
    [4821, 'pending'],
    function(err, result) {
        if (err) {
            console.error('Query failed:', err);
            return;
        }
        console.log('Got ' + result.rows.length + ' orders');
    }
);

Important caveat: Running EXPLAIN ANALYZE on every slow query doubles the execution cost. In high-traffic production systems, sample slow queries (e.g., capture the plan for 1 in 100 slow queries) or use EXPLAIN without ANALYZE to get the estimated plan without re-executing:

var sampleRate = 0.01; // 1% of slow queries

if (duration > SLOW_QUERY_THRESHOLD_MS && Math.random() < sampleRate) {
    var explainQuery = 'EXPLAIN (FORMAT JSON) ' + text;
    pool.query(explainQuery, params, function(explainErr, explainResult) {
        // Log estimated plan without re-executing
    });
}

Complete Working Example: Diagnosing and Fixing a Slow Query

Let us walk through a real scenario. We have an e-commerce database and a query that is running too slow.

The Schema

CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(255),
    region VARCHAR(50),
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(id),
    status VARCHAR(20),
    total_amount DECIMAL(10,2),
    region VARCHAR(50),
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE order_items (
    id SERIAL PRIMARY KEY,
    order_id INTEGER REFERENCES orders(id),
    product_id INTEGER,
    quantity INTEGER,
    unit_price DECIMAL(10,2)
);

-- 50,000 customers, 500,000 orders, 2,000,000 order items

The Slow Query

EXPLAIN (ANALYZE, BUFFERS)
SELECT c.name, c.region,
       SUM(oi.quantity * oi.unit_price) AS total_revenue,
       COUNT(DISTINCT o.id) AS order_count
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
WHERE o.status = 'completed'
  AND o.created_at BETWEEN '2025-01-01' AND '2025-12-31'
  AND c.region = 'us-west'
GROUP BY c.name, c.region
ORDER BY total_revenue DESC
LIMIT 20;

The Bad Plan (Before Optimization)

Limit  (cost=48921.45..48921.50 rows=20 width=76) (actual time=2847.218..2847.234 rows=20 loops=1)
  Buffers: shared hit=12847 read=34218
  ->  Sort  (cost=48921.45..48932.12 rows=4268 width=76) (actual time=2847.215..2847.228 rows=20 loops=1)
        Sort Key: (sum((oi.quantity * oi.unit_price))) DESC
        Sort Method: top-N heapsort  Memory: 28kB
        Buffers: shared hit=12847 read=34218
        ->  HashAggregate  (cost=48712.18..48754.86 rows=4268 width=76) (actual time=2842.187..2845.412 rows=4268 loops=1)
              Group Key: c.name, c.region
              Batches: 1  Memory Usage: 1249kB
              Buffers: shared hit=12847 read=34218
              ->  Hash Join  (cost=18412.45..47842.18 rows=174000 width=44) (actual time=892.412..2587.218 rows=168421 loops=1)
                    Hash Cond: (oi.order_id = o.id)
                    Buffers: shared hit=12847 read=34218
                    ->  Seq Scan on order_items oi  (cost=0.00..24718.00 rows=2000000 width=16) (actual time=0.012..412.187 rows=2000000 loops=1)
                          Buffers: shared hit=4218 read=20500
                    ->  Hash  (cost=17842.45..17842.45 rows=45600 width=36) (actual time=891.842..891.843 rows=42187 loops=1)
                          Buckets: 65536  Batches: 1  Memory Usage: 3412kB
                          Buffers: shared hit=8629 read=13718
                          ->  Hash Join  (cost=248.12..17842.45 rows=45600 width=36) (actual time=12.187..842.412 rows=42187 loops=1)
                                Hash Cond: (o.customer_id = c.id)
                                Buffers: shared hit=8629 read=13718
                                ->  Seq Scan on orders o  (cost=0.00..16843.00 rows=91200 width=12) (actual time=0.024..718.412 rows=87412 loops=1)
                                      Filter: ((status = 'completed') AND (created_at >= '2025-01-01') AND (created_at <= '2025-12-31'))
                                      Rows Removed by Filter: 412588
                                      Buffers: shared hit=8412 read=13500
                                ->  Hash  (cost=198.12..198.12 rows=4000 width=32) (actual time=12.087..12.088 rows=8421 loops=1)
                                      Buckets: 16384 (originally 4096)  Batches: 1  Memory Usage: 542kB
                                      Buffers: shared hit=217 read=218
                                      ->  Seq Scan on customers c  (cost=0.00..198.12 rows=4000 width=32) (actual time=0.018..8.412 rows=8421 loops=1)
                                            Filter: (region = 'us-west')
                                            Rows Removed by Filter: 41579
                                            Buffers: shared hit=217 read=218
 Planning Time: 1.842 ms
 Execution Time: 2847.518 ms

Total time: 2,847ms. Let us diagnose this.

Reading the Plan — Where Is the Time?

  1. Seq Scan on order_items — reads all 2,000,000 rows (412ms). This is the biggest table and it has no filter pushed down. It reads 20,500 pages from disk.
  2. Seq Scan on orders with Filter — reads 500,000 rows, removes 412,588 (718ms). The filter on status and created_at is applied after reading every row.
  3. Row estimate errors — customers estimated 4,000 but got 8,421. The planner sized the hash table wrong (resized from 4,096 to 16,384 buckets).
  4. shared read=34218 total — significant disk I/O. Not enough of the data is cached.

The Fix

-- 1. Index for the orders filter (most impactful)
CREATE INDEX idx_orders_status_created ON orders (status, created_at)
    WHERE status = 'completed';

-- 2. Index for order_items join
CREATE INDEX idx_order_items_order_id ON order_items (order_id)
    INCLUDE (quantity, unit_price);

-- 3. Index for customers region filter
CREATE INDEX idx_customers_region ON customers (region);

-- 4. Update statistics
ANALYZE customers;
ANALYZE orders;
ANALYZE order_items;

The Fixed Plan (After Optimization)

EXPLAIN (ANALYZE, BUFFERS)
SELECT c.name, c.region,
       SUM(oi.quantity * oi.unit_price) AS total_revenue,
       COUNT(DISTINCT o.id) AS order_count
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
WHERE o.status = 'completed'
  AND o.created_at BETWEEN '2025-01-01' AND '2025-12-31'
  AND c.region = 'us-west'
GROUP BY c.name, c.region
ORDER BY total_revenue DESC
LIMIT 20;
Limit  (cost=8421.45..8421.50 rows=20 width=76) (actual time=187.218..187.234 rows=20 loops=1)
  Buffers: shared hit=14842
  ->  Sort  (cost=8421.45..8432.12 rows=4268 width=76) (actual time=187.215..187.228 rows=20 loops=1)
        Sort Key: (sum((oi.quantity * oi.unit_price))) DESC
        Sort Method: top-N heapsort  Memory: 28kB
        Buffers: shared hit=14842
        ->  HashAggregate  (cost=8212.18..8254.86 rows=4268 width=76) (actual time=182.187..185.412 rows=4268 loops=1)
              Group Key: c.name, c.region
              Batches: 1  Memory Usage: 1249kB
              Buffers: shared hit=14842
              ->  Nested Loop  (cost=1248.45..7342.18 rows=174000 width=44) (actual time=8.412..142.218 rows=168421 loops=1)
                    Buffers: shared hit=14842
                    ->  Hash Join  (cost=1248.16..4842.45 rows=42187 width=36) (actual time=8.387..42.187 rows=42187 loops=1)
                          Hash Cond: (o.customer_id = c.id)
                          Buffers: shared hit=2847
                          ->  Index Scan using idx_orders_status_created on orders o  (cost=0.42..3218.42 rows=87412 width=12) (actual time=0.024..18.412 rows=87412 loops=1)
                                Index Cond: ((status = 'completed') AND (created_at >= '2025-01-01') AND (created_at <= '2025-12-31'))
                                Buffers: shared hit=2412
                          ->  Hash  (cost=1198.12..1198.12 rows=8421 width=32) (actual time=8.087..8.088 rows=8421 loops=1)
                                Buckets: 16384  Batches: 1  Memory Usage: 542kB
                                Buffers: shared hit=435
                                ->  Index Scan using idx_customers_region on customers c  (cost=0.29..1198.12 rows=8421 width=32) (actual time=0.018..4.412 rows=8421 loops=1)
                                      Index Cond: (region = 'us-west')
                                      Buffers: shared hit=435
                    ->  Index Only Scan using idx_order_items_order_id on order_items oi  (cost=0.43..0.58 rows=4 width=16) (actual time=0.001..0.002 rows=4 loops=42187)
                          Index Cond: (order_id = o.id)
                          Heap Fetches: 0
                          Buffers: shared hit=11995
 Planning Time: 0.847 ms
 Execution Time: 187.518 ms

Results

Metric Before After Improvement
Execution Time 2,847ms 187ms 15.2x faster
Disk Reads 34,218 pages 0 pages 100% cache hit
Order Items Scan Full Seq Scan (2M rows) Index Only Scan (4 rows/loop) Eliminated full scan
Orders Scan Seq Scan + Filter Index Scan (direct) No rows removed by filter

The key changes:

  • order_items went from a full Seq Scan of 2M rows to an Index Only Scan that reads only the matching rows per order, with zero heap fetches
  • orders went from Seq Scan + Filter (reading 500K, keeping 87K) to an Index Scan that only reads matching rows
  • All reads come from shared buffers — zero disk I/O on the second run
  • The join strategy changed from Hash Join to Nested Loop for order_items, which is correct because the loop count is bounded by the filtered orders

Common Issues & Troubleshooting

Issue 1: Planner Chooses Seq Scan Despite Index

EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'completed';
Seq Scan on orders  (cost=0.00..16843.00 rows=425000 width=94) (actual time=0.024..142.187 rows=425842 loops=1)
  Filter: (status = 'completed')
  Rows Removed by Filter: 74158

Why: The query returns 85% of the table. An index scan would be slower because of random I/O. The planner is correct. If you only need a subset of columns, create a covering index. If you need to filter further, add more conditions.

Issue 2: Wrong Row Estimates Due to Correlated Columns

EXPLAIN ANALYZE SELECT * FROM orders WHERE region = 'us-west' AND status = 'pending';
Bitmap Heap Scan on orders  (cost=82.18..4842.12 rows=180 width=94) (actual time=4.218..28.412 rows=8421 loops=1)
  Recheck Cond: (region = 'us-west')
  Filter: (status = 'pending')
  Rows Removed by Filter: 12

Estimated 180 rows, actual 8,421 — the planner assumed region and status are independent, but they are correlated (most us-west orders are pending).

Fix: Extended statistics:

CREATE STATISTICS stat_orders_region_status (dependencies) ON region, status FROM orders;
ANALYZE orders;

Issue 3: Hash Join Spilling to Disk

Hash Join  (cost=48218.45..98421.18 rows=500000 width=130) (actual time=2847.218..8421.412 rows=500000 loops=1)
  Hash Cond: (o.customer_id = c.id)
  ->  Seq Scan on orders o  ...
  ->  Hash  (cost=24718.00..24718.00 rows=50000 width=36) (actual time=2842.187..2842.188 rows=50000 loops=1)
        Buckets: 65536  Batches: 4  Memory Usage: 4097kB

Batches: 4 means the hash table did not fit in work_mem and spilled to disk in 4 batches.

Fix:

SET work_mem = '32MB';  -- Increase from default 4MB

Or better yet, set it per-query in your application:

function runLargeJoinQuery(params, callback) {
    pool.query('SET LOCAL work_mem = \'32MB\'', function(err) {
        if (err) return callback(err);
        pool.query('SELECT ...', params, callback);
    });
}

Issue 4: Index Not Used Due to Type Mismatch

-- Column is INTEGER, but parameter is TEXT
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = '4821';
Seq Scan on orders  (cost=0.00..16843.00 rows=1 width=94) (actual time=0.024..142.187 rows=6 loops=1)
  Filter: ((customer_id)::text = '4821'::text)

PostgreSQL cast customer_id to text, which prevents index use.

Fix: Use the correct type:

SELECT * FROM orders WHERE customer_id = 4821;

In Node.js, be explicit about parameter types:

// Bad — pg might send this as text
pool.query('SELECT * FROM orders WHERE customer_id = $1', ['4821']);

// Good — ensure it is a number
pool.query('SELECT * FROM orders WHERE customer_id = $1', [parseInt(customerId, 10)]);

Best Practices

  • Always use EXPLAIN ANALYZE, not just EXPLAIN. Estimated plans lie. Actual plans tell the truth. The gap between estimated and actual rows is where performance problems hide.

  • Run ANALYZE after bulk data changes. After large INSERTs, DELETEs, or schema changes, the planner's statistics are stale. Autovacuum handles this eventually, but for immediate accuracy, run ANALYZE tablename explicitly.

  • Set random_page_cost appropriately for your storage. If you are on SSDs (and you should be), set random_page_cost = 1.1. The default of 4.0 assumes spinning disks and causes the planner to avoid index scans even when they would be faster.

  • Use BUFFERS to understand I/O patterns. EXPLAIN (ANALYZE, BUFFERS) shows whether data comes from cache or disk. High shared read counts indicate either a cold cache, insufficient shared_buffers, or a query that touches too many pages.

  • Create composite indexes that match your query patterns. A single-column index on status is rarely enough. If your queries filter on status and sort by created_at, create CREATE INDEX ON orders (status, created_at). Column order matters — put equality conditions first, range conditions and sort columns after.

  • Monitor with pg_stat_statements, not just application logs. Application-level query timing includes network latency, connection pool wait time, and serialization. pg_stat_statements measures pure database execution time and tracks every query, including those from ORMs and background jobs you might not be monitoring.

  • Prefer EXISTS over COUNT for existence checks. SELECT COUNT(*) FROM orders WHERE customer_id = $1 scans all matching rows. SELECT EXISTS (SELECT 1 FROM orders WHERE customer_id = $1) stops at the first match.

  • Do not over-index. Every index slows down writes and consumes disk space. If an index is not used by any query in pg_stat_user_indexes (check idx_scan = 0), drop it.

  • Test query plans with production-like data volumes. A query that uses an index on 1,000 rows might switch to a Seq Scan on 1,000,000 rows. Always test EXPLAIN ANALYZE on a database with realistic data distribution and volume.

References

Powered by Contentful