Embeddings

Implementing RAG with PostgreSQL and OpenAI Embeddings

A production-grade RAG implementation using PostgreSQL with pgvector for vector storage, OpenAI embeddings for semantic search, and GPT-4 for generation, all wired together with Node.js.

Implementing RAG with PostgreSQL and OpenAI Embeddings

Overview

Retrieval-Augmented Generation (RAG) is the most practical way to give large language models access to your own data without the cost, complexity, and staleness problems of fine-tuning. Instead of baking knowledge into model weights, RAG retrieves relevant documents at query time and feeds them as context to the LLM. This article walks through a production-grade RAG implementation using PostgreSQL with pgvector for vector storage, OpenAI's embedding models for semantic search, and GPT-4 for generation — all wired together with Node.js.

Prerequisites

  • Node.js v18+ installed
  • PostgreSQL 15+ with superuser access
  • An OpenAI API key with access to text-embedding-3-small and gpt-4o
  • Basic familiarity with SQL and Express.js
  • psql CLI or a PostgreSQL client like pgAdmin

What RAG Is and Why It Beats Fine-Tuning

Fine-tuning trains a model on your data. RAG retrieves your data and hands it to the model at inference time. The difference matters enormously in practice.

Fine-tuning has three serious problems. First, it is expensive — you are paying to train a model, and you will pay again every time your data changes. Second, the model can hallucinate "facts" it absorbed during training with no way to trace where the information came from. Third, your data goes stale. The moment you add a new document, your fine-tuned model knows nothing about it until you retrain.

RAG solves all three. Your data lives in a database. When a user asks a question, you search that database for the most relevant chunks, stuff them into the prompt as context, and let the LLM synthesize an answer. The data is always current. You can cite sources. And you pay only for inference, not training.

The architecture looks like this:

User Query
    │
    ▼
[Embed Query] ──► OpenAI Embeddings API
    │
    ▼
[Vector Search] ──► PostgreSQL + pgvector
    │
    ▼
[Retrieved Chunks] ──► Top-K relevant documents
    │
    ▼
[Augmented Prompt] ──► System prompt + retrieved context + user query
    │
    ▼
[LLM Generation] ──► GPT-4o
    │
    ▼
Answer with Citations

RAG is the right choice when your data changes frequently, when you need source attribution, when you have domain-specific documents the model was never trained on, or when you cannot afford the latency and cost of fine-tuning cycles.

Setting Up PostgreSQL with pgvector

pgvector is a PostgreSQL extension that adds vector data types and similarity search operators. It is the most battle-tested vector store that runs inside a database you already know how to operate.

Install the extension. On Ubuntu/Debian:

sudo apt install postgresql-15-pgvector

On macOS with Homebrew:

brew install pgvector

On DigitalOcean Managed Databases, pgvector is available as a trusted extension — you just need to enable it.

Create the database and enable the extension:

CREATE DATABASE rag_demo;
\c rag_demo

CREATE EXTENSION IF NOT EXISTS vector;

Now create the table that will hold your document chunks and their embeddings. OpenAI's text-embedding-3-small produces 1536-dimensional vectors by default:

CREATE TABLE document_chunks (
    id SERIAL PRIMARY KEY,
    document_id VARCHAR(255) NOT NULL,
    chunk_index INTEGER NOT NULL,
    content TEXT NOT NULL,
    metadata JSONB DEFAULT '{}',
    embedding vector(1536),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(document_id, chunk_index)
);

-- Full-text search support
ALTER TABLE document_chunks ADD COLUMN content_tsv tsvector
    GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;

CREATE INDEX idx_content_tsv ON document_chunks USING GIN (content_tsv);

-- Metadata index for filtering
CREATE INDEX idx_metadata ON document_chunks USING GIN (metadata);

We will add vector indexes after discussing the tradeoffs between IVFFlat and HNSW later in this article.

Generating Embeddings with OpenAI

Install the required packages:

npm install openai pg dotenv express

Here is the embedding module. It handles single texts and batches, with retry logic for rate limits:

// embeddings.js
var OpenAI = require("openai");

var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

var EMBEDDING_MODEL = "text-embedding-3-small";
var MAX_BATCH_SIZE = 100; // OpenAI allows up to 2048, but smaller batches are safer

function generateEmbedding(text) {
    return client.embeddings.create({
        model: EMBEDDING_MODEL,
        input: text
    }).then(function(response) {
        return response.data[0].embedding;
    });
}

function generateEmbeddingsBatch(texts) {
    var batches = [];
    for (var i = 0; i < texts.length; i += MAX_BATCH_SIZE) {
        batches.push(texts.slice(i, i + MAX_BATCH_SIZE));
    }

    var results = [];
    var chain = Promise.resolve();

    batches.forEach(function(batch, batchIndex) {
        chain = chain.then(function() {
            console.log("Embedding batch " + (batchIndex + 1) + "/" + batches.length +
                " (" + batch.length + " texts)");

            return client.embeddings.create({
                model: EMBEDDING_MODEL,
                input: batch
            });
        }).then(function(response) {
            response.data.forEach(function(item) {
                results.push(item.embedding);
            });
        });
    });

    return chain.then(function() {
        return results;
    });
}

module.exports = {
    generateEmbedding: generateEmbedding,
    generateEmbeddingsBatch: generateEmbeddingsBatch,
    EMBEDDING_MODEL: EMBEDDING_MODEL
};

The text-embedding-3-small model costs $0.02 per million tokens as of early 2026. For a corpus of 10,000 chunks averaging 300 tokens each, the total embedding cost is about $0.06. This is negligible compared to generation costs.

Document Chunking Strategies

Chunking is the most underappreciated part of a RAG pipeline. Bad chunks produce bad retrieval. Bad retrieval produces hallucinated answers. There are three main approaches.

Fixed-Size Chunking

Split text into chunks of N characters with overlap. Simple and predictable:

// chunking.js

function chunkFixedSize(text, chunkSize, overlap) {
    chunkSize = chunkSize || 1000;
    overlap = overlap || 200;

    var chunks = [];
    var start = 0;

    while (start < text.length) {
        var end = Math.min(start + chunkSize, text.length);
        chunks.push(text.slice(start, end));
        start += chunkSize - overlap;
    }

    return chunks;
}

This works but frequently splits sentences mid-thought. A chunk ending with "The server returns a 500 error when" is useless without what follows.

Sentence-Based Chunking

Group complete sentences until you hit a size limit. Better semantic coherence:

function chunkBySentence(text, maxChunkSize, overlap) {
    maxChunkSize = maxChunkSize || 1000;
    overlap = overlap || 1;

    var sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
    var chunks = [];
    var currentChunk = [];
    var currentSize = 0;

    for (var i = 0; i < sentences.length; i++) {
        var sentence = sentences[i].trim();

        if (currentSize + sentence.length > maxChunkSize && currentChunk.length > 0) {
            chunks.push(currentChunk.join(" "));

            // Keep last N sentences as overlap
            var overlapSentences = currentChunk.slice(-overlap);
            currentChunk = overlapSentences;
            currentSize = overlapSentences.join(" ").length;
        }

        currentChunk.push(sentence);
        currentSize += sentence.length;
    }

    if (currentChunk.length > 0) {
        chunks.push(currentChunk.join(" "));
    }

    return chunks;
}

Markdown-Aware Chunking

For markdown documents, respect the structure. Headers, code blocks, and lists should stay together when possible:

function chunkMarkdown(markdown, maxChunkSize) {
    maxChunkSize = maxChunkSize || 1500;

    // Split on headers while keeping the header with its content
    var sections = markdown.split(/(?=^#{1,3}\s)/m);
    var chunks = [];

    sections.forEach(function(section) {
        section = section.trim();
        if (!section) return;

        if (section.length <= maxChunkSize) {
            chunks.push(section);
        } else {
            // Section too large — fall back to sentence chunking
            var subChunks = chunkBySentence(section, maxChunkSize, 1);
            chunks = chunks.concat(subChunks);
        }
    });

    return chunks;
}

module.exports = {
    chunkFixedSize: chunkFixedSize,
    chunkBySentence: chunkBySentence,
    chunkMarkdown: chunkMarkdown
};

My recommendation: use markdown-aware chunking for structured documents and sentence-based chunking for everything else. Target chunks of 500–1500 characters. Smaller chunks improve retrieval precision. Larger chunks provide more context per retrieval. Start with 1000 characters and adjust based on your retrieval quality.

Storing and Indexing Embeddings

Database Layer

// db.js
var pg = require("pg");

var pool = new pg.Pool({
    connectionString: process.env.POSTGRES_CONNECTION_STRING,
    max: 20,
    idleTimeoutMillis: 30000,
    connectionTimeoutMillis: 5000
});

function query(text, params) {
    return pool.query(text, params);
}

function getClient() {
    return pool.connect();
}

module.exports = {
    query: query,
    getClient: getClient,
    pool: pool
};

Inserting Chunks with Embeddings

// store.js
var db = require("./db");
var embeddings = require("./embeddings");
var chunking = require("./chunking");

function ingestDocument(documentId, content, metadata) {
    var chunks = chunking.chunkMarkdown(content);

    console.log("Document " + documentId + ": " + chunks.length + " chunks");

    var texts = chunks.map(function(chunk) { return chunk; });

    return embeddings.generateEmbeddingsBatch(texts).then(function(vectors) {
        var client;

        return db.getClient().then(function(c) {
            client = c;
            return client.query("BEGIN");
        }).then(function() {
            // Delete existing chunks for this document (re-ingestion)
            return client.query(
                "DELETE FROM document_chunks WHERE document_id = $1",
                [documentId]
            );
        }).then(function() {
            var insertChain = Promise.resolve();

            chunks.forEach(function(chunk, index) {
                insertChain = insertChain.then(function() {
                    var vectorStr = "[" + vectors[index].join(",") + "]";

                    return client.query(
                        "INSERT INTO document_chunks (document_id, chunk_index, content, metadata, embedding) " +
                        "VALUES ($1, $2, $3, $4, $5::vector)",
                        [documentId, index, chunk, JSON.stringify(metadata), vectorStr]
                    );
                });
            });

            return insertChain;
        }).then(function() {
            return client.query("COMMIT");
        }).then(function() {
            client.release();
            console.log("Stored " + chunks.length + " chunks for " + documentId);
            return chunks.length;
        }).catch(function(err) {
            client.query("ROLLBACK");
            client.release();
            throw err;
        });
    });
}

module.exports = {
    ingestDocument: ingestDocument
};

IVFFlat vs HNSW Indexes

pgvector supports two index types for approximate nearest neighbor search. The choice matters.

IVFFlat partitions vectors into lists (clusters) and searches only the nearest lists. Faster to build, uses less memory, but requires training data:

-- IVFFlat: good for datasets under 1M vectors
-- lists = sqrt(num_rows) is a reasonable starting point
CREATE INDEX idx_embedding_ivfflat ON document_chunks
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

HNSW (Hierarchical Navigable Small World) builds a graph structure. Higher recall, no training step, but uses more memory and takes longer to build:

-- HNSW: better recall, recommended for production
CREATE INDEX idx_embedding_hnsw ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 128);

For most RAG applications with fewer than a million documents, HNSW is the better choice. The recall difference is meaningful — an IVFFlat search might miss a relevant chunk that HNSW would find. The build time penalty only matters during initial ingestion.

Set the search parameter at query time to trade speed for accuracy:

-- For HNSW, higher ef = better recall, slower search
SET hnsw.ef_search = 100;

-- For IVFFlat, higher probes = better recall, slower search
SET ivfflat.probes = 10;

Building the Similarity Search Layer

// search.js
var db = require("./db");
var embeddings = require("./embeddings");

function searchSimilar(queryText, options) {
    var limit = (options && options.limit) || 5;
    var threshold = (options && options.threshold) || 0.3;
    var metadataFilter = (options && options.metadata) || null;

    return embeddings.generateEmbedding(queryText).then(function(queryVector) {
        var vectorStr = "[" + queryVector.join(",") + "]";

        var sql = "SELECT id, document_id, chunk_index, content, metadata, " +
            "1 - (embedding <=> $1::vector) AS similarity " +
            "FROM document_chunks " +
            "WHERE 1 - (embedding <=> $1::vector) > $2";

        var params = [vectorStr, threshold];
        var paramIndex = 3;

        if (metadataFilter) {
            sql += " AND metadata @> $" + paramIndex;
            params.push(JSON.stringify(metadataFilter));
            paramIndex++;
        }

        sql += " ORDER BY embedding <=> $1::vector LIMIT $" + paramIndex;
        params.push(limit);

        return db.query(sql, params);
    }).then(function(result) {
        return result.rows;
    });
}

module.exports = {
    searchSimilar: searchSimilar
};

The <=> operator computes cosine distance. We convert it to similarity (1 - distance) so higher numbers mean better matches. A similarity of 0.85 is a strong match. Below 0.5 is usually noise.

The Full RAG Pipeline

This is where retrieval meets generation. The quality of your system prompt and the way you format retrieved context will make or break your RAG application.

// rag.js
var OpenAI = require("openai");
var search = require("./search");

var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

var SYSTEM_PROMPT = "You are a technical assistant. Answer the user's question " +
    "based ONLY on the provided context documents. If the context does not contain " +
    "enough information to answer, say so explicitly. Do not make up information. " +
    "Cite the source document IDs when referencing specific information.";

function buildContext(chunks) {
    return chunks.map(function(chunk, index) {
        return "--- Document: " + chunk.document_id +
            " (chunk " + chunk.chunk_index + ", relevance: " +
            (chunk.similarity * 100).toFixed(1) + "%) ---\n" +
            chunk.content;
    }).join("\n\n");
}

function queryRAG(question, options) {
    var searchOptions = {
        limit: (options && options.topK) || 5,
        threshold: (options && options.threshold) || 0.3,
        metadata: (options && options.metadata) || null
    };

    var retrievedChunks;

    return search.searchSimilar(question, searchOptions).then(function(chunks) {
        retrievedChunks = chunks;

        if (chunks.length === 0) {
            return {
                answer: "I could not find any relevant information to answer your question.",
                sources: [],
                tokensUsed: 0
            };
        }

        var context = buildContext(chunks);

        var messages = [
            { role: "system", content: SYSTEM_PROMPT },
            {
                role: "user",
                content: "Context:\n" + context + "\n\nQuestion: " + question
            }
        ];

        return client.chat.completions.create({
            model: "gpt-4o",
            messages: messages,
            temperature: 0.1,
            max_tokens: 1500
        });
    }).then(function(response) {
        if (response.answer) return response; // No chunks case

        var sources = retrievedChunks.map(function(chunk) {
            return {
                documentId: chunk.document_id,
                chunkIndex: chunk.chunk_index,
                similarity: chunk.similarity,
                preview: chunk.content.substring(0, 150) + "..."
            };
        });

        return {
            answer: response.choices[0].message.content,
            sources: sources,
            tokensUsed: response.usage.total_tokens
        };
    });
}

module.exports = {
    queryRAG: queryRAG
};

A few critical decisions in this code:

  1. Temperature 0.1: Low temperature keeps the model grounded in the retrieved context. Higher temperatures lead to more creative (read: hallucinated) answers.
  2. Explicit grounding instruction: "Answer based ONLY on the provided context" is essential. Without it, the model will happily answer from its training data and ignore your documents.
  3. Source attribution: Passing document IDs into the context makes it possible for the model to cite sources, and for your UI to link back to original documents.

Hybrid Search: Combining Full-Text and Vector Similarity

Pure vector search has a weakness: it can miss exact keyword matches that a user expects to find. The query "error code PG-4502" embeds to a general "PostgreSQL error" vector, but the user wants that specific code. Hybrid search combines both approaches.

// hybrid-search.js
var db = require("./db");
var embeddings = require("./embeddings");

function hybridSearch(queryText, options) {
    var limit = (options && options.limit) || 5;
    var vectorWeight = (options && options.vectorWeight) || 0.7;
    var textWeight = 1.0 - vectorWeight;

    return embeddings.generateEmbedding(queryText).then(function(queryVector) {
        var vectorStr = "[" + queryVector.join(",") + "]";

        // Reciprocal Rank Fusion (RRF) combines the two result sets
        var sql = "WITH vector_results AS (" +
            "  SELECT id, content, document_id, chunk_index, metadata, " +
            "    ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS vector_rank " +
            "  FROM document_chunks " +
            "  ORDER BY embedding <=> $1::vector " +
            "  LIMIT 20" +
            "), " +
            "text_results AS (" +
            "  SELECT id, content, document_id, chunk_index, metadata, " +
            "    ROW_NUMBER() OVER (ORDER BY ts_rank(content_tsv, websearch_to_tsquery('english', $2)) DESC) AS text_rank " +
            "  FROM document_chunks " +
            "  WHERE content_tsv @@ websearch_to_tsquery('english', $2) " +
            "  ORDER BY ts_rank(content_tsv, websearch_to_tsquery('english', $2)) DESC " +
            "  LIMIT 20" +
            ") " +
            "SELECT COALESCE(v.id, t.id) AS id, " +
            "  COALESCE(v.content, t.content) AS content, " +
            "  COALESCE(v.document_id, t.document_id) AS document_id, " +
            "  COALESCE(v.chunk_index, t.chunk_index) AS chunk_index, " +
            "  COALESCE(v.metadata, t.metadata) AS metadata, " +
            "  (COALESCE($3::float / (60 + v.vector_rank), 0) + " +
            "   COALESCE($4::float / (60 + t.text_rank), 0)) AS rrf_score " +
            "FROM vector_results v " +
            "FULL OUTER JOIN text_results t ON v.id = t.id " +
            "ORDER BY rrf_score DESC " +
            "LIMIT $5";

        return db.query(sql, [vectorStr, queryText, vectorWeight, textWeight, limit]);
    }).then(function(result) {
        return result.rows;
    });
}

module.exports = {
    hybridSearch: hybridSearch
};

Reciprocal Rank Fusion (RRF) is the standard way to merge ranked lists. The constant 60 in the denominator prevents top-ranked results from dominating too aggressively. This approach consistently outperforms either search method alone in my testing — typically 10–15% better retrieval precision on mixed query types.

Relevance Scoring and Re-Ranking

The initial retrieval gets you candidates. Re-ranking improves precision by using a more expensive scoring function on the smaller candidate set:

// reranker.js
var OpenAI = require("openai");

var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function rerankChunks(query, chunks, topN) {
    topN = topN || 3;

    if (chunks.length <= topN) return Promise.resolve(chunks);

    var prompt = "Given the query: \"" + query + "\"\n\n" +
        "Rate the relevance of each document chunk on a scale of 0-10. " +
        "Respond with ONLY a JSON array of objects with 'index' and 'score' fields.\n\n";

    chunks.forEach(function(chunk, i) {
        prompt += "Chunk " + i + ":\n" + chunk.content.substring(0, 500) + "\n\n";
    });

    return client.chat.completions.create({
        model: "gpt-4o-mini",
        messages: [{ role: "user", content: prompt }],
        temperature: 0,
        response_format: { type: "json_object" }
    }).then(function(response) {
        var scores = JSON.parse(response.choices[0].message.content);
        var scoreArray = scores.scores || scores;

        // Merge scores back and sort
        scoreArray.forEach(function(item) {
            if (chunks[item.index]) {
                chunks[item.index].rerank_score = item.score;
            }
        });

        chunks.sort(function(a, b) {
            return (b.rerank_score || 0) - (a.rerank_score || 0);
        });

        return chunks.slice(0, topN);
    });
}

module.exports = {
    rerankChunks: rerankChunks
};

Re-ranking adds latency and cost (one additional LLM call), so use it selectively. It is most valuable when your initial retrieval returns chunks with similar similarity scores — the re-ranker breaks ties using deeper semantic understanding.

Performance Optimization

Batch Embedding with Connection Pooling

When ingesting large document sets, batch your database operations:

function batchInsertChunks(client, chunks, vectors, documentId, metadata) {
    var values = [];
    var placeholders = [];
    var paramIndex = 1;

    chunks.forEach(function(chunk, i) {
        var vectorStr = "[" + vectors[i].join(",") + "]";
        placeholders.push(
            "($" + paramIndex + ", $" + (paramIndex + 1) + ", $" +
            (paramIndex + 2) + ", $" + (paramIndex + 3) + ", $" +
            (paramIndex + 4) + "::vector)"
        );
        values.push(documentId, i, chunk, JSON.stringify(metadata), vectorStr);
        paramIndex += 5;
    });

    var sql = "INSERT INTO document_chunks (document_id, chunk_index, content, metadata, embedding) " +
        "VALUES " + placeholders.join(", ");

    return client.query(sql, values);
}

Index Tuning Reference

Dataset Size Index Type Parameters Build Time Query Time (p95)
10K vectors HNSW m=16, ef_construction=64 ~5s ~2ms
100K vectors HNSW m=16, ef_construction=128 ~45s ~5ms
1M vectors HNSW m=32, ef_construction=200 ~15min ~12ms
10K vectors IVFFlat lists=100 ~2s ~3ms
100K vectors IVFFlat lists=316 ~10s ~6ms
1M vectors IVFFlat lists=1000 ~2min ~15ms

These are approximate numbers from a 4-core, 8GB RAM machine. Your results will vary, but the relative differences hold.

Monitoring Query Performance

-- Check index usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, content, 1 - (embedding <=> '[0.1, 0.2, ...]'::vector) AS similarity
FROM document_chunks
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 5;

-- Expected output with HNSW index:
-- Index Scan using idx_embedding_hnsw on document_chunks
--   Rows: 5  Loops: 1
--   Planning Time: 0.2 ms
--   Execution Time: 4.8 ms

If you see Seq Scan instead of Index Scan, your index is not being used. Common causes: the table has fewer rows than the index cost threshold (try SET enable_seqscan = off to confirm), or you are filtering before the vector search in a way that prevents index use.

Complete Working Example

Here is the full application wired together as an Express server:

// app.js
var express = require("express");
var dotenv = require("dotenv");
var fs = require("fs");
var path = require("path");

dotenv.config();

var db = require("./db");
var store = require("./store");
var rag = require("./rag");
var hybridSearch = require("./hybrid-search");

var app = express();
app.use(express.json({ limit: "5mb" }));

// Health check
app.get("/health", function(req, res) {
    db.query("SELECT 1").then(function() {
        res.json({ status: "ok", timestamp: new Date().toISOString() });
    }).catch(function(err) {
        res.status(500).json({ status: "error", message: err.message });
    });
});

// Ingest a markdown document
app.post("/api/ingest", function(req, res) {
    var documentId = req.body.documentId;
    var content = req.body.content;
    var metadata = req.body.metadata || {};

    if (!documentId || !content) {
        return res.status(400).json({ error: "documentId and content are required" });
    }

    var startTime = Date.now();

    store.ingestDocument(documentId, content, metadata).then(function(chunkCount) {
        var elapsed = Date.now() - startTime;
        res.json({
            documentId: documentId,
            chunksStored: chunkCount,
            processingTimeMs: elapsed
        });
    }).catch(function(err) {
        console.error("Ingestion error:", err);
        res.status(500).json({ error: err.message });
    });
});

// Ingest markdown files from a directory
app.post("/api/ingest-directory", function(req, res) {
    var dirPath = req.body.path;

    if (!dirPath || !fs.existsSync(dirPath)) {
        return res.status(400).json({ error: "Valid directory path is required" });
    }

    var files = fs.readdirSync(dirPath).filter(function(f) {
        return f.endsWith(".md");
    });

    var totalChunks = 0;
    var startTime = Date.now();
    var chain = Promise.resolve();

    files.forEach(function(file) {
        chain = chain.then(function() {
            var filePath = path.join(dirPath, file);
            var content = fs.readFileSync(filePath, "utf-8");
            var documentId = path.basename(file, ".md");

            return store.ingestDocument(documentId, content, {
                source: file,
                ingestedAt: new Date().toISOString()
            }).then(function(count) {
                totalChunks += count;
            });
        });
    });

    chain.then(function() {
        var elapsed = Date.now() - startTime;
        res.json({
            filesProcessed: files.length,
            totalChunks: totalChunks,
            processingTimeMs: elapsed
        });
    }).catch(function(err) {
        console.error("Directory ingestion error:", err);
        res.status(500).json({ error: err.message });
    });
});

// Query the RAG system
app.post("/api/query", function(req, res) {
    var question = req.body.question;
    var topK = req.body.topK || 5;
    var useHybrid = req.body.hybrid || false;

    if (!question) {
        return res.status(400).json({ error: "question is required" });
    }

    var startTime = Date.now();

    rag.queryRAG(question, {
        topK: topK,
        threshold: 0.3
    }).then(function(result) {
        var elapsed = Date.now() - startTime;
        result.processingTimeMs = elapsed;
        res.json(result);
    }).catch(function(err) {
        console.error("Query error:", err);
        res.status(500).json({ error: err.message });
    });
});

// Search without generation (useful for debugging retrieval)
app.post("/api/search", function(req, res) {
    var query = req.body.query;
    var limit = req.body.limit || 10;

    if (!query) {
        return res.status(400).json({ error: "query is required" });
    }

    hybridSearch.hybridSearch(query, { limit: limit }).then(function(results) {
        res.json({ results: results });
    }).catch(function(err) {
        res.status(500).json({ error: err.message });
    });
});

var PORT = process.env.PORT || 3000;
app.listen(PORT, function() {
    console.log("RAG server running on port " + PORT);
});

Testing the Application

Ingest a document:

curl -X POST http://localhost:3000/api/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "documentId": "express-middleware-guide",
    "content": "# Express Middleware Guide\n\nMiddleware functions are functions that have access to the request object, the response object, and the next middleware function...",
    "metadata": {"category": "nodejs", "author": "shane"}
  }'

Response:

{
  "documentId": "express-middleware-guide",
  "chunksStored": 12,
  "processingTimeMs": 1847
}

Query the system:

curl -X POST http://localhost:3000/api/query \
  -H "Content-Type: application/json" \
  -d '{"question": "How do I create error handling middleware in Express?"}'

Response:

{
  "answer": "To create error handling middleware in Express, define a function with four parameters: err, req, res, and next. Express recognizes middleware as error-handling specifically because it has four arguments instead of three...",
  "sources": [
    {
      "documentId": "express-middleware-guide",
      "chunkIndex": 7,
      "similarity": 0.891,
      "preview": "## Error Handling Middleware\n\nError handling middleware is defined with four arguments instead of three..."
    }
  ],
  "tokensUsed": 847,
  "processingTimeMs": 2341
}

Common Issues and Troubleshooting

1. "ERROR: type 'vector' does not exist"

error: type "vector" does not exist
    at Parser.parseErrorMessage (node_modules/pg-protocol/src/parser.ts:369:69)

The pgvector extension is not installed or not enabled in your database. Run CREATE EXTENSION IF NOT EXISTS vector; while connected to the specific database, not the default postgres database. On managed databases, check that pgvector is available in your provider's extension list.

2. "ERROR: different vector dimensions 1536 and 768"

error: expected 1536 dimensions, not 768

Your column is defined as vector(1536) but you are inserting vectors from a different model. The text-embedding-3-small model defaults to 1536 dimensions. If you switched models or used OpenAI's dimensions parameter to reduce the size, your column definition must match. Either recreate the column or re-embed all existing data with the same model and dimensions.

3. Sequential Scan Instead of Index Scan

EXPLAIN SELECT * FROM document_chunks ORDER BY embedding <=> '[...]'::vector LIMIT 5;
-- Shows: Seq Scan on document_chunks

This happens when the table has fewer rows than PostgreSQL's cost estimator thinks justify an index scan (typically under 1000 rows), or when you have not run ANALYZE after creating the index. Fix:

ANALYZE document_chunks;
-- If still sequential:
SET ivfflat.probes = 10;  -- or SET hnsw.ef_search = 100;
-- Force index usage for testing:
SET enable_seqscan = off;

4. "429 Too Many Requests" from OpenAI

Error: 429 Rate limit exceeded. You are sending requests too quickly.

The OpenAI embeddings API has rate limits based on your tier. Add exponential backoff:

function withRetry(fn, maxRetries) {
    maxRetries = maxRetries || 3;
    var attempt = 0;

    function tryCall() {
        return fn().catch(function(err) {
            if (err.status === 429 && attempt < maxRetries) {
                attempt++;
                var delay = Math.pow(2, attempt) * 1000;
                console.log("Rate limited, retrying in " + delay + "ms...");
                return new Promise(function(resolve) {
                    setTimeout(resolve, delay);
                }).then(tryCall);
            }
            throw err;
        });
    }

    return tryCall();
}

5. Out of Memory During Bulk Ingestion

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

If you are ingesting thousands of documents, you are likely holding all embeddings in memory. Process documents in smaller batches and release references between batches. Increase Node's heap size as a stopgap:

node --max-old-space-size=4096 app.js

Best Practices

  • Chunk size matters more than you think. Start with 800–1200 characters, measure retrieval quality with a test set of questions, and adjust. Too small and you lose context. Too large and irrelevant text dilutes the embedding.

  • Always store the raw text alongside the vector. You need it for the LLM prompt, for debugging retrieval quality, and for re-embedding if you switch models. Vectors alone are useless without the text they represent.

  • Use cosine similarity, not L2 distance, for text embeddings. OpenAI's embeddings are normalized, so cosine similarity and dot product give identical rankings. But cosine similarity scores are intuitive (0 to 1 range) and easier to threshold.

  • Set a similarity threshold and enforce it. Do not blindly pass the top-K results to the LLM. If the best match scores 0.35, it is probably irrelevant. A threshold of 0.5–0.6 works well for most use cases with text-embedding-3-small.

  • Version your embeddings. Store the model name and any parameters (like dimensions) in your metadata. When OpenAI releases a new embedding model, you will want to re-embed everything — and you need to know which vectors are from which model.

  • Monitor token usage in production. Every RAG query costs embedding tokens (for the query) plus generation tokens (for the LLM call). Log these per-request. A query retrieving 5 chunks of 1000 characters each adds roughly 1200 tokens to your prompt, costing about $0.006 with GPT-4o.

  • Test retrieval separately from generation. Build a /search endpoint that returns chunks without calling the LLM. Create a test set of 20–50 questions with known-good source documents. Measure recall (did the right chunk appear in the top 5?) before worrying about answer quality.

  • Use HNSW indexes for production workloads. The recall improvement over IVFFlat is worth the extra memory. Set ef_construction to at least 128 during index build and hnsw.ef_search to at least 64 at query time.

  • Implement document-level deduplication. Before ingesting, check if the document already exists and whether its content has changed. Re-embedding unchanged documents wastes money and creates index churn.

References

Powered by Contentful