Implementing RAG with PostgreSQL and OpenAI Embeddings
A production-grade RAG implementation using PostgreSQL with pgvector for vector storage, OpenAI embeddings for semantic search, and GPT-4 for generation, all wired together with Node.js.
Implementing RAG with PostgreSQL and OpenAI Embeddings
Overview
Retrieval-Augmented Generation (RAG) is the most practical way to give large language models access to your own data without the cost, complexity, and staleness problems of fine-tuning. Instead of baking knowledge into model weights, RAG retrieves relevant documents at query time and feeds them as context to the LLM. This article walks through a production-grade RAG implementation using PostgreSQL with pgvector for vector storage, OpenAI's embedding models for semantic search, and GPT-4 for generation — all wired together with Node.js.
Prerequisites
- Node.js v18+ installed
- PostgreSQL 15+ with superuser access
- An OpenAI API key with access to
text-embedding-3-smallandgpt-4o - Basic familiarity with SQL and Express.js
psqlCLI or a PostgreSQL client like pgAdmin
What RAG Is and Why It Beats Fine-Tuning
Fine-tuning trains a model on your data. RAG retrieves your data and hands it to the model at inference time. The difference matters enormously in practice.
Fine-tuning has three serious problems. First, it is expensive — you are paying to train a model, and you will pay again every time your data changes. Second, the model can hallucinate "facts" it absorbed during training with no way to trace where the information came from. Third, your data goes stale. The moment you add a new document, your fine-tuned model knows nothing about it until you retrain.
RAG solves all three. Your data lives in a database. When a user asks a question, you search that database for the most relevant chunks, stuff them into the prompt as context, and let the LLM synthesize an answer. The data is always current. You can cite sources. And you pay only for inference, not training.
The architecture looks like this:
User Query
│
▼
[Embed Query] ──► OpenAI Embeddings API
│
▼
[Vector Search] ──► PostgreSQL + pgvector
│
▼
[Retrieved Chunks] ──► Top-K relevant documents
│
▼
[Augmented Prompt] ──► System prompt + retrieved context + user query
│
▼
[LLM Generation] ──► GPT-4o
│
▼
Answer with Citations
RAG is the right choice when your data changes frequently, when you need source attribution, when you have domain-specific documents the model was never trained on, or when you cannot afford the latency and cost of fine-tuning cycles.
Setting Up PostgreSQL with pgvector
pgvector is a PostgreSQL extension that adds vector data types and similarity search operators. It is the most battle-tested vector store that runs inside a database you already know how to operate.
Install the extension. On Ubuntu/Debian:
sudo apt install postgresql-15-pgvector
On macOS with Homebrew:
brew install pgvector
On DigitalOcean Managed Databases, pgvector is available as a trusted extension — you just need to enable it.
Create the database and enable the extension:
CREATE DATABASE rag_demo;
\c rag_demo
CREATE EXTENSION IF NOT EXISTS vector;
Now create the table that will hold your document chunks and their embeddings. OpenAI's text-embedding-3-small produces 1536-dimensional vectors by default:
CREATE TABLE document_chunks (
id SERIAL PRIMARY KEY,
document_id VARCHAR(255) NOT NULL,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
embedding vector(1536),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE(document_id, chunk_index)
);
-- Full-text search support
ALTER TABLE document_chunks ADD COLUMN content_tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX idx_content_tsv ON document_chunks USING GIN (content_tsv);
-- Metadata index for filtering
CREATE INDEX idx_metadata ON document_chunks USING GIN (metadata);
We will add vector indexes after discussing the tradeoffs between IVFFlat and HNSW later in this article.
Generating Embeddings with OpenAI
Install the required packages:
npm install openai pg dotenv express
Here is the embedding module. It handles single texts and batches, with retry logic for rate limits:
// embeddings.js
var OpenAI = require("openai");
var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
var EMBEDDING_MODEL = "text-embedding-3-small";
var MAX_BATCH_SIZE = 100; // OpenAI allows up to 2048, but smaller batches are safer
function generateEmbedding(text) {
return client.embeddings.create({
model: EMBEDDING_MODEL,
input: text
}).then(function(response) {
return response.data[0].embedding;
});
}
function generateEmbeddingsBatch(texts) {
var batches = [];
for (var i = 0; i < texts.length; i += MAX_BATCH_SIZE) {
batches.push(texts.slice(i, i + MAX_BATCH_SIZE));
}
var results = [];
var chain = Promise.resolve();
batches.forEach(function(batch, batchIndex) {
chain = chain.then(function() {
console.log("Embedding batch " + (batchIndex + 1) + "/" + batches.length +
" (" + batch.length + " texts)");
return client.embeddings.create({
model: EMBEDDING_MODEL,
input: batch
});
}).then(function(response) {
response.data.forEach(function(item) {
results.push(item.embedding);
});
});
});
return chain.then(function() {
return results;
});
}
module.exports = {
generateEmbedding: generateEmbedding,
generateEmbeddingsBatch: generateEmbeddingsBatch,
EMBEDDING_MODEL: EMBEDDING_MODEL
};
The text-embedding-3-small model costs $0.02 per million tokens as of early 2026. For a corpus of 10,000 chunks averaging 300 tokens each, the total embedding cost is about $0.06. This is negligible compared to generation costs.
Document Chunking Strategies
Chunking is the most underappreciated part of a RAG pipeline. Bad chunks produce bad retrieval. Bad retrieval produces hallucinated answers. There are three main approaches.
Fixed-Size Chunking
Split text into chunks of N characters with overlap. Simple and predictable:
// chunking.js
function chunkFixedSize(text, chunkSize, overlap) {
chunkSize = chunkSize || 1000;
overlap = overlap || 200;
var chunks = [];
var start = 0;
while (start < text.length) {
var end = Math.min(start + chunkSize, text.length);
chunks.push(text.slice(start, end));
start += chunkSize - overlap;
}
return chunks;
}
This works but frequently splits sentences mid-thought. A chunk ending with "The server returns a 500 error when" is useless without what follows.
Sentence-Based Chunking
Group complete sentences until you hit a size limit. Better semantic coherence:
function chunkBySentence(text, maxChunkSize, overlap) {
maxChunkSize = maxChunkSize || 1000;
overlap = overlap || 1;
var sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
var chunks = [];
var currentChunk = [];
var currentSize = 0;
for (var i = 0; i < sentences.length; i++) {
var sentence = sentences[i].trim();
if (currentSize + sentence.length > maxChunkSize && currentChunk.length > 0) {
chunks.push(currentChunk.join(" "));
// Keep last N sentences as overlap
var overlapSentences = currentChunk.slice(-overlap);
currentChunk = overlapSentences;
currentSize = overlapSentences.join(" ").length;
}
currentChunk.push(sentence);
currentSize += sentence.length;
}
if (currentChunk.length > 0) {
chunks.push(currentChunk.join(" "));
}
return chunks;
}
Markdown-Aware Chunking
For markdown documents, respect the structure. Headers, code blocks, and lists should stay together when possible:
function chunkMarkdown(markdown, maxChunkSize) {
maxChunkSize = maxChunkSize || 1500;
// Split on headers while keeping the header with its content
var sections = markdown.split(/(?=^#{1,3}\s)/m);
var chunks = [];
sections.forEach(function(section) {
section = section.trim();
if (!section) return;
if (section.length <= maxChunkSize) {
chunks.push(section);
} else {
// Section too large — fall back to sentence chunking
var subChunks = chunkBySentence(section, maxChunkSize, 1);
chunks = chunks.concat(subChunks);
}
});
return chunks;
}
module.exports = {
chunkFixedSize: chunkFixedSize,
chunkBySentence: chunkBySentence,
chunkMarkdown: chunkMarkdown
};
My recommendation: use markdown-aware chunking for structured documents and sentence-based chunking for everything else. Target chunks of 500–1500 characters. Smaller chunks improve retrieval precision. Larger chunks provide more context per retrieval. Start with 1000 characters and adjust based on your retrieval quality.
Storing and Indexing Embeddings
Database Layer
// db.js
var pg = require("pg");
var pool = new pg.Pool({
connectionString: process.env.POSTGRES_CONNECTION_STRING,
max: 20,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000
});
function query(text, params) {
return pool.query(text, params);
}
function getClient() {
return pool.connect();
}
module.exports = {
query: query,
getClient: getClient,
pool: pool
};
Inserting Chunks with Embeddings
// store.js
var db = require("./db");
var embeddings = require("./embeddings");
var chunking = require("./chunking");
function ingestDocument(documentId, content, metadata) {
var chunks = chunking.chunkMarkdown(content);
console.log("Document " + documentId + ": " + chunks.length + " chunks");
var texts = chunks.map(function(chunk) { return chunk; });
return embeddings.generateEmbeddingsBatch(texts).then(function(vectors) {
var client;
return db.getClient().then(function(c) {
client = c;
return client.query("BEGIN");
}).then(function() {
// Delete existing chunks for this document (re-ingestion)
return client.query(
"DELETE FROM document_chunks WHERE document_id = $1",
[documentId]
);
}).then(function() {
var insertChain = Promise.resolve();
chunks.forEach(function(chunk, index) {
insertChain = insertChain.then(function() {
var vectorStr = "[" + vectors[index].join(",") + "]";
return client.query(
"INSERT INTO document_chunks (document_id, chunk_index, content, metadata, embedding) " +
"VALUES ($1, $2, $3, $4, $5::vector)",
[documentId, index, chunk, JSON.stringify(metadata), vectorStr]
);
});
});
return insertChain;
}).then(function() {
return client.query("COMMIT");
}).then(function() {
client.release();
console.log("Stored " + chunks.length + " chunks for " + documentId);
return chunks.length;
}).catch(function(err) {
client.query("ROLLBACK");
client.release();
throw err;
});
});
}
module.exports = {
ingestDocument: ingestDocument
};
IVFFlat vs HNSW Indexes
pgvector supports two index types for approximate nearest neighbor search. The choice matters.
IVFFlat partitions vectors into lists (clusters) and searches only the nearest lists. Faster to build, uses less memory, but requires training data:
-- IVFFlat: good for datasets under 1M vectors
-- lists = sqrt(num_rows) is a reasonable starting point
CREATE INDEX idx_embedding_ivfflat ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
HNSW (Hierarchical Navigable Small World) builds a graph structure. Higher recall, no training step, but uses more memory and takes longer to build:
-- HNSW: better recall, recommended for production
CREATE INDEX idx_embedding_hnsw ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128);
For most RAG applications with fewer than a million documents, HNSW is the better choice. The recall difference is meaningful — an IVFFlat search might miss a relevant chunk that HNSW would find. The build time penalty only matters during initial ingestion.
Set the search parameter at query time to trade speed for accuracy:
-- For HNSW, higher ef = better recall, slower search
SET hnsw.ef_search = 100;
-- For IVFFlat, higher probes = better recall, slower search
SET ivfflat.probes = 10;
Building the Similarity Search Layer
// search.js
var db = require("./db");
var embeddings = require("./embeddings");
function searchSimilar(queryText, options) {
var limit = (options && options.limit) || 5;
var threshold = (options && options.threshold) || 0.3;
var metadataFilter = (options && options.metadata) || null;
return embeddings.generateEmbedding(queryText).then(function(queryVector) {
var vectorStr = "[" + queryVector.join(",") + "]";
var sql = "SELECT id, document_id, chunk_index, content, metadata, " +
"1 - (embedding <=> $1::vector) AS similarity " +
"FROM document_chunks " +
"WHERE 1 - (embedding <=> $1::vector) > $2";
var params = [vectorStr, threshold];
var paramIndex = 3;
if (metadataFilter) {
sql += " AND metadata @> $" + paramIndex;
params.push(JSON.stringify(metadataFilter));
paramIndex++;
}
sql += " ORDER BY embedding <=> $1::vector LIMIT $" + paramIndex;
params.push(limit);
return db.query(sql, params);
}).then(function(result) {
return result.rows;
});
}
module.exports = {
searchSimilar: searchSimilar
};
The <=> operator computes cosine distance. We convert it to similarity (1 - distance) so higher numbers mean better matches. A similarity of 0.85 is a strong match. Below 0.5 is usually noise.
The Full RAG Pipeline
This is where retrieval meets generation. The quality of your system prompt and the way you format retrieved context will make or break your RAG application.
// rag.js
var OpenAI = require("openai");
var search = require("./search");
var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
var SYSTEM_PROMPT = "You are a technical assistant. Answer the user's question " +
"based ONLY on the provided context documents. If the context does not contain " +
"enough information to answer, say so explicitly. Do not make up information. " +
"Cite the source document IDs when referencing specific information.";
function buildContext(chunks) {
return chunks.map(function(chunk, index) {
return "--- Document: " + chunk.document_id +
" (chunk " + chunk.chunk_index + ", relevance: " +
(chunk.similarity * 100).toFixed(1) + "%) ---\n" +
chunk.content;
}).join("\n\n");
}
function queryRAG(question, options) {
var searchOptions = {
limit: (options && options.topK) || 5,
threshold: (options && options.threshold) || 0.3,
metadata: (options && options.metadata) || null
};
var retrievedChunks;
return search.searchSimilar(question, searchOptions).then(function(chunks) {
retrievedChunks = chunks;
if (chunks.length === 0) {
return {
answer: "I could not find any relevant information to answer your question.",
sources: [],
tokensUsed: 0
};
}
var context = buildContext(chunks);
var messages = [
{ role: "system", content: SYSTEM_PROMPT },
{
role: "user",
content: "Context:\n" + context + "\n\nQuestion: " + question
}
];
return client.chat.completions.create({
model: "gpt-4o",
messages: messages,
temperature: 0.1,
max_tokens: 1500
});
}).then(function(response) {
if (response.answer) return response; // No chunks case
var sources = retrievedChunks.map(function(chunk) {
return {
documentId: chunk.document_id,
chunkIndex: chunk.chunk_index,
similarity: chunk.similarity,
preview: chunk.content.substring(0, 150) + "..."
};
});
return {
answer: response.choices[0].message.content,
sources: sources,
tokensUsed: response.usage.total_tokens
};
});
}
module.exports = {
queryRAG: queryRAG
};
A few critical decisions in this code:
- Temperature 0.1: Low temperature keeps the model grounded in the retrieved context. Higher temperatures lead to more creative (read: hallucinated) answers.
- Explicit grounding instruction: "Answer based ONLY on the provided context" is essential. Without it, the model will happily answer from its training data and ignore your documents.
- Source attribution: Passing document IDs into the context makes it possible for the model to cite sources, and for your UI to link back to original documents.
Hybrid Search: Combining Full-Text and Vector Similarity
Pure vector search has a weakness: it can miss exact keyword matches that a user expects to find. The query "error code PG-4502" embeds to a general "PostgreSQL error" vector, but the user wants that specific code. Hybrid search combines both approaches.
// hybrid-search.js
var db = require("./db");
var embeddings = require("./embeddings");
function hybridSearch(queryText, options) {
var limit = (options && options.limit) || 5;
var vectorWeight = (options && options.vectorWeight) || 0.7;
var textWeight = 1.0 - vectorWeight;
return embeddings.generateEmbedding(queryText).then(function(queryVector) {
var vectorStr = "[" + queryVector.join(",") + "]";
// Reciprocal Rank Fusion (RRF) combines the two result sets
var sql = "WITH vector_results AS (" +
" SELECT id, content, document_id, chunk_index, metadata, " +
" ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS vector_rank " +
" FROM document_chunks " +
" ORDER BY embedding <=> $1::vector " +
" LIMIT 20" +
"), " +
"text_results AS (" +
" SELECT id, content, document_id, chunk_index, metadata, " +
" ROW_NUMBER() OVER (ORDER BY ts_rank(content_tsv, websearch_to_tsquery('english', $2)) DESC) AS text_rank " +
" FROM document_chunks " +
" WHERE content_tsv @@ websearch_to_tsquery('english', $2) " +
" ORDER BY ts_rank(content_tsv, websearch_to_tsquery('english', $2)) DESC " +
" LIMIT 20" +
") " +
"SELECT COALESCE(v.id, t.id) AS id, " +
" COALESCE(v.content, t.content) AS content, " +
" COALESCE(v.document_id, t.document_id) AS document_id, " +
" COALESCE(v.chunk_index, t.chunk_index) AS chunk_index, " +
" COALESCE(v.metadata, t.metadata) AS metadata, " +
" (COALESCE($3::float / (60 + v.vector_rank), 0) + " +
" COALESCE($4::float / (60 + t.text_rank), 0)) AS rrf_score " +
"FROM vector_results v " +
"FULL OUTER JOIN text_results t ON v.id = t.id " +
"ORDER BY rrf_score DESC " +
"LIMIT $5";
return db.query(sql, [vectorStr, queryText, vectorWeight, textWeight, limit]);
}).then(function(result) {
return result.rows;
});
}
module.exports = {
hybridSearch: hybridSearch
};
Reciprocal Rank Fusion (RRF) is the standard way to merge ranked lists. The constant 60 in the denominator prevents top-ranked results from dominating too aggressively. This approach consistently outperforms either search method alone in my testing — typically 10–15% better retrieval precision on mixed query types.
Relevance Scoring and Re-Ranking
The initial retrieval gets you candidates. Re-ranking improves precision by using a more expensive scoring function on the smaller candidate set:
// reranker.js
var OpenAI = require("openai");
var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
function rerankChunks(query, chunks, topN) {
topN = topN || 3;
if (chunks.length <= topN) return Promise.resolve(chunks);
var prompt = "Given the query: \"" + query + "\"\n\n" +
"Rate the relevance of each document chunk on a scale of 0-10. " +
"Respond with ONLY a JSON array of objects with 'index' and 'score' fields.\n\n";
chunks.forEach(function(chunk, i) {
prompt += "Chunk " + i + ":\n" + chunk.content.substring(0, 500) + "\n\n";
});
return client.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: prompt }],
temperature: 0,
response_format: { type: "json_object" }
}).then(function(response) {
var scores = JSON.parse(response.choices[0].message.content);
var scoreArray = scores.scores || scores;
// Merge scores back and sort
scoreArray.forEach(function(item) {
if (chunks[item.index]) {
chunks[item.index].rerank_score = item.score;
}
});
chunks.sort(function(a, b) {
return (b.rerank_score || 0) - (a.rerank_score || 0);
});
return chunks.slice(0, topN);
});
}
module.exports = {
rerankChunks: rerankChunks
};
Re-ranking adds latency and cost (one additional LLM call), so use it selectively. It is most valuable when your initial retrieval returns chunks with similar similarity scores — the re-ranker breaks ties using deeper semantic understanding.
Performance Optimization
Batch Embedding with Connection Pooling
When ingesting large document sets, batch your database operations:
function batchInsertChunks(client, chunks, vectors, documentId, metadata) {
var values = [];
var placeholders = [];
var paramIndex = 1;
chunks.forEach(function(chunk, i) {
var vectorStr = "[" + vectors[i].join(",") + "]";
placeholders.push(
"($" + paramIndex + ", $" + (paramIndex + 1) + ", $" +
(paramIndex + 2) + ", $" + (paramIndex + 3) + ", $" +
(paramIndex + 4) + "::vector)"
);
values.push(documentId, i, chunk, JSON.stringify(metadata), vectorStr);
paramIndex += 5;
});
var sql = "INSERT INTO document_chunks (document_id, chunk_index, content, metadata, embedding) " +
"VALUES " + placeholders.join(", ");
return client.query(sql, values);
}
Index Tuning Reference
| Dataset Size | Index Type | Parameters | Build Time | Query Time (p95) |
|---|---|---|---|---|
| 10K vectors | HNSW | m=16, ef_construction=64 | ~5s | ~2ms |
| 100K vectors | HNSW | m=16, ef_construction=128 | ~45s | ~5ms |
| 1M vectors | HNSW | m=32, ef_construction=200 | ~15min | ~12ms |
| 10K vectors | IVFFlat | lists=100 | ~2s | ~3ms |
| 100K vectors | IVFFlat | lists=316 | ~10s | ~6ms |
| 1M vectors | IVFFlat | lists=1000 | ~2min | ~15ms |
These are approximate numbers from a 4-core, 8GB RAM machine. Your results will vary, but the relative differences hold.
Monitoring Query Performance
-- Check index usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, content, 1 - (embedding <=> '[0.1, 0.2, ...]'::vector) AS similarity
FROM document_chunks
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 5;
-- Expected output with HNSW index:
-- Index Scan using idx_embedding_hnsw on document_chunks
-- Rows: 5 Loops: 1
-- Planning Time: 0.2 ms
-- Execution Time: 4.8 ms
If you see Seq Scan instead of Index Scan, your index is not being used. Common causes: the table has fewer rows than the index cost threshold (try SET enable_seqscan = off to confirm), or you are filtering before the vector search in a way that prevents index use.
Complete Working Example
Here is the full application wired together as an Express server:
// app.js
var express = require("express");
var dotenv = require("dotenv");
var fs = require("fs");
var path = require("path");
dotenv.config();
var db = require("./db");
var store = require("./store");
var rag = require("./rag");
var hybridSearch = require("./hybrid-search");
var app = express();
app.use(express.json({ limit: "5mb" }));
// Health check
app.get("/health", function(req, res) {
db.query("SELECT 1").then(function() {
res.json({ status: "ok", timestamp: new Date().toISOString() });
}).catch(function(err) {
res.status(500).json({ status: "error", message: err.message });
});
});
// Ingest a markdown document
app.post("/api/ingest", function(req, res) {
var documentId = req.body.documentId;
var content = req.body.content;
var metadata = req.body.metadata || {};
if (!documentId || !content) {
return res.status(400).json({ error: "documentId and content are required" });
}
var startTime = Date.now();
store.ingestDocument(documentId, content, metadata).then(function(chunkCount) {
var elapsed = Date.now() - startTime;
res.json({
documentId: documentId,
chunksStored: chunkCount,
processingTimeMs: elapsed
});
}).catch(function(err) {
console.error("Ingestion error:", err);
res.status(500).json({ error: err.message });
});
});
// Ingest markdown files from a directory
app.post("/api/ingest-directory", function(req, res) {
var dirPath = req.body.path;
if (!dirPath || !fs.existsSync(dirPath)) {
return res.status(400).json({ error: "Valid directory path is required" });
}
var files = fs.readdirSync(dirPath).filter(function(f) {
return f.endsWith(".md");
});
var totalChunks = 0;
var startTime = Date.now();
var chain = Promise.resolve();
files.forEach(function(file) {
chain = chain.then(function() {
var filePath = path.join(dirPath, file);
var content = fs.readFileSync(filePath, "utf-8");
var documentId = path.basename(file, ".md");
return store.ingestDocument(documentId, content, {
source: file,
ingestedAt: new Date().toISOString()
}).then(function(count) {
totalChunks += count;
});
});
});
chain.then(function() {
var elapsed = Date.now() - startTime;
res.json({
filesProcessed: files.length,
totalChunks: totalChunks,
processingTimeMs: elapsed
});
}).catch(function(err) {
console.error("Directory ingestion error:", err);
res.status(500).json({ error: err.message });
});
});
// Query the RAG system
app.post("/api/query", function(req, res) {
var question = req.body.question;
var topK = req.body.topK || 5;
var useHybrid = req.body.hybrid || false;
if (!question) {
return res.status(400).json({ error: "question is required" });
}
var startTime = Date.now();
rag.queryRAG(question, {
topK: topK,
threshold: 0.3
}).then(function(result) {
var elapsed = Date.now() - startTime;
result.processingTimeMs = elapsed;
res.json(result);
}).catch(function(err) {
console.error("Query error:", err);
res.status(500).json({ error: err.message });
});
});
// Search without generation (useful for debugging retrieval)
app.post("/api/search", function(req, res) {
var query = req.body.query;
var limit = req.body.limit || 10;
if (!query) {
return res.status(400).json({ error: "query is required" });
}
hybridSearch.hybridSearch(query, { limit: limit }).then(function(results) {
res.json({ results: results });
}).catch(function(err) {
res.status(500).json({ error: err.message });
});
});
var PORT = process.env.PORT || 3000;
app.listen(PORT, function() {
console.log("RAG server running on port " + PORT);
});
Testing the Application
Ingest a document:
curl -X POST http://localhost:3000/api/ingest \
-H "Content-Type: application/json" \
-d '{
"documentId": "express-middleware-guide",
"content": "# Express Middleware Guide\n\nMiddleware functions are functions that have access to the request object, the response object, and the next middleware function...",
"metadata": {"category": "nodejs", "author": "shane"}
}'
Response:
{
"documentId": "express-middleware-guide",
"chunksStored": 12,
"processingTimeMs": 1847
}
Query the system:
curl -X POST http://localhost:3000/api/query \
-H "Content-Type: application/json" \
-d '{"question": "How do I create error handling middleware in Express?"}'
Response:
{
"answer": "To create error handling middleware in Express, define a function with four parameters: err, req, res, and next. Express recognizes middleware as error-handling specifically because it has four arguments instead of three...",
"sources": [
{
"documentId": "express-middleware-guide",
"chunkIndex": 7,
"similarity": 0.891,
"preview": "## Error Handling Middleware\n\nError handling middleware is defined with four arguments instead of three..."
}
],
"tokensUsed": 847,
"processingTimeMs": 2341
}
Common Issues and Troubleshooting
1. "ERROR: type 'vector' does not exist"
error: type "vector" does not exist
at Parser.parseErrorMessage (node_modules/pg-protocol/src/parser.ts:369:69)
The pgvector extension is not installed or not enabled in your database. Run CREATE EXTENSION IF NOT EXISTS vector; while connected to the specific database, not the default postgres database. On managed databases, check that pgvector is available in your provider's extension list.
2. "ERROR: different vector dimensions 1536 and 768"
error: expected 1536 dimensions, not 768
Your column is defined as vector(1536) but you are inserting vectors from a different model. The text-embedding-3-small model defaults to 1536 dimensions. If you switched models or used OpenAI's dimensions parameter to reduce the size, your column definition must match. Either recreate the column or re-embed all existing data with the same model and dimensions.
3. Sequential Scan Instead of Index Scan
EXPLAIN SELECT * FROM document_chunks ORDER BY embedding <=> '[...]'::vector LIMIT 5;
-- Shows: Seq Scan on document_chunks
This happens when the table has fewer rows than PostgreSQL's cost estimator thinks justify an index scan (typically under 1000 rows), or when you have not run ANALYZE after creating the index. Fix:
ANALYZE document_chunks;
-- If still sequential:
SET ivfflat.probes = 10; -- or SET hnsw.ef_search = 100;
-- Force index usage for testing:
SET enable_seqscan = off;
4. "429 Too Many Requests" from OpenAI
Error: 429 Rate limit exceeded. You are sending requests too quickly.
The OpenAI embeddings API has rate limits based on your tier. Add exponential backoff:
function withRetry(fn, maxRetries) {
maxRetries = maxRetries || 3;
var attempt = 0;
function tryCall() {
return fn().catch(function(err) {
if (err.status === 429 && attempt < maxRetries) {
attempt++;
var delay = Math.pow(2, attempt) * 1000;
console.log("Rate limited, retrying in " + delay + "ms...");
return new Promise(function(resolve) {
setTimeout(resolve, delay);
}).then(tryCall);
}
throw err;
});
}
return tryCall();
}
5. Out of Memory During Bulk Ingestion
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
If you are ingesting thousands of documents, you are likely holding all embeddings in memory. Process documents in smaller batches and release references between batches. Increase Node's heap size as a stopgap:
node --max-old-space-size=4096 app.js
Best Practices
Chunk size matters more than you think. Start with 800–1200 characters, measure retrieval quality with a test set of questions, and adjust. Too small and you lose context. Too large and irrelevant text dilutes the embedding.
Always store the raw text alongside the vector. You need it for the LLM prompt, for debugging retrieval quality, and for re-embedding if you switch models. Vectors alone are useless without the text they represent.
Use cosine similarity, not L2 distance, for text embeddings. OpenAI's embeddings are normalized, so cosine similarity and dot product give identical rankings. But cosine similarity scores are intuitive (0 to 1 range) and easier to threshold.
Set a similarity threshold and enforce it. Do not blindly pass the top-K results to the LLM. If the best match scores 0.35, it is probably irrelevant. A threshold of 0.5–0.6 works well for most use cases with
text-embedding-3-small.Version your embeddings. Store the model name and any parameters (like dimensions) in your metadata. When OpenAI releases a new embedding model, you will want to re-embed everything — and you need to know which vectors are from which model.
Monitor token usage in production. Every RAG query costs embedding tokens (for the query) plus generation tokens (for the LLM call). Log these per-request. A query retrieving 5 chunks of 1000 characters each adds roughly 1200 tokens to your prompt, costing about $0.006 with GPT-4o.
Test retrieval separately from generation. Build a
/searchendpoint that returns chunks without calling the LLM. Create a test set of 20–50 questions with known-good source documents. Measure recall (did the right chunk appear in the top 5?) before worrying about answer quality.Use HNSW indexes for production workloads. The recall improvement over IVFFlat is worth the extra memory. Set
ef_constructionto at least 128 during index build andhnsw.ef_searchto at least 64 at query time.Implement document-level deduplication. Before ingesting, check if the document already exists and whether its content has changed. Re-embedding unchanged documents wastes money and creates index churn.
References
- pgvector GitHub Repository — Installation, operators, and index documentation
- OpenAI Embeddings Guide — Model options, dimensions, and pricing
- OpenAI Cookbook: Question Answering with Embeddings — Official RAG tutorial
- PostgreSQL Full-Text Search Documentation —
tsvector,tsquery, and ranking functions - Reciprocal Rank Fusion Paper — The RRF algorithm used in hybrid search
- node-postgres (pg) Documentation — Connection pooling, parameterized queries, and transactions
- HNSW Algorithm Paper — The graph-based index structure used by pgvector
