Cost-Effective LLM API Integration Patterns
Battle-tested patterns for integrating LLM APIs into Node.js applications while keeping costs under control, covering semantic caching, model tiering, batch processing, and budget enforcement.
Cost-Effective LLM API Integration Patterns
Overview
LLM APIs are powerful but expensive -- a single poorly-optimized production workload can burn through thousands of dollars in a month without careful engineering. This article covers battle-tested patterns for integrating LLM APIs into Node.js applications while keeping costs under control. Whether you are building a customer-facing chatbot, a content pipeline, or an internal tool that leverages AI, these patterns will help you ship features without blowing your budget.
Prerequisites
- Node.js v18+ installed
- Basic familiarity with REST APIs and async/await in Node.js
- A Redis instance (local or hosted) for caching examples
- API keys for OpenAI and/or Anthropic (free tier is fine for testing)
npmfor package management
Install the dependencies we will use throughout this article:
npm install openai @anthropic-ai/sdk tiktoken redis crypto uuid express
Understanding LLM API Pricing Models
Before you can optimize costs, you need to understand how you are being charged. Every major LLM provider charges per token, not per request. A token is roughly 3/4 of a word in English -- the word "hamburger" is two tokens, and "I" is one token.
There are two token counts that matter on every API call:
| Metric | Description | Typical Cost Ratio |
|---|---|---|
| Input tokens | Your prompt, system message, and any context you send | Lower cost (often 2-10x cheaper than output) |
| Output tokens | The model's response | Higher cost |
Here is a simplified pricing comparison as of early 2026:
{
"openai": {
"gpt-4o": { "input_per_1M": 2.50, "output_per_1M": 10.00 },
"gpt-4o-mini": { "input_per_1M": 0.15, "output_per_1M": 0.60 },
"gpt-4.1": { "input_per_1M": 2.00, "output_per_1M": 8.00 },
"gpt-4.1-mini": { "input_per_1M": 0.40, "output_per_1M": 1.60 },
"gpt-4.1-nano": { "input_per_1M": 0.10, "output_per_1M": 0.40 }
},
"anthropic": {
"claude-sonnet-4": { "input_per_1M": 3.00, "output_per_1M": 15.00 },
"claude-haiku-3.5": { "input_per_1M": 0.80, "output_per_1M": 4.00 }
}
}
The key insight: input tokens are cheap, output tokens are expensive. This single fact drives most optimization strategies. A verbose prompt that produces a concise response is almost always cheaper than a terse prompt that causes the model to ramble.
Batch vs. Real-Time
Both OpenAI and Anthropic offer batch processing APIs at a 50% discount. If your workload does not need an immediate response -- think content generation, data classification, summarization pipelines -- you should always use batch endpoints. We will cover this in detail below.
Token Counting and Estimation Before API Calls
Never send a request without knowing roughly how many tokens it will consume. The tiktoken library gives you exact counts for OpenAI models, and the counts are close enough for Anthropic as well.
var tiktoken = require("tiktoken");
var encoder = tiktoken.encoding_for_model("gpt-4o");
function countTokens(text) {
var tokens = encoder.encode(text);
return tokens.length;
}
function estimateCost(inputText, estimatedOutputTokens, model) {
var pricing = {
"gpt-4o": { input: 2.50, output: 10.00 },
"gpt-4o-mini": { input: 0.15, output: 0.60 },
"gpt-4.1-nano": { input: 0.10, output: 0.40 },
"claude-sonnet-4": { input: 3.00, output: 15.00 },
"claude-haiku-3.5": { input: 0.80, output: 4.00 }
};
var modelPricing = pricing[model];
if (!modelPricing) {
throw new Error("Unknown model: " + model);
}
var inputTokens = countTokens(inputText);
var inputCost = (inputTokens / 1000000) * modelPricing.input;
var outputCost = (estimatedOutputTokens / 1000000) * modelPricing.output;
return {
inputTokens: inputTokens,
estimatedOutputTokens: estimatedOutputTokens,
estimatedCost: inputCost + outputCost,
model: model
};
}
// Usage
var prompt = "Explain the difference between REST and GraphQL in two sentences.";
var estimate = estimateCost(prompt, 100, "gpt-4o-mini");
console.log(estimate);
// { inputTokens: 12, estimatedOutputTokens: 100, estimatedCost: 0.0000618, model: 'gpt-4o-mini' }
This lets you make informed decisions before every API call. In production, I log these estimates and compare them against actual usage to calibrate my output token estimates over time.
Prompt Optimization to Reduce Token Usage
Prompt engineering is not just about getting better results -- it is about getting the same results with fewer tokens. Here are concrete techniques:
1. Set a max_tokens Limit
Always set max_tokens on your API calls. Without it, models will generate until they hit the context window limit, and you pay for every token.
var OpenAI = require("openai");
var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
function classifyText(text) {
return client.chat.completions.create({
model: "gpt-4o-mini",
max_tokens: 10, // Classification only needs a few tokens
messages: [
{
role: "system",
content: "Classify the sentiment as: positive, negative, or neutral. Reply with one word only."
},
{ role: "user", content: text }
]
});
}
2. Use Structured Output Instructions
Tell the model exactly what format you want. This prevents wasted tokens on preambles like "Sure! Here's the answer to your question..."
var systemPrompt = [
"You are a JSON-only responder.",
"Return a JSON object with these exact keys: category, confidence, reasoning.",
"Do not include any text outside the JSON object.",
"Keep reasoning under 20 words."
].join(" ");
3. Compress Context With Summaries
If you are passing conversation history or document context, summarize older content rather than passing it verbatim. A 10,000-token document can often be summarized into 500 tokens without losing the information the model needs.
function compressHistory(messages, maxTokens) {
var totalTokens = 0;
var compressed = [];
// Keep most recent messages intact, summarize older ones
for (var i = messages.length - 1; i >= 0; i--) {
var msgTokens = countTokens(messages[i].content);
if (totalTokens + msgTokens > maxTokens && compressed.length > 3) {
// Summarize remaining older messages into one
var olderContent = messages.slice(0, i + 1)
.map(function(m) { return m.role + ": " + m.content; })
.join("\n");
compressed.unshift({
role: "system",
content: "Summary of earlier conversation: " + olderContent.substring(0, 500)
});
break;
}
totalTokens += msgTokens;
compressed.unshift(messages[i]);
}
return compressed;
}
Semantic Caching With Redis
This is the single highest-impact optimization for most applications. If two users ask essentially the same question, why pay for the same API call twice?
Simple string matching will not work -- "What is Node.js?" and "Can you explain Node.js?" are the same question. You need semantic similarity matching. The approach: generate an embedding for each query, store it alongside the cached response, and check incoming queries against cached embeddings using cosine similarity.
var Redis = require("redis");
var OpenAI = require("openai");
var crypto = require("crypto");
var redis = Redis.createClient({ url: process.env.REDIS_URL || "redis://localhost:6379" });
var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
var CACHE_TTL = 3600; // 1 hour
var SIMILARITY_THRESHOLD = 0.92;
function cosineSimilarity(a, b) {
var dotProduct = 0;
var normA = 0;
var normB = 0;
for (var i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
function getEmbedding(text) {
return openai.embeddings.create({
model: "text-embedding-3-small",
input: text
}).then(function(response) {
return response.data[0].embedding;
});
}
function getCachedResponse(query) {
return getEmbedding(query).then(function(queryEmbedding) {
return redis.keys("llm_cache:*").then(function(keys) {
var bestMatch = null;
var bestSimilarity = 0;
var promises = keys.map(function(key) {
return redis.hGetAll(key).then(function(cached) {
var cachedEmbedding = JSON.parse(cached.embedding);
var similarity = cosineSimilarity(queryEmbedding, cachedEmbedding);
if (similarity > bestSimilarity && similarity >= SIMILARITY_THRESHOLD) {
bestSimilarity = similarity;
bestMatch = {
response: cached.response,
similarity: similarity,
originalQuery: cached.query
};
}
});
});
return Promise.all(promises).then(function() {
return { match: bestMatch, embedding: queryEmbedding };
});
});
});
}
function cacheResponse(query, response, embedding) {
var key = "llm_cache:" + crypto.createHash("sha256").update(query).digest("hex");
return redis.hSet(key, {
query: query,
response: response,
embedding: JSON.stringify(embedding),
timestamp: Date.now().toString()
}).then(function() {
return redis.expire(key, CACHE_TTL);
});
}
In production, this cache saves 30-60% of API costs depending on how repetitive your traffic is. E-commerce product Q&A, FAQ bots, and documentation assistants see the highest cache hit rates.
Performance note: Scanning all keys for similarity is fine up to a few thousand cache entries. Beyond that, use a vector database like Pinecone or pgvector instead of the brute-force scan shown here.
Model Tiering Strategy
Not every task needs GPT-4o or Claude Sonnet. Here is a practical tiering system:
var MODEL_TIERS = {
simple: {
model: "gpt-4.1-nano",
maxTokens: 150,
description: "Classification, extraction, yes/no questions",
costPer1kOutput: 0.0004
},
standard: {
model: "gpt-4o-mini",
maxTokens: 1000,
description: "Summarization, simple generation, translation",
costPer1kOutput: 0.0006
},
complex: {
model: "gpt-4o",
maxTokens: 4000,
description: "Analysis, creative writing, code generation, reasoning",
costPer1kOutput: 0.01
}
};
function selectModel(taskType, inputTokens) {
// Auto-downgrade if input is very short (likely a simple task)
if (inputTokens < 50 && taskType !== "complex") {
return MODEL_TIERS.simple;
}
return MODEL_TIERS[taskType] || MODEL_TIERS.standard;
}
// Usage
var tier = selectModel("simple", 25);
console.log("Using model: " + tier.model);
// Using model: gpt-4.1-nano
The cost difference is dramatic. A task that costs $0.01 with GPT-4o costs $0.0006 with GPT-4o-mini -- a 16x reduction. Over millions of requests, this adds up to thousands of dollars saved.
My rule of thumb: start every new feature with the cheapest model that could work, then upgrade only when quality metrics show you need to.
Batch API Processing for Non-Real-Time Workloads
OpenAI's Batch API gives you a 50% discount in exchange for results delivered within 24 hours (usually much faster). Perfect for content pipelines, data processing, and scheduled jobs.
var fs = require("fs");
var OpenAI = require("openai");
var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
function createBatchFile(tasks) {
var lines = tasks.map(function(task, index) {
return JSON.stringify({
custom_id: "task-" + index,
method: "POST",
url: "/v1/chat/completions",
body: {
model: task.model || "gpt-4o-mini",
max_tokens: task.maxTokens || 500,
messages: task.messages
}
});
});
var filePath = "/tmp/batch_input_" + Date.now() + ".jsonl";
fs.writeFileSync(filePath, lines.join("\n"));
return filePath;
}
function submitBatch(filePath) {
return client.files.create({
file: fs.createReadStream(filePath),
purpose: "batch"
}).then(function(file) {
return client.batches.create({
input_file_id: file.id,
endpoint: "/v1/chat/completions",
completion_window: "24h"
});
});
}
function checkBatchStatus(batchId) {
return client.batches.retrieve(batchId).then(function(batch) {
console.log("Status: " + batch.status);
console.log("Completed: " + batch.request_counts.completed + "/" + batch.request_counts.total);
if (batch.status === "completed") {
return client.files.content(batch.output_file_id).then(function(content) {
return content.text();
});
}
return null;
});
}
// Example: batch-classify 1000 support tickets
var tasks = [];
for (var i = 0; i < 1000; i++) {
tasks.push({
model: "gpt-4o-mini",
maxTokens: 20,
messages: [
{ role: "system", content: "Classify this support ticket as: billing, technical, account, other. One word only." },
{ role: "user", content: "Ticket #" + i + ": I cannot reset my password" }
]
});
}
// 1000 classifications at 50% discount
// Regular cost: ~$0.02 | Batch cost: ~$0.01
Rate Limiting and Retry Strategies
LLM APIs have strict rate limits. You need proper retry logic or your application will drop requests under load.
function sleep(ms) {
return new Promise(function(resolve) {
setTimeout(resolve, ms);
});
}
function callWithRetry(apiCall, options) {
var maxRetries = (options && options.maxRetries) || 5;
var baseDelay = (options && options.baseDelay) || 1000;
function attempt(retryCount) {
return apiCall().catch(function(error) {
if (retryCount >= maxRetries) {
throw error;
}
var status = error.status || (error.response && error.response.status);
// Rate limited - use Retry-After header if available
if (status === 429) {
var retryAfter = error.headers && error.headers["retry-after"];
var delay = retryAfter ? parseInt(retryAfter) * 1000 : baseDelay * Math.pow(2, retryCount);
console.log("Rate limited. Retrying in " + delay + "ms (attempt " + (retryCount + 1) + "/" + maxRetries + ")");
return sleep(delay).then(function() {
return attempt(retryCount + 1);
});
}
// Server errors - retry with backoff
if (status >= 500) {
var backoff = baseDelay * Math.pow(2, retryCount) + Math.random() * 1000;
console.log("Server error " + status + ". Retrying in " + Math.round(backoff) + "ms");
return sleep(backoff).then(function() {
return attempt(retryCount + 1);
});
}
// Client errors (400, 401, 403) - do not retry
throw error;
});
}
return attempt(0);
}
// Usage
callWithRetry(function() {
return client.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: "Hello" }]
});
}, { maxRetries: 3, baseDelay: 2000 }).then(function(result) {
console.log(result.choices[0].message.content);
});
The jitter (random delay added to backoff) is critical in production. Without it, all your retrying clients will hit the API at the same moment, causing a thundering herd problem.
Cost Tracking and Alerting With Budget Controls
Every production LLM integration needs cost tracking. Not "we will add it later" -- from day one. Here is a practical implementation:
var Redis = require("redis");
function CostTracker(redisClient, options) {
this.redis = redisClient;
this.dailyBudget = (options && options.dailyBudget) || 50.00;
this.monthlyBudget = (options && options.monthlyBudget) || 1000.00;
this.alertThreshold = (options && options.alertThreshold) || 0.80;
this.onAlert = (options && options.onAlert) || function(msg) { console.warn("[COST ALERT] " + msg); };
}
CostTracker.prototype.recordUsage = function(model, inputTokens, outputTokens, metadata) {
var pricing = {
"gpt-4o": { input: 2.50, output: 10.00 },
"gpt-4o-mini": { input: 0.15, output: 0.60 },
"gpt-4.1-nano": { input: 0.10, output: 0.40 },
"claude-sonnet-4": { input: 3.00, output: 15.00 },
"claude-haiku-3.5": { input: 0.80, output: 4.00 }
};
var modelPricing = pricing[model];
if (!modelPricing) return Promise.resolve(0);
var cost = (inputTokens / 1000000) * modelPricing.input +
(outputTokens / 1000000) * modelPricing.output;
var today = new Date().toISOString().split("T")[0];
var month = today.substring(0, 7);
var self = this;
var dailyKey = "llm_cost:daily:" + today;
var monthlyKey = "llm_cost:monthly:" + month;
var detailKey = "llm_cost:detail:" + today;
return Promise.all([
this.redis.incrByFloat(dailyKey, cost),
this.redis.incrByFloat(monthlyKey, cost),
this.redis.rPush(detailKey, JSON.stringify({
model: model,
inputTokens: inputTokens,
outputTokens: outputTokens,
cost: cost,
timestamp: Date.now(),
metadata: metadata || {}
}))
]).then(function(results) {
var dailyTotal = parseFloat(results[0]);
var monthlyTotal = parseFloat(results[1]);
// Set TTL on daily keys (48 hours)
self.redis.expire(dailyKey, 172800);
self.redis.expire(detailKey, 172800);
// Monthly keys expire after 35 days
self.redis.expire(monthlyKey, 3024000);
// Check budget thresholds
if (dailyTotal > self.dailyBudget * self.alertThreshold) {
self.onAlert("Daily spend at $" + dailyTotal.toFixed(2) + " of $" + self.dailyBudget.toFixed(2) + " budget");
}
if (monthlyTotal > self.monthlyBudget * self.alertThreshold) {
self.onAlert("Monthly spend at $" + monthlyTotal.toFixed(2) + " of $" + self.monthlyBudget.toFixed(2) + " budget");
}
return cost;
});
};
CostTracker.prototype.isOverBudget = function() {
var today = new Date().toISOString().split("T")[0];
var month = today.substring(0, 7);
var self = this;
return Promise.all([
this.redis.get("llm_cost:daily:" + today),
this.redis.get("llm_cost:monthly:" + month)
]).then(function(results) {
var dailyTotal = parseFloat(results[0] || 0);
var monthlyTotal = parseFloat(results[1] || 0);
return {
daily: dailyTotal >= self.dailyBudget,
monthly: monthlyTotal >= self.monthlyBudget,
dailySpent: dailyTotal,
monthlySpent: monthlyTotal
};
});
};
Streaming Responses for Perceived Performance
Streaming does not save money directly, but it dramatically improves user experience. Instead of waiting 5-10 seconds for a complete response, users see text appearing immediately. This reduces abandonment rates and makes the cost you are spending deliver more value.
var OpenAI = require("openai");
var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
function streamCompletion(messages, onChunk, onDone) {
var fullResponse = "";
var totalTokens = 0;
return client.chat.completions.create({
model: "gpt-4o-mini",
messages: messages,
stream: true
}).then(function(stream) {
return new Promise(function(resolve, reject) {
stream.on("data", function(chunk) {
var content = chunk.choices[0] && chunk.choices[0].delta && chunk.choices[0].delta.content;
if (content) {
fullResponse += content;
onChunk(content);
}
if (chunk.usage) {
totalTokens = chunk.usage.total_tokens;
}
});
stream.on("end", function() {
if (onDone) onDone(fullResponse, totalTokens);
resolve(fullResponse);
});
stream.on("error", function(err) {
reject(err);
});
});
});
}
// Express SSE endpoint
function handleStreamRoute(req, res) {
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
res.setHeader("Connection", "keep-alive");
var messages = [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: req.body.message }
];
streamCompletion(
messages,
function(chunk) {
res.write("data: " + JSON.stringify({ content: chunk }) + "\n\n");
},
function(fullResponse, tokens) {
res.write("data: " + JSON.stringify({ done: true, totalTokens: tokens }) + "\n\n");
res.end();
}
).catch(function(err) {
res.write("data: " + JSON.stringify({ error: err.message }) + "\n\n");
res.end();
});
}
Comparing Provider Costs
Choosing the right provider for each use case can cut your bill significantly. Here is my real-world experience:
| Use Case | Best Provider | Model | Why |
|---|---|---|---|
| Simple classification | OpenAI | gpt-4.1-nano | Cheapest per token, fast |
| General chat | OpenAI | gpt-4o-mini | Best price/quality ratio |
| Complex reasoning | Anthropic | claude-sonnet-4 | Better at nuanced analysis |
| Code generation | OpenAI | gpt-4.1 | Strong coding, good pricing |
| Long documents | Anthropic | claude-sonnet-4 | 200K context window |
| Bulk processing | OpenAI | gpt-4o-mini batch | 50% discount on batch API |
For open-source models (Llama 3, Mistral), hosting costs through providers like Together AI or Fireworks can be 3-10x cheaper than OpenAI for comparable quality on routine tasks. The trade-off is slightly lower quality and more operational complexity.
Complete Working Example
Here is a production-ready Node.js module that ties everything together: semantic caching, automatic model tiering, token counting, cost tracking, and budget enforcement.
// llm-client.js
var OpenAI = require("openai");
var tiktoken = require("tiktoken");
var Redis = require("redis");
var crypto = require("crypto");
var encoder = tiktoken.encoding_for_model("gpt-4o");
var PRICING = {
"gpt-4o": { input: 2.50, output: 10.00 },
"gpt-4o-mini": { input: 0.15, output: 0.60 },
"gpt-4.1-nano": { input: 0.10, output: 0.40 }
};
var TIERS = {
simple: { model: "gpt-4.1-nano", maxTokens: 100 },
standard: { model: "gpt-4o-mini", maxTokens: 1000 },
complex: { model: "gpt-4o", maxTokens: 4000 }
};
var SIMILARITY_THRESHOLD = 0.92;
var CACHE_TTL = 3600;
function LLMClient(options) {
this.openai = new OpenAI({ apiKey: options.apiKey });
this.redis = Redis.createClient({ url: options.redisUrl || "redis://localhost:6379" });
this.dailyBudget = options.dailyBudget || 50.00;
this.monthlyBudget = options.monthlyBudget || 1000.00;
this.connected = false;
this.stats = { requests: 0, cacheHits: 0, totalCost: 0 };
}
LLMClient.prototype.connect = function() {
var self = this;
return this.redis.connect().then(function() {
self.connected = true;
console.log("LLM client connected to Redis");
});
};
LLMClient.prototype.countTokens = function(text) {
return encoder.encode(text).length;
};
LLMClient.prototype._getEmbedding = function(text) {
return this.openai.embeddings.create({
model: "text-embedding-3-small",
input: text
}).then(function(res) {
return res.data[0].embedding;
});
};
LLMClient.prototype._cosineSimilarity = function(a, b) {
var dot = 0, normA = 0, normB = 0;
for (var i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
};
LLMClient.prototype._checkCache = function(query) {
var self = this;
return this._getEmbedding(query).then(function(queryEmb) {
return self.redis.keys("llm_cache:*").then(function(keys) {
if (!keys.length) return { hit: false, embedding: queryEmb };
var best = null;
var bestScore = 0;
var checks = keys.map(function(key) {
return self.redis.hGetAll(key).then(function(entry) {
var cachedEmb = JSON.parse(entry.embedding);
var sim = self._cosineSimilarity(queryEmb, cachedEmb);
if (sim > bestScore && sim >= SIMILARITY_THRESHOLD) {
bestScore = sim;
best = { response: entry.response, model: entry.model, similarity: sim };
}
});
});
return Promise.all(checks).then(function() {
return { hit: !!best, data: best, embedding: queryEmb };
});
});
});
};
LLMClient.prototype._saveCache = function(query, response, embedding, model) {
var key = "llm_cache:" + crypto.createHash("sha256").update(query).digest("hex");
return this.redis.hSet(key, {
query: query,
response: response,
embedding: JSON.stringify(embedding),
model: model,
timestamp: Date.now().toString()
}).then(function() {
return this.redis.expire(key, CACHE_TTL);
}.bind(this));
};
LLMClient.prototype._trackCost = function(model, inputTokens, outputTokens) {
var p = PRICING[model];
if (!p) return Promise.resolve(0);
var cost = (inputTokens / 1e6) * p.input + (outputTokens / 1e6) * p.output;
var today = new Date().toISOString().split("T")[0];
var month = today.substring(0, 7);
this.stats.totalCost += cost;
return Promise.all([
this.redis.incrByFloat("llm_cost:daily:" + today, cost),
this.redis.incrByFloat("llm_cost:monthly:" + month, cost)
]).then(function(results) {
var daily = parseFloat(results[0]);
var monthly = parseFloat(results[1]);
return { cost: cost, dailyTotal: daily, monthlyTotal: monthly };
});
};
LLMClient.prototype._selectModel = function(tier) {
var self = this;
var today = new Date().toISOString().split("T")[0];
return this.redis.get("llm_cost:daily:" + today).then(function(dailySpent) {
var spent = parseFloat(dailySpent || 0);
// If approaching budget, downgrade model
if (spent > self.dailyBudget * 0.9 && tier === "complex") {
console.log("Budget pressure: downgrading from complex to standard");
return TIERS.standard;
}
if (spent > self.dailyBudget * 0.95) {
console.log("Budget critical: using cheapest model");
return TIERS.simple;
}
return TIERS[tier] || TIERS.standard;
});
};
LLMClient.prototype.complete = function(prompt, options) {
var tier = (options && options.tier) || "standard";
var systemPrompt = (options && options.system) || "You are a helpful assistant.";
var skipCache = (options && options.skipCache) || false;
var self = this;
self.stats.requests++;
// Step 1: Check cache (unless skipped)
var cachePromise = skipCache
? Promise.resolve({ hit: false, embedding: null })
: self._checkCache(prompt);
return cachePromise.then(function(cache) {
if (cache.hit) {
self.stats.cacheHits++;
return {
content: cache.data.response,
model: cache.data.model,
cached: true,
similarity: cache.data.similarity,
cost: 0,
inputTokens: 0,
outputTokens: 0
};
}
// Step 2: Select model (with budget-aware downgrade)
return self._selectModel(tier).then(function(selectedTier) {
var inputText = systemPrompt + " " + prompt;
var inputTokens = self.countTokens(inputText);
// Step 3: Make API call
return self.openai.chat.completions.create({
model: selectedTier.model,
max_tokens: selectedTier.maxTokens,
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: prompt }
]
}).then(function(response) {
var content = response.choices[0].message.content;
var outputTokens = response.usage.completion_tokens;
var actualInputTokens = response.usage.prompt_tokens;
// Step 4: Track cost
return self._trackCost(selectedTier.model, actualInputTokens, outputTokens).then(function(costInfo) {
// Step 5: Cache the response
var cachePromise = cache.embedding
? self._saveCache(prompt, content, cache.embedding, selectedTier.model)
: Promise.resolve();
return cachePromise.then(function() {
return {
content: content,
model: selectedTier.model,
cached: false,
cost: costInfo.cost,
inputTokens: actualInputTokens,
outputTokens: outputTokens,
dailyTotal: costInfo.dailyTotal,
monthlyTotal: costInfo.monthlyTotal
};
});
});
});
});
});
};
LLMClient.prototype.getStats = function() {
return {
requests: this.stats.requests,
cacheHits: this.stats.cacheHits,
hitRate: this.stats.requests > 0
? (this.stats.cacheHits / this.stats.requests * 100).toFixed(1) + "%"
: "0%",
totalCost: "$" + this.stats.totalCost.toFixed(4)
};
};
LLMClient.prototype.disconnect = function() {
return this.redis.quit();
};
module.exports = LLMClient;
Usage Example
// app.js
var LLMClient = require("./llm-client");
var llm = new LLMClient({
apiKey: process.env.OPENAI_API_KEY,
redisUrl: process.env.REDIS_URL,
dailyBudget: 25.00,
monthlyBudget: 500.00
});
llm.connect().then(function() {
// Simple classification - uses cheapest model
return llm.complete("Is this email spam? 'You won a free iPhone!'", {
tier: "simple",
system: "Reply with only: spam or not_spam"
});
}).then(function(result) {
console.log("Response:", result.content);
console.log("Model:", result.model);
console.log("Cost: $" + result.cost.toFixed(6));
console.log("Cached:", result.cached);
// Response: spam
// Model: gpt-4.1-nano
// Cost: $0.000003
// Cached: false
// Same question again - served from cache
return llm.complete("Is this email spam? 'You just won a free iPhone!'", {
tier: "simple",
system: "Reply with only: spam or not_spam"
});
}).then(function(result) {
console.log("\nSecond call:");
console.log("Response:", result.content);
console.log("Cached:", result.cached);
console.log("Similarity:", result.similarity);
console.log("Cost: $" + result.cost.toFixed(6));
// Second call:
// Response: spam
// Cached: true
// Similarity: 0.967
// Cost: $0.000000
console.log("\nStats:", llm.getStats());
// Stats: { requests: 2, cacheHits: 1, hitRate: '50.0%', totalCost: '$0.0000' }
return llm.disconnect();
});
Common Issues and Troubleshooting
1. Token Count Mismatch Between Estimate and Actual
Error: Expected ~200 input tokens, API reported 847
This happens when you forget to count system message tokens, or when the API adds formatting tokens around your messages. The tiktoken count is for raw text -- the actual API wraps each message with special tokens (<|im_start|>, role identifiers, etc.). Budget an extra 10-15% above your tiktoken estimate, and always use the usage field from the API response for cost tracking, not your pre-call estimates.
2. Redis Connection Failures Blocking API Calls
Error: connect ECONNREFUSED 127.0.0.1:6379
Never let a cache failure prevent your application from working. Wrap all Redis operations in try/catch and fall through to a direct API call. Your caching layer is an optimization, not a dependency.
function safeCacheCheck(query) {
return getCachedResponse(query).catch(function(err) {
console.warn("Cache unavailable, proceeding without cache:", err.message);
return { hit: false, embedding: null };
});
}
3. Rate Limit Errors During Burst Traffic
Error: 429 Too Many Requests - Rate limit reached for gpt-4o-mini on tokens per min (TPM): Limit 200000
The fix is not just retry logic -- you need a client-side rate limiter that prevents you from hitting the limit in the first place. Use a token bucket or sliding window algorithm to throttle outgoing requests. The bottleneck npm package works well for this:
npm install bottleneck
var Bottleneck = require("bottleneck");
var limiter = new Bottleneck({
maxConcurrent: 5, // Max parallel requests
minTime: 200, // Min ms between requests
reservoir: 50, // Max requests per interval
reservoirRefreshAmount: 50,
reservoirRefreshInterval: 60000 // Refresh every minute
});
function rateLimitedComplete(client, prompt, options) {
return limiter.schedule(function() {
return client.complete(prompt, options);
});
}
4. Stale Cache Returning Outdated Responses
User asks about "current Node.js LTS version" and gets a cached response from 3 months ago saying "Node.js 20"
Time-sensitive queries should bypass the cache entirely. Add query classification to detect temporal queries:
var TEMPORAL_KEYWORDS = ["current", "latest", "today", "now", "recent", "2026", "this week", "this month"];
function isTemporalQuery(query) {
var lowerQuery = query.toLowerCase();
return TEMPORAL_KEYWORDS.some(function(keyword) {
return lowerQuery.indexOf(keyword) !== -1;
});
}
// In your complete() method:
var skipCache = (options && options.skipCache) || isTemporalQuery(prompt);
5. Budget Enforcement Too Aggressive After Restart
Warning: Daily budget exceeded ($0.00/$25.00) - all requests blocked
If your Redis data is lost (restart, eviction) and you reset to $0, the system works fine. But if you are persisting to disk and restore old data, you might have stale budget counters. Always set TTL on cost tracking keys and validate that the date in the key matches today before enforcing limits.
Best Practices
Always set
max_tokenson every API call. An unbound response can generate thousands of tokens you did not intend. For classification tasks, set it to 10-20. For summaries, estimate the desired length and add 20% buffer.Use the cheapest model that meets your quality bar. Start with GPT-4.1-nano or GPT-4o-mini. Only upgrade to GPT-4o or Claude Sonnet when you have measured evidence that the cheaper model is not good enough. Run A/B tests with real users if possible.
Cache aggressively, invalidate conservatively. For most applications, a 1-hour TTL on cached LLM responses is safe. FAQ bots and documentation assistants can use 24-hour caches. Conversational apps with personalization should cache less.
Track costs per feature, not just per application. Tag each API call with the feature that triggered it (e.g., "chat", "search", "classification"). This lets you identify which features are expensive and where optimization will have the most impact.
Use batch APIs for anything that can wait. Content generation, data enrichment, classification pipelines -- if the user is not staring at a loading spinner, use the batch endpoint and save 50%.
Implement circuit breakers for cost control. If your daily spend exceeds a threshold, automatically downgrade to cheaper models or disable non-essential AI features rather than hard-failing. A degraded experience is better than a $10,000 surprise bill.
Monitor token-to-value ratio. Not all tokens are created equal. A 500-token response that directly answers a user's question is more valuable than a 2,000-token response padded with disclaimers. Optimize your prompts to maximize useful output per token spent.
Pre-compute and store where possible. If you know what questions users frequently ask, pre-generate answers during off-peak hours using batch API pricing and serve them as static content. This is infinitely cheaper than real-time LLM calls.
Version your prompts. When you change a prompt, the cache should be invalidated. Include a prompt version hash in your cache key to prevent stale responses after prompt updates.
References
- OpenAI API Pricing - Current token pricing for all models
- OpenAI Batch API Documentation - Official guide for batch processing
- Anthropic API Pricing - Claude model pricing tiers
- tiktoken on npm - Token counting library for OpenAI models
- OpenAI Tokenizer Tool - Interactive token counting
- Redis Documentation - Caching layer documentation
- bottleneck on npm - Rate limiting for Node.js
- OpenAI Rate Limits - Understanding and handling rate limits
