Llm Apis

Cost-Effective LLM API Integration Patterns

Battle-tested patterns for integrating LLM APIs into Node.js applications while keeping costs under control, covering semantic caching, model tiering, batch processing, and budget enforcement.

Cost-Effective LLM API Integration Patterns

Overview

LLM APIs are powerful but expensive -- a single poorly-optimized production workload can burn through thousands of dollars in a month without careful engineering. This article covers battle-tested patterns for integrating LLM APIs into Node.js applications while keeping costs under control. Whether you are building a customer-facing chatbot, a content pipeline, or an internal tool that leverages AI, these patterns will help you ship features without blowing your budget.

Prerequisites

  • Node.js v18+ installed
  • Basic familiarity with REST APIs and async/await in Node.js
  • A Redis instance (local or hosted) for caching examples
  • API keys for OpenAI and/or Anthropic (free tier is fine for testing)
  • npm for package management

Install the dependencies we will use throughout this article:

npm install openai @anthropic-ai/sdk tiktoken redis crypto uuid express

Understanding LLM API Pricing Models

Before you can optimize costs, you need to understand how you are being charged. Every major LLM provider charges per token, not per request. A token is roughly 3/4 of a word in English -- the word "hamburger" is two tokens, and "I" is one token.

There are two token counts that matter on every API call:

Metric Description Typical Cost Ratio
Input tokens Your prompt, system message, and any context you send Lower cost (often 2-10x cheaper than output)
Output tokens The model's response Higher cost

Here is a simplified pricing comparison as of early 2026:

{
  "openai": {
    "gpt-4o": { "input_per_1M": 2.50, "output_per_1M": 10.00 },
    "gpt-4o-mini": { "input_per_1M": 0.15, "output_per_1M": 0.60 },
    "gpt-4.1": { "input_per_1M": 2.00, "output_per_1M": 8.00 },
    "gpt-4.1-mini": { "input_per_1M": 0.40, "output_per_1M": 1.60 },
    "gpt-4.1-nano": { "input_per_1M": 0.10, "output_per_1M": 0.40 }
  },
  "anthropic": {
    "claude-sonnet-4": { "input_per_1M": 3.00, "output_per_1M": 15.00 },
    "claude-haiku-3.5": { "input_per_1M": 0.80, "output_per_1M": 4.00 }
  }
}

The key insight: input tokens are cheap, output tokens are expensive. This single fact drives most optimization strategies. A verbose prompt that produces a concise response is almost always cheaper than a terse prompt that causes the model to ramble.

Batch vs. Real-Time

Both OpenAI and Anthropic offer batch processing APIs at a 50% discount. If your workload does not need an immediate response -- think content generation, data classification, summarization pipelines -- you should always use batch endpoints. We will cover this in detail below.


Token Counting and Estimation Before API Calls

Never send a request without knowing roughly how many tokens it will consume. The tiktoken library gives you exact counts for OpenAI models, and the counts are close enough for Anthropic as well.

var tiktoken = require("tiktoken");

var encoder = tiktoken.encoding_for_model("gpt-4o");

function countTokens(text) {
  var tokens = encoder.encode(text);
  return tokens.length;
}

function estimateCost(inputText, estimatedOutputTokens, model) {
  var pricing = {
    "gpt-4o": { input: 2.50, output: 10.00 },
    "gpt-4o-mini": { input: 0.15, output: 0.60 },
    "gpt-4.1-nano": { input: 0.10, output: 0.40 },
    "claude-sonnet-4": { input: 3.00, output: 15.00 },
    "claude-haiku-3.5": { input: 0.80, output: 4.00 }
  };

  var modelPricing = pricing[model];
  if (!modelPricing) {
    throw new Error("Unknown model: " + model);
  }

  var inputTokens = countTokens(inputText);
  var inputCost = (inputTokens / 1000000) * modelPricing.input;
  var outputCost = (estimatedOutputTokens / 1000000) * modelPricing.output;

  return {
    inputTokens: inputTokens,
    estimatedOutputTokens: estimatedOutputTokens,
    estimatedCost: inputCost + outputCost,
    model: model
  };
}

// Usage
var prompt = "Explain the difference between REST and GraphQL in two sentences.";
var estimate = estimateCost(prompt, 100, "gpt-4o-mini");
console.log(estimate);
// { inputTokens: 12, estimatedOutputTokens: 100, estimatedCost: 0.0000618, model: 'gpt-4o-mini' }

This lets you make informed decisions before every API call. In production, I log these estimates and compare them against actual usage to calibrate my output token estimates over time.


Prompt Optimization to Reduce Token Usage

Prompt engineering is not just about getting better results -- it is about getting the same results with fewer tokens. Here are concrete techniques:

1. Set a max_tokens Limit

Always set max_tokens on your API calls. Without it, models will generate until they hit the context window limit, and you pay for every token.

var OpenAI = require("openai");
var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function classifyText(text) {
  return client.chat.completions.create({
    model: "gpt-4o-mini",
    max_tokens: 10,  // Classification only needs a few tokens
    messages: [
      {
        role: "system",
        content: "Classify the sentiment as: positive, negative, or neutral. Reply with one word only."
      },
      { role: "user", content: text }
    ]
  });
}

2. Use Structured Output Instructions

Tell the model exactly what format you want. This prevents wasted tokens on preambles like "Sure! Here's the answer to your question..."

var systemPrompt = [
  "You are a JSON-only responder.",
  "Return a JSON object with these exact keys: category, confidence, reasoning.",
  "Do not include any text outside the JSON object.",
  "Keep reasoning under 20 words."
].join(" ");

3. Compress Context With Summaries

If you are passing conversation history or document context, summarize older content rather than passing it verbatim. A 10,000-token document can often be summarized into 500 tokens without losing the information the model needs.

function compressHistory(messages, maxTokens) {
  var totalTokens = 0;
  var compressed = [];

  // Keep most recent messages intact, summarize older ones
  for (var i = messages.length - 1; i >= 0; i--) {
    var msgTokens = countTokens(messages[i].content);
    if (totalTokens + msgTokens > maxTokens && compressed.length > 3) {
      // Summarize remaining older messages into one
      var olderContent = messages.slice(0, i + 1)
        .map(function(m) { return m.role + ": " + m.content; })
        .join("\n");
      compressed.unshift({
        role: "system",
        content: "Summary of earlier conversation: " + olderContent.substring(0, 500)
      });
      break;
    }
    totalTokens += msgTokens;
    compressed.unshift(messages[i]);
  }

  return compressed;
}

Semantic Caching With Redis

This is the single highest-impact optimization for most applications. If two users ask essentially the same question, why pay for the same API call twice?

Simple string matching will not work -- "What is Node.js?" and "Can you explain Node.js?" are the same question. You need semantic similarity matching. The approach: generate an embedding for each query, store it alongside the cached response, and check incoming queries against cached embeddings using cosine similarity.

var Redis = require("redis");
var OpenAI = require("openai");
var crypto = require("crypto");

var redis = Redis.createClient({ url: process.env.REDIS_URL || "redis://localhost:6379" });
var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

var CACHE_TTL = 3600; // 1 hour
var SIMILARITY_THRESHOLD = 0.92;

function cosineSimilarity(a, b) {
  var dotProduct = 0;
  var normA = 0;
  var normB = 0;
  for (var i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

function getEmbedding(text) {
  return openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  }).then(function(response) {
    return response.data[0].embedding;
  });
}

function getCachedResponse(query) {
  return getEmbedding(query).then(function(queryEmbedding) {
    return redis.keys("llm_cache:*").then(function(keys) {
      var bestMatch = null;
      var bestSimilarity = 0;

      var promises = keys.map(function(key) {
        return redis.hGetAll(key).then(function(cached) {
          var cachedEmbedding = JSON.parse(cached.embedding);
          var similarity = cosineSimilarity(queryEmbedding, cachedEmbedding);
          if (similarity > bestSimilarity && similarity >= SIMILARITY_THRESHOLD) {
            bestSimilarity = similarity;
            bestMatch = {
              response: cached.response,
              similarity: similarity,
              originalQuery: cached.query
            };
          }
        });
      });

      return Promise.all(promises).then(function() {
        return { match: bestMatch, embedding: queryEmbedding };
      });
    });
  });
}

function cacheResponse(query, response, embedding) {
  var key = "llm_cache:" + crypto.createHash("sha256").update(query).digest("hex");
  return redis.hSet(key, {
    query: query,
    response: response,
    embedding: JSON.stringify(embedding),
    timestamp: Date.now().toString()
  }).then(function() {
    return redis.expire(key, CACHE_TTL);
  });
}

In production, this cache saves 30-60% of API costs depending on how repetitive your traffic is. E-commerce product Q&A, FAQ bots, and documentation assistants see the highest cache hit rates.

Performance note: Scanning all keys for similarity is fine up to a few thousand cache entries. Beyond that, use a vector database like Pinecone or pgvector instead of the brute-force scan shown here.


Model Tiering Strategy

Not every task needs GPT-4o or Claude Sonnet. Here is a practical tiering system:

var MODEL_TIERS = {
  simple: {
    model: "gpt-4.1-nano",
    maxTokens: 150,
    description: "Classification, extraction, yes/no questions",
    costPer1kOutput: 0.0004
  },
  standard: {
    model: "gpt-4o-mini",
    maxTokens: 1000,
    description: "Summarization, simple generation, translation",
    costPer1kOutput: 0.0006
  },
  complex: {
    model: "gpt-4o",
    maxTokens: 4000,
    description: "Analysis, creative writing, code generation, reasoning",
    costPer1kOutput: 0.01
  }
};

function selectModel(taskType, inputTokens) {
  // Auto-downgrade if input is very short (likely a simple task)
  if (inputTokens < 50 && taskType !== "complex") {
    return MODEL_TIERS.simple;
  }
  return MODEL_TIERS[taskType] || MODEL_TIERS.standard;
}

// Usage
var tier = selectModel("simple", 25);
console.log("Using model: " + tier.model);
// Using model: gpt-4.1-nano

The cost difference is dramatic. A task that costs $0.01 with GPT-4o costs $0.0006 with GPT-4o-mini -- a 16x reduction. Over millions of requests, this adds up to thousands of dollars saved.

My rule of thumb: start every new feature with the cheapest model that could work, then upgrade only when quality metrics show you need to.


Batch API Processing for Non-Real-Time Workloads

OpenAI's Batch API gives you a 50% discount in exchange for results delivered within 24 hours (usually much faster). Perfect for content pipelines, data processing, and scheduled jobs.

var fs = require("fs");
var OpenAI = require("openai");

var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function createBatchFile(tasks) {
  var lines = tasks.map(function(task, index) {
    return JSON.stringify({
      custom_id: "task-" + index,
      method: "POST",
      url: "/v1/chat/completions",
      body: {
        model: task.model || "gpt-4o-mini",
        max_tokens: task.maxTokens || 500,
        messages: task.messages
      }
    });
  });

  var filePath = "/tmp/batch_input_" + Date.now() + ".jsonl";
  fs.writeFileSync(filePath, lines.join("\n"));
  return filePath;
}

function submitBatch(filePath) {
  return client.files.create({
    file: fs.createReadStream(filePath),
    purpose: "batch"
  }).then(function(file) {
    return client.batches.create({
      input_file_id: file.id,
      endpoint: "/v1/chat/completions",
      completion_window: "24h"
    });
  });
}

function checkBatchStatus(batchId) {
  return client.batches.retrieve(batchId).then(function(batch) {
    console.log("Status: " + batch.status);
    console.log("Completed: " + batch.request_counts.completed + "/" + batch.request_counts.total);

    if (batch.status === "completed") {
      return client.files.content(batch.output_file_id).then(function(content) {
        return content.text();
      });
    }
    return null;
  });
}

// Example: batch-classify 1000 support tickets
var tasks = [];
for (var i = 0; i < 1000; i++) {
  tasks.push({
    model: "gpt-4o-mini",
    maxTokens: 20,
    messages: [
      { role: "system", content: "Classify this support ticket as: billing, technical, account, other. One word only." },
      { role: "user", content: "Ticket #" + i + ": I cannot reset my password" }
    ]
  });
}

// 1000 classifications at 50% discount
// Regular cost: ~$0.02 | Batch cost: ~$0.01

Rate Limiting and Retry Strategies

LLM APIs have strict rate limits. You need proper retry logic or your application will drop requests under load.

function sleep(ms) {
  return new Promise(function(resolve) {
    setTimeout(resolve, ms);
  });
}

function callWithRetry(apiCall, options) {
  var maxRetries = (options && options.maxRetries) || 5;
  var baseDelay = (options && options.baseDelay) || 1000;

  function attempt(retryCount) {
    return apiCall().catch(function(error) {
      if (retryCount >= maxRetries) {
        throw error;
      }

      var status = error.status || (error.response && error.response.status);

      // Rate limited - use Retry-After header if available
      if (status === 429) {
        var retryAfter = error.headers && error.headers["retry-after"];
        var delay = retryAfter ? parseInt(retryAfter) * 1000 : baseDelay * Math.pow(2, retryCount);
        console.log("Rate limited. Retrying in " + delay + "ms (attempt " + (retryCount + 1) + "/" + maxRetries + ")");
        return sleep(delay).then(function() {
          return attempt(retryCount + 1);
        });
      }

      // Server errors - retry with backoff
      if (status >= 500) {
        var backoff = baseDelay * Math.pow(2, retryCount) + Math.random() * 1000;
        console.log("Server error " + status + ". Retrying in " + Math.round(backoff) + "ms");
        return sleep(backoff).then(function() {
          return attempt(retryCount + 1);
        });
      }

      // Client errors (400, 401, 403) - do not retry
      throw error;
    });
  }

  return attempt(0);
}

// Usage
callWithRetry(function() {
  return client.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "Hello" }]
  });
}, { maxRetries: 3, baseDelay: 2000 }).then(function(result) {
  console.log(result.choices[0].message.content);
});

The jitter (random delay added to backoff) is critical in production. Without it, all your retrying clients will hit the API at the same moment, causing a thundering herd problem.


Cost Tracking and Alerting With Budget Controls

Every production LLM integration needs cost tracking. Not "we will add it later" -- from day one. Here is a practical implementation:

var Redis = require("redis");

function CostTracker(redisClient, options) {
  this.redis = redisClient;
  this.dailyBudget = (options && options.dailyBudget) || 50.00;
  this.monthlyBudget = (options && options.monthlyBudget) || 1000.00;
  this.alertThreshold = (options && options.alertThreshold) || 0.80;
  this.onAlert = (options && options.onAlert) || function(msg) { console.warn("[COST ALERT] " + msg); };
}

CostTracker.prototype.recordUsage = function(model, inputTokens, outputTokens, metadata) {
  var pricing = {
    "gpt-4o": { input: 2.50, output: 10.00 },
    "gpt-4o-mini": { input: 0.15, output: 0.60 },
    "gpt-4.1-nano": { input: 0.10, output: 0.40 },
    "claude-sonnet-4": { input: 3.00, output: 15.00 },
    "claude-haiku-3.5": { input: 0.80, output: 4.00 }
  };

  var modelPricing = pricing[model];
  if (!modelPricing) return Promise.resolve(0);

  var cost = (inputTokens / 1000000) * modelPricing.input +
             (outputTokens / 1000000) * modelPricing.output;

  var today = new Date().toISOString().split("T")[0];
  var month = today.substring(0, 7);

  var self = this;
  var dailyKey = "llm_cost:daily:" + today;
  var monthlyKey = "llm_cost:monthly:" + month;
  var detailKey = "llm_cost:detail:" + today;

  return Promise.all([
    this.redis.incrByFloat(dailyKey, cost),
    this.redis.incrByFloat(monthlyKey, cost),
    this.redis.rPush(detailKey, JSON.stringify({
      model: model,
      inputTokens: inputTokens,
      outputTokens: outputTokens,
      cost: cost,
      timestamp: Date.now(),
      metadata: metadata || {}
    }))
  ]).then(function(results) {
    var dailyTotal = parseFloat(results[0]);
    var monthlyTotal = parseFloat(results[1]);

    // Set TTL on daily keys (48 hours)
    self.redis.expire(dailyKey, 172800);
    self.redis.expire(detailKey, 172800);
    // Monthly keys expire after 35 days
    self.redis.expire(monthlyKey, 3024000);

    // Check budget thresholds
    if (dailyTotal > self.dailyBudget * self.alertThreshold) {
      self.onAlert("Daily spend at $" + dailyTotal.toFixed(2) + " of $" + self.dailyBudget.toFixed(2) + " budget");
    }
    if (monthlyTotal > self.monthlyBudget * self.alertThreshold) {
      self.onAlert("Monthly spend at $" + monthlyTotal.toFixed(2) + " of $" + self.monthlyBudget.toFixed(2) + " budget");
    }

    return cost;
  });
};

CostTracker.prototype.isOverBudget = function() {
  var today = new Date().toISOString().split("T")[0];
  var month = today.substring(0, 7);
  var self = this;

  return Promise.all([
    this.redis.get("llm_cost:daily:" + today),
    this.redis.get("llm_cost:monthly:" + month)
  ]).then(function(results) {
    var dailyTotal = parseFloat(results[0] || 0);
    var monthlyTotal = parseFloat(results[1] || 0);

    return {
      daily: dailyTotal >= self.dailyBudget,
      monthly: monthlyTotal >= self.monthlyBudget,
      dailySpent: dailyTotal,
      monthlySpent: monthlyTotal
    };
  });
};

Streaming Responses for Perceived Performance

Streaming does not save money directly, but it dramatically improves user experience. Instead of waiting 5-10 seconds for a complete response, users see text appearing immediately. This reduces abandonment rates and makes the cost you are spending deliver more value.

var OpenAI = require("openai");
var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function streamCompletion(messages, onChunk, onDone) {
  var fullResponse = "";
  var totalTokens = 0;

  return client.chat.completions.create({
    model: "gpt-4o-mini",
    messages: messages,
    stream: true
  }).then(function(stream) {
    return new Promise(function(resolve, reject) {
      stream.on("data", function(chunk) {
        var content = chunk.choices[0] && chunk.choices[0].delta && chunk.choices[0].delta.content;
        if (content) {
          fullResponse += content;
          onChunk(content);
        }
        if (chunk.usage) {
          totalTokens = chunk.usage.total_tokens;
        }
      });
      stream.on("end", function() {
        if (onDone) onDone(fullResponse, totalTokens);
        resolve(fullResponse);
      });
      stream.on("error", function(err) {
        reject(err);
      });
    });
  });
}

// Express SSE endpoint
function handleStreamRoute(req, res) {
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");

  var messages = [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: req.body.message }
  ];

  streamCompletion(
    messages,
    function(chunk) {
      res.write("data: " + JSON.stringify({ content: chunk }) + "\n\n");
    },
    function(fullResponse, tokens) {
      res.write("data: " + JSON.stringify({ done: true, totalTokens: tokens }) + "\n\n");
      res.end();
    }
  ).catch(function(err) {
    res.write("data: " + JSON.stringify({ error: err.message }) + "\n\n");
    res.end();
  });
}

Comparing Provider Costs

Choosing the right provider for each use case can cut your bill significantly. Here is my real-world experience:

Use Case Best Provider Model Why
Simple classification OpenAI gpt-4.1-nano Cheapest per token, fast
General chat OpenAI gpt-4o-mini Best price/quality ratio
Complex reasoning Anthropic claude-sonnet-4 Better at nuanced analysis
Code generation OpenAI gpt-4.1 Strong coding, good pricing
Long documents Anthropic claude-sonnet-4 200K context window
Bulk processing OpenAI gpt-4o-mini batch 50% discount on batch API

For open-source models (Llama 3, Mistral), hosting costs through providers like Together AI or Fireworks can be 3-10x cheaper than OpenAI for comparable quality on routine tasks. The trade-off is slightly lower quality and more operational complexity.


Complete Working Example

Here is a production-ready Node.js module that ties everything together: semantic caching, automatic model tiering, token counting, cost tracking, and budget enforcement.

// llm-client.js
var OpenAI = require("openai");
var tiktoken = require("tiktoken");
var Redis = require("redis");
var crypto = require("crypto");

var encoder = tiktoken.encoding_for_model("gpt-4o");

var PRICING = {
  "gpt-4o":          { input: 2.50,  output: 10.00 },
  "gpt-4o-mini":     { input: 0.15,  output: 0.60 },
  "gpt-4.1-nano":    { input: 0.10,  output: 0.40 }
};

var TIERS = {
  simple:   { model: "gpt-4.1-nano",  maxTokens: 100 },
  standard: { model: "gpt-4o-mini",   maxTokens: 1000 },
  complex:  { model: "gpt-4o",        maxTokens: 4000 }
};

var SIMILARITY_THRESHOLD = 0.92;
var CACHE_TTL = 3600;

function LLMClient(options) {
  this.openai = new OpenAI({ apiKey: options.apiKey });
  this.redis = Redis.createClient({ url: options.redisUrl || "redis://localhost:6379" });
  this.dailyBudget = options.dailyBudget || 50.00;
  this.monthlyBudget = options.monthlyBudget || 1000.00;
  this.connected = false;
  this.stats = { requests: 0, cacheHits: 0, totalCost: 0 };
}

LLMClient.prototype.connect = function() {
  var self = this;
  return this.redis.connect().then(function() {
    self.connected = true;
    console.log("LLM client connected to Redis");
  });
};

LLMClient.prototype.countTokens = function(text) {
  return encoder.encode(text).length;
};

LLMClient.prototype._getEmbedding = function(text) {
  return this.openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  }).then(function(res) {
    return res.data[0].embedding;
  });
};

LLMClient.prototype._cosineSimilarity = function(a, b) {
  var dot = 0, normA = 0, normB = 0;
  for (var i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
};

LLMClient.prototype._checkCache = function(query) {
  var self = this;
  return this._getEmbedding(query).then(function(queryEmb) {
    return self.redis.keys("llm_cache:*").then(function(keys) {
      if (!keys.length) return { hit: false, embedding: queryEmb };

      var best = null;
      var bestScore = 0;

      var checks = keys.map(function(key) {
        return self.redis.hGetAll(key).then(function(entry) {
          var cachedEmb = JSON.parse(entry.embedding);
          var sim = self._cosineSimilarity(queryEmb, cachedEmb);
          if (sim > bestScore && sim >= SIMILARITY_THRESHOLD) {
            bestScore = sim;
            best = { response: entry.response, model: entry.model, similarity: sim };
          }
        });
      });

      return Promise.all(checks).then(function() {
        return { hit: !!best, data: best, embedding: queryEmb };
      });
    });
  });
};

LLMClient.prototype._saveCache = function(query, response, embedding, model) {
  var key = "llm_cache:" + crypto.createHash("sha256").update(query).digest("hex");
  return this.redis.hSet(key, {
    query: query,
    response: response,
    embedding: JSON.stringify(embedding),
    model: model,
    timestamp: Date.now().toString()
  }).then(function() {
    return this.redis.expire(key, CACHE_TTL);
  }.bind(this));
};

LLMClient.prototype._trackCost = function(model, inputTokens, outputTokens) {
  var p = PRICING[model];
  if (!p) return Promise.resolve(0);

  var cost = (inputTokens / 1e6) * p.input + (outputTokens / 1e6) * p.output;
  var today = new Date().toISOString().split("T")[0];
  var month = today.substring(0, 7);

  this.stats.totalCost += cost;

  return Promise.all([
    this.redis.incrByFloat("llm_cost:daily:" + today, cost),
    this.redis.incrByFloat("llm_cost:monthly:" + month, cost)
  ]).then(function(results) {
    var daily = parseFloat(results[0]);
    var monthly = parseFloat(results[1]);
    return { cost: cost, dailyTotal: daily, monthlyTotal: monthly };
  });
};

LLMClient.prototype._selectModel = function(tier) {
  var self = this;
  var today = new Date().toISOString().split("T")[0];

  return this.redis.get("llm_cost:daily:" + today).then(function(dailySpent) {
    var spent = parseFloat(dailySpent || 0);

    // If approaching budget, downgrade model
    if (spent > self.dailyBudget * 0.9 && tier === "complex") {
      console.log("Budget pressure: downgrading from complex to standard");
      return TIERS.standard;
    }
    if (spent > self.dailyBudget * 0.95) {
      console.log("Budget critical: using cheapest model");
      return TIERS.simple;
    }

    return TIERS[tier] || TIERS.standard;
  });
};

LLMClient.prototype.complete = function(prompt, options) {
  var tier = (options && options.tier) || "standard";
  var systemPrompt = (options && options.system) || "You are a helpful assistant.";
  var skipCache = (options && options.skipCache) || false;
  var self = this;

  self.stats.requests++;

  // Step 1: Check cache (unless skipped)
  var cachePromise = skipCache
    ? Promise.resolve({ hit: false, embedding: null })
    : self._checkCache(prompt);

  return cachePromise.then(function(cache) {
    if (cache.hit) {
      self.stats.cacheHits++;
      return {
        content: cache.data.response,
        model: cache.data.model,
        cached: true,
        similarity: cache.data.similarity,
        cost: 0,
        inputTokens: 0,
        outputTokens: 0
      };
    }

    // Step 2: Select model (with budget-aware downgrade)
    return self._selectModel(tier).then(function(selectedTier) {
      var inputText = systemPrompt + " " + prompt;
      var inputTokens = self.countTokens(inputText);

      // Step 3: Make API call
      return self.openai.chat.completions.create({
        model: selectedTier.model,
        max_tokens: selectedTier.maxTokens,
        messages: [
          { role: "system", content: systemPrompt },
          { role: "user", content: prompt }
        ]
      }).then(function(response) {
        var content = response.choices[0].message.content;
        var outputTokens = response.usage.completion_tokens;
        var actualInputTokens = response.usage.prompt_tokens;

        // Step 4: Track cost
        return self._trackCost(selectedTier.model, actualInputTokens, outputTokens).then(function(costInfo) {
          // Step 5: Cache the response
          var cachePromise = cache.embedding
            ? self._saveCache(prompt, content, cache.embedding, selectedTier.model)
            : Promise.resolve();

          return cachePromise.then(function() {
            return {
              content: content,
              model: selectedTier.model,
              cached: false,
              cost: costInfo.cost,
              inputTokens: actualInputTokens,
              outputTokens: outputTokens,
              dailyTotal: costInfo.dailyTotal,
              monthlyTotal: costInfo.monthlyTotal
            };
          });
        });
      });
    });
  });
};

LLMClient.prototype.getStats = function() {
  return {
    requests: this.stats.requests,
    cacheHits: this.stats.cacheHits,
    hitRate: this.stats.requests > 0
      ? (this.stats.cacheHits / this.stats.requests * 100).toFixed(1) + "%"
      : "0%",
    totalCost: "$" + this.stats.totalCost.toFixed(4)
  };
};

LLMClient.prototype.disconnect = function() {
  return this.redis.quit();
};

module.exports = LLMClient;

Usage Example

// app.js
var LLMClient = require("./llm-client");

var llm = new LLMClient({
  apiKey: process.env.OPENAI_API_KEY,
  redisUrl: process.env.REDIS_URL,
  dailyBudget: 25.00,
  monthlyBudget: 500.00
});

llm.connect().then(function() {

  // Simple classification - uses cheapest model
  return llm.complete("Is this email spam? 'You won a free iPhone!'", {
    tier: "simple",
    system: "Reply with only: spam or not_spam"
  });

}).then(function(result) {
  console.log("Response:", result.content);
  console.log("Model:", result.model);
  console.log("Cost: $" + result.cost.toFixed(6));
  console.log("Cached:", result.cached);
  // Response: spam
  // Model: gpt-4.1-nano
  // Cost: $0.000003
  // Cached: false

  // Same question again - served from cache
  return llm.complete("Is this email spam? 'You just won a free iPhone!'", {
    tier: "simple",
    system: "Reply with only: spam or not_spam"
  });

}).then(function(result) {
  console.log("\nSecond call:");
  console.log("Response:", result.content);
  console.log("Cached:", result.cached);
  console.log("Similarity:", result.similarity);
  console.log("Cost: $" + result.cost.toFixed(6));
  // Second call:
  // Response: spam
  // Cached: true
  // Similarity: 0.967
  // Cost: $0.000000

  console.log("\nStats:", llm.getStats());
  // Stats: { requests: 2, cacheHits: 1, hitRate: '50.0%', totalCost: '$0.0000' }

  return llm.disconnect();
});

Common Issues and Troubleshooting

1. Token Count Mismatch Between Estimate and Actual

Error: Expected ~200 input tokens, API reported 847

This happens when you forget to count system message tokens, or when the API adds formatting tokens around your messages. The tiktoken count is for raw text -- the actual API wraps each message with special tokens (<|im_start|>, role identifiers, etc.). Budget an extra 10-15% above your tiktoken estimate, and always use the usage field from the API response for cost tracking, not your pre-call estimates.

2. Redis Connection Failures Blocking API Calls

Error: connect ECONNREFUSED 127.0.0.1:6379

Never let a cache failure prevent your application from working. Wrap all Redis operations in try/catch and fall through to a direct API call. Your caching layer is an optimization, not a dependency.

function safeCacheCheck(query) {
  return getCachedResponse(query).catch(function(err) {
    console.warn("Cache unavailable, proceeding without cache:", err.message);
    return { hit: false, embedding: null };
  });
}

3. Rate Limit Errors During Burst Traffic

Error: 429 Too Many Requests - Rate limit reached for gpt-4o-mini on tokens per min (TPM): Limit 200000

The fix is not just retry logic -- you need a client-side rate limiter that prevents you from hitting the limit in the first place. Use a token bucket or sliding window algorithm to throttle outgoing requests. The bottleneck npm package works well for this:

npm install bottleneck
var Bottleneck = require("bottleneck");

var limiter = new Bottleneck({
  maxConcurrent: 5,        // Max parallel requests
  minTime: 200,            // Min ms between requests
  reservoir: 50,           // Max requests per interval
  reservoirRefreshAmount: 50,
  reservoirRefreshInterval: 60000  // Refresh every minute
});

function rateLimitedComplete(client, prompt, options) {
  return limiter.schedule(function() {
    return client.complete(prompt, options);
  });
}

4. Stale Cache Returning Outdated Responses

User asks about "current Node.js LTS version" and gets a cached response from 3 months ago saying "Node.js 20"

Time-sensitive queries should bypass the cache entirely. Add query classification to detect temporal queries:

var TEMPORAL_KEYWORDS = ["current", "latest", "today", "now", "recent", "2026", "this week", "this month"];

function isTemporalQuery(query) {
  var lowerQuery = query.toLowerCase();
  return TEMPORAL_KEYWORDS.some(function(keyword) {
    return lowerQuery.indexOf(keyword) !== -1;
  });
}

// In your complete() method:
var skipCache = (options && options.skipCache) || isTemporalQuery(prompt);

5. Budget Enforcement Too Aggressive After Restart

Warning: Daily budget exceeded ($0.00/$25.00) - all requests blocked

If your Redis data is lost (restart, eviction) and you reset to $0, the system works fine. But if you are persisting to disk and restore old data, you might have stale budget counters. Always set TTL on cost tracking keys and validate that the date in the key matches today before enforcing limits.


Best Practices

  • Always set max_tokens on every API call. An unbound response can generate thousands of tokens you did not intend. For classification tasks, set it to 10-20. For summaries, estimate the desired length and add 20% buffer.

  • Use the cheapest model that meets your quality bar. Start with GPT-4.1-nano or GPT-4o-mini. Only upgrade to GPT-4o or Claude Sonnet when you have measured evidence that the cheaper model is not good enough. Run A/B tests with real users if possible.

  • Cache aggressively, invalidate conservatively. For most applications, a 1-hour TTL on cached LLM responses is safe. FAQ bots and documentation assistants can use 24-hour caches. Conversational apps with personalization should cache less.

  • Track costs per feature, not just per application. Tag each API call with the feature that triggered it (e.g., "chat", "search", "classification"). This lets you identify which features are expensive and where optimization will have the most impact.

  • Use batch APIs for anything that can wait. Content generation, data enrichment, classification pipelines -- if the user is not staring at a loading spinner, use the batch endpoint and save 50%.

  • Implement circuit breakers for cost control. If your daily spend exceeds a threshold, automatically downgrade to cheaper models or disable non-essential AI features rather than hard-failing. A degraded experience is better than a $10,000 surprise bill.

  • Monitor token-to-value ratio. Not all tokens are created equal. A 500-token response that directly answers a user's question is more valuable than a 2,000-token response padded with disclaimers. Optimize your prompts to maximize useful output per token spent.

  • Pre-compute and store where possible. If you know what questions users frequently ask, pre-generate answers during off-peak hours using batch API pricing and serve them as static content. This is infinitely cheaper than real-time LLM calls.

  • Version your prompts. When you change a prompt, the cache should be invalidated. Include a prompt version hash in your cache key to prevent stale responses after prompt updates.


References

Powered by Contentful