AI

Fixing Claude's Slow Response Nights: Local Fallbacks & Hybrid Setups

If you've spent any serious time building with Claude's API, you've noticed the pattern. Mornings on the West Coast? Snappy. Sub-two-second responses on...

If you've spent any serious time building with Claude's API, you've noticed the pattern. Mornings on the West Coast? Snappy. Sub-two-second responses on Haiku, maybe four or five on Sonnet. But come 6 PM Pacific — right when the East Coast is wrapping up work and Europe is waking up — things get ugly. Timeouts. Retries. That soul-crushing spinner while your application hangs waiting for a response that might come back in 15 seconds or might not come back at all.

I hit this wall hard last fall while building out a content pipeline for Grizzly Peak Software. I had a Node.js system that was processing hundreds of articles through Claude for editing and metadata extraction. During the day, it hummed along beautifully. At night, it would stall, timeout, and leave me with half-processed batches that I'd have to reconcile in the morning.

The Heap

The Heap

Discarded robots refuse to die. Engineer Kira discovers their awakening—and a war brewing in the scrap. Dark dystopian SF. Consciousness vs. corporate power.

Learn More

So I did what any engineer living in an Alaskan cabin with questionable satellite internet would do: I built a fallback system that routes to local models when the cloud is slow. And honestly? It changed how I think about AI infrastructure entirely.


Why Cloud AI Has Bad Nights

Let's be clear about what's happening here. This isn't a criticism of Anthropic specifically — OpenAI has the same problem, Google's API has the same problem. Any cloud AI service that's getting hammered by millions of users is going to have variable latency.

The reasons are straightforward:

  1. Peak usage hours overlap globally. When it's late afternoon in San Francisco, it's evening in New York, and early morning in London. That's a massive chunk of the world's developer population all hitting the same endpoints.

  2. Batch jobs stack up. Enterprise customers running large batch processes tend to kick them off at end-of-business, creating demand spikes that compound with interactive usage.

  3. GPU capacity is finite. Despite what the marketing materials suggest, there aren't infinite H100 clusters waiting to serve your requests. Inference capacity has physical limits.

  4. Rate limiting gets aggressive. When load increases, providers throttle more aggressively. You might not be hitting your explicit rate limit, but your requests are getting queued behind higher-priority traffic.

I started logging response times from Claude's API over a two-week period. The pattern was unmistakable: average response times roughly doubled between 5 PM and 11 PM Pacific. Some requests that took 3 seconds during the day were taking 12-15 seconds at night. And the timeout rate — requests that exceeded my 30-second cutoff — went from near-zero during business hours to about 8% during peak evening hours.

That's not acceptable for anything approaching a production system.


The Case for Local Model Fallbacks

Here's the thing most people get wrong about local models: they think it's an all-or-nothing choice. Either you're using Claude (smart but sometimes slow) or you're using a local model (fast but dumber). That's a false dichotomy.

The real power is in hybrid routing. Use the cloud model when it's responsive. Fall back to a local model when it's not. And design your system so that either path produces acceptable results.

Not every task needs a frontier model. When I'm generating article metadata, summarizing text, or doing basic classification, a well-prompted 7B or 13B parameter model running locally can handle it just fine. When I need complex reasoning, nuanced writing, or multi-step analysis, I route to Claude. And when Claude is having a bad night, I either queue those complex tasks for later or accept slightly lower quality from a larger local model.

The economics work out too. Running Ollama on a machine with a decent GPU costs you electricity. Running Claude's API costs you per token. For high-volume workloads, the local fallback actually saves money even when the cloud is fast.


Setting Up Ollama as Your Local Backend

Ollama is the easiest way to get local models running. I've tried llama.cpp directly, I've tried vLLM, I've tried text-generation-webui. Ollama wins on simplicity without sacrificing much flexibility.

Install it, pull a model, and you've got an OpenAI-compatible API running locally:

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull some models for different use cases
ollama pull llama3:8b        # Good general purpose fallback
ollama pull codellama:13b    # Better for code-related tasks
ollama pull mistral:7b       # Fast and surprisingly capable

Ollama exposes an API on localhost:11434 that's similar enough to OpenAI's format that you can swap it in with minimal code changes. Here's a basic wrapper that talks to both:

var http = require("http");
var https = require("https");

function callClaude(prompt, options) {
    return new Promise(function(resolve, reject) {
        var body = JSON.stringify({
            model: options.model || "claude-sonnet-4-20250514",
            max_tokens: options.maxTokens || 1024,
            messages: [{ role: "user", content: prompt }]
        });

        var req = https.request({
            hostname: "api.anthropic.com",
            path: "/v1/messages",
            method: "POST",
            headers: {
                "Content-Type": "application/json",
                "x-api-key": process.env.ANTHROPIC_API_KEY,
                "anthropic-version": "2023-06-01"
            },
            timeout: options.timeout || 15000
        }, function(res) {
            var data = "";
            res.on("data", function(chunk) { data += chunk; });
            res.on("end", function() {
                try {
                    var parsed = JSON.parse(data);
                    resolve({
                        text: parsed.content[0].text,
                        source: "claude",
                        latency: Date.now() - startTime
                    });
                } catch (e) {
                    reject(new Error("Failed to parse Claude response"));
                }
            });
        });

        var startTime = Date.now();
        req.on("timeout", function() { req.destroy(); reject(new Error("Claude timeout")); });
        req.on("error", reject);
        req.write(body);
        req.end();
    });
}

function callOllama(prompt, options) {
    return new Promise(function(resolve, reject) {
        var body = JSON.stringify({
            model: options.localModel || "llama3:8b",
            prompt: prompt,
            stream: false,
            options: {
                temperature: options.temperature || 0.7,
                num_predict: options.maxTokens || 1024
            }
        });

        var req = http.request({
            hostname: "localhost",
            port: 11434,
            path: "/api/generate",
            method: "POST",
            headers: { "Content-Type": "application/json" }
        }, function(res) {
            var data = "";
            res.on("data", function(chunk) { data += chunk; });
            res.on("end", function() {
                try {
                    var parsed = JSON.parse(data);
                    resolve({
                        text: parsed.response,
                        source: "ollama-" + (options.localModel || "llama3:8b"),
                        latency: Date.now() - startTime
                    });
                } catch (e) {
                    reject(new Error("Failed to parse Ollama response"));
                }
            });
        });

        var startTime = Date.now();
        req.on("error", reject);
        req.write(body);
        req.end();
    });
}

Nothing fancy here. Just two functions that call two different APIs and return results in a consistent format. The source field is important — you want to know which model actually handled each request for logging and quality monitoring.


Building the Hybrid Router

Here's where it gets interesting. The router needs to make a decision: should this request go to the cloud or to the local model? And it needs to make that decision fast, before the user (or your pipeline) is left waiting.

My approach uses a rolling latency tracker. It keeps a window of recent Claude API response times and uses that to predict whether the next request is likely to be fast or slow:

var latencyTracker = {
    window: [],
    maxSize: 20,
    slowThreshold: 8000,  // 8 seconds

    record: function(latencyMs) {
        this.window.push(latencyMs);
        if (this.window.length > this.maxSize) {
            this.window.shift();
        }
    },

    getAverage: function() {
        if (this.window.length === 0) return 0;
        var sum = this.window.reduce(function(a, b) { return a + b; }, 0);
        return sum / this.window.length;
    },

    isCloudSlow: function() {
        if (this.window.length < 3) return false;  // Not enough data
        return this.getAverage() > this.slowThreshold;
    },

    getRecentFailRate: function() {
        var recent = this.window.slice(-5);
        var failures = recent.filter(function(l) { return l >= 30000; });
        return failures.length / recent.length;
    }
};

Now the router itself. It considers the task complexity, the current cloud performance, and whether a local model can handle the job:

var COMPLEX_TASKS = ["analysis", "writing", "reasoning", "code_review"];
var SIMPLE_TASKS = ["summarize", "classify", "extract", "format"];

function routeRequest(prompt, taskType, options) {
    var useLocal = false;
    var reason = "";

    // If cloud is slow and task is simple, go local
    if (latencyTracker.isCloudSlow() && SIMPLE_TASKS.indexOf(taskType) !== -1) {
        useLocal = true;
        reason = "cloud_slow_simple_task";
    }

    // If recent failure rate is high, go local regardless
    if (latencyTracker.getRecentFailRate() > 0.4) {
        useLocal = true;
        reason = "high_failure_rate";
    }

    // Force cloud for complex tasks unless it's completely down
    if (COMPLEX_TASKS.indexOf(taskType) !== -1 && latencyTracker.getRecentFailRate() < 0.8) {
        useLocal = false;
        reason = "complex_task_cloud_preferred";
    }

    // Allow manual override
    if (options.forceLocal) {
        useLocal = true;
        reason = "manual_override";
    }

    console.log("[router] task=" + taskType + " useLocal=" + useLocal + " reason=" + reason +
        " avgLatency=" + Math.round(latencyTracker.getAverage()) + "ms");

    if (useLocal) {
        return callOllama(prompt, options);
    }

    // Try cloud with fallback to local on timeout
    return callClaude(prompt, options).then(function(result) {
        latencyTracker.record(result.latency);
        return result;
    }).catch(function(err) {
        console.log("[router] Claude failed (" + err.message + "), falling back to Ollama");
        latencyTracker.record(30000);  // Record as max latency
        return callOllama(prompt, options);
    });
}

The key insight here is the tiered decision-making. Simple tasks get routed locally at the first sign of slowness. Complex tasks stay on Claude unless the service is essentially down. And every cloud request, successful or not, feeds back into the latency tracker so the system adapts in real-time.


The llama.cpp Alternative for Maximum Control

Ollama is great for getting started, but if you want more control — custom quantization, fine-tuned GGUF models, or specific GPU layer allocation — llama.cpp is worth the extra setup.

I run llama.cpp on a dedicated machine with an RTX 3090. The 24GB of VRAM lets me run 13B parameter models at full speed or 30B models with some CPU offloading. Here's my typical startup:

# Build llama.cpp with CUDA support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUDA=1

# Run the server with a specific model
./server \
    -m models/codellama-13b-instruct.Q5_K_M.gguf \
    --host 0.0.0.0 \
    --port 8081 \
    -ngl 40 \
    -c 4096 \
    --threads 8

The -ngl 40 flag offloads 40 layers to the GPU. For a 13B model with Q5 quantization, that's basically the whole model in VRAM. The --host 0.0.0.0 makes it accessible from other machines on my local network, which matters when my development machine and my inference machine are different boxes.

llama.cpp's server exposes an OpenAI-compatible endpoint at /v1/chat/completions, so integrating it into the router is straightforward:

function callLlamaCpp(prompt, options) {
    return new Promise(function(resolve, reject) {
        var body = JSON.stringify({
            messages: [{ role: "user", content: prompt }],
            max_tokens: options.maxTokens || 1024,
            temperature: options.temperature || 0.7
        });

        var req = http.request({
            hostname: options.llamaHost || "localhost",
            port: options.llamaPort || 8081,
            path: "/v1/chat/completions",
            method: "POST",
            headers: { "Content-Type": "application/json" }
        }, function(res) {
            var data = "";
            res.on("data", function(chunk) { data += chunk; });
            res.on("end", function() {
                try {
                    var parsed = JSON.parse(data);
                    resolve({
                        text: parsed.choices[0].message.content,
                        source: "llama-cpp",
                        latency: Date.now() - startTime
                    });
                } catch (e) {
                    reject(new Error("Failed to parse llama.cpp response"));
                }
            });
        });

        var startTime = Date.now();
        req.on("error", reject);
        req.write(body);
        req.end();
    });
}

Handling Quality Differences Between Models

Here's the honest truth that nobody talks about enough: local models are worse than Claude for most tasks. A Llama 3 8B model is not going to produce the same quality output as Claude Sonnet. Pretending otherwise is delusional.

But "worse" doesn't mean "unusable." It means you need to design your prompts and your system differently depending on which model is handling the request.

What I do is maintain two prompt templates for each task type — one optimized for Claude and one optimized for local models:

var promptTemplates = {
    summarize: {
        cloud: "Summarize the following article in 2-3 sentences, capturing the key technical insight:\n\n{content}",
        local: "You are a technical writer. Read the following article and write exactly 2 sentences summarizing the main point. Be specific and technical. Do not add opinions.\n\nArticle:\n{content}\n\nSummary:"
    },
    classify: {
        cloud: "Classify this article into one of these categories: {categories}\n\nArticle: {content}\n\nRespond with just the category name.",
        local: "Task: Pick ONE category from this list that best matches the article.\nCategories: {categories}\n\nArticle: {content}\n\nRules:\n- Respond with ONLY the category name\n- Do not explain your choice\n- Pick exactly one category\n\nCategory:"
    },
    extract_metadata: {
        cloud: "Extract the following metadata from this article as JSON: title, author, date, tags (array), reading_time_minutes (estimate).\n\n{content}",
        local: "Extract metadata from this article. Return valid JSON only, no other text.\n\nRequired fields:\n- title (string)\n- author (string) \n- date (string, ISO format)\n- tags (array of strings, max 5)\n- reading_time_minutes (integer)\n\nArticle:\n{content}\n\nJSON:"
    }
};

function getPrompt(taskType, modelSource, variables) {
    var template = promptTemplates[taskType];
    if (!template) throw new Error("Unknown task type: " + taskType);

    var promptText = modelSource === "cloud" ? template.cloud : template.local;

    Object.keys(variables).forEach(function(key) {
        promptText = promptText.replace("{" + key + "}", variables[key]);
    });

    return promptText;
}

The local model prompts are more explicit, more constrained, and more structured. Local models need more hand-holding. They're more likely to ramble, add unnecessary preambles, or deviate from the requested format. The tighter you make the prompt, the better the results.

I also run a simple validation layer on responses from local models:

function validateResponse(result, taskType) {
    if (taskType === "classify") {
        var validCategories = ["AI", "JavaScript", "DevOps", "Architecture", "Database", "Career"];
        if (validCategories.indexOf(result.text.trim()) === -1) {
            return { valid: false, reason: "Invalid category: " + result.text.trim() };
        }
    }

    if (taskType === "extract_metadata") {
        try {
            JSON.parse(result.text);
        } catch (e) {
            return { valid: false, reason: "Invalid JSON response" };
        }
    }

    if (result.text.trim().length < 10) {
        return { valid: false, reason: "Response too short" };
    }

    return { valid: true };
}

When validation fails on a local model response, the system can either retry with a different local model, queue the task for Claude when it's less busy, or flag it for manual review. In practice, about 90% of simple tasks pass validation from local models on the first try.


Putting It All Together: A Production-Ready Pipeline

Here's how I wire this into an actual content processing pipeline:

var queue = require("./queue");  // Simple file-based task queue

function processBatch(articles) {
    var tasks = [];

    articles.forEach(function(article) {
        // Simple tasks — route through hybrid system
        tasks.push({
            id: article.id + "-classify",
            type: "classify",
            content: article.text,
            priority: "normal"
        });

        tasks.push({
            id: article.id + "-summarize",
            type: "summarize",
            content: article.text,
            priority: "normal"
        });

        // Complex task — prefer cloud, defer if unavailable
        tasks.push({
            id: article.id + "-analysis",
            type: "analysis",
            content: article.text,
            priority: "high",
            deferrable: true
        });
    });

    return processTasksSequentially(tasks, 0, []);
}

function processTasksSequentially(tasks, index, results) {
    if (index >= tasks.length) return Promise.resolve(results);

    var task = tasks[index];
    var modelSource = latencyTracker.isCloudSlow() ? "local" : "cloud";
    var prompt = getPrompt(task.type, modelSource, { content: task.content });

    return routeRequest(prompt, task.type, { timeout: 15000 })
        .then(function(result) {
            var validation = validateResponse(result, task.type);
            if (!validation.valid && task.deferrable) {
                console.log("[pipeline] Deferring task " + task.id + ": " + validation.reason);
                queue.defer(task);
                results.push({ id: task.id, status: "deferred" });
            } else {
                results.push({ id: task.id, status: "complete", result: result });
            }
            return processTasksSequentially(tasks, index + 1, results);
        })
        .catch(function(err) {
            console.log("[pipeline] Task " + task.id + " failed: " + err.message);
            results.push({ id: task.id, status: "error", error: err.message });
            return processTasksSequentially(tasks, index + 1, results);
        });
}

The deferred queue is checked periodically during off-peak hours. I run a cron job at 4 AM Pacific that processes any deferred tasks — that's when Claude is consistently fastest.


Real-World Results

After running this hybrid system for three months, here are the numbers:

  • Overall completion rate: 99.7% (up from 94% with cloud-only)
  • Average response time: 4.2 seconds (down from 6.8 seconds cloud-only)
  • Cost reduction: 35% lower API spend (simple tasks running locally)
  • Night-time reliability: Timeout rate dropped from 8% to under 1%

The quality trade-off is real but manageable. Local model outputs for classification and summarization are about 92% as good as Claude's, measured by my manual spot-check sampling. For the tasks I route locally, that's an acceptable trade.


What I'd Do Differently

If I were building this from scratch today, I'd add a few things:

Health check pinging. Instead of relying only on actual request latency, send a lightweight ping to Claude every 60 seconds to proactively detect slowdowns before they affect real requests.

Model-specific routing. Different local models are better at different tasks. Route classification to Mistral (it's weirdly good at it), summarization to Llama 3, and code tasks to CodeLlama. The router should match task types to model strengths.

Gradual degradation. Instead of binary cloud/local switching, implement a percentage-based split. When cloud latency is medium, send 70% to cloud and 30% to local. Only go fully local when things are really bad.

Response caching. If you're processing similar content, cache responses keyed by a hash of the prompt. Many classification and extraction tasks produce identical results for similar inputs.

The hybrid approach isn't glamorous. It's plumbing. But it's the kind of plumbing that turns a fragile demo into a system you can actually rely on. And living in Alaska, where my satellite internet adds its own latency on top of everything else, having a local fallback isn't just convenient — it's necessary.

Build for the worst case. Your best case will take care of itself.


Shane Larson is a software engineer and the founder of Grizzly Peak Software. He builds AI-powered tools from a cabin in Caswell Lakes, Alaska, where the northern lights are more reliable than cloud API response times. His book on training large language models is available on Amazon.

Powered by Contentful