AI

The Night Claude Got Dumber: What Happened to Model Performance and Fixes

It was 11 PM on a Tuesday, and I was in the middle of a productive Claude Code session — refactoring the job board on Grizzly Peak Software — when...

It was 11 PM on a Tuesday, and I was in the middle of a productive Claude Code session — refactoring the job board on Grizzly Peak Software — when something shifted. The same model that had been writing crisp, architecturally sound Node.js code for the past three hours suddenly started producing output that felt like it came from a different brain entirely.

Functions that had been correctly using var and require() started showing up as const and import. Error handling that had been thorough and production-grade became superficial. Responses that had been concise started bloating with unnecessary explanations I hadn't asked for. And the model kept "forgetting" context from earlier in the conversation, asking me to re-explain things I'd stated clearly twenty messages ago.

The 90-Day AI Transformation: A Week-by-Week Playbook for CIOs Who Need Results Now

The 90-Day AI Transformation: A Week-by-Week Playbook for CIOs Who Need Results Now

Stop planning your 18-month AI roadmap. This week-by-week playbook takes tech leaders from zero AI deployments to measurable production results in 90 days.

Learn More

I stared at my screen in the dim light of my cabin in Caswell Lakes, Alaska, and thought: "Did Claude just get dumber?"

If you've worked with AI models extensively — not just casual ChatGPT conversations, but real, sustained development work — you know exactly what I'm talking about. And you probably thought you were imagining it.

You weren't.


This Is a Real Phenomenon, and It's Maddening

Let me be clear about what I mean, because people who use AI casually will think I'm being dramatic. I'm not talking about the model occasionally getting something wrong. Models make mistakes — that's expected and manageable. I'm talking about perceptible shifts in capability during a single work session or between sessions on the same day.

One hour Claude is a senior engineer. The next hour it's a junior dev who read a blog post about the topic. The code structure changes. The reasoning depth changes. The ability to maintain context across a long conversation changes. It's like your pair programming partner got swapped out for their less experienced cousin, and nobody told you.

I started noticing this pattern while building out the 500+ article library for Grizzly Peak Software. Some sessions produced brilliant first drafts that needed minimal editing. Other sessions, using the exact same prompts and system instructions, produced generic slop that I'd have been embarrassed to publish. The only variable that changed was when I was running the prompts.

After a few weeks of this, I started keeping notes. Time of day, day of week, response quality on a 1-5 scale, specific failure modes. What I found wasn't scientific proof of anything, but the patterns were hard to ignore.


The Theories: What's Actually Going On

Theory 1: Load Balancing and Model Routing

This is the theory I find most plausible, and it's the one that the AI community has the most circumstantial evidence for.

Large language model providers don't run a single instance of their model. They run fleets of servers, often with different hardware configurations, quantization levels, and sometimes even slightly different model versions. When you send a request, a load balancer routes it to whatever server has capacity.

During peak usage hours — business hours in the US, roughly 9 AM to 6 PM Eastern — the load is highest. To handle demand without everything falling over, providers likely route some requests to servers running more aggressively quantized (compressed) versions of the model. Quantization makes the model smaller and faster but reduces quality, sometimes noticeably.

This would explain why my best Claude sessions tend to happen between 10 PM and 2 AM Alaska time (2 AM to 6 AM Eastern). Less load, more likely to get routed to a full-precision instance.

Is this confirmed? No. Anthropic, OpenAI, and Google don't publicly disclose their routing strategies. But it's consistent with standard practice for any high-traffic web service, and it matches my observations.

Theory 2: Silent Model Updates

AI providers update their models constantly. Sometimes they announce it. Sometimes they don't. And sometimes an "update" is actually a regression.

I distinctly remember a week in late 2025 when Claude's coding ability seemed to drop across the board. Conversations on Reddit and Twitter confirmed I wasn't alone — thousands of developers noticed the same thing. Anthropic never officially acknowledged a change, but the degradation was real and lasted several days before quality returned to baseline.

The problem with silent updates is that they break your calibrated expectations. If you've spent weeks refining a prompt that works perfectly with one model version, and the model quietly changes underneath you, your prompt might now produce worse output — and you'll blame yourself before you blame the model.

Theory 3: Context Window Degradation

This one is less about the provider and more about how transformer models fundamentally work. As your conversation gets longer, the model's ability to attend to earlier messages degrades. This isn't a bug — it's an architectural limitation of how attention mechanisms work, even in models with large context windows.

Having 200k tokens of context available doesn't mean the model uses all 200k tokens equally. Information at the beginning and end of the context window gets more attention than information in the middle. So your carefully crafted system prompt at the top of the conversation gradually loses influence as the conversation grows.

This explains why Claude can feel brilliant at the start of a session and mediocre an hour later — even without any load balancing shenanigans. The model is literally paying less attention to your instructions.

Theory 4: Temperature and Sampling Variance

Sometimes the model isn't dumber — it's just unlucky. LLMs are probabilistic. The same prompt can produce different outputs every time, and occasionally the random sampling produces a response that's noticeably worse than average.

This theory explains individual bad responses but not sustained quality drops across an entire session. When I notice degradation, it's usually consistent for a stretch of time, not random blips.

Theory 5: You're Tired and Your Prompts Are Getting Worse

I'd be dishonest if I didn't include this one. Sometimes the model didn't get dumber — I got lazier. At 1 AM, my prompts are less precise, my context-setting is sloppier, and my expectations are higher because I'm tired and impatient. The model reflects the quality of what you put in.

That said, I've controlled for this. I've run the exact same saved prompt at different times and gotten materially different quality outputs. So while prompt fatigue is a real factor, it doesn't explain everything.


The Data (Such As It Is)

Here's what my informal tracking over six weeks showed. This isn't peer-reviewed science — it's one engineer's observations. Take it for what it's worth.

Best performance windows (Alaska Time):

  • 10 PM to 2 AM (2 AM - 6 AM Eastern) — consistently highest quality
  • 5 AM to 7 AM (9 AM - 11 AM Eastern) — surprisingly good, maybe fresh model instances spun up for the day

Worst performance windows:

  • 12 PM to 4 PM (4 PM - 8 PM Eastern) — peak US usage, most noticeable degradation
  • Weekday afternoons worse than weekend afternoons

Specific degradation patterns I observed:

  • Code style inconsistency (ignoring explicit style instructions)
  • Reduced reasoning depth (surface-level answers to complex questions)
  • Context amnesia (forgetting instructions from earlier in the conversation)
  • Increased verbosity (padding responses with unnecessary explanations)
  • More "hedging" language ("you might want to consider…" instead of direct answers)

Again — anecdotal. But consistently anecdotal.


Practical Fixes That Actually Help

Complaining about model degradation is satisfying but unproductive. Here's what I actually do about it.

Fix 1: Keep Conversations Short

This is the single most effective mitigation. Instead of running one long conversation for an entire development session, I start a new conversation every 30-45 minutes or whenever I shift to a different task.

Yes, this means re-establishing context each time. I keep a CLAUDE.md file in my project root with all my project context, coding conventions, and current task state. When I start a new conversation, the model reads that file and is immediately up to speed. It takes about 30 seconds of overhead and saves me from the slow context degradation that kills quality in long sessions.

For Grizzly Peak Software, my CLAUDE.md file includes the tech stack, file structure, URL patterns, environment variables, and coding style preferences. When Claude Code loads this at the start of a session, it's like handing a new contractor the project documentation — they can be productive immediately.

Fix 2: Work During Off-Peak Hours

This sounds dumb and obvious, but it works. My most productive AI-assisted coding sessions happen late at night Alaska time. If the load balancing theory is correct, fewer users mean better model routing mean better outputs.

I realize not everyone lives in Alaska with a flexible schedule. But if you have any control over when you do your AI-heavy work, try shifting it to early morning or late evening US time and see if you notice a difference.

Fix 3: Use Explicit Quality Anchors in Your Prompts

When I notice quality starting to slip, I add explicit quality requirements to my prompts. Not vague instructions like "be thorough" — specific, measurable requirements.

Requirements for this response:
- All code must use var, function(), and require() — no
  const, let, arrow functions, or ESM imports
- Include error handling for every function that performs
  I/O
- Every code example must be self-contained and runnable
- Do not include explanations unless I specifically ask
  for them
- If you are uncertain about any technical claim, say so
  explicitly rather than guessing

This acts as a quality floor. The model might still produce lower-quality output, but the explicit constraints prevent it from drifting as far as it would without them.

Fix 4: Maintain a Fallback Model Strategy

I don't rely on a single model anymore. My primary workflow uses Claude (currently through Claude Code), but I keep fallback options ready:

  • GPT-4o for when Claude seems to be having a bad day. Different provider, different infrastructure, different degradation patterns. When one is off, the other is often fine.
  • Gemini 2.5 Pro for research-heavy tasks. Google's infrastructure is massive enough that I've noticed fewer load-related quality fluctuations, though the baseline coding quality is different from Claude's.
  • Local models via Ollama for tasks where I need consistency over capability. A local Llama model won't have brilliant days or terrible days — it'll have the same mediocre-but-predictable day every day. For boilerplate generation and simple refactoring, predictability beats occasional brilliance.

I wrote about my experience with Gemini becoming a daily driver in a previous article. The point isn't that any one model is best — it's that having options means you're never stuck waiting for one provider to get its act together.

Fix 5: Version Your Prompts and Track Results

I keep my important prompts in files, not in my head. When a prompt that was working well starts producing worse output, I can verify that the prompt hasn't changed — which means the model has.

More importantly, I timestamp the results. When I look back at my prompt performance log and see that the same prompt produced great results on Monday and poor results on Wednesday, that's data. It won't fix the problem, but it stops me from going insane thinking I'm imagining things.

Fix 6: The Nuclear Option — Restart Everything

When things are really bad, I close the terminal, clear the conversation, wait fifteen minutes, and start completely fresh. Sometimes you've gotten routed to a bad instance, or your conversation context has degraded beyond recovery, or something else is happening that a clean start will fix.

It feels wasteful. It sometimes is wasteful. But it's less wasteful than spending an hour fighting a degraded model and producing work you'll have to redo anyway.


The Bigger Picture: Trust but Verify

Here's what I've learned from over a year of using AI models as my primary development tool: they are not deterministic, they are not consistent, and they are not transparent about why.

This doesn't mean they're not useful. They're incredibly useful. Claude Code has probably 3x'd my shipping speed across Grizzly Peak Software, AutoDetective.ai, and the various side projects I'm always tinkering with. But I've had to learn to work with the inconsistency rather than pretending it doesn't exist.

The engineers who get burned by AI tools are the ones who calibrate to the model's best performance and then get frustrated when it doesn't maintain that level. Calibrate to the model's average performance instead. Treat the good sessions as bonuses and the bad sessions as expected variance.

And always — always — review the output. I don't care how good the model was yesterday. Today's code gets read by human eyes before it goes anywhere near production. That's not distrust. That's engineering.


What I Want from AI Providers

If Anthropic, OpenAI, or Google are reading this (they're not, but humor me):

Transparency about routing. If I'm getting a quantized model during peak hours, tell me. Let me opt into a queue for the full-precision model if I'm willing to wait. Don't silently give me a worse product and pretend nothing changed.

Version pinning. Let me pin to a specific model version for a project. If I've calibrated my prompts to Claude 3.5 Sonnet v2, don't silently upgrade me to v3 and break my workflows. Make upgrades opt-in for professional users.

Performance metrics. Give me a dashboard showing my request routing, model version, and inference time. I don't need to know your infrastructure secrets — I just need enough information to understand why Tuesday was great and Wednesday was terrible.

These aren't unreasonable asks. Every other API I use provides versioning and monitoring. AI APIs should be no different.


The Night Shift Advantage

I'll end with this: the night Claude "got dumber" was the night I fundamentally changed how I work with AI tools. I stopped treating them as reliable utilities and started treating them as talented but unpredictable collaborators.

Some nights, my collaborator is firing on all cylinders and we ship incredible work together. Some nights, they're having an off night and I need to carry more of the weight myself. And some nights, I look at the output and say "you know what, let me handle this one" and write the code myself.

That's not a failure of the technology. That's just the reality of working with probabilistic systems in 2026. And honestly? It's not that different from working with human teammates. Some days your senior dev is brilliant. Some days they're distracted and you have to pick up the slack.

The difference is that Claude doesn't get offended when I throw away its output and start over. That's a feature, not a bug.

Now if you'll excuse me, it's 11 PM in Alaska. Prime coding hours.


Shane Larson is a software engineer with 30+ years of experience, writing code and articles from his cabin in Caswell Lakes, Alaska. He runs Grizzly Peak Software and AutoDetective.ai, and he promises he's not just being paranoid about the model quality thing.

Powered by Contentful