AI Agents

Building a Real-Time Voice AI Agent: From Browser Prototype to Working Demo

Shane Larson

2025-12-22T00:00-09:00

Building a voice AI agent from scratch using Node.js, Express, and OpenAI's API. Learn how to handle browser speech recognition, prevent feedback loops, manage async audio, and create smooth voice conversations—all without heavy frameworks.

Building a Real-Time Voice AI Agent: From Browser Prototype to Working Demo

Introduction

What if you could build a voice assistant that actually understands and responds naturally, without relying on expensive third-party platforms? That's exactly what I set out to explore, and the journey from a simple browser prototype to a polished demo taught me valuable lessons about real-time audio processing, state management, and the quirks of browser APIs.

In this article, I'll walk you through how I built a complete voice-to-voice conversational AI agent using Node.js, Express, OpenAI's API, and the Web Speech API. More importantly, I'll share the bugs I encountered, the architectural decisions I made, and the solutions that transformed a buggy prototype into a smooth, functional demo that actually works.

The Vision

The goal was straightforward: create a voice agent that could:

Listen to users speak naturally
Process their speech and understand intent
Generate intelligent responses using AI
Speak those responses back in a natural voice
Maintain conversation context across multiple exchanges

Simple in concept, but as you'll see, the devil was in the details.

screen shot of voice agent

Phase 1: The Browser-Only Prototype

Starting Simple

I began with the most minimal viable version: a pure browser-based prototype using the Web Speech API. This validated two critical assumptions:

Could I reliably capture voice input using SpeechRecognition?
Could I synthesize natural-sounding speech using SpeechSynthesis?

const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
const recognition = new SpeechRecognition();
const synth = window.speechSynthesis;

recognition.lang = "en-US";
recognition.interimResults = true;
recognition.continuous = true;

This prototype ran entirely in the browser with basic intent handling. It proved the concept worked, but had obvious limitations: no real AI, no server-side processing, and no way to keep API keys secure.

Key Learnings from the Prototype

The browser-only version proved the mechanics worked, but it revealed a fundamental limitation: you need real AI to have real conversations. Rule-based intent handling quickly becomes a maintenance nightmare and lacks the natural language understanding users expect. This was the key insight that drove the architecture forward—I needed to integrate augmented human intelligence (OpenAI) to get genuinely useful, contextual responses.

Other important discoveries:

Browser compatibility matters: Firefox and Safari have limited Web Speech API support. Chrome and Edge became the target browsers.
Microphone permissions are critical: Users need to explicitly grant permission, and handling denial gracefully is essential.
Voice selection makes a huge difference: System voices vary wildly in quality, so providing choice is important.

Phase 2: Moving to Server-Side Rendering

Architecture Decision

I chose a server-side rendered approach using Express and EJS over a modern SPA framework. This wasn't about following trends—it was about leveraging what I know works. After running production Express sites for over 10 years, I knew exactly what I was getting: reliability, simplicity, and no surprises.

More importantly, I didn't want to install React, Vue, or any other heavyweight frontend framework for what is essentially a demo. The requirements were simple enough that vanilla JavaScript could handle everything the browser needed to do.

Why this approach made sense:

Familiarity: I know Express inside and out—no learning curve, no gotchas
Simplicity: No build tools, no complex state management, just server-rendered HTML
Security: API keys stay on the server where they belong
Speed: Get something working quickly without framework overhead
Lightweight: The entire client-side JS is under 300 lines

The architecture split responsibilities cleanly:

Browser: Handles audio capture (microphone) and playback (speakers)
Server: Handles AI reasoning with OpenAI API
Communication: Simple REST API for sending text and receiving responses

// server.js
app.post("/api/chat", async (req, res) => {
  const { userText, history } = req.body;

  const input = [
    { role: "system", content: systemPrompt },
    ...history,
    { role: "user", content: userText }
  ];

  const response = await client.responses.create({
    model: MODEL,
    input
  });

  const reply = response.output_text?.trim();
  res.json({ reply, history: updatedHistory });
});

File Structure

The project structure was intentionally minimal:

VoiceAgent101/
├── server.js              # Express server + OpenAI integration
├── views/
│   └── index.ejs         # HTML template
├── public/
│   ├── app.js            # Client-side JavaScript
│   └── styles.css        # Styling
└── .env                  # Environment variables

Phase 3: OpenAI Integration

Choosing the Right API

OpenAI provides several APIs, but I went with the Responses API because it's designed for conversational use cases. The key insight was treating conversation history as a sliding window:

// Keep only the last 12 messages (6 exchanges)
const safeHistory = Array.isArray(history) ? history.slice(-12) : [];

This prevents context windows from growing infinitely while maintaining enough history for coherent conversation.

Crafting the System Prompt

The system prompt turned out to be crucial. My first attempt produced verbose, markdown-formatted responses. The agent would say things like "Here are three ways you can…" with bullet points—terrible for voice!

The refined prompt emphasized natural speech:

const systemPrompt = `You are a friendly and helpful voice assistant engaged in natural conversation. 
Speak naturally as if talking to someone, not writing. 
Keep responses concise (2-3 sentences) unless the user explicitly asks for more detail. 
Avoid using markdown, bullet points, or special formatting—just speak naturally.`;

This simple change transformed the experience from robotic to conversational.

Phase 4: The Bug Hunt Begins

This is where things got interesting. The basic flow worked, but several critical bugs made the app unusable in practice.

Bug #1: The Feedback Loop

The Problem: The agent would hear its own voice and respond to itself, creating an infinite conversation loop.

What was happening:

Agent speaks response
Microphone picks up agent's voice
Speech recognition interprets it as user input
Agent responds to its own response
Loop continues forever

The Solution: The microphone needed to be disabled while the agent spoke. But this was trickier than it sounds because SpeechSynthesis is asynchronous and doesn't provide reliable callbacks.

I made the speak() function return a Promise:

function speak(text) {
  return new Promise((resolve) => {
    const utter = new SpeechSynthesisUtterance(text);
    utter.onend = () => resolve();
    utter.onerror = () => resolve();

    synth.cancel(); // Stop any ongoing speech
    synth.speak(utter);
  });
}

Then I could properly await it:

const reply = await askServer(userText);
addMessage("agent", reply);
await speak(reply); // Wait for speech to complete
// Only restart recognition after speech is done

I also added a 500ms cooldown after speech to ensure audio output fully cleared:

setTimeout(() => {
  try { recognition.start(); } catch {}
}, 500);

Bug #2: Unwanted Recognition Restarts

The Problem: Speech recognition would stop and restart multiple times while waiting for the server response, causing "no-speech" errors and confusing status displays.

Root Cause: The recognition.onend event fires automatically, and my code was set to auto-restart recognition whenever it ended. During the server call, recognition would end naturally, trigger a restart, timeout, end again, restart again…

The Solution: Add a processing flag to prevent auto-restarts during server calls:

let isProcessing = false;

recognition.onend = () => {
  // Only auto-restart if listening AND not processing
  if (isListening && !isProcessing) {
    setTimeout(() => {
      try { recognition.start(); } catch {}
    }, 250);
  }
};

// In the speech recognition handler:
isProcessing = true;
recognition.stop(); // Explicitly stop
const reply = await askServer(userText);
await speak(reply);
isProcessing = false; // Clear flag

This simple flag eliminated random stops/starts and made the flow predictable.

Bug #3: Stop Button Didn't Actually Stop

The Problem: Clicking "Stop" would disable the microphone, but the agent would keep speaking if it was mid-sentence.

The Solution: Cancel speech synthesis when stopping:

function stopListening() {
  isListening = false;
  try { recognition.stop(); } catch {}
  synth.cancel(); // Stop any ongoing speech
  setStatus("Idle");
}

Also check the isListening flag before processing responses:

if (isListening) {
  addMessage("agent", reply);
  await speak(reply);
}

Bug #4: Confusing Status Messages

The Problem: Users would see "Stopped listening" in the middle of normal conversations, making them think something broke.

Root Cause: I was logging every recognition.onend event, even when it was an automatic internal stop during processing.

The Solution: Only show "Stopped listening" for manual stops:

recognition.onend = () => {
  // Only show event if user manually stopped (not processing)
  if (!isProcessing) {
    addEventLine("⏹️ Stopped listening");
    setStatus("Idle");
  }
  // Auto-restart logic...
};

Phase 5: Polish and User Experience

Chat-Style Interface

Text logs are fine for debugging, but users expect a chat interface. I replaced the plain text log with proper message bubbles:

.msg.user .body {
  background: linear-gradient(135deg, var(--accent), var(--accent-2));
  color: #fff;
  border-bottom-right-radius: 4px; /* Speech bubble tail effect */
}

.msg.agent .body {
  background: var(--panel-2);
  color: var(--text);
  border-bottom-left-radius: 4px;
}

The result: user messages appear as blue bubbles on the right, agent messages as gray bubbles on the left, and system events in the center with icons.

Copy Functionality

Users often want to save or share conversations. I added two copy features:

Copy individual agent responses with a 📋 button
Copy entire transcript with a "Copy All" button

elCopyAll.onclick = async () => {
  const messages = elLog.querySelectorAll(".msg");
  let chatText = "Voice Agent Chat Transcript\n" + "=".repeat(40) + "\n\n";

  messages.forEach((msg) => {
    const role = msg.classList.contains("user") ? "You" : "Agent";
    const body = msg.querySelector(".body");
    chatText += `${role}: ${body.textContent}\n\n`;
  });

  await navigator.clipboard.writeText(chatText);
};

Status Indicators

Clear feedback is crucial for voice interfaces since users can't see what's happening. I implemented a three-stage status progression:

Listening… - Microphone is active
⏳ Waiting for agent… - Processing on server
🔊 Speaking… - Playing audio response

These simple indicators eliminated user confusion about why nothing was happening.

Technical Architecture Deep Dive

State Management

The app maintains several critical state flags:

let isListening = false;    // User wants to interact
let isProcessing = false;   // Server call in progress
let finalBuffer = "";       // Accumulated speech text
let history = [];           // Conversation context

The interaction between these flags is what makes the flow work smoothly:

isListening controls whether we're in "conversation mode"
isProcessing prevents race conditions during server calls
finalBuffer accumulates text from multiple recognition results
history maintains conversation context

Speech Recognition Flow

The recognition flow deserves special attention because it's event-driven and asynchronous:

recognition.onresult = async (event) => {
  let interim = "";

  // Collect all finalized text
  for (let i = event.resultIndex; i < event.results.length; i++) {
    const res = event.results[i];
    const text = res[0].transcript;
    if (res.isFinal) finalBuffer += text;
    else interim += text;
  }

  // Show interim results in status
  if (interim) setStatus("Listening… (hearing)");

  // When we have finalized text, process it
  if (finalBuffer.trim()) {
    const userText = finalBuffer.trim();
    finalBuffer = "";

    isProcessing = true;
    recognition.stop();

    const reply = await askServer(userText);
    await speak(reply);

    isProcessing = false;
    // Recognition will auto-restart via onend handler
  }
};

This approach provides:

Real-time visual feedback (interim results)
Proper boundary detection (isFinal)
Clean state transitions (isProcessing)

Conversation History Management

Keeping conversation history trim is important for both performance and cost:

const nextHistory = [
  ...safeHistory,
  { role: "user", content: userText },
  { role: "assistant", content: reply }
].slice(-12); // Keep only last 12 messages

Twelve messages (six exchanges) provides enough context for coherent conversation without bloating the context window or increasing API costs.

Lessons Learned

1. Async Audio is Hard

Browser audio APIs are asynchronous but inconsistent. SpeechSynthesis doesn't always fire onend reliably, especially on mobile. Always add timeouts as backup:

const timeout = setTimeout(() => resolve(), 10000); // 10s max
utter.onend = () => {
  clearTimeout(timeout);
  resolve();
};

2. State Management in Event-Driven Code

When multiple async events can happen simultaneously (microphone, server calls, speech synthesis), explicit state flags prevent race conditions. Don't rely on component state alone.

3. User Feedback is Critical

Voice interfaces lack visual cues that users rely on. Status indicators, event logs, and clear feedback about what's happening are essential for a good user experience.

4. Browser Compatibility Matters

The Web Speech API has spotty support. Chrome and Edge work well, but Firefox and Safari have limitations. Always provide fallbacks and clear error messages.

5. Prompt Engineering for Voice

AI models trained on text need explicit guidance to generate voice-appropriate responses. "Speak naturally, not write" seems obvious but makes a huge difference.

Results

The demo achieved exactly what I set out to prove:

Latency: ~2-3 seconds from user speech to agent response start (mostly OpenAI API latency)
Reliability: Zero feedback loops, clean state transitions, predictable behavior
User Experience: Smooth, conversational, and actually pleasant to use
Browser Support: Excellent on Chrome/Edge, limited on Firefox/Safari
Cost: ~$0.01-0.05 per conversation (OpenAI API usage)
Code Simplicity: Under 500 total lines of JavaScript

More importantly, it proved that you don't need complex frameworks or infrastructure to build something that works well. Sometimes the simplest approach is the best approach.

What's Next

This demo proved the concept works, but there are several enhancements I'm considering if I continue developing this:

1. Function Calling (Tools)

OpenAI's function calling would let the agent actually do things:

const tools = [
  {
    type: "function",
    function: {
      name: "get_weather",
      description: "Get current weather for a location",
      parameters: {
        type: "object",
        properties: {
          location: { type: "string" }
        }
      }
    }
  }
];

This would enable the agent to:

Check weather and calendars
Search knowledge bases
Execute database queries
Control external systems

2. Knowledge File Integration

Build a framework for importing and working with knowledge files:

Upload PDFs, docs, spreadsheets
Automatic chunking and embedding
Semantic search during conversations
Context injection before AI calls

Think of it as giving the agent a "library" it can reference during conversations.

3. Authentication

Add user authentication to support:

Personal conversation histories
User-specific knowledge bases
Role-based access control
Usage tracking and quotas

4. Configurable System Prompts

Instead of hardcoding the system prompt, make it configurable:

const templates = {
  general: "You are a helpful assistant...",
  technical: "You are a technical support expert...",
  sales: "You are a sales assistant...",
  custom: userDefinedPrompt
};

This would let users customize the agent's personality and expertise for different use cases without touching code.

5. Streaming Responses

OpenAI supports streaming, which could reduce perceived latency:

const stream = await client.responses.create({
  model: MODEL,
  input,
  stream: true
});

for await (const chunk of stream) {
  // Speak each chunk as it arrives
}

Though this adds complexity around managing partial responses and interruptions.

Conclusion

Building this voice AI demo taught me that the hard parts aren't the AI itself—OpenAI handles that beautifully—but rather the "boring" details: managing microphone state, preventing feedback loops, handling async audio, and providing clear user feedback.

What started as a simple experiment turned into a genuine learning experience. The journey from buggy prototype to polished demo involved:

Fixing a feedback loop that made the agent talk to itself
Implementing proper state management with processing flags
Adding polish with copy functions and status indicators
Crafting prompts that generate voice-appropriate responses

The resulting demo is clean, responsive, and actually pleasant to use. More importantly, it proved that you can build something functional and useful without reaching for heavy frameworks or complex infrastructure. Sometimes the tools you already know—Express, vanilla JavaScript, and simple REST APIs—are exactly what you need.

If I continue developing this, the foundation is solid enough to add authentication, function calling, knowledge base integration, and configurable prompts without major refactoring. That's the beauty of keeping things simple: you can always add complexity later when you actually need it.

And if you're building voice interfaces, remember: the Web Speech API is powerful but quirky, async audio requires careful state management, and users need constant feedback about what's happening. Get those fundamentals right, and the rest falls into place.

Want to try it yourself? Check out the full code on GitHub: https://github.com/grizzlypeaksoftware/VoiceAgent101

Resources

The complete code for this demo is available on GitHub: https://github.com/grizzlypeaksoftware/VoiceAgent101

The repository includes:

Full Express server implementation
Client-side voice handling
Styled chat interface
Comprehensive documentation
Setup instructions and troubleshooting guide

Tech Stack

Backend: Node.js, Express.js, OpenAI API
Frontend: Vanilla JavaScript, Web Speech API, EJS templates
Styling: CSS with modern dark theme
Voice: Browser SpeechRecognition + SpeechSynthesis

Key Dependencies

{
  "express": "^4.18.2",
  "ejs": "^3.1.9",
  "openai": "^4.20.1",
  "dotenv": "^16.3.1"
}

The beauty of this stack is its simplicity—no complex build tools, no heavy frameworks, just clean server-rendered HTML with vanilla JavaScript for voice interaction. Sometimes less is more.

Have you built voice interfaces before? What challenges did you face? What would you add to a demo like this? I'd love to hear your thoughts.

Building a Real-Time Voice AI Agent: From Browser Prototype to Working Demo

Introduction

Phone Phreaks: The Original Hackers: How Blind Kids and Misfits Conquered the Telephone Network

The Vision

Phase 1: The Browser-Only Prototype

Starting Simple

Key Learnings from the Prototype

Phase 2: Moving to Server-Side Rendering

Architecture Decision

File Structure

Phase 3: OpenAI Integration

Choosing the Right API

Crafting the System Prompt

Phase 4: The Bug Hunt Begins

Bug #1: The Feedback Loop

Bug #2: Unwanted Recognition Restarts

Bug #3: Stop Button Didn't Actually Stop

Bug #4: Confusing Status Messages

Phase 5: Polish and User Experience

Chat-Style Interface

Copy Functionality

Status Indicators

Technical Architecture Deep Dive

State Management

Speech Recognition Flow

Conversation History Management

Lessons Learned

1. Async Audio is Hard

2. State Management in Event-Driven Code

3. User Feedback is Critical

4. Browser Compatibility Matters

5. Prompt Engineering for Voice

Results

What's Next

1. Function Calling (Tools)

2. Knowledge File Integration

3. Authentication

4. Configurable System Prompts

5. Streaming Responses

Conclusion

Resources

Tech Stack

Key Dependencies

Quick Links

Recent Articles

Need Expert Help?