Building a Real-Time Voice AI Agent: From Browser Prototype to Working Demo
Looking for expert solutions in AI, Web Applications, APIs, or blockchain development?
Request a Free ConsultationBuilding a Real-Time Voice AI Agent: From Browser Prototype to Working Demo
Introduction
What if you could build a voice assistant that actually understands and responds naturally, without relying on expensive third-party platforms? That's exactly what I set out to explore, and the journey from a simple browser prototype to a polished demo taught me valuable lessons about real-time audio processing, state management, and the quirks of browser APIs.
In this article, I'll walk you through how I built a complete voice-to-voice conversational AI agent using Node.js, Express, OpenAI's API, and the Web Speech API. More importantly, I'll share the bugs I encountered, the architectural decisions I made, and the solutions that transformed a buggy prototype into a smooth, functional demo that actually works.
The Vision
The goal was straightforward: create a voice agent that could:
- Listen to users speak naturally
- Process their speech and understand intent
- Generate intelligent responses using AI
- Speak those responses back in a natural voice
- Maintain conversation context across multiple exchanges
Simple in concept, but as you'll see, the devil was in the details.

Phase 1: The Browser-Only Prototype
Starting Simple
I began with the most minimal viable version: a pure browser-based prototype using the Web Speech API. This validated two critical assumptions:
- Could I reliably capture voice input using
SpeechRecognition? - Could I synthesize natural-sounding speech using
SpeechSynthesis?
const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
const recognition = new SpeechRecognition();
const synth = window.speechSynthesis;
recognition.lang = "en-US";
recognition.interimResults = true;
recognition.continuous = true;
This prototype ran entirely in the browser with basic intent handling. It proved the concept worked, but had obvious limitations: no real AI, no server-side processing, and no way to keep API keys secure.
Key Learnings from the Prototype
The browser-only version proved the mechanics worked, but it revealed a fundamental limitation: you need real AI to have real conversations. Rule-based intent handling quickly becomes a maintenance nightmare and lacks the natural language understanding users expect. This was the key insight that drove the architecture forward—I needed to integrate augmented human intelligence (OpenAI) to get genuinely useful, contextual responses.
Other important discoveries:
- Browser compatibility matters: Firefox and Safari have limited Web Speech API support. Chrome and Edge became the target browsers.
- Microphone permissions are critical: Users need to explicitly grant permission, and handling denial gracefully is essential.
- Voice selection makes a huge difference: System voices vary wildly in quality, so providing choice is important.
Phase 2: Moving to Server-Side Rendering
Architecture Decision
I chose a server-side rendered approach using Express and EJS over a modern SPA framework. This wasn't about following trends—it was about leveraging what I know works. After running production Express sites for over 10 years, I knew exactly what I was getting: reliability, simplicity, and no surprises.
More importantly, I didn't want to install React, Vue, or any other heavyweight frontend framework for what is essentially a demo. The requirements were simple enough that vanilla JavaScript could handle everything the browser needed to do.
Why this approach made sense:
- Familiarity: I know Express inside and out—no learning curve, no gotchas
- Simplicity: No build tools, no complex state management, just server-rendered HTML
- Security: API keys stay on the server where they belong
- Speed: Get something working quickly without framework overhead
- Lightweight: The entire client-side JS is under 300 lines
The architecture split responsibilities cleanly:
- Browser: Handles audio capture (microphone) and playback (speakers)
- Server: Handles AI reasoning with OpenAI API
- Communication: Simple REST API for sending text and receiving responses
// server.js
app.post("/api/chat", async (req, res) => {
const { userText, history } = req.body;
const input = [
{ role: "system", content: systemPrompt },
...history,
{ role: "user", content: userText }
];
const response = await client.responses.create({
model: MODEL,
input
});
const reply = response.output_text?.trim();
res.json({ reply, history: updatedHistory });
});
File Structure
The project structure was intentionally minimal:
VoiceAgent101/
├── server.js # Express server + OpenAI integration
├── views/
│ └── index.ejs # HTML template
├── public/
│ ├── app.js # Client-side JavaScript
│ └── styles.css # Styling
└── .env # Environment variables
Phase 3: OpenAI Integration
Choosing the Right API
OpenAI provides several APIs, but I went with the Responses API because it's designed for conversational use cases. The key insight was treating conversation history as a sliding window:
// Keep only the last 12 messages (6 exchanges)
const safeHistory = Array.isArray(history) ? history.slice(-12) : [];
This prevents context windows from growing infinitely while maintaining enough history for coherent conversation.
Crafting the System Prompt
The system prompt turned out to be crucial. My first attempt produced verbose, markdown-formatted responses. The agent would say things like "Here are three ways you can…" with bullet points—terrible for voice!
The refined prompt emphasized natural speech:
const systemPrompt = `You are a friendly and helpful voice assistant engaged in natural conversation.
Speak naturally as if talking to someone, not writing.
Keep responses concise (2-3 sentences) unless the user explicitly asks for more detail.
Avoid using markdown, bullet points, or special formatting—just speak naturally.`;
This simple change transformed the experience from robotic to conversational.
Phase 4: The Bug Hunt Begins
This is where things got interesting. The basic flow worked, but several critical bugs made the app unusable in practice.
Bug #1: The Feedback Loop
The Problem: The agent would hear its own voice and respond to itself, creating an infinite conversation loop.
What was happening:
- Agent speaks response
- Microphone picks up agent's voice
- Speech recognition interprets it as user input
- Agent responds to its own response
- Loop continues forever
The Solution: The microphone needed to be disabled while the agent spoke. But this was trickier than it sounds because SpeechSynthesis is asynchronous and doesn't provide reliable callbacks.
I made the speak() function return a Promise:
function speak(text) {
return new Promise((resolve) => {
const utter = new SpeechSynthesisUtterance(text);
utter.onend = () => resolve();
utter.onerror = () => resolve();
synth.cancel(); // Stop any ongoing speech
synth.speak(utter);
});
}
Then I could properly await it:
const reply = await askServer(userText);
addMessage("agent", reply);
await speak(reply); // Wait for speech to complete
// Only restart recognition after speech is done
I also added a 500ms cooldown after speech to ensure audio output fully cleared:
setTimeout(() => {
try { recognition.start(); } catch {}
}, 500);
Bug #2: Unwanted Recognition Restarts
The Problem: Speech recognition would stop and restart multiple times while waiting for the server response, causing "no-speech" errors and confusing status displays.
Root Cause: The recognition.onend event fires automatically, and my code was set to auto-restart recognition whenever it ended. During the server call, recognition would end naturally, trigger a restart, timeout, end again, restart again…
The Solution: Add a processing flag to prevent auto-restarts during server calls:
let isProcessing = false;
recognition.onend = () => {
// Only auto-restart if listening AND not processing
if (isListening && !isProcessing) {
setTimeout(() => {
try { recognition.start(); } catch {}
}, 250);
}
};
// In the speech recognition handler:
isProcessing = true;
recognition.stop(); // Explicitly stop
const reply = await askServer(userText);
await speak(reply);
isProcessing = false; // Clear flag
This simple flag eliminated random stops/starts and made the flow predictable.
Bug #3: Stop Button Didn't Actually Stop
The Problem: Clicking "Stop" would disable the microphone, but the agent would keep speaking if it was mid-sentence.
The Solution: Cancel speech synthesis when stopping:
function stopListening() {
isListening = false;
try { recognition.stop(); } catch {}
synth.cancel(); // Stop any ongoing speech
setStatus("Idle");
}
Also check the isListening flag before processing responses:
if (isListening) {
addMessage("agent", reply);
await speak(reply);
}
Bug #4: Confusing Status Messages
The Problem: Users would see "Stopped listening" in the middle of normal conversations, making them think something broke.
Root Cause: I was logging every recognition.onend event, even when it was an automatic internal stop during processing.
The Solution: Only show "Stopped listening" for manual stops:
recognition.onend = () => {
// Only show event if user manually stopped (not processing)
if (!isProcessing) {
addEventLine("⏹️ Stopped listening");
setStatus("Idle");
}
// Auto-restart logic...
};
Phase 5: Polish and User Experience
Chat-Style Interface
Text logs are fine for debugging, but users expect a chat interface. I replaced the plain text log with proper message bubbles:
.msg.user .body {
background: linear-gradient(135deg, var(--accent), var(--accent-2));
color: #fff;
border-bottom-right-radius: 4px; /* Speech bubble tail effect */
}
.msg.agent .body {
background: var(--panel-2);
color: var(--text);
border-bottom-left-radius: 4px;
}
The result: user messages appear as blue bubbles on the right, agent messages as gray bubbles on the left, and system events in the center with icons.
Copy Functionality
Users often want to save or share conversations. I added two copy features:
- Copy individual agent responses with a 📋 button
- Copy entire transcript with a "Copy All" button
elCopyAll.onclick = async () => {
const messages = elLog.querySelectorAll(".msg");
let chatText = "Voice Agent Chat Transcript\n" + "=".repeat(40) + "\n\n";
messages.forEach((msg) => {
const role = msg.classList.contains("user") ? "You" : "Agent";
const body = msg.querySelector(".body");
chatText += `${role}: ${body.textContent}\n\n`;
});
await navigator.clipboard.writeText(chatText);
};
Status Indicators
Clear feedback is crucial for voice interfaces since users can't see what's happening. I implemented a three-stage status progression:
- Listening… - Microphone is active
- ⏳ Waiting for agent… - Processing on server
- 🔊 Speaking… - Playing audio response
These simple indicators eliminated user confusion about why nothing was happening.
Technical Architecture Deep Dive
State Management
The app maintains several critical state flags:
let isListening = false; // User wants to interact
let isProcessing = false; // Server call in progress
let finalBuffer = ""; // Accumulated speech text
let history = []; // Conversation context
The interaction between these flags is what makes the flow work smoothly:
isListeningcontrols whether we're in "conversation mode"isProcessingprevents race conditions during server callsfinalBufferaccumulates text from multiple recognition resultshistorymaintains conversation context
Speech Recognition Flow
The recognition flow deserves special attention because it's event-driven and asynchronous:
recognition.onresult = async (event) => {
let interim = "";
// Collect all finalized text
for (let i = event.resultIndex; i < event.results.length; i++) {
const res = event.results[i];
const text = res[0].transcript;
if (res.isFinal) finalBuffer += text;
else interim += text;
}
// Show interim results in status
if (interim) setStatus("Listening… (hearing)");
// When we have finalized text, process it
if (finalBuffer.trim()) {
const userText = finalBuffer.trim();
finalBuffer = "";
isProcessing = true;
recognition.stop();
const reply = await askServer(userText);
await speak(reply);
isProcessing = false;
// Recognition will auto-restart via onend handler
}
};
This approach provides:
- Real-time visual feedback (interim results)
- Proper boundary detection (isFinal)
- Clean state transitions (isProcessing)
Conversation History Management
Keeping conversation history trim is important for both performance and cost:
const nextHistory = [
...safeHistory,
{ role: "user", content: userText },
{ role: "assistant", content: reply }
].slice(-12); // Keep only last 12 messages
Twelve messages (six exchanges) provides enough context for coherent conversation without bloating the context window or increasing API costs.
Lessons Learned
1. Async Audio is Hard
Browser audio APIs are asynchronous but inconsistent. SpeechSynthesis doesn't always fire onend reliably, especially on mobile. Always add timeouts as backup:
const timeout = setTimeout(() => resolve(), 10000); // 10s max
utter.onend = () => {
clearTimeout(timeout);
resolve();
};
2. State Management in Event-Driven Code
When multiple async events can happen simultaneously (microphone, server calls, speech synthesis), explicit state flags prevent race conditions. Don't rely on component state alone.
3. User Feedback is Critical
Voice interfaces lack visual cues that users rely on. Status indicators, event logs, and clear feedback about what's happening are essential for a good user experience.
4. Browser Compatibility Matters
The Web Speech API has spotty support. Chrome and Edge work well, but Firefox and Safari have limitations. Always provide fallbacks and clear error messages.
5. Prompt Engineering for Voice
AI models trained on text need explicit guidance to generate voice-appropriate responses. "Speak naturally, not write" seems obvious but makes a huge difference.
Results
The demo achieved exactly what I set out to prove:
- Latency: ~2-3 seconds from user speech to agent response start (mostly OpenAI API latency)
- Reliability: Zero feedback loops, clean state transitions, predictable behavior
- User Experience: Smooth, conversational, and actually pleasant to use
- Browser Support: Excellent on Chrome/Edge, limited on Firefox/Safari
- Cost: ~$0.01-0.05 per conversation (OpenAI API usage)
- Code Simplicity: Under 500 total lines of JavaScript
More importantly, it proved that you don't need complex frameworks or infrastructure to build something that works well. Sometimes the simplest approach is the best approach.
What's Next
This demo proved the concept works, but there are several enhancements I'm considering if I continue developing this:
1. Function Calling (Tools)
OpenAI's function calling would let the agent actually do things:
const tools = [
{
type: "function",
function: {
name: "get_weather",
description: "Get current weather for a location",
parameters: {
type: "object",
properties: {
location: { type: "string" }
}
}
}
}
];
This would enable the agent to:
- Check weather and calendars
- Search knowledge bases
- Execute database queries
- Control external systems
2. Knowledge File Integration
Build a framework for importing and working with knowledge files:
- Upload PDFs, docs, spreadsheets
- Automatic chunking and embedding
- Semantic search during conversations
- Context injection before AI calls
Think of it as giving the agent a "library" it can reference during conversations.
3. Authentication
Add user authentication to support:
- Personal conversation histories
- User-specific knowledge bases
- Role-based access control
- Usage tracking and quotas
4. Configurable System Prompts
Instead of hardcoding the system prompt, make it configurable:
const templates = {
general: "You are a helpful assistant...",
technical: "You are a technical support expert...",
sales: "You are a sales assistant...",
custom: userDefinedPrompt
};
This would let users customize the agent's personality and expertise for different use cases without touching code.
5. Streaming Responses
OpenAI supports streaming, which could reduce perceived latency:
const stream = await client.responses.create({
model: MODEL,
input,
stream: true
});
for await (const chunk of stream) {
// Speak each chunk as it arrives
}
Though this adds complexity around managing partial responses and interruptions.
Conclusion
Building this voice AI demo taught me that the hard parts aren't the AI itself—OpenAI handles that beautifully—but rather the "boring" details: managing microphone state, preventing feedback loops, handling async audio, and providing clear user feedback.
What started as a simple experiment turned into a genuine learning experience. The journey from buggy prototype to polished demo involved:
- Fixing a feedback loop that made the agent talk to itself
- Implementing proper state management with processing flags
- Adding polish with copy functions and status indicators
- Crafting prompts that generate voice-appropriate responses
The resulting demo is clean, responsive, and actually pleasant to use. More importantly, it proved that you can build something functional and useful without reaching for heavy frameworks or complex infrastructure. Sometimes the tools you already know—Express, vanilla JavaScript, and simple REST APIs—are exactly what you need.
If I continue developing this, the foundation is solid enough to add authentication, function calling, knowledge base integration, and configurable prompts without major refactoring. That's the beauty of keeping things simple: you can always add complexity later when you actually need it.
And if you're building voice interfaces, remember: the Web Speech API is powerful but quirky, async audio requires careful state management, and users need constant feedback about what's happening. Get those fundamentals right, and the rest falls into place.
Want to try it yourself? Check out the full code on GitHub: https://github.com/grizzlypeaksoftware/VoiceAgent101
Resources
The complete code for this demo is available on GitHub: https://github.com/grizzlypeaksoftware/VoiceAgent101
The repository includes:
- Full Express server implementation
- Client-side voice handling
- Styled chat interface
- Comprehensive documentation
- Setup instructions and troubleshooting guide
Tech Stack
- Backend: Node.js, Express.js, OpenAI API
- Frontend: Vanilla JavaScript, Web Speech API, EJS templates
- Styling: CSS with modern dark theme
- Voice: Browser SpeechRecognition + SpeechSynthesis
Key Dependencies
{
"express": "^4.18.2",
"ejs": "^3.1.9",
"openai": "^4.20.1",
"dotenv": "^16.3.1"
}
The beauty of this stack is its simplicity—no complex build tools, no heavy frameworks, just clean server-rendered HTML with vanilla JavaScript for voice interaction. Sometimes less is more.
Have you built voice interfaces before? What challenges did you face? What would you add to a demo like this? I'd love to hear your thoughts.

Retrieval Augmented Generation with Node.js: A Practical Guide to Building LLM Based Applications
"Unlock the power of AI-driven applications with RAG techniques in Node.js, from foundational concepts to advanced implementations of Large Language Models."
Get the Kindle Edition
Designing Solutions Architecture for Enterprise Integration: A Comprehensive Guide
"This comprehensive guide dives into enterprise integration complexities, offering actionable insights for scalable, robust solutions. Align strategies with business goals and future-proof your digital infrastructure."
Get the Kindle EditionWe create solutions using APIs and AI to advance financial security in the world. If you need help in your organization, contact us!
Cutting-Edge Software Solutions for a Smarter Tomorrow
Grizzly Peak Software specializes in building AI-driven applications, custom APIs, and advanced chatbot automations. We also provide expert solutions in web3, cryptocurrency, and blockchain development. With years of experience, we deliver impactful innovations for the finance and banking industry.
- AI-Powered Applications
- Chatbot Automation
- Web3 Integrations
- Smart Contract Development
- API Development and Architecture
Ready to bring cutting-edge technology to your business? Let us help you lead the way.
Request a Consultation Now