All Posts
Voice AI

Vibe Coding a LiveKit Voice Agent

Jan 27, 202619 min read
Featured image for Vibe Coding a LiveKit Voice Agent

LiveKit separates the voice agent from the frontend, giving you control over STT, LLM, and TTS choices - including custom voice clones for any language.

Introduction

I'd been building voice agents for a while now, mainly with Retell AI but have also dabbled with Ultravox too and it was going well. The platform handles a lot of complexity - you configure your agent, add tools, get post-call data and of course, connect it to a phone number, and it works. Most of my projects were in English or Spanish, and Retell's built-in voices covered those nicely.

Then an opportunity came along with a client who needed something different - a language not supported by Retell. And that's where LiveKit entered the picture.

The following blog post is part 1 of my first attempt at building a voice agent from scratch with LiveKit. I'll be sharing my experience and lessons learned. I am not a software developer, but I do have a background in AI and voice technology and have a decent understanding of how code works, even if I wouldn't describe myself as a developer.

Why LiveKit? The Language Problem

The voice AI space has two types of platforms. There are the integrated platforms like Retell, Vapi, and Bland that handle everything for you. Then there are the frameworks like LiveKit and Pipecat that give you the building blocks to assemble yourself.

📝 With LiveKit, you're assembling Lego Code blocks. With Retell, you're buying a pre-built set.

Integrated platforms are brilliant for getting started quickly. You don't really need to think about speech-to-text, LLM orchestration, or text-to-speech separately - it's all packaged together. It works great. The trade-off is flexibility, customisation and price. For example, if you need a niche language like Hebrew or Catalan, i.e. something outside their supported options, you're stuck.

LiveKit sits at the framework level. It provides the real-time audio infrastructure - the WebSocket/ WebRTC connections, the room management, the audio streaming - but leaves the choice of AI components up to you. For instance, with Retell, you can choose voices from Eleven Labs and Cartesia but with LiveKit you can choose from a wide variety of providers - in our case, MiniMax. So, LiveKit allows you to choose your own STT provider, your own LLM, your own TTS.

The Voice Agent Stack - taxonomy showing LiveKit, MiniMax, OpenAI, Cloudflare, Cal.com, Vercel, Next.js, React, Python - Vibe Coded with Claude Opus 4.5

This modular approach meant I could use MiniMax for text-to-speech. MiniMax offers voice cloning - upload a sample of someone speaking, and it creates a voice that sounds like them. Suddenly, Hebrew wasn't a limitation. Neither was any other language.

The Two-Part Architecture

Because we are building a a web based application prototype/ demo, where the 'caller' speaks with the app on the screen rather than through a phone, there are two separate deployments needed

Part 1: The Agent

The agent is a Python application that handles the Voice Agent side of things. It receives audio from the user, converts it to text (STT), sends that text to an LLM such as OpenAI's GPT 4.1 model or Gemini Flash 3.0, and converts the response back to speech (TTS). This can run on LiveKit Cloud or you can self-host it. For our prototype, I chose LiveKit Cloud for convenience.

Two deployments from local dev - one rocket to LiveKit Cloud for the Agent, one to Vercel for the Frontend

Part 2: The Frontend

The frontend is a web application - in my case, built with React and Next.js - that provides the user interface. At this stage I wasn’t interested in telephony at all. I want the users to experience interacting with the Voice Agent to get an idea of what they might be buying. I want them to imagine how their clients might feel booking an appointment speaking to Grace. Oh, I forgot to mention, my client chose the name Grace.

📝 Using a cool web app interface enhances the user experience. LiveKit and Eleven Labs have some great looking components. Access these on Shadcn/Directories

Web app design For me it’s important, even at this stage of prototyping to have the right interface that’s in keeping with the client’s brand. If it’s a spa, an upmarket hotel, we want tasteful colors, premium fonts and an elegant interface. If it’s for a fast-paced burger bar, something a little more playful and dynamic. Thankfully, LiveKit has just released some pretty cool new components for their voice agents that you can find in the Shadcn directory. I went for the Orb in turquoise - LiveKit’s default yes, but matched our Azure Spa theme perfectly.

Security concept - protecting access to the voice agent demo

At this stage I should also mention that I added a little security feature to the Voice Agent - a Login panel. The plan is to send the demo to selected users (leads) and provide them with the login. We don’t want any Tom, Dick or Harry using the agent - while LiveKit is more affordable per-minute rates than Retell, the costs of running a Voice Agent can add up quite quickly.

Once in, users have a simple button to press ‘Talk with Grace’. Once they accept the microphone permission, the Orb will come to life and the conversation begins.

The frontend doesn't do any AI Voice processing. It generates an authentication token, connects to LiveKit Cloud, and streams audio via WebRTC. All the intelligence lives in the agent in the background.

User visits URL → Frontend generates token → Connects to LiveKit Cloud → Agent handles voice

User flow - from visiting URL through frontend token generation to LiveKit Cloud and agent

Effectively, a LiveKit Agent, as a web app, is 2 applications living in one project. This separation is quite a simple and obvious concept but coming from Retell, took a few minutes to sink in. The frontend is just a web page. I could embed it anywhere, style it however I wanted, add my own authentication. The agent runs independently on LiveKit's servers in the cloud (or on my own server somewhere).

The Voice Stack

With the architecture basics understood, I needed to choose my voice stack. LiveKit's Python SDK has plugins for most major providers.

What is a stack?

Stack simply refers to the technologies used to build a product or a workflow.

What is an SDK?

An SDK (Software Development Kit) is like a toolbox that a company gives you so you can build things that work with their product. Instead of figuring everything out from scratch, you get pre-made pieces - ready-to-use code, instructions, and examples - that make it much easier to connect your project to their service.

Speech-to-Text: OpenAI Whisper

Whisper handles Hebrew well. I set the language parameter explicitly to avoid any confusion. However, one of the exciting things about building with LiveKit is that I can swap this out. Whisper has an excellent reputation for transcribing speech to text but what if there was something better. For instance, ElevenLabs recently released ‘Scribe v2 Realtime’. Maybe this model is better, faster, more affordable. No problem, let’s swap out Whisper for Scribe v2! In fact, once we get to the testing and review stage, that’s exactly what we’ll do. We can observe just about every aspect of each call - including STT.

Note: STT (Speech-to-Text) is also known as ASR (Automatic Speech Recognition). You'll see both terms used interchangeably.

python
stt=openai.STT(model="whisper-1", language="he")

LLM: GPT-4.1-mini for now

I used GPT-4.1-mini to power the agent. It's fast enough for real-time conversation and handles the example Hebrew in the prompts without issue. The system prompt is written in English with Hebrew example phrases, and the model responds in whatever language the caller uses. That being said, there were times, throughout my conversations, when the agent would go back to speaking Hebrew. For production use or even to demo with the client, we will likely opt for a slightly more advanced model such as GPT 4.1, which should enhance the quality of the responses. But we are a few days away from that yet.

Me speaking English while the agent responds in Hebrew - shalom shalom!

Text-to-Speech: MiniMax with Voice Clone

This was the deciding factor in why we went with LiveKit. MiniMax's Speech 2.6 Turbo model supports voice cloning and claims to be a low latency. The process of voice cloning is very simple via the API - with the instructions all documented clearly on MiniMax’s website. My colleague in Israel sent me a voice recording for me to 'clone'. (It actually failed first time - MiniMax needs 10 seconds of voice but the recording was only 9.93 seconds so thanks to a little help from Claude, I was able to extend it by half a second and get the job done). Armed with my MiniMax API key and voice_id, I was ready to wire up the TTS.

python
tts=minimax.TTS(model="speech-2.6-turbo", voice=MINIMAX_VOICE_ID)

The Date Injection Issue

If there is one thing a Voice Agent always needs to know - it’s the time and date today. From there, it has a point of reference. If you say - ‘have you got space available tomorrow, it will look for the date after today. If you say 'I want to book an appointment for Monday' it will know to look for the nearest Monday in the future relative to today. We take it for granted, that an AI Agent should know the right date - after all, it has expertise in just about everything from archaeology to quantum physics. But don’t rely on it to tell you today’s date. You need to make the agent time/date aware.

Today, I forgot.

That moment when you realize you forgot to inject the date

So, even though the agent was up and running - it could greet callers, understand their requests - every time it went to check availability, there were no spaces available.

Large language models have a training cutoff date. When GPT-4.1-mini processes a request, it has no idea whether "today" is January 2026 or October 2024. If a caller says "I'd like to book for tomorrow", the model might interpret that relative to its training data - which could be years in the past.

The solution is to inject the current date and time into the system prompt. Not as a static string you write once, but dynamically at the start of each call.

python
from datetime import datetime from zoneinfo import ZoneInfo def get_current_datetime_jerusalem() -> str: """Get current datetime in Jerusalem time zone.""" tz = ZoneInfo("Asia/Jerusalem") now = datetime.now(tz) return now.strftime("%A, %B %d, %Y at %H:%M")

Then in the agent initialisation:

python
# Load prompt template with open(prompt_path) as f: prompt_template = f.read() # Inject current datetime current_time = get_current_datetime_jerusalem() full_prompt = prompt_template.replace( "{{current_time_Asia/Jerusalem}}", current_time )

The prompt template includes a placeholder:

markdown
## Context - Timezone: {{current_time_Asia/Jerusalem}}

When the agent starts, this becomes:

markdown
## Context - Timezone: Sunday, January 26, 2026 at 19:45

Now the LLM knows what day it is. "Tomorrow" means tomorrow. "Next Tuesday" means next Tuesday. Bookings work.

Date injection flow - from template placeholder to final prompt with real datetime

This problem isn't specific to LiveKit - it affects every voice agent that uses an LLM for conversation. If you're building booking agents on any platform, check how date context is being provided. Without it, you'll spend hours debugging API errors that seem to make no sense.

The Cloudflare Worker Bridge

Now we have a high level overview of how the LiveKit agent works, it needs to do something functional. Grace has a job to do. She’s not built to pass the time of day.

Like our booking agents built on Ultravox and Retell, Grace is built to turn callers into customers. All appointments are booked through Cal.com and from there, if the clients wishes, the booking is integrated into their platform of choice.

I used to use n8n to book on Cal but recently I've switched to Cloudflare Workers - one block of code in the cloud that creates, reschedules and cancels bookings efficiently and without the need of an additional platform like n8n.

Why? A few reasons but mainly because if we have a choice of code versus an AI Agent powered by an LLM, we should choose code every time; it’s faster, more reliable and way cheaper than token usage.

Why deterministic code beats LLM calls for business logic

Service Catalogue Pattern

The spa offers three services - facials, massages, and spa sessions. Each has a different duration and maps to a different Cal.com event type. Rather than hardcoding this in the agent, I created a service catalogue in the worker:

jsx
const SERVICE_CATALOG = { 'facial': { eventTypeId: 4525126, duration: 60, displayName: 'Facial Treatment' }, 'massage': { eventTypeId: 4525127, duration: 30, displayName: 'Massage' }, 'spa_session': { eventTypeId: 4525128, duration: 90, displayName: 'Spa Session' } };

The agent just says "I need a massage slot for January 27th" and the worker handles the translation to Cal.com's event type IDs.

Deterministic Business Hours Validation

The worker validates requests against business hours before hitting the Cal.com API. If someone asks for an appointment at 3am, the worker rejects it immediately with a helpful message rather than waiting for Cal.com to return an error.

API Abstraction

If I later switch from Cal.com to another booking system, I only need to update the worker. The agent's tool definitions stay the same.

The worker exposes five main actions: check_availability, create_booking, cancel_booking, reschedule_booking, and get_all_bookings. Each takes simple parameters and returns structured responses that the agent can read aloud naturally.

Writing the Voice Prompt

Voice prompts are different from chatbot prompts. In a text conversation, the user can scroll back, re-read, take their time. In a voice conversation, everything happens in real-time. The agent needs to be concise, ask one question at a time, and handle interruptions gracefully. I rarely write a prompt from scratch anymore. A meta-prompt is the way to go. In fact, I have a Claude Skill that utilises a meta-prompt stored in my global project directory.

The Voice Agent Metaprompt - Role, Personality, Context, Instructions, Stages, Critical Rules

Whether my agents are built on LiveKit or Retell, they will all follow a similar structure and are written in clear Markdown. My preference is to keep my prompts as lean as possible following the principles of 'lean orchestration'. That is to say, clear and concise, reduce the number of tools to an absolute necessity, and cut out repetition such as unnecessary reminders.

Some practical tips you'd give any voice agent, LiveKit or otherwise:

Limit options. When offering appointment times, three choices maximum. Any more can overwhelm callers.

Use verbal bridges. Before checking availability (which takes a second or two), the agent says something like "Let me check that for you..." to reduce those awkward silences while the API call happens.

Spelling things out works better. When confirming a phone number for instance, the agent reads it digit by digit with pauses. "That's zero - five - zero - one - two - three - four - five - six - seven. Is that correct?"

Also

Anticipate questions about whether Grace is AI. Embrace it and be upfront about it. At the beginning of the conversation, we have Grace say “Hello, welcome to Bella Spa, I’m Grace, Bella’s AI Digital Receptionist, how can I help you today?”

Deployment Lessons

Getting everything running locally was straightforward. But deploying was a little finicky. First off, the agent wouldn't deploy - apparently it needs a README.md file. Then there were the files I forgot to add to the gitignore. Then I got error messages saying I already had an agent deployed - despite the deployment failure, the agent still registered as deployed! The great thing about advanced models such as Claude Opus 4.5 is they eat these errors for breakfast - copy, paste, solved.

Deployment - launching the agent to the cloud

We have these cool little CLI commands to perform actions: want to see all your agents, type lk agent list. Want to delete an agent - that's lk agent delete. And to create a new agent - lk agent create. Let's say we want to update our prompt - make the changes locally in the agent/prompt file then deploy the agent again:

bash
cd ~/dqx-livekit-bella-spa/agent lk agent deploy # Rebuilds Docker image with new prompt

LiveKit Cloud Free Tier: One Agent Only

I'm not a big fan of the pricing model for LiveKit. I wish they had a pay as you go model. Yes, they have a free tier with 1000 minutes, and yes, the cost per call minute is considerably lower than Retell, but the free tier is limited to one agent and you can soon run out of credits as you start testing.

If you want to deploy a new agent, you'll have to delete the existing one. The next pricing tier up is a whopping $50. Unlike the free tier, if you run out of credits, you can continue to use the agent but you'll be charged $0.05 for each minute of usage. I wish this was available on the free tier as well.

I guess I will just have to wait another few days until next month to continue working on the prototype.

Counting down the days until credits reset

Ultravox and Retell have the advantage here with their pay as you go pricing models - at least at small scale when developing. With LiveKit, on top of the paid tier cost, you'll still need to fork out for the API calls to the likes of MiniMax and OpenAI.

Vercel Environment Variables and Newlines

When setting environment variables via the Vercel CLI, if you use echo to pipe the value, it adds a trailing newline. That newline breaks JWT token validation.

bash
# WRONG - adds trailing newline echo "APIKEY123" | vercel env add LIVEKIT_API_KEY production # CORRECT - no trailing newline printf '%s' 'APIKEY123' | vercel env add LIVEKIT_API_KEY production

Matching Credentials Across Deployments

The frontend needs the LiveKit URL, API key, and API secret that match the agent's project. These come from ~/.livekit/cli-config.yaml after you run lk cloud auth. If the frontend connects to the wrong LiveKit project, the agent won't respond - it's sitting in a different room entirely.

What's Next

The booking agent works. Callers can schedule appointments, reschedule, and cancel - all through natural voice conversation. But there's more to build.

Conversation Flow Improvements

After fixing the date issue, I was able to successfully book an appointment at the Bella Spa. However, the conversation flow was off. Sometimes, there will be tangible points the agent will miss or say that she’s not supposed to and sometimes, the conversation will just feel a little off. This is to be expected. All agents need a bit of nurturing first time around. Very much like new employees.

Improvement consists of reviewing the transcript, more phone calls, unit testing, batch testing and iterating on the prompt.

We need to have some tangible measurements in place that quantify the agent's performance. But we also need to ‘listen’ with our human ears. We want the conversation with the agent to ‘feel’ right. Does it match our tone, does it have any quirks, is the conversation engaging?

end_call

Need to add an end_call function otherwise the agent will not hang up. This can be costly.

Post-Call Analytics

LiveKit provides metrics events during the call - latency measurements, transcription timing, token usage. Although you have access to a lot of this in the dashboard, we need to store them for analytics - I want to capture these and store them in a database. Understanding where delays happen helps optimise the experience. Going forward, and depending on the client, we may need to store transcripts, call summaries, token costs etc.

Transcript Storage - OFF

Not every conversation should be logged. By default, observability - including transcripts is turned off. It needs to be enabled to access this feature. Not just for debugging, but for compliance and quality review. LiveKit's session history contains the full transcript - I just need to persist it somewhere.

Post-call analytics and data storage concepts

Client Dashboard

Eventually, the spa owner should be able to see their call metrics, read transcripts, and adjust the agent's behaviour. That means building a frontend dashboard backed by the stored data. I have a few options already here having built a dashboard for a similar use case recently. I’m tempted to use that. But if you’ve read my previous blogs, you’ll know I’m a big fan of Supabase.

Configurable Prompts

I’m working with a client partner in another country who needs to have access to the prompt in order to make changes on the go. Right now, the voice prompt is in my project directory. To make changes, I need to do so on my local machine, then redeploy the agent. A better approach would be storing the prompt as an environment variable or fetching it from a database at startup. That way my colleague can tweak the wording from anywhere in the world without touching code base.

Telephony

You might be wondering - what about phone calls? Right now, this is a web application. Users visit a URL, click a button, and talk to Grace through their browser. There's no phone number to call.

That's intentional. This project is a prototype - a demo for the client to experience the voice agent before committing. If they're happy and want to go ahead, we can connect LiveKit to telephony platforms like Twilio, Telnyx, Zadarma and others. LiveKit supports all the usual providers. But for now, a web-based demo keeps things simple and keeps costs down while we're still iterating.

Conclusion

Building a LiveKit Voice agent with code seems daunting. However, once you have a high level overview of the architecture and grasp the basic concepts, it's not so bad.

The two-part deployment model, the importance of date injection, understanding concepts in their documentation such as 'rooms', 'WebRTC', 'sessions', 'Egress' and 'Ingress' - these all start to make sense once you've built something real.

If Retell or another integrated platform does what you need, use it. They are fantastic, user friendly platforms for managing Voice Agents. User friendly doesn't mean they are short on advanced features either - the developer experience is excellent and you'll ship faster. These platforms are accessible, reliable and perform well.

But when you need something they don't offer - a specific language, a custom voice, unusual integrations, advanced concurrency, granular levels of control, per-minute-cost considerations, advanced security features such as role based access, hosting options outside the US, - using a framework like LiveKit opens doors that would otherwise stay closed.

Key Takeaways

  • LiveKit uses a two-part architecture - the agent runs on LiveKit Cloud while the frontend deploys separately to Vercel
  • Voice cloning services like MiniMax let you create custom voices for languages that aren't natively supported
  • The LLM doesn't know today's date - you must inject it into the prompt or all your bookings will fail
  • Cloudflare Workers make excellent bridges between voice agents and booking APIs like Cal.com

Share this post