Real-Time Call Translation With Twilio and OpenAI (Both Sides of the Line)

April 01, 2026 • 7 min read

Home / Blog / Real-Time Call Translation With Twilio and OpenAI (Both Sides of the Line)

About the author

Author

Gonzalo Gomez

AI & Automation Specialist

I design AI-powered communication systems. My work focuses on voice agents, WhatsApp chatbots, AI assistants, and workflow automation built primarily on Twilio, n8n, and modern LLMs like OpenAI and Claude. Over the past 7 years, I've shipped 30+ automation projects handling 250k+ monthly interactions.

Subscribe to my newsletter

If you enjoy the content that I make, you can subscribe and receive insightful information through email. No spam is going to be sent, just updates about interesting posts or specialized content that I talk about.

Real-Time Call Translation With Twilio and OpenAI (Both Sides of the Line) | Step-by-step guide on how to implement real time translation using Twilio and OpenAI

Introduction

Language barriers in call centers are a solved problem. Most people just don't know it yet, or they think it requires expensive middleware and a six-month integration project. It doesn't. Here's how I built a full two-way real-time translation system using Twilio Flex, Studio, TaskRouter, and the OpenAI Realtime API, with ngrok bridging local dev to the public web.

 

When a customer calls in speaking Spanish, the agent hears English. When the agent responds in English, the caller hears Spanish. Both sides are live, streaming, and neither person has to do anything differently. No hold music, no "please wait while we connect you to a translator."

Let me walk through how this works.

 

The Architecture

The core of this system is a single Node.js application sitting in the middle of everything. It has three jobs:

 

  1. Connect the caller to the agent through Twilio
  2. Stream both sides of the audio to OpenAI for translation
  3. Push the translated audio back to the correct leg of the call

 

To pull this off, you need two phone numbers.

 

The first number is the customer-facing number. When someone calls it, Twilio Studio runs a flow that captures the call and connects it to your application via a WebSocket. The second number is the agent-side number, handling the other leg of the call, also connected to your application. Both numbers are streaming audio in and out of your server in parallel.

 

The reason for two numbers is that Twilio's media streams work per call leg. You need separate WebSocket connections to capture what the caller says versus what the agent says. Your app sits between them, translating in both directions.

 

On top of that:

  • Twilio Flex handles the agent UI and availability management
  • Twilio Studio defines the inbound call flow and language selection
  • TaskRouter creates and assigns tasks when a call comes in
  • OpenAI Realtime API does the actual speech-to-speech translation
  • ngrok exposes your local app so Twilio webhooks can reach it

 

The Call Flow

When a customer calls the main number, Studio runs. The caller gets a menu: press 1 for English, press 2 for Spanish. That selection gets stored and used by the application as the source language for translation. Simple, but important. The app needs to know which direction to translate.

 

From there, Studio redirects the call to your ngrok URL. Your app receives it, creates a Twilio conference or proxied call, and connects the second leg (agent-side number) to the available agent in Flex. TaskRouter picks up the task and matches it to an available worker.

 

The agent accepts the call in the Flex UI. Both legs are now live. Your WebSocket connections are streaming audio from each side. OpenAI handles the translation and returns audio. That audio gets played into the opposite leg of the call.

 

Setting Up Twilio Studio

The repository includes a JSON you can import directly into Studio. No building from scratch.

 

Go to Studio > Flows > Create new flow, then import from JSON. The flow has two branches based on keypress input. Branch 1 sets the call as English-sourced. Branch 2 sets it as Spanish-sourced. Both branches eventually redirect to your ngrok URL with the host replaced. Keep the rest of the path intact, just swap the host.

 

Once the flow is published, assign it to phone number one (the customer-facing one) as the handler for incoming voice calls.

For phone number two (agent-side), configure its webhook URL to also point to your ngrok URL. Same host replacement, same logic. This number is what your app will call programmatically to create the second leg.

 

TaskRouter Configuration

Flex sets up a workspace and default workflows automatically on first login. For this project the routing logic is simple: assign any task to any available worker. That means your Target Worker Expression in TaskRouter should be 1 == 1, match everyone.

 

Go to TaskRouter > Workspaces > Workflows, open your workflow, and confirm the filter expression. Then go to Queues and make sure your queue is targeting that same expression. If your agent account shows up under "Matching Workers," you're good.

 

The SID you need from here starts with WW. Copy it, it goes into your .env as FLEX_WORKFLOW_SID.

 

Environment Variables

Everything that connects the pieces lives in your .env:

 

TWILIO_ACCOUNT_SID=
TWILIO_AUTH_TOKEN=
CALLER_NUMBER=          # Customer-facing number
FLEX_NUMBER=            # Agent-side number
FLEX_WORKFLOW_SID=      # WW... SID from TaskRouter
OPENAI_API_KEY=
PORT=5050
DOMAIN=                 # ngrok domain WITHOUT https://
FORWARD_AUDIO_BEFORE_TRANSLATION=false

 

A few things worth flagging:

Phone numbers need the full international format with the + prefix, no spaces or parentheses. +12025551234, not (202) 555-1234.

 

The DOMAIN variable does not include https://. This is because the same domain string gets used for both HTTPS webhooks and WebSocket connections (wss://). If you prefix it, the WebSocket URL breaks.

 

The OpenAI API key needs access to Whisper. If you're scoping your key's permissions and you block audio models, the transcription step fails silently and you'll spend an hour debugging the wrong thing. Leave Whisper access enabled.

 

FORWARD_AUDIO_BEFORE_TRANSLATION defaults to false. Setting it to true plays the original audio on both sides simultaneously with the translation. Useful for some use cases, annoying in development when you're hearing yourself twice on a slight delay.

 

Twilio Flex Plan

Flex has three pricing tiers. The one relevant for this kind of project:

  • Free tier: 5,000 hours/month, limited analytics
  • Per-seat: $1/hour/user
  • Named user: $150/month flat, unlimited hours

 

For development and low-volume production, the free tier is more than enough. The analytics limitations don't matter for what we're building here. You get the full agent UI, real-time presence, and all the TaskRouter features.

 

How the Translation Actually Works

When your app receives audio from a call leg, it streams it to OpenAI's Realtime API. The API returns translated audio as a stream. Your app injects that audio back into the opposite call leg using Twilio's <Play> or media stream injection.

 

The key detail: this is not "record the whole sentence, then translate." It's streaming translation. Latency is noticeable but not conversation-breaking. In a live test ("Hi, I'm Gonzalo, I'm calling because my account is locked"), the translated audio came through fast enough that the conversation felt natural. Not perfect, but workable. The kind of latency you'd attribute to a bad connection, not to a translation layer.

 

What This Is Good For

The use case is call centers with global customer bases. You hire agents who speak one language, they can serve customers who speak others. No bilingual staffing requirement.

 

But the use case I find more interesting is reselling this as a service. If you work with SMBs that have started expanding internationally, this is a turnkey solution you can deploy on their existing Twilio account. The infrastructure is already there. You're adding a middleware layer and a second phone number.

The system also extends naturally. The language selection menu is just a Studio widget. Adding a third language is a third branch. Spanish, English, Portuguese. Same architecture, more branches.

 

What To Watch Out For

A few things I hit during implementation that are worth knowing upfront:

ngrok URL changes on restart. Every time you restart ngrok (unless you're on a paid plan with a fixed domain), you get a new URL. That means updating the webhook on both phone numbers and the Studio flow redirect. Annoying. Pay for the fixed domain or use a different tunneling solution in production.

Twilio Flex first-login setup is automatic but slow. When you create your Flex account, Twilio provisions your workspace, channels, and default TaskRouter config in the background. Give it a few minutes before trying to configure anything. If things look missing, wait and refresh.

WebSocket connections drop. In production, you'll need reconnection logic. During development it's fine, but under real call volume a dropped WebSocket means one side of the call is getting untranslated audio with no warning.

 

You can find the repository here. It includes the Studio flow JSON, the .env template, and the application code. If you've already been following the outbound call series, the patterns here will look familiar. Same webhook structure, same ngrok setup, different payload on the WebSocket.

 

-Gonza

7
Twilio
Published on April 01, 2026

Ready to automate your customer conversations?

Contact me

Related posts

Building an AI Outbound Call Sales Assistant with n8n, Twilio, and ElevenLabs

January 30, 2026
IntroductionOutbound sales calls are one of the hardest channels to automate with AI. Latency matters.Costs compound fast.Hallucinations are unacceptable.And voice systems fail loudly when something breaks. In... Continue reading

Building a WhatsApp AI Agent for Automated Booking: Architecture and Design Decisions

February 24, 2026
IntroductionMost AI assistant tutorials focus on prompts or models. In production, that is rarely the hard part. The real challenge is building a system that:accepts multiple input... Continue reading