February 24, 2026 • 5 min read
AI & Automation Specialist
I design AI-powered communication systems. My work focuses on voice agents, WhatsApp chatbots, AI assistants, and workflow automation built primarily on Twilio, n8n, and modern LLMs like OpenAI and Claude. Over the past 7 years, I've shipped 30+ automation projects handling 250k+ monthly interactions.
If you enjoy the content that I make, you can subscribe and receive insightful information through email. No spam is going to be sent, just updates about interesting posts or specialized content that I talk about.
Most AI assistant tutorials focus on prompts or models.
In production, that is rarely the hard part.
The real challenge is building a system that:
In this article, I’ll break down the full architecture behind a WhatsApp AI agent that automatically schedules appointments.
Instead of a step-by-step tutorial, this is a practical engineering walkthrough:
Before diving into details, here is the overall architecture:
WhatsApp (Twilio) → Webhook → Input normalization → Audio/Text branching → Transcription (if needed) → AI Agent (tools + memory) → Output parser → WhatsApp response

Key principle:
The LLM is only one component inside a larger system.
A conversational agent that schedules appointments must respect real constraints:
Without solving these first, the model becomes unreliable regardless of how good it is.
All incoming messages arrive via webhook.
Example node configuration:
{
"type": "n8n-nodes-base.webhook",
"parameters": {
"httpMethod": "POST",
"path": "your-webhook-id"
},
"name": "Webhook"
}
Design decision:
Never let downstream nodes depend on raw webhook structure. Normalize early.
We convert the incoming payload into a stable internal format:
{
"type": "n8n-nodes-base.set",
"name": "Edit Fields",
"parameters": {
"assignments": {
"assignments": [
{ "name": "type", "value": "={{ $json.body.MessageType }}" },
{ "name": "body", "value": "={{ $json.body.Body }}" },
{ "name": "from", "value": "={{ $json.body.From }}" },
{ "name": "recording", "value": "={{ $json.body.MediaUrl0 }}" }
]
}
}
}
This matters because the rest of the system becomes independent from provider-specific formats.
Audio messages require extra steps:
Example switch logic:
{
"type": "n8n-nodes-base.switch",
"name": "Switch",
"parameters": {
"rules": {
"values": [
{ "outputKey": "Audio", "conditions": [{ "leftValue": "={{ $json.type }}", "rightValue": "audio" }] },
{ "outputKey": "Text", "conditions": [{ "leftValue": "={{ $json.type }}", "rightValue": "text" }] }
]
}
}
}
Important principle:
The AI agent should never know whether input was audio or text.
Voice notes are fetched from Twilio servers and transcribed using OpenAI.
{
"type": "n8n-nodes-base.httpRequest",
"name": "HTTP Request",
"parameters": {
"url": "={{ $json.recording }}",
"authentication": "predefinedCredentialType",
"nodeCredentialType": "twilioApi"
}
}
{
"type": "@n8n/n8n-nodes-langchain.openAi",
"name": "Transcribe a recording",
"parameters": {
"resource": "audio",
"operation": "transcribe",
"options": { "language": "es" }
}
}
Key insight:
Normalize inputs BEFORE reaching the agent. Otherwise prompts become unnecessarily complex.
Both branches merge into a single standardized structure:
const base = { ...$('Edit Fields').first().json };
const normalizedBody = (base.type === 'audio')
? ($input.first().json.text ?? '').trim()
: (base.body ?? '');
return [{
json: {
...base,
body: normalizedBody,
from: base.from.split('whatsapp:')[1]
}
}];
This dramatically reduces downstream complexity.
Before generating the response, we send a “typing” indicator via Twilio through an HTTP Request node.
{
"type": "n8n-nodes-base.httpRequest",
"name": "Señal de escribiendo respuesta",
"parameters": {
"method": "POST",
"url": "https://messaging.twilio.com/v2/Indicators/Typing.json",
"bodyParameters": {
"parameters": [
{ "name": "messageId", "value": "={{ $('Webhook').item.json.body.MessageSid }}" },
{ "name": "channel", "value": "whatsapp" }
]
}
}
}
Small change, huge perceived intelligence. Users feel like they are interacting with a real system rather than automation.
The agent is composed of:
GPT-4.1-mini with structured output schema. Currently using a mini model because it's going to scale in terms of saving without having to think for too much time on information it already has.
Session keyed by phone number.
Window limited to last interactions to avoid uncontrolled growth.
Key design rule:
The model does not invent data. It queries tools.
{
"model": "gpt-4.1-mini",
"response_format": "json_schema"
}
Structured output prevents downstream failures.
One fundamental constraint:
Calendar events represent occupied time, not availability.
The system must:
This single rule prevents most real-world booking failures.
Agents don’t always return identical output structures.
Instead of trusting the response blindly, we parse defensively:
function tryParseJson(output) {
try { return JSON.parse(output); } catch { return null; }
}
Production systems assume variability.
Final node sends the AI-generated message back using Twilio. Simple, but only after validation and formatting layers.
{
"type": "n8n-nodes-base.twilio",
"name": "Send an SMS/MMS/WhatsApp message",
"parameters": {
"toWhatsapp": true,
"to": "={{ $('Normalizar datos').item.json.from }}",
"message": "={{ $json.message }}"
}
}
AI agents are not chatbots with better prompts.
They are systems composed of:
If you are building something similar and want help moving from prototype to production, feel free to reach out.
Ready to automate your customer conversations?
Contact me