Executive Overview

AgentAid
System Design

A real-time call lead marketplace connecting inbound insurance calls to licensed agents via browser softphone — built for reliability, auditability, and scale.

Final Expense Insurance Real-Time Routing Prepaid Billing Event-Driven Auditable
8 Core Components
6 MVP Phases
~15 min Presentation
02 Product Context

Inbound calls.
Licensed agents.
Zero hardware.

AgentAid routes consumer-initiated inbound phone calls to licensed final expense insurance agents through a browser-based softphone.

  • Agents prepay balance, go online, receive calls
  • Automatically charged per connected call
  • Browser-based softphone — no hardware required

The system must be reliable, auditable, event-driven, and ready to evolve into a white-label marketplace platform.

Business Goals

  • Route inbound calls to the right licensed agent in real time
  • Enforce online, licensed, funded, and daily cap eligibility
  • Debit agent balance per connected call automatically
  • Real-time visibility for admins
  • Support future verticals and white-label deployments

Engineering Goals

  • High availability for call routing
  • Reliable event processing
  • Safe payment and balance handling
  • Full routing auditability
  • Minimal operational complexity for MVP
  • Scale without major rewrites

North Star

Simple enough to ship fast. Reliable enough for real money, real calls, and real-time routing.

Evolution Path

Architecture designed to evolve into a white-label marketplace platform without major rewrites.

03 Architecture Principles

Decisions driven by
constraints, not trends

Every technology choice maps to a product requirement — not resume-driven development.

Principle 01

Modular Monolith

One deployable unit with clear internal modules. Fast to build, simple to operate.

Why Microservices add coordination overhead before we have scale problems. Extract later if needed.
Principle 02

Event-Driven Core

Webhooks ingest fast. Workers process async. Every state change is an auditable event.

Why External systems (Twilio, Stripe, CallGrid) are unreliable timing partners. Decouple ingestion from processing.
Principle 03

Ledger, Not Balance

Every credit and debit is a transaction record. Balance is derived, never blindly mutated.

Why Real money requires auditability. A mutable balance field alone is a liability in disputes.
Principle 04

Idempotent Everything

Duplicate webhooks must never double-charge or double-route. Store raw events first.

Why Stripe and Twilio will send the same event twice. Design for it from day one.
Principle 05

Real-Time by Default

WebSockets for agent ringing, admin monitoring, and balance updates — not polling.

Why A missed ring is lost revenue. Operations need live visibility, not 30-second delays.
Principle 06

AWS-Native MVP

Managed services for database, queue, secrets, and deployment. Minimize ops surface area.

Why Small team, high stakes. Managed infra lets engineers focus on routing and billing logic.
04 High-Level Architecture

Eight components.
One critical path.

flowchart TD
    A["Inbound Call / CallGrid"] --> B["Webhook API"]
    B --> C["PostgreSQL Event Store"]
    B --> D["AWS SQS Queue"]
    D --> E["Routing Worker"]
    E --> F["PostgreSQL"]
    E --> G["Twilio Voice SDK / Twilio Voice"]
    E --> H["Audit Log"]
    G --> I["Browser Softphone - React"]
    I --> J["Agent Dashboard"]
    K["Stripe Checkout"] --> L["Stripe Webhook API"]
    L --> M["PostgreSQL Ledger"]
    L --> D
    F --> N["Admin Dashboard"]
    E --> O["WebSocket Gateway"]
    O --> N
    O --> J
          
4.1 React Agent Dashboard Real-time state, softphone, forms
4.2 Admin Dashboard Separate surface for ops & security
4.3 Node.js Backend API Webhooks, integrations, real-time
4.4 PostgreSQL Strong consistency for money & routing
4.5 AWS SQS Async events, retries, decoupling
4.6 Twilio Voice SDK Browser softphone, no hardware
4.7 Stripe Prepaid top-ups via verified webhooks
4.8 WebSocket Gateway Live ringing, monitoring, balance
05 Call Routing Flow

From inbound ring
to agent connection

1
Caller → CallGrid
Calls campaign phone number
2
CallGrid → API
Sends inbound call webhook
3
API → DB
Stores raw call event
4
API → SQS
Publishes CALL_RECEIVED
5
SQS → Worker → DB
Finds eligible agents, creates routing audit log
6
Worker → Twilio
Connects call to selected agent
7
Twilio → Agent
Rings browser softphone
8
Agent → Twilio → API → SQS
Accepts call → connected webhook → CALL_CONNECTED
Agent is approved AND agent is online AND agent is licensed in caller state AND agent has enough balance AND agent has not reached daily cap AND agent is not currently busy

MVP Selection Strategy

  1. Filter eligible agents
  2. Prioritize agents with higher balance
  3. Respect daily cap
  4. Avoid sending multiple calls to the same busy agent
  5. Round-robin or weighted round-robin among eligible agents
Why this approach Simple, fair, auditable, and good enough for MVP. Advanced bidding or scoring can be added later.
06 Event-Driven Processing

SQS decouples
ingestion from logic

Webhook handlers: receive → validate signature → store raw event → publish to queue → return 200. Business processing happens in workers.

Receive
Validate Signature
Store Raw Event
Publish to SQS
Return 200
Worker
Process
Retry on fail
DLQ if exhausted
CALL_RECEIVED CALL_ROUTING_REQUESTED CALL_ROUTING_STARTED CALL_ROUTED CALL_NOT_ROUTED CALL_CONNECTED CALL_ENDED CALL_BILLED PAYMENT_RECEIVED BALANCE_UPDATED AGENT_ONLINE AGENT_OFFLINE AGENT_AUTO_OFFLINED

Why SQS over RabbitMQ?

Reduces coupling between webhook ingestion and business processing. Retries, durability, and failure isolation — without managing RabbitMQ. AWS-native, no exchange routing needed at MVP.

Why Event-Driven?

The system depends on external systems, asynchronous workflows, retries, and auditability. Prevents webhook timeouts and reduces the risk of losing critical events.

07 Billing & Stripe Webhooks

Never trust the frontend
for money

Credit Flow — Stripe Top-Up

1
Agent → Stripe
Frontend starts Stripe Checkout flow
2
Stripe → API
payment_intent.succeeded webhook
3
API → DB
Stores Stripe event (idempotency key)
4
API → SQS
Publishes PAYMENT_RECEIVED
5
Worker → DB
Updates balance ledger, marks agent funded

Debit Flow — Per Connected Call

1
Twilio → API
Call connected/ended webhook
2
API → DB
Stores call event
3
API → SQS
Publishes CALL_BILLING_REQUESTED
4
Worker → DB
Creates debit transaction, updates agent balance
5
Worker → Agent
Auto-offlines agent if balance is too low
Billing Principle — Ledger Model

Use a ledger model, not only a mutable balance field. Each balance change generates a transaction record — auditable and safer.

Why webhooks The frontend must never be trusted as the source of truth for payments. Balance is updated only after Stripe confirms the transaction server-side.
Ledger Type

CREDIT

Stripe payment

Ledger Type

DEBIT

Connected call

Ledger Type

REFUND

Stripe refund

Ledger Type

ADMIN_ADJUSTMENT

Manual admin override

08 Twilio Voice SDK

Browser softphone.
No hardware required.

Agents receive calls directly in Chrome. No mobile app, no desk phone, no provisioning delay.

Backend
Access Token
React App
Twilio Device
Register
Ready for calls
Inbound Call
Ring Browser
Agent Accepts

Why Twilio Voice SDK?

Removes physical phones, mobile apps, and hardware. Agents receive calls directly in Chrome. Backend generates access tokens; React registers a Twilio Device.

Agent Dashboard

Onboarding · Licensed states · NPN verification · Prepaid balance · Online/offline toggle · Browser softphone · Balance, call history, daily cap

Admin Dashboard

Live call monitor · Agent approve/suspend · Balance overrides · Campaign & phone number config · Routing logs · Revenue reports

Why separate admin experience Operationally sensitive workflows — separate surface improves security, permissions, and product clarity.

WebSocket Gateway

Agent call ringing · Agent availability · Admin live call monitor · Balance updates · Call status changes

09 PostgreSQL Data Model

Strong consistency
for money & routing

PostgreSQL stores agents, campaigns, call records, routing attempts, balance ledger, Stripe transactions, and audit logs.

agents agent_licenses agent_status campaigns campaign_phone_numbers calls call_events routing_attempts balance_ledger stripe_events admin_users admin_audit_logs
agents
id · name · email
npn_number · status
created_at · updated_at
calls
external_call_id
campaign_id · caller_state
status · selected_agent_id
started · connected · ended
routing_attempts
call_id · agent_id
decision · reason
AGENT_OFFLINE
INSUFFICIENT_BALANCE
STATE_NOT_LICENSED
DAILY_CAP_REACHED
AGENT_BUSY
SELECTED
balance_ledger
agent_id · type · amount
balance_after
source · source_id
CREDIT · DEBIT · REFUND
ADMIN_ADJUSTMENT
stripe_events
event_id (idempotency key)
raw payload · processed_at
admin_audit_logs
admin_user_id · action
target · metadata
created_at
Why PostgreSQL Strong consistency for balance, routing, billing, and audit trails. Reliable transactions, relational modeling, indexes, and query performance — not eventual consistency.
10 Reliability, Availability & Idempotency

Near 100% availability
on the routing path

Inbound webhook Routing decision Twilio connection Call status processing Billing
01
Lightweight Webhooks
Validate, persist, enqueue, respond. No business logic in the request path — prevents timeouts and lost events.
02
SQS Retries + DLQ
Automatic retry on worker failure. Dead-letter queue for events that exhaust retries — inspectable and replayable.
03
Idempotent Processors
Stripe event IDs and external call IDs prevent duplicate billing and routing. Safe to process the same event twice.
04
Raw Event Storage
CallGrid, Twilio, and Stripe payloads stored before processing. Enables debugging, replay, and compliance audit.
05
Database Transactions
Balance updates, ledger entries, and call status changes happen atomically. Partial failures roll back cleanly.
06
Health Checks & Monitoring
CloudWatch, Sentry, structured logs. Monitor API errors, queue depth, failed jobs, webhook failures, routing latency, billing failures.
07
Failure Scenario Handling
Duplicate Stripe/Twilio webhooks → idempotency keys. No eligible agent → log, mark not routed, fallback to voicemail/overflow/admin alert. Low balance → debit, auto-offline, notify agent.
11 AWS Deployment & Security

Managed infra.
Security by design.

ECS Fargate
Node.js API + Workers
RDS PostgreSQL
Relational database
SQS
Queue + DLQ
CloudWatch
Logs + metrics
Secrets Manager
API keys + credentials
S3
Static assets / exports
CloudFront
CDN
Route 53 + ACM
DNS + SSL

MVP Deployment

React frontend · Node.js API · PostgreSQL on RDS · SQS · Worker process · Twilio · Stripe · WebSocket gateway.

Start with ECS Fargate if possible. EC2 acceptable for v1. RDS from day one — database reliability is critical.

Modular monolith modules auth · agents · campaigns · routing · billing · telephony · admin · audit · notifications

Security Requirements

  • HTTPS everywhere · secrets in Secrets Manager
  • Validate Stripe & Twilio webhook signatures
  • RBAC for admin · audit logs for admin actions
  • Never expose Stripe/Twilio secrets to frontend
  • Least-privilege IAM · encrypt DB at rest
  • Secure JWT/session management
Why this matters Payments, phone calls, agent identities, and insurance data — security and auditability are core product requirements.
12 Observability & Operations

If routing fails,
revenue stops immediately

Operations and product analytics are directly connected. Observability is a revenue protection strategy.

Business Metrics
  • Calls received, routed, connected, missed
  • Revenue by campaign and date
  • Agent balance and online time
  • Conversion rate by campaign
Technical Metrics
  • API and webhook latency
  • SQS queue depth and worker failures
  • Routing latency end-to-end
  • Twilio / Stripe webhook failure rates
  • WebSocket disconnects · error rates
  • Database query performance

CloudWatch

Infrastructure metrics, log aggregation, alerting on queue depth and error rates.

Sentry

Application error tracking with stack traces. Catch routing and billing exceptions in production.

Structured Logs

JSON logs with correlation IDs across webhook → queue → worker → database for end-to-end tracing.

13 MVP Delivery Plan

Six phases.
Incremental value.

1
Phase 1
Core Infrastructure
2
Phase 2
Routing Engine
3
Phase 3
Telephony
4
Phase 4
Billing
5
Phase 5
Admin Platform
6
Phase 6
QA & Launch
Phase 1 — Core Infrastructure
  • AWS setup · PostgreSQL schema · Node.js backend
  • Authentication · SQS queue · Worker service
  • Raw webhook ingestion · Deployment pipeline
Phase 2 — Routing Engine
  • Campaign phone number mapping
  • Eligibility rules · state licensing · online/offline
  • Daily cap · balance validation · routing audit logs
Phase 3 — Telephony
  • Twilio Voice SDK token generation
  • Browser softphone · incoming call flow
  • Call connected/ended events · call history
Phase 4 — Billing
  • Stripe Checkout · Stripe webhooks
  • Balance ledger · per-call debit
  • Auto-offline on low balance
Phase 5 — Admin Platform
  • Live call monitor · agent management
  • Campaign management · revenue reporting
  • Routing logs
Phase 6 — QA & Launch
  • E2E call testing · payment testing
  • Webhook replay testing · load testing
  • Monitoring dashboards · soft launch with seed agents
14 Final Recommendation

Speed today.
Scale tomorrow.

Simple enough to build quickly, but reliable enough to handle real money, real phone calls, and real-time routing. Not microservices from day one.

Modular Monolith PostgreSQL SQS Workers Twilio Stripe WebSockets AWS Managed Infrastructure

Speed

Fast to build, simple to deploy, easier to maintain as the first developer. Clear module ownership inside one codebase.

Reliability

Event-driven, idempotent, ledger-based. Durable queue, raw event storage, and transactional billing from day one.

Scale Path

High-volume modules (routing, billing) can be extracted into separate services when metrics demand it — not before.

Back to Start