Skip to content

GhostBrain Architecture

Overview

GhostBrain is a real-time voice AI interviewer bot that conducts natural conversations through phone calls or local microphone. It combines state-of-the-art speech recognition, language understanding, and voice synthesis to create a seamless conversational experience. Crucially, it decouples live caller latency from post-call analysis by utilizing an event-driven serverless architecture.

System Architecture

flowchart TD
    subgraph Clients
        T["๐Ÿ“ž Twilio Phone Call"]
        M["๐ŸŽค Local Microphone"]
        W["๐ŸŒ Daily WebRTC"]
    end

    subgraph Live ["Live Calling Service (Cloud Run)"]
        F["โšก FastAPI WebSocket Endpoint"]

        subgraph Pipeline ["Pipecat Voice Pipeline"]
            direction TB
            TR["Transport"] -->|"Audio"| STT["Deepgram STT"]
            STT -->|"Text"| VAD["Silero VAD"]
            VAD -->|"Intent"| LLM["Groq Llama-3.3-70B"]
            LLM -->|"Text Response"| TTS["OpenAI TTS"]
            TTS -->|"Synthesized Audio"| TR
        end
    end

    T <-->|"Audio Stream"| F
    M <-->|"Audio Stream"| F
    W <-->|"Audio Stream"| F
    F <-->|"Frames"| TR

    F -.->|"Upload transcript"| S["๐Ÿชฃ GCS Transcript Storage"]

    subgraph Async ["Post-Call Processing"]
        E["โšก Eventarc Trigger"]
        P["โš™๏ธ Post-Call Service (Cloud Run)"]
        A["๐Ÿง  Anthropic Claude"]
    end

    S -->|"Object Finalized"| E
    E -->|"Webhook"| P
    P <-->|"Summarize & Split"| A
    P -.->|"Save Markdown Files"| S

Core Components

1. Input Layer

The system accepts audio input from multiple sources:

  • Twilio WebSocket: Production phone calls via Twilio Media Streams
  • 8kHz sample rate (telephony standard)
  • ยต-law audio encoding
  • Real-time bidirectional streaming

  • Local Microphone: Development/testing via PyAudio

  • 16kHz sample rate (higher quality)
  • Direct PCM audio capture
  • No telephony overhead

2. Live Service (FastAPI)

Central web application managing active WebSocket connections:

  • Endpoint: /ws - Accepts Twilio Media Stream connections
  • Async Architecture: Full async/await pattern for concurrent connections
  • Transcript Upload: Uploads the raw text transcript to GCS immediately upon call hangup.

3. Pipecat Pipeline

The heart of the live system - a composable pipeline for real-time voice processing:

flowchart LR
    TR_IN["Transport Input"] --> STT["STT"]
    STT --> VAD["VAD"]
    VAD --> UA["User Aggregator"]
    UA --> LLM["LLM"]
    LLM --> TTS["TTS"]
    TTS --> TR_OUT["Transport Output"]
    TR_OUT --> AA["Assistant Aggregator"]

4. Voice Activity Detection (VAD)

Model: Silero VAD - Detects when users start/stop speaking - Configurable pause detection (0.2-0.5 seconds) - Prevents interruptions and crosstalk

5. Speech-to-Text (STT)

Service: Deepgram Model: nova-2 - Industry-leading accuracy for conversational speech - Real-time streaming transcription (<300ms latency)

6. Large Language Model (LLM)

Service: Groq Model: llama-3.3-70b-versatile - Ultra-fast inference (Groq LPU architecture) optimized for <200ms TTFT (Time to First Token) to keep conversations natural.

7. Text-to-Speech (TTS)

Service: OpenAI Model: tts-1 Voice: alloy - Natural-sounding synthesized speech optimized for latency.

8. Post-Call Processing (Eventarc & Anthropic)

To prevent heavy processing from stealing CPU cycles from live callers, analysis is handled by a separate Cloud Run service:

  • Eventarc: Listens for file drops in the GCS bucket.
  • Anthropic Claude 3.5 Sonnet: Analyzes the raw transcript, intelligently splits the user's thoughts into multiple topics, and formats them into beautiful Markdown using predefined templates (e.g. Daily Logs, Project Ideas).
  • Storage Loop: The generated markdown files are saved back into the processed/ prefix of the GCS bucket.

Data Flow

Live Phone Call Flow

sequenceDiagram
    actor U as User
    participant T as Twilio
    participant CR as Cloud Run (Live)
    participant P as Pipecat Pipeline
    participant G as GCS

    U->>T: Calls Phone Number
    T->>CR: Initiates WebSocket
    CR->>P: Initializes Pipeline
    loop Audio Streaming
        U->>T: Speaks
        T->>P: Audio Stream (Inbound)
        P->>P: STT โ†’ LLM โ†’ TTS
        P->>T: Audio Stream (Outbound)
        T->>U: Hears Response
    end
    U->>T: Hangs Up
    T->>CR: Disconnects
    CR->>G: Uploads raw transcript.txt

Post-Call Async Flow

sequenceDiagram
    participant G as GCS
    participant E as Eventarc
    participant PC as Cloud Run (Post-Call)
    participant A as Anthropic Claude

    G->>E: Fires Object Finalized Event
    E->>PC: Triggers POST /events/post-call
    PC->>G: Downloads raw transcript
    PC->>A: Prompts with Context & Templates
    A-->>PC: Returns structured JSON files
    PC->>G: Uploads processed/ template files

Deployment Architecture

Google Cloud Platform

flowchart TD
    subgraph GCP ["Google Cloud Project"]
        CR_LIVE["โ˜๏ธ Cloud Run (Live)"]
        CR_POST["โ˜๏ธ Cloud Run (Post-Call)"]
        GCS["๐Ÿชฃ Cloud Storage Bucket"]
        EV["โšก Eventarc"]
        SM["๐Ÿ” Secret Manager"]

        CR_LIVE -->|"Uploads transcript"| GCS
        GCS -->|"Triggers"| EV
        EV -->|"Invokes"| CR_POST

        CR_LIVE -->|"Reads Keys"| SM
        CR_POST -->|"Reads Keys"| SM
    end

Infrastructure as Code: Terraform manages all GCP resources natively, automatically provisioning the Eventarc triggers, Pub/Sub permissions, and binding Secret Manager versions to the Cloud Run services.

Performance Characteristics

Latency Budget (Live Service)

  • STT Latency: ~200-300ms (Deepgram streaming)
  • LLM Latency: ~100-200ms (Groq LPU)
  • TTS Latency: ~150-250ms (OpenAI streaming)
  • Network: ~50-100ms
  • Total End-to-End: ~500-850ms

Scalability

  • Decoupled Workloads: By moving LLM analysis and JSON parsing to a secondary Post-Call Cloud Run service, the Live service maintains real-time WebSocket stability without CPU starvation.