Skip to main content

Voice Mode

What Voice Mode does

Voice Mode turns AI Partner into a spoken conversation interface. You speak, Whisper transcribes your words, the agent generates a response, and a TTS voice reads it back — all in one continuous loop.

This is the same underlying pipeline used in meeting attendance, but for interactive one-on-one conversation rather than passive listening.


Activating Voice Mode

Click the Voice button in the chat mode selector at the bottom of the chat panel:

Mode: [Auto] [Chat] [Goal] [Voice]
↑ click here

Or press the keyboard shortcut Ctrl+Shift+V.

When Voice Mode is active:

  • A microphone button appears in the chat input area
  • The current model name shows a 🎤 indicator
  • Responses are spoken aloud automatically

How it works

You speak

Browser captures audio (MediaRecorder API)

Audio sent to POST /api/voice/transcribe (Whisper STT)

Transcript appears in chat input (you can edit before sending)

Agent generates response (same as Chat or Goal mode)

Response text sent to POST /api/voice/tts

Audio stream plays in browser

Ready for next input

STT — Speech to Text

AI Partner uses Whisper for transcription. Whisper runs either:

SetupHowQuality
Self-hosted (recommended)Docker service on port 8000Best — private, no API cost
OpenAI Whisper APIOPENAI_API_KEY requiredGood — cloud, usage cost

Start the self-hosted Whisper service:

docker compose --profile full up whisper

Set the transcription language in .env (optional, defaults to auto-detect):

WHISPER_LANGUAGE=en

Supported languages: English, Hindi, Spanish, French, German, Japanese, Chinese, Portuguese, and 95 more.


TTS — Text to Speech

Responses are spoken using your configured TTS provider. Set it in workspace/USER.md:

voice_profile: elevenlabs:your-voice-id
ProviderConfigQualityCost
ElevenLabselevenlabs:<voice_id>Highest — realistic clone of your own voice~$0.30/1K chars
MiniMax TTSminimax:<voice_id>Very good — expressive, low latencyLow
OpenAI TTSopenai:nova (or alloy/echo/fable/onyx/shimmer)Good — preset voices, no clone~$0.015/1K chars
Browser TTSbrowserBasic — uses OS system voices, freeFree

Audio output formats — configure in .env:

TTS_FORMAT=mp3 # mp3 / wav / opus / aac / flac

Voice-to-voice chat (full round-trip)

For the most natural experience, use the voice-to-voice endpoint directly:

POST /api/voice/chat

This handles the full loop server-side:

  1. Transcribes your audio
  2. Runs the agent
  3. Returns synthesized audio of the response

Useful for building custom integrations (e.g., phone IVR, smart speaker wake-word).


Microphone permissions

Voice Mode requires microphone access. On first use, your browser will ask for permission. Grant it and the microphone icon will turn active (red when recording, grey when idle).

If mic doesn't work:

  • Chrome/Edge: ensure the site is on localhost or https:// (microphone is blocked on http:// for non-localhost)
  • Check system audio settings — the default mic should be selected in your OS
  • Firefox: go to Preferences → Privacy & Security → Permissions → Microphone

Push-to-talk vs. auto-stop

Two recording modes (toggle in Settings → Voice):

ModeHow to use
Push-to-talkHold the microphone button while speaking; release to send
Auto-stopClick once to start recording; speaking stops automatically after 1.5s of silence

Auto-stop is more hands-free; push-to-talk gives you more control in noisy environments.


Using Voice Mode with goals

Voice Mode always routes through Auto — you no longer have to say the word "goal" to run one. Every utterance is sent with mode: "auto" and the backend classifier decides chat vs. goal:

  • A question or chit-chat ("what's the weather", "what is 2 + 2") stays chat — answered instantly.
  • An imperative desktop/app/media command ("turn up the volume", "open Spotify and play something", "click the submit button") escalates to goal, which can reach computer use / T4 host control.

The classifier is pure regex/heuristics (no extra LLM call), so routing adds no latency — a spoken command is as fast as typing one. Short commands like "mute" or "open Spotify" are recognised even though they're under four words.

When the backend actually starts a goal, the agent gives a brief spoken "On it." (so a chat-classified utterance never gets a misleading acknowledgment), and speaks a summary when the goal completes.

Routing to goal does not force host control. The agent still picks the simplest capable tool — a web action over driving your real desktop — and T4 is the last, opt-in tier. If a host command needs T4 and it's disabled (or you're in multi-user mode), the agent says so rather than acting. Sensitive host actions are gated by the denylist + approval layer.

Safety gates for spoken commands

Destructive verbs (send email, delete, purchase, book, deploy, cancel subscription) trigger a spoken "Are you sure? Say yes or go" confirmation before firing. While a goal is running, a new command is queued ("say queue it") rather than colliding.


Partner Mode

Partner Mode keeps the T3 virtual desktop container permanently active so it's ready for immediate computer use — no 30-second cold-start each time you need a browser.

Activating Partner Mode

Say (or type) any of these:

wake up partner
hey partner
wake partner

AI Partner boots the T3 container, opens a browser, and responds:

Partner activated — tell me Boss!!

The header badge in the top-right switches to:

🤖 Partner active ● (green pulse)

Two things Partner Mode can do

1. Join a meeting (auto-wake)

Paste any supported meeting URL — Partner Mode does not need to be active first. The system detects the URL and wakes the T3 container automatically:

@partner https://meet.google.com/abc-defg-hij

This bypasses the goal executor entirely and goes straight to the meeting pipeline (audio capture → Whisper STT → TTS response). See Meeting Proxy for the full flow.

2. General computer use (requires active Partner Mode)

For any non-meeting task, Partner Mode must be active first (wake up partner), then prefix the command with @partner:

@partner go to google.com and search for today's top AI news
@partner open https://linear.app/pricing and extract all plan details

Or via voice:

"partner, check our Stripe dashboard and tell me today's revenue"

These run through the goal executor with the T3 container's live Chromium browser — so the agent operates the real browser you see in the live view panel, not a headless instance.

If you send an @partner non-meeting command while Partner Mode is sleeping, the response will be: "Partner is sleeping. Say 'wake up partner' first."

Deactivating Partner Mode

bye partner
goodbye partner
partner sleep

The T3 container shuts down and the header badge returns to Partner off.

Checking Partner status

is partner active?
partner status
check partner

When to use Partner Mode vs. regular goals

SituationUse
Joining a meeting@partner <meeting URL> — auto-wakes T3
Multiple browser tasks in one sessionWake Partner Mode — container stays warm between tasks
Single one-off web extractionRegular goal — T1/T2 handles it without T3
CAPTCHA or login-walled siteRegular goal — agent auto-escalates to T3 as needed
Complex form filling you'll repeatPartner Mode + skill learning

Partner Mode is powered by the T3 container tier. If Docker Desktop is not running or the desktop image is not built, activation will return an error with instructions to build the image.