Voice Mode
What Voice Mode does
Voice Mode turns AI Partner into a spoken conversation interface. You speak, Whisper transcribes your words, the agent generates a response, and a TTS voice reads it back — all in one continuous loop.
This is the same underlying pipeline used in meeting attendance, but for interactive one-on-one conversation rather than passive listening.
Activating Voice Mode
Click the Voice button in the chat mode selector at the bottom of the chat panel:
Mode: [Auto] [Chat] [Goal] [Voice]
↑ click here
Or press the keyboard shortcut Ctrl+Shift+V.
When Voice Mode is active:
- A microphone button appears in the chat input area
- The current model name shows a 🎤 indicator
- Responses are spoken aloud automatically
How it works
You speak
↓
Browser captures audio (MediaRecorder API)
↓
Audio sent to POST /api/voice/transcribe (Whisper STT)
↓
Transcript appears in chat input (you can edit before sending)
↓
Agent generates response (same as Chat or Goal mode)
↓
Response text sent to POST /api/voice/tts
↓
Audio stream plays in browser
↓
Ready for next input
STT — Speech to Text
AI Partner uses Whisper for transcription. Whisper runs either:
| Setup | How | Quality |
|---|---|---|
| Self-hosted (recommended) | Docker service on port 8000 | Best — private, no API cost |
| OpenAI Whisper API | OPENAI_API_KEY required | Good — cloud, usage cost |
Start the self-hosted Whisper service:
docker compose --profile full up whisper
Set the transcription language in .env (optional, defaults to auto-detect):
WHISPER_LANGUAGE=en
Supported languages: English, Hindi, Spanish, French, German, Japanese, Chinese, Portuguese, and 95 more.
TTS — Text to Speech
Responses are spoken using your configured TTS provider. Set it in workspace/USER.md:
voice_profile: elevenlabs:your-voice-id
| Provider | Config | Quality | Cost |
|---|---|---|---|
| ElevenLabs | elevenlabs:<voice_id> | Highest — realistic clone of your own voice | ~$0.30/1K chars |
| MiniMax TTS | minimax:<voice_id> | Very good — expressive, low latency | Low |
| OpenAI TTS | openai:nova (or alloy/echo/fable/onyx/shimmer) | Good — preset voices, no clone | ~$0.015/1K chars |
| Browser TTS | browser | Basic — uses OS system voices, free | Free |
Audio output formats — configure in .env:
TTS_FORMAT=mp3 # mp3 / wav / opus / aac / flac
Voice-to-voice chat (full round-trip)
For the most natural experience, use the voice-to-voice endpoint directly:
POST /api/voice/chat
This handles the full loop server-side:
- Transcribes your audio
- Runs the agent
- Returns synthesized audio of the response
Useful for building custom integrations (e.g., phone IVR, smart speaker wake-word).
Microphone permissions
Voice Mode requires microphone access. On first use, your browser will ask for permission. Grant it and the microphone icon will turn active (red when recording, grey when idle).
If mic doesn't work:
- Chrome/Edge: ensure the site is on
localhostorhttps://(microphone is blocked onhttp://for non-localhost) - Check system audio settings — the default mic should be selected in your OS
- Firefox: go to Preferences → Privacy & Security → Permissions → Microphone
Push-to-talk vs. auto-stop
Two recording modes (toggle in Settings → Voice):
| Mode | How to use |
|---|---|
| Push-to-talk | Hold the microphone button while speaking; release to send |
| Auto-stop | Click once to start recording; speaking stops automatically after 1.5s of silence |
Auto-stop is more hands-free; push-to-talk gives you more control in noisy environments.
Using Voice Mode with goals
Voice Mode always routes through Auto — you no longer have to say the word "goal" to run one. Every utterance is sent with mode: "auto" and the backend classifier decides chat vs. goal:
- A question or chit-chat ("what's the weather", "what is 2 + 2") stays chat — answered instantly.
- An imperative desktop/app/media command ("turn up the volume", "open Spotify and play something", "click the submit button") escalates to goal, which can reach computer use / T4 host control.
The classifier is pure regex/heuristics (no extra LLM call), so routing adds no latency — a spoken command is as fast as typing one. Short commands like "mute" or "open Spotify" are recognised even though they're under four words.
When the backend actually starts a goal, the agent gives a brief spoken "On it." (so a chat-classified utterance never gets a misleading acknowledgment), and speaks a summary when the goal completes.
Routing to goal does not force host control. The agent still picks the simplest capable tool — a web action over driving your real desktop — and T4 is the last, opt-in tier. If a host command needs T4 and it's disabled (or you're in multi-user mode), the agent says so rather than acting. Sensitive host actions are gated by the denylist + approval layer.
Safety gates for spoken commands
Destructive verbs (send email, delete, purchase, book, deploy, cancel subscription) trigger a spoken "Are you sure? Say yes or go" confirmation before firing. While a goal is running, a new command is queued ("say queue it") rather than colliding.
Partner Mode
Partner Mode keeps the T3 virtual desktop container permanently active so it's ready for immediate computer use — no 30-second cold-start each time you need a browser.
Activating Partner Mode
Say (or type) any of these:
wake up partner
hey partner
wake partner
AI Partner boots the T3 container, opens a browser, and responds:
Partner activated — tell me Boss!!
The header badge in the top-right switches to:
🤖 Partner active ● (green pulse)
Two things Partner Mode can do
1. Join a meeting (auto-wake)
Paste any supported meeting URL — Partner Mode does not need to be active first. The system detects the URL and wakes the T3 container automatically:
@partner https://meet.google.com/abc-defg-hij
This bypasses the goal executor entirely and goes straight to the meeting pipeline (audio capture → Whisper STT → TTS response). See Meeting Proxy for the full flow.
2. General computer use (requires active Partner Mode)
For any non-meeting task, Partner Mode must be active first (wake up partner), then prefix the command with @partner:
@partner go to google.com and search for today's top AI news
@partner open https://linear.app/pricing and extract all plan details
Or via voice:
"partner, check our Stripe dashboard and tell me today's revenue"
These run through the goal executor with the T3 container's live Chromium browser — so the agent operates the real browser you see in the live view panel, not a headless instance.
If you send an @partner non-meeting command while Partner Mode is sleeping, the response will be: "Partner is sleeping. Say 'wake up partner' first."
Deactivating Partner Mode
bye partner
goodbye partner
partner sleep
The T3 container shuts down and the header badge returns to Partner off.
Checking Partner status
is partner active?
partner status
check partner
When to use Partner Mode vs. regular goals
| Situation | Use |
|---|---|
| Joining a meeting | @partner <meeting URL> — auto-wakes T3 |
| Multiple browser tasks in one session | Wake Partner Mode — container stays warm between tasks |
| Single one-off web extraction | Regular goal — T1/T2 handles it without T3 |
| CAPTCHA or login-walled site | Regular goal — agent auto-escalates to T3 as needed |
| Complex form filling you'll repeat | Partner Mode + skill learning |
Partner Mode is powered by the T3 container tier. If Docker Desktop is not running or the desktop image is not built, activation will return an error with instructions to build the image.