Voice & meetings
Capture paths
Section titled “Capture paths”- Phone dial-in — Twilio + Gemini Live. Real-time speech-to-text, zero upload. See Voice dial-in.
- File upload — mp3 / m4a / webm. Whisper transcribes. See Uploading meetings.
- Paste transcript — skip transcription. Fastest path.
All three produce a Meeting row with a Transcript. Downstream
is identical.
Transcription
Section titled “Transcription”Whisper (default)
Section titled “Whisper (default)”- Cloud: OpenAI’s
whisper-1model. RequiresOPENAI_API_KEY. - Self-hosted: any Whisper-compatible endpoint via
WHISPER_BASE_URL=http://host:9000. Runwhisper.cppor Faster-Whisper.
Domain-aware prompting
Section titled “Domain-aware prompting”Before transcription, we build a domain prompt from:
- The active project’s name.
- Top-20 god-nodes from the project graph (technical terms, class names).
- Frequently-referenced nouns from previous transcripts in the same project.
This prompt is passed as the prompt field to Whisper. The
difference on technical meetings is significant — “BAAgentService”
stops coming back as “be agent service.”
See backend/src/services/transcription/domain-prompt.ts.
Gemini Live (voice dial-in only)
Section titled “Gemini Live (voice dial-in only)”For real-time dial-in, we use Gemini Live which streams audio in and emits text + semantic events. Lower latency than Whisper; transcription quality is comparable for conversational speech.
The Meeting object
Section titled “The Meeting object”Stored in Postgres:
id— stable.tenantId,projectId— scope.title— inferred from filename / Meet calendar / user input.source—upload | voice | paste.duration— seconds; null for paste.transcript— full text.speakers— if parseable from source.status—uploading | transcribing | ready | failed.audioUrl— optional; only whenRETAIN_RECORDINGS=1.
Lifecycle
Section titled “Lifecycle”[source] │ ▼Meeting.status = transcribing │ ▼ (Whisper / Gemini Live)Transcript persisted │ ▼Meeting.status = ready │ ▼ (if FEATURE_AUTO_BRIEF=1)Brief drafting kicked offEditing a transcript
Section titled “Editing a transcript”Transcripts are versioned (Transcript.version). Edits produce a
new version; old versions are retained for audit. The chief-of-staff
uses the latest version on next brief draft.
Privacy flags
Section titled “Privacy flags”| Flag | Default | Effect |
|---|---|---|
RETAIN_RECORDINGS | 0 | Keep raw audio after transcription. |
RETAIN_CALL_AUDIO | 0 | Keep Twilio call audio (post-stream). |
WHISPER_BASE_URL | unset | Use local Whisper endpoint. |
VOICE_CONSENT_GREETING | 1 | Prepend a recording-consent notice. |
Limitations
Section titled “Limitations”- No speaker diarization on phone dial-in (Gemini Live limitation).
- No real-time transcript visible during a call (UI polls every 5s).
- Single-language per meeting; no auto-language-detect.
Typical latency
Section titled “Typical latency”| Path | End-to-end | Notes |
|---|---|---|
| Paste | <100 ms | Just a DB insert. |
| Upload 10-min | ~30 s | Whisper inference + Drive download if relevant. |
| Upload 1-hour | ~90 s | Linear-ish in audio length. |
| Voice dial-in | transcript ready within 5 s of hangup | Live streaming means text is ready when the call ends. |