Skip to content

Voice & meetings

  1. Phone dial-in — Twilio + Gemini Live. Real-time speech-to-text, zero upload. See Voice dial-in.
  2. File upload — mp3 / m4a / webm. Whisper transcribes. See Uploading meetings.
  3. Paste transcript — skip transcription. Fastest path.

All three produce a Meeting row with a Transcript. Downstream is identical.

  • Cloud: OpenAI’s whisper-1 model. Requires OPENAI_API_KEY.
  • Self-hosted: any Whisper-compatible endpoint via WHISPER_BASE_URL=http://host:9000. Run whisper.cpp or Faster-Whisper.

Before transcription, we build a domain prompt from:

  • The active project’s name.
  • Top-20 god-nodes from the project graph (technical terms, class names).
  • Frequently-referenced nouns from previous transcripts in the same project.

This prompt is passed as the prompt field to Whisper. The difference on technical meetings is significant — “BAAgentService” stops coming back as “be agent service.”

See backend/src/services/transcription/domain-prompt.ts.

For real-time dial-in, we use Gemini Live which streams audio in and emits text + semantic events. Lower latency than Whisper; transcription quality is comparable for conversational speech.

Stored in Postgres:

  • id — stable.
  • tenantId, projectId — scope.
  • title — inferred from filename / Meet calendar / user input.
  • sourceupload | voice | paste.
  • duration — seconds; null for paste.
  • transcript — full text.
  • speakers — if parseable from source.
  • statusuploading | transcribing | ready | failed.
  • audioUrl — optional; only when RETAIN_RECORDINGS=1.
[source]
Meeting.status = transcribing
▼ (Whisper / Gemini Live)
Transcript persisted
Meeting.status = ready
▼ (if FEATURE_AUTO_BRIEF=1)
Brief drafting kicked off

Transcripts are versioned (Transcript.version). Edits produce a new version; old versions are retained for audit. The chief-of-staff uses the latest version on next brief draft.

FlagDefaultEffect
RETAIN_RECORDINGS0Keep raw audio after transcription.
RETAIN_CALL_AUDIO0Keep Twilio call audio (post-stream).
WHISPER_BASE_URLunsetUse local Whisper endpoint.
VOICE_CONSENT_GREETING1Prepend a recording-consent notice.
  • No speaker diarization on phone dial-in (Gemini Live limitation).
  • No real-time transcript visible during a call (UI polls every 5s).
  • Single-language per meeting; no auto-language-detect.
PathEnd-to-endNotes
Paste<100 msJust a DB insert.
Upload 10-min~30 sWhisper inference + Drive download if relevant.
Upload 1-hour~90 sLinear-ish in audio length.
Voice dial-intranscript ready within 5 s of hangupLive streaming means text is ready when the call ends.