Audio / Voice Notes — 2026-01-17
What works
- Media understanding (audio): If audio understanding is enabled (or auto‑detected), Clawdia:
- Locates the first audio attachment (local path or URL) and downloads it if needed.
- Enforces
maxBytesbefore sending to each model entry. - Runs the first eligible model entry in order (provider or CLI).
- If it fails or skips (size/timeout), it tries the next entry.
- On success, it replaces
Bodywith an[Audio]block and sets{{Transcript}}.
- Command parsing: When transcription succeeds,
CommandBody/RawBodyare set to the transcript so slash commands still work. - Verbose logging: In
--verbose, we log when transcription runs and when it replaces the body.
Auto-detection (default)
If you don’t configure models andtools.media.audio.enabled is not set to false,
Clawdia auto-detects in this order and stops at the first working option:
- Local CLIs (if installed)
sherpa-onnx-offline(requiresSHERPA_ONNX_MODEL_DIRwith encoder/decoder/joiner/tokens)whisper-cli(fromwhisper-cpp; usesWHISPER_CPP_MODELor the bundled tiny model)whisper(Python CLI; downloads models automatically)
- Gemini CLI (
gemini) usingread_many_files - Provider keys (OpenAI → Groq → Deepgram → Google)
tools.media.audio.enabled: false.
To customize, set tools.media.audio.models.
Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on PATH (we expand ~), or set an explicit CLI model with a full command path.
Config examples
Provider + CLI fallback (OpenAI + Whisper CLI)
Provider-only with scope gating
Provider-only (Deepgram)
Notes & limits
- Provider auth follows the standard model auth order (auth profiles, env vars,
models.providers.*.apiKey). - Deepgram picks up
DEEPGRAM_API_KEYwhenprovider: "deepgram"is used. - Deepgram setup details: Deepgram (audio transcription).
- Audio providers can override
baseUrl,headers, andproviderOptionsviatools.media.audio. - Default size cap is 20MB (
tools.media.audio.maxBytes). Oversize audio is skipped for that model and the next entry is tried. - Default
maxCharsfor audio is unset (full transcript). Settools.media.audio.maxCharsor per-entrymaxCharsto trim output. - OpenAI auto default is
gpt-4o-mini-transcribe; setmodel: "gpt-4o-transcribe"for higher accuracy. - Use
tools.media.audio.attachmentsto process multiple voice notes (mode: "all"+maxAttachments). - Transcript is available to templates as
{{Transcript}}. - CLI stdout is capped (5MB); keep CLI output concise.
Gotchas
- Scope rules use first-match wins.
chatTypeis normalized todirect,group, orroom. - Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via
jq -r .text. - Keep timeouts reasonable (
timeoutSeconds, default 60s) to avoid blocking the reply queue.
