Hermes Agent Voice Mode: STT/TTS Pipeline and Setup
Summary
Hermes Agent supports a full voice interaction layer spanning three distinct modes: interactive CLI microphone input, spoken replies in Telegram and Discord chat, and live conversation inside Discord voice channels. The feature is built on a pluggable speech-to-text (STT) and text-to-speech (TTS) stack with both local (free, private) and cloud (higher quality, paid) provider options. Voice mode is an optional capability layered on top of a working text-mode Hermes installation.
Details
Architecture and Modes
Voice mode in Hermes is not a single feature but three separate interaction surfaces with different setup requirements and use cases:
| Mode | Platform | Trigger |
|---|---|---|
| Interactive microphone loop | CLI | Ctrl+B push-to-record inside hermes |
| Voice replies in chat | Telegram, Discord | /voice on or /voice tts in a chat session |
| Live Discord voice channel bot | Discord VC | /voice join from a Discord text channel |
The recommended progression is: get text mode working first, then enable CLI voice, then messaging voice replies, and only then tackle Discord VC mode. Each step introduces additional system dependencies and configuration surface, so the incremental path keeps the debugging scope manageable.
Installation
Voice support ships as optional extras on top of the base hermes-agent package:
pip install "hermes-agent[voice]"— CLI microphone and playbackpip install "hermes-agent[messaging]"— Telegram/Discord gateway voice deliverypip install "hermes-agent[tts-premium]"— ElevenLabs integrationpip install "hermes-agent[all]"— everything
System-level dependencies are also required. On Debian/Ubuntu:
sudo apt install portaudio19-dev ffmpeg libopus0 espeak-ng
portaudio handles microphone I/O for CLI mode; ffmpeg converts audio for TTS delivery and messaging platforms; opus/libopus0 is the Discord voice codec; espeak-ng is the phonemizer backend required by NeuTTS.
NeuTTS can be installed separately via python -m pip install -U neutts[all]. The hermes setup wizard detects whether it is present and offers to install it automatically, falling back to Edge TTS if the install fails or is skipped.
STT/TTS Provider Stack
Hermes supports a choice of providers at both ends of the pipeline. API keys for cloud providers go into ~/.hermes/.env.
STT options:
| Provider | Model | Cost | Key Required |
|---|---|---|---|
local (Whisper) | base to large-v3 | Free | No |
groq | whisper-large-v3-turbo | Free tier available | Yes (GROQ_API_KEY) |
openai | whisper-1 | Paid | Yes (VOICE_TOOLS_OPENAI_KEY) |
TTS options:
| Provider | Quality | Cost | Latency |
|---|---|---|---|
edge (Microsoft Edge TTS) | Good | Free | ~1s |
neutts | Good | Free (local) | Varies |
openai | Good | Paid | ~1.5s |
elevenlabs | Excellent | Paid | ~2s |
The default recommended configuration uses local STT with the base Whisper model and Edge TTS with the en-US-AriaNeural voice — this is zero-cost and requires no API keys while providing acceptable quality for most workflows.
Configuration Reference
The voice subsystem is configured in ~/.hermes/config.yaml:
voice:
record_key: "ctrl+b"
max_recording_seconds: 120
auto_tts: false
silence_threshold: 200
silence_duration: 3.0
stt:
provider: "local"
local:
model: "base"
tts:
provider: "edge"
edge:
voice: "en-US-AriaNeural"silence_threshold (RMS value) controls how sensitive the silence detector is — increase it (e.g. to 250) in noisy environments. silence_duration controls how long a pause must be before recording auto-stops; increase to 4.0+ if you pause frequently between sentences. record_key can be remapped if ctrl+b conflicts with tmux or terminal bindings.
CLI Voice Mode Pipeline
When voice mode is active in the CLI, the recording loop works as follows:
- User presses
Ctrl+B; a beep confirms recording has started - Audio is captured via PortAudio with a live RMS level bar
- After
silence_durationseconds of continuous silence, recording stops (two beeps) - Audio is transcribed via the configured STT provider (Whisper locally or a cloud API)
- The transcript is passed through the agent pipeline
- If TTS is enabled, the response is streamed sentence-by-sentence with markdown and
<think>blocks stripped before synthesis - Recording automatically restarts for continuous hands-free use
A hallucination filter runs on Whisper output, suppressing 26 known phantom phrases that Whisper produces when given silence.
Discord Voice Channel Mode
This mode requires additional Discord bot permissions beyond the standard text-bot setup:
- Bot permissions: Connect, Speak, Use Voice Activity (permissions integer
274881432640) - Privileged Gateway Intents: Presence Intent, Server Members Intent (critical for voice speaker identification), Message Content Intent
Once joined via /voice join, the bot listens to each user independently, detects speech boundaries (0.5s speech minimum, 1.5s silence to cut), transcribes, and speaks replies back into the channel. Transcripts also appear in the associated text channel as [Voice] @user: what you said.
For production-quality Discord VC use, the recommended stack is local large-v3 or Groq Whisper for STT and ElevenLabs for TTS. For zero-cost use, local STT + Edge TTS provides a workable setup.
Common Failure Modes
- “Bot joins VC but hears nothing”: check
DISCORD_ALLOWED_USERS, verify privileged intents are enabled, confirm Connect/Speak permissions - “Whisper outputs garbage”: increase
silence_threshold, use a quieter environment, or try a larger model - “Text response but no voice”: check TTS provider config and API quotas; verify
ffmpegis installed for Edge TTS conversion paths - “Works in DMs but not server channels”: Discord server channels require
@mentionby default unlessDISCORD_REQUIRE_MENTION=falseis set