Hermes Agent Voice Mode: STT/TTS Pipeline and Setup

Summary

Hermes Agent supports a full voice interaction layer spanning three distinct modes: interactive CLI microphone input, spoken replies in Telegram and Discord chat, and live conversation inside Discord voice channels. The feature is built on a pluggable speech-to-text (STT) and text-to-speech (TTS) stack with both local (free, private) and cloud (higher quality, paid) provider options. Voice mode is an optional capability layered on top of a working text-mode Hermes installation.

Details

Architecture and Modes

Voice mode in Hermes is not a single feature but three separate interaction surfaces with different setup requirements and use cases:

ModePlatformTrigger
Interactive microphone loopCLICtrl+B push-to-record inside hermes
Voice replies in chatTelegram, Discord/voice on or /voice tts in a chat session
Live Discord voice channel botDiscord VC/voice join from a Discord text channel

The recommended progression is: get text mode working first, then enable CLI voice, then messaging voice replies, and only then tackle Discord VC mode. Each step introduces additional system dependencies and configuration surface, so the incremental path keeps the debugging scope manageable.

Installation

Voice support ships as optional extras on top of the base hermes-agent package:

  • pip install "hermes-agent[voice]" — CLI microphone and playback
  • pip install "hermes-agent[messaging]" — Telegram/Discord gateway voice delivery
  • pip install "hermes-agent[tts-premium]" — ElevenLabs integration
  • pip install "hermes-agent[all]" — everything

System-level dependencies are also required. On Debian/Ubuntu:

sudo apt install portaudio19-dev ffmpeg libopus0 espeak-ng

portaudio handles microphone I/O for CLI mode; ffmpeg converts audio for TTS delivery and messaging platforms; opus/libopus0 is the Discord voice codec; espeak-ng is the phonemizer backend required by NeuTTS.

NeuTTS can be installed separately via python -m pip install -U neutts[all]. The hermes setup wizard detects whether it is present and offers to install it automatically, falling back to Edge TTS if the install fails or is skipped.

STT/TTS Provider Stack

Hermes supports a choice of providers at both ends of the pipeline. API keys for cloud providers go into ~/.hermes/.env.

STT options:

ProviderModelCostKey Required
local (Whisper)base to large-v3FreeNo
groqwhisper-large-v3-turboFree tier availableYes (GROQ_API_KEY)
openaiwhisper-1PaidYes (VOICE_TOOLS_OPENAI_KEY)

TTS options:

ProviderQualityCostLatency
edge (Microsoft Edge TTS)GoodFree~1s
neuttsGoodFree (local)Varies
openaiGoodPaid~1.5s
elevenlabsExcellentPaid~2s

The default recommended configuration uses local STT with the base Whisper model and Edge TTS with the en-US-AriaNeural voice — this is zero-cost and requires no API keys while providing acceptable quality for most workflows.

Configuration Reference

The voice subsystem is configured in ~/.hermes/config.yaml:

voice:
  record_key: "ctrl+b"
  max_recording_seconds: 120
  auto_tts: false
  silence_threshold: 200
  silence_duration: 3.0
 
stt:
  provider: "local"
  local:
    model: "base"
 
tts:
  provider: "edge"
  edge:
    voice: "en-US-AriaNeural"

silence_threshold (RMS value) controls how sensitive the silence detector is — increase it (e.g. to 250) in noisy environments. silence_duration controls how long a pause must be before recording auto-stops; increase to 4.0+ if you pause frequently between sentences. record_key can be remapped if ctrl+b conflicts with tmux or terminal bindings.

CLI Voice Mode Pipeline

When voice mode is active in the CLI, the recording loop works as follows:

  1. User presses Ctrl+B; a beep confirms recording has started
  2. Audio is captured via PortAudio with a live RMS level bar
  3. After silence_duration seconds of continuous silence, recording stops (two beeps)
  4. Audio is transcribed via the configured STT provider (Whisper locally or a cloud API)
  5. The transcript is passed through the agent pipeline
  6. If TTS is enabled, the response is streamed sentence-by-sentence with markdown and <think> blocks stripped before synthesis
  7. Recording automatically restarts for continuous hands-free use

A hallucination filter runs on Whisper output, suppressing 26 known phantom phrases that Whisper produces when given silence.

Discord Voice Channel Mode

This mode requires additional Discord bot permissions beyond the standard text-bot setup:

  • Bot permissions: Connect, Speak, Use Voice Activity (permissions integer 274881432640)
  • Privileged Gateway Intents: Presence Intent, Server Members Intent (critical for voice speaker identification), Message Content Intent

Once joined via /voice join, the bot listens to each user independently, detects speech boundaries (0.5s speech minimum, 1.5s silence to cut), transcribes, and speaks replies back into the channel. Transcripts also appear in the associated text channel as [Voice] @user: what you said.

For production-quality Discord VC use, the recommended stack is local large-v3 or Groq Whisper for STT and ElevenLabs for TTS. For zero-cost use, local STT + Edge TTS provides a workable setup.

Common Failure Modes

  • “Bot joins VC but hears nothing”: check DISCORD_ALLOWED_USERS, verify privileged intents are enabled, confirm Connect/Speak permissions
  • “Whisper outputs garbage”: increase silence_threshold, use a quieter environment, or try a larger model
  • “Text response but no voice”: check TTS provider config and API quotas; verify ffmpeg is installed for Edge TTS conversion paths
  • “Works in DMs but not server channels”: Discord server channels require @mention by default unless DISCORD_REQUIRE_MENTION=false is set