Hermes Auxiliary Models

Summary

The Hermes Auxiliary Model system is a configuration framework that allows the Hermes Agent to offload specialized side tasks—such as image analysis, web summarization, and session searching—to lightweight or task-specific LLMs. By default, these tasks utilize Gemini Flash via auto-detection, but they can be granularly routed to various providers including OpenRouter, Codex, or local OpenAI-compatible endpoints.

Details

The Hermes Agent distinguishes between its primary reasoning model and “Auxiliary Models” used for background or specialized operations. This separation ensures that high-latency or high-cost models are not wasted on simple tasks like summarizing a URL or analyzing a screenshot, while also allowing for multimodal capabilities (Vision) even if the main chat model is text-only.

Universal Configuration Pattern

Every auxiliary model slot in the Hermes configuration follows a standardized “three-knob” pattern:

  1. provider: Determines the authentication and routing logic. Defaults to "auto".
  2. model: Specifies the exact model string to request. If left blank, Hermes uses the provider’s default for that specific task.
  3. base_url: A custom OpenAI-compatible endpoint. When this is set, it takes precedence over the provider setting, allowing Hermes to point to local LLMs (e.g., via Ollama or vLLM) or specific proxy services.

Supported Task Slots

The system defines several specific slots under the auxiliary configuration block, each with its own timeout settings:

  • vision: Used for image and screenshot analysis. Defaults to a 30s timeout. It is often configured to use Gemini Flash or GPT-4o.
  • web_extract: Handles web page summarization and content extraction. (30s timeout).
  • approval: A specialized slot for internal agentic gates or confirmation logic. (30s timeout).
  • compression: Manages context window compression. This slot has a significantly higher default timeout of 120s to account for processing large amounts of text.
  • session_search: Powers the agent’s ability to query historical conversation data. (30s timeout).
  • skills_hub: Manages the retrieval and indexing of agent tools and capabilities. (30s timeout).
  • mcp: Handles interactions with the Model Context Protocol servers. (30s timeout).
  • flush_memories: Used during the process of committing short-term observations to long-term storage in the knowledge graph. (30s timeout).

Providers and Authentication

Hermes supports a wide array of providers for these tasks:

  • auto: The default setting which attempts to select the best available model based on the environment.
  • openrouter: Routes requests through OpenRouter; requires an OPENROUTER_API_KEY.
  • nous: Uses the Nous Portal, authenticated via hermes login.
  • codex: Utilizes Codex OAuth (ChatGPT Plus/Pro accounts). If Codex is the main provider, vision tasks are routed here automatically.
  • main: Forces the auxiliary task to use the same endpoint and credentials as the primary chat model.

Configuration Examples

Users can configure these models via config.yaml or environment variables. The YAML configuration is preferred for its support of timeouts and specific task overrides.

Example: Overriding Vision to use a local model

auxiliary:
  vision:
    base_url: "http://localhost:1234/v1"
    api_key: "local-key"
    model: "qwen2.5-vl"

Example: Using OpenRouter for Web Extraction

auxiliary:
  web_extract:
    provider: "openrouter"
    model: "google/gemini-flash-1.5"

For legacy support or quick overrides, environment variables such as AUXILIARY_VISION_MODEL or AUXILIARY_WEB_EXTRACT_PROVIDER can be used, though they do not support the full range of timeout configurations available in the YAML schema.