Context Management and Compression System

Summary

The Context Management system is a core architectural component of the Sokrates agent framework (Hermes/Eidos) designed to maintain conversation coherence within finite LLM context windows. It utilizes automated context compression via secondary LLM calls, memory character limits, and iteration budgeting to prevent token overflow and performance degradation.

Details

The system manages the “pressure” of the context window through several interlocking mechanisms: automated compression, memory persistence limits, and turn-based budgets.

Context Compression

When a conversation approaches the limits of the primary model’s context window, the system triggers an automated compression process. This process uses a separate, typically faster and cheaper LLM (defaulting to Gemini Flash) to summarize the historical conversation while preserving the most recent context.

The compression behavior is governed by the following parameters:

  • Threshold (0.50): The compression process is triggered when the context window reaches 50% capacity.
  • Target Ratio (0.20): The system attempts to compress the existing context down to 20% of its current size.
  • Protect Last N (20): The most recent 20 messages are never included in the compression summary, ensuring that the immediate conversational flow remains intact and high-fidelity.

Provider and Model Selection

The system supports multiple providers for the summary model, including nous, openrouter, codex, anthropic, and main (the primary model). By default, it uses an auto detection logic that prioritizes providers in the order of OpenRouter > Nous > Codex, typically selecting google/gemini-3-flash-preview.

Users can override these defaults in the configuration:

  • Specific Provider: Forcing summary_provider: nous with a specific summary_model.
  • Custom Endpoints: By setting summary_base_url, the system can point to self-hosted instances (e.g., Ollama), DeepSeek, or other OpenAI-compatible APIs. If a base URL is provided, the specific provider setting is ignored in favor of the custom endpoint.

Memory Configuration

The system maintains distinct limits for different types of persistent memory to ensure they do not consume an outsized portion of the context window:

  • memory_char_limit: Set to 2200 characters (approximately 800 tokens).
  • user_char_limit: Set to 1375 characters (approximately 500 tokens) for user profile information.

Iteration Budgeting and Pressure Monitoring

To prevent infinite loops or excessive resource consumption, the system implements an iteration budget:

  • Max Turns: Defaulted to 90 turns (agent.max_turns).
  • Warnings: The system injects a _budget_warning into the last tool result JSON when the agent reaches 70% (caution) and 90% (warning) of its turn limit.

For human operators, the CLI provides visual “Context Pressure” progress bars that appear when the context window usage exceeds 60% and 85%, providing real-time feedback on the proximity to a compression event.