上下文压缩和缓存

Hermes Agent 使用双压缩系统和 Anthropic 提示缓存来高效管理跨长对话的上下文窗口使用。

源文件：agent/context_engine.py（ABC）、agent/context_compressor.py（默认引擎）、agent/prompt_caching.py、gateway/run.py（会话卫生）、run_agent.py（搜索 _compress_context）

可插拔上下文引擎

上下文管理建立在 ContextEngine ABC（agent/context_engine.py）上。内置的 ContextCompressor 是默认实现，但插件可以用替代引擎替换它（例如无损上下文管理）。

context:
  engine: "compressor"    # 默认 — 内置有损摘要
  engine: "lcm"           # 示例 — 提供无损上下文的插件

双压缩系统

Hermes 有两个独立运行的压缩层：

                     ┌──────────────────────────┐
  Incoming message   │   Gateway Session Hygiene │  在上下文 85% 时触发
  ─────────────────► │   (pre-agent, rough est.) │  大会话的安全网
                     └─────────────┬────────────┘
                                   │
                                   ▼
                     ┌──────────────────────────┐
                     │   Agent ContextCompressor │  在上下文 50% 时触发（默认）
                     │   (in-loop, real tokens)  │  正常上下文管理
                     └──────────────────────────┘

1. 网关会话卫生（85% 阈值）

位于 gateway/run.py。这是一个安全网，在 Agent 处理消息之前运行。防止会话在轮次之间长得太大（例如 Telegram/Discord 过夜累积）。

2. Agent ContextCompressor（50% 阈值，可配置）

位于 agent/context_compressor.py。这是主要压缩系统，在 Agent 的工具循环内运行，可访问准确的 API 报告 token 计数。

配置

compression:
  enabled: true              # 启用/禁用压缩（默认：true）
  threshold: 0.50            # 上下文窗口分数（默认：0.50 = 50%）
  target_ratio: 0.20         # 保留为尾部的阈值比例（默认：0.20）
  protect_last_n: 20         # 最小保护的尾部消息数（默认：20）

压缩算法

ContextCompressor.compress() 方法遵循 4 阶段算法：

阶段 1：修剪旧工具结果

旧工具结果（>200 字符）被替换为：

[Old tool output cleared to save context space]

阶段 2：确定边界

┌─────────────────────────────────────────────────────────────┐
│  Message list                                               │
│                                                             │
│  [0..2]  ← protect_first_n (system + first exchange)        │
│  [3..N]  ← middle turns → SUMMARIZED                       │
│  [N..end] ← tail (by token budget OR protect_last_n)        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

阶段 3：生成结构化摘要

使用辅助 LLM 和结构化模板总结中间轮次：

## Goal
[What the user is trying to accomplish]

## Constraints & Preferences
[User preferences, coding style, constraints, important decisions]

## Progress
### Done
[Completed work — specific file paths, commands run, results]
### In Progress
[Work currently underway]
### Blocked
[Any blockers or issues encountered]

## Key Decisions
[Important technical decisions and why]

## Relevant Files
[Files read, modified, or created — with brief note on each]

## Next Steps
[What needs to happen next]

## Critical Context
[Specific values, error messages, configuration details]

阶段 4：组装压缩消息

压缩后的消息列表：

头部消息（首次压缩时附加说明到系统提示）
摘要消息（选择角色以避免连续相同角色违规）
尾部消息（未修改）

提示缓存（Anthropic）

源：agent/prompt_caching.py

通过缓存对话前缀将多轮对话的输入 token 成本降低约 75%。使用 Anthropic 的 cache_control 断点。

策略：system_and_3

Anthropic 每个请求最多允许 4 个 cache_control 断点。Hermes 使用「system_and_3」策略：

Breakpoint 1: System prompt           (stable across all turns)
Breakpoint 2: 3rd-to-last non-system message  ─┐
Breakpoint 3: 2nd-to-last non-system message   ├─ Rolling window
Breakpoint 4: Last non-system message           ─┘

启用提示缓存

当以下条件时自动启用提示缓存：

模型是 Anthropic Claude 模型（通过模型名称检测）
提供者支持 cache_control（原生 Anthropic API 或 OpenRouter）

CLI 在启动时显示缓存状态：

💾 Prompt caching: ENABLED (Claude via OpenRouter, 5m TTL)

上下文压力警告

Agent 在压缩阈值的 85%（不是上下文的 85% — 是阈值的 85%，阈值本身是上下文的 50%）发出上下文压力警告：

⚠️  Context is 85% to compaction threshold (42,500/50,000 tokens)

可插拔上下文引擎​

双压缩系统​

1. 网关会话卫生（85% 阈值）​

2. Agent ContextCompressor（50% 阈值，可配置）​

配置​

压缩算法​

阶段 1：修剪旧工具结果​

阶段 2：确定边界​

阶段 3：生成结构化摘要​

阶段 4：组装压缩消息​

提示缓存（Anthropic）​

策略：system_and_3​

启用提示缓存​

上下文压力警告​