环境、基准测试和数据生成

本文档介绍 Hermes Agent 的测试环境、基准测试框架和训练数据生成工具。

环境

开发环境

# 创建虚拟环境
uv venv venv --python 3.11
source venv/bin/activate  # Linux/macOS
# 或
.\venv\Scripts\activate  # Windows

# 安装所有依赖
uv pip install -e ".[all,dev]"

# 运行测试
pytest tests/ -v

测试环境

Hermes 使用 pytest 进行单元和集成测试：

# 运行所有测试
pytest tests/ -v

# 运行特定模块
pytest tests/agent/ -v

# 带覆盖率
pytest tests/ --cov=src --cov-report=html

基准测试

运行基准测试

# 运行所有基准测试
python -m benchmarks.run

# 运行特定基准
python -m benchmarks.run --suite agent

# 比较模型
python -m benchmarks.compare --models claude-sonnet-4 gpt-4o

基准测试格式

基准测试定义使用 YAML：

# benchmarks/agent_quality.yaml
name: agent_quality
description: Agent quality benchmarks
tasks:
  - name: code_generation
    prompt: "Write a Python function to sort a list"
    metrics:
      - syntax_valid
      - test_pass_rate
  - name: reasoning
    prompt: "Solve this logic puzzle..."
    metrics:
      - correctness
      - explanation_quality

评估指标

指标	描述	计算方式
`syntax_valid`	代码语法有效	解析 Python AST
`test_pass_rate`	测试通过率	运行单元测试
`correctness`	答案正确性	人工标注或自动验证
`response_time`	响应时间	端到端延迟测量

数据生成

批处理运行器

用于生成训练数据：

python batch_runner.py \
    --dataset_file=data/prompts.jsonl \
    --batch_size=20 \
    --run_name=my_run \
    --model=anthropic/claude-sonnet-4.6

轨迹格式

生成的数据是 JSONL 格式：

{
  "prompt_index": 0,
  "conversations": [
    {"from": "human", "value": "..."},
    {"from": "gpt", "value": "...", "tool_calls": [...]},
    {"from": "tool", "value": "..."}
  ],
  "completed": true,
  "tool_stats": {...}
}

数据集管理

# 列出数据集
hermes data list

# 导出数据集
hermes data export trajectory_samples.jsonl

# 过滤数据集
hermes data filter --input data.jsonl --output filtered.jsonl --condition "completed == true"

性能分析

内存分析

# 使用 memory_profiler
python -m memory_profiler run_agent.py

# 生成 memory 报告
mprof run --multiprocess run_agent.py
mprof plot

CPU 分析

# 使用 cProfile
python -m cProfile -o profile.stats run_agent.py

# 分析
python -c "import pstats; p = pstats.Stats('profile.stats'); p.sort_stats('cumulative').print_stats(20)"

延迟分析

# 端到端延迟基准
python benchmarks/latency.py --runs 100

# 分阶段延迟
python benchmarks/latency.py --breakdown

持续集成

GitHub Actions

# .github/workflows/test.yml
name: Test
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install -e ".[dev]"
      - name: Run tests
        run: pytest tests/ -v

测试覆盖

覆盖率目标

Agent 核心：90%+
工具系统：85%+
网关：80%+
CLI：75%+

生成覆盖率报告

pytest tests/ --cov=src --cov-report=html --cov-report=term
open htmlcov/index.html

模拟和测试替身

使用 mock

from unittest.mock import Mock, patch

def test_tool_execution():
    mock_sandbox = Mock()
    mock_sandbox.execute.return_value = {"success": True, "output": "result"}
    
    with patch('sandbox.get_sandbox', return_value=mock_sandbox):
        result = execute_tool("terminal", {"command": "ls"})
        assert result["success"]

使用 pytest fixtures

import pytest

@pytest.fixture
def mock_env(monkeypatch):
    monkeypatch.setenv("HERMES_HOME", "/tmp/test_hermes")
    monkeypatch.setenv("API_KEY", "test-key")
    return True

环境​

开发环境​

测试环境​

基准测试​

运行基准测试​

基准测试格式​

评估指标​

数据生成​

批处理运行器​

轨迹格式​

数据集管理​

性能分析​

内存分析​

CPU 分析​

延迟分析​

持续集成​

GitHub Actions​

测试覆盖​

覆盖率目标​

生成覆盖率报告​

模拟和测试替身​

使用 mock​

使用 pytest fixtures​

环境

开发环境

测试环境

基准测试

运行基准测试

基准测试格式

评估指标

数据生成

批处理运行器

轨迹格式

数据集管理

性能分析

内存分析

CPU 分析

延迟分析

持续集成

GitHub Actions

测试覆盖

覆盖率目标

生成覆盖率报告

模拟和测试替身

使用 mock

使用 pytest fixtures