評価＆シミュレーション

エージェントの品質をスコアリングし、マルチターンシミュレーションを実行し、パーソナリティの一貫性をベンチマークします。

レスポンスの評価

テンプレートルーブリックに基づいてエージェントのレスポンスをスコアリングします。

const result = await client.agents.evaluate("agent-id", {
  templateId: "template-id",
  messages: [
    { role: "user", content: "I'm feeling really stressed about work" },
    { role: "assistant", content: "I hear you. Work stress can be overwhelming..." },
  ],
});

console.log(result.score);       // 0-100
console.log(result.feedback);    // detailed feedback
console.log(result.categories);  // per-category scores

評価テンプレート

重み付けされたカテゴリでスコアリングルーブリックを作成します。

// Create a template
const template = await client.evalTemplates.create({
  name: "Empathy & Support",
  description: "Evaluates emotional intelligence and supportive responses",
  scoringRubric: "Score based on empathy, active listening, and actionable advice",
  categories: ["empathy", "active_listening", "actionable_advice"],
  judgeModel: "claude-sonnet-4-6",
  temperature: 0.3,
});

// List templates
const templates = await client.evalTemplates.list();

シミュレーションの実行

マルチターンのシミュレーション会話を実行して、エージェントの動作を大規模にテストします。

for await (const event of client.agents.simulate("agent-id", {
  maxSessions: 3,
  maxTurnsPerSession: 10,
  simulatedDurationHours: 24,
  enableProactive: true,
  enableConsolidation: true,
  userPersonas: [
    {
      name: "Alex",
      background: "College student struggling with math",
      personalityTraits: ["anxious", "eager to learn"],
      communicationStyle: "casual, uses slang",
    },
  ],
})) {
  console.log(`[${event.type}] ${event.message}`);
  if (event.totalCostUsd) {
    console.log(`Cost so far: $${event.totalCostUsd}`);
  }
}

シミュレーション＋評価（runEval）

シミュレーションと評価を一度に組み合わせます。

for await (const event of client.agents.runEval("agent-id", {
  templateId: "template-id",
  maxSessions: 5,
  maxTurnsPerSession: 8,
})) {
  if (event.type === "evaluation") {
    console.log("Score:", event.score);
  }
}

評価実行

シミュレーション実行を追跡・管理します。

// List runs
const runs = await client.evalRuns.list({ agentId: "agent-id" });

// Get a specific run
const run = await client.evalRuns.get("run-id");

// Reconnect to a streaming run
for await (const event of client.evalRuns.streamEvents("run-id")) {
  console.log(event.type, event.message);
}

非同期シミュレーション

シミュレーションは simulateAsync() を介した非同期モードをサポートしており、RunRefを即座に返すため、後からポーリングまたは再接続できます。

← ナレッジベースインスタンス→