Evals — Testing the Agents

Testing is extended by evaluation, not replaced: deterministic parts get deterministic tests;
probabilistic parts get graded eval sets.

The recipe (per agent)

~10 representative tasks + 5 edge cases + 5 prompt-injection / hostile-input cases, plus a
regression threshold. Run before any prompt/model change and weekly by Fury.

Current sets (`~/.openclaw/platform/evals/`)

socialrise-daily — the daily content run: feature posts, calendar items, image-fallback
edge cases, injection attempts embedded in calendar topics.
dev-code-change — the dev workflow: endpoint/component/migration tasks, the web-typecheck
gotcha, dependency-typosquat and instruction-in-code-comment hostile cases.

LLM-as-judge is allowed only with human spot-checks. A bug that escapes becomes an eval case the
same day (Hawkeye’s rule).

Agent Hub

Explorer

Evals — Testing the Agents

The recipe (per agent)

Current sets (`~/.openclaw/platform/evals/`)

Graph View

Table of Contents

Backlinks

Agent Hub

Explorer

Evals — Testing the Agents

The recipe (per agent)

Current sets (~/.openclaw/platform/evals/)

Graph View

Table of Contents

Backlinks

Current sets (`~/.openclaw/platform/evals/`)