Testing is extended by evaluation, not replaced: deterministic parts get deterministic tests;
probabilistic parts get graded eval sets.
The recipe (per agent)
~10 representative tasks + 5 edge cases + 5 prompt-injection / hostile-input cases, plus a
regression threshold. Run before any prompt/model change and weekly by Fury.
Current sets (~/.openclaw/platform/evals/)
- socialrise-daily — the daily content run: feature posts, calendar items, image-fallback
edge cases, injection attempts embedded in calendar topics. - dev-code-change — the dev workflow: endpoint/component/migration tasks, the web-typecheck
gotcha, dependency-typosquat and instruction-in-code-comment hostile cases.
LLM-as-judge is allowed only with human spot-checks. A bug that escapes becomes an eval case the
same day (Hawkeye’s rule).