When to run evals

Agent evals are most useful when output quality matters as much as code correctness.

Good moments to run them

On pull requests

Run evals in CI for PRs that change prompts, agent logic, tools, retrieval, or model settings.

Why:

Catch behavior regressions before merge
Keep expected quality stable over time
Make prompt changes reviewable with pass/fail feedback

Before releases

Run the full eval suite before deployments (for example tests/AgentEvals).

Why:

Validate critical flows after dependency/model updates
Confirm outputs still meet product expectations

During prompt iteration

Run a subset locally while editing prompt logic.

Why:

Tight feedback loop for faster prompt tuning
Immediate signal when a change breaks a known behavior

Suggested cadence

Small local run while developing: php artisan ai-evals:run --filter="..."
Full suite on PR and on merge to main
Optional scheduled nightly run for broader confidence

Cases where evals are especially valuable

Customer support answers (policy, compliance, tone)
Structured outputs that must include key facts
Agent workflows where wrong wording creates support or legal risk
High-impact user journeys (checkout, cancellation, refunds)