Define what 'good' means for your agents, then grade real decisions against it — rule-based scoring, no LLM required.