Use Test Results to Improve Your Agent
If your tests all pass, congratulations! However, if one or more tests fail, you have some work to do. For more information, look at the errorMessage
field in a failed test result and the metricScore
for each test result.
Use the conversation preview panel in the Agent Builder UI to talk to the active agent and test the words and responses in a conversational way. Then use the information to fine tune your agent instructions, actions, or topics.
A topic test checks if the agent responded with the expected topic when it received the utterance. A topic test is defined by an expectation name of topic_sequence_match
in the AiEvaluationDefinition
metadata component. If the test fails, check the topic's expectedValue
defined in AiEvaluationDefinition
versus the topic that the agent actually used.
The metricScore
field is PASS
or FAILED
.
An action test verifies if the agent used the expected action or actions. An action test is defined by an expectation name of action_sequence_match
in the AiEvaluationDefinition
metadata component. If the test fails, check the action's expectedValue
defined in AiEvaluationDefinition
versus the action that the agent actually used.
The metricScore
field is PASS
or FAILED
.
An outcome test uses a semantic comparison between the expected and actual values using natural language. Even if the text of the actual outcome differs from the expected outcome, the test can still pass if the core meaning is the same. However, if the actual outcome is significantly different, the test fails. An outcome test is defined by an expectation name of bot_response_rating
in the AiEvaluationDefinition
metadata component. If the test fails, check the action's expectedValue
defined in AiEvaluationDefinition
versus the actual agent response.
The metricScore
field is PASS
or FAILED
.
A test is coherent if the response is easy to understand and has no grammatical errors. This test type is defined by an expectation name of coherence
in the AiEvaluationDefinition
metadata component. If you use this quality check, you don't need an expectedValue
field value.
The metricScore
field is PASS
or FAILED
.
A test is complete if the response includes all the essential information. This test type is defined by an expectation name of completeness
in the AiEvaluationDefinition
metadata component. If you use this quality check, you don't need an expectedValue
field value.
The metricScore
field is PASS
or FAILED
.
A test is concise if the response is brief but comprehensive. Shorter is better. This test type is defined by an expectation name of conciseness
in the AiEvaluationDefinition
metadata component. If you use this quality check, you don't need an expectedValue
field value.
The metricScore
field is PASS
or FAILED
.
A latency test returns the latency in milliseconds from sending a request until a response is received. This test type is defined by an expectation name of output_latency_milliseconds
in the AiEvaluationDefinition
metadata component. If you use this quality check, you don't need an expectedValue
field value.
An instruction adherence test evaluates how well the generated responses follow the topic instructions. The instruction adherence test type is stored as a metric in the AiEvalTestCaseCritRslt metadata component. The generatedData
field contains the response generated by the large language model (LLM), and the actualValue
field contains the response the AI agent provides with any further modifications as needed. The metricScore
field contains one of three possible values: HIGH
, LOW
, or UNCERTAIN
. See Instruction Adherence in Salesforce Help for more details.