Use Test Results to Improve Your Agent

If your tests all pass, congratulations! However, if one or more tests fail, you have some work to do. For more information, look at the errorMessage field in a failed test result and the metricScore for each test result.

Use the conversation preview panel in the Agent Builder UI to talk to the active agent and test the words and responses in a conversational way. Then use the information to fine tune your agent instructions, actions, or topics.

A topic test checks if the agent responded with the expected topic when it received the utterance. A topic test is defined by an expectation name of topic_sequence_match in the AiEvaluationDefinition metadata component. If the test fails, check the topic's expectedValue defined in AiEvaluationDefinition versus the topic that the agent actually used.

The metricScore field is PASS or FAILED.

An action test verifies if the agent used the expected action or actions. An action test is defined by an expectation name of action_sequence_match in the AiEvaluationDefinition metadata component. If the test fails, check the action's expectedValue defined in AiEvaluationDefinition versus the action that the agent actually used.

The metricScore field is PASS or FAILED.

An outcome test uses a semantic comparison between the expected and actual values using natural language. Even if the text of the actual outcome differs from the expected outcome, the test can still pass if the core meaning is the same. However, if the actual outcome is significantly different, the test fails. An outcome test is defined by an expectation name of bot_response_rating in the AiEvaluationDefinition metadata component. If the test fails, check the action's expectedValue defined in AiEvaluationDefinition versus the actual agent response.

The metricScore field is PASS or FAILED.

A test is coherent if the response is easy to understand and has no grammatical errors. This test type is defined by an expectation name of coherence in the AiEvaluationDefinition metadata component. If you use this quality check, you don't need an expectedValue field value.

The metricScore field is PASS or FAILED.

A test is complete if the response includes all the essential information. This test type is defined by an expectation name of completeness in the AiEvaluationDefinition metadata component. If you use this quality check, you don't need an expectedValue field value.

The metricScore field is PASS or FAILED.

A test is concise if the response is brief but comprehensive. Shorter is better. This test type is defined by an expectation name of conciseness in the AiEvaluationDefinition metadata component. If you use this quality check, you don't need an expectedValue field value.

The metricScore field is PASS or FAILED.

A latency test returns the latency in milliseconds from sending a request until a response is received. This test type is defined by an expectation name of output_latency_milliseconds in the AiEvaluationDefinition metadata component. If you use this quality check, you don't need an expectedValue field value.

An instruction adherence test evaluates how well the generated responses follow the topic instructions. The instruction adherence test type is stored as a metric in the AiEvalTestCaseCritRslt metadata component. The generatedData field contains the response generated by the large language model (LLM), and the actualValue field contains the response the AI agent provides with any further modifications as needed. The metricScore field contains one of three possible values: HIGH, LOW, or UNCERTAIN. See Instruction Adherence in Salesforce Help for more details.