Create Custom Scorers for Agent Testing

Custom scorers allow you to define evaluation logic for agent testing. Standard Expectations cover common test scenarios like topic matching and response coherence, but custom scorers create expectations tailored to your business requirements.

Custom scorers evaluate agent behavior at different levels of granularity. A custom scorer can test an entire conversation, a single interaction, or a specific moment within an interaction. Scorers use a prompt template to evaluate agent behavior automatically with an LLM.

Use the AiAgentScorerDefinition Metadata API type to define custom scorers and deploy them to your org.

A custom scorer evaluates agent behavior and produces a result that maps to an outcome: pass, fail, or not applicable. Each scorer has two key components:

  • Engine: The evaluation logic. Use a PromptTemplate engine to assess the agent's behavior with an LLM.
  • Output mapping: Rules that translate the engine's result into a pass, fail, or not-applicable outcome.

In this guide, we define a custom scorer that uses a prompt template to detect whether a customer dropped off before the conversation resolved.

  • Agentforce is enabled in your org with at least one active agent. See Set up Agents in Salesforce Help.
  • If your scorer uses the PromptTemplate engine type, the prompt template must exist in your org or get deployed alongside the scorer. See Deploy a Scorer with a Prompt Template for details on deploying both together.

Create an AiAgentScorerDefinition metadata component to define your scorer. The component exists in the aiAgentScorerDefinitions folder with the .aiAgentScorerDefinition file suffix.

The inputScope field determines the data that the scorer evaluates:

ScopeDescription
SessionEvaluates the entire agent session, including all interactions.
InteractionEvaluates a single interaction (one utterance-response pair).
MomentEvaluates a specific moment within an interaction, such as an individual action invocation.

Agentforce Observability currently supports only Session scope at run time. To reference the latest interaction inside a session-scoped scorer, use the getLastInteraction invocable action within your prompt template.

FieldTypeDescription
inputScopestringRequired. The level of agent data the scorer evaluates. Valid values: Session, Interaction, Moment.
dataTypestringRequired. The data type of the scorer's output. Valid values: Text, Number.
scorerVersionscorerVersion[]Required. The version configuration for the scorer. Scorers support multiple versions.

Version numbers must be sequential starting from 1, and each scorer supports a maximum of 100 versions.

FieldTypeDescription
versionNumberintegerRequired. The version number. Must be sequential starting from 1.
statusstringRequired. The lifecycle status. Valid values: Available, Archived.
descriptionstringRequired. A description of what the scorer evaluates.
labelstringRequired. A display label for the scorer version.
agentAssociationAgentAssociationRequired. Associates the scorer with a specific agent.
engineengine[]Required. The evaluation logic for the scorer.
outputEnumValueoutputEnumValue[]Required. One or more mappings that translate engine output values to pass or fail outcomes.
specificationspecification[]Optional. Constraints on the scorer's output values, such as min, max, step, and threshold.
FieldTypeDescription
isActivebooleanRequired. Whether the scorer is active for the associated agent. Can only be true for versions with Available status. Only one agent association per scorer can have isActive set to true.
agentApiNamestringRequired. The API name of the agent. The agent must exist in the org. For example, Copilot_for_Salesforce.
samplingRatedoubleOptional. A value greater than 0 and up to 1.0 that controls the sampling rate. Default is 1.0.
FieldTypeDescription
engineTypestringRequired. The type of evaluation engine. Valid value: PromptTemplate.
engineRefstringRequired. The API name of the prompt template.
FieldTypeDescription
isFallbackbooleanWhether this mapping is the default when no other mapping matches.
isFallbackSystembooleanWhether this mapping is the system-level fallback.
outcomeTypestringOptional. The test outcome. Valid values are: Pass, Fail, NotApplicable. Default value is NotApplicable.
valuestringThe engine output value that maps to this outcome.
FieldTypeDescription
maxdoubleThe maximum valid output value.
mindoubleThe minimum valid output value.
stepdoubleThe increment between valid output values.
thresholddoubleOptional. Output values greater than or equal to threshold pass.

This example defines a custom scorer that evaluates whether a customer drops off before a conversation resolves. The scorer uses a prompt template to analyze the session and outputs a value of 0 (no drop-off, pass) or 1 (drop-off detected, fail).

To deploy a custom scorer, create a project directory with this structure:

The package.xml file specifies the scorer to deploy:

The members value must match the filename of your .aiAgentScorerDefinition file (without the extension).

Deploy the scorer to your org with the Salesforce CLI:

If your scorer uses the PromptTemplate engine type, you can deploy both the template and the scorer together. Add a genAiPromptTemplates folder containing your prompt template definition, and add the template to package.xml.

In package.xml, the GenAiPromptTemplate type must appear before AiAgentScorerDefinition. Metadata API deploys types in the order that they appear, and the prompt template must exist before the scorer that references it can successfully deploy.

To retrieve a scorer definition from your org, use the Salesforce CLI:

You can also retrieve a scorer definition by using the same package.xml that you used for deployment.

To update an existing scorer, modify the .aiAgentScorerDefinition file and redeploy. Keep in mind these constraints:

  • You can add new versions to a scorer, but you can't delete existing versions.
  • You can update a version's status value (for example, from Available to Archived).
  • You can update the agentAssociation isActive and samplingRate values.
  • The scorer checks the members name in package.xml. If a scorer with that name already exists, the deployment updates the existing scorer.