Create Custom Scorers for Agent Testing

Custom scorers allow you to define evaluation logic for agent testing. Standard Expectations cover common test scenarios like topic matching and response coherence, but custom scorers create expectations tailored to your business requirements.

Custom scorers evaluate agent behavior at different levels of granularity. A custom scorer can test an entire conversation, a single interaction, or a specific moment within an interaction. Scorers use a prompt template to evaluate agent behavior automatically with an LLM.

Use the AiAgentScorerDefinition Metadata API type to define custom scorers and deploy them to your org.

A custom scorer evaluates agent behavior and produces a result that maps to an outcome: pass, fail, or not applicable. Each scorer has two key components:

Engine: The evaluation logic. Use a PromptTemplate engine to assess the agent’s behavior with an LLM.
Output mapping: Rules that translate the engine’s result into a pass, fail, or not-applicable outcome.

In this guide, we define a custom scorer that uses a prompt template to detect whether a customer dropped off before the conversation resolved.

Agentforce is enabled in your org with at least one active agent. See Set up Agents in Salesforce Help.
If your scorer uses the PromptTemplate engine type, the prompt template must exist in your org or get deployed alongside the scorer. See Deploy a Scorer with a Prompt Template for details on deploying both together.

Create an AiAgentScorerDefinition metadata component to define your scorer. The component exists in the aiAgentScorerDefinitions folder with the .aiAgentScorerDefinition file suffix.

The inputScope field determines the data that the scorer evaluates:

Scope	Description
`Session`	Evaluates the entire agent session, including all interactions.
`Interaction`	Evaluates a single interaction (one utterance-response pair).
`Moment`	Evaluates a specific moment within an interaction, such as an individual action invocation.

Agentforce Observability currently supports only Session scope at run time. To reference the latest interaction inside a session-scoped scorer, use the getLastInteraction invocable action within your prompt template.

Field	Type	Description
`inputScope`	string	Required. The level of agent data the scorer evaluates. Valid values: `Session`, `Interaction`, `Moment`.
`dataType`	string	Required. The data type of the scorer’s output. Valid values: `Text`, `Number`.
`scorerVersion`	scorerVersion[]	Required. The version configuration for the scorer. Scorers support multiple versions.

Version numbers must be sequential starting from 1, and each scorer supports a maximum of 100 versions.

Field	Type	Description
`versionNumber`	integer	Required. The version number. Must be sequential starting from `1`.
`status`	string	Required. The lifecycle status. Valid values: `Available`, `Archived`.
`description`	string	Required. A description of what the scorer evaluates.
`label`	string	Required. A display label for the scorer version.
`agentAssociation`	AgentAssociation	Required. Associates the scorer with a specific agent.
`engine`	engine[]	Required. The evaluation logic for the scorer.
`outputEnumValue`	outputEnumValue[]	Required. One or more mappings that translate engine output values to pass or fail outcomes.
`specification`	specification[]	Optional. Constraints on the scorer’s output values, such as min, max, step, and threshold.

Field	Type	Description
`isActive`	boolean	Required. Whether the scorer is active for the associated agent. Can only be `true` for versions with `Available` status. Only one agent association per scorer can have `isActive` set to `true`.
`agentApiName`	string	Required. The API name of the agent. The agent must exist in the org. For example, `Copilot_for_Salesforce`.
`samplingRate`	double	Optional. A value greater than `0` and up to `1.0` that controls the sampling rate. Default is `1.0`.

Field	Type	Description
`engineType`	string	Required. The type of evaluation engine. Valid value: `PromptTemplate`.
`engineRef`	string	Required. The API name of the prompt template.

Field	Type	Description
`isFallback`	boolean	Whether this mapping is the default when no other mapping matches.
`isFallbackSystem`	boolean	Whether this mapping is the system-level fallback.
`outcomeType`	string	Optional. The test outcome. Valid values are: `Pass`, `Fail`, `NotApplicable`. Default value is `NotApplicable`.
`value`	string	The engine output value that maps to this outcome.

Field	Type	Description
`max`	double	The maximum valid output value.
`min`	double	The minimum valid output value.
`step`	double	The increment between valid output values.
`threshold`	double	Optional. Output values greater than or equal to `threshold` pass.

This example defines a custom scorer that evaluates whether a customer drops off before a conversation resolves. The scorer uses a prompt template to analyze the session and outputs a value of 0 (no drop-off, pass) or 1 (drop-off detected, fail).

To deploy a custom scorer, create a project directory with this structure:

The package.xml file specifies the scorer to deploy:

The members value must match the filename of your .aiAgentScorerDefinition file (without the extension).

Deploy the scorer to your org with the Salesforce CLI:

If your scorer uses the PromptTemplate engine type, you can deploy both the template and the scorer together. Add a genAiPromptTemplates folder containing your prompt template definition, and add the template to package.xml.

In package.xml, the GenAiPromptTemplate type must appear before AiAgentScorerDefinition. Metadata API deploys types in the order that they appear, and the prompt template must exist before the scorer that references it can successfully deploy.

To retrieve a scorer definition from your org, use the Salesforce CLI:

You can also retrieve a scorer definition by using the same package.xml that you used for deployment.

To update an existing scorer, modify the .aiAgentScorerDefinition file and redeploy. Keep in mind these constraints:

You can add new versions to a scorer, but you can’t delete existing versions.
You can update a version’s status value (for example, from Available to Archived).
You can update the agentAssociation isActive and samplingRate values.
The scorer checks the members name in package.xml. If a scorer with that name already exists, the deployment updates the existing scorer.