Toxicity Scoring with Models API

Most LLMs are trained on a vast collection of data that includes a wide range of problematic content. Generated responses can contain toxic material that is unacceptable in an enterprise context. When free text input is allowed in prompts, services can also see toxic language entered directly by end users. Detection of toxic language is a key capability of the Einstein Trust Layer, enabling application developers and customers to take appropriate policy actions in response. Toxic language detection is an important component of our audit trail solution. See Einstein Trust Layer in Salesforce Help.

By default, the Models API automatically performs toxicity confidence scoring on LLM generations. The API passes back a flag indicating whether toxicity was detected, along with score information. This information is also stored in Data 360 for your review. See Review Toxicity Scores in Salesforce Help.

To ensure that toxicity detection can be as accurate as possible, specify the correct locale in your API request. See Specify Languages and Locales with Models API.

The presence of region-specific language presents difficulties in effectively identifying harmful or inappropriate content. Einstein Trust Layer doesn’t always identify such content with 100% accuracy, especially when it involves region-specific language patterns.

This response snippet highlights some of the content moderation data that is returned in the response. This information is also stored in Data 360.

This table lists some of the key content moderation fields in a generation response.

Field	Type	Parent Object	Description
`contentQuality`	object	`generation`	Content moderation details for generated content, including its safety and quality scores.
`scanToxicity`	object	`contentQuality`	Represents safety and toxicity data.
`isDetected`	boolean	`scanToxicity`	Indicates whether a scan detected any toxic content. When the `isDetected` field is `true`, it indicates a high level of confidence that the content contains toxic language. However, when the `isDetected` field is `false`, it doesn’t necessarily mean there isn’t toxicity, but rather, that the model didn’t detect toxicity in the content.
`categories`	array	`scanToxicity`	The array of toxicity categories
`categoryName`	string	`categories`	Name of the toxicity category
`score`	number	`categories`	A toxicity score value from 0 to 1. A higher score means the response is more likely to be toxic.

Models API Reference: Generate Text Endpoint
Models API Developer Guide: Access Models API with REST
Models API Developer Guide: Access Models API with Apex
Salesforce Help: Einstein Trust Layer
Salesforce Help: Review Toxicity Scores