Toxicity Scoring with Models API

Most LLMs are trained on a vast collection of data that includes a wide range of problematic content. Generated responses can contain toxic material that is unacceptable in an enterprise context. When free text input is allowed in prompts, services can also see toxic language entered directly by end users. Detection of toxic language is a key capability of the Einstein Trust Layer, enabling application developers and customers to take appropriate policy actions in response. Toxic language detection is an important component of our audit trail solution. See Einstein Trust Layer in Salesforce Help.

By default, the Models API automatically performs toxicity confidence scoring on LLM generations. The API passes back a flag indicating whether toxicity was detected, along with score information. This information is also stored in Data Cloud for your review. See Review Toxicity Scores in Salesforce Help.

To ensure that toxicity detection can be as accurate as possible, specify the correct locale in your API request. See Specify Languages and Locales with Models API.

The presence of region-specific language presents difficulties in effectively identifying harmful or inappropriate content. Einstein Trust Layer doesn’t always identify such content with 100% accuracy, especially when it involves region-specific language patterns.

This response snippet highlights some of the content moderation data that is returned in the response. This information is also stored in Data Cloud.

This table lists some of the key content moderation fields in a generation response.

FieldTypeParent ObjectDescription
contentQualityobjectgenerationContent moderation details for generated content, including its safety and quality scores.
scanToxicityobjectcontentQualityRepresents safety and toxicity data.
isDetectedbooleanscanToxicityIndicates whether a scan detected any toxic content. When the isDetected field is true, it indicates a high level of confidence that the content contains toxic language. However, when the isDetected field is false, it doesn’t necessarily mean there isn’t toxicity, but rather, that the model didn’t detect toxicity in the content.
categoriesarrayscanToxicityThe array of toxicity categories
categoryNamestringcategoriesName of the toxicity category
scorenumbercategoriesA toxicity score value from 0 to 1. A higher score means the response is more likely to be toxic.