Dataset and Model Best Practices
-
Target at least 1,000 examples per dataset label.
-
Each dataset label should have about the same number of images. For example, avoid a situation where you have 1,000 images in one label and 400 in another within the same dataset.
-
Include a wide variety of images for each dataset label. If you have a label that contains images of a certain object, include images with these characteristics.
- Color
- Black and white
- Blurred
- With other objects the object might typically be seen with
- With text and without text (if applicable)
-
A wide variety of images makes the model more accurate. For example, if you have a dataset label called “buildings,” include images of many different building styles: skyscraper, gothic, modern, and so on.
-
In a binary dataset, include images in the negative label that look similar to images in the positive label. For example, if your positive label is oranges, include grapefruits, tangerines, lemons, and other citrus fruits in your negative label.
-
For a multi-label model, include images with objects that appear in different areas within the image. For example, if you have images that have the label “car,” incorporate images that have the car in different areas within the image.
-
A dataset can have up to 500 labels, but we recommend a maximum of 100 labels for better model accuracy.
-
If you have a dataset that contains a lot of classes, increase the number of examples per label.
-
We recommend that an Einstein Intent or Einstein Sentiment dataset contain a maximum of 100 labels. If you need more than 100 labels, consider hierarchical classification.
-
We recommend less than 150 words for the length of the intent or sentiment string. This guideline applies to both a language dataset example and a string sent into a model for prediction.
-
During the training process, special text formatting, like emojis, words in all uppercase , and punctuation aren’t included. For example, if you add a text example containing a smiley emoji to a dataset, the emoji isn’t considered during training. Only the text is used.
-
When you send in text for prediction, the model doesn’t consider special text formatting and punctuation. For example, when you send the string “We had a great time! :)” to the model, the model returns a prediction for the string “We had a great time”.
-
Batch predictions aren’t supported. When you send text in for a prediction, you make a single API call to the
/intent
endpoint or the/sentiment
endpoint.