Datasets and Models

A dataset contains the source image or text data. A model is created after a dataset is trained. The model is the construct that returns predictions.

Here are some key points about datasets and models:

  • A dataset is the structure that contains your data, whether that data is image or text.

  • The training process uses the dataset to create a model. You train a dataset multiple times to create multiple models. So a single dataset can create many models.

  • The relationship between a dataset and a model is complete after the model is created. After a model is created, the model doesn’t reference the source dataset again unless you retrain the dataset.

  • You use APIs to edit a dataset. There’s no visual way to edit a dataset. If you have new data that you want to include in a model, you call the API to create a new dataset, or you can call the API to add new data to an existing dataset.

  • For Einstein Language, target 200–500 examples per dataset label. For Einstein Vision, target at least 1,000 images per label.

  • Make sure that each dataset label has about the same number of images. For example, avoid a situation where you have 1,000 images in one label and 400 in another in the same dataset.

  • Use a wide variety of examples per label to improve the model accuracy, because images or text sent in for prediction likely vary.

  • For image datasets, consider including a negative label, such as “Other.” When an image sent in for prediction doesn’t match any labels in the model, the model returns the value “Other.” For language datasets, you can use the out-of-domain algorithm when you train a dataset and create a model. If text sent for prediction doesn't match any of the labels in the model, the model returns an empty probability.

See Dataset Training and Retraining for more information on when to train a dataset vs. retrain a dataset.

When you train or retrain a dataset, the training process sets aside some of the training data to test the model for accuracy.

  • Training data—Examples used by the training process to create the model.
  • Test data—Examples set aside by the training process to test the model accuracy.

The default amount of data used for training and testing varies depending on which API you use.

  • Einstein Vision—90% of the data is used to create the model, and 10% is used to test the model accuracy.
  • Einstein Language—80% of the data is used to create the model, and 20% is used to test the model’s accuracy.

You can change this ratio, also called the split, by using the trainSplitRatio parameter.

You specify the amount of training and test data, but the actual data that the training process holds out for testing is randomly selected. This means you can see differences in models and model metrics even when those models are created from the same dataset.