Datasets and Models

A dataset contains the source text data. A model is created after a dataset is trained. The model is the construct that returns predictions.

Here are some key points about datasets and models:

  • A dataset is the structure that contains your data, whether that data.

  • The training process uses the dataset to create a model. You train a dataset multiple times to create multiple models. So a single dataset can create many models.

  • The relationship between a dataset and a model is complete after the model is created. After a model is created, the model doesn’t reference the source dataset again unless you retrain the dataset.

  • You use APIs to edit a dataset. There’s no visual way to edit a dataset. If you have new data that you want to include in a model, you call the API to create a new dataset.

  • For Einstein Language, target 200–500 examples per dataset label.

See Dataset Training and Retraining for more information on when to train a dataset vs. retrain a dataset.

When you train or retrain a dataset, the training process sets aside some of the training data to test the model for accuracy.

  • Training data—Examples used by the training process to create the model.
  • Test data—Examples set aside by the training process to test the model accuracy.
  • Einstein Language uses 80% of the data to create the model, and 20% is used to test the model’s accuracy.

You can change this ratio, also called the split, by using the trainSplitRatio parameter.

You specify the amount of training and test data, but the actual data that the training process holds out for testing is randomly selected. This means you can see differences in models and model metrics even when those models are created from the same dataset.