Einstein OCR Model Card

The model analyzed in this card returns text, probabilities, and other information identified in an image or PDF.

Get information about training data, use cases, features, and other considerations for the optical character recognition (OCR) model.

Salesforce AI Research

February 2022
The model doesn’t have numbered versioning. When a fix is sent out, it's deployed everywhere so customers are always running on the latest version.
Minor changes can occur throughout the release.
Major changes can occur and are communicated via release notes.

Vision model for character recognition.

Einstein OCR has three modules. The first module detects the text in an image. The second module identifies characters in the text that was identified in the first module, and converts it character by character to a string. The third module uses this computed text to tag the text with an entity such as person, phone number, or address.

The modules use a convolutional neural network (CNN) plus a combination of a CNN and long short-term memory (LSTM) network.

Einstein OCR is available to Salesforce customers that have any SKU for Einstein Vision and Language.

Send questions or comments about the model to vision-team@salesforce.com. For more info, see Einstein OCR documentation.

The Einstein OCR model detects alphanumeric text and common special characters (such as ’#’ and ‘&’) in an image.

The Einstein OCR model is designed for enterprise customers. Users can access the model via REST API or embed the API in a Salesforce application.

OCR text: Identifies typed and handwritten text in a document or image and returns the relative position of that text. Use cases include:
VIN number detection
serial number detection
OCR entity: Identifies text in an image and returns the relative position and the associated entity. To build on OCR text, it identifies text and then identifies the entity that the text represents. Use cases include extracting contact information from business cards.
OCR table: Identifies text in an image and returns the relative position and associated cell number. To build on OCR text, it identifies tables in the text and maps entries to a specific table cell. Use cases include digitizing tables such as price sheets.

The Einstein OCR model currently only supports English. See the Note section for details on supported characters.
The model doesn't support checkboxes, or circled answers.
The model is optimized for short-form prediction rather than the identification of characters, words, or entities in long-form sentences or paragraphs.
Certain use cases are prohibited under the Salesforce Acceptable Use Policy.
A user can't submit data prohibited by the Security, Privacy and Architecture documentation, Sensitive Data section.

The quality of a user's input can affect the accuracy of the results.

These factors can affect text recognition accuracy.

Low quality images such as those downloaded from unknown sources or poorly scanned
Low light
Low resolution
High skew or perspective
Varying density of words in the document
Length of strings
Excessively cursive or handwritten text that's illegible
Font of the typed text

These factors can affect entity recognition accuracy.

Addresses with varied formats that differ between countries. For example, apartment numbers, postal codes, and so on
Birth dates in different formats
Names in different cultures
Various genders

Precision and recall are used to evaluate where the text is within an image (the bounding box). Word-level accuracy is used to evaluate character recognition in identified text. The entity F1 score is used to discern the accuracy of entity tags identified for text.

The overall model is also evaluated using an F1 score. The score calculation is based on the precision and recall and the word-level accuracy mentioned above.

The training dataset is composed of data from these open-source libraries:

Born-Digital Images, found on https://rrc.cvc.uab.es, in the Challenges picklist
Focused Scene Text, found on https://rrc.cvc.uab.es, in the Challenges picklist
MLT, found on https://rrc.cvc.uab.es, in the Challenges picklist
COCO-Text, found on https://rrc.cvc.uab.es, in the Challenges picklist
IIIT-HWS, found on https://github.com/kris314/hwnet/tree/master/iiit-hws (1)
The UNIPEN on-line handwritten samples collection, found on http://www.unipen.org/products.html (2)

In addition to the open-source libraries, synthetic datasets were used. Synthetic datasets were created by picking a scene or plain white background and then imposing randomly selected words onto the background and changing certain attributes such as font, character size, or skew.

The entity recognition capabilities use a combination of in-house named-entity recognition (NER) models, the Stanford business card dataset, and an in-house business card dataset.

The data used to evaluate the model was gathered from the sources listed above. The evaluation and training data sets are different subsets of the source data that don’t overlap.

The results reported here are from a publicly available challenge dataset, ICDAR Focused scene text. This dataset consists of real-world scenes such as storefronts and street signs, with images primarily focused on text content. This scenario is typical for text reading applications where the user explicitly directs the camera’s focus on the text content of interest. The challenge in this dataset arises from complex text orientation and complex backgrounds.

Complex text orientation necessitates estimating skew in addition to the location and content of the text. Complex backgrounds can reduce contrast of the text and introduce false positives in the form of random patterns that appear like text.

The ICDAR dataset was designed to encourage the research community to develop solutions for text reading in real-world images, and the challenge allowed for the OCR system to limit the words to a fixed vocabulary (as with the State of the Art (SoTA) benchmark reported below). Such a limitation can be helpful in correcting OCR errors because the words to be detected in the test set are present in the fixed vocabulary. And the search space for a word in a test image decreases from potentially infinite (all possible combinations of alphabets and numbers) to a small number of words in the specified vocabulary.

The SoTA models reported here use a general vocabulary set to aid in text prediction, and are trained on this dataset. But our results don’t use a vocabulary because the models support OCR applications that can’t be captured by a vocabulary, such as vehicle identification number (VIN) or serial number scanning. The models aren't trained on this dataset exclusively. A generic model is used that’s more robust and addresses a wider range of challenges. The F1 score diverges from the SoTA results by just 6.17%.

Evaluation Metric: F1 score Salesforce Research Model ID: scene

	Salesforce Research	SoTA Research Benchmark*
Fixed Vocabulary	No	Yes
Usage	Real world	Fine tuned for the dataset
F1 Score	79.61	85.78

*SoTA Research Benchmark

The text is processed into a string character by character. Because dictionaries aren’t referenced, underlying bias is not an issue. But the supported characters are limited to the Roman alphabet, which could be advantageous to English-language individuals, companies, and organizations.

Certain entities are identified after the text is converted character by character to a string. So the identification of certain entities may be challenging for the model. For example, first or last names from certain regional, ethnic, or religious groups sometimes are recognized with higher accuracy. This issue can include non-binary gender or varied birth date formats.

Only text that contains characters found in the supported character list is identified. Languages with characters other than those in the supported character list aren’t supported. An increase in the supported characters and languages is planned.
Only entities in the supported entity list are identified in this version of the model.
Only phone numbers displayed in the format of the supported phone number formats are identified in this version of the model.
The Detect Text API provides for more than one prediction model. Each has a unique modelId specifier (for example, OCRModel and tabulatev2). For best accuracy, use the modelId and task combinations we recommend.

These special characters are supported. !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~£≈

These entities are supported.

ADDRESS
EMAIL
FAX
HOME_PHONE
MOBILE_PHONE
OFFICE_PHONE
ORG
PERSON
WEBSITE

Phone numbers with these country-specific formats are supported.

CA
CN
DE
ES
FR
GB
IN
IT
MX
JP
PT
US

We can also identify service and emergency numbers (such as 911) in the locales listed below. Our support for service numbers is limited to these specific locales.

Locale	Locale Code	Service Number
Chinese, Simplified (China)	zh_cn	110, 119, 120
Chinese, Traditional (China)	zh_tw	110,119,120
Danish (Denmark)	da	112
Dutch (Netherlands)	nl_nl	112
English (U.S.A.)	en_us	911
English (U.K)	en_gb	999, 112
French (France)	fr	15, 17, 18, 112
German (Germany)	de	110, 112
Italian (Italy)	it	112
Japanese (Japan)	ja	110, 119
Korean (Korea)	ko	112, 119
Portuguese (Brazil)	pt_br	190, 192, 193
Portuguese (Portugal)	pt_pt	112
Russia (Russian)	ru	101,102, 103, 112
Spanish (Spain)	es	112
Swedish (Sweden)	sv	112

(1) Praveen Krishnan and C.V. Jawahar, Generating Synthetic Data for Text Recognition, arXiv preprint arXiv:1608.04224, 2016. (2) The database is provided by the Unipen Foundation.