Einstein Vision is a powerful feature of the Salesforce Platform that allows developers to build apps that can react by categorizing incoming images. It’s a neat little corner of AI and one I recently got a chance to utilize for a demo. In this post, I share some of the lessons that I learned while setting it up.

As for the demo itself, it started with a rather innocent question: “What if we tried to detect bears at Moscone?”

Clearly, if a real bear appears at Moscone West we’ll have a lot more to worry about than how machine learning works. There really isn’t any practical reason to answer this question, but we pride ourselves on laughing in the face of mere practicality when it comes to diving into it a technical question. After all, as one of our Trailhead marketing managers Dana Hall noted, “Bears are kind of hypnotic because they already kinda look like people in bear suits.”

Which begged an evolution of the question: “Can we tell the difference between a person in a bear suit and a real bear?”

And so Project BearNoBear was born.

First thing first: What do we mean by AI?

Artificial intelligence (AI), that is, intelligence exhibited by machines rather than humans, is a fairly loaded term. It can remind people of everything from Siri to HAL 9000. For much of the Einstein Platform, we’re more specifically referring to a kind of technology called machine learning. Machine learning is about software that predicts the qualities of new data based on existing data. To learn the qualities of the existing data, the data is organized into categories called classifiers. The more data in the classifiers, the more properties the software can learn about the categories. So to the end user, you organize your categories into directories, put as much data as possible into those groups, and upload them to be converted into classifiers that machine learning can utilize.

Short version: Upload enough pictures of an apple, indicate to the machine that it’s an apple, and then the machine can apply the qualities of an apple to something it has never encountered to determine its apple-ness.

Step 1: The human part of training: downloading a lot of images

Our task is to determine if a given image includes:

A bear
A person in a bear suit
No bear

Since all the smart people at Einstein Vision have figured out the hard training part, a big part of our job is to simply gather the data for the classifier and organize it into qualities. You can think of qualities as categories, or more specifically, directories.

The first step is to create three directories:

Real Bear
Person in a Bear Suit
No Bear

And then search, download, and sort various images of known bears, bear suits, and humans (or even random objects) into those directories.

Step 2: Sign up for Einstein Vision

Obviously, the Einstein team isn’t so genius that they can simply train a machine off your hard drive. Before we get into the specifics from Project BearNoBear, let me point you in the right direction on how to access Einstein Vision.

This is a rough outline for context, the specifics are available at the Einstein Vision (previously known as Metamind) docs: https://metamind.readme.io/docs/apex-qs-create-vf-page.

Follow the prerequisite steps listed in the docs to get an Einstein account activated.

First, zip your files.

Second, generate a token.

Third, upload your files (this uses curl as an example):

The response from that looks like this and include a dataset ID.

Take the dataset ID, give your model a name, and tell it to get training.

And that response gives you the custom model ID.

With that custom model ID, you can start making API calls to make predictions. To see more, look at Christophe’s example of how to use Vision with Lightning Components, René’s Using Einstein within the Salesforce Platform or his Apex Wrapper for Einstein.

For Project BearNoBear, at this point, I now had my classifiers set up within Einstein and started to find images of bears in the wild, people in bear costumes, and images without any bears.

What did I learn?

Lesson one: Start slowly

This can seem counter-productive because everyone who has worked with machine learning tells you that a deep well of data is super important for making the machine smarter. But digging a well takes time. And while you’re digging that well, you can learn how well your machine is learning.

The first set of BearNoBear (trademark not pending) pictures were roughly 20 to 30 images deep for each category. Your final classifier can be 10 times that, but starting with smaller sample sizes lets you see how the training is going.

Lesson two: Context is king

Early results of the classifier were pretty decent. The always amazing Mary Scotton offered this pic of her and old-school Codey hanging out:

and it nailed it in one. That was a person in a bear suit. Then Dana brought her rather impressive (actually, it’s slightly scary) library of bear pictures to (forgive me) bear, and it failed at examples like these:

In every example, the classifier thought Dana was showing a person in a bear suit. Notice anything similar to the one it got right and the ones it got wrong? In each of Dana’s examples, the bear is acting in a human-like way. I realized this wasn’t even terribly generic—the classifier was learning that standing wasn’t bear-like. From a contextual point of view, they were bears in a non-bear-like context. So I added more standing bears into the “real bears” category and instantly started seeing better predictions despite bear posture or whatever they were wearing.

Lesson three: But it also includes a lack of context

If we weren’t going to be training our classifier to work in the wild, so to speak, we could have the luxury of having images of objects set against specific backgrounds. Some of the early sample images I’ve used on Einstein’s food classifier were intentionally images with completely white backgrounds. This removes the noise that the classifier has to train on and make for great baseline images.

For instance, if you were going to do detection of product images, start with images of those products set on a white background from a wide variety of angles. Test the classifier for that, and then add in the product set in a variety of real-world settings.

Lesson four: “Not” is important

There’s an old joke about how to carve a wooden elephant: Take a block of wood and remove everything that doesn’t look like an elephant. Getting the classifier to determine bears that look “fake” versus “real” was a challenge, but then what if we want to determine that there’s no bear at all?

What constitutes “not a bear”? I started by creating a classifier that was a set of random images of anything that wasn’t a bear: hats, flowers, people in crowds, and such. Actually, I started by focusing mostly on random sets of people, portrait shots, myself in a mirror, and so on, because a random selfie is less likely to trigger a false positive of “real bear.”

However, there’s a way to create this baseline “other” by creating a global dataset. This allows a safe fallback for when no ursine-related images are involved at all when you’re checking for bears.

Summary

Hopefully, you now have an appreciation for how to manage classifiers to train for specific tasks. Einstein Vision has done an excellent job of making predictive vision services easily consumable for developers. #BearNoBear is still up and running. If you want to give it a try, follow @bear_labs, take a photo of the most bear-like thing you can find, and add the hashtag. In an upcoming post (time and bears permitting), I will talk about how you can use Twitter bots for your own devious needs.

And if you want to try Einstein Vision with Trailhead, check out René’s Home for Wayward Cats Project.