Text Classification: When Not to Use Machine Learning

Machine learning is a great approach for many text classification problems. For example, the problem of classifying an email as “spam” or “not spam”, based on its textual content.

The following is not one of them.

Consider the problem of classifying a job title into a rank from C-level, VP-level, Director-level, Manager-level, or Staff. Below are some examples of job titles and the ranks we’d like them to be classified to.

Chief Information Officer Vice President → C-level
Admin to Chief Information Officer → Staff
E-Commerce Project Manager → Manager-level
Director, Software Engineering → Director-level
General Manager → VP-level
Assistant to Vice President → Staff
Assistant Vice President → VP-level

At first glance, one might be tempted to use keywords/regular expressions mapped to ranks for this purpose. For example, if the title contained the phrase chief <word> officer, we might rank it C-level.

On further thought, after noting some subtleties, as witnessed in some of the examples above, we might prefer a more sophisticated approach. If we are (or aspire to be) data scientists, we might imagine that machine learning would work better for this problem. Specifically, use a training set of job titles classified to ranks to automatically learn, via machine learning, to classify a title to its appropriate rank.

Sounds very appealing.

As we’ll see below, this doesn’t work out as well as we might imagine.

A Machine Learning Solution

For a machine learning solution to this problem, one needs the following:

1. A training set of (job title, rank) pairs
2. Features to be extracted from a title that the machine learning algorithm will use

We might choose, for example, every word and every two-word phrase in the title to be a feature. Below are some examples of job titles and their word-level and two-word level features.

Title Features
General Manager general, manager, “general manager”
Director, Software Engineering director, software, engineering, “director software”, “software engineering”


The difficulty with this approach is that, unless the training set is very large and sufficiently diverse, a machine learning solution can significantly overfit it.

The term “overfit” means “the learned model does not work adequately well on titles not seen during training”.

Below is a simple example that illustrates this. Imagine that the training set has an entry chief medical officer → C-level. Also imagine that no other title in the training set has the word medical in it. In view of this, a machine learning algorithm is likely to learn the association that medical predicts C-level, which is clearly wrong.

Why is this happening? We are expecting the machine learning algorithm to automatically figure out which words, and which two-word phrases, predict specific ranks and which don’t. Hundreds of thousands of different words can occur in the imagined universe of titles. (The contacts database at Data.com has more than ten million distinct titles.) Hundreds of thousands squared two-word phrases. For the machine learning solution to automatically discover which of these words and two-word phrases predict specific ranks requires a very large training set.

Can we alleviate this issue by limiting our features to words? The reasoning being that limiting features to words drastically reduces the universe of feature values, thereby, needing a significantly smaller training set to learn associations between individual words and ranks from.

Yes, but we pay a price for it, in reduced accuracy. Certain two-word phrases, for instance vice president, predict ranks more accurately than the independent combination of the words in them. (president predicts C-level, vice in of itself does not strongly predict VP-level.)

Moreover, the number of distinct words in the universe of titles is still rather large, so the requisite training set will still remain large.

If a very large training set is available, great. If not, as is often the case, what to do? Let’s revisit the keyword → rank rule-based approach.

A Rules-Based Solution

Consider the rule

manager → manager-level

Interpret this as “if the title contains the word manager, classify it to the rank manager-level”.

This single rule classifies most (but not all) titles which contain the word manager in it correctly. In the parlance of machine learning, this one rule generalizes massively (albeit not perfectly).

To improve on this, the following mechanism helps

If two rules fire on a particular title, and the antecedent of one of the two is a subphrase of the antecedent of the other, override the former rule.

Let’s see an example.

Add the following rule:

general manager → VP-level

Consider the title General Manager, data.com. Both rules fire on this title. The general manager rule wins because manager is a subphrase of general manager. This results in the title getting classified to VP-level.

How do we ensure that Assistant to Vice President gets classified to Staff whereas Assistant Vice President to VP-level?

We need to add a simple mechanism, a numeric strength to each rule. To illustrate this imagine that the rules set is as follows:

1. assistant → Staff (1)
2. vice president → VP-level (2)
3. assistant to → Staff (3)

Consider the title Assistant to Vice President. Rules 1, 2, and 3 all fire on this title. Rule 3 overrides rule 1. Next, rule 3 predicts Staff more strongly than rule 2 predicts VP-level. So Staff wins. Next, consider the title Assistant Vice President. Rules 1 and 2 fire. Rule 2 predicts VP-level more strongly than rule 1 predicts Staff. So VP-level wins.

It turns out that by hand-crafting a couple of hundred such rules one can achieve a high classification accuracy on a large test set of titles.

Sure, hand-crafting a few hundred rules takes work. Putting together a training set of ten to hundreds of thousands of titles pre-classified to ranks might take a whole lot more work.

Combining Rules and Machine Learning

The rules-based approach gives us massive generalization from a small set of rules. However it doesn’t automatically learn from its mistakes. If feedback is expected to arrive continually (even if at a low rate), automated learning from such feedback to improve classification accuracy is very attractive. The alternative of manually adjusting the rules from such feedback is more laborious, and injects humans in the loop. (Humans are intelligent, but don’t scale.)

So a sensible combination would be to use the rule-based approach to quickly get a decent classifier off the ground; then use machine learning to automatically adjust the rules from feedback.

For instance, machine learning can be used to automatically adjust the strengths of the various rules from feedback.

How do you solve such issues? We’d love to hear from you.

If you’re interested in these sorts of problems, Salesforce is hiring! Visit http://www.salesforce.com/tech to find your #dreamjob.

September 23, 2015

Leave your comments...

Text Classification: When Not to Use Machine Learning