# GloVe and Word Vectors for Sentiment Analysis

In this workbook, we take another look at the Stanford Sentiment Treebank and use what we learned to do some basic sentiment analysis. If you haven't completed [Word Meaning and Word2Vec](https://trailhead.salesforce.com/content/learn/modules/word-meaning-and-word2vec), we suggest you do that first.

Code sections of the notebook appear in grey cells. To run the code in a cell, hover over the brackets in the upper left corner of the cell and click the play button or Shift+Enter. You can edit the code in any cell. When running a cell, be sure that you've run all the above cells first to avoid errors.

When you have completed the lab, return to Trailhead to enter your answers to the exercises in the quiz section and get points.

In [0]:
import matplotlib.pyplot as plt
import random
import collections
import numpy as np
import os
import urllib
import zipfile
import collections
import math
import os
import datetime as dt
import string
import re
import time
from tqdm import tqdm
import numpy as np
import tarfile
import io
import array

!pip3 install http://download.pytorch.org/whl/cu80/torch-0.4.0-cp36-cp36m-linux_x86_64.whl
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.modules.module import Module

def println(*x):
  print(*x)
  print()

#  Data

## The Stanford Sentiment Treebank
We'll be playing around with the Stanford Sentiment Treebank as an example of sentiment analysis, or sentiment classification.

### Download SST

In [0]:
train_url = 'https://raw.githubusercontent.com/salesforce/decaNLP/master/local_data/train_fine_sent.csv'
dev_url = 'https://raw.githubusercontent.com/salesforce/decaNLP/master/local_data/dev_fine_sent.csv'

def download(url):
    filename = os.path.basename(url)
    if not os.path.exists(filename):
        downloaded_path, _ = urllib.request.urlretrieve(url, filename)
    else:
      downloaded_path = filename
    return downloaded_path

train_path = download(train_url)
dev_path = download(dev_url)

In [0]:
print('Stanford Sentiment Treebank train: ', train_path)
print('Stanford Sentiment Treebank dev: ', dev_path)

# Hands-on: Extracting examples

In this cell, we're extracting and cleaning up the SST data. The SST dataset comes with labels 0-4 that indicate very negative, negative, neutral, positive, and very positive examples. The binary version of the task merges labels 0 with 1 and 3 with 4, throwing away the neutral label 2. For simplicity, we're using the binary version. 

It's your job to implement the `binarize()` method to get us there. The method takes an example in the form `(sentence, label)` and returns a tuple of the form `(sentence, binarized_label)`. Map labels of 0 and 1 to the label 0, labels of 3 and 4 to label 1, and labels of 2 to "None". 

In [0]:
train_examples = None

# We will need to clean the sentiment data
def clean(s):
    """Lower text and remove punctuation, articles and extra whitespace."""
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)
    def white_space_fix(text):
        return ' '.join(text.split())
    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)
    def lower(text):
        return text.lower()
    return white_space_fix(remove_articles(remove_punc(lower(s))))
  
def get_examples(path):
  with open(path) as f:
    # skip the headers of the csv file
    next(f) 
    examples = [(clean(s[2:]).split(), int(s[0])) for s in f] 
    return examples
  
def binarize(ex):
  # TODO: Implement as described

binary_train_examples = [binarize(ex) for ex in get_examples(train_path) if not ex[1] == 2]
binary_dev_examples = [binarize(ex) for ex in get_examples(dev_path) if not ex[1] == 2]

In [0]:
print('Stanford Sentiment Treebank training example: ', binary_train_examples[0])
print('Stanford Sentiment Treebank dev example: ', binary_dev_examples[0])

# Quiz Question 1
What is the first word of the first binarized training example?

# Quiz Question 2
What is the first word of the first binarized dev example?

# Create a vocabulary

In [0]:
class Vocabulary:
  
  def __init__(self, sentences):
    word_counts = collections.Counter([w for s in sentences for w in s ])
    print('Sentences contain ', len(word_counts), ' words.')
                                 
    # Replace uncommon words with a special unknown token
    unk_token = 'UNK'
    new_sentences = []
    for sentence in sentences:
      new_sentence = []
      for word in sentence:
        if word in word_counts:
          new_sentence.append(word)
        else:
          new_sentence.append(unk_token)
      new_sentences.append(new_sentence)
    self.sentences = new_sentences
                            
          
    self.index_to_word, self.word_to_index = {}, {}
    sorted_common_word_counts = sorted(word_counts.items(), 
                                       key=lambda tup: (-tup[1], tup[0]))


    for idx, (word, count) in enumerate(sorted_common_word_counts):
      self.index_to_word[idx] = word
      self.word_to_index[word] = idx
      
    self.index_to_word[idx+1] = 'PAD'
    self.index_to_word[idx+2] = unk_token
    self.word_to_index['PAD'] = idx+1
    self.word_to_index[unk_token] = idx+2

n_train = len(binary_train_examples)
n_dev = len(binary_dev_examples)
examples = binary_train_examples + binary_dev_examples
sentences = [x[0] for x in examples]
train_labels = [x[1] for x in binary_train_examples]
dev_labels = [x[1] for x in binary_dev_examples]
vocab = Vocabulary(sentences)
unked = vocab.sentences
train_examples = list(zip(unked[:n_train], train_labels))
dev_examples = list(zip(unked[-n_dev:], dev_labels))

In [0]:
print('The vocabulary has ', len(vocab.index_to_word), 'words.')
print(list(vocab.index_to_word[i] for i in range(10)))
print(train_examples[0][:10])
print(dev_examples[0][:10])

# Quiz Question 3
Why does the vocabulary have more words than the sentences contained?

# Quiz Question 4
What is the most frequent word in the vocabulary?

# Hands-on: Numericalize the data
Implement the `numericalize()` method. This method takes in an example of the form `(sentence, label)` and returns a tuple of the form `(numericalized_sentence, label)`. In this case, numericalization means that the words are replaced with the corresponding index in the vocabulary.

In [0]:
def numericalize(example, vocabulary):
  # TODO: implement as described

train_numericalized = [numericalize(s, vocab) for s in train_examples]
dev_numericalized = [numericalize(s, vocab) for s in dev_examples]

In [0]:
print(train_numericalized[0][:10])
print(dev_numericalized[0][:10])

# Quiz Question 5
What is the index of the first word in the first example of `train_numericalized`?

# Quiz Question 6
What is the index of the first word in the first example of `dev_numericalized`?

# Hands-on: Construct minibatches


In these methods, we're creating minibatches so our model can process our tensors in parallel. At the TODO statement, add the index for the special PAD token to the number of times needed to pad all inputs to the same length (`max_len`).

In [0]:
def tensorize(x, dtype=torch.long):
  if len(x) == 0:
    return None
  return torch.tensor(x, dtype=dtype)

def batch(examples, vocab, batch_size=64):
  if len(examples) > 1:
    example_indices = random.sample(range(0, len(examples)-1), batch_size)
  else:
    example_indices = [0]
  batch_examples = [examples[idx] for idx in example_indices]
  inputs = [example[0] for example in batch_examples]
  max_len = max([len(input) for input in inputs])
  for input in inputs:
    if len(input) < max_len:
      input += # TODO: implement as described
  targets = [example[1] for example in batch_examples]
  return [tensorize(inputs), tensorize(targets)]


In [0]:
random.seed(123)
for x in batch(train_numericalized, vocab):
  print(x)
for x in batch(dev_numericalized, vocab):
  print(x)

# Quiz Question 7
What is the first number in the first tensor printed out above?

# Quiz Question 8
What is the last number of the last tensor printed out above?

# Hands-on: Define the CBOW Model
This model takes the word vectors for each word in the sentence, adds them all together, and then feeds the result through a *two layer neural network* with [ReLU activation functions](https://pytorch.org/docs/stable/nn.html#relu) before using a final linear over the output classes to compute scores for each class. In practice, creating this network means using the `[nn.Sequential](https://pytorch.org/docs/stable/nn.html#sequential)` container. The container takes as parameters `nn.Linear` and `nn.ReLU` functions for each layer you want to construct (hint: you need to call nn.ReLU twice and nn.Linear three times).

Then, implement the `forward()` method as described in the comments.

In [0]:
class CBOW(nn.Module):
  
  def __init__(self, vocab_size, num_classes=5, embedding_size=300, hidden_size=128):
    super().__init__()
    self.embedding = nn.Embedding(vocab_size, embedding_size)
    # TODO: define a ReLU network as described above
    self.relu_network = 
    
  def forward(self, batch, train_embeddings):    
    inputs = batch[0]
    # TODO: obtain the input vectors
    input_vectors = 
    # TODO: if the input vectors are pretrained and should not be overwritten
    # then detach them here; otherwise set input_vectors to be itself
    input_vectors = 
    # TODO: compute a continous bag-of-words over the input vectors
    cbow = 
    # TODO: compute scores 
    scores = 
    return scores

# Training CBOW on SST
We'll now need to define a training loop that trains our models over some number of iterations. The inline comments walk you through this process. Check out the [PyTorch documentation](https://pytorch.org/docs/stable/optim.html) for a review of optimizers.

In [0]:
def get_trainable_parameters(model):
  """Returns the trainable parameters of a model"""
  return list(filter(lambda p: p.requires_grad, model.parameters()))

def denumericalize(vocab, x):
  return [vocab.index_to_word[y] for y in x]

def train(model, vocab, train_dataset, dev_dataset, device, 
          max_iterations=int(1e4), log_every=1e2,
          val_every=1e2, batch_size=64, train_embeddings=True):
  best_dev_acc = 0
  print('Training on ', len(train_dataset), ' examples for '
        , max_iterations, ' iterations with ', len(vocab.index_to_word), ' words in the vocabulary.')
  model.to(device)
  model.train()
  # TODO: initialize a default Adam optimizer
  # Hint: use get_trainable_parameters
  opt =  
  avg_loss = 0
  for iteration in range(max_iterations):
    # TODO: zero out the gradients the optimizer is tracking
    
    # TODO: get the next batch from the dataset
    b = 
    b = [x.to(device) for x in b]
    # TODO: get scores from the model 
    scores =
    
    targets = b[1]
    # TODO: use scores and targets to compute a cross entropy loss
    loss = 
 
    # TODO: compute gradients using the loss

    # TODO: update your parameters by using the optimizer to take a step

    # logging  
    avg_loss += loss.item()
    if (iteration + 1) % log_every == 0:
      print(f'Iteration: {iteration + 1}, avg_loss: {avg_loss / log_every}')
      avg_loss = 0
      
    # validating on the dev_dataset
    if (iteration + 1) % val_every == 0:
      model.eval()
      num_correct = 0
      for idx in range(len(dev_dataset)):
        dev_example = [dev_dataset[idx]]
        b = batch(dev_example, vocab, batch_size=1)
        b = [x.to(device) for x in b]
        scores = model(b, False)
        predictions = scores.argmax(1)
        if predictions.item() == b[1].item():
          num_correct += 1
      dev_acc = num_correct / len(dev_dataset)
      print('Validation accuracy: ', dev_acc)
      if dev_acc > best_dev_acc:
        best_dev_acc = dev_acc
      
  print('Best validation accuracy: ', best_dev_acc)

# Pretrained Word Vectors
Getting pretrained word vectors usually involves a decent amount of code just to handle downloading large files from the internet, parsing them to create dictionaries for all the words, and then maintaining large tensors for the actual word vectors. While it's be a good exercise to do this yourself, we've included code below that conveniently allows us to access both GloVe vectors and another kind of pretrained word vectors called FastText vectors. We'll use these vectors to compare against training from scratch and to each other.

This code was pulled from [decaNLP](https://github.com/salesforce/decaNLP/blob/203a02e2326de65400a8d3dce63fdb0f4ae0c324/text/torchtext/vocab.py) and then customized for this notebook due to RAM constraints. There's no code for you to implement here, but take a look at how this snippet works.

In [0]:
def reporthook(t):
    """https://github.com/tqdm/tqdm"""
    last_b = [0]

    def inner(b=1, bsize=1, tsize=None):
        """
        b: int, optional
        Number of blocks just transferred [default: 1].
        bsize: int, optional
        Size of each block (in tqdm units) [default: 1].
        tsize: int, optional
        Total size (in tqdm units). If [default: None] remains unchanged.
        """
        if tsize is not None:
            t.total = tsize
        t.update((b - last_b[0]) * bsize)
        last_b[0] = b
    return inner

class Vectors(object):

    def __init__(self, name, vocab, cache='.vector_cache',
                 url=None):
        """Arguments:
               name: name of the file that contains the vectors
               cache: directory for cached vectors
               url: url for download if vectors not found in cache
         """
        self.cache(name, cache, vocab, url=url)

    def __getitem__(self, token):
        if token in self.stoi:
            return self.vectors[self.stoi[token]]
        else:
            return torch.Tensor.zero_(torch.Tensor(1, self.dim))

    def cache(self, name, cache, vocab, url=None):
        if os.path.isfile(name):
            path = name
            path_pt = os.path.join(cache, os.path.basename(name)) + '.pt'
        else:
            path = os.path.join(cache, name)
            path_pt = path + '.pt'

        if not os.path.isfile(path_pt):
            if not os.path.isfile(path) and url:
                print('Downloading vectors from {}'.format(url))
                if not os.path.exists(cache):
                    os.makedirs(cache)
                dest = os.path.join(cache, os.path.basename(url))
                if not os.path.isfile(dest):
                    with tqdm(unit='B', unit_scale=True, miniters=1, desc=dest) as t:
                        urllib.request.urlretrieve(url, dest, reporthook=reporthook(t))
                print('Extracting vectors into {}. This is going to take a while.'.format(cache))
                ext = os.path.splitext(dest)[1][1:]
                if ext == 'zip':
                    with zipfile.ZipFile(dest, "r") as zf:
                        zf.extractall(cache)
                elif ext == 'gz':
                    with tarfile.open(dest, 'r:gz') as tar:
                        tar.extractall(path=cache)
            if not os.path.isfile(path):
                raise RuntimeError('no vectors found at {}'.format(path))

            itos, vectors, dim = [], [], None

            # Try to read the whole file with utf-8 encoding.
            binary_lines = False
            try:
                with io.open(path, encoding="utf8") as f:
                    lines = [line for line in f]
            # If there are malformed lines, read in binary mode
            # and manually decode each word from utf-8
            except:
                print("Could not read {} as UTF8 file, "
                               "reading file as bytes and skipping "
                               "words with malformed UTF8.".format(path))
                with open(path, 'rb') as f:
                    lines = [line for line in f]
                binary_lines = True

            print("Loading vectors from {}".format(path))
            for line in tqdm(lines, total=len(lines)):
                # Explicitly splitting on " " is important, so we don't
                # get rid of Unicode non-breaking spaces in the vectors.
                entries = line.rstrip().split(b" " if binary_lines else " ")

                word, entries = entries[0], entries[1:]
                if word not in vocab.word_to_index:
                  continue
                if dim is None and len(entries) > 1:
                    dim = len(entries)
                elif len(entries) == 1:
                    print("Skipping token {} with 1-dimensional "
                                   "vector {}; likely a header".format(word, entries))
                    continue
                elif dim != len(entries):
                    raise RuntimeError(
                        "Vector for token {} has {} dimensions, but previously "
                        "read vectors have {} dimensions. All vectors must have "
                        "the same number of dimensions.".format(word, len(entries), dim))

                if binary_lines:
                    try:
                        if isinstance(word, six.binary_type):
                            word = word.decode('utf-8')
                    except:
                        print("Skipping non-UTF8 token {}".format(repr(word)))
                        continue
                vectors.append([float(x) for x in entries])
                itos.append(word)

            self.itos = itos
            self.stoi = {word: i for i, word in enumerate(itos)}
            self.dim = dim
            self.vectors = torch.Tensor(vectors).view(-1, dim)
            print('Saving vectors to {}'.format(path_pt))
            torch.save((self.itos, self.stoi, self.vectors, self.dim), path_pt)
        else:
            print('Loading vectors from {}'.format(path_pt))
            self.itos, self.stoi, self.vectors, self.dim = torch.load(path_pt)



class GloVe(Vectors):
    url = {
        '42B': 'http://nlp.stanford.edu/data/glove.42B.300d.zip',
        '840B': 'http://nlp.stanford.edu/data/glove.840B.300d.zip',
        'twitter.27B': 'http://nlp.stanford.edu/data/glove.twitter.27B.zip',
        '6B': 'http://nlp.stanford.edu/data/glove.6B.zip',
    }

    def __init__(self, vocab, name='840B', dim=300, **kwargs):
        url = self.url[name]
        name = 'glove.{}.{}d.txt'.format(name, str(dim))
        super(GloVe, self).__init__(name, vocab, url=url, **kwargs)


class FastText(Vectors):

    url_base = 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.{}.vec'

    def __init__(self, vocab, language="en", **kwargs):
        url = self.url_base.format(language)
        name = os.path.basename(url)
        super(FastText, self).__init__(name, vocab, url=url, **kwargs)

def _default_unk_index():
    return 0
  
def load_vectors(pretrained_vectors, vocab):
  vectors = torch.zeros(len(vocab.index_to_word), pretrained_vectors.dim)
  for index, word in vocab.index_to_word.items():
    vectors[index] = pretrained_vectors[word]
  return vectors

## Train SST from Scratch

In [0]:
torch.manual_seed(123)
random.seed(123)
device = torch.device("cpu")

start_time = time.time()
train(CBOW(vocab_size=len(vocab.index_to_word)), vocab, train_numericalized, dev_numericalized, device)
print('Time Elapsed: ', time.time() - start_time)

# Quiz Question 9
What is the best validation accuracy achieved by training the CBOW model from scratch?

## Train with GloVe


In [0]:
torch.manual_seed(123)
random.seed(123)
device = torch.device("cpu")

model = CBOW(vocab_size=len(vocab.index_to_word))
glove = GloVe(vocab)
vectors = load_vectors(glove, vocab)
model.embedding.weight.data = vectors

start_time = time.time()
train(model, vocab, train_numericalized, dev_numericalized, device, train_embeddings=False)
print('Time Elapsed: ', time.time() - start_time)

# Quiz Question 10 
What is the best validation accuracy achieved by the model using GloVe vectors?

## Train with FastText

In [0]:
torch.manual_seed(123)
random.seed(123)
device = torch.device("cpu")

model = CBOW(vocab_size=len(vocab.index_to_word))
fasttext = FastText(vocab)
vectors = load_vectors(fasttext, vocab)
model.embedding.weight.data = vectors

start_time = time.time()
train(model, vocab, train_numericalized, dev_numericalized, device, train_embeddings=False)
print('Time Elapsed: ', time.time() - start_time)

# Quiz Question 11
What is the best validation accuracy achieved by the model using FastText vectors?

# Quiz Question 12
Which was faster, training CBOW using pretrained vectors or using randomly initialized vectors?