Introduction

In this NLP getting started challenge on kaggle, we are given tweets which are classified as 1 if they are about real disasters and 0 if not. The goal is to predict given the text of the tweets and some other metadata about the tweet, if its about a real disaster or not.
In this part 5 for Deep Learning data preparation, I will use the raw data with the splits generated in Part 2 to create a single class of Data Module that holds all the preprocessing, vectorization and PyTorch DataLoaders implementation as preparation for use in future deep learning models using PyTorch. Following up from the previous Part 4 about tree-based models, I will move to creating Neural Networks to classify these tweets in upcoming posts.

Vocabulary, Vectorizer and More

I will be using the pytorch-lightning framework to build a consistent DataModule class which will hold the logic of generating train, val and test datasets and creating corresponding data loaders that manage in-built batching and shuffling functionality. First things first, let’s start with the Vocabulary class.

Vocabulary

The vocabulary class will be responsible for converting the individual tokens (i.e. words) in our tweets to numerical indices. It is basically a simple dictionary with some sugar.

class Vocabulary:
    def __init__(self):
        self._token_to_idx = {}
        self._idx_to_token = {idx: token for token, idx in self._token_to_idx.items()}

    def add_token(self, token):
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index

    def lookup_token(self, token):
        return self._token_to_idx[token]

    def lookup_index(self, index):
        if index not in self._idx_to_token:
            raise KeyError('the index {} is not in the vocab'.format(index))
        return self._idx_to_token[index]

For any text vectorization task, along with transforming individual tokens into numerical indices, we will also need a few other things -

  • Begin Token - A special token (<BEGIN>) that signifies the beginning of a new tweet.
  • End Token - A special token (<END>) that signifies the end of a new tweet.
  • Unknown Token - A special token (<UNK>) that signifies out-of-vocabulary words.
  • Mask Token - A special token (<MASK>) that is used to make all the tweet vectors of same length by appending the mask token in the end.

Let’s create a sequence vocabulary object which has these 4 special tokens by default in our vocabulary -

class SequenceVocabulary(Vocabulary):
    def __init__(self, unk_token="<UNK>", mask_token="<MASK>",
                 begin_seq_token="<BEGIN>", end_seq_token="<END>"):
        super(SequenceVocabulary, self).__init__()
        self._mask_token = mask_token
        self._unk_token = unk_token
        self._begin_seq_token = begin_seq_token
        self._end_seq_token = end_seq_token

        self.mask_index = self.add_token(self._mask_token)
        self.unk_index = self.add_token(self._unk_token)
        self.begin_seq_index = self.add_token(self._begin_seq_token)
        self.end_seq_index = self.add_token(self._end_seq_token)

    def lookup_token(self, token):
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]

Vectorizer

Now that we have our vocabulary in place, we will need a vectorizer class that will, given a dataframe, create the exact vocabulary for the given dataset and also provide functionality to transform a single tweet text into a numerical vector representation.

class TweetsVectorizer:
    def __init__(self, text_vocab):
        self.text_vocab = text_vocab

    def vectorize(self, text, vector_length=-1):
        indices = [self.text_vocab.begin_seq_index]
        indices.extend(self.text_vocab.lookup_token(token) for token in text.split(" "))
        indices.append(self.text_vocab.end_seq_index)

        if vector_length < 0:
            vector_length = len(indices)

        out_vector = np.zeros(vector_length, dtype=np.int64)
        out_vector[:len(indices)] = indices
        out_vector[len(indices):] = self.text_vocab.mask_index

        return out_vector, len(indices)

    @classmethod
    def from_dataframe(cls, tweets_df, cutoff=1):
        word_counts = Counter()
        for text in tweets_df.text:
            for token in text.split(" "):
                if token not in string.punctuation:
                    word_counts[token] += 1

        text_vocab = SequenceVocabulary()
        for word, word_count in word_counts.items():
            if word_count > cutoff:
                text_vocab.add_token(word)

        return cls(text_vocab)

The TweetsVectorizer class above handles a couple of functionality -

  • The from_dataframe method will take in the tweets dataframe and create the Sequence Vocabulary
  • Each tweet text is split into tokens (by space) and token counts are generated.
  • We add all the tokens with counts higher than the given cutoff to our text vocabulary.
  • The vectorize method is the main method of use, that will vectorize a given tweet text using the previously created text vocabulary.
  • First the <BEGIN> token index is added to the vector, then each of the word in the text is looked up using vocabulary and appended to the vector, and finally the <END> token index is appended.
  • This vector is then normalized to a fixed length by appending <MASK> token indices in the end.
  • Along with the final vector, we also return the actual length of the vector without any <MASK> token indices.

As, an example if our vocabulary looks like as follows -

Token Index
<MASK> 0
<UNK> 1
<BEGIN> 2
<END> 3
the 4
fire 5
alarm 6
went 7
off 8
…. ….

The vectorize function will transform the tweet -

the fire alarm went off

to -

[2, 4, 5, 6, 7, 8, 3, 0, 0, 0, 0, 0, 0, ….. 0]

with 0s appended in the end uptil the length of the longest tweet.

Dataset

Now we are ready to create the Dataset object for our data which is the abstraction provided by PyTorch for easy data transport and loading.

from torch.utils.data import Dataset

class TweetsDataset(Dataset):
    def __init__(self, tweets_df, vectorizer, max_seq_length=-1):
        self.tweets_df = tweets_df
        self._vectorizer = vectorizer
        self._max_seq_length = max_seq_length

    @classmethod
    def load_dataset_and_make_vectorizer(cls, tweets_df):
        train_tweets_df = tweets_df[tweets_df.split == 'train']
        measure_len = lambda context: len(context.split(" "))
        max_seq_length = max(map(measure_len, tweets_df.text)) + 2  # for begin and end tokens
        return cls(train_tweets_df, TweetsVectorizer.from_dataframe(train_tweets_df), max_seq_length)

    def __getitem__(self, idx):
        row = self.tweets_df.iloc[idx]

        text_vector, vec_length = self._vectorizer.vectorize(row.text, self._max_seq_length)
        return {'x_data': torch.tensor(text_vector),
                'y_target': torch.tensor(row.target),
                'text': row.text,
                'x_length': vec_length}
  • load_dataset_and_make_vectorizer method creates the vectorizer and dataset objects
  • The train split of the tweets dataframe is used to create the vectorizer object. Both val and test splits will use this train vectorizer to handle vectorization.
  • Max length of a tweet is calculated to be used for normalization of vectors.
  • __getitem__ is the method that gets called during the model training cycle.
  • It runs the vectorization and returns all the relevant fields.

We will create 3 instances of the TweetsDataset class, one for each train, val and test splits. These will be used in the Data Module accordingly to load data via Data Loaders.

Data Module

PyTorch Lightning provides a fantastic LightningDataModule class that helps create a shareable, reusable object that encapsulates all the steps needed to process data.

import pytorch_lightning as pl

class DisasterTweetsDataModule(pl.LightningDataModule):
    def setup(self, stage: Optional[str] = None):
        tweets_df = pd.read_csv(self.tweets_data_path)
        self.train_ds = TweetsDataset.load_dataset_and_make_vectorizer(tweets_df)
        self.val_ds = TweetsDataset(tweets_df[tweets_df.split == 'val'],
                                    self.train_ds.get_vectorizer(),
                                    self.train_ds.get_max_seq_length())
        self.test_ds = TweetsDataset(tweets_df[tweets_df.split == 'test'],
                                     self.train_ds.get_vectorizer(),
                                     self.train_ds.get_max_seq_length())

        def load_glove_from_file(glove_filepath):
            word_to_index = {}
            embeddings = []
            with open(glove_filepath, "r") as fp:
                for index, line in enumerate(fp):
                    line = line.split(" ")
                    word_to_index[line[0]] = index
                    embedding_i = np.array([float(val) for val in line[1:]])
                    embeddings.append(embedding_i)
            return word_to_index, np.stack(embeddings)

        def make_embedding_matrix(glove_filepath, words):
            word_to_idx, glove_embeddings = load_glove_from_file(glove_filepath)
            embedding_size = glove_embeddings.shape[1]

            final_embeddings = np.zeros((len(words), embedding_size))

            for i, word in enumerate(words):
                if word in word_to_idx:
                    final_embeddings[i, :] = glove_embeddings[word_to_idx[word]]
                else:
                    embedding_i = torch.ones(1, embedding_size)
                    torch.nn.init.xavier_uniform_(embedding_i)
                    final_embeddings[i, :] = embedding_i
            return final_embeddings

        words = self.train_ds.get_vectorizer().text_vocab.get_token_to_idx().keys()
        embeddings = make_embedding_matrix(self.embeddings_path, words)
        self.pretrained_embeddings = torch.from_numpy(embeddings).float()

    def train_dataloader(self):
        return torch.utils.data.DataLoader(
            self.train_ds,
            batch_size=self.batch_size,
            shuffle=True,
            drop_last=True,
            num_workers=self.num_workers
        )

The above code shows the most important components of the LightningDataModule class.

  • The setup method is the main method that runs all the data preprocessing and vectorization steps.
  • First the dataframe is loaded from csv data path and 3 TweetsDataset objects are created, one for each train, val and test splits.
  • As in the previous parts of this series, we will use Glove pre-trained embeddings to get low dimension embeddings for each of the words in our vocabulary. The setup function also loads these embeddings from file and creates the embedding matrix for all the words in our vocabulary.
  • These pre-trained embeddings are returned as a 2-D tensor that we can use to initialize our embedding layer during modelling.
  • Also note that we generate a xavier uniform embedding vector for words in our vocabulary that do not have a glove embedding.
  • Finally, the train_dataloader function wraps the train dataset object in a DataLoader object provided by PyTorch. This will generate batches of data and is also configured to shuffle the data and drop records (if any) if they do not form a complete batch.
  • Similarly, I have also added val_dataloader and test_dataloader methods (not shown here for brevity) that generate DataLoader objects for val and test Datasets respectively. PS, both of those will need to have shuffling and drop_last turned off.

Summary

In this part of the series, we have done our ground work for data preparation which we can use to jumpstart our Deep Learning modelling process. The full code for this class can be viewed on my github here. In the next part, I will start building a simple multi-layer perceptron using PyTorch Lightning and the DataModule class that we discussed in this part. See you there!

References