Introduction

In this NLP getting started challenge on kaggle, we are given tweets which are classified as 1 if they are about real disasters and 0 if not. The goal is to predict given the text of the tweets and some other metadata about the tweet, if its about a real disaster or not.
In this part 6 for building Multi Layer Perceptron, I will use the data module generated in Part 5 to create a Multi Layer Perceptron model to predict if the tweet is about a real disaster. Following up from the previous Part 4 about tree-based models, I will generate the prediction output of this model on the validation set and compare results.

Classifier Class

Mutli-Layer Perceptron Pictorial representation from https://www.tutorialspoint.com/tensorflow/tensorflow_multi_layer_perceptron_learning.htm

Above pictorial representation shows a simple multi-layer perceptron with 2 hidden layers. We will essentially build the same for our usecase with a few more layers and additional bells and whistles related to network regularization.

I will be using the pytorch-lightning framework to build a classifier class.
In a deep learning training pipeline we need to configure the following 5 essential components -

Network architecture - We define exactly how is the network structured. The architecture would vary based on if the network we are trying to build is a perceptron or a Convolutional network or a Recurrent network or any other form of network. We also define different architecture parameters like the number of layers in the network and the number of neurons in each hidden layer of the fully connected network or the number of channels in case of a convolutional network.
Loss function - The loss function would calculate the loss of the prediction from the ground truth. This loss value’s gradient is then backpropogated through the network in order to update the network weights and come up with a lower loss value in subsequent iterations.
Optimizer - Optimizers update the weight parameters to minimize the loss function. In its simplest form, we multiply our preconfigured constant learning rate to the gradient of each parameter in the network to update it. Fancier optimizers update the learning rate, bring in momentum etc. but essentially accomplish the same goal of updating the weights to minimize loss.
Forward Pass - This method defines how the training batch will forward pass through the Network by using parameter weights and activation functions (and dropout/batch norms) and finally generate the prediction output.
Training Step / Val Step / Test Step - We configure specific to each phase (train/val/test) how exactly we will compute and report the loss.

That’s it! Rest everything related to epoch loop, checkpointing, logging, book-keeping are handled by PyTorch Lightning so we do not need to implement that specifically (as we would have if we were to use vanilla PyTorch. example here). In the rest of the post I will mainly discuss the above 5 components specific to building a Multi Layer Perceptron for our Disaster Tweets usecase.

Network Architecture and Loss Function

A multi layer perceptron is a simple neural network where neurons in each layer are fully connected to all the neurons in the next layer. Weights are applied to each input feature by a linear transform along with a bias and non-linearity is introduced via the activation function. Let’s see how our network architecture and loss function look.

import pytorch_lightning as pl

class DisasterTweetsClassifierMLP(pl.LightningModule):

    def __init__(self, num_classes, dropout_p, pretrained_embeddings,
                 max_seq_length, learning_rate, weight_decay):
        super().__init__()

        self.save_hyperparameters('num_classes', 'dropout_p', 'learning_rate', 'weight_decay')

        embedding_dim = pretrained_embeddings.size(1)
        num_embeddings = pretrained_embeddings.size(0)

        self.emb = torch.nn.Embedding(embedding_dim=embedding_dim,
                                      num_embeddings=num_embeddings,
                                      padding_idx=0,
                                      _weight=pretrained_embeddings)
        self.fc = torch.nn.Sequential(
            torch.nn.Linear(in_features=embedding_dim * max_seq_length, out_features=128),
            torch.nn.BatchNorm1d(num_features=128),
            torch.nn.Dropout(p=dropout_p),
            torch.nn.ReLU(),
            torch.nn.Linear(in_features=128, out_features=64),
            torch.nn.BatchNorm1d(num_features=64),
            torch.nn.Dropout(p=dropout_p),
            torch.nn.ReLU(),
            torch.nn.Linear(in_features=64, out_features=32),
            torch.nn.BatchNorm1d(num_features=32),
            torch.nn.Dropout(p=dropout_p),
            torch.nn.ReLU(),
            torch.nn.Linear(in_features=32, out_features=16),
            torch.nn.BatchNorm1d(num_features=16),
            torch.nn.Dropout(p=dropout_p),
            torch.nn.ReLU(),
            torch.nn.Linear(in_features=16, out_features=num_classes)
        )
        self.loss = torch.nn.CrossEntropyLoss(reduction='none')

We are using the LightningModule abstraction provided by PyTorch Lightning to create our training module.
First we do some housekeeping by saving all the hyperparameters that are passed to our module.
As we discussed in my previous post about data preparation, we will use pretrained glove embeddings to embed our vectorized tweets into lower dimensional vectors. For this we use the torch.nn.Embedding layer with pretrained embeddings supplied as initial weights.
Next we define our fc (fully connected) network, using the torch.nn.Sequential syntactic sugar provided by PyTorch. In here we have 5 fully connected layers of 128, 64, 32, 16 and 2 neurons respectively. Each of these layers use linear transform (therefore torch.nn.Linear) to transform its input from one dimension to another.
In order to regularize the network so it doesn’t overfit our training data we are using Dropout, this will randomly drop output of certain nodes/neurons based on given dropout probability. Here is an excellent post on Dropout if you are curious.
In order to stabilize the learning process and reduce the number of training epochs required for the network to converge, we are also using batch normalization via torch.nn.BatchNorm1d. This will standardize (0 mean and 1 standard deviation) the inputs to a layer for each batch. Here is an excellent post on Batch Normalization if you are curious.
We apply ReLU activation to the output of each layer which provides us with our non-linearity. It will keep the output as is if it is positive or else make it zero.
Finally, the last layer will provide 2 outputs, one for each of our classes (disaster or not). We can transform it to probability by using softmax to see what’s the predicted probability of each class.
Loss Function - We are using the Cross Entropy Loss function here. My favorite explanation for this loss is this post that does a beautiful job of explaining how exactly it is computed and how its gradient is calculated. The reason for passing ‘reduction=none’ in the loss function will become clear in the section below about training/val step.

Optimizer

class DisasterTweetsClassifierMLP(pl.LightningModule):
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(),
                                lr=self.hparams.learning_rate,
                                weight_decay=self.hparams.weight_decay)

We are using the Adam optimizer for optimizer our network parameters which is a good default optimizer as it combines advantages of 2 other well known optimizers AdaGrad and RMSProp. Here is a gentle introduction on Adam optimizer. We pass in all our network parameters, the provided learning rate and weight decay hyper parameters to torch.optim.Adam.

Forward Method

class DisasterTweetsClassifierMLP(pl.LightningModule):
  def forward(self, batch):
        x_embedded = self.emb(batch).view(batch.size(0), -1)
        output = self.fc(x_embedded)

        return output

In the forward pass through the network, we first generate the embeddings for our batch and also flatten it. For example, if we had batch size as 8 and each of the 8 samples were of normalized length 70, then the self.emb embedding layer will transform it to a tensor 8 x 70 x 25 (if we were to use 25 dimensional glove embeddings). i.e. it’ll extract the 25-dimensional embedding for each of 70 words in our 8 samples. Then the view(batch.size(0), -1) will flatten it such that the tensor will transform from 8 x 70 x 25 to 8 x 1750 (i.e. it’ll append the embedding vectors of all 70 words in the sample for all 8 samples).
This flattened vector then gets passed through our fc fully connected layer and generates the output which gets returned.

Training Step / Val Step / Test Step

class DisasterTweetsClassifierMLP(pl.LightningModule):
  def training_step(self, batch, batch_idx):
        y_pred = self(batch['x_data']) #internally calls the forward method
        loss = self.loss(y_pred, batch['y_target']).mean()
        return {'loss': loss, 'log': {'train_loss': loss}}

In our training step, one batch of training dataset is passed to the class (i.e. the forward method). The output is then passed, along with the true target values, to our loss function. We return the mean cross entropy loss across all samples in the batch. Note that, we also pass a log field in our response which is used by our logger utility for training instrumentation. In our case, we use the Tensorboard Logger (more about that ahead).

class DisasterTweetsClassifierMLP(pl.LightningModule):
  def validation_step(self, batch, batch_idx):
        y_pred = self(batch['x_data'])
        loss = self.loss(y_pred, batch['y_target'])
        acc = (y_pred.argmax(-1) == batch['y_target']).float()
        return {'loss': loss, 'acc': acc}
        
  def validation_epoch_end(self, outputs):
        loss = torch.cat([o['loss'] for o in outputs], 0).mean()
        acc = torch.cat([o['acc'] for o in outputs], 0).mean()
        out = {'val_loss': loss, 'val_acc': acc}
        return {**out, 'log': out}

We have split the validation step in 2 methods. The validation_step and the validation_epoch_end. Since we want to summarize the loss for the entire validation dataset and not just for each individual validation batch, we only return the individual losses of each sample in the batch from our validation_step method (similar to how we computed in training step) and do the aggregation in the validation_epoch_end method, which as the name suggests runs after the validation epoch is over. Here, we see the reason for passing reduction=none in our loss function instantiation. Since, we don’t allow the loss function to do reduction, and handle it manually ourselves, we can use the same function for both train and validation steps. As we noted earlier, training loss gets computed over each batch and gets backpropogated, however validation loss is computed over the entire validation dataset and reported. We also compute accuracy on our validation set for reporting purposes.

The test step is exactly identical to the validation step (Mutatis mutandis). i.e. Test dataset used instead of validation dataset.

The Training Routine

With all the above work, our network is all set and ready to train. Let’s discuss the training routine which remains mostly same across different networks. It majorly has class object creation, logger configuration and running the fit method on our trainer.

if __name__ == '__main__':
  pl.seed_everything(42)

  dm = DisasterTweetsDataModule(tweets_data_path=args.tweets_data_path,
                                embeddings_path=args.embeddings_path,
                                batch_size=args.batch_size,
                                num_workers=args.num_workers)

  dm.setup('fit')

  model = DisasterTweetsClassifierMLP(num_classes=2,
                                      dropout_p=args.dropout_p,
                                      learning_rate=args.learning_rate,
                                      weight_decay=args.weight_decay,
                                      pretrained_embeddings=dm.pretrained_embeddings,
                                      max_seq_length=dm.train_ds.get_max_seq_length())

  trainer = pl.Trainer.from_argparse_args(args,
                                          deterministic=True,
                                          weights_summary='full',
                                          early_stop_callback=True)

  trainer.configure_logger(pl.loggers.TensorBoardLogger('lightning_logs/',
                                                        name='disaster_tweets_mlp_glove'))
  trainer.fit(model, dm)

We first seed everything so we can have reproducible results.
Next, we create an object of our DisasterTweetsDataModule passing all the required params and also call it’s setup method. (For more context checkout my previous post Part 5).
We create our DisasterTweetsClassifierMLP model object which holds all the above code related to the network, loss, optimizer etc.
After that we instantiate the Trainer object provided by PyTorch Lightning, which runs our entire training pipeline and provides a lot of other functionality like logging and early stopping. For now, we have enabled the default early stopping callback, which will stop the training if we do not see val_loss decreasing for 3 consecutive epochs.
We also configure the TensorBoard Logger that will help visualize and monitor the training routine.
Finally, we call the fit method of the trainer with the Model object and the Lightning Data Module object.

Inference and Results

Training Routine Plots in Tensorboard

The above plots show the training routine visualizations in Tensorboard. We see the validation loss continuously decreases until around epoch=13 where it starts saturating and our early stopping callback gets triggered and stops the training.

def predict_target(text, classifier, vectorizer, max_seq_length):
    text_vector, _ = vectorizer.vectorize(text, max_seq_length)
    text_vector = torch.tensor(text_vector)
    pred = torch.nn.functional.softmax(classifier(text_vector.unsqueeze(dim=0)), dim=1)
    probability, target = pred.max(dim=1)

    return {'pred': target.item(), 'probability': probability.item()}

We can use the above method to run the trained model on new incoming text and generate predictions. When we run prediction on our validation dataset we get an accuracy score of 0.7835232252410167 and F1-score of 0.7507568113017156. F1-score for the best tuned Gradient Boosted Tree model that we got in Part 4 was 0.8107538, which is much higher than our multi-layer perceptron. However, note that this is just the default performance of this Network Architecture on our validation set with single constant values for our different hyperparameters like batch size, learning rate, hidden layers etc. We can do a full hyperparameter sweep and tune this network to get an optimal performance number (coming up in a future post).

Summary

In this part of the series, we built a multi-layer perceptron neural network architecture to predict if a tweet is about a real disaster using the PyTorch Lightning framework. We built this on top of the Lightning Data Module that we discussed in the previous post of this series. We can see that we get a good solid baseline performance from this network and can hopefully with further hyperparameter tuning improve it. The full code for this post can be viewed on my github here. In the next part, I will build a Recurrent Neural Network for this task using our same underlying data module. Hope to see you there.

References

Project Summary Page - NLP with disaster tweets: Summary
Project Part 1 - NLP with Disaster Tweets: Part 1 Data Preparation
Project Part 2 - NLP with Disaster Tweets: Part 2 Nearest Neighbor Models
Project Part 3 - NLP with Disaster Tweets: Part 3 Linear Models
Project Part 4 - NLP with Disaster Tweets: Part 4 Tree-based Models
Project Part 5 - NLP with Disaster Tweets: Part 5 Deep Learning Data Preparation
PyTorch - Docs
PyTorch Lightning - Docs
Natural Language Processing with PyTorch - ebook
Full DisasterTweetsClassifierMLP implementation - github

NLP with Disaster Tweets: Part 6 Multi Layer Perceptron