Introduction

Recently I have been starting to learn and move towards PyTorch instead of Keras/Tensorflow, and I have to say I have been positively impressed by PyTorch’s neat API. However, I didn’t find a lot of articles around how to make that transition easy and I found myself going back and forth trying to lookup what are the corresponding layer functions or loss functions in PyTorch in comparison to tensorflow. This post is mostly to remind myself or anyone else who is making this transition by showing a basic deep learning toy example, which is written in both Tensorflow and PyTorch. It’ll be a quick small post and hopefully help anyone to quickly refer some basic Tensorflow vs. PyTorch functionality.

Any neural network model training workflow follows the following basic steps -

  1. Prepare data.
  2. Define network architecture
  3. Start an epoch and forward pass data through the laid out network.
  4. Calculate prediction from the network, and calculate the chosen loss metric using true value.
  5. Calculate gradient for each parameter in the network by backpropogating the loss.
  6. Use learning rate and the calculated gradient to calculate the new parameter values.
  7. Repeat step 2 above for another epoch.

I will use the same above blueprint and build model in both these frameworks. Note that I am using Tensorflow’s quickstart tutorial as the toy model and will build corresponding model in PyTorch.

Modelling

Load Libraries

rm(list = ls())
library(reticulate)

use_condaenv("stanford-nlp")
import tensorflow as tf

tf.__version__
## '2.2.0'
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

torch.__version__
## '1.4.0'

Tensorflow

Prep Data

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

print('X_TRAIN: {0} Y_TRAIN: {1} \n X_TEST: {2} Y_TEST: {3}'.format(x_train.shape, y_train.shape, x_test.shape, y_test.shape))
## X_TRAIN: (60000, 28, 28) Y_TRAIN: (60000,) 
##  X_TEST: (10000, 28, 28) Y_TEST: (10000,)

Using the classic MNIST dataset to predict the handwritten digit.

Network Architecture

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

In the above simple network, we first flatten the 2d image of the handwritten digits from 28x28 pixel values to single vector of 784 values. Next, we apply a fully connected dense layer with 128 hidden nodes and relu activation. Followed by a dropout layer with 20% dropout probability and then our final Dense layer that calculates the log-odds for each of the 10 digits.
Next we define our loss function, the categorical cross entropy function (with parameters that uses logits to calculate the loss value) and the Adam Optimizer that will help train the network more efficiently as compared to using a simple gradient descent optimizer.

Fit and evaluate

model.fit(x_train, y_train, epochs=5)

We train our model for 5 epochs above and evaluate below.

model.evaluate(x_test,  y_test, verbose=2)
## 313/313 - 0s - loss: 0.0773 - accuracy: 0.9738
## [0.0772763043642044, 0.973800003528595]
probability_model = tf.keras.Sequential([
  model,
  tf.keras.layers.Softmax()
])

probability_model(x_test[:3]).numpy()
## array([[3.1675106e-07, 2.3145599e-07, 6.6892753e-05, 2.2017957e-04,
##         4.1319330e-09, 1.6857163e-07, 9.4870272e-14, 9.9964452e-01,
##         2.0268694e-06, 6.5717024e-05],
##        [7.4892768e-09, 1.6987176e-05, 9.9998236e-01, 4.0866948e-07,
##         7.2810264e-17, 1.7220984e-07, 2.1501261e-09, 5.0949375e-12,
##         6.0641248e-08, 3.3724926e-11],
##        [6.3570706e-06, 9.9873573e-01, 5.9583277e-04, 9.2160417e-06,
##         9.0130925e-06, 5.0910571e-06, 7.6300630e-06, 1.9710122e-04,
##         4.3375589e-04, 2.5116697e-07]], dtype=float32)

In order to calculate predicted probability for each digit (instead of log-odds), we run our model output through a simple softmax function and display the predicted probabilities for the first 3 samples in the test data.

Now moving on to PyTorch!

PyTorch

Prep data

x_train_torch = torch.from_numpy(x_train)
y_train_torch = torch.from_numpy(y_train)

x_test_torch = torch.from_numpy(x_test)
y_test_torch = torch.from_numpy(y_test)

x_train_torch = x_train_torch.float()
x_test_torch = x_test_torch.float()
y_train_torch = y_train_torch.long()
y_test_torch = y_test_torch.long()

We are using the same MNIST dataset we used for building our Tensorflow example but converting it to PyTorch tensors and making sure they have the correct datatypes.

Network Architecture

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.hidden = nn.Linear(28*28, 128)
        self.drop = nn.Dropout(0.2)
        self.out = nn.Linear(128,10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = F.relu(self.hidden(x))
        x = self.drop(x)
        x = self.out(x)
        return x
        
torch_model = Net()

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(torch_model.parameters(), lr = 0.001)

With PyTorch, we get to build the network architecture in a more object oriented fashion where we define each layer in the class constructor and a forward function that defines how the input data will pass forward through the network. Note that, we do not need to have a separate “Flatten” layer in PyTorch like we did in Tensorflow, we can just restructure the tensor using the “view” function to achieve the same effect.
We define the same loss function (CrossEntropyLoss), note however PyTorch’s CrossEntropyLoss function applies Softmax internally. We also initialize our Adam optimizer object.

Fit and evaluate

torch_model.train()
## Net(
##   (hidden): Linear(in_features=784, out_features=128, bias=True)
##   (drop): Dropout(p=0.2, inplace=False)
##   (out): Linear(in_features=128, out_features=10, bias=True)
## )
for epoch in range(30):
    y_pred = torch_model(x_train_torch)

    loss = loss_fn(y_pred, y_train_torch)
    
    if epoch%10 == 9:
        print(epoch, loss.item())

    optimizer.zero_grad()

    loss.backward()

    optimizer.step()
## 9 1.6672592163085938
## 19 1.0455528497695923
## 29 0.7043634653091431

We have a much more hands-on approach to fitting the network on the data. You can see the exact steps are similar to what we defined initially. This workflow gives a bit more control to the user (IMHO) while training the neural network.

torch_model.eval()
## Net(
##   (hidden): Linear(in_features=784, out_features=128, bias=True)
##   (drop): Dropout(p=0.2, inplace=False)
##   (out): Linear(in_features=128, out_features=10, bias=True)
## )
preds = torch_model(x_test_torch).detach().numpy()

preds = np.argmax(preds, axis=1)
np.mean(y_test_torch.numpy() == preds)
## 0.8619

I also wanted to run this network through a simple gradient descent instead of a pre-built optimizer just out of curiosity to see how well it performs.

Simple Gradient Descent

torch_model = Net()
learning_rate = 1e-4

torch_model.train()
## Net(
##   (hidden): Linear(in_features=784, out_features=128, bias=True)
##   (drop): Dropout(p=0.2, inplace=False)
##   (out): Linear(in_features=128, out_features=10, bias=True)
## )
for epoch in range(30):
    y_pred = torch_model(x_train_torch)

    loss = loss_fn(y_pred, y_train_torch)
    if epoch % 10 == 9:
        print(epoch, loss.item())

    torch_model.zero_grad()

    loss.backward()

    with torch.no_grad():
        for param in torch_model.parameters():
            param = -learning_rate * param.grad
            
## 9 2.3100671768188477
## 19 2.3099873065948486
## 29 2.3100321292877197
torch_model.eval()
## Net(
##   (hidden): Linear(in_features=784, out_features=128, bias=True)
##   (drop): Dropout(p=0.2, inplace=False)
##   (out): Linear(in_features=128, out_features=10, bias=True)
## )
preds = torch_model(x_test_torch).detach().numpy()

preds = np.argmax(preds, axis=1)
np.mean(y_test_torch.numpy() == preds)
## 0.1399

So intead of using the adam optimizer, we manually use the learning rate and the computed gradient for each parameter and update the value for each param during an epoch. As we see the model doesn’t perform quite as well, but nevertheless we now know how we can use PyTorch’s cool API to fully control every aspect of training the neural network.

Summary

In summary, I think both Tensorflow and PyTorch provide solid frameworks for training neural networks. If you prefer more intuitive, easy-to-use API Tensorflow should be the way to go and if you like a neat object-oriented, more fine-grained control API then PyTorch serves that purpose well.

References