Introduction

In this NLP getting started challenge on kaggle, we are given tweets which are classified as 1 if they are about real disasters and 0 if not. The goal is to predict given the text of the tweets and some other metadata about the tweet, if its about a real disaster or not.
In this part 2 for Nearest Neighbor Modelling, I will use the processed data generated in Part 1 to train nearest neighbor models in order to predict if a tweet is about a real disaster or not using the tidymodels framework.

Analysis

Load Libraries

rm(list = ls())
library(tidyverse)
library(ggplot2)
library(tidymodels)
library(silgelib)

theme_set(theme_plex())

Loading processed data from previous part

tweets <- readRDS("../data/nlp_with_disaster_tweets/tweets_proc.rds")
tweets_final <- readRDS("../data/nlp_with_disaster_tweets/tweets_test_proc.rds")

tweets %>% 
  dim

## [1] 7613  830

tweets_final %>% 
  dim

## [1] 3263  829

Feature preprocessing and engineering

tweets %>% 
  mutate(target = as.factor(target),
         id = as.character(id)) -> tweets

tweets %>% 
  count(target, sort = T)

## # A tibble: 2 x 2
##   target     n
##   <fct>  <int>
## 1 0       4342
## 2 1       3271

Split data

Splitting the data into 3 sets. A test set of 10% data, a cross validation set of 20% data and a training set of 70% data. Training and validation sets will be used for training, tuning and validating performance of models and comparing among them. Test set will only be used for final estimation of the model performance on unknown data.

set.seed(42)
tweets_split <- initial_split(tweets, prop = 0.1, strata = target)

tweets_test <- training(tweets_split)
tweets_train_cv <- testing(tweets_split)

set.seed(42)
tweets_split <- initial_split(tweets_train_cv, prop = 7/9, strata = target)
tweets_train <- training(tweets_split)
tweets_cv <- testing(tweets_split)

dim(tweets_train)

## [1] 5328  830

dim(tweets_cv)

## [1] 1522  830

dim(tweets_test)

## [1] 763 830

Preparation Recipe

I will use the recipe package from tidymodels to generate a recipe for data preprocessing and feature engineering.

recipe(target ~ ., data = tweets_train) %>% 
  update_role(id, new_role = "ID") %>% 
  step_rm(location, keyword) %>% 
  step_mutate(len = str_length(text),
              num_hashtags = str_count(text, "#")) %>% 
  step_rm(text) %>% 
  step_zv(all_numeric(), -all_outcomes()) %>% 
  step_normalize(all_numeric(), -all_outcomes())  %>% 
  step_pca(all_predictors(), -len, -num_hashtags, threshold = 0.80)-> tweets_recipe

Note above

We use the training dataset to create the recipe
We won’t use ‘id’ field as a predictor, only as an identifier.
For current analysis, we will drop the location and keyword features.
Creating a length feature to model the tweet length and another feature to store the number of hashtags in the tweet.
Getting rid of the text field since we have generated all the features from it that we wanted for now.
Removing all predictors with zero variance.
Normalizing all features i.e. centering and scaling.
Adding dimensionality reduction using PCA to keep 80% variance and reduce the number of features while still keeping our custom features.

tweets_prep <- tweets_recipe %>% 
  prep(training = tweets_train, 
       strings_as_factors = FALSE)

Modelling

Baseline model

I will first create a baseline model to beat. In this case, we can predict randomly in the ratio of target counts and evaluate the model performance accordingly.

tweets_prep %>% 
  juice() %>% 
  count(target) %>% 
  mutate(prob = n/sum(n)) %>% 
  pull(prob) -> probs

set.seed(42)
tweets_prep %>% 
  bake(new_data = tweets_cv) %>% 
  mutate(predicted_target = as.factor(sample(0:1, 
                                             size = nrow(tweets_cv), 
                                             prob = probs, replace = T))) %>% 
  accuracy(target, predicted_target)

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.512

set.seed(42)
tweets_prep %>% 
  bake(new_data = tweets_cv) %>% 
  mutate(predicted_target = as.factor(sample(0:1, 
                                             size = nrow(tweets_cv), 
                                             prob = probs, replace = T))) %>% 
  f_meas(target, predicted_target)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 f_meas  binary         0.581

Like, we see above, we have a baseline f1 score of 0.5812. We need to build and train a model that beats this baseline.

set.seed(42)
tweets_prep %>% 
  bake(new_data = tweets_test) %>% 
  mutate(predicted_target = as.factor(sample(0:1, 
                                             size = nrow(tweets_test), 
                                             prob = probs, replace = T))) %>% 
  accuracy(target, predicted_target)

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.503

set.seed(42)
tweets_prep %>% 
  bake(new_data = tweets_test) %>% 
  mutate(predicted_target = as.factor(sample(0:1, 
                                             size = nrow(tweets_test), 
                                             prob = probs, replace = T))) %>% 
  f_meas(target, predicted_target)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 f_meas  binary         0.574

Generating submission file

set.seed(42)
tweets_prep %>% 
  bake(new_data = tweets_final) %>% 
  mutate(target = as.factor(sample(0:1, 
                                   size = nrow(tweets_final), 
                                   prob = probs, replace = T))) %>% 
  select(id, target) %>% 
  write_csv("../data/nlp_with_disaster_tweets/submissions/baseline_cvf_57_testf_57.csv")

K-Nearest Neighbor model

Let’s build a basic KNN model with some default number of neighbors to see how the modelling is done in this framework and checkout how the modelling output looks like.

Basic

knn_spec <- nearest_neighbor(neighbors = 3) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

wf <- workflow() %>% 
  add_recipe(tweets_recipe)

knn_fit <- wf %>% 
  add_model(knn_spec) %>% 
  fit(data = tweets_train)

saveRDS(knn_fit, "../data/nlp_with_disaster_tweets/knn/knn_basic_fit.rds")

knn_fit <- readRDS("../data/nlp_with_disaster_tweets/knn/knn_basic_fit.rds")
knn_fit %>% 
  pull_workflow_fit() -> wf_fit

wf_fit$fit$MISCLASS

##     optimal
## 3 0.3521021

The above shows a simple K-nearest neighbors model using the “kknn” engine. Gives about 0.3521021 of minimal misclassification. Let’s try and tune the number of neighbors (k) and see if we can interpret the underlying problem space.

Tuning number of neighbors

Using 5-fold cross validation and values of K going from 1 to 100.

set.seed(1234)
folds <- vfold_cv(tweets_train, strata = target, v = 5, repeats = 1)

tune_spec <- nearest_neighbor(neighbors = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("kknn")

neighbor_grid <- expand.grid(neighbors = seq(1,100, by = 1))

set.seed(1234)
doParallel::registerDoParallel(cores = parallel::detectCores(logical = FALSE))


knn_grid <- tune_grid(
  wf %>% add_model(tune_spec),
  resamples = folds,
  grid = neighbor_grid,
  metrics = metric_set(accuracy, roc_auc, f_meas),
  control = control_grid(save_pred = TRUE,
                           verbose = TRUE)
)

saveRDS(knn_grid, "../data/nlp_with_disaster_tweets/knn/knn_grid.rds")

knn_grid <- readRDS("../data/nlp_with_disaster_tweets/knn/knn_grid.rds")

knn_grid %>% 
  collect_metrics()

## # A tibble: 300 x 6
##    neighbors .metric  .estimator  mean     n std_err
##        <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
##  1         1 accuracy binary     0.628     5 0.00597
##  2         1 f_meas   binary     0.705     5 0.00426
##  3         1 roc_auc  binary     0.604     5 0.00653
##  4         2 accuracy binary     0.628     5 0.00598
##  5         2 f_meas   binary     0.705     5 0.00426
##  6         2 roc_auc  binary     0.630     5 0.00928
##  7         3 accuracy binary     0.628     5 0.00544
##  8         3 f_meas   binary     0.706     5 0.00349
##  9         3 roc_auc  binary     0.639     5 0.00873
## 10         4 accuracy binary     0.628     5 0.00537
## # … with 290 more rows

knn_grid %>% 
  collect_metrics() %>% 
  mutate(flexibility = 1/neighbors,
         .metric = str_to_title(str_replace_all(.metric, "_", " "))) %>% 
  ggplot(aes(flexibility, mean, color = .metric)) + 
  geom_errorbar(aes(ymin = mean - std_err,
                    ymax = mean + std_err), alpha = 0.5) + 
  geom_line(size = 1.5) + 
  facet_wrap(~.metric, scales = "free", nrow = 3) + 
  scale_x_log10() + 
  theme(legend.position = "none") + 
  labs(title = "Model performance against model flexibility",
       subtitle = "F1-score peaks around lower flexibility values",
       x = "Model flexibility i.e. Log(1/NumberOfNeighbors)",
       y = "Mean metric value")

As we see in the plot above, the f1-score increases on the evaluation set until around K=20 and then starts falling down. We plot the flexibility (i.e. 1/NumberOfNeighbors) to visualize how the model performance varies as the model flexibility increases. The KNN model with K=1 will be highly flexible and thus have high variance, whereas K=100 would lead to a much stricter model which is less flexible and might suffer from bias.
Looks like our underlying problem stays much closer to being flexible than strict (since optimal K looks to be around 20). We should remember this fact for picking further models for experimentation.

Let’s pickout the best parameter K based on the highest f1-score and train our final model on the full training dataset and evaluate against cross validation dataset.

knn_grid %>% 
  select_best("f_meas") -> highest_f_meas

final_knn <- finalize_workflow(
  wf %>% add_model(tune_spec),
  highest_f_meas
)

last_fit(final_knn, 
         tweets_split,
         metrics = metric_set(accuracy, roc_auc, f_meas)) -> knn_last_fit

saveRDS(knn_last_fit, "../data/nlp_with_disaster_tweets/knn/knn_last_fit.rds")

knn_last_fit <- readRDS("../data/nlp_with_disaster_tweets/knn/knn_last_fit.rds")
knn_last_fit %>% 
  collect_metrics()

## # A tibble: 3 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.643
## 2 f_meas   binary         0.756
## 3 roc_auc  binary         0.687

Our final fit knn model with K=25 gives an f1-score of 0.7560538, which is much higher than our baseline model on the same CV dataset.

Summary

We can hence learn quite a few things about our underlying problem space by using this basic modelling algorithm K-nearest neighbors and use our learning in further model selection and tuning and also generate a fairly robust model that predicts quite effectively as compared to the baseline model.
Also, this tidymodels framework provides a good modelling structure which can be easily reproduced and used to train a variety of models. In the next part of this series, I will work on another classic modelling algorithm, Lasso Regression, where we will also see if we can identify if there any of these features are much more important than the others and if our 2 custom features are useful.

References

Project Summary Page - NLP with disaster tweets: Summary
Project Part 1 - NLP with Disaster Tweets: Part 1 Data Preparation
Lasso Regression using Tidymodels by Julia Silge.
Introduction to Statistical Learning - Book

NLP with Disaster Tweets: Part 2 Nearest Neighbor Models