NLP with Disaster Tweets: Part 2 Nearest Neighbor Models
tech, DataAnalysis, DataScience, NLP, TextMining
Introduction
In this NLP getting started challenge on kaggle, we are given tweets which are classified as 1 if they are about real disasters and 0 if not. The goal is to predict given the text of the tweets and some other metadata about the tweet, if its about a real disaster or not. In this part 2 for Nearest Neighbor Modelling, I will use the processed data generated in Part 1 to train nearest neighbor models in order to predict if a tweet is about a real disaster or not using the tidymodels framework.
Analysis
Load Libraries
rm(list = ls())
library(tidyverse)
library(ggplot2)
library(tidymodels)
library(silgelib)
theme_set(theme_plex())
Loading processed data from previous part
<- readRDS("../data/nlp_with_disaster_tweets/tweets_proc.rds")
tweets <- readRDS("../data/nlp_with_disaster_tweets/tweets_test_proc.rds") tweets_final
%>%
tweets dim
## [1] 7613 830
%>%
tweets_final dim
## [1] 3263 829
Feature preprocessing and engineering
%>%
tweets mutate(target = as.factor(target),
id = as.character(id)) -> tweets
%>%
tweets count(target, sort = T)
## # A tibble: 2 x 2
## target n
## <fct> <int>
## 1 0 4342
## 2 1 3271
Split data
Splitting the data into 3 sets. A test set of 10% data, a cross validation set of 20% data and a training set of 70% data. Training and validation sets will be used for training, tuning and validating performance of models and comparing among them. Test set will only be used for final estimation of the model performance on unknown data.
set.seed(42)
<- initial_split(tweets, prop = 0.1, strata = target)
tweets_split
<- training(tweets_split)
tweets_test <- testing(tweets_split)
tweets_train_cv
set.seed(42)
<- initial_split(tweets_train_cv, prop = 7/9, strata = target)
tweets_split <- training(tweets_split)
tweets_train <- testing(tweets_split) tweets_cv
dim(tweets_train)
## [1] 5328 830
dim(tweets_cv)
## [1] 1522 830
dim(tweets_test)
## [1] 763 830
Preparation Recipe
I will use the recipe package from tidymodels to generate a recipe for data preprocessing and feature engineering.
recipe(target ~ ., data = tweets_train) %>%
update_role(id, new_role = "ID") %>%
step_rm(location, keyword) %>%
step_mutate(len = str_length(text),
num_hashtags = str_count(text, "#")) %>%
step_rm(text) %>%
step_zv(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_pca(all_predictors(), -len, -num_hashtags, threshold = 0.80)-> tweets_recipe
Note above
We use the training dataset to create the recipe
We won’t use ‘id’ field as a predictor, only as an identifier.
For current analysis, we will drop the location and keyword features.
Creating a length feature to model the tweet length and another feature to store the number of hashtags in the tweet.
Getting rid of the text field since we have generated all the features from it that we wanted for now.
Removing all predictors with zero variance.
Normalizing all features i.e. centering and scaling.
Adding dimensionality reduction using PCA to keep 80% variance and reduce the number of features while still keeping our custom features.
<- tweets_recipe %>%
tweets_prep prep(training = tweets_train,
strings_as_factors = FALSE)
Modelling
Baseline model
I will first create a baseline model to beat. In this case, we can predict randomly in the ratio of target counts and evaluate the model performance accordingly.
%>%
tweets_prep juice() %>%
count(target) %>%
mutate(prob = n/sum(n)) %>%
pull(prob) -> probs
set.seed(42)
%>%
tweets_prep bake(new_data = tweets_cv) %>%
mutate(predicted_target = as.factor(sample(0:1,
size = nrow(tweets_cv),
prob = probs, replace = T))) %>%
accuracy(target, predicted_target)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.512
set.seed(42)
%>%
tweets_prep bake(new_data = tweets_cv) %>%
mutate(predicted_target = as.factor(sample(0:1,
size = nrow(tweets_cv),
prob = probs, replace = T))) %>%
f_meas(target, predicted_target)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 f_meas binary 0.581
Like, we see above, we have a baseline f1 score of 0.5812. We need to build and train a model that beats this baseline.
set.seed(42)
%>%
tweets_prep bake(new_data = tweets_test) %>%
mutate(predicted_target = as.factor(sample(0:1,
size = nrow(tweets_test),
prob = probs, replace = T))) %>%
accuracy(target, predicted_target)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.503
set.seed(42)
%>%
tweets_prep bake(new_data = tweets_test) %>%
mutate(predicted_target = as.factor(sample(0:1,
size = nrow(tweets_test),
prob = probs, replace = T))) %>%
f_meas(target, predicted_target)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 f_meas binary 0.574
Generating submission file
set.seed(42)
%>%
tweets_prep bake(new_data = tweets_final) %>%
mutate(target = as.factor(sample(0:1,
size = nrow(tweets_final),
prob = probs, replace = T))) %>%
select(id, target) %>%
write_csv("../data/nlp_with_disaster_tweets/submissions/baseline_cvf_57_testf_57.csv")
K-Nearest Neighbor model
Let’s build a basic KNN model with some default number of neighbors to see how the modelling is done in this framework and checkout how the modelling output looks like.
Basic
<- nearest_neighbor(neighbors = 3) %>%
knn_spec set_engine("kknn") %>%
set_mode("classification")
<- workflow() %>%
wf add_recipe(tweets_recipe)
<- wf %>%
knn_fit add_model(knn_spec) %>%
fit(data = tweets_train)
saveRDS(knn_fit, "../data/nlp_with_disaster_tweets/knn/knn_basic_fit.rds")
<- readRDS("../data/nlp_with_disaster_tweets/knn/knn_basic_fit.rds")
knn_fit %>%
knn_fit pull_workflow_fit() -> wf_fit
$fit$MISCLASS wf_fit
## optimal
## 3 0.3521021
The above shows a simple K-nearest neighbors model using the “kknn” engine. Gives about 0.3521021 of minimal misclassification. Let’s try and tune the number of neighbors (k) and see if we can interpret the underlying problem space.
Tuning number of neighbors
Using 5-fold cross validation and values of K going from 1 to 100.
set.seed(1234)
<- vfold_cv(tweets_train, strata = target, v = 5, repeats = 1)
folds
<- nearest_neighbor(neighbors = tune()) %>%
tune_spec set_mode("classification") %>%
set_engine("kknn")
<- expand.grid(neighbors = seq(1,100, by = 1)) neighbor_grid
set.seed(1234)
::registerDoParallel(cores = parallel::detectCores(logical = FALSE))
doParallel
<- tune_grid(
knn_grid %>% add_model(tune_spec),
wf resamples = folds,
grid = neighbor_grid,
metrics = metric_set(accuracy, roc_auc, f_meas),
control = control_grid(save_pred = TRUE,
verbose = TRUE)
)
saveRDS(knn_grid, "../data/nlp_with_disaster_tweets/knn/knn_grid.rds")
<- readRDS("../data/nlp_with_disaster_tweets/knn/knn_grid.rds")
knn_grid
%>%
knn_grid collect_metrics()
## # A tibble: 300 x 6
## neighbors .metric .estimator mean n std_err
## <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 1 accuracy binary 0.628 5 0.00597
## 2 1 f_meas binary 0.705 5 0.00426
## 3 1 roc_auc binary 0.604 5 0.00653
## 4 2 accuracy binary 0.628 5 0.00598
## 5 2 f_meas binary 0.705 5 0.00426
## 6 2 roc_auc binary 0.630 5 0.00928
## 7 3 accuracy binary 0.628 5 0.00544
## 8 3 f_meas binary 0.706 5 0.00349
## 9 3 roc_auc binary 0.639 5 0.00873
## 10 4 accuracy binary 0.628 5 0.00537
## # … with 290 more rows
%>%
knn_grid collect_metrics() %>%
mutate(flexibility = 1/neighbors,
.metric = str_to_title(str_replace_all(.metric, "_", " "))) %>%
ggplot(aes(flexibility, mean, color = .metric)) +
geom_errorbar(aes(ymin = mean - std_err,
ymax = mean + std_err), alpha = 0.5) +
geom_line(size = 1.5) +
facet_wrap(~.metric, scales = "free", nrow = 3) +
scale_x_log10() +
theme(legend.position = "none") +
labs(title = "Model performance against model flexibility",
subtitle = "F1-score peaks around lower flexibility values",
x = "Model flexibility i.e. Log(1/NumberOfNeighbors)",
y = "Mean metric value")
As we see in the plot above, the f1-score increases on the evaluation set until around K=20 and then starts falling down. We plot the flexibility (i.e. 1/NumberOfNeighbors) to visualize how the model performance varies as the model flexibility increases. The KNN model with K=1 will be highly flexible and thus have high variance, whereas K=100 would lead to a much stricter model which is less flexible and might suffer from bias. Looks like our underlying problem stays much closer to being flexible than strict (since optimal K looks to be around 20). We should remember this fact for picking further models for experimentation. Let’s pickout the best parameter K based on the highest f1-score and train our final model on the full training dataset and evaluate against cross validation dataset.
%>%
knn_grid select_best("f_meas") -> highest_f_meas
<- finalize_workflow(
final_knn %>% add_model(tune_spec),
wf
highest_f_meas )
last_fit(final_knn,
tweets_split,metrics = metric_set(accuracy, roc_auc, f_meas)) -> knn_last_fit
saveRDS(knn_last_fit, "../data/nlp_with_disaster_tweets/knn/knn_last_fit.rds")
<- readRDS("../data/nlp_with_disaster_tweets/knn/knn_last_fit.rds")
knn_last_fit %>%
knn_last_fit collect_metrics()
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.643
## 2 f_meas binary 0.756
## 3 roc_auc binary 0.687
Our final fit knn model with K=25 gives an f1-score of 0.7560538, which is much higher than our baseline model on the same CV dataset.
Summary
We can hence learn quite a few things about our underlying problem space by using this basic modelling algorithm K-nearest neighbors and use our learning in further model selection and tuning and also generate a fairly robust model that predicts quite effectively as compared to the baseline model. Also, this tidymodels framework provides a good modelling structure which can be easily reproduced and used to train a variety of models. In the next part of this series, I will work on another classic modelling algorithm, Lasso Regression, where we will also see if we can identify if there any of these features are much more important than the others and if our 2 custom features are useful.
References
- Project Summary Page - NLP with disaster tweets: Summary
- Project Part 1 - NLP with Disaster Tweets: Part 1 Data Preparation
- Lasso Regression using Tidymodels by Julia Silge.
- Introduction to Statistical Learning - Book