Introduction

In this NLP getting started challenge on kaggle, we are given tweets which are classified as 1 if they are about real disasters and 0 if not. The goal is to predict given the text of the tweets and some other metadata about the tweet, if its about a real disaster or not.
In this part 1 for data preparation, I will do some basic exploration and vectorize the given tweet text into glove embedding vectors.

Analysis

Load Libraries

rm(list = ls())
library(tidyverse)
library(ggplot2)
library(GGally)
library(skimr)
library(tidymodels)
library(keras)
library(janitor)

theme_set(theme_light())

Read Data

tweets <- read_csv("../data/nlp_with_disaster_tweets/train.csv")
tweets_test <- read_csv("../data/nlp_with_disaster_tweets/test.csv")
skim(tweets)
Table 1: Data summary
Name tweets
Number of rows 7613
Number of columns 5
_______________________
Column type frequency:
character 3
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
keyword 61 0.99 4 21 0 221 0
location 2534 0.67 1 49 0 3279 0
text 0 1.00 7 157 0 7503 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 5441.93 3137.12 1 2734 5408 8146 10873 ▇▇▇▇▇
target 0 1 0.43 0.50 0 0 0 1 1 ▇▁▁▁▆

Getting glove embedding for tweet text and adding as features

The simple workflow for vectorizing tweet text into glove embeddings is as follows -

  1. Tokenize incoming tweet texts in the training data.
  2. Download and parse glove embeddings into an embedding matrix for the tokenized words.
  3. Generate embeddings vector for tweets text in training data.
  4. Generate embeddings vector for tweets text in test data.
  5. Append to given tweets features and export.

Tokenize incoming tweet texts in the training data

Using keras’ text_tokenizer to tokenize the text in tweets dataset.

text_tokenizer() %>% 
  fit_text_tokenizer(tweets$text) -> tokenizer

num_words <- length(tokenizer$word_index) + 1
print(length(tokenizer$word_index))
## [1] 22700

A total of 22700 unique words were assigned an index in the tokenization.
Using the above fit tokenizer, converting the text to actual sequences of indices.

sequences <- texts_to_sequences(tokenizer, tweets$text)

summary(map_int(sequences, length))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   13.00   17.00   16.84   21.00   33.00
maxlen <- max(map_int(sequences, length))
print(maxlen)
## [1] 33

Capping the maximum length of a tweets sequence to 33. This will translate all the tweets text sequences into a sequence of length 33. If the original sequence was longer, it will truncate from the beginning and if the original sequence is smaller, it will pad the sequence in the beginning to bring the final length to 33.

pad_sequences(sequences, maxlen = maxlen) -> padded_sequences

str(padded_sequences)
##  int [1:7613, 1:33] 0 0 0 0 0 0 0 0 0 0 ...

Like we see above, for all the 7,613 tweets in the training data we have created a tokenized sequence of 33 elements each.

Download and parse glove embeddings into an embedding matrix for the tokenized words

Downloaded the pre-trained glove embeddings trained on 2 billion tweets from Stanford’s NLP projects page on Glove and borrowing the code for parsing and generating glove embedding matrix from my deepSentimentR package.

parse_glove_embeddings <- function(file_path) {
  lines <- readLines(file_path)
  embeddings_index <- new.env(hash = TRUE, parent = emptyenv())
  for (i in 1:length(lines)) {
    line <- lines[[i]]
    values <- strsplit(line, " ")[[1]]
    word <- values[[1]]
    embeddings_index[[word]] <- as.double(values[-1])
  }
  cat("Found", length(embeddings_index), "word vectors.\n")
  return(embeddings_index)
}

generate_embedding_matrix <- function(word_index, embedding_dim, max_words, glove_file_path) {
  embeddings_index <- parse_glove_embeddings(glove_file_path)

  embedding_matrix <- array(0, c(max_words, embedding_dim))
  for (word in names(word_index)) {
    index <- word_index[[word]]
    if (index < max_words) {
      embedding_vector <- embeddings_index[[word]]
      if (!is.null(embedding_vector)) {
        embedding_matrix[index+1,] <- embedding_vector
      }
    }
  }

  return(embedding_matrix)
}
embedding_dim <- 25
embedding_matrix <- generate_embedding_matrix(tokenizer$word_index, 
                                              embedding_dim = embedding_dim, 
                                              max_words = num_words,
"../../../nlp_with_disaster_tweets/data/glove.twitter.27B/glove.twitter.27B.25d.txt")

saveRDS(embedding_matrix, "../data/nlp_with_disaster_tweets/embedding_matrix_25d.rds")
embedding_matrix <- readRDS("../data/nlp_with_disaster_tweets/embedding_matrix_25d.rds")
str(embedding_matrix)
##  num [1:22701, 1:25] 0 0.7864 0.4186 0.7086 -0.0102 ...

Using only 25 dimensional embedding in order to keep the computations fast, we have created an embedding matrix which holds the 25 dimension values for the most popular 22700 words in our tweets text data.

Generate embeddings vector for tweets text in training data

Using the keras modelling framework to generate embeddings for the given training data. We basically create a simple sequential model with one embedding layer whose weights we will freeze based on our embedding matrix created above, and a flattening layer that will flatten the output into a 2D matrix of dimensions (7613, 33x25).

keras_model_sequential() %>% 
  layer_embedding(input_dim = num_words, output_dim = embedding_dim, 
                  input_length = maxlen, name = "embedding") %>% 
  layer_flatten(name = "flatten") -> model_embedding

model_embedding %>% 
  get_layer(name = "embedding") %>% 
  set_weights(list(embedding_matrix)) %>% 
  freeze_weights()

model_embedding %>% 
  predict(padded_sequences) -> tweets_embeddings

str(tweets_embeddings)
##  num [1:7613, 1:825] 0 0 0 0 0 0 0 0 0 0 ...

For each of the 7,613 padded tweet sequences of upto max length 33, we use the keras model to “predict” and populate the embedding for each of those 33 words in the sequence and susequently flatten those to create a single feature vector of 33x25=825 dimensions.

Generate embeddings vector for tweets text in test data

Using the similar approach as above (i.e. tokenize, pad and vectorize using glove embeddings) on the test data, to generate similar embedding vector for text in the tweets test data. Note that, we use the previously fit text tokenizer on the train data to tokenize the test data.

sequences_test <- texts_to_sequences(tokenizer, tweets_test$text)
pad_sequences(sequences_test, maxlen = maxlen) -> padded_sequences_test

model_embedding %>% 
  predict(padded_sequences_test) -> tweets_embeddings_test

str(tweets_embeddings_test)
##  num [1:3263, 1:825] 0 0 0 0 0 0 0 0 0 0 ...

We similarly get 825 embedding dimensions for 3,263 tweets in the test data.

Append to given tweets features and export

tweets %>% 
  bind_cols(as_tibble(tweets_embeddings)) %>% 
  clean_names() -> tweets_proc

tweets_test %>% 
  bind_cols(as_tibble(tweets_embeddings_test)) %>% 
  clean_names() -> tweets_test_proc
saveRDS(tweets_proc, "../data/nlp_with_disaster_tweets/tweets_proc.rds")
saveRDS(tweets_test_proc, "../data/nlp_with_disaster_tweets/tweets_test_proc.rds")

Exporting the appended feature set will help us work on this dataset for modelling. I will cover the modelling using the tidymodels framework in my upcoming posts.

References