Introduction

With India’s 2019 General Elections around the corner, I thought it’d be a good idea to analyse the election manifestos of its 2 biggest political parties, BJP and Congress. Let’s use text mining to understand what each party promises and prioritizes.
In this part 1, I’ll collect and clean data and setup the ground work for the project.

Analysis

Load libraries

rm(list = ls())
library(tidyverse)
library(pdftools)
library(tidylog)
library(hunspell)
library(tidytext)
library(ggplot2)
library(gridExtra)
library(scales)
library(reticulate)
library(widyr)
library(igraph)
library(ggraph)

theme_set(theme_light())
use_python("~/anaconda3/bin/python")

Downloading Manifestos

BJP’s Manifesto available at their website - bjp.org

bjp_txt <- pdf_text("~/Downloads/BJP-Election-english-2019.pdf")

tibble(
  page = 1:length(bjp_txt),
  text = bjp_txt
  ) %>% 
  separate_rows(text, sep = "\n") %>% 
  group_by(page) %>% 
  mutate(line = row_number()) %>% 
  ungroup() %>% 
  select(page, line, text) -> bjp
bjp %>% 
  glimpse()
## Rows: 1,590
## Columns: 3
## $ page <int> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ line <int> 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ text <chr> "", "                                           Table of Content…

Congress’ manifesto available at their website - inc.in

download.file("https://manifesto.inc.in/pdf/english.pdf", "~/Downloads/congress.pdf")
congress_txt <- pdf_text("~/Downloads/congress.pdf")
tibble(
  page = 1:length(congress_txt),
  text = congress_txt
  ) %>% 
  separate_rows(text, sep = "\n") %>% 
  group_by(page) %>% 
  mutate(line = row_number()) %>% 
  ungroup() %>% 
  select(page, line, text) -> congress
congress %>% 
  glimpse()
## Rows: 1,490
## Columns: 3
## $ page <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ line <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ text <chr> "CONGRESS", "WILL", "DELIVER", "      MANIFESTO", "      LOK SAB…

Cleaning

Page range

As we see from the 2 documents, first few pages contain Title and Index of the manifestos, and then moves on to the notes from the Party Leaders. The actual plans for development and work starts from page 11 in BJP’s manifesto and page 7 in Congress’. Filtering out all the other pages for exploration

bjp %>% 
  filter(page >= 11) -> bjp_content
congress %>% 
  filter(page >= 7) -> congress_content

Text NA

Dropping all the rows where we dont have any text.

bjp_content %>% 
  filter(!is.na(text)) -> bjp_content
congress_content %>% 
  filter(!is.na(text)) -> congress_content

Normalize

Normalizing text lines.

bjp_content %>% 
  unnest_tokens(text, text, token = "lines") -> bjp_content
congress_content %>% 
  unnest_tokens(text, text, token = "lines") -> congress_content

I’ll take a deep dive into individual topics of the manifestos in separate blog posts. For now, I will export our cleaned and normalized data for future analysis.

Export Data

bjp_content %>% 
  write_csv("../data/indian_election_2019/bjp_manifesto_clean.csv")
congress_content %>% 
  write_csv("../data/indian_election_2019/congress_manifesto_clean.csv")

Stay Tuned!

References