India Elections 2019: BJP & Congress Manifestos’ Analysis - Part 1 Data Cleaning
tech, DataAnalysis, DataScience, India, Python, R, TextMining, Visualizations
Introduction
With India’s 2019 General Elections around the corner, I thought it’d be a good idea to analyse the election manifestos of its 2 biggest political parties, BJP and Congress. Let’s use text mining to understand what each party promises and prioritizes. In this part 1, I’ll collect and clean data and setup the ground work for the project.
Analysis
Load libraries
rm(list = ls())
library(tidyverse)
library(pdftools)
library(tidylog)
library(hunspell)
library(tidytext)
library(ggplot2)
library(gridExtra)
library(scales)
library(reticulate)
library(widyr)
library(igraph)
library(ggraph)
theme_set(theme_light())
use_python("~/anaconda3/bin/python")
Downloading Manifestos
BJP’s Manifesto available at their website - bjp.org
<- pdf_text("~/Downloads/BJP-Election-english-2019.pdf")
bjp_txt
tibble(
page = 1:length(bjp_txt),
text = bjp_txt
%>%
) separate_rows(text, sep = "\n") %>%
group_by(page) %>%
mutate(line = row_number()) %>%
ungroup() %>%
select(page, line, text) -> bjp
%>%
bjp glimpse()
## Rows: 1,590
## Columns: 3
## $ page <int> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ line <int> 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ text <chr> "", " Table of Content…
Congress’ manifesto available at their website - inc.in
download.file("https://manifesto.inc.in/pdf/english.pdf", "~/Downloads/congress.pdf")
<- pdf_text("~/Downloads/congress.pdf") congress_txt
tibble(
page = 1:length(congress_txt),
text = congress_txt
%>%
) separate_rows(text, sep = "\n") %>%
group_by(page) %>%
mutate(line = row_number()) %>%
ungroup() %>%
select(page, line, text) -> congress
%>%
congress glimpse()
## Rows: 1,490
## Columns: 3
## $ page <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ line <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ text <chr> "CONGRESS", "WILL", "DELIVER", " MANIFESTO", " LOK SAB…
Cleaning
Page range
As we see from the 2 documents, first few pages contain Title and Index of the manifestos, and then moves on to the notes from the Party Leaders. The actual plans for development and work starts from page 11 in BJP’s manifesto and page 7 in Congress’. Filtering out all the other pages for exploration
%>%
bjp filter(page >= 11) -> bjp_content
%>%
congress filter(page >= 7) -> congress_content
Text NA
Dropping all the rows where we dont have any text.
%>%
bjp_content filter(!is.na(text)) -> bjp_content
%>%
congress_content filter(!is.na(text)) -> congress_content
Normalize
Normalizing text lines.
%>%
bjp_content unnest_tokens(text, text, token = "lines") -> bjp_content
%>%
congress_content unnest_tokens(text, text, token = "lines") -> congress_content
I’ll take a deep dive into individual topics of the manifestos in separate blog posts. For now, I will export our cleaned and normalized data for future analysis.
Export Data
%>%
bjp_content write_csv("../data/indian_election_2019/bjp_manifesto_clean.csv")
%>%
congress_content write_csv("../data/indian_election_2019/congress_manifesto_clean.csv")
Stay Tuned!
References
- Part 2 Economic Growth
- For all the parts go to Project Summary Page - India General Elections 2019 Analysis