Tim Thoughts data/rstats/policy

Election Analysis in R

I’m going to write my first tutorial in R. While I would never call myself an expert in R (especially when I follow the footsteps of so many great R programmers), I hope some beginners to data analysis may find use in my thoughts. A more experienced programmer would (and should!) pick out inefficiencies and errors in my code.

I’m thoroughly obssessed with the US election this year, if you’ve noticed my Tweets over the last few months. The question we’ll be looking to answer for this project, what are the geographic areas in which certain candidates are strongest? Media organizations like the New York Times, Los Angeles Times, and FiveThirtyEight have worked on fascinating interactives to help visualize the chaos of the Republican and Democratic primaries. While we’re not looking to re-invent the wheel here, it will be a fun challenge to replicate some of their ideas within the R ecosystem so that we may go a bit further.

One particular thing I noticed about the NYT map is that it is divided into counties, and the LAT map is divided by state. Let’s go a bit further, and take a more granular view and see if we can build a map focusing on Congressional districts. This is bit problematic because this data isn’t readily available on most major news outlets, as they tend to report results coming in by county (for example, CBS’s coverage of the Iowa caucus).

Fortunately there’s a wonderful website called the Green Papers, which has become a one-stop shop for political nerds wishing to learn the arcane rules of state elections. What their website lacks in graphic design, it makes up for in astounding substance if you’re trying to figure out how delegates are allocated. They’ve done a lot of the heavy lifting for us, which would be agreggrating congressional-level data directly from state election offices and party websites. We will be webscraping their data whenever possible and available, I trust no other site more on the Internet for accurate delegate counts (you’ll find that the AP, CNN, CBS, NYT etc etc all vary in their exact delegate count, as many states’s final count will not be solidified until state conventions, delegate selection…it gets a little nauseating thinking about it, ugh.)

So we’ll start with the annoying part of any data project. Getting clean data. Part one will focus on web-scraping, cleaning, data manipulation. These are the mundane tasks you delegate to your interns. Just know that it’s vital to your ultimate goal of some fancy regression or app.

PART 1 web-scraping

PART 2 data cleaning

PART 3 visualization in DT

PART 4 spatial mapping

PART 5 shiny

PART 6 deploying

I’m gonna deploy the final app so that you’ll have a sneak peek of what the result looks like.

Shall we begin?

library(rvest)
library(dplyr)
library(plyr)
library(tidyr)
library(stringr)
comments powered by Disqus