Tim Thoughts data/rstats/policy

Election Analysis, Pt 2: Web-scraping with rvest and gdata using Alabama

So I’ve been relying on the excellent Green Papers site for the majority of the data that I want. Here is what’s on their website, and it’s generally the most reliable source (you’ll notice that a lot of major media sites like FiveThirtyEight will reference their work).

If I wanted to scrape this exact data frame as it is, I can use “rvest”” to do so.

al<- "http://www.thegreenpapers.com/P16/AL-R" #get URL
al<- read_html(al) 
al<-html_table(html_nodes(al, "table"), fill=TRUE)[7] #extract relevant information
al<-al[c(-1:-2, -12),] #subset correct columns 
al[2:13]<-apply(al[2:13], 2, function(x) as.numeric(gsub(",", "", x))) #delete thousands separator and convert to numeric, will be handy
names(al)<- c("CD","Pop_Vote","Qual_Vote", "Total_Del","Trump_Pop","Trump_Alloc","Trump_Del","Cruz_Pop","Cruz_Alloc","Cruz_Del","Rubio_Pop","Rubio_Alloc","Rubio_Del") #rename columns

So the data is ready…right? Sadly I have a slight OCD complex with political data, and we have a much larger task. I want the complete data set, direct from Alabama’s Secretary of State’s office, where they have a much more detailed breakdown. Such information would contain the rest of the candidates, precinct totals, absentee vote percentages.

Fortunately the Green Papers provided a direct link to Alabama’s website. There are two different data sets available. One is labeled “Results Spreadsheet, Certified March 11 2016” and the other is a ZIP archive of Excel files titled “County-By-County Precinct Level Primary Election Results”.

This is what the first file looks like.

Scrolling down to row 75, we’ll see the totals divided by Congressional District. Jackpot, it has all of the candidates!

Notice that there are 66 more sheets that look the same for each county. Might be relevant for calculating county totals, but more on that later.

If you want to use this data, we can use “gdata” and “RCurl” to directly read XLS files into our environment. Alternatively, you can download the Excel file directly into your own directory, open it, save it as a csv, and read it back into R with a simple read.csv function. You may find that option more simpler and faster than learning how to read an Excel document into R. However the spirit of this blog is to maintain reproducubility as much as possibile, and I heed Karl Broman’s guidelines on avoiding absolute paths.

alabama_by_congress <- read.xls(url, na="")
alabama_by_congress<- alabama_by_congress[1:3] #only need the first 3 columns
alabama_by_congress<- na.omit(alabama_by_congress[74:183,]) #only need the Presidential races, ignore House District/Senate/etc races
names(alabama_by_congress)<- c("Candidate", "Votes", "Percent") #rename the columns for ease
alabama_by_congress$State<- "Alabama" #I like adding "State" as a column, useful when combining for future datasets with other states
alabama_by_congress$Votes<-as.numeric(gsub(",","",alabama_by_congress$Votes)) #2 functions in this, delete all comma separators and convert to numeric
alabama_by_congress$Candidate<- toTitleCase(word(tolower(alabama_by_congress$Candidate),-1)) #3 functions in this. "tolower" converts JEB BUSH to lower case, then word (from stringr) takes the last word from the string "bush", then toTitleCase from "tools" will capitalize it to title case.
alabama_by_congress<- alabama_by_congress%>% group_by(Candidate) %>% slice(1:7) %>% filter(Candidate !="Total_cd", Votes != "Votes")
#dplyr deserves its own tutorial, but group_by and slice will parse this down further
alabama_by_congress$Percent<-as.numeric(sub("%","",alabama_by_congress$Percent)) #deletes % sign, converts to numeric
alabama_by_congress$District<- paste0(str_pad(1:7, 2, "left", pad="0")) #Creates a "District" column. Will be relevant for geocoding, in later tutorials
alabama_by_congress<- alabama_by_congress[,c(4,5,1,2,3)] #reorder columns so it goes state first

Stop here if you’re satisfied! Part 3 will go into my OCD complex because (spoilers) this data is actually isn’t the most up to date data available. Also, this spreadsheet only contains Republicans! We’ll be reading the ZIP archive in the next post. If you really don’t care, move along…

comments powered by Disqus