Tim Thoughts data/rstats/policy

Tampa Update

things I've drastically improved my knowledge of:
git
shiny
rselenium
votebuilder
sql
hamilton lyrics
snapchat
coffee consumption

A Million Things I Haven't Done

So much for writing about the election this year. I just moved to Tampa to join Hillary Clinton and the Florida Democrats, I'll be working on the data and analytics team for the duration of the election. Volunteering as a fellow during the California primary for HRC will always be a highlight of my life, and I'm really thankful for all the great people I met.

But now the hardest part comes next. I'll be working with some really smart people on all sorts of various projects, my primarily focus will be aiding the organizers and ground game to maximize their impact here in Florida. I'm excited to use my growing data skills on the largest (and according to recent polls, the most contested) battleground state with some amazingly gifted people.Posting R code and graphs will have to wait for a while. Wish me luck, I'm excited to learn all and do all the things.

When times get hard, and they will, I have to remind myself of why I started all this. One of my first posts here a year ago was a little short story joke I wrote about the Don, but it's no longer funny. I take my politics very seriously, and I'm eager to do whatever it takes to put Hillary into office. This isn't the post where I tell you why I support her (maybe a later time), but I'll briefly say that I am proud to stand with my fellow Democrats behind such a qualified candidate.

My boss gave me a survey to complete before I arrived, it's a list of questions about my professional identity and personality. I've been thinking a lot about what I want to do with my life, and it was refreshing to rethink some of my goals about what I want out of this campaign for myself. It's really important to put my pride and ego aside and concentrate on my daily tasks, but I get that I'm supposed to grow as a person. I'll list some of them here now to remind myself.

Professional goals: Master the VAN/Votebuilder, better understand the landscape of Democratic Party data infrastructure, learn how to do more data science things like random forest algorithms, practice SQL/Tableau, absorb all of Daniel Kreiss's new book

Personal goals: Eat healthy, practice self-care, send postcards home, call mom/dad more, memorize the Hamilton soundtrack

Unlikely goals: Post more updates on Snapchat/Instagram, 2K MMR on DOTA, try more EDM music, sleep, date hahahahahaha okay let's just stop here.

There's so much left to learn, I am so eager for this challenge. I won't come home til the job is finished, and I hope to be a better version of myself by then. I hope this election will change me for the better, and I look forward to the grind. See you in November!

Data Perfectionism is the Enemy

Back in graduate school, my development economist professor assigned us a story by Jorge Luis Borges. In "Funes the Memorious", the protagonist meets a teenage prodigy who has developed perfect memory due to a horseriding accident. For example the young boy could recall an entire's worth of memories, and spend an entire day reliving his thoughts from the previous day.

As we discussed the story, a student remarked wow, what a cool ability! But I recognized the lesson right away. The boy's ability was a blessing but yet a curse, his perfectionist ability to get every single detail correct prevented him from thinking abstractly or broadly. Hence the relevence to a class on economics, as our professor was stressing the necessity of sacrificing a tiny bit of detail for more broader policy validity. Or something like that, I assumed that was his point. Sorry Jeff.

I've been working on aggregating all the primary results by congressional districts, and it's been increasingly frustrating when noting the disparity between state reporting methods. Tuesday's results in Pennsylvania is a fine example, for they only reported the results by county.

I messaged the Green Papers, and they pointed me towards an AP press release for the Democrats that was much more helpful. To attain their estimated delegate count, they told me their method was "to go from county to CD -- say, 30% of a county is in CDa, 60% in CDb and 10% in CDc: we take 30% of the county vote and apply it to CDa, 60% of the county vote and apply it to CDb, and 10% of the county vote and apply it to CDc. We did that for each county. We have found that the results much more usually end up pretty close to what the final delegate numbers per CD turn out to be".

It should be good enough for me.

Sigh. Okay so there's 67 counties and 18 congressional districts in Pennsylvania. Some counties are entirely located in one district, some are split across more than one. But I now know that Precinct 1271 of Whitehall Dist 1 of Alleghany County in Pennsylvania (literally the smallest and atomic unit of political geography in the United States) is actually split into 2 Congressional Districts.

I know this because I'm going to every 67 county websites, downloading their data directly, and filtering it into R. I figured I can obtain which district a particular precinct belongs to if I check which Congressional race they're voting for. That's when I realized 1271 of Alleghany was voting for both candidates and delegates to the 14th and 18th district. When I checked 1272-1288, etc to see the rest of Whitehall 2-16's District, they're all in the 18th. You can double-check my results by Control + F "1271 Whitehall Dist 1" at this website.

So that ONE particular precinct has a few people living in the 14th. Just to confirm, I made a phone call to Alleghany's elections department this morning, and they told me that because of some redistricting that occured during even-numbered years (when Congressional representatives are elected), a precinct such as 1271 may be split. It's literally the only one in the county.

Aftering doing more online digging, I finally found an updated spreadsheet matching precincts to districts at the state website. Considering the lesson of Funes and grad school, I will just recode 1271 a part of the 18th district and move on. Going for perfect granularity when you're dealing with election data is a sisyphean task that will be an endless timesuck.

I just thought I should write a disclaimer, because while I theoretically accept and understand what I just wrote, it's still frustrating to know that there's incomplete data out there. Just learn to live with it, Tim.

Election Analysis, Pt 2: Web-scraping with rvest and gdata using Alabama

So I've been relying on the excellent Green Papers site for the majority of the data that I want. Here is what's on their website, and it's generally the most reliable source (you'll notice that a lot of major media sites like FiveThirtyEight will reference their work).

If I wanted to scrape this exact data frame as it is, I can use "rvest"" to do so.

library(rvest)
al<- "http://www.thegreenpapers.com/P16/AL-R" #get URL
al<- read_html(al) 
al<-html_table(html_nodes(al, "table"), fill=TRUE)[7] #extract relevant information
al<-data.frame(al)
al<-al[c(-1:-2, -12),] #subset correct columns 
al[2:13]<-apply(al[2:13], 2, function(x) as.numeric(gsub(",", "", x))) #delete thousands separator and convert to numeric, will be handy
names(al)<- c("CD","Pop_Vote","Qual_Vote", "Total_Del","Trump_Pop","Trump_Alloc","Trump_Del","Cruz_Pop","Cruz_Alloc","Cruz_Del","Rubio_Pop","Rubio_Alloc","Rubio_Del") #rename columns
print(al)

So the data is ready...right? Sadly I have a slight OCD complex with political data, and we have a much larger task. I want the complete data set, direct from Alabama's Secretary of State's office, where they have a much more detailed breakdown. Such information would contain the rest of the candidates, precinct totals, absentee vote percentages.

Fortunately the Green Papers provided a direct link to Alabama's website. There are two different data sets available. One is labeled "Results Spreadsheet, Certified March 11 2016" and the other is a ZIP archive of Excel files titled "County-By-County Precinct Level Primary Election Results".

This is what the first file looks like.

Scrolling down to row 75, we'll see the totals divided by Congressional District. Jackpot, it has all of the candidates!

Notice that there are 66 more sheets that look the same for each county. Might be relevant for calculating county totals, but more on that later.

If you want to use this data, we can use "gdata" and "RCurl" to directly read XLS files into our environment. Alternatively, you can download the Excel file directly into your own directory, open it, save it as a csv, and read it back into R with a simple read.csv function. You may find that option more simpler and faster than learning how to read an Excel document into R. However the spirit of this blog is to maintain reproducubility as much as possibile, and I heed Karl Broman's guidelines on avoiding absolute paths.

library(gdata)
library(tools)
library(stringr)
library(dplyr)
url<-"http://www.alabamavotes.gov/downloads/election/2016/primary/primaryResultsCertified-Republican-Spreadsheet-2016-03-11.xlsx"
alabama_by_congress <- read.xls(url, na="")
alabama_by_congress<- alabama_by_congress[1:3] #only need the first 3 columns
alabama_by_congress<- na.omit(alabama_by_congress[74:183,]) #only need the Presidential races, ignore House District/Senate/etc races
names(alabama_by_congress)<- c("Candidate", "Votes", "Percent") #rename the columns for ease
alabama_by_congress$State<- "Alabama" #I like adding "State" as a column, useful when combining for future datasets with other states
alabama_by_congress$Votes<-as.numeric(gsub(",","",alabama_by_congress$Votes)) #2 functions in this, delete all comma separators and convert to numeric
alabama_by_congress$Candidate<- toTitleCase(word(tolower(alabama_by_congress$Candidate),-1)) #3 functions in this. "tolower" converts JEB BUSH to lower case, then word (from stringr) takes the last word from the string "bush", then toTitleCase from "tools" will capitalize it to title case.
alabama_by_congress<- alabama_by_congress%>% group_by(Candidate) %>% slice(1:7) %>% filter(Candidate !="Total_cd", Votes != "Votes")
#dplyr deserves its own tutorial, but group_by and slice will parse this down further
alabama_by_congress$Percent<-as.numeric(sub("%","",alabama_by_congress$Percent)) #deletes % sign, converts to numeric
alabama_by_congress$District<- paste0(str_pad(1:7, 2, "left", pad="0")) #Creates a "District" column. Will be relevant for geocoding, in later tutorials
alabama_by_congress<- alabama_by_congress[,c(4,5,1,2,3)] #reorder columns so it goes state first

Stop here if you're satisfied! Part 3 will go into my OCD complex because (spoilers) this data is actually isn't the most up to date data available. Also, this spreadsheet only contains Republicans! We'll be reading the ZIP archive in the next post. If you really don't care, move along...

Goodbye Kobe

Kobe Bryant's greatest moments will forever live on in Youtube reels and I'm thankful that I can always revisit them. Years from now when the memories fade, we'll use them as ammunition in arguments with our younger coworkers and family members who doubted that anyone can play like that.

Here's what I remember, the good and the bad.

The Utah airballs as a young gun. his first match-up against MJ. The 4th quarter comeback against Portland in the 2000 WCF, ending with crossover on Pippen leading to the famous "Bryant... to Shaq!" alley-oop. Game 4 when Shaq fouled out at Indiana. The first ring of three, a dynasty is born. Boulder, Colorado. A 3pt moonball buzzerbeater at Portland to clinch the division in 04. The feud, and Shaq leaves. The Smush Parker/Kwame Brown era begins (the Dark Ages, to me). Getting 62 in 3 quarters. Averaging 43 points for the first month of January 2006, peaking at the 81 vs Toronto. "Bryant for the win...BANG!" to go up 3-1, though it wasn't meant to be. Trade rumors to Chicago, cursing out Bynum in the parking lot. Hope returns with Phil, and Pau Gasol. The train wreck of 39 point beatdown in Game 6 of the NBA Finals, as the hated Celtics steamrolled Kobe's Lakers to win their first title in decades. Redemption through Team USA, taking over the 4th quarter to beat Spain for the gold. Those stupid and endearing Kobe and LeBron puppet commercials, hyping a Finals matchup that would never happen. That ridiculous one-footed leaning 3 point buzzerbeater against Wade's Heat in 2009 that no professional player should ever even attempt, unless you say "Kobe". The 2010 Finals. 17 in 6 minutes in the 3rd quarter to silence the Garden. 6-24, 15 rebounds in Game 7, still the most dramatic game of basketball I've ever seen. No one knew at the time that it would be the end of an era for Kobe and the Lakers, they've never returned to the Finals since. It's now LeBron's time. The failed CP3 trade. Dwight and Nash comes, it doesn't work out. The Toronto comeback. The Achilles tear, the free throws. The struggle to return. The end.

Youtube will capture most, if not all of these. It won't capture my memories of hanging out with my bro, watching KCAL right before my dad leaves for work. Watch parties with my high school friends during the deep playoff runs. College runs at the rec center, with Shaan setting up like Sasha Vujacic at the line. Every time I had a hard final or project to study, I'd mentally channel my inner Mamba (because his legendary work ethic stories had started to spread across the Internet at that point). I had the utmost pleasure of growing up with Kobe, and it breaks me a little to say goodbye to him because it's recognizing a part of my life is now closed. I don't even watch basketball that much anymore, it's not a part of my routine (I blame Twitch, election season and Reddit highlights). I'm not that invested in fantasy basketball as I used to be, I'm not too emotionally invested in the young guns of our roster, and I don't see how I could be. Kobe Bryant may literally the last athlete that I have any semblance of an emotional connection to. We grew up together, from those Utah airballs to that last ovation he'll have tonight.

One last memory. Here is the play by play from a game at Milwaukee Bucks in 2009. The Lakers were losing by 6 points with 90 seconds left. I was eating at the time, and when I glanced over at the TV, my first reaction was "Huh, it's okay Kobe's got this". After scoring the next 6 points, Kobe nails the turnaround buzzer beater (from the same exact spot he missed in regulation). That's what I'll miss most, it was the inevitability.

Thank you for all of it. The sweat of work, the pain of failure, the glory of rings. Thanks.

Alt text