This blog is about my musings and thoughts. I hope you find it useful, at most, and entertaining, at least.
The Sunlight Foundation has a project called OpenStates that contains data scraped from all 50 state legislatures. This data, plus data found at TIGER, is what I wanted to use for the next iteration of LetterSource. When examining the raw data, I found that the scraped address were, well, messy. I decided to take it upon myself to clean them up.
This lead to my OpenStates Parser that goes through the bulk data and extracts what it can, and normalizes the addresses as much as it can. I found the address parsing to be an interesting process and wanted to talk a little on what algorithm I used to do it.
The algorithm went through many iterations as each time I ran it, I found yes another exception. Let’s just say addresses are not as standard as people like to believe they are.
The algorithm I created is as follows:
The problems I ran into even after doing this was that certain states don’t list addresses, just room numbers. I wrote in some exceptions for them to build the capitol address. Even after that, I found a decent number of addresses whose city wasn’t in the list of cities for that zip code. I think that might just be a deficiency in my zip code database and try to use Google to understand those address; sometimes that was successful, other times not.
I want to reïterate that Sunlight Labs’ parsing is pretty good. It’s not their fault the raw data states and legislators publish is dirty.