About Me!

This blog is about my musings and thoughts. I hope you find it useful, at most, and entertaining, at least.

Résumé [PDF]

Other Pages

Quotes

Links

Presence Elsewhere

jim@jimkeener.com

GitHub

BitBucket

OpenStates and Address Parsing

Date: 2013-03-10
Tags: gis address

The Sunlight Foundation has a project called OpenStates that contains data scraped from all 50 state legislatures. This data, plus data found at TIGER, is what I wanted to use for the next iteration of LetterSource. When examining the raw data, I found that the scraped address were, well, messy. I decided to take it upon myself to clean them up.

This lead to my OpenStates Parser that goes through the bulk data and extracts what it can, and normalizes the addresses as much as it can. I found the address parsing to be an interesting process and wanted to talk a little on what algorithm I used to do it.

The algorithm went through many iterations as each time I ran it, I found yes another exception. Let’s just say addresses are not as standard as people like to believe they are.

The algorithm I created is as follows:

Find the zip code – I look for 5 digits after the first 10 characters of the raw address. If I find a zip+4, I use that, otherwise I use the last 5 digits in the raw address.
Find the city – Using the 5-digit address, I do a lookup of all cities that have that zipcode (using http://federalgovernmentzipcodes.us/, but other zip databases could easily be used as well or in tandem). I then make some modifications to this list of cities (e.g.: Saint to St) creating a new list of the original and modifications. I then look for words in the raw address that have a high Jaro Distance with any of the target cities and use the best score. I did this, rather than simple string matching, because I found misspellings in the scraped data.
Everything before the city is the street address – I don’t yet separate out the street and street number, but would like to later
If the address is too long, make it smaller – Publication 28 § 35 says that lines should have no more than 40 characters with no more than 8 words. I try to validate against this by splitting the street line before and after certain key words to make multiple lines. If that isn’t enough I apply Publication 28 § Appendix G abbreviations to make the lines as short as possible.

The problems I ran into even after doing this was that certain states don’t list addresses, just room numbers. I wrote in some exceptions for them to build the capitol address. Even after that, I found a decent number of addresses whose city wasn’t in the list of cities for that zip code. I think that might just be a deficiency in my zip code database and try to use Google to understand those address; sometimes that was successful, other times not.

I want to reïterate that Sunlight Labs’ parsing is pretty good. It’s not their fault the raw data states and legislators publish is dirty.