Python script for matching a list of messy addresses against a gazetteer using dedupe. This also functions as a pseudo geocoder if your Gazetteer has lat/long information.
Here's how to get this script working - without having dedupe already installed.
git clone [email protected]:datamade/address-matching.git
cd address-matching
pip install "numpy>=1.6"
pip install -r requirements.txt
You will need a Gazetteer of all unique addresses in a given area. For this example, we used the Cook County Address Point shapefile.
This program takes a list of addresses and matches them to individual records in the Gazetteer. For this example, we are using a messy list of early childhood education locations in Chicago. This file can have multiple entries referring to the same place.
Once you have a Gazetteer and a messy input file, run address_matching.py
python address_matching.py
You will be prompted to label some training pairs for dedupe to do its thing. More on this here.
The output will be saved to address_matching_output.csv