Giter VIP home page Giter VIP logo

address-net's Introduction

AddressNet

Read more on Towards Data Science

Background

This project is an attempt to create a recurrent neural network that segments an Australian street address into its components such that it can be more easily matched against a structured address database. The primary use-case for a model such as this is to transform legacy address data (e.g. unvalidated addresses, such as those collected on paper or by phone) into a reportable form at minimal cost. Once structured address data is produced, searching databases such as GNAF for geocoding information is much easier!

Installation

Get the latest code by installing directly from git using

pip install git+https://github.com/jasonrig/address-net.git

Or from PyPI:

pip install address-net
pip install address-net[tf]     # install TensorFlow (CPU version)
pip install address-net[tf_gpu] # install TensorFlow (GPU version)

You will need an appropriate version of TensorFlow installed, ideally greater than version 1.12. This is not automatically installed since the CPU and GPU versions of TensorFlow exist in separate packages.

Model output

This model performs character-level classification, assigning each a one of the following 22 classes as defined by the GNAF database:

  1. Separator/Blank
  2. Building name
  3. Level number prefix
  4. Level number
  5. Level number suffix
  6. Level type
  7. Flat number prefix
  8. Flat number
  9. Flat number suffix
  10. Flat type
  11. Number first prefix
  12. Number first
  13. Number first suffix
  14. Number last prefix
  15. Number last
  16. Number last suffix
  17. Street name
  18. Street suffix
  19. Street type
  20. Locality name
  21. State
  22. Postcode

An example result from this model for "168A Separation Street Northcote, VIC 3070" would be:

address classification for 168A Separation Street Northcote, VIC 3070

Architecture

This model uses a character-level vocabulary consisting of digits, lower-case ASCII characters, punctuation and whitespace as defined in Python's string package. These characters are encoded using embedding vectors of eight units in length.

The encoded text is fed through a bidirectional three-layer 128-Gated Recurrent Unit (GRU) Recurrent Neural Network (RNN). The outputs from the forward and backward pass are concatenated and fed through a dense layer with ELU activations to produce logits for each class. The final output probabilities are generated through a softmax transformation.

Regularisation is achieved in three ways:

  1. Data augmentation: the addresses constructed from the GNAF dataset are semi-randomly generated so that a huge variety of permutations are produced
  2. Noise: a random typo generator that creates plausible errors consisting of insertions, transpositions, deletions and substitutions of nearby keys on the keyboard is used for each address
  3. Dropout for the outputs and state is applied to the RNN layers

Data sources

The data used to produce this model was from the GNAF database and is available under a permissive Creative Commons-like license. The GNAF data is available as a series of SQL files that can be imported to databases such as PostgreSQL, including a summary view named "address_view". Code included in generate_tf_records.py was used to consume a CSV dump of this file, producing a TFRecord file that is natively supported by TensorFlow.

Pretrained model

While you are free to train this model using the model_fn provided, a pretrained model is supplied with this package under addressnet/pretrained and is the default model loaded when using the prediction function. Thus, using this package should be as simple as:

from addressnet.predict import predict_one

if __name__ == "__main__":
    # This is a fake address!
    print(predict_one("casa del gelato, 10A 24-26 high street road mount waverley vic 3183"))

Expected output:

{
    'building_name': 'CASA DEL GELATO',
    'flat_number': '10',
    'flat_number_suffix': 'A',
    'number_first': '24',
    'number_last': '26',
    'street_name': 'HIGH STREET',
    'street_type': 'ROAD',
    'locality_name': 'MOUNT WAVERLEY',
    'state': 'VICTORIA',
    'postcode': '3183'
}

Because the model is not sensitive to small typographical errors, a simple string similarity algorithm is used to normalise fields such as street_type and state, since we know exhaustively what they should be.

address-net's People

Contributors

jasonrig avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.