Giter VIP home page Giter VIP logo

pyhacrf's People

Contributors

dirko avatar fgregg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pyhacrf's Issues

Speedups

With a trained model, pyhacrf is spending all it's time in the _forward and _build_lattice functions. Are these parts of the code stable enough for speed optimizations?

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     4437    0.535    0.000   51.148    0.012 pyhacrf.py:124(predict_proba)
     4437    0.026    0.000   31.088    0.007 pyhacrf.py:355(predict)
     4437    0.074    0.000   31.020    0.007 pyhacrf.py:360(_forward_probabilities)
     4437   26.067    0.006   30.942    0.007 pyhacrf.py:374(_forward)
     4437    0.111    0.000   19.417    0.004 pyhacrf.py:326(__init__)
     4437   14.329    0.003   19.305    0.004 pyhacrf.py:414(_build_lattice)
  2794029    4.299    0.000    4.299    0.000 {numpy.core._dotblas.dot}

Order matters in predicted probability of match

If the strings are of different length, then predicted probability depends upon order:

> print(ed('foo1', 'bar'))
0.459472080321
> print(ed('bar', 'foo1'))
0.506212489757
> print(ed('foo', 'bar'))
0..496366272811
> print(ed('bar', 'foo'))
0..496366272811

Pull together training corpus

it might help to add quite a few dictionary words(~1000) with a single edit as matches and random pairs as mis-matches to give it a edit-distance-like starting point.

I'll start working on this.

Affine gap distance

Currently dedupe uses an affine gap distance. The edit distance is very similar to the Levenshtein distance except that the cost of extending a gap (a deletion or insertion) is less than opening the gap.

This works really well for the kinds of strings we deal with in record linkage. How would we implement this for pyhacrf.

New release

Could you make a pypi release of the python3 compatible code?

Thanks!

build error under conda on Windows 7

I am trying to use pip to install pyhacrf under conda on Windows 7 and it is giving me the following error. Any help with a work around would be greatly apprecieated.

C:\Users\User\AppData\Local\Continuum\Anaconda>pip install pyhacrf
Collecting pyhacrf
  Using cached pyhacrf-0.0.12.tar.gz
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.9 in c:\users\User\appdata\local\continuum\anaconda\lib\site-p
ackages (from pyhacrf)
Requirement already satisfied (use --upgrade to upgrade): PyLBFGS>=0.1.3 in c:\users\User\appdata\local\continuum\anaconda\lib\si
te-packages (from pyhacrf)
Building wheels for collected packages: pyhacrf
  Running setup.py bdist_wheel for pyhacrf
  Complete output from command C:\Users\User\AppData\Local\Continuum\Anaconda\python.exe -c "import setuptools;__file__='c:\\user
s\\User\\appdata\\local\\temp\\pip-build-rojt7c\\pyhacrf\\setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __f
ile__, 'exec'))" bdist_wheel -d c:\users\User\appdata\local\temp\tmpqnoh1dpip-wheel-:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-2.7
  creating build\lib.win-amd64-2.7\pyhacrf
  copying pyhacrf\feature_extraction.py -> build\lib.win-amd64-2.7\pyhacrf
  copying pyhacrf\pyhacrf.py -> build\lib.win-amd64-2.7\pyhacrf
  copying pyhacrf\state_machine.py -> build\lib.win-amd64-2.7\pyhacrf
  copying pyhacrf\__init__.py -> build\lib.win-amd64-2.7\pyhacrf
  running build_ext
  Looking for python27.dll
  building 'pyhacrf.algorithms' extension
  C compiler: gcc -m64 -g -DNDEBUG -DMS_WIN64 -O2 -Wall -Wstrict-prototypes

  creating build\temp.win-amd64-2.7
  creating build\temp.win-amd64-2.7\Release
  creating build\temp.win-amd64-2.7\Release\pyhacrf
  compile options: '-D__MSVCRT_VERSION__=0x0900 -IC:\\Users\\User\\AppData\\Local\\Continuum\\Anaconda\\lib\\site-packages\\numpy
\\core\\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\lib\site-packages\numpy\core\include -IC:\Users\User\AppData\
Local\Continuum\Anaconda\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\PC -c'
  gcc -m64 -g -DNDEBUG -DMS_WIN64 -O2 -Wall -Wstrict-prototypes -D__MSVCRT_VERSION__=0x0900 -IC:\\Users\\User\\AppData\\Local\\Co
ntinuum\\Anaconda\\lib\\site-packages\\numpy\\core\\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\lib\site-packages\nu
mpy\core\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\PC
-c pyhacrf/algorithms.c -o build\temp.win-amd64-2.7\Release\pyhacrf\algorithms.o
  Found executable C:\Users\User\AppData\Local\Continuum\Anaconda\Scripts\gcc.bat
  gcc -m64 -g -shared build\temp.win-amd64-2.7\Release\pyhacrf\algorithms.o -LC:\\Users\\User\\AppData\\Local\\Continuum\\Anacond
a\\lib\\site-packages\\numpy\\core\\lib -LC:\Users\User\AppData\Local\Continuum\Anaconda\libs -LC:\Users\User\AppData\Local\Co
ntinuum\Anaconda\PCbuild\amd64 -lnpymath -lpython27 -lmsvcr90 -o build\lib.win-amd64-2.7\pyhacrf\algorithms.pyd
  Warning: .drectve `/manifestdependency:"type='win32' name='Microsoft.VC90.CRT' version='9.0.21022.8' processorArchitecture='amd64'
 publicKeyToken='1fc8b3b9a1e18e3b'" /DEFAULTLIB:"python27.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized
  Warning: .drectve `/manifestdependency:"type='win32' name='Microsoft.VC90.CRT' version='9.0.21022.8' processorArchitecture='amd64'
 publicKeyToken='1fc8b3b9a1e18e3b'" /DEFAULTLIB:"python27.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized
  C:\\Users\\User\\AppData\\Local\\Continuum\\Anaconda\\lib\\site-packages\\numpy\\core\\lib/npymath.lib(build/temp.win-amd64-2.7
/build/src.win-amd64-2.7/numpy/core/src/npymath/npy_math.obj):(.text+0x2e3): undefined reference to `__imp_modff'
  collect2.exe: error: ld returned 1 exit status
  error: Command "gcc -m64 -g -shared build\temp.win-amd64-2.7\Release\pyhacrf\algorithms.o -LC:\\Users\\User\\AppData\\Local\\Co
ntinuum\\Anaconda\\lib\\site-packages\\numpy\\core\\lib -LC:\Users\User\AppData\Local\Continuum\Anaconda\libs -LC:\Users\User\
AppData\Local\Continuum\Anaconda\PCbuild\amd64 -lnpymath -lpython27 -lmsvcr90 -o build\lib.win-amd64-2.7\pyhacrf\algorithms.pyd" fai
led with exit status 1

  ----------------------------------------
  Failed building wheel for pyhacrf

Single character string matches

How do you indicate that two one character strings match, since the transition matrix starts at the first character?

In typical string distance dynamic programs we start before the transition matrix starts before the first character so we can indicate matches.

Binary wheels?

Hi @dirko,

For dedupe, I've setup some scripts to automatically build binary wheels for OS X and windows for many of the dependencies. This makes it a lot easier for people to use these libraries.

ex: https://github.com/datamade/affinegap/blob/master/.travis.yml
https://github.com/datamade/affinegap/blob/master/appveyor.yml

When you push a tag to github, it automatically builds these wheels and deploys them to github.

Would you be interested in me writing these scripts for pyhacrf?

Levenshtein?

@dirko, could you explain what you mean here:

# TODO: For longer strings, tokenize and use Levenshtein

    def __init__(self, bias=1.0, start=False, end=False, match=False, numeric=False, transition=False):
        # TODO: For longer strings, tokenize and use Levenshtein
        # distance up until a lattice position.  Other (possibly)
        # useful features might be whether characters are consonant or
        # vowel, punctuation, case.

What do you mean by "use Levenshtein distance up until a lattice position"?

Latent classes?

General question.

The McAllum paper describes a model with multiple latent states but by default it seems that there are only two states (match, nonmatch)? Am I missing something?

Dynamic program without an edgelist.

With the speedups possible from using an approximate log and exp #27, pyhacrf is very close to usable for applications that make millions of comparisons. However, there is is still a lot of overhead from calculating the edges.

For models where the only transition is between adjacent positions in the string1, string2 matrix, we could avoid calculating the edgelist altogether. I think I'll write up model that does just that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.