dirko / pyhacrf Goto Github PK
View Code? Open in Web Editor NEWHidden alignment conditional random field for classifying string pairs.
License: BSD 3-Clause "New" or "Revised" License
Hidden alignment conditional random field for classifying string pairs.
License: BSD 3-Clause "New" or "Revised" License
With a trained model, pyhacrf is spending all it's time in the _forward
and _build_lattice
functions. Are these parts of the code stable enough for speed optimizations?
ncalls tottime percall cumtime percall filename:lineno(function)
4437 0.535 0.000 51.148 0.012 pyhacrf.py:124(predict_proba)
4437 0.026 0.000 31.088 0.007 pyhacrf.py:355(predict)
4437 0.074 0.000 31.020 0.007 pyhacrf.py:360(_forward_probabilities)
4437 26.067 0.006 30.942 0.007 pyhacrf.py:374(_forward)
4437 0.111 0.000 19.417 0.004 pyhacrf.py:326(__init__)
4437 14.329 0.003 19.305 0.004 pyhacrf.py:414(_build_lattice)
2794029 4.299 0.000 4.299 0.000 {numpy.core._dotblas.dot}
If the strings are of different length, then predicted probability depends upon order:
> print(ed('foo1', 'bar'))
0.459472080321
> print(ed('bar', 'foo1'))
0.506212489757
> print(ed('foo', 'bar'))
0..496366272811
> print(ed('bar', 'foo'))
0..496366272811
Hi @dirko can you make a new release with my latest speedups.
Thanks!
it might help to add quite a few dictionary words(~1000) with a single edit as matches and random pairs as mis-matches to give it a edit-distance-like starting point.
I'll start working on this.
Currently dedupe uses an affine gap distance. The edit distance is very similar to the Levenshtein distance except that the cost of extending a gap (a deletion or insertion) is less than opening the gap.
This works really well for the kinds of strings we deal with in record linkage. How would we implement this for pyhacrf.
Could you make a pypi release of the python3 compatible code?
Thanks!
I am trying to use pip to install pyhacrf under conda on Windows 7 and it is giving me the following error. Any help with a work around would be greatly apprecieated.
C:\Users\User\AppData\Local\Continuum\Anaconda>pip install pyhacrf
Collecting pyhacrf
Using cached pyhacrf-0.0.12.tar.gz
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.9 in c:\users\User\appdata\local\continuum\anaconda\lib\site-p
ackages (from pyhacrf)
Requirement already satisfied (use --upgrade to upgrade): PyLBFGS>=0.1.3 in c:\users\User\appdata\local\continuum\anaconda\lib\si
te-packages (from pyhacrf)
Building wheels for collected packages: pyhacrf
Running setup.py bdist_wheel for pyhacrf
Complete output from command C:\Users\User\AppData\Local\Continuum\Anaconda\python.exe -c "import setuptools;__file__='c:\\user
s\\User\\appdata\\local\\temp\\pip-build-rojt7c\\pyhacrf\\setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __f
ile__, 'exec'))" bdist_wheel -d c:\users\User\appdata\local\temp\tmpqnoh1dpip-wheel-:
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-2.7
creating build\lib.win-amd64-2.7\pyhacrf
copying pyhacrf\feature_extraction.py -> build\lib.win-amd64-2.7\pyhacrf
copying pyhacrf\pyhacrf.py -> build\lib.win-amd64-2.7\pyhacrf
copying pyhacrf\state_machine.py -> build\lib.win-amd64-2.7\pyhacrf
copying pyhacrf\__init__.py -> build\lib.win-amd64-2.7\pyhacrf
running build_ext
Looking for python27.dll
building 'pyhacrf.algorithms' extension
C compiler: gcc -m64 -g -DNDEBUG -DMS_WIN64 -O2 -Wall -Wstrict-prototypes
creating build\temp.win-amd64-2.7
creating build\temp.win-amd64-2.7\Release
creating build\temp.win-amd64-2.7\Release\pyhacrf
compile options: '-D__MSVCRT_VERSION__=0x0900 -IC:\\Users\\User\\AppData\\Local\\Continuum\\Anaconda\\lib\\site-packages\\numpy
\\core\\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\lib\site-packages\numpy\core\include -IC:\Users\User\AppData\
Local\Continuum\Anaconda\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\PC -c'
gcc -m64 -g -DNDEBUG -DMS_WIN64 -O2 -Wall -Wstrict-prototypes -D__MSVCRT_VERSION__=0x0900 -IC:\\Users\\User\\AppData\\Local\\Co
ntinuum\\Anaconda\\lib\\site-packages\\numpy\\core\\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\lib\site-packages\nu
mpy\core\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\include -IC:\Users\User\AppData\Local\Continuum\Anaconda\PC
-c pyhacrf/algorithms.c -o build\temp.win-amd64-2.7\Release\pyhacrf\algorithms.o
Found executable C:\Users\User\AppData\Local\Continuum\Anaconda\Scripts\gcc.bat
gcc -m64 -g -shared build\temp.win-amd64-2.7\Release\pyhacrf\algorithms.o -LC:\\Users\\User\\AppData\\Local\\Continuum\\Anacond
a\\lib\\site-packages\\numpy\\core\\lib -LC:\Users\User\AppData\Local\Continuum\Anaconda\libs -LC:\Users\User\AppData\Local\Co
ntinuum\Anaconda\PCbuild\amd64 -lnpymath -lpython27 -lmsvcr90 -o build\lib.win-amd64-2.7\pyhacrf\algorithms.pyd
Warning: .drectve `/manifestdependency:"type='win32' name='Microsoft.VC90.CRT' version='9.0.21022.8' processorArchitecture='amd64'
publicKeyToken='1fc8b3b9a1e18e3b'" /DEFAULTLIB:"python27.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized
Warning: .drectve `/manifestdependency:"type='win32' name='Microsoft.VC90.CRT' version='9.0.21022.8' processorArchitecture='amd64'
publicKeyToken='1fc8b3b9a1e18e3b'" /DEFAULTLIB:"python27.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized
C:\\Users\\User\\AppData\\Local\\Continuum\\Anaconda\\lib\\site-packages\\numpy\\core\\lib/npymath.lib(build/temp.win-amd64-2.7
/build/src.win-amd64-2.7/numpy/core/src/npymath/npy_math.obj):(.text+0x2e3): undefined reference to `__imp_modff'
collect2.exe: error: ld returned 1 exit status
error: Command "gcc -m64 -g -shared build\temp.win-amd64-2.7\Release\pyhacrf\algorithms.o -LC:\\Users\\User\\AppData\\Local\\Co
ntinuum\\Anaconda\\lib\\site-packages\\numpy\\core\\lib -LC:\Users\User\AppData\Local\Continuum\Anaconda\libs -LC:\Users\User\
AppData\Local\Continuum\Anaconda\PCbuild\amd64 -lnpymath -lpython27 -lmsvcr90 -o build\lib.win-amd64-2.7\pyhacrf\algorithms.pyd" fai
led with exit status 1
----------------------------------------
Failed building wheel for pyhacrf
@dirko could you release a new version to pypi?
How do you indicate that two one character strings match, since the transition matrix starts at the first character?
In typical string distance dynamic programs we start before the transition matrix starts before the first character so we can indicate matches.
Hi @dirko,
For dedupe, I've setup some scripts to automatically build binary wheels for OS X and windows for many of the dependencies. This makes it a lot easier for people to use these libraries.
ex: https://github.com/datamade/affinegap/blob/master/.travis.yml
https://github.com/datamade/affinegap/blob/master/appveyor.yml
When you push a tag to github, it automatically builds these wheels and deploys them to github.
Would you be interested in me writing these scripts for pyhacrf
?
@dirko, could you explain what you mean here:
pyhacrf/pyhacrf/feature_extraction.py
Line 201 in 5145568
def __init__(self, bias=1.0, start=False, end=False, match=False, numeric=False, transition=False):
# TODO: For longer strings, tokenize and use Levenshtein
# distance up until a lattice position. Other (possibly)
# useful features might be whether characters are consonant or
# vowel, punctuation, case.
What do you mean by "use Levenshtein distance up until a lattice position"?
Hi @dirko,
There's some problems building for windows. You can see it here:
https://ci.appveyor.com/project/fgregg/dedupe/build/1.0.210#L117
General question.
The McAllum paper describes a model with multiple latent states but by default it seems that there are only two states (match, nonmatch)? Am I missing something?
With the speedups possible from using an approximate log and exp #27, pyhacrf is very close to usable for applications that make millions of comparisons. However, there is is still a lot of overhead from calculating the edges.
For models where the only transition is between adjacent positions in the string1, string2 matrix, we could avoid calculating the edgelist altogether. I think I'll write up model that does just that.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.