kpu / kenlm Goto Github PK
View Code? Open in Web Editor NEWKenLM: Faster and Smaller Language Model Queries
Home Page: http://kheafield.com/code/kenlm/
License: Other
KenLM: Faster and Smaller Language Model Queries
Home Page: http://kheafield.com/code/kenlm/
License: Other
Not sure if it is an issue, but right now I am not able to install the git clone on either windows (under cygwin) or linux machines. Both give me an error when running "./bjam install":
warning: mismatched versions of Boost.Build engine and core
warning: Boost.Build engine (bjam) is 2014.03.00
warning: Boost.Build core (at /usr/share/boost-build) is 2013.05-svn
error: Unable to find file or target named
error: 'prefix-include'
error: referred to from project at
error: '.'
setup.py needs to be adjusted manually to add flags like HAVE_ZLIB
under extra_compile_args
section in order to be able to read compressed LM files.
Compiling kenlm on OS X El Capitan with ./bjam yields the following output – any suggestions?
I have also installed Boost via Homebrew.
rm -rf bootstrap
mkdir bootstrap
cc -o bootstrap/jam0 command.c compile.c constants.c debug.c execcmd.c frames.c function.c glob.c hash.c hdrmacro.c headers.c jam.c jambase.c jamgram.c lists.c make.c make1.c object.c option.c output.c parse.c pathsys.c regexp.c rules.c scan.c search.c subst.c timestamp.c variable.c modules.c strings.c filesys.c builtins.c class.c cwd.c native.c md5.c w32_getreg.c modules/set.c modules/path.c modules/regex.c modules/property-set.c modules/sequence.c modules/order.c execunix.c fileunix.c pathunix.c
make.c:296:37: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "make\t--\t%s%s\n", spaces( depth ), object_str( t->name ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:296:37: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:303:37: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "make\t--\t%s%s\n", spaces( depth ), object_str( t->name ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:303:37: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:376:45: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "bind\t--\t%s%s: %s\n", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:376:45: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:384:45: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "time\t--\t%s%s: %s\n", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:384:45: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:389:45: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "time\t--\t%s%s: %s\n", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:389:45: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:731:13: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
spaces( depth ), object_str( t->name ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:731:13: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
6 warnings generated.
modules/path.c:16:12: warning: implicit declaration of function 'file_query' is invalid in C99
[-Wimplicit-function-declaration]
return file_query( list_front( lol_get( frame->args, 0 ) ) ) ?
^
1 warning generated.
./bootstrap/jam0 -f build.jam --toolset=darwin --toolset-root= clean
...found 1 target...
...updating 1 target...
...updated 1 target...
./bootstrap/jam0 -f build.jam --toolset=darwin --toolset-root=
...found 139 targets...
...updating 3 targets...
[MKDIR] bin.macosxx86_64
[COMPILE] bin.macosxx86_64/b2
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: optimization flag '-finline-functions' is not supported
clang: warning: argument unused during compilation: '-finline-functions'
make.c:296:37: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "make\t--\t%s%s\n", spaces( depth ), object_str( t->name ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:296:37: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:303:37: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "make\t--\t%s%s\n", spaces( depth ), object_str( t->name ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:303:37: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:376:45: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "bind\t--\t%s%s: %s\n", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:376:45: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:384:45: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "time\t--\t%s%s: %s\n", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:384:45: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:389:45: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "time\t--\t%s%s: %s\n", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:389:45: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:731:13: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
spaces( depth ), object_str( t->name ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:731:13: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:768:43: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( "->%s%2d Name: %s\n", spaces( depth ), depth, target_name( t
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:768:43: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:772:43: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s%2d Name: %s\n", spaces( depth ), depth, target_name( t
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:772:43: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:778:38: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s Loc: %s\n", spaces( depth ), object_str( t->boundname )
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:778:38: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:784:42: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Stable\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:784:42: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:787:41: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Newer\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:787:41: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:790:56: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Up to date temp file\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:790:56: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:793:65: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Temporary file, to be updated\n", spaces( depth )
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:793:65: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:797:61: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Been touched, updating it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:797:61: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:800:56: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Missing, creating it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:800:56: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:803:57: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Outdated, updating it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:803:57: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:806:56: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Rebuild, updating it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:806:56: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:809:47: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Updating it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:809:47: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:812:51: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Can not find it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:812:51: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:815:47: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Can make it\n", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:815:47: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:821:34: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : ", spaces( depth ) );
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:821:34: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
make.c:833:52: warning: adding 'int' to a string does not append to the string [-Wstring-plus-int]
printf( " %s : Depends on %s (%s)", spaces( depth ),
^~~~~~~~~~~~~~~
make.c:85:44: note: expanded from macro 'spaces'
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
make.c:833:52: note: use array indexing to silence this warning
make.c:85:44: note: expanded from macro 'spaces'
^
22 warnings generated.
modules/path.c:16:12: warning: implicit declaration of function 'file_query' is invalid in C99 [-Wimplicit-function-declaration]
return file_query( list_front( lol_get( frame->args, 0 ) ) ) ?
^
1 warning generated.
[COPY] bin.macosxx86_64/bjam
...updated 3 targets...
~/Downloads/kenlm
Failed to run bash -c "g++ -dM -x c++ -E /dev/null -include boost/version.hpp 2>/dev/null |grep '#define BOOST_'"
Boost does not seem to be installed or g++ is confused.
Installing using pip no longer works since the changes made in 500406a
Pip install fails with the following errors:
python/kenlm.cpp:1430:59: error: ‘class lm::base::Model’ has no member named ‘Score’
python/kenlm.cpp:1450:57: error: ‘class lm::base::Model’ has no member named ‘Score’
python/kenlm.cpp:1637:74: error: ‘class lm::base::Model’ has no member named ‘FullScore’
python/kenlm.cpp:1693:72: error: ‘class lm::base::Model’ has no member named ‘FullScore’
Hi, I was wondering if there was a way to access the probability of the next word in a sentence.
Is reverse lookup supported ?.
I've just tried using KenLM, and hit an error.
>>> model = kenlm.Model('LM/en.europarl-nc.lm')
Loading the LM will be faster if you build a binary file.
Reading /Users/bittlingmayer/Desktop/sgnln2/private-SignalN-Research/tsiran/lm/LM/en.europarl-nc.lm
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
*The ARPA file is missing <unk>. Substituting log10 probability -100.
***************************************************************************************************
>>> model.score('This is a test')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'kenlm.Model' object has no attribute 'score'
>>> model
<Model from en.europarl-nc.lm>
>>> dir(model)
['BaseFullScore', 'BaseScore', 'BeginSentenceWrite', 'NullContextWrite', '__class__', '__contains__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'order', 'path']
Any idea what I may be doing wrong?
If it makes a difference, I installed via pip and I'm using python 2.7 (anaconda).
Is there a nice way to emulate SRILM's continuous-ngram-count? My goal is to have markers for punctuation (such as commas, periods, exaclamation marks, etc.) and to be able to keep context across sentences.
Currently I put the whole text on one line, but it's not great memory wise.
I can install kenlm Python package outside of virtualenv but having trouble inside virtualenv.
Using Mac OS 10.11.4
nlp $ uname -a
Darwin Motokis-Macintosh.local 15.4.0 Darwin Kernel Version 15.4.0: Fri Feb 26 21:17:08 PST 2016; root:xnu-3248.40.184~2/RELEASE_X86_64 x86_64
nlp $ which clang
/usr/bin/clang
nlp $ clang --version
Apple LLVM version 7.3.0 (clang-703.0.31)
Target: x86_64-apple-darwin15.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Error message:
(nlp) nlp $ STATIC_DEPS=true pip install https://github.com/kpu/kenlm/archive/master.zip
Collecting https://github.com/kpu/kenlm/archive/master.zip
Downloading https://github.com/kpu/kenlm/archive/master.zip (513kB)
100% |████████████████████████████████| 522kB 636kB/s
Installing collected packages: kenlm
Running setup.py install for kenlm ... error
Complete output from command /Users/apewu/smartannotations/nlp/bin/python2.7 -u -c "import setuptools, tokenize;__file__='/var/folders/d1/2291vfk93bq5l675mc1dy21m0000gn/T/pip-obzcbl-build/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/d1/2291vfk93bq5l675mc1dy21m0000gn/T/pip-rSkucL-record/install-record.txt --single-version-externally-managed --compile --install-headers /Users/apewu/smartannotations/nlp/bin/../include/site/python2.7/kenlm:
running install
running build
running build_ext
building 'kenlm' extension
creating build
creating build/temp.macosx-10.11-x86_64-2.7
creating build/temp.macosx-10.11-x86_64-2.7/util
creating build/temp.macosx-10.11-x86_64-2.7/lm
creating build/temp.macosx-10.11-x86_64-2.7/util/double-conversion
creating build/temp.macosx-10.11-x86_64-2.7/python
clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/bit_packing.cc -o build/temp.macosx-10.11-x86_64-2.7/util/bit_packing.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/ersatz_progress.cc -o build/temp.macosx-10.11-x86_64-2.7/util/ersatz_progress.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/exception.cc -o build/temp.macosx-10.11-x86_64-2.7/util/exception.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/file.cc -o build/temp.macosx-10.11-x86_64-2.7/util/file.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/file_piece.cc -o build/temp.macosx-10.11-x86_64-2.7/util/file_piece.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
util/file_piece.cc:37:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
In file included from util/file_piece.cc:3:
In file included from ./util/double-conversion/double-conversion.h:31:
./util/double-conversion/utils.h:302:16: warning: unused typedef 'VerifySizesAreEqual' [-Wunused-local-typedef]
typedef char VerifySizesAreEqual[sizeof(Dest) == sizeof(Source) ? 1 : -1]
^
2 warnings generated.
clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/float_to_string.cc -o build/temp.macosx-10.11-x86_64-2.7/util/float_to_string.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
In file included from util/float_to_string.cc:3:
In file included from ./util/double-conversion/double-conversion.h:31:
./util/double-conversion/utils.h:302:16: warning: unused typedef 'VerifySizesAreEqual' [-Wunused-local-typedef]
typedef char VerifySizesAreEqual[sizeof(Dest) == sizeof(Source) ? 1 : -1]
^
1 warning generated.
clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/integer_to_string.cc -o build/temp.macosx-10.11-x86_64-2.7/util/integer_to_string.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/mmap.cc -o build/temp.macosx-10.11-x86_64-2.7/util/mmap.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
util/mmap.cc:246:15: warning: unused variable 'from_size' [-Wunused-variable]
std::size_t from_size = mem.size();
^
1 warning generated.
clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/murmur_hash.cc -o build/temp.macosx-10.11-x86_64-2.7/util/murmur_hash.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/parallel_read.cc -o build/temp.macosx-10.11-x86_64-2.7/util/parallel_read.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/pool.cc -o build/temp.macosx-10.11-x86_64-2.7/util/pool.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
clang -fno-strict-aliasing -fno-common -dynamic -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/read_compressed.cc -o build/temp.macosx-10.11-x86_64-2.7/util/read_compressed.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
util/read_compressed.cc:24:10: fatal error: 'lzma.h' file not found
#include <lzma.h>
^
1 error generated.
error: command 'clang' failed with exit status 1
----------------------------------------
Command "/Users/apewu/smartannotations/nlp/bin/python2.7 -u -c "import setuptools, tokenize;__file__='/var/folders/d1/2291vfk93bq5l675mc1dy21m0000gn/T/pip-obzcbl-build/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/d1/2291vfk93bq5l675mc1dy21m0000gn/T/pip-rSkucL-record/install-record.txt --single-version-externally-managed --compile --install-headers /Users/apewu/smartannotations/nlp/bin/../include/site/python2.7/kenlm" failed with error code 1 in /var/folders/d1/2291vfk93bq5l675mc1dy21m0000gn/T/pip-obzcbl-build/
Hi, thank you for this nice tool and also thanks for providing a windows version.
I have to work on a sever with Windows Server 8 R2 and I had successfully built the KenLM itself with the project files in the widows folder. However, it always went to error when I was trying to install the Python library on this windows server.
P.S. I am mainly working on python, so KenLM training tool is not that urgent for me since I could train the data from other machine, so I just want to know how to install the python part.
Any help would be appreciated.
Hi Ken,
Is there a way to replicate with KenLM the workflow where we can build a LM as with the continuous-ngram-count and then query/process a text with hidden-ngram (given an hidden-vocab file) ?
Cheers,
Vince
It would be nice to have access to kenlm.LanguageModel.vocab or even (maybe more pytonic way) to support iterable protocol on kenlm.LanguageModel.
The current RewindableStream implementation can cycle through blocks with operator++ then not detect an overrun because the same memory block has been recycled.
Can you comment on build systems?
bjam is the default and preferred. I see you have provided compile_query_only.sh
, presumably for convenience for folks who don't want to bother with boost.
What about cmake? I see that that is added to the system and in fact just build Joshua using it, but it's not clear to me that this was the right thing to do. In particular, cmake does not seem to respond to environmental settings of e.g., KENLM_MAX_ORDER. Why is cmake present, and what is its intended use, and why is it included? It also litters files all over the place.
It seems I should revert to using bjam in my own build process.
(My goal is to make it easier to depend on KenLM. Ideally I'd like to package it as a submodule. I've already separated KenLM from Joshua's wrappers and it works well, apart from the build system complication).
(Caveat: I do not understand modern build systems.)
Hi Kenneth!
I am now using kenlm to experiment with different language models. From time to time I need to compute conditional probabilities of all n-grams. Arpa files do not contain them all and there is a rule how to compute probabilities that are not explicitly listed. I wrote a simple 20 line python script that uses arpa package to do that. Basically what that package (arpa) does is it accepts an n-gram string and returns probability of a last word conditioned on prefix. Maybe I did something wrong but it takes "forever" to compute, for instance, all 5-gram probabilities even with hundreds of threads.
I am wondering what would be the best way to compute probabilities of all possible n-grams with kenlm? I looked through the code and your examples and I think something like this may work:
Does it sound like something reasonable or is there a better way to do it?
Thanks,
Sergey.
platform : 64bit, Red Hat Enterprise Linux Server release 5.8 (Tikanga)
g++ : g++ (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52)
You must use ./bjam if you want language model estimation, filtering, or support for compressed files (.gz, .bz2, .xz)
Compiling with g++ -I. -O3 -DNDEBUG -DKENLM_MAX_ORDER=6
./util/scoped.hh: In static member function 'static void util::scoped_c_forward<T, clean>::Close(T*) [with T = void, void (* clean)(T*) = free]':
./util/scoped.hh:28: instantiated from 'util::scoped_base<T, Closer>::~scoped_base() [with T = void, Closer = util::scoped_c_forward<void, free>]'
./util/scoped.hh:55: instantiated from here
./util/scoped.hh:70: internal compiler error: in build_call, at cp/call.c:321
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:http://bugzilla.redhat.com/bugzilla> for instructions.
Preprocessed source stored into /tmp/ccFSXFye.out file, please attach this to your bugreport.
Hi, I got a segmentation fault when running "lmplz -o 3 < text > arpa" on a corpus. Stack trace is pasted below. I've got lmplz running fine on several other corpora. The only thing special about this corpus is it contains a lot of duplicated sentences, don't know if this could cause the segmentation fault.
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffca054700 (LWP 9216)]
0x00000000004856ca in lm::builder::NGram::IsMarked (this=0x7fffca053c20) at ./lm/builder/ngram.hh:77
77 return Value().count >> (sizeof(Value().count) * 8 - 1);
(gdb) bt
#0 0x00000000004856ca in lm::builder::NGram::IsMarked (this=0x7fffca053c20)
at ./lm/builder/ngram.hh:77
#1 0x000000000048e12a in lm::builder::NGram::CutoffCount (this=0x7fffca053c20)
at ./lm/builder/ngram.hh:93
#2 0x000000000048afa6 in lm::builder::(anonymous namespace)::PruneNGramStream::operator++ (
this=0x7fffca053c20) at /home/cfan/tools/kenlm/lm/builder/initial_probabilities.cc:74
#3 0x000000000048bb40 in lm::builder::(anonymous namespace)::MergeRight::Run (this=0x95bc78,
primary=...) at /home/cfan/tools/kenlm/lm/builder/initial_probabilities.cc:238
#4 0x000000000048df48 in util::stream::Thread::operator()<util::stream::ChainPosition, lm::builder::{anonymous}::MergeRight>(const util::stream::ChainPosition &, lm::builder::(anonymous namespace)::MergeRight &) (this=0x928170, position=..., worker=...) at ./util/stream/chain.hh:77
#5 0x000000000048ddf1 in boost::_bi::list2boost::_bi::value<util::stream::ChainPosition, boost::_bi::valuelm::builder::{anonymous}::MergeRight >::operator()boost::reference_wrapper<util::stream::Thread, boost::_bi::list0>(boost::_bi::type, boost::reference_wrapperutil::stream::Thread &, boost::_bi::list0 &, int) (this=0x95bc40, f=..., a=...) at /usr/include/boost/bind/bind.hpp:313
#6 0x000000000048dccf in boost::_bi::bind_t<void, boost::reference_wrapperutil::stream::Thread, boost::_bi::list2boost::_bi::value<util::stream::ChainPosition, boost::_bi::valuelm::builder::{anonymous}::MergeRight > >::operator()(void) (this=0x95bc38)
at /usr/include/boost/bind/bind_template.hpp:20
#7 0x000000000048dc34 in boost::detail::thread_data<boost::_bi::bind_t<void, boost::reference_wrapperutil::stream::Thread, boost::_bi::list2boost::_bi::value<util::stream::ChainPosition, boost::_bi::valuelm::builder::{anonymous}::MergeRight > > >::run(void) (this=0x95bab0)
at /usr/include/boost/thread/detail/thread.hpp:61
Even the most basic function P(you | where are) cannot be computed,
full_scores("where are you") automatically append "" at the beginning of the phrase which is stupid.
If I really want to compute " where are you", I will append "" by myself.
Hello, in the kenlm documents I found only one function to use: en_model.score(sentence).
Can you please provide detail description of functions available if there are?
I'm trying to parse unigram, birgram trigram probabilities from LM as they appear there.
For example LM contain the following lines. I need to have a function which will work like this: en_model.bigram_prob("too recognize") will return -4.923469
-4.923469 too recognised
-4.923469 too recognises
-4.923469 too recognize
-4.923469 too recommend
The same for unigrams and trigrams.
Does kenlm support such functionality ?
Thank you,
Zaven.
hi @kpu i have a question for you.
i train a 4-gram lm and a 5-gram lm on same corpus with the same configuration.
when i test the language model on a sentence, i found an unreasonable result:
for example, i have a sentence here:
m4 = kenlm.Model('4gram-lm')
m5 = kenlm.Model('5gram-lm')
sent_3 = 'bolivia holds presidential'
s4 = m4.score(sent_3, bos = False, eos = False)
s5 = m5.score(sent_3, bos = False, eos = False)
i test the language model score on a sentence whose length is 3, i got exactly same s4 and s5 which is reasonable.
s4: -13.948734283447266
s5: -13.948734283447266
but when i test on a sentence whose length is 4, strange thing happens:
sent_4 = 'bolivia holds presidential and'
s4 = m4.score(sent_4, bos = False, eos = False)
s5 = m5.score(sent_4, bos = False, eos = False)
s4: -8.61363410949707
s5: -8.647890090942383
i think, s4 and s5 should be same, as we can see, however, i got a little different s4 and s5 there:
because for string the length of which is 4, do not consider bos and eos
p4(w1 w2 w3 w4) = p(w1) * p(w2 | w1) * p(w3 | w1 w2) * p(w4 | w1 w2 w3)
p5(w1 w2 w3 w4) = p(w1) * p(w2 | w1) * p(w3 | w1 w2) * p(w4 | w1 w2 w3)
so p4 and p5 should be the same , right ? can you give me some explanations about this ?
of cause, for sentence whose length is 5, it will be different, because last items in following formula are different.
p4(w1 w2 w3 w4 w5) = p(w1) * p(w2 | w1) * p(w3 | w1 w2) * p(w4 | w1 w2 w3) * p(w5 | w2 w3 w4)
p5(w1 w2 w3 w4 w5) = p(w1) * p(w2 | w1) * p(w3 | w1 w2) * p(w4 | w1 w2 w3) * p(w5 | w1 w2 w3 w4)
Hi,
I've tried to use kenlm
as a library form in my decoder. However, libkenlm.so
gives unexpected results.
You can reproduce my situation as follows. Assume kenlm
is compiled.
cd </path/to/kenlm/lm>
g++ -DKENLM_MAX_ORDER=2 -I../ -c -o query_main.o query_main.cc
g++ -L../lib -o query_main query_main.o -lkenlm
export LD_LIBRARY_PATH=../lib
./query_main test.arpa
This rises a coredump.
My environment is Ubuntu 12.04.2, with g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3.
Add a flag to lmplz such that in the created ARPA file, within each order the n-grams would be sorted lexicographically.
Running on OSX, boost 1.55
I'm essentially following the instructions in this document:
http://victor.chahuneau.fr/notes/2012/07/03/kenlm.html
Interestingly, everything worked once, but then stopped working. When I input ./bjam, I get a couple of errors, one involving a broken pipe, but the one I'm most concerned about is:
-bash: ./kenlm/bin/lmplz: No such file or directory
The output begins with: warning: No toolsets are configured.
warning: Configuring default toolset "darwin".
warning: If the default is wrong, your build may not work correctly.
warning: Use the "toolset=xxxxx" option to override our guess.
warning: For more configuration options, please consult
warning: http://boost.org/boost-build2/doc/html/bbv2/advanced/configuration.html
...patience...
...found 628 targets...
...updating 38 targets...
and at the end, the output is...
...failed darwin.link lm/bin/left_test.test/darwin-5.1.0/release/threading-multi/left_test...
...skipped <plm/bin/left_test.test/darwin-5.1.0/release/threading-multi>left_test.run for lack of <plm/bin/left_test.test/darwin-5.1.0/release/threading-multi>left_test...
...failed updating 23 targets...
...skipped 15 targets...
If I could attach the log I would but it's very long!
I've run into issues trying to compile kenlm on Mavericks 10.9. Using the default clang provided by Xcode:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin13.0.0
Thread model: posix
everything seems to compile okay (a few tests fail), but when I go to train a model, I get:
jbg-hackintosh:simtrans jbg$ lmplz -o 3 -S 2G -T /tmp < scratch/lm/train-de > scratch/lm/train-de.arpa
=== 1/5 Counting and sorting n-grams ===
Reading stdin
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Function not implemented
I thought maybe clang was the issue, so I also tried with gcc 4.8 (via homebrew), which produces a linking error (which I won't copy here, as it may be a boost issue; haven't debugged fully). My student reproduced the same issue on his Mavericks laptop.
Is there a recommended path for building kenlm in 10.9?
Hi,
Is there a tool to convert existing ARPA format files to files as generated by lmplz's --intermediate flag ?
Thanks Mittul
Looks like the required
flag from Boost.Program_options is used, which was only added in 1.41. I guess the version requirement in CMakeLists.txt
should be upped.
[ 85%] Building CXX object lm/CMakeFiles/partial_test.dir/partial_test.cc.o
/home/cortex-m40/kenlm/lm/kenlm_benchmark_main.cc: In function ‘int main(int, char**)’:
/home/cortex-m40/kenlm/lm/kenlm_benchmark_main.cc:200:51: error: ‘class boost::program_options::typed_value<std::basic_string<char>, char>’ has no member named ‘required’
("model,m", po::value<std::string>(&model)->required(), "Model to query or convert vocab ids")
^
make[2]: *** [lm/CMakeFiles/kenlm_benchmark.dir/kenlm_benchmark_main.cc.o] Error 1
make[1]: *** [lm/CMakeFiles/kenlm_benchmark.dir/all] Error 2
Hi ~ I have some issue when I tried to compile mosesdecoder on RedHat 5.8 & gcc 4.1.2.
Any tip how to fix it?
Thank you very much!
./util/scoped.hh:70: internal compiler error: in build_call, at cp/call.c:321
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:http://bugzilla.redhat.com/bugzilla> for instructions.
Preprocessed source stored into /tmp/ccnex5v6.out file, please attach this to your bugreport.
...failed gcc.compile.c++ lm/bin/gcc-4.1.2/release/debug-symbols-on/link-static/threading-multi/quantize.o...
.../libs/kenlm/lm/builder/adjust_counts.cc:61 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'.
ERROR: 1-gram discount out of range for adjusted count 2: -1.6000001
Aborted (core dumped)
What could have happened to cause this error? We preprocessed the files to limit to 10k vocab (replace out of vocab words with ). The files are sufficiently big enough (with line breaks, thanks to the help in the other thread), some output info:
Unigram tokens 77187240 types 10002
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:120024 2:37614993408 3:70528114688
ERROR: 1-gram discount out of range for adjusted count 2: -1.75
Aborted (core dumped)
Did fixing the vocabulary externally cause this problem?
KenLM is sweet.
In order to compile it on OSX (10.8.3) I had to modify the 'limits' include in:
https://github.com/kpu/kenlm/blob/master/util/file.cc
I added one more include to the very top of this file:
#include <limits.h>
and everything suddenly compiled like magic. The '.h' was the secret sauce.
Brew is pretty nice for the boost stuff too. I was dreading this aspect, but:
$ brew install boost
just worked.
Hello,
I know this isn't an issue but I didn't found anywhere else to ask.
I think there is no way ton interpolate plural arpa models into one as the SRILM and IRSTLM does. Is this a features planned or kenlm don't do it on purpose ?
Thank you anyway for the awesome job with kenlm !
Hi,
I am using the tool to build a LM over Entity Grids. As it is obvious i am therefore not interested in including probabilities of n-gramms that contain the sentence boundaries. Is it possible to somehow achieve this? I still want to only calculate n-gramms within a sentence so making one big sentence would not solve the problem.
thanks! (especially for the great tool!)
Sorry about this question :(
I ran into some confusion that I always thought perplexity for a document is evaluated per sentence, then you do the average for all the sentences' perplexities in the document. Is this how KenLM implemented bin/query?
Or did KenLM evaluate the perplexity on the whole documents then normalize it by the length of the document?
I would like to point out that identifiers like "LM_NGRAM_QUERY__
" and "LM_VOCAB__
" do not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?
It's possible to upload the official repo into PyPI?
Hello. Firstly thanks for this great tool. The Python support has made this very easy to use alongside nltk
for some recent research.
I'm having difficulty finding documentation for the probabilities from model.full_scores()
. They appear to be log probabilities, but I'm unsure of which base?
Scanning through the repository, I found this line that seems to indicate that it is base 10:
Line 67 in a8a1b55
But I can't find any other reason to confirm that this is the case. Thanks.
I'm using Ubuntu 12.04.
Previously tried to compile using Boost 1.46 but failed at all due to -lboost_exception was not exist in /usr/lib.
Then, tried to compile using Boost 1.55 (/usr/local) but lmplz_main always failed while others compiled successfully. Both source code from github and from http://kheafield.com/code/kenlm.tar.gz complains the same error.
gcc.compile.c++ /home/***/LM/kenlm/lm/builder/bin/gcc-4.6/release/link-static/threading-multi/lmplz_main.o
/home/***/LM/kenlm/lm/builder/lmplz_main.cc: In function ‘int main(int, char**)’:
/home/***/LM/kenlm/lm/builder/lmplz_main.cc:55:72: error: no matching function for call to ‘value(uint64_t*)’
/home/***/LM/kenlm/lm/builder/lmplz_main.cc:55:72: note: candidates are:
/usr/local/include/boost/program_options/detail/value_semantic.hpp:175:5: note: template<class T> boost::program_options::typed_value<T>* boost::program_options::value()
/usr/local/include/boost/program_options/detail/value_semantic.hpp:183:5: note: template<class T> boost::program_options::typed_value<T>* boost::program_options::value(T*)
"g++" -ftemplate-depth-128 -O3 -finline-functions -Wno-inline -Wall -pthread -DKENLM_MAX_ORDER=6 -DNDEBUG -I"." -I"util/double-conversion" -c -o "/home/***/LM/kenlm/lm/builder/bin/gcc-4.6/release/link-static/threading-multi/lmplz_main.o" "/home/***/LM/kenlm/lm/builder/lmplz_main.cc"
...failed gcc.compile.c++ /home/***/LM/kenlm/lm/builder/bin/gcc-4.6/release/link-static/threading-multi/lmplz_main.o...
So, what's wrong with my compilation since I'm new to this?
I have two files. One file works fine with kenlm, the other gives the following error:
jbg-hackintosh:qblearn jbg$ lmplz -o 2 -S 2G -T -kndiscount /tmp < bl > scratch/Literature/10393.comb.arpa
=== 1/5 Counting and sorting n-grams ===
Reading stdin
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Unigram tokens 2366 types 1168
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:14016 2:2147469568
/Users/jbg/repositories/kenlm/lm/builder/adjust_counts.cc:50 in void lm::builder::::StatCollector::CalculateDiscounts() threw BadDiscountException because `discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'.
ERROR: 2-gram discount out of range for adjusted count 3: -0.402645
Abort trap: 6
The only difference between the two files is that one ends with the sentence:
Lord Melbourne offered him a lordship, which he declined
I've also sent the full files to Kenneth via e-mail.
Part of the output copied below. According to the home page, "Estimation and filtering require Boost at least 1.36.0 and zlib." I have boost 1.46.1 and get the following linking error.
So basically my question is: what version of boost works?
...failed gcc.link util/bin/gcc-4.6/release/link-static/threading-multi/bit_packing_test...
...skipped <putil/bin/gcc-4.6/release/link-static/threading-multi>bit_packing_test.passed for lack of <putil/bin/gcc-4.6/release/link-static/threading-multi>bit_packing_test...
gcc.link util/bin/gcc-4.6/release/link-static/threading-multi/joint_sort_test
util/bin/gcc-4.6/release/link-static/threading-multi/joint_sort_test.o: In function `main':
joint_sort_test.cc:(.text.startup+0xb): undefined reference to `boost::unit_test::unit_test_main(bool (*)(), int, char**)'
collect2: ld returned 1 exit status
"g++" -o "util/bin/gcc-4.6/release/link-static/threading-multi/joint_sort_test" -Wl,--start-group "util/bin/gcc-4.6/release/link-static/threading-multi/joint_sort_test.o" "util/bin/gcc-4.6/release/link-static/threading-multi/parallel_read.o" "util/bin/gcc-4.6/release/link-static/threading-multi/read_compressed.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/diy-fp.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/fixed-dtoa.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/bignum.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/strtod.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/double-conversion.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/bignum-dtoa.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/fast-dtoa.o" "util/double-conversion/bin/gcc-4.6/release/link-static/threading-multi/cached-powers.o" "util/bin/gcc-4.6/release/link-static/threading-multi/bit_packing.o" "util/bin/gcc-4.6/release/link-static/threading-multi/ersatz_progress.o" "util/bin/gcc-4.6/release/link-static/threading-multi/exception.o" "util/bin/gcc-4.6/release/link-static/threading-multi/file.o" "util/bin/gcc-4.6/release/link-static/threading-multi/file_piece.o" "util/bin/gcc-4.6/release/link-static/threading-multi/mmap.o" "util/bin/gcc-4.6/release/link-static/threading-multi/murmur_hash.o" "util/bin/gcc-4.6/release/link-static/threading-multi/pool.o" "util/bin/gcc-4.6/release/link-static/threading-multi/scoped.o" "util/bin/gcc-4.6/release/link-static/threading-multi/string_piece.o" "util/bin/gcc-4.6/release/link-static/threading-multi/usage.o" -Wl,-Bstatic -lboost_system-mt -lboost_system-mt -lboost_unit_test_framework-mt -lboost_thread-mt -lz -Wl,-Bdynamic -lSegFault -lrt -Wl,--end-group -pthread
...failed gcc.link util/bin/gcc-4.6/release/link-static/threading-multi/joint_sort_test...
...skipped <putil/bin/gcc-4.6/release/link-static/threading-multi>joint_sort_test.passed for lack of <putil/bin/gcc-4.6/release/link-static/threading-multi>joint_sort_test...
...failed updating 12 targets...
...skipped 16 targets...
I rebuilt my kenlm with max order set to 10 by using:
cmake .. -DKENLM_MAX_ORDER=10
during the build process
, while building, and by updating the setup.py file:
ARGS = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=10']
.
Now, I'm able to use lmplz without an error to build a 7 gram Language model.
However, when trying to use the python interface, I still get the following error:
IOError: Cannot read model '../models/LM_7gram.klm' (lm/model.cc:49 in void lm::ngram::detail::(anonymous namespace)::CheckCounts(const std::vector<uint64_t> &) threw FormatLoadException because
counts.size() > 6'. This model has order 7 but KenLM was compiled to support up to 6. If your build system supports changing KENLM_MAX_ORDER, change it there and recompile. In the KenLM tarball or Moses, use e.g.
bjam --max-kenlm-order=6 -a'. Otherwise, edit lm/max_order.hh.)
File "lm/builder/output.cc" employs std::cer and it results in a compilation error on ubuntu 12.04. Simply, adding "#include " to the file solves the problem.
Is this known? Should I make a pull request?
This is the full log:
➜ bin/lmplz -o 5 -S 50% -T /tmp <~/data/enwiki-latest-pages-articles >text.arpa
=== 1/5 Counting and sorting n-grams ===
Reading /home/deeppixel/data/enwiki-latest-pages-articles
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 4027024634 types 8571832
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:102861984 2:1237697920 3:2320683776 4:3713093888 5:5414928896
Statistics:
1 8571831 D1=0.682343 D2=1.02373 D3+=1.37025
2 208792530 D1=0.747714 D2=1.07416 D3+=1.35152
3 871078563 D1=0.826502 D2=1.1214 D3+=1.3274
4 1692737525 D1=0.88864 D2=1.18124 D3+=1.33282
5 2308548475 D1=0.874941 D2=1.29421 D3+=1.3912
Memory estimate for binary LM:
type GB
probing 100 assuming -p 1.5
probing 116 assuming -r models -p 1.5
trie 53 without quantization
trie 31 assuming -q 8 -b 8 quantization
trie 46 assuming -a 22 array pointer compression
trie 24 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:102861972 2:1329358848 3:2492548096 4:3988076544 5:5815945216
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:102861972 2:910337664 3:1706883200 4:2731013120 5:3982727424
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
**********************************************************************---Last input should have been poison.
[1] 6802 abort bin/lmplz -o 5 -S 50% -T /tmp < ~/data/enwiki-latest-pages-articles >
clang -fno-strict-aliasing -fno-common -dynamic -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c util/file.cc -o build/temp.macosx-10.11-x86_64-2.7/util/file.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -DHAVE_ZLIB -DHAVE_BZLIB -DHAVE_XZLIB
util/file.cc:32:10: fatal error: 'features.h' file not found
#include <features.h>
^
1 error generated.
error: command 'clang' failed with exit status 1
Hi 👋 ,
it would be nice to learn language models on existing count files that contain n-gram counts. Similar to the ngram-count
parameter -read
from SRILM
. The ability to only load those counts, enables the use of essentially unlimited n-gram statistics like skip-ngram.
here i have a question about KenLM, i want to use following function:
assume i have a trained 3-gram language model, i want to
get probabilities of all words in vocabulary given a two-words sequence,
say, i have a two-words sequence "A B".
i want to get:
P(A|A B) P(B|A B) P(C|A B) P(D|A B) P(E|A B)
and so on
does c++ or python provide this interface ? thanks a lot.
It would be nice if the various executables supported --help and -help flags.
Hi,
I can't get KenLM working on my corpus.
I've followed the usual steps:
./bin/lmplz -T /tmp/ --text corpus.txt --arpa myarpa.arpa
./bin/build_binary myarpa.arpa my_probing_model.mmap
Then I tried the snippet from here:
https://kheafield.com/code/kenlm/developers/
With a TrieModel, it always ends with a segfault, regardless of MAX_ORDER. The error occurs here:
lm::ngram::trie::TrieSearch<lm::ngram::DontQuantize, lm::ngram::trie::DontBhiksha>::SetupMemory(unsigned char*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) ()
With a ProbingModel, I get a segfault only for MAX_ORDER < 5:
lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::ResumeScore(unsigned int const*, unsigned int const*, unsigned char, unsigned long&, float*, unsigned char&, lm::FullScoreReturn&)
For MAX_ORDER = 5, the C++ program runs only with a couple of Valgrind errors:
==3445== Invalid write of size 8
==3445== at 0x411B1A: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::GenericModel(char const*, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445== by 0x409920: lm::ngram::ProbingModel::ProbingModel(char const*, lm::ngram::Config const&) (model.hh:136)
Invalid write of size 8
==3445== at 0x43A06B: lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>::SetupMemory(unsigned char*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445== by 0x411515: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::SetupMemory(void*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445== by 0x411FC0: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::GenericModel(char const*, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
But a JNA wrapper around the same snippet raises a "malloc(): memory corruption" when loading the model.
I tried with and without pruning, with order 2 and 3, both with KenLM from the download section and this of github. The size of the corpus is about 1Gb.
One peculiarity of the vocabulary is that it contains A LOT of words that are substring of other words of the vocabulary.
I'm aware that it's probably not enough information for proper debugging, but I would be interested to know whether the valgrind errors are ok and if you can suggest me anything to help me find the problem.
My system is Mint 17. The compilation succeeded with no warning.
I'm getting this error while trying to compile kenlm with make -j 4, please help.
undefined reference to `boost::unit_test::ut_detail::normalize_test_case_name
Hi,
I'm running KenLM on LM1B data (Language-Modeling 1 Billion), and for some weird reason, we see some perplexity goes down for extremely small model:
unigram tokens unigram types with OOV exclude OOV
38023755 | 337972 | 156.5 | 148.27
3807417 | 107563 | 247.5 | 215.76
380918 | 32879 | 398.2 | 283.43
37438 | 8406 | 522.2 | 253.29
3728 | 1640 | 392.2 | 118.1
As you can notice that when unigram tokens drop low (smallest model), perplexity magically dropped to 392.2.
How does KenLM calculate perplexity including OOV and excluding OOV?
Hello! I'm new in open source, and like to help)
I check kenlm project on Cppcheck. Cppcheck is a static analysis tool for C/C++ code.
All error linked to "jam-files/engine". Can I fix these errors over pull request?
Or it is not used the code?
[jam-files/engine/compile.c:69]: (error) Buffer is accessed out of bounds.
[jam-files/engine/hcache.c:146]: (error) Common realloc mistake: 'buf' nulled but not freed upon failure
[jam-files/engine/lists.c:104]: (error) Pointer to local array variable returned.
[jam-files/engine/lists.c:135]: (error) Pointer to local array variable returned.
[jam-files/engine/lists.c:35]: (error) Allocation with malloc, return doesnt release it.
[jam-files/engine/make1.c:121]: (error) Allocation with malloc, return doesnt release it.
[jam-files/engine/mkjambase.c:73]: (error) Resource leak: fout
[jam-files/engine/modules/order.c:85]: (error) Memory leak: colors
[jam-files/engine/object.c:262]: (error) Memory leak: m
[jam-files/engine/regexp.c:255]: (error) Memory leak: r
[jam-files/engine/regexp.c:520]: (error) Uninitialized variable: classend
[jam-files/engine/regexp.c:521]: (error) Uninitialized variable: classr
[jam-files/engine/rules.c:552]: (error) Buffer is accessed out of bounds.
[jam-files/engine/yyacc.c:166]: (error) Memory leak: key.string
[jam-files/engine/yyacc.c:195]: (error) Resource leak: grammar_source_f
full list: http://pastebin.com/0AjCPcD
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.