chokkan / crfsuite Goto Github PK
View Code? Open in Web Editor NEWCRFsuite: a fast implementation of Conditional Random Fields (CRFs)
Home Page: http://www.chokkan.org/software/crfsuite/
License: Other
CRFsuite: a fast implementation of Conditional Random Fields (CRFs)
Home Page: http://www.chokkan.org/software/crfsuite/
License: Other
CRFsuite Version 0.12 http://www.chokkan.org/software/crfsuite/ * INTRODUCTION CRFSuite is an implementation of Conditional Random Fields (CRFs) for labeling sequential data. Please refer to the web site for more information about this software. * COPYRIGHT AND LICENSING INFORMATION This program is distributed under the modified BSD license. Refer to COPYING file for the precise description of the license. Portions of this software are based on libLBFGS. The MIT License Copyright (c) 1990 Jorge Nocedal Copyright (c) 2007 Naoaki Okazaki Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Portions of this software are based on Constant Quark Database (CQDB). The BSD license. Copyright (c) 2007, Naoaki Okazaki All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the Northwestern University, University of Tokyo, nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Portions of this software are based on RumAVL. MIT/X Consortium License. Copyright (c) 2005-2007 Jesse Long <[email protected]> All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 1. The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 2. The origin of the Software must not be misrepresented; you must not claim that you wrote the original Software. 3. Altered source versions of the Software must be plainly marked as such, and must not be misrepresented as being the original Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Portions of this software are based on a portable stdint.h (for MSVC). Copyright (c) 2005-2007 Paul Hsieh Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must not misrepresent the orignal source in the documentation and/or other materials provided with the distribution. The names of the authors nor its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Portions of this software are based on Mersenne Twister. Copyright (C) 1997 - 2002, Makoto Matsumoto and Takuji Nishimura, All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. The names of its contributors may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * SPECIAL THANKS GOES TO... Olivier Grisel Andreas Holzbach Baoli Li Yoshimasa Tsuruoka Hiroshi Manabe Riza Theresa B. Batista-Navarro
After training
crfsuite learn -m review.model train.data
I got a model then I test it's effect
crfsuite tag -r -m review.model test.data > result
but I found many null
cat -n result | grep 'null'
33 (null) N
35 (null) N
37 (null) N
46 (null) N
50 (null) N
59 (null) N
68 (null) N
77 (null) N
97 (null) N
118 (null) N
134 (null) N
139 (null) N
196 (null) N
205 (null) N
263 (null) N
273 (null) N
287 (null) N
288 (null) N
324 (null) N
330 (null) N
342 (null) N
343 (null) N
347 (null) N
349 (null) N
366 (null) N
but actually in test.data
they have tagged,
cat -n result | grep 'null' | awk '{print $1}' | xargs -i{} sed -n '{}p' test.data | awk '{print $1}'
SEV-KY-95
LJ-119
LJ-119
SEV-PY-50
CPU-CM-146
HX-116
LOG-HGL-137
HX-116
HH-94
TH-18
LOG-HK-92
CAM-HX-116
KY-95
GY-149
NET-HX-151
SYS-BL-30
NET-WY-97
NET-WY-97
NET-BK-100
SCN-KY-95
BAT-BNY-168
BAT-RY-53
KY-95
BAT-HX-151
KY-95
If the training file is in Dos format. CRFsuite will treat a empty line "\r\n" as a sample with label "\r" and feature all zeros. This can be seen when you tag the file with -l to output all marginals.
Hi,
I use extensively CRFSuite to build various models. All my models used only L2-regularization. I discovered that with L1-regularization I can get virtually the same performance while the model is much more compact.
So I started to use L1+L2 parameters during training. My question is: how is it implemented? Is the objective function penalized with those two terms at the same time or is it sequential (first L1, then re-training with L2)?
From the documentation it is clear that whenever L1=0, L-BFGS can be used. If L1 is positive value, then OWL-QN solver automatically turns it. From literature I know that OWL-QN optimizes with L1-regularizer only, there is no L2-regularization term. So, how is it implemented in CRFSuite?
Thanks
I am very happy to see crfsuite being improved since very recently! Thanks a lot, chokkan! I don't know the policy of version numbers but wouldn't it be great to increment the version number to reflect the relatively large number of pull requests being integrated?
Thanks a lot for maintaining the software!
myhost:python gaowei$ python setup.py build_ext
running build_ext
building '_crfsuite' extension
g++ -fno-strict-aliasing -I/Users/gaowei/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/gaowei/anaconda/include/python2.7 -c crfsuite.cpp -o build/temp.macosx-10.5-x86_64-2.7/crfsuite.o
g++ -fno-strict-aliasing -I/Users/gaowei/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/gaowei/anaconda/include/python2.7 -c export_wrap.cpp -o build/temp.macosx-10.5-x86_64-2.7/export_wrap.o
export_wrap.cpp:5182:25: error: redefinition of 'swig::traits<std::vector<CRFSuite::Attribute, std::allocatorCRFSuite::Attribute > >'
template <> struct traits<CRFSuite::Item > {
^~~~~~~~~~~~~~~~~~~~~~~
export_wrap.cpp:5051:22: note: previous definition is here
template <> struct traits<std::vector<CRFSuite::Attribute, std::allocator< CRFSuite::Attribute > > > {
^
1 error generated.
error: command 'g++' failed with exit status 1
myhost:python gaowei$
From one of my colleague, CRFSuite seems to fail to compile when an intrinsic SSE command (mm_castsi128_pd) is used.
GCC version is 3.4.6
Hi,
I am experimenting with integrating CRFSuite into a 64 C++11 windows application compiled using visual studio 2013 and have encountered an unexpected challenge. I have successfully built the 64 bit (release and debug) cqdl.lib and crf.lib static libraries.
The program loads a prebuilt model constructed and tested using the front end (crfsuite.exe)application.
The following is the sequence of actions my program takes:
(1) construct a tagger:
CRFSuite::Tagger tagger;
(2) I successfully load the pre-constructed model:
tagger.open("path to model");
(3) Create a new ItemSequence that will be used to tag unknown data:
CRFSuite::ItemSequence xseq;
(4)In a loop I create items, populate attributes and add the item to the sequence
CRFSuite::Item item;
for (auto t : terms){
item.push_back(CRFSuite::Attribute(t));
}
xseq.push_back(item);
(5) give the sequence to the tagger
tagger.set(xseq);
BAM!!!! I get the following runtime check error:
"Run-Time Check Failure #2 - Stack around the variable '_inst' was corrupted."
My assumption is that I am doing something incorrectly.
Any insight into this would be greatly appreciated.
CRFSuite provides a good pipeline for NER training and recognition using CRF. I wanted to confirm the training procedure. From what I observed, only word embeddings do not provide good accuracy. However, adding them on baseline features like contexual tokens, pos, isupper, isdigit, istitle, etc gives good accuracy. Is there anything on which I am missing out?
@chokkan : I faced the following error while I tried "Build the binding" step ($ python setup.py build_ext):
running build_ext
building '_crfsuite' extension
C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\BIN\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG "-IC:\Program Files (x86)\Python33\include" "-IC:\ProgramFiles (x86)\Python33\include" /Tpcrfsuite.cpp /Fobuild\temp.win32-3.3\Release\crfsuite.obj
crfsuite.cpp
crfsuite.cpp(1) : error C2059: syntax error : '<'
error: command '"C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\BIN\cl.exe"' failed with exit status 2
any idea?
Thanks,
Mahnoosh
In crfsuite.hpp, all definition of member functions such as Trainer::Trainer are not inline functions. So, when multiple source files include crfsuite.hpp, the c++ compiler reports duplicated symbols of these member functions.
Please add "inline" to all the member functions defined in the header like this:
@@ -44,6 +44,7 @@
namespace CRFSuite
{
+inline
Trainer::Trainer()
{
data = new crfsuite_data_t;
Thanks.
I cannot do something like
crfsuite learn -m some.model < train.txt
However it works for tagging
crfsuite tag -m some.model < test.txt
wired
i get that error when importing plus i get libcrfsuite.so cannot find file
Now, a lot of people use the CRF to segment image. Can I use CRFsuite to do this?
And all examples are processing text by CRFsuite. So, can you give me a example that process image by CRFsuite?
think you very much !
hello, i know that hmm can predict the label of one observed sequence such as predict the word of one speech.can crf do the same thing ...the steps are like:
1.first train the different crf models with diffenrent set of samples
2.input the sequence that want predicting to different models and get the probabilities of diffenrent models
3.find the model that has the maximum probability,and the label the sequence with the model's label.
thanks a lot!
CRFsuite uses a single tagger instance per model, created here and returned from get_tagger()
here. As such, it is not safe to tag data from multiple threads at once, because the tagging process mutates the tagger.
It would be a simple enough change to create a new tagger instance in get_tagger()
every time. Would that be enough to ensure thread safety?
Perhaps I missed it in the documentation but it is not clear for me how duplicate features are treated in CRFSuite. Suppose my data is like this:
...
label0 A C D
label1 A B C D A
label2 B C C D C
label3 B C D E
...
Here we see that 2nd and 3rd example have duplicate features. The 2nd has the feature A duplicated while the 3rd example has the feature C three times.
I would really hope that duplicated features get the weight incremented. I hope my example is equivalent to:
...
label0 A C D
label1 A:2 B C D
label2 B C:3 D
label3 B C D E
...
Thanks
Problem: tagger sometimes generates sequence that is not possible according to transition matrix.
Preconditions: sparse transition matrix.
Repro: model file and sample sequence available upon request.
Patch: https://gist.github.com/pgmmpk/6193513
Please, review
Mike
I want to add features like "isCapital", "isNumber" etc. Could you please tell me how can I do that?
I use crfsuite tagger in simple test program:
CRFSuite::Tagger* newTagger = new CRFSuite::Tagger();
newTagger->open("test_model.model");
...
tagVec = newTagger->tag(items);
delete newTagger;
Running the program in valgrind would return an error such as:
==11280== 7,519,904 bytes in 1 blocks are definitely lost in loss record 2 of 2
==11280== at 0x4C2DBB6: malloc (vg_replace_malloc.c:299)
==11280== by 0x42041A: crf1dm_new (in /home/artem/projects/test_crfsuite/dist/Debug/GNU-Linux/test_crfsuite)
==11280== by 0x415B8F: crf1m_create_instance_from_file (in /home/artem/projects/test_crfsuite/dist/Debug/GNU-Linux/test_crfsuite)
==11280== by 0x403551: CRFSuite::Tagger::open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (crfsuite.hpp:289)
==11280== by 0x403D1A: main (main.cpp:26)
==11280==
==11280== LEAK SUMMARY:
==11280== definitely lost: 7,519,904 bytes in 1 blocks
==11280== indirectly lost: 0 bytes in 0 blocks
==11280== possibly lost: 0 bytes in 0 blocks
==11280== still reachable: 72,704 bytes in 1 blocks
==11280== suppressed: 0 bytes in 0 blocks
==11280==
==11280== For counts of detected and suppressed errors, rerun with: -v
==11280== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
"Definitely lost" here is equal .model file size.
The place that valgrind points:
...
buffer = buffer_orig = (uint8_t*)malloc(size + 16);
if (buffer_orig = NULL) {
goto error_exit;
}
...
return crf1dm_new_impl(buffer_orig, buffer, size);
where:
static crf1dm_t* crf1dm_new_impl(uint8_t* buffer_orig, const uint8_t* buffer, uint32_t size)
{
...
model->buffer_orig = buffer_orig;
then in "destructor" memory will never free:
void crf1dc_delete(crf1d_context_t* ctx)
{
...
if (model->buffer_orig != NULL) {
free(model->buffer_orig);
model->buffer_orig = NULL;
}
model->buffer = NULL;
...
After my fix #74 valgrind shows that memory doesn't leak any more:
==27439== LEAK SUMMARY:
==27439== definitely lost: 0 bytes in 0 blocks
==27439== indirectly lost: 0 bytes in 0 blocks
==27439== possibly lost: 0 bytes in 0 blocks
==27439== still reachable: 72,704 bytes in 1 blocks
==27439== suppressed: 0 bytes in 0 blocks
==27439==
==27439== For counts of detected and suppressed errors, rerun with: -v
==27439== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Sometimes I am using the CRFSuite to do document classification. All the features for a document are simply tucked in a single line where the label is the first token in that line as defined by the format.
In the classic Logistic Regression setup one tries to fit the model by finding the parameters - theta (number of features x number classes) and a bias term. The CRFSuite gives the former matrix of coefficients but no bias term. Is it necessary for classification?
All in all, CRF is just a generalization of Logistic Regression to sequences according to some seminal papers on sequence analysis.
Thanks
I am new to CRF , I am still building myself on the its basic concept and Programmable Graphic Model (PGM), however there is urgent need for me to start familiarizing with the relevant crfsuite codes to enable me start the implementation of satellite image classification as ASAP. Can anyone advise me on the way forward?
Is possible to return n the most likely predictions using CRF? If so, which place should be modified in the source code to get this behavior since I could not find any parameter that gives this.
Thank you.
g++ was failing during the phase:
python setup.py build_ext
with the errors:
Undefined symbols for architecture i386:
"_PyArg_ParseTuple", referenced from:
__wrap_version in export_wrap.o
.
.
There were many undefined symbols during the linking phase. By adding the path to the python libraries I was able to compile. The command: python-config --libs will tell you the exact commands you need to add to g++.
I had undefined symbols issues when trying to build the python wrapper (using the "modified" export_wrap.cpp as suggested in README).
The undefined symbols error can be seen there : https://gist.github.com/3248610
It was issued while running python setup.py build_ext
by the command
g++ -arch i386 -arch x86_64 build/temp.macosx-10.7-intel-2.7/crfsuite.o build/temp.macosx-10.7-intel-2.7/export_wrap.o -L/usr/local/lib -lcrfsuite -o build/lib.macosx-10.7-intel-2.7/_crfsuite.so -shared
After some research I managed to resolve the issue by passing to the linker the option -undefined dynamic_lookup
(found in man ld
)
So finally, this worked:
python setup.py build_ext
g++ -arch i386 -arch x86_64 build/temp.macosx-10.7-intel-2.7/crfsuite.o build/temp.macosx-10.7-intel-2.7/export_wrap.o -L/usr/local/lib -lcrfsuite -o build/lib.macosx-10.7-intel-2.7/_crfsuite.so -shared -Wl,-undefined,dynamic_lookup
python setup.py build_ext
python setup.py install
(the option can be passed to the linker with -Wl,-undefined,dynamic_lookup
)
It may save some hours to someone else..!
(I am not sure how to change the code to prevent this error. If one gives me some clues I can propose a pull request).
where can i find the demo using c++ of crfsuite?
If my data is like below, the crfsuite can't be used, right?>
label-1 , vector
label-2, vector
...
The Wapiti CRF toolkit has a neat feature called N-best Viterbi output which returns the n-best label sequences for an input sequence. Is there a similar functionality in crfsuite
?
Thanks for your hints!
The documentation for crfsuite_create_instance
currently states that 0
is returned upon success (like the rest of the API), but the implementation returns 1
upon success due to the use of == 0
in the implementation to perform a conditional execution via short-circuit evaluation of the first half of the expression.
From current latest revision (a6f144b) of lib/crf/src/crfsuite.c
:
int crfsuite_create_instance(const char *iid, void **ptr)
{
int ret =
crf1de_create_instance(iid, ptr) == 0 ||
crfsuite_dictionary_create_instance(iid, ptr) == 0;
return ret;
}
Suggested correction:
int crfsuite_create_instance(const char *iid, void **ptr)
{
int ret = crf1de_create_instance(iid, ptr);
if (ret != 0)
ret = crfsuite_dictionary_create_instance(iid, ptr);
return ret;
}
I need to train a crf model using a large dataset, which doesn't fit into the memory all at once. Is there a way to train the model in a batch mode?
Hi all,
I am trying to load a model from Python and it yields the following error :
>>> import crfsuite
>>> t = crfsuite.Tagger()
>>> t.open("/.../my.crf.model")
Assertion failed: (false), function crf1dm_initialize_header, file /SourceCache/CRFSuite/CRFSuite-33/crfsuite/lib/crf/src/crf1d_model.c, line 990.
Abort trap: 6
The strange thing is that the same model works perfectly when loaded from the CLI :
crfsuite tag -m /.../my.crf.model test.txt
I have tried to reinstall swig, but it does not change anything. Any clue of what I should try next ?
Thanks in advance and Happy New Year !
Hi,
if the ':' character is used as label it is swallowed if I print out the predicted labels.
I attached two links for a dummy-set of train/test data below.
I trained my model with the command "crfsuite learn -m out.model miniTrain.txt"
I printed the prediction by calling "crfsuite tag -m out.model miniTest.txt"
The command line output prints empty lines if the ':' is used as label (red boxes)
https://dl.dropboxusercontent.com/u/2953290/miniTrain.txt
https://dl.dropboxusercontent.com/u/2953290/miniTest.txt
I am working on OSX 64 bit
Is there a way to restrict possible set of tags for each item?
For example, I want to do Morphological Disambiguation, so for each word there is a small set of possible tags (from dictionary), as opposed to all possible tags for all words.
@chokkan: It would be great if you could provide build instructions for the SWIG wrapper.
I changed the include paths in swig/python/setup.py and swig/python/prepare.sh and run prepare.sh but when I try to run setup.py building the extension module fails with the error:
export_wrap.cpp:4727: error: redefinition of ‘struct swig::traits<std::vector<CRFSuite::Attribute, std::allocator<CRFSuite::Attribute> > >’
export_wrap.cpp:4635: error: previous definition of ‘struct swig::traits<std::vector<CRFSuite::Attribute, std::allocator<CRFSuite::Attribute> > >’
thanks,
Peter
I ve installed crfsuite on windows7 64 bit.
I have a farsi text,like this:
دولتی ADJ-SIM-GEN O
عربی N-PR-SING O
استقامت N-COM-SING O
و CON O
ایستادگی N-COM،SING-GEN O
افزونتر ADJ-CMPR O
از P O
The encoding is windows_1256. when I run the command:
python chunking.py < train.txt > train.crfsuite.txt
I face an error: too few fields (1) for ['w','pos','y'].
What should I do?
Training input:
a x
b y
Output of running crfsuite learn
CRFSuite 0.12 Copyright (c) 2007-2011 Naoaki Okazaki
Start time of the training: 2011-11-16T01:05:15Z
Reading the data set(s)
[1] a
0....1....2....3....4....5....6....7....8....9....10
Number of instances: 1
Seconds required: 0.000
Statistics the data set(s)
Number of data sets (groups): 1
Number of instances: 1
Number of items: 2
Number of attributes: 2
Number of labels: 2
Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 3
Seconds required: 0.000
L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 2147483647
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20
***** Iteration #1 *****
Loss: 1.166644
Feature norm: 0.423138
Error norm: 0.014465
Active features: 3
Line search trials: 2
Line search step: 0.410505
Seconds required for this iteration: 0.000
***** Iteration #2 *****
Loss: 1.166594
Feature norm: 0.600181
Error norm: 0.002310
Active features: 3
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.000
***** Iteration #3 *****
Loss: 1.166593
Feature norm: 0.600198
Error norm: 0.000006
Active features: 3
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.000
*** glibc detected *** crfsuite: free(): invalid pointer: 0x00000000098b4e30 ***
======= Backtrace: =========
/lib64/libc.so.6[0x336fa7245f]
/lib64/libc.so.6(cfree+0x4b)[0x336fa728bb]
/chomes/wuke/.world/liblbfgs-1.10/lib/liblbfgs-1.10.so(lbfgs+0x1483)[0x2b594b9c8003]
/chomes/wuke/.world/crfsuite-0.12/lib/libcrfsuite-0.12.so(crfsuite_train_lbfgs+0x303)[0x2b594b7b8ab3]
/chomes/wuke/.world/crfsuite-0.12/lib/libcrfsuite-0.12.so[0x2b594b7c004d]
crfsuite[0x403080]
crfsuite[0x40480b]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x336fa1d994]
crfsuite[0x401629]
======= Memory map: ========
00400000-00407000 r-xp 00000000 00:2c 9494537 /chomes/wuke/.world/crfsuite-0.12/bin/crfsuite
00606000-00607000 rw-p 00006000 00:2c 9494537 /chomes/wuke/.world/crfsuite-0.12/bin/crfsuite
098a1000-098c2000 rw-p 098a1000 00:00 0 [heap]
336f600000-336f61c000 r-xp 00000000 fd:00 1011849 /lib64/ld-2.5.so
336f81b000-336f81c000 r--p 0001b000 fd:00 1011849 /lib64/ld-2.5.so
336f81c000-336f81d000 rw-p 0001c000 fd:00 1011849 /lib64/ld-2.5.so
336fa00000-336fb4e000 r-xp 00000000 fd:00 261766 /lib64/libc-2.5.so
336fb4e000-336fd4e000 ---p 0014e000 fd:00 261766 /lib64/libc-2.5.so
336fd4e000-336fd52000 r--p 0014e000 fd:00 261766 /lib64/libc-2.5.so
336fd52000-336fd53000 rw-p 00152000 fd:00 261766 /lib64/libc-2.5.so
336fd53000-336fd58000 rw-p 336fd53000 00:00 0
3370200000-3370282000 r-xp 00000000 fd:00 261846 /lib64/libm-2.5.so
3370282000-3370481000 ---p 00082000 fd:00 261846 /lib64/libm-2.5.so
3370481000-3370482000 r--p 00081000 fd:00 261846 /lib64/libm-2.5.so
3370482000-3370483000 rw-p 00082000 fd:00 261846 /lib64/libm-2.5.so
3374200000-337420d000 r-xp 00000000 fd:00 261848 /lib64/libgcc_s-4.1.2-20080825.so.1
337420d000-337440d000 ---p 0000d000 fd:00 261848 /lib64/libgcc_s-4.1.2-20080825.so.1
337440d000-337440e000 rw-p 0000d000 fd:00 261848 /lib64/libgcc_s-4.1.2-20080825.so.1
2b594b7ae000-2b594b7b0000 rw-p 2b594b7ae000 00:00 0
2b594b7b0000-2b594b7c5000 r-xp 00000000 00:2c 9494832 /chomes/wuke/.world/crfsuite-0.12/lib/libcrfsuite-0.12.so
2b594b7c5000-2b594b9c4000 ---p 00015000 00:2c 9494832 /chomes/wuke/.world/crfsuite-0.12/lib/libcrfsuite-0.12.so
2b594b9c4000-2b594b9c5000 rw-p 00014000 00:2c 9494832 /chomes/wuke/.world/crfsuite-0.12/lib/libcrfsuite-0.12.so
2b594b9c5000-2b594b9c9000 r-xp 00000000 00:2c 16683972 /chomes/wuke/.world/liblbfgs-1.10/lib/liblbfgs-1.10.so
2b594b9c9000-2b594bbc8000 ---p 00004000 00:2c 16683972 /chomes/wuke/.world/liblbfgs-1.10/lib/liblbfgs-1.10.so
2b594bbc8000-2b594bbc9000 rw-p 00003000 00:2c 16683972 /chomes/wuke/.world/liblbfgs-1.10/lib/liblbfgs-1.10.so
2b594bbc9000-2b594bbca000 rw-p 2b594bbc9000 00:00 0
2b594bc14000-2b594bc15000 rw-p 2b594bc14000 00:00 0
2b594bc15000-2b594bc18000 r-xp 00000000 00:2c 9494889 /chomes/wuke/.world/crfsuite-0.12/lib/libcqdb-0.12.so
2b594bc18000-2b594be17000 ---p 00003000 00:2c 9494889 /chomes/wuke/.world/crfsuite-0.12/lib/libcqdb-0.12.so
2b594be17000-2b594be18000 rw-p 00002000 00:2c 9494889 /chomes/wuke/.world/crfsuite-0.12/lib/libcqdb-0.12.so
2b594be18000-2b594be1a000 rw-p 2b594be18000 00:00 0
7fffb3b20000-7fffb3b35000 rw-p 7ffffffe9000 00:00 0 [stack]
7fffb3bfd000-7fffb3c00000 r-xp 7fffb3bfd000 00:00 0 [vdso]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vsyscall]
Aborted
There don't seem to be any unit tests in the code - it would be great to have some tests in this and the other C++ dependencies and then add (many more) unit tests!
Python extensions must be built with the same VS as Python itself; it is VC 9.0 for Python 2.7, VC 10.0 for Python 3.4 and VC 2015 for Python 3.5+. But after a5221da and #44 crfsuite can be built only with msvc 2015. This is a problem for https://github.com/tpeng/python-crfsuite - it bundles crfsuite and builds a Cython extension for it, but after recent changes this no longer works because crfsuite can't be built with earlier msvc versions now.
Hi,
I used ./bin/crfsuite tag -m model test.txt > test.tagged
The number of lines in the output is not the same as test.txt
Example:
wc -l test.txt
11620
wc -l test.tagged
11500
It doesn't happen always so there can be a bug.
This is a stupid question,but I followed your manual and still struck on this.
The first step in training a tagging model is to transform raw data into "feature/attribute" file, use chunking.py.
thus to do:
train.txt -----> train.crfsuite.txt
test.txt -----> test.crfsuite.txt
Then do training and testing both on these "feature/attribute" file, like this.
crfsuite learn -m CRF.model train.crfsuite.txt
crfsuite tag -m CRF.model test.crfsuite.txt
But the question is when I tried to do tagging, I actually don't want to do experiment and check accuracy, f1 score and sort of these. I only have unlabelled text data, then how do I tag it?
I tried this:
crfsuite tag -m CRF.model unlabelled.txt
but the result is all the same, which is obviously wrong.
Should I first transform my unlabelled text data into "feature/attribute" file? then how to do this?
please help.
I've been trying for the past couple of days to train the CRF with rich spatial data looking like this:
Sequence1:
A1 L=0.0 O=North
B1 L=0.8 O=East
C1 L=0.8 O=East
C2 L=0.8 O=South
Sequence2:
A2 L=0.0 O=North
A3 L=0.8 O=South
A4 L=0.8 O=South
B5 L=0.8 O=East
(Something like a pawn traveling on a chessboard on possible paths.)
Then I'm passing arbitrary data (a small path) and try to match them and get a label sequence telling me the most probable path that the pawn took but I'm getting nowhere.
Would you be able to provide an example of chunking using a dataset like this one for possible path-map matching?
Thank you in advance.
J.
According to the INSTALL file:
. cd' to the directory containing the package's source code and type
./configure' to configure the package for your system. If you're
using csh' on an old version of System V, you might need to type
sh ./configure' instead to prevent csh' from trying to execute
configure' itself.
I cannot locate the configure.sh in the directory, and autoconf does not do anything. Please let me know howto fix it.
I think CRFSuite can be optimized to utilize multiple cores available on all machines these days. A simple fix I thought for that was computing the scores in the for loop of encoder_objective_and_gradients_batch
especially at line
crfsuite/lib/crf/src/crf1d_encode.c
Line 825 in 8c0028c
An additional dependency might be added if we want to use a multi processing library like openMP for implementing the feature, which can be switched on or off using a flag.
Some API changes might also be needed in order to ensure the proper aggregation of results from each of the parallel jobs.
I would love to have a feedback on this and know if anyone else is working on this patch?
Instead, one can simply python setup.py build_ext -R PATH_TO_CRFSUITE
to tell Python to look for crfsuite.so at PATH_TO_CRFSUITE.
crfsuite version = 0.12
StopIteration
During handling of the above exception, another exception occurred:
SystemError: <built-in function delete_ItemSequence> returned a result with an error set
During handling of the above exception, another exception occurred:
SystemError: <built-in function delete_StringList> returned a result with an error set
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "..../train.py", line 73, in <module>
for xseq_, yseq_ in instances(fi_):
SystemError: <built-in function delete_Item> returned a result with an error set
my code:
#!/usr/bin/env python
import crfsuite
import sys
from xxxxx import config
import os.path
# Inherit crfsuite.Trainer to implement message() function, which receives
# progress messages from a training process.
class Trainer(crfsuite.Trainer):
def message(self, s):
# Simply output the progress messages to STDOUT.
sys.stdout.write(s)
def instances(fi):
xseq = crfsuite.ItemSequence()
yseq = crfsuite.StringList()
i = 0
for line in fi:
i += 1
print(i)
line = line.strip('\n')
if not line:
# An empty line presents an end of a sequence.
# if xseq:
yield xseq, tuple(yseq)
xseq = crfsuite.ItemSequence()
yseq = crfsuite.StringList()
continue
# Split the line with TAB characters.
fields = line.split('\t')
# Append attributes to the item.
item = crfsuite.Item()
for field in fields[1:]:
p = field.rfind(':')
if p == -1:
# Unweighted (weight=1) attribute.
item.append(crfsuite.Attribute(field))
else:
# Weighted attribute
item.append(crfsuite.Attribute(field[:p], float(field[p + 1:])))
# Append the item to the item sequence.
xseq.append(item)
# Append the label to the label sequence.
yseq.append(fields[0])
xseq.erase()
if __name__ == '__main__':
version = "_v1_python"
# This demonstrates how to obtain the version string of CRFsuite.
print("crfsuite version = ", crfsuite.version())
# Create a Trainer object.
trainer = Trainer()
# Read training instances from STDIN, and set them to trainer.
train_feature_file = config.train_feature_file+"_v1"
model_file = config.model_file+version
if not os.path.exists(train_feature_file):
raise FileNotFoundError("train_file: {} not found".format(train_feature_file))
if os.path.exists(model_file):
raise FileExistsError("model_file: {} have been existed".format(model_file))
with open(train_feature_file, 'r') as fi_:
for xseq_, yseq_ in instances(fi_):
trainer.append(xseq_, yseq_, 0)
# Use L2-regularized SGD and 1st-order dyad features.
trainer.select('l2sgd', 'crf1d')
# This demonstrates how to list parameters and obtain their values.
for name in trainer.params():
print(name, trainer.get(name), trainer.help(name))
# Set the coefficient for L2 regularization to 0.1
trainer.set('c2', '0.1')
# Start training; the training process will invoke trainer.message()
# to report the progress.
trainer.train(model_file, -1)
I am new to CRF , I am still building myself on the its basic concept and Programmable Graphic Model (PGM), however there is urgent need for me to start familiarizing with the relevant crfsuite codes to enable me start the implementation of satellite image classification as ASAP. Can anyone advise me on the way forward?
On Ubuntu 13.10, after downloading and unpacking crfsuite master:
./autogen.sh
...
aclocal: warning: autoconf input should be named 'configure.ac', not 'configure.in'
configure.in:33: error: automatic de-ANSI-fication support has been removed
/usr/share/aclocal-1.13/obsolete.m4:26: AM_C_PROTOTYPES is expanded from...
configure.in:33: the top level
autom4te: /usr/bin/m4 failed with exit status: 1
aclocal: error: echo failed with exit status: 1
aclocal failed!
removing the offending macro by replacing
AM_C_PROTOTYPES
with
dnl AM_C_PROTOTYPES
in configure.in
fixes the (immediate) issue.
I would like to point out that an identifier like “__CRFSUITE_API_HPP__
” does not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?
I am using swig 2.0.9 trying to build the Python interface. During the first step of the build process: >./build.sh --swig I get the following errors:
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (directorout) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &DIRECTOROUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (in) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (in) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (typecheck) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (typecheck) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (argout) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *OUTPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (argout) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &OUTPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (typecheck) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (typecheck) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (freearg) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (freearg) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INOUT
Other people have this error as well using swig 2.0.8:
http://swig.10945.n7.nabble.com/Error-copying-typemap-when-building-crfsuite-td10775.html
I'm trying to train a model with a text file that is 42G in size. I have more than enough memory on my machine but I seem to be getting a segmentation core dump while training. Any reason why this would happen?
My team and I have trained multiple models on smaller datasets on the same machine, so we are confident that crfsuit is setup correctly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.