chokkan / crfsuite Goto Github PK

CRFsuite: a fast implementation of Conditional Random Fields (CRFs)

Home Page: http://www.chokkan.org/software/crfsuite/

License: Other

Shell 0.12% Python 4.63% C 42.69% C++ 51.64% Perl 0.27% HTML 0.01% Ruby 0.30% Makefile 0.19% Roff 0.08% SWIG 0.09%

crfsuite's Introduction

                               CRFsuite
                              Version 0.12
                 http://www.chokkan.org/software/crfsuite/



* INTRODUCTION
CRFSuite is an implementation of Conditional Random Fields (CRFs) for
labeling sequential data. Please refer to the web site for more
information about this software.



* COPYRIGHT AND LICENSING INFORMATION

This program is distributed under the modified BSD license. Refer to
COPYING file for the precise description of the license.


Portions of this software are based on libLBFGS.

The MIT License

Copyright (c) 1990 Jorge Nocedal
Copyright (c) 2007 Naoaki Okazaki

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.


Portions of this software are based on Constant Quark Database (CQDB).

The BSD license.

Copyright (c) 2007, Naoaki Okazaki
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
    * Neither the name of the Northwestern University, University of Tokyo,
      nor the names of its contributors may be used to endorse or promote
      products derived from this software without specific prior written
      permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


Portions of this software are based on RumAVL.

MIT/X Consortium License.

Copyright (c) 2005-2007 Jesse Long <[email protected]>
All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:

   1. The above copyright notice and this permission notice shall be
      included in all copies or substantial portions of the Software.
   2. The origin of the Software must not be misrepresented; you must not
      claim that you wrote the original Software.
   3. Altered source versions of the Software must be plainly marked as
      such, and must not be misrepresented as being the original Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.


Portions of this software are based on a portable stdint.h (for MSVC).

Copyright (c) 2005-2007 Paul Hsieh

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

    Redistributions of source code must retain the above copyright
    notice, this list of conditions and the following disclaimer.

    Redistributions in binary form must not misrepresent the orignal
    source in the documentation and/or other materials provided
    with the distribution.

    The names of the authors nor its contributors may be used to
    endorse or promote products derived from this software without
    specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
OF THE POSSIBILITY OF SUCH DAMAGE.


Portions of this software are based on Mersenne Twister.

Copyright (C) 1997 - 2002, Makoto Matsumoto and Takuji Nishimura,
All rights reserved.                          

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

  1. Redistributions of source code must retain the above copyright
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  3. The names of its contributors may not be used to endorse or promote 
     products derived from this software without specific prior written 
     permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.



* SPECIAL THANKS GOES TO...
Olivier Grisel
Andreas Holzbach
Baoli Li
Yoshimasa Tsuruoka
Hiroshi Manabe
Riza Theresa B. Batista-Navarro

crfsuite's People

Contributors

Stargazers

Watchers

Forkers

pprett kayess flexibits eiichiroi phantasst vgoklani romka777 priancho smhanov zaxtax shannonyu chbrown xavi-reloaded huangxinms xiaoyao1991 fgregg kmike oneplus arcodergh mihail911 viveksck cheusov unnonouno maxbywin tpeng simonboutin isaachaze lucosax maluuba pfjob09 truongdo leebird abicky jstarc ergaurav2 c3h3 ambier gcoolmoon dmit25 bboalimoe ruoshui1126 excelle1987 mwilliams3 disdar yanyankangkang w0lker digitalreasoning lansatiankong ibalazevic summer-3 passstory takei-yuya ganeshparameswaran lngvietthang caomw duncanka jordi-adell napsternxg ya-na-pa ohliming mr-zer0o leoking01 scr dr-lingyunxu datamade peihuachen kyrs yoriyuki justhalf tianjianjiang yaozhengwang albarrentine miticy antoine-tran xiongziqi nikolsky y-bo javelir shivamdawer eedanny suzhoushr hyeokhyen codeaudit prpankajsingh onsdigital fliptalk plr123 gentang l1l1thly yuyincug mei16 jimregan sugarcase onpoeet benjamesbabala jdc08161063 kuyun-zhangyang mkhodakarami fseasy longhanvisazure

crfsuite's Issues

why some reference lable is null generated by tag -r

After training

crfsuite learn -m review.model train.data

I got a model then I test it's effect

crfsuite tag -r -m review.model test.data > result

but I found many null

cat -n result | grep 'null'
    33	(null)	N
    35	(null)	N
    37	(null)	N
    46	(null)	N
    50	(null)	N
    59	(null)	N
    68	(null)	N
    77	(null)	N
    97	(null)	N
   118	(null)	N
   134	(null)	N
   139	(null)	N
   196	(null)	N
   205	(null)	N
   263	(null)	N
   273	(null)	N
   287	(null)	N
   288	(null)	N
   324	(null)	N
   330	(null)	N
   342	(null)	N
   343	(null)	N
   347	(null)	N
   349	(null)	N
   366	(null)	N

but actually in test.data they have tagged,

cat -n result | grep 'null' | awk '{print $1}' | xargs -i{} sed -n '{}p' test.data | awk '{print $1}'
SEV-KY-95
LJ-119
LJ-119
SEV-PY-50
CPU-CM-146
HX-116
LOG-HGL-137
HX-116
HH-94
TH-18
LOG-HK-92
CAM-HX-116
KY-95
GY-149
NET-HX-151
SYS-BL-30
NET-WY-97
NET-WY-97
NET-BK-100
SCN-KY-95
BAT-BNY-168
BAT-RY-53
KY-95
BAT-HX-151
KY-95

Treat empty lines with "\r\n" samples with label "\r"

If the training file is in Dos format. CRFsuite will treat a empty line "\r\n" as a sample with label "\r" and feature all zeros. This can be seen when you tag the file with -l to output all marginals.

How regularization with L1 and L2 works

Hi,

I use extensively CRFSuite to build various models. All my models used only L2-regularization. I discovered that with L1-regularization I can get virtually the same performance while the model is much more compact.

So I started to use L1+L2 parameters during training. My question is: how is it implemented? Is the objective function penalized with those two terms at the same time or is it sequential (first L1, then re-training with L2)?

From the documentation it is clear that whenever L1=0, L-BFGS can be used. If L1 is positive value, then OWL-QN solver automatically turns it. From literature I know that OWL-QN optimizes with L1-regularizer only, there is no L2-regularization term. So, how is it implemented in CRFSuite?

Thanks

Increment the version number?

I am very happy to see crfsuite being improved since very recently! Thanks a lot, chokkan! I don't know the policy of version numbers but wouldn't it be great to increment the version number to reflect the relatively large number of pull requests being integrated?

Thanks a lot for maintaining the software!

swig python build error

myhost:python gaowei$ python setup.py build_ext
running build_ext
building '_crfsuite' extension
g++ -fno-strict-aliasing -I/Users/gaowei/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/gaowei/anaconda/include/python2.7 -c crfsuite.cpp -o build/temp.macosx-10.5-x86_64-2.7/crfsuite.o
g++ -fno-strict-aliasing -I/Users/gaowei/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/gaowei/anaconda/include/python2.7 -c export_wrap.cpp -o build/temp.macosx-10.5-x86_64-2.7/export_wrap.o
export_wrap.cpp:5182:25: error: redefinition of 'swig::traits<std::vector<CRFSuite::Attribute, std::allocatorCRFSuite::Attribute > >'
template <> struct traits<CRFSuite::Item > {
^~~~~~~~~~~~~~~~~~~~~~~
export_wrap.cpp:5051:22: note: previous definition is here
template <> struct traits<std::vector<CRFSuite::Attribute, std::allocator< CRFSuite::Attribute > > > {
^
1 error generated.
error: command 'g++' failed with exit status 1
myhost:python gaowei$

Fail to compile with SSE

From one of my colleague, CRFSuite seems to fail to compile when an intrinsic SSE command (mm_castsi128_pd) is used.

GCC version is 3.4.6

64 bit stack exception

Hi,
I am experimenting with integrating CRFSuite into a 64 C++11 windows application compiled using visual studio 2013 and have encountered an unexpected challenge. I have successfully built the 64 bit (release and debug) cqdl.lib and crf.lib static libraries.

The program loads a prebuilt model constructed and tested using the front end (crfsuite.exe)application.

The following is the sequence of actions my program takes:
(1) construct a tagger:
CRFSuite::Tagger tagger;

(2) I successfully load the pre-constructed model:
tagger.open("path to model");

(3) Create a new ItemSequence that will be used to tag unknown data:
CRFSuite::ItemSequence xseq;

(4)In a loop I create items, populate attributes and add the item to the sequence
CRFSuite::Item item;
for (auto t : terms){
item.push_back(CRFSuite::Attribute(t));
}
xseq.push_back(item);

(5) give the sequence to the tagger
tagger.set(xseq);

BAM!!!! I get the following runtime check error:
"Run-Time Check Failure #2 - Stack around the variable '_inst' was corrupted."

My assumption is that I am doing something incorrectly.
Any insight into this would be greatly appreciated.

Feature for training CRF

CRFSuite provides a good pipeline for NER training and recognition using CRF. I wanted to confirm the training procedure. From what I observed, only word embeddings do not provide good accuracy. However, adding them on baseline features like contexual tokens, pos, isupper, isdigit, istitle, etc gives good accuracy. Is there anything on which I am missing out?

Build the binding failed

@chokkan : I faced the following error while I tried "Build the binding" step ($ python setup.py build_ext):

running build_ext
building '_crfsuite' extension
C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\BIN\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG "-IC:\Program Files (x86)\Python33\include" "-IC:\ProgramFiles (x86)\Python33\include" /Tpcrfsuite.cpp /Fobuild\temp.win32-3.3\Release\crfsuite.obj
crfsuite.cpp
crfsuite.cpp(1) : error C2059: syntax error : '<'
error: command '"C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\BIN\cl.exe"' failed with exit status 2

any idea?

Thanks,
Mahnoosh

missing inline

In crfsuite.hpp, all definition of member functions such as Trainer::Trainer are not inline functions. So, when multiple source files include crfsuite.hpp, the c++ compiler reports duplicated symbols of these member functions.
Please add "inline" to all the member functions defined in the header like this:

@@ -44,6 +44,7 @@
 namespace CRFSuite
 {

+inline
 Trainer::Trainer()
 {
     data = new crfsuite_data_t;

Thanks.

learning can not be piped

I cannot do something like
crfsuite learn -m some.model < train.txt
However it works for tagging
crfsuite tag -m some.model < test.txt
wired

Python importing error no module found _crfsuite

i get that error when importing plus i get libcrfsuite.so cannot find file

How to use the CRFsuite for image segmentation？

Now, a lot of people use the CRF to segment image. Can I use CRFsuite to do this?
And all examples are processing text by CRFsuite. So, can you give me a example that process image by CRFsuite?
think you very much !

can crf predict like hmm ！ thanks a lot！

hello, i know that hmm can predict the label of one observed sequence such as predict the word of one speech.can crf do the same thing ...the steps are like:
1.first train the different crf models with diffenrent set of samples
2.input the sequence that want predicting to different models and get the probabilities of diffenrent models
3.find the model that has the maximum probability,and the label the sequence with the model's label.

thanks a lot!

CRFsuite cannot tag concurrently

CRFsuite uses a single tagger instance per model, created here and returned from get_tagger() here. As such, it is not safe to tag data from multiple threads at once, because the tagging process mutates the tagger.

It would be a simple enough change to create a new tagger instance in get_tagger() every time. Would that be enough to ensure thread safety?

Question about duplicate features

Perhaps I missed it in the documentation but it is not clear for me how duplicate features are treated in CRFSuite. Suppose my data is like this:

...
label0 A C D
label1 A B C D A
label2 B C C D C
label3 B C D E
...

Here we see that 2nd and 3rd example have duplicate features. The 2nd has the feature A duplicated while the 3rd example has the feature C three times.

I would really hope that duplicated features get the weight incremented. I hope my example is equivalent to:

...
label0 A C D
label1 A:2 B C D
label2 B C:3 D
label3 B C D E
...

Thanks

bug: viterbi transition scores too soft

Problem: tagger sometimes generates sequence that is not possible according to transition matrix.

Preconditions: sparse transition matrix.

Repro: model file and sample sequence available upon request.

Patch: https://gist.github.com/pgmmpk/6193513

Please, review
Mike

How can I add more features?

I want to add features like "isCapital", "isNumber" etc. Could you please tell me how can I do that?

Memory leak in CRFSuite::Tagger

I use crfsuite tagger in simple test program:

CRFSuite::Tagger* newTagger = new CRFSuite::Tagger();
newTagger->open("test_model.model");
...
tagVec = newTagger->tag(items);
delete newTagger;

Running the program in valgrind would return an error such as:

==11280== 7,519,904 bytes in 1 blocks are definitely lost in loss record 2 of 2
==11280==    at 0x4C2DBB6: malloc (vg_replace_malloc.c:299)
==11280==    by 0x42041A: crf1dm_new (in /home/artem/projects/test_crfsuite/dist/Debug/GNU-Linux/test_crfsuite)
==11280==    by 0x415B8F: crf1m_create_instance_from_file (in /home/artem/projects/test_crfsuite/dist/Debug/GNU-Linux/test_crfsuite)
==11280==    by 0x403551: CRFSuite::Tagger::open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (crfsuite.hpp:289)
==11280==    by 0x403D1A: main (main.cpp:26)
==11280== 
==11280== LEAK SUMMARY:
==11280==    definitely lost: 7,519,904 bytes in 1 blocks
==11280==    indirectly lost: 0 bytes in 0 blocks
==11280==      possibly lost: 0 bytes in 0 blocks
==11280==    still reachable: 72,704 bytes in 1 blocks
==11280==         suppressed: 0 bytes in 0 blocks
==11280== 
==11280== For counts of detected and suppressed errors, rerun with: -v
==11280== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

"Definitely lost" here is equal .model file size.
The place that valgrind points:

...
buffer = buffer_orig = (uint8_t*)malloc(size + 16);
if (buffer_orig = NULL) {
    goto error_exit;
}
...
return crf1dm_new_impl(buffer_orig, buffer, size);

where:

static crf1dm_t* crf1dm_new_impl(uint8_t* buffer_orig, const uint8_t* buffer, uint32_t size)
{
...
model->buffer_orig = buffer_orig;

then in "destructor" memory will never free:

void crf1dc_delete(crf1d_context_t* ctx)
{
...
    if (model->buffer_orig != NULL) {
        free(model->buffer_orig);
        model->buffer_orig = NULL;
    }
    model->buffer = NULL;
...

After my fix #74 valgrind shows that memory doesn't leak any more:

==27439== LEAK SUMMARY:
==27439==    definitely lost: 0 bytes in 0 blocks
==27439==    indirectly lost: 0 bytes in 0 blocks
==27439==      possibly lost: 0 bytes in 0 blocks
==27439==    still reachable: 72,704 bytes in 1 blocks
==27439==         suppressed: 0 bytes in 0 blocks
==27439== 
==27439== For counts of detected and suppressed errors, rerun with: -v
==27439== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Bias term?

Sometimes I am using the CRFSuite to do document classification. All the features for a document are simply tucked in a single line where the label is the first token in that line as defined by the format.

In the classic Logistic Regression setup one tries to fit the model by finding the parameters - theta (number of features x number classes) and a bias term. The CRFSuite gives the former matrix of coefficients but no bias term. Is it necessary for classification?

All in all, CRF is just a generalization of Logistic Regression to sequences according to some seminal papers on sequence analysis.

Thanks

crfsuite for image classification

I am new to CRF , I am still building myself on the its basic concept and Programmable Graphic Model (PGM), however there is urgent need for me to start familiarizing with the relevant crfsuite codes to enable me start the implementation of satellite image classification as ASAP. Can anyone advise me on the way forward?

n-best tagging results

Is possible to return n the most likely predictions using CRF? If so, which place should be modified in the source code to get this behavior since I could not find any parameter that gives this.

Thank you.

Python wrapper build fails on OS X

g++ was failing during the phase:

python setup.py build_ext
with the errors:
Undefined symbols for architecture i386:
"_PyArg_ParseTuple", referenced from:
__wrap_version in export_wrap.o
.
.

There were many undefined symbols during the linking phase. By adding the path to the python libraries I was able to compile. The command: python-config --libs will tell you the exact commands you need to add to g++.

Undefined symbols when building python wrapper on OSX (solution)

I had undefined symbols issues when trying to build the python wrapper (using the "modified" export_wrap.cpp as suggested in README).

The undefined symbols error can be seen there : https://gist.github.com/3248610

It was issued while running python setup.py build_ext by the command
g++ -arch i386 -arch x86_64 build/temp.macosx-10.7-intel-2.7/crfsuite.o build/temp.macosx-10.7-intel-2.7/export_wrap.o -L/usr/local/lib -lcrfsuite -o build/lib.macosx-10.7-intel-2.7/_crfsuite.so -shared

After some research I managed to resolve the issue by passing to the linker the option -undefined dynamic_lookup (found in man ld)

So finally, this worked:

python setup.py build_ext
g++ -arch i386 -arch x86_64 build/temp.macosx-10.7-intel-2.7/crfsuite.o build/temp.macosx-10.7-intel-2.7/export_wrap.o -L/usr/local/lib -lcrfsuite -o build/lib.macosx-10.7-intel-2.7/_crfsuite.so -shared -Wl,-undefined,dynamic_lookup
python setup.py build_ext
python setup.py install

(the option can be passed to the linker with -Wl,-undefined,dynamic_lookup)

It may save some hours to someone else..!

(I am not sure how to change the code to prevent this error. If one gives me some clues I can propose a pull request).

has demo with C++?

where can i find the demo using c++ of crfsuite?

crfsuite is only used for NLP ?

If my data is like below, the crfsuite can't be used, right?>

label-1 , vector
label-2, vector

...

Question: Is "n-best" tagging possible with CRFSuite?

The Wapiti CRF toolkit has a neat feature called N-best Viterbi output which returns the n-best label sequences for an input sequence. Is there a similar functionality in crfsuite?

Thanks for your hints!

Documentation on the return value of `crfsuite_create_instance` does not match implementation.

The documentation for crfsuite_create_instance currently states that 0 is returned upon success (like the rest of the API), but the implementation returns 1 upon success due to the use of == 0 in the implementation to perform a conditional execution via short-circuit evaluation of the first half of the expression.

From current latest revision (a6f144b) of lib/crf/src/crfsuite.c:

int crfsuite_create_instance(const char *iid, void **ptr)
{
    int ret = 
        crf1de_create_instance(iid, ptr) == 0 ||
        crfsuite_dictionary_create_instance(iid, ptr) == 0;

    return ret;
}

Suggested correction:

int crfsuite_create_instance(const char *iid, void **ptr)
{
    int ret = crf1de_create_instance(iid, ptr);
    if (ret != 0)
      ret = crfsuite_dictionary_create_instance(iid, ptr);
    return ret;
}

Batch training with large datasets

I need to train a crf model using a large dataset, which doesn't fit into the memory all at once. Is there a way to train the model in a batch mode?

Unable to load model from Python, while OK from the CLI.

Hi all,

I am trying to load a model from Python and it yields the following error :

>>> import crfsuite
>>> t = crfsuite.Tagger()
>>> t.open("/.../my.crf.model")

Assertion failed: (false), function crf1dm_initialize_header, file /SourceCache/CRFSuite/CRFSuite-33/crfsuite/lib/crf/src/crf1d_model.c, line 990.
Abort trap: 6

The strange thing is that the same model works perfectly when loaded from the CLI :

crfsuite tag -m /.../my.crf.model test.txt

I have tried to reinstall swig, but it does not change anything. Any clue of what I should try next ?

Thanks in advance and Happy New Year !

Issue wie the ':' character as label

Hi,
if the ':' character is used as label it is swallowed if I print out the predicted labels.

I attached two links for a dummy-set of train/test data below.
I trained my model with the command "crfsuite learn -m out.model miniTrain.txt"
I printed the prediction by calling "crfsuite tag -m out.model miniTest.txt"

The command line output prints empty lines if the ':' is used as label (red boxes)

https://dl.dropboxusercontent.com/u/2953290/miniTrain.txt
https://dl.dropboxusercontent.com/u/2953290/miniTest.txt

I am working on OSX 64 bit

Restrict tags for each item

Is there a way to restrict possible set of tags for each item?
For example, I want to do Morphological Disambiguation, so for each word there is a small set of possible tags (from dictionary), as opposed to all possible tags for all words.

Build instructions for SWIG wrapper

@chokkan: It would be great if you could provide build instructions for the SWIG wrapper.

I changed the include paths in swig/python/setup.py and swig/python/prepare.sh and run prepare.sh but when I try to run setup.py building the extension module fails with the error:

export_wrap.cpp:4727: error: redefinition of ‘struct swig::traits<std::vector<CRFSuite::Attribute, std::allocator<CRFSuite::Attribute> > >’
export_wrap.cpp:4635: error: previous definition of ‘struct swig::traits<std::vector<CRFSuite::Attribute, std::allocator<CRFSuite::Attribute> > >’

thanks,
Peter

What s the problem with my data?

I ve installed crfsuite on windows7 64 bit.
I have a farsi text,like this:
دولتی ADJ-SIM-GEN O
عربی N-PR-SING O
استقامت N-COM-SING O
و CON O
ایستادگی N-COM،SING-GEN O
افزون‌تر ADJ-CMPR O
از P O
The encoding is windows_1256. when I run the command:
python chunking.py < train.txt > train.crfsuite.txt
I face an error: too few fields (1) for ['w','pos','y'].
What should I do?

segfault on simple training input

Training input:

a       x
b       y

Output of running crfsuite learn

CRFSuite 0.12  Copyright (c) 2007-2011 Naoaki Okazaki

Start time of the training: 2011-11-16T01:05:15Z

Reading the data set(s)
[1] a
0....1....2....3....4....5....6....7....8....9....10
Number of instances: 1
Seconds required: 0.000

Statistics the data set(s)
Number of data sets (groups): 1
Number of instances: 1
Number of items: 2
Number of attributes: 2
Number of labels: 2

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 3
Seconds required: 0.000

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 2147483647
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 1.166644
Feature norm: 0.423138
Error norm: 0.014465
Active features: 3
Line search trials: 2
Line search step: 0.410505
Seconds required for this iteration: 0.000

***** Iteration #2 *****
Loss: 1.166594
Feature norm: 0.600181
Error norm: 0.002310
Active features: 3
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.000

***** Iteration #3 *****
Loss: 1.166593
Feature norm: 0.600198
Error norm: 0.000006
Active features: 3
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.000

*** glibc detected *** crfsuite: free(): invalid pointer: 0x00000000098b4e30 ***
======= Backtrace: =========
/lib64/libc.so.6[0x336fa7245f]
/lib64/libc.so.6(cfree+0x4b)[0x336fa728bb]
/chomes/wuke/.world/liblbfgs-1.10/lib/liblbfgs-1.10.so(lbfgs+0x1483)[0x2b594b9c8003]
/chomes/wuke/.world/crfsuite-0.12/lib/libcrfsuite-0.12.so(crfsuite_train_lbfgs+0x303)[0x2b594b7b8ab3]
/chomes/wuke/.world/crfsuite-0.12/lib/libcrfsuite-0.12.so[0x2b594b7c004d]
crfsuite[0x403080]
crfsuite[0x40480b]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x336fa1d994]
crfsuite[0x401629]
======= Memory map: ========
00400000-00407000 r-xp 00000000 00:2c 9494537                            /chomes/wuke/.world/crfsuite-0.12/bin/crfsuite
00606000-00607000 rw-p 00006000 00:2c 9494537                            /chomes/wuke/.world/crfsuite-0.12/bin/crfsuite
098a1000-098c2000 rw-p 098a1000 00:00 0                                  [heap]
336f600000-336f61c000 r-xp 00000000 fd:00 1011849                        /lib64/ld-2.5.so
336f81b000-336f81c000 r--p 0001b000 fd:00 1011849                        /lib64/ld-2.5.so
336f81c000-336f81d000 rw-p 0001c000 fd:00 1011849                        /lib64/ld-2.5.so
336fa00000-336fb4e000 r-xp 00000000 fd:00 261766                         /lib64/libc-2.5.so
336fb4e000-336fd4e000 ---p 0014e000 fd:00 261766                         /lib64/libc-2.5.so
336fd4e000-336fd52000 r--p 0014e000 fd:00 261766                         /lib64/libc-2.5.so
336fd52000-336fd53000 rw-p 00152000 fd:00 261766                         /lib64/libc-2.5.so
336fd53000-336fd58000 rw-p 336fd53000 00:00 0 
3370200000-3370282000 r-xp 00000000 fd:00 261846                         /lib64/libm-2.5.so
3370282000-3370481000 ---p 00082000 fd:00 261846                         /lib64/libm-2.5.so
3370481000-3370482000 r--p 00081000 fd:00 261846                         /lib64/libm-2.5.so
3370482000-3370483000 rw-p 00082000 fd:00 261846                         /lib64/libm-2.5.so
3374200000-337420d000 r-xp 00000000 fd:00 261848                         /lib64/libgcc_s-4.1.2-20080825.so.1
337420d000-337440d000 ---p 0000d000 fd:00 261848                         /lib64/libgcc_s-4.1.2-20080825.so.1
337440d000-337440e000 rw-p 0000d000 fd:00 261848                         /lib64/libgcc_s-4.1.2-20080825.so.1
2b594b7ae000-2b594b7b0000 rw-p 2b594b7ae000 00:00 0 
2b594b7b0000-2b594b7c5000 r-xp 00000000 00:2c 9494832                    /chomes/wuke/.world/crfsuite-0.12/lib/libcrfsuite-0.12.so
2b594b7c5000-2b594b9c4000 ---p 00015000 00:2c 9494832                    /chomes/wuke/.world/crfsuite-0.12/lib/libcrfsuite-0.12.so
2b594b9c4000-2b594b9c5000 rw-p 00014000 00:2c 9494832                    /chomes/wuke/.world/crfsuite-0.12/lib/libcrfsuite-0.12.so
2b594b9c5000-2b594b9c9000 r-xp 00000000 00:2c 16683972                   /chomes/wuke/.world/liblbfgs-1.10/lib/liblbfgs-1.10.so
2b594b9c9000-2b594bbc8000 ---p 00004000 00:2c 16683972                   /chomes/wuke/.world/liblbfgs-1.10/lib/liblbfgs-1.10.so
2b594bbc8000-2b594bbc9000 rw-p 00003000 00:2c 16683972                   /chomes/wuke/.world/liblbfgs-1.10/lib/liblbfgs-1.10.so
2b594bbc9000-2b594bbca000 rw-p 2b594bbc9000 00:00 0 
2b594bc14000-2b594bc15000 rw-p 2b594bc14000 00:00 0 
2b594bc15000-2b594bc18000 r-xp 00000000 00:2c 9494889                    /chomes/wuke/.world/crfsuite-0.12/lib/libcqdb-0.12.so
2b594bc18000-2b594be17000 ---p 00003000 00:2c 9494889                    /chomes/wuke/.world/crfsuite-0.12/lib/libcqdb-0.12.so
2b594be17000-2b594be18000 rw-p 00002000 00:2c 9494889                    /chomes/wuke/.world/crfsuite-0.12/lib/libcqdb-0.12.so
2b594be18000-2b594be1a000 rw-p 2b594be18000 00:00 0 
7fffb3b20000-7fffb3b35000 rw-p 7ffffffe9000 00:00 0                      [stack]
7fffb3bfd000-7fffb3c00000 r-xp 7fffb3bfd000 00:00 0                      [vdso]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vsyscall]
Aborted

Question about CRF model

Hi ,

Can I use the crfsuite to implement the model in the attached image? Thanks a lot.

Create some unit tests

There don't seem to be any unit tests in the code - it would be great to have some tests in this and the other C++ dependencies and then add (many more) unit tests!

crfsuite can't be built with msvc < 2015

Python extensions must be built with the same VS as Python itself; it is VC 9.0 for Python 2.7, VC 10.0 for Python 3.4 and VC 2015 for Python 3.5+. But after a5221da and #44 crfsuite can be built only with msvc 2015. This is a problem for https://github.com/tpeng/python-crfsuite - it bundles crfsuite and builds a Cython extension for it, but after recent changes this no longer works because crfsuite can't be built with earlier msvc versions now.

The number of lines in input and output for tagged is different

Hi,

I used ./bin/crfsuite tag -m model test.txt > test.tagged
The number of lines in the output is not the same as test.txt

Example:

wc -l test.txt
11620
wc -l test.tagged
11500

It doesn't happen always so there can be a bug.

how to do tagging without "attribute" file?

This is a stupid question,but I followed your manual and still struck on this.

The first step in training a tagging model is to transform raw data into "feature/attribute" file, use chunking.py.
thus to do:
train.txt -----> train.crfsuite.txt
test.txt -----> test.crfsuite.txt
Then do training and testing both on these "feature/attribute" file, like this.
crfsuite learn -m CRF.model train.crfsuite.txt
crfsuite tag -m CRF.model test.crfsuite.txt
But the question is when I tried to do tagging, I actually don't want to do experiment and check accuracy, f1 score and sort of these. I only have unlabelled text data, then how do I tag it?
I tried this:
crfsuite tag -m CRF.model unlabelled.txt
but the result is all the same, which is obviously wrong.
Should I first transform my unlabelled text data into "feature/attribute" file? then how to do this?
please help.

Spatial Data Training

I've been trying for the past couple of days to train the CRF with rich spatial data looking like this:

Sequence1:
A1 L=0.0 O=North
B1 L=0.8 O=East
C1 L=0.8 O=East
C2 L=0.8 O=South

Sequence2:
A2 L=0.0 O=North
A3 L=0.8 O=South
A4 L=0.8 O=South
B5 L=0.8 O=East

(Something like a pawn traveling on a chessboard on possible paths.)

Then I'm passing arbitrary data (a small path) and try to match them and get a label sequence telling me the most probable path that the pawn took but I'm getting nowhere.
Would you be able to provide an example of chunking using a dataset like this one for possible path-map matching?

Thank you in advance.
J.

Problem in installation

According to the INSTALL file:

. cd' to the directory containing the package's source code and type ./configure' to configure the package for your system. If you're
using csh' on an old version of System V, you might need to type sh ./configure' instead to prevent csh' from trying to execute configure' itself.

I cannot locate the configure.sh in the directory, and autoconf does not do anything. Please let me know howto fix it.

Multi core support for training on large number of instances

I think CRFSuite can be optimized to utilize multiple cores available on all machines these days. A simple fix I thought for that was computing the scores in the for loop of encoder_objective_and_gradients_batch especially at line

crfsuite/lib/crf/src/crf1d_encode.c

Line 825 in 8c0028c

for (i = 0;i < N;++i) {

An additional dependency might be added if we want to use a multi processing library like openMP for implementing the feature, which can be switched on or off using a flag.

Some API changes might also be needed in order to ensure the proper aggregation of results from each of the parallel jobs.

I would love to have a feedback on this and know if anyone else is working on this patch?

export LD_LIBRARY is unnecessary for Python bindings

Instead, one can simply python setup.py build_ext -R PATH_TO_CRFSUITE to tell Python to look for crfsuite.so at PATH_TO_CRFSUITE.

Python Swig: SystemError: <built-in function delete_Item> returned a result with an error set

crfsuite version = 0.12

StopIteration

During handling of the above exception, another exception occurred:

SystemError: <built-in function delete_ItemSequence> returned a result with an error set

During handling of the above exception, another exception occurred:

SystemError: <built-in function delete_StringList> returned a result with an error set

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "..../train.py", line 73, in <module>
    for xseq_, yseq_ in instances(fi_):
SystemError: <built-in function delete_Item> returned a result with an error set

my code:

#!/usr/bin/env python

import crfsuite
import sys
from xxxxx import config
import os.path


# Inherit crfsuite.Trainer to implement message() function, which receives
# progress messages from a training process.
class Trainer(crfsuite.Trainer):
    def message(self, s):
        # Simply output the progress messages to STDOUT.
        sys.stdout.write(s)


def instances(fi):
    xseq = crfsuite.ItemSequence()
    yseq = crfsuite.StringList()
    i = 0
    for line in fi:
        i += 1
        print(i)
        line = line.strip('\n')
        if not line:
            # An empty line presents an end of a sequence.
            # if xseq:
            yield xseq, tuple(yseq)
            xseq = crfsuite.ItemSequence()
            yseq = crfsuite.StringList()
            continue

        # Split the line with TAB characters.
        fields = line.split('\t')

        # Append attributes to the item.
        item = crfsuite.Item()
        for field in fields[1:]:
            p = field.rfind(':')
            if p == -1:
                # Unweighted (weight=1) attribute.
                item.append(crfsuite.Attribute(field))
            else:
                # Weighted attribute
                item.append(crfsuite.Attribute(field[:p], float(field[p + 1:])))

        # Append the item to the item sequence.
        xseq.append(item)
        # Append the label to the label sequence.
        yseq.append(fields[0])

    xseq.erase()


if __name__ == '__main__':

    version = "_v1_python"
    # This demonstrates how to obtain the version string of CRFsuite.
    print("crfsuite version = ", crfsuite.version())

    # Create a Trainer object.
    trainer = Trainer()

    # Read training instances from STDIN, and set them to trainer.
    train_feature_file = config.train_feature_file+"_v1"
    model_file = config.model_file+version

    if not os.path.exists(train_feature_file):
        raise FileNotFoundError("train_file: {} not found".format(train_feature_file))

    if os.path.exists(model_file):
        raise FileExistsError("model_file: {} have been existed".format(model_file))

    with open(train_feature_file, 'r') as fi_:
        for xseq_, yseq_ in instances(fi_):
            trainer.append(xseq_, yseq_, 0)

    # Use L2-regularized SGD and 1st-order dyad features.
    trainer.select('l2sgd', 'crf1d')

    # This demonstrates how to list parameters and obtain their values.
    for name in trainer.params():
        print(name, trainer.get(name), trainer.help(name))

    # Set the coefficient for L2 regularization to 0.1
    trainer.set('c2', '0.1')

    # Start training; the training process will invoke trainer.message()
    # to report the progress.
    trainer.train(model_file, -1)

crfsuite for image classification

./autogen.sh fails on error: automatic de-ANSI-fication support has been removed

On Ubuntu 13.10, after downloading and unpacking crfsuite master:

./autogen.sh
...
aclocal: warning: autoconf input should be named 'configure.ac', not 'configure.in'
configure.in:33: error: automatic de-ANSI-fication support has been removed
/usr/share/aclocal-1.13/obsolete.m4:26: AM_C_PROTOTYPES is expanded from...
configure.in:33: the top level
autom4te: /usr/bin/m4 failed with exit status: 1
aclocal: error: echo failed with exit status: 1
aclocal failed!

removing the offending macro by replacing

AM_C_PROTOTYPES

with

dnl AM_C_PROTOTYPES

in configure.in fixes the (immediate) issue.

reserved identifier violation

I would like to point out that an identifier like “__CRFSUITE_API_HPP__” does not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?

Error copying typemap during prepare.sh --swig

I am using swig 2.0.9 trying to build the Python interface. During the first step of the build process: >./build.sh --swig I get the following errors:

/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (directorout) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &DIRECTOROUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (in) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (in) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (typecheck) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (typecheck) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (argout) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *OUTPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (argout) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &OUTPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (typecheck) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (typecheck) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (freearg) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > *INOUT
/opt/local/share/swig/2.0.9/std/std_vector.i:87: Error: Can't copy typemap (freearg) std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INPUT = std::vector< CRFSuite::Item,std::allocator< CRFSuite::Item > > &INOUT

Other people have this error as well using swig 2.0.8:
http://swig.10945.n7.nabble.com/Error-copying-typemap-when-building-crfsuite-td10775.html

Size of Training Data.

I'm trying to train a model with a text file that is 42G in size. I have more than enough memory on my machine but I seem to be getting a segmentation core dump while training. Any reason why this would happen?

My team and I have trained multiple models on smaller datasets on the same machine, so we are confident that crfsuit is setup correctly.

chokkan / crfsuite Goto Github PK

crfsuite's Introduction

crfsuite's People

Contributors

Stargazers

Watchers

Forkers

crfsuite's Issues

Recommend Projects

Recommend Topics

Recommend Org