shangjingbo1226 / autophrase Goto Github PK

View Code? Open in Web Editor NEW

1.2K 39.0 272.0 199.8 MB

AutoPhrase: Automated Phrase Mining from Massive Text Corpora

License: Apache License 2.0

Makefile 0.32% C++ 44.69% C 2.23% Shell 10.45% Java 17.13% Perl 20.45% Python 3.96% Dockerfile 0.77%

quality-phrases text-mining multi-language automatic phrase lexicon compound-words

autophrase's People

Contributors

Stargazers

Watchers

Forkers

ericxsun qingniufly xiangyu fulquan benjamesbabala shannonyu knowledgefold quanfang knoweng wall0p colinsongf mickeysjm andyrbm byzhang ceaserz chenxizongyu mindis tpnguyen goo2go huaiwen napsternxg panyang luotongml xgeric fandywang liuxiapu meccy ipsolar waleking zhangzexdu hcl14 clusteranalysis nonva cs512-autophrase-demo nbermudezs hongtaowutj sevenyang94 liuyisi changun dm2-nd clikque wugh lentitude mars-wei jffifa johnwenjunwu duzx16 mars2018 changtianyu will007008 afcarl alfords caseolap syjbupt x-hacker schaber rezanachmad jerrygaolondon batermj mingyates dunzhang linusp mozarthasse coffeebeanustb nooralahzadeh entn-at zorrock trenddev melancholicwang mozhouting jackysnake selectwait chengli0327 kajyuuen awyshw dongmengli lrisliu yyht semsevens nlngh wibruce richardsun solertis halolimat pvk444 ssanbu08 supercharleszhu yingjun2 feirenani rightpeter wjianxz hongtaicao aabhas0694 djstrong hamaldonado1 whmadan jessica126006 yuzhiw vseledkin h8f

autophrase's Issues

running on Bash or sh

It seems that there are some issues (shown below) when running ./phrasal_segmentation.sh on WSL (windows subsystem ubuntu 18.08).

===Part-Of-Speech Tagging===
./phrasal_segmentation.sh: 44: [: EN: unexpected operator

Based on this post, it seems like either running bash ./phrasal_segmentation.sh solves the problem or substituting == with = in the ./phrasal_segmentation.sh file.

wiki_all.txt and wiki_quality.tx

you could publish the files wiki_all.txt and wiki_quality.txt in Portuguese?
using wikipedia ptwiki the files were generated: links, linktext y redirects.gz
using the wikidata-20180720-all.json.bz2, the file was generated by redirecting the output of this phase (load-wikidta) to a file wikidata.tsv.gz.
but unfortunately analyse-links generates an entities.gz file but empty. Can you explain what is happening? We want to use your method in a Portuguese base

Raw input format

What's the correct format for the raw input format?

in the documentation it says, each line should be one document. But DBLP.txt contains one document and a '.'???

here is example DBLP.txt content:

My Cat Is Object-Oriented.
.
Making Database Systems Fast Enough for CAD Applications.
.
Optimizing Smalltalk Message Performance.
.

POS file doesn't have enough POS tags

gcc version 7.3.0 (Ubuntu 7.3.0-27ubuntu1~18.04)

openjdk version "1.8.0_181"

When I run ./auto_phrase.sh on a freshly cloned repository of AutoPhrase I get the error POS file doesn't have enough POS tags many hundreds if not thousands of times. I also get the error ERROR: not a parameter file: ./lib/english-utf8.par! at the Part-Of-Speech Tagging step. This doesn't seem like normal behavior as I have used AutoPhrase on a another system and did not get this error.

Hi, i have an input file that have doc_id, doc_content, so how can i write output with format: "doc_id" "ranking" "phrase", Thanks you bro

Change input data corpus using the pretrained model

Is there any way to change the input data corpus while still using the pre-trained model, which was trained on DBLP?

Buffer Limit Exceeded

When I'm trying to apply AutoPhrase on Chinese corpus(with wiki_cn model), it happens.

By my observation, a token A only appears once in the corpus, and when function mappingBackText in AutoPhrase/tools/tokenizer/src/Tokenizer.java loads this token, the text in buffer has already exceeded where the token is. Thus this error happens.

FileNotFoundException

when I clone the project and run the auto_phrase.sh directly , some exception occured:

===Saving Model and Results===
cp: cannot stat 'tmp/segmentation.model': No such file or directory
===Generating Output===
java.io.FileNotFoundException: tmp/final_quality_multi-words.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at java.io.FileInputStream.(FileInputStream.java:93)
at Tokenizer.tokenizeText(Tokenizer.java:705)
at Tokenizer.main(Tokenizer.java:856)
java.io.FileNotFoundException: tmp/final_quality_unigrams.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at java.io.FileInputStream.(FileInputStream.java:93)
at Tokenizer.tokenizeText(Tokenizer.java:705)
at Tokenizer.main(Tokenizer.java:856)
java.io.FileNotFoundException: tmp/final_quality_salient.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at java.io.FileInputStream.(FileInputStream.java:93)
at Tokenizer.tokenizeText(Tokenizer.java:705)
at Tokenizer.main(Tokenizer.java:856)

[ERROR] not enough training data found!

Hi，when in use ‘autophrase’ to process my corpus. It tell me "[ERROR] not enough training data found!" and can not generate any result.
My corpus is ancient Chinese. And I have merge my domain entity mention into the ‘wiki_quality.txt’ and these mention definitely occurrence in the corpus many times. I wonder if you could help me with this.
Thanks you very much.
The shell output：
=== Current Settings ===
Iterations = 2
Minimum Support Threshold = 10
Maximum Length Threshold = 6
POS-Tagging Mode Enabled
Number of threads = 10
Labeling Method = DPDN
Auto labels from knowledge bases
Max Positive Samples = -1

Loading data...

of total tokens = 0

max word token id = 0

of documents = 0

of distinct POS tags = 0

Mining frequent phrases...
selected MAGIC = 1

of frequent phrases = 1

Extracting features...
Constructing label pools...
The size of the positive pool = 0
The size of the negative pool = 0

truth patterns = 0

Estimating Phrase Quality...
0 0
[ERROR] not enough training data found!
[ERROR] no training data found!
Segmenting...
Rectifying features...
Estimating Phrase Quality...
0 0
[ERROR] not enough training data found!
[ERROR] no training data found!
Segmenting...
Dumping results...
Done.

Some options in auto_phrase.sh and phrasal_segmentation.sh don't accept arguments

Certain options like TEXT_TO_SEG in phrasal_segmentation.sh and RAW_LABEL_FILE in auto_phrase.sh don't accept arguments like other options. I will send a PR.

Load Limit Exceeded

I am running Autophrase on a 1.66 GB corpus and during phrasal segmentation I got this error. I'm not sure if I should cut down the size of the corpus or try something else. What is causing this error?

Applying trained segmentation model on different text

Thanks a lot for making this code available.

I have a particular use case. I wish to train the segmentation model using a large corpora to get the useful key phrases and then apply that same model to different data. Is there a way that can be done? Looking at the phrasal_segmentation.sh it appears that the code in it is completely unrelated to the one used in auto_phrase.sh.

I do understand that running the code on the combination of the large data + my data is a possibility. But I am looking for a more efficient solution if it is available.

Do the the two scripts share any information, either via the tmp folder or some other way. If all I am interested in is tagging of quality phrases in text should I run auto_phrase.sh followed by phrasal_segmentation.sh or only the latter one?

Finally, since your code uses parallel processing, do you maintain the order of the output values compared to the input values. E.g. if the input file specified in TEXT_TO_SEG has sentences in some order will the output file segmentation.txt have outputs in the same order ?

Finally, is there a way to have the segmentation functions available as a library, which I can call on my text data?

请问正确的运行步骤是？

我有一段中文语料，想使用该工具进行短语挖掘
请问clone这个工程后，正确的运行方式是怎样的（如果不运行默认的auto_phrase.sh）？
谢谢！

Can't access the DBLP.txt.gz

Hi jingbo,

I am so excited when I read you paper, but I can't access the data of DBLP, I also try change my IP proxy, I still can't access the data.

the problems details:
curl http://dmserv2.cs.illinois.edu/data/DBLP.txt.gz --output data/DBLP.txt.gz
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (7) Failed to connect to dmserv2.cs.illinois.edu port 80: Connection refused

after changing the proxy: the data size is 0 byte

I hope that I can get your help soon.

Thanks,
Jianan

AutoPhrase in Python

I find the AutoPhrase is working well on text mining i.e. topic modeling. May I know if there is a AutoPhrase package in Python? Thank you.

Different MIN_SUP VALUES RETURN DIFFERENT RESULTS

Hi, Jingbo! I tried to run the scripts on the same dataset with different min_sup values, and for each value, I get a list of quality phrase and quality single words. Not surprisingly, those lists are different in lengths as well as values. Here is my question, is there a way to evaluate which result is better than another without human annotation involved? For now, I did a count for each phrase to see how many times it shows up in the documents.

change the 'RAW_TRAIN' data but load the same data of "DBLP.txt"

Would it scale for a large corpus?

Hi,

First of all, thank you for the wonderful open source contribution. One of the things that I could find in the previous SegPhrase implementation was that the frequent pattern mining is not scalable to a large text corpus because of the python implementation. Would this implementation scale to large corpus?

I have a very patchy understanding of C/C++ so I don't quite understand it completely.

Thank you
Sandeep

"POS file doesn't have enough POS tags" and no output when training on custom corpus

Hi, I am training on the first 1100MB of Gutenberg http://www.gutenberg-tar.com/ and have the following error, and nothing is generated in results folder:

===Compilation===
===Tokenization===
Current step: Tokenizing input file...
real    2m0.784s
user    10m22.952s
sys     0m17.820s
Detected Language: EN
Current step: Tokenizing wikipedia phrases...
No provided expert labels.
===Part-Of-Speech Tagging===
Current step: Merging...
===AutoPhrasing===
=== Current Settings ===
Iterations = 2
Minimum Support Threshold = 30
Maximum Length Threshold = 6
POS-Tagging Mode Enabled
Number of threads = 10
Labeling Method = DPDN
        Auto labels from knowledge bases
        Max Positive Samples = -1
=======
Loading data...
# of total tokens = 216070506
max word token id = 5117344
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
# of documents = 23194556
# of distinct POS tags = 57                                                                                                                            
Mining frequent phrases...                                                                                                                             
selected MAGIC = 5117347                                                                                                                               
# of frequent phrases = 5420607                                                                                                                        
Extracting features...                                                                                                                                 
Constructing label pools...                                                                                                                            
        The size of the positive pool = 31419                                                                                                          
        The size of the negative pool = 5384346                                                                                                        
# truth patterns = 220417                                                                                                                              
Estimating Phrase Quality...
Segmenting...
Rectifying features...
Estimating Phrase Quality...
Segmenting...
Dumping results...
Done.

real    14m30.644s
user    68m46.148s
sys     4m15.356s
===Saving Model and Results===
===Generating Output===

My settings are:

#!/bin/bash
MODEL=${MODEL:- "models/DBLP"}
# RAW_TRAIN is the input of AutoPhrase, where each line is a single document.
RAW_TRAIN=${RAW_TRAIN:- data/input.txt}
# When FIRST_RUN is set to 1, AutoPhrase will run all preprocessing. 
# Otherwise, AutoPhrase directly starts from the current preprocessed data in the tmp/ folder.
FIRST_RUN=${FIRST_RUN:- 1}
# When ENABLE_POS_TAGGING is set to 1, AutoPhrase will utilize the POS tagging in the phrase mining. 
# Otherwise, a simple length penalty mode as the same as SegPhrase will be used.
ENABLE_POS_TAGGING=${ENABLE_POS_TAGGING:- 1}
# A hard threshold of raw frequency is specified for frequent phrase mining, which will generate a candidate set.
MIN_SUP=${MIN_SUP:- 30}
# You can also specify how many threads can be used for AutoPhrase
THREAD=${THREAD:- 10}

Fail to run on wikidump

Hi,
I tried to run auto phrase on the latest wiki dump data. According to log, it failed at step "AutoPhrasing".

Here is the output

===Compilation===
===Tokenization===
[Warning] White Space in tokens!!!  ᠮᠣᠩᠭᠣᠯ᠎‍ᠦᠨ
[Warning] White Space in tokens!!!  ᠭᠠᠴᠠᠭ᠎ᠠ

real	17m55.954s
user	60m25.987s
sys	0m32.384s
Detected Language: EN
Current step: Tokenizing wikipedia phrases...
No provided expert labels.
===Part-Of-Speech Tagging===
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3253k  100 3253k    0     0  2118k      0  0:00:01  0:00:01 --:--:-- 2116k
English parameter file (Linux, UTF8) installed.
Current step: Merging...
===AutoPhrasing===
=== Current Settings ===
Iterations = 2
Minimum Support Threshold = 10
Maximum Length Threshold = 6
POS-Tagging Mode Enabled
Number of threads = 8
Labeling Method = DPDN
	Auto labels from knowledge bases
	Max Positive Samples = -1
=======
Loading data...
# of total tokens = -1822932212
max word token id = 7063562
# of documents = 5707656
# of distinct POS tags = 57
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
./auto_phrase.sh: line 108: 15694 Aborted                 (core dumped) ./bin/segphrase_train --pos_tag --thread $THREAD --pos_prune data/BAD_POS_TAGS.txt --label_method $LABEL_METHOD --label $LABEL_FILE --max_positives $MAX_POSITIVES --min_sup $MIN_SUP

real	21m20.400s
user	20m50.328s
sys	0m15.308s

Any suggestion would be help, thanks.

a little suggestion ...

data scientist should use python

Can not download DBLP.txt.gz

Hello,when I run
curl http://dmserv2.cs.illinois.edu/data/DBLP.txt.gz
It failed, is this server down or the file is not here any more?

Can you please make some recommendations on memory tuning on a big corpus for Java newbie?

I tried to run on big corpora (4-40 GB, java 1.8, ubuntu) and observed the following behavior:

Tokenizing input file on training phase crashes with java.lang.OutOfMemoryError: GC overhead limit exceeded. When I tried to tune java heap size according to the recommendations on Stackoverflow, I added -Xms20g -Xmx40g -XX:-UseGCOverheadLimit and then tokenizing still crashes with either the same message of Exception in thread "main" java.lang.OutOfMemoryError: Java heap space.

The memory usage is always around 10-16 GB, while I have 32 GB RAM and 32 GB swap, so I expected to be able to process up to 16 GB corpus according to the 4x estimation for MIN_SUP=30.

I've tried bigger values of MIN_SUP on smaller datasets, but the results seem worse for me.

It seem for me that java settings restrict the memory usage. Could you please give me some help on tuning?

The dataset I used for testing was 37 GB Gutenberg from http://www.gutenberg-tar.com/ which I merged into one big txt file. The maximum size of the file I can tokenize without error is 2 GB. But then segphrase_train crashes:

segphrase_train: src/classification/../model_training/segmentation.h:504: double Segmentation::adjustPOSTagTransition(std::vector<std::pair<long long int, long long int> >&, int): Assertion `f[i] > -1e80' failed.
./auto_phrase.sh: line 108: 10522 Aborted                 (core dumped)

phrase segmentation problem with phrasal_segmentation.sh

In the phrasal_segmentation.sh, the default setting is:
HIGHLIGHT_MULTI=${HIGHLIGHT_MULTI:- 0.5}
HIGHLIGHT_SINGLE=${HIGHLIGHT_SINGLE:- 0.8}
to segment the quality phrases.

I use the phrasal_segmentation.sh to segment my own corpus, before that I have got the segmentation.model under the same corpus.

However, the final segmentation.txt I got has many low quality single-word as phrase:
e.g.
The paper describes a natural language based expert system route advisor for the public bus transport in Trondheim, Norway.

even the quality of these single word is much lower than 0.8:
0.4528807447 paper
0.5108750335 public
0.5367821510 bus
...

FileNotFoundException: tmp/raw_tokenized_text_to_seg.txt

First, I replaced the DBLP.txt with my corpus and appended some terms in wiki_quality.txt. Then I ran the auto_phrase.sh. Everything went well. Then I chose first 10 lines of the corpus as a test file and set the TEXT_TO_SEG to the file path. However , when I ran the phrasal_segmentation.sh, it showed a exception as below,

===Compilation===
===Tokenization===
[pool-1-thread-3] WARN DICLOG - not find library.properties in classpath use it by default !
[pool-1-thread-3] WARN DICLOG - init userLibrary  warning :/home/zhouyisha/AutoPhrase/library/default.dic because : file not found or failed to read !
[pool-1-thread-3] WARN DICLOG - init ambiguity  error :/home/zhouyisha/AutoPhrase/library/ambiguity.dic because : not find that file or can not found!
[pool-1-thread-2] INFO DICLOG - init core library ok use time :549
[pool-1-thread-4] INFO DICLOG - init ngram ok use time :508

real	0m1.677s
user	0m4.064s
sys	0m0.180s
Detected Language: CN
===Part-Of-Speech Tagging===
===Phrasal Segmentation===
=== Current Settings ===
Segmentation Model Path = models/DBLP/segmentation.model
After the phrasal segmentation, only following phrases will be highlighted with <phrase> and </phrase>
	Q(multi-word phrases) >= 0.500000
	Q(single-word phrases) >= 0.800000
=======
POS guided model loaded.
# of loaded patterns = 1656
# of loaded truth patterns = 3206
POS transition matrix loaded
Phrasal segmentation finished.
   # of total highlighted quality phrases = 11
   # of total processed sentences = 20
   avg highlights per sentence = 0.55

real	0m0.010s
user	0m0.008s
sys	0m0.000s
===Generating Output===
java.io.FileNotFoundException: tmp/raw_tokenized_text_to_seg.txt (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at java.io.FileInputStream.<init>(FileInputStream.java:93)
	at Tokenizer.mappingBackText(Tokenizer.java:595)
	at Tokenizer.main(Tokenizer.java:848)

I checked the tmp folder, there wasn't a file named raw_tokenized_text_to_seg.txt indeed.
Could you help me to find out the reason why this file haven't been generated. And during which process this file will be generated? Thanks a lot.

The data used in paper

Hi,
This is a great job! I have read the paper and tried a lot with this code. I focus on Chinese corpus. I have some question about data as follow:

Is the data of "wiki_quality.txt" and "wiki_all.txt" in Chinese same as the one used in the paper? If not the same, could you offer or publish the wiki phrase used in the paper?
Could you offer or publish the training data of Chinese wiki article?
If you can't provide the data mentioned above, could you offer the data size of "wiki_quality.txt", "wiki_all.txt" , and Chinese training data?

can support partly EN?

like positive:
add negative_quality.txt and negative_all.txt

请问如何运行WikipediaEntities来解析最新的pages-articles.xml.bz2得到entities？

由于WikipediaEntities没有运行指南，大概看了WikipediaEntities的源码，其过程应该是：

编译
运行ParseWikipedia，输出links.gz, linktext.gz, redirects.gz
运行AnalyzeLinks, 输出entities.gz

但是第3步需要一个叫wikidata.tsv.gz的文件，请问这个文件从哪里来？

Segmentation fault on large corpus

Hi Jingbo,
The Segmentation fault occurs when I change the TRAIN_DATA, I have commented out the // define LARGE in the beginning of src/utils/parameters.h. the train data 's size is 35GB. the details is below,

of documents = 175912840
of distinct POS tags = 56
Mining frequent phrases...
selected MAGIC = 3746593
./auto_phrase.sh: line 110: 16940 Segmentation fault (core dumped) ./bin/segphrase_train --pos_tag --thread $THREAD --pos_prune data/BAD_POS_TAGS.txt --label_method $LABEL_METHOD --label $LABEL_FILE --max_positives $MAX_POSITIVES --min_sup $MIN_SUP

Best wishes,

Jianan

Can I custom AutoPhrase for Vietnamese by modifying stopwords.txt, and wiki files

some questions about library.properties

when I changed the DBLP.txt to my own corpus, I met the problem like this:
I dont know where to find the library.properties, can you tell me how to solve it? Thank you very much.

[pool-1-thread-1] WARN DICLOG - not find library.properties in classpath use it by default !
[pool-1-thread-1] WARN DICLOG - init userLibrary warning :/homeh/AutoPhrasebrary/default.dic because : file not found or failed to read !
[pool-1-thread-1] WARN DICLOG - init ambiguity error :/homeh/AutoPhrasebrarybiguity.dic because : not find that file or can not found!
[pool-1-thread-1] INFO DICLOG - init core library ok use time :607
[pool-1-thread-1] INFO DICLOG - init ngram ok use time :815

real 0m1.736s
user 0m4.317s
sys 0m0.141s
Detected Language: CN
[pool-1-thread-2] WARN DICLOG - not find library.properties in classpath use it by default !
[pool-1-thread-2] WARN DICLOG - init userLibrary warning :/homeh/AutoPhrasebrary/default.dic because : file not found or failed to read !
[pool-1-thread-2] WARN DICLOG - init ambiguity error :/homeh/AutoPhrasebrarybiguity.dic because : not find that file or can not found!

lack of

The lack of parameter file english.utf8.par
Some parameter file can't be downloaded automatically due to invalid URL. I try to download english.par and replace english.utf8.par with it. but it doesn't work. I am appreciated if I get this parameter file from you.

Usage of trained model on new corpus

I have trained a model on Wikipedia. I have segmentation.model and list of extracted phrases. How to apply this model to a new corpus to extract new phrases? Is it possible? Or phrasal_segmentation.sh only highlights phrases extracted from original corpus?

compile.sh: line 5: make: command not found

===Compilation===
compile.sh: line 5: make: command not found

===Tokenization===
[pool-1-thread-8] WARN DICLOG - not find library.properties in classpath use it by default !
[pool-1-thread-8] WARN DICLOG - init userLibrary  warning :/autophrase/library/default.dic because : file not found or failed to read !
[pool-1-thread-8] WARN DICLOG - init ambiguity  error :/autophrase/library/ambiguity.dic because : not find that file or can not found!

Code does not compile

In your last commit you included changes to
src/frequent_pattern_mining/frequent_pattern_mining.h

Specifically, you define TOTAL_TOKEN_TYPE, but this type is not defined elsewhere.
As such, the project doesn't build, giving the errors:

src/frequent_pattern_mining/frequent_pattern_mining.h:157:31: error: ‘TOTAL_TOKEN_TYPE’ was not declared in this scope
src/frequent_pattern_mining/frequent_pattern_mining.h:157:31: note: suggested alternative: ‘TOTAL_TOKENS_TYPE’
     inline bool pruneByPOSTag(TOTAL_TOKEN_TYPE st, TOTAL_TOKEN_TYPE ed)
                               ^~~~~~~~~~~~~~~~
                               TOTAL_TOKENS_TYPE

Build attempted using the following dependencies enabled:

 1) gcc/7.1.0     3) mpich/3.1.4   5) git/2.6.3     7) java/1.8.0
  2) python/3.4    4) hdp/0.1       6) gsl/2.3

POS_ID_TYPE is unsigned

parameters.h line 24:
typedef unsigned char POS_ID_TYPE;

segment.cp line 115:
POS_ID_TYPE posTagId = -1;

segmentation.h line 308 and many other lines:
tags[j] >= 0

some files not found, when execute auto_phrase.sh

when I execute auto_phrase.sh, some Exceptions are founded:
java.io.FileNotFoundException: tmp/final_quality_multi-words.txt
java.io.FileNotFoundException: tmp/final_quality_unigrams.txt
java.io.FileNotFoundException: tmp/final_quality_salient.txt

According to the error information, something wrong was located
at Tokenizer.tokenizeText(Tokenizer.java:618)
at Tokenizer.main(Tokenizer.java:766)

so, how to resolve this?

After install openjdk-8-jre, no javac avaliable

Should be:

Ubuntu:

g++ 4.8 $ sudo apt-get install g++-4.8
Java 8 $ sudo apt-get install openjdk-8-jdk
curl $ sudo apt-get install curl

(Also curl is required)

Label back strategy

What is the strategy for the rebate part of AutoPhrase? Because we found that there are errors in running the label back program for some corpora, we have solved the problem of letter case, but still feel that the effect of the result is not good enough.

[Fatal Error]: Buffer Limit Exceeded Error! You may want to modify the buffer limit in the Tokenizer.java

This issue occurred when I ran ./phrasal_segmentation.sh and at Generating Output stage.

I have used my own data corpus (of size 22 MB), while still using the pre-trained model. (As mentioned in Issue #22, I have changed the text_to_seg in the phrasal_segmentation.sh script.)

I have tried to increase the Buffer limit from 8192 default size to 16384 (2 x 8192), 32768 (4 x 8192), 65536 (8 x 8192), 131072 (16 x 8192), 262144 (32 x 8192) but no use.

How may I resolve this issue ?

run compile.sh error

It reports errors

tools/tokenizer/src/Tokenizer.java:415: error: local variable language is accessed from within inner class; needs to be declared final
                	if (!hasSuitableAnalyzer(language)) {
                	                         ^
tools/tokenizer/src/Tokenizer.java:416: error: local variable language is accessed from within inner class; needs to be declared final
                		SpecialTagger tagger = getTagger(Thread.currentThread().getName(), language, mode);
                		                                                                   ^
tools/tokenizer/src/Tokenizer.java:416: error: local variable mode is accessed from within inner class; needs to be declared final
                		SpecialTagger tagger = getTagger(Thread.currentThread().getName(), language, mode);
                		                                                                             ^
tools/tokenizer/src/Tokenizer.java:417: error: local variable mode is accessed from within inner class; needs to be declared final
                		token_pairs = lineToTokens(tagger, line, mode);
                		                                         ^
tools/tokenizer/src/Tokenizer.java:419: error: local variable language is accessed from within inner class; needs to be declared final
                		Analyzer analyzer = getAnalyzer(Thread.currentThread().getName(), language, mode);
                		                                                                  ^
tools/tokenizer/src/Tokenizer.java:419: error: local variable mode is accessed from within inner class; needs to be declared final
                		Analyzer analyzer = getAnalyzer(Thread.currentThread().getName(), language, mode);
                		                                                                            ^
tools/tokenizer/src/Tokenizer.java:420: error: local variable mode is accessed from within inner class; needs to be declared final
                		token_pairs = lineToTokens(analyzer, line, mode);
                		                                           ^
tools/tokenizer/src/Tokenizer.java:425: error: local variable mode is accessed from within inner class; needs to be declared final
                    if (mode.equals("train") || mode.equals("test") || mode.equals("direct_test")) {
                        ^
tools/tokenizer/src/Tokenizer.java:425: error: local variable mode is accessed from within inner class; needs to be declared final
                    if (mode.equals("train") || mode.equals("test") || mode.equals("direct_test")) {
                                                ^
tools/tokenizer/src/Tokenizer.java:425: error: local variable mode is accessed from within inner class; needs to be declared final
                    if (mode.equals("train") || mode.equals("test") || mode.equals("direct_test")) {
                                                                       ^
tools/tokenizer/src/Tokenizer.java:426: error: local variable tag_writer is accessed from within inner class; needs to be declared final
                        if (tag_writer != null) {
                            ^
tools/tokenizer/src/Tokenizer.java:434: error: local variable tag_writer is accessed from within inner class; needs to be declared final
                        if (tag_writer == null || mode.equals("test") || mode.equals("direct_test")) { // we always need raw tokens under the test mode
                            ^
tools/tokenizer/src/Tokenizer.java:434: error: local variable mode is accessed from within inner class; needs to be declared final
                        if (tag_writer == null || mode.equals("test") || mode.equals("direct_test")) { // we always need raw tokens under the test mode
                                                  ^
tools/tokenizer/src/Tokenizer.java:434: error: local variable mode is accessed from within inner class; needs to be declared final
                        if (tag_writer == null || mode.equals("test") || mode.equals("direct_test")) { // we always need raw tokens under the test mode
                                                                         ^
tools/tokenizer/src/Tokenizer.java:449: error: local variable mode is accessed from within inner class; needs to be declared final
                    if (mode.equals("train") && case_sen.equals("Y")) {
                        ^
tools/tokenizer/src/Tokenizer.java:449: error: local variable case_sen is accessed from within inner class; needs to be declared final
                    if (mode.equals("train") && case_sen.equals("Y")) {
                                                ^
tools/tokenizer/src/Tokenizer.java:464: error: local variable mode is accessed from within inner class; needs to be declared final
                    } else if (mode.equals("train") && case_sen.equals("N")) {
                               ^
tools/tokenizer/src/Tokenizer.java:464: error: local variable case_sen is accessed from within inner class; needs to be declared final
                    } else if (mode.equals("train") && case_sen.equals("N")) {
                                                       ^
tools/tokenizer/src/Tokenizer.java:495: error: local variable mode is accessed from within inner class; needs to be declared final
                    } else if (mode.equals("test") || mode.equals("direct_test")) {
                               ^
tools/tokenizer/src/Tokenizer.java:495: error: local variable mode is accessed from within inner class; needs to be declared final
                    } else if (mode.equals("test") || mode.equals("direct_test")) {
                                                      ^
tools/tokenizer/src/Tokenizer.java:500: error: local variable case_sen is accessed from within inner class; needs to be declared final
                            if (case_sen.equals("N")) {
                                ^
tools/tokenizer/src/Tokenizer.java:515: error: local variable mode is accessed from within inner class; needs to be declared final
                    else if (mode.equals("translate")) {
                             ^
22 errors

How do I fix it ?
Thanks!

language compatible

" The language in the input will be automatically detected". So there's no way for me to apply your algorithm for my language (Vietnamese). I think you should make the algorithm flexibly by allowing human efforts so that if people make use of this algorithm in clever way and suitable for their language, they will get a big step in processing languages. Thank you.

AutoPhrase breaks on large corpa, integer overflow?

Hey, I'm trying to use autophrase on MEDLINE2017 (nearly 30 million documents)
When I run it, this is the error.
It looks like the number of total tokens may have overflowed?

===AutoPhrasing===
=== Current Settings ===
Iterations = 2
Minimum Support Threshold = 20
Maximum Length Threshold = 6
POS-Tagging Mode Enabled
Number of threads = 24
Labeling Method = DPDN
	Auto labels from knowledge bases
	Max Positive Samples = -1
=======
Loading data...
# of total tokens = -349783405
max word token id = 44869137
terminate called after throwing an instance of 'std::bad_array_new_length'
  what():  std::bad_array_new_length
./auto_phrase.sh: line 110: 157112 Aborted                 ./bin/segphrase_train --pos_tag --thread $THREAD --pos_prune data/BAD_POS_TAGS.txt --label_method $LABEL_METHOD --label $LABEL_FILE --max_positives $MAX_POSITIVES --min_sup $MIN_SUP

real	3m57.970s
user	3m48.907s
sys	0m7.165s

[Fatal Error] Load Limit Exceeded! You may want to modify the load limit in the Tokenizer.java

I got this error when I run the phrase_segmentation.sh on a huge txt file (990K lines, 68MB).
I cannot provide the file for reproducing the problem because this data is not public.
I got some partial output in Semengtation.txt
So my best guess is that the file is too large?

Please let me know if there is any easy to to solve this issue.

Why does AutoPhrase segmentation include stopwords as phrases?

I tried running AutoPhrase on a dataset of mine followed by phrasal segmentation. I used my own expert labels file and for the phrasal segmentation set MULTIWORD cutoff at 0.7 and SINGLEWORD cutoff at 0.8, but I ended up getting phrases like the following in my results.

We aimed to test <phrase>the</phrase> overexpansion of <phrase>the</phrase> <phrase>BVS</phrase> scaffold in vitro and evaluate the impact of <phrase>excessive</phrase> <phrase>scaffol
d</phrase> <phrase>oversizing</phrase> <phrase>on</phrase> <phrase>focal point</phrase> support

In the above example phrases such as the and on should not be included as they are in conflict with the SegPhrase rule that a phrase should be filtered out if it ends with a stopword. I didn't edit the stopword file, so this behavior is strange.

quality phrase results change every time I run the bash file

Hi, jingbo!
I ran the bash file on the exactly same input file, with exactly same environment variables (min_sup) several times, I but got different results every time. Could you please let me know why this happened?

Thank you,
Yuan

"[Fatal Error] Load Limit Exceeded! You may want to modify the load limit in the Tokenizer.java

I have edited the Tokenizer.java file with maximum value as 100000000 on the following line:

AutoPhrase/tools/tokenizer/src/Tokenizer.java

Line 628 in 5f49499

if (loadCount > 100) {

"[Fatal Error] Load Limit Exceeded! You may want to modify the load limit in the Tokenizer.java

But, I still continue to get this error. What is the load limit based on?

My data has around 200,000,000 phrases.