shangjingbo1226 / autophrase Goto Github PK
View Code? Open in Web Editor NEWAutoPhrase: Automated Phrase Mining from Massive Text Corpora
License: Apache License 2.0
AutoPhrase: Automated Phrase Mining from Massive Text Corpora
License: Apache License 2.0
It seems that there are some issues (shown below) when running ./phrasal_segmentation.sh
on WSL (windows subsystem ubuntu 18.08).
===Part-Of-Speech Tagging===
./phrasal_segmentation.sh: 44: [: EN: unexpected operator
Based on this post, it seems like either running bash ./phrasal_segmentation.sh
solves the problem or substituting ==
with =
in the ./phrasal_segmentation.sh
file.
you could publish the files wiki_all.txt and wiki_quality.txt in Portuguese?
using wikipedia ptwiki the files were generated: links, linktext y redirects.gz
using the wikidata-20180720-all.json.bz2, the file was generated by redirecting the output of this phase (load-wikidta) to a file wikidata.tsv.gz.
but unfortunately analyse-links generates an entities.gz file but empty. Can you explain what is happening? We want to use your method in a Portuguese base
What's the correct format for the raw input format?
in the documentation it says, each line should be one document. But DBLP.txt contains one document and a '.'???
here is example DBLP.txt content:
My Cat Is Object-Oriented.
.
Making Database Systems Fast Enough for CAD Applications.
.
Optimizing Smalltalk Message Performance.
.
gcc version 7.3.0 (Ubuntu 7.3.0-27ubuntu1~18.04)
openjdk version "1.8.0_181"
When I run ./auto_phrase.sh
on a freshly cloned repository of AutoPhrase I get the error POS file doesn't have enough POS tags
many hundreds if not thousands of times. I also get the error ERROR: not a parameter file: ./lib/english-utf8.par!
at the Part-Of-Speech Tagging step. This doesn't seem like normal behavior as I have used AutoPhrase on a another system and did not get this error.
Is there any way to change the input data corpus while still using the pre-trained model, which was trained on DBLP?
When I'm trying to apply AutoPhrase on Chinese corpus(with wiki_cn model), it happens.
By my observation, a token A only appears once in the corpus, and when function mappingBackText in AutoPhrase/tools/tokenizer/src/Tokenizer.java loads this token, the text in buffer has already exceeded where the token is. Thus this error happens.
when I clone the project and run the auto_phrase.sh directly , some exception occured:
===Saving Model and Results===
cp: cannot stat 'tmp/segmentation.model': No such file or directory
===Generating Output===
java.io.FileNotFoundException: tmp/final_quality_multi-words.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at java.io.FileInputStream.(FileInputStream.java:93)
at Tokenizer.tokenizeText(Tokenizer.java:705)
at Tokenizer.main(Tokenizer.java:856)
java.io.FileNotFoundException: tmp/final_quality_unigrams.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at java.io.FileInputStream.(FileInputStream.java:93)
at Tokenizer.tokenizeText(Tokenizer.java:705)
at Tokenizer.main(Tokenizer.java:856)
java.io.FileNotFoundException: tmp/final_quality_salient.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at java.io.FileInputStream.(FileInputStream.java:93)
at Tokenizer.tokenizeText(Tokenizer.java:705)
at Tokenizer.main(Tokenizer.java:856)
Loading data...
max word token id = 0
Mining frequent phrases...
selected MAGIC = 1
Extracting features...
Constructing label pools...
The size of the positive pool = 0
The size of the negative pool = 0
Estimating Phrase Quality...
0 0
[ERROR] not enough training data found!
[ERROR] no training data found!
Segmenting...
Rectifying features...
Estimating Phrase Quality...
0 0
[ERROR] not enough training data found!
[ERROR] no training data found!
Segmenting...
Dumping results...
Done.
Certain options like TEXT_TO_SEG
in phrasal_segmentation.sh
and RAW_LABEL_FILE
in auto_phrase.sh
don't accept arguments like other options. I will send a PR.
I am running Autophrase on a 1.66 GB corpus and during phrasal segmentation I got this error. I'm not sure if I should cut down the size of the corpus or try something else. What is causing this error?
Thanks a lot for making this code available.
I have a particular use case. I wish to train the segmentation model using a large corpora to get the useful key phrases and then apply that same model to different data. Is there a way that can be done? Looking at the phrasal_segmentation.sh
it appears that the code in it is completely unrelated to the one used in auto_phrase.sh
.
I do understand that running the code on the combination of the large data + my data is a possibility. But I am looking for a more efficient solution if it is available.
Do the the two scripts share any information, either via the tmp folder or some other way. If all I am interested in is tagging of quality phrases in text should I run auto_phrase.sh
followed by phrasal_segmentation.sh
or only the latter one?
Finally, since your code uses parallel processing, do you maintain the order of the output values compared to the input values. E.g. if the input file specified in TEXT_TO_SEG
has sentences in some order will the output file segmentation.txt
have outputs in the same order ?
Finally, is there a way to have the segmentation functions available as a library, which I can call on my text data?
我有一段中文语料,想使用该工具进行短语挖掘
请问clone这个工程后,正确的运行方式是怎样的(如果不运行默认的auto_phrase.sh)?
谢谢!
Hi jingbo,
I am so excited when I read you paper, but I can't access the data of DBLP, I also try change my IP proxy, I still can't access the data.
the problems details:
curl http://dmserv2.cs.illinois.edu/data/DBLP.txt.gz --output data/DBLP.txt.gz
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (7) Failed to connect to dmserv2.cs.illinois.edu port 80: Connection refused
after changing the proxy: the data size is 0 byte
I hope that I can get your help soon.
Thanks,
Jianan
I find the AutoPhrase is working well on text mining i.e. topic modeling. May I know if there is a AutoPhrase package in Python? Thank you.
Hi, Jingbo! I tried to run the scripts on the same dataset with different min_sup values, and for each value, I get a list of quality phrase and quality single words. Not surprisingly, those lists are different in lengths as well as values. Here is my question, is there a way to evaluate which result is better than another without human annotation involved? For now, I did a count for each phrase to see how many times it shows up in the documents.
Hi,
First of all, thank you for the wonderful open source contribution. One of the things that I could find in the previous SegPhrase implementation was that the frequent pattern mining is not scalable to a large text corpus because of the python implementation. Would this implementation scale to large corpus?
I have a very patchy understanding of C/C++ so I don't quite understand it completely.
Thank you
Sandeep
Hi, I am training on the first 1100MB of Gutenberg http://www.gutenberg-tar.com/ and have the following error, and nothing is generated in results folder:
===Compilation===
===Tokenization===
Current step: Tokenizing input file...
real 2m0.784s
user 10m22.952s
sys 0m17.820s
Detected Language: EN
Current step: Tokenizing wikipedia phrases...
No provided expert labels.
===Part-Of-Speech Tagging===
Current step: Merging...
===AutoPhrasing===
=== Current Settings ===
Iterations = 2
Minimum Support Threshold = 30
Maximum Length Threshold = 6
POS-Tagging Mode Enabled
Number of threads = 10
Labeling Method = DPDN
Auto labels from knowledge bases
Max Positive Samples = -1
=======
Loading data...
# of total tokens = 216070506
max word token id = 5117344
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
POS file doesn't have enough POS tags
# of documents = 23194556
# of distinct POS tags = 57
Mining frequent phrases...
selected MAGIC = 5117347
# of frequent phrases = 5420607
Extracting features...
Constructing label pools...
The size of the positive pool = 31419
The size of the negative pool = 5384346
# truth patterns = 220417
Estimating Phrase Quality...
Segmenting...
Rectifying features...
Estimating Phrase Quality...
Segmenting...
Dumping results...
Done.
real 14m30.644s
user 68m46.148s
sys 4m15.356s
===Saving Model and Results===
===Generating Output===
My settings are:
#!/bin/bash
MODEL=${MODEL:- "models/DBLP"}
# RAW_TRAIN is the input of AutoPhrase, where each line is a single document.
RAW_TRAIN=${RAW_TRAIN:- data/input.txt}
# When FIRST_RUN is set to 1, AutoPhrase will run all preprocessing.
# Otherwise, AutoPhrase directly starts from the current preprocessed data in the tmp/ folder.
FIRST_RUN=${FIRST_RUN:- 1}
# When ENABLE_POS_TAGGING is set to 1, AutoPhrase will utilize the POS tagging in the phrase mining.
# Otherwise, a simple length penalty mode as the same as SegPhrase will be used.
ENABLE_POS_TAGGING=${ENABLE_POS_TAGGING:- 1}
# A hard threshold of raw frequency is specified for frequent phrase mining, which will generate a candidate set.
MIN_SUP=${MIN_SUP:- 30}
# You can also specify how many threads can be used for AutoPhrase
THREAD=${THREAD:- 10}
Hi,
I tried to run auto phrase on the latest wiki dump data. According to log, it failed at step "AutoPhrasing".
Here is the output
===Compilation===
===Tokenization===
[Warning] White Space in tokens!!! ᠮᠣᠩᠭᠣᠯᠦᠨ
[Warning] White Space in tokens!!! ᠭᠠᠴᠠᠭᠠ
real 17m55.954s
user 60m25.987s
sys 0m32.384s
Detected Language: EN
Current step: Tokenizing wikipedia phrases...
No provided expert labels.
===Part-Of-Speech Tagging===
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3253k 100 3253k 0 0 2118k 0 0:00:01 0:00:01 --:--:-- 2116k
English parameter file (Linux, UTF8) installed.
Current step: Merging...
===AutoPhrasing===
=== Current Settings ===
Iterations = 2
Minimum Support Threshold = 10
Maximum Length Threshold = 6
POS-Tagging Mode Enabled
Number of threads = 8
Labeling Method = DPDN
Auto labels from knowledge bases
Max Positive Samples = -1
=======
Loading data...
# of total tokens = -1822932212
max word token id = 7063562
# of documents = 5707656
# of distinct POS tags = 57
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
./auto_phrase.sh: line 108: 15694 Aborted (core dumped) ./bin/segphrase_train --pos_tag --thread $THREAD --pos_prune data/BAD_POS_TAGS.txt --label_method $LABEL_METHOD --label $LABEL_FILE --max_positives $MAX_POSITIVES --min_sup $MIN_SUP
real 21m20.400s
user 20m50.328s
sys 0m15.308s
Any suggestion would be help, thanks.
data scientist should use python
Hello,when I run
curl http://dmserv2.cs.illinois.edu/data/DBLP.txt.gz
It failed, is this server down or the file is not here any more?
I tried to run on big corpora (4-40 GB, java 1.8, ubuntu) and observed the following behavior:
Tokenizing input file on training phase crashes with java.lang.OutOfMemoryError: GC overhead limit exceeded
. When I tried to tune java heap size according to the recommendations on Stackoverflow, I added -Xms20g -Xmx40g -XX:-UseGCOverheadLimit
and then tokenizing still crashes with either the same message of Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
.
The memory usage is always around 10-16 GB, while I have 32 GB RAM and 32 GB swap, so I expected to be able to process up to 16 GB corpus according to the 4x estimation for MIN_SUP=30
.
I've tried bigger values of MIN_SUP
on smaller datasets, but the results seem worse for me.
It seem for me that java settings restrict the memory usage. Could you please give me some help on tuning?
The dataset I used for testing was 37 GB Gutenberg from http://www.gutenberg-tar.com/ which I merged into one big txt file. The maximum size of the file I can tokenize without error is 2 GB. But then segphrase_train
crashes:
segphrase_train: src/classification/../model_training/segmentation.h:504: double Segmentation::adjustPOSTagTransition(std::vector<std::pair<long long int, long long int> >&, int): Assertion `f[i] > -1e80' failed.
./auto_phrase.sh: line 108: 10522 Aborted (core dumped)
In the phrasal_segmentation.sh, the default setting is:
HIGHLIGHT_MULTI=${HIGHLIGHT_MULTI:- 0.5}
HIGHLIGHT_SINGLE=${HIGHLIGHT_SINGLE:- 0.8}
to segment the quality phrases.
I use the phrasal_segmentation.sh to segment my own corpus, before that I have got the segmentation.model under the same corpus.
However, the final segmentation.txt I got has many low quality single-word as phrase:
e.g.
The paper describes a natural language based expert system route advisor for the public bus transport in Trondheim, Norway.
even the quality of these single word is much lower than 0.8:
0.4528807447 paper
0.5108750335 public
0.5367821510 bus
...
First, I replaced the DBLP.txt with my corpus and appended some terms in wiki_quality.txt. Then I ran the auto_phrase.sh. Everything went well. Then I chose first 10 lines of the corpus as a test file and set the TEXT_TO_SEG to the file path. However , when I ran the phrasal_segmentation.sh, it showed a exception as below,
===Compilation===
===Tokenization===
[pool-1-thread-3] WARN DICLOG - not find library.properties in classpath use it by default !
[pool-1-thread-3] WARN DICLOG - init userLibrary warning :/home/zhouyisha/AutoPhrase/library/default.dic because : file not found or failed to read !
[pool-1-thread-3] WARN DICLOG - init ambiguity error :/home/zhouyisha/AutoPhrase/library/ambiguity.dic because : not find that file or can not found!
[pool-1-thread-2] INFO DICLOG - init core library ok use time :549
[pool-1-thread-4] INFO DICLOG - init ngram ok use time :508
real 0m1.677s
user 0m4.064s
sys 0m0.180s
Detected Language: CN
===Part-Of-Speech Tagging===
===Phrasal Segmentation===
=== Current Settings ===
Segmentation Model Path = models/DBLP/segmentation.model
After the phrasal segmentation, only following phrases will be highlighted with <phrase> and </phrase>
Q(multi-word phrases) >= 0.500000
Q(single-word phrases) >= 0.800000
=======
POS guided model loaded.
# of loaded patterns = 1656
# of loaded truth patterns = 3206
POS transition matrix loaded
Phrasal segmentation finished.
# of total highlighted quality phrases = 11
# of total processed sentences = 20
avg highlights per sentence = 0.55
real 0m0.010s
user 0m0.008s
sys 0m0.000s
===Generating Output===
java.io.FileNotFoundException: tmp/raw_tokenized_text_to_seg.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at Tokenizer.mappingBackText(Tokenizer.java:595)
at Tokenizer.main(Tokenizer.java:848)
I checked the tmp folder, there wasn't a file named raw_tokenized_text_to_seg.txt indeed.
Could you help me to find out the reason why this file haven't been generated. And during which process this file will be generated? Thanks a lot.
Hi,
This is a great job! I have read the paper and tried a lot with this code. I focus on Chinese corpus. I have some question about data as follow:
like positive:
add negative_quality.txt and negative_all.txt
由于WikipediaEntities没有运行指南,大概看了WikipediaEntities的源码,其过程应该是:
但是第3步需要一个叫wikidata.tsv.gz的文件,请问这个文件从哪里来?
Hi Jingbo,
The Segmentation fault occurs when I change the TRAIN_DATA, I have commented out the // define LARGE in the beginning of src/utils/parameters.h. the train data 's size is 35GB. the details is below,
of documents = 175912840
of distinct POS tags = 56
Mining frequent phrases...
selected MAGIC = 3746593
./auto_phrase.sh: line 110: 16940 Segmentation fault (core dumped) ./bin/segphrase_train --pos_tag --thread $THREAD --pos_prune data/BAD_POS_TAGS.txt --label_method $LABEL_METHOD --label $LABEL_FILE --max_positives $MAX_POSITIVES --min_sup $MIN_SUP
Best wishes,
Jianan
when I changed the DBLP.txt to my own corpus, I met the problem like this:
I dont know where to find the library.properties, can you tell me how to solve it? Thank you very much.
[pool-1-thread-1] WARN DICLOG - not find library.properties in classpath use it by default !
[pool-1-thread-1] WARN DICLOG - init userLibrary warning :/homeh/AutoPhrasebrary/default.dic because : file not found or failed to read !
[pool-1-thread-1] WARN DICLOG - init ambiguity error :/homeh/AutoPhrasebrarybiguity.dic because : not find that file or can not found!
[pool-1-thread-1] INFO DICLOG - init core library ok use time :607
[pool-1-thread-1] INFO DICLOG - init ngram ok use time :815
real 0m1.736s
user 0m4.317s
sys 0m0.141s
Detected Language: CN
[pool-1-thread-2] WARN DICLOG - not find library.properties in classpath use it by default !
[pool-1-thread-2] WARN DICLOG - init userLibrary warning :/homeh/AutoPhrasebrary/default.dic because : file not found or failed to read !
[pool-1-thread-2] WARN DICLOG - init ambiguity error :/homeh/AutoPhrasebrarybiguity.dic because : not find that file or can not found!
The lack of parameter file english.utf8.par
Some parameter file can't be downloaded automatically due to invalid URL. I try to download english.par and replace english.utf8.par with it. but it doesn't work. I am appreciated if I get this parameter file from you.
I have trained a model on Wikipedia. I have segmentation.model and list of extracted phrases. How to apply this model to a new corpus to extract new phrases? Is it possible? Or phrasal_segmentation.sh
only highlights phrases extracted from original corpus?
===Compilation===
compile.sh: line 5: make: command not found
===Tokenization===
[pool-1-thread-8] WARN DICLOG - not find library.properties in classpath use it by default !
[pool-1-thread-8] WARN DICLOG - init userLibrary warning :/autophrase/library/default.dic because : file not found or failed to read !
[pool-1-thread-8] WARN DICLOG - init ambiguity error :/autophrase/library/ambiguity.dic because : not find that file or can not found!
In your last commit you included changes to
src/frequent_pattern_mining/frequent_pattern_mining.h
Specifically, you define TOTAL_TOKEN_TYPE, but this type is not defined elsewhere.
As such, the project doesn't build, giving the errors:
src/frequent_pattern_mining/frequent_pattern_mining.h:157:31: error: ‘TOTAL_TOKEN_TYPE’ was not declared in this scope
src/frequent_pattern_mining/frequent_pattern_mining.h:157:31: note: suggested alternative: ‘TOTAL_TOKENS_TYPE’
inline bool pruneByPOSTag(TOTAL_TOKEN_TYPE st, TOTAL_TOKEN_TYPE ed)
^~~~~~~~~~~~~~~~
TOTAL_TOKENS_TYPE
Build attempted using the following dependencies enabled:
1) gcc/7.1.0 3) mpich/3.1.4 5) git/2.6.3 7) java/1.8.0
2) python/3.4 4) hdp/0.1 6) gsl/2.3
parameters.h line 24:
typedef unsigned char POS_ID_TYPE;
segment.cp line 115:
POS_ID_TYPE posTagId = -1;
segmentation.h line 308 and many other lines:
tags[j] >= 0
when I execute auto_phrase.sh, some Exceptions are founded:
java.io.FileNotFoundException: tmp/final_quality_multi-words.txt
java.io.FileNotFoundException: tmp/final_quality_unigrams.txt
java.io.FileNotFoundException: tmp/final_quality_salient.txt
According to the error information, something wrong was located
at Tokenizer.tokenizeText(Tokenizer.java:618)
at Tokenizer.main(Tokenizer.java:766)
so, how to resolve this?
Should be:
Ubuntu:
g++ 4.8 $ sudo apt-get install g++-4.8
Java 8 $ sudo apt-get install openjdk-8-jdk
curl $ sudo apt-get install curl
(Also curl is required)
What is the strategy for the rebate part of AutoPhrase? Because we found that there are errors in running the label back program for some corpora, we have solved the problem of letter case, but still feel that the effect of the result is not good enough.
This issue occurred when I ran ./phrasal_segmentation.sh
and at Generating Output
stage.
I have used my own data corpus (of size 22 MB), while still using the pre-trained model. (As mentioned in Issue #22, I have changed the text_to_seg
in the phrasal_segmentation.sh
script.)
I have tried to increase the Buffer limit from 8192 default size
to 16384 (2 x 8192
), 32768 (4 x 8192
), 65536 (8 x 8192
), 131072 (16 x 8192
), 262144 (32 x 8192
) but no use.
How may I resolve this issue ?
It reports errors
tools/tokenizer/src/Tokenizer.java:415: error: local variable language is accessed from within inner class; needs to be declared final
if (!hasSuitableAnalyzer(language)) {
^
tools/tokenizer/src/Tokenizer.java:416: error: local variable language is accessed from within inner class; needs to be declared final
SpecialTagger tagger = getTagger(Thread.currentThread().getName(), language, mode);
^
tools/tokenizer/src/Tokenizer.java:416: error: local variable mode is accessed from within inner class; needs to be declared final
SpecialTagger tagger = getTagger(Thread.currentThread().getName(), language, mode);
^
tools/tokenizer/src/Tokenizer.java:417: error: local variable mode is accessed from within inner class; needs to be declared final
token_pairs = lineToTokens(tagger, line, mode);
^
tools/tokenizer/src/Tokenizer.java:419: error: local variable language is accessed from within inner class; needs to be declared final
Analyzer analyzer = getAnalyzer(Thread.currentThread().getName(), language, mode);
^
tools/tokenizer/src/Tokenizer.java:419: error: local variable mode is accessed from within inner class; needs to be declared final
Analyzer analyzer = getAnalyzer(Thread.currentThread().getName(), language, mode);
^
tools/tokenizer/src/Tokenizer.java:420: error: local variable mode is accessed from within inner class; needs to be declared final
token_pairs = lineToTokens(analyzer, line, mode);
^
tools/tokenizer/src/Tokenizer.java:425: error: local variable mode is accessed from within inner class; needs to be declared final
if (mode.equals("train") || mode.equals("test") || mode.equals("direct_test")) {
^
tools/tokenizer/src/Tokenizer.java:425: error: local variable mode is accessed from within inner class; needs to be declared final
if (mode.equals("train") || mode.equals("test") || mode.equals("direct_test")) {
^
tools/tokenizer/src/Tokenizer.java:425: error: local variable mode is accessed from within inner class; needs to be declared final
if (mode.equals("train") || mode.equals("test") || mode.equals("direct_test")) {
^
tools/tokenizer/src/Tokenizer.java:426: error: local variable tag_writer is accessed from within inner class; needs to be declared final
if (tag_writer != null) {
^
tools/tokenizer/src/Tokenizer.java:434: error: local variable tag_writer is accessed from within inner class; needs to be declared final
if (tag_writer == null || mode.equals("test") || mode.equals("direct_test")) { // we always need raw tokens under the test mode
^
tools/tokenizer/src/Tokenizer.java:434: error: local variable mode is accessed from within inner class; needs to be declared final
if (tag_writer == null || mode.equals("test") || mode.equals("direct_test")) { // we always need raw tokens under the test mode
^
tools/tokenizer/src/Tokenizer.java:434: error: local variable mode is accessed from within inner class; needs to be declared final
if (tag_writer == null || mode.equals("test") || mode.equals("direct_test")) { // we always need raw tokens under the test mode
^
tools/tokenizer/src/Tokenizer.java:449: error: local variable mode is accessed from within inner class; needs to be declared final
if (mode.equals("train") && case_sen.equals("Y")) {
^
tools/tokenizer/src/Tokenizer.java:449: error: local variable case_sen is accessed from within inner class; needs to be declared final
if (mode.equals("train") && case_sen.equals("Y")) {
^
tools/tokenizer/src/Tokenizer.java:464: error: local variable mode is accessed from within inner class; needs to be declared final
} else if (mode.equals("train") && case_sen.equals("N")) {
^
tools/tokenizer/src/Tokenizer.java:464: error: local variable case_sen is accessed from within inner class; needs to be declared final
} else if (mode.equals("train") && case_sen.equals("N")) {
^
tools/tokenizer/src/Tokenizer.java:495: error: local variable mode is accessed from within inner class; needs to be declared final
} else if (mode.equals("test") || mode.equals("direct_test")) {
^
tools/tokenizer/src/Tokenizer.java:495: error: local variable mode is accessed from within inner class; needs to be declared final
} else if (mode.equals("test") || mode.equals("direct_test")) {
^
tools/tokenizer/src/Tokenizer.java:500: error: local variable case_sen is accessed from within inner class; needs to be declared final
if (case_sen.equals("N")) {
^
tools/tokenizer/src/Tokenizer.java:515: error: local variable mode is accessed from within inner class; needs to be declared final
else if (mode.equals("translate")) {
^
22 errors
How do I fix it ?
Thanks!
" The language in the input will be automatically detected". So there's no way for me to apply your algorithm for my language (Vietnamese). I think you should make the algorithm flexibly by allowing human efforts so that if people make use of this algorithm in clever way and suitable for their language, they will get a big step in processing languages. Thank you.
Hey, I'm trying to use autophrase on MEDLINE2017 (nearly 30 million documents)
When I run it, this is the error.
It looks like the number of total tokens may have overflowed?
===AutoPhrasing===
=== Current Settings ===
Iterations = 2
Minimum Support Threshold = 20
Maximum Length Threshold = 6
POS-Tagging Mode Enabled
Number of threads = 24
Labeling Method = DPDN
Auto labels from knowledge bases
Max Positive Samples = -1
=======
Loading data...
# of total tokens = -349783405
max word token id = 44869137
terminate called after throwing an instance of 'std::bad_array_new_length'
what(): std::bad_array_new_length
./auto_phrase.sh: line 110: 157112 Aborted ./bin/segphrase_train --pos_tag --thread $THREAD --pos_prune data/BAD_POS_TAGS.txt --label_method $LABEL_METHOD --label $LABEL_FILE --max_positives $MAX_POSITIVES --min_sup $MIN_SUP
real 3m57.970s
user 3m48.907s
sys 0m7.165s
I got this error when I run the phrase_segmentation.sh on a huge txt file (990K lines, 68MB).
I cannot provide the file for reproducing the problem because this data is not public.
I got some partial output in Semengtation.txt
So my best guess is that the file is too large?
Please let me know if there is any easy to to solve this issue.
I tried running AutoPhrase on a dataset of mine followed by phrasal segmentation. I used my own expert labels file and for the phrasal segmentation set MULTIWORD cutoff at 0.7 and SINGLEWORD cutoff at 0.8, but I ended up getting phrases like the following in my results.
We aimed to test <phrase>the</phrase> overexpansion of <phrase>the</phrase> <phrase>BVS</phrase> scaffold in vitro and evaluate the impact of <phrase>excessive</phrase> <phrase>scaffol
d</phrase> <phrase>oversizing</phrase> <phrase>on</phrase> <phrase>focal point</phrase> support
In the above example phrases such as the
and on
should not be included as they are in conflict with the SegPhrase rule that a phrase should be filtered out if it ends with a stopword. I didn't edit the stopword file, so this behavior is strange.
Hi, jingbo!
I ran the bash file on the exactly same input file, with exactly same environment variables (min_sup) several times, I but got different results every time. Could you please let me know why this happened?
Thank you,
Yuan
I have edited the Tokenizer.java
file with maximum value as 100000000
on the following line:
AutoPhrase/tools/tokenizer/src/Tokenizer.java
Line 628 in 5f49499
"[Fatal Error] Load Limit Exceeded! You may want to modify the load limit in the Tokenizer.java
But, I still continue to get this error. What is the load limit based on?
My data has around 200,000,000
phrases.
DBLP.txt.gz seems to have been moved as the URL (http://dmserv2.cs.illinois.edu/data/DBLP.txt.gz) leads to a 404 right now. What is the new URL to get the data?
已经clone了WikipediaEntities,但是不知道怎么用,他的readme上也没有说.
O(∩_∩)O谢谢
not enough training data found!
hello,I have a question.when I try to use my own input.txt always occur this error.
would you like to tell me the reason?
thank you so much!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.