brendano / ark-tweet-nlp Goto Github PK

View Code? Open in Web Editor NEW

574.0 574.0 202.0 60.21 MB

CMU ARK Twitter Part-of-Speech Tagger

Home Page: http://www.ark.cs.cmu.edu/TweetNLP/

License: Other

Shell 1.48% PHP 16.02% R 2.06% Python 0.57% Java 79.50% Makefile 0.10% HTML 0.27%

ark-tweet-nlp's People

Contributors

Stargazers

Watchers

Forkers

pet3ris stevenbedrick aschmolck dbamman dynamicguy truncs julosaure domrout lonelydancer cesine fielddb rosner pjanic orangelpai jbernabe antoine-tran martinsykora ediio sandeeppanem mbautin sridif giangbinhtran chinaboy lucentcosmos alteryx reactormonk chenhaot vatsan thatmattbone ikekonglp alkeshpatel11 cgl mariankh canberkozdemir saseith af4007 pratimkm tbkraf08 freedompei igraves judycai12 relbaz rincaro gpalto winterzhangdong mansukhani saaam sentimentron mgedigian sjmaharjan clab neufang pmsorhaindo logbon72 niuox linearregression dthume msdevanms beders akrampsut norbert bigrat911 gaoyanou indrasela wavelets ghosthamlet t1c1 hecspc bradparks rhmiller47 mgulaid shanbady narayana1208 khemanta gragtah shafiahmed mlopezm vanessaluong miguelevben subailong xiaoyangren shivam11 qianyizhong admackin olanre kotakondavinay warwickwang deppych millecker zunaira4241 puneetloya peachtoken zmarinho falitokiniaina milescook edwyer aschwartz379 leondz seasonlaw yanyankangkang

ark-tweet-nlp's Issues

Ark-tweet-nlp with gate

Just wondering if anybody have used ark-tweet-nlp as plug in in GATE.

thanks

--output-file doesn't work

./runTagger.sh --output-format conll --output-file text.txt examples/casual.txt
it does not work properly.
No output....

Setup doesn't work

Hey folks,

I just wanted to try out your tagger, but I can't get it to run. First of I tried following your hacking.txt but no success.

Also the project structure is weird for a java project. So I have some questions about this project:

Why are you providing the jargs jar? Did you change something in it so you cannot use the standard version that is accessible through maven?
The same goes for the gnu trove jar that you provide. Any changes made to the library?
Why are you separating the actual src files into the separate src folder in the root of the project while maintaining the resources in the ark-tweet-nlp folder?
Are metaphone-map2.txt and ptb_ordered_metaphone.txt that are contained in the lib directory external resources or are they created by you? If so, why are they in the lib directory?
Where is the posBerkeley.jar from? Is it available to the public (e.g. from here)?

Since I want to use/try/evaluate it, I'm very interested in your project. I'm also experienced with maven, java, eclipse so I could help you with restructuring this stuff.

Tweetnlp crashed on this input, note that there are lines with no words at all...

A musician must make music, an artist must paint.., to be ultimately at peace with himself. What a man can be, he must be ~Maslow
A small body of determined spirits fired by an unquenchable faith in their mission can alter the course of history.~Gandhi #quote
Never for the sake of peace and quiet deny your convictions ~ Dag Hammarskjold #quote
5020

@kz713twt Amanpour
2

Oh my god ! It was an wedding anniversary today, but I stayed at home unfortunately.

emoticon confused with RT

for instance, pls RT__Tell will be parsed to pls R T__T ell

I have an ad-hoc fix for now. It seems OK to me.

how does the tokenizer work? (whitespace tokenizer?)

Hi, I have this question about tokenizer. Based what criteria does the tokenizer tokenize?
in the example you gave:
@thecamion @
I O
like V
monkeys N
, ,
but &
I O
still R
hate V
COSTCO ^
parking N
lots N
.. ,

why wouldn't "COSTCO, parking, lots" be tokenized into one phrase"COSTCO parking lots"? I know this tagger is fast so probably it tokenizes based on space?...

Cannot Train POS with Locale Other Than English

If I train the arktweet POS tagger with e.g. German locale (cf. train method), the training process fails because it generates a file containing decimals with German formatting. For example, numbers like 0.2 are formatted as 0,2 (German notation) and the trainer component fails to load this file because of the comma.

Port to PHP?

Is there a port of this to PHP? :(

My code skills are basic and I have only just gotten round to being competent at PHP. Java is a bit above me right now. Is there a PHP port of this OR is there a way of integrating the existing code into a PHP web app?

kevinzzz007/ark-tweet-nlp : WindowsError: [Error 2] The system cannot find the file specified

May you pls help tp run your code from
https://github.com/kevinzzz007/ark-tweet-nlp
file ark-tweet-nlp-windows.zip

CMU ARK Twitter Part-of-Speech Tagger-Python wrapper for Windows http://www.ark.cs.cmu.edu/TweetNLP/

I only added test line to your code at the end
runString("lets go to store ")

and got this error massage

File "c:\Sander\mycode\pedrobalage-TwitterHybridClassifier_dec28\ark-tweet-nlp-windows\CMUTweetTaggerWindows.py", line 42, in
runString("lets go to store ")
File "c:\Sander\mycode\pedrobalage-TwitterHybridClassifier_dec28\ark-tweet-nlp-windows\CMUTweetTaggerWindows.py", line 29, in runString
p = subprocess.Popen('java -XX:ParallelGCThreads=2 -Xmx500m -jar ark-tweet-nlp-0.3.2.jar ' + file_name,stdout=subprocess.PIPE)
File "C:\Anaconda\Lib\subprocess.py", line 710, in init
errread, errwrite)
File "C:\Anaconda\Lib\subprocess.py", line 958, in _execute_child
startupinfo)

WindowsError: [Error 2] The system cannot find the file specified
PS
actually I have the same mistake from this project
https://github.com/pedrobalage/TwitterHybridClassifier/tree/master/Data/Lexicon/NRC-Hashtag-Sentiment-Lexicon-v0.1

so I try localise problem and run your code first to use ark-tweet-nlp on windows machine

Use twitter-text to extract hashtags, mentions, and URLs

Currently the tokenizer has it's own regex's for hashtags, mentions, and URLs (and there's a comment about what the best URL pattern is). Twitter maintains a java library twitter-text that can extract these and handles all sorts of weird edge-cases. It also has a pretty good regex for getting URLs that aren't preceded by a protocol. Offloading the identification of the twitter-specific tokens to the twitter-maintained library would probably improve the identification of those items (or at the very least, mean it's making the same mistakes as Twitter itself)

Scala compilation failed in cygwin

When doing scala compilation in cygwin environment, the compiler failed on importing class from scala-lib.jar and print error message full of messy characters. I suspect it is caused by encoding problem. However, the compilation succeeded in Windows cmd.
Just a reminder, in case someone encounter the same problem.

the --input-field command option doesn't even seem to work

It seems that, regardless of the number I specify, the tagger always goes to the first column for data.

Use POS without tokenizer

Dear,

The tokenizer does a great job however it does not suit my needs for the task I have at hand. Is it possible to run only the POS tagger without the tokenizer? If so how should I approach this?

Kind regards,
Henry

Cannot build properly

I followed these steps to build the project and run the example for tagging:

git clone https://github.com/brendano/ark-tweet-nlp.git
cd ark-tweet-nlp/
mvn clean package
cp ark-tweet-nlp/target/original-ark-tweet-nlp-0.3.2.jar .
mv original-ark-tweet-nlp-0.3.2.jar ark-tweet-nlp-0.3.2.jar
./runTagger.sh examples/example_tweets.txt

I am getting the following error:

Exception in thread "main" java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonParseException
    at cmu.arktweetnlp.impl.Model.loadModelFromText(Model.java:409)
    at cmu.arktweetnlp.Tagger.loadModel(Tagger.java:40)
    at cmu.arktweetnlp.RunTagger.runTagger(RunTagger.java:85)
    at cmu.arktweetnlp.RunTagger.main(RunTagger.java:373)
Caused by: java.lang.ClassNotFoundException: com.fasterxml.jackson.core.JsonParseException
    at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 4 more

Any ideas?

Cannot execute runTagger.sh script from other directories

For example, from the parent directory of ark-tweet-nlp, I get:

Noahs feature file:null
File with initial transition probs:null
Reading embeddings file...
java.io.FileNotFoundException: lib/embeddings.txt (No such file or directory)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:137)
    at java.io.FileInputStream.<init>(FileInputStream.java:96)
    at java.io.FileReader.<init>(FileReader.java:58)
    at edu.cmu.cs.lti.ark.ssl.util.BasicFileIO.openFileToRead(BasicFileIO.java:39)
    at edu.cmu.cs.lti.ark.ssl.pos.SemiSupervisedPOSTagger.readDistSim(SemiSupervisedPOSTagger.java:630)
    at edu.cmu.cs.lti.ark.ssl.pos.SemiSupervisedPOSTagger.setVariousOptions(SemiSupervisedPOSTagger.java:531)
    at edu.cmu.cs.lti.ark.ssl.pos.SemiSupervisedPOSTagger.<init>(SemiSupervisedPOSTagger.java:178)
    at edu.cmu.cs.lti.ark.tweetnlp.TweetTaggerInstance.<init>(TweetTaggerInstance.java:62)
    at edu.cmu.cs.lti.ark.tweetnlp.TweetTaggerInstance.getInstance(TweetTaggerInstance.java:24)
    at edu.cmu.cs.lti.ark.tweetnlp.RunPOSTagger.tweetTagging(RunPOSTagger.java:43)
    at edu.cmu.cs.lti.ark.tweetnlp.RunPOSTagger.doPOSTagging(RunPOSTagger.java:39)
    at edu.cmu.cs.lti.ark.tweetnlp.RunPOSTagger.main(RunPOSTagger.java:63)
12-Sep-2011 15:25:03 edu.cmu.cs.lti.ark.ssl.util.BasicFileIO openFileToRead
SEVERE: Could not open file:lib/embeddings.txt

Tokenizer gets stuck on bad-match regex matching

i found a problem by using your Twokenize tokenizer.

The program stucks in the Twokenize.simpleTokenize() method in this part:
while(matches.find()){
// The spans of the "bads" should not be split.
if (matches.start() != matches.end()){ //unnecessary?
List bad = new ArrayList(1);
bad.add(splitPunctText.substring(matches.start(),matches.end()));
bads.add(bad);
badSpans.add(new Pair<Integer, Integer>(matches.start(),matches.end()));
}
}

Example string: String test = "@rkdalswl0302 완댜)물론~얼굴은아니지만....................................................................................................";

Word Cluster

How to run algorith with word cluster?
"\n --word-clusters Alternate word clusters file (see FeatureExtractor)" +
Which filename we have to write here?
Thanks in advance

JSON output

pls see my comment on closed ticket #12

Hi,
I saw this thread and I wonder if we could have JSON output as well. if you like, i can make the changes and maybe merge the changes.
right now the conll output is like:
Tax N 0.8294
benefits N 0.9934
of P 0.9954
retirement N 0.9764
savings N 0.9953
in P 0.9947
jeopardy N 0.9038
-- , 0.9228
one $ 0.9548
economist N 0.9925
says V 0.9961
the D 0.9906
hit N 0.9130
to P 0.9981
savers N 0.5918
could V 0.9966
endanger V 0.9993
the D 0.9996
economy N 0.9971
. , 0.9980

Does it make sense to have output like
{
"values":[
{"word":"Tax","POS":"N","confidence":0.8294},
......
]

}

GPL

Hi,

Has anyone managed removing / replacing the components which make the library GPL, and lived to tell about it?

In particular, for only the POS Tagging, but also, in general?

Thanks!

LICENSE Issue GPLv2 compatibility with GPLv3

Hi Brendan,

We are using your library for a twitter application which we plan to release under GPLv3 however, we cannot release your code with with our GPLv3 as it doesn't specify that your version of software is licensed under GPLv2 and later versions.

So if you can change your license to GPLv2 or later then it will be easier to use your code in GPLv3 released code.

You can see this compatibility matrix to see that GPLv2 or later is compatible with GPLv3 but not GPLv2

I would look forward to your response.

Missing default model.20120919 after building from source code

After I use mvn package to build the ark-tweet-nlp-0.3.2.jar, It will report an IOException when I run ./runTagger.sh examples/example_tweets.txt.

Details:

Exception in thread "main" java.io.IOException: Neither file nor resource found for: /cmu/arktweetnlp/model.20120919
    at cmu.arktweetnlp.util.BasicFileIO.openFileOrResource(BasicFileIO.java:250)
    at cmu.arktweetnlp.impl.Model.loadModelFromText(Model.java:409)
    at cmu.arktweetnlp.Tagger.loadModel(Tagger.java:40)
    at cmu.arktweetnlp.RunTagger.runTagger(RunTagger.java:85)
    at cmu.arktweetnlp.RunTagger.main(RunTagger.java:373)

Suppress the printouts

The printouts, i guess are for debugging purposes, no use for real usage.

some config files in scripts/train.sh

When i try to use the semi-crf tools to train a crf model.I try to use the file in scipts/train.sh.
There is an option:--noahsFeaturesFile noah.feats

I've no idea what is the noah.feats looks like.
So can you give me an example?

Tagger is slow

Dipanjan is working on a fix

could you explain the mean of the "model.20120919.txt"?

Can you explain the meaning of each column?
_BIAS_ O 0.219882
_BIAS_ V 0.337862
_BIAS_ D -0.505394
_BIAS_ A 0.00000
_BIAS_ N 0.785694
_BIAS_ P 0.00000
_BIAS_ , 1.71877
_BIAS_ ^ 1.35129
_BIAS_ L 0.00000
_BIAS_ ~ -0.0207695
_BIAS_ @ -0.424859
_BIAS_ U -0.358914
_BIAS_ $ 0.00000

ark tweet tagger fails with a conll input file with just one column

./ark-tweet-nlp-0.3.2/runTagger.sh --input-format conll data/test.txt
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at cmu.arktweetnlp.io.CoNLLReader.sentenceFromLines(CoNLLReader.java:55)
at cmu.arktweetnlp.io.CoNLLReader.readFile(CoNLLReader.java:32)
at cmu.arktweetnlp.RunTagger.runTaggerInEvalMode(RunTagger.java:161)
at cmu.arktweetnlp.RunTagger.runTagger(RunTagger.java:87)
at cmu.arktweetnlp.RunTagger.main(RunTagger.java:364)

The file test.txt was:

This
is
a
test
!

Adding a (tab separated) dummy column solves the problem (but still, it ought work with files with just one column):
This 1
is 2
a 3
test 4
! 5

boutta: P => V

Nathan noticed this today:

"boutta", short for "about to", is currently tagged as P in the 0.3 data release, but it should be V since it's like a modal auxiliary verb, similar to "ought to". In fact, the Brown clusters have figured this out, grouping "boutta" with "tryna", "gonna", and "finna" variants ("trying/going to", "going to", "fixing to"): http://www.ark.cs.cmu.edu/TweetNLP/paths/0011001.html

This might also be related to immediate future auxiliaries as mentioned in the NAACL paper (for "finna" and Texan English).

Current examples of the problem, just for "boutta":

~/twi/pos/ark-tweet-nlp/data/twpos-data-v0.3 % grep -ni boutta *.conll
oct27.conll:22611:boutta P
oct27.conll:26789:Boutta P

Some further inconsistencies. Here are examples of this cluster in the data. I haven't looked at them in context yet but highly doubt the P reading is correct.

daily547.conll:1422 Tryna V
daily547.conll:2499 tryna V
daily547.conll:3934 Bouta P
oct27.conll:1534 fiNna R
oct27.conll:3469 fina V
oct27.conll:3923 gon V
oct27.conll:6065 tryna V
oct27.conll:7890 tryna V
oct27.conll:8455 gne V
oct27.conll:11337 tryna V
oct27.conll:13993 gon V
oct27.conll:19302 finna P
oct27.conll:21114 gon V
oct27.conll:22610 boutta P
oct27.conll:24181 tryna V
oct27.conll:26788 Boutta P

Twokenize runs into NullPointerException for conll output format, with provided example (casual.txt)

me$ java -jar ark-tweet-nlp-0.3.2.jar --output-format conll --just-tokenize /tmp/casual.txt
Detected text input format
Exception in thread "main" java.lang.NullPointerException
at cmu.arktweetnlp.RunTagger.outputJustTagging(RunTagger.java:245)
at cmu.arktweetnlp.RunTagger.runTagger(RunTagger.java:130)
at cmu.arktweetnlp.RunTagger.main(RunTagger.java:364)

Tagger works on the same input though.

me$ java -jar ark-tweet-nlp-0.3.2.jar --output-format conll /tmp/casual.txt
Detected text input format
@Cwallll @ 0.9989
@diddy_dance @ 0.9986
ikr ! 0.8143
smh G 0.9406
he O 0.9963
asked V 0.9979
fir P 0.5545
yo D 0.6272
last A 0.9871
name N 0.9998
so P 0.9838
he O 0.9981
can V 0.9997
add V 0.9997
u O 0.9978
on P 0.9426
fb ^ 0.9453
lololol ! 0.9664

:o E 0.9387
:/ E 0.9983
:'( E 0.9975

:o E 0.9964
(: E 0.9994
:) E 0.9997
.< E 0.9952
XD E 0.9938
-__- E 0.9956
o.O E 0.9899
;D E 0.9995
:-) E 0.9992
@_@ E 0.9964
:P E 0.9996
8D E 0.9961
: E 0.6925
1 $ 0.9194
:( E 0.9715
:D E 0.9996
=| E 0.9963
" , 0.6125
) , 0.9078
: , 0.6272
E 0.4920
.... , 0.8882

jar dependencies are not pulled correctly

The instructions in "hacking.txt" suggest running "mvn package" command to populate jar dependencies in ark-tweet-nlp/src/target. First error is that it generates the dependencies in ark-tweet-nlp/target and not in ark-tweet-nlp/src/target.

Further, it creates ark-tweet-nlp/target/com.googlecode.addjars.mojo.AddJarsMojo and populates three jars. But all of them are empty. For example this is what I get on my machine:

$ ls -l
total 8
-rwxrwx---+ 1 user_xyz None 0 Sep 8 14:36 ark-tweet-nlp-posBerkeley.jar
-rwxrwx---+ 1 user_xyz None 0 Sep 8 14:36 ark-tweet-nlp-stanford-postagger-2010-05-26.jar
-rwxrwx---+ 1 user_xyz None 0 Sep 8 14:36 ark-tweet-nlp-trove-3.0.0a5.jar

The pom files should be modified for getting right jars here.

Trying to get in touch regarding a security issue

Hey there!

I'd like to report a security issue but cannot find contact instructions on your repository.

If not a hassle, might you kindly add a SECURITY.md file with an email, or another contact method? GitHub recommends this best practice to ensure security issues are responsibly disclosed, and it would serve as a simple instruction for security researchers in the future.

Thank you for your consideration, and I look forward to hearing from you!

(cc @huntr-helper)

Taking too much memory

Try to flush the output file to conserve memory.

"yeen" O => (Pronoun Verb)

I have been told that "yeen" is short for "you ain't"

currently it's tagged "Yeen/O gotta lie" (Daily547)

TODO: check whether we have a compound tag for this