brendano / ark-tweet-nlp Goto Github PK
View Code? Open in Web Editor NEWCMU ARK Twitter Part-of-Speech Tagger
Home Page: http://www.ark.cs.cmu.edu/TweetNLP/
License: Other
CMU ARK Twitter Part-of-Speech Tagger
Home Page: http://www.ark.cs.cmu.edu/TweetNLP/
License: Other
When doing scala compilation in cygwin environment, the compiler failed on importing class from scala-lib.jar and print error message full of messy characters. I suspect it is caused by encoding problem. However, the compilation succeeded in Windows cmd.
Just a reminder, in case someone encounter the same problem.
Is there a port of this to PHP? :(
My code skills are basic and I have only just gotten round to being competent at PHP. Java is a bit above me right now. Is there a PHP port of this OR is there a way of integrating the existing code into a PHP web app?
Can you explain the meaning of each column?
_BIAS_ O 0.219882
_BIAS_ V 0.337862
_BIAS_ D -0.505394
_BIAS_ A 0.00000
_BIAS_ N 0.785694
_BIAS_ P 0.00000
_BIAS_ , 1.71877
_BIAS_ ^ 1.35129
_BIAS_ L 0.00000
_BIAS_ ~ -0.0207695
_BIAS_ @ -0.424859
_BIAS_ U -0.358914
_BIAS_ $ 0.00000
I have been told that "yeen" is short for "you ain't"
currently it's tagged "Yeen/O gotta lie" (Daily547)
TODO: check whether we have a compound tag for this
When i try to use the semi-crf tools to train a crf model.I try to use the file in scipts/train.sh.
There is an option:--noahsFeaturesFile noah.feats
I've no idea what is the noah.feats looks like.
So can you give me an example?
i found a problem by using your Twokenize tokenizer.
The program stucks in the Twokenize.simpleTokenize() method in this part:
while(matches.find()){
// The spans of the "bads" should not be split.
if (matches.start() != matches.end()){ //unnecessary?
List bad = new ArrayList(1);
bad.add(splitPunctText.substring(matches.start(),matches.end()));
bads.add(bad);
badSpans.add(new Pair<Integer, Integer>(matches.start(),matches.end()));
}
}
Example string: String test = "@rkdalswl0302 완댜)물론~얼굴은아니지만....................................................................................................";
Currently the tokenizer has it's own regex's for hashtags, mentions, and URLs (and there's a comment about what the best URL pattern is). Twitter maintains a java library twitter-text that can extract these and handles all sorts of weird edge-cases. It also has a pretty good regex for getting URLs that aren't preceded by a protocol. Offloading the identification of the twitter-specific tokens to the twitter-maintained library would probably improve the identification of those items (or at the very least, mean it's making the same mistakes as Twitter itself)
pls see my comment on closed ticket #12
Hi,
I saw this thread and I wonder if we could have JSON output as well. if you like, i can make the changes and maybe merge the changes.
right now the conll output is like:
Tax N 0.8294
benefits N 0.9934
of P 0.9954
retirement N 0.9764
savings N 0.9953
in P 0.9947
jeopardy N 0.9038
-- , 0.9228
one $ 0.9548
economist N 0.9925
says V 0.9961
the D 0.9906
hit N 0.9130
to P 0.9981
savers N 0.5918
could V 0.9966
endanger V 0.9993
the D 0.9996
economy N 0.9971
. , 0.9980
Does it make sense to have output like
{
"values":[
{"word":"Tax","POS":"N","confidence":0.8294},
......
]
}
Nathan noticed this today:
"boutta", short for "about to", is currently tagged as P in the 0.3 data release, but it should be V since it's like a modal auxiliary verb, similar to "ought to". In fact, the Brown clusters have figured this out, grouping "boutta" with "tryna", "gonna", and "finna" variants ("trying/going to", "going to", "fixing to"): http://www.ark.cs.cmu.edu/TweetNLP/paths/0011001.html
This might also be related to immediate future auxiliaries as mentioned in the NAACL paper (for "finna" and Texan English).
Current examples of the problem, just for "boutta":
~/twi/pos/ark-tweet-nlp/data/twpos-data-v0.3 % grep -ni boutta *.conll
oct27.conll:22611:boutta P
oct27.conll:26789:Boutta P
Some further inconsistencies. Here are examples of this cluster in the data. I haven't looked at them in context yet but highly doubt the P reading is correct.
daily547.conll:1422 Tryna V
daily547.conll:2499 tryna V
daily547.conll:3934 Bouta P
oct27.conll:1534 fiNna R
oct27.conll:3469 fina V
oct27.conll:3923 gon V
oct27.conll:6065 tryna V
oct27.conll:7890 tryna V
oct27.conll:8455 gne V
oct27.conll:11337 tryna V
oct27.conll:13993 gon V
oct27.conll:19302 finna P
oct27.conll:21114 gon V
oct27.conll:22610 boutta P
oct27.conll:24181 tryna V
oct27.conll:26788 Boutta P
If I train the arktweet POS tagger with e.g. German locale (cf. train method), the training process fails because it generates a file containing decimals with German formatting. For example, numbers like 0.2
are formatted as 0,2
(German notation) and the trainer component fails to load this file because of the comma.
Hi Brendan,
We are using your library for a twitter application which we plan to release under GPLv3 however, we cannot release your code with with our GPLv3 as it doesn't specify that your version of software is licensed under GPLv2 and later versions.
So if you can change your license to GPLv2 or later then it will be easier to use your code in GPLv3 released code.
You can see this compatibility matrix to see that GPLv2 or later is compatible with GPLv3 but not GPLv2
I would look forward to your response.
Hi, I have this question about tokenizer. Based what criteria does the tokenizer tokenize?
in the example you gave:
@thecamion @
I O
like V
monkeys N
, ,
but &
I O
still R
hate V
COSTCO ^
parking N
lots N
.. ,
why wouldn't "COSTCO, parking, lots" be tokenized into one phrase"COSTCO parking lots"? I know this tagger is fast so probably it tokenizes based on space?...
Try to flush the output file to conserve memory.
Just wondering if anybody have used ark-tweet-nlp as plug in in GATE.
thanks
Hey folks,
I just wanted to try out your tagger, but I can't get it to run. First of I tried following your hacking.txt but no success.
Also the project structure is weird for a java project. So I have some questions about this project:
src
folder in the root of the project while maintaining the resources in the ark-tweet-nlp
folder?metaphone-map2.txt
and ptb_ordered_metaphone.txt
that are contained in the lib
directory external resources or are they created by you? If so, why are they in the lib
directory?posBerkeley.jar
from? Is it available to the public (e.g. from here)?Since I want to use/try/evaluate it, I'm very interested in your project. I'm also experienced with maven, java, eclipse so I could help you with restructuring this stuff.
./ark-tweet-nlp-0.3.2/runTagger.sh --input-format conll data/test.txt
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at cmu.arktweetnlp.io.CoNLLReader.sentenceFromLines(CoNLLReader.java:55)
at cmu.arktweetnlp.io.CoNLLReader.readFile(CoNLLReader.java:32)
at cmu.arktweetnlp.RunTagger.runTaggerInEvalMode(RunTagger.java:161)
at cmu.arktweetnlp.RunTagger.runTagger(RunTagger.java:87)
at cmu.arktweetnlp.RunTagger.main(RunTagger.java:364)
The file test.txt was:
This
is
a
test
!
Adding a (tab separated) dummy column solves the problem (but still, it ought work with files with just one column):
This 1
is 2
a 3
test 4
! 5
./runTagger.sh --output-format conll --output-file text.txt examples/casual.txt
it does not work properly.
No output....
For example, from the parent directory of ark-tweet-nlp
, I get:
Noahs feature file:null
File with initial transition probs:null
Reading embeddings file...
java.io.FileNotFoundException: lib/embeddings.txt (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:137)
at java.io.FileInputStream.<init>(FileInputStream.java:96)
at java.io.FileReader.<init>(FileReader.java:58)
at edu.cmu.cs.lti.ark.ssl.util.BasicFileIO.openFileToRead(BasicFileIO.java:39)
at edu.cmu.cs.lti.ark.ssl.pos.SemiSupervisedPOSTagger.readDistSim(SemiSupervisedPOSTagger.java:630)
at edu.cmu.cs.lti.ark.ssl.pos.SemiSupervisedPOSTagger.setVariousOptions(SemiSupervisedPOSTagger.java:531)
at edu.cmu.cs.lti.ark.ssl.pos.SemiSupervisedPOSTagger.<init>(SemiSupervisedPOSTagger.java:178)
at edu.cmu.cs.lti.ark.tweetnlp.TweetTaggerInstance.<init>(TweetTaggerInstance.java:62)
at edu.cmu.cs.lti.ark.tweetnlp.TweetTaggerInstance.getInstance(TweetTaggerInstance.java:24)
at edu.cmu.cs.lti.ark.tweetnlp.RunPOSTagger.tweetTagging(RunPOSTagger.java:43)
at edu.cmu.cs.lti.ark.tweetnlp.RunPOSTagger.doPOSTagging(RunPOSTagger.java:39)
at edu.cmu.cs.lti.ark.tweetnlp.RunPOSTagger.main(RunPOSTagger.java:63)
12-Sep-2011 15:25:03 edu.cmu.cs.lti.ark.ssl.util.BasicFileIO openFileToRead
SEVERE: Could not open file:lib/embeddings.txt
me$ java -jar ark-tweet-nlp-0.3.2.jar --output-format conll --just-tokenize /tmp/casual.txt
Detected text input format
Exception in thread "main" java.lang.NullPointerException
at cmu.arktweetnlp.RunTagger.outputJustTagging(RunTagger.java:245)
at cmu.arktweetnlp.RunTagger.runTagger(RunTagger.java:130)
at cmu.arktweetnlp.RunTagger.main(RunTagger.java:364)
Tagger works on the same input though.
me$ java -jar ark-tweet-nlp-0.3.2.jar --output-format conll /tmp/casual.txt
Detected text input format
@Cwallll @ 0.9989
@diddy_dance @ 0.9986
ikr ! 0.8143
smh G 0.9406
he O 0.9963
asked V 0.9979
fir P 0.5545
yo D 0.6272
last A 0.9871
name N 0.9998
so P 0.9838
he O 0.9981
can V 0.9997
add V 0.9997
u O 0.9978
on P 0.9426
fb ^ 0.9453
lololol ! 0.9664
:o E 0.9387
:/ E 0.9983
:'( E 0.9975
:o E 0.9964
(: E 0.9994
:) E 0.9997
.< E 0.9952
XD E 0.9938
-__- E 0.9956
o.O E 0.9899
;D E 0.9995
:-) E 0.9992
@_@ E 0.9964
:P E 0.9996
8D E 0.9961
: E 0.6925
1 $ 0.9194
:( E 0.9715
:D E 0.9996
=| E 0.9963
" , 0.6125
) , 0.9078
: , 0.6272
E 0.4920
.... , 0.8882
Hi,
Has anyone managed removing / replacing the components which make the library GPL, and lived to tell about it?
In particular, for only the POS Tagging, but also, in general?
Thanks!
I followed these steps to build the project and run the example for tagging:
git clone https://github.com/brendano/ark-tweet-nlp.git
cd ark-tweet-nlp/
mvn clean package
cp ark-tweet-nlp/target/original-ark-tweet-nlp-0.3.2.jar .
mv original-ark-tweet-nlp-0.3.2.jar ark-tweet-nlp-0.3.2.jar
./runTagger.sh examples/example_tweets.txt
I am getting the following error:
Exception in thread "main" java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonParseException
at cmu.arktweetnlp.impl.Model.loadModelFromText(Model.java:409)
at cmu.arktweetnlp.Tagger.loadModel(Tagger.java:40)
at cmu.arktweetnlp.RunTagger.runTagger(RunTagger.java:85)
at cmu.arktweetnlp.RunTagger.main(RunTagger.java:373)
Caused by: java.lang.ClassNotFoundException: com.fasterxml.jackson.core.JsonParseException
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 4 more
Any ideas?
Hey there!
I'd like to report a security issue but cannot find contact instructions on your repository.
If not a hassle, might you kindly add a SECURITY.md
file with an email, or another contact method? GitHub recommends this best practice to ensure security issues are responsibly disclosed, and it would serve as a simple instruction for security researchers in the future.
Thank you for your consideration, and I look forward to hearing from you!
(cc @huntr-helper)
Dear,
The tokenizer does a great job however it does not suit my needs for the task I have at hand. Is it possible to run only the POS tagger without the tokenizer? If so how should I approach this?
Kind regards,
Henry
The instructions in "hacking.txt" suggest running "mvn package" command to populate jar dependencies in ark-tweet-nlp/src/target. First error is that it generates the dependencies in ark-tweet-nlp/target and not in ark-tweet-nlp/src/target.
Further, it creates ark-tweet-nlp/target/com.googlecode.addjars.mojo.AddJarsMojo and populates three jars. But all of them are empty. For example this is what I get on my machine:
$ ls -l
total 8
-rwxrwx---+ 1 user_xyz None 0 Sep 8 14:36 ark-tweet-nlp-posBerkeley.jar
-rwxrwx---+ 1 user_xyz None 0 Sep 8 14:36 ark-tweet-nlp-stanford-postagger-2010-05-26.jar
-rwxrwx---+ 1 user_xyz None 0 Sep 8 14:36 ark-tweet-nlp-trove-3.0.0a5.jar
The pom files should be modified for getting right jars here.
for instance, pls RT__Tell will be parsed to pls R T__T ell
I have an ad-hoc fix for now. It seems OK to me.
It seems that, regardless of the number I specify, the tagger always goes to the first column for data.
After I use mvn package
to build the ark-tweet-nlp-0.3.2.jar, It will report an IOException when I run ./runTagger.sh examples/example_tweets.txt
.
Details:
Exception in thread "main" java.io.IOException: Neither file nor resource found for: /cmu/arktweetnlp/model.20120919
at cmu.arktweetnlp.util.BasicFileIO.openFileOrResource(BasicFileIO.java:250)
at cmu.arktweetnlp.impl.Model.loadModelFromText(Model.java:409)
at cmu.arktweetnlp.Tagger.loadModel(Tagger.java:40)
at cmu.arktweetnlp.RunTagger.runTagger(RunTagger.java:85)
at cmu.arktweetnlp.RunTagger.main(RunTagger.java:373)
A musician must make music, an artist must paint.., to be ultimately at peace with himself. What a man can be, he must be ~Maslow
A small body of determined spirits fired by an unquenchable faith in their mission can alter the course of history.~Gandhi #quote
Never for the sake of peace and quiet deny your convictions ~ Dag Hammarskjold #quote
5020
@kz713twt Amanpour
2
Oh my god ! It was an wedding anniversary today, but I stayed at home unfortunately.
How to run algorith with word cluster?
"\n --word-clusters Alternate word clusters file (see FeatureExtractor)" +
Which filename we have to write here?
Thanks in advance
May you pls help tp run your code from
https://github.com/kevinzzz007/ark-tweet-nlp
file ark-tweet-nlp-windows.zip
CMU ARK Twitter Part-of-Speech Tagger-Python wrapper for Windows http://www.ark.cs.cmu.edu/TweetNLP/
I only added test line to your code at the end
runString("lets go to store ")
and got this error massage
File "c:\Sander\mycode\pedrobalage-TwitterHybridClassifier_dec28\ark-tweet-nlp-windows\CMUTweetTaggerWindows.py", line 42, in
runString("lets go to store ")
File "c:\Sander\mycode\pedrobalage-TwitterHybridClassifier_dec28\ark-tweet-nlp-windows\CMUTweetTaggerWindows.py", line 29, in runString
p = subprocess.Popen('java -XX:ParallelGCThreads=2 -Xmx500m -jar ark-tweet-nlp-0.3.2.jar ' + file_name,stdout=subprocess.PIPE)
File "C:\Anaconda\Lib\subprocess.py", line 710, in init
errread, errwrite)
File "C:\Anaconda\Lib\subprocess.py", line 958, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified
PS
actually I have the same mistake from this project
https://github.com/pedrobalage/TwitterHybridClassifier/tree/master/Data/Lexicon/NRC-Hashtag-Sentiment-Lexicon-v0.1
so I try localise problem and run your code first to use ark-tweet-nlp on windows machine
The printouts, i guess are for debugging purposes, no use for real usage.
Dipanjan is working on a fix
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.