trinker / textstem Goto Github PK
View Code? Open in Web Editor NEWTools for fast text stemming & lemmatization
Tools for fast text stemming & lemmatization
dw <- c('driver', 'drive', 'drove', 'driven', 'drives', 'driving')
stem_words(dw)
lemmatize_words(dw)
bw <- c('are', 'am', 'being', 'been', 'be')
stem_words(bw)
lemmatize_words(bw)
'aren\'t' %>%
textclean::replace_contraction() %>%
lemmatize_strings()
A copy from an email from [email protected]
I was trying your hunspell package and i recognized,
that if there is a EURO symbol in the text
make_lemma_dictionary gives an error.
make_lemma_dictionary("some text with € symbol",
engine = "hunspell", lang = "de_AT")
## Error in R_hunspell_stem(dictionary, words) :
## basic_string::_M_construct null not valid
I found the same behavior for
c("€", "…", "“", "„")
I am not sure if this is the intended behavior,
but if not it would be maybe the easiest if
you run intToUtf8
for some reasonable sequence
and exclude all the symbols which cause trouble.
make sure left column has no dupes
For example, if I run:
lemmatize_strings(c("this that also", "me too thanks", "also also"))
I get:
[1] "this that conjurer" "me too thank" "conjurer conjurer"
It's not a huge bug, but I was very confused about what was going on when I was running this on a larger corpus, and very curious about how it happened!
lemmatize_strings("This is 34.546 above")
"This be 34. 546 above"
Thanks for an awesome package. Is there a Is there a reference for Mechura's (2016) English lemmatization List used in the package? I will of course be referencing textstem in my paper, but would also like a ref for the list, and can't find on google.
Many thanks!
I tried below code, which I take from:
https://www.rdocumentation.org/packages/textstem/versions/0.1.4/topics/lemmatize_strings
I installed treetagger under C:\
code is as below:
x <- c('the dirtier dog has eaten the pies', 'that shameful pooch is tricky and sneaky',
"He opened and then reopened the food bag", 'There are skies of blue and red roses too!',
NA, "The doggies, well they aren't joyfully running.", "The daddies are coming over...",
"This is 34.546 above")
lemma_dictionary <- make_lemma_dictionary(x, engine = 'treetagger')
ERROR MESSAGE (in turkish):
Error in dplyr::filter([email protected][c("token", "lemma")], !lemma %in% :
"TT.res" adında bir slot yok ("kRp.text" sınıfında bir nesne için)
it says:
... "TT.res" slot is missing (for "kRp.text" class object)
Can you help me please to solve this issue.
"As" is almost always the conjunction/preposition/whatever, and is rarely the plural of "a".
I am a long time user of the koRpus package and on a Windows machine. I recently upgraded my version of koRpus and when I run:
koRpus::treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
TT.tknz=FALSE , lang="en",
TT.options=list(path="C:/TreeTagger", preset="en"))
I get the error:
Error: Specified file cannot be found:
C:/TreeTagger/cmd/utf8-tokenize.pl
Switching to the version 0.06-5 makes this go away. The reason is on a Windows machine there is no cmd/utf8-tokenize.pl it's called cmd/utf8-tokenize.perl.
The package textstem that I maintain depends on koRpus for this functionality thus it concerns me that other Windows users are unable to utilize this functionality.
I am using the most current treetagger version available from here: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
Thank you for attention to this.
I had cleaned and lemmatized a document, but it turns out the default lemmatization for the word 'second' is '2'. Not only is this confusing, but it is incorrect in some situations. Usually we are removing numbers in preprocessing, then when you see a number in your results it's very confusing.
What if the word 'second' is in the text but being used as a measure of time?
I know this is using the lexicon package and a lemmatization table from Mechura, but I think this package might be the right place to correct this mistake.
There are a bunch of number conversions in the lemmatization table. Again, I think this is confusing for most use cases where removing numbers is probably the norm. Perhaps there should be an option for using these number conversions or not. At the very least, 'second' should be fixed somehow I think -- you would need the POS tag to properly lemmatize it.
I am having an issue specifying the path to Treetagger in make_lemma_dictionary(). No matter what directory I specify, it fails to load the Treetagger in Linux. It says "Error: Specified directory cannot be found". Not sure whether this works at all in Linux.
Hi,
Firstly, thanks for a great and useful package! I've been experimenting with the make_lemma_dictionary function and was wondering if the addition of the following features would be helpful:
Because the text is separated into tokens prior to its being sent into treetag(), some of the context is lost. Would it make sense to have an option to keep the text as is, i.e., full sentences? Here's an example: c("That food is really nice.","That felt is really nice."). Because the token/line with 'felt' is all by itself (as the other terms already appear), TreeTagger uses the default interpretation of felt as a verb. Passing in the full sentences to treetag() allows for the proper tagging.
I had some issues getting the treetag() function itself to work; potential bugs have been raised with koRpus' developer. I was wondering if a debug flag could be passed to treetag as well as an option to unsuppress messages, so that users could diagnose problems.
Thanks!
Best,
Jay
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.