trinker / textstem Goto Github PK

View Code? Open in Web Editor NEW

43.0 7.0 8.0 182 KB

Tools for fast text stemming & lemmatization

R 100.00%

r lemmatization stemming text-mining

textstem's People

Contributors

Stargazers

Watchers

Forkers

kbenoit jlricon lizl90 bedantaguru zhengyangxu arrpak valentingar stuawesome

textstem's Issues

Add to readme examples

dw <- c('driver', 'drive', 'drove', 'driven', 'drives', 'driving')

stem_words(dw)
lemmatize_words(dw)

bw <- c('are', 'am', 'being', 'been', 'be')

stem_words(bw)
lemmatize_words(bw)


'aren\'t' %>% 
    textclean::replace_contraction() %>%
    lemmatize_strings()

Non-ASCII symbol causes error

A copy from an email from [email protected]

I was trying your hunspell package and i recognized,
that if there is a EURO symbol in the text
make_lemma_dictionary gives an error.

make_lemma_dictionary("some text with € symbol", 
                      engine = "hunspell", lang = "de_AT")
## Error in R_hunspell_stem(dictionary, words) : 
##   basic_string::_M_construct null not valid

I found the same behavior for

c("€", "…", "“", "„")

I am not sure if this is the intended behavior,
but if not it would be maybe the easiest if
you run intToUtf8 for some reasonable sequence
and exclude all the symbols which cause trouble.

Default dictionary for lemmatize_strings replaces "also" with "conjurer"

For example, if I run:

lemmatize_strings(c("this that also", "me too thanks", "also also"))

I get:

[1] "this that conjurer" "me too thank" "conjurer conjurer"

It's not a huge bug, but I was very confused about what was going on when I was running this on a larger corpus, and very curious about how it happened!

`lemmatize_strings` splits up numbers w/ decimals

lemmatize_strings("This is 34.546 above")

"This be 34. 546 above"

Is there a reference for Mechura's (2016) English lemmatization List used in the package?

Thanks for an awesome package. Is there a Is there a reference for Mechura's (2016) English lemmatization List used in the package? I will of course be referencing textstem in my paper, but would also like a ref for the list, and can't find on google.

Many thanks!

Problem in lemma_dictionary <- make_lemma_dictionary(x, engine = 'treetagger')

I tried below code, which I take from:

https://www.rdocumentation.org/packages/textstem/versions/0.1.4/topics/lemmatize_strings

I installed treetagger under C:\

code is as below:

x <- c('the dirtier dog has eaten the pies', 'that shameful pooch is tricky and sneaky',
"He opened and then reopened the food bag", 'There are skies of blue and red roses too!',
NA, "The doggies, well they aren't joyfully running.", "The daddies are coming over...",
"This is 34.546 above")

lemma_dictionary <- make_lemma_dictionary(x, engine = 'treetagger')

ERROR MESSAGE (in turkish):
Error in dplyr::filter([email protected][c("token", "lemma")], !lemma %in% :
"TT.res" adında bir slot yok ("kRp.text" sınıfında bir nesne için)

it says:
... "TT.res" slot is missing (for "kRp.text" class object)

Can you help me please to solve this issue.

lemmatizes "as" to "a"

"As" is almost always the conjunction/preposition/whatever, and is rarely the plural of "a".

koRpus tree tagger broken

I am a long time user of the koRpus package and on a Windows machine. I recently upgraded my version of koRpus and when I run:

koRpus::treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
                      TT.tknz=FALSE , lang="en",
                      TT.options=list(path="C:/TreeTagger", preset="en"))

I get the error:

 Error: Specified file cannot be found:
 C:/TreeTagger/cmd/utf8-tokenize.pl

Switching to the version 0.06-5 makes this go away. The reason is on a Windows machine there is no cmd/utf8-tokenize.pl it's called cmd/utf8-tokenize.perl.

The package textstem that I maintain depends on koRpus for this functionality thus it concerns me that other Windows users are unable to utilize this functionality.

I am using the most current treetagger version available from here: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

Thank you for attention to this.

lemmatization of numbers is confusing

I had cleaned and lemmatized a document, but it turns out the default lemmatization for the word 'second' is '2'. Not only is this confusing, but it is incorrect in some situations. Usually we are removing numbers in preprocessing, then when you see a number in your results it's very confusing.

What if the word 'second' is in the text but being used as a measure of time?

I know this is using the lexicon package and a lemmatization table from Mechura, but I think this package might be the right place to correct this mistake.

There are a bunch of number conversions in the lemmatization table. Again, I think this is confusing for most use cases where removing numbers is probably the norm. Perhaps there should be an option for using these number conversions or not. At the very least, 'second' should be fixed somehow I think -- you would need the POS tag to properly lemmatize it.

Path for Treetag

I am having an issue specifying the path to Treetagger in make_lemma_dictionary(). No matter what directory I specify, it fails to load the Treetagger in Linux. It says "Error: Specified directory cannot be found". Not sure whether this works at all in Linux.

Issues with make_lemma_dictionary for treetagger engine

Hi,

Firstly, thanks for a great and useful package! I've been experimenting with the make_lemma_dictionary function and was wondering if the addition of the following features would be helpful:

Because the text is separated into tokens prior to its being sent into treetag(), some of the context is lost. Would it make sense to have an option to keep the text as is, i.e., full sentences? Here's an example: c("That food is really nice.","That felt is really nice."). Because the token/line with 'felt' is all by itself (as the other terms already appear), TreeTagger uses the default interpretation of felt as a verb. Passing in the full sentences to treetag() allows for the proper tagging.
I had some issues getting the treetag() function itself to work; potential bugs have been raised with koRpus' developer. I was wondering if a debug flag could be passed to treetag as well as an option to unsuppress messages, so that users could diagnose problems.

Thanks!

Best,
Jay