Giter VIP home page Giter VIP logo

textstem's People

Contributors

kbenoit avatar trinker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

textstem's Issues

Add to readme examples

dw <- c('driver', 'drive', 'drove', 'driven', 'drives', 'driving')

stem_words(dw)
lemmatize_words(dw)

bw <- c('are', 'am', 'being', 'been', 'be')

stem_words(bw)
lemmatize_words(bw)


'aren\'t' %>% 
    textclean::replace_contraction() %>%
    lemmatize_strings()

Non-ASCII symbol causes error

A copy from an email from [email protected]

I was trying your hunspell package and i recognized,
that if there is a EURO symbol in the text
make_lemma_dictionary gives an error.

make_lemma_dictionary("some text with € symbol", 
                      engine = "hunspell", lang = "de_AT")
## Error in R_hunspell_stem(dictionary, words) : 
##   basic_string::_M_construct null not valid

I found the same behavior for

c("€", "…", "“", "„")

I am not sure if this is the intended behavior,
but if not it would be maybe the easiest if
you run intToUtf8 for some reasonable sequence
and exclude all the symbols which cause trouble.

Default dictionary for lemmatize_strings replaces "also" with "conjurer"

For example, if I run:

lemmatize_strings(c("this that also", "me too thanks", "also also"))

I get:

[1] "this that conjurer" "me too thank" "conjurer conjurer"

It's not a huge bug, but I was very confused about what was going on when I was running this on a larger corpus, and very curious about how it happened!

Problem in lemma_dictionary <- make_lemma_dictionary(x, engine = 'treetagger')

I tried below code, which I take from:

https://www.rdocumentation.org/packages/textstem/versions/0.1.4/topics/lemmatize_strings

I installed treetagger under C:\

code is as below:

x <- c('the dirtier dog has eaten the pies', 'that shameful pooch is tricky and sneaky',
"He opened and then reopened the food bag", 'There are skies of blue and red roses too!',
NA, "The doggies, well they aren't joyfully running.", "The daddies are coming over...",
"This is 34.546 above")

lemma_dictionary <- make_lemma_dictionary(x, engine = 'treetagger')

ERROR MESSAGE (in turkish):
Error in dplyr::filter([email protected][c("token", "lemma")], !lemma %in% :
"TT.res" adında bir slot yok ("kRp.text" sınıfında bir nesne için)

it says:
... "TT.res" slot is missing (for "kRp.text" class object)

Can you help me please to solve this issue.

lemmatizes "as" to "a"

"As" is almost always the conjunction/preposition/whatever, and is rarely the plural of "a".

koRpus tree tagger broken

I am a long time user of the koRpus package and on a Windows machine. I recently upgraded my version of koRpus and when I run:

koRpus::treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
                      TT.tknz=FALSE , lang="en",
                      TT.options=list(path="C:/TreeTagger", preset="en"))

I get the error:

 Error: Specified file cannot be found:
 C:/TreeTagger/cmd/utf8-tokenize.pl 

Switching to the version 0.06-5 makes this go away. The reason is on a Windows machine there is no cmd/utf8-tokenize.pl it's called cmd/utf8-tokenize.perl.

The package textstem that I maintain depends on koRpus for this functionality thus it concerns me that other Windows users are unable to utilize this functionality.

I am using the most current treetagger version available from here: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

Thank you for attention to this.

lemmatization of numbers is confusing

I had cleaned and lemmatized a document, but it turns out the default lemmatization for the word 'second' is '2'. Not only is this confusing, but it is incorrect in some situations. Usually we are removing numbers in preprocessing, then when you see a number in your results it's very confusing.

What if the word 'second' is in the text but being used as a measure of time?

I know this is using the lexicon package and a lemmatization table from Mechura, but I think this package might be the right place to correct this mistake.

There are a bunch of number conversions in the lemmatization table. Again, I think this is confusing for most use cases where removing numbers is probably the norm. Perhaps there should be an option for using these number conversions or not. At the very least, 'second' should be fixed somehow I think -- you would need the POS tag to properly lemmatize it.

Path for Treetag

I am having an issue specifying the path to Treetagger in make_lemma_dictionary(). No matter what directory I specify, it fails to load the Treetagger in Linux. It says "Error: Specified directory cannot be found". Not sure whether this works at all in Linux.

Issues with make_lemma_dictionary for treetagger engine

Hi,

Firstly, thanks for a great and useful package! I've been experimenting with the make_lemma_dictionary function and was wondering if the addition of the following features would be helpful:

  1. Because the text is separated into tokens prior to its being sent into treetag(), some of the context is lost. Would it make sense to have an option to keep the text as is, i.e., full sentences? Here's an example: c("That food is really nice.","That felt is really nice."). Because the token/line with 'felt' is all by itself (as the other terms already appear), TreeTagger uses the default interpretation of felt as a verb. Passing in the full sentences to treetag() allows for the proper tagging.

  2. I had some issues getting the treetag() function itself to work; potential bugs have been raised with koRpus' developer. I was wondering if a debug flag could be passed to treetag as well as an option to unsuppress messages, so that users could diagnose problems.

Thanks!

Best,
Jay

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.