nikita-moor / ldatuning Goto Github PK

View Code? Open in Web Editor NEW

74.0 74.0 18.0 366 KB

LDA models parameters tuning

License: Other

R 76.96% TeX 23.04%

ldatuning's Introduction

Nikita Murzintcev, PhD

Specialization

Cartography, GIS, on-line maps
Computational Linguistics, NLP

See my Curriculum Vitae.

ldatuning's People

Contributors

Stargazers

Watchers

Forkers

sboosz rdatasculptor zhanghansi sailuh anirband baumanno royadityak minzhanggit sburtner titaniumtroop centrogeo bubblyjuly op-two meantrix steve-pilato timlg07 snapbuy ko-ichi-h

ldatuning's Issues

Deveaud 2014 correctly implemented?

My understanding of the metric introduced by Deveaud et al. 2014 (section 3.2) differs from how it is implemented in the ldatuning package. However, I can't tell if my understanding is correct, since the authors of the paper didn't reply to my questions and don't provide any code. Still, I wanted to raise the following points that I stumbled upon:

In the ldatuning implementation, the divergence for the whole word distributions for each pair of topics is calculated (lines 254ff). However, my interpretation of the paper is that for any two topics k and k', at first the top n words in their word distribution are determined (sets W_k and W_k' in the paper). This doesn't happen in the implementation – there's no parameter n for the Deveaud2014 function. Furthermore, I think that the divergence is only calculated for the subset of words that occur in the top n list of both topics, i.e. the intersection of W_k and W_k' (see subscript of the sums in eq. 2).

Apart from these possible issues in the implementation, I was wondering about two things in the paper, but as I said I couldn't reach the authors for a discussion. I'd still like to raise these questions here, because maybe someone else has an opinion about that:

The formula for the Jenson-Shannon divergence (JSD) in the paper is different from the one that is usually used: JSD(P||Q) = 1/2 * D(P||M) + 1/2 * D(Q||M), with M = 1/2 (P+Q) and D(X||Y) being the Kullback-Leibler divergence. The paper doesn't explain why. I can see in the comments of the code that the author of ldatuning also stumbled upon this.
What if W_k and W_k' are disjoint, i.e. there are no top words that occur in a pair of topics at the same time? This will actually happen quite often with a large vocabulary, a low n and a high number of topics. In my understanding, this should mean that the word distributions for the top words in both topics completely diverge since they don't even have common top words. So I'd argue that in this case the divergence for such a pair of topics should be the upper bound of the JSD function (which is 1 when the log base is 2). They paper doesn't say anything about what should happen if W_k and W_k' were disjoint, so I guess this means they wouldn't add anything to the total divergence, i.e. if a pair of topics don't share common top words they don't diverge at all, which seems like a strange reasoning to me.

I also wondered why they came up with their own metric anyway, since there were already several topic model evaluation metrics available at the time (Griffiths & Steyvers 2004, Cao et al. 2009, Wallach et al. 2009, Arun et al. 2010 and more). I don't see in the paper how they assessed the performance of their metric compared to the other metrics.

Biterm topic model tuning

Hi,

very helpful package for identifying the optimal number of topics for my LDA models!
I am also using the biterm topic modelling (BTM) approach (see https://github.com/bnosac/BTM). Is this package also applicable to BTM?

In case not, an extension would be great.
Thank you!

Arun 2010 correctly implemented?

I read the paper by Arun et al. that you cited in the implementation of the "Arun2010" metric. I noticed two things in the ldatuning implementation which were a little unclear to me (also in the paper, which is sometimes quite vague):

Since KL divergence is made for prob. distributions, shouldn't "cm1" (line 156) be normalized to sum up to 1? The paper also speaks about the "distribution of singular values", which is quite vague but I interpret it as the normalized singular values.
The paper says "cm2" is calculated by normalizing the vector L*M2. I interpret this as norm(L*M2), however in the implementation it looks like it is implemented as norm(L)*M2. Also, I wonder why the "M-norm" is used for normalization. I interpret the paper as using the L1-norm to yield cm2 (which sums to 1 then).

I haven't tried it out yet, but only from reading the source code, I believe that cm1 and cm2 are not proper distributions to be used in KL divergence, as currently implemented. Please correct me if I'm wrong :)

Status of each k in verbose mode

It would be a great enhancement if verbose mode could indicate when it had finished fitting the model for each k in the topics parameter, even something as simple as Model fit for k=50. Doing so would provide a de facto progress meter, allowing the user to see how quickly the FindTopicsNumber() job is progressing instead of having no idea when it will complete.

It's okay if the k values appear out of order due to parallel processing; what's important is seeing how quickly the models are being fit.

Returning list of trained models.

I was thinking it might be a good idea to return a list of models instead of a data.frame of the metric. Similar to how the caret package tunes models.

My current usage right now is to run the model which takes a long time, then I have to run it again on the desired number of topics. If it could cache the models that would be great or at least have a parameter to turn that feature on and off.

What do you think?

warning message running "FindTopicsNumber_plot"

Running FindTopicsNumber_plot returns this warning message:

The <scale> argument of guides() cannot be FALSE. Use "none" instead as of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the ldatuning package.

Typo in man page

FindTopicNumber.Rd currently reads:

\item{topics}{Vvector with number of topics to compare different models.}

I think it should read:

\item{topics}{Vector with number of topics to compare different models.}

with one 'v' instead of two.

Thanks for the helpful package.

crash with Apple M1chip

Thank you for the packages.
I ran the packages without any issues with MacOS and Window system.
However, under the Apple M1 chip, there is an error with the packages.
Is there anyone in the same situation?

Error in seq_len(m) : argument must be coercible to non-negative integer

I am getting following errors after running the code as given in the vignette

library(pacman)
p_load("tm", "SnowballCC", "RColorBrewer", "ggplot2", "wordcloud", "biclust",
"cluster", "igraph", "fpc", "Rcampdf")
p_load("topicmodels", "devtools", "ldatuning")
data("AssociatedPress", package="topicmodels")
dtm <- AssociatedPress[1:10, ]
result <- FindTopicsNumber(
dtm,
topics = seq(from = 2, to = 15, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 2L,
verbose = TRUE
)
knitr::kable(result)
Error in seq_len(m) : argument must be coercible to non-negative integer
In addition: Warning messages:
1: In rep(digits, length.out = m) :
first element used of 'length.out' argument
2: In seq_len(m) : first element used of 'length.out' argument

FindTopicsNumber_plot(result)
Error in subset.default(values, select = 2:ncol(values)) :
argument "subset" is missing, with no default``

Warning message: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.

After running FindTopicsNumber_plot() I got this warning message:

Warning message:
The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2
3.3.4.
ℹ The deprecated feature was likely used in the ldatuning package.
  Please report the issue at <https://github.com/nikita-moor/ldatuning/issues>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

return_models = T causes error in FindTopicsNumber_plot

FindTopicsNumber_plot rescales the score calculations. It does so by rescaling all columns except for the topic numbers. If the LDA_modelcolumn is present, it generates an error on the call to rescale. Because the LDA_model column isn't needed for the plot, it needs to be dropped before the call to rescale.

fit models...Error in checkForRemoteErrors(val) : 2 nodes produced errors; first error: invalid argument type

Hi,
I am going to use your interested library to find optimal number of topics for my data. I followed this tutorial: https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html and could receive answer. but when i tried to use my owned data by using below code, i face with error.
I open my data via this code:
dtm <- read.csv("F:/Download/IoT App txt mining/Book2.txt")

Then:

result <- FindTopicsNumber(
  dtm,
  topics = seq(from = 2, to = 15, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
  verbose = TRUE
)

finally, i face with below error:

fit models...Error in checkForRemoteErrors(val) : 
  2 nodes produced errors; first error: invalid argument type

I am beginner in R. could you please help me?
Thank you!

Check needed for NCOL(dtm) <= # of topics

Hello,

Thank you for developing such a useful software!

When I run FindTopicsNumber(), I can get results normally for some data, but I get the following error for some data.

fit models... done.
calculate metrics:
     Griffiths2004... done.
     CaoJuan2009... done.
     Arun2010...Error in FUN(X[[i]], ...) : 
     dims [product 71] do not match the length of object [80]
In addition: Warning message:
In cm1/cm2 :
  longer object length is not a multiple of shorter object length

And here is the R script file that gave me the above error:
ldatuning_error.zip

If I exclude "Arun2010" from "metrics" option, I get results normally without any errors.

My sessionInfo():

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Japanese_Japan.932  LC_CTYPE=Japanese_Japan.932   
[3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C                  
[5] LC_TIME=Japanese_Japan.932    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ldatuning_1.0.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7         xml2_1.3.2         magrittr_2.0.1    
 [4] munsell_0.5.0      colorspace_2.0-2   tm_0.7-8          
 [7] R6_2.5.0           rlang_0.4.11       fansi_0.5.0       
[10] tools_4.1.0        parallel_4.1.0     grid_4.1.0        
[13] gtable_0.3.0       utf8_1.2.2         modeltools_0.2-23 
[16] ellipsis_0.3.2     tibble_3.1.3       lifecycle_1.0.0   
[19] crayon_1.4.1       gmp_0.6-2          NLP_0.2-1         
[22] ggplot2_3.3.5      vctrs_0.3.8        glue_1.4.2        
[25] slam_0.1-48        Rmpfr_0.8-4        compiler_4.1.0    
[28] pillar_1.6.2       topicmodels_0.2-12 scales_1.1.1      
[31] stats4_4.1.0       pkgconfig_2.0.3

I also get the same error with R 3.x.

Best.

maintainer

@titaniumtroop

I want to publish a new release, but your website nathanchaney.com and the corresponding email are not available.

Contact me, please, on my old address.

Porting this package to use text2vec instead of topicmodels

This is a great package! The only pain point is that it is really slow and the slowest step seems to be the fitting LDA step from the topicmodels package. How easy is it to port this over to text2vec http://text2vec.org/ LDA? The model fitting is extremely fast (about 10x speed improvement). I think this grid search type approach will be extremely beneficial.

Performance of FindTopicsNumber on big datasets

Hello,

I am trying to build a good topic model on a large Wikipedia dump (around 7GB in size) and would like to use ldatuning::FindTopicsNumber to get a rough idea of the optimal number of topics for this corpus. However, it seems that this piece of code runs for too long.

result <- ldatuning::FindTopicsNumber(
  DTM,
  topics = seq(from = 500, to = 1000, by = 500),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  verbose = TRUE,
  mc.cores = 4L,
)

As a matter of fact, I let it run for more than two weeks, until I was forced to kill the process as it was taking too long and I needed the resources for some other tasks I am working on. I have tried to run the same piece of code on a smaller dataset (around 30MB) but still no results. I think the given number of topics might be too large. Can you confirm my doubt? Maybe you might have some tips on how to efficiently calculate the optimal number of topics for relatively large datasets using your library. E.g increasing the number of cores, small set of metrics, used method... That would help me a lot!

Thank you and keep up with the good work!

FindTopicsNumber stop working

Hello,

I am using the FindTopicsNumber function and have run into an issue. I ran the following code:

result <- ldatuning::FindTopicsNumber(
  nao_dtm,
  topics = seq(from = 2, to = 20, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 5691),
  mc.cores = 2L,
  verbose = TRUE)

and my console simply stopped at the output "fit models..." without ever finishing the task or coming up with an error. My CPU didn't seem to be working on the task while I waited. The followings are the feature of my dtm:

<<DocumentTermMatrix (documents: 115, terms: 3972)>>
Non-/sparse entries: 11725/445055
Sparsity: 97%
Maximal term length: 14
Weighting: term frequency (tf)

The data size is rather small so it should not be taking so long. I am using R 3.6.0. on mac OS. The exact same code seems to be running fine on two other devices with R 3.6.0 on Windows and R 3.5.3 on mac OS, both completed within a minute. I have tried upgrading all my packages, restarting the computer, etc.

Thank you!

Error in checkForRemoteErrors(val) : 2 nodes produced errors; first error: there is no package called 'topicmodels'

I got the same error mentioned here

Is this a bug or is it something we did wrong in our code?

Подбор топиков для модели построенной с помощью пакета lda

Не могу подобрать число топиков для модели построенной с помощью пакета lda (lda.collapsed.gibbs.sampler).
Формирую матрицу из документов и словаря с помощью ldaformat2dtm из пакета topicmodels:

dtm <- ldaformat2dtm(documents = documents, vocab = vocab)
result <- FindTopicsNumber(
  dtm,
  topics = seq(from = 2, to = 15, by = 1),
  metrics = c("Arun2010"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
  verbose = TRUE
)

Прошло больше 9 часов - без результатов. В чем может быт подводный камень?

Error in checkForRemoteErrors(val) : 2 nodes produced errors; first error: Each row of the input matrix needs to contain at least one non-zero entry

Hello guys,
Pretty inexperienced person here...
I'm trying to usa ldatunning to find to find the optimal topic numbers for a database of facebook comments.

I am running on:
MacOSX
R 4.0.2
Rstudio 1.3.1073

Everytime I try to run with my data:
result <- FindTopicsNumber(

  dtm,
  topics = seq(from = 2, to = 15, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
  verbose = TRUE
)

I get the error on the title
I prepare my data with this:

tokens = tolower(mensaje_lda$texto)
tokens = word_tokenizer(tokens)
it = itoken(tokens, ids = mensaje$parrafo, progressbar = FALSE)
v = create_vocabulary(it)
v = prune_vocabulary(v, term_count_min = 10, doc_proportion_max = 0.5, doc_proportion_min = 0.001)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer, type = "dgCMatrix")

I am following this guide https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html

and when I try the FindTopicsNumber on the Associated_press data it actually runs... what am I doing wrong? any help would be appreciated.

Error: "Each row of the input matrix needs to contain at least one non-zero entry"

I'm getting this error when trying to run ldatuning:
fit models...Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: Each row of the input matrix needs to contain at least one non-zero entry
Calls: FindTopicsNumber ... clusterApply -> staticClusterApply -> checkForRemoteErrors

My corpus is pretty sparse but I'm able to feed it to lda and topicmodel packages without any issues:
<>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 60000
<<DocumentTermMatrix (documents: 60000, terms: 176674)>>
Non-/sparse entries: 3072914/10597367086
Sparsity : 100%
Maximal term length: 30
Weighting : term frequency (tf)

add option to load-balance model-fitting

Fitting a model via Gibbs sampling tends to scale in the number of topics k, with larger values of k taking longer to fit. Because the user may be interested in tuning k over a large range, the models fit for smaller values of k will tend to take less time than those for larger values of k. The implementation of parLapply may not be as efficient as a load-balanced approach for this scenario. The parallelization may be better implemented using clusterApplyLB, which in general does a better job of managing utilization when the distributed tasks take variable lengths of time.

Accordingly, would it be appropriate to modify FindTopicsNumber to take a logical argument lb, with parallelization similar to the following? If so, happy to submit a PR.

  lda.fun <- function(x) {
    topicmodels::LDA(dtm, k = x, method = method, control = control)
  }
  if (lb) {
    models <- do.call(
      parallel::clusterApplyLB, 
      args = list(x = topics, fun = lda.fun)
    )
  } else {
    models <- do.call(
      parallel::parLapply, 
      args = list(X = topics, fun = lda.fun)
    )
  }

Consistent fatal error when attempting FindTopicsNumber()

When attempting to execute FindTopicsNumber, I consistently get a fatal error. At first I thought there may be an issue with my data, but I also reproduce the issue when following the steps on the Select number of topics for LDA model article.

Here is my sessionInfo():

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] topicmodels_0.2-11 ldatuning_1.0.2   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        crayon_1.3.4      slam_0.1-47       grid_4.0.2        R6_2.4.1         
 [6] lifecycle_0.2.0   gtable_0.3.0      stats4_4.0.2      magrittr_1.5      scales_1.1.1     
[11] ggplot2_3.3.2     pillar_1.4.6      rlang_0.4.7       NLP_0.2-0         xml2_1.3.2       
[16] vctrs_0.3.4       ellipsis_0.3.1    glue_1.4.2        munsell_0.5.0     compiler_4.0.2   
[21] tm_0.7-7          pkgconfig_2.0.3   colorspace_1.4-1  modeltools_0.2-23 tibble_3.0.3

And here are the commands I enter:

> data("AssociatedPress", package="topicmodels")
> dtm <- AssociatedPress[1:10, ]
> result <- FindTopicsNumber(
+   dtm,
+   topics = seq(from = 2, to = 15, by = 1),
+   metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
+   method = "Gibbs",
+   control = list(seed = 77),
+   mc.cores = 2L,
+   verbose = TRUE
+ )

And the response and traceback:

fit models... done.
calculate metrics:
  Griffiths2004...
 *** caught illegal operation ***
address 0x110468a51, cause 'illegal opcode'

Traceback:
 1: initialize(value, ...)
 2: initialize(value, ...)
 3: new("mpfr", .Call(Arith_mpfr_d, e1, e2, .Arith.codes[.Generic]))
 4: -Rmpfr::mpfr(x, prec = 2000L) + llMed
 5: -Rmpfr::mpfr(x, prec = 2000L) + llMed
 6: Rmpfr::mean(exp(-Rmpfr::mpfr(x, prec = 2000L) + llMed))
 7: FUN(X[[i]], ...)
 8: lapply(X = X, FUN = FUN, ...)
 9: sapply(logLiks, function(x) {    llMed <- stats::median(x)    metric <- as.double(llMed - log(Rmpfr::mean(exp(-Rmpfr::mpfr(x,         prec = 2000L) + llMed))))    return(metric)})
10: Griffiths2004(models, control)
11: FindTopicsNumber(dtm, topics = seq(from = 2, to = 15, by = 1),     metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),     method = "Gibbs", control = list(seed = 77), mc.cores = 2L,     verbose = TRUE)

nikita-moor / ldatuning Goto Github PK

ldatuning's Introduction

Nikita Murzintcev, PhD

Specialization

ldatuning's People

Contributors

Stargazers

Watchers

Forkers

ldatuning's Issues

Recommend Projects

Recommend Topics

Recommend Org