- Cartography, GIS, on-line maps
- Computational Linguistics, NLP
See my Curriculum Vitae.
LDA models parameters tuning
License: Other
See my Curriculum Vitae.
My understanding of the metric introduced by Deveaud et al. 2014 (section 3.2) differs from how it is implemented in the ldatuning package. However, I can't tell if my understanding is correct, since the authors of the paper didn't reply to my questions and don't provide any code. Still, I wanted to raise the following points that I stumbled upon:
In the ldatuning implementation, the divergence for the whole word distributions for each pair of topics is calculated (lines 254ff). However, my interpretation of the paper is that for any two topics k
and k'
, at first the top n
words in their word distribution are determined (sets W_k
and W_k'
in the paper). This doesn't happen in the implementation – there's no parameter n
for the Deveaud2014
function. Furthermore, I think that the divergence is only calculated for the subset of words that occur in the top n
list of both topics, i.e. the intersection of W_k
and W_k'
(see subscript of the sums in eq. 2).
Apart from these possible issues in the implementation, I was wondering about two things in the paper, but as I said I couldn't reach the authors for a discussion. I'd still like to raise these questions here, because maybe someone else has an opinion about that:
The formula for the Jenson-Shannon divergence (JSD) in the paper is different from the one that is usually used: JSD(P||Q) = 1/2 * D(P||M) + 1/2 * D(Q||M)
, with M = 1/2 (P+Q)
and D(X||Y)
being the Kullback-Leibler divergence. The paper doesn't explain why. I can see in the comments of the code that the author of ldatuning also stumbled upon this.
What if W_k
and W_k'
are disjoint, i.e. there are no top words that occur in a pair of topics at the same time? This will actually happen quite often with a large vocabulary, a low n
and a high number of topics. In my understanding, this should mean that the word distributions for the top words in both topics completely diverge since they don't even have common top words. So I'd argue that in this case the divergence for such a pair of topics should be the upper bound of the JSD
function (which is 1 when the log base is 2). They paper doesn't say anything about what should happen if W_k
and W_k'
were disjoint, so I guess this means they wouldn't add anything to the total divergence, i.e. if a pair of topics don't share common top words they don't diverge at all, which seems like a strange reasoning to me.
I also wondered why they came up with their own metric anyway, since there were already several topic model evaluation metrics available at the time (Griffiths & Steyvers 2004, Cao et al. 2009, Wallach et al. 2009, Arun et al. 2010 and more). I don't see in the paper how they assessed the performance of their metric compared to the other metrics.
Hi,
very helpful package for identifying the optimal number of topics for my LDA models!
I am also using the biterm topic modelling (BTM) approach (see https://github.com/bnosac/BTM). Is this package also applicable to BTM?
In case not, an extension would be great.
Thank you!
I read the paper by Arun et al. that you cited in the implementation of the "Arun2010" metric. I noticed two things in the ldatuning implementation which were a little unclear to me (also in the paper, which is sometimes quite vague):
L*M2
. I interpret this as norm(L*M2)
, however in the implementation it looks like it is implemented as norm(L)*M2
. Also, I wonder why the "M-norm" is used for normalization. I interpret the paper as using the L1-norm to yield cm2 (which sums to 1 then).I haven't tried it out yet, but only from reading the source code, I believe that cm1 and cm2 are not proper distributions to be used in KL divergence, as currently implemented. Please correct me if I'm wrong :)
It would be a great enhancement if verbose mode could indicate when it had finished fitting the model for each k in the topics
parameter, even something as simple as Model fit for k=50
. Doing so would provide a de facto progress meter, allowing the user to see how quickly the FindTopicsNumber()
job is progressing instead of having no idea when it will complete.
It's okay if the k values appear out of order due to parallel processing; what's important is seeing how quickly the models are being fit.
I was thinking it might be a good idea to return a list of models instead of a data.frame of the metric. Similar to how the caret
package tunes models.
My current usage right now is to run the model which takes a long time, then I have to run it again on the desired number of topics. If it could cache the models that would be great or at least have a parameter to turn that feature on and off.
What do you think?
Running FindTopicsNumber_plot
returns this warning message:
The <scale>
argument of guides()
cannot be FALSE
. Use "none" instead as of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the ldatuning package.
FindTopicNumber.Rd currently reads:
\item{topics}{Vvector with number of topics to compare different models.}
I think it should read:
\item{topics}{Vector with number of topics to compare different models.}
with one 'v' instead of two.
Thanks for the helpful package.
Thank you for the packages.
I ran the packages without any issues with MacOS and Window system.
However, under the Apple M1 chip, there is an error with the packages.
Is there anyone in the same situation?
I am getting following errors after running the code as given in the vignette
library(pacman)
p_load("tm", "SnowballCC", "RColorBrewer", "ggplot2", "wordcloud", "biclust",
"cluster", "igraph", "fpc", "Rcampdf")
p_load("topicmodels", "devtools", "ldatuning")
data("AssociatedPress", package="topicmodels")
dtm <- AssociatedPress[1:10, ]
result <- FindTopicsNumber(
dtm,
topics = seq(from = 2, to = 15, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 2L,
verbose = TRUE
)
knitr::kable(result)
Error in seq_len(m) : argument must be coercible to non-negative integer
In addition: Warning messages:
1: In rep(digits, length.out = m) :
first element used of 'length.out' argument
2: In seq_len(m) : first element used of 'length.out' argument
FindTopicsNumber_plot(result)
Error in subset.default(values, select = 2:ncol(values)) :
argument "subset" is missing, with no default``
After running FindTopicsNumber_plot() I got this warning message:
Warning message:
The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2
3.3.4.
ℹ The deprecated feature was likely used in the ldatuning package.
Please report the issue at <https://github.com/nikita-moor/ldatuning/issues>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
FindTopicsNumber_plot
rescales the score calculations. It does so by rescaling all columns except for the topic numbers. If the LDA_model
column is present, it generates an error on the call to rescale
. Because the LDA_model
column isn't needed for the plot, it needs to be dropped before the call to rescale
.
Hi,
I am going to use your interested library to find optimal number of topics for my data. I followed this tutorial: https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html and could receive answer. but when i tried to use my owned data by using below code, i face with error.
I open my data via this code:
dtm <- read.csv("F:/Download/IoT App txt mining/Book2.txt")
Then:
result <- FindTopicsNumber(
dtm,
topics = seq(from = 2, to = 15, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 2L,
verbose = TRUE
)
finally, i face with below error:
fit models...Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: invalid argument type
I am beginner in R. could you please help me?
Thank you!
Hello,
Thank you for developing such a useful software!
When I run FindTopicsNumber(), I can get results normally for some data, but I get the following error for some data.
fit models... done.
calculate metrics:
Griffiths2004... done.
CaoJuan2009... done.
Arun2010...Error in FUN(X[[i]], ...) :
dims [product 71] do not match the length of object [80]
In addition: Warning message:
In cm1/cm2 :
longer object length is not a multiple of shorter object length
And here is the R script file that gave me the above error:
ldatuning_error.zip
If I exclude "Arun2010" from "metrics" option, I get results normally without any errors.
My sessionInfo():
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=Japanese_Japan.932 LC_CTYPE=Japanese_Japan.932
[3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C
[5] LC_TIME=Japanese_Japan.932
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ldatuning_1.0.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 xml2_1.3.2 magrittr_2.0.1
[4] munsell_0.5.0 colorspace_2.0-2 tm_0.7-8
[7] R6_2.5.0 rlang_0.4.11 fansi_0.5.0
[10] tools_4.1.0 parallel_4.1.0 grid_4.1.0
[13] gtable_0.3.0 utf8_1.2.2 modeltools_0.2-23
[16] ellipsis_0.3.2 tibble_3.1.3 lifecycle_1.0.0
[19] crayon_1.4.1 gmp_0.6-2 NLP_0.2-1
[22] ggplot2_3.3.5 vctrs_0.3.8 glue_1.4.2
[25] slam_0.1-48 Rmpfr_0.8-4 compiler_4.1.0
[28] pillar_1.6.2 topicmodels_0.2-12 scales_1.1.1
[31] stats4_4.1.0 pkgconfig_2.0.3
I also get the same error with R 3.x.
Best.
I want to publish a new release, but your website nathanchaney.com and the corresponding email are not available.
Contact me, please, on my old address.
This is a great package! The only pain point is that it is really slow and the slowest step seems to be the fitting LDA step from the topicmodels
package. How easy is it to port this over to text2vec
http://text2vec.org/ LDA? The model fitting is extremely fast (about 10x speed improvement). I think this grid search type approach will be extremely beneficial.
Hello,
I am trying to build a good topic model on a large Wikipedia dump (around 7GB in size) and would like to use ldatuning::FindTopicsNumber to get a rough idea of the optimal number of topics for this corpus. However, it seems that this piece of code runs for too long.
result <- ldatuning::FindTopicsNumber(
DTM,
topics = seq(from = 500, to = 1000, by = 500),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
verbose = TRUE,
mc.cores = 4L,
)
As a matter of fact, I let it run for more than two weeks, until I was forced to kill the process as it was taking too long and I needed the resources for some other tasks I am working on. I have tried to run the same piece of code on a smaller dataset (around 30MB) but still no results. I think the given number of topics might be too large. Can you confirm my doubt? Maybe you might have some tips on how to efficiently calculate the optimal number of topics for relatively large datasets using your library. E.g increasing the number of cores, small set of metrics, used method... That would help me a lot!
Thank you and keep up with the good work!
Hello,
I am using the FindTopicsNumber function and have run into an issue. I ran the following code:
result <- ldatuning::FindTopicsNumber(
nao_dtm,
topics = seq(from = 2, to = 20, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 5691),
mc.cores = 2L,
verbose = TRUE)
and my console simply stopped at the output "fit models..." without ever finishing the task or coming up with an error. My CPU didn't seem to be working on the task while I waited. The followings are the feature of my dtm:
<<DocumentTermMatrix (documents: 115, terms: 3972)>>
Non-/sparse entries: 11725/445055
Sparsity: 97%
Maximal term length: 14
Weighting: term frequency (tf)
The data size is rather small so it should not be taking so long. I am using R 3.6.0. on mac OS. The exact same code seems to be running fine on two other devices with R 3.6.0 on Windows and R 3.5.3 on mac OS, both completed within a minute. I have tried upgrading all my packages, restarting the computer, etc.
Thank you!
I got the same error mentioned here
Is this a bug or is it something we did wrong in our code?
Не могу подобрать число топиков для модели построенной с помощью пакета lda (lda.collapsed.gibbs.sampler).
Формирую матрицу из документов и словаря с помощью ldaformat2dtm из пакета topicmodels:
dtm <- ldaformat2dtm(documents = documents, vocab = vocab)
result <- FindTopicsNumber(
dtm,
topics = seq(from = 2, to = 15, by = 1),
metrics = c("Arun2010"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 2L,
verbose = TRUE
)
Прошло больше 9 часов - без результатов. В чем может быт подводный камень?
Hello guys,
Pretty inexperienced person here...
I'm trying to usa ldatunning to find to find the optimal topic numbers for a database of facebook comments.
I am running on:
MacOSX
R 4.0.2
Rstudio 1.3.1073
Everytime I try to run with my data:
result <- FindTopicsNumber(
dtm,
topics = seq(from = 2, to = 15, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 2L,
verbose = TRUE
)
I get the error on the title
I prepare my data with this:
tokens = tolower(mensaje_lda$texto)
tokens = word_tokenizer(tokens)
it = itoken(tokens, ids = mensaje$parrafo, progressbar = FALSE)
v = create_vocabulary(it)
v = prune_vocabulary(v, term_count_min = 10, doc_proportion_max = 0.5, doc_proportion_min = 0.001)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer, type = "dgCMatrix")
I am following this guide https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html
and when I try the FindTopicsNumber on the Associated_press data it actually runs... what am I doing wrong? any help would be appreciated.
I'm getting this error when trying to run ldatuning:
fit models...Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: Each row of the input matrix needs to contain at least one non-zero entry
Calls: FindTopicsNumber ... clusterApply -> staticClusterApply -> checkForRemoteErrors
My corpus is pretty sparse but I'm able to feed it to lda and topicmodel packages without any issues:
<>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 60000
<<DocumentTermMatrix (documents: 60000, terms: 176674)>>
Non-/sparse entries: 3072914/10597367086
Sparsity : 100%
Maximal term length: 30
Weighting : term frequency (tf)
Fitting a model via Gibbs sampling tends to scale in the number of topics k
, with larger values of k
taking longer to fit. Because the user may be interested in tuning k
over a large range, the models fit for smaller values of k
will tend to take less time than those for larger values of k
. The implementation of parLapply
may not be as efficient as a load-balanced approach for this scenario. The parallelization may be better implemented using clusterApplyLB
, which in general does a better job of managing utilization when the distributed tasks take variable lengths of time.
Accordingly, would it be appropriate to modify FindTopicsNumber
to take a logical argument lb
, with parallelization similar to the following? If so, happy to submit a PR.
lda.fun <- function(x) {
topicmodels::LDA(dtm, k = x, method = method, control = control)
}
if (lb) {
models <- do.call(
parallel::clusterApplyLB,
args = list(x = topics, fun = lda.fun)
)
} else {
models <- do.call(
parallel::parLapply,
args = list(X = topics, fun = lda.fun)
)
}
When attempting to execute FindTopicsNumber, I consistently get a fatal error. At first I thought there may be an issue with my data, but I also reproduce the issue when following the steps on the Select number of topics for LDA model article.
Here is my sessionInfo()
:
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] topicmodels_0.2-11 ldatuning_1.0.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 crayon_1.3.4 slam_0.1-47 grid_4.0.2 R6_2.4.1
[6] lifecycle_0.2.0 gtable_0.3.0 stats4_4.0.2 magrittr_1.5 scales_1.1.1
[11] ggplot2_3.3.2 pillar_1.4.6 rlang_0.4.7 NLP_0.2-0 xml2_1.3.2
[16] vctrs_0.3.4 ellipsis_0.3.1 glue_1.4.2 munsell_0.5.0 compiler_4.0.2
[21] tm_0.7-7 pkgconfig_2.0.3 colorspace_1.4-1 modeltools_0.2-23 tibble_3.0.3
And here are the commands I enter:
> data("AssociatedPress", package="topicmodels")
> dtm <- AssociatedPress[1:10, ]
> result <- FindTopicsNumber(
+ dtm,
+ topics = seq(from = 2, to = 15, by = 1),
+ metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
+ method = "Gibbs",
+ control = list(seed = 77),
+ mc.cores = 2L,
+ verbose = TRUE
+ )
And the response and traceback:
fit models... done.
calculate metrics:
Griffiths2004...
*** caught illegal operation ***
address 0x110468a51, cause 'illegal opcode'
Traceback:
1: initialize(value, ...)
2: initialize(value, ...)
3: new("mpfr", .Call(Arith_mpfr_d, e1, e2, .Arith.codes[.Generic]))
4: -Rmpfr::mpfr(x, prec = 2000L) + llMed
5: -Rmpfr::mpfr(x, prec = 2000L) + llMed
6: Rmpfr::mean(exp(-Rmpfr::mpfr(x, prec = 2000L) + llMed))
7: FUN(X[[i]], ...)
8: lapply(X = X, FUN = FUN, ...)
9: sapply(logLiks, function(x) { llMed <- stats::median(x) metric <- as.double(llMed - log(Rmpfr::mean(exp(-Rmpfr::mpfr(x, prec = 2000L) + llMed)))) return(metric)})
10: Griffiths2004(models, control)
11: FindTopicsNumber(dtm, topics = seq(from = 2, to = 15, by = 1), metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"), method = "Gibbs", control = list(seed = 77), mc.cores = 2L, verbose = TRUE)
Delete
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.