yumeng5 / topclus Goto Github PK

View Code? Open in Web Editor NEW

79.0 2.0 8.0 110.08 MB

[WWW 2022] Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations

License: Apache License 2.0

Shell 1.36% Python 98.64%

language-model pretrained-language-model topic-modeling topic-discovery clustering

topclus's People

Contributors

Stargazers

Watchers

Forkers

mivanovitch yiyayybj xiaoqiangzhang203 isr-wang yjyoo3312 pranavpillai04 dejunw krish240574

topclus's Issues

A request for evaluation script

Hello,
Could you please share the script used for evaluation? The obtained results may differ with different evaluation scripts.
Thanks a lot!

Best wishes,
Lei

cluster loss is 0 on my own dataset

Dear authors,
Thanks for your excellent work! I'm running the model on my own dataset and trying to understand the code. However, I always get cluster_loss = 0. I am using a Twitter dataset which is noisy and sparse, I have changed the BERT sequense length to 128 acoordingly in the pre-processing step. I have tried several different batch size and lr. (It seems to be working fine on the given yelp examples). Does anyone has the same problem or perhaps know why..? Thanks in advance!

Request for using TopClus on different pretrained language models

Hi,

I've read your paper and I like this approach. Thank you for sharing the code. I've one question regarding the pretrained language models (PLMs) that you use for getting the contextualized word representations. I saw in the source code that the model you use is fixed, and it's the classical 'bert-base-uncased':

TopClus/src/trainer.py

Line 22 in 01e22fb

pretrained_lm = 'bert-base-uncased'

Suppose I'm interested on using this method on a corpus of italian texts. In that case, would it be possible to change this model and use a bert-base-multilingual-uncased instead?

If that's possible, can we make pretrained_lm a parameter of the TopClusTrainer?

Thank you.

Question for NMI measurement

Hi, thank you for sharing the code:)

Since I am a novice here, It may be a trivial question; but it seems that NMI is tested on the training dataset.

Is this a typical setting for measuring the NMI? (measuring on the training set)

Thank you

Question on AE pretraining part

TopClus/src/trainer.py

Line 79 in 01e22fb

 optimizer = Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=self.args.lr) 

Thank you for sharing the code. As far as I understand, this stage only trains autoencoder parameters, so It seems that

'''
Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=self.args.lr)
'''
should be converted to

'''
Adam(filter(lambda p: p.requires_grad, model.ae.parameters()), lr=self.args.lr).
'''

If it is not, please let me know.

I think one word has multiple contextualized embeddings in the corpus. How do you deal with that?

I like your paper but I think it's confusing that how to tackle multiple embeddings of the same word/token. I wonder is there any chance that different embeddings of the same word are mapped to different clusters and all of them are quite close to the cluster center in the spherical space. How do you deal with that?

yumeng5 / topclus Goto Github PK

topclus's People

Contributors

Stargazers

Watchers

Forkers

topclus's Issues

A request for evaluation script

cluster loss is 0 on my own dataset

Request for using TopClus on different pretrained language models

Question for NMI measurement

Question on AE pretraining part

I think one word has multiple contextualized embeddings in the corpus. How do you deal with that?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent