yym6472 / consert Goto Github PK

View Code? Open in Web Editor NEW

534.0 534.0 81.0 1.15 MB

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Python 99.63% Shell 0.37%

consert's People

Contributors

Stargazers

Watchers

consert's Issues

shuffle操作怎么保证句子的位置信息

你好，麻烦问一下shuffle操作是不是会有问题，比如”蛋糕奶油“和”奶油蛋糕“，经过这种shuffle后是不是体现不出位置重要性了。

Question on table 1 in ACL 2021 paper

Hi, I have a question on the dataset.
In section 4.1 setups, it mentions that for unsupervised experiment setting, unlabled texts from STS12-16 + STSb + SICK-R are used for training.
I have looked through the dataset files. The number of samples of unlabeled text that I have found in STS16 (last row) does not match to your table 1.
I found 8002.
In the paper, I was not able to find a mention on expanding the size of dataset by concatenating the given dataset, but the numbers make sense when I doubled the labeled (train/valid/test) samples and unlabeled samples.
Did you doubled the size of the train/valid/test samples?
And did it hurt your performance if it is not doubled?

Thank you,
Jin

Is it possible to run main.py in local machine without GPU?

As titled, thanks!

关于数据增强的问题

你好，请问数据增强的实现里cut off是只把对应的行或列的初始化embedding设置为0，mask没变，后续还会继续参与训练更新是吗？我看《A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation》这篇的源代码里似乎是把mask和embedding都置为0了？不知我的理解是否有误？

cpu ram memory leak

I've been re-implementing ConSERT these days

Just out of curiosity, I removed early stopping to check if it makes difference in scores.

I found that this code might have CPU memory leak.

When I execute this code, the total amount of CPU memory usage keeps increasing and it ends up shutting down.

Have you experienced this kind of situation on this code as well?

有没有TensorFlow代码

其他对比学习做文本表示的repo有也可以提供一下，感谢感谢感谢！

AttributeError: module 'torch.distributed' has no attribute '_all_gather_base'

torch 1.6.0 and torch 1.8.1 not work. assert this error like title.

Traceback (most recent call last):
File "main.py", line 14, in
from sentence_transformers import models, losses
File "/root/ConSERT/sentence_transformers/init.py", line 3, in
from .datasets import SentencesDataset, SentenceLabelDataset, ParallelSentencesDataset
File "/root/ConSERT/sentence_transformers/datasets/init.py", line 1, in
from .sampler import *
File "/root/ConSERT/sentence_transformers/datasets/sampler/init.py", line 1, in
from .LabelSampler import *
File "/root/ConSERT/sentence_transformers/datasets/sampler/LabelSampler.py", line 6, in
from ...datasets import SentenceLabelDataset
File "/root/ConSERT/sentence_transformers/datasets/SentenceLabelDataset.py", line 8, in
from .. import SentenceTransformer
File "/root/ConSERT/sentence_transformers/SentenceTransformer.py", line 11, in
import transformers
File "/root/ConSERT/transformers/init.py", line 22, in
from .integrations import ( # isort:skip
File "/root/ConSERT/transformers/integrations.py", line 58, in
from .file_utils import is_torch_tpu_available
File "/root/ConSERT/transformers/file_utils.py", line 140, in
from apex import amp # noqa: F401
File "/root/miniconda3/lib/python3.8/site-packages/apex/init.py", line 27, in
from . import transformer
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/init.py", line 4, in
from apex.transformer import pipeline_parallel
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/init.py", line 1, in
from apex.transformer.pipeline_parallel.schedules import get_forward_backward_func
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/init.py", line 3, in
from apex.transformer.pipeline_parallel.schedules.fwd_bwd_no_pipelining import (
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/fwd_bwd_no_pipelining.py", line 10, in
from apex.transformer.pipeline_parallel.schedules.common import Batch
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/common.py", line 9, in
from apex.transformer.pipeline_parallel.p2p_communication import FutureTensor
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/p2p_communication.py", line 25, in
from apex.transformer.utils import split_tensor_into_1d_equal_chunks
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/utils.py", line 11, in
torch.distributed.all_gather_into_tensor = torch.distributed._all_gather_base
AttributeError: module 'torch.distributed' has no attribute '_all_gather_base'

this error in apex
NVIDIA/apex#1526

apex not match torch version, can you tell me your torch version?

OSError:Model name '/data/ConSERT-master/chinese-roberta-wwm-ext-large' was not found in tokenizers model name list(roberta-base, roberta-large, robert-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector)

我按照README文件的知识，然后将chinese-roberta-wwm-ext-large 预训练模型文件下载好了放在./chinese-roberta-wwm-ext-large 目录下，然后我运行了python3 main.py --no_pair --seed 1 --use_apex_amp --apex_amp_opt_level O1 --batch_size 32 --max_seq_length 40 --evaluation_steps 20 --add_cl --cl_loss_only --cl_rate 0.15 --temperature 0.1 --learning_rate 0.0000005 --chinese_dataset atec_ccks --num_epochs 10 --da_final_1 feature_cutoff --da_final_2 shuffle --cutoff_rate_final_1 0.2 --model_name_or_path ./chinese-roberta-wwm-ext-large --model_save_path ./output/unsup-consert-large-atec_ccks --force_del --patience 10 命令，然后就提示了以上错误，请问这个是什么问题？能帮忙解答一下吗？

关于loss中向量的拼接

在一些Loss中能看到添加了额外的一个向量的拼接if concatenation_sent_max_square: torch.max(rep_a, rep_b).pow(2)，请问有实验对应的结果吗？SENTENCE-TRANFORMERS的默认拼接就如论文所引用concat(u, v, |u-v|)，已经在大量实验上证明其有效性（更好的句子语义相似表示），不知道如寐建议的这个trick的出处或者数学含义是什么？

数据增强问题

作者您好，感谢分享！
请问您有尝试过将simcse与您论文中多种数据增强策略结合吗？或者说您觉得这种方式对结果提升有价值吗，谢谢~

How to use the model with sentence-transformer for inference?

Cannot load the model.
code
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("../../models/consbert/unsup-consert-base-atec_ccks") # the model path
Error message
Traceback (most recent call last):
File "/home/qhd/PythonProjects/GraduationProject/code/preprocess_unlabeled_second/sentence-bert.py", line 16, in
model = SentenceTransformer("../../models/cosbert/unsup-consert-base-atec_ccks")
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py", line 87, in init
modules = self._load_sbert_model(model_path)
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py", line 824, in _load_sbert_model
module = module_class.load(os.path.join(model_path, module_config['path']))
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py", line 123, in load
return Transformer(model_name_or_path=input_path, **config)
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py", line 30, in init
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path if tokenizer_name_or_path is not None else model_name_or_path, cache_dir=cache_dir, **tokenizer_args)
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 445, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1719, in from_pretrained
return cls._from_pretrained(
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1791, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/models/bert/tokenization_bert_fast.py", line 177, in init
super().init(
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 96, in init
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: No such file or directory (os error 2)

'BertModel' object has no attribute 'set_flag'

具体报错是：
File "/data2/work2/chenzhihao/NLP/nlp/sentence_transformers/SentenceTransformer.py", line 594, in fit
loss_value = loss_model(features, labels)
File "/root/anaconda3/envs/NLP_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/data2/work2/chenzhihao/NLP/nlp/sentence_transformers/losses/AdvCLSoftmaxLoss.py", line 775, in forward
rep_a_view1 = self._data_aug(sentence_feature_a, self.data_augmentation_strategy_final_1,
File "/data2/work2/chenzhihao/NLP/nlp/sentence_transformers/losses/AdvCLSoftmaxLoss.py", line 495, in _data_aug
self.model[0].auto_model.set_flag("data_aug_cutoff", True)
File "/root/anaconda3/envs/NLP_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'BertModel' object has no attribute 'set_flag'

我加载的是hfl/chinese-roberta-wwm-ext模型。

OsError when running main.py

I've been running into this issue when I run bash scripts/unsup-consert-base.sh

Traceback (most recent call last):
  File "main.py", line 327, in <module>
    main(args)
  File "main.py", line 185, in main
    word_embedding_model = models.Transformer(args.model_name_or_path, attention_probs_dropout_prob=0.0, hidden_dropout_prob=0.0)
  File "/home/qmin/ConSERT/sentence_transformers/models/Transformer.py", line 36, in __init__
    self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
  File "/home/qmin/ConSERT/transformers/modeling_auto.py", line 629, in from_pretrained
    pretrained_model_name_or_path, *model_args, config=config, **kwargs
  File "/home/qmin/ConSERT/transformers/modeling_utils.py", line 954, in from_pretrained
    "Unable to load weights from pytorch checkpoint file. "

OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

Is there any workaround?

Can only run bert model?

Is the current version of code only tested in bert model? It seems that the code fails to run in other models. For example, there is no set_flag attribute in RoberTa model.

本项目的系统适配问题

看到了linux的scripts脚本，想问一下项目的系统问题在windows上也能跑吗？还是需要一些小修改呢？谢谢！

有关the final contrastive loss L_con

作者你好，我想问一下这个the final contrastive loss L_con是怎么实现的，代码比较多好像找不到

在使用unsup-consert-base.sh复现时，结果和论文中的结果差距比较大，差了约10个点

similarity mean: 0.6198720335960388
similarity std: 0.23218630254268646
similarity max: 0.9888665676116943
similarity min: -0.11419621855020523
labels mean: 0.5215833187103271
labels std: 0.30510348081588745
labels max: 1.0
labels min: 0.0

不知道是不是忽略了什么细节，希望您能给出一点建议

其他中文数据集的微调

请问如何基于当前提供的接口对其他中文数据集进行bert的微调？

生成的句子向量表示在下游任务中的使用效果

你好，请问当前模型生成的句子向量表示在下游任务使用当中，除了句子匹配以外还有应用到一些其他场景的效果变化吗？

你好，请问怎么进行中文数据集的有监督训练？输入命令是怎么样的？

关于dropout的问题

您好，想请问一下，
1.看到代码中只有 unsup-consert-base.sh 使用了 no_dropout 参数，而其它没有将BERT自带的dropout设置为0，这是为什么呢？
2.在禁用了BERT的dropout的情况下，是原句子和数据增强后句子都也不使用dropout，还是说只是数据增强后的句子不使用？

多卡训练

您好。该代码是单卡训练代码，尝试改成多卡训练，用torch.nn.DataParallel对model包一下，跑代码会报错“dataparallel' object has no attribute” ？训练过程中把用到的model 都改为model.module，不再报错，可是会导致只用gpu_id训练。
请问实验过程中，多卡训练，您是怎么实现的呢？期待您的回复，感谢

弱问下这种无监督训练向量表示一般最少要多少数据呢？

多谢多谢多谢
@yym6472

Chinese STS Tasks 评测指标

您好，请问Chinese STS Tasks 使用的评测指标也是和英文Task相同的斯皮尔曼相关系数吗

训练时loss模块的错误

你好！当我执行 python main.py --chinese_dataset atec_ccks --model_name_or_path /path/huggingface-models/chinese-roberta-wwm-ext --seed 7777 --num_epochs 5 --model_save_path ./models/ --tensorboard_log_dir ./logs/ --adv_training 去训练的时候，就有遇到loss 模块的错误：

不知道你能不能帮忙出路这个问题呢？谢谢

analysis_rep_space.py question

hi， when I run your code in the Chinese date set, i got an output dictionary.
But there is an analysis_rep_space.py file，i do not know the function of the file.
And when I try to run the file, there is no ./tmp/stsb_test_features.txt in your code.