Giter VIP home page Giter VIP logo

consert's People

Contributors

yym6472 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

consert's Issues

Question on table 1 in ACL 2021 paper

Hi, I have a question on the dataset.
In section 4.1 setups, it mentions that for unsupervised experiment setting, unlabled texts from STS12-16 + STSb + SICK-R are used for training.
I have looked through the dataset files. The number of samples of unlabeled text that I have found in STS16 (last row) does not match to your table 1.
I found 8002.
In the paper, I was not able to find a mention on expanding the size of dataset by concatenating the given dataset, but the numbers make sense when I doubled the labeled (train/valid/test) samples and unlabeled samples.
Did you doubled the size of the train/valid/test samples?
And did it hurt your performance if it is not doubled?

Thank you,
Jin

关于数据增强的问题

你好,请问数据增强的实现里cut off是只把对应的行或列的初始化embedding设置为0,mask没变,后续还会继续参与训练更新是吗?我看《A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation》这篇的源代码里似乎是把mask和embedding都置为0了?不知我的理解是否有误?

cpu ram memory leak

I've been re-implementing ConSERT these days

Just out of curiosity, I removed early stopping to check if it makes difference in scores.

I found that this code might have CPU memory leak.

When I execute this code, the total amount of CPU memory usage keeps increasing and it ends up shutting down.

Have you experienced this kind of situation on this code as well?

AttributeError: module 'torch.distributed' has no attribute '_all_gather_base'

torch 1.6.0 and torch 1.8.1 not work. assert this error like title.

Traceback (most recent call last):
File "main.py", line 14, in
from sentence_transformers import models, losses
File "/root/ConSERT/sentence_transformers/init.py", line 3, in
from .datasets import SentencesDataset, SentenceLabelDataset, ParallelSentencesDataset
File "/root/ConSERT/sentence_transformers/datasets/init.py", line 1, in
from .sampler import *
File "/root/ConSERT/sentence_transformers/datasets/sampler/init.py", line 1, in
from .LabelSampler import *
File "/root/ConSERT/sentence_transformers/datasets/sampler/LabelSampler.py", line 6, in
from ...datasets import SentenceLabelDataset
File "/root/ConSERT/sentence_transformers/datasets/SentenceLabelDataset.py", line 8, in
from .. import SentenceTransformer
File "/root/ConSERT/sentence_transformers/SentenceTransformer.py", line 11, in
import transformers
File "/root/ConSERT/transformers/init.py", line 22, in
from .integrations import ( # isort:skip
File "/root/ConSERT/transformers/integrations.py", line 58, in
from .file_utils import is_torch_tpu_available
File "/root/ConSERT/transformers/file_utils.py", line 140, in
from apex import amp # noqa: F401
File "/root/miniconda3/lib/python3.8/site-packages/apex/init.py", line 27, in
from . import transformer
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/init.py", line 4, in
from apex.transformer import pipeline_parallel
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/init.py", line 1, in
from apex.transformer.pipeline_parallel.schedules import get_forward_backward_func
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/init.py", line 3, in
from apex.transformer.pipeline_parallel.schedules.fwd_bwd_no_pipelining import (
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/fwd_bwd_no_pipelining.py", line 10, in
from apex.transformer.pipeline_parallel.schedules.common import Batch
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/common.py", line 9, in
from apex.transformer.pipeline_parallel.p2p_communication import FutureTensor
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/p2p_communication.py", line 25, in
from apex.transformer.utils import split_tensor_into_1d_equal_chunks
File "/root/miniconda3/lib/python3.8/site-packages/apex/transformer/utils.py", line 11, in
torch.distributed.all_gather_into_tensor = torch.distributed._all_gather_base
AttributeError: module 'torch.distributed' has no attribute '_all_gather_base'

this error in apex
NVIDIA/apex#1526

apex not match torch version, can you tell me your torch version?

OSError:Model name '/data/ConSERT-master/chinese-roberta-wwm-ext-large' was not found in tokenizers model name list(roberta-base, roberta-large, robert-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector)

我按照README文件的知识, 然后将chinese-roberta-wwm-ext-large 预训练模型文件下载好了放在./chinese-roberta-wwm-ext-large 目录下,然后我运行了python3 main.py --no_pair --seed 1 --use_apex_amp --apex_amp_opt_level O1 --batch_size 32 --max_seq_length 40 --evaluation_steps 20 --add_cl --cl_loss_only --cl_rate 0.15 --temperature 0.1 --learning_rate 0.0000005 --chinese_dataset atec_ccks --num_epochs 10 --da_final_1 feature_cutoff --da_final_2 shuffle --cutoff_rate_final_1 0.2 --model_name_or_path ./chinese-roberta-wwm-ext-large --model_save_path ./output/unsup-consert-large-atec_ccks --force_del --patience 10 命令,然后就提示了以上错误,请问这个是什么问题?能帮忙解答一下吗?

关于loss中向量的拼接

在一些Loss中能看到添加了额外的一个向量的拼接if concatenation_sent_max_square: torch.max(rep_a, rep_b).pow(2),请问有实验对应的结果吗?SENTENCE-TRANFORMERS的默认拼接就如论文所引用concat(u, v, |u-v|),已经在大量实验上证明其有效性(更好的句子语义相似表示),不知道如寐建议的这个trick的出处或者数学含义是什么?

数据增强问题

作者您好,感谢分享!
请问您有尝试过将simcse与您论文中多种数据增强策略结合吗?或者说您觉得这种方式对结果提升有价值吗,谢谢~

How to use the model with sentence-transformer for inference?

Cannot load the model.
code
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("../../models/consbert/unsup-consert-base-atec_ccks") # the model path
Error message
Traceback (most recent call last):
File "/home/qhd/PythonProjects/GraduationProject/code/preprocess_unlabeled_second/sentence-bert.py", line 16, in
model = SentenceTransformer("../../models/cosbert/unsup-consert-base-atec_ccks")
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py", line 87, in init
modules = self._load_sbert_model(model_path)
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py", line 824, in _load_sbert_model
module = module_class.load(os.path.join(model_path, module_config['path']))
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py", line 123, in load
return Transformer(model_name_or_path=input_path, **config)
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py", line 30, in init
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path if tokenizer_name_or_path is not None else model_name_or_path, cache_dir=cache_dir, **tokenizer_args)
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 445, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1719, in from_pretrained
return cls._from_pretrained(
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1791, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/models/bert/tokenization_bert_fast.py", line 177, in init
super().init(
File "/home/qhd/anaconda3/envs/qhdpython39/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 96, in init
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: No such file or directory (os error 2)

'BertModel' object has no attribute 'set_flag'

具体报错是:
File "/data2/work2/chenzhihao/NLP/nlp/sentence_transformers/SentenceTransformer.py", line 594, in fit
loss_value = loss_model(features, labels)
File "/root/anaconda3/envs/NLP_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/data2/work2/chenzhihao/NLP/nlp/sentence_transformers/losses/AdvCLSoftmaxLoss.py", line 775, in forward
rep_a_view1 = self._data_aug(sentence_feature_a, self.data_augmentation_strategy_final_1,
File "/data2/work2/chenzhihao/NLP/nlp/sentence_transformers/losses/AdvCLSoftmaxLoss.py", line 495, in _data_aug
self.model[0].auto_model.set_flag("data_aug_cutoff", True)
File "/root/anaconda3/envs/NLP_py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1185, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'BertModel' object has no attribute 'set_flag'

我加载的是hfl/chinese-roberta-wwm-ext模型。

OsError when running main.py

I've been running into this issue when I run bash scripts/unsup-consert-base.sh

Traceback (most recent call last):
  File "main.py", line 327, in <module>
    main(args)
  File "main.py", line 185, in main
    word_embedding_model = models.Transformer(args.model_name_or_path, attention_probs_dropout_prob=0.0, hidden_dropout_prob=0.0)
  File "/home/qmin/ConSERT/sentence_transformers/models/Transformer.py", line 36, in __init__
    self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
  File "/home/qmin/ConSERT/transformers/modeling_auto.py", line 629, in from_pretrained
    pretrained_model_name_or_path, *model_args, config=config, **kwargs
  File "/home/qmin/ConSERT/transformers/modeling_utils.py", line 954, in from_pretrained
    "Unable to load weights from pytorch checkpoint file. "

OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. 

Is there any workaround?

Can only run bert model?

Is the current version of code only tested in bert model? It seems that the code fails to run in other models. For example, there is no set_flag attribute in RoberTa model.

本项目的系统适配问题

看到了linux的scripts脚本,想问一下项目的系统问题在windows上也能跑吗?还是需要一些小修改呢? 谢谢!

关于dropout的问题

您好,想请问一下,
1.看到代码中只有 unsup-consert-base.sh 使用了 no_dropout 参数,而其它没有将BERT自带的dropout设置为0,这是为什么呢?
2.在禁用了BERT的dropout的情况下,是原句子和数据增强后句子都也不使用dropout,还是说只是数据增强后的句子不使用?

多卡训练

您好。该代码是单卡训练代码,尝试改成多卡训练,用torch.nn.DataParallel对model包一下,跑代码会报错“dataparallel' object has no attribute” ?训练过程中把用到的model 都改为model.module,不再报错,可是会导致只用gpu_id训练。
请问实验过程中,多卡训练,您是怎么实现的呢?期待您的回复,感谢

训练时loss模块的错误

你好!当我执行 python main.py --chinese_dataset atec_ccks --model_name_or_path /path/huggingface-models/chinese-roberta-wwm-ext --seed 7777 --num_epochs 5 --model_save_path ./models/ --tensorboard_log_dir ./logs/ --adv_training 去训练的时候,就有遇到loss 模块的错误:
image
不知道你能不能帮忙出路这个问题呢?谢谢

analysis_rep_space.py question

hi, when I run your code in the Chinese date set, i got an output dictionary.
But there is an analysis_rep_space.py file,i do not know the function of the file.
And when I try to run the file, there is no ./tmp/stsb_test_features.txt in your code.
image

关于句子表示

Q1:请问在无监督学习时,利用两种数据增强方法产生了句子的两个表示,但是之后对模型评估的时候,论文中说通过平均最后两层的token来获得句子的表示,请问这个句子是指哪一个?是原始的句子送入transformer得到的表示还是增强后的句子送入得到的?
Q2:在监督学习的任务中,加入了下游任务的损失,此时如果使用joint方式训练,figure2中的数据增强还是使用两种增强吗?还是一个是原始句子 一个是增强后的句子,同样的,在模型评估时用的是哪个句子的表示?谢谢!

NT-Xent损失函数

感谢您非常出色的工作。有个小问题想问一下,看过代码后发现loss损失函数的实现和nt-xent损失函数的公式不太一致?
44C050B4-0270-4956-BE10-665840F167C3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.