Giter VIP home page Giter VIP logo

microsoft / vert-papers Goto Github PK

View Code? Open in Web Editor NEW
263.0 12.0 90.0 22.5 MB

This repository contains code and datasets related to entity/knowledge papers from the VERT (Versatile Entity Recognition & disambiguation Toolkit) project, by the Knowledge Computing group at Microsoft Research Asia (MSRA).

License: MIT License

Python 98.26% Batchfile 0.27% Shell 1.20% Perl 0.27%
ner entity-extraction entity-linking entity-resolution nlp nlp-resources ml named-entity-recognition linkingpark unitrans

vert-papers's Introduction

This repository contains code, datasets, and links related to entity/knowledge papers from the VERT (Versatile Entity Recognition & Disambiguation Toolkit) project, by the Knowledge Computing (KC) group at Microsoft Research Asia (MSRA).

Our group is hiring both research interns and full-time employees! If you are interest, please take a look at:

News:

Recent Papers:

Related Projects:

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

vert-papers's People

Contributors

arleneyuzhiwei avatar hitercs avatar iofu728 avatar lyutyuh avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar mtt1998 avatar qianhuiwu avatar sherryzyy avatar tellarin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vert-papers's Issues

Are there any parameters to be modified when reproducing the paper Advpicker?

Hello, are there any parameters to be modified when reproducing the paper Advpicker? I use the same conll2002 and conll2003 datasets on wihich can obtain a similar performance to the paper Single-Multi, but when reproducing the Advpicker, I only get F1 scores of 40-50 on 5 random seeds and 3 languages. I ran the run.sh file without any modification.

Why my results are so poor?

I downloaded the code without changing anything, just to make sure it ran, I loaded the local bert_model

2023-10-26 12:59:12 INFO: - Step: 0/5001, span loss = 6.118810, type loss = 0.000000, time = 2.27s.
2023-10-26 12:59:53 INFO: - Step: 20/5001, span loss = 5.980084, type loss = 0.000000, time = 43.38s.
2023-10-26 13:00:36 INFO: - Step: 40/5001, span loss = 5.879471, type loss = 0.000000, time = 86.74s.
2023-10-26 13:01:17 INFO: - Step: 60/5001, span loss = 5.772162, type loss = 0.000000, time = 128.07s.
2023-10-26 13:01:57 INFO: - Step: 80/5001, span loss = 5.848635, type loss = 0.000000, time = 167.65s.
2023-10-26 13:02:36 INFO: - Step: 100/5001, span loss = 5.735874, type loss = 0.000000, time = 206.93s.
2023-10-26 13:03:15 INFO: - Step: 120/5001, span loss = 5.667262, type loss = 0.000000, time = 246.25s.
2023-10-26 13:03:55 INFO: - Step: 140/5001, span loss = 5.633423, type loss = 0.000000, time = 285.52s.
2023-10-26 13:04:34 INFO: - Step: 160/5001, span loss = 5.501272, type loss = 0.000000, time = 325.18s.
2023-10-26 13:05:16 INFO: - Step: 180/5001, span loss = 5.545497, type loss = 0.000000, time = 366.45s.
2023-10-26 13:05:55 INFO: - Step: 200/5001, span loss = 5.426091, type loss = 0.000000, time = 406.06s.

2023-10-26 13:08:26 INFO: - ***** Eval results inter-valid *****
2023-10-26 13:08:26 INFO: - f1 = 0.0
2023-10-26 13:08:26 INFO: - f1_threshold = 0.0
2023-10-26 13:08:26 INFO: - loss = tensor(1.1085, device='cuda:0')
2023-10-26 13:08:26 INFO: - precision = 0.0
2023-10-26 13:08:26 INFO: - precision_threshold = 0.0
2023-10-26 13:08:26 INFO: - recall = 0.0
2023-10-26 13:08:26 INFO: - recall_threshold = 0.0
2023-10-26 13:08:26 INFO: - span_f1 = 0.05210224202151717
2023-10-26 13:08:26 INFO: - span_p = 0.03787716670233228
2023-10-26 13:08:26 INFO: - span_r = 0.08343808925204142
2023-10-26 13:08:26 INFO: - type_f1 = 0.0
2023-10-26 13:08:26 INFO: - type_p = 0.0
2023-10-26 13:08:26 INFO: - type_r = 0.0
2023-10-26 13:08:26 INFO: - 0.000,0.000,0.000,3.788,8.344,5.210,0.000,0.000,0.000,0.000,0.000,0.000
2023-10-26 13:08:26 INFO: - ===> Best Valid F1: 0.0

cannot download the models

I can't download the content in the link you gave, I keep getting errors, it's not a network problem. Could you please give another download method.

Obtaining Predicted NER Results for a Series of Sentences with DecomposedMetaNER

I am hoping to use the DecomposedMetaNER model for a learning project. My expectation is to input a series of sentences and, after processing by the model, receive a series of NER predictions, including entities and their corresponding types. However, after following the Quick Start steps and running the model, I did not get this kind of output. I can only see the evaluation results in the console.

Is there any method or guidance available to help me obtain the desired predicted NER results for my input sentences? Any assistance would be greatly appreciated.

cannot downlload models

I can't download the content in the link you gave, I keep getting errors, it's not a network problem. Could you please give another download method.

How to preprocess CoNLL2003 de dataset?

Hi, I downloaded the ECI Multilingual Text dataset from LDC and the ner.tgz file from the official CoNLL2003 website. However, running bin/make.deu in the ner folder (built from ner.tgz) did not get the CoNLL2003 de dataset, it showed an error: Incorrect number of lines in data files. Could you tell me how to preprocess CoNLL2003 de dataset with the ECI dataset and ner.tgz? Thanks.

a experiment about meta-test

Hi. I delete Line 384-385 and Line 447 of learner.py to avoid fine-tuning in the support set during the meta-test? Is this right? Thanks.

[CAN-NER] What's the 'seg'?

model/data_process.py

for index, line in enumerate(lines):
    text = line.strip().split(" ")
    word = text[0][0]
    seg = text[0][1]
    tag = text[1]

What's the seg?In the train data, there isn't seg!

吴 B-NAME
重 M-NAME
阳 E-NAME
, O
中 B-CONT
国 M-CONT
国 M-CONT
籍 E-CONT
, O
大 B-EDU
学 M-EDU
本 M-EDU
科 E-EDU

Why my distillation results are so bad?

I ran the source code 5 times and got the results for mBERT-TLADV: F1 of de/es/nl 72/76/81, after KD the result becomes 73/76/81, which is much low than the reported result in the original paper (mBERT-TLADV 74/77/81, after KD 75/79/83). Are there any other trick to reproduce the reported results other than the given code?

The version of FewNERD

Hi, @iofu728. It seems the open source dataset “episode-data” is the arxiv version of FewNERD? I found that the reproduced results are very different from those in the paper, maybe you use the ACL version of FewNERD in the paper?

CAN-NER, data process problem

can you offer the data files in the data folder like the following:
"./data/embeds", "./data/embedding/embeds", "./data/embedding/word2vec.sgns.weibo.onlychar", "./data/embedding/fasttext.cc.zh.300.vec.onlychar", "./data/embedding/glove.6B.100d.english.txt", "./data/embedding/glove.6B.300d.english.txt"]

ask for some problems?

请问Seg Embedding到底是如何计算后和char拼接的?论文和代码好像也不一致~
另外,还遇到两个问题,第一个是想咨询下输入数据格式是怎么样的,能否给下样例数据(因为直接输入(char tag)是不可以的),因为看数据处理的过程,看不到seg信息是如何添加的;
第二个问

[SingleMulti-TS] Loading Similarities

Hi,

Thanks for your great work on this project.

I got a problem when I was running the Multi-source teacher-student learning. I think it's sentence/domain similarity related.
It appears on the # STEP3: multi-source teacher-student learning after I run the domain model training and save the domain model.

2021-06-20 13:53:14 INFO: - Training/evaluation parameters Namespace(adam_epsilon=1e-08, balance_classes=False, cache_dir='', config_name='', data_dir='./data/ner/conll', device=device(type='cuda'), do_lower_case=False, do_predict=False, do_train=True, do_voting=False, domain_orthogonal=False, evaluate_during_training=False, freeze_bottom_layer=3, gamma_R=0.01, gpu_ids=[7], hard_label_usage='none', hard_label_weight=0, labels='./data/ner/conll/labels.txt', learning_rate=0.0001, log_dir='result-22/ts-learn-var-domain-id-rank_64-gamma_0.01/logs', logging_steps=20, low_rank_size=64, max_grad_norm=1.0, max_seq_length=128, max_steps=-1, model_name_or_path='bert-base-multilingual-cased', n_gpu=1, num_train_epochs=3, output_dir='result-22/ts-learn-var-domain-id-rank_64-gamma_0.01', overwrite_cache=False, per_gpu_eval_batch_size=64, per_gpu_train_batch_size=32, save_steps=2000, seed=22, sim_dir='domain-model', sim_level='domain', sim_type='learn', sim_with_tgt=False, src_langs=['en'], src_model_dir='conll-model-22', src_model_dir_prefix='mono-src-', tau_metric='var', tgt_lang='id', tokenizer_name='', train_hard_label=False, warmup_ratio=0.1, warmup_steps=0, weight_decay=1e-05)
2021-06-20 13:53:16 INFO: - Loading features from cached file ./data/ner/conll/en/cached_train_bert-base-multilingual-cased_128.
2021-06-20 13:53:18 INFO: - ********** scheme: training with KD **********
2021-06-20 13:53:18 INFO: - Loading features from cached file ./data/ner/conll/id/cached_train_bert-base-multilingual-cased_128.
2021-06-20 13:53:18 INFO: - sim_dir: domain-model/id-rank_64-gamma_0.01-seed_22
2021-06-20 13:53:18 INFO: - ==> Computing similarities....
2021-06-20 13:53:18 INFO: - ***** Compute sentence embeddings for [id] plain text dataset using the [pre-trained] base_model *****
2021-06-20 13:53:29 INFO: -   Num examples = 1
2021-06-20 13:53:29 INFO: -   Batch size = 64
2021-06-20 13:53:29 INFO: - ***** Running evaluation on id *****
2021-06-20 13:53:29 INFO: -   Num of examples = 1
2021-06-20 13:53:29 INFO: -   Instantaneous batch size per GPU = 64
2021-06-20 13:53:29 INFO: -   GPU IDs for training: 7
2021-06-20 13:53:29 INFO: -   Batch size = 64
2021-06-20 13:53:29 INFO: - ==> tau: nan
2021-06-20 13:53:29 INFO: -   Sentence similarities:
2021-06-20 13:53:29 INFO: -   en
2021-06-20 13:53:29 INFO: -   nan
Traceback (most recent call last):
  File "main.py", line 918, in <module>
    main(args)
  File "main.py", line 822, in main
    src_probs, src_predictions = get_src_weighted_probs(args, dataset_pt, config, mode="train", src_datasets=src_datasets)
  File "main.py", line 590, in get_src_weighted_probs
    st_sims, dm_sims = get_st_sims(args, domain_model, dataset_st_ebd, src_idxs)
  File "main.py", line 489, in get_st_sims
    print_sim_info(st_sims, dm_sims)
  File "main.py", line 536, in print_sim_info
    logger.info("  " + "\t".join([str(round(v.item(), 4)) for v in st_sims[i]]))
IndexError: index 2 is out of bounds for dimension 0 with size 1

Do you have any idea why this happens and how to fix this?

Thanks in advance!

The result of decoding BPE

Hello!
Could you help me understand the following output?
I passed these tags as query label to DecomposedMetaNER:
['action', 'action', 'O', 'entity', 'O', 'O', 'O', 'action', 'O', 'property', 'entity', 'O', 'O', 'property', 'entity', 'O', 'O', 'property', 'O', 'entity', 'O', 'O']
And after applying convert_bpe I have word indexes:
[[0, 1], [3, 4], [7, 8], [9, 9], [10, 11], [13, 13], [14, 15], [17, 17], [19, 20]]
What is the logic of pairs? Why can I get [i, i] or [i, i+1] for some single words?

How to run CANNER code

I am new to Github. I don't understand how to run the CANNER code. Can someone please guide me on how to run those commands given in the readme file of CANNER? It would be really helpful. I really appreciate any help you can provide.

DecomposedMetaNER evaluate problem

I have put few-nerd eposide-data into ./episode-data/inter, and run bash script/run.sh. But some metrics of model, such as precision, recall, and f1 all return 0. Does anything I do wrong?

Fail to reproduce the f1 score for dataset Cross-Dataset

Hi @iofu728 , thanks for your README file, I have already reproduced the f1 score for Few-NERD V6. However, when it comes to Cross-Dataset, I cannot reproduce the result as presented in paper.
For example, for setting Wiki 1-shot, the result is 14.837,4.236,6.591,25.452,7.266,11.305,34.928,34.928,34.928,15.006,3.800,6.064 if I do not modify any hyperparameter. Then I set the finetune step to 30 and 20 according to A.3 Additional Implementation Details, I also change the inner_lambda_max_loss and the type_threshold, the final result is 17.484,12.891,14.840,35.528,26.194,30.155,41.734,41.734,41.734,19.475,12.240,15.032. It looks better but there is still some distance from 17.54.
Could you please tell me where is the problem (meta training or meta test)? Have you recorded the experimental hyperparameters for Cross-Dataset?
Thanks.

Migration Problem of Code on Apple M1 Chip

Hi, in the process of testing the code, I found that the function .cuda() of this line cannot run on M1 chip. Previously, PyTorch was able to use M1 chips for GPU acceleration, but this line did not work. May I ask if you have any solutions?

About Seg Embedding

According to the paper, BMES scheme(Convolutional Neural Network with Word Embeddings for
Chinese Word Segmentation) is used to represent CWS. Only four types of tags is enough to label all characters.
My question is, in your code, there are five tags(OBMES) in soft seg embedding, why?

CAN-NER data_process.py segmentation part

`

 for r in sentences:
    lines = r.split('\n')
    for index, line in enumerate(lines):
        text = line.strip().split("\t")
        # text = line.strip().split(" ")

        word = text[0][0]
        seg = text[0][1]
        tag = text[1]
        word = normalize_word(word)
        if word not in word2id:
            # word 没有在word2id中,就映射为<UNK>, <UNK>的id为1.

            word2id[word] = 1
        if tag not in tag2id:
            print(tag)
            tag2id[tag] = len(tag2id)

        if index == len(lines) - 1:
            # 最后一行处理分词的逻辑没有看懂?
            if seg == "0":
                seg = 4
            else:
                seg = 3
            sent.append([word2id[word], tag2id[tag], seg])
            ret = getDataFromSent_with_seg_test(sent)
            rs.append(ret)
            sent = []

        else:
            next_seg = lines[index + 1].split(" ")[0][1]
            if seg == "0":
                if next_seg == "0":
                    seg = 4
                else:
                    seg = 1
            else:
                if next_seg == "0":
                    seg = 3
                else:
                    seg = 2
            sent.append([word2id[word], tag2id[tag], seg])

`

data_process.py 中处理分词没怎么看懂?就是上述代码,当分词为最后一行时seg ==='0'替换为4,否则替换为3. 不是最后一行的话,若为0就看下一个seg是否为0,若为0则替换为4,否则替换为1 等等。这个逻辑是什么?能解释一下吗?

A question about meta-learning few-shot NER

Hi, I have a question about ablation study in Table 4 1) w/o MAML.

I run the code to get the results of w/o MAML and find that w/o MAML also uses learner.evaluate_meta_() to evalate the performance.

As you know, meta learning aims to fast adaption. It is reasonable for your method to use evaluate_meta_() to fast adaption, but w/o MAML uses evaluate_meta_() to evaluate is unreasonable.

From my view, w/o MAML also trained the model again during the meta-testing phase (fast adaption). However, this method is w/o MAML.

In the CAN NER

self.add_module('feature_embeds_{}'.format(0), embeds)
self.add_module('feature_embeds_{}'.format(1), soft_seg)

while i >=2 it would raise error:

feature_embed_data = batch_x[:, :, i]
feature_embeds = getattr(self, 'feature_embeds_{}'.format(i))

and in the preprocess py ,there are some index out of range error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.