Giter VIP home page Giter VIP logo

dlkp's Introduction

dlkp [WIP]

wip

A transformers based deep learning library for keyphrase identification from text documents.

dlkp is:

  • A deep learning keyphrase extraction and generation library. dlkp allows you to train and apply state-of-the-art deep learning models for keyphrase extraction and generation from text documents.

  • Transformer based framework. dlkp framework builds directly on transformers, making it easy to train and evaluate your own transformer based keyphrase extraction and generation models and experiment with new approaches using different contextual embeddings.

  • A dataset library for keyphrase extraction and generation. dlkp has simple interfaces that allow you to download several benchmark datasets in the domain of keyphrase extraction and generation from Huggingface Datasets and readily use them in your training your models with the transformer library. It provides easy access to BIO tagged data for several datasets such as Inspec, NUS, WWW, KDD, KP20K, LDKP and many more suitable for training your keyphrase extraction model as a sequence tagger.

  • An evaluation library for keyphrase extraction and generation. dlkp implements several evaluation metrics for evaluating keyphrase extraction and generation models and helps to generate evaluation reports of your models.

Quick Start

Requirements and Installation

The project is based on transformers>=4.6.0 and Python 3.6+. If you do not have Python 3.6, install it first. Here is how for Ubuntu 16.04. Then, in your favorite virtual environment, simply do:

git clone https://github.com/midas-research/dlkp.git
cd dlkp
pip install -e .

Example Usage

Keyphrase Extraction

Keyphrase Generation

Tutorials

Citing dlkp

Contact

Please email your questions or comments to Amardeep Kumar or Debanjan Mahata

Contributing

Thanks for your interest in contributing! There are many ways to get involved; start with our contributor guidelines and then check these open issues for specific tasks.

The MIT License (MIT)

dlkp's People

Contributors

ad6398 avatar debanjanbhucs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dlkp's Issues

Modularity

  • rename existing module to sound it more relevant
  • remove nested importing
  • Data collator Base class
  • KpeTrainer base class

Metrics

  • KpMetrics base class
  • extraction Metrics
    • Mean Reciprocal rank
    • mean average precision
    • NDCG MAP
    • MRR

Dataset Config Name in KEDataArguments

dataset_config_name="extraction",

Do we need this in the KEDataArguments? What if someone enters generation? 'extraction' should be fixed and the users should not be able to change that.

Model Training Failed for bloomberg/KBIR

Model Training Failed while fine-tuning bloomberg/KBIR model

Here is the output:

(venv) debanjan@deb-research:~/code/research/dlkp/examples/ke$ CUDA_VISIBLE_DEVICES=1 python ke_tagger_transformers.py
03/15/2022 21:56:26 - WARNING - dlkp.extraction.train_eval_kp_tagger -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
03/15/2022 21:56:26 - INFO - dlkp.extraction.train_eval_kp_tagger -   Training/evaluation parameters KETrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=100,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=3e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=../../outputs/runs/Mar15_21-56-26_deb-research,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=100,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=2,
optim=OptimizerNames.ADAMW_HF,
output_dir=../../outputs,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=4,
per_device_train_batch_size=4,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=../../outputs,
save_on_each_node=False,
save_steps=1000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
[INFO|configuration_utils.py:648] 2022-03-15 21:56:27,153 >> loading configuration file https://huggingface.co/bloomberg/KBIR/resolve/main/config.json from cache at /home/debanjan/.cache/huggingface/transformers/e3e4e9cc0f5071082c8ddd8a31edf789b32dc26ba7dec62e3fbe88190bb22206.a2909b3b000ff4512970d210c555be4a09eeb3e22fbde068618b6919d78ddc34
[INFO|configuration_utils.py:684] 2022-03-15 21:56:27,154 >> Model config RobertaConfig {
  "_name_or_path": "bloomberg/KBIR",
  "architectures": [
    "KLMForReplacementAndMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.17.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

[INFO|tokenization_utils_base.py:1786] 2022-03-15 21:56:28,110 >> loading file https://huggingface.co/bloomberg/KBIR/resolve/main/vocab.json from cache at /home/debanjan/.cache/huggingface/transformers/f9fcf68a490a2d176769963ac10ab2c59182602353f3484d14da025d9d7585ab.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05
[INFO|tokenization_utils_base.py:1786] 2022-03-15 21:56:28,110 >> loading file https://huggingface.co/bloomberg/KBIR/resolve/main/merges.txt from cache at /home/debanjan/.cache/huggingface/transformers/ef76f4aba99dea7f976cccdc74652fdc84f73480f539d10abe709d5182fe94ed.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1786] 2022-03-15 21:56:28,110 >> loading file https://huggingface.co/bloomberg/KBIR/resolve/main/tokenizer.json from cache at /home/debanjan/.cache/huggingface/transformers/5af92da1cf1b5bbe300ffd058e94205c37374f4f5de77f2be7d41d559f9aa59b.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
[INFO|tokenization_utils_base.py:1786] 2022-03-15 21:56:28,110 >> loading file https://huggingface.co/bloomberg/KBIR/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1786] 2022-03-15 21:56:28,110 >> loading file https://huggingface.co/bloomberg/KBIR/resolve/main/special_tokens_map.json from cache at /home/debanjan/.cache/huggingface/transformers/214f7e5010364e3b2a1b4768a7c94d3925495de7d1df2b3a4bf3bc9e60fec3a9.cb2244924ab24d706b02fd7fcedaea4531566537687a539ebb94db511fd122a0
[INFO|tokenization_utils_base.py:1786] 2022-03-15 21:56:28,110 >> loading file https://huggingface.co/bloomberg/KBIR/resolve/main/tokenizer_config.json from cache at /home/debanjan/.cache/huggingface/transformers/499673c10e6c643617124841edb2de871366d6c41fc6216f2cb4fea18705f2df.ad1eb0115e1dcd62936cffaee6ce0d19c9b8fea6a4ad255cc807ce29db906ec2
[INFO|configuration_utils.py:648] 2022-03-15 21:56:28,234 >> loading configuration file https://huggingface.co/bloomberg/KBIR/resolve/main/config.json from cache at /home/debanjan/.cache/huggingface/transformers/e3e4e9cc0f5071082c8ddd8a31edf789b32dc26ba7dec62e3fbe88190bb22206.a2909b3b000ff4512970d210c555be4a09eeb3e22fbde068618b6919d78ddc34
[INFO|configuration_utils.py:684] 2022-03-15 21:56:28,235 >> Model config RobertaConfig {
  "_name_or_path": "bloomberg/KBIR",
  "architectures": [
    "KLMForReplacementAndMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.17.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

03/15/2022 21:56:29 - WARNING - datasets.builder -   Reusing dataset inspec (/home/debanjan/.cache/huggingface/datasets/midas___inspec/extraction/0.0.1/debd18641afb7048a36cee2b7bb8dfbf2cd1a68899118653a42fd760cf84284e)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1503.87it/s]
#0:   0%|                                                                                                                                                                                                   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,649 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,649 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,676 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,676 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,692 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,692 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,737 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,737 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,749 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,749 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. [00:00<?, ?ba/s]
#0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.99ba/s]
#1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.65ba/s]
#1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.66ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,798 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,798 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,808 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,808 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. [00:00<?, ?ba/s]
#2: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.02ba/s]
#4: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.81ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,856 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding. [00:00<00:00,  7.03ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,856 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#3: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.80ba/s]
#5: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.10ba/s]
#6: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.42ba/s]
#7: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.56ba/s]
#0:   0%|                                                                                                                                                                                                   | 0/1 [00:00<?, ?ba/s][WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,301 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,301 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,331 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,331 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#7: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.58ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,362 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,362 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. [00:00<?, ?ba/s]
#0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.62ba/s]
#1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.74ba/s]

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,403 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,403 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#2: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.63ba/s]

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,431 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,431 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#3: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 16.73ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,455 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,455 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,457 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,458 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#4: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 18.71ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,485 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,485 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#5: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.06ba/s]
#6: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.30ba/s]
#7: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.70ba/s]
#0:   0%|                                                                                                                                                                                                   | 0/1 [00:00<?, ?ba/s][WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,898 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,898 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,928 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,928 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,941 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,941 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. [00:00<?, ?ba/s]
#0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.16ba/s]
#1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.79ba/s]
#2: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.93ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,991 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,991 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,996 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,996 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. [00:00<?, ?ba/s]

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:31,031 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:31,031 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#4: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.91ba/s]
#3: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.36ba/s]
#5:   0%|                                                                                                                                                                                                   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:31,066 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:31,066 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#5: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.87ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:31,094 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:31,095 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#6: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 16.38ba/s]
#7: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 16.84ba/s]
[INFO|configuration_utils.py:648] 2022-03-15 21:56:31,430 >> loading configuration file https://huggingface.co/bloomberg/KBIR/resolve/main/config.json from cache at /home/debanjan/.cache/huggingface/transformers/e3e4e9cc0f5071082c8ddd8a31edf789b32dc26ba7dec62e3fbe88190bb22206.a2909b3b000ff4512970d210c555be4a09eeb3e22fbde068618b6919d78ddc34
[INFO|configuration_utils.py:684] 2022-03-15 21:56:31,430 >> Model config RobertaConfig {                                                                                                                   | 0/1 [00:00<?, ?ba/s]
  "_name_or_path": "bloomberg/KBIR",
  "architectures": [
    "KLMForReplacementAndMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.17.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

[INFO|modeling_utils.py:1431] 2022-03-15 21:56:31,803 >> loading weights file https://huggingface.co/bloomberg/KBIR/resolve/main/pytorch_model.bin from cache at /home/debanjan/.cache/huggingface/transformers/dcbcac674440cfdfe901cc2259b02eb80c5069676d931c4416a862656ebf0f49.5aec73b8a150e8021bbc6d6d1de4d9548566e457a9a5044790d0f685dcb48c6b
[WARNING|modeling_utils.py:1693] 2022-03-15 21:56:34,052 >> Some weights of the model checkpoint at bloomberg/KBIR were not used when initializing RobertaForTokenClassification: ['infilling_head.mlp_layer_norm.layer_norm1.bias', 'infilling_head.position_embeddings.weight', 'infilling_head.mlp_layer_norm.layer_norm2.bias', 'lm_head.dense.weight', 'replacement_classification_head.classifier.weight', 'lm_head.dense.bias', 'infilling_head.mlp_layer_norm.linear1.weight', 'infilling_head.decoder.weight', 'infilling_head.mlp_layer_norm.layer_norm1.weight', 'infilling_head.num_tok_classifier.bias', 'infilling_head.mlp_layer_norm.linear2.bias', 'replacement_classification_head.classifier.bias', 'infilling_head.mlp_layer_norm.layer_norm2.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'infilling_head.mlp_layer_norm.linear2.weight', 'infilling_head.num_tok_classifier.weight', 'infilling_head.mlp_layer_norm.linear1.bias', 'replacement_classification_head.bias', 'infilling_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[WARNING|modeling_utils.py:1704] 2022-03-15 21:56:34,052 >> Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at bloomberg/KBIR and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|trainer.py:570] 2022-03-15 21:56:37,210 >> The following columns in the training set  don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: id, doc_bio_tags, special_tokens_mask, document. If id, doc_bio_tags, special_tokens_mask, document are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[INFO|trainer.py:1279] 2022-03-15 21:56:37,258 >> ***** Running training *****
[INFO|trainer.py:1280] 2022-03-15 21:56:37,258 >>   Num examples = 1000
[INFO|trainer.py:1281] 2022-03-15 21:56:37,258 >>   Num Epochs = 2
[INFO|trainer.py:1282] 2022-03-15 21:56:37,258 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:1283] 2022-03-15 21:56:37,258 >>   Total train batch size (w. parallel, distributed & accumulation) = 4
[INFO|trainer.py:1284] 2022-03-15 21:56:37,259 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1285] 2022-03-15 21:56:37,259 >>   Total optimization steps = 500
  3%|█████▋                                                                                                                                                                                      | 15/500 [00:02<01:17,  6.25it/s]Traceback (most recent call last):
  File "ke_tagger_transformers.py", line 40, in <module>
    KeyphraseTagger.train_and_eval(
  File "/home/debanjan/code/research/dlkp/src/dlkp/extraction/tagger.py", line 95, in train_and_eval
    return train_eval_extraction_model(
  File "/home/debanjan/code/research/dlkp/src/dlkp/extraction/train_eval_kp_tagger.py", line 165, in train_eval_extraction_model
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1400, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1984, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/debanjan/code/research/dlkp/src/dlkp/extraction/trainer.py", line 219, in compute_loss
    outputs = model(**inputs)
  File "/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 1397, in forward
    outputs = self.roberta(
  File "/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 816, in forward
    buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (527) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [4, 527].  Tensor sizes: [1, 514]
  3%|█████▋                                                                                                   

Code Used:

from dlkp.models import KeyphraseTagger
from dlkp.extraction import (
    KEDataArguments,
    KEModelArguments,
    KETrainingArguments,
)

training_args = KETrainingArguments(
    output_dir="../../outputs",
    learning_rate=3e-5,
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    # gradient_accumulation_steps=4,
    do_train=True,
    do_eval=True,
    do_predict=False,
    evaluation_strategy="steps",
    save_steps=1000,
    eval_steps=100,
    # lr_scheduler_type= 'cosine',
    # warmup_steps=200,
    logging_steps=100
    # weight_decay =0.001
)
model_args = KEModelArguments(
    model_name_or_path="bloomberg/KBIR",
    use_crf=True,
)
data_args = KEDataArguments(
    dataset_name="midas/inspec",
    dataset_config_name="extraction",
    pad_to_max_length=True,
    overwrite_cache=True,
    label_all_tokens=True,
    preprocessing_num_workers=8,
    return_entity_level_metrics=True,
)
KeyphraseTagger.train_and_eval(
    model_args=model_args,
    data_args=data_args,
    training_args=training_args
)

`

Logger

  • improve logging system

Save the best_model

While training a model it does not saves the best model in a separate directory. It would be nice to save the best model under a best_model directory under the outputs directory by default and the users should also be able to point to a directory of their choice.

Dataset class

  • KpDatset base class (inherited from Hugging face dataset)
  • customized pre-processing step
  • test, train, and dev split from a single file
  • [ ]

Unable to select the free gpu while training

Currently the training code does not select the free gpu while training the models. If trained on a multi gpu system if one of the GPUs is busy (GPU:0) it does not select the other GPU (GPU:1), it would be nice to have that feature.

Building from git main generating errors it didn't before

Problem Description

Hi, thanks for this great library! I used it a couple of months ago successfully but on returning my code no longer worked. I was installing the library as:

pip install git+https://github.com/midas-research/dlkp.git

The first thing I noticed is that when I loaded a model I had trained with the library months ago it started warning about a whole heap of roberta layers that were unused and recommend fine tuning before getting any use out of the model, which I didn't see before.

Then when I tried to use the library I received the error:

---> 81 all_kps = KEDatasets.extract_kp_from_tags(token_ids, tags)
     83 extracted_kps = self.tokenizer.batch_decode(
     84     all_kps,
     85     skip_special_tokens=True,
     86     clean_up_tokenization_spaces=True,
     87 )
     88 examples["extracted_keyphrase"] = extracted_kps

TypeError: extract_kp_from_tags() missing 1 required positional argument: 'tokenizer'

Which was very strange because looking at the code on github in the main branch that method did not expect a tokenizer in its signature.

I tried uninstalling and reinstalling but still the same strange code that I couldn't locate on git.

My Hack Solution

I managed to resolve the issue by installing from an earlier branch (ldkp):

pip install git+https://github.com/midas-research/dlkp.git@ldkp

I have no idea where or why it was pulling in code with method signatures that didn't match the code I could see in the repo. But this fixed it.

Any idea what's going on with the main branch?

Thanks!

Usage Interface

It will be nice to have the usage interface as follows

from dlkp.models import KeyphraseTagger
from dlkp.models import KeyphraseGenerator

load the keyphrase tagger
tagger = KeyphraseTagger.load(<path to tagger/model name in huggingface hub>

Run keyphrase extraction over input_text
tagger.predict(input_text)

load the keyphrase generator
generator = KeyphraseGenerator.load(<path to tagger/model name in huggingface hub>

Run keyphrase generation over input_text
generator.predict(input_text)

Model Predictions has Issues with merging sub-words

The model prediction seems to have a bug and does not properly deal with the sub-words in the output.

For example this is the output obtained:

[[' representation', ' documents', ' mask', 'ing strategies', ' transformer language models', ' discrim', 'in', ' gener', 'ative', ' discrim', 'in', 'ative', ' Key', 'phrase Boundary Infilling with Replacement', 'K', 'BI', ' key', 'phrase extraction', ' gener', 'ative', ' BART', ' Key', 'B', 'ART', ' Cat', 'Se', 'q', ' key', ' generation', ' named entity recognition', ' answering', ' relation extraction', ' abstract', 'ive summarization']]

After executing the following code:

tagger = KeyphraseTagger.load(
    model_name_or_path="../../outputs"
    )

input_text = "In this work, we explore how to learn task-specific language models aimed towards learning rich " \
             "representation of keyphrases from text documents. We experiment with different masking strategies for " \
             "pre-training transformer language models (LMs) in discriminative as well as generative settings. In the " \
             "discriminative setting, we introduce a new pre-training objective - Keyphrase Boundary Infilling with " \
             "Replacement (KBIR), showing large gains in performance (upto 9.26 points in F1) over SOTA, when LM " \
             "pre-trained using KBIR is fine-tuned for the task of keyphrase extraction. In the generative setting, we " \
             "introduce a new pre-training setup for BART - KeyBART, that reproduces the keyphrases related to the " \
             "input text in the CatSeq format, instead of the denoised original input. This also led to gains in " \
             "performance (upto 4.33 points in F1@M) over SOTA for keyphrase generation. Additionally, we also " \
             "fine-tune the pre-trained language models on named entity recognition (NER), question answering (QA), " \
             "relation extraction (RE), abstractive summarization and achieve comparable performance with that of the " \
             "SOTA, showing that learning rich representation of keyphrases is indeed beneficial for many other " \
             "fundamental NLP tasks."

keyphrases = tagger.predict(input_text)
print(keyphrases)

As can be seen in the output keyphrases 'mask' and 'ing strategies' are treated as separate keyphrases. This seems like a bug while putting together the sub-words during formatting the prediction output.

inference [KPE]

  • function to extract KP from BIO tags
  • inference function where user can pass text input and get KP extracted out of it

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.