The dlkp from midas-research

AutoCrfModelforKpExtraction not defined

I was trying to replicate the baseline, by following this notebook but looks like AutoCrfModelforKpExtraction not defined in this __init__.py. Replacing it with BertCrfModelForKpExtraction seem to work, but is that intended?

Modularity

rename existing module to sound it more relevant
remove nested importing
Data collator Base class
KpeTrainer base class

Metrics

Dataset Config Name in KEDataArguments

dataset_config_name="extraction",

Do we need this in the KEDataArguments? What if someone enters generation? 'extraction' should be fixed and the users should not be able to change that.

Model Training Failed for bloomberg/KBIR

Model Training Failed while fine-tuning bloomberg/KBIR model

Here is the output:

(venv) debanjan@deb-research:~/code/research/dlkp/examples/ke$ CUDA_VISIBLE_DEVICES=1 python ke_tagger_transformers.py
03/15/2022 21:56:26 - WARNING - dlkp.extraction.train_eval_kp_tagger -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
03/15/2022 21:56:26 - INFO - dlkp.extraction.train_eval_kp_tagger -   Training/evaluation parameters KETrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=100,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=3e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=../../outputs/runs/Mar15_21-56-26_deb-research,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=100,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=2,
optim=OptimizerNames.ADAMW_HF,
output_dir=../../outputs,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=4,
per_device_train_batch_size=4,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=../../outputs,
save_on_each_node=False,
save_steps=1000,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
[INFO|configuration_utils.py:648] 2022-03-15 21:56:27,153 >> loading configuration file https://huggingface.co/bloomberg/KBIR/resolve/main/config.json from cache at /home/debanjan/.cache/huggingface/transformers/e3e4e9cc0f5071082c8ddd8a31edf789b32dc26ba7dec62e3fbe88190bb22206.a2909b3b000ff4512970d210c555be4a09eeb3e22fbde068618b6919d78ddc34
[INFO|configuration_utils.py:684] 2022-03-15 21:56:27,154 >> Model config RobertaConfig {
  "_name_or_path": "bloomberg/KBIR",
  "architectures": [
    "KLMForReplacementAndMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.17.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

[INFO|tokenization_utils_base.py:1786] 2022-03-15 21:56:28,110 >> loading file https://huggingface.co/bloomberg/KBIR/resolve/main/vocab.json from cache at /home/debanjan/.cache/huggingface/transformers/f9fcf68a490a2d176769963ac10ab2c59182602353f3484d14da025d9d7585ab.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05
[INFO|tokenization_utils_base.py:1786] 2022-03-15 21:56:28,110 >> loading file https://huggingface.co/bloomberg/KBIR/resolve/main/merges.txt from cache at /home/debanjan/.cache/huggingface/transformers/ef76f4aba99dea7f976cccdc74652fdc84f73480f539d10abe709d5182fe94ed.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1786] 2022-03-15 21:56:28,110 >> loading file https://huggingface.co/bloomberg/KBIR/resolve/main/tokenizer.json from cache at /home/debanjan/.cache/huggingface/transformers/5af92da1cf1b5bbe300ffd058e94205c37374f4f5de77f2be7d41d559f9aa59b.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
[INFO|tokenization_utils_base.py:1786] 2022-03-15 21:56:28,110 >> loading file https://huggingface.co/bloomberg/KBIR/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1786] 2022-03-15 21:56:28,110 >> loading file https://huggingface.co/bloomberg/KBIR/resolve/main/special_tokens_map.json from cache at /home/debanjan/.cache/huggingface/transformers/214f7e5010364e3b2a1b4768a7c94d3925495de7d1df2b3a4bf3bc9e60fec3a9.cb2244924ab24d706b02fd7fcedaea4531566537687a539ebb94db511fd122a0
[INFO|tokenization_utils_base.py:1786] 2022-03-15 21:56:28,110 >> loading file https://huggingface.co/bloomberg/KBIR/resolve/main/tokenizer_config.json from cache at /home/debanjan/.cache/huggingface/transformers/499673c10e6c643617124841edb2de871366d6c41fc6216f2cb4fea18705f2df.ad1eb0115e1dcd62936cffaee6ce0d19c9b8fea6a4ad255cc807ce29db906ec2
[INFO|configuration_utils.py:648] 2022-03-15 21:56:28,234 >> loading configuration file https://huggingface.co/bloomberg/KBIR/resolve/main/config.json from cache at /home/debanjan/.cache/huggingface/transformers/e3e4e9cc0f5071082c8ddd8a31edf789b32dc26ba7dec62e3fbe88190bb22206.a2909b3b000ff4512970d210c555be4a09eeb3e22fbde068618b6919d78ddc34
[INFO|configuration_utils.py:684] 2022-03-15 21:56:28,235 >> Model config RobertaConfig {
  "_name_or_path": "bloomberg/KBIR",
  "architectures": [
    "KLMForReplacementAndMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.17.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

03/15/2022 21:56:29 - WARNING - datasets.builder -   Reusing dataset inspec (/home/debanjan/.cache/huggingface/datasets/midas___inspec/extraction/0.0.1/debd18641afb7048a36cee2b7bb8dfbf2cd1a68899118653a42fd760cf84284e)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1503.87it/s]
#0:   0%|                                                                                                                                                                                                   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,649 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,649 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,676 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,676 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,692 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,692 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,737 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,737 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,749 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,749 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. [00:00<?, ?ba/s]
#0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.99ba/s]
#1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.65ba/s]
#1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.66ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,798 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,798 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,808 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,808 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. [00:00<?, ?ba/s]
#2: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.02ba/s]
#4: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.81ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:29,856 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding. [00:00<00:00,  7.03ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:29,856 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#3: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.80ba/s]
#5: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.10ba/s]
#6: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.42ba/s]
#7: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.56ba/s]
#0:   0%|                                                                                                                                                                                                   | 0/1 [00:00<?, ?ba/s][WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,301 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,301 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,331 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,331 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#7: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.58ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,362 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,362 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. [00:00<?, ?ba/s]
#0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.62ba/s]
#1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.74ba/s]

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,403 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,403 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#2: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.63ba/s]

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,431 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,431 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#3: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 16.73ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,455 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,455 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,457 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,458 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#4: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 18.71ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,485 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,485 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#5: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.06ba/s]
#6: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.30ba/s]
#7: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.70ba/s]
#0:   0%|                                                                                                                                                                                                   | 0/1 [00:00<?, ?ba/s][WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,898 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,898 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,928 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,928 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,941 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,941 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. [00:00<?, ?ba/s]
#0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.16ba/s]
#1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.79ba/s]
#2: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.93ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,991 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,991 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:30,996 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:30,996 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. [00:00<?, ?ba/s]

[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:31,031 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:31,031 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#4: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.91ba/s]
#3: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.36ba/s]
#5:   0%|                                                                                                                                                                                                   | 0/1 [00:00<?, ?ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:31,066 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:31,066 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#5: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.87ba/s]
[WARNING|tokenization_utils_base.py:2334] 2022-03-15 21:56:31,094 >> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
[WARNING|tokenization_utils_base.py:2347] 2022-03-15 21:56:31,095 >> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#6: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 16.38ba/s]
#7: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 16.84ba/s]
[INFO|configuration_utils.py:648] 2022-03-15 21:56:31,430 >> loading configuration file https://huggingface.co/bloomberg/KBIR/resolve/main/config.json from cache at /home/debanjan/.cache/huggingface/transformers/e3e4e9cc0f5071082c8ddd8a31edf789b32dc26ba7dec62e3fbe88190bb22206.a2909b3b000ff4512970d210c555be4a09eeb3e22fbde068618b6919d78ddc34
[INFO|configuration_utils.py:684] 2022-03-15 21:56:31,430 >> Model config RobertaConfig {                                                                                                                   | 0/1 [00:00<?, ?ba/s]
  "_name_or_path": "bloomberg/KBIR",
  "architectures": [
    "KLMForReplacementAndMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.17.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

[INFO|modeling_utils.py:1431] 2022-03-15 21:56:31,803 >> loading weights file https://huggingface.co/bloomberg/KBIR/resolve/main/pytorch_model.bin from cache at /home/debanjan/.cache/huggingface/transformers/dcbcac674440cfdfe901cc2259b02eb80c5069676d931c4416a862656ebf0f49.5aec73b8a150e8021bbc6d6d1de4d9548566e457a9a5044790d0f685dcb48c6b
[WARNING|modeling_utils.py:1693] 2022-03-15 21:56:34,052 >> Some weights of the model checkpoint at bloomberg/KBIR were not used when initializing RobertaForTokenClassification: ['infilling_head.mlp_layer_norm.layer_norm1.bias', 'infilling_head.position_embeddings.weight', 'infilling_head.mlp_layer_norm.layer_norm2.bias', 'lm_head.dense.weight', 'replacement_classification_head.classifier.weight', 'lm_head.dense.bias', 'infilling_head.mlp_layer_norm.linear1.weight', 'infilling_head.decoder.weight', 'infilling_head.mlp_layer_norm.layer_norm1.weight', 'infilling_head.num_tok_classifier.bias', 'infilling_head.mlp_layer_norm.linear2.bias', 'replacement_classification_head.classifier.bias', 'infilling_head.mlp_layer_norm.layer_norm2.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'infilling_head.mlp_layer_norm.linear2.weight', 'infilling_head.num_tok_classifier.weight', 'infilling_head.mlp_layer_norm.linear1.bias', 'replacement_classification_head.bias', 'infilling_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[WARNING|modeling_utils.py:1704] 2022-03-15 21:56:34,052 >> Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at bloomberg/KBIR and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|trainer.py:570] 2022-03-15 21:56:37,210 >> The following columns in the training set  don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: id, doc_bio_tags, special_tokens_mask, document. If id, doc_bio_tags, special_tokens_mask, document are not expected by `RobertaForTokenClassification.forward`,  you can safely ignore this message.
/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[INFO|trainer.py:1279] 2022-03-15 21:56:37,258 >> ***** Running training *****
[INFO|trainer.py:1280] 2022-03-15 21:56:37,258 >>   Num examples = 1000
[INFO|trainer.py:1281] 2022-03-15 21:56:37,258 >>   Num Epochs = 2
[INFO|trainer.py:1282] 2022-03-15 21:56:37,258 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:1283] 2022-03-15 21:56:37,258 >>   Total train batch size (w. parallel, distributed & accumulation) = 4
[INFO|trainer.py:1284] 2022-03-15 21:56:37,259 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1285] 2022-03-15 21:56:37,259 >>   Total optimization steps = 500
  3%|█████▋                                                                                                                                                                                      | 15/500 [00:02<01:17,  6.25it/s]Traceback (most recent call last):
  File "ke_tagger_transformers.py", line 40, in <module>
    KeyphraseTagger.train_and_eval(
  File "/home/debanjan/code/research/dlkp/src/dlkp/extraction/tagger.py", line 95, in train_and_eval
    return train_eval_extraction_model(
  File "/home/debanjan/code/research/dlkp/src/dlkp/extraction/train_eval_kp_tagger.py", line 165, in train_eval_extraction_model
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1400, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1984, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/debanjan/code/research/dlkp/src/dlkp/extraction/trainer.py", line 219, in compute_loss
    outputs = model(**inputs)
  File "/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 1397, in forward
    outputs = self.roberta(
  File "/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/debanjan/code/research/dlkp/venv/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 816, in forward
    buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (527) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [4, 527].  Tensor sizes: [1, 514]
  3%|█████▋

Code Used:

from dlkp.models import KeyphraseTagger
from dlkp.extraction import (
    KEDataArguments,
    KEModelArguments,
    KETrainingArguments,
)

training_args = KETrainingArguments(
    output_dir="../../outputs",
    learning_rate=3e-5,
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    # gradient_accumulation_steps=4,
    do_train=True,
    do_eval=True,
    do_predict=False,
    evaluation_strategy="steps",
    save_steps=1000,
    eval_steps=100,
    # lr_scheduler_type= 'cosine',
    # warmup_steps=200,
    logging_steps=100
    # weight_decay =0.001
)
model_args = KEModelArguments(
    model_name_or_path="bloomberg/KBIR",
    use_crf=True,
)
data_args = KEDataArguments(
    dataset_name="midas/inspec",
    dataset_config_name="extraction",
    pad_to_max_length=True,
    overwrite_cache=True,
    label_all_tokens=True,
    preprocessing_num_workers=8,
    return_entity_level_metrics=True,
)
KeyphraseTagger.train_and_eval(
    model_args=model_args,
    data_args=data_args,
    training_args=training_args
)

`

Logger

improve logging system

Save the best_model

While training a model it does not saves the best model in a separate directory. It would be nice to save the best model under a best_model directory under the outputs directory by default and the users should also be able to point to a directory of their choice.

Dataset class

KpDatset base class (inherited from Hugging face dataset)
customized pre-processing step
test, train, and dev split from a single file
[ ]

Unable to select the free gpu while training

Currently the training code does not select the free gpu while training the models. If trained on a multi gpu system if one of the GPUs is busy (GPU:0) it does not select the other GPU (GPU:1), it would be nice to have that feature.

Building from git main generating errors it didn't before

Problem Description

Hi, thanks for this great library! I used it a couple of months ago successfully but on returning my code no longer worked. I was installing the library as:

pip install git+https://github.com/midas-research/dlkp.git

The first thing I noticed is that when I loaded a model I had trained with the library months ago it started warning about a whole heap of roberta layers that were unused and recommend fine tuning before getting any use out of the model, which I didn't see before.

Then when I tried to use the library I received the error:

---> 81 all_kps = KEDatasets.extract_kp_from_tags(token_ids, tags)
     83 extracted_kps = self.tokenizer.batch_decode(
     84     all_kps,
     85     skip_special_tokens=True,
     86     clean_up_tokenization_spaces=True,
     87 )
     88 examples["extracted_keyphrase"] = extracted_kps

TypeError: extract_kp_from_tags() missing 1 required positional argument: 'tokenizer'

Which was very strange because looking at the code on github in the main branch that method did not expect a tokenizer in its signature.

I tried uninstalling and reinstalling but still the same strange code that I couldn't locate on git.

My Hack Solution

I managed to resolve the issue by installing from an earlier branch (ldkp):

pip install git+https://github.com/midas-research/dlkp.git@ldkp

I have no idea where or why it was pulling in code with method signatures that didn't match the code I could see in the repo. But this fixed it.

Any idea what's going on with the main branch?

Thanks!

tmp_trainer directory getting created after running the training

tmp_trainer directory is automatically getting created while running training. The directory is created in the same path in which the training script is being invoked.

get_predicted_labels in KEDatasets

What is the use of the method

    def get_predicted_labels(self, predictions):
        self.datasets = "g"

https://github.com/midas-research/dlkp/blob/main/src/dlkp/datasets/extraction.py#L155

Usage Interface

It will be nice to have the usage interface as follows

from dlkp.models import KeyphraseTagger
from dlkp.models import KeyphraseGenerator

load the keyphrase tagger
tagger = KeyphraseTagger.load(<path to tagger/model name in huggingface hub>

Run keyphrase extraction over input_text
tagger.predict(input_text)

load the keyphrase generator
generator = KeyphraseGenerator.load(<path to tagger/model name in huggingface hub>

Run keyphrase generation over input_text
generator.predict(input_text)

Model Predictions has Issues with merging sub-words

The model prediction seems to have a bug and does not properly deal with the sub-words in the output.

For example this is the output obtained:

[[' representation', ' documents', ' mask', 'ing strategies', ' transformer language models', ' discrim', 'in', ' gener', 'ative', ' discrim', 'in', 'ative', ' Key', 'phrase Boundary Infilling with Replacement', 'K', 'BI', ' key', 'phrase extraction', ' gener', 'ative', ' BART', ' Key', 'B', 'ART', ' Cat', 'Se', 'q', ' key', ' generation', ' named entity recognition', ' answering', ' relation extraction', ' abstract', 'ive summarization']]

After executing the following code:

tagger = KeyphraseTagger.load(
    model_name_or_path="../../outputs"
    )

input_text = "In this work, we explore how to learn task-specific language models aimed towards learning rich " \
             "representation of keyphrases from text documents. We experiment with different masking strategies for " \
             "pre-training transformer language models (LMs) in discriminative as well as generative settings. In the " \
             "discriminative setting, we introduce a new pre-training objective - Keyphrase Boundary Infilling with " \
             "Replacement (KBIR), showing large gains in performance (upto 9.26 points in F1) over SOTA, when LM " \
             "pre-trained using KBIR is fine-tuned for the task of keyphrase extraction. In the generative setting, we " \
             "introduce a new pre-training setup for BART - KeyBART, that reproduces the keyphrases related to the " \
             "input text in the CatSeq format, instead of the denoised original input. This also led to gains in " \
             "performance (upto 4.33 points in F1@M) over SOTA for keyphrase generation. Additionally, we also " \
             "fine-tune the pre-trained language models on named entity recognition (NER), question answering (QA), " \
             "relation extraction (RE), abstractive summarization and achieve comparable performance with that of the " \
             "SOTA, showing that learning rich representation of keyphrases is indeed beneficial for many other " \
             "fundamental NLP tasks."

keyphrases = tagger.predict(input_text)
print(keyphrases)

As can be seen in the output keyphrases 'mask' and 'ing strategies' are treated as separate keyphrases. This seems like a bug while putting together the sub-words during formatting the prediction output.

inference [KPE]

function to extract KP from BIO tags
inference function where user can pass text input and get KP extracted out of it

midas-research / dlkp Goto Github PK

dlkp's Introduction

dlkp [WIP]

Quick Start

Requirements and Installation

Example Usage

Keyphrase Extraction

Keyphrase Generation

Tutorials

Citing dlkp

Contact

Contributing

dlkp's People

Contributors

Stargazers

Watchers

Forkers

dlkp's Issues

Problem Description

My Hack Solution

Recommend Projects

Recommend Topics

Recommend Org