Giter VIP home page Giter VIP logo

deepset-ai / farm Goto Github PK

View Code? Open in Web Editor NEW
1.7K 53.0 243.0 7.13 MB

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

Home Page: https://farm.deepset.ai

License: Apache License 2.0

Python 69.92% Dockerfile 0.03% Jupyter Notebook 29.94% HTML 0.11%
language-models bert nlp deep-learning transfer-learning pytorch nlp-library nlp-framework xlnet-pytorch ner

farm's People

Contributors

ahmedidr avatar bogdankostic avatar brandenchan avatar busyxin avatar cjb06776 avatar cregouby avatar felixvor avatar ftesser avatar guan27 avatar guggio avatar jinnerbichler avatar johann-petrak avatar julian-risch avatar kolk avatar lalitpagaria avatar lambdaofgod avatar lingsond avatar maknotavailable avatar pashok3d avatar philipmay avatar rohanag avatar skiran252 avatar skirdey avatar stefan-it avatar tanaysoni avatar tanmaylaud avatar tholor avatar timoeller avatar tripl3a avatar tstadel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

farm's Issues

Label_ids "None" upon initiating training of Adaptive Model

When calling train method on the adaptive model, an error is thrown upon collecting the losses from the prediction head (logits_to_loss_per_head, adaptive_model) when logits and labels in prediction head are combined (logits_to_loss, prediction_head) to create a per_sample_loss because the variable "label_ids" is None and the "view()" function cannot be called upon it. The variable "label_ids" becomes None due to the fact that the function call assigning values to it "kwargs.get(self.label_tensor_name)" returns None.
However, the documentation does not reveal whether and where "label_ids" or rather, "label_tensor_names" ought to be specified.

Error message
Train epoch 1/5: 0%| | 0/200 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/f_weise/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/192.5728.105/helpers/pydev/pydevd.py", line 2060, in
main()
File "/home/f_weise/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/192.5728.105/helpers/pydev/pydevd.py", line 2054, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/home/f_weise/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/192.5728.105/helpers/pydev/pydevd.py", line 1405, in run
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
File "/home/f_weise/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/192.5728.105/helpers/pydev/pydevd.py", line 1412, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/f_weise/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/192.5728.105/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/f_weise/projects/adup-watchdog/src/scripts/BertBaseCased.py", line 152, in
model_training = trainer.train(model)
File "/home/f_weise/projects/adup-watchdog/.venv/src/farm/farm/train.py", line 154, in train
per_sample_loss = model.logits_to_loss(logits=logits, **batch)
File "/home/f_weise/projects/adup-watchdog/.venv/src/farm/farm/modeling/adaptive_model.py", line 129, in logits_to_loss
all_losses = self.logits_to_loss_per_head(logits, **kwargs)
File "/home/f_weise/projects/adup-watchdog/.venv/src/farm/farm/modeling/adaptive_model.py", line 116, in logits_to_loss_per_head
all_losses.append(head.logits_to_loss(logits=logits_for_one_head, **kwargs))
File "/home/f_weise/projects/adup-watchdog/.venv/src/farm/farm/modeling/prediction_head.py", line 263, in logits_to_loss
return self.loss_fct(logits, label_ids.view(-1))
AttributeError: 'NoneType' object has no attribute 'view'

Expected behavior
I expected the variable "label_id" to be a tensor of the same lengths/shape as there are samples in the training set, such that a per_sample_loss can be calculated and the training successfully begins.

To Reproduce
`class BertBaseCased(object):

def get_layout_for_pytorch(self, list_of_pcs):

    list_of_texts_to_channels = []
    for pc in list_of_pcs:
        pc.text = re.sub('\s+', ' ', pc.text).strip() 
        pc.text = re.sub('[^a-zA-ZÜÖÄüöä\d\s:\.\,]', ' ', pc.text)
        if pc.text is "" or " ' " and pc.text in list_of_texts_to_channels:
            continue
        else:
            list_of_texts_to_channels.append([pc.text, 'sensitive' if pc.channel.class_channel_id < 100 else 'non-sensitive'])
    return list_of_texts_to_channels

import logging
logging.getLogger().setLevel(logging.INFO)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Devices available: {}".format(device))

tokenizer = BertTokenizer.from_pretrained(
    pretrained_model_name_or_path="bert-base-german-cased",
    do_lower_case=True)

processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=50,
                                        data_dir="/home/f_weise/projects/adup-watchdog/src/scripts/",
                                        train_filename="data_train.tsv",
                                        dev_filename=None,
                                        test_filename="data_test.tsv",
                                        dev_split=0.1,
                                        columns = ["text", "label"],
                                        label_list= ["sensitive", "non-sensitive"],
                                        source_field= "label",
                                        metrics = ["acc"],
                                        use_multiprocessing = True)

data_silo = DataSilo(processor=processor,batch_size= 15)

MODEL_NAME_OR_PATH = "bert-base-german-cased"

language_model = Bert.load(MODEL_NAME_OR_PATH)

LAYER_DIMS = [768, 2]

processor.add_task(name='text_classification', metric='acc', label_list=["sensitive", "non-sensitive"])
prediction_head = TextClassificationHead(layer_dims=[768, 2])

EMBEDS_DROPOUT_PROB = 0.1

model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_sequence"],
    device=device)


LEARNING_RATE = 2e-5
WARMUP_PROPORTION = 0.1
N_EPOCHS = 5

optimizer, warmup_linear = initialize_optimizer(
    model=model,
    learning_rate=LEARNING_RATE,
    warmup_proportion=WARMUP_PROPORTION,
    n_batches=len(data_silo.loaders["train"]),
    n_epochs=N_EPOCHS)

N_GPU = 0

trainer = Trainer(
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    warmup_linear=warmup_linear,
    device=device,
)


model_training = trainer.train(model)

save_dir = "bert-german-test_version"
model.save(save_dir)
processor.save(save_dir)

`

data_test_screenshot
data_train_screenshot

System:

  • OS: Mint 19.1
  • GPU/CPU: cpu
  • FARM version: farm-0.2.0

Proposition: improve dependency management with pipenv

Coming in with another proposition for your repo (feel free to dismiss if not important to you or not required :) ):
Fix dependency management using pipenv. Why? This article explains it very well and I could only re-articulate the info from it:

https://realpython.com/pipenv-guide/#dependency-management-with-requirementstxt

And here is the pipenv official doc:
https://docs.pipenv.org/en/latest/basics/

I'm using the same mechanism with docker-compose and pipenv and am happy about this configuration. Just my 2 cents for today.

Text Classification on GermEval2018 fine does not work

Describe the bug
When running the examples/text_classification.py on the GermEval2018 fine grained dataset (4 classes instead of two) the model does not learn anything.

Error message
No error, but macro f1 score on test set is: 0.049 instead of 0.488. Also only one class is predicted.
When dev set is evaluated during training the predictions jump from all one class to all another class.

Expected behavior
I think something happened to the class weighting. I also checked the coarse grained, there is no class weighting and the performance is normal.

Steps To Reproduce
Colab notebook with fine grained settings

`dev_size` param in run-by-config is being ignored

I tried setting the dev_sizeparameter in the following ways (in different runs).

"dev_size": {"value": [0.2,0.3], "default": 0.1, "desc": "Split a dev set from the training set using dev_size as proportion."}
"dev_size": {"value": 0.2, "default": 0.1, "desc": "Split a dev set from the training set using dev_size as proportion."}
"dev_size": {"value": null, "default": 0.2, "desc": "Split a dev set from the training set using dev_size as proportion."}

It always defaulted to dev_size=0.1. This value is being set in the constructor of my processor, just as in the GermEval examples.

Is this supposed behaviour?

Wrong number of total steps for linear warmup schedule

The number of total steps is currently wrongly calculated due to wrong rounding here:

optimization_steps = int(n_examples / batch_size / grad_acc_steps) * n_epochs

It's resulting in one step less than required per epoch. If running for multiple epochs, this will result in LR = 0 for the last few steps. The related warning message occurs:
Training beyond specified 't_total'. Learning rate multiplier set to 0. Please set 't_total' of {} correctly.

In addition, we start with a LR of zero for the first step which is a waste of computation.
wrong_lr

Segmentation fault when running inference for QA models

Describe the bug
When running examples/question_answering.py. Training finished.
line 107 result = model.inference_from_dicts(dicts=QA_input) throws Segmentation fault.

Error message
10/19/2019 20:57:29 - INFO - pytorch_transformers.modeling_utils - loading weights file ../saved_models/bert-english-qa-tutorial/language_model.bin 10/19/2019 20:57:30 - INFO - farm.modeling.adaptive_model - Found files for loading 1 prediction heads 10/19/2019 20:57:30 - INFO - farm.modeling.prediction_head - Loading prediction head from ../saved_models/bert-english-qa-tutorial/prediction_head_0.bin 10/19/2019 20:57:30 - WARNING - farm.modeling.adaptive_model - ML logging didn't work: INVALID_PARAMETER_VALUE: Changing param value is not allowed. Param with key='lm_name' was already logged with value='bert-base-cased' for run ID='cd27bbe37acd434fb1792857e8d8da38. Attempted logging new value '../saved_models/bert-english-qa-tutorial'. 10/19/2019 20:57:30 - INFO - pytorch_transformers.tokenization_utils - Model name '../saved_models/bert-english-qa-tutorial' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc). Assuming '../saved_models/bert-english-qa-tutorial' is a path or url to a directory containing tokenizer files. 10/19/2019 20:57:30 - INFO - pytorch_transformers.tokenization_utils - Didn't find file ../saved_models/bert-english-qa-tutorial/added_tokens.json. We won't load it. 10/19/2019 20:57:30 - INFO - pytorch_transformers.tokenization_utils - Didn't find file ../saved_models/bert-english-qa-tutorial/special_tokens_map.json. We won't load it. 10/19/2019 20:57:30 - INFO - pytorch_transformers.tokenization_utils - loading file None 10/19/2019 20:57:30 - INFO - pytorch_transformers.tokenization_utils - loading file None 10/19/2019 20:57:30 - INFO - pytorch_transformers.tokenization_utils - loading file ../saved_models/bert-english-qa-tutorial/vocab.txt 10/19/2019 20:57:31 - INFO - farm.utils - device: cpu n_gpu: 0, distributed training: False, 16-bits training: False Segmentation fault

Expected behavior
Inferencer extracts the answer from the paragraph.

Additional context
Inference on text classification and NER works fine.

To Reproduce
Steps to reproduce the behavior

System:

  • OS: Ubuntu 17.10
  • GPU/CPU: Tesla V100-SXM2-16GB
  • FARM version: 0.2.0

Error when running by config with a list of batch sizes

The config was like this:

"batch_size":                   {"value": [16, 32], "default": 48,  "desc": "Total batch size for training for single GPU v100. If using multiGPU, the total batch size will be automatically adjusted."},

Which lead to the following, somehow odd, error:

Traceback (most recent call last):
  File "run_nohate_experiments.py", line 16, in <module>
    main()
  File "run_nohate_experiments.py", line 12, in main
    run_experiment(experiment)
  File "/workspace/FARM/farm/experiment.py", line 79, in run_experiment
    processor=processor, batch_size=args.batch_size, distributed=distributed
  File "/workspace/FARM/farm/data_handler/data_silo.py", line 39, in __init__
    self._load_data()
  File "/workspace/FARM/farm/data_handler/data_silo.py", line 66, in _load_data
    self._initialize_data_loaders()
  File "/workspace/FARM/farm/data_handler/data_silo.py", line 79, in _initialize_data_loaders
    tensor_names=self.tensor_names,
  File "/workspace/FARM/farm/data_handler/dataloader.py", line 49, in __init__
    collate_fn=collate_fn,
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 179, in __init__
    batch_sampler = BatchSampler(sampler, batch_size, drop_last)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 162, in __init__
    "but got batch_size={}".format(batch_size))
ValueError: batch_size should be a positive integer value, but got batch_size=16

Add XLNet

Let's add XLNet as a second language model to the farm. Should be a good first alternative to BERT in many cases.

Deadlock in DataSilo._get_dataset when using docker

UPDATE/SOLUTION: PSA FOR POSTERITY
If you have a deadlock running FARM in a docker container, make sure you are running the container with --ipc=host to increase shared memory.
SOLUTION END

Describe the bug
There seems to be a multiprocessing-related deadlock in DataSilo._get_dataset, which transforms tsv lines into dicts into datasets, chunkwise. Reading a moderately-sized training set (~18k docs) stalls with zero CPU activity after around 2/3 of the data.

Error message
This is a rather unhelpful trace just like so many multiprocessing deadlocks:

Process ForkPoolWorker-23:
Process ForkPoolWorker-20:
Process ForkPoolWorker-22:
Process ForkPoolWorker-21:
Process ForkPoolWorker-19:
Process ForkPoolWorker-17:
Process ForkPoolWorker-18:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
Traceback (most recent call last):
Traceback (most recent call last):
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 335, in get
    res = self._reader.recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
KeyboardInterrupt
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
 60%|██████████████████████████████████████████████████████▋                                    | 10752/17908 [04:04<02:42, 44.04 Dicts/s]
Process ForkPoolWorker-24:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt

Expected behavior
Dataset loading should successfully finish after processing the last chunk.

Additional context
I have tested that manually loading the data works:

# This works
train_dicts = processor.file_to_dicts("train.tsv")
train_dataset, tensor_names = processor.dataset_from_dicts(dicts=train_dicts)

Loading a subset of the first 10k docs works, too.

One hunch is that grouper(dicts, multiprocessing_chunk_size) under some condition produces a pathological chunk size.

To Reproduce
I'll try and see if I can come up with a synthetic reproducer that doesn't include my data (which I can't give out).

System:

  • OS: Ubuntu 18.04 with nvidia-docker2 and a CUDA 10.0 image
  • GPU/CPU: GTX 1080 / Xeon 4-core, 120GB RAM
  • FARM version: master
    The system is otherwise idle, no file sytem contention, no excessive context switches, plenty of free RAM.

Multi-GPU only enabled in experiment mode

Currently we only allow the usage of multi gpu in the experiment mode (running from configs). Would make sense to make this more generic and allow it also in the "building blocks" mode.

We would need to move the parallelization from here:

FARM/farm/experiment.py

Lines 154 to 160 in 64b2565

if fp16:
model.half()
if local_rank > -1:
model = WrappedDDP(model)
elif n_gpu > 1:
model = WrappedDataParallel(model)

to either Trainer.train() or AdaptiveModel.__init__() .
The latter seems the more logical place to me but it would require to pass quite some additional args to the constructor, while we already have those args available in Trainer.train()

Initialize Inferencer from building blocks

Currently the inferencer is always constructed by loading a saved model from disk:

FARM/farm/infer.py

Lines 30 to 58 in a537b25

def __init__(self, load_dir, batch_size=4, gpu=False):
"""
Initializes inferencer from directory with saved model.
:param load_dir: Directory containing a saved AdaptiveModel
:type load_dir str
:param batch_size: Number of samples computed once per batch
:type batch_size: int
:param gpu: If GPU shall be used
:type gpu: bool
"""
# Init device and distributed settings
device, n_gpu = initialize_device_settings(
use_cuda=gpu, local_rank=-1, fp16=False
)
self.processor = Processor.load_from_dir(load_dir)
self.model = AdaptiveModel.load(load_dir, device)
self.model.eval()
self.batch_size = batch_size
self.device = device
self.language = self.model.language_model.language
# TODO adjust for multiple prediction heads
if len(self.model.prediction_heads) == 1:
self.prediction_type = self.model.prediction_heads[0].model_type
self.label_map = self.processor.label_maps[0]
elif len(self.model.prediction_heads) == 0:
self.prediction_type = "embedder"
self.name = os.path.basename(load_dir)
set_all_seeds(42, n_gpu)

Let's change it to have a classmethod "load" for doing this and a constructor expecting an adaptive model + processor.

NER+CRF missing as PredictionHead

Adding Conditional Random Fields as PredictionHead for Named Entity Recognition seems promising. Usually CRFs outperform every other type of token classifier.

run_all_experiments.py QA throws keyError

When I run run_all_experiments.py, the last question answering stops with keyError

10/17/2019 20:16:51 - INFO - pytorch_transformers.modeling_utils -   load
ing weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert
-base-cased-pytorch_model.bin from cache at /root/.cache/torch/pytorch_tr
ansformers/35d8b9d36faaf46728a0192d82bf7d00137490cd6074e8500778afed552a67
e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2
10/17/2019 20:16:54 - WARNING - farm.modeling.language_model -   Could no
t automatically detect from language model name what language it is.
We guess it's an *ENGLISH* model ...
If not: Init the language model by supplying the 'language' param.
Example: Bert.load('my_mysterious_model_name', language='de')
10/17/2019 20:16:58 - INFO - farm.modeling.optimization -   Number of opt
imization steps: 12220
Traceback (most recent call last):
  File "run_all_experiments.py", line 36, in <module>
    main()
  File "run_all_experiments.py", line 33, in main
    run_experiment(experiment)
  File "/home/user/farm/experiment.py", line 147, in run_experiment
    model = trainer.train(model)
  File "/home/user/farm/train.py", line 129, in train
    model.connect_heads_with_processor(self.data_silo.processor.tasks)
  File "/home/user/farm/modeling/adaptive_model.py", line 233, in connect
_heads_with_processor
    head.label_tensor_name = tasks[head.task_name]["label_tensor_name"]
KeyError: 'question_answering'

System:

  • OS: Ubuntu 17.10
  • GPU/CPU: Tesla V100-SXM2-16GB
  • FARM version: 0.2.0

Suggestion: Add Option to use Processor.dataset_from_dicts() in datasilo.load_data()

When initializing a new DataSilo it automatically calls self._load_data() which in turn calls self.processor.dataset_from_file().

The option to pass data via dictionaries would be much appreciated since Processor.dataset_from_dicts() is already implemented and manually working around is causing much inconvenience when doing preprocessing (i.e. getting Errors since train.tsv was not found).

Test for document regression broken: TypeError in featurize_samples()

The test for the recently added RegressionProcessor is broken.
@busyxin Can you please have a look at what caused this?

../farm/data_handler/data_silo.py:39: in __init__
    self._load_data()
../farm/data_handler/data_silo.py:47: in _load_data
    self.data["train"], self.tensor_names = self.processor.dataset_from_file(train_file)
../farm/data_handler/processor.py:369: in dataset_from_file
    self._featurize_samples()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <farm.data_handler.processor.RegressionProcessor object at 0x7f2a0406e320>

    def _featurize_samples(self):
        chunks_to_process = int(len(self.baskets) / self.multiprocessing_chunk_size)
        num_cpus = min(mp.cpu_count(), self.max_processes, chunks_to_process) or 1
        logger.info(
            f"Got ya {num_cpus} parallel workers to featurize samples in baskets (chunksize = {self.multiprocessing_chunk_size}) ..."
        )
    
        try:
            if "train" in self.baskets[0].id:
                train_labels = []
                for basket in self.baskets:
                    for sample in basket.samples:
                        train_labels.append(sample.clear_text["label"])
                scaler = StandardScaler()
                scaler.fit(np.reshape(train_labels, (-1, 1)))
                self.label_list = [scaler.mean_.item(), scaler.scale_.item()]
                # Create label_maps because featurize is called after Processor instantiation
                self.label_maps = [{0:scaler.mean_.item(), 1:scaler.scale_.item()}]
    
        except Exception as e:
            logger.warning(f"Baskets not found: {e}")
    
        with mp.Pool(processes=num_cpus) as p:
            all_features_gen = p.imap(
                self._multiproc_featurize,
                self.baskets,
                chunksize=self.multiprocessing_chunk_size,
            )
    
            for basket_features, basket in tqdm(
                zip(all_features_gen, self.baskets), total=len(self.baskets)
            ):
                for f, s in zip(basket_features, basket.samples):
                    # Samples don't have labels during Inference mode
                    if "label" in s.clear_text:
                        label = s.clear_text["label"]
>                       scaled_label = (label - self.label_list[0]) / self.label_list[1]
E                       TypeError: unsupported operand type(s) for -: 'str' and 'float'

../farm/data_handler/processor.py:908: TypeError

Add changelog

I advocate for adding a CHANGELOG.md or CHANGES.md document as a toplevel resource in the repo to be able to follow changes / fixes / new commits. Having to look at the commits is a lot of work for users of your framework, especially when you have rapid iteration and are getting a lot of stuff done.

If you don't want to write down changes by hand, you could generate them from squashed commits. Using something like the angular commit message format (https://docs.google.com/document/d/1QrDFcIiPjSLDn3EL15IJygNPiHORgU1_OOAqWjiDU5Y/edit) helps immensely.

LM finetuning example missing data

Trying to run examples/lm_finetuning.py results in FileNotFoundError: [Errno 2] No such file or directory: '../data/finetune_sample/train.txt'. The other examples seem to be able to automatically download the dataset if it's not present.

Apex.amp support

Describe the solution you'd like
It would be great if Apex.amp is supported.

Additional context
If interested, I can work on this. I don't think it should be too much of a problem. I'd replace fp16 to amp instead of simply halfing everything.

Save data checkpoints for input data processing

Is your feature request related to a problem or particular use case?
Creating training data takes a long time and also consumes a lot of memory, especially for Language Model adaptation.

Describe the solution you'd like
Since FARM processes data chunkwise transformed chunks can be saved to disk.
Ideally these chunks are also removed from memory to save RAM space.

Please add `proxies=proxies` capability to any functions call to `requests`

Is your feature request related to a problem or particular use case?
As a corporate user, I need to pass through proxies for any https request. Currently, framework initialization commands are missing the proxies=proxies parameter pass-trough to the requests library, making initalization very painful with manual download and badly reproducible.

Describe the solution you'd like
Please incorporate proxies=proxies parameter to any call to the requests library so that initialisation like

proxies = {'https': config['https_proxy']}
tokenizer = BertTokenizer(
    vocab_file="bert-base-cased",
    proxies = proxies,
    do_lower_case=False)

is now OK through corporate proxies.

Additional context
I manage to workaround it through merging with huggingface/transformers/.../file_utils.py web related functions...

Wrong vocab ids for custom vocab

The user can enrich the vocabulary of an already pretrained language model by supplying a custom vocab. However, if this custom vocab contains tokens that are already present in the original vocab, the indices got messed up in an edge case because the duplicate check is on the "unstripped" token rather than the "stripped" one. This caused some trouble later down my pipeline. I will add a fix soon.

if token not in self.vocab.keys():

KeyError: 'text_classification'

Hi guys,

Unfortunately, it is me again. This time, I am reporting an issue during text classification which did not appear yesterday despite running the same code. Everything works fine until training the model.

tokenizer = BertTokenizer.from_pretrained(
    pretrained_model_name_or_path="bert-base-german-cased",
    do_lower_case=False)

processor = TextClassificationProcessor(
    tokenizer=tokenizer,
    max_seq_len=128,
    data_dir="",
    train_filename="train.tsv",
    dev_filename='dev.tsv',
    labels=["Claim","Premise"],
    metric="f1_macro",
    dev_split=0.0
    )

BATCH_SIZE = 32
data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)

MODEL_NAME_OR_PATH = "bert-base-german-cased"
language_model = Bert.load(MODEL_NAME_OR_PATH)

LAYER_DIMS = [768, 2]
prediction_head = TextClassificationHead(layer_dims=LAYER_DIMS)

EMBEDS_DROPOUT_PROB = 0.1
model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_sequence"],
    device=device)

LEARNING_RATE = 2e-5
WARMUP_PROPORTION = 0.1
N_EPOCHS = 1
optimizer, warmup_linear = initialize_optimizer(
    model=model,
    learning_rate=LEARNING_RATE,
    warmup_proportion=WARMUP_PROPORTION,
    n_batches=len(data_silo.loaders["train"]),
    n_epochs=N_EPOCHS)

N_GPU = 1
trainer = Trainer(
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    warmup_linear=warmup_linear,
    device=device,
)

Error message

/usr/local/lib/python3.6/dist-packages/farm/modeling/adaptive_model.py in connect_heads_with_processor(self, tasks)
    231         """
    232         for head in self.prediction_heads:
--> 233             head.label_tensor_name = tasks[head.task_name]["label_tensor_name"]
    234             head.label_list = tasks[head.task_name]["label_list"]
    235             head.metric = tasks[head.task_name]["metric"]
KeyError: 'text_classification'

Due to this error message, i added a task to the processor which resulted in an error when I wanted to create the datasilo.

processor.add_task(name='text_classification', metric='f1_macro',label_list=['Claim', 'Premise'])
/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1250             if not(self.name == 'loc' and not raise_missing):
   1251                 not_found = list(set(key) - set(ax))
-> 1252                 raise KeyError("{} not in index".format(not_found))
   1253 
   1254             # we skip the warning on Categorical/Interval
KeyError: '[None] not in index'

Thus, I assume the bug lies in the 'text_classification' task name.

Thanks in advance,
Sebastian

System:

  • OS: Mac OS 10.14.6 running on Google Colab
  • GPU/CPU: GPU
  • FARM version: 0.2.0

Adding additional custom features?

Is it possible to add additional custom features in addition to using pre-trained language models for the down-stream tasks?

For example:

# TWO INPUTS
这是一只狗和这是一只红猫<div>This is a dog and that is a panda
0 0 0 0 B-TERM 0 0 0 0 0 0 0<div>0 0 0 0 0 0 0 0 0

# OUTPUT
0 0 0 0 B-TERM 0 0 0 0 0 0 0<div>0 0 0 B-TERM 0 0 0 0 0

'NERProcessor' has no attribute 'tokenizer'

Python: 3.6.8
OS: Windows 10
FARM: 0.2.0

Hi guys,

if I want to run FARM to do some NER finetuning, I tried both approaches:

experiments = load_experiments("conll_de_config.json")

and the stick-your-blocks-together one. Both of them ended with the following error:

08/28/2019 07:47:35 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
08/28/2019 07:47:35 - INFO - pytorch_transformers.tokenization_utils -   loading file https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt from cache at C:\Users\JulianGerhard\.cache\torch\pytorch_transformers\da299cdd121a3d71e1626f2908dda0d02658f42e925a3d6abd8273ec08cf41a6.2a48e6c60dcdb582effb718237ce5894652e3b4abb94f0a4d9a857b70333308d
08/28/2019 07:47:35 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
08/28/2019 07:47:35 - INFO - farm.data_handler.data_silo -   Loading train set from: data/conll03-de\train.txt 
08/28/2019 07:47:35 - INFO - farm.data_handler.utils -    Couldn't find data/conll03-de\train.txt locally. Trying to download ...
08/28/2019 07:47:35 - INFO - farm.data_handler.utils -   downloading and extracting file C:\Users\JulianGerhard\PycharmProjects\word_embeddings\ner_finetuning\data\conll03-de to dir 
08/28/2019 07:47:39 - INFO - farm.data_handler.processor -   Got ya 8 parallel workers to fill the baskets with samples (chunksize = 1000)...
  0%|          | 0/24000 [00:00<?, ?it/s]multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\JulianGerhard\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\JulianGerhard\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "c:\users\juliangerhard\pycharmprojects\farm\farm\data_handler\processor.py", line 310, in _multiproc_sample
    samples = cls._dict_to_samples(dict=basket.raw, all_dicts=all_dicts)
  File "c:\users\juliangerhard\pycharmprojects\farm\farm\data_handler\processor.py", line 555, in _dict_to_samples
    tokenized = tokenize_with_metadata(dict["text"], cls.tokenizer, cls.max_seq_len)
AttributeError: type object 'NERProcessor' has no attribute 'tokenizer'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:/Users/JulianGerhard/PycharmProjects/word_embeddings/ner_finetuning/farm_experimen t.py", line 6, in <module>
    run_experiment(experiments[0])
  File "c:\users\juliangerhard\pycharmprojects\farm\farm\experiment.py", line 87, in run_experiment
    distributed=distributed,
  File "c:\users\juliangerhard\pycharmprojects\farm\farm\data_handler\data_silo.py", line 39, in __init__
    self._load_data()
  File "c:\users\juliangerhard\pycharmprojects\farm\farm\data_handler\data_silo.py", line 47, in _load_data
    self.data["train"], self.tensor_names = self.processor.dataset_from_file(train_file)
  File "c:\users\juliangerhard\pycharmprojects\farm\farm\data_handler\processor.py", line 366, in dataset_from_file
    self._init_samples_in_baskets()
  File "c:\users\juliangerhard\pycharmprojects\farm\farm\data_handler\processor.py", line 304, in _init_samples_in_baskets
    zip(samples, self.baskets), total=len(self.baskets)
  File "C:\Users\JulianGerhard\AppData\Local\Programs\Python\Python36\lib\site-packages\tqdm\_tqdm.py", line 1034, in __iter__
    for obj in iterable:
  File "C:\Users\JulianGerhard\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 320, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "C:\Users\JulianGerhard\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 735, in next
    raise value
AttributeError: type object 'NERProcessor' has no attribute 'tokenizer'
  0%|          | 0/24000 [00:06<?, ?it/s]

I installed twice - from repo and with pip.

After investigating this shortly (I need to leave), I think it has something to do with your NERProcessor's classmethod implementation. I can confirm that the tokenizer is set properly:

<farm.modeling.tokenization.BertTokenizer object at 0x00000238DE551080>

and the cls instance, _dict_to_samples is receiving is of type:

<class 'farm.data_handler.processor.NERProcessor'>

Any ideas?

Regards

Multiprocessing causes data preprocessing to crash

Data preprocessing crashes when performing language model finetuning. This has to do with the size of the dataset since training ran smoothly with a dataset of 10k. This error was thrown when processing a dataset of 5 million samples.

Screenshot 2019-10-08 at 10 11 29

Inference now does not utilise GPU, is much slower and more verbose

After the recent commits, inference_from_dicts took place of run_inference and uses multiprocessing. However, at least for my task it made matters worse.

I have a test dataset of 200 text documents. They are preprocessed and split into fragments, from which lists of dicts {"text": "some text"} are formed for subsequent text classification with FARM. There are 5576 fragments in total. Previously, with gpu flag enabled, these fragments were classified in 53 seconds, utilising up to 1900 MB memory on GPU. Processor transformed samples into inputs for BERT model silently. I have installed FARM from the older branch to make sure that it still works this way.

With new changes to inference, each list of dicts is treated the same way as a dataset during training - i.e., FARM prints one sample from each with tokens, offsets, token_ids etc., and with full ASCII art. Even with logging turned off, tqdm progress bar is still printed for each list of dicts. Inference for all dataset now takes 370 seconds. All this time, no more than 1390 MB of GPU memory is utilised (this is the typical size of bare BERT model, loaded into GPU).

The results of inference did not change.

System:

  • OS: Ubuntu 18.04
  • GPU/CPU: NVIDIA GTX 1080 Ti, CUDA 10.1
  • FARM version: newest version on the morning of 16 Oct 2019.

Evaluation running slowly

I've noticed that eval runs more slowly than training, which doesn't seem intuitive to me. Shouldn't eval run faster as there's no back propagation?

For example, while training a standard LM:

Training: 25941/25941 [3:31:49<00:00, 2.95it/s] = 2.4 it/s
Eval: 3811/3811 [41:24<00:00, 1.53it/s] = 1.53 it/s

Unwanted type conversion in Class TextClassificationProcessor

Im not sure if this a bug or not, but im getting an error when using the doc_classification example for custom dataset labeled True and False. I managed to solve it by just adding label before true and false, but just wanted to report it anyway.

Accuracy metric in LM finetuning always zero

The accuracy metric reported by the Evaluator for masked LM task during LM finetuning is wrong. This is due to the fact that we have there a nested list of predictions with different lengths (different number of masked tokens per sample). This leads to wrongly converted numpy arrays and a wrong metric. I am working on a fix.

excessive uncalled-for warnings when using the inferencer

While evaluating a trained model on some held-out data, we get literally hundreds or thousands of warnings of this type:

  "09/26/2019 17:33:12 - WARNING - farm.data_handler.input_features -   [Task: text_classification] Could not convert labels to ids via label_list!\n",
  "If your are running in *inference* mode: Don't worry!\n",
  "If you are running in *training* mode: Verify you are supplying a proper label list to your processor.\n",

I am running in inference mode, so I am not worried, but I am troubled by the vast number of warnings that make it hard to find the important information in all of that output.

Expected behavior: As FARM knows that is runs in inference mode, it should not print this warning. And otherwise it should print it only once.

CoNLL-2003: index out of range

Hi :)

I wanted to train a NER model on the CoNLL-2003 dataset (I only changed the path to the dataset in the configuration file).:

from farm.experiment import run_experiment, load_experiments 
experiments = load_experiments("experiments/ner/conll2003_de_config.json")
run_experiment(experiments[0])

Then the following error message appears:

09/03/2019 19:03:58 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
09/03/2019 19:03:58 - INFO - pytorch_transformers.tokenization_utils -   loading file https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt from cache at /root/.cache/torch/pytorch_transformers/da299cdd121a3d71e1626f2908dda0d02658f42e925a3d6abd8273ec08cf41a6.2a48e6c60dcdb582effb718237ce5894652e3b4abb94f0a4d9a857b70333308d
09/03/2019 19:03:58 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
09/03/2019 19:03:58 - INFO - farm.data_handler.data_silo -   Loading train set from: /root/.flair/datasets/conll_03_german/train.txt 
09/03/2019 19:03:59 - INFO - farm.data_handler.processor -   Got ya 12 parallel workers to fill the baskets with samples (chunksize = 1000)...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12152/12152 [00:04<00:00, 2728.79it/s]
09/03/2019 19:04:04 - INFO - farm.data_handler.processor -   Got ya 12 parallel workers to featurize samples in baskets (chunksize = 1000) ...
  0%|                                                                                                                          | 0/12152 [00:00<?, ?it/s]---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/mnt/FARM/farm/data_handler/processor.py", line 367, in _multiproc_featurize
    all_features.append(cls._sample_to_features(sample=sample))
  File "/mnt/FARM/farm/data_handler/processor.py", line 665, in _sample_to_features
    tokenizer=cls.tokenizer,
  File "/mnt/FARM/farm/data_handler/input_features.py", line 185, in samples_to_features_ner
    labels_token = expand_labels(labels_word, initial_mask, non_initial_token)
  File "/mnt/FARM/farm/data_handler/utils.py", line 214, in expand_labels
    labels_token.append(labels_word[word_index])
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

IndexError                                Traceback (most recent call last)
<ipython-input-3-386307ddc5b9> in <module>
----> 1 run_experiment(experiments[0])

/mnt/FARM/farm/experiment.py in run_experiment(args)
     85         processor=processor,
     86         batch_size=args.parameter.batch_size,
---> 87         distributed=distributed,
     88     )
     89 

/mnt/FARM/farm/data_handler/data_silo.py in __init__(self, processor, batch_size, distributed)
     38         self.batch_size = batch_size
     39         self.class_weights = None
---> 40         self._load_data()
     41 
     42     def _load_data(self):

/mnt/FARM/farm/data_handler/data_silo.py in _load_data(self)
     46         train_file = os.path.join(self.processor.data_dir, self.processor.train_filename)
     47         logger.info("Loading train set from: {} ".format(train_file))
---> 48         self.data["train"], self.tensor_names = self.processor.dataset_from_file(train_file)
     49 
     50         # dev data

/mnt/FARM/farm/data_handler/processor.py in dataset_from_file(self, file, log_time)
    396             c = time.time()
    397             MlLogger.log_metrics(metrics={"t_init_samples": (c - b) / 60}, step=0)
--> 398             self._featurize_samples()
    399             d = time.time()
    400             MlLogger.log_metrics(metrics={"t_featurize_samples": (d - c) / 60}, step=0)

/mnt/FARM/farm/data_handler/processor.py in _featurize_samples(self)
    345 
    346                 for basket_features, basket in tqdm(
--> 347                         zip(all_features_gen, self.baskets), total=len(self.baskets)
    348                 ):
    349                     for f, s in zip(basket_features, basket.samples):

/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py in __iter__(self)
   1058                 """), fp_write=getattr(self.fp, 'write', sys.stderr.write))
   1059 
-> 1060             for obj in iterable:
   1061                 yield obj
   1062                 # Update and possibly print the progressbar.

/usr/lib/python3.6/multiprocessing/pool.py in <genexpr>(.0)
    318                     result._set_length
    319                 ))
--> 320             return (item for chunk in result for item in chunk)
    321 
    322     def imap_unordered(self, func, iterable, chunksize=1):

/usr/lib/python3.6/multiprocessing/pool.py in next(self, timeout)
    733         if success:
    734             return value
--> 735         raise value
    736 
    737     __next__ = next                    # XXX

IndexError: list index out of range

Could you help me with that? I used the latest FARM version from master branch. Thanks ❤️

Threading Error upon building Data Silo

Upon loading the DataSilo, which would theoretically call the functions in previously created Processor object to generate the datasets, I keep running into a threading error without being given any useful information regarding the interplay between the threading.py module and FARM (see complete stacktrace below).

Error message:
09/16/2019 15:15:59 - INFO - farm.data_handler.data_silo -
Loading data into the data silo ...
______
|o | !
__ |:`|---'-.
||
___.-/ _ -----.|
(o)(o)------'\ _ / ( )

09/16/2019 15:15:59 - INFO - farm.data_handler.data_silo - Loading train set from: data_train.tsv

09/16/2019 15:15:59 - INFO - farm.data_handler.processor - Got ya 2 parallel workers to fill the baskets with samples (chunksize = 1000)...
Exception ignored in: <function _after_fork at 0x7fe36e789170>
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 1374, in _after_fork
assert len(_active) == 1
AssertionError:
Exception ignored in: <function _after_fork at 0x7fe36e789170>
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 1374, in _after_fork
assert len(_active) == 1
AssertionError:

Expected behavior
I expected that a DataLoader object would be returned, which will later be passed on to the model Trainer object.

To Reproduce
Screenshot of file heads and Code snippet needed to reproduce attached.

System:

  • OS: Linux Mint 19.1
  • GPU/CPU: cpu
  • FARM version: farm-0.2.0

head_data_test
head_data_train
FARM_ErrorToReproduce.pdf

Empty datasets due to random_split_ConcatDataset

Describe the bug
DataSilo is Crashing during the attempt of splitting a dev set from a "small" train set.
It seems that random_split_ConcatDataset() doesn't work if there's only a single chunk (= 1 dataset). idx_dataset is 0 in that case and thus creates an empty ConcatDataset for train

Error message
Error that was thrown (if available)

10/11/2019 17:12:47 - INFO - farm.data_handler.data_silo -   Loading dev set as a slice of train set
Traceback (most recent call last):
  File ".../train.py", line 436, in <module>
    augmentation=True)
  File ".../train.py", line 348, in continue_finetuning
    data_silo = DataSilo(processor=processor, batch_size=batch_size, multiprocessing_chunk_size=2000)
  File "/.../farm/data_handler/data_silo.py", line 49, in __init__
    self._load_data()
  File ".../farm/data_handler/data_silo.py", line 104, in _load_data
    self._create_dev_from_train()
  File ".../farm/data_handler/data_silo.py", line 175, in _create_dev_from_train
    train_dataset, dev_dataset = self.random_split_ConcatDataset(self.data["train"], lengths=[n_train, n_dev])
  File ".../farm/data_handler/data_silo.py", line 200, in random_split_ConcatDataset
    train = ConcatDataset(ds.datasets[:idx_dataset])
  File ".../torch/utils/data/dataset.py", line 68, in __init__
    assert len(datasets) > 0, 'datasets should not be an empty iterable'
AssertionError: datasets should not be an empty iterable

Expected behavior
A portion of the train set should be splitted apart into a dev set

Additional context
Related function:
https://github.com/deepset-ai/FARM/blob/master/farm/data_handler/data_silo.py#L186

    def random_split_ConcatDataset(self, ds, lengths):
...
        if sum(lengths) != len(ds):
            raise ValueError("Sum of input lengths does not equal the length of the input dataset!")

        idx_dataset = np.where(np.array(ds.cumulative_sizes) > lengths[0])[0][0]

        train = ConcatDataset(ds.datasets[:idx_dataset])
        test = ConcatDataset(ds.datasets[idx_dataset:])
        return train, test

To Reproduce

  1. Provide no dev_filename but instead set dev_split
  2. Have a trainfile that has less samples that multiprocessing_chunksize

System:

  • OS: Ubuntu 18.04
  • GPU/CPU: CPU
  • FARM version: 0.2.1

Add cross-validation for small datasets

I'm currently working with a pretty small dataset (only 1.8k observations). Therefore, it seems to be appropriate to cross-validate my results. Wouldn't it make sense to add such functionality to FARM? How could this be best implement within the framework?
Cheers, a³

Evaluating a saved model on another Test dataset

Hey guys,

i was just wondering if there is a way to evaluate a saved model (maybe I missed it in the docs). I managed to do that using the Inferencer but it is quite slow and it doesn't seem like the right way to it. Thank you in advance.

Predicting without persisting / using different labels for different prediction heads (multitask)

Hi everyone!

I just started to work with FARM and have two questions about it.

  1. How do I use a model without storing it first? In all your examples you first train a model, save it to disk and then load it via an inferencer. But what do I do, if I want to just use my text classification model right after training it? Right now the inferencer does not support multiple prediction heads, so there must be a way to do so I guess? I just can't find the correct method for this.

  2. How can I add multiple prediction heads for different text classification tasks? As far as I see it, many blocks in the process are referring to the data silo, like the trainer, the optimizer and also the prediction_head. But every data silo only has one processor, and every processor needs one label column. But for different text classification tasks, I want to have different label columns. Is this possible at all?

I like the idea behind your framework a lot! It would be cool if someone could help me to understand my problems better :). Right now I have a workaround for my problems but it consists of ugly code and has a big performance issue.

Best regards,
Garstig

TypeError: classification_report() got an unexpected keyword argument 'target_names'

Hi guys,

I really love your pre-trained models and that they are so easy to implement with FARM. I was classifying German text according to the tutorial and was able to adopt it to use it on my own dataset. However, for the BIO-classification of my texts, I keep on getting the same error message during training, despite closely following your tutorial code.

Thanks in advance,
Sebastian

Error message

TypeError: Traceback (most recent call last)
<ipython-input-21-17cc446f9014> in <module>()
----> 1 model = trainer.train(model)
1 frames
/usr/local/lib/python3.6/dist-packages/farm/eval.py in eval(self, model)
    121                 else:
    122                     result["report"] = report_fn(
--> 123                         label_all[head_num], preds_all[head_num], digits=4, target_names=head.label_list)
    124 
    125             all_results.append(result)
TypeError: classification_report() got an unexpected keyword argument 'target_names'

Here you can find my code that worked fine until training.

tokenizer = BertTokenizer.from_pretrained(
    pretrained_model_name_or_path="bert-base-german-cased",
    do_lower_case=False)

ner_processor = NERProcessor(tokenizer=tokenizer, 
                             max_seq_len=512, 
                             data_dir="",
                             train_filename='train.txt',
                             dev_filename=None,
                             dev_split=0.1,
                             labels=["[PAD]","X","B","I","O"]
                             )
ner_labels = ["[PAD]","X", "B", "I", "O"]
ner_processor.add_task("ner", "seq_f1", ner_labels)

LAYER_DIMS = [768, 5]
ner_prediction_head = TokenClassificationHead(layer_dims=LAYER_DIMS)

BATCH_SIZE = 8
EMBEDS_DROPOUT_PROB = 0.1
LEARNING_RATE = 2e-5
WARMUP_PROPORTION = 0.1
N_EPOCHS = 1
N_GPU = 1

data_silo = DataSilo(
    processor=ner_processor,
    batch_size=BATCH_SIZE)

model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[ner_prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_token"],
    device=device)

optimizer, warmup_linear = initialize_optimizer(
    model=model,
    learning_rate=LEARNING_RATE,
    warmup_proportion=WARMUP_PROPORTION,
    n_batches=len(data_silo.loaders["train"]),
    n_epochs=N_EPOCHS)

trainer = Trainer(
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    warmup_linear=warmup_linear,
    device=device,
)

System:

  • OS: Google Colab
  • GPU/CPU: GPU
  • FARM version: farm 0.2.0

Could not convert labels to ids via label_list

Hi there,

when i try to build a DataSilo with TextClassificationProcessor i get the following error/warning:
"Could not convert labels to ids via label_list"

The label_list i give to the processor is fine however. I investigated a bit and the error seems to appear when
label_raw = sample.clear_text[label_name]
is executed in the input_feautes.py
It looks like the clear_text dictionary does not have a key called "text_classification_label" even though that's the name of my label column.

Here's my code:

tokenizer = BertTokenizer.from_pretrained(
 pretrained_model_name_or_path="bert-base-german-cased",
 do_lower_case=False)

bert_df = pd.DataFrame ( {
         "text":self.df['Originalbeitrag'],
         'text_classification_label':self.df['Tonalität des Artikels/Themas']
     }).dropna()
     train_set = bert_df.sample(frac=0.8)
     test_set = bert_df.drop(train_set.index)
     train_set.to_csv('data/train.tsv', sep='\t', index=False, header=['text', 'text_classification_label'], 
     columns=train_set.columns)
     test_set.to_csv('data/test.tsv',  sep='\t', index=False, header=['text', 'text_classification_label'], 
     columns=test_set.columns)


 processor = TextClassificationProcessor(tokenizer=tokenizer,
                                     multilabel=True,
                                     max_seq_len=512,
                                     data_dir="data/",
                                     label_list = ['positiv','neutral','ambivalent','negativ'],
                                     metric = "acc",
                                     header=0,
                                     label_column_name = 'text_classification_label')

BATCH_SIZE = 32
data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)

System:

  • Ubuntu: 18.04

Expected input batch_size to match target batch_size

Hello,

when running the Trainer for a multi-class classification (4 classes, batch_size=8) there seems to be a missmatch:

Expected input batch_size (8) to match target batch_size (32)

The error appears in this context:

farm/modeling/prediction_head.py(262)logits_to_loss()
-> return self.loss_fct(logits, label_ids.view(-1))
(Pdb)  logits
tensor([[ 0.0025,  0.1011,  0.0897,  0.5420],
        [-0.2871,  0.1540,  0.1067,  0.3153],
        [ 0.0552,  0.5559,  0.0795,  0.3348],
        [ 0.2319,  0.1192, -0.0756,  0.5185],
        [ 0.2457, -0.1189, -0.3208,  0.1301],
        [-0.1660,  0.1726, -0.0196,  0.0887],
        [-0.3584,  0.0351,  0.0103,  0.2665],
        [-0.3831,  0.0922,  0.2931,  0.2717]], grad_fn=<A
(Pdb)  label_ids
tensor([[0, 1, 0, 0],
        [0, 0, 1, 0],
        [0, 1, 0, 0],
        [0, 0, 0, 1],
        [1, 0, 0, 0],
        [0, 1, 0, 0],
        [1, 0, 0, 0],
        [0, 1, 0, 0]])

.
.
.

--Call--
> /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py(537)__call__()
-> def __call__(self, *input, **kwargs):
(Pdb)  input
(tensor([[ 0.0025,  0.1011,  0.0897,  0.5420],
        [-0.2871,  0.1540,  0.1067,  0.3153],
        [ 0.0552,  0.5559,  0.0795,  0.3348],
        [ 0.2319,  0.1192, -0.0756,  0.5185],
        [ 0.2457, -0.1189, -0.3208,  0.1301],
        [-0.1660,  0.1726, -0.0196,  0.0887],
        [-0.3584,  0.0351,  0.0103,  0.2665],
        [-0.3831,  0.0922,  0.2931,  0.2717]], grad_fn=<AddmmBackward>), tensor([0, 1, 0, 
        0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
        1, 0, 0, 0, 0, 1, 0, 0]))

Then input and target missmatch:

--Call--
> /usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py(914)forward()
-> def forward(self, input, target):
(Pdb)  input
tensor([[ 0.0025,  0.1011,  0.0897,  0.5420],
    [-0.2871,  0.1540,  0.1067,  0.3153],
    [ 0.0552,  0.5559,  0.0795,  0.3348],
    [ 0.2319,  0.1192, -0.0756,  0.5185],
    [ 0.2457, -0.1189, -0.3208,  0.1301],
    [-0.1660,  0.1726, -0.0196,  0.0887],
    [-0.3584,  0.0351,  0.0103,  0.2665],
    [-0.3831,  0.0922,  0.2931,  0.2717]], grad_fn=<AddmmBackward>)
(Pdb)  target
tensor([0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
    1, 0, 0, 0, 0, 1, 0, 0])

Finally the error is thrown here:

> /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py(1820)nll_loss()
-> if input.size(0) != target.size(0):
-> raise ValueError('Expected input batch_size ({}) to match target batch_size ({}).'

Is there something i can do to avoid this Value Error?

Multiprocessing Error with PyTorch Version 1.2.0

Describe the bug
When upgarding to the newest PyTorch version the multiprocessing breaks.

Error message
Traceback (most recent call last):
File "/home/timo/Programming/gitlab/FARM/examples/doc_classification.py", line 55, in
batch_size=batch_size)
File "/home/timo/Programming/gitlab/FARM/farm/data_handler/data_silo.py", line 50, in init
self._load_data()
File "/home/timo/Programming/gitlab/FARM/farm/data_handler/data_silo.py", line 97, in _load_data
self.data["train"], self.tensor_names = self._get_dataset(train_file)
File "/home/timo/Programming/gitlab/FARM/farm/data_handler/data_silo.py", line 85, in _get_dataset
for dataset, tensor_names in tqdm(results, total=len(dicts)/self.multiprocessing_chunk_size):
File "/home/timo/.virtualenvs/farm/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1017, in iter
for obj in iterable:
File "/home/timo/anaconda3/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise value
multiprocessing.pool.MaybeEncodingError: Error sending result: '(<torch.utils.data.dataset.TensorDataset object at 0x7fd0da706a90>, ['input_ids', 'padding_mask', 'segment_ids', 'text_classification_label_ids'])'. Reason: 'RuntimeError('error executing torch_shm_manager at "/home/timo/.virtualenvs/farm/lib/python3.6/site-packages/torch/bin/torch_shm_manager" at /pytorch/torch/lib/libshm/core.cpp:99',)'

Additional context
When installing through FARM requirements.txt version PyTorch version 1.0.2 is installed, but that will change in the future.

System:

  • OS: Ubuntu 18.04
  • GPU/CPU: both
  • FARM version: master @ commit ID 45c7180

This issue will be fixed by #96

output_dir parameter in run by config is being ignored

If you set the parameter like this:

    "output_dir": {"value": "/path/to/output", "default": "", "desc": "Output directory where model predictions and checkpoints will be saved."},

You will get an error if the directory already exists and the run will stop.

If the director does not exist the run will be successfully executed but the directory will be empty afterwards and the log will contain such entries:

08/03/2019 15:11:20 - WARNING - farm.file_utils -   Path saved_models/bert-german-GermEval18Coarse already exists. You might be overwriting files.

IndexError during data processing

While putting together a notebook to demonstrate another issue I encountered the following error:

/content/FARM/farm/data_handler/utils.py in read_docs_from_txt(filename, delimiter, encoding)
    137 
    138         # if last row in file is not empty
--> 139         if all_docs[-1] != doc:
    140             all_docs.append({"doc": doc})
    141             # sample_to_doc.pop()

IndexError: list index out of range

The error should be reproducible by running the following colab notebook: https://colab.research.google.com/drive/1BWW207df5kHoYrXQOkEwDw5zVv_SnwP-

Perhaps it's just some sort of data cleaning issue, but maybe it's worth looking into if only to make data processing more robust.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.