In nlpboost online documentation say:
The task name for QA is qa, so the correct configuration is DatasetConfig(..., task="qa"). The default format for this task is the SQUAD format (check squad dataset in Huggingface’s Datasets). If your QA dataset is not in that format, you can either preprocess it before using AutoTrainer with it, or use a pre_func in DatasetConfig to achieve the same.
I try launch squad model with this dataset setting:
squac_config = default_args_dataset.copy()
squac_config.update(
{ "dataset_name": "squad", "alias": "squad", "task": "qa", "text_field": "context", "label_col": "question", "hf_load_kwargs": {"path": "squad"} }
But when I launch the train script, this show me the next error: Keyerror "test" (this is correct, because the dataset have train and validation, but not train)
It's posible change in nlpboost/hfdatasets_manager.py the line 94 the "train" value by "validation" and work, but would the final test work well?
93 if self.dataset_config.task == "qa":
94 test_dataset = dataset["test"]
the output error is:
/data/afernandez/nlpboost/src/nlpboost/hfdatasets_manager.py:62 in get_dataset_and_tag2id │
│ │
│ 59 │ │ │ Dictionary with tags (labels) and their indexes. │
│ 60 │ │ """ │
│ 61 │ │ if self.dataset_config.pretokenized_dataset is None: │
│ ❱ 62 │ │ │ dataset, tag2id = self._generic_load_dataset(tokenizer) │
│ 63 │ │ else: │
│ 64 │ │ │ dataset = self.dataset_config.pretokenized_dataset │
│ 65 │ │ │ tag2id = {} │
│ │
│ /data/afernandez/nlpboost/src/nlpboost/hfdatasets_manager.py:94 in _generic_load_dataset │
│ │
│ 91 │ │ if self.dataset_config.pre_func is not None: │
│ 92 │ │ │ dataset = dataset.map(self.dataset_config.pre_func, remove_columns=dataset[" │
│ 93 │ │ if self.dataset_config.task == "qa": │
│ ❱ 94 │ │ │ test_dataset = dataset["test"] │
│ 95 │ │ tags = get_tags(dataset, self.dataset_config) │
│ 96 │ │ tag2id = {t: i for i, t in enumerate(sorted(tags))} │
│ 97 │ │ dataset = self._general_label_mapper(tag2id, dataset) │
│ │
│ /data/afernandez/odesia/lib/python3.10/site-packages/datasets/dataset_dict.py:58 in getitem │
│ │
│ 55 │ │
│ 56 │ def getitem(self, k) -> Dataset: │
│ 57 │ │ if isinstance(k, (str, NamedSplit)) or len(self) == 0: │
│ ❱ 58 │ │ │ return super().getitem(k) │
│ 59 │ │ else: │
│ 60 │ │ │ available_suggested_splits = [ │
│ 61 │ │ │ │ split for split in (Split.TRAIN, Split.TEST, Split.VALIDATION) if split │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'test'