ibm / unitxt Goto Github PK

View Code? Open in Web Editor NEW

150.0 15.0 37.0 43.71 MB

🦄 Unitxt: a python library for getting data fired up and set for training and evaluation

Home Page: https://www.unitxt.ai

License: Apache License 2.0

Python 99.58% Makefile 0.12% Shell 0.16% Dockerfile 0.14%

ai data llm mlops nlp nlp-library python

unitxt's Introduction

In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution.

Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively.

Turn.on.speakers.mov

🦄 Currently on Unitxt Catalog

🦄 Run Unitxt Exploration Dashboard

To launch unitxt graphical user interface first install unitxt with ui requirements:

pip install unitxt[ui]

Then launch the ui by running:

unitxt-explore

🦄 Contributors

Please install Unitxt from source by:

git clone [email protected]:IBM/unitxt.git
cd unitxt
pip install -e ".[dev]"
pre-commit install

🦄 Citation

If you use Unitxt in your research, please cite our paper:

@inproceedings{bandel-etal-2024-unitxt,
    title = "Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative {AI}",
    author = "Bandel, Elron  and
      Perlitz, Yotam  and
      Venezian, Elad  and
      Friedman, Roni  and
      Arviv, Ofir  and
      Orbach, Matan  and
      Don-Yehiya, Shachar  and
      Sheinwald, Dafna  and
      Gera, Ariel  and
      Choshen, Leshem  and
      Shmueli-Scheuer, Michal  and
      Katz, Yoav",
    editor = "Chang, Kai-Wei  and
      Lee, Annie  and
      Rajani, Nazneen",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-demo.21",
    pages = "207--215",
    abstract = "In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution.Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively. Join the Unitxt community at https://github.com/IBM/unitxt",
}

Unitxt emoji designed by OpenMoji - the open-source emoji and icon project. License: CC BY-SA 4.0

unitxt's People

Contributors

Stargazers

Watchers

unitxt's Issues

Use of cache after changing card returns stale results

I changed a card (added a preprocessing step), but the dataset was loaded from cache:

07/16/2023 13:49:32 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /Users/yoavkatz/cache/huggingface/datasets/unitxt___data/card=cards.sst2_sentiment,template_item=0/1.1.1/161c975966d35694e0db488ca61993c4a4cfb44975f0fa25e6aac6dc3806b97f/cache-d2a30425e116067b.arrow

Need to include a hash of the card text in the cache key.

Independent randomizations

The current use of nested_seed(..) calls in unitxt is problematic:

when not specifying a sub_seed parameter, the new seed depends on a call to get_random_string(10), which in turn uses the current randomizer within _thread_local.random. So the state of that randomizer affects the seed of the new nested randomizer. Any outside effect on that randomizer thus affects the new seed, for example, any added runs of datasets or templates.
since within the nested_seed(..) scope, things are returned lazily, some randomizations may occur after the scope exits.

Consider need for requirements for operator caches

Loaders need it, do other operators?

#339

Random split of a predefined size

Capp the maximum number of examples returned by the split random mix (e.g., who cares for a 5% of the examples of a 1trilion sentences for test)

Dict Templates canot be added to the catalog

It is not an artifact so you can't save the whole named groups of templates
Perhaps there are also other things like that as well instructions or template list that can't be saved to the catalog

Add a ability to map specific values in List

Currently MapInstanceValues - maps complete valus in fields

MapInstanceValues(mappers={"labels": {"[]": ["none"]}}, strict=False),

However, in multi label classification, we would like to map individual elements within a field.

We would like an option 'process_every_value' to be added to MapInstanceValues to allow working on list.

MapInstanceValues(mappers={"labels": {"a": "apple", "b": "bandana"}}, strict=False, process_every_value=True),

Add CICD to the project

Support more sophisticated templates with ease

Currently using things like {choices} ends up with "[a,b,c]" which might not be the preferred format.
Also, several columns might be templated together.
{col_title:column}

Improve runtime of confidence interval calculation for GlobalMetric

Computing of confidence intervals for GlobelMetric objects (here) requires the recalculation of the metric multiple times, depending on the n_resamples parameter. This recalculation may be costly in runtime for some metrics, specifically those that score the prediction with an independent evaluation model.

The runtime for this computation should be improved, for example by caching the inference results of such evaluation models.

from chiti :You can make 100 inference requests. Then take 100 or 1000 random samples with replacement from this 100 set, which will give you 100/1000 different metrics for free, as you simply reuse the inference results . If the original set of 100 samples is random, then the above will give you a pretty accurate estimation of the confidence intervals without having to run inference on more than the original 100 instances.

More augmentation capabilities.

We have a request for another augmentor that adds whitespace at both the start and end of string:

"let's add up to 5 consecutive whitespaces at the beginning and end of the prompt. Within the 5 slots, for each slot, pick between

{
" " : 20,
"\t" : 10,
"\n" : 40,
"" : 30,
}

It's simpler to AugmentSuffix, except that it's also for prefix as well + the selection is repeated 5 times .

Maybe we can generalize AugmentSuffix.

Automate licenses and metadata from hf

Make a small function that takes a card and adds the license from HF to it (if it was not manually added before).
Then we can run it on all cards and get all the hf licenses and save one hassle for adding a new dataset (as this is not preprocessing)
Maybe also download any other metadata HF has?

In spirit, metadata is not preprocessing, so it is not an addition over HF and we can just use what already exists there. Possibly, we can even avoid resaving the metadata and just load it upon request with the loader, but it is probably harder and might be specific to HF datasets

additional_input not available to HuggingFace Metrics

additional_input field is not available to HuggingfaceMetrics .
PR #348 tries to solve the issue, but encounters test failures in other metrics.

CC @yoavkatz

NormalizeListFields should be replaced

It is in the wrong place, uninformative and not a problem of the one that just adds a dataset

Artifacts cache returns mutable items

Consider the following scenario:

Get an artifact from the artifacts repo (e.g. an Accuracy metric).
Make an adjustment to it (e.g. disable confidence interval calculation).
Retrieve the artifact again (e.g. to run another evaluation with the Accuracy metric).
The artifact is then retrieved from the in-memory cache of the artifacts repo, the retrieved item will be the modified artifact object.

This may lead to unexpected results. For example, i was not getting any confidence intervals, because i had previously disabled that for the cached Accuracy metric while running on another stream.

Generally, current usage of the artifactory-repo cache can be dangerous. There is no explicit way to prevent changes to objects returned from the cache. Maybe the returned artifact should be a copy of what is present in the cache.

Add new Augmentor to add random suffix to text

Augmentors (See #250 ) allow random modification of the input text of the models, to test model robustness.

Today, there is only a single kind of Augmentor: AugmentWhitespace:

which is registered in:

https://github.com/IBM/unitxt/blob/main/prepare/augmentors/augment_whitespace.py

Note that augmentors have two configurations, augment the entire input provided to the model (which includes the instructions, template prompt, etc). or only specific defined fields in the task. This is controlled by the

operator = AugmentWhitespace(augment_model_input=True)
vs
operator = AugmentWhitespace(augment_task_input=True)

The current whitespace augments replaces any existing whitespace in the string with random 1-3 whitespace chars.

class AugmentWhitespace(Augmentor):
    """
    Augments the inputs by replace existing whitespace with other whitespace.
    Currently each whitespace is replaced by a random choice of 1-3 whitespace charaters (spcae, tab, newline).
    """

    def process_value(self, value: Any) -> Any:
        import re

        words = re.split("(\s+)", value)
        new_value = ""

        for word in words:
            if word.isspace():
                new_value += random.choice(["\n", "\t", " "]) * random.randint(1, 3)
            else:
                new_value += word
        return new_value

There is a request to support augmentation of whitespace, but only at the end (even if there was no whitespace at the end).

One option is to add a new Augmentor class that randomly add one of a set of suffixes.

class AugmentSuffix(Augmentor)

That has a list of possible suffixes that the user can customize and will be randomly added.

operator = AugmentSuffix(augment_model_input=True, suffixes=[" ", "\n", ""])

We can add the option for weighted selection by passing a dictionary.

operator = AugmentSuffix(augment_model_input=True, suffixes={" ": 2 , "\n": 3 ,"" : 5})

Weighted F1 Metric

Adding Weighted F1 Metric to the list of available metrics.

LoadHF can not load from compressed files

dataset https://huggingface.co/datasets/GEM/xlsum is implemented over files compressed as *.tar.bz2.
Currently, these can not be loaded in Streaming mode, as shows in the attached Notebook
ShowNotImplemented.pdf

I am suggesting a fast patch in a coming pull request.

Support for a tokenizer parameter in metrics

Some metrics, such as Rouge, accept a tokenizer parameter for better support for foreign languages. It will be helpful to expose this option.

https://discuss.huggingface.co/t/which-tokenizer-does-rouge-metric-uses-under-the-hood/19903

https://github.com/google-research/google-research/blob/e3d00617cb28064b6e96ab4e2485079f0ca5a763/rouge/rouge_scorer.py#L60

cc: @perlitz @yoavkatz @gitMichal

Why do we verify after preparing and not before

in Stream we first make actions (in prepare()) and only then verify() the inputs.
So we both waste time and can't assume everything makes sense in the prepare.
If this is fixed, fix FieldOperator prepare() that starts with an assert section as well

Add shared templates

Add templates for classification nli and other classes which already share a function for their creation

Unclear error messag e when specifying a wrong field name

I used as input, the wrong field that does not appear in the input dataset, and I got a key error (See below).

It should print where exactly the field was accessed in the stream process, and the possible correct fields.

Artifact cards.sst2_sentiment is fetched from LocalCatalog(type='local_catalog', name='local', location='/Users/yoavkatz/fm-eval/fm_eval/catalogs/private')
Traceback (most recent call last):
File "/Users/yoavkatz/opt/miniconda3/envs/fme2/lib/python3.11/site-packages/datasets/builder.py", line 1629, in _prepare_split_single
for key, record in generator:
File "/Users/yoavkatz/cache/huggingface/modules/datasets_modules/datasets/unitxt--data/161c975966d35694e0db488ca61993c4a4cfb44975f0fa25e6aac6dc3806b97f/data.py", line 91, in _generate_examples
for i, row in enumerate(generator):
File "/Users/yoavkatz/cache/huggingface/modules/datasets_modules/datasets/unitxt--data/161c975966d35694e0db488ca61993c4a4cfb44975f0fa25e6aac6dc3806b97f/operator.py", line 141, in _process_stream
first_instance = next(iterator)
^^^^^^^^^^^^^^
File "/Users/yoavkatz/cache/huggingface/modules/datasets_modules/datasets/unitxt--data/161c975966d35694e0db488ca61993c4a4cfb44975f0fa25e6aac6dc3806b97f/operator.py", line 123, in _process_stream
for instance in stream:
File "/Users/yoavkatz/cache/huggingface/modules/datasets_modules/datasets/unitxt--data/161c975966d35694e0db488ca61993c4a4cfb44975f0fa25e6aac6dc3806b97f/operator.py", line 124, in _process_stream
yield self._process_instance(instance, stream_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yoavkatz/cache/huggingface/modules/datasets_modules/datasets/unitxt--data/161c975966d35694e0db488ca61993c4a4cfb44975f0fa25e6aac6dc3806b97f/operator.py", line 127, in _process_instance
return self.process(instance, stream_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yoavkatz/cache/huggingface/modules/datasets_modules/datasets/unitxt--data/161c975966d35694e0db488ca61993c4a4cfb44975f0fa25e6aac6dc3806b97f/task.py", line 16, in process
inputs = {key: instance[key] for key in self.inputs}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yoavkatz/cache/huggingface/modules/datasets_modules/datasets/unitxt--data/161c975966d35694e0db488ca61993c4a4cfb44975f0fa25e6aac6dc3806b97f/task.py", line 16, in
inputs = {key: instance[key] for key in self.inputs}
~~~~~~~~^^^^^
KeyError: 'sentence'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/yoavkatz/fm-eval/./fm_eval/runnables/run_text2text.py", line 157, in
main(sys.argv)
File "/Users/yoavkatz/fm-eval/./fm_eval/runnables/run_text2text.py", line 34, in main
raw_datasets = get_datasets(config)
^^^^^^^^^^^^^^^^^^^^
File "/Users/yoavkatz/fm-eval/fm_eval/runnables/data_utils.py", line 28, in get_datasets
raw_datasets = load_dataset(
^^^^^^^^^^^^^
File "/Users/yoavkatz/opt/miniconda3/envs/fme2/lib/python3.11/site-packages/datasets/load.py", line 1809, in load_dataset
builder_instance.download_and_prepare(
File "/Users/yoavkatz/opt/miniconda3/envs/fme2/lib/python3.11/site-packages/datasets/builder.py", line 909, in download_and_prepare
self._download_and_prepare(
File "/Users/yoavkatz/opt/miniconda3/envs/fme2/lib/python3.11/site-packages/datasets/builder.py", line 1670, in _download_and_prepare
super()._download_and_prepare(
File "/Users/yoavkatz/opt/miniconda3/envs/fme2/lib/python3.11/site-packages/datasets/builder.py", line 1004, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/Users/yoavkatz/opt/miniconda3/envs/fme2/lib/python3.11/site-packages/datasets/builder.py", line 1508, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/Users/yoavkatz/opt/miniconda3/envs/fme2/lib/python3.11/site-packages/datasets/builder.py", line 1665, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

MapInstanceValues name is not indicative

what is an instance? example or row? instances can mean many things

support balanced sampling of classes for classification datasets

Feature request-

for a given classification dataset, we want to support balanced sampling of class labels to create a more balanced subset of dataset. This will enable us to run train experiments with balanced datasets

renderer name is uninformative

It is not self explanatory what a renderer is.
Its essence is to deal with the whole input, or to glue demonstrations together. Input formatter?

CastFields

Not in canonic format (field_to_field etc.)
Can only work inplace

same template for different datasets

Can we share a template across datasets? this would require instead of {col1} {col2} have something more generic that can eat any input column (and maybe some meta information of it like its name) for example to have {Sentences} or {Col_name}:{Sentence}.

Write tutorial for unitxt

Increase n_resamples for GlobalMetric in testing so confidence intervals are not NaN

@eladven @matanor
In test_utils/metrics.py/test_metric, for a GlobalMetric we have

    if isinstance(metric, GlobalMetric) and metric.n_resamples:
        metric.n_resamples3  # Use a low number of resamples in testing for GlobalMetric, to save runtime

while this may be a good way to save runtime, it appears to cause an issue in the testing because it can cause the confidence intervals (which by default use bias-corrected accelerated BCa method) to be undefined. In my case, the resampled scores (the input theta_hat_b to _bca_interval), which consist only of 3 numbers (n_resamples as set above) are all above the value of the single value computed as "theta_hat". This causes the value of "percentile" to be 0, and therefore z0_hat, which is used to compute the rest of the interval, to be -Inf, which causes the other computations to be NaN. Basically we need theta_hat to be somewhere in the range of the values of theta_hat_b, which would cause percentile to be in (0,1) and not exactly 0 or 1 (see formulas http://users.stat.umn.edu/~helwig/notes/bootci-Notes.pdf slides 34-35). The likelihood of this happening increases if we allow some more resamples, perhaps 15 or 20 is enough for testing purposes. It seems like some other GlobalMetric objects like NDCG and Squad are tested as MetricPipeline, not GlobalMetric, so this is not an issue for them. Not sure if it affects other tests.

FormTask name is uninformative

Uninformative name, was it meant to be multipleChoice?

format utility modifies virtual env files

When running the util/format.sh script to fix the code format, it also modifies files inside the "venv" directory, which causes problems to the installed packages. The virtual env directory should be excluded from this process.

Potential error in metrics/F1MultiLabel

@matanor @elronbandel In F1MultiLabel, I think the following line is a mistake

labels = [
    lbl
    for lbl in {label for reference in references for label in reference}
    if lbl not in self.classes_to_ignore
]

This is supposed to be collecting the unique non-ignored label values in reference. But since reference is a list of strings due to the line references = [reference[0] for reference in references], this ends up generating a list of the single characters in the list, which creates labels that didn't exist. For instance, with the inputs

references = [["A B"], ["BC D"], ["C"], ["123"]]
predictions = [["B", "AB", "A"], ["A", "bC", "BC DF"], ["c", " C"], [13, 23, 234]]

for the first pair of input and prediction, the result of labels is [' ', 'A', 'B'] (the individual characters of 'A B'), not just 'A B'. That is, ' ', 'A', and 'B' are new label classes that are invented. I think that one of the following is the case:

either it should instead be labels = set([reference for reference in references if reference not in self.classes_to_ignore]), or use set difference
Possibly F1MultiLabel was intended to allow multiple reference values per prediction, rather than the reverse. Possibly the assertion was copied from F1 without changing it to check if there is only one prediction per reference.

Also, it would be useful to have examples in prepare/metrics/f1.py illustrating use of these metrics, as the other ones have.

Rouge might not be correct

Rouge in the old unitxt had some preprocessing that is not included here. (something to do with separation of sentences) this might affect the results.
@gitMichal

Informative metric not found error

Currently it looks something like:

    metric, _ = fetch_artifact(metric_name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibmlc/fusion/unitxt/src/unitxt/artifact.py", line 186, in fetch_artifact
    raise UnitxtArtifactNotFoundError(name, Artifactories().artifactories)
unitxt.artifact.UnitxtArtifactNotFoundError: Artifact accuracy does not exist```

We should have something like python has “no method XXXX.  Did you mean YYYY”?
Do we have a list of all available artificats we an compare to?
https://stackoverflow.com/questions/10018679/python-find-closest-string-from-a-list-to-another-string

Another option is to use fetch_metric(metric_name), which will automatically add the “metric” name to the name.

New operator to return most common values from a given field

Often we need to list a set of choices for the model, which can be derived from the labels of train data.

Today, people need to write them manually, use metadata (if available) or find some other way to generate them.

The idea is to add a new StreamOperators which returns field values

ExtractFieldValues(stream="train", field="label", to_field="choices", min_precent (optional)=1 , min_count (optional)= 10)

Fetch all instance from stream ("train")
Count distribution of values of field ("label")
Remove from dictionary all labels below min _precent or below min_count
Sort the list of value by count
Add the list field named to_field to all streams in input multi stream ("choices"). [ "negative" , "positive"]

Before:
train:
{ "text" : ... , "label: positive}
{ "text" : ... , "label: positive}
{ "text" : ... , "label: neutral}
{ "text" : ... , "label: neutral}
{ "text" : ... , "label: negative}

test:
{ "text" : ... , "label: positive}
{ "text" : ... , "label: negative}

After:

train:
{ "text" : ... , "label: positive" , choices : ["positive","neutral"}
{ "text" : ... , "label: positive, choices : ["positive","neutral"}}
{ "text" : ... , "label: negative, choices : ["positive","neutral"}}

test:
{ "text" : ... , "label: positive, choices : ["positive","neutral"}}
{ "text" : ... , "label: negative, choices : ["positive","neutral"}}

Similar to merge (MultiStreamOperator)

Add check for validaty of all templates

Today templates can be defined that use field that don't exist or not use any fields at all.

test_card() should check all templates and not just one.
it should validate the user did not forget to add atlease one field.

A formatter based on format strings

The current formatter ICLFormat expresses the used format implicitly, using several sepsrators added in between the template and the demos:

class ICLFormat(SizeLimitingFormat):
    prefix: str = ""
    input_prefix: str = ""
    output_prefix: str = ""
    target_prefix: str = " "
    instruction_prefix: str = ""
    input_output_separator: str = "\n"
    demo_separator: str = "\n\n"
    suffix: str = ""

A more explicit formatter could be based on two fields only, one for the instructions format, and another for the demos format.
For example,
instructions_format =
"|system message|

{demos}

{instruction}
"
where the demos are created in a loop, each demo formatted using a demo_format, or using the template, and the instruction is also produced by the template.

Generally, there will be several fixed field names that may be integrated into the formats, and one or two format fields that control how the output looks.

The test test_thread_safety throws exceptions

The test test_thread_safety throws the following exceptions:

Traceback (most recent call last): 
 File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/threading.py", line 932, in _bootstrap_inner 
 self.run() 
 File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/threading.py", line 870, in run 
 self._target(*self._args, **self._kwargs) 
 File "/home/runner/work/unitxt/unitxt/tests/test_random_utils.py", line 64, in thread_function 
 with nested_seed(): 
 File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/contextlib.py", line 113, in __enter__ 
 return next(self.gen) 
 File "/home/runner/work/unitxt/unitxt/src/unitxt/random_utils.py", line 30, in nested_seed 
 state = _thread_local.random.getstate() 
AttributeError: '_thread._local' object has no attribute 'random'

See for example here.

These are thrown by the threads created in the test, causing the threads to exit. However, the test does not fail, since it is not checked whether an exception was thrown by the threads.

The exception is thrown because the random attribute is not initialized for the created threads. So, when they try to get a nested seed (here) an exception is thrown.

Remove slash (/) from object key in LoadFromIBMCloud

The slash character is not allowed as a valid character in the object key when storing files in IBM COS thus the retrieval of files with this character will lead to an object not being found.

Making data_dir as an optional parameter will solve this issue.

Subtle issue in ExtractFieldValues

The tests of ExtractFieldValues work well but when I use it, I saw the the field was not added.

The reason , I think is that process method modified the stream in place and does not create a new stream generator that modifies the input streams.

   for name in multi_stream:
            for instance in multi_stream[name]:
                instance[self.to_field] = values_to_keep
        return multi_stream

So if you add, before the return.

 for name in multi_stream:
         for instance in multi_stream[name]:
            print(instance)

You will see that the instance (which is a actually new instance that is fetched again from the multi-stream[name], does not include the change )

I think the implementation should be similar to :

class SpreadSplit(InstanceOperatorWithMultiStreamAccess):
which has access to the multi stream, but then adds a single value to the instance.

class ExtractFieldValues(InstanceOperatorWithMultiStreamAccess):
    field: str
    stream_name: str
    overall_top_frequency_percent: Optional[int] = 100
    min_frequency_percent: Optional[int] = 0
    to_field: str
    process_every_value: Optional[bool] = False

    def prepare(self):
        self.local_cache = None

    def verify(self):
            return super().verify()

    def process(
        self, instance: Dict[str, object], multi_stream: MultiStream
    ) -> Dict[str, object]:
        try:
            if self.local_cache is None:
                self.local_cache =  calculate_extracted_values(multi stream)
         
            instance[self.to_field] =  self.local_cache
            return instance
        except Exception as e:
            raise Exception(
                f"Unable to fetch instances from '{self.source_stream}' to '{self.target_field}'"
            ) from e

example operator test is broken

Trying to add the new operator in the docs gives an assertion error. The test on line 42 in test_utils/operators.py is backwards.

(fme) [jlquinn@cccxl005 unitxt]$ python ../optest.py 
/u/jlquinn/jlquinn-mt/gaama/conda.x86/envs/fme/lib/python3.9/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Artifact tmp_name saved to /tmp/tmpoyx56n6a/tmp_name.json
Traceback (most recent call last):
  File "/dccstor/multilm2/fm-ml-foglight/unitxt/../optest.py", line 29, in <module>
    print(test_operator(operator, inputs, targets)) # True
  File "/dccstor/multilm2/fm-ml-foglight/unitxt/src/unitxt/test_utils/operators.py", line 46, in test_operator
    raise AssertionError("\n".join(errors))
AssertionError: input and output must be equal, got <{'a': 1, 'b': 2}> =/= <{'a': 1, 'b': 2}>
input and output must be equal, got <{'a': 2, 'b': 2}> =/= <{'a': 2, 'b': 2}>

optest.py

from typing import (                                                                                                                                                                                 
    Any,                                                                                                                                                                                             
    Callable,                                                                                                                                                                                        
    Dict,                                                                                                                                                                                            
    Generator,                                                                                                                                                                                       
    Iterable,                                                                                                                                                                                        
    List,                                                                                                                                                                                            
    Optional,                                                                                                                                                                                        
    Tuple,                                                                                                                                                                                           
    Union,                                                                                                                                                                                           
)                                                                                                                                                                                                    
from unitxt.operator import StreamInstanceOperator                                                                                                                                                   
                                                                                                                                                                                                     
class AddFields1(StreamInstanceOperator):                                                                                                                                                            
    fields: Dict[str, object]                                                                                                                                                                        
                                                                                                                                                                                                     
    def process(self, instance: Dict[str, Any], stream_name: str = None) -> Dict[str, Any]:                                                                                                          
        return {**instance, **self.fields}                                                                                                                                                           
                                                                                                                                                                                                     
                                                                                                                                                                                                     
operator = AddFields1(fields={"b": 2})                                                                                                                                                               
                                                                                                                                                                                                     
inputs = [{'a': 1}, {'a': 2}]                                                                                                                                                                        
targets = [{'a': 1, 'b': 2}, {'a': 2, 'b': 2}]                                                                                                                                                       
                                                                                                                                                                                                     
from unitxt.test_utils.operators import test_operator                                                                                                                                                
                                                                                                                                                                                                     
print(test_operator(operator, inputs, targets)) # True

Remove src prefix in "The Task Card example"

There is inconsistency with import in adding_common_recipe.rst
one time it uses src prefix, 4 times it doesn't.

Error message for IBM Cloud loader produce wrong message error

The error message for ibm_boto3 not being installed has typo and it does not specify what should be installed to fix the issue @yoavkatz

Make sure test run after every push

RenderFormatTemplate is not a template

In templates.py there is RenderFormatTemplate, RenderAutoFormatTemplate, and RenderTemplatedICL. In theory I expected these to be templates, but they derive from ABC, not Template.

Errors that occure while generating the dataset with templates.empty

The issue is that I am getting errors while generating the dataset. I am providing the whole traceback below:

Task: type=standard_recipe_with_indexes,card=cards.entities_selected.all,template=templates.empty
model: google/flan-t5-xl

Traceback (most recent call last):
  File "/Users/rafalmaciasz/opt/anaconda3/envs/fm-eval/lib/python3.10/site-packages/datasets/builder.py", line 1706, in _prepare_split_single
    num_examples, num_bytes = writer.finalize()
  File "/Users/rafalmaciasz/opt/anaconda3/envs/fm-eval/lib/python3.10/site-packages/datasets/arrow_writer.py", line 598, in finalize
    raise SchemaInferenceError("Please pass `features` or at least one example when writing data")
datasets.arrow_writer.SchemaInferenceError: Please pass `features` or at least one example when writing data

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/rafalmaciasz/dev/ibm/fm-eval-setup/fm-eval/fm_eval/runnables/OrchestratedTasks/TasksRunner.py", line 197, in run
    result_n, config = taskRunner.run_task()
  File "/Users/rafalmaciasz/dev/ibm/fm-eval-setup/fm-eval/fm_eval/runnables/OrchestratedTasks/SingleTaskWrapper.py", line 135, in run_task
    global_scores, _ = run_experiment(self.config, self.per_device_eval_batch_size)
  File "/Users/rafalmaciasz/dev/ibm/fm-eval-setup/fm-eval/fm_eval/runnables/run_text2text.py", line 38, in run_experiment
    raw_datasets = get_datasets(config)
  File "/Users/rafalmaciasz/dev/ibm/fm-eval-setup/fm-eval/fm_eval/runnables/data_utils.py", line 49, in get_datasets
    raw_datasets = load_dataset(
  File "/Users/rafalmaciasz/opt/anaconda3/envs/fm-eval/lib/python3.10/site-packages/datasets/load.py", line 2136, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/rafalmaciasz/opt/anaconda3/envs/fm-eval/lib/python3.10/site-packages/datasets/builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "/Users/rafalmaciasz/.cache/huggingface/modules/datasets_modules/datasets/dataset/1aae7442b3ea98cbf3a572039df652f4d1f89511e864fa07c075a7104482edc5/dataset.py", line 135, in _download_and_prepare
    result = super()._download_and_prepare(dl_manager, "no_checks", **prepare_splits_kwargs)
  File "/Users/rafalmaciasz/opt/anaconda3/envs/fm-eval/lib/python3.10/site-packages/datasets/builder.py", line 1720, in _download_and_prepare
    super()._download_and_prepare(
  File "/Users/rafalmaciasz/opt/anaconda3/envs/fm-eval/lib/python3.10/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/rafalmaciasz/opt/anaconda3/envs/fm-eval/lib/python3.10/site-packages/datasets/builder.py", line 1555, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/Users/rafalmaciasz/opt/anaconda3/envs/fm-eval/lib/python3.10/site-packages/datasets/builder.py", line 1715, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Call only a specific split

When a specific split is called, shouldn't process everything. For example, when calling MMLU (100K train 1K test) we don't want to process everything

Type check to avoid warning on "str" when it is convertible to an artifact

In every place where we allow a string from the catalog to replace our Artifact s we need to announce the right types to allow it.
This means either union in every place (x|str) or to define a new python type that names this union informatively (or to pass an object rather than string, but not friendly to users so I don't like this one.

repr and types for Artifacts, Instructions etc.

Instructions templates and our other classes don't have representations
Among other things it makes it annoying to debug

artifact name to include hyphen (minus sign)?

Artifacts built for multilingual datasets, are built one per language, and they typically contain the language in their name:
e.g. add_to_catalog(card, f"cards.xnli.{lang}", overwrite=True)
method verify_legal_catalog_name(name) in /unitxt/src/unitxt/catalog.py , enforces the names of the artifact to match the reg exp r"^[\w" + COLLECTION_SEPARATOR + "]+$" .
Now, in some datasets, e.g. AmazonScience/massive, the language names do contain hyphen, e.g: 'de-DE', 'el-GR', 'en-US' .
Should the regular expression be expanded to also allow hyphens? Or is there a reason not to?
Might a twicked artifact name, e.g. cards.amazon_mass.en_US (for language en-US), cause a problem downstream? Are there tools that depend on the name of the language being part of the name of the artifact?

Unify groups of streams to have the same structure

Move all fields operators to inherit from field operator (e.g. CastFields is complex and not generic)
try to create something simplifying (with 2\multiple interactig fields or with other generic use cases for streams) so they will all get the same inputs have simple logic.