nateraw / modelcards Goto Github PK

View Code? Open in Web Editor NEW

15.0 4.0 4.0 94 KB

📝 Utility to create, edit, and publish model cards on the Hugging Face Hub. [**Now lives in huggingface_hub**]

License: MIT License

Python 37.34% Jupyter Notebook 62.66%

modelcards's Introduction

modelcards

⚠️Deprecation Warning⚠️ Utils in this repo now live in `huggingface_hub`.

So, this project will no longer be maintained.

You can find more information in the Hugging Face Hub documentation:

📝 Utility to create, edit, and publish model cards on the Hugging Face Hub.

For a full walkthrough, try the demo in Colab:

Usage

Installation

pip install modelcards

Examples

Load a model card from a Hugging Face Hub repo:

from modelcards import ModelCard

card = ModelCard.load("nateraw/food")

# Access its card data
print(card.data)

# Update its card data
card.data.library_name = "transformers"

# Save it to a file
card.save("my_card.md")

# Or, push it to the hub directly to replace the existing card
card.push_to_hub("nateraw/food")

Make model cards from the default model card template. 👀 You can see what the resulting model card looks like at this Hugging Face Hub repo.

from modelcards import CardData, ModelCard

repo_id = "nateraw/my-cool-model-with-card"

# Initialize card from default template, including card metadata
card = ModelCard.from_template(
    card_data=CardData(  # Card metadata object that will be converted to YAML block
        language='en',
        license='mit',
        library_name='timm',
        tags=['image-classification', 'resnet'],
        datasets='beans',
        metrics=['accuracy', 'f1'],
    ),
    model_id=repo_id.split('/')[-1],  # Jinja template kwarg
    model_description="Some really helpful description...",  # Jinja template kwarg
)

The modelcards.CardData class is used above to define some card metadata. This metadata is leveraged by the Hugging Face Hub to:

enable discoverability of your model through filters
provide a standardized way to share your evaluation results (which are then automatically posted to Papers With Code)
enable the inference API if your model is compatible with one of the available Inference API pipelines.
And more!

You can also make your own template and supply that to the from_template method by using the template_path argument.

from pathlib import Path

from modelcards import CardData, ModelCard

template_text = """
---
{{ card_data }}
---

# {{ model_id | default("CoolModel") }}

This model is part of `super_cool_models` package (which doesn't exist)! It is a fine tuned `cool-model` on the `{{ dataset_name }}`.

## Intended uses & limitations

This model doesn't exist, so you probably don't want to use it! This is just an example template. Please write a very thoughtful model card ❤️
"""

Path('my_template.md').write_text(template_text)

card = ModelCard.from_template(
    card_data=CardData(  # Card metadata object that will be converted to YAML block
        language='en',
        license='mit',
        library_name='super_cool_models',
        tags=['image-classification', 'cool-model'],
        datasets='awesome-dataset',
        metrics=['accuracy', 'f1'],
    ),
    template_path='my_template.md', # The template we just wrote!
    model_id='cool-model',  # Jinja template kwarg
    dataset_name='awesome-dataset', # Jinja template kwarg
)

modelcards's People

Contributors

Stargazers

Watchers

Forkers

adrinjalali techthiyanes sugatoray tobias-fischer

modelcards's Issues

Add token for loading card

Right now you can't load private cards w/o background authentication since token is not exposed.

PyPI source does not have `requirements.txt`: leads to installation failure

The package source on PyPI does not have any requirements.txt. This leads to installation failure while trying to install from the source (*.tar.gz file).

Update docstring syntax

As mentioned by lysandre here, we should update docstrings to look like this. Adding this issue to track it

            language (`Union[str, List[str]]`, *optional*):

Originally posted by @LysandreJik in #18 (comment)

Include `revision` and `commit_description` kwargs in `push_to_hub`

Just good practice here...match all the generic push_to_hub kwargs from huggingface_hub.

Decouple model card specific items from RepoCard

Right now, in a couple places there are model specific items in the RepoCard object. If we want to support Data Cards, we should decouple these things from RepoCard and include them in ModelCard. Then, any dataset specific things can go in a new object, DatasetCard, including the dataset card template. (related to #36 )

from_template silently drops non-matching kwargs

As a user, it is surprising to me that I can add any argument to ModelCard.from_template only to have them silently being ignored if they don't match the template.

To reproduce, use the same code as in the README and add an arbitrary argument, e.g.

card = ModelCard.from_template(card_data=..., model_id=..., model_description=..., foo='123')
card.save(...)

The foo='123' part is silently dropped.

ping @adrinjalali @merveenoyan

Return URL string from `push_to_hub`

In push_to_hub, we use upload_file which will return a string URL of the uploaded file on the specified branch/revision/etc. We should return that.

Context: This way, it will be easier to match huggingface_hub's metadata_update when we port this over to a PR there.

Creating a card from a template string instead of file

I wanted to know if this would be something that could be added: Add a new class method from_string (or similar name) to initialize a ModelCard when the template is already stored as a string variable. That way, in some circumstances, we can avoid the "round-trip" through a file. Implementation-wise, it would be quite simple:

class ModelCard(RepoCard):
    @classmethod
    def from_template(
        cls,
        card_data: CardData,
        template_path: Optional[str] = TEMPLATE_MODELCARD_PATH,
        **template_kwargs,
    ):
        template_str = Path(template_path).read_text()
        return cls.from_string(card_data=card_data, template_str=template_str, **template_kwargs)

    @classmethod
    def from_string(
        cls,
        card_data: CardData,
        template_str: str,
        **template_kwargs,
    ):
        content = jinja2.Template(template_str).render(
            card_data=card_data.to_yaml(), **template_kwargs
        )
        return cls(content)

Please add a license to the repository and the PyPI source as well.

Currently no license file is present in the repository and the PyPI source. Please add a license.

Update CardData.model_name on push_to_hub if its different

If a user set a certain model_name and its different than the model name of the repo they're pushing to, change to use that instead.

Perhaps warn user about this or have a flag to avoid/enable this behavior?

yaml error when saving the card

I get:

yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/apply:numpy.core.multiarray.scalar'
  in "<unicode string>", line 16, column 14:
          value: !!python/object/apply:numpy.core ...

when I want to save the card, I don't know what I'm doing wrong (given in the tests I'm doing something similar and they pass) maybe I'm hitting an edge case. I tried with pyyaml 6.0 and 5.4 as they were allowed.

Here's code to reproduce the issue:

from skops import card
from modelcards import CardData
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingGridSearchCV, train_test_split
from skops import card
X, y = load_breast_cancer(as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
param_grid = {
    "max_leaf_nodes": [5, 10, 15],
    "max_depth": [2, 5, 10],
}
model = HalvingGridSearchCV(
    estimator=HistGradientBoostingClassifier(),
    param_grid=param_grid,
    random_state=42,
    n_jobs=-1,
).fit(X_train, y_train)
model.score(X_test, y_test)
limitations = "This model is not ready to be used in production."
model_description = (
    "This is a HistGradientBoostingClassifier model trained on breast cancer dataset."
    " It's trained with Halving Grid Search Cross Validation, with parameter grids on"
    " max_leaf_nodes and max_depth."
)
license = "mit"
eval_results = card.evaluate(
    model, X_test, y_test, "neg_mean_squared_error", "random_type", "dummy_dataset", "tabular-regression"
)
card_data = CardData(
    license=license,
    tags=["tabular-classification"],
    datasets="breast-cancer",
    eval_results=eval_results,
    model_name="my-cool-model",
)
permutation_importances = card.permutation_importances(model, X_test, y_test)
model_card = card.create_model_card(
    model,
    card_data=card_data,
    template_path = "skops/skops/card/default_template.md",
    limitations=limitations,
    model_description=model_description,
    permutation_importances=permutation_importances,
)
model_card.save(f"{local_repo}/README.md")

Below is the error I get:

python3 test.py
Traceback (most recent call last):
  File "test.py", line 100, in <module>
    model_description=model_description,
  File "/Users/mervenoyan/Desktop/skops/skops/skops/card/_model_card.py", line 57, in create_model_card
    **card_kwargs,
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/modelcards/cards.py", line 274, in from_template
    return cls(content)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/modelcards/cards.py", line 40, in __init__
    data_dict = yaml.safe_load(yaml_block)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/yaml/__init__.py", line 162, in safe_load
    return load(stream, SafeLoader)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/yaml/__init__.py", line 114, in load
    return loader.get_single_data()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/yaml/constructor.py", line 43, in get_single_data
    return self.construct_document(node)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/yaml/constructor.py", line 52, in construct_document
    for dummy in generator:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/yaml/constructor.py", line 404, in construct_yaml_map
    value = self.construct_mapping(node)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/yaml/constructor.py", line 210, in construct_mapping
    return super().construct_mapping(node, deep=deep)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/yaml/constructor.py", line 135, in construct_mapping
    value = self.construct_object(value_node, deep=deep)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/yaml/constructor.py", line 92, in construct_object
    data = constructor(self, node)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/yaml/constructor.py", line 420, in construct_undefined
    node.start_mark)
yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/apply:numpy.core.multiarray.scalar'
  in "<unicode string>", line 16, column 14:
          value: !!python/object/apply:numpy.core ...

I also tried to save without args after CardData to see if they have weird characters, I still get the same error. You can try my fork's feature_importance branch to get the necessary functions.

ModelCard init breaks if model-index exists but is invalid

For example, in this repo there is no dataset defined, so it breaks:

from modelcards import ModelCard

card = ModelCard.load('nateraw/rare-puppers')

It would be nice if it tried to partial load this, then helped users mitigate issues with validation.

As a quick fix, I think I'll just ignore model-index in these cases for now, logging a warning to give an indication of what's going on.

Should we update syntax for loading model cards?

So right now, I have it so ModelCard.load works for both repo IDs and local files.

I'm thinking perhaps we could do:

ModelCard.open(filepath) to open a local file
ModelCard.load_from_hub(repo_id) to load a remote Hugging Face model repo model card

WDYT?

Add ability to PR modelcard updates

@mgerchick suggested to add a workflow to PR (via contributions tab on Hugging Face Hub) updates made to an existing model card, which is something she and others have been doing for model cards around the Hub. I think we could wait until any utilities for this are stable in huggingface_hub, then leverage them here as needed.

Probably related to this: huggingface/huggingface_hub#888

Will add other related issues/PRs in huggingface_hub as I find them (others feel free to drop them below too 😄 )

Support for datacards/datasheets

Thanks for this library -- I've just started playing with this, and it looks like it is going to be super useful :)

Are there any plans for also supporting the creation of datacard/datasheets in this library?

I think this could be quite useful for a few use cases. In particular being able to template out some standard information might be useful for organizations which might want to standardize some information in a Datacard for example, in https://huggingface.co/datasets/BritishLibraryLabs/EThOS-PhD-metadata, we may want to be able to pass in a list of names or OCRDIDs to go under https://huggingface.co/datasets/BritishLibraryLabs/EThOS-PhD-metadata#dataset-curators.

This could end up looking something like:

datacard = DataCard.from_template(
    card_data=DataCardData(  # Card metadata object that will be converted to YAML block
        license='mit',
        tags=['image-classification'],	
	... 
    ),
    template_path='my_data_template.md', # The template we just wrote!
    dataset_id='cool-model',  # Jinja template kwarg
    external_url='data.bl....', # Jinja template kwarg
   curators=['name1', 'name2'] 
)

I think this could also be useful for organizations/users using the hub to store data that is actively being developed/annotated. They could then use this feature to automagically create some key stats about the dataset i.e. number of instances, label frequency breakdowns, annotator agreement scores etc. and keep that documentation in sync with a changing dataset? I had planned to add something like this to https://github.com/davanstrien/hugit-cli/ but would rather piggyback on something else!

Determine how to deal with type validation in card metadata

How do we want to deal with card data validation? Dataclasses are a great way to give type hints, but they don't do any validation. Is this a complex enough use case to use pydantic? I think it would make our lives easier, but I'm concerned the dependency won't be welcomed in huggingface_hub if this code gets integrated there down the line.

CC @adrinjalali

Validate Card Metadata

Card Metadata that comes at the top of the readme between the ---s should be validated. For example, languages are a list of strings, not a string, so if someone overwrites card.data['language'] = 'en' it should be automatically updated to ['en']. If they try to set it to a weird type, like a float, they should get a helpful error message.

Add support for Carbon Emissions reporting in CardData

Its possible to include emissions data in your model card metadata. We should make it easy to do so with this package.

Spec:

co2_eq_emissions:
      emissions: "in grams of CO2"
      source: "source of the information, either directly from AutoTrain, code carbon or from a scientific article documenting the model"
      training_type: "pretraining or fine-tuning"
      geographical_location: "as granular as possible, for instance Quebec, Canada or Brooklyn, NY, USA"
      hardware_used: "how much compute and what kind, e.g. 8 v100 GPUs"

This spec also apparently lets you just say co2_eq_emissions: <amount> where the amount is what you would have passed to emissions within, so something like 1000.0. This is what I've seen used pretty often.

In the end, if someone didn't report emissions data but they want to, it'll be as easy as this as soon as we add the feature.

from modelcards import ModelCard

card = ModelCard.load('nateraw/rare-puppers')
card.data.emissions = 1000.0
card.push_to_hub('nateraw/rare-puppers')

Or, if they want to include the extra metadata from the spec, maybe we have a dataclass called something like EmissionsData you can use.

from modelcards import ModelCard

card = ModelCard.load('nateraw/rare-puppers')
card.data.emissions = EmissionsData(
    emissions=1000.0,
    source="codecarbon",
    training_type="fine-tuning",
    geographical_location="Rochester, NY",
    hardware_used="1 Nvidia GTX 1080"
)
card.push_to_hub('nateraw/rare-puppers')

CC @sashavor

Update validation endpoint

The endpoint for card validation has been updated from /validate-yaml to /api/validate-yaml. Should reflect this change in RepoCard.

Add a tag for this library automatically

Add a tag like auto-generated-modelcard that gets added to tag list to help track usage. Probably want a way to opt out of this too, but the default should be opt-in, I think.

Loading cards from Hub with unhandled metadata headers fails

When loading a card from the hub, this package fails when loading one that has unhandled fields in the metadata header. For example, widgets.

To recreate the issue, pip install modelcards, then:

from modelcards import ModelCard

card = ModelCard.load('Maltehb/aelaectra-danish-electra-small-cased')

You'll get TypeError: __init__() got an unexpected keyword argument 'co2_eq_emissions'. This is coming from CardData's init, which doesn't expect this co2_eq_emissions key. This example specifically is related to #14, but it is also true for many other fields that are used in the metadata header ('widgets', etc.)

Extentions for integrations

How should a library integration developer extend the modelcard here to fit their usecase?

It would be nice to start with a documentation to explain how the workflow works for us to set the API. I would suggest something like this file (https://github.com/skops-dev/skops/blob/main/examples/plot_hf_hub.py) instead of a notebook, which can be converted into a notebook.

Should we use card data object directly in ModelCard.from_template?

Right now, the ModelCard.from_template function takes in the same args as CardData, and then just initializes it. Does it make more sense to initialize CardData separately and pass that to ModelCard.from_template? That way the documentation around card data is all in one place.

So instead of

card = ModelCard.from_template(language='en', ...)

you would do

card_data = CardData(language='en', ...)
card = ModelCard.from_template(card_data, **template_kwargs)

CC @adrinjalali

Add ability to pull information from Transformers docs

Hi,

This library looks great already. It would be awesome if we can leverage it when adding new models to the Transformers library. For now, model cards are created manually (after pushing the checkpoints to the hub).

We typically duplicate things from the documentation into the model card (like the abstract of the paper, a code snippet showcasing basic usage, the URL of the paper, short description of the model, etc.).

For instance, check the ViT docs, which is then used in the model card of a ViT checkpoint.

So ideally, we could use this library in the conversion scripts, where we also add the model card when pushing to the model.

Add encoding when saving card

Seems there were issues when saving the card related to encoding (and perhaps related/unrelated issue about saving cards on Windows). Let's add encoding when saving the card to see if that fixes the issue.

Related convo here: skops-dev/skops#22

Handle invalid model-index instead of throwing data away

Right now, if model-index is not formatted correctly in the remote repo when using ModelCard.load(repo_id), it will tell you the model-index is invalid and that we're not loading eval results.

For context, under the hood modelcards.CardData uses a list of EvalResult to represent the model index, which is stored at CardData.eval_results. When you turn card data to dict, it doesn't return the list of EvalResult, but instead formats it to a valid model index and assigns it to the model-index key of the metadata dict.

The code can be seen here:

modelcards/modelcards/card_data.py

Lines 151 to 167 in a45b85b

 def to_dict(self): 

 """Converts CardData to a dict. It also formats the internal eval_results to 

  be compatible with the model-index format. 

  Returns: 

  dict: CardData represented as a dictionary ready to be dumped to a YAML 

  block for inclusion in a README.md file. 

  """ 

 data_dict = copy.deepcopy(self.__dict__) 

 if self.eval_results is not None: 

 data_dict["model-index"] = eval_results_to_model_index( 

 self.model_name, self.eval_results 

 ) 

 del data_dict["eval_results"], data_dict["model_name"] 

 return _remove_none(data_dict)

So, maybe in the case that the found 'model-index' is invalid, we assign it to card_data['model-index'] so it can still be accessed. If the user tries to update the eval_results, we'll use that, but if not, the previous model-index will still be in tact.

Model card template for NLP models with corresponding auto-fill texts for different variables

Template for NLP models with corresponding auto-fill texts for different variables: NLP_modelcard_new_spec.md
Explanation of template variables and expected values:
nlp_modelcard_template_vars.md
Examples of generating model cards using the modelcards repo and this template:

from modelcards import ModelCard, CardData

repo_id = "" # Place your repo id here

# Set the metadata that will go at the top of the card
card_data = CardData(
    license='mit',
    language=['en', 'fr', 'multilingual'],
    )

# Model card for a text generation model
card1 = ModelCard.from_template(
    card_data, 
    template_path='NLP_modelcard_new_spec.md', 
    model_id=repo_id.split('/')[-1], 
    model_task="text_generation", 
    authors=['person1', 'person2', 'person3'],
    model_summary="This is a placeholder summary.",
    related_models=['fake_model1', 'fake_model2'],
    blog_link="https://huggingface.co", 
    paper_link="https://huggingface.co",
    model_card_user="policymaker")

# Model card with no task specified, where direct use and downstream use are specified
card2 = ModelCard.from_template(
    card_data, 
    template_path='NLP_modelcard_new_spec.md', 
    model_id=repo_id.split('/')[-1], 
    authors=['person1', 'person2', 'person3'],
    model_summary="This is a placeholder summary.",
    related_models=['fake_model1', 'fake_model2'],
    blog_link="https://huggingface.co", 
    paper_link="https://huggingface.co",
    direct_use="This model can be used to do some sort of task.",
    downstream_use="This model can be used downstream as part of some system.")

# if you want to push it to the hub
card1.push_to_hub( # change to card 1 or 2 accordingly
    repo_id,
    commit_message="initial model card",
    create_pr=True)

Allow returning specific model card sections

It would be really great to be able to go:

card.description

and get only the model description! 🤗

Support Evaluation Metrics in Card Data

We should be able to add/update model card evaluation metrics in the modelcard metadata (the stuff that is in between the ---s). These updates/additions to model card evaluation data should be validated to make sure no problems occur when pushing to the hub (RE: #2 ).

	def to_dict(self):
	"""Converts CardData to a dict. It also formats the internal eval_results to
	be compatible with the model-index format.

	Returns:
	dict: CardData represented as a dictionary ready to be dumped to a YAML
	block for inclusion in a README.md file.
	"""

	data_dict = copy.deepcopy(self.__dict__)
	if self.eval_results is not None:
	data_dict["model-index"] = eval_results_to_model_index(
	self.model_name, self.eval_results
	)
	del data_dict["eval_results"], data_dict["model_name"]

	return _remove_none(data_dict)