huggingface / hub-docs Goto Github PK
View Code? Open in Web Editor NEWDocs of the Hugging Face Hub
Home Page: http://hf.co/docs/hub
License: Apache License 2.0
Docs of the Hugging Face Hub
Home Page: http://hf.co/docs/hub
License: Apache License 2.0
Hi,
I hope first of all to open this problem in the right place (I hesitated to post in this repo but it seems less active with only issues in 1 year: https://github.com/huggingface/model_card).
I'll illustrate my observation by talking about French models, but the logic applies to any language.
I found by chance the following model on the hub: https://huggingface.co/dbmdz/electra-base-french-europeana-cased-generator
An electra model in French released more than a year ago and I had never heard of it? How it's possible?
I realized that it was simply because it was not referenced correctly (no "fr" tag). This probably explains why it was downloaded only 12 times last month. I think it's a shame.
So I did a little more research to see if I had missed any other French models that were not referenced and here is the list I came up with:
This represents 24 models. If we calculate in relation to what is announced by the "fr" filter (https://huggingface.co/models?language=fr), it's about 7.5% (24/(24+300)) of the models in French that are not referenced.
So I think it would be important to improve the reference.
I have two ideas to submit:
A slightly different but related topic is multilingual models. Should multilingual models to be tag with all the languages they contain or not?
This solution has been adopted for Helsinki NLP templates (an example: https://huggingface.co/Helsinki-NLP/opus-mt-af-fr tagged in "af" and "fr").
But this is not the case for Geotrend models (an example: https://huggingface.co/Geotrend/bert-base-en-fr-cased, contains neither "en" nor "fr") or for T-systems (an example: https://huggingface.co/T-Systems-onsite/cross-en-fr-roberta-sentence-transformer contains neither "en" nor "fr").
I haven't checked with datasets, but I guess the problem must apply there too. So I think this would be a point to harmonize.
Have a nice day :)
(I noticed recently that you changed the language filtering GET parameter from filter
to language
, but unfortunately my problem still persists)
If I search for Danish models then models like this also pop up, which is not a Danish model, but has the da
tag (for "direct assessment"). This could be fixed be filtering by the tag-green
class rather than a general tag
filtering, as I guess is done currently.
Thanks!
Would have as input
Would output:
This should be similar in implementation to the image classification widget.
Is your feature request related to a problem? Please describe.
I tried to upload a 13GB dataset file (for this repo), and after waiting a couple of hours for it to upload, it gave a "Payload too large" error.
Describe the solution you'd like
I'd like to be able to upload large dataset files using the web UI.
Describe alternatives you've considered
I'm guessing there's some sort of command line tool for cases like this, but I'd prefer to not have to install something just to upload a file - it's bad UX. There's no fundamental technical reason why the web UI can't handle large files, so it doesn't seem like a good idea to put limits like this in place.
Additional info
If the team is for some reason adamant about not allowing upload of large files via the web UI, then at the very least it would be good to stop the user with an error when they pick the file, rather than after the upload is complete. I.e. check the size of the blob with JS before uploading rather than uploading and checking size on the server.
If it's actually already possible to upload large files via the web UI (and I've just misunderstood the process), then please consider this issue a request for better UX in guiding the user toward doing that. After creating the dataset repo I just clicked across to the "files and versions" tab, and then clicked "Add file > Upload file".
Thanks!
Is your feature request related to a problem? Please describe.
I am trying to share our hate speech measurement model, which predicts a continuous measure for hate speech severity. So it is doing text regression rather than classification, but the input remains just a text input that gets tokenized. Is it possible to clarify if I need to create a custom pipeline for this, or how else to proceed? I have one of our models uploaded for TF at https://huggingface.co/ucberkeley-dlab/hate-measure-roberta-large
The architecture is just a Transformer backbone, 1D global average pooling layer, followed by a linear output node.
Describe the solution you'd like
I would like to support the hosted inference API so that individuals can get a continuous score prediction out of our model and for the demo widget to work.
Describe alternatives you've considered
I'm not clear if I need to create a new TextRegression task for HuggingFace, which feels like it would be a large lift, or if there is an easier way to do this.
Additional context
The preprint for our work is at https://arxiv.org/abs/2009.10277 and this is to allow the model to be used in the Jigsaw Toxic Severity Kaggle competition: https://www.kaggle.com/c/jigsaw-toxic-severity-rating/overview
I'm currently adding LayoutLMv2 and LayoutXLM to HuggingFace Transformers. These models, built by Microsoft, have impressive capabilities for understanding document images (scanned documents, such as PDFs). LayoutLM, and its successor, LayoutLMv2, are extensions of BERT that incorporate layout and visual information, besides just text. LayoutXLM is a multilingual version of LayoutLMv2.
It would be really cool to have inference widgets for the following tasks:
Document image understanding (also called form understanding) means understanding all pieces of information of a document image. Example datasets here are FUNSD, CORD, SROIE and Kleister-NDA.
The input is a document image:
The output should be the same image, but with colored bounding boxes, indicating for example what part of the image are questions (blue), which are answers (green), which are headers (orange), etc.
LayoutLMv2 solves this as a NER problem, using LayoutLMv2ForTokenClassification
. First, an OCR engine is run on the image to get a list of words + corresponding coordinates. These are then tokenized, and together with the image sent through the LayoutLMv2 model. The model then labels each token using its classification head.
Document visual question answering means, given an image + question, generate (or extract) an answer. For example, for the PDF document above, a question could be "what's the date at which this document was sent?", and the answer is "January 11, 1999".
Example datasets here are DocVQA - on which LayoutLMv2 obtains SOTA performance, who might have guessed.
LayoutLMv2 solves this as a extractive question answering problem similar to SQuAD. I've defined a LayoutLMv2ForQuestionAnswering
, which predicts the start_positions
and end_positions
.
Document image classification is fairly simple: given a document image, classify it (e.g. invoice/form/letter/email/etc.). Example datasets here are [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/#:~:text=The%20RVL%2DCDIP%20(Ryerson%20Vision,images%2C%20and%2040%2C000%20test%20images.). For this, I have defined a LayoutLMv2ForSequenceClassification
, which just places a linear layer on top of the model in order to classify documents.
I don't think we can leverage the existing 'token-classification', 'question-answering' and 'image-classification' pipelines, as the inputs are quite different (document images instead of text). To ease the development of new pipelines, I have implemented a new LayoutLMv2Processor
, which takes care of all the preprocessing required for LayoutLMv2. It combines a LayoutLMv2FeatureExtractor
(for the image modality) and LayoutLMv2Tokenizer
(for the text modality). I would also argue that if we have other models in the future, they all implement a processor that takes care of all the preprocessing (and possibly postprocessing). Processors are ideal for multi-modal models (they have been defined previously for CLIP and Wav2Vec2).
Note that you're not expected to do all of the following steps. This PR helps track all the steps required to get a new task fully supported in the Hub π₯
transformers
pipeline
huggingface/transformers#13828Integration guide: https://hf.co/docs/hub/adding-a-task
When I go to a repo like this one I can click the "use in transformers" button. This makes me think "grand! that'll be a quickstart!". It shows me this code;
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("oliverguhr/fullstop-punctuation-multilang-large")
model = AutoModelForTokenClassification.from_pretrained("oliverguhr/fullstop-punctuation-multilang-large")
This is a bit of a bummer. It tells me how to set up a tokenizer and a model. But it doesn't tell me at all how to use the model. It'd be much more pragmatic if instead it showed this line of code;
from transformers import pipeline
pipe = pipeline("token-classification", "oliverguhr/fullstop-punctuation-multilang-large")
pipe(["this is an example that you can actually run"])
This way, I don't need to search the docs for the type of pipeline model that I'm dealing with and I immediately have something that works in my notebook. Wouldn't it be better to generate the pipeline code in the docs?
Potentially you could still render the model/tokeniser as well, but these don't feel part of the "getting started journey".
Naming is tentative
Note that you're not expected to do all of the following steps. This PR helps track all the steps required to get a new task fully supported in the Hub π₯
transformers
pipeline
Integration guide: https://hf.co/docs/hub/adding-a-task
We can do this with gensim or fasttext models for example for obtaining closest words with nearest neighbors. Example repo: https://huggingface.co/Hellisotherpeople/debate2vec
CC @mishig25 in case you would like to help with the widget, this is pretty much the same as the text-classification
widget
Hey guys, huge fan of you and what you've done to make all our lives easier.
I think it would be really cool if the site https://huggingface.co/ had the option to favorite a model or dataset if you're a user.
Would have as input
Would output
This should be similar in implementation to the image classification widget.
Is your feature request related to a problem? Please describe.
I got this error:
Traceback (most recent call last):
File "app.py", line 1, in <module>
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
ModuleNotFoundError: No module named 'transformers'
when I pushed this version to spaces https://huggingface.co/spaces/ttj/t0-generation/commit/efc071478da5be7f6a369b034ecda4844ed3ad22
Describe the solution you'd like
Pre-install common libraries.
Hello,
I think it's a UX improvement and an improvement for better visibility of the pages. I want a redirection from the models page to task page, such that when they people filter for a task, they can also get better information on it (+these resources would be more visible)
cc: @osanseviero @gary149 @beurkinger
https://ai.googleblog.com/2021/01/totto-controlled-table-to-text.html
as suggested by @mrm8488
Should be decently easy once huggingface/huggingface_hub#87 is merged
We are having more non-NLP tasks supported in the Inference API but there are no code snippets at https://huggingface.co/superb/hubert-large-superb-er even if there are at https://api-inference.huggingface.co/docs/python/html/detailed_parameters.html#audio-classification-task
This probably requires some internal changes + changing https://github.com/huggingface/huggingface_hub/tree/main/widgets/src/lib/inferenceSnippets since it expects same type of input always.
cc @mishig25
Would have as inputs:
as suggested by @patil-suraj (for CLIP?)
Should be decently easy once huggingface/huggingface_hub#87 is merged
Note that you're not expected to do all of the following steps. This PR helps track all the steps required to get a new task fully supported in the Hub π₯
transformers
pipeline
https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/object_detection.pyIntegration guide: https://hf.co/docs/hub/adding-a-task
Is your feature request related to a problem? Please describe.
As discussed on Slack, it would be nice if we add a direct link to the documentation of a model on the hub (e.g. linking to https://huggingface.co/transformers/model_doc/bert.html for bert-base-uncased). This can probably be done based on the config.json
of a model.
Describe the solution you'd like
A button could be added either next to "Deploy" and "Use in Transformers", or a link could be added within the "Use in Transformers" button.
cc @julien-c
As we do huggingface/huggingface_hub#744 and add library documentation based in docstrings, the hub docs might split into hugginface_hub
and product usage.
Based on existing content, what we can do is more self-contained use cases. An initial mental model without creating additional content would be
Move to huggingface_hub
as guides
Then on the Hub
Model card
Repositories
CO2 Emissions
Widgets
Inference API
Security
Endpoints
Adding a new task
Integrating a new library (with parts of it linking to huggingface_hub
)
This would also include Spaces, which can very likely break into more pieces
WDYT @julien-c @LysandreJik @adrinjalali @muellerzr of this as a first step once we kick off the splitting
In v0.9 of flair
a new zero-shot sequence labeling feature was included which seems (a) very exciting and (b) very useful for real-world NER applications where acquiring labeled data is expensive.
It would be cool to include support for this in the Hugging Face Hub, possibly as a companion widget to the existing one for zero-shot classification.
Here's a link to the paper behind the TARS technique.
cc @osanseviero
transformers
pipeline
Note that you're not expected to do all of the following steps. This issue helps track all the steps required to get a new task fully supported in the Hub π₯
transformers
pipeline
Integration guide: https://hf.co/docs/hub/adding-a-task
Note that you're not expected to do all of the following steps. This PR helps track all the steps required to get a new task fully supported in the Hub π₯
transformers
pipeline
Integration guide: https://hf.co/docs/hub/adding-a-task
What do you think about paying some % to developer / owners of models if it's getting used through inference API ?
It could make it more attractive uploading more amazing models if it's possible to generate some passive income.
It would be like rapid API for NLP π€
Is your feature request related to a problem? Please describe.
It is hard to tell how big a model is without going to the repo and seeing the size of the weights file.
Describe the solution you'd like
It might prove useful to be able to sort models by size or number of parameters, especially for people with more modest compute budgets. Perhaps even display that information alongside a model's popularity.
Describe alternatives you've considered
Alternatives would be prior knowledge of the distilled / small model landscape. This feature might also help people discover new models they hadn't heard of by enabling filtering queries like "most popular and smallest model trained using MLM on wikitext"
Additional context
The datasets tab has a similar feature with the Size
category filter.
Note that you're not expected to do all of the following steps. This PR helps track all the steps required to get a new task fully supported in the Hub π₯
transformers
pipeline
Integration guide: https://hf.co/docs/hub/adding-a-task
Example model: https://arxiv.org/pdf/2111.05610v1.pdf
cc @nateraw
Note that you're not expected to do all of the following steps. This PR helps track all the steps required to get a new task fully supported in the Hub π₯
transformers
pipeline
Integration guide: https://hf.co/docs/hub/adding-a-task
This request for a widget/inference API for the hub for multimodal models for VQA tasks:
Would have input as
This should be used for VisualBERT/LXMERT models. And might need detectron or something similar to the FasterRCNN model here : https://github.com/huggingface/transformers/tree/master/examples/research_projects/lxmert
Hello, I am trying to create new streamlit app on HuggingFace by using pyspark in it.
I've created app.py file, requirement.txt file and I run a basic streamlit app without pyspark and it worked seamlessly.
However, problem starts when I add the following line spark = SparkSession.builder.appName("appName").getOrCreate()
to start sparksession, I got an error like following:
Exception: Java gateway process exited before sending its port number
The content of requirements.txt file:
streamlit
pyspark==3.1.2
spark-nlp==3.3.2
If Anyone worked with pyspark on hf can help me I would be appreciate!
I would like to start documenting good practices of model repos to add to our documentation.
Some come to mind rather quickly
How do we want to encourage users to have multiple checkpoints in a single repo? There was a related discussion in GPT-J and for other contributions
My suggestion
I'm just gathering ideas so any are welcome!
cc @patrickvonplaten @julien-c @LysandreJik @lewtun @NielsRogge I hope I did not forget anyone
It would be useful to be able to generate a Digital Object Identifier to artifacts living on the Hub.
It would let people cite a specific dataset or a specific model, and make their own datasets and models citable. Some venues require DOIs for digital resources and it would be nice to not have to use 3rd parties for that.
Kaggle, for instance, currently has that feature for public datasets: https://www.kaggle.com/product-feedback/108594
Drawback: It costs money, because one has to go through approved agencies (e.g.: https://datacite.org/feemodel.html)
Automation would probably be straightforward: Creating DOIs with the Datacite REST API
Would have as inputs:
as suggested by @patil-suraj (for CLIP?)
Should be decently easy once huggingface/huggingface_hub#87 is merged
Should we either show the same sentence or say that no tokens were found?
Note that you're not expected to do all of the following steps. This PR helps track all the steps required to get a new task fully supported in the Hub π₯
transformers
pipeline
Integration guide: https://hf.co/docs/hub/adding-a-task
When I call the hosted text-generation API, the request fails if I set the temperature
parameter to an integer value of 1
instead of a float value of 1.0
.
For example, this succeeds:
curl -i -X POST https://api-inference.huggingface.co/models/my_organization/my_model \
-H "Authorization: Bearer <<REDACTED>>" \
-H "Content-Type: application/json" \
-d \
'{
"inputs":"Once upon a time",
"parameters": {
"temperature": 1.0,
"max_new_tokens": 20
}
}'
But this request fails:
curl -i -X POST https://api-inference.huggingface.co/models/my_organization/my_model \
-H "Authorization: Bearer <<REDACTED>>" \
-H "Content-Type: application/json" \
-d \
'{
"inputs":"Once upon a time",
"parameters": {
"temperature": 1,
"max_new_tokens": 20
}
}'
...with this error message:
{"error":["value is not a valid float: `temperature` in `parameters`"]}
Both requests should succeed. The only difference between the two requests is the numeric formatting of the temperature
parameter (with or without a decimal point and trailing zero).
Note that you're not expected to do all of the following steps. This issue helps track all the steps required to get a new task fully supported in the Hub π₯
transformers
pipeline
Integration guide: https://hf.co/docs/hub/adding-a-task
Would have as inputs:
as suggested by @patil-suraj (for CLIP?)
Should be decently easy once huggingface/huggingface_hub#87 is merged
For the DETR model, which will soon be part of HuggingFace Transformers (see huggingface/transformers#11653 (comment)), it would be cool to have object detection and image segmentation (actually panoptic segmentation) inference widgets.
Similar to the image classification widget, a user should be able to upload/drag an image to the widget, which is then annotated with bounding boxes and classes (in case of object detection), or turned into a segmentation map (in case of panoptic segmentation).
Here are 2 notebooks which illustrate what you can do with the head models of DETR:
DetrForObjectDetection
: https://colab.research.google.com/drive/170dlGN5s37uaYO32XKUHfPklGS8oB059?usp=sharing
DetrForSegmentation
: https://colab.research.google.com/drive/1hTGTPGBLPRY1QkLmG7P9air6v04tcXUL?usp=sharing
The models are already on the hub: https://huggingface.co/models?search=facebook/detr
cc @LysandreJik
When a user selects a specific task on the Hugging Face Hub - for example, image-to-text
:
That user is shown a series of models, with no guidance as to which model might be state of the art, or which might be the most performant for their use case.
To test the capabilities and behavior of each model, the user must:
Space
or a Colab notebook available (not every model does).The user should be able to:
image-to-text
).image-to-text
).Note that you're not expected to do all of the following steps. This PR helps track all the steps required to get a new task fully supported in the Hub π₯
transformers
pipeline
Integration guide: https://hf.co/docs/hub/adding-a-task
@osanseviero thank you for looking into this issue.
Simply put, we built an API that receives questions or documents from a user through a web application in the front, and on the back end downloads ML models from huggingface.co in order to answer those questions or documents and encode new documents for search.
In order to connect to huggingface.co to download models, we require an SSL certificate. Without the cert, the following error comes up
The current fix requires that I download the certificate chain manually through my browser as follows
Note the hugging face SSL certificate expires very soon!
To install the certificate, I copy it over to my container in my Dockerfile and add the following lines of code to the application:
import requests
import certifi
try:
print('Checking connection to Huggingface...')
test = requests.get('https://huggingface.co')
print('Connection to Huggingface OK.')
except requests.exceptions.SSLError as err:
print('SSL Error. Adding custom certs to Certifi store...')
cafile = certifi.where()
with open('huggingface-co-chain.pem', 'rb') as infile:
customca = infile.read()
with open(cafile, 'ab') as outfile:
outfile.write(customca)
print('That might have worked.')
The main problem with this fix is that the certificate I download is only valid for a short period of time (one or two weeks).
-Ideally, we should be able to do this on the command line as part of the container build, but so far efforts to do so have not worked.
-The following command yields a chain of three certificates, while the ones downloaded from the browser have a chain of 4.
openssl s_client -showcerts -verify 5 -connect huggingface.co:443 < /dev/null
-The missing certificate appears to be the zScaler root CA, which shows up in the browser but not the command line.
Is your feature request related to a problem? Please describe.
In newer versions of HuggingFace_Hub, text inputs and outputs left-justify their text, when we can add an attribute to automatically detect and adjust for right-to-left language text.
Describe the solution you'd like
recommended
<form>
or the <span ... role="textbox">
element should have HTML attribute dir="auto"
(no css equivalent)<p ... class="alert alert-success">
element should have dir="auto"
less critical
.prose > {h1, h2, h3, h4, p}
and apply dir="auto"
; the selection is so we don't mess with code / pre / table elements<div>
around the bar chart could also have dir="auto"
direction: rtl
CSS to text inputs and outputsIs your feature request related to a problem? Please describe.
Currently, the following data fields on the hub are only displayed in MB.
--Size of the generated dataset:
--Size of downloaded dataset files:
--Total amount of disk used:
Figures like 1895.01 MB and 1611.50 MB can become unwieldly as they grow in size to reason about the space they require, compared to 1.89501 GB or 1.6115 GB.
Describe the solution you'd like
Convert numbers to GB in dataset cards when MB > 1000.
Describe alternatives you've considered
I considered advocating to truncate size to 3 decimal points, as precision to 5 or 6 decimal points in GB (such as a number like 1.543210 GB) may be an unnecessary degree of precision to provide for users.
Ultimately though, I reasoned that more precision is often better.
Additional context
I'm happy to contribute to this. I didn't see exactly where this was handled in the current codebase, so any pointers appreciated. Also, let me know if I should be opening this up in the datasets repo instead...
Using the huggingface_hub
library, I was able to collect some statistics on the 9,984 models that are currently hosted on the Hub. The main goal of this exercise was to find answers to the following questions:
BertForSequenceClassification
architecture is likely to be about text classification; similarly for the other ModelNameForXxx
architectures.Without applying any filters on the architecture names, the number of models per criterion is shown in the table below:
Has architecture | Has dataset | Has metric | Number of models |
---|---|---|---|
β | β | β | 8129 |
β | β | β | 1241 |
β | β | β | 359 |
These numbers include models for which a task may not be easily inferred from the architecture alone. For example BertModel
would presumably be associated with a feature-extraction
task, but these are not simple to evaluate.
By applying a filter on the architecture name to contain any of "For", "MarianMTModel" (translation), or "LMHeadModel" (language modelling), we arrive at the following table:
Has task | Has dataset | Has metric | Number of models |
---|---|---|---|
β | β | β | 7452 |
β | β | β | 1150 |
β | β | β | 337 |
Some models either have no architecture (e.g. the info is missing from the config.json
file or the model belongs to another library like Flair), or multiple ones:
Number of architectures | Number of models |
---|---|
0 | 1755 |
1 | 8125 |
2 | 1 |
3 | 3 |
Based on these counts, it thus makes sense to just focus on models with a single architecture.
For models with a single architecture, I extract the task names from the architecture name according to the following mappings:
The resulting frequency counts are shown below:
LanguageModeling 3250
Translation 1354
SequenceClassification 829
ConditionalGeneration 766
Model 655
QuestionAnswering 364
CTC 318
TokenClassification 286
PreTraining 163
MultipleChoice 37
MultiLabelSequenceClassification 17
ImageClassification 15
MultiLabelClassification 11
Generation 7
ImageClassificationWithTeacher 4
We can visualise which tasks are connected to which datasets as a graph. Here we show the top 10 tasks (measured by node connectivity) with the top 20 datasets marked in orange
π Greetings! Am not sure if this is the correct place to file an issue for huggingface.co, but figured I'd try anyway. :)
Is your feature request related to a problem? Please describe.
Would it be possible to include models that are mobile-compatible as a filterable category in the Other
section on the Hugging Face website? Meaning .tflite
, .ptl
for PyTorch, or CoreML models for iOS devices.
In competitors' hubs, models of varying sizes are colocated on a single page (example below) - so you might see the base example, a Colab notebook to test out the model interactively, and a TF Lite implementation. You can also sort and filter based on model type, as well as framework version and fine-tunability (though HuggingFace's models would win that contest every time π ).
Note that you're not expected to do all of the following steps. This PR helps track all the steps required to get a new task fully supported in the Hub π₯
transformers
pipeline
Integration guide: https://hf.co/docs/hub/adding-a-task
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.