dccuchile / beto Goto Github PK
View Code? Open in Web Editor NEWBETO - Spanish version of the BERT model
License: Creative Commons Attribution 4.0 International
BETO - Spanish version of the BERT model
License: Creative Commons Attribution 4.0 International
Can you provide the script used to run the GLUES tasks, please ?
Thank you!
I'd like to run English - Spanish translation. I followed transformers guide and as I understand I need a custom pipeline for inference.
Are there any resources which might help with that?
First of all, thanks for working on such an impressive Spanish pre-trained model, it means a lot to the Spanish NLP world!
I've been playing with BETO for a while, just wondering could you please provide the license for BETO model so that I can understand if it can be used for commercial purposes?
Thank you!
Hola,
Al intentar utilizar los pipeline de transformers en el caso de 'fill-mask' si que funciona bien:
nlp_fill = pipeline('fill-mask', model="dccuchile/bert-base-spanish-wwm-cased", tokenizer="dccuchile/bert-base-spanish-wwm-cased") nlp_fill('Mi caballo está ' + nlp_fill.tokenizer.mask_token)
[{'score': 0.3315647542476654, 'sequence': '[CLS] Mi caballo está muerto [SEP]', 'token': 3435}, {'score': 0.055828217417001724, 'sequence': '[CLS] Mi caballo está herido [SEP]', 'token': 9403}, {'score': 0.045170336961746216, 'sequence': '[CLS] Mi caballo está aquí [SEP]', 'token': 1401}, {'score': 0.03713322430849075, 'sequence': '[CLS] Mi caballo está perdido [SEP]', 'token': 5036}, {'score': 0.036504991352558136, 'sequence': '[CLS] Mi caballo está enfermo [SEP]', 'token': 8434}]
Pero en cambio en otros casos como 'sentiment-analysis' no funciona bien poniendo 'LABEL_0' y 'LABEL_1', creo que 'LABEL_0' es positivo y 'LABEL_1' negativo:
nlp_token_class = pipeline('sentiment-analysis', model="dccuchile/bert-base-spanish-wwm-cased", tokenizer="dccuchile/bert-base-spanish-wwm-cased") nlp_token_class('Que día tan malo hace')
[{'label': 'LABEL_1', 'score': 0.5785217881202698}]
Hi,
We released our RoBERTa-base and RoBERTa-large models recently and we compared them against your BETO. The paper states the following:
Our model has 12 self-attention layers with 16 attention-heads each (Vaswani et al., 2017), using 1024 as hidden size. In total our model has 110M parameters.
However, the configuration files (BETO cased and BETO uncased) of the models uploaded to the HuggingFace Hub do not have the hyperparameters mentioned. They are 768 of hidden size instead of 1024 and 12 attention heads instead of 16. Otherwise the number of parameters (110M) does not fit.
Hola! Escribo para pedir una opinión respecto del uso de BETO.
Se las adelanto, hace sentido un fine tuninng en este contexto?
Estoy construyendo un sistema recomendador de propuestas ciudadanas. El asunto es que tengo una base de datos de unas 700 propuestas escritas por ciudadanos durante las elecciones del 2017, estás propuestas tienen un título y una descripción, las cuales vectorizo usando BETO y construyo un sistema para recuperar las propuestas más cercanas.
Saludos y muchas gracias por esto.
Buenas! Estaba queriendo ejecutar el test de Part of Speech en mi máquina. Veo que utilizaron la biblioteca Transformers de Huggingface, pero no encuentro la manera de utilizarla para el problema de POS. ¿Podrían compartir el código conmigo?
También me gustaría utilizar el mismo dataset que ustedes. Ví que listaron el dataset de Español de Universal Dependencies, pero dado que hay tres datasets distintos, no estoy seguro de cual utilizaron.
Gracias!
Saludos, Pedro
Hi there; first of all thank you very much for developing a spanish bert model, the community was waiting for this and we really appreciate it. It would be very helpful if you published the hyperparameters you've used to get the results shown in the README.md. Is there any paper you've published on the subject / blog where you described the strategy followed to train the model?
Thank you very much in advance :)
He diseñado un modelo Bert para resolver NER basado en su modelo, pero me gustaría saber como revertir los valores que tienen los tokens de [UNK]
. Es decir obtener el valor real que tiene en la frase el token que contiene esta etiqueta.
Esto se debe a que a veces se detectan entidades a las que se les asigna esta etiqueta y no puedo saber de cuál es su nombre.
¡Hola!
¿Es posible indicar cómo sería el proceso de utilizar Beto?, si bien sé que se debe utilizar la función import_meta_graph
, no queda claro el formato del input para poder correr el modelo y obtener el vector. Por ejemplo, si tengo la frase: "quiero ir a ese banco" ¿cómo llego a obtener el vector de dicha frase? ¿hay que hacer un preprocesamiento a esa frase o se entrega tal cual viene?
Saludos ✌️
I've followed some examples in transformers with no success:
from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf
tokenizer = AutoTokenizer.from_pretrained("./data/beto/")
model = TFAutoModelForQuestionAnswering.from_pretrained("./data/beto/")
ValueError: Unrecognized model in ./data/beto/. Should have a `model_type` key in its config.json
Hello there,
First of all, congratulations and thank you very much for this fantastic work!
I have been using BETO in some projects, with no problem so far. It's tokenization and resulting subwords are, for Spanish, way better than the ones from multilingual BERT.
But suddenly I have noticed that I was hitting way too many UNK tokens (it usually happens when there is some weird character).
After checking BETO vocabulary (both cased and uncased) I have seen that it is missing characters like % (percent) / (slash) or ; (semicolon) (and probably some other common punctuation marks and symbols).
Is it true that they are missing, or have I overlooked something?
I think that such characters (tokens) bear relevant information, and that fusing all of them into a UNK token may hurt the performance of the model.
As a quick fix, I have added them to my own customised version of the BETO vocabulary using some of the unused token positions.
It would be great if you could add those missing characters/symbols to the official BETO vocabulary, also for the version uploaded to the HuggingFace Transformers models directory, even if they were not pretrained with the model. This way they would be there for everybody and they would have their own embedding for fine-tuning tasks.
Thank you very much for your attention.
Hola a todos
En www.adelacu.com estamos interesados en evaluar el uso de Beto para una aplicación que tenemos en operación, para ello nos gustaría instalarlo y ajustarlo a los datos actuales. ¿Hay alguien que nos pueda ayudar con esto?. Obviamente se trataría de un trabajo remunerado.
Gracias, mis datos: [email protected], +569 7608-0532
Issue:
HuggingFace AutoModel / AutoTokenizer doesn't detect the model per se.
Fix:
Would be possible to update the config.json to append the following first line for both cased/uncased:
{
"model_type": "bert",
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
...
}
Thx in advance
=============================
Example Before:
from transformers import BertTokenizer
BertTokenizer.from_pretrained( path )
Example Now:
from transformers import AutoTokenizer
AutoTokenizer.from_pretrained( path ) #BertTokenizer still works
==============================
This is important in order to Automatize the model selection. (Getting the scores from multiple models i.e. BETO, RoBERTa, GPT-2)
No estoy seguro si es cuestión de los modelos o de la implementación, pero no he podido cargar el modelo referido en la página de Huggingface:
%tensorflow_version 2.x
!pip install transformers
import torch
from transformers import *
#Estos funcionan:
#m = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
#m = AutoModelForSequenceClassification.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")
#Estos no funcionan
m = TFBertForSequenceClassification.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
m
Manda error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-1-fb60f3d257dd> in <module>()
9
10 #Estos no funcionan
---> 11 m = TFBertForSequenceClassification.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
12
13 m
1 frames
/usr/lib/python3.6/genericpath.py in isfile(path)
28 """Test whether a path is a regular file"""
29 try:
---> 30 st = os.stat(path)
31 except OSError:
32 return False
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
Pueden checarlo en Colab
Muchisimas gracias para la contribución :)
I am currently pre-training BERT for another language and am hoping to benchmark my model against mBERT just like you did on tasks such as POS in other languages than English. Do you have any resources you could share (code, repos, etc..) and that I could use to do this benchmark?
Buenas! Estoy utilizando BETO para clasificar textos en categorías.
Al ser BETO un modelo donde se necesita un contexto para mejorar la predicción... Qué limpieza es conveniente realizar al texto?
Para BERT (inglés) se suele utilizar lo siguiente:
Remover las tildes, las stopwords y lematizar no tendría sentido.
Remover caracteres especiales incluirian !"#$%&'()*+, -./:;<=>?@[\]^_{|}~
. Tampoco sé que tanto influye sacar los signos de exclamación/interrogación, a lo sumo podría sacar todos los demás menos esos.
Los e-mails e hipervinculos me parecen razonables sacarlos, pero los dígitos?
Espero su respuesta, estoy abierto a debate. Gracias!!
I have a custom spanish dataset which I want to classify into specific labels. However, in the repo, there are no specific instructions ( apart from the weight files ), on how to do that ( specifically, using TF2 ). Can you be more specific ?
Buenas tardes. En primer lugar, muchas gracias por el esfuerzo de entrenar un modelo de lenguaje en español, además funciona muy bien, por lo que doblemente gracias. Llevo usando vuestro modelo mucho tiempo, y he notado que desde el release de Marzo funciona bastante mejor, saca muy buenos resultados. Me gustaría saber, si fuera posible, qué cambios hicieron en el modelo en este release. En cuanto a arquitectura vi que no ha cambiado nada, y el tokenizador parece el mismo. ¿Entrenaron más el modelo? ¿Lo entrenaron con un corpus diferente? Muchas gracias de antemano y gracias de nuevo por el esfuerzo realizado.
Hi again, and thanks in advance for your response. We are questioning ourselves how you dealt with accents, which are very important in spanish (e.g. hacia is not the same as hacía). When using your tokenizer with transformers library, it seems that both words (and other example of pairs of this type) have the same id in the embedding and therefore have the same vector representation.
Estoy buscando implementar algún modelo transformador (Bert o algún otro), pero no he encontrado nada single-language, al parecer la única opción es hacer el pretraining a mano.
¿Han logrado hacer algo?, y además, ¿les serviría apoyo?
Hi, there's something strange with the model using transformers library:
In [5]: tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
In [6]: tokenizer.model_max_length
Out[6]: 1000000000000000019884624838656
In [7]: tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
In [8]: tokenizer.model_max_length
Out[8]: 512
So it returns a wrong value for model_max_length - for another model like BERTurk it returns the correct value.
The easiest way would be to extend the tokenizer_config.json and add a "max_len": 512 option :)
Es muy importante que tengamos herramientas como estas entrenadas y disponibles en español. Un abrazo y muchas gracias de nuevo, hermanos!
Hello!
I have read the paper and I would like to replicate the training of this model from scratch. The thing is that you specify the configuration of the pre-training (hyper-parameters), but you don't specify the cost of hardware and time to train the models you have generated. Can you provide that information? I'm interested in knowing the number of TPU v3-8 pods (the interruptible ones, as mentioned in the article) used for training and the training time (hours, days, weeks) if possible. I would like to estimate how much it would cost to train a model like yours.
Also, to have such a model in a production environment, which minimum hardware would it require to work?
Thank you in advance!
Regards :)
Estimadxs,
¿Hay algún código para correr el benchmark de GLUES? Me gustaría correrlo en otro modelo que hemos entrenado.
Si es necesario, puedo colaborar en ver si podemos armar algo como el script run_glue.py
Muchas gracias!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.