dccuchile / beto Goto Github PK

View Code? Open in Web Editor NEW

479.0 479.0 63.0 283 KB

BETO - Spanish version of the BERT model

License: Creative Commons Attribution 4.0 International

bert bert-model nlp spanish transformers transformers-library

beto's People

Contributors

Stargazers

Watchers

beto's Issues

Script for running the tasks

Can you provide the script used to run the GLUES tasks, please ?

Thank you!

How to translate with pretrained model

I'd like to run English - Spanish translation. I followed transformers guide and as I understand I need a custom pipeline for inference.
Are there any resources which might help with that?

License for BETO?

First of all, thanks for working on such an impressive Spanish pre-trained model, it means a lot to the Spanish NLP world!

I've been playing with BETO for a while, just wondering could you please provide the license for BETO model so that I can understand if it can be used for commercial purposes?

Thank you!

Al intentar utilizar los pipeline de transformers en el caso de 'fill-mask' si que funciona bien:
nlp_fill = pipeline('fill-mask', model="dccuchile/bert-base-spanish-wwm-cased", tokenizer="dccuchile/bert-base-spanish-wwm-cased") nlp_fill('Mi caballo está ' + nlp_fill.tokenizer.mask_token)

[{'score': 0.3315647542476654, 'sequence': '[CLS] Mi caballo está muerto [SEP]', 'token': 3435}, {'score': 0.055828217417001724, 'sequence': '[CLS] Mi caballo está herido [SEP]', 'token': 9403}, {'score': 0.045170336961746216, 'sequence': '[CLS] Mi caballo está aquí [SEP]', 'token': 1401}, {'score': 0.03713322430849075, 'sequence': '[CLS] Mi caballo está perdido [SEP]', 'token': 5036}, {'score': 0.036504991352558136, 'sequence': '[CLS] Mi caballo está enfermo [SEP]', 'token': 8434}]

Pero en cambio en otros casos como 'sentiment-analysis' no funciona bien poniendo 'LABEL_0' y 'LABEL_1', creo que 'LABEL_0' es positivo y 'LABEL_1' negativo:

nlp_token_class = pipeline('sentiment-analysis', model="dccuchile/bert-base-spanish-wwm-cased", tokenizer="dccuchile/bert-base-spanish-wwm-cased") nlp_token_class('Que día tan malo hace')

[{'label': 'LABEL_1', 'score': 0.5785217881202698}]

BETO model size

Hi,

We released our RoBERTa-base and RoBERTa-large models recently and we compared them against your BETO. The paper states the following:

Our model has 12 self-attention layers with 16 attention-heads each (Vaswani et al., 2017), using 1024 as hidden size. In total our model has 110M parameters.

However, the configuration files (BETO cased and BETO uncased) of the models uploaded to the HuggingFace Hub do not have the hyperparameters mentioned. They are 768 of hidden size instead of 1024 and 12 attention heads instead of 16. Otherwise the number of parameters (110M) does not fit.

Pedida de opinión en el uso de BETO

Hola! Escribo para pedir una opinión respecto del uso de BETO.
Se las adelanto, hace sentido un fine tuninng en este contexto?

El contexto

Estoy construyendo un sistema recomendador de propuestas ciudadanas. El asunto es que tengo una base de datos de unas 700 propuestas escritas por ciudadanos durante las elecciones del 2017, estás propuestas tienen un título y una descripción, las cuales vectorizo usando BETO y construyo un sistema para recuperar las propuestas más cercanas.
Saludos y muchas gracias por esto.

Reproducir el test de POS

Buenas! Estaba queriendo ejecutar el test de Part of Speech en mi máquina. Veo que utilizaron la biblioteca Transformers de Huggingface, pero no encuentro la manera de utilizarla para el problema de POS. ¿Podrían compartir el código conmigo?

También me gustaría utilizar el mismo dataset que ustedes. Ví que listaron el dataset de Español de Universal Dependencies, pero dado que hay tres datasets distintos, no estoy seguro de cual utilizaron.

Gracias!
Saludos, Pedro

Missing hyperparameters for fine-tuning

Hi there; first of all thank you very much for developing a spanish bert model, the community was waiting for this and we really appreciate it. It would be very helpful if you published the hyperparameters you've used to get the results shown in the README.md. Is there any paper you've published on the subject / blog where you described the strategy followed to train the model?
Thank you very much in advance :)

Gestionar [UNK]

He diseñado un modelo Bert para resolver NER basado en su modelo, pero me gustaría saber como revertir los valores que tienen los tokens de [UNK]. Es decir obtener el valor real que tiene en la frase el token que contiene esta etiqueta.

Esto se debe a que a veces se detectan entidades a las que se les asigna esta etiqueta y no puedo saber de cuál es su nombre.

Uso de Beto

¡Hola!

¿Es posible indicar cómo sería el proceso de utilizar Beto?, si bien sé que se debe utilizar la función import_meta_graph, no queda claro el formato del input para poder correr el modelo y obtener el vector. Por ejemplo, si tengo la frase: "quiero ir a ese banco" ¿cómo llego a obtener el vector de dicha frase? ¿hay que hacer un preprocesamiento a esa frase o se entrega tal cual viene?

Saludos ✌️

hi! any hints on how to use it for question answering in spanish?

I've followed some examples in transformers with no success:


from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("./data/beto/")
model = TFAutoModelForQuestionAnswering.from_pretrained("./data/beto/")

ValueError: Unrecognized model in ./data/beto/. Should have a `model_type` key in its config.json

Relevant punctuation tokens missing in the vocabulary (IMHO)

Hello there,

First of all, congratulations and thank you very much for this fantastic work!

I have been using BETO in some projects, with no problem so far. It's tokenization and resulting subwords are, for Spanish, way better than the ones from multilingual BERT.

But suddenly I have noticed that I was hitting way too many UNK tokens (it usually happens when there is some weird character).
After checking BETO vocabulary (both cased and uncased) I have seen that it is missing characters like % (percent) / (slash) or ; (semicolon) (and probably some other common punctuation marks and symbols).

Is it true that they are missing, or have I overlooked something?
I think that such characters (tokens) bear relevant information, and that fusing all of them into a UNK token may hurt the performance of the model.

As a quick fix, I have added them to my own customised version of the BETO vocabulary using some of the unused token positions.

It would be great if you could add those missing characters/symbols to the official BETO vocabulary, also for the version uploaded to the HuggingFace Transformers models directory, even if they were not pretrained with the model. This way they would be there for everybody and they would have their own embedding for fine-tuning tasks.

Thank you very much for your attention.

Soporte para uso de Beto

Hola a todos
En www.adelacu.com estamos interesados en evaluar el uso de Beto para una aplicación que tenemos en operación, para ello nos gustaría instalarlo y ajustarlo a los datos actuales. ¿Hay alguien que nos pueda ayudar con esto?. Obviamente se trataría de un trabajo remunerado.
Gracias, mis datos: [email protected], +569 7608-0532

AutoModel bug (super easy fix)

Issue:
HuggingFace AutoModel / AutoTokenizer doesn't detect the model per se.

Fix:
Would be possible to update the config.json to append the following first line for both cased/uncased:

{
  "model_type": "bert",
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
   ...
}

Thx in advance

=============================
Example Before:

from transformers import BertTokenizer
BertTokenizer.from_pretrained( path )

Example Now:

from transformers import AutoTokenizer
AutoTokenizer.from_pretrained( path ) #BertTokenizer still works

==============================
This is important in order to Automatize the model selection. (Getting the scores from multiple models i.e. BETO, RoBERTa, GPT-2)

No se puede cargar en Tensorflow

No estoy seguro si es cuestión de los modelos o de la implementación, pero no he podido cargar el modelo referido en la página de Huggingface:

%tensorflow_version 2.x
!pip install transformers
import torch
from transformers import *

#Estos funcionan:
#m = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
#m = AutoModelForSequenceClassification.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")

#Estos no funcionan
m = TFBertForSequenceClassification.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")

m

Manda error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-fb60f3d257dd> in <module>()
      9 
     10 #Estos no funcionan
---> 11 m = TFBertForSequenceClassification.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
     12 
     13 m

1 frames
/usr/lib/python3.6/genericpath.py in isfile(path)
     28     """Test whether a path is a regular file"""
     29     try:
---> 30         st = os.stat(path)
     31     except OSError:
     32         return False

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Pueden checarlo en Colab

Code for running benchmark

Muchisimas gracias para la contribución :)

I am currently pre-training BERT for another language and am hoping to benchmark my model against mBERT just like you did on tasks such as POS in other languages than English. Do you have any resources you could share (code, repos, etc..) and that I could use to do this benchmark?

Limpieza de texto

Buenas! Estoy utilizando BETO para clasificar textos en categorías.
Al ser BETO un modelo donde se necesita un contexto para mejorar la predicción... Qué limpieza es conveniente realizar al texto?
Para BERT (inglés) se suele utilizar lo siguiente:

Remover e-mails.
Remover hipervínculos.
Remover dígitos.
Remover caracteres especiales.
Remover palabras con tildes.
Remover stopwords.
Lematizar texto.

Remover las tildes, las stopwords y lematizar no tendría sentido.
Remover caracteres especiales incluirian !"#$%&'()*+, -./:;<=>?@[\]^_{|}~. Tampoco sé que tanto influye sacar los signos de exclamación/interrogación, a lo sumo podría sacar todos los demás menos esos.
Los e-mails e hipervinculos me parecen razonables sacarlos, pero los dígitos?

Espero su respuesta, estoy abierto a debate. Gracias!!

Finetuning on custom spanish dataset

I have a custom spanish dataset which I want to classify into specific labels. However, in the repo, there are no specific instructions ( apart from the weight files ), on how to do that ( specifically, using TF2 ). Can you be more specific ?

Qué cambios se hicieron en el último release del modelo?

Buenas tardes. En primer lugar, muchas gracias por el esfuerzo de entrenar un modelo de lenguaje en español, además funciona muy bien, por lo que doblemente gracias. Llevo usando vuestro modelo mucho tiempo, y he notado que desde el release de Marzo funciona bastante mejor, saca muy buenos resultados. Me gustaría saber, si fuera posible, qué cambios hicieron en el modelo en este release. En cuanto a arquitectura vi que no ha cambiado nada, y el tokenizador parece el mismo. ¿Entrenaron más el modelo? ¿Lo entrenaron con un corpus diferente? Muchas gracias de antemano y gracias de nuevo por el esfuerzo realizado.

What did you do with accents?

Hi again, and thanks in advance for your response. We are questioning ourselves how you dealt with accents, which are very important in spanish (e.g. hacia is not the same as hacía). When using your tokenizer with transformers library, it seems that both words (and other example of pairs of this type) have the same id in the embedding and therefore have the same vector representation.

¿Necesitan ayuda/manos?

Estoy buscando implementar algún modelo transformador (Bert o algún otro), pero no he encontrado nada single-language, al parecer la única opción es hacer el pretraining a mano.

¿Han logrado hacer algo?, y además, ¿les serviría apoyo?

Config file missed max_len

Hi, there's something strange with the model using transformers library:

In [5]: tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")

In [6]: tokenizer.model_max_length
Out[6]: 1000000000000000019884624838656

In [7]: tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")

In [8]: tokenizer.model_max_length
Out[8]: 512

So it returns a wrong value for model_max_length - for another model like BERTurk it returns the correct value.

The easiest way would be to extend the tokenizer_config.json and add a "max_len": 512 option :)

Reference

GRACIAS

Es muy importante que tengamos herramientas como estas entrenadas y disponibles en español. Un abrazo y muchas gracias de nuevo, hermanos!

Training requirements

Hello!

I have read the paper and I would like to replicate the training of this model from scratch. The thing is that you specify the configuration of the pre-training (hyper-parameters), but you don't specify the cost of hardware and time to train the models you have generated. Can you provide that information? I'm interested in knowing the number of TPU v3-8 pods (the interruptible ones, as mentioned in the article) used for training and the training time (hours, days, weeks) if possible. I would like to estimate how much it would cost to train a model like yours.
Also, to have such a model in a production environment, which minimum hardware would it require to work?

Thank you in advance!
Regards :)

GLUES benchmark

Estimadxs,

¿Hay algún código para correr el benchmark de GLUES? Me gustaría correrlo en otro modelo que hemos entrenado.

Si es necesario, puedo colaborar en ver si podemos armar algo como el script run_glue.py

Muchas gracias!

dccuchile / beto Goto Github PK

beto's People

Contributors

Stargazers

Watchers

Forkers

beto's Issues

El contexto

Recommend Projects

Recommend Topics

Recommend Org