indonlp / indonlu Goto Github PK

The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained IndoBERT models, and a starter code! (AACL-IJCNLP 2020)

Home Page: https://indobenchmark.com

License: Apache License 2.0

Jupyter Notebook 49.88% Python 47.47% Shell 2.65%

indonesian bahasa bert benchmark datasets nlp nlu aacl indobert indobert-models indonlu indo4b indobert-lite indonlp

indonlu's Introduction

IndoNLU

Baca README ini dalam Bahasa Indonesia.

IndoNLU is a collection of Natural Language Understanding (NLU) resources for Bahasa Indonesia with 12 downstream tasks. We provide the code to reproduce the results and large pre-trained models (IndoBERT and IndoBERT-lite) trained with around 4 billion word corpus (Indo4B), more than 20 GB of text data. This project was initially started by a joint collaboration between universities and industry, such as Institut Teknologi Bandung, Universitas Multimedia Nusantara, The Hong Kong University of Science and Technology, Universitas Indonesia, Gojek, and Prosa.AI.

Research Paper

IndoNLU has been accepted by AACL-IJCNLP 2020 and you can find the details in our paper https://www.aclweb.org/anthology/2020.aacl-main.85.pdf. If you are using any component on IndoNLU including Indo4B, FastText-Indo4B, or IndoBERT in your work, please cite the following paper:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

How to contribute to IndoNLU?

Be sure to check the contributing guidelines and contact the maintainers or open an issue to collect feedbacks before starting your PR.

12 Downstream Tasks

You can check [Link]
We provide train, valid, and test sets. The labels of the test set are masked (no true labels) in order to preserve the integrity of the evaluation. Please submit your predictions to the submission portal at CodaLab

Examples

A guide to load IndoBERT model and finetune the model on Sequence Classification and Sequence Tagging task.
You can check link

Submission Format

Please kindly check the link. For each task, there is different format. Every submission file always start with the index column (the id of the test sample following the order of the masked test set).

For the submission, first you need to rename your prediction into pred.txt, then zip the file. After that, you need to allow the system to compute the results. You can easily check the progress in your results tab.

Indo4B Dataset

We provide the access to our large pretraining dataset. In this version, we exclude all Twitter tweets due to restrictions of the Twitter Developer Policy and Agreement.

Indo4B Dataset (23 GB uncompressed, 5.6 GB compressed) [Link]

IndoBERT and IndoBERT-lite Models

We provide 4 IndoBERT and 4 IndoBERT-lite Pretrained Language Model [Link]

IndoBERT-base
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-large
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-lite-base
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-lite-large
- Phase 1 [Link]
- Phase 2 [Link]

FastText (Indo4B)

We provide the full uncased FastText model file (11.9 GB) and the corresponding Vector file (3.9 GB)

FastText model (11.9 GB) [Link]
Vector file (3.9 GB) [Link]

We provide smaller FastText models with smaller vocabulary for each of the 12 downstream tasks

FastText-Indo4B [Link]
FastText-CC-ID [Link]

Leaderboard

Community Portal and Public Leaderboard [Link]
Submission Portal https://competitions.codalab.org/competitions/26537

indonlu's People

Contributors

Stargazers

Watchers

Forkers

billiechristian atnanahidiw prosa-ai acul3 anak10thn kevinmel2000 rendaardy heronimus iketutgun cahya-wirawan samuelcahyawijaya dhimasyoga16 gentaiscool pakdanan edosyhptra erikaris myausweis mrrizal byhqsr ghazimuharam feliciamargareta damianus04 irhw110 mghozyah aibotsacademy jonathanrsmjtk nunenuh ilham-bintang reneje muhammadagf angelina-ss aditrhn satrioardhimstyo falahputra abadiegie dessyamirudin agung67 alfinpradana99 regalius sigitbn dikawesome syauqiex budimm akurniawan ariwiradana ricoferdian shan-2205 kukuhsetyob kokizzu joviarnandy yeutong jinusean aozorahime aussa-project abdiansah sidiksoleman henritantyoko indahpuspitaa17 ikhwankhaliddd ryanpram itscrimsonaut frozznight baysetyo jhonsonlee ulviaagustina ricco48 dimasananda0501 amaliaristantya ezaaputra mrifqiram46 cutichaa yuliavincentia herm41 leungkanmay muhammadfadhilarkan blanktix rendchevi andikazidanef15 jokoeliyanto dhestarwirawan kanzulf muhamdilyas safitrisoetam igoramli blu3no kpkepra hudtakim faridlazuarda aditbest5 phillette ferdyanggara eldo-greshard dhenydwiprakoso coker91 afatkharrofiqi darkshides 0xdead4f imfdlh irfnrdh shakiraayunda

indonlu's Issues

Computation Power to Pretrain the Indo4B Dataset From Scratch

hi, I'm actually using one of your models for text similarity and it works great!

I wonder if I would like to pretrain the model from scratch using the Indo4B dataset, with that such a huge size (~24GB). May I know how many RAM and VRAM are needed to be able to train it with the same batch size you guys stated in your paper? i.e. for IndoBERTBASE was using 256 Batch size. Is 16 GB of VRAM and 32GB RAM enough?

Thank you for this such amazing work!

Different Vocab Size Between Tokenizer and Model's Word Embedding Layer

Expected Behavior

The length of tokenizer vocab size and the BERT's word embedding layer dimension should be the same

Actual Behavior

The length of tokenizer vocab size and the BERT's word embedding layer dimension is not the same

Steps to Reproduce the Problem

Load the model: model = AutoModel.from_pretrained('indobenchmark/indobert-base-p1')
Print the model: print(model)

Load the tokenizer: tokenizer = AutoTokenizer.from_pretrained('indobenchmark/indobert-base-p1')
Print the length of toikenizer: print(len(tokenizer))

multi_label_classification swish import error

I'm trying to run finetune_casa.ipynb but i got error when it importing BertForMultiLabelClassification in multi_label_classification
the error like this

ImportError                               Traceback (most recent call last)
<ipython-input-3-2955fce89653> in <module>()
     10 from nltk.tokenize import TweetTokenizer, word_tokenize
     11 
---> 12 from indonlu.modules.multi_label_classification import BertForMultiLabelClassification
     13 from indonlu.utils.forward_fn import forward_sequence_multi_classification
     14 from indonlu.utils.metrics import absa_metrics_fn

/content/indonlu/modules/multi_label_classification.py in <module>()
      7 from torch.nn import CrossEntropyLoss, MSELoss
      8 
----> 9 from transformers.activations import gelu, gelu_new, swish
     10 from transformers.configuration_bert import BertConfig
     11 from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_callable

ImportError: cannot import name 'swish' from 'transformers.activations' (/usr/local/lib/python3.7/dist-packages/transformers/activations.py)

can you help me to solve this error ?

Reproduce FacQA Task Result

Hi IndoNLU team,

I have a problem when i'm trying to reproduce the result of indoNLU paper on FacQA task using indobert-lite-large pretained model. In the paper show the F1 score result is 69.47.

But when i'm trying to reproduce and submit the prediction result to codalab, i got 64.4 on F1 score.

The difference between my result and the result reported on the paper is about 5 point, which is i think its quite far.

What i do to reproduce the task is run https://github.com/indobenchmark/indonlu/blob/master/main.py code. For the hyperparameters, i follow what is written in the paper.

I run the main.py code as shown bellow:

Are there some configuration that i missed?

Thank you in advance.

SSL Error on site

Minor SSL issue on indobenchmark.com

SentencePiece error when using pre-trained Albert (indobenchmark/indobert-lite-large-p2)

Hi indobenchmark team!

I am trying to use the pre-trained models for question answering task. It worked well when I'm using BERT, however, I got this error when using Albert (specifically indobenchmark/indobert-lite-large-p2):

File "/home/kiki/anaconda3/envs/huggingface/lib/python3.8/site-packages/simpletransformers/question_answering/question_answering_model.py", line 184, in __init__
  self.tokenizer = tokenizer_class.from_pretrained(model_name, do_lower_case=self.args.do_lower_case, **kwargs)
File "/home/kiki/anaconda3/envs/huggingface/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1428, in from_pretrained
  return cls._from_pretrained(*inputs, **kwargs)
File "/home/kiki/anaconda3/envs/huggingface/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1575, in _from_pretrained
  tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/kiki/anaconda3/envs/huggingface/lib/python3.8/site-packages/transformers/tokenization_albert.py", line 155, in __init__
  self.sp_model.Load(vocab_file)
File "/home/kiki/anaconda3/envs/huggingface/lib/python3.8/site-packages/sentencepiece.py", line 367, in Load
  return self.LoadFromFile(model_file)
File "/home/kiki/anaconda3/envs/huggingface/lib/python3.8/site-packages/sentencepiece.py", line 177, in LoadFromFile
  return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

Python version & packages that I used:

Python                3.8.5
sentencepiece         0.1.91
transformers          3.3.1
simpletransformers    0.48.9

I guess this error occurs due to the missing spiece.model file in the model. Could you provide this missing file? Or is there any other solution to this problem? Thank you.

ValueError: invalid literal for int() with base 10: 'sentiment'

Expected Behavior

Dear Author,

I want to make multiclass classification by modify DocumentSentimentDataset,

class DocumentSentimentDataset(Dataset):
# Static constant variable
LABEL2INDEX = {'Ekonomi': 0, 'Hukum': 1, 'Kesehatan': 2, 'Sosial':3, 'Teknologi':4}
INDEX2LABEL = {0: 'Ekonomi', 1: 'Hukum', 2: 'Kesehatan', 3 : 'Sosial', 4 : 'Teknologi'}
NUM_LABELS = 5

def load_dataset(self, path): 
    df = pd.read_csv(path, sep='\t', header=None)
    df.columns = ['text','sentiment']
    #df['sentiment'] = df['sentiment'].apply(lambda lab: self.LABEL2INDEX[lab])
    return df

def __init__(self, dataset_path, tokenizer, no_special_token=False, *args, **kwargs):
    self.data = self.load_dataset(dataset_path)
    self.tokenizer = tokenizer
    self.no_special_token = no_special_token

def __getitem__(self, index):
    data = self.data.loc[index,:]
    text, sentiment = data['text'], data['sentiment']
    subwords = self.tokenizer.encode(text, add_special_tokens=not self.no_special_token)
    return np.array(subwords), np.array(sentiment), data['text']

def __len__(self):
    return len(self.data)

but when i started to train the model i got error like this :

ValueError: Caught ValueError in DataLoader worker process 12.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/kaggle/working/indonlu/utils/data_utils.py", line 550, in _collate_fn
sentiment_batch[i,0] = sentiment
ValueError: invalid literal for int() with base 10: 'sentiment'

i have checked that 'setniment' column was int.

Do you have any advices to my problem ?

Thank You in Advance

Model File Export

hi, I would like to ask. suppose I have done fine-tuning my model. how can I export it to physical app so that I can later load it to directly do prediction?

Thanks.

DocumentSentimentDataset key error

Hi, I'm currently doing sentiment analysis using indobert-base-p1, and following this reference: (https://medium.com/@eza.a.putra/implementasi-bert-untuk-analisis-sentimen-terhadap-ulasan-aplikasi-flip-berbahasa-indonesia-557d691e0440).
When I ran the program everything ran smoothly and there were no errors at all. But when I changed my data set, I got an error in this section:
Then I tried to process the data set from what started out like this: to become like this: Then the error changed to something like this:
Then I changed the sentiment from numbers to positive and negative words, and now the error has changed to this:

Is this a problem with the data set that I have or a problem in the data_utils.py coding?

Contributing Guidelines

Hi Indobenchmark team,

Thanks for such an interesting project.
Just curious. Is there a contributing guidelines for this project?

Many thanks,
David

Loss Function for Fine-Tuning

Hi, IndoNLU team,

Thanks for your amazing work! I'm currently working on my bachelor thesis with this IndoBERT for SequenceClassification Task.
If I want to change my loss function for fine tuning, where or how can I do it?

From your tutorials here, I found out that you use CrossEntropy as the loss function for multiclass classification task (sentiment analysis in that case).

But when I want to dig more into the code, I can't find it. I can just find the CrossEntropyLoss() in:

none are for multi class classification.

The tutorials also mentioned :

"Cross entropy loss is calculated by comparing how well the probability distribution output by Softmax matches the one-hot-encoded ground truth label of the data."

But, the SmSA fine-tuning examples doesn't show anything about the ground truth being hot-encoded, they are being label-encoded instead. I also tried to print out the list_hyp and list_label, in case they are being one-hot encoded somewhere outside the code that I can see, but the outputs are just how the way they are (mapping from LABEL2INDEX). Meanwhile I suppose the SmSA label doesn't have an order or rank, right? So is my thesis task.

Thank you in advance!
Regards,
Celine.

HTTPs Certificate for indobenchmark expired

hello, hope you guys well.
as mentioned, your https certificate for indobenchmark website expired.
hope the project still continues and the certificate gets updated.

The bert model reserved all the GPU

Is nvidia V100 with 32 GB ram large enough for loading the Bert model ?

I got an error RuntimeError: CUDA out of memory. Tried to allocate 120.00 MiB (GPU 0; 31.75 GiB total capacity; 30.32 GiB already allocated; 9.50 MiB free; 30.42 GiB reserved in total by PyTorch) after the model was loaded in GPU.

There whole GPU is occupied by the model so that I can't feed the tensor into GPU.

does the classification process not use softmax?

I re-read the Forward function for sequence classification and it turns out there is no external application of softmax. Do you really not use softmax, or do you use softmax internally? If you use softmax internally, where is the softmax implemented?

Codalab Submission upload has been disabled.

I am trying to upload a submission through the Codalab competition portal. However, everytime i tried to submit a new submission, it said "Submission upload has been disabled. See the new instance at: https://codalab.lisn.upsaclay.fr/". However, i see that the competition doesn't have End, so i thought it because i can't resubmit or did the competition actually ended?

Test split in the Emot

Thank you so much for the tool. I have a question on the Emot dataset. I have observed that the test set contains only one label: "happy". For fair evaluation, I think the test set contains samples from the other labels too. I wonder whether wrong test set was uploaded or it is a design choice.

Expected Behavior

the test split should have samples from each label

Actual Behavior

the test set has only one label: happy

Steps to Reproduce the Problem

N/A, i got it from the dataset analysis.

Build spell checker in Bahasa Indonesia

Hi, you guys are doing an amazing work!

I am working on creating a spellchecker program for Bahasa Indonesia, currently I am combining fasttext original id language model with norvig's spell check algorithm, the results are ok, but i think it can be improved further with larger and cleaner language model.

I tried your FastText (Indo4B) model, but so far it produces same results as previous one's. There are still words such as "anaak", "indonesa", etc.

Any idea on how i can do this task better? I am newbie here btw :) any advice is welcomed.

Could you please direct me to your dataset/corpus that covers formal Indonesian language?

Thanks a lot!

Training and validation accuracy multi label classification

Hi, thank you IndoNLU team for making this indobert model. I'm currently working on thesis with this IndoBERT for BertForMultiLabelClassification Task.

I have successfully run the "finetune_casa.ipynb" provided in the examples folder.

I used a private dataset and adapted it to the one in AspectBasedSentimentAnalysisAiryDataset on utils/data_utils.py . The dataset I use is imbalance.

However, after I visualize using matplotlib, the result accuracy between train and eval is much different and the accuracy of eval tends to be static. I have also tried to do something similar using the dataset that has been provided in dataset/casa_absa-prosa however, the results are also not much different from the dataset that I use.
this is the code used for fine tuning :

train_loss_lists = []
train_acc_lists = []
eval_loss_lists = []
eval_acc_lists = []

# Train
n_epochs = 8
for epoch in range(n_epochs):
    model.train()
    torch.set_grad_enabled(True)
 
    total_train_loss = 0
    list_hyp, list_label = [], []

    train_pbar = tqdm(train_loader, leave=True, total=len(train_loader))
    for i, batch_data in enumerate(train_pbar):
        # Forward model
        loss, batch_hyp, batch_label = forward_sequence_multi_classification(model, batch_data[:-1], i2w=i2w, device='cuda')

        # Update model
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        tr_loss = loss.item()
        total_train_loss = total_train_loss + tr_loss

        # Calculate metrics
        list_hyp += batch_hyp
        list_label += batch_label

        train_pbar.set_description("(Epoch {}) TRAIN LOSS:{:.4f} LR:{:.8f}".format((epoch+1),
            total_train_loss/(i+1), get_lr(optimizer)))


    # Calculate train metric
    metrics = absa_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) TRAIN LOSS:{:.4f} {} LR:{:.8f}".format((epoch+1),
        total_train_loss/(i+1), metrics_to_string(metrics), get_lr(optimizer)))
    train_acc_lists.append(metrics['ACC'])
    current_train_loss = round(total_train_loss/(i+1), 4)
    train_loss_lists.append(current_train_loss)
    
    # Evaluate on validation
    model.eval()
    torch.set_grad_enabled(False)
    
    total_loss, total_correct, total_labels = 0, 0, 0
    list_hyp, list_label = [], []

    pbar = tqdm(valid_loader, leave=True, total=len(valid_loader))
    for i, batch_data in enumerate(pbar):
        batch_seq = batch_data[-1]        
        loss, batch_hyp, batch_label = forward_sequence_multi_classification(model, batch_data[:-1], i2w=i2w, device='cuda')
        
        # Calculate total loss
        valid_loss = loss.item()
        total_loss = total_loss + valid_loss

        # Calculate evaluation metrics
        list_hyp += batch_hyp
        list_label += batch_label
        metrics = absa_metrics_fn(list_hyp, list_label)

        pbar.set_description("VALID LOSS:{:.4f} {}".format(total_loss/(i+1), metrics_to_string(metrics)))
        
    metrics = absa_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) VALID LOSS:{:.4f} {}".format((epoch+1),
        total_loss/(i+1), metrics_to_string(metrics)))
    eval_acc_lists.append(metrics['ACC'])
    current_eval_loss = round(total_loss/(i+1), 4)
    eval_loss_lists.append(current_eval_loss)

The result of matplotlib look like this :

Obviously there's some issue with how this is checked, but I can't put my finger on it. Is there anything I can check?

Indo4B Cased

Terimakasih tim IndoBenchmark atas dataset ini :D
Kalau boleh tau apakah ada Indo4B versi cased tersedia?

Terimakasih

[Question] [Help Needed] local variable 'subword_batch' referenced before assignment

Hi, thank you to IndoBenchmark Team for the tutorial, example codes, and efforts of IndoBERT model.
I have successfully run the "finetune_smsa.ipynb" provided in the examples folder.

Furthermore, I would like to learn and finetune the emotions recognition for "emot_emotion-twitter" dataset.

Thus, I try to modify the sentiment analysis code. In my code, I have used EmotionDetectionDataset and EmotionDetectionDataLoader for dataset preparations. Nevertheless, when I try to run the "Fine Tuning & Evaluation" part, I got this error: "local variable 'subword_batch' referenced before assignment"

At first I thought I failed to prepare the datasets. Thus, I have double checked the "train_loader" with this "len(train_loader)" and got 3,521; which indicates it is not empty and should be passed to the "forward_sequence_classification" function.

Is there something that I miss to change for the emotions recognition?
And how can I solve this issue? Thank you in advance. 🙏

Here is the file of my Google Notebook (Collab) if needed:
indo-bert.zip

Make this open source to meet github recommended community standard.

Please see checklist here: https://github.com/indobenchmark/indonlu/community

load classification report and confusion matrix

#Hi, thank you indobenchmark team for making this indobert model !!. I'm a newbie in python and excited to learn NLP. i have successfully run the "finetune_smsa.ipynb" provided in the examples folder.

Sorry, I want to ask about basic things. I confuse about how to get a classification report and confusion matrix for evaluating fitune_smsa because i don't understand the prediction variable (i mean like y_pred and y_true variable in sklearn tutorial).

is there any syntax to get classification report and confusion matrix? Thank you in advance !!

RuntimeError: CUDA error: device-side assert triggered

Expected Behavior

I want to finetune indobenchmark/indobert-base-p2 to work for text classification. I have dataset from several math courses/topic (peluang, integral, trigonometri, etc). Given ~1200 list of data (~10-20 example question row per courses).

I am using google colab FPU

Dataframe head

I split into 3 files: train.csv, test.csv, valid.csv

Everythings work fine when following the finetune_smsa.ipynb sample code

Actual Behavior

Encounter Error CUDA error: device-side assert triggered when run the train from https://github.com/indobenchmark/indonlu/blob/master/examples/finetune_smsa.ipynb
(line 14 under Fine tune section)

I really suspicious of google colab GPU config

Steps to Reproduce the Problem

Follow the example code, change the data frame class

class PDataset(Dataset):
  NUM_LABELS = NUM_LABELS
  LABEL2INDEX = LABEL2INDEX
  INDEX2LABEL = INDEX2LABEL
  
  def load_dataset(self, path):
      df = pd.read_csv(path)
      df['lesson_id'] = df['lesson_id'].apply(lambda lesson_id: self.LABEL2INDEX[lesson_id])
      return df
  
  def __init__(self, dataset_path, tokenizer, no_special_token=False, *args, **kwargs):
      self.data = self.load_dataset(dataset_path)
      self.tokenizer = tokenizer
      self.no_special_token = no_special_token
      
  def __getitem__(self, index):
      data = self.data.loc[index,:]
      content, lesson_id = data['content'], data['lesson_id']
      
      subwords = self.tokenizer.encode(content, add_special_tokens=not self.no_special_token)
      return np.array(subwords), np.array(lesson_id), data['content']
  
  def __len__(self):
      return len(self.data)

Content loader

class PContentDataLoader(DataLoader):
    def __init__(self, max_seq_len=512, *args, **kwargs):
        super(PContentDataLoader, self).__init__(*args, **kwargs)
        self.collate_fn = self._collate_fn
        self.max_seq_len = max_seq_len
        
    def _collate_fn(self, batch):
        batch_size = len(batch)
        max_seq_len = max(map(lambda x: len(x[0]), batch))
        max_seq_len = min(self.max_seq_len, max_seq_len)
        
        subword_batch = np.zeros((batch_size, max_seq_len), dtype=np.int64)
        mask_batch = np.zeros((batch_size, max_seq_len), dtype=np.float32)
        sentiment_batch = np.zeros((batch_size, 1), dtype=np.int64)
        
        seq_list = []
        for i, (subwords, sentiment, raw_seq) in enumerate(batch):
            subwords = subwords[:max_seq_len]
            subword_batch[i,:len(subwords)] = subwords
            mask_batch[i,:len(subwords)] = 1
            sentiment_batch[i,0] = sentiment
            
            seq_list.append(raw_seq)
            
        return subword_batch, mask_batch, sentiment_batch, seq_list

Error when running the train code

Benchmark table is not consistent (am I wrong?)

I read the paper and compare it to this website : https://www.indobenchmark.com/leaderboard.html . It seems that the sequence labelling benchmark is not the same. I also tried my own fine-tuning, and the result is closer to the one on the paper rather than that in the website. is there any explanation regarding this problem?

Example how to use FastText

Adjust DocumentSentimentDataset() to read dataset with 3 columns

Hi, many thanks to IndoBenchmark Team before, for the deployment of IndoBERT model.
I'm currently working on my thesis project, it's about sentence similarity detection which the dataset are pair of questions scraped from Quora saved in .csv format with 3 columns : question1, question2, and is_duplicate.

For the training process i'm following the Finetuning SMSA.ipynb. But in the Prepare Dataset section, i found error when i'm running the DocumentSentimentDataset() function, like this :

It seems because the function only "accept" 2 columns from the dataset while in my dataset there are 3 columns.

And i've tried to change the data_utils.py file like this but the error is remain the same :

How can i solve this problem?
Thankyou in advance.

Textual entailment usage

Hi, thanks for publishing this work.
Specific to text entailment task, how to use your model for this task since we need to feed two sentence?

need indobert cased pretrained model

As I read from some article that cased BERT model is more accurate in part of speech tagging or NER type data. I hope the team will build a cased model for pretrained IndoBERT.

Pre-training proplexity

hi, I wanna ask about the training dairy/behaviour. do you have notes on it? or at least the number of the final proplexity for each of the bert types. It will be helpful for the research community to reproduce your research. thanks

Documentation in Bahasa Indonesia

Let's add a documentation in Indonesian.

ValueError('Expected input batch_size ({}) to match target batch_size ({}).' ValueError: Expected input batch_size (3208) to match target batch_size (3784).

Hi indonlu team,

I got this error when i run task for facqa:
ValueError('Expected input batch_size ({}) to match target batch_size ({}).' ValueError: Expected input batch_size (3208) to match target batch_size (3784).

I got thats error when using my own dataset. My dataset has many long passage, so i think it's about wrong truncation process. this is my modification in QAFactoidDataLoader:

Can you help or give some pointer to me for fixing this error?

Thanks in advance.

LSTM Classification for Document Sentiment Task

Hi! Indobenchmark team.
I've been wondering, could I use LSTM for document sentiment task instead of logistic regression? I'm working on my thesis about document sentiment classification using BERT and LSTM. I'm pretty new to the topic and already tried the example code from finetune smsa, but I'm having a hard time on changing the code to use LSTM instead of logistic regression. It would be really helpful if you guys give me a hint about what should I use and what should I change from the example code. Thanks in advance 😄

Load Dataset Error

Expected Behavior

Load the data set at the prepare dataset stage can run smoothly without any errors

Actual Behavior

When running it on my computer, the preprocessing data section has a problem, namely ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements

Question

How do I fix the error? My dataset has 2 columns, namely a text column and a sentiment column. Each file has 7200 data in train.csv, 1000 data in valid.csv, and 1800 data in test.csv

What exactly is the environment for this library

Expected Behavior

Was hoping to replicate https://github.com/indobenchmark/indonlu/blob/master/examples/finetune_ner_grit.ipynb
and get a finetuned model :)

Actual Behavior

However, was faced with ImportError issues stemming from incompatible functions found from older versions of transformers.

Steps to Reproduce the Problem

Just run the cells with transformers version 4.3.0
Would love to know which versions of what dependencies was used?

I've read the paper, and would love to try out this amazing work on some corpus for a school project.

indonlp / indonlu Goto Github PK

indonlu's Introduction

IndoNLU

Research Paper

How to contribute to IndoNLU?

12 Downstream Tasks

Examples

Submission Format

Indo4B Dataset

IndoBERT and IndoBERT-lite Models

FastText (Indo4B)

Leaderboard

indonlu's People

Contributors

Stargazers

Watchers

Forkers

indonlu's Issues

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Expected Behavior

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Expected Behavior

Dataframe head

Actual Behavior

Steps to Reproduce the Problem

Expected Behavior

Actual Behavior

Question

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Recommend Projects

Recommend Topics

Recommend Org