Open-Dialog Chatbots for Learning New Languages [Part 1] | IAmANerd about i-am-a-nerd HOT 18 OPEN

utterances-bot commented on September 27, 2024

Open-Dialog Chatbots for Learning New Languages [Part 1] | IAmANerd

from i-am-a-nerd.

Comments (18)

virattt commented on September 27, 2024 1

Thank you! Both this notebook and the Data Preprocessing colab are incredibly helpful.

from i-am-a-nerd.

ncoop57 commented on September 27, 2024

@virattt Glad it was helpful!

from i-am-a-nerd.

monisha08041998 commented on September 27, 2024

is there a limit for the set of conversations?

from i-am-a-nerd.

ncoop57 commented on September 27, 2024

@monisha08041998 It was only trained on a max length of 9 conversations, so going beyond that may leads to poor results. Also, the max length of the entire conversation that GPT-2 will consider when predicting new tokens is 512, so even if you go beyond that it can only look at that many for context.

from i-am-a-nerd.

bhuvan1643 commented on September 27, 2024

how are the pre-trained weights going to help here as the new data is completely different

from i-am-a-nerd.

bhuvan1643 commented on September 27, 2024

how did you use pre-trained tokenizer here, as pre-trained one contains only english words but data here is spanish

from i-am-a-nerd.

ncoop57 commented on September 27, 2024

@bhuvan1643 DialoGPT used the original GPT2 model, pretrained weights, and tokenizer. Even though the vast majority of the data was English, it still contained some Spanish text and therefore the necessary Spanish characters/words.

I am not 100% sure the pre-trained weights help with modeling the Spanish language. However, Spanish has a lot of overlap in vocabulary and grammatical structure with English since they are both romance languages like French and German. This overlap may help the model transfer its knowledge from English to Spanish.

I'm not sure how well this would work on non-romance languages like Chinese, Hindi, etc since there are almost no overlap even if you converted the words/characters to their Latin versions.

from i-am-a-nerd.

TheHmmka commented on September 27, 2024

Where did you train large model? Is there any cloud service or something like that?

from i-am-a-nerd.

ncoop57 commented on September 27, 2024

@TheHmmka I trained the larger model on one of my school's machines that had 4 1080ti's. I'm sure you could train it on a cloud service relatively easily though, but I've never had experience with those.

from i-am-a-nerd.

etrigger commented on September 27, 2024

I cannot download the data( subtitles of Spanish TV shows ), the script to generate a csv cannot be accessed either.
Can you please fix them?
Thanks

from i-am-a-nerd.

ncoop57 commented on September 27, 2024

Hey @etrigger, could you show me the error you are getting when trying to download or generate the data? I tried to reproduce this, but it was working for me

from i-am-a-nerd.

etrigger commented on September 27, 2024

I can download the data now, but the script can't be opened.
https://colab.research.google.com/drive/1kKErlSSpewQbWexFPEj1rPWsYpMx69ZS?usp=sharing

Here are the error:
A network error occurred and the request could not be completed.

https://drive.google.com/drive/?action=locate&id=1kKErlSSpewQbWexFPEj1rPWsYpMx69ZS&authuser=0
A network error occurred and the request could not be completed.
GapiError: A network error occurred and the request could not be completed.
at pz.Vs [as constructor] (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:704:150)
at new pz (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:1225:318)
at Da.program_ (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:1359:470)
at Fa (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:19:336)
at Da.throw_ (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:18:402)
at Ia.throw (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:20:248)
at g (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:62:155)

from i-am-a-nerd.

ncoop57 commented on September 27, 2024

@etrigger what an interesting error. I did a bit of digging and it seems to be an issue with colab in certain situation. Here is an issue about it: googlecolab/colabtools#1771, but it seem like it just automagically got fixed for the person who opened it. I'd recommend trying with a different browser or in incognito mode on the browser you are using to see if that fixes it. I don't think there is anything I can do from my side other than giving you access to a converted python script so you can download it yourself. Here is a link to it where you could just download the file and run it locally if you want (though be careful because it takes a lot of compute, networking and memory to generate the CSV, especially for languages that have a ton of examples): https://drive.google.com/file/d/1qvIh3zztJT7TelMYLdahOoGmypw398VD/view?usp=sharing

from i-am-a-nerd.

etrigger commented on September 27, 2024

@ncoop57 Thanks for the script.

from i-am-a-nerd.

etrigger commented on September 27, 2024

@ncoop57 Question on preparing training data format?
I have the dialog data like this: each line has the sentence A (source) followed by sentence B(target).
How should I organize the data for training?

from i-am-a-nerd.

ncoop57 commented on September 27, 2024

@etrigger I have the format that dialoGPT requires in the data section of my blog: https://nathancooper.io/i-am-a-nerd/chatbot/deep-learning/gpt2/2020/05/12/chatbot-part-1.html#The-Data!. I recommend trying to first get it into a format that my code expects (each column having a different response) and then tossing it into that function to generate the necessary input data for your model

from i-am-a-nerd.

berkozg96 commented on September 27, 2024

I have a question about the defined train and evaluate functions. Both have:
inputs, labels = (batch, batch)
meaning that inputs and labels are exactly the same. My question is: Shouldn't the model try to learn how to respond to the given input? I feel like there is something wrong with that in this case.

from i-am-a-nerd.

Viile1 commented on September 27, 2024

You're correct, it doesn't make sense for the inputs and labels to be the same in a train or evaluation function for a conversational chatbot. The goal of the model is to learn to generate a response given an input, so the inputs should be questions or prompts, and the labels should be the corresponding answers. The model's performance is usually evaluated by comparing the generated response to the ground truth answer in the label variable.

If the inputs and labels are the same, the model would simply memorize the training data and wouldn't be able to generalize to new examples. So it's important to ensure that the inputs and labels are distinct, with the inputs being used to prompt the model to generate a response, and the labels being used to evaluate the quality of the generated response.

from i-am-a-nerd.

Open-Dialog Chatbots for Learning New Languages [Part 1] | IAmANerd about i-am-a-nerd HOT 18 OPEN

Comments (18)

Related Issues (7)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent