Giter VIP home page Giter VIP logo

Comments (18)

virattt avatar virattt commented on September 27, 2024 1

Thank you! Both this notebook and the Data Preprocessing colab are incredibly helpful.

from i-am-a-nerd.

ncoop57 avatar ncoop57 commented on September 27, 2024

@virattt Glad it was helpful!

from i-am-a-nerd.

monisha08041998 avatar monisha08041998 commented on September 27, 2024

is there a limit for the set of conversations?

from i-am-a-nerd.

ncoop57 avatar ncoop57 commented on September 27, 2024

@monisha08041998 It was only trained on a max length of 9 conversations, so going beyond that may leads to poor results. Also, the max length of the entire conversation that GPT-2 will consider when predicting new tokens is 512, so even if you go beyond that it can only look at that many for context.

from i-am-a-nerd.

bhuvan1643 avatar bhuvan1643 commented on September 27, 2024

how are the pre-trained weights going to help here as the new data is completely different

from i-am-a-nerd.

bhuvan1643 avatar bhuvan1643 commented on September 27, 2024

how did you use pre-trained tokenizer here, as pre-trained one contains only english words but data here is spanish

from i-am-a-nerd.

ncoop57 avatar ncoop57 commented on September 27, 2024

@bhuvan1643 DialoGPT used the original GPT2 model, pretrained weights, and tokenizer. Even though the vast majority of the data was English, it still contained some Spanish text and therefore the necessary Spanish characters/words.

I am not 100% sure the pre-trained weights help with modeling the Spanish language. However, Spanish has a lot of overlap in vocabulary and grammatical structure with English since they are both romance languages like French and German. This overlap may help the model transfer its knowledge from English to Spanish.

I'm not sure how well this would work on non-romance languages like Chinese, Hindi, etc since there are almost no overlap even if you converted the words/characters to their Latin versions.

from i-am-a-nerd.

TheHmmka avatar TheHmmka commented on September 27, 2024

Where did you train large model? Is there any cloud service or something like that?

from i-am-a-nerd.

ncoop57 avatar ncoop57 commented on September 27, 2024

@TheHmmka I trained the larger model on one of my school's machines that had 4 1080ti's. I'm sure you could train it on a cloud service relatively easily though, but I've never had experience with those.

from i-am-a-nerd.

etrigger avatar etrigger commented on September 27, 2024

I cannot download the data( subtitles of Spanish TV shows ), the script to generate a csv cannot be accessed either.
Can you please fix them?
Thanks

from i-am-a-nerd.

ncoop57 avatar ncoop57 commented on September 27, 2024

Hey @etrigger, could you show me the error you are getting when trying to download or generate the data? I tried to reproduce this, but it was working for me

from i-am-a-nerd.

etrigger avatar etrigger commented on September 27, 2024

I can download the data now, but the script can't be opened.
https://colab.research.google.com/drive/1kKErlSSpewQbWexFPEj1rPWsYpMx69ZS?usp=sharing

Here are the error:
A network error occurred and the request could not be completed.

https://drive.google.com/drive/?action=locate&id=1kKErlSSpewQbWexFPEj1rPWsYpMx69ZS&authuser=0
A network error occurred and the request could not be completed.
GapiError: A network error occurred and the request could not be completed.
at pz.Vs [as constructor] (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:704:150)
at new pz (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:1225:318)
at Da.program_ (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:1359:470)
at Fa (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:19:336)
at Da.throw_ (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:18:402)
at Ia.throw (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:20:248)
at g (https://colab.research.google.com/v2/external/external_polymer_binary_l10n__zh_cn.js?vrz=colab-20210128-085606-RC00_354297656:62:155)

from i-am-a-nerd.

ncoop57 avatar ncoop57 commented on September 27, 2024

@etrigger what an interesting error. I did a bit of digging and it seems to be an issue with colab in certain situation. Here is an issue about it: googlecolab/colabtools#1771, but it seem like it just automagically got fixed for the person who opened it. I'd recommend trying with a different browser or in incognito mode on the browser you are using to see if that fixes it. I don't think there is anything I can do from my side other than giving you access to a converted python script so you can download it yourself. Here is a link to it where you could just download the file and run it locally if you want (though be careful because it takes a lot of compute, networking and memory to generate the CSV, especially for languages that have a ton of examples): https://drive.google.com/file/d/1qvIh3zztJT7TelMYLdahOoGmypw398VD/view?usp=sharing

from i-am-a-nerd.

etrigger avatar etrigger commented on September 27, 2024

@ncoop57 Thanks for the script.

from i-am-a-nerd.

etrigger avatar etrigger commented on September 27, 2024

@ncoop57 Question on preparing training data format?
I have the dialog data like this: each line has the sentence A (source) followed by sentence B(target).
How should I organize the data for training?

from i-am-a-nerd.

ncoop57 avatar ncoop57 commented on September 27, 2024

@etrigger I have the format that dialoGPT requires in the data section of my blog: https://nathancooper.io/i-am-a-nerd/chatbot/deep-learning/gpt2/2020/05/12/chatbot-part-1.html#The-Data!. I recommend trying to first get it into a format that my code expects (each column having a different response) and then tossing it into that function to generate the necessary input data for your model

from i-am-a-nerd.

berkozg96 avatar berkozg96 commented on September 27, 2024

I have a question about the defined train and evaluate functions. Both have:
inputs, labels = (batch, batch)
meaning that inputs and labels are exactly the same. My question is: Shouldn't the model try to learn how to respond to the given input? I feel like there is something wrong with that in this case.

from i-am-a-nerd.

Viile1 avatar Viile1 commented on September 27, 2024

You're correct, it doesn't make sense for the inputs and labels to be the same in a train or evaluation function for a conversational chatbot. The goal of the model is to learn to generate a response given an input, so the inputs should be questions or prompts, and the labels should be the corresponding answers. The model's performance is usually evaluated by comparing the generated response to the ground truth answer in the label variable.

If the inputs and labels are the same, the model would simply memorize the training data and wouldn't be able to generalize to new examples. So it's important to ensure that the inputs and labels are distinct, with the inputs being used to prompt the model to generate a response, and the labels being used to evaluate the quality of the generated response.

from i-am-a-nerd.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.