microsoft / godel Goto Github PK
View Code? Open in Web Editor NEWLarge-scale pretrained models for goal-directed dialog
Home Page: http://aka.ms/GODEL
License: MIT License
Large-scale pretrained models for goal-directed dialog
Home Page: http://aka.ms/GODEL
License: MIT License
I noticed that evaluate_data() in generate.py uses the global eval_dataloader instead of the dataloader passed into function call. I was wondering if this is intentional.
why is there virtually no documentation?
./PROGRAM_DEMO/data/
1. data
2. dummy_data
2.1.dstc
2.2.msmarco
2.3.reddit
2.4.unifedqa
3.grounded
4.ungrounded
I wonder if I am right to use the
EXAMPLE.JSON
to train directly. If it is wrong, how can I do it CORRECTLY?
{
"Context": "Please remind me of calling to Jessie at 2PM.",
"Knowledge": "reminder_contact_name is Jessie, reminder_time is 2PM",
"Response": "Sure, set the reminder: call to Jesse at 2PM"
},
@pengbaolin you mentioned for 1.1 you trained it for longer, is that dataset included and did you change the configuration for the training? It doesn't seem like you modified any of the code on the main branch for this
I encountered an error while trying to download the base model for the Godel project. When executing the command :
wget https://bapengstorage.blob.core.windows.net/fileshare/godel_base.tar.gz
, I received the following error message: "409 Public access is not permitted on this storage account."
Followed the installation steps mentioned in the repository's readme.
Executed the command wget https://bapengstorage.blob.core.windows.net/fileshare/godel_base.tar.gz
.
I expected the base model to be downloaded successfully without any errors.
Instead, I encountered the error message "409 Public access is not permitted on this storage account," which prevented the download.
I would appreciate it if someone could provide assistance in resolving this issue. It would be great if an alternative download link could be provided or if the instructions in the readme could be updated to address this problem. Access to the base model is essential for proceeding with the installation and fine-tuning of Godel.
Thank you for your attention to this matter.
Any discussions or plans to offer a TensorFlow variant on HuggingFace? Thanks
I tried to do something like a list in python, and then generate a response based on these answers, but it didnโt lead to anything good. Can someone throw off the code or tell me how to do it all?
So I've downloaded enough to have a go at it. What commands to run to turn the 1.1 large model into ready to run material and will it remember my interactions?
I meet a problem when using DialoGPT to download reddit dataset during Downloading and Extracting Data
part, and get a empty train.db folder
If anyone can help me about download Reddit?
Thanks
Data preparation involves downloading reddit comment and submission data form https://files.pushshift.io/reddit/ and it is written that total data is around 700GB. However, the actual size of the data is around ~2TB, for training GODEL unitl which YYYY-MM reddit data you've used?
Hi, would you be interested in sharing your models in the Hugging Face Hub? The Hub offers free hosting of over 54K models, and it would make your work more accessible and visible to the rest of the ML community. We already have a organization for microsoft similar to github on Hugging Face for adding models/datasets/spaces(web demos): https://huggingface.co/microsoft
Some of the benefits of sharing your models through the Hub would be:
Creating the repos and adding new models should be a relatively straightforward process if you've used Git before. This is a step-by-step guide explaining the process in case you're interested.
and here are guides for adding spaces(web demos) and datasets to your org
How to add a Space: https://huggingface.co/blog/gradio-spaces
uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html
as well as examples of web demos in the microsoft organization:
github: https://github.com/microsoft/unicl
Spaces: https://huggingface.co/spaces/microsoft/unicl-img-recog-demo
Please let us know if you would be interested and if you have any questions.
Hugging Face team
cc: @osanseviero
When running train.py using train_file and validation_file arguments, get the following error.
Traceback (most recent call last):
File "GODEL/train.py", line 633, in <module>
main()
File "GODEL/train.py", line 426, in main
test_dataset = lm_datasets["test"]
File "/GODEL/.py38venv/lib/python3.8/site-packages/datasets/dataset_dict.py", line 41, in __getitem__
return super().__getitem__(k)
KeyError: 'test'
The output text is different on ROCm GPU. It should have been the same.
def to_gpu(x):
return x #.to("cuda:0") # Uncomment to test
tokenizer = AutoTokenizer.from_pretrained("microsoft/GODEL-v1_1-large-seq2seq")
model = to_gpu(AutoModelForSeq2SeqLM.from_pretrained("microsoft/GODEL-v1_1-large-seq2seq"))
def generate(instruction, knowledge, dialog):
if knowledge != '':
knowledge = '[KNOWLEDGE] ' + knowledge
dialog = ' EOS '.join(dialog)
query = f"{instruction} [CONTEXT] {dialog} {knowledge}"
input_ids = to_gpu(tokenizer(f"{query}", return_tensors="pt")).input_ids
outputs = model.generate(input_ids, max_length=128, min_length=8, top_p=0.9, do_sample=True)
output = tokenizer.decode(outputs[0], skip_special_tokens=True)
return output
instruction = f'Instruction: given a dialog context, you need to response empathically.'
# Leave the knowldge empty
knowledge = ''
dialog = [
'Does money buy happiness?',
'It is a question. Money buys you a lot of things, but not enough to buy happiness.',
'What is the best way to buy happiness ?'
]
# dialog = ["Hey my name is Thomas! How are you?"] # Uncomment to test
response = generate(instruction, knowledge, dialog)
print(response)
absl-py==1.0.0
astunparse==1.6.3
cachetools==5.1.0
certifi==2022.12.7
charset-normalizer==3.0.1
click==8.1.3
contourpy==1.0.5
cycler==0.11.0
filelock==3.9.0
flatbuffers==1.12
fonttools==4.37.4
gast==0.4.0
google-auth==2.6.6
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
grpcio==1.46.3
h5py==3.7.0
huggingface-hub==0.12.1
idna==3.4
joblib==1.2.0
keras==2.9.0
Keras-Preprocessing==1.1.2
keybert==0.7.0
kiwisolver==1.4.4
libclang==14.0.1
Markdown==3.3.7
markdown-it-py==2.2.0
matplotlib==3.6.1
mdurl==0.1.2
nltk==3.8.1
numpy==1.24.2
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauthlib==3.2.0
opt-einsum==3.3.0
packaging==23.0
pandas==1.4.2
Pillow==9.2.0
protobuf==3.19.4
psutil==5.9.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
Pygments==2.14.0
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2022.1
PyYAML==6.0
regex==2022.10.31
requests==2.28.2
requests-oauthlib==1.3.1
rich==13.3.1
rsa==4.8
scikit-learn==1.2.1
scipy==1.10.1
sentence-transformers==2.2.2
sentencepiece==0.1.96
six==1.16.0
tensorboard==2.9.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.9.1
tensorflow-estimator==2.9.0
tensorflow-io-gcs-filesystem==0.26.0
termcolor==1.1.0
threadpoolctl==3.1.0
tokenizers==0.13.2
torch==1.13.1+rocm5.2
torchvision==0.14.1
tqdm==4.64.1
transformers==4.26.1
typing_extensions==4.5.0
urllib3==1.26.14
Werkzeug==2.1.2
wrapt==1.14.1
CPU | GPU |
---|---|
Money doesn't buy happiness. It only gives you money for things you have. It doesn't buy happiness. | Be happy, not to be. |
Hello Thomas, Iโm fine. How are you? | Hi, I'm good. How are you? |
It may be a dumb question, but can someone please guide me on how we can convert our own documents(often big documents of paragraphs, bullet points, tables etc) of data into this models input data and therefore can be used in train.py script. May be one example of a paragraph converted to input data for training will help me ๐ฌ .
In trying to run the demo, I'm running into an error I'm not sure how to resolve:
!python examples/dstc9/dstc9_server.py
Traceback (most recent call last):
File "examples/dstc9/dstc9_server.py", line 54, in
from DialoGLM.server import *
ModuleNotFoundError: No module named 'DialoGLM'
I found this project, but I'm not sure it's the right project or how to interconnect them:
https://github.com/microsoft/DialogLM
Thanks for the code release.
But, Where can I find this import code?
from DialoGLM.server import *
GODEL/examples/dstc9/dstc9_server.py
Line 54 in c1af42a
First of all, thank you for making this work public!
I'm curious about the model size shown in the README for the released GODEL_XL model (based on GPT-J). In the table in the README it lists the model size as "2.7B". My understanding is that GPT-J has 6B parameters.
Is the number of parameters for GODEL XL listed in the README correct?
Once the computer restarts, changing nothing, I get the error below.
The script has been changed already to GODEL.server on line 54.
Traceback (most recent call last):
File "examples/dstc9/dstc9_server.py", line 54, in
from GODEL.server import *
ModuleNotFoundError: No module named 'GODEL'
nltk==3.4.1
was added from adee686 while nltk==3.7
was already there.
I was able to successfully complete fine tuning by following the "fine tuning" example in the readme, using custom train, validatation, and test datasets that I transformed in GODEL format.
I can see fine-tuned models under 4 checkpoint directories that were created as a result of said training. The final log output of the training was: "09/11/2022 02:15:34 - INFO - main - Saving model outputs to output/test-step-44200"
When I look at output/test-step-44200 (this is the same for all of the test-step-NNN files), it looks like:
[
...,
[
"todo"
],
[
"todo"
],
[
"todo"
],
...
]
Did you guys use a specific decoding strategy other than beam search? and some values to reproduce the results ? In the paper you mentioned 5 beams.
Does this require few shot examples to produce coherant results, I am getting odd results either the knowledge is copied in the response or it generates unk |knowledge|>
What is the strategy or template to prompt? I have read the paper it uses <|environment|> token between two exchanges of messages but
Reading the Server Code it looks like
context: "sentence. <|knowledge|> knowledge sentence. => "
generated: "sentence. <|knowledge|> knowledge sentence. => generated sentence."
In the training data, one example for knowledge grounded looks like
START EOS:
I am considering downloading the training data and seeing if I can see how it was trained to fix this issue but it seems like a lot of effort to get a demo working proper.
It seems like the claims in the paper are that it does well in metrics, but there doesn't seem to be much discussion on how well it generates responses given knowledge grounded responses. Thanks
If you can provide an example of what 3 turns of knowledge grounded dialogue looks like and what 3 turns of no knowledge grounded dialogue looks like that would save an immense amount of time for people wanting to leverage this project.
Hello, I wasn't able to find anywhere what is the maximum Knowledge length for the model. I'd like to know how many tokens I can feed as knowledge before they are cut off.
Thanks
Table 6 of the paper compares the effect of full fine-tuning on four tasks between GPT-3 and GODEL-XL. Why does this result look worse than GODEL-L in table 3-5 under a larger scale (GODEL-L is set with few-shot).
The file GODEL/requirements.txt lists two versions of nltk
line 6: nltk==3.7
line 19: nltk==3.4.1
Resulting in:
The hosted Inference API on huggingface seems to be broken. https://huggingface.co/microsoft/GODEL-v1_1-base-seq2seq
First of, the README is unclear because it suggests running a python EXAMPLE_server.py
to run a local server where I presume you might type questions, but there's no such file, and there's a vague comment that points to dstc9_server.py
file instead.
So I presume you actually supposed to run python examples/dstc9/dstc9_server.py
, however doing so results in it downloading 850M of some unknown data, and then it fails with the error.
On any system without NVidia GPU enter the project and run the following commands:
ฮป python examples/dstc9/dstc9_server.py
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1.18k/1.18k [00:00<00:00, 978kB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 850M/850M [03:51<00:00, 3.85MB/s]
Traceback (most recent call last):
File "/home/constantine/Projects/GODEL/examples/dstc9/dstc9_server.py", line 57, in <module>
main()
File "/home/constantine/Projects/GODEL/GODEL/server.py", line 56, in main
model = model.to(args.device)
File "/home/constantine/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 907, in to
return self._apply(convert)
File "/home/constantine/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/home/constantine/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 601, in _apply
param_applied = fn(param)
File "/home/constantine/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 905, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/home/constantine/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
There should be no error, because the project isn't specific to any GPU, and I have AMD and Intel GPU drivers on my system
There's an error about a GPU which isn't even present on my system
I had to apply this PR to make it work.
Hi,
Recently I've been involved the support of GODEL inside rust-bert repository and we was wondering if there is any chance to get those 2 PRs merged:
Thank you ๐
Hi.
I appreciate your model and codes.
But, I have interest in the dataset with human evaluation result.
If it is possible, could you share the generated dataset with human evaluation score.
the requirements.txt file contains nltk==3.7 and nltk==3.4 whis of course ends up in conflicting dependencies. Which version should be contained?
EDIT: I saw that there is already an open request. I will close this issue.
Hi Baolin, Could you provide the evaluation scripts for F1^R, F1^K, etc.? Many thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.