microsoft / godel Goto Github PK

Large-scale pretrained models for goal-directed dialog

License: MIT License

Python 87.60% JavaScript 0.12% HTML 0.36% Vue 1.95% SCSS 4.17% Shell 1.47% Makefile 4.32%

data-processing dialogue dialogue-systems machine-learning text-data text-generation transformers conversational-ai language-grounding grounded-generation

godel's Issues

evalute_data() in generate.py uses incorrect dataloader

I noticed that evaluate_data() in generate.py uses the global eval_dataloader instead of the dataloader passed into function call. I was wondering if this is intentional.

documentation

why is there virtually no documentation?

How do these files work in the project? (it is a bit different from your essay)

./PROGRAM_DEMO/data/

1. data
2. dummy_data
    2.1.dstc
    2.2.msmarco
    2.3.reddit
    2.4.unifedqa
3.grounded
4.ungrounded

Also, I wonder how can I train the project with my own datasets?

I wonder if I am right to use the EXAMPLE.JSON to train directly. If it is wrong, how can I do it CORRECTLY?

  {
    "Context": "Please remind me of calling to Jessie at 2PM.",
    "Knowledge": "reminder_contact_name is Jessie, reminder_time is 2PM",
    "Response": "Sure, set the reminder: call to Jesse at 2PM"
  },

Trained for longer? and using different prompt?

@pengbaolin you mentioned for 1.1 you trained it for longer, is that dataset included and did you change the configuration for the training? It doesn't seem like you modified any of the code on the main branch for this

Error downloading base model - "409 Public access is not permitted on this storage account."

Description:

I encountered an error while trying to download the base model for the Godel project. When executing the command :
wget https://bapengstorage.blob.core.windows.net/fileshare/godel_base.tar.gz, I received the following error message: "409 Public access is not permitted on this storage account."

Steps to Reproduce:

Followed the installation steps mentioned in the repository's readme.
Executed the command wget https://bapengstorage.blob.core.windows.net/fileshare/godel_base.tar.gz.

Expected Behavior:

I expected the base model to be downloaded successfully without any errors.

Actual Behavior:

Instead, I encountered the error message "409 Public access is not permitted on this storage account," which prevented the download.

I would appreciate it if someone could provide assistance in resolving this issue. It would be great if an alternative download link could be provided or if the instructions in the readme could be updated to address this problem. Access to the base model is essential for proceeding with the installation and fine-tuning of Godel.

Thank you for your attention to this matter.

TensorFlow support

Any discussions or plans to offer a TensorFlow variant on HuggingFace? Thanks

How to recreate saving dialog history

I tried to do something like a list in python, and then generate a response based on these answers, but it didn’t lead to anything good. Can someone throw off the code or tell me how to do it all?

What to do after downloading enough.

So I've downloaded enough to have a go at it. What commands to run to turn the 1.1 large model into ready to run material and will it remember my interactions?

Can not download data

I meet a problem when using DialoGPT to download reddit dataset during Downloading and Extracting Data part, and get a empty train.db folder

If anyone can help me about download Reddit?
Thanks

The example is broken

https://huggingface.co/spaces/microsoft/GODEL-Demo

Reddit Data

Data preparation involves downloading reddit comment and submission data form https://files.pushshift.io/reddit/ and it is written that total data is around 700GB. However, the actual size of the data is around ~2TB, for training GODEL unitl which YYYY-MM reddit data you've used?

adding web demo and models on Hugging Face

Hi, would you be interested in sharing your models in the Hugging Face Hub? The Hub offers free hosting of over 54K models, and it would make your work more accessible and visible to the rest of the ML community. We already have a organization for microsoft similar to github on Hugging Face for adding models/datasets/spaces(web demos): https://huggingface.co/microsoft

Some of the benefits of sharing your models through the Hub would be:

versioning, commit history and diffs
repos provide useful metadata about their tasks, languages, metrics, etc that make them discoverable
multiple features from TensorBoard visualizations, PapersWithCode integration, and more
wider reach of your work to the ecosystem

Creating the repos and adding new models should be a relatively straightforward process if you've used Git before. This is a step-by-step guide explaining the process in case you're interested.

and here are guides for adding spaces(web demos) and datasets to your org

How to add a Space: https://huggingface.co/blog/gradio-spaces
uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html

as well as examples of web demos in the microsoft organization:

github: https://github.com/microsoft/unicl
Spaces: https://huggingface.co/spaces/microsoft/unicl-img-recog-demo

Please let us know if you would be interested and if you have any questions.

Hugging Face team

cc: @osanseviero

Need test_file argument

When running train.py using train_file and validation_file arguments, get the following error.

Traceback (most recent call last):
  File "GODEL/train.py", line 633, in <module>
    main()
  File "GODEL/train.py", line 426, in main
    test_dataset = lm_datasets["test"]
  File "/GODEL/.py38venv/lib/python3.8/site-packages/datasets/dataset_dict.py", line 41, in __getitem__
    return super().__getitem__(k)
KeyError: 'test'

How to use GODEL to build a Chatbot?

https://www.youtube.com/watch?v=Yfs6uF0SQ-w

Different output on ROCm

The output text is different on ROCm GPU. It should have been the same.

Code

def to_gpu(x):
    return x #.to("cuda:0") # Uncomment to test
tokenizer = AutoTokenizer.from_pretrained("microsoft/GODEL-v1_1-large-seq2seq")
model = to_gpu(AutoModelForSeq2SeqLM.from_pretrained("microsoft/GODEL-v1_1-large-seq2seq"))
def generate(instruction, knowledge, dialog):
    if knowledge != '':
        knowledge = '[KNOWLEDGE] ' + knowledge
    dialog = ' EOS '.join(dialog)
    query = f"{instruction} [CONTEXT] {dialog} {knowledge}"
    input_ids = to_gpu(tokenizer(f"{query}", return_tensors="pt")).input_ids
    outputs = model.generate(input_ids, max_length=128, min_length=8, top_p=0.9, do_sample=True)
    output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return output
instruction = f'Instruction: given a dialog context, you need to response empathically.'
# Leave the knowldge empty
knowledge = ''
dialog = [
    'Does money buy happiness?',
    'It is a question. Money buys you a lot of things, but not enough to buy happiness.',
    'What is the best way to buy happiness ?'
]
# dialog = ["Hey my name is Thomas! How are you?"] # Uncomment to test
response = generate(instruction, knowledge, dialog)
print(response)

requirements.txt

absl-py==1.0.0
astunparse==1.6.3
cachetools==5.1.0
certifi==2022.12.7
charset-normalizer==3.0.1
click==8.1.3
contourpy==1.0.5
cycler==0.11.0
filelock==3.9.0
flatbuffers==1.12
fonttools==4.37.4
gast==0.4.0
google-auth==2.6.6
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
grpcio==1.46.3
h5py==3.7.0
huggingface-hub==0.12.1
idna==3.4
joblib==1.2.0
keras==2.9.0
Keras-Preprocessing==1.1.2
keybert==0.7.0
kiwisolver==1.4.4
libclang==14.0.1
Markdown==3.3.7
markdown-it-py==2.2.0
matplotlib==3.6.1
mdurl==0.1.2
nltk==3.8.1
numpy==1.24.2
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauthlib==3.2.0
opt-einsum==3.3.0
packaging==23.0
pandas==1.4.2
Pillow==9.2.0
protobuf==3.19.4
psutil==5.9.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
Pygments==2.14.0
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2022.1
PyYAML==6.0
regex==2022.10.31
requests==2.28.2
requests-oauthlib==1.3.1
rich==13.3.1
rsa==4.8
scikit-learn==1.2.1
scipy==1.10.1
sentence-transformers==2.2.2
sentencepiece==0.1.96
six==1.16.0
tensorboard==2.9.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.9.1
tensorflow-estimator==2.9.0
tensorflow-io-gcs-filesystem==0.26.0
termcolor==1.1.0
threadpoolctl==3.1.0
tokenizers==0.13.2
torch==1.13.1+rocm5.2
torchvision==0.14.1
tqdm==4.64.1
transformers==4.26.1
typing_extensions==4.5.0
urllib3==1.26.14
Werkzeug==2.1.2
wrapt==1.14.1

Output

CPU	GPU
Money doesn't buy happiness. It only gives you money for things you have. It doesn't buy happiness.	Be happy, not to be.
Hello Thomas, I’m fine. How are you?	Hi, I'm good. How are you?

How to train with our own knowledge base documents

It may be a dumb question, but can someone please guide me on how we can convert our own documents(often big documents of paragraphs, bullet points, tables etc) of data into this models input data and therefore can be used in train.py script. May be one example of a paragraph converted to input data for training will help me 😬 .

Trying the GODEL Demo

In trying to run the demo, I'm running into an error I'm not sure how to resolve:

!python examples/dstc9/dstc9_server.py
Traceback (most recent call last):
File "examples/dstc9/dstc9_server.py", line 54, in
from DialoGLM.server import *
ModuleNotFoundError: No module named 'DialoGLM'

I found this project, but I'm not sure it's the right project or how to interconnect them:
https://github.com/microsoft/DialogLM

import DialoGLM.server

Thanks for the code release.

But, Where can I find this import code?

from DialoGLM.server import *

GODEL/examples/dstc9/dstc9_server.py

Line 54 in c1af42a

from DialoGLM.server import *

Question about GODEL_XL (GPT-J) model size

First of all, thank you for making this work public!

I'm curious about the model size shown in the README for the released GODEL_XL model (based on GPT-J). In the table in the README it lists the model size as "2.7B". My understanding is that GPT-J has 6B parameters.

Is the number of parameters for GODEL XL listed in the README correct?

Getting erros after restarting computer

Once the computer restarts, changing nothing, I get the error below.

The script has been changed already to GODEL.server on line 54.

Traceback (most recent call last):
File "examples/dstc9/dstc9_server.py", line 54, in
from GODEL.server import *
ModuleNotFoundError: No module named 'GODEL'

Conflicting dependencies on nltk

nltk==3.4.1 was added from adee686 while nltk==3.7 was already there.

Model outputs after training consist of "todo" statements

I was able to successfully complete fine tuning by following the "fine tuning" example in the readme, using custom train, validatation, and test datasets that I transformed in GODEL format.

I can see fine-tuned models under 4 checkpoint directories that were created as a result of said training. The final log output of the training was: "09/11/2022 02:15:34 - INFO - main - Saving model outputs to output/test-step-44200"

When I look at output/test-step-44200 (this is the same for all of the test-step-NNN files), it looks like:

[
 ...,
  [
    "todo"
  ],
  [
    "todo"
  ],
  [
    "todo"
  ],
 ...
]

Decoding Strategy? and Prompting Guide

Did you guys use a specific decoding strategy other than beam search? and some values to reproduce the results ? In the paper you mentioned 5 beams.

Does this require few shot examples to produce coherant results, I am getting odd results either the knowledge is copied in the response or it generates unk |knowledge|>

What is the strategy or template to prompt? I have read the paper it uses <|environment|> token between two exchanges of messages but
Reading the Server Code it looks like

context: "sentence. <|knowledge|> knowledge sentence. => "
generated: "sentence. <|knowledge|> knowledge sentence. => generated sentence."

In the training data, one example for knowledge grounded looks like

START EOS:

I am considering downloading the training data and seeing if I can see how it was trained to fix this issue but it seems like a lot of effort to get a demo working proper.

It seems like the claims in the paper are that it does well in metrics, but there doesn't seem to be much discussion on how well it generates responses given knowledge grounded responses. Thanks

If you can provide an example of what 3 turns of knowledge grounded dialogue looks like and what 3 turns of no knowledge grounded dialogue looks like that would save an immense amount of time for people wanting to leverage this project.

@pengbaolin

Max Knowledge Length

Hello, I wasn't able to find anywhere what is the maximum Knowledge length for the model. I'd like to know how many tokens I can feed as knowledge before they are cut off.

Thanks

Question about GODEL-XL experimental in paper

Table 6 of the paper compares the effect of full fine-tuning on four tasks between GPT-3 and GODEL-XL. Why does this result look worse than GODEL-L in table 3-5 under a larger scale (GODEL-L is set with few-shot).

requirements.txt has two versions of nltk

The file GODEL/requirements.txt lists two versions of nltk

line 6: nltk==3.7
line 19: nltk==3.4.1

Resulting in:

ERROR: Cannot install nltk==3.4.1 and nltk==3.7 because these package versions have conflicting dependencies.
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

Broken hosted API on huggingface

The hosted Inference API on huggingface seems to be broken. https://huggingface.co/microsoft/GODEL-v1_1-base-seq2seq

Running backend server results in: `Found no NVIDIA driver on your system.`

First of, the README is unclear because it suggests running a python EXAMPLE_server.py to run a local server where I presume you might type questions, but there's no such file, and there's a vague comment that points to dstc9_server.py file instead.

So I presume you actually supposed to run python examples/dstc9/dstc9_server.py, however doing so results in it downloading 850M of some unknown data, and then it fails with the error.

Steps to reproduce

On any system without NVidia GPU enter the project and run the following commands:

 λ python examples/dstc9/dstc9_server.py
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.18k/1.18k [00:00<00:00, 978kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 850M/850M [03:51<00:00, 3.85MB/s]
Traceback (most recent call last):
  File "/home/constantine/Projects/GODEL/examples/dstc9/dstc9_server.py", line 57, in <module>
    main()
  File "/home/constantine/Projects/GODEL/GODEL/server.py", line 56, in main
    model = model.to(args.device)
  File "/home/constantine/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 907, in to
    return self._apply(convert)
  File "/home/constantine/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home/constantine/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/home/constantine/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 905, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/home/constantine/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 216, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Expected

There should be no error, because the project isn't specific to any GPU, and I have AMD and Intel GPU drivers on my system

Actual

There's an error about a GPU which isn't even present on my system

Additional information

I had to apply this PR to make it work.

Rust models upload on HuggingFace's repository

Hi,

Recently I've been involved the support of GODEL inside rust-bert repository and we was wondering if there is any chance to get those 2 PRs merged:

Thank you 😄