Giter VIP home page Giter VIP logo

microsoft / godel Goto Github PK

View Code? Open in Web Editor NEW
836.0 20.0 110.0 51.02 MB

Large-scale pretrained models for goal-directed dialog

Home Page: http://aka.ms/GODEL

License: MIT License

Python 87.60% JavaScript 0.12% HTML 0.36% Vue 1.95% SCSS 4.17% Shell 1.47% Makefile 4.32%
data-processing dialogue dialogue-systems machine-learning text-data text-generation transformers conversational-ai language-grounding grounded-generation

godel's Issues

Some questions about datasets

How do these files work in the project? (it is a bit different from your essay)

./PROGRAM_DEMO/data/

1. data
2. dummy_data
    2.1.dstc
    2.2.msmarco
    2.3.reddit
    2.4.unifedqa
3.grounded
4.ungrounded

Also, I wonder how can I train the project with my own datasets?

I wonder if I am right to use the EXAMPLE.JSON to train directly. If it is wrong, how can I do it CORRECTLY?

  {
    "Context": "Please remind me of calling to Jessie at 2PM.",
    "Knowledge": "reminder_contact_name is Jessie, reminder_time is 2PM",
    "Response": "Sure, set the reminder: call to Jesse at 2PM"
  },

Error downloading base model - "409 Public access is not permitted on this storage account."

Description:

I encountered an error while trying to download the base model for the Godel project. When executing the command :
wget https://bapengstorage.blob.core.windows.net/fileshare/godel_base.tar.gz, I received the following error message: "409 Public access is not permitted on this storage account."

Steps to Reproduce:

Followed the installation steps mentioned in the repository's readme.
Executed the command wget https://bapengstorage.blob.core.windows.net/fileshare/godel_base.tar.gz.

Expected Behavior:

I expected the base model to be downloaded successfully without any errors.

Actual Behavior:

Instead, I encountered the error message "409 Public access is not permitted on this storage account," which prevented the download.
image
I would appreciate it if someone could provide assistance in resolving this issue. It would be great if an alternative download link could be provided or if the instructions in the readme could be updated to address this problem. Access to the base model is essential for proceeding with the installation and fine-tuning of Godel.

Thank you for your attention to this matter.

TensorFlow support

Any discussions or plans to offer a TensorFlow variant on HuggingFace? Thanks

How to recreate saving dialog history

I tried to do something like a list in python, and then generate a response based on these answers, but it didnโ€™t lead to anything good. Can someone throw off the code or tell me how to do it all?

What to do after downloading enough.

So I've downloaded enough to have a go at it. What commands to run to turn the 1.1 large model into ready to run material and will it remember my interactions?

Can not download data

I meet a problem when using DialoGPT to download reddit dataset during Downloading and Extracting Data part, and get a empty train.db folder

If anyone can help me about download Reddit?
Thanks

Reddit Data

Data preparation involves downloading reddit comment and submission data form https://files.pushshift.io/reddit/ and it is written that total data is around 700GB. However, the actual size of the data is around ~2TB, for training GODEL unitl which YYYY-MM reddit data you've used?

adding web demo and models on Hugging Face

Hi, would you be interested in sharing your models in the Hugging Face Hub? The Hub offers free hosting of over 54K models, and it would make your work more accessible and visible to the rest of the ML community. We already have a organization for microsoft similar to github on Hugging Face for adding models/datasets/spaces(web demos): https://huggingface.co/microsoft

Some of the benefits of sharing your models through the Hub would be:

  • versioning, commit history and diffs
  • repos provide useful metadata about their tasks, languages, metrics, etc that make them discoverable
  • multiple features from TensorBoard visualizations, PapersWithCode integration, and more
  • wider reach of your work to the ecosystem

Creating the repos and adding new models should be a relatively straightforward process if you've used Git before. This is a step-by-step guide explaining the process in case you're interested.

and here are guides for adding spaces(web demos) and datasets to your org

How to add a Space: https://huggingface.co/blog/gradio-spaces
uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html

as well as examples of web demos in the microsoft organization:

github: https://github.com/microsoft/unicl
Spaces: https://huggingface.co/spaces/microsoft/unicl-img-recog-demo

Please let us know if you would be interested and if you have any questions.

Hugging Face team

cc: @osanseviero

Need test_file argument

When running train.py using train_file and validation_file arguments, get the following error.

Traceback (most recent call last):
  File "GODEL/train.py", line 633, in <module>
    main()
  File "GODEL/train.py", line 426, in main
    test_dataset = lm_datasets["test"]
  File "/GODEL/.py38venv/lib/python3.8/site-packages/datasets/dataset_dict.py", line 41, in __getitem__
    return super().__getitem__(k)
KeyError: 'test'

Different output on ROCm

The output text is different on ROCm GPU. It should have been the same.

Code

def to_gpu(x):
    return x #.to("cuda:0") # Uncomment to test
tokenizer = AutoTokenizer.from_pretrained("microsoft/GODEL-v1_1-large-seq2seq")
model = to_gpu(AutoModelForSeq2SeqLM.from_pretrained("microsoft/GODEL-v1_1-large-seq2seq"))
def generate(instruction, knowledge, dialog):
    if knowledge != '':
        knowledge = '[KNOWLEDGE] ' + knowledge
    dialog = ' EOS '.join(dialog)
    query = f"{instruction} [CONTEXT] {dialog} {knowledge}"
    input_ids = to_gpu(tokenizer(f"{query}", return_tensors="pt")).input_ids
    outputs = model.generate(input_ids, max_length=128, min_length=8, top_p=0.9, do_sample=True)
    output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return output
instruction = f'Instruction: given a dialog context, you need to response empathically.'
# Leave the knowldge empty
knowledge = ''
dialog = [
    'Does money buy happiness?',
    'It is a question. Money buys you a lot of things, but not enough to buy happiness.',
    'What is the best way to buy happiness ?'
]
# dialog = ["Hey my name is Thomas! How are you?"] # Uncomment to test
response = generate(instruction, knowledge, dialog)
print(response)
requirements.txt
absl-py==1.0.0
astunparse==1.6.3
cachetools==5.1.0
certifi==2022.12.7
charset-normalizer==3.0.1
click==8.1.3
contourpy==1.0.5
cycler==0.11.0
filelock==3.9.0
flatbuffers==1.12
fonttools==4.37.4
gast==0.4.0
google-auth==2.6.6
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
grpcio==1.46.3
h5py==3.7.0
huggingface-hub==0.12.1
idna==3.4
joblib==1.2.0
keras==2.9.0
Keras-Preprocessing==1.1.2
keybert==0.7.0
kiwisolver==1.4.4
libclang==14.0.1
Markdown==3.3.7
markdown-it-py==2.2.0
matplotlib==3.6.1
mdurl==0.1.2
nltk==3.8.1
numpy==1.24.2
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauthlib==3.2.0
opt-einsum==3.3.0
packaging==23.0
pandas==1.4.2
Pillow==9.2.0
protobuf==3.19.4
psutil==5.9.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
Pygments==2.14.0
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2022.1
PyYAML==6.0
regex==2022.10.31
requests==2.28.2
requests-oauthlib==1.3.1
rich==13.3.1
rsa==4.8
scikit-learn==1.2.1
scipy==1.10.1
sentence-transformers==2.2.2
sentencepiece==0.1.96
six==1.16.0
tensorboard==2.9.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.9.1
tensorflow-estimator==2.9.0
tensorflow-io-gcs-filesystem==0.26.0
termcolor==1.1.0
threadpoolctl==3.1.0
tokenizers==0.13.2
torch==1.13.1+rocm5.2
torchvision==0.14.1
tqdm==4.64.1
transformers==4.26.1
typing_extensions==4.5.0
urllib3==1.26.14
Werkzeug==2.1.2
wrapt==1.14.1

Output

CPU GPU
Money doesn't buy happiness. It only gives you money for things you have. It doesn't buy happiness. Be happy, not to be.
Hello Thomas, Iโ€™m fine. How are you? Hi, I'm good. How are you?

How to train with our own knowledge base documents

It may be a dumb question, but can someone please guide me on how we can convert our own documents(often big documents of paragraphs, bullet points, tables etc) of data into this models input data and therefore can be used in train.py script. May be one example of a paragraph converted to input data for training will help me ๐Ÿ˜ฌ .

Trying the GODEL Demo

In trying to run the demo, I'm running into an error I'm not sure how to resolve:

!python examples/dstc9/dstc9_server.py
Traceback (most recent call last):
File "examples/dstc9/dstc9_server.py", line 54, in
from DialoGLM.server import *
ModuleNotFoundError: No module named 'DialoGLM'

I found this project, but I'm not sure it's the right project or how to interconnect them:
https://github.com/microsoft/DialogLM

Question about GODEL_XL (GPT-J) model size

First of all, thank you for making this work public!

I'm curious about the model size shown in the README for the released GODEL_XL model (based on GPT-J). In the table in the README it lists the model size as "2.7B". My understanding is that GPT-J has 6B parameters.

Is the number of parameters for GODEL XL listed in the README correct?

Getting erros after restarting computer

Once the computer restarts, changing nothing, I get the error below.

The script has been changed already to GODEL.server on line 54.

Traceback (most recent call last):
File "examples/dstc9/dstc9_server.py", line 54, in
from GODEL.server import *
ModuleNotFoundError: No module named 'GODEL'

Model outputs after training consist of "todo" statements

I was able to successfully complete fine tuning by following the "fine tuning" example in the readme, using custom train, validatation, and test datasets that I transformed in GODEL format.

I can see fine-tuned models under 4 checkpoint directories that were created as a result of said training. The final log output of the training was: "09/11/2022 02:15:34 - INFO - main - Saving model outputs to output/test-step-44200"

When I look at output/test-step-44200 (this is the same for all of the test-step-NNN files), it looks like:

[
 ...,
  [
    "todo"
  ],
  [
    "todo"
  ],
  [
    "todo"
  ],
 ...
]

Decoding Strategy? and Prompting Guide

Did you guys use a specific decoding strategy other than beam search? and some values to reproduce the results ? In the paper you mentioned 5 beams.

Does this require few shot examples to produce coherant results, I am getting odd results either the knowledge is copied in the response or it generates unk |knowledge|>

What is the strategy or template to prompt? I have read the paper it uses <|environment|> token between two exchanges of messages but
Reading the Server Code it looks like

context: "sentence. <|knowledge|> knowledge sentence. => "
generated: "sentence. <|knowledge|> knowledge sentence. => generated sentence."

In the training data, one example for knowledge grounded looks like

START EOS:

I am considering downloading the training data and seeing if I can see how it was trained to fix this issue but it seems like a lot of effort to get a demo working proper.

It seems like the claims in the paper are that it does well in metrics, but there doesn't seem to be much discussion on how well it generates responses given knowledge grounded responses. Thanks

If you can provide an example of what 3 turns of knowledge grounded dialogue looks like and what 3 turns of no knowledge grounded dialogue looks like that would save an immense amount of time for people wanting to leverage this project.

@pengbaolin

Max Knowledge Length

Hello, I wasn't able to find anywhere what is the maximum Knowledge length for the model. I'd like to know how many tokens I can feed as knowledge before they are cut off.

Thanks

Question about GODEL-XL experimental in paper

Table 6 of the paper compares the effect of full fine-tuning on four tasks between GPT-3 and GODEL-XL. Why does this result look worse than GODEL-L in table 3-5 under a larger scale (GODEL-L is set with few-shot).

Running backend server results in: `Found no NVIDIA driver on your system.`

First of, the README is unclear because it suggests running a python EXAMPLE_server.py to run a local server where I presume you might type questions, but there's no such file, and there's a vague comment that points to dstc9_server.py file instead.

So I presume you actually supposed to run python examples/dstc9/dstc9_server.py, however doing so results in it downloading 850M of some unknown data, and then it fails with the error.

Steps to reproduce

On any system without NVidia GPU enter the project and run the following commands:

 ฮป python examples/dstc9/dstc9_server.py
Downloading: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1.18k/1.18k [00:00<00:00, 978kB/s]
Downloading: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 850M/850M [03:51<00:00, 3.85MB/s]
Traceback (most recent call last):
  File "/home/constantine/Projects/GODEL/examples/dstc9/dstc9_server.py", line 57, in <module>
    main()
  File "/home/constantine/Projects/GODEL/GODEL/server.py", line 56, in main
    model = model.to(args.device)
  File "/home/constantine/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 907, in to
    return self._apply(convert)
  File "/home/constantine/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home/constantine/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/home/constantine/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 905, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/home/constantine/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 216, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Expected

There should be no error, because the project isn't specific to any GPU, and I have AMD and Intel GPU drivers on my system

Actual

There's an error about a GPU which isn't even present on my system

Additional information

I had to apply this PR to make it work.

generated data and human evaluation result

Hi.
I appreciate your model and codes.
But, I have interest in the dataset with human evaluation result.
If it is possible, could you share the generated dataset with human evaluation score.

requirements.txt contains two nltk versions

the requirements.txt file contains nltk==3.7 and nltk==3.4 whis of course ends up in conflicting dependencies. Which version should be contained?

EDIT: I saw that there is already an open request. I will close this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.