openai / gpt-2-output-dataset Goto Github PK

View Code? Open in Web Editor NEW

1.9K 74.0 549.0 272 KB

Dataset of GPT-2 outputs for research in detection, biases, and more

License: MIT License

Python 86.84% HTML 13.16%

gpt-2-output-dataset's Introduction

gpt-2-output-dataset

This dataset contains:

250K documents from the WebText test set
For each GPT-2 model (trained on the WebText training set), 250K random samples (temperature 1, no truncation) and 250K samples generated with Top-K 40 truncation

We look forward to the research produced using this data!

Download

For each model, we have a training split of 250K generated examples, as well as validation and test splits of 5K examples.

All data is located in Google Cloud Storage, under the directory gs://gpt-2/output-dataset/v1. (NOTE: everything has been migrated to Azure https://openaipublic.blob.core.windows.net/gpt-2/output-dataset/v1/)

There, you will find files:

webtext.${split}.jsonl
small-117M.${split}.jsonl
small-117M-k40.${split}.jsonl
medium-345M.${split}.jsonl
medium-345M-k40.${split}.jsonl
large-762M.${split}.jsonl
large-762M-k40.${split}.jsonl
xl-1542M.${split}.jsonl
xl-1542M-k40.${split}.jsonl

where split is one of train, test, and valid.

We've provided a script to download all of them, in download_dataset.py.

Finetuned model samples

Additionally, we encourage research on detection of finetuned models. We have released data under gs://gpt-2/output-dataset/v1-amazonfinetune/ with samples from a GPT-2 full model finetuned to output Amazon reviews.

Detectability baselines

We're interested in seeing research in detectability of GPT-2 model family generations.

We provide some initial analysis of two baselines, as well as code for the better baseline.

Overall, we are able to achieve accuracies in the mid-90s for Top-K 40 generations, and mid-70s to high-80s (depending on model size) for random generations. We also find some evidence that adversaries can evade detection via finetuning from released models.

Data removal requests

If you believe your work is included in WebText and would like us to remove it, please let us know at [email protected].

gpt-2-output-dataset's People

Contributors

Stargazers

Watchers

Forkers

mod-cpu peteroxic fountain111 xvshiting cnfive tarrysingh b-xiang stanxii qoboty gandolfxu epirs tarsbase jmrs1978 big-data-ai polatbilek kingkf kindle0226 smileluo hidhineshraja milesoai rahulkhairnarr calclavia sofianeboumedine atanida ricklentz aozhi mapleleafss ldnloveletters jieliorz ch488674662 crates emilyarroyo ednasawe akx decoderkurt habibzadeh psds01 chengjingfeng databill86 alistairwalsh gaoli1537 skeevey pint1022 hpmichal11 fontclos fw-coder psyche-ps icgilca outlierd29 kyroskoh ttong-ai embeddedsamurai basifrank cuptea mm86133 alanagiasi sztheory jongwook neuralmeatbot submacro ml-lab soth02 volkovasystems joshbla mightyweasel cognami dnvs linhduongtuan albertvillanova dakelq chrsep nunofernandes-plight marcinwal stjordanis neuroradiology manikant92 bogdanvarlamov jbdatascience gerenuk anandhperumal federicosan adn14 allensmile eganji simon-heinen jaynoel awesome-archive julien-c b2220333 ginagriggs75 stevenborg cronchcronch kaimae sbrock patrickjonesdotca lordnynex trizt clintg kvercher svsintel

gpt-2-output-dataset's Issues

indices sequence length is longer than the specified maximum sequence length

when i run python -m detector.train it gives this error:

i have tried to solve it with different parameters of train but doesnt make sense.

ModuleNotFoundError: No module named 'torch._C'

When I ran the detector.server, this error showed up.
I did install Torch. Python 3.9
this error was observed by many others in different situation.
any hint?

Detector failing on certain inputs

I've been playing around with the detector at (https://openai-openai-detector.hf.space/), and I noticed that there are some inputs that lead to the detector getting stuck on a "Predicting..." screen.
I consistently get the error with this input as a substring: docker swarm init --advertise-addr

In the developer console, I get:

It seems the server fails to respond with this sort of input. I am not sure what's special about this input because the server works just fine if I was to say change the "a" to "b".

License?

Great work with the Detector, guys!

What type of license is this using?

🤞🏽 for MIT.

Any plans for a GPT-3 detector that can spot ChatGPT output?

How is the output text generated?

Please do correct me if my understanding of this is wrong. GPT-2 was trained on WebText, and the small, medium and large models you provide are completions based on seeing a part of the context in each WebText post correct? When I load the datasets and look at the sentences, I'm not sure I understand how the posts generated by the large model correspond to the WebText training set. How does the model generate text (as in what is the context it uses to generate text)? Right now the xth post in WebText doesn't seem to correspond to the xth post in WebText.

some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification:

Finding a strange error in a simple question of GPT4.5

Error text includes:
OpenAI error. That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 981c94eaf7d8e0a89ddecf88a8cd2499 in your message.)
I have heard that I get the reward for reporting error How can I get the payment I know a lot of other errors! Please send me a message with my e-mail [email protected].

Inference Example Code

Hi,
could you provide a sample of code where a inference is done, after the training phase?
Thanks

download_model.py [117M | 345M | 355M | 762M | 774M | 1542M] -> The specified bucket does not exist

Hi,

I am just starting with gtp-2 and wanted to run some tutorials.
For example, running python3 download_models.py 117M outputs

python3 download_model.py 117M
Fetching checkpoint: 1.00kit [00:00, 320kit/s]
Fetching encoder.json: 1.00kit [00:00, 428kit/s]
Fetching hparams.json: 1.00kit [00:00, 302kit/s]
Fetching model.ckpt.data-00000-of-00001: 1.00kit [00:00, 333kit/s]
Fetching model.ckpt.index: 1.00kit [00:00, 366kit/s]
Fetching model.ckpt.meta: 1.00kit [00:00, 348kit/s]
Fetching vocab.bpe: 1.00kit [00:00, 427kit/s]

inspecting any of those files gives:

<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchBucket</Code><Message>The specified bucket does not exist.</Message></Error>

I also don't see any data when I try:
https://console.cloud.google.com/storage/browser/gpt-2/output-dataset/v1

anything I am doing wrong?
best,
gmohsyan

WebText Dataset format

Which is the meaning of length, ended in the dataset lines:

{"id": 1, "ended": true, "length": 66, "text": "LeSean McCoy going through warmups with first team offense. To my eye, does not look close to 100 percent when cutting and exploding.\n\nABOUT COOKIES\n\nTo help make this website better, to improve and personalize your experience and for advertising purposes, are you happy to accept cookies and other technologies?"}

also I can see that there are newlines followed by indexes like in

{"id": 0, "ended": true, "length": 138, "text": "These girlfriends deserves a special mention for going that extra mile, hopefully doesn't set too many guys off on the path towards outrageous demands.\n\n1. She knows the severity of man-flu\n\n2. All fun and games is all good\n\n3. A voucher that says 'I love you'\n\n4. When arguments don't drag on forever.\n\n5. Providing everything he needs.\n\n6. Very understanding\n\n7. As awesome a gesture as this is, we are worried about this man's cooking skills.\n\n8. Nice cake\n\n8. Fair bargaining\n\n9. Excellent gift choice\n\n10. Very thoughtful"}

so \n\n3...\n\n8. What does this mean? Is it just a questionnaire style scraped document?

I can see that the detector does not use those info anyways: https://github.com/openai/gpt-2-output-dataset/blob/master/detector/dataset.py#L17

Thank you.

Access denied to model checkpoints

Hello, when trying to get the two model checkpoints as indicated here using:

wget https://storage.googleapis.com/gpt-2/detector-models/v1/detector-large.pt

I get the error: ERROR 403: Forbidden.

It seems like permissions are not allowing me to download the files. Could you help me get access to the models?

python3.5 error

I get this error when using python3.5:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 174, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.5/runpy.py", line 144, in _get_module_details
    code = loader.get_code(mod_name)
  File "<frozen importlib._bootstrap_external>", line 767, in get_code
  File "<frozen importlib._bootstrap_external>", line 727, in source_to_code
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "/mnt/Data/ubuntu/gpt-2-output-dataset/detector/train.py", line 133
    records = [record for v in range(votes) for record in tqdm(loader, desc=f'Preloading data ... {v}',

What prompt is used to generate the GPT2 datasets?

I see that GPT2 is trained on webtext, but not sure how the datasets here are generated? Specifically what prompt was used with GPT2 to generate the "fake" datasets?

No module named detector

ERROR train.py: Default process group is not initialized

I get this error when training on a single GPU, when calling the function distributed() to disable tqdm.

To avoid this I have simple wrapped distributed like:

def distributed():
    try:
        return dist.is_available() and dist.is_initialized()
    except:
        return False

Compress the dataset files

Can we get the json files compressed (zip or gzip for example)? This could improve collaboration as not everyone is able to afford fast and reliable internet.

why Roberta?

Why did you use Roberta and not use BERT or ELMO instead?

python vs. python3 in line 96 of /detector/server.py

This is a simple problem. Just posting so others are aware.

To get the Web-based GPT-2 Output Detector to work I had to change "python" to "python3" in line 96 of /detector/server.py. See:

gpt-2-output-dataset/detector/server.py

Line 96 in 12459ab

 num_workers = int(subprocess.check_output(['python', '-c', 'import torch; print(torch.cuda.device_count())'])) 

System:
OS: Ubuntu 19.10 eoan
Kernel: x86_64 Linux 5.3.0-19-generic
Uptime: 13d 6h 1m
Packages: 2125
Shell: bash 5.0.3
Resolution: 2560x1440
DE: GNOME
WM: GNOME Shell
WM Theme: Adwaita
GTK Theme: Yaru-dark [GTK2/3]
Icon Theme: Yaru
Font: Ubuntu 11
CPU: Intel Core i7-8809G @ 8x 4.2GHz [27.8°C]
GPU: AMD VEGAM (DRM 3.33.0, 5.3.0-19-generic, LLVM 9.0.0)
RAM: 6278MiB / 32035MiB

Behavior before the change:

~/Projects/AI/gpt-2-output-dataset/detector$ python3 -m server detector-large.pt
Loading checkpoint from detector-large.pt
Starting HTTP server on port 8080
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named torch
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/drew/Projects/AI/gpt-2-output-dataset/detector/server.py", line 120, in
fire.Fire(main)
File "/home/drew/.local/lib/python3.7/site-packages/fire/core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/drew/.local/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
target=component.name)
File "/home/drew/.local/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/drew/Projects/AI/gpt-2-output-dataset/detector/server.py", line 96, in main
num_workers = int(subprocess.check_output(['python', '-c', 'import torch; print(torch.cuda.device_count())']))
File "/usr/lib/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/usr/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['python', '-c', 'import torch; print(torch.cuda.device_count())']' returned non-zero exit status 1.

Behavior after (is as expected)

~/Projects/AI/gpt-2-output-dataset/detector$ python3 -m server detector-large.pt
Loading checkpoint from detector-large.pt
Starting HTTP server on port 8080
[] Process has started; loading the model ...
[] Ready to serve
[] "GET / HTTP/1.1" 200 -
[] "GET /favicon.ico HTTP/1.1" 200 -
[] "GET /?This%20is%20an%20online%20demo%20of%20the%20GPT-2%20output%20detector%20model.%20Enter%20some%20text%20in%20the%20text%20box;%20the%20predicted%20probabilities%20will%20be%20displayed%20below.%20The%20results%20start%20to%20get%20reliable%20after%20around%2050%20tokens. HTTP/1.1" 200 -

RunTimeError: Error(s) in loading state_dict for RobertaForSequenceClassification

I get the following error after trying to run

pip install -r requirements.txt
python -m detector.server detector-base.pt

Error:

RuntimeError: Error(s) in loading state_dict for RobertaForSequenceClassification:
        Missing key(s) in state_dict: "roberta.embeddings.position_ids".
        Unexpected key(s) in state_dict: "roberta.pooler.dense.weight", "roberta.pooler.dense.bias"

I'm not sure if there was a change between the google version of the Roberta weights or the azure version of the weights.

Thanks for the help!

Simplified English often falsely classified as AI output

This is more feedback than a bug report. So feel free to close or ignore the issue. It appears that there are lots of false positives for articles in the simplified wikipedia.

Examples:

Both articles predate GPT. They cannot be AI generations yet the system is 99% sure. I found more examples. It appears that simplified english is classified as AI output with a relativly high probability.

Does this work for other languages than English?

Does this work for other languages than English?
I need this for Dutch

Temperature for detectability baselines

The temperature is different for each model for detectability baselines.

Model	Temperature 1	Top-K 40
117M	88.29%	96.79%
345M	88.94%	95.22%
762M	77.16%	94.43%
1542M	74.31%	92.69%

It seems that the temperature is chosen closer to:

0.9 for the 117M and 345M models,
0.75 for the 762M and 1542M models.

Did you optimize the temperature for each model?
If so, is that a recommended setting when we use your models for other applications?

I mention so because a Python package based on GPT-2 uses 0.7 as the default temperature, with your 117M and 345M models, and that might not be an optimal default value.

Missing key(s) in state_dict: "roberta.embeddings.position_ids".

Hi there,
when I try to start the detector (python -m detector.server detector-base.pt), I obtain the error: Missing key(s) in state_dict: "roberta.embeddings.position_ids"..

How can I solve this and play around with the detector?
Thanks

Precondition
I downloaded (and tried with) both detector-base.pt and detector-large.pt. Either way, I get the same error. I can see that in previous issues, people correctly started the detector, so maybe there is an issue with the .pt files I am downloading.

Full execution log

$ python -m detector.server detector-base.pt

Loading checkpoint from detector-base.pt
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "/opt/anaconda3/envs/detector/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/anaconda3/envs/detector/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/user/workspace/stanford/gpt-2-output-dataset/detector/server.py", line 120, in <module>
    fire.Fire(main)
  File "/opt/anaconda3/envs/detector/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/anaconda3/envs/detector/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/anaconda3/envs/detector/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/Users/user/workspace/stanford/gpt-2-output-dataset/detector/server.py", line 89, in main
    model.load_state_dict(data['model_state_dict'])
  File "/opt/anaconda3/envs/detector/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RobertaForSequenceClassification:
	Missing key(s) in state_dict: "roberta.embeddings.position_ids".
	Unexpected key(s) in state_dict: "roberta.pooler.dense.weight", "roberta.pooler.dense.bias".

python3.5 error invalid syntax

I get this error when using python3.5:

Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 174, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.5/runpy.py", line 133, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/usr/lib/python3.5/runpy.py", line 109, in _get_module_details
import(pkg_name)
File "/home/user/.local/lib/python3.5/site-packages/pytest/init.py", line 3, in
from . import collect
File "/home/user/.local/lib/python3.5/site-packages/pytest/collect.py", line 8, in
from _pytest.deprecated import PYTEST_COLLECT_MODULE
File "/home/user/.local/lib/python3.5/site-packages/_pytest/deprecated.py", line 13, in
from _pytest.warning_types import PytestDeprecationWarning
File "/home/user/.local/lib/python3.5/site-packages/_pytest/warning_types.py", line 8, in
from _pytest.compat import final
File "/home/user/.local/lib/python3.5/site-packages/_pytest/compat.py", line 39
NOTSET: "Final" = NotSetType.token # noqa: E305

SyntaxError: invalid syntax

C:\Users\JoeBiden\AppData\Local\Programs\Python\Python37\python.exe: Error while finding module specification for 'detector.server' (ModuleNotFoundError: No module named 'detector')

When running the command "python -m detector.server detector-base.pt" on the detector folder, I get this error
C:\Users\JoeBiden\AppData\Local\Programs\Python\Python37\python.exe: Error while finding module specification for 'detector.server' (ModuleNotFoundError: No module named 'detector')

I know the code says to run it on the root of the repo instead of the detector file but I get a error that tells me detector-base.pt dosent exist, well ofc it dosent because its in the detector folder 💀

Is this capable of running on Mac M1 architecture?

I'm using a Mac M1 mini and am unable to install the requirements.

Successful at getting everything downloaded, changed the requirements.txt file:

transformers==2.9.1

Then installed the requirements with:

pip install -r requirements.txt

The install works up until:

error[E0463]: can't find crate for core| = note: thex86_64-apple-darwintarget may not be installed = help: consider downloading the target withrustup target add x86_64-apple-darwin= help: consider building the standard library from source withcargo build -Zbuild-std`

  error[E0463]: can't find crate for compiler_builtins`

downloaded the target with:

rustup target add x86_64-apple-darwin

Repeated install of requirements.txt:

pip install -r requirements.txt

Same error as above - tried to build the standard library from source:

cargo build -Zbuild-std

An error was generated:

error: the -Zflag is only accepted on the nightly channel of Cargo, but this is thestable channel

What is the full link for gs://gpt-2/output-dataset/v1

Thank you very much!

Issues running the detection training code

Thank you for releasing the code for training the detector. I want to fine-tune the detector on the amazon reviews you released but I have some issues.

The file "train.py" in "detector" directory requires some python functions in some directories (dataset, download, utils) that are missing. Can you please help?

The following imports gives error:

from .dataset import Corpus, EncodedDataset
from .download import download
from .utils import summary, distributed

Thank you

Filenames for the finetuned Amazon review samples?

Perhaps I overlooked something, but Google Cloud Storage does not support indexes and so the files in gs://gpt-2/output-dataset/v1-amazonfinetune/ are not easily downloaded, given they are not specified.

re: "Additionally, we encourage research on detection of finetuned models. We have released data under gs://gpt-2/output-dataset/v1-amazonfinetune/ with samples from a GPT-2 full model finetuned to output Amazon reviews."

For anyone wanting the files, the full list is / seems to be:

gs://gpt-2/output-dataset/v1-amazonfinetune/amazon-xl-1542M-k40.test.jsonl
gs://gpt-2/output-dataset/v1-amazonfinetune/amazon-xl-1542M-k40.train.jsonl
gs://gpt-2/output-dataset/v1-amazonfinetune/amazon-xl-1542M-k40.valid.jsonl
gs://gpt-2/output-dataset/v1-amazonfinetune/amazon-xl-1542M-nucleus.test.jsonl
gs://gpt-2/output-dataset/v1-amazonfinetune/amazon-xl-1542M-nucleus.train.jsonl
gs://gpt-2/output-dataset/v1-amazonfinetune/amazon-xl-1542M-nucleus.valid.jsonl
gs://gpt-2/output-dataset/v1-amazonfinetune/amazon-xl-1542M.test.jsonl
gs://gpt-2/output-dataset/v1-amazonfinetune/amazon-xl-1542M.train.jsonl
gs://gpt-2/output-dataset/v1-amazonfinetune/amazon-xl-1542M.valid.jsonl
gs://gpt-2/output-dataset/v1-amazonfinetune/amazon.test.jsonl
gs://gpt-2/output-dataset/v1-amazonfinetune/amazon.train.jsonl
gs://gpt-2/output-dataset/v1-amazonfinetune/amazon.valid.jsonl

RunTimeError: Error(s) in loading state_dict for RobertaForSequenceClassification even after transformer==2.9.1

I get the following error after trying to run

pip install -r requirements.txt
python -m detector.server detector-base.pt

Error:

RuntimeError: Error(s) in loading state_dict for RobertaForSequenceClassification:
        Missing key(s) in state_dict: "roberta.embeddings.position_ids".
        Unexpected key(s) in state_dict: "roberta.pooler.dense.weight", "roberta.pooler.dense.bias"

I have tried transformers==2.9.1 in requirements.txt, still receiving the same error.

Thanks for your help.

Quickstart guide?

I'd love to start using OpenAI's datasets and packages..in particular this one for text generation. Sorry for the 'basic' question but is there a beginner's tutorial for getting started somewhere? Perhaps a detailed installation guide?

Permission for Commercial Using

Can I use this project for commercial purposes? Actually, I want to create a website that can classify the text whether it is based on GPT-2 model or written by Humans.

gpt-2-output-dataset

getting error with both base & large model
OSError: It looks like the config file at 'detector-base.pt' is not a valid JSON file.

Create a requirements.txt file

It's not clear which are the repo prerequisites. There should be a "requirements.txt" file (I'd be more than happy to create it if provided with the package versions required)
I'm collecting package names as I'm trying to run the 2 python scripts

Loss, Logits error while training

There is a assignment error in the train.py script where in the loss and logits are considered to be 'str' type after the assignment and hence have to be updated.

Line: 108 and 146

loss, logits = model(texts, attention_mask=masks, labels=labels)

Here the loss variable is assigned as a 'str' type hence the following loss.backward() would fail stating that a 'str' type doesn't have a backward method.

Release a dataset using nucleus (top-p) sampling

This preprint (https://arxiv.org/abs/1904.09751) demonstrates that nucleus sampling can generate better text in general than either top-k or full sampling. This has been my experience with both the 117M and 345M parameter models. Releasing a dataset of outputs sampled with this method would be very helpful, especially since it seems that it can generate good outputs without falling prey to the pronoun overuse etc. mentioned in the README.

Different detection result on localhost and the server

Tested with second sample of ChatGPT and the detection result is not same with server.

The test result of https://openai-openai-detector.hf.space/

Test result with roberta-base model on localhost

Test result with roberta-large model on localhost

Training code fails on 0 length inputs (which are in several datasets included by the author/used in the report)

Some of the training data (specifically, the GPT2 generated datasets) contain texts of length 0. This causes training (and would cause inference) to error out. Is this expected? Please see the error message below:

Loading data/webtext.train.jsonl: 100%|██████████| 250000/250000 [00:05<00:00, 49837.49it/s]
Loading data/webtext.test.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 48536.65it/s]
Loading data/webtext.valid.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 48406.80it/s]
Loading data/xl-1542M.train.jsonl: 100%|██████████| 250000/250000 [00:05<00:00, 46902.35it/s]
Loading data/xl-1542M.test.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 45678.24it/s]
Loading data/xl-1542M.valid.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 45654.67it/s]
Epoch 1:  10%|█         | 2098/20834 [22:20<3:19:33,  1.56it/s, acc=0.856, loss=0.297]
Traceback (most recent call last):
  File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/local/home/openai_code/detector/train.py", line 324, in <module>
    run(**vars(args))
  File "/local/home/openai_code/detector/train.py", line 255, in run
    train_metrics = train(model, optimizer, device, train_loader, f'Epoch {epoch}')
  File "/local/home/openai_code/detector/train.py", line 108, in train
    for texts, masks, labels in loop:
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/local/home/openai_code/detector/dataset.py", line 60, in __getitem__
    tokens = self.tokenizer.encode(text)
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1427, in encode
    **kwargs,
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1569, in encode_plus
    first_ids = get_input_ids(text)
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1541, in get_input_ids
    tokens = self.tokenize(text, add_special_tokens=add_special_tokens, **kwargs)
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1265, in tokenize
    text = self.prepare_for_tokenization(text, **kwargs)
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_roberta.py", line 239, in prepare_for_tokenization
    if add_prefix_space and not text[0].isspace():
IndexError: string index out of range

The following datasets contain entries with length of 0:

./data/large-762M.train.jsonl
./data/large-762M.valid.jsonl
./data/medium-345M.train.jsonl
./data/small-117M100.valid.jsonl
./data/small-117M.test.jsonl
./data/small-117M.train.jsonl
./data/small-117M.valid.jsonl
./data/xl-1542M.train.jsonl

Questions regarding the dataset format and partition

Hi, Thanks for making the great dataset public. I have download the webtext.train.jsonl, a file in 250k line. I am not sure about whether it is just a sample or a slice of the WebText training set on which the GPT-2 models are trained? May I have access to the full training set of the WebText?

Looking forward to your reply or any advice.

Pembangunan sebuah bangunan memerlukan banyak bahan. Tidak hanya bahan bangunan yang berkualitas baik yang diperlukan, tetapi juga diperlukan supplier bahan konstruksi yang dapat dipercaya dan berkualitas. Pemilihan supplier bahan bangunan yang baik dan tepat dapat membantu memastikan bahwa bangunan yang dibangun memiliki kualitas yang baik dan tahan lama.

How successful has GPT-2 been in using punctuation, paused words, correct verb tenses, and so on?

I'm looking for research to show how GPT-2 works when using punctuation, pause words, correct verb tenses, etc.? Or how much can it be like human-produced text?

how

Got everything up and running, but i was wondering if i say had a .txt file and wanted to pass it through the model is there anyway i can do this in the command line rather than setting up the whole webpage?

SyntaxError: invalid syntax detector server

Getting into syntax error when running the detector. Is there something that I should be configuring with Detector before running it?

  File "C:\Users\user\AppData\Local\Programs\Python\Python35\lib\runpy.py", line 151, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name)
  File "C:\Users\user)\AppData\Local\Programs\Python\Python35\lib\runpy.py", line 126, in _get_module_details
    code = loader.get_code(mod_name)
  File "<frozen importlib._bootstrap_external>", line 764, in get_code
  File "<frozen importlib._bootstrap_external>", line 724, in source_to_code
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "C:\Docks\git\notebooks\GPT 2\gpt-2-output-dataset-master\detector\server.py", line 13
    model: RobertaForSequenceClassification = None
         ^
SyntaxError: invalid `syntax`

Questions about the meaning of data set attribute representation

About your dataset, does the "length" attribute represent the length of the "text" attribute? Or something else? I don't think it means the length of the "text" attribute, for example, in the file "medium-345m-k40 train.jsonl ”"Length" = 1024, but I calculated the length of text is equal to 4750, so I want to know the meaning of "length" attribute. I look forward to your reply. Thank you very much.

About cuda devices

Hi,
watching your code I saw a interesting thing:

num_workers = int(subprocess.check_output(['python3', '-c', 'import torch; print(torch.cuda.device_count())']))

In this case you call a subprocess from the main process to check how many cuda devices are available. My question is : what is the difference between your version from doing this command, for example:

if torch.cuda.is_available():
   num_workers = int( torch.cuda.device_count() )

I think you did so because they are different in some way, but I don't know in what.

Thanks

How to work with JSON lines database?

Hello. I downloaded all files. And all of them are just a randomly answers in JSON format. So, I want to train my own tensorflow.js model using this database! But, I don't have a question database here. So, what I need to do?

openai / gpt-2-output-dataset Goto Github PK

gpt-2-output-dataset's Introduction

gpt-2-output-dataset

Download

Finetuned model samples

Detectability baselines

Data removal requests

gpt-2-output-dataset's People

Contributors

Stargazers

Watchers

Forkers

gpt-2-output-dataset's Issues

Recommend Projects

Recommend Topics

Recommend Org