Comments (5)
Thanks! Feel free to ping me for examples. May not respond immediately because we're all busy but would like to help.
from datasets.
Hi @natolambert, could you please give some examples of JSON files to benchmark?
Please note that this JSON file (https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/eval-set-scores/Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback.json) is not in "records" orient; instead it has the following structure:
{
"chat_template": "tulu",
"id": [30, 34, 35,...],
"model": "Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback",
"model_type": "Seq. Classifier",
"results": [1, 1, 1, ...],
"scores_chosen": [4.421875, 1.8916015625, 3.8515625,...],
"scores_rejected": [-2.416015625, -1.47265625, -0.9912109375,...],
"subset": ["alpacaeval-easy", "alpacaeval-easy", "alpacaeval-easy",...]
"text_chosen": ["<s>[INST] How do I detail a...",...],
"text_rejected": ["<s>[INST] How do I detail a...",...]
}
Note that "records" orient should be a list (not a dict) with each row as one item of the list:
[
{"chat_template": "tulu", "id": 30,... },
{"chat_template": "tulu", "id": 34,... },
...
]
from datasets.
Thanks again for your feedback, @natolambert.
However, strictly speaking, the last file is not in JSON format but in kind of JSON-Lines like format (although not properly either because there are multiple newline characters within each object). Not even pandas can read that file format.
Anyway, for JSON-Lines, I would expect that datasets
and pandas
have the same performance for JSON Lines files, as both use pyarrow
under the hood...
A proper JSON file in records orient should be a list (a JSON array): the first character should be [
.
Anyway, I am generating a JSON file from your JSON-Lines file to test performance.
from datasets.
We use a mix (which is a mess), here's an example with the records orient
https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/best-of-n/alpaca_eval/tulu-13b/OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5.json
There are more in that folder, ~40mb maybe?
from datasets.
@albertvillanova here's a snippet so you don't need to click
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
0
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 3.076171875
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
1
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 3.87890625
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
2
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 3.287109375
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
3
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 1.6337890625
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
4
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 5.27734375
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
5
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 3.0625
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
6
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 2.29296875
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
7
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 6.77734375
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
8
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 3.853515625
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
9
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 4.86328125
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
10
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 2.890625
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
11
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 4.70703125
}
{
"config": "top_p=0.9;temp=1.0",
"dataset_details": "helpful_base",
"id": [
0,
12
],
"model": "allenai/tulu-2-dpo-13b",
"scores": 4.45703125
}
from datasets.
Related Issues (20)
- HTTPError 403 raised by CLI convert_to_parquet when creating script branch on 3rd party repos
- Add the option of saving in parquet instead of arrow HOT 2
- Extraction protocol for arrow files is not defined
- irc_disentangle - Issue with splitting data
- Support the deserialization of json lines files comprised of lists HOT 1
- Fail to load "stas/c4-en-10k" dataset since 2.16 version HOT 2
- Add MedImg for streaming HOT 2
- Column order is nondeterministic when loading from JSON
- ```push_to_hub()``` - Prevent Automatic Generation of Splits
- WinError 32 The process cannot access the file during load_dataset
- NonMatchingSplitsSizesError when using data_dir HOT 2
- Invalid YAML in README.md: unknown tag !<tag:yaml.org,2002:python/tuple>
- Export Parquet Tablet Audio-Set is null bytes in Arrow
- Caching map result of DatasetDict.
- Avoid downloading the whole dataset when only README.me has been touched on hub. HOT 2
- ValueError: Couldn't infer the same data file format for all splits. Got {'train': ('json', {}), 'validation': (None, {})}
- Support for pathlib.Path in datasets 2.19.0
- save_to_disk() freezes when saving on s3 bucket with multiprocessing
- JSON loader implicitly coerces floats to integers
- ExpectedMoreSplits error when using data_dir
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasets.