As reported by <a class="user-mention notranslate" data-hovercard-type="user" data-hov

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks again for your feedback, <a class="user-mention notranslate" data-hovercard-typ

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Improve performance of JSON loader about datasets HOT 5 CLOSED

albertvillanova commented on June 2, 2024 3

Improve performance of JSON loader

from datasets.

Comments (5)

natolambert commented on June 2, 2024 1

Thanks! Feel free to ping me for examples. May not respond immediately because we're all busy but would like to help.

from datasets.

albertvillanova commented on June 2, 2024 1

Hi @natolambert, could you please give some examples of JSON files to benchmark?

Please note that this JSON file (https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/eval-set-scores/Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback.json) is not in "records" orient; instead it has the following structure:

{
  "chat_template": "tulu",
  "id": [30, 34, 35,...],
  "model": "Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback",
  "model_type": "Seq. Classifier",
  "results": [1, 1, 1, ...],
  "scores_chosen": [4.421875, 1.8916015625, 3.8515625,...],
  "scores_rejected": [-2.416015625, -1.47265625, -0.9912109375,...],
  "subset": ["alpacaeval-easy", "alpacaeval-easy", "alpacaeval-easy",...]
  "text_chosen": ["<s>[INST] How do I detail a...",...],
  "text_rejected": ["<s>[INST] How do I detail a...",...]
}

Note that "records" orient should be a list (not a dict) with each row as one item of the list:

[
  {"chat_template": "tulu", "id": 30,... },
  {"chat_template": "tulu", "id": 34,... },
  ...
]

from datasets.

albertvillanova commented on June 2, 2024 1

Thanks again for your feedback, @natolambert.

However, strictly speaking, the last file is not in JSON format but in kind of JSON-Lines like format (although not properly either because there are multiple newline characters within each object). Not even pandas can read that file format.

Anyway, for JSON-Lines, I would expect that datasets and pandas have the same performance for JSON Lines files, as both use pyarrow under the hood...

A proper JSON file in records orient should be a list (a JSON array): the first character should be [.

Anyway, I am generating a JSON file from your JSON-Lines file to test performance.

from datasets.

natolambert commented on June 2, 2024

We use a mix (which is a mess), here's an example with the records orient
https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/best-of-n/alpaca_eval/tulu-13b/OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5.json

There are more in that folder, ~40mb maybe?

from datasets.

natolambert commented on June 2, 2024

@albertvillanova here's a snippet so you don't need to click

{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        0
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.076171875
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        1
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.87890625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        2
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.287109375
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        3
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 1.6337890625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        4
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 5.27734375
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        5
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.0625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        6
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 2.29296875
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        7
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 6.77734375
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        8
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 3.853515625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        9
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 4.86328125
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        10
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 2.890625
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        11
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 4.70703125
}
{
    "config": "top_p=0.9;temp=1.0",
    "dataset_details": "helpful_base",
    "id": [
        0,
        12
    ],
    "model": "allenai/tulu-2-dpo-13b",
    "scores": 4.45703125
}

from datasets.

Improve performance of JSON loader about datasets HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent