Giter VIP home page Giter VIP logo

taco's Introduction

TACO(Topics in Algorithmic COde generation dataset)

🤗 Hugging Face   |    BAAI DataHub   |    Paper


TACO (Topics in Algorithmic COde generation dataset) is a dataset focused on algorithmic code generation, designed to provide a more challenging training dataset and evaluation benchmark for the code generation model field. The dataset consists of programming competition problems that are more difficult and closer to real programming scenarios. It emphasizes improving or evaluating the model's understanding and reasoning abilities in practical application scenarios, rather than just implementing predefined function functionalities.

  • Larger scale: TACO includes a training set (25,443 problems) and a test set (1,000 problems), making it the largest code generation dataset currently available.
  • Higher quality: Each problem in the TACO dataset is designed to match a diverse set of solution answers, with answer sizes of up to 1.55M. This ensures that the model is not prone to overfitting during training and validates the effectiveness of evaluation results.
  • Fine-grained labels: Each problem in the TACO dataset includes fine-grained labels such as task topics, algorithms, skills, and difficulty levels. These labels provide more accurate references for the training and evaluation of code generation models.

News and Updates

  • 🚀🚀🚀[2024/06/19] Announcing Testing Framework Update: Bug Fixes and Enhanced Functionality! We're excited to announce the release of our first version after extensive manual validation and debugging efforts. In this update, we've addressed several bugs in our testing framework for the quoted APPS benchmark. We double-checked the test cases and updated TACO test set to a new version.

    Details on the fixes can be found here.

  • 🔥🔥🔥[2024/04/11] Announcing the Release of Part of TACO (Topics in Algorithm for Code) Project Models on Hugging Face! We've full fine-tuned top-tier code models like CodeLlama and Starcoder, ranging from 1B to 15B parameters, specifically tailored for competitive algorithmic challenges. Check out our models under FlagOpen on Hugging Face. Join us in advancing the Code & LLMs community! 🏋️‍♂️👩‍💻

Download and Use

🤗 Hugging Face

First, install the datasets package.

pip install -U datasets

Then, load the dataset with the following program.

from datasets import load_dataset
taco = load_dataset('BAAI/TACO', token=YOUR_HF_TOKEN)
  • You can also specify the split ("train" or "test") through
    from datasets import load_dataset
    taco = load_dataset('BAAI/TACO', split='train', token=YOUR_HF_TOKEN)
  • You can also specify the difficulties (a list choosing from ["EASY", "MEDIUM", "MEDIUM_HARD", "HARD", "VERY_HARD"] or ["ALL"] as default) or skills (a list choosing from ["Data structures", "Sorting", "Range queries", "Complete search", "Amortized analysis", "Dynamic programming", "Bit manipulation", "Greedy algorithms"] or ["ALL"] as default) by passing the list of difficulties or skills as a list.
    from datasets import load_dataset
    taco_difficulties = load_dataset('BAAI/TACO', difficulties=['EASY'], token=YOUR_HF_TOKEN)
    from datasets import load_dataset
    taco_skills = load_dataset('BAAI/TACO', skills=['Sorting', 'Range queries'], token=YOUR_HF_TOKEN)

BAAI DataHub First, download the dataset and unzip it into a folder named "BAAI-TACO." Then, load the dataset with the following program.

from datasets import load_from_disk
taco = load_from_disk(PATH_TO_BAAI-TACO)
  • You can also specify the split ("train" or "test") through
    from datasets import load_from_disk
    taco = load_from_disk(PATH_TO_BAAI-TACO)['train']
  • You can also specify the difficulties (a list choosing from ["EASY", "MEDIUM", "MEDIUM_HARD", "HARD", "VERY_HARD"] or ["ALL"] as default) or skills (a list choosing from ["Data structures", "Sorting", "Range queries", "Complete search", "Amortized analysis", "Dynamic programming", "Bit manipulation", "Greedy algorithms"] or ["ALL"] as default) by passing the list of difficulties or skills as a list.
    from datasets import load_from_disk
    difficulties=['EASY']
    taco = load_from_disk(PATH_TO_BAAI-TACO)
    taco_difficulties = taco.filter(lambda entry: entry['difficulty'] in difficulties)
    from datasets import load_from_disk
    skills=set(['Sorting', 'Range queries'])
    taco = load_from_disk(PATH_TO_BAAI-TACO)
    taco_skills = taco.filter(lambda entry: set(eval(entry['skill_types'])) & skills)

Statistics of TACO

Comparison Dimension TACO CodeContest APPS HumanEval(/-X) MBP(/X)P
Problem Scale (train/dev/test) 25443/-/1000 13328/117/165 5000/-/5000 -/-/164 374/-/500
No Answers in Test Set 0 43/165 1235/5000 0 0
Duplicate Questions No Duplication No Duplication No Duplication Duplicates Removed Duplicates Removed
Duplicate Answers Duplicates Removed No Duplication No Duplication Duplicates Removed Duplicates Removed
Test Cases/Problems 202.3 203.7 20.99 7.77 3
Task Topics Yes Yes No No No
Algorithm Tags Yes No No No No
Programming Skills Yes No No No No
Difficulty Tags Yes Yes Yes No No

The Distribution of Algorithm Tags in TACO is

The Distribution of Programming Skills in TACO is

Evaluation with TACO

First, you should initialize model, tokenizer as well as the difficulties or skills to use TACO.

# Initialize model and tokenizer
model_name = 'codellama/CodeLlama-7b-hf'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
device = "cuda:0"
model = model.to(device)


# Initialize evaluation dataset 
difficulties = ['ALL']
# difficulties = ["EASY", "MEDIUM", "MEDIUM_HARD", "HARD", "VERY_HARD"] 
# skills = ['ALL']
# skills = ["Data structures", "Sorting", "Range queries", "Complete search", "Amortized analysis", "Dynamic programming", "Bit manipulation", "Greedy algorithms"]

from datasets import load_dataset
taco = load_dataset('BAAI/TACO', split='test', difficulties=difficulties)
# taco = load_dataset('BAAI/TACO', split='test', skills=skills)

Then, run generations with code models.

# setting up times of run
n_samples = 200
temperature = 0.2
top_p = 0.95 
output = []
for idx, sample in enumerate(taco):
    prompt = sample['question']
    results = {"task_id": idx, "prompt": prompt}
    generations = []
    for i in range(n_samples):
        seed = i
        generation = predict(device, model, tokenizer, prompt, seed, top_p, temperature, max_length=2048)
        clean_code = truncate_after_eof_strings(generation)
        generations.append(clean_code)
    results["output"] = generations
    output.append(results)

generation.py gives a complete example of generate TACO result samples with CodeLlama, which outputs a JSON format file generation.json.

[
    {
        "task_id": 0,
        "prompt": "The city park of IT City contains n east to ...",
        "output": [
            "\ndef solve(n):\n    return n**5 - 10*n**4 + 40*n**3 ...",
            "\ndef solve(n):\n    return n**5 - 10*n**4 + 40*n**3 ...",
            ...
        ]
    },
    {
        "task_id": "1",
        "prompt": "Zookeeper is buying a carton of fruit to feed ...",
        "output": [
            "\ndef solve(n, s):\n    pre, suf, ans = [0]*n, [0]*n, ...",
            "\ndef solve(n, s):\n    pre, suf, ans = [0]*n, [0]*n, ...",
            ...
        ]
    },
    ...
]

Finally, execute the generated codes and compute metrics. compute_metric.py gives a complete example of code execution and pass@k computation with generation.json from last step.

The result file taco_metrics.json is like

{
    "pass@1": 0.0932,
    "pass@10": 0.1515,
    "pass@100": 0.1999,
    "detail" : {
        "pass@1": {
            "0": ...,
            "1": ...,
            ...
        },
        "pass@10": {
            "0": ...,
            "1": ...,
            ...
        },
        "pass@100": {
            "0": ...,
            "1": ...,
            ...
        },
    }
}

Finetuning with TACO

First, you should tokenize the training set of TACO. We provide a python script pretokenizing.py and an example shell script pretokenize.sh to help you. This step would output a pretokenized training data in cache_dir with the name of dataset_name. Below is an example to tokenize with CodeLlama-7b.

python pretokenizing.py \
    --tokenizer_dir codellama/CodeLlama-7b-hf \
    --cache_dir . \
    --dataset_name codellama_tokenized 

Then, finetune with the pretokenized training data. We provide a python script train.py and an example shell script finetune.sh to help you. This step would output the checkpoints in output_dir. Below is an example to finetuning CodeLlama-7b.

torchrun --nproc_per_node=8 --nnodes=1 train.py \
    --model_name_or_path codellama/CodeLlama-7b-hf \
    --data_path codellama_tokenized \
    --bf16 True \
    --output_dir codellama_ft \
    --num_train_epochs 2 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 5e-5 \
    --weight_decay 0.1 \
    --warmup_ratio 0.1 \
    --logging_steps 1 \
    --resume_from_checkpoint True \
    --gradient_checkpointing True \
    --deepspeed ds_configs/deepspeed_z2_config_bf16.json

Evaluation Results

We conducted experiments using the TACO test set and training set on GPT-4 and a code generation model trained on a large amount of code data. The results show:

  • The TACO test set is highly challenging, with GPT-4 achieving a pass@1 score of only 31.5 at the easy level. Except for GPT-4, the pass@1 scores of various code models across five difficulty levels are generally below 10. Even the pass@100 scores are not as high as GPT-4's pass@1

  • Utilizing the TACO training set with fine-grained labels can selectively enhance the performance of code generation models. For instance, after fine-tuning starcoder-1b on specific skills using the TACO training set, there is a noticeable improvement in performance.

Citation

If you use the models, data, or code from this project, please cite the original paper:

@article{li2023taco,
  title={TACO: Topics in Algorithmic COde generation dataset},
  author={Rongao Li and Jie Fu and Bo-Wen Zhang and Tao Huang and Zhihong Sun and Chen Lyu and Guang Liu and Zhi Jin and Ge Li},
  journal={arXiv preprint arXiv:2312.14852},
  year={2023}
}

License

The TACO dataset that is authored by BAAI, Shandong Normal University and Peking University is released under an Apache 2.0 License. However, the data also includes content licensed under other permissive licenses such as MIT License, or web-crawled data which is used under the terms of the CC BY 4.0 license(Creative Commons Attribution 4.0 International license).

We gratefully acknowledge the contributions of the following:

taco's People

Contributors

bowen92 avatar rongaoli avatar eltociear avatar

Stargazers

 avatar  avatar  avatar Wei Fu avatar  avatar Yubin Wang avatar hzwer avatar JIMMY ZHAO avatar  avatar 千古兴亡知衡权 avatar Wenkang Zhong avatar 1024 avatar  avatar Abhinav Gupta avatar kartik avatar mutouyuguo avatar Ranran avatar Ted Chien avatar  avatar  avatar Yang Wang  avatar Shengyu Ye avatar Jiang Xue avatar  avatar TechxGenus avatar Rohit Dandamudi avatar Yali avatar Jeff Carpenter avatar  avatar Bruce avatar HolenZhang avatar  avatar  avatar  avatar Hongpeng Pan avatar krab avatar  avatar 刘鸿祥 avatar 你猜  avatar CodeGeneration avatar Bolun Li avatar Huang Tao avatar  avatar  avatar  avatar  avatar Andrew Pearson avatar  avatar uriel avatar Marcin avatar  avatar Naman Jain avatar Sen Wang avatar Diego Torre Damm avatar Dan avatar  avatar Marden Neubert avatar Jorge Hernández avatar Eduardo Cortecione Gouvea avatar pe653 avatar Romain Andres avatar  avatar Sergey Abramchuk avatar Jeet Majumdar avatar Rob avatar Guy Podjarny avatar Jacob Knox avatar  avatar Clayton Christian avatar Agrim Sachdeva avatar TonyMoMoney avatar Kenneth Goh avatar  avatar nick kartashov avatar zen avatar young_chao avatar Marc Guirand avatar 姬忠鹏 avatar  avatar Ming Zhu avatar  avatar  avatar Jie Fu avatar  avatar  avatar D. Marcos avatar Vincent avatar Myungchul Shin avatar Mike Oller avatar Rodrigo Baron avatar FAb avatar Michael Ludwig avatar Ahmed Morsi avatar  avatar Art A. avatar  avatar  avatar Kaixin Li avatar Mark Lin avatar zhangkejiang avatar

Watchers

Yulong Ao avatar  avatar  avatar  avatar Julien Pourcel avatar  avatar

taco's Issues

compute_metric.py 好像有问题

我从test set里抽了20题,然后用compute_metric.py评测了一下数据集自带的solution code, 但大概只有60%-70%能通过评测。我找了一份没通过的代码,人工评估是对的,提交到Codeforces也Accepted了,所以应该是评测这块有问题。另,论文里report的结果是用这个script算出来的吗?

How to construct the appropriate output?

I try to evaluate codellama-7b using the easy difficulty of the TACO dataset. But I find that the output code needs to meet certain specifications in order to pass the test cases, like
s = input()\nprint(s.swapcase())
not
def solve(s):\n return s.swapcase()
How to construct the appropriate output?

更新后的评测框架似乎存在重大bug?

我再次尝试评测了数据集自带的solution code,但似乎所有评测都失败了,check_correctness()返回的所有结果都是-1。我不太了解APPS的测试框架,因此没有进一步排查,你们能够检查一下这个问题吗?

题目的难度是如何确定的?/ How is the difficulty level obtained?

请问数据集中题目难度的标注是如何确定的,是基于网站本身的标签还是用户通过率呢?


Hi, thanks for the great work! I am curious about the criteria of the difficulty level annotation in the files. Is it based on the website's own tagging or the user pass rates? Can you share more details on this? Thank you!

数据污染问题

你们的工作太棒了! TACO 是目前看到的开源数据集中,最棒的代码生成数据集。

比较好奇你们在制作数据集时,有没有考虑到数据污染问题。在LLM的时代,测试集是否被污染是一个非常重要的参考点。
我在里面论文里没有找到相关的信息。

code-llama-7b-python精度对不上

我尝试了test split中easy部分的200题,跑了一遍测得的pass@1只有3左右,和paper里的精度不符。
我已经使用了repo中的prompt和evaluation部分,设置n_samples=1,temperature=0.8。
想求助原因,非常感谢!

Finetuned Models

Hello! I'm having some trouble reproducing the finetuning with the script – would you mind releasing the trained models so I can verify their evaluation results and build on top of them? Thank you!

specific performance of gpt-4

Thanks for releasing this dataset and all the amazing work you have done! Do you have any data on the specific performance of gpt-4 on the test data set? If so, can you send me a copy?

AttributeError: module 'inspect' has no attribute 'getargspec'. Did you mean: 'getargs'?

When I use compute_metric.py to evaluate the generation results, the console noted "no module named pyext." I installed it using pip and got the following error:

Collecting pyext
  Using cached pyext-0.7.tar.gz (7.8 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [9 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "...\AppData\Local\Temp\pip-install-exx9gznl\pyext_28002897deae467da164cbba24ad8613\setup.py", line 6, in <module>
          import pyext
        File "...\AppData\Local\Temp\pip-install-exx9gznl\pyext_28002897deae467da164cbba24ad8613\pyext.py", line 117, in <module>
          oargspec = inspect.getargspec
                     ^^^^^^^^^^^^^^^^^^
      AttributeError: module 'inspect' has no attribute 'getargspec'. Did you mean: 'getargs'?
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

After searching online, I found out that this is a Python bug and I'm using Python 3.11 to support OpenAI's models.

Upon further investigation, it appears that this issue doesn't exist in Python 3.8, so I downgraded the Python version to 3.8.

Moreover, Someone mentioned that the problem may be resolved in the second quarter of 2024: https://community.privacyidea.org/t/python-3-11-support/3115/2

I suggest providing some detailed environment setting on the guide page.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.