Giter VIP home page Giter VIP logo

jailbreak_llm's Introduction

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

This is the official repository for "Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation", by Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen.

In this work, we propose the generation exploitation attack, an extremely simple approach that disrupts model alignment by only manipulating variations of generation methods. Check out our project page for more details and examples!

If you find our implementation and paper helpful, please consider citing our work:

@article{huang2023catastrophic,
  title={Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation},
  author={Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi},
  journal={arXiv preprint arXiv:2310.06987},
  year={2023}
}

Table of Contents

Updates

  • 2023-10-11: We release the first version.

Installation

Our implementation was based on transformers==4.32.1 and torch==2.0.1.

Preparation

Our attack is model-agnostic. To launch the attack on a specified model, please download the model and insert/update its path in configs.py. The table below provides links to all models we used in the paper.

Model Hugging Face Page
Vicuna-7B https://huggingface.co/lmsys/vicuna-7b-v1.5
Vicuna-13B https://huggingface.co/lmsys/vicuna-13b-v1.5
Vicuna-33B https://huggingface.co/lmsys/vicuna-33b-v1.3
MPT-7B https://huggingface.co/mosaicml/mpt-7b
MPT-30B https://huggingface.co/mosaicml/mpt-30b
Falcon-7B https://huggingface.co/tiiuae/falcon-7b
Falcon-40B https://huggingface.co/tiiuae/falcon-40b
LLaMA2-7B https://huggingface.co/meta-llama/Llama-2-7b-hf
LLaMA2-13B https://huggingface.co/meta-llama/Llama-2-13b-hf
LLaMA2-7B-chat https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
LLaMA2-13B-chat https://huggingface.co/meta-llama/Llama-2-13b-chat-hf

Launch the attack

Default or greedy generation

The commands below runs the default decoding (temperature=0.1, top_p=0.9) for the Llama-2-7b-chat-hf model:

python attack.py \
    --model Llama-2-7b-chat-hf \
    --use_default \
    --use_system_prompt

or the greedy decoding for the Llama-2-7b-chat-hf model:

python attack.py \
    --model Llama-2-7b-chat-hf \
    --use_greedy \
    --use_system_prompt

Exploited generation

The command below exploits various decoding settings for the Llama-2-7b-chat-hf model (with the system prompt disabled):

python attack.py \
    --model Llama-2-7b-chat-hf \
    --tune_temp \
    --tune_topp \
    --tune_topk \
    --n_sample 1

You can increase the --n_sample parameter to generate more examples for each prompt, which potentially makes the attack stronger.

Evaluate the attack

Before running the evaluation, please make sure that you have prepared a risk scorer model (saved in SCORER_PATH) for the attacker to pick the best attack result among several candidates. In the meantime, you need to prepare a separate evaluator (saved in EVALUATOR_PATH) for the final scoring purpose.

You may download our scorer and evaluator here.

To evaluate the attack with substring match by Zou et al., you can use the following command:

python evaluate.py \
    --model Llama-2-7b-chat-hf \
    --config 'exploited' \
    --matching_only \
    --scorer_path SCORER_PATH

or using our evaluator:

python evaluate.py \
    --model Llama-2-7b-chat-hf \
    --config 'exploited' \
    --evaluator_path EVALUATOR_PATH \
    --scorer_path SCORER_PATH

which summarizes results in a json file. For instance, here is the output for Llama-2-7b-chat-hf model with n_sample=1.

{
    "best_attack_config": {
        "temp": {
            "temp_0.95": "25"   # Most vulnerable temp is 0.95, which gives 25% ASR
        },
        "topk": {
            "topk_500": "26"   # Most vulnerable top-k is 500, which gives 26% ASR
        },
        "topp": {
            "topp_0.7": "29"   # Most vulnerable temp is 0.7, which gives 29% ASR
        }
    },
    "greedy": "16",            # Greedy decoding without system prompt gives 16% ASR
    "break_by_temp": "47",     # Exploiting temp only gives 47% ASR
    "break_by_topk": "54",     # Exploiting top-k only gives 54% ASR
    "break_by_topp": "77",     # Exploiting top-p only gives 77% ASR
    "break_by_all": "81"       # Exploiting all decoding strategies gives 81% ASR
}

You can aggregate results for different models (to obtain Table 1 in our paper) using the example shown in aggregate_results.ipynb.

Bugs or questions

If you have any questions related to the code or the paper, feel free to email Yangsibo ([email protected]) or open an issue. Please try to specify the problem with details so we can help you better and quicker!

jailbreak_llm's People

Contributors

hazelsuko07 avatar danqi avatar eltociear avatar princeton-sysml avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.