clp24-arch-entropy

Code used in our paper: Archetypes and Entropy: Theory-Driven Extraction of Evidence for Suicide Risk. Our submission for CLPsych 2024 Shared Task A and B.

Each directory description

Task A:
Task B:

Relative entropy pipeline

All code is in the relative-entropy directory. There is a script that will run the full pipeline: run_entropy_pipeline.sh (see steps 0 and 1 below before running it). The section below describes each step.

Steps

Data: This pipeline assumes that the following files are in one directory:
- clp24_all_messages.csv: This contains all messages from the dataset, including data from the no risk users. Prepare it to contain the label column with the post authors' risk levels and the by (which annotator group) column. For no risk users, set the label to "No".
- clp24_all_messages_test.csv: The messages of the users designated for the test set for the shared task.
- clp24_all_messages_train.csv: The messages of users not in the test set
- clp24_SW_messages_sent_tokenized.csv: Sentences from each post in r/SuicideWatch (nltk.sent_tokenize applied to messages).

Create a file config.py with these contents:

'''config.py'''
DATADIR = # string pointing to the path of the clpsych24 shared task data, e.g., "/data/clp24/" 
MODELDIR = # string pointing to a path to a directory where you want to save the finetuning output, e.g., "/data/clp24/finetune-output/"

Prepare dataset for language modeling scripts
```
python prepare_data.py
```
- The script will create two files:
  1. DATADIR/all_labels_messages_train.csv
  2. DATADIR/all_labels_messages_test.csv
- Change where to save the files in the two lines after the imports, if you like.
- Change the number of no risk posts to include in the test set, if you like. We go with a small sample.

Finetune a language model on a group of CLPsych users based on risk level.

You are required to specify a group to finetune on. All options follow the format {Risk level}_risk_{Subreddit set}. Risk level can be no, low, moderate, high, or any (includes low, moderate, and high, but not no risk). Subreddit set can be sw (SuicideWatch posts only) or all (all posts by the users in the risk level group)

Example:

# example
python finetune.py --train_group high_risk_sw --model_output_dir data/model_output

Usage:

Usage: 
python finetune.py [-h] -g {low_risk_sw,high_risk_sw,moderate_risk_sw,low_risk_all,high_risk_all,moderate_risk_all,no_risk_sw,no_risk_all,any_risk_sw,any_risk_all} [-tr ALL_LABELS_MESSAGES_TRAIN] [-d MODEL_OUTPUT_DIR] [-k] [--logging_dir LOGGING_DIR] [--device DEVICE] [-m {distilgpt2}] [-e NUM_EPOCHS] [-lr LEARNING_RATE] [-wd WEIGHT_DECAY] [-ss SAVE_STRATEGY] [-es EVAL_STRATEGY] [--logging_steps LOGGING_STEPS] [--hidden_dropout_prob HIDDEN_DROPOUT_PROB] [--train_proportion TRAIN_PROPORTION]

Options
    -h, --help
    -g, --train_group {low_risk_sw,high_risk_sw,moderate_risk_sw,low_risk_all,high_risk_all,moderate_risk_all,no_risk_sw,no_risk_all,any_risk_sw,any_risk_all}
    -tr ALL_LABELS_MESSAGES_TRAIN, --all_labels_messages_train ALL_LABELS_MESSAGES_TRAIN
                            Path to training data file.
    -d, --model_output_dir MODEL_OUTPUT_DIR
                            Directory for saving the model checkpoints. They will be saved at [args.dir]/[args.train_group]. Recommended to make this unique to your experiment.
    -k, --keep_annotations
                            Pass in -k if you want to include the posts that we annotated internally.
    --logging_dir LOGGING_DIR
                            Path to a logging dir if you don't want to use the default.
    --device DEVICE
    -m, --base_lm {distilgpt2}
                            Name of pre-trained language model you want to finetune.
    -e, --num_epochs NUM_EPOCHS
    -lr LEARNING_RATE, --learning_rate LEARNING_RATE
    -wd, --weight_decay WEIGHT_DECAY
    -ss, --save_strategy SAVE_STRATEGY
    -es, --eval_strategy EVAL_STRATEGY
    --logging_steps LOGGING_STEPS
                            Number of update steps between logs
    --hidden_dropout_prob HIDDEN_DROPOUT_PROB
                            Dropout
    --train_proportion TRAIN_PROPORTION
                            Proportion of data to use for train, the rest will be used for eval.

(Optional) Plot losses of the finetuned language models.

python plot_loss.py

Options:
    -h, --help            show this help message and exit
    -lm LM_TO_PLOT [LM_TO_PLOT ...], --lm_to_plot LM_TO_PLOT [LM_TO_PLOT ...]
                            Name of the language models you want to plot (i.e. the 'train group'). If passing in nothing, it will do all found LMs in --model_dir path.
    -d, --model_dir MODEL_DIR 
                            Directory where the lms are. Defaults to MODELDIR specified in config.py.
    -o, --plot_path PLOT_PATH
                            Directory where you want to save the plot images. Defaults to MODELDIR/_loss_plots.

Compute token entropies using one group's language model on another group's test data.

TODO: optimize implementation, current implementation is inefficient

Example:

# example

# compute losses from each group model on their own test data
python compute_entropy.py --train_group any_risk_sw --test_group any_risk_sw;

# compute losses from each group model on any risk data
python compute_entropy.py --train_group no_risk_sw --test_group any_risk_sw;

Options:
    -h, --help            show this help message and exit
    -tr {low_risk_sw,high_risk_sw,moderate_risk_sw,low_risk_all,high_risk_all,moderate_risk_all,no_risk_sw,no_risk_all,any_risk_sw,any_risk_all}, --train_group {low_risk_sw,high_risk_sw,moderate_risk_sw,low_risk_all,high_risk_all,moderate_risk_all,no_risk_sw,no_risk_all,any_risk_sw,any_risk_all}
                            This specifies the language model that we'll use to compute losses.
    -te {low_risk_sw,high_risk_sw,moderate_risk_sw,low_risk_all,high_risk_all,moderate_risk_all,no_risk_sw,no_risk_all,any_risk_sw,any_risk_all,annotations}, --test_group {low_risk_sw,high_risk_sw,moderate_risk_sw,low_risk_all,high_risk_all,moderate_risk_all,no_risk_sw,no_risk_all,any_risk_sw,any_risk_all,annotations}
                            This specifies the data that the language model will be run on for losses.
    -d MODEL_OUTPUT_DIR, --model_output_dir MODEL_OUTPUT_DIR
                            Directory where the finetuned models are saved.
    -o TOKEN_ENTROPIES_DIR, --token_entropies_dir TOKEN_ENTROPIES_DIR
                            Directory path where you want to save the token entropies.
    --all_labels_messages_train ALL_LABELS_MESSAGES_TRAIN
                            Path to training data file.
    --all_labels_messages_test ALL_LABELS_MESSAGES_TEST
                            Path to test data file.
    -m {distilgpt2,lsanochkin/deberta-large-feedback,microsoft/deberta-base}, --base_lm {distilgpt2,lsanochkin/deberta-large-feedback,microsoft/deberta-base}
    -lm {lm,mlm}, --lm_type {lm,mlm}
                            lm: CausalLM | mlm: MaskedLM. Not implemented for mlm yet.
    --device DEVICE
    -sw SW_SIZE, --sw_size SW_SIZE
                            Context window size preceding target token.
    --checkpoint_selection CHECKPOINT_SELECTION
                            Strategy for choosing the model checkpoint to load. Use a) min_eval_loss if you want to choose based on the minimum loss on the val set during training, b) last to use the last checkpoint, or c) a path to
                            the checkpoint directory, e.g., model_dir/train_group/checkpoint-1000.

Map token entropies to sentences. Script aggregates the token entropies at sentence level. Output file by default is {ARGS.test_group}_sentence_entropies.json.

python sentence_entropy.py

Options:
    -h, --help            show this help message and exit
    -dir TOKEN_ENTROPIES_DIR, --token_entropies_dir TOKEN_ENTROPIES_DIR
                            Directory where you have the token entropies saved.
    -o ARGS.OUTDIR, --ARGS.outdir ARGS.OUTDIR
                            Directory where you want to save the sentence entropies.
    --all_messages ALL_MESSAGES
                            Path to test data file.
    -st SENT_TOKENIZED_FILE, --sent_tokenized_file SENT_TOKENIZED_FILE
                            Path to the data broken into sentences.
    -te {any_risk_sw,annotations}, --test_group {any_risk_sw,annotations}
                            This specifies the data that the language model will be run on for losses.
    -s, --save_intermediate_values
                            Pass in if you want intermediate values to be written to a file.

Explore span selection policies and write submission file with chosen policy. This step may involve your manual changes to the script to adjust your policies or create new ones, but code is written there that will output samples based on your policies.
```
python span_selection.py # run with -w if you want to write submission files with the policies implemented in the script.
```

(optional) Use the compare_systems.py script to compare the highlighted evidence per user from two systems.

# example
python compare_systems.py -a max_arch_score_70_submission -b prod_max_arc_ent_ac_70_submission

Options:
    -h, --help            show this help message and exit
    -s SAVE_DF, --save_df SAVE_DF
                            Pass in a path to save the df with all the intermediate and final score values.
    -arc ARCHETYPES_SENT_FILE, --archetypes_sent_file ARCHETYPES_SENT_FILE
                            Path to archetypes sentences file.
    -dir SENTENCE_ENTROPY_DIR, --sentence_entropy_dir SENTENCE_ENTROPY_DIR
                            Path to directory that has the sentence entropy files.
    -te {any_risk_sw,annotations}, --test_group {any_risk_sw,annotations}
                            This specifies if we're using the internal annotations or the task submission data data.
    -df SENT_DF_PATH, --sent_df_path SENT_DF_PATH
                            Path to sentence entropies file. This is an alternative to passing in --sentence_entropy_dir and --test_group.
    -p PERCENTILE_THRESHOLD, --percentile_threshold PERCENTILE_THRESHOLD
    -o SUBMISSION_FILE_PATH, --submission_file_path SUBMISSION_FILE_PATH
    -ml MAX_OUTPUT_LEN, --max_output_len MAX_OUTPUT_LEN
    --num_posts_to_show NUM_POSTS_TO_SHOW
    --n_factors N_FACTORS
    -w, --write

TODO: (optional) Make latex visual of highlighted spans

alahnala / clp24-arch-entropy Goto Github PK