Sparse Circuit Discovery

Circuit discovery in GPT-2 small, using sparse autoencoding

Installation
User's Guide
How to Read the Graphs
Errors
Project Status

Installation

To manually install, just run these commands in the shell:

git clone https://github.com/DavidUdell/sparse_circuit_discovery

cd sparse_circuit_discovery

pip install -e .

Alternatively, I have a Docker image on DockerHub. The Docker image is especially good for pulling to a remote server.

User's Guide

Your base of operations is sparse_coding/config/central_config.yaml. The most important hyperparameters are clustered up top:

## Key Params
# Throughout, leave out entries for None. Writing in `None` values will get
# you the string "None". Key params here:
ACTS_LAYERS_SLICE: "9:12"
INIT_THINNING_FACTOR: 0.01
NUM_SEQUENCES_INTERPED: 200
SEQ_PER_DIM_CAP: 100

# Only pin single dims per layer.
DIMS_PINNED:
  9: [331]

In order:

ACTS_LAYERS_SLICE is a Python slice formatted as a string. It sets which layers of the GPT-2 small model you'll interpret activations at.
INIT_THINNING_FACTOR is the fraction of features at the first layer in your slice you'll plot. I.e., a fraction of 1 will try to plot every feature in the layer.
NUM_SEQUENCES_INTERPED is the number of token sequences used during plotting, for the purpose of caluculating logit effects and downstream feature effects.
SEQ_PER_DIM_CAP is the maximum number of top-activating sequences a feature can have. I.e., when it equals NUM_SEQUENCES_INTERPED, you're saying that any feature that fired at every sequence should be interpreted over all of those sequences. For computational reasons, we basically want to set NUM_SEQUENCES_INTERPED as high as we can, and then set this value relatively low, so that our interpretability calculations are tractable.
DIMS_PINNED is a dictionary of layer indices followed by singleton lists with lone feature indices. If set for the first layer, it will completely override INIT_THINNING_FACTOR.

Set these values, save central_config.yaml, then run interpretability with:

cd sparse_coding

python3 pipe.py

Data appears in sparse_coding/data/.

The last cognition graph you generated is saved as both a .svg for you and as a .dot for the computer. If you run the interpretability pipeline again, the new data will expand upon that old .dot file. This way, you can progressively trace out circuits as you go.

Validating Circuits

There's also an independent circuit validation pipeline, val.py. This script simultaneously ablates all the features that comprise a circuit, to see how the overall circuit behaves under ablation (rather than just looking at separate features under independent ablations, the way pipe.py cognition graphs do).

To set this up, first set ACTS_LAYERS_SLICE to encompass the relevant layers in GPT-2 small, including one extra layer after

## Key Params
# Throughout, leave out entries for None. Writing in `None` values will get
# you the string "None". Key params here:
ACTS_LAYERS_SLICE: "6:9"

and then pin all the features that comprise a given circuit in VALIDATION_DIMS_PINNED.

# Here you can freely pin multiple dims per layer.
VALIDATION_DIMS_PINNED:
  6: [8339, 14104, 18854]
  7: [2118]

Now run validation with:

python3 val.py

How to Read the Graphs

Consider the cognition graph at the top of this page. Each box with a label like 4.112 is a feature in a sparse autoencoder. 4 is its layer index, while 112 is its column index in that layer's autoencoder. You can cross-reference more comprehensive interpretability data for any given feature on Neuronpedia.

Blue tokens in sequences in each box represent top feature activations in their contexts, to a specified length out to either side.

Blue and red tokens in individual boxes at the bottom are the logits most upweighted/downweighted by that ablation. (Gray is the 0.0 effect edge case.)

Arrows between boxes represent downstream ablation effects on other features. Red arrows indicate downweighting, blue arrows indicate upweighting, and degree is indicated by color transparency.

Errors

I've gimped a lot of repository functionality for now: only GPT-2 small and a projection factor of 32 are supported, to take advantage of a set of preexisting sparse autoencoders.
When there is an "exactly 0.0 effect from ablations" error, check whether your layers slice is compatible with your pinned dim.
If you're encountering cryptic env variable bugs, ensure you're running CUDA Toolkit 12.2 or newer.
As the shell syntax suggests, Unix-like paths (on MacOS or Linux) are currently required, and Windows pathing will probably not play nice with the repo.

Project Status

Current version is 0.5.0.

The sae_training sub-directory is Joseph Bloom's, a dependency for importing his pretrained sparse autoencoders from HF Hub.

The Neuronpedia wiki is Johnny Lin's.

jiahaolu97 / sparse_circuit_discovery Goto Github PK