Circuit discovery in GPT-2 small
, using sparse autoencoding
To manually install, just run these commands in the shell:
git clone https://github.com/DavidUdell/sparse_circuit_discovery
cd sparse_circuit_discovery
pip install -e .
Alternatively, I have a Docker image on DockerHub. The Docker image is especially good for pulling to a remote server.
Your base of operations is sparse_coding/config/central_config.yaml
.
The most important hyperparameters are clustered up top:
## Key Params
# Throughout, leave out entries for None. Writing in `None` values will get
# you the string "None". Key params here:
ACTS_LAYERS_SLICE: "9:12"
INIT_THINNING_FACTOR: 0.01
NUM_SEQUENCES_INTERPED: 200
SEQ_PER_DIM_CAP: 100
# Only pin single dims per layer.
DIMS_PINNED:
9: [331]
In order:
ACTS_LAYERS_SLICE
is a Python slice formatted as a string. It sets which layers of theGPT-2 small
model you'll interpret activations at.INIT_THINNING_FACTOR
is the fraction of features at the first layer in your slice you'll plot. I.e., a fraction of1
will try to plot every feature in the layer.NUM_SEQUENCES_INTERPED
is the number of token sequences used during plotting, for the purpose of caluculating logit effects and downstream feature effects.SEQ_PER_DIM_CAP
is the maximum number of top-activating sequences a feature can have. I.e., when it equalsNUM_SEQUENCES_INTERPED
, you're saying that any feature that fired at every sequence should be interpreted over all of those sequences. For computational reasons, we basically want to setNUM_SEQUENCES_INTERPED
as high as we can, and then set this value relatively low, so that our interpretability calculations are tractable.DIMS_PINNED
is a dictionary of layer indices followed by singleton lists with lone feature indices. If set for the first layer, it will completely overrideINIT_THINNING_FACTOR
.
Set these values, save central_config.yaml
, then run interpretability with:
cd sparse_coding
python3 pipe.py
Data appears in sparse_coding/data/
.
The last cognition graph you generated is saved as both a .svg
for you and as
a .dot
for the computer. If you run the interpretability pipeline again, the
new data will expand upon that old .dot
file. This way, you can progressively
trace out circuits as you go.
There's also an independent circuit validation pipeline, val.py
. This script
simultaneously ablates all the features that comprise a circuit, to see how the
overall circuit behaves under ablation (rather than just looking at separate
features under independent ablations, the way pipe.py
cognition graphs do).
To set this up, first set ACTS_LAYERS_SLICE
to encompass the relevant layers
in GPT-2 small, including one extra layer after
## Key Params
# Throughout, leave out entries for None. Writing in `None` values will get
# you the string "None". Key params here:
ACTS_LAYERS_SLICE: "6:9"
and then pin all the features that comprise a given circuit in
VALIDATION_DIMS_PINNED
.
# Here you can freely pin multiple dims per layer.
VALIDATION_DIMS_PINNED:
6: [8339, 14104, 18854]
7: [2118]
Now run validation with:
python3 val.py
Consider the cognition graph at the top of this page. Each box with a label
like 4.112
is a feature in a sparse autoencoder. 4
is its layer index,
while 112
is its column index in that layer's autoencoder. You can
cross-reference more comprehensive interpretability data for any given feature
on Neuronpedia.
Blue tokens in sequences in each box represent top feature activations in their contexts, to a specified length out to either side.
Blue and red tokens in individual boxes at the bottom are the logits most upweighted/downweighted by that ablation. (Gray is the 0.0 effect edge case.)
Arrows between boxes represent downstream ablation effects on other features. Red arrows indicate downweighting, blue arrows indicate upweighting, and degree is indicated by color transparency.
-
I've gimped a lot of repository functionality for now: only
GPT-2 small
and a projection factor of 32 are supported, to take advantage of a set of preexisting sparse autoencoders. -
When there is an "exactly
0.0
effect from ablations" error, check whether your layers slice is compatible with your pinned dim. -
If you're encountering cryptic env variable bugs, ensure you're running CUDA Toolkit 12.2 or newer.
-
As the shell syntax suggests, Unix-like paths (on MacOS or Linux) are currently required, and Windows pathing will probably not play nice with the repo.
Current version is 0.5.0.
The sae_training
sub-directory is Joseph Bloom's, a dependency for importing
his pretrained sparse autoencoders from HF Hub.
The Neuronpedia wiki is Johnny Lin's.