jbloomlab / polyclonal Goto Github PK
View Code? Open in Web Editor NEWModel mutational escape from polyclonal antibodies.
License: Other
Model mutational escape from polyclonal antibodies.
License: Other
The numbering of epitopes is not identifiable. Therefore if we run inference multiple times we can get models with different orderings of epitopes. This issue is to harmonize inferences between inference runs.
Define a "mapping matrix" to be a matrix that decides which epitopes in the first polyclonal object map to which epitopes in the second object.
This is already done in Tim's notebook
We will want to have tests for this in various scenarios, including the ones that @timcyu mentions:
You can then match them up, although you might have to think about what to do in the case where two or more of the "bootstrapped" epitopes are both most highly correlated with the same "fit" epitope. In that case, maybe proceed with the one-to-one mapping that yields the highest average correlation. Let me know if that makes sense.
This can have some "round trip" tests where we apply the same 2x2 matrix twice and make sure we get out the same polyclonal object.
Putting this all together, we have a method .harmonize_epitopes_with(other)
which rearranges the epitopes in self
with the other
provided object. It could also return some diagnostic correlation information that corresponds to the sorts of plots that @timcyu made in his notebook.
I think in general it might make more sense to bootstrap on variants when all concentrations have the same variants. Right now it bootstraps different variants for each concentration, rather than the same variants for each concentration. I will think about this and perhaps push a fix.
Right now the model averaging just gets the mean across models, but the median might be more robust to outliers and so should be added as an option in PolyclonalCollection
and be made the default for PolyclonalAverage
.
Box some mutations with a standard deviation over some given threshold in red.
Right now the installation docs have to explain how to install from GitHub (as we are using development version of altair
), so before releasing 1.0 the docs need to be updated to point to this tag.
The default value of collapse_identical_variants
is "mean", which sort of makes sense for fitting just a single model as it pre-averages all variants with same mutations and then weights by number of observations, and thereby reduces data set size a bit and makes fitting faster (maybe by about ~30% on plausible simulated data set).
However, this sort of pre-averaging no longer makes sense in conjunction with the bootstrapping, as we want to be bootstrapping the true experimental data not after averaging together variants. For instance, if we have 1000 measurements of wildtype, right now the bootstrap would either keep all 1000 or drop all 1000, whereas if we set collapse_identical_variants=False
then it will bootstrap these measurements.
I am switching the default here. Note that this is a backward-compatibility breaking change in terms of consistency of results, although I have verified the changes are quite small. I am also re-running notebooks with new set up.
First copying @matsen's Slack post here:
Here's a little thread about Fisher information for those interested.
Here's a nice review article: https://arxiv.org/pdf/1705.01064.pdf
The idea is that we have finite data, so we can think of repeated draws from the data generating process followed by parameter estimation as being a random process.
There is a version of the central limit theorem that tells us that (a rescaled version of) this distribution is close to normal, with the mean at the true value. As per the screenshot above we can calculate the covariance matrix of that multivariate normal using the inverse of the Fisher information matrix.
The situation we have with polyclonal
is a little more complex-- we have lots of parameters and lots of data, but most data points are not informative about most parameters.
So... I think we should start by just bootstrapping the data as Jesse suggested to ground our understanding here, and seeing if we can relate the empirical uncertainty to the number of data points containing a given mutation.
This is a point that @timcyu suggested to @Bernadetadad. I think the ideal value of reg_escape_weight
depends a lot on the data.
I found that increasing it from 0.01 to 0.02 seems to generally at least not hurt and possibly give better results. But increasing it beyond 0.02 hurts the ability to infer subdominant epitopes in polyclonal sera, at least for the simulated data. For instance, at 0.05 there is no escape for class 1 (weakest) epitope.
For monoclonal antibodies, however, having a higher weight like 0.1 does seem to help as there isn't the subdominant epitope issue.
Overall, I suspect the optimal value here is going to have to be tuned based on the data set and number of epitopes.
This issue will involve changes to address some of the points that @Bernadetadad as noted where there are problems particularly in site-level metrics and also in some mutations having poorly fit values. This will improve visualization of the results.
The changes will be as follows:
altair
(still unreleased but on GitHub), and make times seen the relevant statistic rather than fraction bootstrap replicates.For the plots of results for PolyclonalAverage
, we want to be able to filter mutations by how many models have estimates for that mutation. This should be implemented in the PolyclonalCollection
base class.
@timcyu, as you noted, fixing the spread regularization might slightly change numerical results. Also, some of the altair
plots still look strange (like here). I'm not sure if that is because they just need to be re-run and committed with fixes you've already made, or something still needs to be fixed. But related to this, they just released altair
4.2. So if you are doing this, you could also update requirement to >=4.2
when you re-run stuff just so we are using latest version of it. That is also what I now have installed in BloomLab
conda environment.
Just do the fitting in this notebook using torchdms
. Use same fitting library (noise, avg2muts) and then see how well it fits actual mutation escapes and predicts variant escapes on exact avg3muts model.
The goal here is just to determine if one package is working notably better than the other. If polyclonal
is working better, once we have things squared away we can get back to Matsen lab about them fixing whatever issues with torchdms
. If torchdms
is already working better, we can just move ASAP to getting all the real data and visualization integrated with that.
The regularization of the site spread is as follows:
I had originally differentiated this as follows and implemented in the code:
Previously this was passing tests. This is because the tests in test_gradients.ipynb
were previously only being done on the non-fit models. @matsen tested on models that had been fit as well, which made relevant parameter values different (not same beta for all mutations at each site), and then the test for the site-spread regularization failed!
I re-implemented the tests to do what @matsen was doing, and indeed I confirmed his result the test was failing. Why?
However, everything works fine if the (M - 1) / M
is dropped from the formula for the gradient above. I cannot for the life of me figure out why this is the case!
I'm continuing to work on this even after previous pull request, so opening as new issue.
There are simulated data for a number of concentrations. Right now the example just fits three (0.25, 1, 4). Again create some sort of analysis that compares how it does by using different sets of concentrations.
The goal is to figure out what we actually want in experiments.
The reference numbering scheme for influenza H3 includes negative numbers. This has been handled for general analysis - see #103. But the function mut_escape_pdb_b_factor
still raises the following error for H3 data:
ValueError Traceback (most recent call last)
Cell In[40], line 1
----> 1 model.mut_escape_pdb_b_factor(
2 input_pdbfile="data/PDBs/4o5n.pdb",
3 chains=["A", "B"],
4 metric="mean"
5 )
File /fh/fast/bloom_j/software/miniconda3/envs/dms-vep-pipeline/lib/python3.10/site-packages/polyclonal/polyclonal.py:2117, in Polyclonal.mut_escape_pdb_b_factor(self, input_pdbfile, chains, metric, outdir, outfile, missing_metric, min_times_seen)
2115 os.makedirs(os.path.dirname(output_pdbfile), exist_ok=True)
2116 result_files.append((epitope, output_pdbfile))
-> 2117 polyclonal.pdb_utils.reassign_b_factor(
2118 input_pdbfile,
2119 output_pdbfile,
2120 df.query("epitope == @epitope"),
2121 metric_col,
2122 missing_metric=missing_metric,
2123 )
2124 return pd.DataFrame(result_files, columns=["epitope", "PDB file"])
File /fh/fast/bloom_j/software/miniconda3/envs/dms-vep-pipeline/lib/python3.10/site-packages/polyclonal/pdb_utils.py:278, in reassign_b_factor(input_pdbfile, output_pdbfile, df, metric_col, site_col, chain_col, missing_metric, model_index)
275 raise ValueError("non-unique metric for a site in a chain")
277 if df[site_col].dtype != int:
--> 278 raise ValueError("function currently requires `site_col` to be int")
280 # read PDB, catch warnings about discontinuous chains
281 with warnings.catch_warnings():
ValueError: function currently requires `site_col` to be int
It seems like removing this error from pdb_utils.reassign_b_factor should fix the issue. Let me know if more info would be helpful! @jbloom
I am trying to initialize polyclonal models using the mutation effects of a different polyclonal model that uses reference sites, rather than sequential sites. To use reference sites, you must specify a list of the reference sites when initializing. However, if you specify a list of sites when initializing, this causes both options "prune_to_data" and "fill_to_data" to fail for "data_mut_escape_overlap", because those options will only work if self.sites is None for some reason (as in, a check for this specifically is coded in). Is there a reason these options check that self.sites is None?
@timcyu, sorry I didn't catch this when doing the review of #84, but for the altair
bar plots I think it would be better if you fix the x-axis scale to go from 0 to 1. I believe this is done by domain
or range
(can't remember) in Scale
within the X
encoding. The reason is that right now the interactive plots sometimes make it look like a big difference, but that's because the x-axis scale shrinks with data and isn't always covering the relevant range from 0 all the way to 1.
We'd like to be able to handle the case where there is a site numbering scheme that is not sequential integers, as is often the case where variants of a virus are numbered against a reference. We'd like to somehow be able to report the reference numbers in the key results / plots.
The warning is:
/home/jbloom/polyclonal/polyclonal/plot.py:716: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_percent_max[e] = df_percent_max[e].clip(lower=0)
Currently, heatmaps of mut_escape_df
are hard-coded to plot the escape
column of these data frames. A lot of the implementation choices I made in #43 were primarily to re-use as much of the visualization code in plot.py
as possible. The summary stats I generate from a polyclonal_collection
object are held in a dictionary of stat : data frame
pairs, where each data frame is the same format as mut_escape_df
but with the summary statistic for each mutation replacing the escape
column from mut_escape_df
. I would like to be able to call plot.mut_escape_heatmap()
with specification of what value is being plotted from each of the data frames in the summary stat dictionaries.
To extend these methods (i.e., activity_wt_barplot()
, mut_escape_heatmap()
and maybe mut_escape_lineplot()
), there are a few things we could do.
escape
(this could get messy)plot.py
methods that allows for the value being plotted to be provided as a column name. For example, I'd like to be able to plot heatmaps (or any plot mentioned above) of the bootstrapped summary stats like this plot.mut_escape_heatmap(mut_escape_df=summary_stat_dict['mean'], stat='mean')
.If everybody is okay with the approach in 2, I'll get started on this after #43 has been merged in.
@timcyu, two of your notebooks are huge: partition_function.ipynb
is 44 MB and specify_epitopes.ipynb
is 24 MB.
At least for specify_epitopes.ipynb
, this is because you are using altair
to plot tons of points. Use a static plotting program like plotnine
for that. I haven't looked in detail why partition_function.ipynb
is so big, but I'm guessing it's because you are plotting a huge data frame. Recall altair
embeds the data frame in the plot, so either figure out how to plot with a smaller data frame or use a program like plotnine
that makes non-interactive plots that don't store the whole data frame.
There should be ways to solve this by intelligently designing the data frames with fewer points or if you don't need interactivity using another program: https://altair-viz.github.io/user_guide/faq.html#why-does-altair-lead-to-such-extremely-large-notebooks
Add an example of fitting real monoclonal antibody data as the data seem to be somewhat different than even for the "noisy" simulated data.
Any autograd system, JAX included, is essentially based on functional programming rather than object-oriented programming: JAX needs to see exactly all of the inputs in order to set up the chain of computations.
Thus to get to #31 , we need to re-express the core computations for polyclonal to be free functions.
Successful completion of this issue will be to
functional
submodule (happily taking requests for better names)Polyclonal
methods to give the needed behaviorThis issue is just bringing up point @matsen @WSDeWitt @zorian15 @timcyu and I discussed yesterday: adding the option for a regularization term that enforces the idea that sites in the epitope should somehow be in 3D proximity.
We imagine two ways the structural information could be passed to the model:
In either case, we have to keep in mind that some sites often lack structural information for a protein as they are missing from crystal or cryo-EM structure.
There are two ideas:
Maybe [2] is better as it would be order L rather than L^2 if there are L sites.
In terms of how this is done, maybe we weight each site in an epitope by the average of the squares of its betas to identify sites that are big players in the epitope. Then we use this to compute the centroid of the epitope and the spread around it. One point @jbloom thinks might be important is that this spread have some L1-like property in the sense that it doesn't become catastrophic for regularization if there are two spatially distant sites for some unexpected reason, although we do think this is unlikely.
@zorian15, I think this should be the next thing worked on. Basically, we have now have properties that get the activities, mutation effects, and site-summarized mutation effects from a PolyclonalCollection
.
But we need to get other stuff, like predictions.
I think we should again make each method return a single data frame, and keep it paralleled to Polyclonal
.
This would mean:
make_predictions
and summarize_bootstraped_predictions
methods. These don't meet above criteria.icXX
and icXX_replicates
methods to PolyclonalCollection
. These should exactly parallel Polyclonal.icxx
with the following differences: (a) icXX
should also have a bootstrap_replicate
column similar to PolyclonalCollection.mut_escape_df_replicates
, and (b) icXX_replicates
should return with mean
, median
, std
, n_bootstrap_replicates
, and frac_bootstrap_replicates
like PolyclonalCollection.mut_escape_df
.prob_escape
and prob_escape_replicates
methods being added to PolcylonalCollection
.I'd suggest starting with (1) and (2) above. Does this sound good as next step?
Randomly subsample down (always seed random number generator) number of variants and see how fitting scales with it.
@zorian15, I'm going to run the bootstrapping on a realistic data set now that it's fully implemented and see if I have any further suggested changes. I will use this to add a notebook to the docs as well. Does that sound good as next step?
This was one thing I forgot to implement in the "eHarmony" PR (#39) -- heatmaps displaying the output of make_max_correlation_matrix()
or any future function that produces a correlation matrix (in a pandas.DataFrame
) between the parameters of two polyclonal
objects.
I'm planning to implement Altair
heatmaps similar to the ones @timcyu used to in the "Specifying epitopes" example in the docs.
I'm just raising the issue here now -- but will finish getting the bootstrapping functionality (#15) ready for PR review before I tackle this.
I'm proposing to introduce a new regularization term that penalizes the similarity between epitope escape profiles. The reason why this may be useful is @caelanradford and others are getting identical looking epitope escape profiles with sera. These profiles sometimes show multiple peaks that look like they should separate into their own epitopes.
Here's an idea that I think can be applied when fitting the site-wise model. It penalizes the dot product of site-wise escape values for each pair of epitopes. So, it should encourage any escape mutations to have 0 effect on escape at all but one of the epitopes. I remember this is similar to something that @matsen suggested a long time ago for torchdms
too.
I think we should also consider trying to reproduce this phenomena with the simulated dataset. My guess is we might get it if we increase the noise. Or we could just try and work directly with @caelanradford's data or Frances' cocktail selection data if it's available.
The intuitive idea is that most mutations at a site should have similar-ish effects, and at many sites there should no effect for a given epitope.
Right now, the regularization is set up as follows:
beta
(mutation escape) valuesbeta
values at a given site.See here.
@matsen @WSDeWitt were talking about group lasso. Is that a better way to do things than above?
Jesse will add code to compute IC50.
The model basically cares about the weight assigned to the microstate in the partition function when no antibodies are bound. Right now it assumes that every microstate (all combinations of epitopes) could be bound (see here: https://jbloomlab.github.io/polyclonal/biophysical_model.html). In reality, we expect that some combinations of epitopes can’t be bound at the same time, and should be excluded from the partition function. My intuition is that neglecting this should not have a big effect because in the case where two epitopes are bound the states with the individual ones are bound anyway and this doesn’t make much contribution to the totally unbound microstate. But I do think it would be interesting for someone to actually write out the real math for this and figure out how accurate the assumption of no competition is. This paper by Tal sort of has some partition-function like things related to what we’d need: https://pubmed.ncbi.nlm.nih.gov/32365091/
@WSDeWitt said:
Related ideas here: http://www.sfu.ca/phys/347/Lecture%2022.pdf Coupling term J appears between the two bound states, breaking the factorization of the grand partition function. Can we get a minimal model of interference by introducing learnable constants J_ij for each epitope pair {i, j}?
Then I said: I think we should consider this. But we should first somehow figure out if the models are meaningfully different in the regime where we are getting most of our information from the experiments (variants where fraction completely unbound >> 0). If they aren’t, it may not be possible to meaningfully fit any such interference parameters.
The idea here is that if we have multiple sera, we might want to constrain them to have similar epitopes even if activities are different and exact beta
values change a bit.
Right now we think this idea should be kept in mind, but we should wait to fit a model on real serum data before starting too much on this.
When optimizing modeling parameters, it's helpful to generate a lot of models using different selection concentrations + regularization parameters, and compare the line plots in a notebook. But the general model.mut_escape_plot()
function generates both a line plot and heatmap. The heatmaps make each plot much bulkier and makes it difficult to compare models directly.
Incorporating an option to just display the line plot would help streamline these analyses. @jbloom
We have simulated noisy data with average of 1, 2, 3, and 4 mutations. If we fit a model to each of them, how good are the fits? Is one notably better than the others? You could maybe do this using all 30,000 variants of each, and also using fewer variants (say 10,000 random ones, making sure you've shuffled first so you aren't getting all single mutants if they are sorted by number of mutation).
In pdb_utils.py
, the following line in def reassign_b_factor
raises an error when run on a windows machine because windows handles int
differently than unix systems
if df[site_col].dtype != int:
raise ValueError("function currently requires `site_col` to be int")
This is explained in the following post: https://stackoverflow.com/questions/64901822/why-do-pandas-integer-dtypes-not-behave-the-same-on-unix-and-windows
We need to be able to average Polyclonal
models fit to different experimental replicates or libraries. It appears that this is better than bootstrapping to really estimate error.
This addition will become version 1.0 as it involves some backward-incompatible changes:
The bootstrapping as currently written is done by the PolyclonalCollection
class. This isn't really correct. PolyclonalCollection
should be a general use base class for collections of Polyclonal
objects (which can include bootstrapping, averaging, etc), and then should be sub-classed by specific classes for uses cases. So the first change is to make PolyclonalCollection
general for any collection and just have those general methods, and then make a new class PolyclonalBootstrap
that is the actual bootstrapping. This will be backward incompatible because what was called PolyclonalCollection
will become PolyclonalBootstrap
.
Make a new PolyclonalAverage
class for averaging that subclasses PolyclonalBootstrap
.
Make a notebook illustrating the averaging, and clearly suggest that when possible experimental replicates are preferable to bootstrapping.
After trying out some different values for different fitting parameters while using the epitope similarity regularization penalty, I am still having issues separating escape effects into believable epitopes. Epitopes being fit are no longer mirrors of each others' effects, but now do not make much intuitive sense. For example, mutations at the first residue of glycosylation sites are being put in one epitope, while mutations at the third residue in the same glycosylation site (which are essentially the same mutation, knocking out the glycan) are being put in a second epitope. I am also having cases of 5 or so sites in a row in linear sequence causing escape, but each site being seemingly randomly put in a different epitope depending on parameters chosen, when we would really expect most of these are probably the same epitope.
This is likely an artifact of how the mutagenesis was done; the mutagenic primers used will overwrite adjacent mutations in subsequent rounds of PCR, so few variants have mutations close in linear sequence. This should make it difficult for polyclonal to determine these mutations are in separate epitopes.
This could be improved by either penalizing residues in close proximity either in linear sequence space or tertiary structure for being in the same epitope. Either of these methods would have drawbacks. Proximity in linear sequence obviously differs greatly from proximity in the tertiary structure, but for some proteins portions of the protein are often not visualized in structures. This could be especially problematic for HIV Env where variable loops are often not visualized, but often very important for antibody escape. Linear sequence proximity penalization would probably be easier to implement and would probably somewhat alleviate what I am seeing happening.
@Bernadetadad, I'm almost done with the milestones for what you need for the pipeline to analyze your DMS data (see here).
However, I wanted to add one more potential issue: Do we want to order the amino-acids in the rows of the heatmaps different? Right now they are alphabetical. But for instance in prior deep mutational scanning (e.g., see Fig 1 of Tyler and Allie's latest paper) we ordered them by physicochemical properties.
Should I make that the default for us too? If so, should I just use same ordering from above paper, or any preferences?
@matsen and team, this is a Python question for you.
If you look in @timcyu's notebooks, he often has code like in cell [3] of this notebook. Essentially what he is doing is providing a manually built descriptive string that describes the model in that situation (e.g., how many variants, or what concentrations or whatever), fitting that model, then pickle-ing it and keeping track of what descriptive strings have already been fit so they don't have to be re-fit.
Is there a general way to automate this process into a function for the package? The issue of course is that @timcyu's strings don't actually fully describe the model: e.g., they don't describe the input data or other parameters associated with the model that aren't been varied, he's just manually included what happens to be important for this case.
To generically automate this, we'd somehow need to be able to key a dict or database with a hashable key that unique describes the entire model and associated data set. Then every time the user goes to fit a new model, the function would look up whether that hashable key corresponds to a model that has already been fit in some database.
Is this sort of thing possible?
I am opening this issue for myself to investigate as I think more deeply about how to handle interaction effects. It occurred to me that cases of epistasis (particularly negative) between mutations could make it difficult for Polyclonal
to learn "true" mutation effects.
As a simple illustration:
A
has a +5 effect on antibody escape at epitope 1.B
also has a +5 effect on antibody escape at epitope 1.A
and B
together have an additional -5 effect on antibody escape due to some negative epistatic interaction.Since Polyclonal
does not fit any interaction terms, I worry that these types of interactions between mutations at an epitope can obscure effects of individual mutations, leading to potential misinterpretation of escape mutations (ex. mutation A
or B
being given a much lower individual effect than the ground truth). I am also interested in how these types of interactions, which we should expect in nature to some extent, will affect escape prediction of unseen variants.
For now, I am going to simulate some very basic data with and without interactions and compare how Polyclonal
does at 1) inferring the true individual mutation effects and 2) predict escape of unseen variants.
@caelanradford, since you are working with these models and have a lot of expertise in protein structure, we were wondering if you'd be interested in helping with a sub-issue related to #24.
Basically, we want to regularize the epitopes to be clustered in 3D space. To do that, we have to provide as input a CSV that gives the 3D alpha carbon coordinate for each residue. For now we are testing on the RBD, which is PDB 6M0J.
Would you want to write a function that extracts the 3D coordinate of the alpha carbons for a monomer chain of the RBD into a CSV with columns site,x,y,z
. I think this could be done using Bio.PDB
. If you wanted to make a pull request with a little function that does this in addition to just posting the coordinates you could add it to the polyclonal.pdb_utils module. I'd think the function would take as input the PDB and chain of interest.
Is this something you'd be up for doing? It might also be good to have your input on what we're trying to do in #24 if you want to follow along and come to any future meetings on it.
The intention of this first step is not to make any actual new functionality, but rather
A major source of "error" appears to be mutations that are found only in a few variants. So I will add functionality to report how many variants each mutation is found it, and enable this to be used for filtering results.
The docs and GitHub should cite the pre-print: https://www.biorxiv.org/content/10.1101/2022.09.17.508366v1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.