jbloomlab / polyclonal Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 218.98 MB

Model mutational escape from polyclonal antibodies.

License: Other

Python 1.22% Jupyter Notebook 98.78% Shell 0.01%

polyclonal's People

Contributors

Stargazers

Watchers

polyclonal's Issues

Implement epitope harmonization for polyclonal objects

The numbering of epitopes is not identifiable. Therefore if we run inference multiple times we can get models with different orderings of epitopes. This issue is to harmonize inferences between inference runs.

Define a "mapping matrix" to be a matrix that decides which epitopes in the first polyclonal object map to which epitopes in the second object.

Proposed setup

A free function to take two polyclonal objects and make a correlation matrix

This is already done in Tim's notebook

A free function to take a correlation matrix and decide a mapping matrix

We will want to have tests for this in various scenarios, including the ones that @timcyu mentions:

You can then match them up, although you might have to think about what to do in the case where two or more of the "bootstrapped" epitopes are both most highly correlated with the same "fit" epitope. In that case, maybe proceed with the one-to-one mapping that yields the highest average correlation. Let me know if that makes sense.

A method to take a square mapping matrix and reorder epitopes

This can have some "round trip" tests where we apply the same 2x2 matrix twice and make sure we get out the same polyclonal object.

A method to harmonize polyclonal objects

Putting this all together, we have a method .harmonize_epitopes_with(other) which rearranges the epitopes in self with the other provided object. It could also return some diagnostic correlation information that corresponds to the sorts of plots that @timcyu made in his notebook.

bootstrapping maybe isn't quite right

I think in general it might make more sense to bootstrap on variants when all concentrations have the same variants. Right now it bootstraps different variants for each concentration, rather than the same variants for each concentration. I will think about this and perhaps push a fix.

plot / compute average as median in addition to mean

Right now the model averaging just gets the mean across models, but the median might be more robust to outliers and so should be added as an option in PolyclonalCollection and be made the default for PolyclonalAverage.

Highlight mutations with high uncertainty in heat maps

Box some mutations with a standard deviation over some given threshold in red.

Epitope colors for polyclonal

The default epitope colors are currently hard to distinguish (see figure), so we are planning to change it. Any suggestions for a color palette that is contrasting and color-blind friendly are welcome.

docs give installation instructions for 1.0

Right now the installation docs have to explain how to install from GitHub (as we are using development version of altair), so before releasing 1.0 the docs need to be updated to point to this tag.

change default value of `collapse_identical_variants`

The default value of collapse_identical_variants is "mean", which sort of makes sense for fitting just a single model as it pre-averages all variants with same mutations and then weights by number of observations, and thereby reduces data set size a bit and makes fitting faster (maybe by about ~30% on plausible simulated data set).

However, this sort of pre-averaging no longer makes sense in conjunction with the bootstrapping, as we want to be bootstrapping the true experimental data not after averaging together variants. For instance, if we have 1000 measurements of wildtype, right now the bootstrap would either keep all 1000 or drop all 1000, whereas if we set collapse_identical_variants=False then it will bootstrap these measurements.

I am switching the default here. Note that this is a backward-compatibility breaking change in terms of consistency of results, although I have verified the changes are quite small. I am also re-running notebooks with new set up.

estimating "error" on `beta` values and/or predictions

First copying @matsen's Slack post here:

Here's a little thread about Fisher information for those interested.

Here's a nice review article: https://arxiv.org/pdf/1705.01064.pdf

The idea is that we have finite data, so we can think of repeated draws from the data generating process followed by parameter estimation as being a random process.

There is a version of the central limit theorem that tells us that (a rescaled version of) this distribution is close to normal, with the mean at the true value. As per the screenshot above we can calculate the covariance matrix of that multivariate normal using the inverse of the Fisher information matrix.

The situation we have with polyclonal is a little more complex-- we have lots of parameters and lots of data, but most data points are not informative about most parameters.

So... I think we should start by just bootstrapping the data as Jesse suggested to ground our understanding here, and seeing if we can relate the empirical uncertainty to the number of data points containing a given mutation.

Overlapping labels in site bar zoom

There is a minor issue where the site zoom bar in mut_escape_lineplot() and mut_escape_heatmap has overlapping labels. I think this is an internal issue and I also think there's a quick fix. Going to work on PR now.

adjust regularization weights

This is a point that @timcyu suggested to @Bernadetadad. I think the ideal value of reg_escape_weight depends a lot on the data.

I found that increasing it from 0.01 to 0.02 seems to generally at least not hurt and possibly give better results. But increasing it beyond 0.02 hurts the ability to infer subdominant epitopes in polyclonal sera, at least for the simulated data. For instance, at 0.05 there is no escape for class 1 (weakest) epitope.

For monoclonal antibodies, however, having a higher weight like 0.1 does seem to help as there isn't the subdominant epitope issue.

Overall, I suspect the optimal value here is going to have to be tuned based on the data set and number of epitopes.

improve site-level metrics and filtering of poorly fit mutations

This issue will involve changes to address some of the points that @Bernadetadad as noted where there are problems particularly in site-level metrics and also in some mutations having poorly fit values. This will improve visualization of the results.

The changes will be as follows:

Get rid of the median summary statistic from bootstrapping: it's not ever notably different than mean, so is just making things more complicated.
Get better denominators for the site-level metrics by allowing exclusion of certain mutations (e.g., to stop) and filters on how many times a mutation needs to be seen.
Improve plotting. Use the newer version of altair (still unreleased but on GitHub), and make times seen the relevant statistic rather than fraction bootstrap replicates.

filter by number of models in `PolyclonalCollection` plots

For the plots of results for PolyclonalAverage, we want to be able to filter mutations by how many models have estimates for that mutation. This should be implemented in the PolyclonalCollection base class.

re-run notebooks and fix `Altair` plots

@timcyu, as you noted, fixing the spread regularization might slightly change numerical results. Also, some of the altair plots still look strange (like here). I'm not sure if that is because they just need to be re-run and committed with fixes you've already made, or something still needs to be fixed. But related to this, they just released altair 4.2. So if you are doing this, you could also update requirement to >=4.2 when you re-run stuff just so we are using latest version of it. That is also what I now have installed in BloomLab conda environment.

compare fitting of `polyclonal` and `torchdms`

Just do the fitting in this notebook using torchdms. Use same fitting library (noise, avg2muts) and then see how well it fits actual mutation escapes and predicts variant escapes on exact avg3muts model.

The goal here is just to determine if one package is working notably better than the other. If polyclonal is working better, once we have things squared away we can get back to Matsen lab about them fixing whatever issues with torchdms. If torchdms is already working better, we can just move ASAP to getting all the real data and visualization integrated with that.

gradient of site spread regularization

The regularization of the site spread is as follows:

I had originally differentiated this as follows and implemented in the code:

Previously this was passing tests. This is because the tests in test_gradients.ipynb were previously only being done on the non-fit models. @matsen tested on models that had been fit as well, which made relevant parameter values different (not same beta for all mutations at each site), and then the test for the site-spread regularization failed!

I re-implemented the tests to do what @matsen was doing, and indeed I confirmed his result the test was failing. Why?

However, everything works fine if the (M - 1) / M is dropped from the formula for the gradient above. I cannot for the life of me figure out why this is the case!

bootstrapping example notebook

I'm continuing to work on this even after previous pull request, so opening as new issue.

how many concentrations

There are simulated data for a number of concentrations. Right now the example just fits three (0.25, 1, 4). Again create some sort of analysis that compares how it does by using different sets of concentrations.

The goal is to figure out what we actually want in experiments.

Allow non-int site numbers in PDB b-factor reassignment

The reference numbering scheme for influenza H3 includes negative numbers. This has been handled for general analysis - see #103. But the function mut_escape_pdb_b_factor still raises the following error for H3 data:

ValueError                                Traceback (most recent call last)
Cell In[40], line 1
----> 1 model.mut_escape_pdb_b_factor(
      2     input_pdbfile="data/PDBs/4o5n.pdb",
      3     chains=["A", "B"],
      4     metric="mean"
      5 )

File /fh/fast/bloom_j/software/miniconda3/envs/dms-vep-pipeline/lib/python3.10/site-packages/polyclonal/polyclonal.py:2117, in Polyclonal.mut_escape_pdb_b_factor(self, input_pdbfile, chains, metric, outdir, outfile, missing_metric, min_times_seen)
   2115         os.makedirs(os.path.dirname(output_pdbfile), exist_ok=True)
   2116     result_files.append((epitope, output_pdbfile))
-> 2117     polyclonal.pdb_utils.reassign_b_factor(
   2118         input_pdbfile,
   2119         output_pdbfile,
   2120         df.query("epitope == @epitope"),
   2121         metric_col,
   2122         missing_metric=missing_metric,
   2123     )
   2124 return pd.DataFrame(result_files, columns=["epitope", "PDB file"])

File /fh/fast/bloom_j/software/miniconda3/envs/dms-vep-pipeline/lib/python3.10/site-packages/polyclonal/pdb_utils.py:278, in reassign_b_factor(input_pdbfile, output_pdbfile, df, metric_col, site_col, chain_col, missing_metric, model_index)
    275     raise ValueError("non-unique metric for a site in a chain")
    277 if df[site_col].dtype != int:
--> 278     raise ValueError("function currently requires `site_col` to be int")
    280 # read PDB, catch warnings about discontinuous chains
    281 with warnings.catch_warnings():

ValueError: function currently requires `site_col` to be int

It seems like removing this error from pdb_utils.reassign_b_factor should fix the issue. Let me know if more info would be helpful! @jbloom

Reference sites cannot be used when initializing a polyclonal model with site or mutation escape dataframes

I am trying to initialize polyclonal models using the mutation effects of a different polyclonal model that uses reference sites, rather than sequential sites. To use reference sites, you must specify a list of the reference sites when initializing. However, if you specify a list of sites when initializing, this causes both options "prune_to_data" and "fill_to_data" to fail for "data_mut_escape_overlap", because those options will only work if self.sites is None for some reason (as in, a check for this specifically is coded in). Is there a reason these options check that self.sites is None?

fix max in distinct epitopes `altair` plots

@timcyu, sorry I didn't catch this when doing the review of #84, but for the altair bar plots I think it would be better if you fix the x-axis scale to go from 0 to 1. I believe this is done by domain or range (can't remember) in Scale within the X encoding. The reason is that right now the interactive plots sometimes make it look like a big difference, but that's because the x-axis scale shrinks with data and isn't always covering the relevant range from 0 all the way to 1.

reference site numbering

We'd like to be able to handle the case where there is a site numbering scheme that is not sequential integers, as is often the case where variants of a virus are numbered against a reference. We'd like to somehow be able to report the reference numbers in the key results / plots.

fix `pandas` warning from `polyclonal.plot` heatmap

The warning is:

/home/jbloom/polyclonal/polyclonal/plot.py:716: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_percent_max[e] = df_percent_max[e].clip(lower=0)

Extend plotting code to bootstrap summary statistics

The Issue

Currently, heatmaps of mut_escape_df are hard-coded to plot the escape column of these data frames. A lot of the implementation choices I made in #43 were primarily to re-use as much of the visualization code in plot.py as possible. The summary stats I generate from a polyclonal_collection object are held in a dictionary of stat : data frame pairs, where each data frame is the same format as mut_escape_df but with the summary statistic for each mutation replacing the escape column from mut_escape_df. I would like to be able to call plot.mut_escape_heatmap() with specification of what value is being plotted from each of the data frames in the summary stat dictionaries.

Proposed solutions

To extend these methods (i.e., activity_wt_barplot(), mut_escape_heatmap() and maybe mut_escape_lineplot()), there are a few things we could do.

Simply rename the columns of whatever we want to plot to escape (this could get messy)
Add a parameter to plot.py methods that allows for the value being plotted to be provided as a column name. For example, I'd like to be able to plot heatmaps (or any plot mentioned above) of the bootstrapped summary stats like this plot.mut_escape_heatmap(mut_escape_df=summary_stat_dict['mean'], stat='mean').

If everybody is okay with the approach in 2, I'll get started on this after #43 has been merged in.

shrink overly large notebooks

@timcyu, two of your notebooks are huge: partition_function.ipynb is 44 MB and specify_epitopes.ipynb is 24 MB.

At least for specify_epitopes.ipynb, this is because you are using altair to plot tons of points. Use a static plotting program like plotnine for that. I haven't looked in detail why partition_function.ipynb is so big, but I'm guessing it's because you are plotting a huge data frame. Recall altair embeds the data frame in the plot, so either figure out how to plot with a smaller data frame or use a program like plotnine that makes non-interactive plots that don't store the whole data frame.

There should be ways to solve this by intelligently designing the data frames with fewer points or if you don't need interactivity using another program: https://altair-viz.github.io/user_guide/faq.html#why-does-altair-lead-to-such-extremely-large-notebooks

add real example

Add an example of fitting real monoclonal antibody data as the data seem to be somewhat different than even for the "noisy" simulated data.

problems with plotting in `concentration_set`

The plots for class 2 and class 3 epitopes are messed up here in cells [6] and [12], with all points at the same region on x-axis. @timcyu, can you fix this?

Refactor core calculations as free functions

Any autograd system, JAX included, is essentially based on functional programming rather than object-oriented programming: JAX needs to see exactly all of the inputs in order to set up the chain of computations.

Thus to get to #31 , we need to re-express the core computations for polyclonal to be free functions.

Successful completion of this issue will be to

pull out these functions as free functions into a new functional submodule (happily taking requests for better names)
check that they can be JIT-ed using JAX
call them in the Polyclonal methods to give the needed behavior
have all existing tests pass (and I'm happy to add any further tests as needed)

regularizing 3D spread of epitopes

This issue is just bringing up point @matsen @WSDeWitt @zorian15 @timcyu and I discussed yesterday: adding the option for a regularization term that enforces the idea that sites in the epitope should somehow be in 3D proximity.

Representing structural information

We imagine two ways the structural information could be passed to the model:

A distance matrix of pairwise distances.
A 3D coordinate for each site (maybe this one is better if we take the centroid approach)?

In either case, we have to keep in mind that some sites often lack structural information for a protein as they are missing from crystal or cryo-EM structure.

What is regularization term?

There are two ideas:

Somehow penalize overall average parwise distance (or something like that) of sites that are key players in epitope.
Define a centroid for epitope, and somehow penalize the spread around this (@WSDeWitt suggested second moment).

Maybe [2] is better as it would be order L rather than L^2 if there are L sites.

In terms of how this is done, maybe we weight each site in an epitope by the average of the squares of its betas to identify sites that are big players in the epitope. Then we use this to compute the centroid of the epitope and the spread around it. One point @jbloom thinks might be important is that this spread have some L1-like property in the sense that it doesn't become catastrophic for regularization if there are two spatially distant sites for some unexpected reason, although we do think this is unlikely.

additional output functions of `PolyclonalCollection`

@zorian15, I think this should be the next thing worked on. Basically, we have now have properties that get the activities, mutation effects, and site-summarized mutation effects from a PolyclonalCollection.

But we need to get other stuff, like predictions.

I think we should again make each method return a single data frame, and keep it paralleled to Polyclonal.

This would mean:

Delete the current make_predictions and summarize_bootstraped_predictions methods. These don't meet above criteria.
Add icXX and icXX_replicates methods to PolyclonalCollection. These should exactly parallel Polyclonal.icxx with the following differences: (a) icXX should also have a bootstrap_replicate column similar to PolyclonalCollection.mut_escape_df_replicates, and (b) icXX_replicates should return with mean, median, std, n_bootstrap_replicates, and frac_bootstrap_replicates like PolyclonalCollection.mut_escape_df.
Then we want to do the same thing with prob_escape and prob_escape_replicates methods being added to PolcylonalCollection.

I'd suggest starting with (1) and (2) above. Does this sound good as next step?

how does fitting accuracy scale with number of variants

Randomly subsample down (always seed random number generator) number of variants and see how fitting scales with it.

bootstrapping on realistic data set

@zorian15, I'm going to run the bootstrapping on a realistic data set now that it's fully implemented and see if I have any further suggested changes. I will use this to add a notebook to the docs as well. Does that sound good as next step?

Add epitope correlation heatmaps

This was one thing I forgot to implement in the "eHarmony" PR (#39) -- heatmaps displaying the output of make_max_correlation_matrix() or any future function that produces a correlation matrix (in a pandas.DataFrame) between the parameters of two polyclonal objects.

I'm planning to implement Altair heatmaps similar to the ones @timcyu used to in the "Specifying epitopes" example in the docs.

I'm just raising the issue here now -- but will finish getting the bootstrapping functionality (#15) ready for PR review before I tackle this.

A regularization term that penalizes similarity between epitope escape profiles

I'm proposing to introduce a new regularization term that penalizes the similarity between epitope escape profiles. The reason why this may be useful is @caelanradford and others are getting identical looking epitope escape profiles with sera. These profiles sometimes show multiple peaks that look like they should separate into their own epitopes.

Here's an idea that I think can be applied when fitting the site-wise model. It penalizes the dot product of site-wise escape values for each pair of epitopes. So, it should encourage any escape mutations to have 0 effect on escape at all but one of the epitopes. I remember this is similar to something that @matsen suggested a long time ago for torchdms too.

I think we should also consider trying to reproduce this phenomena with the simulated dataset. My guess is we might get it if we increase the noise. Or we could just try and work directly with @caelanradford's data or Frances' cocktail selection data if it's available.

group lasso per site?

The intuitive idea is that most mutations at a site should have similar-ish effects, and at many sites there should no effect for a given epitope.

Right now, the regularization is set up as follows:

L1-like regularization on all beta (mutation escape) values
additional regularization on "spread" (variance) of beta values at a given site.

See here.

@matsen @WSDeWitt were talking about group lasso. Is that a better way to do things than above?

compute the IC50 as a metric

Jesse will add code to compute IC50.

is the underlying statistical model correct (what about competing antibodies)

The model basically cares about the weight assigned to the microstate in the partition function when no antibodies are bound. Right now it assumes that every microstate (all combinations of epitopes) could be bound (see here: https://jbloomlab.github.io/polyclonal/biophysical_model.html). In reality, we expect that some combinations of epitopes can’t be bound at the same time, and should be excluded from the partition function. My intuition is that neglecting this should not have a big effect because in the case where two epitopes are bound the states with the individual ones are bound anyway and this doesn’t make much contribution to the totally unbound microstate. But I do think it would be interesting for someone to actually write out the real math for this and figure out how accurate the assumption of no competition is. This paper by Tal sort of has some partition-function like things related to what we’d need: https://pubmed.ncbi.nlm.nih.gov/32365091/

@WSDeWitt said:

Related ideas here: http://www.sfu.ca/phys/347/Lecture%2022.pdf Coupling term J appears between the two bound states, breaking the factorization of the grand partition function. Can we get a minimal model of interference by introducing learnable constants J_ij for each epitope pair {i, j}?

Then I said: I think we should consider this. But we should first somehow figure out if the models are meaningfully different in the regime where we are getting most of our information from the experiments (variants where fraction completely unbound >> 0). If they aren’t, it may not be possible to meaningfully fit any such interference parameters.

I think @timcyu and @WSDeWitt are looking into this more?

sharing information across sites

The idea here is that if we have multiple sera, we might want to constrain them to have similar epitopes even if activities are different and exact beta values change a bit.

Right now we think this idea should be kept in mind, but we should wait to fit a model on real serum data before starting too much on this.

how to handle incorrect numbers of epitopes?

Option for line plot or heatmap only in `model.mut_escape_plot()`

When optimizing modeling parameters, it's helpful to generate a lot of models using different selection concentrations + regularization parameters, and compare the line plots in a notebook. But the general model.mut_escape_plot() function generates both a line plot and heatmap. The heatmaps make each plot much bulkier and makes it difficult to compare models directly.

Incorporating an option to just display the line plot would help streamline these analyses. @jbloom

figure out what average mutation rate we want

We have simulated noisy data with average of 1, 2, 3, and 4 mutations. If we fit a model to each of them, how good are the fits? Is one notably better than the others? You could maybe do this using all 30,000 variants of each, and also using fewer variants (say 10,000 random ones, making sure you've shuffled first so you aren't getting all single mutants if they are sorted by number of mutation).

Windows specific int comparison error

In pdb_utils.py, the following line in def reassign_b_factor raises an error when run on a windows machine because windows handles int differently than unix systems

if df[site_col].dtype != int:
        raise ValueError("function currently requires `site_col` to be int")

This is explained in the following post: https://stackoverflow.com/questions/64901822/why-do-pandas-integer-dtypes-not-behave-the-same-on-unix-and-windows

average across `Polyclonal` models

We need to be able to average Polyclonal models fit to different experimental replicates or libraries. It appears that this is better than bootstrapping to really estimate error.

This addition will become version 1.0 as it involves some backward-incompatible changes:

The bootstrapping as currently written is done by the PolyclonalCollection class. This isn't really correct. PolyclonalCollection should be a general use base class for collections of Polyclonal objects (which can include bootstrapping, averaging, etc), and then should be sub-classed by specific classes for uses cases. So the first change is to make PolyclonalCollection general for any collection and just have those general methods, and then make a new class PolyclonalBootstrap that is the actual bootstrapping. This will be backward incompatible because what was called PolyclonalCollection will become PolyclonalBootstrap.
Make a new PolyclonalAverage class for averaging that subclasses PolyclonalBootstrap.
Make a notebook illustrating the averaging, and clearly suggest that when possible experimental replicates are preferable to bootstrapping.

Epitopes are questionably separated despite the epitope similarity penalty

After trying out some different values for different fitting parameters while using the epitope similarity regularization penalty, I am still having issues separating escape effects into believable epitopes. Epitopes being fit are no longer mirrors of each others' effects, but now do not make much intuitive sense. For example, mutations at the first residue of glycosylation sites are being put in one epitope, while mutations at the third residue in the same glycosylation site (which are essentially the same mutation, knocking out the glycan) are being put in a second epitope. I am also having cases of 5 or so sites in a row in linear sequence causing escape, but each site being seemingly randomly put in a different epitope depending on parameters chosen, when we would really expect most of these are probably the same epitope.

This is likely an artifact of how the mutagenesis was done; the mutagenic primers used will overwrite adjacent mutations in subsequent rounds of PCR, so few variants have mutations close in linear sequence. This should make it difficult for polyclonal to determine these mutations are in separate epitopes.

This could be improved by either penalizing residues in close proximity either in linear sequence space or tertiary structure for being in the same epitope. Either of these methods would have drawbacks. Proximity in linear sequence obviously differs greatly from proximity in the tertiary structure, but for some proteins portions of the protein are often not visualized in structures. This could be especially problematic for HIV Env where variable loops are often not visualized, but often very important for antibody escape. Linear sequence proximity penalization would probably be easier to implement and would probably somewhat alleviate what I am seeing happening.

order amino-acid alphabets differently?

@Bernadetadad, I'm almost done with the milestones for what you need for the pipeline to analyze your DMS data (see here).

However, I wanted to add one more potential issue: Do we want to order the amino-acids in the rows of the heatmaps different? Right now they are alphabetical. But for instance in prior deep mutational scanning (e.g., see Fig 1 of Tyler and Allie's latest paper) we ordered them by physicochemical properties.

Should I make that the default for us too? If so, should I just use same ordering from above paper, or any preferences?

somehow automating pickling of models

@matsen and team, this is a Python question for you.

If you look in @timcyu's notebooks, he often has code like in cell [3] of this notebook. Essentially what he is doing is providing a manually built descriptive string that describes the model in that situation (e.g., how many variants, or what concentrations or whatever), fitting that model, then pickle-ing it and keeping track of what descriptive strings have already been fit so they don't have to be re-fit.

Is there a general way to automate this process into a function for the package? The issue of course is that @timcyu's strings don't actually fully describe the model: e.g., they don't describe the input data or other parameters associated with the model that aren't been varied, he's just manually included what happens to be important for this case.

To generically automate this, we'd somehow need to be able to key a dict or database with a hashable key that unique describes the entire model and associated data set. Then every time the user goes to fit a new model, the function would look up whether that hashable key corresponds to a model that has already been fit in some database.

Is this sort of thing possible?

Explore consequences of interaction effects between mutations

I am opening this issue for myself to investigate as I think more deeply about how to handle interaction effects. It occurred to me that cases of epistasis (particularly negative) between mutations could make it difficult for Polyclonal to learn "true" mutation effects.

As a simple illustration:

mutation A has a +5 effect on antibody escape at epitope 1.
mutation B also has a +5 effect on antibody escape at epitope 1.
mutation A and B together have an additional -5 effect on antibody escape due to some negative epistatic interaction.

Since Polyclonal does not fit any interaction terms, I worry that these types of interactions between mutations at an epitope can obscure effects of individual mutations, leading to potential misinterpretation of escape mutations (ex. mutation A or B being given a much lower individual effect than the ground truth). I am also interested in how these types of interactions, which we should expect in nature to some extent, will affect escape prediction of unseen variants.

For now, I am going to simulate some very basic data with and without interactions and compare how Polyclonal does at 1) inferring the true individual mutation effects and 2) predict escape of unseen variants.

3D coordinates for SARS-CoV-2 RBD

@caelanradford, since you are working with these models and have a lot of expertise in protein structure, we were wondering if you'd be interested in helping with a sub-issue related to #24.

Basically, we want to regularize the epitopes to be clustered in 3D space. To do that, we have to provide as input a CSV that gives the 3D alpha carbon coordinate for each residue. For now we are testing on the RBD, which is PDB 6M0J.

Would you want to write a function that extracts the 3D coordinate of the alpha carbons for a monomer chain of the RBD into a CSV with columns site,x,y,z. I think this could be done using Bio.PDB. If you wanted to make a pull request with a little function that does this in addition to just posting the coordinates you could add it to the polyclonal.pdb_utils module. I'd think the function would take as input the PDB and chain of interest.

Is this something you'd be up for doing? It might also be good to have your input on what we're trying to do in #24 if you want to follow along and come to any future meetings on it.

Use JAX for automatic differentiation

The intention of this first step is not to make any actual new functionality, but rather

reimplement the existing model using JAX
test that gradients are identical
do any refactoring that would seem handy for future developing

report number of variants with mutation

A major source of "error" appears to be mutations that are found only in a few variants. So I will add functionality to report how many variants each mutation is found it, and enable this to be used for filtering results.

add citations to pre-print

The docs and GitHub should cite the pre-print: https://www.biorxiv.org/content/10.1101/2022.09.17.508366v1