jbloomlab / dms_tools Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 0.0 24.81 MB

Package for analyzing and visualizing deep mutational scanning (DMS) data

License: GNU General Public License v3.0

Python 52.40% TeX 43.71% PostScript 1.62% C 2.26%

dms_tools's People

Contributors

Stargazers

Watchers

dms_tools's Issues

Add `median` option to `dms_merge`

It appears that it would be useful to be able to easily get the median as well as the average. Is probably OK if this only works for diffsel as that is where it seems likely to be useful.

Maybe @mbdoud can do this?

ratio estimation

@Haddox I looked at your updates to the preferences estimates from enrichment ratios. Your changes look good.

I have just made a few stylistic tweaks to the code by moving some items out of a for loop that didn't need to be executed each time.

You can see the changes using git diff. First pull the repository. Then type git log. You will see that your most recent commit was 6d113ebf9513c09d08629e15db8dd645af87da94, and my last one was 9de5d5af0cb55f9d6badb466064eeb15f12f36e1. So you can see the changes I made to the code in src/mcmc/py by typing:

git diff 6d113ebf9513c09d08629e15db8dd645af87da94 9de5d5af0cb55f9d6badb466064eeb15f12f36e1 src/mcmc.py

After reviewing these changes, the last thing we need to do is update the documentation for dms_inferprefs to explain how the enrichment ratios are now calculated. This is currently done for the src/mcmc.py file, but not in docs/dms_inferprefs.rst. So update this documentation to reflect the corrected approach, building the docs with:

 cd docs
 make html

and then looking in docs/_build/html/dms_inferprefs.html for the updated documentation.

After you've made these changes, you can push them back and I'll look at them.

Also, look into what you think needs to be done for the differential preferences estimation...

add pseudocount specifically to sample with lower depth in dms_diffselection

dms_diffselection currently adds the value specified by pseudocount to the selected sample and scales the mock sample appropriately. This was in line with experimental design for some of my early experiments where I knew selected samples would have less depth and wanted to ensure that at least pseudocount counts were added to the lowest-depth sample.

Instead it should add pseudocount to the sample with lower depth at the site and scale the other sample appropriately, in case the mock sample actually has lower depth. This guarantees that in any situation, at least pseudocount counts are added to both samples every time.

conflicts in master

@mbdoud: There are conflicts in your merge that you did not resolve.

These are due to me making changes, pushing them to GitHub, and then you making changes to the same file in an old version. When you pulled (or pushed) to GitHub you should have gotten a message about conflicts. In general, you need to fix the conflicts before doing anything -- don't leave the conflicted version.

I am going to go through and try to fix these. You can then test the resolved branch.

In the future, the easiest thing is to just avoid issues like this by pulling the latest version before making changes. Otherwise you need to be sure to merge conflicts.

Bar plots misaligned

Barplots generated bydms_summarizealignments have a problem where the ticks are not properly aligned with the bars. Instead of being aligned in the center of the bar, they are aligned at the right-hand side of the bar.

Building consensus reads does not measure all pairwise distances

@jbloom: I think I found out why I was getting different alignment results when reads were ordered differently in the input FASTQ files. It relates to making sure that all reads are above minreadidentity relative to one another. Right now, dms_tools only measures distances relative to the FIRST read assigned to that barcode. It does not measure all pairwise distances. So depending on which read is the first read, the distances measured may differ, exceeding the minreadidentity in some cases but not others.

It seems like we should change the code to measure all pairwise distances. I could do this for python. However, currently the default code for the function that does this computation (dms_tools.utils.BuildReadConsensus) is written using C, which I do not have experience in.

What do you think should be the next step?

dms2_bcsumabp for one sample and only 1 barcode (single-end run)?

Hello, I have a quick question regarding using dms2_bcsubamp or dms2_batch_bcsubamp. Is it possible to run this code on sequencing data from a single-end run instead of a paired-end run? Also for only one subamplicon instead of multiple ones. The way I'm trying to set this up is to use the same NGS strategy as explained in:

https://jbloomlab.github.io/dms_tools2/bcsubamp.html

Except I use FASTQs from a single-end run instead of a paired-end run. I can optionally still amplify the target by appending 2 barcodes on each site, making it small enough so that a single-end read captures both barcodes. The only issue is the resulting FASTQs will be R1 instead of both R1 & R2.

Is there an easy way to do this or am I missing something that would make this impossible?

making `dms_merge` handle counts

@orrzor

Looks good in general. A few questions that you might address in a further modification:

Shouldn't the default --chartype be codon rather than DNA? It seems like right now we do most of our analyses with codons and have that as the default for the other programs.
Related point, it is slightly confusing how the choices in the argument parser for the --chartype option specifies an upper-case DNA, but then the default is a lower-case dna. Maybe make this consistent so the user isn't confused about whether to enter upper or lower case. Even if in the end it doesn't matter (looks like you make choices case-insensitive), this might clarify.
I wasn't clear if the adding of counts accepts the --minus option. Either way is probably OK, but perhaps clarify this in the docs?

--Jesse

Remove suggestion to use `sudo` with `pip`

http://jbloomlab.github.io/dms_tools/installation.html#quick-installation

Using sudo with pip is a bad because it mixes system package management with pip package management. People frequently end up in #python with totally busted Python installs because of this. Please remove the suggestion from the docs.

Equations not rendered on some docs pages

Some of the equations, such as 11-16 and 18 on the nferprefs_algorithm, are not rendering correctly on the webpage.

I believe these equations are duplicated on nferdiffprefs_algorithm page as they also do not render there.

add details on round 2 primers in dms_subassemble documentation

add mincounts option to dms_diffselection

Add an option --mincounts to dms_diffselection so that differential selection is only inferred at a site if >= this many counts in either the selected or mock-selected libraries.

The default should probably be zero.

We need to figure out how to specify undetermined values in output files.

We also need to ensure that programs that do subsequent processing of the data (e.g., dms_logoplot) can correctly handle the undetermined values correctly.

dms_barcodedsubamplicons without barcodes?

Hey! Got a quick question. If we did not barcode our variants, but rather have a sample of pooled variants for each round/condition which were randomly fragmented and sequenced (i.e. each pool is barcoded, but not each variant in each pool), can we still use dms_barcodedsubamplicons? Do we just set the barcode to a null value? Thanks!

check counts files to reflect barcode counts, not read counts

MCMC FAILED to converge after all attempts

I get this error while running dms_tools to analyze DMS data from Olson et al.
There are no errors for any other datasets that I have analysed with dms_tools (version: 1.1.15).
I am not sure why this is happening for this particular data.
Please have a look at the logfile if you can find anything.

Logfile: https://github.com/rraadd88/tests/blob/master/Selection_WRT_Input.log
Command used: python dms_inferprefs Input Selection --ncpus -10 --chartype aa
Input counts: https://github.com/rraadd88/tests/blob/master/Input
Selection counts: https://github.com/rraadd88/tests/blob/master/Selection

adding dms_matchsubassembledbarcodes

@djpl
I have added your script as the dms_matchsubassembledbarcodes.

We're now going to start re-structuring this into a proper program.

The program itself is now in dms_tools/scripts/dms_matchsubassembledbarcodes.

To get the newest version, go to the dms_tools that you have pulled from GitHub. Type git status. You should see something that says that you are on the branch subassembly. If this is not the case, do git checkout subassembly and the confirm you're on the branch subassembly. If that still doesn't work, let me know.

Then type git pull origin subassembly to pull the latest version from GitHub. Type git log and you should see that the last commit was made by me recently with the message Initial modifications to dms_matchsubassembledbarcodes to make it read arguments from argparse parser.

I have already made some changes to your program.

Move all commands in global scope into the main function. You shouldn't have any global commands, this is bad programming practice.
I have added an argument parsing function in src/parsearguments called MatchSubassembledBarcodesParser. This now parses the arguments for the scripts in the proper way. If you install the updated package with python setup.py install --user you can now type dms_matchsubassembledbarcodes -h and you should get the help message. You can run the script by providing arguments at the command line.

You should make the following changes:

Look at how the script now works by parsing the arguments from the command line. Make sure the input / output specifications are correct.
Right now, I haven't done anything with the reference_seq variable. Do we really need that as a separate input, or can you extract it from the subassembled variants file? If you can extract it, get it that way to avoid unnecessary arguments.
Add a variable description for the --minquality option in the argument parser.
Clean up other aspects of the program. For instance, the barcode length might vary in future experiments. So rather than being hardcoded as 18, this should be an argument that is either determined from the subassembled variants file (if possible) or specified as an option. I think it can be determined by looking at the length of the barcodes in the subassembled variants?
Think if there are other input-output things that should be specified.
Look at the other scripts in ./scripts/ and make the basic input-output in the main function work similarly.
For the quality scores, you currently hard code the conversion table. This can be done with a proper command using chr and ord, as in minqchar = chr(minq + 33).
Add documentation strings for all functions in the proper format: https://www.python.org/dev/peps/pep-0257/
Change the from <PACKAGE> import <MODULE> to just import <PACKAGE> and then when you call it, just call it with PACKAGE.MODULE.FUNCTION. This leads to clearer code.
Convert print statements to the Python3 compatible format: http://www.python-course.eu/python3_print.php

Once you've finished all of this, add your changes with git add .. If you then type git status, you should see a list of the files you've changed. Commit these with git commit -m 'MESSAGE' where MESSAGE is a brief one-sentence description of what you've done. You can then push them back to GitHub with git push origin subassembly. At that point, I'll get an automatic message and I can look at your changes and we can do another round.

If you have questions / clarifications, do them by replying to this message on GitHub rather than via e-mail so we can keep all the conversation here in one place.

Also, I'm not sure what program you're using to edit the Python files? If possible, see if you can set it up to make your tabs as four spaces rather than a tab character.

division by 0 in ReadDMSCounts

In ReadDMSCounts, there is now functionality to calculate character frequencies. For the line, counts[site]['F_%s' % c] = float(counts[site][c])/counts[site]['COUNTS'], there should probably be a catch for when the site has zero counts and division by 0 occurs?

I sometimes use WriteDMSCounts to write count files where not all sites have counts. For instance in the targeted amplicon sequencing, only a region of NP has been sequenced and is assigned counts. All sites outside the amplicon have counts 0. Now when that gets read by ReadDMSCounts, there is division by 0. I think it would be good to catch the division by 0 error and then assign a 0 frequency for the characters at that site.

Determining whether a mutation is beneficial, deleterious or neutral

Hi,
I am trying to analyse DMS data of TEM1 from Firnberg et. al. 2014 (A Comprehensive .. Gene’s Fitness Landscape) which is available here.
For that, I converted sequencing counts data to the input format used for dms_tools as described in case of Melnikov et al. example.

I got the preferences for each site by following command,

python dms_inferprefs Amp0.5.mat_mut_aas_X Amp256.mat_mut_aas_X Amp256_inferprefs_X.txt --ncpus -1 --chartype aa

The distribution of preferences looks like this,

So to infer whether a mutation is beneficial/deleterious in case of ratio based approach, it's very clear (>1 or <1). I went through the manuscript and the documentation but I could not find out how to do it with the preferences values.
Please let me know.

inputs (sequencing counts) : https://github.com/rraadd88/aaprogs_testing/tree/master/dms_tools/input
outputs (preferences) : https://github.com/rraadd88/aaprogs_testing/tree/master/dms_tools/output

move commonly used iPython notebooks to dms_tools

@mbdoud and @orrzor have a number of commonly used functions that they define in their iPython notebooks. Moving these into dms_tools so that they are accessible via the Python API would be better programming practice. Functions that are repeatedly re-used in different analyses are better incorporated as part of a software package. And the iPython notebooks would be simpler to read if they focused on the actual analysis at hand rather than spending huge amounts of text defining general functions.

Specific functions that seem like good candidates (looking here: https://github.com/jbloomlab/WSN-HA_mAb_dms/blob/master/SecondaryAnalysis_all_mAbs/SecondaryAnalysis_all_mAbs.ipynb):

ReadCodonCountsToAA : note that this function to be a compound of the existing ReadCodonCounts and SumCodonToAA, and so perhaps could be re-written as such?
ReadCodonCountsToAA_DataFrame: maybe you could just make ReadCodonCountsToAA have an option to return either a dictionary or data frame rather than having two separate functions?
ParseNSMutFreqBySite
I think the correlation plots should be made part of dms_correlate, so actually part of the script, not just the Python API. This seems like a general action: to correlate the differential selection values.
Perhaps some other functions, I haven't looked carefully.

Let me know if there are specific parts of this that one of you wants to work on, or specific parts that you want me to do.

jbloomlab / dms_tools Goto Github PK

dms_tools's People

Contributors

Stargazers

Watchers

dms_tools's Issues

Recommend Projects

Recommend Topics

Recommend Org