jbloomlab / dms_tools Goto Github PK
View Code? Open in Web Editor NEWPackage for analyzing and visualizing deep mutational scanning (DMS) data
License: GNU General Public License v3.0
Package for analyzing and visualizing deep mutational scanning (DMS) data
License: GNU General Public License v3.0
It appears that it would be useful to be able to easily get the median as well as the average. Is probably OK if this only works for diffsel
as that is where it seems likely to be useful.
Maybe @mbdoud can do this?
@Haddox I looked at your updates to the preferences estimates from enrichment ratios. Your changes look good.
I have just made a few stylistic tweaks to the code by moving some items out of a for loop that didn't need to be executed each time.
You can see the changes using git diff
. First pull the repository. Then type git log
. You will see that your most recent commit was 6d113ebf9513c09d08629e15db8dd645af87da94
, and my last one was 9de5d5af0cb55f9d6badb466064eeb15f12f36e1
. So you can see the changes I made to the code in src/mcmc/py
by typing:
git diff 6d113ebf9513c09d08629e15db8dd645af87da94 9de5d5af0cb55f9d6badb466064eeb15f12f36e1 src/mcmc.py
After reviewing these changes, the last thing we need to do is update the documentation for dms_inferprefs
to explain how the enrichment ratios are now calculated. This is currently done for the src/mcmc.py
file, but not in docs/dms_inferprefs.rst
. So update this documentation to reflect the corrected approach, building the docs with:
cd docs
make html
and then looking in docs/_build/html/dms_inferprefs.html
for the updated documentation.
After you've made these changes, you can push them back and I'll look at them.
Also, look into what you think needs to be done for the differential preferences estimation...
dms_diffselection
currently adds the value specified by pseudocount
to the selected sample and scales the mock sample appropriately. This was in line with experimental design for some of my early experiments where I knew selected samples would have less depth and wanted to ensure that at least pseudocount
counts were added to the lowest-depth sample.
Instead it should add pseudocount
to the sample with lower depth at the site and scale the other sample appropriately, in case the mock sample actually has lower depth. This guarantees that in any situation, at least pseudocount
counts are added to both samples every time.
@mbdoud: There are conflicts in your merge that you did not resolve.
These are due to me making changes, pushing them to GitHub
, and then you making changes to the same file in an old version. When you pulled (or pushed) to GitHub
you should have gotten a message about conflicts. In general, you need to fix the conflicts before doing anything -- don't leave the conflicted version.
I am going to go through and try to fix these. You can then test the resolved branch.
In the future, the easiest thing is to just avoid issues like this by pulling the latest version before making changes. Otherwise you need to be sure to merge conflicts.
Barplots generated bydms_summarizealignments
have a problem where the ticks are not properly aligned with the bars. Instead of being aligned in the center of the bar, they are aligned at the right-hand side of the bar.
@jbloom: I think I found out why I was getting different alignment results when reads were ordered differently in the input FASTQ files. It relates to making sure that all reads are above minreadidentity
relative to one another. Right now, dms_tools
only measures distances relative to the FIRST read assigned to that barcode. It does not measure all pairwise distances. So depending on which read is the first read, the distances measured may differ, exceeding the minreadidentity
in some cases but not others.
It seems like we should change the code to measure all pairwise distances. I could do this for python. However, currently the default code for the function that does this computation (dms_tools.utils.BuildReadConsensus
) is written using C, which I do not have experience in.
What do you think should be the next step?
Hello, I have a quick question regarding using dms2_bcsubamp or dms2_batch_bcsubamp. Is it possible to run this code on sequencing data from a single-end run instead of a paired-end run? Also for only one subamplicon instead of multiple ones. The way I'm trying to set this up is to use the same NGS strategy as explained in:
https://jbloomlab.github.io/dms_tools2/bcsubamp.html
Except I use FASTQs from a single-end run instead of a paired-end run. I can optionally still amplify the target by appending 2 barcodes on each site, making it small enough so that a single-end read captures both barcodes. The only issue is the resulting FASTQs will be R1 instead of both R1 & R2.
Is there an easy way to do this or am I missing something that would make this impossible?
Looks good in general. A few questions that you might address in a further modification:
--chartype
be codon
rather than DNA
? It seems like right now we do most of our analyses with codons and have that as the default for the other programs.--chartype
option specifies an upper-case DNA
, but then the default is a lower-case dna
. Maybe make this consistent so the user isn't confused about whether to enter upper or lower case. Even if in the end it doesn't matter (looks like you make choices case-insensitive), this might clarify.--minus
option. Either way is probably OK, but perhaps clarify this in the docs?--Jesse
http://jbloomlab.github.io/dms_tools/installation.html#quick-installation
Using sudo
with pip
is a bad because it mixes system package management with pip package management. People frequently end up in #python
with totally busted Python installs because of this. Please remove the suggestion from the docs.
Some of the equations, such as 11-16 and 18 on the nferprefs_algorithm
, are not rendering correctly on the webpage.
I believe these equations are duplicated on nferdiffprefs_algorithm
page as they also do not render there.
Add an option --mincounts
to dms_diffselection
so that differential selection is only inferred at a site if >= this many counts in either the selected or mock-selected libraries.
The default should probably be zero.
We need to figure out how to specify undetermined values in output files.
We also need to ensure that programs that do subsequent processing of the data (e.g., dms_logoplot
) can correctly handle the undetermined values correctly.
Hey! Got a quick question. If we did not barcode our variants, but rather have a sample of pooled variants for each round/condition which were randomly fragmented and sequenced (i.e. each pool is barcoded, but not each variant in each pool), can we still use dms_barcodedsubamplicons? Do we just set the barcode to a null value? Thanks!
I get this error while running dms_tools to analyze DMS data from Olson et al.
There are no errors for any other datasets that I have analysed with dms_tools (version: 1.1.15).
I am not sure why this is happening for this particular data.
Please have a look at the logfile if you can find anything.
Logfile: https://github.com/rraadd88/tests/blob/master/Selection_WRT_Input.log
Command used: python dms_inferprefs Input Selection --ncpus -10 --chartype aa
Input counts: https://github.com/rraadd88/tests/blob/master/Input
Selection counts: https://github.com/rraadd88/tests/blob/master/Selection
@djpl
I have added your script as the dms_matchsubassembledbarcodes
.
We're now going to start re-structuring this into a proper program.
The program itself is now in dms_tools/scripts/dms_matchsubassembledbarcodes
.
To get the newest version, go to the dms_tools
that you have pulled from GitHub. Type git status
. You should see something that says that you are on the branch subassembly
. If this is not the case, do git checkout subassembly
and the confirm you're on the branch subassembly
. If that still doesn't work, let me know.
Then type git pull origin subassembly
to pull the latest version from GitHub. Type git log
and you should see that the last commit was made by me recently with the message Initial modifications to dms_matchsubassembledbarcodes to make it read arguments from argparse parser.
I have already made some changes to your program.
main
function. You shouldn't have any global commands, this is bad programming practice.src/parsearguments
called MatchSubassembledBarcodesParser
. This now parses the arguments for the scripts in the proper way. If you install the updated package with python setup.py install --user
you can now type dms_matchsubassembledbarcodes -h
and you should get the help message. You can run the script by providing arguments at the command line.You should make the following changes:
reference_seq
variable. Do we really need that as a separate input, or can you extract it from the subassembled variants file? If you can extract it, get it that way to avoid unnecessary arguments.--minquality
option in the argument parser../scripts/
and make the basic input-output in the main
function work similarly.chr
and ord
, as in minqchar = chr(minq + 33)
.from <PACKAGE> import <MODULE>
to just import <PACKAGE>
and then when you call it, just call it with PACKAGE.MODULE.FUNCTION. This leads to clearer code.print
statements to the Python3 compatible format: http://www.python-course.eu/python3_print.phpOnce you've finished all of this, add your changes with git add .
. If you then type git status
, you should see a list of the files you've changed. Commit these with git commit -m 'MESSAGE'
where MESSAGE is a brief one-sentence description of what you've done. You can then push them back to GitHub with git push origin subassembly
. At that point, I'll get an automatic message and I can look at your changes and we can do another round.
If you have questions / clarifications, do them by replying to this message on GitHub rather than via e-mail so we can keep all the conversation here in one place.
Also, I'm not sure what program you're using to edit the Python files? If possible, see if you can set it up to make your tabs as four spaces rather than a tab character.
In ReadDMSCounts, there is now functionality to calculate character frequencies. For the line, counts[site]['F_%s' % c] = float(counts[site][c])/counts[site]['COUNTS']
, there should probably be a catch for when the site has zero counts and division by 0 occurs?
I sometimes use WriteDMSCounts to write count files where not all sites have counts. For instance in the targeted amplicon sequencing, only a region of NP has been sequenced and is assigned counts. All sites outside the amplicon have counts 0. Now when that gets read by ReadDMSCounts, there is division by 0. I think it would be good to catch the division by 0 error and then assign a 0 frequency for the characters at that site.
Hi,
I am trying to analyse DMS data of TEM1 from Firnberg et. al. 2014 (A Comprehensive .. Gene’s Fitness Landscape) which is available here.
For that, I converted sequencing counts data to the input format used for dms_tools as described in case of Melnikov et al. example.
I got the preferences for each site by following command,
python dms_inferprefs Amp0.5.mat_mut_aas_X Amp256.mat_mut_aas_X Amp256_inferprefs_X.txt --ncpus -1 --chartype aa
The distribution of preferences looks like this,
So to infer whether a mutation is beneficial/deleterious in case of ratio based approach, it's very clear (>1 or <1). I went through the manuscript and the documentation but I could not find out how to do it with the preferences values.
Please let me know.
inputs (sequencing counts) : https://github.com/rraadd88/aaprogs_testing/tree/master/dms_tools/input
outputs (preferences) : https://github.com/rraadd88/aaprogs_testing/tree/master/dms_tools/output
@mbdoud and @orrzor have a number of commonly used functions that they define in their iPython
notebooks. Moving these into dms_tools
so that they are accessible via the Python API would be better programming practice. Functions that are repeatedly re-used in different analyses are better incorporated as part of a software package. And the iPython
notebooks would be simpler to read if they focused on the actual analysis at hand rather than spending huge amounts of text defining general functions.
Specific functions that seem like good candidates (looking here: https://github.com/jbloomlab/WSN-HA_mAb_dms/blob/master/SecondaryAnalysis_all_mAbs/SecondaryAnalysis_all_mAbs.ipynb):
ReadCodonCountsToAA
: note that this function to be a compound of the existing ReadCodonCounts
and SumCodonToAA
, and so perhaps could be re-written as such?ReadCodonCountsToAA_DataFrame
: maybe you could just make ReadCodonCountsToAA
have an option to return either a dictionary or data frame rather than having two separate functions?ParseNSMutFreqBySite
dms_correlate
, so actually part of the script, not just the Python API. This seems like a general action: to correlate the differential selection values.Let me know if there are specific parts of this that one of you wants to work on, or specific parts that you want me to do.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.