mpra_selection's People
Forkers
thefraserlabmpra_selection's Issues
Have some method to specify target ingroup
So I have previously been hard-coding that primates are the ingroup, but for the LTV1 enhancer, this is not the case. So instead, I have added to the config.yaml file a dictionary specifying the ingroup (which should default to primates if not found). However, this doesn't completely work yet, for reasons I haven't fully traced.
Think about comparison with ancestral reconstruction MPRAs
Klein et al have published STARR-seq data with some ancestral reconstruction of enhancers.
-
Do any of their enhancers overlap with the ones in the Kirchner data?
-
Do we see selection in the branches that most correspond to the enhancers that are specifically expressing in certain taxa? Presumably we will have lower-than-chance overlap with the enhancers that express in non-humans, since the Kirchner enhancers weren't randomly chosen, but that's okay.
Think more closely about what to do with bases where we have data on deletions
At least in the Kircher data, there are SNPs where we know what effect a deletion has. In theory, we could use this, though the data may be somewhat biased (for instance, for HBG1 we have 85 of the 274 bases with a deletion measured).
Also, I should double check that I'm not counting these in the possible up, down, and neutral mutations, since at the moment I don't count them in the actual up, down, and neutral.
BLAST can sometimes not return any hits in Rodents
It looks like TCF7L2, possibly among many others, does not return any rodent hits. As an outgroup, it does have 3 species in Equus (caballus, asinus, and przewalski), and one species of bat. Given that I'm not testing any of the ancestors actually along the outgroup branches, this isn't a huge issue, but the ancestor_comparisons step does assume that there will be an outgroup species in the tree. One simple option could be to include more outgroup species—possibly not even remove any species as outgroups. I need to think and test.
Merge Reconstructions doesn't work with snakemake on the cluster
When I try to run the merge_reconstructions rule using smcluster (a quick and dirty snakemake wrapper that spins jobs out across the cluster), it fails, giving a maximum recursion error. But when I run it using just plain old snakemake, it does just fine.
The error is:
Building DAG of jobs...
InputFunctionException in line 320 of /home/pcombs/MPRA_selection/Snakefile:
RecursionError: maximum recursion depth exceeded in comparison
Wildcards:
enhancer=enhancers/LTV1-1
target=FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-mammals-clustalw-clustalw-tcoffee-clustalw-clustalo-clustalw-mcoffee-tcoffee-clustalo-tcoffee-tcoffee-muscle-clustalo-muscle-clustalw-clustalo-mcoffee-muscle-clustalw-clustalo-tcoffee-clustalw-muscle-muscle-tcoffee-muscle-tcoffee-clustalo-clustalo-tcoffee-clustalo-clustalo-clustalw-muscle-clustalw-clustalw-tcoffee-clustalo-mcoffee-clustalw-muscle-clustalw-tcoffee-clustalw-clustalo-clustalo-clustalw-clustalw-tcoffee-muscle-mcoffee-tcoffee-mcoffee-muscle-muscle-mcoffee-muscle-muscle-tcoffee-muscle-muscle-mcoffee-clustalw-clustalo-muscle-tcoffee-mcoffee-clustalo-tcoffee-tcoffee-muscle-clustalo-clustalw-clustalw-clustalw-clustalo-clustalo-mcoffee-clustalw-muscle-clustalo-muscle-tcoffee-muscle-clustalw-clustalo-clustalo-clustalo-tcoffee-tcoffee-clustalo-clustalw-clustalw-mcoffee-tcoffee-mcoffee-muscle-clustalw-mcoffee-clustalo-clustalw-tcoffee-clustalo-clustalo-mcoffee-tcoffee-muscle-mcoffee-tcoffee-mcoffee-muscle-tcoffee-mcoffee-clustalo-clustalo-mcoffee-tcoffee-clustalw-tcoffee-clustalo-clustalo-clustalo-mcoffee-mcoffee-muscle-muscle-mcoffee-clustalo-clustalo-muscle-clustalo-tcoffee-clustalo-mcoffee-clustalo-tcoffee-muscle-clustalo-mcoffee-clustalo-clustalo-muscle-clustalo-tcoffee-muscle-tcoffee-tcoffee-muscle-clustalw-clustalo-tcoffee-tcoffee-mcoffee-muscle-muscle-mcoffee-clustalo-muscle-clustalo-muscle-mcoffee-tcoffee-clustalw-muscle-mcoffee-tcoffee-mcoffee-tcoffee-clustalo-clustalw-clustalw-clustalw-clustalw-tcoffee-tcoffee-tcoffee-mcoffee-mcoffee-muscle-mcoffee-muscle-muscle-clustalw-tcoffee-clustalw-mcoffee-clustalw-mcoffee-mcoffee-mcoffee-clustalw-clustalo-mcoffee-clustalw-clustalw-muscle-muscle-clustalo-clustalw-tcoffee-tcoffee-muscle-muscle-clustalw-mcoffee-muscle-muscle-mcoffee-clustalo-muscle-clustalw-clustalo-muscle-mcoffee-clustalw-tcoffee-clustalo-mcoffee-clustalw-clustalw-mcoffee-muscle-tcoffee-muscle-clustalw-clustalw-muscle-muscle-clustalw-muscle-muscle-tcoffee-clustalw-clustalo-clustalo-clustalo-muscle-tcoffee-clustalo-mcoffee-muscle-muscle-clustalw-clustalw-tcoffee-mcoffee-clustalo-clustalw-clustalo-mcoffee-mcoffee-clustalw-muscle-clustalo-mcoffee-clustalo-tcoffee-muscle-clustalo-mcoffee-clustalo-tcoffee-tcoffee-muscle-clustalw-muscle-clustalw-mcoffee-muscle-clustalo-muscle-clustalo-clustalo-clustalw-clustalw-clustalw-clustalw-clustalw-tcoffee-clustalo-clustalw-muscle-tcoffee-tcoffee-clustalo-clustalw-clustalo-clustalw-clustalo-muscle-clustalw-tcoffee-clustalw-clustalo-clustalw-mcoffee-clustalw-mcoffee-clustalw-tcoffee-muscle-muscle-muscle-clustalo-muscle-tcoffee-clustalo-tcoffee-clustalw-clustalo-muscle-tcoffee-clustalo-tcoffee-clustalw-mcoffee-tcoffee-mcoffee-clustalo-clustalo-clustalw-mcoffee-muscle-mcoffee-clustalo-tcoffee-tcoffee-clustalw-mcoffee-muscle-mcoffee-tcoffee-tcoffee-muscle-muscle-clustalw-mcoffee-clustalw-muscle-mcoffee-tcoffee-mcoffee-tcoffee-clustalo-clustalw-clustalo-muscle-clustalo-mcoffee-clustalo-clustalo-muscle-mcoffee-muscle-clustalo-muscle-tcoffee-clustalw-mcoffee-clustalw-muscle-clustalw-clustalw-muscle-muscle-clustalo-tcoffee-tcoffee-muscle-clustalo-clustalo-clustalo-mcoffee-tcoffee-tcoffee-clustalo-muscle-tcoffee-tcoffee-clustalo-clustalo-mcoffee-clustalw-clustalo-clustalw-mcoffee-tcoffee-mcoffee-mcoffee-muscle-mcoffee-mcoffee-clustalo-tcoffee-mcoffee-mcoffee-muscle-mcoffee-clustalo-clustalo-mcoffee-mcoffee-clustalw-clustalo-tcoffee-muscle-muscle-clustalw-clustalo-mcoffee-mcoffee-clustalw-mcoffee-tcoffee-clustalo-clustalw-mcoffee-clustalw-clustalw-clustalo-clustalw-clustalo-mcoffee-mcoffee-muscle-clustalw-clustalw-muscle-clustalw-muscle-tcoffee-tcoffee-mcoffee-tcoffee-muscle-clustalo-clustalo-tcoffee-clustalo-tcoffee-clustalo-clustalo-clustalo-muscle-clustalo-muscle-clustalw-mcoffee-tcoffee-tcoffee-muscle-mcoffee-tcoffee-clustalw-tcoffee-clustalw-muscle-tcoffee-clustalo-tcoffee-muscle-clustalw-tcoffee-tcoffee-tcoffee-clustalo-tcoffee-clustalo-clustalo-mcoffee-muscle-clustalw-muscle-tcoffee-clustalw-clustalw-clustalo-mcoffee-clustalw-clustalw-clustalo-clustalo-clustalw-tcoffee
smcluster is defined as:
snakemake --reason --use-conda --printshellcmds --jobs 100 --cluster 'sbatch -p {cluster.partition} --job-name {cluster.jobname} --mem {cluster.memory} --time {cluster.time} --cpus {cluster.cpus} --error {output[0]}.log.e --output {output[0]}.log.o ' --cluster-config cluster.json
Come up with a figure or three on biological results
My first thought here is to just have a scatterplot with log10 pvalue for upregulation on the x axis, and log10 pvalue for downregulation on the y axis.
| p p p
l |
o |
g | p
p | e e
| e e p p p
+---------------------
log p
Where p are the promoters and e are the enhancers.
Make changes to allow software to run more easily on different computers
So probably the long-term value here is in making a complete package that will do the selection calculations on even more MPRA data, with the operation on the Kircher data being a nice proof-of-concept. To that end, I should:
-
Include the aligners in the conda installation (where applicable?) to minimize the number of pre-installation steps that need to be done.
-
Give the option for blast to be done on the NCBI server. To limit the amount of requests, one could imagine having a maximum number of submissions, and above that requiring a local download, but I would think that this is not actually a big deal, since the refseq_genomes is pretty big, so you'd have to do a lot of blast searches to make up the difference. I'm still glad I have it locally, but no sense requiring it.
Decide what to do when there's not complete saturation of the mutagenesis
For instance, RET has only 2 bases tested at '573:
10 43086572 A C 98 2972 3663 -0.08 0.1749 RET
10 43086572 A G 215 7260 9664 -0.02 0.70391 RET
10 43086572 A T 306 10958 14018 0.02 0.60999 RET
10 43086573 C A 6 163 185 -0.17 0.47475 RET
10 43086573 C T 136 4233 5560 -0.06 0.21929 RET
10 43086574 A C 222 7970 10160 0.01 0.72669 RET
10 43086574 A G 531 17261 22568 0 0.90982 RET
10 43086574 A T 373 13178 16714 0.01 0.77223 RET
I'm not sure whether it biases things to skip only the missing G, or if I'd be better off marking that base as bad and dropping it from analyses altogether.
Deal with LINE elements
For ECR11, Justin removes LINE elements from... the Homonoidae sequences? The reconstruction? The text is unclear, but at any rate, I need to not compare bases in LINE elements and similar.
Which means that I need to
- Identify LINEs and other repetitive elements
- Figure out an appropriate time to remove them from my sequences
- Do everything after that.
Make sure to process exonic enhancers differently
Fist step here, of course, is to figure out which, if any, enhancers actually are exonic. If they are, then I should definitely also apply the Agoglia et al analysis, and maybe not do the Smith et al analysis.
Deal with internal node if sequence same as parent
In Smith et al, they just reuse the name, so A8 is the ancestor of both the Homininae and the Hominidae. Whereas I have them as separate nodes. Some work could be done to check for identity and rename them as appropriate. I haven't thought through yet whether this affects things.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.