Giter VIP home page Giter VIP logo

mpra_selection's People

Contributors

petercombs avatar

Watchers

 avatar

Forkers

thefraserlab

mpra_selection's Issues

Have some method to specify target ingroup

So I have previously been hard-coding that primates are the ingroup, but for the LTV1 enhancer, this is not the case. So instead, I have added to the config.yaml file a dictionary specifying the ingroup (which should default to primates if not found). However, this doesn't completely work yet, for reasons I haven't fully traced.

Think about comparison with ancestral reconstruction MPRAs

Klein et al have published STARR-seq data with some ancestral reconstruction of enhancers.

  1. Do any of their enhancers overlap with the ones in the Kirchner data?

  2. Do we see selection in the branches that most correspond to the enhancers that are specifically expressing in certain taxa? Presumably we will have lower-than-chance overlap with the enhancers that express in non-humans, since the Kirchner enhancers weren't randomly chosen, but that's okay.

image

Think more closely about what to do with bases where we have data on deletions

At least in the Kircher data, there are SNPs where we know what effect a deletion has. In theory, we could use this, though the data may be somewhat biased (for instance, for HBG1 we have 85 of the 274 bases with a deletion measured).

Also, I should double check that I'm not counting these in the possible up, down, and neutral mutations, since at the moment I don't count them in the actual up, down, and neutral.

BLAST can sometimes not return any hits in Rodents

It looks like TCF7L2, possibly among many others, does not return any rodent hits. As an outgroup, it does have 3 species in Equus (caballus, asinus, and przewalski), and one species of bat. Given that I'm not testing any of the ancestors actually along the outgroup branches, this isn't a huge issue, but the ancestor_comparisons step does assume that there will be an outgroup species in the tree. One simple option could be to include more outgroup species—possibly not even remove any species as outgroups. I need to think and test.

Merge Reconstructions doesn't work with snakemake on the cluster

When I try to run the merge_reconstructions rule using smcluster (a quick and dirty snakemake wrapper that spins jobs out across the cluster), it fails, giving a maximum recursion error. But when I run it using just plain old snakemake, it does just fine.

The error is:

Building DAG of jobs...
InputFunctionException in line 320 of /home/pcombs/MPRA_selection/Snakefile:
RecursionError: maximum recursion depth exceeded in comparison
Wildcards:
enhancer=enhancers/LTV1-1
target=FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-FastML-mammals-clustalw-clustalw-tcoffee-clustalw-clustalo-clustalw-mcoffee-tcoffee-clustalo-tcoffee-tcoffee-muscle-clustalo-muscle-clustalw-clustalo-mcoffee-muscle-clustalw-clustalo-tcoffee-clustalw-muscle-muscle-tcoffee-muscle-tcoffee-clustalo-clustalo-tcoffee-clustalo-clustalo-clustalw-muscle-clustalw-clustalw-tcoffee-clustalo-mcoffee-clustalw-muscle-clustalw-tcoffee-clustalw-clustalo-clustalo-clustalw-clustalw-tcoffee-muscle-mcoffee-tcoffee-mcoffee-muscle-muscle-mcoffee-muscle-muscle-tcoffee-muscle-muscle-mcoffee-clustalw-clustalo-muscle-tcoffee-mcoffee-clustalo-tcoffee-tcoffee-muscle-clustalo-clustalw-clustalw-clustalw-clustalo-clustalo-mcoffee-clustalw-muscle-clustalo-muscle-tcoffee-muscle-clustalw-clustalo-clustalo-clustalo-tcoffee-tcoffee-clustalo-clustalw-clustalw-mcoffee-tcoffee-mcoffee-muscle-clustalw-mcoffee-clustalo-clustalw-tcoffee-clustalo-clustalo-mcoffee-tcoffee-muscle-mcoffee-tcoffee-mcoffee-muscle-tcoffee-mcoffee-clustalo-clustalo-mcoffee-tcoffee-clustalw-tcoffee-clustalo-clustalo-clustalo-mcoffee-mcoffee-muscle-muscle-mcoffee-clustalo-clustalo-muscle-clustalo-tcoffee-clustalo-mcoffee-clustalo-tcoffee-muscle-clustalo-mcoffee-clustalo-clustalo-muscle-clustalo-tcoffee-muscle-tcoffee-tcoffee-muscle-clustalw-clustalo-tcoffee-tcoffee-mcoffee-muscle-muscle-mcoffee-clustalo-muscle-clustalo-muscle-mcoffee-tcoffee-clustalw-muscle-mcoffee-tcoffee-mcoffee-tcoffee-clustalo-clustalw-clustalw-clustalw-clustalw-tcoffee-tcoffee-tcoffee-mcoffee-mcoffee-muscle-mcoffee-muscle-muscle-clustalw-tcoffee-clustalw-mcoffee-clustalw-mcoffee-mcoffee-mcoffee-clustalw-clustalo-mcoffee-clustalw-clustalw-muscle-muscle-clustalo-clustalw-tcoffee-tcoffee-muscle-muscle-clustalw-mcoffee-muscle-muscle-mcoffee-clustalo-muscle-clustalw-clustalo-muscle-mcoffee-clustalw-tcoffee-clustalo-mcoffee-clustalw-clustalw-mcoffee-muscle-tcoffee-muscle-clustalw-clustalw-muscle-muscle-clustalw-muscle-muscle-tcoffee-clustalw-clustalo-clustalo-clustalo-muscle-tcoffee-clustalo-mcoffee-muscle-muscle-clustalw-clustalw-tcoffee-mcoffee-clustalo-clustalw-clustalo-mcoffee-mcoffee-clustalw-muscle-clustalo-mcoffee-clustalo-tcoffee-muscle-clustalo-mcoffee-clustalo-tcoffee-tcoffee-muscle-clustalw-muscle-clustalw-mcoffee-muscle-clustalo-muscle-clustalo-clustalo-clustalw-clustalw-clustalw-clustalw-clustalw-tcoffee-clustalo-clustalw-muscle-tcoffee-tcoffee-clustalo-clustalw-clustalo-clustalw-clustalo-muscle-clustalw-tcoffee-clustalw-clustalo-clustalw-mcoffee-clustalw-mcoffee-clustalw-tcoffee-muscle-muscle-muscle-clustalo-muscle-tcoffee-clustalo-tcoffee-clustalw-clustalo-muscle-tcoffee-clustalo-tcoffee-clustalw-mcoffee-tcoffee-mcoffee-clustalo-clustalo-clustalw-mcoffee-muscle-mcoffee-clustalo-tcoffee-tcoffee-clustalw-mcoffee-muscle-mcoffee-tcoffee-tcoffee-muscle-muscle-clustalw-mcoffee-clustalw-muscle-mcoffee-tcoffee-mcoffee-tcoffee-clustalo-clustalw-clustalo-muscle-clustalo-mcoffee-clustalo-clustalo-muscle-mcoffee-muscle-clustalo-muscle-tcoffee-clustalw-mcoffee-clustalw-muscle-clustalw-clustalw-muscle-muscle-clustalo-tcoffee-tcoffee-muscle-clustalo-clustalo-clustalo-mcoffee-tcoffee-tcoffee-clustalo-muscle-tcoffee-tcoffee-clustalo-clustalo-mcoffee-clustalw-clustalo-clustalw-mcoffee-tcoffee-mcoffee-mcoffee-muscle-mcoffee-mcoffee-clustalo-tcoffee-mcoffee-mcoffee-muscle-mcoffee-clustalo-clustalo-mcoffee-mcoffee-clustalw-clustalo-tcoffee-muscle-muscle-clustalw-clustalo-mcoffee-mcoffee-clustalw-mcoffee-tcoffee-clustalo-clustalw-mcoffee-clustalw-clustalw-clustalo-clustalw-clustalo-mcoffee-mcoffee-muscle-clustalw-clustalw-muscle-clustalw-muscle-tcoffee-tcoffee-mcoffee-tcoffee-muscle-clustalo-clustalo-tcoffee-clustalo-tcoffee-clustalo-clustalo-clustalo-muscle-clustalo-muscle-clustalw-mcoffee-tcoffee-tcoffee-muscle-mcoffee-tcoffee-clustalw-tcoffee-clustalw-muscle-tcoffee-clustalo-tcoffee-muscle-clustalw-tcoffee-tcoffee-tcoffee-clustalo-tcoffee-clustalo-clustalo-mcoffee-muscle-clustalw-muscle-tcoffee-clustalw-clustalw-clustalo-mcoffee-clustalw-clustalw-clustalo-clustalo-clustalw-tcoffee

smcluster is defined as:

snakemake --reason --use-conda --printshellcmds --jobs 100 --cluster 'sbatch -p {cluster.partition} --job-name {cluster.jobname} --mem {cluster.memory} --time {cluster.time} --cpus {cluster.cpus} --error {output[0]}.log.e --output {output[0]}.log.o ' --cluster-config cluster.json

Come up with a figure or three on biological results

My first thought here is to just have a scatterplot with log10 pvalue for upregulation on the x axis, and log10 pvalue for downregulation on the y axis.

  |  p               p p 
l |
o |
g |  p                  
p |     e    e
  |   e e        p     p   p
 +---------------------
           log p 

Where p are the promoters and e are the enhancers.

Make changes to allow software to run more easily on different computers

So probably the long-term value here is in making a complete package that will do the selection calculations on even more MPRA data, with the operation on the Kircher data being a nice proof-of-concept. To that end, I should:

  • Include the aligners in the conda installation (where applicable?) to minimize the number of pre-installation steps that need to be done.

  • Give the option for blast to be done on the NCBI server. To limit the amount of requests, one could imagine having a maximum number of submissions, and above that requiring a local download, but I would think that this is not actually a big deal, since the refseq_genomes is pretty big, so you'd have to do a lot of blast searches to make up the difference. I'm still glad I have it locally, but no sense requiring it.

Decide what to do when there's not complete saturation of the mutagenesis

For instance, RET has only 2 bases tested at '573:

10      43086572        A       C       98      2972    3663    -0.08   0.1749  RET
10      43086572        A       G       215     7260    9664    -0.02   0.70391 RET
10      43086572        A       T       306     10958   14018   0.02    0.60999 RET
10      43086573        C       A       6       163     185     -0.17   0.47475 RET
10      43086573        C       T       136     4233    5560    -0.06   0.21929 RET
10      43086574        A       C       222     7970    10160   0.01    0.72669 RET
10      43086574        A       G       531     17261   22568   0       0.90982 RET
10      43086574        A       T       373     13178   16714   0.01    0.77223 RET

I'm not sure whether it biases things to skip only the missing G, or if I'd be better off marking that base as bad and dropping it from analyses altogether.

Deal with LINE elements

For ECR11, Justin removes LINE elements from... the Homonoidae sequences? The reconstruction? The text is unclear, but at any rate, I need to not compare bases in LINE elements and similar.

Which means that I need to

  1. Identify LINEs and other repetitive elements
  2. Figure out an appropriate time to remove them from my sequences
  3. Do everything after that.

Make sure to process exonic enhancers differently

Fist step here, of course, is to figure out which, if any, enhancers actually are exonic. If they are, then I should definitely also apply the Agoglia et al analysis, and maybe not do the Smith et al analysis.

Deal with internal node if sequence same as parent

In Smith et al, they just reuse the name, so A8 is the ancestor of both the Homininae and the Hominidae. Whereas I have them as separate nodes. Some work could be done to check for identity and rename them as appropriate. I haven't thought through yet whether this affects things.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.