Giter VIP home page Giter VIP logo

Comments (14)

AmandaKedaigle avatar AmandaKedaigle commented on August 25, 2024 1

Oh OK, from the abstract I thought they had just refined PWMs using ENCODE data. I am not married to MotifMap at all, we should compare these, and keep looking for others! One thing to consider is that it looks like they only provide human matches, whereas motifmap also has mm9 and mm10, which I think is important. I also just downloaded their matches.txt file and they don't provide any kind of affinity scores. They do have "matches for the shuffled control motifs (indicated with _C#)" so we could potentially calculate our own FDR-like scores, but that doesn't improve on your original question about the scores.

I think no matter which data source we use, we need to think carefully about what kind of score we want to use in the regression.

from garnet.

AmandaKedaigle avatar AmandaKedaigle commented on August 25, 2024

Originally we used FDR from MotifMap (paper which explains score here https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-495) but then I think Alex did some kind of normalization on it

from garnet.

alexlenail avatar alexlenail commented on August 25, 2024

I don't believe there's any normalization.

https://github.com/fraenkel-lab/GarNet/blob/master/GarNet/garnet.py#L163

Is there an issue with 0-score motifs?

from garnet.

iamjli avatar iamjli commented on August 25, 2024

Ideally, the motif score is an estimation for binding affinity, so it's strange to see points that lie on the y-axis of the scatter plots.

from garnet.

AmandaKedaigle avatar AmandaKedaigle commented on August 25, 2024

(Thanks Alex, I took the idea of normalization from the example notebook where you said "I've already normalized the MotifMap file, so I won't do it here", and use a file called motifmap.normalized.cleaned.tsv. What processing were you referring to there?)

We should think about whether FDR is the right score to use. I wouldn't think about FDR as an estimation for binding affinity. Motifmap integrates scores that have to do with binding affinity (from position weight matrices) with conservation information. I said to use FDR at first because it integrates all the other scores.

" The first score is the Normalized Log-Odds (NLOD) score derived from the position weight matrix of the corresponding transcription factor. The second score is the Bayesian Branch Length Score (BBLS) to measure the degree of evolutionary conservation. Functional elements, such as those playing a regulatory role, often evolve more slowly than neutral sequences and can be detected by their higher level of conservation .... The third score is the False Discovery Rate (FDR) estimated by using Monte Carlo methods....The False Discovery Rate is computed as the median number of sites found using the shuffled matrices divided by the number of sites found for the real matrix at a particular (NLOD, BBLS) score combination or higher."

Since we're using FDR derived using Monte Carlo, zero is actually a really good score? (I still haven't been able to generate scatter plots so haven't looked at this - how common is it?)

from garnet.

iamjli avatar iamjli commented on August 25, 2024

If we are going to reevaluate motif matching, we should also take a look at this work out of the Kellis lab.

from garnet.

AmandaKedaigle avatar AmandaKedaigle commented on August 25, 2024

That was published in 2013, so theoretically the PWMs they found should have been included in TRANSFAC and JASPAR, and already be in MotifMap. I don't know that they got that specific paper but the version we got from them was updated in 2014. Is there a reason you like that paper specifically?

from garnet.

iamjli avatar iamjli commented on August 25, 2024

They combined motifs from 5 sources, including TRANSFAC and JASPAR:

We found that no single algorithm or database comprehensively assays the motifs relevant to the binding diversity surveyed by ENCODE. Therefore, our approach was to collect motifs from several literature sources (11–16) and supplement them with motifs discovered de novo on the data sets themselves using five established tools (17–21)

Their motif file also contains proteins in the correct namespace, so that's a small plus over motifmap. Not sure if it's been updated since 2014, but just a suggestion.

from garnet.

alexlenail avatar alexlenail commented on August 25, 2024

@AmandaKedaigle "normalized" meant namespace-normalized, so I changed the names of the genes/TFs so they matched the rest of our names/belonged to the same namespace.

from garnet.

AmandaKedaigle avatar AmandaKedaigle commented on August 25, 2024

After conversation with Ernest, we should definitely switch MotifMap from FDR to NLOD scores, and avoid using conservation as part of our scores. We should still compare MotifMap with other sources, but let's use NLOD for those comparisons.

I was going to switch this in the code myself, but it looks like the parse_motifs_file function is not being used anymore since we've switched to bedtools? Is there still a way to make your own motifs file included with garnet, or should we delete that function from garnet.py and require motifs file to be formatted a certain way?

(I am fine with either way, personally, especially if we are able to eventually include a preprocessed correctly formatted version of human+mice motif files when we distribute GarNet)

from garnet.

iamjli avatar iamjli commented on August 25, 2024

Agreed.

Alex made the argument that it may come useful for cleaning up MotifMap data, which I agreed with initially.

But the data can be cleaned via command line very quickly, so don't think we really need this function anymore. Here's the command I used:

cat motifMap_hg19.tsv | awk -F'\t' '{ print $5 "\t" $8 "\t" $14 "\t" $3 "\t" $12 "\t" $7}' |tr -d \" | sort -k1,1 -k2,2n > motifMap.NLOD.bed

Output here:

/nfs/latdata/iamjli/packages/GarNet2/garnet_data/hg19/garnetDB.hg19.NLOD.10kb.tsv.

I'm not sure how long the pandas parsing takes, but I'm fairly sure this is faster, esp for large input files.

from garnet.

iamjli avatar iamjli commented on August 25, 2024

Would be great to have something like this from chromVar: https://github.com/GreenleafLab/chromVARmotifs

from garnet.

alexlenail avatar alexlenail commented on August 25, 2024

What's the status on this issue?

from garnet.

iamjli avatar iamjli commented on August 25, 2024

I'm exploring the motif score/TF affinity metric in depth, and it will take a while, so I'm okay with closing this issue for now.

from garnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.