Giter VIP home page Giter VIP logo

Comments (36)

wdecoster avatar wdecoster commented on August 16, 2024 2

Hmm tagging @marcus1487 again. I noticed that the bam file uses Mm and Ml tags, with a small second letter. I thought the specifications were using all caps for these tags?
samtools/hts-specs#362 (but maybe there are other threads relevant here). Methplotlib currently expects MM and MP tags. Of course, changing that in methplotlib is not too crazy :-)

I see Mm and Ml pop up in samtools/hts-specs#418...

from methplotlib.

wdecoster avatar wdecoster commented on August 16, 2024

This looks like a bug in my code... And if it isn't a bug I would be happy to make methplotlib compatible. Do you think it would be possible to share the cram file?

Thanks,
Wouter

from methplotlib.

dipannita-g avatar dipannita-g commented on August 16, 2024

Hi Wouter!

Yes, I can share the CRAM file. I will share a Google drive link to the file in an email.

Thanks a lot for your time!

from methplotlib.

PaulaRomeroLozano avatar PaulaRomeroLozano commented on August 16, 2024

Hello Wouter,

Is there any update on this? I am also interested.

Thanks a lot in advance.

Regards,
Paula

from methplotlib.

wdecoster avatar wdecoster commented on August 16, 2024

Hi Paula,

Do you mean you got the same error? I hope to find time this week...

Wouter

from methplotlib.

PaulaRomeroLozano avatar PaulaRomeroLozano commented on August 16, 2024

Hello Wouter,

Maybe we can continue the thread on: all context methylation visualization #29 :)

Thank you,
Paula

from methplotlib.

wdecoster avatar wdecoster commented on August 16, 2024

Hi @dipannita-g, I am very confused. Maybe @marcus1487 will have any insight here?

The cram file you shared with me (and maybe could share with Marcus too?) is behaving oddly and it hurts my brain. It could just be Friday afternoon, but I have no idea what's going on here.
samtools idxstats tells me that there are no aligned or unaligned reads in your file, but that disagrees with samtools flagstat and simple observation using samtools view.
Converting the file to bam and back to cram puts 32890416 reads as unaligned in idxstats, but the output of samtools view looks equivalent.
Converting the sorted_mod_mappings.bam you also shared to cram (with -T for the reference.fasta) puts 26885680 reads as unaligned in idxstats.

For all three crams flagstat reports 17656 reads mapped - which looks okay to me.

from methplotlib.

dipannita-g avatar dipannita-g commented on August 16, 2024

Hi Wouter,
Thanks a lot for taking a look at this! I have no clue why the discrepancy is happening when using Samtools idxstats. Flagstat also gives me the same number of mapped reads.

I have also uploaded the original bam file which Megalodon produced, mod_mappings.bam, in the folder, if that helps somehow.

I am sharing the files with @marcus1487 as well.

Thanks again!
Dipannita

from methplotlib.

a-slide avatar a-slide commented on August 16, 2024

Hi @wdecoster,

It is a bit misleading. Originally MM/MP were proposed but it drifted away to Mm/Ml as implemented in Megalodon (and Guppy).

from methplotlib.

a-slide avatar a-slide commented on August 16, 2024

Would Methplotlib be able to cope with multiple modifications encoded on the Mm field ?

from methplotlib.

wdecoster avatar wdecoster commented on August 16, 2024

Soon! How do the Ml map to the Mm in case of multiple modifications?

from methplotlib.

a-slide avatar a-slide commented on August 16, 2024

Small correction on the Mm/Ml. Once it is accepted in the spec it will become MM/ML, So better make the tag parsing case insensitive.

Good news for multiple mods support. The proposed spec actually describes how several mods can be stored in the Mm tag but it is arguably a bit hard to follow. Essentially, you can concatenate multiple modifications in the Mm tag like that
Mm:Z:C+m,5,12;C+h,5,12;
With likelihood scores in the same order in the Ml tag.

from methplotlib.

wdecoster avatar wdecoster commented on August 16, 2024

Right, but without a ; separator in the likelihood scores?
I noticed in the bam file I'm currently looking at that it's just one array of integers, with as many likelihoods as the positions for all modifications together.

The Mm tag still stores the distances between positions for that specific base with the modification, right?

from methplotlib.

marcus1487 avatar marcus1487 commented on August 16, 2024

Yes, there is no ; separator in the Ml/ML scores. I think this is a requirement of the B type in the BAM spec to allow more efficient storage as a single array. If you read the field directly from the bam/cram it would just be an array with no separators.

Yes the fields for different mods are all structured the same way and can treated independently, but the order matters in order to access the scores in the Ml/ML tag.

from methplotlib.

a-slide avatar a-slide commented on August 16, 2024

I forgot to mention before, but the documentation for the new proposed modification tags lives here https://github.com/jkbonfield/hts-specs/blob/7fafbdf65291da9377d9e17736c4e4bb06fea9a6/SAMtags.tex#L477

And here is a PDF version : SAMtags.pdf

from methplotlib.

wdecoster avatar wdecoster commented on August 16, 2024

Question for @marcus1487 or @a-slide: do you have a suggestion of minimal cut-off of Ml for megalodon? I see the output includes a probability of C+m for all Cs in the sequence, but most have just a very low probability, I guess, and it doesn't make much sense to show them all does it?

from methplotlib.

wdecoster avatar wdecoster commented on August 16, 2024

I have just pushed a bunch of code that should enable megalodon cram/bam format in methplotlib v0.19.0, and I am looking forward to your feedback.

I would very much appreciate it if @marcus1487 or @a-slide would have the time to take a look at my code for handling the Mm/Ml tags, the relevant bits are in this function. I have commented quite heavily, as it is (at least to me) not the most trivial task to map the number-of-bases-to-skip to genomic coordinates.

from methplotlib.

a-slide avatar a-slide commented on August 16, 2024

Hi Wouter,
I had a look at your code and it seems to be parsing the delta encoding mods correctly. It is indeed not trivial, but this encoding was chosen for compactness reasons (I personally think a simple offset relative to the start would have been better).

I still need to test it and compare it to the recently implemented equivalent functionality in IGV. If you want to have a look at real data, I put together a small human chr20 5mC toy dataset there https://nanoporetech.box.com/s/82pnw3lhusfs93s0vj7azxiz4x7kxabo

from methplotlib.

a-slide avatar a-slide commented on August 16, 2024

For the cut-off, @marcus1487 will know better, but I suspect it might be different depending on the model used, so maybe leave it to the users to decide where to put the bar. Or just use a transparent to solid color scale

from methplotlib.

wdecoster avatar wdecoster commented on August 16, 2024

Aha thanks, I'll have a look at the testdata.

Yes, the cutoff is an option the user can specify, but I want to set a sensible default. A problem is that plotting a dot for every C and every A gets quite dense and a heavy html. The current implementation omits points below a cutoff and uses a colorscale for the rest

from methplotlib.

marcus1487 avatar marcus1487 commented on August 16, 2024

I agree that a cutoff would be very useful. In megalodon the cutoff is 1%, but this is a command line parameter with the default to output as much of the data as is reasonable. This is likely too low a cutoff for visualization. In megalodon a read is only included in the per-site aggregated "modified" base count when the per-read probability of a modification is greater than 0.8. So to give a bit of wiggle room I think a default of 0.6-0.7 would be a reasonable default.

Another relevant issue here is calibration of these scores. Megalodon calibrates the modified base scores to match emperical data with a ground truth. Guppy produced calls on the other hand are not calibrated. So the shading/color scale may need to be adjusted for different callers.

from methplotlib.

marcus1487 avatar marcus1487 commented on August 16, 2024

For the code, a couple minor comments:

  • on line 273 offset = len(deltas) should be offset += len(deltas) (won't cause bug now as we're not calling three mods at once... yet)
  • Not sure on this one, but should basemod be mod on line 280? Or are you using the canonical base later in processing?

Otherwise this looks great!

P.S. I agree that get_forward_sequence result is a bit odd. And +1 for using all the pysam commands!

from methplotlib.

dithiii avatar dithiii commented on August 16, 2024

Megalodon input works great for me, using the mod_mappings.bam. However I don't like the color. It's just a light beige to dark red palette. Is there a way to specify it to go from blue (unmethylated) to red (methylated) as I see when using nanopolish output?

from methplotlib.

wdecoster avatar wdecoster commented on August 16, 2024

Hi @dithiii thanks for the feedback! I'm not sure about your request.
The difference with nanopolish is that it symmetrically has a log-likelihood ratio to indicate if a position is more likely modified or more likely unmodified. Megalodon in contrast has 'just' the probability that a position is modified.
This problem also relates to setting a cut-off for the likelihood score that was discussed above, and I could (instead of hiding those) mark the positions below the cut-off with a single shade of blue.

@marcus1487 Thanks for the correction on line 273 - that would have been a nasty bug in the future.
Line 280 just ends up as annotation in the graph and better spells out the modification the user is seeing in that panel, so basemod (C+m) makes a bit more sense than just mod (m).

You mention the cut-off in percent above - and what is in the cram/bam is then Phred scaled?

from methplotlib.

fbrennen avatar fbrennen commented on August 16, 2024

Ah, it's worth noting that the Ml tag is not required according to the spec -- it's possible to only have Mm. That shouldn't happen from either megalodon or guppy output as far as I know, but it is a possibility.

from methplotlib.

wdecoster avatar wdecoster commented on August 16, 2024

Aha, great, I'll make some changes to avoid crashing on that.

from methplotlib.

PaulaRomeroLozano avatar PaulaRomeroLozano commented on August 16, 2024

Hello,

Maybe the colours are light beige to dark red, and not "down" to blue, because the megalodon probabilities are shown on a natural log, whether the log-likehood from nanopolish are log 10, therefore the spread in range probabilities (and so colours) is wider?

Thank you all for the work,
I am going to try it out!

Best,
Paula

from methplotlib.

dithiii avatar dithiii commented on August 16, 2024

I'm a little confused on how to show different modifications. In my case, I am using a mod_mappings.bam that I sorted and loaded into methplotlib. It has calls for both 5mC and 5hmC. But the colors on the output are limited to simply "likely modified". Ideas on how to deal with this? I'd love different colors ranges for mods.

In the meantime though, I would just like to know how to filter a bam file for a specific mod so I can make separate plots for each of the two mods. Help?

from methplotlib.

marcus1487 avatar marcus1487 commented on August 16, 2024

The separation of the 5mC and 5hmC calls is a bit complicated. For example given probabilities of 5mC=0.2 and 5hmC=0.3 and C=0.5 in a read at a particular site, I see three "logical" ways to output this call into a file with only C and 5mC. a) 5mC=0.5 C=0.5 b) 5mC=0.2 C=0.8 and c) 5mC=0.35 C=0.65. The problem is that given probabilities of all three the conversion to the other 2 is ambiguous and any of the above 3 options could be argued for with valid points.

For this reason I think if only one modified base is required that should be modeled explicitly or the modified bases should be considered together. To that point the 5mC model is likely to represent option a above since 5hmC is collapsed in alternative bisulfite data used to train that model (see megalodon training data prep here).

One additional point not really related to methplotlib, but which may be presenting as an issue here is that the 5mC+5hmC nanopore model currently generally produces weak 5hmC calls. This is due to issues with our current training data where 5hmC training data is likely not completely 5hmC where we expect it. We are working to improve this in a later version of this model. In this context though this would present as the 5hmC mods showing as very light in almost all cases, so it may be that the 5hmC calls are simply hard to see given the lower probabilities.

from methplotlib.

dithiii avatar dithiii commented on August 16, 2024

Thanks Marcus, I see the issue. I called bases using the rerio 5mC+5hmC model and honestly I found it to work very well. The genome is mitochondria and I saw unrealistically high 5mC when calling with the 5mC-allcontext model. It called way more 5mC than is really there. Nanopolish called much lower, but still had a respectable amount. Then I called again with the 5mC+5hmC model and found essentially zero 5mC and much more 5hmC. Those results actually make much more biological sense to me (since I believe that most mtDNA 5mC is not real anyway, and any 5hmC would be misread as 5mC using bisulfite methods).

So I give your model more credit than you do. I think it's doing a great job on calling 5hmC in my case! My remaining problem is how to visualize zero 5mC and low but consistent levels of 5hmC. That's why I wanted to split the bam file. I see that it's not easy. I'm thinking of alternatives.

from methplotlib.

marcus1487 avatar marcus1487 commented on August 16, 2024

@dithiii That is great to hear! Glad you are seeing good results! I guess I should turn up my optimism for that model in its current state.

If there is a valid use case and reasonable expected behavior for a splitting method, I would be open to adding a command for this to megalodon. I think option c is the most reasonable. It would just produce very weird results where the other modified base was high probability. For example with C=0.002 5hmC=0.008 5mC=0.99, it seems a bit weird to report C=0.2 and 5hmC=0.8 probabilities, but maybe this is the most logical thing to do (it feels a bit hacky).

from methplotlib.

dithiii avatar dithiii commented on August 16, 2024

"For example with C=0.002 5hmC=0.008 5mC=0.99, it seems a bit weird to report C=0.2 and 5hmC=0.8 probabilities, but maybe this is the most logical thing to do (it feels a bit hacky)."

Right tricky question, I don't think that's a good way to report it, assuming someone asks "is it C or 5hmC" at that position, I don't think it's a logical thing to report "well, it's 0.8 likely to be 5hmC over the 0.2 probability of C" when the model really thinks it's 99% likely to be 5mC. It's not a reliable report. That being said, I don't know how to resolve that issue. Personally I would rather it just say, "it's 0.002 likely to be C and 0.008 likely to be 5hmC" and leave the user scratching their head and asking "I wonder what the other 99% call is" and hope they ask the right question.

from methplotlib.

marcus1487 avatar marcus1487 commented on August 16, 2024

Yes. That would be more ideal though still problematic, but this is unfortunately a limitation of the file format. It is assumed that the probabilities would sum to 1. So the probability not found in modified bases is assumed to be the unmodified form.

from methplotlib.

PaulaRomeroLozano avatar PaulaRomeroLozano commented on August 16, 2024

Hello,

I have tried to use methplotlib from megalodon BAMs.

  1. for 5mC

  2. for 5mC and 5hmC
    I think the format is not right...
    Please find below the command I used and the plot I am getting.

methplotlib -m hipox_sub_25.cram -n calls -w chr1:3746460-3771645 -f GRCh38.primary_assembly.genome.fa -g Homo_sapiens.GRCh38.90.chr.gtf

image

methplotlib -m 5hmc_hip_sub.sorted.cram -n calls -w chr1:3746460-3771645 -f GRCh38.primary_assembly.genome.fa -g Homo_sapiens.GRCh38.90.chr.gtf

image

Can you spot what is the problem?

You feedback would be very much appreciated,

Thank you in advance,
Paula

from methplotlib.

wdecoster avatar wdecoster commented on August 16, 2024

Ehm I don't directly have an idea, do you think it would be possible to share the dataset that causes this?

Thanks,
Wouter

from methplotlib.

PaulaRomeroLozano avatar PaulaRomeroLozano commented on August 16, 2024

Good morning Wouter,

Please find the dataset in the email I just sent you.

Thank you so much for having a look into this.

Very much appreciated,
Kind regards,

Paula

from methplotlib.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.