Comments (36)
Hmm tagging @marcus1487 again. I noticed that the bam file uses Mm
and Ml
tags, with a small second letter. I thought the specifications were using all caps for these tags?
samtools/hts-specs#362 (but maybe there are other threads relevant here). Methplotlib currently expects MM and MP tags. Of course, changing that in methplotlib is not too crazy :-)
I see Mm and Ml pop up in samtools/hts-specs#418...
from methplotlib.
This looks like a bug in my code... And if it isn't a bug I would be happy to make methplotlib compatible. Do you think it would be possible to share the cram file?
Thanks,
Wouter
from methplotlib.
Hi Wouter!
Yes, I can share the CRAM file. I will share a Google drive link to the file in an email.
Thanks a lot for your time!
from methplotlib.
Hello Wouter,
Is there any update on this? I am also interested.
Thanks a lot in advance.
Regards,
Paula
from methplotlib.
Hi Paula,
Do you mean you got the same error? I hope to find time this week...
Wouter
from methplotlib.
Hello Wouter,
Maybe we can continue the thread on: all context methylation visualization #29 :)
Thank you,
Paula
from methplotlib.
Hi @dipannita-g, I am very confused. Maybe @marcus1487 will have any insight here?
The cram file you shared with me (and maybe could share with Marcus too?) is behaving oddly and it hurts my brain. It could just be Friday afternoon, but I have no idea what's going on here.
samtools idxstats tells me that there are no aligned or unaligned reads in your file, but that disagrees with samtools flagstat and simple observation using samtools view.
Converting the file to bam and back to cram puts 32890416 reads as unaligned in idxstats, but the output of samtools view looks equivalent.
Converting the sorted_mod_mappings.bam you also shared to cram (with -T for the reference.fasta) puts 26885680 reads as unaligned in idxstats.
For all three crams flagstat reports 17656 reads mapped - which looks okay to me.
from methplotlib.
Hi Wouter,
Thanks a lot for taking a look at this! I have no clue why the discrepancy is happening when using Samtools idxstats. Flagstat also gives me the same number of mapped reads.
I have also uploaded the original bam file which Megalodon produced, mod_mappings.bam, in the folder, if that helps somehow.
I am sharing the files with @marcus1487 as well.
Thanks again!
Dipannita
from methplotlib.
Hi @wdecoster,
It is a bit misleading. Originally MM/MP were proposed but it drifted away to Mm/Ml as implemented in Megalodon (and Guppy).
from methplotlib.
Would Methplotlib be able to cope with multiple modifications encoded on the Mm field ?
from methplotlib.
Soon! How do the Ml map to the Mm in case of multiple modifications?
from methplotlib.
Small correction on the Mm/Ml. Once it is accepted in the spec it will become MM/ML, So better make the tag parsing case insensitive.
Good news for multiple mods support. The proposed spec actually describes how several mods can be stored in the Mm tag but it is arguably a bit hard to follow. Essentially, you can concatenate multiple modifications in the Mm tag like that
Mm:Z:C+m,5,12;C+h,5,12;
With likelihood scores in the same order in the Ml tag.
from methplotlib.
Right, but without a ;
separator in the likelihood scores?
I noticed in the bam file I'm currently looking at that it's just one array of integers, with as many likelihoods as the positions for all modifications together.
The Mm tag still stores the distances between positions for that specific base with the modification, right?
from methplotlib.
Yes, there is no ;
separator in the Ml/ML scores. I think this is a requirement of the B
type in the BAM spec to allow more efficient storage as a single array. If you read the field directly from the bam/cram it would just be an array with no separators.
Yes the fields for different mods are all structured the same way and can treated independently, but the order matters in order to access the scores in the Ml/ML tag.
from methplotlib.
I forgot to mention before, but the documentation for the new proposed modification tags lives here https://github.com/jkbonfield/hts-specs/blob/7fafbdf65291da9377d9e17736c4e4bb06fea9a6/SAMtags.tex#L477
And here is a PDF version : SAMtags.pdf
from methplotlib.
Question for @marcus1487 or @a-slide: do you have a suggestion of minimal cut-off of Ml for megalodon? I see the output includes a probability of C+m for all Cs in the sequence, but most have just a very low probability, I guess, and it doesn't make much sense to show them all does it?
from methplotlib.
I have just pushed a bunch of code that should enable megalodon cram/bam format in methplotlib v0.19.0, and I am looking forward to your feedback.
I would very much appreciate it if @marcus1487 or @a-slide would have the time to take a look at my code for handling the Mm/Ml tags, the relevant bits are in this function. I have commented quite heavily, as it is (at least to me) not the most trivial task to map the number-of-bases-to-skip to genomic coordinates.
from methplotlib.
Hi Wouter,
I had a look at your code and it seems to be parsing the delta encoding mods correctly. It is indeed not trivial, but this encoding was chosen for compactness reasons (I personally think a simple offset relative to the start would have been better).
I still need to test it and compare it to the recently implemented equivalent functionality in IGV. If you want to have a look at real data, I put together a small human chr20 5mC toy dataset there https://nanoporetech.box.com/s/82pnw3lhusfs93s0vj7azxiz4x7kxabo
from methplotlib.
For the cut-off, @marcus1487 will know better, but I suspect it might be different depending on the model used, so maybe leave it to the users to decide where to put the bar. Or just use a transparent to solid color scale
from methplotlib.
Aha thanks, I'll have a look at the testdata.
Yes, the cutoff is an option the user can specify, but I want to set a sensible default. A problem is that plotting a dot for every C and every A gets quite dense and a heavy html. The current implementation omits points below a cutoff and uses a colorscale for the rest
from methplotlib.
I agree that a cutoff would be very useful. In megalodon the cutoff is 1%, but this is a command line parameter with the default to output as much of the data as is reasonable. This is likely too low a cutoff for visualization. In megalodon a read is only included in the per-site aggregated "modified" base count when the per-read probability of a modification is greater than 0.8. So to give a bit of wiggle room I think a default of 0.6-0.7 would be a reasonable default.
Another relevant issue here is calibration of these scores. Megalodon calibrates the modified base scores to match emperical data with a ground truth. Guppy produced calls on the other hand are not calibrated. So the shading/color scale may need to be adjusted for different callers.
from methplotlib.
For the code, a couple minor comments:
- on line 273
offset = len(deltas)
should beoffset += len(deltas)
(won't cause bug now as we're not calling three mods at once... yet) - Not sure on this one, but should
basemod
bemod
on line 280? Or are you using the canonical base later in processing?
Otherwise this looks great!
P.S. I agree that get_forward_sequence
result is a bit odd. And +1 for using all the pysam commands!
from methplotlib.
Megalodon input works great for me, using the mod_mappings.bam. However I don't like the color. It's just a light beige to dark red palette. Is there a way to specify it to go from blue (unmethylated) to red (methylated) as I see when using nanopolish output?
from methplotlib.
Hi @dithiii thanks for the feedback! I'm not sure about your request.
The difference with nanopolish is that it symmetrically has a log-likelihood ratio to indicate if a position is more likely modified or more likely unmodified. Megalodon in contrast has 'just' the probability that a position is modified.
This problem also relates to setting a cut-off for the likelihood score that was discussed above, and I could (instead of hiding those) mark the positions below the cut-off with a single shade of blue.
@marcus1487 Thanks for the correction on line 273 - that would have been a nasty bug in the future.
Line 280 just ends up as annotation in the graph and better spells out the modification the user is seeing in that panel, so basemod (C+m) makes a bit more sense than just mod (m).
You mention the cut-off in percent above - and what is in the cram/bam is then Phred scaled?
from methplotlib.
Ah, it's worth noting that the Ml
tag is not required according to the spec -- it's possible to only have Mm
. That shouldn't happen from either megalodon or guppy output as far as I know, but it is a possibility.
from methplotlib.
Aha, great, I'll make some changes to avoid crashing on that.
from methplotlib.
Hello,
Maybe the colours are light beige to dark red, and not "down" to blue, because the megalodon probabilities are shown on a natural log, whether the log-likehood from nanopolish are log 10, therefore the spread in range probabilities (and so colours) is wider?
Thank you all for the work,
I am going to try it out!
Best,
Paula
from methplotlib.
I'm a little confused on how to show different modifications. In my case, I am using a mod_mappings.bam that I sorted and loaded into methplotlib. It has calls for both 5mC and 5hmC. But the colors on the output are limited to simply "likely modified". Ideas on how to deal with this? I'd love different colors ranges for mods.
In the meantime though, I would just like to know how to filter a bam file for a specific mod so I can make separate plots for each of the two mods. Help?
from methplotlib.
The separation of the 5mC and 5hmC calls is a bit complicated. For example given probabilities of 5mC=0.2 and 5hmC=0.3 and C=0.5 in a read at a particular site, I see three "logical" ways to output this call into a file with only C and 5mC. a) 5mC=0.5 C=0.5 b) 5mC=0.2 C=0.8 and c) 5mC=0.35 C=0.65. The problem is that given probabilities of all three the conversion to the other 2 is ambiguous and any of the above 3 options could be argued for with valid points.
For this reason I think if only one modified base is required that should be modeled explicitly or the modified bases should be considered together. To that point the 5mC model is likely to represent option a
above since 5hmC is collapsed in alternative bisulfite data used to train that model (see megalodon training data prep here).
One additional point not really related to methplotlib, but which may be presenting as an issue here is that the 5mC+5hmC nanopore model currently generally produces weak 5hmC calls. This is due to issues with our current training data where 5hmC training data is likely not completely 5hmC where we expect it. We are working to improve this in a later version of this model. In this context though this would present as the 5hmC mods showing as very light in almost all cases, so it may be that the 5hmC calls are simply hard to see given the lower probabilities.
from methplotlib.
Thanks Marcus, I see the issue. I called bases using the rerio 5mC+5hmC model and honestly I found it to work very well. The genome is mitochondria and I saw unrealistically high 5mC when calling with the 5mC-allcontext model. It called way more 5mC than is really there. Nanopolish called much lower, but still had a respectable amount. Then I called again with the 5mC+5hmC model and found essentially zero 5mC and much more 5hmC. Those results actually make much more biological sense to me (since I believe that most mtDNA 5mC is not real anyway, and any 5hmC would be misread as 5mC using bisulfite methods).
So I give your model more credit than you do. I think it's doing a great job on calling 5hmC in my case! My remaining problem is how to visualize zero 5mC and low but consistent levels of 5hmC. That's why I wanted to split the bam file. I see that it's not easy. I'm thinking of alternatives.
from methplotlib.
@dithiii That is great to hear! Glad you are seeing good results! I guess I should turn up my optimism for that model in its current state.
If there is a valid use case and reasonable expected behavior for a splitting method, I would be open to adding a command for this to megalodon. I think option c
is the most reasonable. It would just produce very weird results where the other modified base was high probability. For example with C=0.002 5hmC=0.008 5mC=0.99, it seems a bit weird to report C=0.2 and 5hmC=0.8 probabilities, but maybe this is the most logical thing to do (it feels a bit hacky).
from methplotlib.
"For example with C=0.002 5hmC=0.008 5mC=0.99, it seems a bit weird to report C=0.2 and 5hmC=0.8 probabilities, but maybe this is the most logical thing to do (it feels a bit hacky)."
Right tricky question, I don't think that's a good way to report it, assuming someone asks "is it C or 5hmC" at that position, I don't think it's a logical thing to report "well, it's 0.8 likely to be 5hmC over the 0.2 probability of C" when the model really thinks it's 99% likely to be 5mC. It's not a reliable report. That being said, I don't know how to resolve that issue. Personally I would rather it just say, "it's 0.002 likely to be C and 0.008 likely to be 5hmC" and leave the user scratching their head and asking "I wonder what the other 99% call is" and hope they ask the right question.
from methplotlib.
Yes. That would be more ideal though still problematic, but this is unfortunately a limitation of the file format. It is assumed that the probabilities would sum to 1. So the probability not found in modified bases is assumed to be the unmodified form.
from methplotlib.
Hello,
I have tried to use methplotlib from megalodon BAMs.
-
for 5mC
-
for 5mC and 5hmC
I think the format is not right...
Please find below the command I used and the plot I am getting.
methplotlib -m hipox_sub_25.cram -n calls -w chr1:3746460-3771645 -f GRCh38.primary_assembly.genome.fa -g Homo_sapiens.GRCh38.90.chr.gtf
methplotlib -m 5hmc_hip_sub.sorted.cram -n calls -w chr1:3746460-3771645 -f GRCh38.primary_assembly.genome.fa -g Homo_sapiens.GRCh38.90.chr.gtf
Can you spot what is the problem?
You feedback would be very much appreciated,
Thank you in advance,
Paula
from methplotlib.
Ehm I don't directly have an idea, do you think it would be possible to share the dataset that causes this?
Thanks,
Wouter
from methplotlib.
Good morning Wouter,
Please find the dataset in the email I just sent you.
Thank you so much for having a look into this.
Very much appreciated,
Kind regards,
Paula
from methplotlib.
Related Issues (20)
- Color modification frequency HOT 1
- all context methylation visualization HOT 21
- Unable to install using pip HOT 7
- gtf fails HOT 9
- No calls in window HOT 7
- Bug? HOT 1
- What kind of Nanocompore outputs should be used? HOT 4
- different version different error HOT 19
- Can bam files from guppy version v.6.1.1 be used?
- Python import issue HOT 5
- cannot import name 'get_all_ties'
- bgzip, tabix and sorting HOT 5
- Pandas.error and shape mismatch HOT 17
- Window parameter cannot deal with complicated fasta headers HOT 4
- Remora .bam as a input HOT 7
- mehplotlib having trouble finding sklearn HOT 3
- AttributeError: module 'plotly' has no attribute 'subplots' HOT 1
- sklearn deprecated HOT 1
- problem with HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from methplotlib.