Giter VIP home page Giter VIP logo

Comments (10)

chrishendra93 avatar chrishendra93 commented on September 3, 2024

Hi @Vadim-Fed , thanks for raising the issue.

  1. Currently the only way to compare the two samples together is by checking the probability modified of each sites of interest between the two samples.
  2. The probability modified columns should give the probability that a particular site is modified. If it is NA, I fear that there must be something wrong in the preprocessing steps.
  3. The number of rows do not correspond to the number of modified sites, but rather each site that has minimum read number of 20 will be displayed. The modification status of each site should be reflected by the probability modified column

To fix your problem, can I trouble you with printing the first 5 rows of the eventalign.txt file and perhaps the first five entries from data.json?

thank you!

Rgards

Christopher Hendra

from m6anet.

Vadim-Fed avatar Vadim-Fed commented on September 3, 2024

Sure, no trouble at all.
WT2 eventalign.txt:
contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level
chr01 1220 ATTGT 1 t 1535 110.50 3.360 0.01029 NNNNN 0.00 0.00 inf
chr01 1253 CACTT 1 t 1534 112.71 1.968 0.00398 AAGTG 117.29 4.03 -1.03
chr01 1254 ACTTT 1 t 1533 110.15 1.686 0.00398 AAAGT 108.73 4.06 0.32
chr01 1254 ACTTT 1 t 1532 108.27 1.348 0.00232 AAAGT 108.73 4.06 -0.10
chr01 1254 ACTTT 1 t 1531 109.90 1.617 0.00531 AAAGT 108.73 4.06 0.26
chr01 1254 ACTTT 1 t 1530 111.42 1.583 0.00863 AAAGT 108.73 4.06 0.60
chr01 1254 ACTTT 1 t 1529 105.97 1.604 0.00432 AAAGT 108.73 4.06 -0.62
chr01 1254 ACTTT 1 t 1528 112.65 1.532 0.00432 AAAGT 108.73 4.06 0.87
chr01 1254 ACTTT 1 t 1527 110.31 1.712 0.00398 AAAGT 108.73 4.06 0.35

data.json-
{"chr01":{"7589":{"TTAACAA":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"chr01":{"7646":{"TGAACTA":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"chr01":{"7669":{"GAAACAT":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"chr01":{"7763":{"AGGACTT":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"chr01":{"7781":{"CTAACTG":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"chr01":{"7919":{"TAAACTT":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"chr01":{"8168":{"GAGACAA":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"chr01":{"8180":{"AGAACAT":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"chr01":{"8197":{"TAAACCC":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}
{"chr01":{"8235":{"CAAACCA":[[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]]}}}

Thanks,

Vadim

from m6anet.

chrishendra93 avatar chrishendra93 commented on September 3, 2024

Hi @Vadim-Fed , I think I know the problem. In order for m6anet to be able to extract the features effectively, it needs the segmentation information from nanopolish eventalign.txt which is missing from your eventalign.txt. In order to get this information, you have to run nanopolish eventalign with the option --signal-index. So in short, please run nanopolish again with the following command:

nanopolish eventalign --reads reads.fastq --bam alignment.sorted.bam --genome genome.fa --scale-events --signal-index --summary summary.txt --threads n_threads > eventalign.txt

from m6anet.

Vadim-Fed avatar Vadim-Fed commented on September 3, 2024

Ok, ill try and update...

Thanks a lot!

from m6anet.

Vadim-Fed avatar Vadim-Fed commented on September 3, 2024

Now it works and we do get the probability column. Thanks!

Just a couple more questions regarding the 3rd query if I may:
You responded:
"The number of rows do not correspond to the number of modified sites, but rather each site that has minimum read number of 20 will be displayed. The modification status of each site should be reflected by the probability modified column"

I checked and all predicted modified sites (all rows) are centered at the m6a consensus sequence DRACH, which is good. From your experience which probability threshold would you use to decide if the site is indeed mofidied or not?

Also which arguments can I attenuate in m6anet to get more predictions? I already noticed that doing -dataprep with the default n_neighbors is more permissive than n_neighbors set to 10 as i got ~half of total predictions with n_neighbors 10.
Which additional arguments would you suggest to play around with and what are their value range?

Thanks,

Vadim

from m6anet.

chrishendra93 avatar chrishendra93 commented on September 3, 2024

Hi Vadim, sorry for the late reply

  • By default m6Anet will only predict modification for DRACH consensus sequence
  • If you really want to be stringent with the false positives, setting a threshold of 0.8 is good, 0.9 is even better to my experience
  • One way to increase the number of prediction is perhaps to reduce the number of reads required for each site. Currently I have not implemented this functionality since the model itself is not trained to handle this scenario, I will try to explore this in future releases

thanks!

from m6anet.

Vadim-Fed avatar Vadim-Fed commented on September 3, 2024

Thanks @chrishendra93 for the reply.

We are using 4 Nanopore sequenced datasets obtained for 2 WT and 2 ime4-knockout yeast strains (the latter of which should be devoid of any m6AS), and when we perform the analysis we get a very similar number of
predicted m6A locations between the WT (869 on average) and the KO (877 on average) with few unique WT m6A locations.

It seems using the modification probability parameter for thresholding is not an option (The numbers in brackets correspond to number of predicted locations shared within each group and filtered to overlap with known m6A sites:

m6A prediction

Apart from reducing the number of reads required for each site (is it the --readcount_min parameter? because it states the default is 1), are there any additional arguments we can change so to obtain more reasonable results, or maybe even perform some training for these tool…

from m6anet.

chrishendra93 avatar chrishendra93 commented on September 3, 2024

Hi Vadim, thanks for trying this out in this particular dataset. As I have mentioned before, currently the model can only predict with a minimum of 20 reads per site. The --readcount_min parameter for dataprep is the minimum read for each gene not for each segment that will be predicted by m6anet. That being said, there can be a variety of reason for this error, but first let me clarify a couple of things with you

  • Have you checked if the known sites really match the ones you observe from m6anet prediction files? It might be that the coordinates are misaligned, can you check if the known sites have the DRACH motifs so that we're sure they are aligned?
  • Can you check if among all the shared sites, m6anet generally have higher predicted probability in wild type compared to KO samples? The thing is from our own experiment we find that m6anet tends to find sites that have elevated level of signal intensity in wild type compared to knock out, and these sites sometimes are missed by other experimental protocols that are used to label these "known" m6a sites
  • Have you tried running other tols on this dataset? I.e, epinano?
  • Furthermore, is it possible for you to send me the data.json and data.readcount files so that I can take a look at these?

Meanwhile you can also try out another datasets used in this paper https://www.biorxiv.org/content/10.1101/2020.06.18.160010v1.full.pdf. The dataset provided there have more number of reads and will provide a better overview of m6anet general performance

from m6anet.

Vadim-Fed avatar Vadim-Fed commented on September 3, 2024

Hi @chrishendra93, thanks for the response.
*The Known sites are defined as windows with a median of 337 bp across all windows. I defined a predicted m6A coordinate as valid if it was overlapping within a window, and I was quite loose as I enabled even overlap at the edge of the window with the predicted coordinate.
*Ok so I did the same plot without intersecting with the reference, just looking at shared predicted sites among the 2 datasets of each group:
Erase
*We are currently also trying to run Epinano, I will keep you posted.
*You can find the files at this link -
https://drive.google.com/drive/u/0/folders/1ZVkd_TmdAGccmvB1DRB7V3fAQIGLqoAJ

Thanks for all the help,

Vadim

from m6anet.

chrishendra93 avatar chrishendra93 commented on September 3, 2024

Hi @Vadim-Fed , I have requested an access to your google drive to take a look at the data. Regardless can I clarify with you again regarding the known sites that you have classified? Is it possible that you have multiple predicted sites in a single window? It might be the case that there are very few modified positions within that window but by aggregating all those sites in your boxplots you end up with a lot of unmodified probability. Can you check what will the boxplots look if you only take the maximum probability per window?

from m6anet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.