Giter VIP home page Giter VIP logo

Comments (14)

chrishendra93 avatar chrishendra93 commented on July 29, 2024

Thanks for posting the issue. It seems that the program does not automatically terminate after triggering the assertion error. I will update this on the next version. Nevertheless, the assertion triggered because there is a position in the eventalign.txt that contains more than one possible 5-mer. Do you know if it is possible with the reference you are using? Also can you show me the first few row of eventalign.txt?

Thank you!

Regards

Christopher Hendra

from m6anet.

YCCHEN23 avatar YCCHEN23 commented on July 29, 2024

Thanks for prompt reply

Here are the reference & software & commands I performed before using m6anet-dataprep:

First, I extracted the transcript fasta from genome.fa with gffread.

Reference Genome: http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-51/plants/fasta/arabidopsis_thaliana/dna/ (version: 05-Mar-2021 11:02)
Reference GTF file: http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-51/plants/gtf/arabidopsis_thaliana/ (version: 08-Mar-2021 22:31)

### (I added "chr" to each chromosome name in both genome.fa & gtf. ( ex: 1 -> chr1 ))

awk '{ if($1 !~ /^#|[A-Z]/){print "chr"$0} else{print $0} }' Arabidopsis_thaliana.TAIR10.51.gtf > renamed_chromosome.gtf

# gffread version 0.12.1
gffread -w /media/ycc/2AFC41ECFC41B2BD/Genome/Arabidopsis/cDNA/transcripts.fa \
        -g /media/ycc/2AFC41ECFC41B2BD/Genome/Arabidopsis/gDNA/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa \
        /media/ycc/2AFC41ECFC41B2BD/Genome/Arabidopsis/gff3/renamed_chromosome.gtf

Then aligned the merged fastq reads (cat all the passed reads from Guppy basecalling) to the transcript fasta and performed nanopolish.

#minimap2 version 2.17-r941
#samtools version 1.9
#nanopolish version 0.13.3

minimap2 -ax splice -uf -k14 -t 8 /media/ycc/2AFC41ECFC41B2BD/Genome/Arabidopsis/cDNA/transcripts.fa \
         ~/Desktop/WT_R2/WT2_merged.fastq | samtools view -bh -F 2324 | samtools sort -O bam > ~/Desktop/WT_R2/WT_R2_filtered.bam

nanopolish index -s /media/ycc/Transcend/Col_Dark_R2/sequencing_summary.txt \
                 -d /media/ycc/Transcend/Col_Dark_R2/workspace/ \
                 ~/Desktop/WT_R2/WT2_merged.fastq

nanopolish eventalign --reads ~/Desktop/WT_R2/WT2_merged.fastq \
                      --bam ~/Desktop/WT_R2/WT_R2_filtered.bam \
                      --genome /media/ycc/2AFC41ECFC41B2BD/Genome/Arabidopsis/cDNA/transcripts.fa \
                      --scale-events \
                      -t 4 \
                      --progress \
                      --signal-index > ~/Desktop/WT_R2/WT_Dark_R2_eventalign.txt

Here's a part of my eventalign results.

contig	position	reference_kmer	read_index	strand	event_index	event_level_mean	event_stdv	event_length	model_kmer	model_mean	model_stdv	standardized_level	start_idx	end_idx
AT1G01010.1	9	GATAT	1	t	3	70.52	5.003	0.00232	GATAT	88.95	6.28	-2.61	65275	65282
AT1G01010.1	10	ATATA	1	t	4	76.90	1.505	0.00299	ATATA	86.67	2.63	-3.31	65266	65275
AT1G01010.1	11	TATAC	1	t	5	97.65	2.540	0.00232	TATAC	95.09	6.28	0.36	65259	65266
AT1G01010.1	11	TATAC	1	t	6	92.51	2.549	0.00266	TATAC	95.09	6.28	-0.37	65251	65259
AT1G01010.1	12	ATACC	1	t	7	73.41	1.788	0.00930	ATACC	74.69	2.10	-0.54	65223	65251
AT1G01010.1	12	ATACC	1	t	8	71.55	1.331	0.00531	ATACC	74.69	2.10	-1.33	65207	65223
AT1G01010.1	13	TACCA	1	t	9	77.65	2.034	0.00266	TACCA	77.84	2.81	-0.06	65199	65207
AT1G01010.1	14	ACCAA	1	t	10	75.49	0.738	0.00564	ACCAA	74.58	2.11	0.38	65182	65199
AT1G01010.1	14	ACCAA	1	t	11	72.94	1.398	0.00697	ACCAA	74.58	2.11	-0.69	65161	65182
AT1G01010.1	15	CCAAA	1	t	12	84.03	2.852	0.00963	CCAAA	87.19	3.02	-0.93	65132	65161
AT1G01010.1	15	CCAAA	1	t	13	90.83	3.311	0.00299	CCAAA	87.19	3.02	1.08	65123	65132
AT1G01010.1	16	CAAAC	1	t	14	104.27	2.112	0.00199	CAAAC	104.22	2.68	0.02	65117	65123
AT1G01010.1	16	CAAAC	1	t	15	107.91	0.911	0.00299	CAAAC	104.22	2.68	1.23	65108	65117
AT1G01010.1	16	CAAAC	1	t	16	103.47	3.530	0.00365	CAAAC	104.22	2.68	-0.25	65097	65108
AT1G01010.1	17	AAACC	1	t	17	94.76	5.007	0.00266	AAACC	100.00	3.45	-1.35	65089	65097
AT1G01010.1	18	AACCA	1	t	18	82.91	2.254	0.00730	AACCA	82.92	2.81	-0.00	65067	65089
AT1G01010.1	19	ACCAG	1	t	19	75.00	1.675	0.00365	ACCAG	72.33	2.11	1.13	65056	65067
AT1G01010.1	19	ACCAG	1	t	20	73.95	1.884	0.00398	ACCAG	72.33	2.11	0.69	65044	65056
AT1G01010.1	19	ACCAG	1	t	21	73.16	2.420	0.00398	ACCAG	72.33	2.11	0.35	65032	65044
AT1G01010.1	20	CCAGA	1	t	22	94.73	10.344	0.00232	CCAGA	76.22	5.09	3.24	65025	65032
AT1G01010.1	20	CCAGA	1	t	23	71.16	1.432	0.00232	CCAGA	76.22	5.09	-0.88	65018	65025
AT1G01010.1	21	CAGAG	1	t	24	115.49	5.990	0.00232	CAGAG	106.73	5.87	1.33	65011	65018
AT1G01010.1	22	AGAGA	1	t	25	131.68	5.844	0.01926	AGAGA	127.71	5.69	0.62	64953	65011
AT1G01010.1	23	GAGAA	1	t	26	116.67	7.068	0.01394	GAGAA	125.76	5.87	-1.38	64911	64953
AT1G01010.1	24	AGAAA	1	t	27	129.21	4.249	0.00498	AGAAA	128.13	5.56	0.17	64896	64911

from m6anet.

chrishendra93 avatar chrishendra93 commented on July 29, 2024

hi @YCCHEN23, I have pushed a quick fix for this that will skip the problematic positions and record the positions along with the non-unique 5-mers in a file called data.error. I have yet to push this to the main branch since we have not tested it. Can I please trouble you with testing this with your data?

To do so, you can clone the repository and run python setup.py install on the branch release_1.01. From there you can run m6anet-dataprep again and it should finish preprocessing all the data. Then if you have managed to do so, please share the content of data.error here so we can understand the problem better. Let me know if you encounter any difficulties!

Regards

Christopher Hendra

from m6anet.

YCCHEN23 avatar YCCHEN23 commented on July 29, 2024

Hi, thanks for the efficient update, I run m6Anet-dataprep again with the release_1.01 version, and it took about 7 hours to finish the process.

Everything goes well and I successfully obtain the data.result.csv from m6anet-run_inference.

After I checked the data.error & data.log, it seemed that the problem of "non-unique 5-mers" was caused by "I mapped the reads to cDNA (transcript) reference". However, m6anet is preparing the data at gene level, so "non-unique k-mers caused by different isoform" is not expected, is that correct?

If so, can I use m6anet to analysis the alignment results from cDNA reference?

I can send you apart of my eventalign result, cDNA reference, and data.error if needed.

Best regards

YCCHEN

from m6anet.

chrishendra93 avatar chrishendra93 commented on July 29, 2024

hi @YCCHEN23, great to know that you have managed to run m6anet-dataprep on your samples. I hope that the inference results are useful for you

In any case, m6Anet prepares data on the transcript level, typically we require users to aggregate predictions made on the transcript level by themselves in order to get gene-level prediction. However during run-time, m6Anet-dataprep will run a check if a 5-mer is trully unique on the transcript level, so if you are aligning this to transcript, it confuses me as to why there can be non-unique 5-mers in that position.

You can perhaps show me the data.error results here? This way I can also check on my part if there are some bugs that we have to handle

Thanks!

Regards

Christopher Hendra

from m6anet.

YCCHEN23 avatar YCCHEN23 commented on July 29, 2024

Thank you for your clear explanation, but I'm just confused that I can't distinguish m6A sites information in transcript level from my result.

Take AT1G01040 as an example, I don't know which is the right transcript (AT1G01040.1 or AT1G01040.2?), because only gene id was recorded in the file. There is no isoform information showed in transcript_id (gene.1, gene.2, gene.3, etc)

Is the result recorded properly?

The following data are the first few lines of my result:

transcript_id,transcript_position,n_reads,probability_modified
AT1G01010,978,20,0.15509425
AT1G01010,1062,20,0.2124232
AT1G01010,1084,20,0.094096266
AT1G01010,1124,20,0.33268613
AT1G01010,1162,20,0.26048714
AT1G01010,1273,21,0.045965612
AT1G01010,1345,21,0.56280404
AT1G01010,1357,20,0.4143505
AT1G01040,4200,22,0.60451066
AT1G01040,4251,23,0.19029099
AT1G01040,4273,23,0.71105254
AT1G01040,4292,26,0.6048376
AT1G01040,4307,24,0.3069207
AT1G01040,4333,26,0.38802323
AT1G01040,4487,28,0.50886166
AT1G01040,4755,28,0.3398115
AT1G01040,4798,29,0.027187109
AT1G01040,4815,30,0.07250579
AT1G01040,4855,31,0.29441229

And sure, here's my result of data.error, hoping this comes in helpful for you.

There are 8835 sites recorded in the file, so I upload the zipped file.

data.error.zip

Best regards

YCCHEN

from m6anet.

chrishendra93 avatar chrishendra93 commented on July 29, 2024

hi @YCCHEN23

I understand what's going on.

Apology, I think this is because there's a code left over from my previous experiments that will automatically remove the .1, .2, .3 from the transcripts and so they are grouped together in such a way.

I have push the fix to m6anet_release_1.01 that should fix this issue. Do you mind running this again to check? By the way if you have run this before and you still keep eventalign.index, you can just pass the argument --skip_index so that the program will skip indexing the entire eventalign file again (do check the first few entries of eventalign.index first, the transcript in there should be in the form of AT1G01040.1 or AT1G01040.2 etc)

from m6anet.

YCCHEN23 avatar YCCHEN23 commented on July 29, 2024

Never mind, I have performed the latest version and it still running. But now, the m6A information seemed to be recorded properly at transcript-level in all the outputs and there's no data.error occurring.

I'll close the comment within few days if there has no further errors.

Thanks for all the kindly helping.

Best regards

YCCHEN

from m6anet.

chrishendra93 avatar chrishendra93 commented on July 29, 2024

Thanks for trying out! It's great to know that you have managed to run it without any problem. Do let me know if you encounter any other problems

Regards

Christopher Hendra

from m6anet.

Stakaitis avatar Stakaitis commented on July 29, 2024

head_eventalign.txt
head_eventalign.index.txt
head_data.readcount.txt
head_data.log.txt
head_data.json.txt
head_data.index.txt
Hi @chrishendra93,

I'm getting the same AssertionError while working with reads aligned to reference genome (I know that I should use transcriptome instead, but I'm curious how different the results will be if ref genome is used):

  Process Consumer-19:
  Traceback (most recent call last):
    File "/home/stakatis/miniconda3/envs/m6anet/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
      self. Run()
    File "/home/stakatis/miniconda3/envs/m6anet/lib/python3.9/site-packages/m6anet/scripts/helper.py", line 113, in run
      result = self.task_function(*next_task_args,self.locks)
   File "/home/stakatis/miniconda3/envs/m6anet/lib/python3.9/site-packages/m6anet/scripts/dataprep.py", line 310, in preprocess_tx
     assert(len(kmer) == 1)
 AssertionError

Oddly, only 1 out of 4 samples rises this error, others finish the inference step successfully.

I'm attaching the first lines of some input/output files of the erroneous sample

Versions:

  1. m6Anet - I believe I'm using the latest version of m6Anet, since I'm able to parse the --infer_mod_rate argument, though, I'm not sure how to check the precise version I'm using.
  2. nanopolish version 0.13.2
  3. samtools 1.15.1
  4. htslib 1.15.1
  5. minimap2 2.17-r941
  6. Python 3.9.0
  7. pandas 1.4.2

Commands:
Ref_file=GRCh38.primary_assembly.genome.fa

minimap2 -ax splice -k14 ${Ref_index} ${Reads_fq} | samtools sort -K 14 -o ${BAM}

samtools view --bam --with-header --output ${bam_filtered} --output-unselected ${bam_unselected} --min-MQ 1 --excl-flags UNMAP,SECONDARY,SUPPLEMENTARY ${BAM}

samtools index ${bam_filtered}

nanopolish index --directory ${Fast5_dir} -s ${Summary_file} ${Reads_fq}

nanopolish eventalign --reads ${Reads_fq} --bam ${bam_filtered} --genome ${Ref_file} --scale-events --signal-index > ${EVENTALIGN}

m6anet-dataprep --eventalign ${EVENTALIGN} --out_dir ${DATAPREP_DIR} --n_processes ${CPUS} --readcount_min 2 --min_segment_count 2

m6anet-run_inference --input_dir ${DATAPREP_DIR} --out_dir ${INFERENCE_DIR} --infer_mod_rate --n_processes ${CPUS}

What kind of help do I seek?
What changes and to which script files should I make to avoid this error and instead output an additional data.error file?

from m6anet.

chrishendra93 avatar chrishendra93 commented on July 29, 2024

hi @Stakaitis, seems like due to the genome mapping, there are two possible 5-mer in that specific site and so the code throws an error (which I set deliberately to prevent having multiple possible 5-mers in here). I think this might only be observed in one out of four of your samples, you can add a print statement to the code to see the specific site that throws this error. Usually what I do with genomic coordinates is to map them after m6anet inference and average m6A probability per genomic coordinate.

from m6anet.

Stakaitis avatar Stakaitis commented on July 29, 2024

Thank you for a quick reply.

  1. Where exactly should I add a print statement in the dataprep.py file?
  2. How this print statement should be written?
  3. Is there an easy way how to exclude these overlapping 5-mers, and avoid this error allowing inference step to finish successfully while working with reference genome?

For now, I will use your advice for getting m6A location at the genomic coordinates

Thanks!

from m6anet.

chrishendra93 avatar chrishendra93 commented on July 29, 2024

hi @Stakaitis sorry for the delay in my reply, I was busy the entire last week.

I suggested printing the kmer just to see if the splicing site is indeed the case. I think one way to handle this will be to modify the dataset class in m6anet.utils.data_utils.py to output the k-mer separately but currently, we need to modify the dataprep.py too in order to accommodate this behavior. On the other hand, mapping this to the transcriptome produces probability on the read-level that gives more overall control over the output

from m6anet.

Stakaitis avatar Stakaitis commented on July 29, 2024

I agree, it makes more sense to analyse reads mapped to the transcriptome. Thanks!

from m6anet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.