Hi there, I am trying to run ribotish on regular riboseq data from o

This is one of them, they look like this: <div class="snippet-clipboard-content no

<a href="https://github.com/zhpn1024/ribotish/files/10098401/SRR1333393.sorted_indexed

No reads found!,about zhpn1024/ribotish

Comments (52)

zhpn1024 commented on July 19, 2024

Hi! The qc step uses known coding genes. You need to input coding annotations. The noncoding analysis is in the predict step.

from ribotish.

noepozzan commented on July 19, 2024

I see, didn't realize that.
Thanks a lot for the help!
Btw, I opened another issue since I have been using ribotish quite extensively.

from ribotish.

noepozzan commented on July 19, 2024

Getting back to you here because even though ribotish predict works with the following parameters: "--seq --aaseq --tpth 1 --fpth 1 --fspth 1 -v --longest", I do not get a single prediction out, when predicting on the RNAcentral gtf.. The pred(_all).txt files only have a header and nothing else. Do you have any idea why?

from ribotish.

noepozzan commented on July 19, 2024

I am asking this because I am specifically searching for these new ORFs that I see on IGV. There are clearly some reads lying around these ORFs and they even look quite periodic.
Why is ribotish not identifying any of them, even with these permissive parameters?
Thank you!

from ribotish.

zhpn1024 commented on July 19, 2024

What about the quality result? What is the content in the para.py file?

from ribotish.

noepozzan commented on July 19, 2024

This is one of them, they look like this:

offdict = {25: 7, 26: 7, 27: 10, 28: 7, 29: 7, 31: 7, 'm0': {25: 7, 26: 10, 28: 13, 31: 8, 33: 16}}

Looks kind of strange, or what do you think?

from ribotish.

zhpn1024 commented on July 19, 2024

It's strange. What about the pdf figure?

from ribotish.

noepozzan commented on July 19, 2024

SRR1333393.sorted_indexed_qual.pdf

from ribotish.

zhpn1024 commented on July 19, 2024

The reads length s are too short. The lengths should be around 29nt. How did you trim the reads?

from ribotish.

noepozzan commented on July 19, 2024

I got rid of the first 5 bases and then clipped at the adapter. Would it help you if I pasted some reads here?
Thanks btw, for really trying to help me!

from ribotish.

zhpn1024 commented on July 19, 2024

That's the reason. 5+7=12 is the most common offset value. Do not cut the first 5 bases.

from ribotish.

noepozzan commented on July 19, 2024

I see. I'll give it a try. Thanks again!

from ribotish.

noepozzan commented on July 19, 2024

SRR1333393.sorted_indexed_qual.pdf

What do you think of this? Definitely looks better, no?
Still doesn't find those regions that I see some reads and periodicity in..

from ribotish.

zhpn1024 commented on July 19, 2024

It's much better now. Are there any prediction results?

from ribotish.

noepozzan commented on July 19, 2024

Yes, there are results.
I should specify: The quality figure I posted above comes actually from a different experiment from CHX-treated riboseq data that I took from SRA trying to reproduce some paper's results. In that paper, they found 10 short ORFs. But using ribotish I only find 4 of them. What's also strange is that I found the same 4 ORFs even before you told me to correct the offset. So this doesn't seem to matter too much.
The problem is that I have no idea why ribotish doesn't see them. These regions of interest aren't even in the predict outputs. I expected them to be in there, just maybe not with a passing p-value.
Do you have any other idea why this may be?

from ribotish.

zhpn1024 commented on July 19, 2024

Run predict with --transprofile option can output signal data on all transcripts. Check if there is any signal in your transcript. At 2022-11-28 23:35:41, "Noè Pozzan" ***@***.***> wrote: Yes, there are results. I should specify: The quality figure I posted above comes actually from a different experiment from CHX-treated riboseq data that I took from SRA trying to reproduce some paper's results. In that paper, they found 10 short ORFs. But using ribotish I only find 4 of them. What's also strange is that I found the same 4 ORFs even before you told me to correct the offset. So this doesn't seem to matter too much. The problem is that I have no idea why ribotish doesn't see them. These regions of interest aren't even in the predict outputs. I expected them to be in there, just maybe not with a passing p-value. Do you have any other idea why this may be? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.Message ID: ***@***.***>

from ribotish.

noepozzan commented on July 19, 2024

Hey @zhpn1024,

Sorry for the late reply..
I ran ribotish predict with the --transprofile option. Pretty handy!
I see signal in the transcripts im interested in, but not for the exact sequence I am interested in.

So, as an example this paper I am trying to replicate has the following:
They found ribosome profiling and mass spec evidence for the following peptide:
ATAAGLSAGLTR. They annotated this with id = ENST00000283195.
Ribotish finds many things with an id of ENST00000283195, but not this exact peptide I'd be interested in.
Below is a screenshot from IGV where I plot the full footprints and only their A-site for periodicity of the region of this peptide. You should be able to see it in the upper translation frame.

Do the reads not look good enough to be considered by ribotish?

Does it make sense what I am telling you?
And thanks again for your help!

from ribotish.

zhpn1024 commented on July 19, 2024

How do you generate the A site profile? From the A site result, the 394 data support your peptide, while 393 data supports translation in another frame (bottom).
My transprofile results are P site profile. What are the signals in this region?

from ribotish.

noepozzan commented on July 19, 2024

Good point. The A site profiles in the screenshot above were actually separately derived using a python script from transcriptome-aligned reads. The 393 is actually CHX-treated, while 394 is LTM-treated. Not sure how that should make a difference for periodicity, though...

Looking into the transprofile file, there is a lot in there for the entry with id = ENST00000283195. I only pasted about half of it due to the length.

ENSG00000153201 ENST00000283195 RANBP2  {}      {48:1, 63:2, 64:1, 65:1, 66:22, 69:12, 72:1, 79:3, 135:3, 137:2, 147:1, 154:2
, 156:5, 176:1, 285:5, 288:1, 294:3, 295:6, 345:2, 346:6, 348:1, 351:4, 354:2, 360:1, 462:1, 463:1, 468:2, 471:5, 480:1, 490:
2, 642:6, 645:3, 648:2, 651:11, 652:1, 653:3, 672:4, 673:3, 690:2, 717:5, 718:2, 720:2, 721:1, 726:1, 727:1, 729:19, 730:1, 7
32:4, 733:6, 734:1, 735:21, 736:3, 737:3, 738:6, 888:3, 891:7, 903:16, 904:1, 912:1, 913:1, 924:1, 927:1, 928:1, 930:1, 933:5
0, 934:9, 1029:4, 1033:1, 1035:1, 1042:2, 1047:4, 1048:1, 1050:1, 1221:2, 1225:3, 1248:1, 1251:1, 1272:1, 1344:1, 1347:2, 134
8:3, 1350:1, 1362:3, 1368:1, 1374:5, 1375:1, 1377:5, 1379:1, 1380:2, 1383:1, 1385:1, 1386:2, 1392:1, 1509:1, 1512:1, 1513:1, 
1527:1, 1671:2, 1752:4, 1755:8, 1758:25, 1767:1, 1915:2, 1917:1, 1918:1, 2068:1, 2100:1, 2118:1, 2195:2, 2196:3, 2202:2, 2208
:1, 2211:1, 2269:1, 2271:1, 2283:1, 2292:1, 2478:2, 2541:1, 2595:2, 2616:2, 2640:1, 2670:5, 2814:2, 2913:1, 2963:2, 2964:3, 2
979:3, 3026:2, 3039:2, 3077:1, 3081:3, 3090:2, 3102:1, 3111:1, 3135:1, 3165:2, 3247:1, 3282:2, 3312:1, 3354:1, 3355:1, 3360:3
, 3435:7, 3474:2, 3510:1, 3519:1, 3554:1, 3555:1, 3557:1, 3558:2, 3561:1, 3567:3, 3573:1, 3576:1, 3579:6, 3585:1, 3588:1, 359
1:2, 3594:4, 3597:2, 3600:3, 3612:3, 3615:1, 3618:1, 3633:4, 3657:2, 3660:10, 3663:10, 3684:2, 3687:1, 3699:1, 3711:3, 3738:2, 3753:4, 3754:1, 3755:1, 3759:2, 3765:6, 3774:3, 3825:1, 3830:1, 3834:1, 3837:5, 3840:2, 3864:3, 3871:2, 3909:1, 3918:6, 3945:1, 3981:1, 3984:4, 4042:3, 4090:1, 4098:2, 4116:3, 4119:1, 4131:1, 4138:1, 4171:1, 4176:1, 4227:1, 4228:1, 4230:1, 4251:7, 4260:2, 4320:1, 4327:1, 4335:1, 4338:3, 4341:1, 4356:1, 4359:1, 4362:2, 4368:1, 4374:2, 4421:2, 4427:2, 4437:3, 4440:1, 4449:6, 4509:17, 4542:1, 4545:5, 4550:1, 4551:1, 4553:1, 4561:1, 4575:1, 4593:2, 4605:1, 4629:2, 4632:1, 4635:3, 4645:1, 4647:10, 4678:3, 4701:8, 4704:1, 4725:5, 4726:2, 4737:6, 4738:2, 4744:5, 4752:4, 4755:2, 4767:2, 4785:1, 4794:6, 4797:1, 4803:6 ...

from ribotish.

zhpn1024 commented on July 19, 2024

Your peptide is starting at position 63. So the translation signal is right here. Why this region not reported may be because there's no start codon upstream. Did you use the --alt option?

from ribotish.

zhpn1024 commented on July 19, 2024

I think your script for the A site may be not correct.

from ribotish.

noepozzan commented on July 19, 2024

No, I did not use the --alt option before actually. Thanks for telling me about it.
I am running the workflow with it right now.
What makes you think that the script is wrong for inferring the A site?

from ribotish.

noepozzan commented on July 19, 2024

And how exactly do you see that the peptide is starting at position 63?

from ribotish.

noepozzan commented on July 19, 2024

Hi @zhpn1024

I have another question that I mentioned before.
So, I run ribotish quality on the ensembl gtf file with --th 0.4.
The quality output looks pretty good, the offsets for an exemplary file:

offdict = {25: 10, 26: 8, 28: 10, 29: 11, 31: 11, 32: 11, 33: 11, 'm0': {28: 7, 29: 11, 31: 11, 33: 8}}

But then I want to run ribotish predict on another gtf file from RNAcentral with the following parameters: --alt --transprofile TRANSPROFILE --seq --aaseq -v --longest.

This is the log:

Mon Dec 12 17:18:47 2022 Loading genome...
Making fasta index for Mus_musculus.GRCm39.dna.primary_assembly.fa...
No input TIS data!
Mon Dec 12 17:19:25 2022 Predicting...
1
10
11
12
13
14
15
16
17
18
19
2
3
4
5
6
7
8
9
GL456210.1
GL456211.1
GL456212.1
GL456219.1
GL456221.1
GL456233.2
GL456239.1
GL456354.1
GL456359.1
GL456360.1
GL456366.1
GL456367.1
GL456368.1
GL456370.1
GL456372.1
GL456378.1
GL456379.1
GL456381.1
GL456382.1
GL456383.1
GL456385.1
GL456387.1
GL456389.1
GL456390.1
GL456394.1
GL456396.1
JH584296.1
JH584297.1
JH584298.1
JH584299.1
JH584300.1
JH584301.1
JH584302.1
JH584303.1
JH584304.1
MT
MU069434.1
MU069435.1
X
Y
Mon Dec 12 18:36:50 2022 Checking overlap with known CDS..
Mon Dec 12 18:36:50 2022 BH correcting...
Mon Dec 12 18:36:50 2022 Done! Time used: 4682.619626283646 s.

Even though the log file seems alright, there is no output besides the header to both pred(_all).txt files.

I am really out of ideas on this one. Can you help?
Thanks for your help!

from ribotish.

zhpn1024 commented on July 19, 2024

No, I did not use the --alt option before actually. Thanks for telling me about it. I am running the workflow with it right now. What makes you think that the script is wrong for inferring the A site?

Based on the offset. The A-site is upstream of P-site, and the offset is around 9nt. But your A-sites are nearly the end of reads.
The position 63 is by calculating the mapping of genome location to transcript position.

from ribotish.

zhpn1024 commented on July 19, 2024

Are there any data in TRANSPROFILE file?

from ribotish.

noepozzan commented on July 19, 2024

Based on the offset. The A-site is upstream of P-site, and the offset is around 9nt. But your A-sites are nearly the end of reads. The position 63 is by calculating the mapping of genome location to transcript position.

I see. Yes, the A sites have not been computed based on the genome alignments but based on the transcriptome alignment. I would need to look more closely into what happened there exactly.

Btw, I really thought that the --alt option was by default enabled, haha... Read over that every time.
So, when running with the --alt option enabled and some additional alternative start codons, the peptide mentioned above is actually identified. It is also annotated with id = ENST00000283195:

ENSG00000153201	ENST00000283195	RANBP2	protein_coding	2:109335960-109400357:+	GCG	24	9801	Extended	0	None	3.32963423346377e-92	N	None	None	7.647073795067235e-89	None	3258

I didn't paste the sequence because of its length. So ATAAGLSAGLTR is really just a small part of that.

Coming back to calculating which positions this peptide is represented by in the TRANSPROFILE;
I still don't really get how the exact numbers are calculated. But ignoring this, this small part of the larger TRANSPROFILE does seem to indicate periodicity:

... 63:2, 64:1, 65:1, 66:22, 69:12 ...

I also made sure to include A as a possible start codon.
Why is ATAAGLSAGLTR not called by itself?

from ribotish.

noepozzan commented on July 19, 2024

Are there any data in TRANSPROFILE file?

For this other experiment, there is nothing in the TRANSPROFILE except the header, either.
Why does ribotish take more than an hour to not output anything at all?

from ribotish.

zhpn1024 commented on July 19, 2024

Glad to see the new results.
It is curious that GCG is not in the alt start codons. What's your command?
The transprofile data is position: count. The position is the 5' end of P-site.
Maybe you run with the --longest, so the most upstream start codon is used. You can try --framebest instead. This option will test all possible start codons and choose the most significant one.

from ribotish.

zhpn1024 commented on July 19, 2024

Are there any data in TRANSPROFILE file?

For this other experiment, there is nothing in the TRANSPROFILE except the header, either. Why does ribotish take more than an hour to not output anything at all?

It is curious. Are chromosome names consistent in bam, genome fa and gtf files? For example 'chr1' with '1'.

from ribotish.

noepozzan commented on July 19, 2024

Glad to see the new results. It is curious that GCG is not in the alt start codons. What's your command? The transprofile data is position: count. The position is the 5' end of P-site. Maybe you run with the --longest, so the most upstream start codon is used. You can try --framebest instead. This option will test all possible start codons and choose the most significant one.

I am glad, too. Thank you!

Since the --alt is described to only consider start codons differing in one nt from ATG, I added some that were relevant for the small peptides I am searching for:

--alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq --tpth 1 --fpth 1 --fspth 1 -v --longest

Thanks, I got that part. But the peptide I am interested in is starting at chr2:109'336'000. If I see that there were reads at 63, 66, 69 in the TRANSPROFILE of the respective entry, then how is this related to the absolute genome coordinates?

I'll certainly give --framebest a try, aswell!

from ribotish.

noepozzan commented on July 19, 2024

It is curious. Are chromosome names consistent in bam, genome fa and gtf files? For example 'chr1' with '1'.

All of the files have the usual names like 1,2,3 .. MT, X, Y.

But then the RNAcentral gtf has a couple entries like this: JH584298.1 which are only in there.

And the bam file header shows some entries like this: GL000192.1 which are also in the genome.

Does that tell you anything?

from ribotish.

zhpn1024 commented on July 19, 2024

Mapping transcript coordinate to genome coordinate need custom scripts. If you are fammilar with python, you can use my ribotish.zbio.gtf module. The Trans.genome_pos() function does this job.

from ribotish.

zhpn1024 commented on July 19, 2024

The small chromosomes like JH584298.1 and GL000192.1 do not affect. Can you show the QC figure?

from ribotish.

noepozzan commented on July 19, 2024

Mapping transcript coordinate to genome coordinate need custom scripts. If you are fammilar with python, you can use my ribotish.zbio.gtf module. The Trans.genome_pos() function does this job.

I see. May have a look at it but I'll just trust you on that for now. =)

from ribotish.

noepozzan commented on July 19, 2024

The small chromosomes like JH584298.1 and GL000192.1 do not affect. Can you show the QC figure?

This is the QC output for one exemplary file:
Sample_11_Concatenated_WT_30M_132443_RIBO_ALL.sorted_indexed_qual.pdf

from ribotish.

zhpn1024 commented on July 19, 2024

The data quality is not good. There are very few reads in coding regions, and the frame pattern is not clear.

from ribotish.

noepozzan commented on July 19, 2024

Well, alright... We thought it was good enough, but thanks for the hint.

from ribotish.

noepozzan commented on July 19, 2024

Hi @zhpn1024

So, I agree that the data mentioned just above is not very deep.
But now I have tried to predict on the RNAcentral gtf with other data.

This data is almost strangely periodic and deep, or not?:
SRR1333393.sorted_indexed_qual.pdf

Still, all output files (including the TRANSPROFILE) are empty.

I ran the following:

ribotish predict -b file.bam -g homo_sapiens.GRCh38.gtf -f Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --ribopara SRR1333393.sorted_indexed.bam.para.py --alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt

This is the log:

Wed Dec 14 13:03:49 2022 Loading genome...
Making fasta index for Homo_sapiens.GRCh37.75.dna.primary_assembly.fa...
No input TIS data!
Wed Dec 14 13:04:30 2022 Predicting...
.
.
CHR_HSCHRX_2_CTG3
GL000008.2
.
.
Wed Dec 14 13:04:37 2022 Checking overlap with known CDS..
Wed Dec 14 13:04:37 2022 BH correcting...
Wed Dec 14 13:04:37 2022 Done! Time used: 48.64457869529724 s.

I don't get it. Why doesn't ribotish predict anything on this gtf?
And why was it so fast if it took way later on that other data?

from ribotish.

zhpn1024 commented on July 19, 2024

This data looks good.
GRCh37 or 38? The version for gtf and genome fa is not consistent.

from ribotish.

noepozzan commented on July 19, 2024

Hmm... As far as I am aware RNAcentral doesn't exist for anything pre 38.

But I'll just run it with all files from 38 to make sure.

Will get back to you, haha. Thanks!

from ribotish.

noepozzan commented on July 19, 2024

Well, I tried running with all 3 files (ensemnbl gtf, RNAcentral gtf and genome.fa) from the same release and it didn't help. The output files and TRANSPROFILE are still empty.

ribotish predict -b SRR1333393.bam -g homo_sapiens.GRCh38.gtf -f Homo_sapiens.GRCh38.dna.primary_assembly.fa --ribopara SRR1333393.sorted_indexed.bam.para.py --alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt

The log pretty much looks the same as just above.
Quality as discussed earlier should also definitely not be an issue.

Again, if I aligned with the normal ensembl gtf, then estimate offsets with the normal gtf but then predict with the RNAcentral gtf this should work?

I mean mapping with STAR works fine (~50% of uniquely mapped reads), also if I only map using the RNAcentral gtf.

In summary; I am really out of ideas why ribotish doesn't predict with the RNAcentral gtf. Do you have an idea still?

from ribotish.

zhpn1024 commented on July 19, 2024

Try to predict with ensembl gtf?

from ribotish.

noepozzan commented on July 19, 2024

well, that works. That gives the results we saw further up in the thread.

from ribotish.

zhpn1024 commented on July 19, 2024

I found RNAcentral data in gff3 format. Where do you download th gtf format?

from ribotish.

noepozzan commented on July 19, 2024

You are right. I downloaded it in gff3 format, but then transformed to gtf using gffread.

I'll run it with the gff3 to make sure..

from ribotish.

zhpn1024 commented on July 19, 2024

Maybe ribotish failed to parse the converted gtf format.

from ribotish.

noepozzan commented on July 19, 2024

Hi @zhpn1024

So I ran ribotish with the RNAcentral gff3 file. Again the output files (TRANSPROFILE included) are empty.
But this time atleast the log is telling something new:

Fri Dec 16 14:18:23 2022 Loading genome...
Making fasta index for Homo_sapiens.GRCh38.dna.primary_assembly.fa...
No input TIS data!
Fri Dec 16 14:19:06 2022 Predicting...
.
.
.
Inconsistent trans strand:  URS00000235AA_9606.241936 - +
15
Inconsistent trans strand:  URS000027D32E_9606.274248 + -
16
Inconsistent trans strand:  URS00008B4EA9_9606.317462 + -
17
Inconsistent trans strand:  URS00001AB772_9606.350408 + -
.
.
.
Fri Dec 16 14:19:48 2022 Checking overlap with known CDS..
Fri Dec 16 14:19:48 2022 BH correcting...
Fri Dec 16 14:19:48 2022 Done! Time used: 85.3397159576416 s.

This is the command I ran. I had ribotish autodetect the gff3 file type. Is that a problem?

ribotish predict -b SRR1333393.bam -g homo_sapiens.GRCh38.gff3 -f Homo_sapiens.GRCh38.dna.primary_assembly.fa --ribopara SRR1333393.sorted_indexed.bam.para.py --alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt

Do you have an idea what the error is trying to tell? Thanks!

from ribotish.

zhpn1024 commented on July 19, 2024

I just checked the RNAcentral gff3 format. There are two major problems that my program fail to parse. The first is the 'exon' in the third column is changed to 'noncoding_exon', resulted in no exon in any transcript. The second is no gene annotation to organize transcripts.
I think using the RNAcentral bed annotation would be OK.

from ribotish.

noepozzan commented on July 19, 2024

Hi @zhpn1024

It really finally worked when running ribotish with the RNAcentral bed file.
Thanks so much for your help & persistence.

So, on our own samples, which are not that deeply sequenced (QC output in the thread above), none of the ORFs passed the thresholds, so there are only entries in the "_all.txt" files. Still, I am glad we found the error.

I ran:

ribotish predict -b Sample_11.bam -g mus_musculus.GRCm39.bed -f Mus_musculus.GRCm39.dna.primary_assembly.fa --ribopara Sample_11_Concatenated_WT_30M_132443_RIBO_ALL.sorted_indexed.bam.para.py --alt --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt

The log file has >600'000 warning messages, though:
I just copied you the tail of the log where 2 of those messages are visible.

Wrong CDS annotation: URS000028A7C4_10090 URS000028A7C4_10090 0 73 73
Wrong CDS annotation: URS000061C2F2_10090 URS000061C2F2_10090 0 73 73
Mon Dec 19 13:04:30 2022 Checking overlap with known CDS..
Mon Dec 19 13:04:30 2022 BH correcting...
Mon Dec 19 13:04:30 2022 Done! Time used: 5537.979197025299 s.

Seems like ribotish still doesn't like the annotation file..
I guess that means I am losing information. Is there any way around this?

from ribotish.

zhpn1024 commented on July 19, 2024

Haha. The warnings are because the ThickStart/Stop in 7-8 column of bed format represent coding regions, while RNAcentral used it to duplicate Start and Stop range, as if the whole transcript is coding. In fact they are all non-coding. It would not affect results except TISType classification.

from ribotish.

noepozzan commented on July 19, 2024

hmm, I see, haha..
Thanks for the help!

from ribotish.

No reads found! about ribotish HOT 52 CLOSED

Comments (52)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent