Comments (52)
Hi! The qc step uses known coding genes. You need to input coding annotations. The noncoding analysis is in the predict step.
from ribotish.
I see, didn't realize that.
Thanks a lot for the help!
Btw, I opened another issue since I have been using ribotish quite extensively.
from ribotish.
Getting back to you here because even though ribotish predict works with the following parameters: "--seq --aaseq --tpth 1 --fpth 1 --fspth 1 -v --longest", I do not get a single prediction out, when predicting on the RNAcentral gtf.. The pred(_all).txt files only have a header and nothing else. Do you have any idea why?
from ribotish.
I am asking this because I am specifically searching for these new ORFs that I see on IGV. There are clearly some reads lying around these ORFs and they even look quite periodic.
Why is ribotish not identifying any of them, even with these permissive parameters?
Thank you!
from ribotish.
What about the quality result? What is the content in the para.py file?
from ribotish.
This is one of them, they look like this:
offdict = {25: 7, 26: 7, 27: 10, 28: 7, 29: 7, 31: 7, 'm0': {25: 7, 26: 10, 28: 13, 31: 8, 33: 16}}
Looks kind of strange, or what do you think?
from ribotish.
It's strange. What about the pdf figure?
from ribotish.
SRR1333393.sorted_indexed_qual.pdf
from ribotish.
The reads length s are too short. The lengths should be around 29nt. How did you trim the reads?
from ribotish.
I got rid of the first 5 bases and then clipped at the adapter. Would it help you if I pasted some reads here?
Thanks btw, for really trying to help me!
from ribotish.
That's the reason. 5+7=12 is the most common offset value. Do not cut the first 5 bases.
from ribotish.
I see. I'll give it a try. Thanks again!
from ribotish.
SRR1333393.sorted_indexed_qual.pdf
What do you think of this? Definitely looks better, no?
Still doesn't find those regions that I see some reads and periodicity in..
from ribotish.
It's much better now. Are there any prediction results?
from ribotish.
Yes, there are results.
I should specify: The quality figure I posted above comes actually from a different experiment from CHX-treated riboseq data that I took from SRA trying to reproduce some paper's results. In that paper, they found 10 short ORFs. But using ribotish I only find 4 of them. What's also strange is that I found the same 4 ORFs even before you told me to correct the offset. So this doesn't seem to matter too much.
The problem is that I have no idea why ribotish doesn't see them. These regions of interest aren't even in the predict outputs. I expected them to be in there, just maybe not with a passing p-value.
Do you have any other idea why this may be?
from ribotish.
from ribotish.
Hey @zhpn1024,
Sorry for the late reply..
I ran ribotish predict
with the --transprofile
option. Pretty handy!
I see signal in the transcripts im interested in, but not for the exact sequence I am interested in.
So, as an example this paper I am trying to replicate has the following:
They found ribosome profiling and mass spec evidence for the following peptide:
ATAAGLSAGLTR. They annotated this with id = ENST00000283195.
Ribotish finds many things with an id of ENST00000283195, but not this exact peptide I'd be interested in.
Below is a screenshot from IGV where I plot the full footprints and only their A-site for periodicity of the region of this peptide. You should be able to see it in the upper translation frame.
Do the reads not look good enough to be considered by ribotish?
Does it make sense what I am telling you?
And thanks again for your help!
from ribotish.
How do you generate the A site profile? From the A site result, the 394 data support your peptide, while 393 data supports translation in another frame (bottom).
My transprofile results are P site profile. What are the signals in this region?
from ribotish.
Good point. The A site profiles in the screenshot above were actually separately derived using a python script from transcriptome-aligned reads. The 393 is actually CHX-treated, while 394 is LTM-treated. Not sure how that should make a difference for periodicity, though...
Looking into the transprofile file, there is a lot in there for the entry with id = ENST00000283195. I only pasted about half of it due to the length.
ENSG00000153201 ENST00000283195 RANBP2 {} {48:1, 63:2, 64:1, 65:1, 66:22, 69:12, 72:1, 79:3, 135:3, 137:2, 147:1, 154:2
, 156:5, 176:1, 285:5, 288:1, 294:3, 295:6, 345:2, 346:6, 348:1, 351:4, 354:2, 360:1, 462:1, 463:1, 468:2, 471:5, 480:1, 490:
2, 642:6, 645:3, 648:2, 651:11, 652:1, 653:3, 672:4, 673:3, 690:2, 717:5, 718:2, 720:2, 721:1, 726:1, 727:1, 729:19, 730:1, 7
32:4, 733:6, 734:1, 735:21, 736:3, 737:3, 738:6, 888:3, 891:7, 903:16, 904:1, 912:1, 913:1, 924:1, 927:1, 928:1, 930:1, 933:5
0, 934:9, 1029:4, 1033:1, 1035:1, 1042:2, 1047:4, 1048:1, 1050:1, 1221:2, 1225:3, 1248:1, 1251:1, 1272:1, 1344:1, 1347:2, 134
8:3, 1350:1, 1362:3, 1368:1, 1374:5, 1375:1, 1377:5, 1379:1, 1380:2, 1383:1, 1385:1, 1386:2, 1392:1, 1509:1, 1512:1, 1513:1,
1527:1, 1671:2, 1752:4, 1755:8, 1758:25, 1767:1, 1915:2, 1917:1, 1918:1, 2068:1, 2100:1, 2118:1, 2195:2, 2196:3, 2202:2, 2208
:1, 2211:1, 2269:1, 2271:1, 2283:1, 2292:1, 2478:2, 2541:1, 2595:2, 2616:2, 2640:1, 2670:5, 2814:2, 2913:1, 2963:2, 2964:3, 2
979:3, 3026:2, 3039:2, 3077:1, 3081:3, 3090:2, 3102:1, 3111:1, 3135:1, 3165:2, 3247:1, 3282:2, 3312:1, 3354:1, 3355:1, 3360:3
, 3435:7, 3474:2, 3510:1, 3519:1, 3554:1, 3555:1, 3557:1, 3558:2, 3561:1, 3567:3, 3573:1, 3576:1, 3579:6, 3585:1, 3588:1, 359
1:2, 3594:4, 3597:2, 3600:3, 3612:3, 3615:1, 3618:1, 3633:4, 3657:2, 3660:10, 3663:10, 3684:2, 3687:1, 3699:1, 3711:3, 3738:2, 3753:4, 3754:1, 3755:1, 3759:2, 3765:6, 3774:3, 3825:1, 3830:1, 3834:1, 3837:5, 3840:2, 3864:3, 3871:2, 3909:1, 3918:6, 3945:1, 3981:1, 3984:4, 4042:3, 4090:1, 4098:2, 4116:3, 4119:1, 4131:1, 4138:1, 4171:1, 4176:1, 4227:1, 4228:1, 4230:1, 4251:7, 4260:2, 4320:1, 4327:1, 4335:1, 4338:3, 4341:1, 4356:1, 4359:1, 4362:2, 4368:1, 4374:2, 4421:2, 4427:2, 4437:3, 4440:1, 4449:6, 4509:17, 4542:1, 4545:5, 4550:1, 4551:1, 4553:1, 4561:1, 4575:1, 4593:2, 4605:1, 4629:2, 4632:1, 4635:3, 4645:1, 4647:10, 4678:3, 4701:8, 4704:1, 4725:5, 4726:2, 4737:6, 4738:2, 4744:5, 4752:4, 4755:2, 4767:2, 4785:1, 4794:6, 4797:1, 4803:6 ...
from ribotish.
Your peptide is starting at position 63. So the translation signal is right here. Why this region not reported may be because there's no start codon upstream. Did you use the --alt option?
from ribotish.
I think your script for the A site may be not correct.
from ribotish.
No, I did not use the --alt option before actually. Thanks for telling me about it.
I am running the workflow with it right now.
What makes you think that the script is wrong for inferring the A site?
from ribotish.
And how exactly do you see that the peptide is starting at position 63?
from ribotish.
Hi @zhpn1024
I have another question that I mentioned before.
So, I run ribotish quality
on the ensembl gtf file with --th 0.4
.
The quality output looks pretty good, the offsets for an exemplary file:
offdict = {25: 10, 26: 8, 28: 10, 29: 11, 31: 11, 32: 11, 33: 11, 'm0': {28: 7, 29: 11, 31: 11, 33: 8}}
But then I want to run ribotish predict
on another gtf file from RNAcentral with the following parameters: --alt --transprofile TRANSPROFILE --seq --aaseq -v --longest
.
This is the log:
Mon Dec 12 17:18:47 2022 Loading genome...
Making fasta index for Mus_musculus.GRCm39.dna.primary_assembly.fa...
No input TIS data!
Mon Dec 12 17:19:25 2022 Predicting...
1
10
11
12
13
14
15
16
17
18
19
2
3
4
5
6
7
8
9
GL456210.1
GL456211.1
GL456212.1
GL456219.1
GL456221.1
GL456233.2
GL456239.1
GL456354.1
GL456359.1
GL456360.1
GL456366.1
GL456367.1
GL456368.1
GL456370.1
GL456372.1
GL456378.1
GL456379.1
GL456381.1
GL456382.1
GL456383.1
GL456385.1
GL456387.1
GL456389.1
GL456390.1
GL456394.1
GL456396.1
JH584296.1
JH584297.1
JH584298.1
JH584299.1
JH584300.1
JH584301.1
JH584302.1
JH584303.1
JH584304.1
MT
MU069434.1
MU069435.1
X
Y
Mon Dec 12 18:36:50 2022 Checking overlap with known CDS..
Mon Dec 12 18:36:50 2022 BH correcting...
Mon Dec 12 18:36:50 2022 Done! Time used: 4682.619626283646 s.
Even though the log file seems alright, there is no output besides the header to both pred(_all).txt files.
I am really out of ideas on this one. Can you help?
Thanks for your help!
from ribotish.
No, I did not use the --alt option before actually. Thanks for telling me about it. I am running the workflow with it right now. What makes you think that the script is wrong for inferring the A site?
Based on the offset. The A-site is upstream of P-site, and the offset is around 9nt. But your A-sites are nearly the end of reads.
The position 63 is by calculating the mapping of genome location to transcript position.
from ribotish.
Are there any data in TRANSPROFILE file?
from ribotish.
Based on the offset. The A-site is upstream of P-site, and the offset is around 9nt. But your A-sites are nearly the end of reads. The position 63 is by calculating the mapping of genome location to transcript position.
I see. Yes, the A sites have not been computed based on the genome alignments but based on the transcriptome alignment. I would need to look more closely into what happened there exactly.
Btw, I really thought that the --alt
option was by default enabled, haha... Read over that every time.
So, when running with the --alt
option enabled and some additional alternative start codons, the peptide mentioned above is actually identified. It is also annotated with id = ENST00000283195:
ENSG00000153201 ENST00000283195 RANBP2 protein_coding 2:109335960-109400357:+ GCG 24 9801 Extended 0 None 3.32963423346377e-92 N None None 7.647073795067235e-89 None 3258
I didn't paste the sequence because of its length. So ATAAGLSAGLTR is really just a small part of that.
Coming back to calculating which positions this peptide is represented by in the TRANSPROFILE;
I still don't really get how the exact numbers are calculated. But ignoring this, this small part of the larger TRANSPROFILE does seem to indicate periodicity:
... 63:2, 64:1, 65:1, 66:22, 69:12 ...
I also made sure to include A as a possible start codon.
Why is ATAAGLSAGLTR not called by itself?
from ribotish.
Are there any data in TRANSPROFILE file?
For this other experiment, there is nothing in the TRANSPROFILE except the header, either.
Why does ribotish take more than an hour to not output anything at all?
from ribotish.
Glad to see the new results.
It is curious that GCG is not in the alt start codons. What's your command?
The transprofile data is position: count. The position is the 5' end of P-site.
Maybe you run with the --longest, so the most upstream start codon is used. You can try --framebest instead. This option will test all possible start codons and choose the most significant one.
from ribotish.
Are there any data in TRANSPROFILE file?
For this other experiment, there is nothing in the TRANSPROFILE except the header, either. Why does ribotish take more than an hour to not output anything at all?
It is curious. Are chromosome names consistent in bam, genome fa and gtf files? For example 'chr1' with '1'.
from ribotish.
Glad to see the new results. It is curious that GCG is not in the alt start codons. What's your command? The transprofile data is position: count. The position is the 5' end of P-site. Maybe you run with the --longest, so the most upstream start codon is used. You can try --framebest instead. This option will test all possible start codons and choose the most significant one.
I am glad, too. Thank you!
Since the --alt
is described to only consider start codons differing in one nt from ATG, I added some that were relevant for the small peptides I am searching for:
--alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq --tpth 1 --fpth 1 --fspth 1 -v --longest
Thanks, I got that part. But the peptide I am interested in is starting at chr2:109'336'000. If I see that there were reads at 63, 66, 69 in the TRANSPROFILE of the respective entry, then how is this related to the absolute genome coordinates?
I'll certainly give --framebest
a try, aswell!
from ribotish.
It is curious. Are chromosome names consistent in bam, genome fa and gtf files? For example 'chr1' with '1'.
All of the files have the usual names like 1,2,3 .. MT, X, Y.
But then the RNAcentral gtf has a couple entries like this: JH584298.1 which are only in there.
And the bam file header shows some entries like this: GL000192.1 which are also in the genome.
Does that tell you anything?
from ribotish.
Mapping transcript coordinate to genome coordinate need custom scripts. If you are fammilar with python, you can use my ribotish.zbio.gtf module. The Trans.genome_pos() function does this job.
from ribotish.
The small chromosomes like JH584298.1 and GL000192.1 do not affect. Can you show the QC figure?
from ribotish.
Mapping transcript coordinate to genome coordinate need custom scripts. If you are fammilar with python, you can use my ribotish.zbio.gtf module. The Trans.genome_pos() function does this job.
I see. May have a look at it but I'll just trust you on that for now. =)
from ribotish.
The small chromosomes like JH584298.1 and GL000192.1 do not affect. Can you show the QC figure?
This is the QC output for one exemplary file:
Sample_11_Concatenated_WT_30M_132443_RIBO_ALL.sorted_indexed_qual.pdf
from ribotish.
The data quality is not good. There are very few reads in coding regions, and the frame pattern is not clear.
from ribotish.
Well, alright... We thought it was good enough, but thanks for the hint.
from ribotish.
Hi @zhpn1024
So, I agree that the data mentioned just above is not very deep.
But now I have tried to predict on the RNAcentral gtf with other data.
This data is almost strangely periodic and deep, or not?:
SRR1333393.sorted_indexed_qual.pdf
Still, all output files (including the TRANSPROFILE) are empty.
I ran the following:
ribotish predict -b file.bam -g homo_sapiens.GRCh38.gtf -f Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --ribopara SRR1333393.sorted_indexed.bam.para.py --alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt
This is the log:
Wed Dec 14 13:03:49 2022 Loading genome...
Making fasta index for Homo_sapiens.GRCh37.75.dna.primary_assembly.fa...
No input TIS data!
Wed Dec 14 13:04:30 2022 Predicting...
.
.
CHR_HSCHRX_2_CTG3
GL000008.2
.
.
Wed Dec 14 13:04:37 2022 Checking overlap with known CDS..
Wed Dec 14 13:04:37 2022 BH correcting...
Wed Dec 14 13:04:37 2022 Done! Time used: 48.64457869529724 s.
I don't get it. Why doesn't ribotish predict anything on this gtf?
And why was it so fast if it took way later on that other data?
from ribotish.
This data looks good.
GRCh37 or 38? The version for gtf and genome fa is not consistent.
from ribotish.
Hmm... As far as I am aware RNAcentral doesn't exist for anything pre 38.
But I'll just run it with all files from 38 to make sure.
Will get back to you, haha. Thanks!
from ribotish.
Well, I tried running with all 3 files (ensemnbl gtf, RNAcentral gtf and genome.fa) from the same release and it didn't help. The output files and TRANSPROFILE are still empty.
ribotish predict -b SRR1333393.bam -g homo_sapiens.GRCh38.gtf -f Homo_sapiens.GRCh38.dna.primary_assembly.fa --ribopara SRR1333393.sorted_indexed.bam.para.py --alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt
The log pretty much looks the same as just above.
Quality as discussed earlier should also definitely not be an issue.
Again, if I aligned with the normal ensembl gtf, then estimate offsets with the normal gtf but then predict with the RNAcentral gtf this should work?
I mean mapping with STAR works fine (~50% of uniquely mapped reads), also if I only map using the RNAcentral gtf.
In summary; I am really out of ideas why ribotish doesn't predict with the RNAcentral gtf. Do you have an idea still?
from ribotish.
Try to predict with ensembl gtf?
from ribotish.
well, that works. That gives the results we saw further up in the thread.
from ribotish.
I found RNAcentral data in gff3 format. Where do you download th gtf format?
from ribotish.
You are right. I downloaded it in gff3 format, but then transformed to gtf using gffread.
I'll run it with the gff3 to make sure..
from ribotish.
Maybe ribotish failed to parse the converted gtf format.
from ribotish.
Hi @zhpn1024
So I ran ribotish with the RNAcentral gff3 file. Again the output files (TRANSPROFILE included) are empty.
But this time atleast the log is telling something new:
Fri Dec 16 14:18:23 2022 Loading genome...
Making fasta index for Homo_sapiens.GRCh38.dna.primary_assembly.fa...
No input TIS data!
Fri Dec 16 14:19:06 2022 Predicting...
.
.
.
Inconsistent trans strand: URS00000235AA_9606.241936 - +
15
Inconsistent trans strand: URS000027D32E_9606.274248 + -
16
Inconsistent trans strand: URS00008B4EA9_9606.317462 + -
17
Inconsistent trans strand: URS00001AB772_9606.350408 + -
.
.
.
Fri Dec 16 14:19:48 2022 Checking overlap with known CDS..
Fri Dec 16 14:19:48 2022 BH correcting...
Fri Dec 16 14:19:48 2022 Done! Time used: 85.3397159576416 s.
This is the command I ran. I had ribotish autodetect the gff3 file type. Is that a problem?
ribotish predict -b SRR1333393.bam -g homo_sapiens.GRCh38.gff3 -f Homo_sapiens.GRCh38.dna.primary_assembly.fa --ribopara SRR1333393.sorted_indexed.bam.para.py --alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt
Do you have an idea what the error is trying to tell? Thanks!
from ribotish.
I just checked the RNAcentral gff3 format. There are two major problems that my program fail to parse. The first is the 'exon' in the third column is changed to 'noncoding_exon', resulted in no exon in any transcript. The second is no gene annotation to organize transcripts.
I think using the RNAcentral bed annotation would be OK.
from ribotish.
Hi @zhpn1024
It really finally worked when running ribotish with the RNAcentral bed file.
Thanks so much for your help & persistence.
So, on our own samples, which are not that deeply sequenced (QC output in the thread above), none of the ORFs passed the thresholds, so there are only entries in the "_all.txt" files. Still, I am glad we found the error.
I ran:
ribotish predict -b Sample_11.bam -g mus_musculus.GRCm39.bed -f Mus_musculus.GRCm39.dna.primary_assembly.fa --ribopara Sample_11_Concatenated_WT_30M_132443_RIBO_ALL.sorted_indexed.bam.para.py --alt --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt
The log file has >600'000 warning messages, though:
I just copied you the tail of the log where 2 of those messages are visible.
Wrong CDS annotation: URS000028A7C4_10090 URS000028A7C4_10090 0 73 73
Wrong CDS annotation: URS000061C2F2_10090 URS000061C2F2_10090 0 73 73
Mon Dec 19 13:04:30 2022 Checking overlap with known CDS..
Mon Dec 19 13:04:30 2022 BH correcting...
Mon Dec 19 13:04:30 2022 Done! Time used: 5537.979197025299 s.
Seems like ribotish still doesn't like the annotation file..
I guess that means I am losing information. Is there any way around this?
from ribotish.
Haha. The warnings are because the ThickStart/Stop in 7-8 column of bed format represent coding regions, while RNAcentral used it to duplicate Start and Stop range, as if the whole transcript is coding. In fact they are all non-coding. It would not affect results except TISType classification.
from ribotish.
hmm, I see, haha..
Thanks for the help!
from ribotish.
Related Issues (20)
- What dose this Horizontal line mean HOT 3
- how to edit the parameter (-d DIS Position range near start codon or stop codon)? HOT 3
- what the types of TisType mean? HOT 2
- ribotish tisdiff error?ZeroDivisionError: integer division or modulo by zero HOT 10
- --altcodons ACG,CUG
- ribotish quality: error: argument -d: expected one argument HOT 2
- ribotish quality "no reads found" for TIS reads HOT 10
- TypeError: '>=' not supported between instances of 'NoneType' and 'float' HOT 3
- TypeError: unsupported operand type(s) for +=: 'NoneType' and 'int' HOT 5
- run mistakes HOT 4
- TypeError: '<' not supported between instances of 'NoneType' and 'int' when running predict module HOT 5
- The meaning and differences in TisType HOT 10
- Wrong CDS annotation HOT 6
- AttributeError: 'Exp' object has no attribute 'sq' HOT 4
- Use of 5' mismatch reads in predict HOT 1
- ribotish predict module run error HOT 2
- TypeError: unsupported operand type(s) for +=: 'NoneType' and 'int' HOT 7
- Empty sequence error HOT 2
- Encountered error when using ribotish predict HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ribotish.