schatzlab / scikit-ribo Goto Github PK
View Code? Open in Web Editor NEWAccurate estimation and robust modelling of translation dynamics at codon resolution
License: GNU General Public License v2.0
Accurate estimation and robust modelling of translation dynamics at codon resolution
License: GNU General Public License v2.0
Hi,
I installed scikit-ribo using pip and am using it with human data. In order to generate the RNA fold file, I am first running gtf_process.py on our GTF file and in doing so obtained the following error:
[status] Reading the input file: XXX/single_CDS_translatedGenes.sorted.gtf
[execute] Starting the pre-processing module
[execute] Loading the the gtf file in to sql db
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/scikit-ribo/gtf_preprocess.py", line 266, in
worker.convertGtf()
File "/usr/local/lib/python3.5/dist-packages/scikit-ribo/gtf_preprocess.py", line 44, in convertGtf
gtfDedup = self.output + "/" + self.prefix + '.dedup.gtf'
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
I realised that the gtf_process.py file that is on github is different to the version that I installed with pip.
I managed to fix the problem by changing:
def init(self, gtf=None, fasta=None, prefix=None, output=None):
to:
def init(self, gtf=None, fasta=None, output=None, prefix=None):
and adding in the lines (which are in the version on github but not the version I installed using pip):
self.base = os.path.basename(self.gtf)
self.prefix = os.path.splitext(self.base)[0]
Regards,
Catherine
I got stuck with creating the RNAfold input file.Both call_rnafold.py and process_rnafold.py are not added to the path by setup and the documentation is not clear regarding the parameters of this function. Is RNAfold a dependency of scikit-ribo? Thanks for clarifying
I have been trying to use scikit-ribo on mouse ESC Ribo-Seq data but the output files generated from scikit-ribo-build do no correctly parse the given input GFF file. I am using the mm10_knownGene GFF file as input with genome file obtained from http://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/.
Specifically, there is inaccurate assignment of codons of transcripts.
For example, for gene uc008cjg.1, the start codon is at chr17: 35916361-35916363 while in the files mm10.codons.df and mm10.codons.bed, this chromosome location is at codon 2
$ cat mm10.codons.bed | grep 'uc008cjg.1' | head -12
chr17 35916390 35916393 uc008cjg.1 -8 - CGG
chr17 35916387 35916390 uc008cjg.1 -7 - CAG
chr17 35916384 35916387 uc008cjg.1 -6 - GCC
chr17 35916381 35916384 uc008cjg.1 -5 - GTA
chr17 35916378 35916381 uc008cjg.1 -4 - CGT
chr17 35916375 35916378 uc008cjg.1 -3 - CCG
chr17 35916372 35916375 uc008cjg.1 -2 - GTG
chr17 35916369 35916372 uc008cjg.1 -1 - TGC
chr17 35916366 35916369 uc008cjg.1 0 - GTC
chr17 35916363 35916366 uc008cjg.1 1 - AAG
chr17 35916360 35916363 uc008cjg.1 2 - ATG
chr17 35916357 35916360 uc008cjg.1 3 - GCT
Another example where the codon assignments are entirely different from the reference is uc007vnb.1. The start codon is at chr15:37003817-37003820 but the codon file has very different annotations.
$cat mm10.codons.bed | grep 'uc007vnb.1' | head -20
chr15 37007423 37007426 uc007vnb.1 -8 - CCG
chr15 37007420 37007423 uc007vnb.1 -7 - CAG
chr15 37007417 37007420 uc007vnb.1 -6 - GAA
chr15 37007414 37007417 uc007vnb.1 -5 - GGT
chr15 37007411 37007414 uc007vnb.1 -4 - GAG
chr15 37007408 37007411 uc007vnb.1 -3 - GGG
chr15 37007405 37007408 uc007vnb.1 -2 - CGG
chr15 37007402 37007405 uc007vnb.1 -1 - GAC
chr15 37007399 37007402 uc007vnb.1 0 - CGC
chr15 37007396 37007399 uc007vnb.1 1 - GCG
chr15 37007393 37007396 uc007vnb.1 2 - GGC
chr15 37007390 37007393 uc007vnb.1 3 - AAG
Assuming that codon 0 is the start codon in this file, I found that only 3380(7%) of 46355 transcripts have ATG as start codon
cat mm10.codons.bed | awk '{if($5==0)print $5,$7}' | sort | uniq -c
727 0 AAA
436 0 AAC
748 0 AAG
376 0 AAT
686 0 ACA
473 0 ACC
295 0 ACG
854 0 ACT
1768 0 AGA
1212 0 AGC
1207 0 AGG
1724 0 AGT
377 0 ATA
522 0 ATC
**3380 0 ATG**
704 0 ATT
228 0 CAA
377 0 CAC
493 0 CAG
167 0 CAT
400 0 CCA
569 0 CCC
399 0 CCG
477 0 CCT
188 0 CGA
602 0 CGC
637 0 CGG
136 0 CGT
202 0 CTA
831 0 CTC
625 0 CTG
594 0 CTT
990 0 GAA
779 0 GAC
1888 0 GAG
585 0 GAT
921 0 GCA
1177 0 GCC
1318 0 GCG
928 0 GCT
1856 0 GGA
1691 0 GGC
2329 0 GGG
1037 0 GGT
523 0 GTA
1015 0 GTC
1304 0 GTG
916 0 GTT
90 0 TAA
88 0 TAC
184 0 TAG
115 0 TAT
284 0 TCA
401 0 TCC
162 0 TCG
386 0 TCT
285 0 TGA
392 0 TGC
460 0 TGG
270 0 TGT
155 0 TTA
506 0 TTC
274 0 TTG
632 0 TTT
I hope this bug can be fixed so that scikit-ribo can be run on mm10 data.
Investigate if we need to match the strand of the read and the gene during bedtools intersect.
makeTrainingData
and makeCdsData
https://github.com/hanfang/scikit-ribo/blob/master/scripts/bam_preprocess.py
Pairing probability as a vector, left/right 5 codons
Whilst using gtf_preprocess.py to create the expandCDS.fasta file, I obtained the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/scikit-ribo/gtf_preprocess.py", line 280, in
worker.getSeq()
File "/usr/local/lib/python3.5/dist-packages/scikit-ribo/gtf_preprocess.py", line 154, in getSeq
self.fiveUtrDic[geneName] + self.fastaDic[geneName] + self.threeUtrDic[geneName] + "\n")
KeyError: 'ENSG00000230989'
This appears to be because in the 3utr.fasta, 5tr.fasta and cds.fasta files that were created have, for example, the following as a header:
ENSG00000187961::1:960586-965715(+)
Whereas the variable self.geneNames stores the IDs as only e.g. ENSG00000187961
Given the previous issue that I raised and solved, please can you confirm whether there is a problem in the code that is giving rise to this error?
Many thanks,
Catherine
Hi there,
I am trying to test scikit-ribo on some yeast ribo-seq data we have just generated.
When running scikit-ribo-build.py I get the following error:
File "/home/drew/anaconda3/envs/scikit-ribo/lib/python3.6/site-packages/gffutils/interface.py", line 227, in __getitem__ raise FeatureNotFoundError(key) gffutils.exceptions.FeatureNotFoundError: YDR106W_BY4741
I am using the yeast BY4741 Toronto genome, and a subset of the associated gff file that I converted to gtf using gffread. The subset just includes coding genes (and excludes some other Dubious annotated transcripts). This seemed to solve another issue I was having using the full gff or gtf made from this which were either
pandas.errors.ParserError: Too many columns specified: expected 9 and found 1.
when using a gff, or
KeyError: 'gene_id'
when using a gtf that included noncoding genes (the problem here is with missing "gene_id" descriptors for these kinds of genes or for exon annotations.)
However I ran into the problem described above.
I am not sure why gffutils is not finding the genes by these names (e.g. YDR106W_BY4741) since this is exactly how they are named in the gtf file (obviously, since the geneNames list in gtf_preprocess.py is built from the gtf itself).
Your help would be much appreciated!! Thanks.
Hi guys - to double check some of my own work, I've been trying to compare it to scikit-ribo's estimates of predicted protein abundance (PA). If I read the paper and code correctly, then it should be possible to get the PA estimates by multiplying the RNAseq TPMs by the TE estimates that scikit-ribo outputs (?). Doing this with your example data however, and comparing it to the proteomics data form Lawless et al, gives me a (much) worse correlation with the proteomics than just plain-old riboseq TPMs from salmon. From the paper, it should be much better, say .83 vs .77 with TPMs alone.
Am I calculating PA values correctly? what exactly should I multiply scikit-ribo's output by to get them?
Hi,
I'm testing scikit-ribo on a set of ribo-seq experiments from human cancer cell lines. I am experiencing the same issue reported by @aemoor some months ago while running gtf_preprocess.py (see #2) (error reported below).
[status] Reading the input file: /elaborazioni/sharedCO/SU2C_Data/scores_SU2C_v2/stuff/Homo_sapiens.GRCh37.75.gtf
[execute] Starting the pre-processing module
[execute] Loading the the gtf file in to sql db
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
File "scikit_ribo/gtf_preprocess.py", line 268, in
worker.convertGtf()
File "scikit_ribo/gtf_preprocess.py", line 55, in convertGtf
geneBed12 = self.db.bed12(geneName, name_field='gene_id')
File "/home/matteo.benelli/.local/lib/python3.5/site-packages/gffutils/interface.py", line 1218, in bed12
% (last, feature.stop))
ValueError: End of last exon (39127564) does not match end of feature (39127595)
Did you solve that issue? It would be great if you could share pre-built data for human model (Grch37).
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.