schatzlab / scikit-ribo Goto Github PK

View Code? Open in Web Editor NEW

18.0 9.0 8.0 71.12 MB

Accurate estimation and robust modelling of translation dynamics at codon resolution

License: GNU General Public License v2.0

Python 99.75% Shell 0.25%

riboseq glm random-forest ridge-regression glmnet

scikit-ribo's Issues

Inconsistence Gene IDs used in gtf_preprocess.py

Whilst using gtf_preprocess.py to create the expandCDS.fasta file, I obtained the following error:

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/scikit-ribo/gtf_preprocess.py", line 280, in
worker.getSeq()
File "/usr/local/lib/python3.5/dist-packages/scikit-ribo/gtf_preprocess.py", line 154, in getSeq
self.fiveUtrDic[geneName] + self.fastaDic[geneName] + self.threeUtrDic[geneName] + "\n")
KeyError: 'ENSG00000230989'

This appears to be because in the 3utr.fasta, 5tr.fasta and cds.fasta files that were created have, for example, the following as a header:

ENSG00000187961::1:960586-965715(+)

Whereas the variable self.geneNames stores the IDs as only e.g. ENSG00000187961

Given the previous issue that I raised and solved, please can you confirm whether there is a problem in the code that is giving rise to this error?

Many thanks,
Catherine

Incorrect codon files generated by scikit-ribo-build for mm10

I have been trying to use scikit-ribo on mouse ESC Ribo-Seq data but the output files generated from scikit-ribo-build do no correctly parse the given input GFF file. I am using the mm10_knownGene GFF file as input with genome file obtained from http://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/.
Specifically, there is inaccurate assignment of codons of transcripts.

For example, for gene uc008cjg.1, the start codon is at chr17: 35916361-35916363 while in the files mm10.codons.df and mm10.codons.bed, this chromosome location is at codon 2
$ cat mm10.codons.bed | grep 'uc008cjg.1' | head -12

chr17	35916390	35916393	uc008cjg.1	-8	-	CGG
chr17	35916387	35916390	uc008cjg.1	-7	-	CAG
chr17	35916384	35916387	uc008cjg.1	-6	-	GCC
chr17	35916381	35916384	uc008cjg.1	-5	-	GTA
chr17	35916378	35916381	uc008cjg.1	-4	-	CGT
chr17	35916375	35916378	uc008cjg.1	-3	-	CCG
chr17	35916372	35916375	uc008cjg.1	-2	-	GTG
chr17	35916369	35916372	uc008cjg.1	-1	-	TGC
chr17	35916366	35916369	uc008cjg.1	0	-	GTC
chr17	35916363	35916366	uc008cjg.1	1	-	AAG
chr17	35916360	35916363	uc008cjg.1	2	-	ATG
chr17	35916357	35916360	uc008cjg.1	3	-	GCT

Another example where the codon assignments are entirely different from the reference is uc007vnb.1. The start codon is at chr15:37003817-37003820 but the codon file has very different annotations.

$cat mm10.codons.bed | grep 'uc007vnb.1' | head -20

chr15	37007423	37007426	uc007vnb.1	-8	-	CCG
chr15	37007420	37007423	uc007vnb.1	-7	-	CAG
chr15	37007417	37007420	uc007vnb.1	-6	-	GAA
chr15	37007414	37007417	uc007vnb.1	-5	-	GGT
chr15	37007411	37007414	uc007vnb.1	-4	-	GAG
chr15	37007408	37007411	uc007vnb.1	-3	-	GGG
chr15	37007405	37007408	uc007vnb.1	-2	-	CGG
chr15	37007402	37007405	uc007vnb.1	-1	-	GAC
chr15	37007399	37007402	uc007vnb.1	0	-	CGC
chr15	37007396	37007399	uc007vnb.1	1	-	GCG
chr15	37007393	37007396	uc007vnb.1	2	-	GGC
chr15	37007390	37007393	uc007vnb.1	3	-	AAG

Assuming that codon 0 is the start codon in this file, I found that only 3380(7%) of 46355 transcripts have ATG as start codon

cat mm10.codons.bed | awk '{if($5==0)print $5,$7}' | sort | uniq -c

727 0 AAA
    436 0 AAC
    748 0 AAG
    376 0 AAT
    686 0 ACA
    473 0 ACC
    295 0 ACG
    854 0 ACT
   1768 0 AGA
   1212 0 AGC
   1207 0 AGG
   1724 0 AGT
    377 0 ATA
    522 0 ATC
   **3380 0 ATG**
    704 0 ATT
    228 0 CAA
    377 0 CAC
    493 0 CAG
    167 0 CAT
    400 0 CCA
    569 0 CCC
    399 0 CCG
    477 0 CCT
    188 0 CGA
    602 0 CGC
    637 0 CGG
    136 0 CGT
    202 0 CTA
    831 0 CTC
    625 0 CTG
    594 0 CTT
    990 0 GAA
    779 0 GAC
   1888 0 GAG
    585 0 GAT
    921 0 GCA
   1177 0 GCC
   1318 0 GCG
    928 0 GCT
   1856 0 GGA
   1691 0 GGC
   2329 0 GGG
   1037 0 GGT
    523 0 GTA
   1015 0 GTC
   1304 0 GTG
    916 0 GTT
     90 0 TAA
     88 0 TAC
    184 0 TAG
    115 0 TAT
    284 0 TCA
    401 0 TCC
    162 0 TCG
    386 0 TCT
    285 0 TGA
    392 0 TGC
    460 0 TGG
    270 0 TGT
    155 0 TTA
    506 0 TTC
    274 0 TTG
    632 0 TTT

I hope this bug can be fixed so that scikit-ribo can be run on mm10 data.

Bug creating file name for gtfDedup file in gtf_process.py

Hi,

I installed scikit-ribo using pip and am using it with human data. In order to generate the RNA fold file, I am first running gtf_process.py on our GTF file and in doing so obtained the following error:

[status] Reading the input file: XXX/single_CDS_translatedGenes.sorted.gtf
[execute] Starting the pre-processing module
[execute] Loading the the gtf file in to sql db
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/scikit-ribo/gtf_preprocess.py", line 266, in
worker.convertGtf()
File "/usr/local/lib/python3.5/dist-packages/scikit-ribo/gtf_preprocess.py", line 44, in convertGtf
gtfDedup = self.output + "/" + self.prefix + '.dedup.gtf'
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

I realised that the gtf_process.py file that is on github is different to the version that I installed with pip.

I managed to fix the problem by changing:
def init(self, gtf=None, fasta=None, prefix=None, output=None):

to:
def init(self, gtf=None, fasta=None, output=None, prefix=None):

and adding in the lines (which are in the version on github but not the version I installed using pip):
self.base = os.path.basename(self.gtf)
self.prefix = os.path.splitext(self.base)[0]

Regards,
Catherine

Issues with RNAfold

I got stuck with creating the RNAfold input file.Both call_rnafold.py and process_rnafold.py are not added to the path by setup and the documentation is not clear regarding the parameters of this function. Is RNAfold a dependency of scikit-ribo? Thanks for clarifying

gffutils error for bed12 conversion in scikit-ribo-build.py

Hi there,

I am trying to test scikit-ribo on some yeast ribo-seq data we have just generated.
When running scikit-ribo-build.py I get the following error:
File "/home/drew/anaconda3/envs/scikit-ribo/lib/python3.6/site-packages/gffutils/interface.py", line 227, in __getitem__ raise FeatureNotFoundError(key) gffutils.exceptions.FeatureNotFoundError: YDR106W_BY4741

I am using the yeast BY4741 Toronto genome, and a subset of the associated gff file that I converted to gtf using gffread. The subset just includes coding genes (and excludes some other Dubious annotated transcripts). This seemed to solve another issue I was having using the full gff or gtf made from this which were either
pandas.errors.ParserError: Too many columns specified: expected 9 and found 1.
when using a gff, or
KeyError: 'gene_id'
when using a gtf that included noncoding genes (the problem here is with missing "gene_id" descriptors for these kinds of genes or for exon annotations.)
However I ran into the problem described above.

I am not sure why gffutils is not finding the genes by these names (e.g. YDR106W_BY4741) since this is exactly how they are named in the gtf file (obviously, since the geneNames list in gtf_preprocess.py is built from the gtf itself).

Your help would be much appreciated!! Thanks.

strand matching

Investigate if we need to match the strand of the read and the gene during bedtools intersect.
makeTrainingData and makeCdsData
https://github.com/hanfang/scikit-ribo/blob/master/scripts/bam_preprocess.py
Pairing probability as a vector, left/right 5 codons

Reproducing the PA values from the paper

Hi guys - to double check some of my own work, I've been trying to compare it to scikit-ribo's estimates of predicted protein abundance (PA). If I read the paper and code correctly, then it should be possible to get the PA estimates by multiplying the RNAseq TPMs by the TE estimates that scikit-ribo outputs (?). Doing this with your example data however, and comparing it to the proteomics data form Lawless et al, gives me a (much) worse correlation with the proteomics than just plain-old riboseq TPMs from salmon. From the paper, it should be much better, say .83 vs .77 with TPMs alone.

Am I calculating PA values correctly? what exactly should I multiply scikit-ribo's output by to get them?

RNAfold / pre-built files for human

Hi,
I'm testing scikit-ribo on a set of ribo-seq experiments from human cancer cell lines. I am experiencing the same issue reported by @aemoor some months ago while running gtf_preprocess.py (see #2) (error reported below).

[status] Reading the input file: /elaborazioni/sharedCO/SU2C_Data/scores_SU2C_v2/stuff/Homo_sapiens.GRCh37.75.gtf
[execute] Starting the pre-processing module
[execute] Loading the the gtf file in to sql db
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
File "scikit_ribo/gtf_preprocess.py", line 268, in
worker.convertGtf()
File "scikit_ribo/gtf_preprocess.py", line 55, in convertGtf
geneBed12 = self.db.bed12(geneName, name_field='gene_id')
File "/home/matteo.benelli/.local/lib/python3.5/site-packages/gffutils/interface.py", line 1218, in bed12
% (last, feature.stop))
ValueError: End of last exon (39127564) does not match end of feature (39127595)

Did you solve that issue? It would be great if you could share pre-built data for human model (Grch37).
Thanks!

schatzlab / scikit-ribo Goto Github PK

scikit-ribo's Issues

Inconsistence Gene IDs used in gtf_preprocess.py

Incorrect codon files generated by scikit-ribo-build for mm10

Bug creating file name for gtfDedup file in gtf_process.py

Issues with RNAfold

gffutils error for bed12 conversion in scikit-ribo-build.py

strand matching

Reproducing the PA values from the paper

RNAfold / pre-built files for human

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent