schatzlab / scikit-ribo Goto Github PK

View Code? Open in Web Editor NEW

18.0 9.0 8.0 71.12 MB

Accurate estimation and robust modelling of translation dynamics at codon resolution

License: GNU General Public License v2.0

Python 99.75% Shell 0.25%

riboseq glm random-forest ridge-regression glmnet

scikit-ribo's People

Contributors

Stargazers

Watchers

Forkers

wckdouglas langerdan sherkinglee dgelsin hangzhang mt1022 biogeeker harel-coffee

scikit-ribo's Issues

Bug creating file name for gtfDedup file in gtf_process.py

Hi,

I installed scikit-ribo using pip and am using it with human data. In order to generate the RNA fold file, I am first running gtf_process.py on our GTF file and in doing so obtained the following error:

[status] Reading the input file: XXX/single_CDS_translatedGenes.sorted.gtf
[execute] Starting the pre-processing module
[execute] Loading the the gtf file in to sql db
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/scikit-ribo/gtf_preprocess.py", line 266, in
worker.convertGtf()
File "/usr/local/lib/python3.5/dist-packages/scikit-ribo/gtf_preprocess.py", line 44, in convertGtf
gtfDedup = self.output + "/" + self.prefix + '.dedup.gtf'
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

I realised that the gtf_process.py file that is on github is different to the version that I installed with pip.

I managed to fix the problem by changing:
def init(self, gtf=None, fasta=None, prefix=None, output=None):

to:
def init(self, gtf=None, fasta=None, output=None, prefix=None):

and adding in the lines (which are in the version on github but not the version I installed using pip):
self.base = os.path.basename(self.gtf)
self.prefix = os.path.splitext(self.base)[0]

Regards,
Catherine

Issues with RNAfold

I got stuck with creating the RNAfold input file.Both call_rnafold.py and process_rnafold.py are not added to the path by setup and the documentation is not clear regarding the parameters of this function. Is RNAfold a dependency of scikit-ribo? Thanks for clarifying

Incorrect codon files generated by scikit-ribo-build for mm10

I have been trying to use scikit-ribo on mouse ESC Ribo-Seq data but the output files generated from scikit-ribo-build do no correctly parse the given input GFF file. I am using the mm10_knownGene GFF file as input with genome file obtained from http://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/.
Specifically, there is inaccurate assignment of codons of transcripts.

For example, for gene uc008cjg.1, the start codon is at chr17: 35916361-35916363 while in the files mm10.codons.df and mm10.codons.bed, this chromosome location is at codon 2
$ cat mm10.codons.bed | grep 'uc008cjg.1' | head -12

chr17	35916390	35916393	uc008cjg.1	-8	-	CGG
chr17	35916387	35916390	uc008cjg.1	-7	-	CAG
chr17	35916384	35916387	uc008cjg.1	-6	-	GCC
chr17	35916381	35916384	uc008cjg.1	-5	-	GTA
chr17	35916378	35916381	uc008cjg.1	-4	-	CGT
chr17	35916375	35916378	uc008cjg.1	-3	-	CCG
chr17	35916372	35916375	uc008cjg.1	-2	-	GTG
chr17	35916369	35916372	uc008cjg.1	-1	-	TGC
chr17	35916366	35916369	uc008cjg.1	0	-	GTC
chr17	35916363	35916366	uc008cjg.1	1	-	AAG
chr17	35916360	35916363	uc008cjg.1	2	-	ATG
chr17	35916357	35916360	uc008cjg.1	3	-	GCT

Another example where the codon assignments are entirely different from the reference is uc007vnb.1. The start codon is at chr15:37003817-37003820 but the codon file has very different annotations.

$cat mm10.codons.bed | grep 'uc007vnb.1' | head -20

chr15	37007423	37007426	uc007vnb.1	-8	-	CCG
chr15	37007420	37007423	uc007vnb.1	-7	-	CAG
chr15	37007417	37007420	uc007vnb.1	-6	-	GAA
chr15	37007414	37007417	uc007vnb.1	-5	-	GGT
chr15	37007411	37007414	uc007vnb.1	-4	-	GAG
chr15	37007408	37007411	uc007vnb.1	-3	-	GGG
chr15	37007405	37007408	uc007vnb.1	-2	-	CGG
chr15	37007402	37007405	uc007vnb.1	-1	-	GAC
chr15	37007399	37007402	uc007vnb.1	0	-	CGC
chr15	37007396	37007399	uc007vnb.1	1	-	GCG
chr15	37007393	37007396	uc007vnb.1	2	-	GGC
chr15	37007390	37007393	uc007vnb.1	3	-	AAG

Assuming that codon 0 is the start codon in this file, I found that only 3380(7%) of 46355 transcripts have ATG as start codon

cat mm10.codons.bed | awk '{if($5==0)print $5,$7}' | sort | uniq -c

727 0 AAA
    436 0 AAC
    748 0 AAG
    376 0 AAT
    686 0 ACA
    473 0 ACC
    295 0 ACG
    854 0 ACT
   1768 0 AGA
   1212 0 AGC
   1207 0 AGG
   1724 0 AGT
    377 0 ATA
    522 0 ATC
   **3380 0 ATG**
    704 0 ATT
    228 0 CAA
    377 0 CAC
    493 0 CAG
    167 0 CAT
    400 0 CCA
    569 0 CCC
    399 0 CCG
    477 0 CCT
    188 0 CGA
    602 0 CGC
    637 0 CGG
    136 0 CGT
    202 0 CTA
    831 0 CTC
    625 0 CTG
    594 0 CTT
    990 0 GAA
    779 0 GAC
   1888 0 GAG
    585 0 GAT
    921 0 GCA
   1177 0 GCC
   1318 0 GCG
    928 0 GCT
   1856 0 GGA
   1691 0 GGC
   2329 0 GGG
   1037 0 GGT
    523 0 GTA
   1015 0 GTC
   1304 0 GTG
    916 0 GTT
     90 0 TAA
     88 0 TAC
    184 0 TAG
    115 0 TAT
    284 0 TCA
    401 0 TCC
    162 0 TCG
    386 0 TCT
    285 0 TGA
    392 0 TGC
    460 0 TGG
    270 0 TGT
    155 0 TTA
    506 0 TTC
    274 0 TTG
    632 0 TTT

I hope this bug can be fixed so that scikit-ribo can be run on mm10 data.

strand matching

Investigate if we need to match the strand of the read and the gene during bedtools intersect.
makeTrainingData and makeCdsData
https://github.com/hanfang/scikit-ribo/blob/master/scripts/bam_preprocess.py
Pairing probability as a vector, left/right 5 codons

Inconsistence Gene IDs used in gtf_preprocess.py

Whilst using gtf_preprocess.py to create the expandCDS.fasta file, I obtained the following error:

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/scikit-ribo/gtf_preprocess.py", line 280, in
worker.getSeq()
File "/usr/local/lib/python3.5/dist-packages/scikit-ribo/gtf_preprocess.py", line 154, in getSeq
self.fiveUtrDic[geneName] + self.fastaDic[geneName] + self.threeUtrDic[geneName] + "\n")
KeyError: 'ENSG00000230989'

This appears to be because in the 3utr.fasta, 5tr.fasta and cds.fasta files that were created have, for example, the following as a header:

ENSG00000187961::1:960586-965715(+)

Whereas the variable self.geneNames stores the IDs as only e.g. ENSG00000187961

Given the previous issue that I raised and solved, please can you confirm whether there is a problem in the code that is giving rise to this error?

Many thanks,
Catherine

gffutils error for bed12 conversion in scikit-ribo-build.py

Hi there,

I am trying to test scikit-ribo on some yeast ribo-seq data we have just generated.
When running scikit-ribo-build.py I get the following error:
File "/home/drew/anaconda3/envs/scikit-ribo/lib/python3.6/site-packages/gffutils/interface.py", line 227, in __getitem__ raise FeatureNotFoundError(key) gffutils.exceptions.FeatureNotFoundError: YDR106W_BY4741

I am using the yeast BY4741 Toronto genome, and a subset of the associated gff file that I converted to gtf using gffread. The subset just includes coding genes (and excludes some other Dubious annotated transcripts). This seemed to solve another issue I was having using the full gff or gtf made from this which were either
pandas.errors.ParserError: Too many columns specified: expected 9 and found 1.
when using a gff, or
KeyError: 'gene_id'
when using a gtf that included noncoding genes (the problem here is with missing "gene_id" descriptors for these kinds of genes or for exon annotations.)
However I ran into the problem described above.

I am not sure why gffutils is not finding the genes by these names (e.g. YDR106W_BY4741) since this is exactly how they are named in the gtf file (obviously, since the geneNames list in gtf_preprocess.py is built from the gtf itself).

Your help would be much appreciated!! Thanks.

Reproducing the PA values from the paper

Hi guys - to double check some of my own work, I've been trying to compare it to scikit-ribo's estimates of predicted protein abundance (PA). If I read the paper and code correctly, then it should be possible to get the PA estimates by multiplying the RNAseq TPMs by the TE estimates that scikit-ribo outputs (?). Doing this with your example data however, and comparing it to the proteomics data form Lawless et al, gives me a (much) worse correlation with the proteomics than just plain-old riboseq TPMs from salmon. From the paper, it should be much better, say .83 vs .77 with TPMs alone.

Am I calculating PA values correctly? what exactly should I multiply scikit-ribo's output by to get them?

RNAfold / pre-built files for human

Hi,
I'm testing scikit-ribo on a set of ribo-seq experiments from human cancer cell lines. I am experiencing the same issue reported by @aemoor some months ago while running gtf_preprocess.py (see #2) (error reported below).

[status] Reading the input file: /elaborazioni/sharedCO/SU2C_Data/scores_SU2C_v2/stuff/Homo_sapiens.GRCh37.75.gtf
[execute] Starting the pre-processing module
[execute] Loading the the gtf file in to sql db
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
File "scikit_ribo/gtf_preprocess.py", line 268, in
worker.convertGtf()
File "scikit_ribo/gtf_preprocess.py", line 55, in convertGtf
geneBed12 = self.db.bed12(geneName, name_field='gene_id')
File "/home/matteo.benelli/.local/lib/python3.5/site-packages/gffutils/interface.py", line 1218, in bed12
% (last, feature.stop))
ValueError: End of last exon (39127564) does not match end of feature (39127595)

Did you solve that issue? It would be great if you could share pre-built data for human model (Grch37).
Thanks!

schatzlab / scikit-ribo Goto Github PK

scikit-ribo's People

Contributors

Stargazers

Watchers

Forkers

scikit-ribo's Issues

Bug creating file name for gtfDedup file in gtf_process.py

Issues with RNAfold

Incorrect codon files generated by scikit-ribo-build for mm10

strand matching

Inconsistence Gene IDs used in gtf_preprocess.py

gffutils error for bed12 conversion in scikit-ribo-build.py

Reproducing the PA values from the paper

RNAfold / pre-built files for human

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent