Giter VIP home page Giter VIP logo

pga's Introduction

PGA: a tool for ProteoGenomics Analysis

PGA is an R package for identification of novel peptides by customized database derived from RNA-Seq or DNA-Seq data. This package provides functions for construction of customized protein databases based on RNA-Seq data with/without genome guided or DNA-Seq data, database searching, post-processing and report generation. This kind of customized protein database includes both the reference database (such as Refseq or ENSEMBL) and the novel peptide sequences form RNA-Seq data or DNA-Seq data.

Usage

Please read this document to find how to use PGA: PGA tutorial. If you have any questions about PGA, please open an issue here: open an issue.

Demo report of PGA output: Demo report.

Global or separate FDR estimation

Global FDR estimation at PSM or peptide level: example

Separate FDR estimation at PSM or peptide level: example

Wiki

More about PGA.

Installation

To install PGA:

# Install the development version from GitHub:
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
install.packages("remotes")
BiocManager::install("wenbostar/PGA")

Use PGA in docker (recommended):

Find detail at https://github.com/wenbostar/PGA/wiki/Use-PGA-docker.

Citation

To cite the PGA package in publications, please use:

Wen B, Xu S, Zhou R, et al. PGA: an R/Bioconductor package for identification of novel peptides using a customized database derived from RNA-Seq. BMC bioinformatics, 2016, 17(1): 244. DOI: 10.1186/s12859-016-1133-3

Wen, B., Xu, S., Sheynkman, G.M., Feng, Q., Lin, L., Wang, Q., Xu, X., Wang, J. and Liu, S., 2014. sapFinder: an R/Bioconductor package for detection of variant peptides in shotgun proteomics experiments. Bioinformatics, 30(21), pp.3136-3138. DOI: 10.1093/bioinformatics/btu397

List of citations

PGA/sapFinder has been cited in the following manuscripts:

  1. Ignatchenko, Alexandr, et al. "Detecting protein variants by mass spectrometry: a comprehensive study in cancer cell-lines." Genome medicine 9.1 (2017): 62.
  2. Luan, Ning, et al. "A combinational strategy upon RNA sequencing and peptidomics unravels a set of novel toxin peptides in scorpion Mesobuthus martensii." Toxins 8.10 (2016): 286.
  3. Ma, Chunwei, et al. "Improvement of peptide identification with considering the abundance of mRNA and peptide." BMC bioinformatics 18.1 (2017): 109.
  4. Zhang, Jia, et al. "GAPP: A Proteogenomic Software for Genome Annotation and Global Profiling of Post-translational Modifications in Prokaryotes." Molecular & Cellular Proteomics 15.11 (2016): 3529-3539.
  5. Proffitt, J. Michael, et al. "Proteomics in non-human primates: utilizing RNA-Seq data to improve protein identification by mass spectrometry in vervet monkeys." BMC genomics 18.1 (2017): 877.
  6. Dimitrakopoulos, Lampros, et al. "Onco-proteogenomics: Multi-omics level data integration for accurate phenotype prediction." Critical reviews in clinical laboratory sciences 54.6 (2017): 414-432.
  7. Ruggles, Kelly V., et al. "Methods, tools and current perspectives in proteogenomics." Molecular & Cellular Proteomics 16.6 (2017): 959-981.
  8. Peng, Xinxin, et al. "A-to-I RNA editing contributes to proteomic diversity in cancer." Cancer cell 33.5 (2018): 817-828.
  9. Zhang, Minying, et al. "RNA editing derived epitopes function as cancer antigens to elicit immune responses." Nature communications 9.1 (2018): 3919.
  10. Low, Teck Yew, et al. "Connecting Proteomics to Next‐Generation Sequencing: Proteogenomics and Its Current Applications in Biology." Proteomics (2018): 1800235.
  11. Wen, Bo, Xiaojing Wang, and Bing Zhang. "PepQuery enables fast, accurate, and convenient proteomic validation of novel genomic alterations." Genome research 29.3 (2019): 485-493.
  12. Lobas, Anna A., et al. "Proteogenomics of malignant melanoma cell lines: the effect of stringency of exome data filtering on variant peptide identification in shotgun proteomics." Journal of proteome research 17.5 (2018): 1801-1811.
  13. Robin, Thibault, et al. "Large-scale reanalysis of publicly available HeLa cell proteomics data in the context of the Human Proteome Project." Journal of proteome research 17.12 (2018): 4160-4170.
  14. Yang, Mingkun, et al. "Genome annotation of a model diatom Phaeodactylum tricornutum using an integrated proteogenomic pipeline." Molecular plant 11.10 (2018): 1292-1307.
  15. Cifani, Paolo, et al. "ProteomeGenerator: A framework for comprehensive proteomics based on de novo transcriptome assembly and high-accuracy peptide mass spectral matching." Journal of proteome research 17.11 (2018): 3681-3692.
  16. Misra, Biswapriya B. "Updates on resources, software tools, and databases for plant proteomics in 2016–2017." Electrophoresis 39.13 (2018): 1543-1557.
  17. Rong, Mingqiang, et al. "The defensive system of tree frog skin identified by peptidomics and RNA sequencing analysis." Amino Acids 51.2 (2019): 345-353.
  18. Rong, Mingqiang, et al. "PPIP: Automated Software for Identification of Bioactive Endogenous Peptides." Journal of proteome research (2019).
  19. Nagaraj, Shivashankar H., et al. "PGTools: a software suite for proteogenomic data analysis and visualization." Journal of proteome research 14.5 (2015): 2255-2266.
  20. Sheynkman, Gloria M., et al. "Proteogenomics: integrating next-generation sequencing and mass spectrometry to characterize human proteomic variation." Annual review of analytical chemistry 9 (2016): 521-545.
  21. Menschaert, Gerben, and David Fenyö. "Proteogenomics from a bioinformatics angle: A growing field." Mass spectrometry reviews 36.5 (2017): 584-599.
  22. Komor, Malgorzata A., et al. "Identification of differentially expressed splice variants by the proteogenomic pipeline Splicify." Molecular & Cellular Proteomics 16.10 (2017): 1850-1863.
  23. Hernandez-Valladares, Maria, et al. "Proteogenomics approaches for studying cancer biology and their potential in the identification of acute myeloid leukemia biomarkers." Expert review of proteomics 14.8 (2017): 649-663.
  24. Ischenko, Dmitry, et al. "Large scale analysis of amino acid substitutions in bacterial proteomics." BMC bioinformatics 17.1 (2016): 450.

Contribution

Contributions to the package are more than welcome.

pga's People

Contributors

mattdowle avatar shawn-xu avatar wenbostar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pga's Issues

Using PGA to create custom protein database from trinity.fasta file

Greetings @wenbostar,

I would like to utilize PGA's createProDB4DenovoRNASeq function to generate a custom protein database for proteomics data searching. Unfortunately, due to absent dependencies in R, I am not able to run PGA. I am observing the error message of "Error: package ‘rTANDEM’ required by ‘PGA’ could not be found". Please do advise me on how I can proceed with my required task?

Specifically, I would like to utilize this command below:

"**# Library
library("PGA")

#Create custom protein database from trinity assembly output fasta file

createProDB4DenovoRNASeq(infa = "F:\Trinity\MDAMB231\Trinity_Files\MDAMB231_Trinity_Processed_Sorted.fasta", bool_use_3frame = FALSE,
outmtab = "F:\Trinity\MDAMB231\Trinity_Files\MDAMB231_Novel_Transcripts_ntx.tab",
outfa = "F:\Trinity\MDAMB231\Trinity_Files\MDAMB231_denovo_Proteogeomics_Database.fasta", bool_get_longest = FALSE,
make_decoy = TRUE, decoy_tag = "#REV#", outfile_name = "ProGeo"**)"

Regards,
Parthiban

Error when running PGA

Hi,

I am having a problem when using PGA to generate a protein database from my transcriptome.

Here are the codes I used:

library(PGA)

transcript_seq_file <- system.file("extdata/input", "/Users/kim/Desktop/read_counts_overlapping/Transcriptome_Trinity.fasta",package="PGA")
outdb <- createProDB4DenovoRNASeq(infa=transcript_seq_file,outfile_name = "denovo")

Here is the error:

Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector") : 
  cannot open file ''

This is the output of sessionInfo()

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] PGA_1.8.0            rTANDEM_1.18.0       Rcpp_0.12.14         XML_3.98-1.9         data.table_1.10.4-3 
 [6] Biostrings_2.46.0    XVector_0.18.0       GenomicRanges_1.30.0 GenomeInfoDb_1.14.0  IRanges_2.12.0      
[11] S4Vectors_0.16.0     BiocGenerics_0.24.0 

loaded via a namespace (and not attached):
 [1] Biobase_2.38.0             RMySQL_0.10.13             tidyr_0.7.2                bit64_0.9-7               
 [5] splines_3.4.3              Formula_1.2-2              assertthat_0.2.0           latticeExtra_0.6-28       
 [9] blob_1.1.0                 BSgenome_1.46.0            Rsamtools_1.30.0           GenomeInfoDbData_0.99.1   
[13] progress_1.1.2             RSQLite_2.0                backports_1.1.1            lattice_0.20-35           
[17] glue_1.2.0                 digest_0.6.12              RColorBrewer_1.1-2         checkmate_1.8.5           
[21] colorspace_1.3-2           htmltools_0.3.6            Matrix_1.2-12              plyr_1.8.4                
[25] DESeq2_1.18.1              pkgconfig_2.0.1            pheatmap_1.0.8             customProDB_1.18.0        
[29] biomaRt_2.34.0             genefilter_1.60.0          zlibbioc_1.24.0            purrr_0.2.4               
[33] xtable_1.8-2               scales_0.5.0               BiocParallel_1.12.0        htmlTable_1.11.0          
[37] tibble_1.3.4               annotate_1.56.1            ggplot2_2.2.1              AhoCorasickTrie_0.1.0     
[41] GenomicFeatures_1.30.0     SummarizedExperiment_1.8.0 nnet_7.3-12                lazyeval_0.2.1            
[45] survival_2.41-3            magrittr_1.5               memoise_1.1.0              foreign_0.8-69            
[49] prettyunits_1.0.2          tools_3.4.3                matrixStats_0.52.2         stringr_1.2.0             
[53] munsell_0.4.3              locfit_1.5-9.1             cluster_2.0.6              DelayedArray_0.4.1        
[57] AnnotationDbi_1.40.0       bindrcpp_0.2               compiler_3.4.3             rlang_0.1.4               
[61] grid_3.4.3                 RCurl_1.95-4.8             rstudioapi_0.7             VariantAnnotation_1.24.2  
[65] htmlwidgets_0.9            bitops_1.0-6               base64enc_0.1-3            gtable_0.2.0              
[69] DBI_0.7                    R6_2.2.2                   GenomicAlignments_1.14.1   Nozzle.R1_1.1-1           
[73] gridExtra_2.3              rtracklayer_1.38.2         knitr_1.17                 dplyr_0.7.4               
[77] bit_1.1-12                 bindr_0.1                  Hmisc_4.0-3                stringi_1.1.6             
[81] geneplotter_1.56.0         rpart_4.1-11               acepack_1.4.1  

Could someone help me with this, please?
Thank you.

Download reference data from UCSC for RefSeq

The CDS and protein data were downloaded from UCSC on the same day with running the following code that had the following warning message:

library(PGA)
annotation_path <- tempdir()
pepfasta <- "~/Downloads/hg19_refGenePro.fa"
CDSfasta <- "~/Downloads/hg19_refGeneCDS.fa"
PrepareAnnotationRefseq2(genome='hg19', CDSfasta, pepfasta, annotation_path,
                         dbsnp=NULL, splice_matrix=FALSE, COSMIC=FALSE)
Build TranscriptDB object (txdb.sqlite) ... 
Download the refGene table ... OK
Download the hgFixed.refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
 done
Prepare gene/transcript/protein id mapping information (ids.RData) ...  done
Prepare exon annotation information (exon_anno.RData) ...  done
Prepare protein sequence (proseq.RData) ...  done
Prepare protein coding sequence (procodingseq.RData)...  done
Warning message:
In .extractCdsLocsFromUCSCTxTable(ucsc_txtable) :
  UCSC data anomaly in 433 transcript(s): the cds cumulative length is not a multiple of 3
  for transcripts ‘NM_033425’ ‘NM_006510’ ‘NM_001146344’ ‘NM_001010890’ ‘NM_001300891’
  ‘NM_001300891’ ‘NM_017940’ ‘NM_002537’ ‘NM_003954’ ‘NM_006510’ ‘NM_001278563’
  ‘NM_001291815’ ‘NM_001359231’ ‘NM_001354658’ ‘NM_001350198’ ‘NM_001243042’
  ‘NM_001243042’ ‘NM_002570’ ‘NM_001128590’ ‘NM_001271870’ ‘NM_001271872’ ‘NM_001329984’
  ‘NM_001037501’ ‘NM_001037675’ ‘NM_001277444’ ‘NM_001351365’ ‘NM_001297654’
  ‘NM_001288952’ ‘NM_001134939’ ‘NM_001301371’ ‘NM_153334’ ‘NM_001348286’ ‘NM_001348208’
  ‘NM_001348208’ ‘NM_001348208’ ‘NM_001348208’ ‘NM_001348208’ ‘NM_001289152’ ‘NM_199349’
  ‘NM_138324’ ‘NM_138323’ ‘NM_138322’ ‘NM_138319’ ‘NM_005671’ ‘NM_001143962’ ‘NM_000500’
  ‘NM_145171’ ‘NM_001318833’ ‘NM_006904� [... truncated]
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2018.03

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] PGA_1.13.3           rTANDEM_1.22.1       Rcpp_1.0.1
 [4] XML_3.98-1.20        data.table_1.12.2    Biostrings_2.50.2
 [7] XVector_0.22.0       GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[10] IRanges_2.16.0       S4Vectors_0.20.1     BiocGenerics_0.28.0

loaded via a namespace (and not attached):
 [1] Biobase_2.42.0              httr_1.4.0
 [3] bit64_0.9-7                 assertthat_0.2.1
 [5] BiocManager_1.30.4          blob_1.1.1
 [7] BSgenome_1.50.0             GenomeInfoDbData_1.2.0
 [9] Rsamtools_1.34.1            remotes_2.0.4
[11] progress_1.2.2              pillar_1.4.1
[13] RSQLite_2.1.1               lattice_0.20-38
[15] glue_1.3.1                  digest_0.6.19
[17] RColorBrewer_1.1-2          colorspace_1.4-1
[19] Matrix_1.2-17               plyr_1.8.4
[21] pkgconfig_2.0.2             pheatmap_1.0.12
[23] customProDB_1.22.1          biomaRt_2.38.0
[25] zlibbioc_1.28.0             purrr_0.3.2
[27] scales_1.0.0                processx_3.3.1
[29] BiocParallel_1.16.6         tibble_2.1.3
[31] ggplot2_3.2.0               AhoCorasickTrie_0.1.0
[33] SummarizedExperiment_1.12.0 GenomicFeatures_1.34.8
[35] lazyeval_0.2.2              magrittr_1.5
[37] crayon_1.3.4                memoise_1.1.0
[39] ps_1.3.0                    MASS_7.3-51.4
[41] RMariaDB_1.0.6.9000         tools_3.5.3
[43] prettyunits_1.0.2           hms_0.4.2
[45] matrixStats_0.54.0          stringr_1.4.0
[47] munsell_0.5.0               DelayedArray_0.8.0
[49] AnnotationDbi_1.44.0        ade4_1.7-13
[51] compiler_3.5.3              rlang_0.3.4
[53] grid_3.5.3                  RCurl_1.95-4.12
[55] VariantAnnotation_1.28.13   bitops_1.0-6
[57] gtable_0.3.0                curl_3.3
[59] DBI_1.0.0.9001              R6_2.4.0
[61] GenomicAlignments_1.18.1    Nozzle.R1_1.1-1
[63] dplyr_0.8.1                 rtracklayer_1.42.2
[65] seqinr_3.4-5                bit_1.1-14
[67] readr_1.3.1                 stringi_1.4.3
[69] tidyselect_0.2.5

rTANDEM is unavailable

Hi,

I failed to install PGA as rTANDEM has been removed from bioconductor, could you please let me know the solution?

Bioconductor version 3.12 (BiocManager 1.30.16), R 4.0.4 (2021-02-15)
Installing package(s) 'rTANDEM'
package �rTANDEM� is not available for Bioconductor version '3.12'

Thanks

Error with dbCreator

Hello ,

I just tried to use with my data PGA package and i had an error in dbCreator:

library(BSgenome.Hsapiens.UCSC.hg19)
dbfile <- dbCreator(gtfFile=gtffile,vcfFile=vcffile,bedFile=bedfile,annotation_path=annotation_path,outfile_name=outfile_name,genome=Hsapiens,outdir=outfile_path)
Error in y[, z] : subscript out of bounds
this is the vcf header:

##fileformat=VCFv4.1
##source=VarScan2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth of quality bases">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates if record is a somatic mutation">
##INFO=<ID=SS,Number=1,Type=String,Description="Somatic status of variant (0=Reference,1=Germline,2=Somatic,3=LOH, or 5=Unknown)">
##INFO=<ID=SSC,Number=1,Type=String,Description="Somatic score in Phred scale (0-255) derived from somatic p-value">
##INFO=<ID=GPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor+normal versus no variant for Germline calls">
##INFO=<ID=SPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor versus normal for Somatic/LOH calls">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=DP4,Number=1,Type=String,Description="Strand read counts: ref/fwd, ref/rev, var/fwd, var/rev">
#CHROM POS ID REF ALT QUAL
chr1 10043516 . A G 0.01
chr1 10044378 . A G 0.01

iarc@iarc-H8QG6:/Processed_data/gabriel/amrita/tfRNA/EVs/derfinder-pipe$ head /New_data/pancreatic/new_data/junctions.bed
chr1 12697 13221 JUNC_0 0 + 12697 13221 255,0,0 2 50,50 0,575
chr1 14829 14970 JUNC_1 1 - 14829 14970 255,0,0 2 50,50 0,192
chr1 14829 15021 JUNC_2 0 - 14829 15021 255,0,0 2 50,50 0,243
chr1 15012 25233 JUNC_3 0 + 15012 25233 255,0,0 2 50,50 0,10272
chr1 15038 15796 JUNC_4 3 - 15038 15796 255,0,0 2 50,50 0,809
chr1 15947 16607 JUNC_5 3 - 15947 16607 255,0,0 2 50,50 0,711
chr1 16765 16854 JUNC_6 0 - 16765 16854 255,0,0 2 50,50 0,140
chr1 16765 16858 JUNC_7 0 - 16765 16858 255,0,0 2 50,50 0,144
chr1 17055 17233 JUNC_8 0 - 17055 17233 255,0,0 2 50,50 0,229
chr1 17055 17606 JUNC_9 0 - 17055 17606 255,0,0 2 50,50 0,602

iarc@iarc-H8QG6:/Processed_data/gabriel/amrita/tfRNA/EVs/derfinder-pipe$ head /New_data/pancreatic/new_data/merged.gtf
chr1 Cufflinks exon 852260 857889 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.6.1"; class_code "u"; tss_id "TSS1";
chr1 Cufflinks exon 857991 858025 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; oId "CUFF.6.1"; class_code "u"; tss_id "TSS1";
chr1 Cufflinks exon 878110 878438 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 878633 878757 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 879078 879188 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "3"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 879288 880180 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "4"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 880422 880526 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "5"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 880898 881033 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "6"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 881527 881666 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "7"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";
chr1 Cufflinks exon 881782 881925 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "8"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";

annotation driectory prepared with PrepareAnnotationRefseq2

PGAP_Genus_Species_problem

Hello.
Description

I want to use pgap for structurally annotation. I want to annotate near 2,000 genomes from Actinomyces. My list has a mixture of species. When I try to use -s organism option with only -s "Actinomyces" I get the error message "Fall to complete".

The documentation shows the possibility to use only the Genus (Actinomyces), but it is false. In order to complete the job, you must indicate the Genus and Species

Could you help me or assist with my consideration?

Thank

PrepareAnnotationRefseq2 error

Faced error while executing command on
R version 3.4.4
customProDB_1.18.0, PGA_1.8.1
Command executed from tutorial

transcript_ids <- c("NM_001126112", "NM_033360", "NR_073499")
pepfasta <- system.file("extdata", "refseq_pro_seq.fasta",
                        package="customProDB")
CDSfasta <- system.file("extdata", "refseq_coding_seq.fasta",
                        package="customProDB")
annotation_path <- tempdir()
PrepareAnnotationRefseq2(genome='hg19', CDSfasta, pepfasta, annotation_path,
                          dbsnp=NULL, transcript_ids=transcript_ids,
                          splice_matrix=FALSE, COSMIC=FALSE)

Build TranscriptDB object (txdb.sqlite) ...
Error in .tablename2track(tablename, session) :
UCSC table "refGene" is not supported

Install PGA on Windows10

R 3.6.1

After installing all the necessary by hand, it shows that the PGA requires the R version is at least R 4.0.0.

image

Exception in thread "main" java.lang.NumberFormatException:

I did run the rtandem successfully but when I am now running the perseGear I get the following error. What could be the problem?

parserGear(file = idfile, db = dbfile, decoyPrefix="#REV#",xmx=1,thread=4,outdir = "parser_outdir")
Exception in thread "main" java.lang.NumberFormatException: For input string: "PT5786.951S"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at java.lang.Double.valueOf(Double.java:502)
at cn.bgi.MainRun.run(MainRun.java:317)
at cn.bgi.MainRun.main(MainRun.java:58)
Warning message:
In system(command = tandemparser, intern = TRUE) :
running command '/opt/exp_soft/miniconda3/bin/java -Xmx1G -jar "/home/oknjav001/R/x86_64-pc-linux-gnu-library/3.6/PGA/parser4PGA.jar" ".//180731_C0_CH_T016BCG.mzML_xtandem.xml" "/scratch/oknjav001/transcriptomics/proteogenomics/dbcreation/customdb/T042PPD_merge-var.fasta" "pga" "parser_outdir" "#REV#" "VAR" 1 4' had status 1

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so

locale:
[1] LC_CTYPE=en_ZA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_ZA.UTF-8 LC_COLLATE=en_ZA.UTF-8
[5] LC_MONETARY=en_ZA.UTF-8 LC_MESSAGES=en_ZA.UTF-8
[7] LC_PAPER=en_ZA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_ZA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] PGA_1.16.0 rTANDEM_1.26.0 Rcpp_1.0.1
[4] XML_3.99-0.3 data.table_1.12.8 Biostrings_2.54.0
[7] XVector_0.26.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
[10] IRanges_2.20.2 S4Vectors_0.24.4 BiocGenerics_0.32.0

loaded via a namespace (and not attached):
[1] Biobase_2.46.0 httr_1.4.1
[3] bit64_0.9-7 assertthat_0.2.1
[5] askpass_1.1 BiocFileCache_1.10.2
[7] blob_1.2.1 BSgenome_1.54.0
[9] GenomeInfoDbData_1.2.2 Rsamtools_2.2.3
[11] progress_1.2.2 pillar_1.4.4
[13] RSQLite_2.2.0 lattice_0.20-38
[15] glue_1.3.1 digest_0.6.25
[17] RColorBrewer_1.1-2 colorspace_1.4-1
[19] Matrix_1.2-17 plyr_1.8.6
[21] pkgconfig_2.0.3 pheatmap_1.0.12
[23] customProDB_1.26.1 biomaRt_2.42.1
[25] zlibbioc_1.32.0 purrr_0.3.4
[27] scales_1.1.0 processx_3.4.2
[29] BiocParallel_1.20.1 tibble_3.0.1
[31] openssl_1.4.1 ggplot2_3.3.0
[33] AhoCorasickTrie_0.1.0 ellipsis_0.3.0
[35] SummarizedExperiment_1.16.1 GenomicFeatures_1.38.2
[37] magrittr_1.5 crayon_1.3.4
[39] memoise_1.1.0 ps_1.3.2
[41] MASS_7.3-51.4 tools_3.6.0
[43] prettyunits_1.1.1 hms_0.5.3
[45] lifecycle_0.2.0 matrixStats_0.56.0
[47] stringr_1.4.0 munsell_0.5.0
[49] DelayedArray_0.12.3 AnnotationDbi_1.48.0
[51] ade4_1.7-15 compiler_3.6.0
[53] rlang_0.4.6 grid_3.6.0
[55] RCurl_1.98-1.2 VariantAnnotation_1.32.0
[57] rappdirs_0.3.1 bitops_1.0-6
[59] gtable_0.3.0 DBI_1.1.0
[61] curl_4.3 R6_2.4.1
[63] GenomicAlignments_1.22.1 Nozzle.R1_1.1-1
[65] dplyr_0.8.5 rtracklayer_1.46.0
[67] seqinr_3.6-1 bit_1.1-15.2
[69] readr_1.3.1 stringi_1.4.3
[71] vctrs_0.2.4 dbplyr_1.4.3
[73] tidyselect_1.0.0

PrepareAnnotationEnsembl2 error with enable snp and related dbCreator error

I use the PrepareAnnotationEnsembl2 function to Preparing annotation files with Ensembl release 101. There's error with enable the snp.

ensembl <- useMart("ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl",
                   host="aug2020.archive.ensembl.org", path="/biomart/martservice",
                   archive=FALSE)

PrepareAnnotationEnsembl2(mart = ensembl, "D:/data_analysis/KIRC_multiomics/KIRC_rna_seq_rsem/PGA/ensembl_annotation",
                         splice_matrix = T,
                         dbsnp = "snp151", COSMIC = FALSE)

Prepare gene/transcript/protein id mapping information (ids.RData) ...  done
Build TranscriptDB object (txdb.sqlite) ... 
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... FAILED! (=> skipped)
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
 done
Prepare exon annotation information (exon_anno.RData) ...  done
Prepare protein coding sequence (procodingseq.RData)...  done
Prepare protein sequence (proseq.RData) ...  done
Prepare dbSNP information (dbsnpinCoding.RData) ... Error in function (type, msg, asError = TRUE)  : 
  transfer closed with outstanding read data remaining
此外: Warning message:
In .infer_chrominfo_from_transcripts_and_splicings(transcripts$tx_chrom,  :
  chromosome lengths and circularity flags are not available for this TxDb object

Then, I skip this error and run dbCreator.

##input files
vcf = paste(getwd(),list.files(getwd(),pattern = "\\.vcf$"),sep = "/")
tab = paste(getwd(),list.files(getwd(),pattern = "\\.tab$"),sep = "/")
gtf = paste(getwd(),list.files(getwd(),pattern = "\\.gtf$"),sep = "/")

##dbcreator
dbCreator(gtfFile = gtf, vcfFile  = vcf, tabFile = tab, 
          annotation_path = "D:/data_analysis/KIRC_multiomics/KIRC_rna_seq_rsem/PGA/ensembl_annotation",
          outfile_name = "a382607", outdir = "D:/data_analysis/KIRC_multiomics/KIRC_rna_seq_rsem/PGA/out",
          genome=Hsapiens,debug = T)

Output abberant protein FASTA file caused by short INDEL...  done
Output variation table and variant protein sequence caused bySNVs... Error in h(simpleError(msg, call)) : 
  在为'unique'函数选择方法时评估'x'参数出了错: 在为'substr'函数选择方法时评估'x'参数出了错: 在为'translate'函数选择方法时评估'x'参数出了错: input must be a single non-NA string   
此外: There were 19 warnings (use warnings() to see them)

I wonder if this error is related to the failed construction of SNP database with the PrepareAnnotationEnsembl2 function. Here is my workflow of analysis of vcf, transcript.gtf, and junction.tab.

####hisat2 build index
##extract exon 
python /home/rchenubuntu/software/hisat2-2.2.1/extract_exons.py /media/rchenubuntu/5618DC8B18DC6B8D/database/genome_and_annotation/ensembl_re101/Homo_sapiens.GRCh38.101.gtf > /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_index/genome.exon
##extract splice site
python /home/rchenubuntu/software/hisat2-2.2.1/extract_splice_sites.py /media/rchenubuntu/5618DC8B18DC6B8D/database/genome_and_annotation/ensembl_re101/Homo_sapiens.GRCh38.101.gtf > /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_index/genome.ss
##build index
hisat2-build -p 48 /media/rchenubuntu/5618DC8B18DC6B8D/database/genome_and_annotation/ensembl_re101/Homo_sapiens.GRCh38.dna.primary_assembly.fa --exon /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_index/genome.exon --ss /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_index/genome.ss /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_index/ensembl101


###hisat2 align  	JUNCTION.tab 
hisat2 -x /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_index/ensembl101 -p 40 -1 /media/rchenubuntu/5618DC8B18DC6B8D/clean_data/A_382607_1_clean.fq.gz -2 /media/rchenubuntu/5618DC8B18DC6B8D/clean_data/A_382607_2_clean.fq.gz -S /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607.sam --novel-splicesite-outfile /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607.tab

##samtools view 
samtools view -@ 46 -S -b /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607.sam > /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607.bam
##samtools sort
samtools sort -@ 46 /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607.bam -O /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607_sorted.bam

##cufflinks     TRANSCRIPT.gtf
cufflinks /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607_sorted.bam --library-type fr-firststrand -p 46 -g /media/rchenubuntu/5618DC8B18DC6B8D/database/genome_and_annotation/ensembl_re101/Homo_sapiens.GRCh38.101.gtf -o /media/rchenubuntu/5618DC8B18DC6B8D/cufflinks_output/xzm

###call SNPs and short indels
samtools mpileup -uf /media/rchenubuntu/5618DC8B18DC6B8D/database/genome_and_annotation/ensembl_re101/Homo_sapiens.GRCh38.dna.primary_assembly.fa /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607_sorted.bam | bcftools call -mv > /media/rchenubuntu/5618DC8B18DC6B8D/bcf_call_output/A382607.vcf

Running own data example

Count you give an example on how one can run his/her own data using this wonderful pipeline. This information is lacking in the tutorial section

functions to use in loading own data

Hi,

I have a problem loading the vcf,gtf and bed files using this package. Can there be a section in the readme file where users can be guided on how to run their own data apart from the example data which comes with the package? I desperately need your assistance in this. I have followed the procedure accordingly in the tutorial file and things don't work with my data.

Thanks,
Javan

Reverse peptides are not from reverse complement strands.

Hi,

We used the protein database generated from PGA and found out that the 3 reverse reading frames were not from reverse complement strands of the transcriptome, but from the translating contigs backwards.

For example, I have this nucleotide sequence and want to find its 6-frame translation:

TRINITY_DN125758_c12_g1_short
GGTCAAGGTAATAAAGGTCAAGGTGAAATATCAAAAAGGTAATAAAAAAAAACCACCATTTTCTCAGAAACCCCTTGAGCTACAGTCACCAAATTTAT

I ran the code:

short <- "/Users/kim/Desktop/sample_short.fasta"
outdb <- createProDB4DenovoRNASeq(infa=short, bool_use_3frame = FALSE, bool_get_longest = FALSE, outfile_name = "short_6_frame")

And this was the result:

DNO|NTX1|TRINITY_DN125758_c12_g1_short|+|F2|1
VKVIKVKVKYQKGNKKKPPFSQKPLELQSPNL
#REV#DNO|NTX1|TRINITY_DN125758_c12_g1_short|+|F2|1
LNPSQLELPKQSFPPKKKNGKQYKVKVKIVKV

There were two problems:

  • It only gave out 2 peptides from frame 2 instead of 6 from all 3 frames.
  • The REV peptide was the same as the forward one written backwards. I was expecting the REV (from frame 2) to be something like this (which was obtained by translating the reverse complement of the initial nucleotide sequence from frame 2):

TRINITY_DN125758_c12_g1_short_reverse_complement
ATAAATTTGGTGACTGTAGCTCAAGGGGTTTCTGAGAAAATGGTGGTTTTTTTTTATTACCTTTTTGATATTTCACCTTGACCTTTATTACCTTGACC

REV_frame2
-IW-L-LKGFLRKWWFFFITFLIFHLDLYYLD

Please let me know how to fix this.

Thanks a lot.

Avoid stopping the installation for warning messages

> BiocManager::install("wenbostar/PGA")
Bioconductor version 3.9 (BiocManager 1.30.10), R 3.6.0 (2019-04-26)
Installing github package(s) 'wenbostar/PGA'
Downloading GitHub repo wenbostar/PGA@master
√  checking for file 'C:\Users\wb\AppData\Local\Temp\Rtmp6fyOgl\remotes269073a869ae\wenbostar-PGA-98cac29/DESCRIPTION' (934ms)
-  preparing 'PGA': (601ms)
√  checking DESCRIPTION meta-information ... 
-  checking for LF line-endings in source and make files and shell scripts
-  checking for empty or unneeded directories
-  building 'PGA_1.15.1.tar.gz'
   
* installing *source* package 'PGA' ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
Error: (converted from warning) package 'IRanges' was built under R version 3.6.1
Execution halted
ERROR: lazy loading failed for package 'PGA'
* removing 'e:/R/R-3.6.0/library/PGA'
* restoring previous 'e:/R/R-3.6.0/library/PGA'
Error: Failed to install 'PGA' from GitHub:
  (converted from warning) installation of package ‘C:/Users/**/AppData/Local/Temp/***/PGA_1.15.1.tar.gz’ had non-zero exit status

Preparing annotation files错误

您好,

我尝试使用PGA构建个性化蛋白库,在试运行PGA包,进行注释分析构建时出错,无法进行后续的分析。
一、使用PrepareAnnotationRefseq2(参考customProDB包的说明),运行命令和错误如下:
`> library(customProDB)

library(PGA)
transcript_ids <- c("NM_001126112", "NM_033360", "NR_073499", "NM_004448",

  • "NM_000179", "NR_029605", "NM_004333", "NM_001127511")

pepfasta <- system.file("extdata", "refseq_pro_seq.fasta",

  • package="customProDB")

CDSfasta <- system.file("extdata", "refseq_coding_seq.fasta",

  • package="customProDB")

annotation_path <- tempdir()
PrepareAnnotationRefseq2(genome='hg19', CDSfasta, pepfasta, annotation_path,dbsnp = NULL, transcript_ids=transcript_ids,splice_matrix=FALSE, ClinVar=FALSE)
Build TranscriptDB object (txdb.sqlite) ...
Download the refGene table ... OK
Download the hgFixed.refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
done
Prepare gene/transcript/protein id mapping information (ids.RData) ... Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 27180 did not have 16 elements
In addition: Warning message:
closing unused connection 3 ()`

二、使用PrepareAnnotationEnsembl2,运行命令和错误如下:
`> ensembl <- useMart("ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl",

  • host="sep2015.archive.ensembl.org", path="/biomart/martservice",
  • archive=FALSE)

annotation_path <- tempdir()
transcript_ids <- c("ENST00000234420", "ENST00000269305", "ENST00000445888",

  • "ENST00000257430", "ENST00000508376", "ENST00000288602",
  • "ENST00000269571", "ENST00000256078", "ENST00000384871")

PrepareAnnotationEnsembl2(mart=ensembl, annotation_path=annotation_path,splice_matrix=FALSE, dbsnp=NULL,transcript_ids=transcript_ids, COSMIC=FALSE)
Prepare gene/transcript/protein id mapping information (ids.RData) ... done
Build TranscriptDB object (txdb.sqlite) ...
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... FAILED! (=> skipped)
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
done
Prepare exon annotation information (exon_anno.RData) ... done
Prepare protein coding sequence (procodingseq.RData)... done
Prepare protein sequence (proseq.RData) ... done
Warning messages:
1: In download.file(url, destfile, quiet = TRUE) :
InternetOpenUrl failed: '拒绝登录申请
'
2: In .infer_chrominfo_from_transcripts_and_splicings(transcripts$tx_chrom, :
chromosome lengths and circularity flags are not available for this TxDb object`

同样地,customProDB包中的PrepareAnnotationEnsembl和PrepareAnnotationRefseq也报错。

请您帮我看看如何处理?

谢谢~

Customized database construction for de novo transcript assembly result

Hello,

I am a Graduate bioinformatics student in the Hood College and working on Bioconductor(PGA) for identification of novel peptides from human brain sample.

I prepared the customised database without genome information and following: Based on the result from de novo assembly of RNASeq data without a reference genome

I followed transcript_seq_file <- system.file("extdata/input", "Trinity.fasta",
package="PGA")
outdb <- createProDB4DenovoRNASeq(infa=transcript_seq_file,
outfile_name = "denovo")

valid models = 61
#Unique models = 51
#Estimated false positives = 3 +/- 2

Post-processing
I have successfully run all the codes by couldn't generate report gear.

I am getting this information while trying to run:
SNV(DB) didn't exist!

I don't understand in which code I include this SNV database. Please help me.

dbCreator error report

While running dbCreator with tabFile input, the following error is reported.
Thanks for any advice.

> dbCreator(gtfFile=NULL,vcfFile=NULL,bedFile=NULL,tabFile=tabfile,
+           annotation_path=annotation_path,outfile_name=outfile_name,
+           genome=Rnorvegicus,outdir=outfile_path)
Output novel junction peptides... Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with 'browseVignettes()'. To cite
    Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'.

Error in `$<-.data.frame`(`*tmp*`, "V4", value = "+") : 
  replacement has 1 row, data has 0
> traceback()
5: stop(sprintf(ngettext(N, "replacement has %d row, data has %d", 
       "replacement has %d rows, data has %d"), N, nrows), domain = NA)
4: `$<-.data.frame`(`*tmp*`, "V4", value = "+")
3: `$<-`(`*tmp*`, "V4", value = "+")
2: Tab2Range(tabFile)
1: dbCreator(gtfFile = NULL, vcfFile = NULL, bedFile = NULL, tabFile = tabfile, 
       annotation_path = annotation_path, outfile_name = outfile_name, 
       genome = Rnorvegicus, outdir = outfile_path)

Here is the session info.
I downloaded PGA from wenbostar/PGA.

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 19

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicFeatures_1.40.0              AnnotationDbi_1.50.0               
 [3] Biobase_2.48.0                      BSgenome.Rnorvegicus.UCSC.rn6_1.4.1
 [5] BSgenome_1.56.0                     rtracklayer_1.48.0                 
 [7] PGA_1.15.1                          rTANDEM_1.27.0                     
 [9] Rcpp_1.0.4.6                        XML_3.99-0.3                       
[11] data.table_1.12.8                   Biostrings_2.56.0                  
[13] XVector_0.28.0                      GenomicRanges_1.40.0               
[15] GenomeInfoDb_1.24.0                 IRanges_2.22.2                     
[17] S4Vectors_0.26.1                    BiocGenerics_0.34.0                
[19] biomaRt_2.44.0                     

loaded via a namespace (and not attached):
 [1] httr_1.4.1                  bit64_0.9-7                 assertthat_0.2.1           
 [4] askpass_1.1                 BiocFileCache_1.12.0        blob_1.2.1                 
 [7] GenomeInfoDbData_1.2.3      Rsamtools_2.4.0             progress_1.2.2             
[10] pillar_1.4.4                RSQLite_2.2.0               lattice_0.20-41            
[13] glue_1.4.1                  digest_0.6.25               RColorBrewer_1.1-2         
[16] colorspace_1.4-1            Matrix_1.2-18               plyr_1.8.6                 
[19] pkgconfig_2.0.3             pheatmap_1.0.12             customProDB_1.28.0         
[22] zlibbioc_1.34.0             purrr_0.3.4                 scales_1.1.1               
[25] processx_3.4.2              BiocParallel_1.22.0         tibble_3.0.1               
[28] openssl_1.4.1               generics_0.0.2              ggplot2_3.3.1              
[31] AhoCorasickTrie_0.1.0       ellipsis_0.3.1              SummarizedExperiment_1.18.1
[34] magrittr_1.5                crayon_1.3.4                memoise_1.1.0              
[37] ps_1.3.3                    MASS_7.3-51.6               tools_4.0.0                
[40] prettyunits_1.1.1           hms_0.5.3                   lifecycle_0.2.0            
[43] matrixStats_0.56.0          stringr_1.4.0               munsell_0.5.0              
[46] DelayedArray_0.14.0         ade4_1.7-15                 compiler_4.0.0             
[49] rlang_0.4.6                 grid_4.0.0                  RCurl_1.98-1.2             
[52] rstudioapi_0.11             VariantAnnotation_1.34.0    rappdirs_0.3.1             
[55] bitops_1.0-6                gtable_0.3.0                DBI_1.1.0                  
[58] curl_4.3                    R6_2.4.1                    GenomicAlignments_1.24.0   
[61] Nozzle.R1_1.1-1             dplyr_1.0.0                 seqinr_3.6-1               
[64] bit_1.1-15.2                readr_1.3.1                 stringi_1.4.6              
[67] vctrs_0.3.0                 dbplyr_1.4.4                tidyselect_1.1.0           
> 

Here is the console history

> library("biomaRt")
> library("PGA")
Loading required package: IRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap,
    parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call,
    duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted,
    lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: Biostrings
Loading required package: XVector

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

    strsplit

Loading required package: data.table
data.table 1.12.8 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com

Attaching package: ‘data.table’

The following object is masked from ‘package:GenomicRanges’:

    shift

The following object is masked from ‘package:IRanges’:

    shift

The following objects are masked from ‘package:S4Vectors’:

    first, second

Loading required package: rTANDEM
Loading required package: XML
Loading required package: Rcpp
> setwd("/home/rpm/evo/RS/PGA")
> ensembl <- biomaRt::useMart("ENSEMBL_MART_ENSEMBL", dataset="rnorvegicus_gene_ensembl",
+                             host="www.ensembl.org", path="/biomart/martservice",
+                             archive=FALSE)
Ensembl site unresponsive, trying useast mirror
Ensembl site unresponsive, trying uswest mirror
Ensembl site unresponsive, trying uswest mirror
Ensembl site unresponsive, trying asia mirror
> annotation_path <- tempdir()
> PrepareAnnotationEnsembl2(mart=ensembl, annotation_path=annotation_path,
+                           splice_matrix=TRUE, dbsnp=NULL, transcript_ids=NULL,
+                           COSMIC=FALSE)
Prepare gene/transcript/protein id mapping information (ids.RData) ...  done
Build TranscriptDB object (txdb.sqlite) ... 
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... FAILED! (=> skipped)
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
 done
Prepare exon annotation information (exon_anno.RData) ...  done
Prepare protein coding sequence (procodingseq.RData)...  done
Prepare protein sequence (proseq.RData) ...  done
Prepare exon splice information (splicemax.RData) ...  done
Warning message:
In .infer_chrominfo_from_transcripts_and_splicings(transcripts$tx_chrom,  :
  chromosome lengths and circularity flags are not available for this TxDb object
> library("BSgenome.Rnorvegicus.UCSC.rn6")
Loading required package: BSgenome
Loading required package: rtracklayer
> tabfile <- "/home/rpm/evo/RS/PGA/jum"
> outfile_path<-"db/"
> outfile_name<-"test"
> dbCreator(gtfFile=NULL,vcfFile=NULL,bedFile=NULL,tabFile=tabfile,
+           annotation_path=annotation_path,outfile_name=outfile_name,
+           genome=Rnorvegicus,outdir=outfile_path)
Output novel junction peptides... Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with 'browseVignettes()'. To cite
    Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'.

Error in `$<-.data.frame`(`*tmp*`, "V4", value = "+") : 
  replacement has 1 row, data has 0
> traceback()
5: stop(sprintf(ngettext(N, "replacement has %d row, data has %d", 
       "replacement has %d rows, data has %d"), N, nrows), domain = NA)
4: `$<-.data.frame`(`*tmp*`, "V4", value = "+")
3: `$<-`(`*tmp*`, "V4", value = "+")
2: Tab2Range(tabFile)
1: dbCreator(gtfFile = NULL, vcfFile = NULL, bedFile = NULL, tabFile = tabfile, 
       annotation_path = annotation_path, outfile_name = outfile_name, 
       genome = Rnorvegicus, outdir = outfile_path)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.