CONCUR

CONCUR (Codon counts from Ribo-seq) is a tool for calculating codon usage from Ribo-seq data.

Getting Started

These instructions will allow you to run CONCUR on your computer. CONCUR is a command line tool developed for Linux and macOS.

Prerequisites

You will need Perl and bedtools to be installed in your system.

In addition, you need to have R and the two R packages pheatmap and RColorBrewer installed to generate some of the figures. If you chose not to install them, you need to run CONCUR with the --withoutR parameter.

The following organisms are pre-installed from the Gencode [https://www.gencodegenes.org] project:

Human - hg38, hg19
Mouse - mm10, mm9

The following organism is pre-installed from Ensembl [http://ensemblgenomes.org/]:

Rat - rn9
Yeast - sc3

If you need to analyze another organism, you can easily do so provided you have a gtf file with annotated protein-coding genes for that organism. Please see additional instructions below.

Installation

Download the latest release (v1.0) from the release tab.

The following commands will install CONCUR in your current directory:

tar xvfc concur-1.0.tar.gz
cd concur-1.0

Verify that the tool is working with the example in the demo directory.

Alternatively, clone this repository for the latest version.

Running CONCUR

You need a bam file with read alignments, alignments.bam, to run CONCUR. You also need to specify the genome (-g) and an output directory (-o).

perl concur.pl -i alignment.bam -g hg38 -o project_name

Parameters

The following general parameters are available

Parameter	Description
-i / --input BAM_FILE	Input bam file [mandatory]
-g / --genome GENOME	Genome version (e.g., hg38, hg19, mm10, mm9, rn9 or sc3) [mandatory]
-o / --out FOLDER	Output folder name [mandatory]
-n / --name FILENAME	Output file name [optional]. Input file name is used by default.
-w / --withoutR	Run without creating figures using R. This is useful if R is not installed.
-h / --help	Print help message and quit
-m / --man	Print help message and quit
-v / --version	Print version and quit

The following parameters can be used to change some of the default behavour

Parameter	Description
-s / --size FROM-TO	This will alter the fragment size range included in the analysis (described in section 2.1 of the manuscript). The default range is 20-50. Non-informative lengths are automatically detected and excluded and the default range should be suitable for most datasets. [optional]
-r / --reads_min READS	This parameter sets the minimum number of reads near the TIS required to include a read set in the analysis (described in section 2.1 of the manuscript). The default threshold is 1000 reads. Increasing this threshold may improve the analysis of deeply sequenced libraries by excluding low-quality read sets that may affect the read set validation steps. [optional]
-f / --filter_outliers THRESHOLD	This option will change the final filtering of the selected read sets. By default, a read set is used in the final codon usage calculations if S_r >= 0.5S_r^max at the P and A site (described in section 2.2.3 of the manuscript). In a dataset where many read sets have passed the validation filters, this threshold would exclude read sets that are nevertheless outliers compared with the best ones. We believe it is generally useful to apply this filter to focus on the most informative reads. However, the threshold can be lowered if keeping as many read sets as possible is of higher importance (use a threshold <0.5), or increased if stricter filtering is desired (use a threshold >0.5). [optional*]

Installing Additional Genomes

These instructions will help you to install additional genomes. CONCUR can be run for any organism provided that you have a gtf file containing genes and their coding sequence. If available, CONCUR will use the reading frame information in column 8, otherwise frame can be calculated manually.

Using coding sequences fasta

This is the best option if there is a fasta file with the coding sequences available (e.g., Ensembl annotations).

First, download the coding sequence and annotation files (gtf or gff):

wget -O Saccer3.cds.fa.gz ftp://ftp.ensemblgenomes.org/pub/fungi/release-40/fasta/saccharomyces_cerevisiae/cds/Saccharomyces_cerevisiae.R64-1-1.cds.all.fa.gz
wget -O Saccer3.gff.gz ftp://ftp.ensemblgenomes.org/pub/fungi/release-40/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.40.gff3.gz

wget -O Ratnor9.cds.fa.gz ftp://ftp.ensembl.org/pub/release-95/fasta/rattus_norvegicus/cds/Rattus_norvegicus.Rnor_6.0.cds.all.fa.gz
wget -O Ratnor9.gtf.gz ftp://ftp.ensembl.org/pub/release-95/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_6.1.05.gtf.gz

Next, run the installation tool. Use --recalculate if you wish to disregard the reading frame information in column 8 of the gtf/gff file.

perl concur_install_genome.pl --gtf Saccer3.gff.gz --fasta Saccer3.cds.fa.gz --short sc3
perl concur_install_genome.pl --gtf Ratnor9.gtf.gz --fasta Ratnor9.cds.fa.gz --short rn9

This will create two files for yeast: data/sc3.bg.txt and data/sc3.bed.gz, and two files for rat: data/rn9.bg.txt and data/rn9.bed.gz.

Using protein coding gene fasta

This is the best option if there is not a separate fasta file with only the coding sequences available, but there is a fasta file with transcript sequences and information about the CDS position (e.g., Gencode annotations).

First, download the coding sequence and annotation files (gtf or gff):

wget -O Musmus10.pcg.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.pc_transcripts.fa.gz
wget -O Musmus10.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.primary_assembly.annotation.gtf.gz

wget -O Homsap38.pcg.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.pc_transcripts.fa.gz
wget -O Homsap38.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.primary_assembly.annotation.gtf.gz

Next, run the installation tool. The --pcg flag is used to extract the coding sequences from the full transcript sequences. Use --recalculate if you wish to disregard the reading frame information in column 8 of the gtf/gff file.

perl concur_install_genome.pl --gtf Musmus10.gtf.gz --fasta Musmus10.pcg.fa.gz --short mm10 --pcg
perl concur_install_genome.pl --gtf Homsap38.gtf.gz --fasta Homsap38.pcg.fa.gz --short hg38 --pcg

The --pcg tag will assume that there is a string in the "CDS:61-1041" format in each fasta header line. The start and end position of the coding sequence is retrieved from this string and is used to extract the coding sequence from the full transcript sequence. Coding sequences where the length is not a multiple of three nucleotides will not be used.

This will create two files for mouse: data/mm10.bg.txt and data/mm10.bed.gz, and two files for human: data/hg38.bg.txt and data/hg38.bed.gz.

Version

The current version is 1.0. For other the versions, see the releases on this repository.

Authors

Susanne Bornelöv - susbo

License

This project is licensed under the GNU AGPLv3 License - see the LICENSE.txt file for details.

Citation

If you use CONCUR for your work, please cite:

Michaela Frye, Susanne Bornelöv (2020) CONCUR: quick and robust calculation of codon usage from ribosome profiling data, Bioinformatics, bta733, https://doi.org/10.1093/bioinformatics/btaa733

Errors during correlation steps

I am trying to run this program, but I keep getting errors:

Running CONCUR v1.0

#####  General options  #######
Input file: /cluster/work/users/sjannies/finalbams2/riboseq_N24_starmap2goldfish.final.csort.bam
Genome: caur
Output folder: test
Output file name: riboseq_N24_starmap2goldfish.final.csort
Run without R: FALSE
#####  Analysis options  ######
Fragment size range tested: 20-50
Minimum number of reads: 1000
Outlier removal filter [2.2.3]: 0.5
###############################
[Step 1/10] Mapping genomic reads to transcripts...
[Step 2/10] Calculating periodicity...
[Step 3/10] Predicting offset per read set...
[Step 4/10] Calculating codon frequency per read set...
[Step 5/10] Calculating codon frequency (step 2) per read set...
[Step 6/10] Plotting codon correlations per read set...
Error in file(file, "rt") : cannot open the connection
Calls: read.csv -> read.table -> file
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'data/codon_to_AA.csv': No such file or directory
Execution halted
[Step 7/10] Calculating correlations between read sets...
Cannot open riboseq_N24_starmap2goldfish.final.csort.correlation.csv

I have changed the path to the R scripts in concur.pl, because I thought that was the reason.
I have generated a genome for my species, but I am unsure what the data/codon_to_AA.csv file called in the R scripts is - should it have been generated during the genome processing? I only got the bg.txt and bed.gz in the data folder.

I appreciate any advice.

susbo / concur Goto Github PK

concur's Introduction

CONCUR

Getting Started

Prerequisites

Installation

Running CONCUR

Parameters

Installing Additional Genomes

Using coding sequences fasta

Using protein coding gene fasta

Version

Authors

License

Citation

concur's People

Contributors

Stargazers

Watchers

concur's Issues

General options

Analysis options

Recommend Projects

Recommend Topics

Recommend Org