Giter VIP home page Giter VIP logo

concur's Introduction

CONCUR

CONCUR (Codon counts from Ribo-seq) is a tool for calculating codon usage from Ribo-seq data.

Getting Started

These instructions will allow you to run CONCUR on your computer. CONCUR is a command line tool developed for Linux and macOS.

Prerequisites

You will need Perl and bedtools to be installed in your system.

In addition, you need to have R and the two R packages pheatmap and RColorBrewer installed to generate some of the figures. If you chose not to install them, you need to run CONCUR with the --withoutR parameter.

The following organisms are pre-installed from the Gencode [https://www.gencodegenes.org] project:

  • Human - hg38, hg19
  • Mouse - mm10, mm9

The following organism is pre-installed from Ensembl [http://ensemblgenomes.org/]:

  • Rat - rn9
  • Yeast - sc3

If you need to analyze another organism, you can easily do so provided you have a gtf file with annotated protein-coding genes for that organism. Please see additional instructions below.

Installation

Download the latest release (v1.0) from the release tab.

The following commands will install CONCUR in your current directory:

tar xvfc concur-1.0.tar.gz
cd concur-1.0

Verify that the tool is working with the example in the demo directory.

Alternatively, clone this repository for the latest version.

Running CONCUR

You need a bam file with read alignments, alignments.bam, to run CONCUR. You also need to specify the genome (-g) and an output directory (-o).

perl concur.pl -i alignment.bam -g hg38 -o project_name

Parameters

The following general parameters are available

Parameter Description
-i / --input BAM_FILE Input bam file [mandatory]
-g / --genome GENOME Genome version (e.g., hg38, hg19, mm10, mm9, rn9 or sc3) [mandatory]
-o / --out FOLDER Output folder name [mandatory]
-n / --name FILENAME Output file name [optional]. Input file name is used by default.
-w / --withoutR Run without creating figures using R. This is useful if R is not installed.
-h / --help Print help message and quit
-m / --man Print help message and quit
-v / --version Print version and quit

The following parameters can be used to change some of the default behavour

Parameter Description
-s / --size FROM-TO This will alter the fragment size range included in the analysis (described in section 2.1 of the manuscript). The default range is 20-50. Non-informative lengths are automatically detected and excluded and the default range should be suitable for most datasets. [optional]
-r / --reads_min READS This parameter sets the minimum number of reads near the TIS required to include a read set in the analysis (described in section 2.1 of the manuscript). The default threshold is 1000 reads. Increasing this threshold may improve the analysis of deeply sequenced libraries by excluding low-quality read sets that may affect the read set validation steps. [optional]
-f / --filter_outliers THRESHOLD This option will change the final filtering of the selected read sets. By default, a read set is used in the final codon usage calculations if S_r >= 0.5*S_r^max at the P and A site (described in section 2.2.3 of the manuscript). In a dataset where many read sets have passed the validation filters, this threshold would exclude read sets that are nevertheless outliers compared with the best ones. We believe it is generally useful to apply this filter to focus on the most informative reads. However, the threshold can be lowered if keeping as many read sets as possible is of higher importance (use a threshold <0.5), or increased if stricter filtering is desired (use a threshold >0.5). [optional]

Installing Additional Genomes

These instructions will help you to install additional genomes. CONCUR can be run for any organism provided that you have a gtf file containing genes and their coding sequence. If available, CONCUR will use the reading frame information in column 8, otherwise frame can be calculated manually.

Using coding sequences fasta

This is the best option if there is a fasta file with the coding sequences available (e.g., Ensembl annotations).

First, download the coding sequence and annotation files (gtf or gff):

wget -O Saccer3.cds.fa.gz ftp://ftp.ensemblgenomes.org/pub/fungi/release-40/fasta/saccharomyces_cerevisiae/cds/Saccharomyces_cerevisiae.R64-1-1.cds.all.fa.gz
wget -O Saccer3.gff.gz ftp://ftp.ensemblgenomes.org/pub/fungi/release-40/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.40.gff3.gz

wget -O Ratnor9.cds.fa.gz ftp://ftp.ensembl.org/pub/release-95/fasta/rattus_norvegicus/cds/Rattus_norvegicus.Rnor_6.0.cds.all.fa.gz
wget -O Ratnor9.gtf.gz ftp://ftp.ensembl.org/pub/release-95/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_6.1.05.gtf.gz

Next, run the installation tool. Use --recalculate if you wish to disregard the reading frame information in column 8 of the gtf/gff file.

perl concur_install_genome.pl --gtf Saccer3.gff.gz --fasta Saccer3.cds.fa.gz --short sc3
perl concur_install_genome.pl --gtf Ratnor9.gtf.gz --fasta Ratnor9.cds.fa.gz --short rn9

This will create two files for yeast: data/sc3.bg.txt and data/sc3.bed.gz, and two files for rat: data/rn9.bg.txt and data/rn9.bed.gz.

Using protein coding gene fasta

This is the best option if there is not a separate fasta file with only the coding sequences available, but there is a fasta file with transcript sequences and information about the CDS position (e.g., Gencode annotations).

First, download the coding sequence and annotation files (gtf or gff):

wget -O Musmus10.pcg.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.pc_transcripts.fa.gz
wget -O Musmus10.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.primary_assembly.annotation.gtf.gz

wget -O Homsap38.pcg.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.pc_transcripts.fa.gz
wget -O Homsap38.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.primary_assembly.annotation.gtf.gz

Next, run the installation tool. The --pcg flag is used to extract the coding sequences from the full transcript sequences. Use --recalculate if you wish to disregard the reading frame information in column 8 of the gtf/gff file.

perl concur_install_genome.pl --gtf Musmus10.gtf.gz --fasta Musmus10.pcg.fa.gz --short mm10 --pcg
perl concur_install_genome.pl --gtf Homsap38.gtf.gz --fasta Homsap38.pcg.fa.gz --short hg38 --pcg

The --pcg tag will assume that there is a string in the "CDS:61-1041" format in each fasta header line. The start and end position of the coding sequence is retrieved from this string and is used to extract the coding sequence from the full transcript sequence. Coding sequences where the length is not a multiple of three nucleotides will not be used.

This will create two files for mouse: data/mm10.bg.txt and data/mm10.bed.gz, and two files for human: data/hg38.bg.txt and data/hg38.bed.gz.

Version

The current version is 1.0. For other the versions, see the releases on this repository.

Authors

  • Susanne Bornelöv - susbo

License

This project is licensed under the GNU AGPLv3 License - see the LICENSE.txt file for details.

Citation

If you use CONCUR for your work, please cite:

Michaela Frye, Susanne Bornelöv (2020) CONCUR: quick and robust calculation of codon usage from ribosome profiling data, Bioinformatics, bta733, https://doi.org/10.1093/bioinformatics/btaa733

concur's People

Contributors

susbo avatar

Stargazers

 avatar Fernando Duarte avatar  avatar YuciWang99 avatar Sjannie Lefevre avatar Fredrik Svensson avatar Qin Lin avatar

Watchers

James Cloos avatar  avatar

concur's Issues

Cannot open correlation.csv file?

Thanks a lot for providing this software to A-site counting;
I tried a lot of software, this is very user-friendly.
But meet an issue with the analysis of the dataset (Nedialkova and Leidel, 2015).

Here is my code:
System: Ubuntu 20.04 LTS

Input:
concur perl concur.pl -w -i ../geno_bam/SRR1944912.1_1.fastq.gz.sam.bam.rm.bam.long.bam -g sc3 -o WT_1
Running CONCUR v1.0

Output:

General options

Input file: ../geno_bam/SRR1944912.1_1.fastq.gz.sam.bam.rm.bam.long.bam
Genome: sc3
Output folder: WT_1
Output file name: SRR1944912.1_1.fastq.gz.sam.rm.bam.long.bam
Run without R: TRUE

Analysis options

Fragment size range tested: 20-50
Minimum number of reads: 1000
Outlier removal filter [2.2.3]: 0.5
###############################
[Step 1/10] Mapping genomic reads to transcripts...
[Step 2/10] Calculating periodicity...
[Step 3/10] Predicting offset per read set...
[Step 4/10] Calculating codon frequency per read set...
[Step 5/10] Calculating codon frequency (step 2) per read set...
[Step 7/10] Calculating correlations between read sets...

Cannot open SRR1944912.1_1.fastq.gz.sam.rm.bam.long.bam.correlation.csv

Errors during correlation steps

Hi

I am trying to run this program, but I keep getting errors:

Running CONCUR v1.0

#####  General options  #######
Input file: /cluster/work/users/sjannies/finalbams2/riboseq_N24_starmap2goldfish.final.csort.bam
Genome: caur
Output folder: test
Output file name: riboseq_N24_starmap2goldfish.final.csort
Run without R: FALSE
#####  Analysis options  ######
Fragment size range tested: 20-50
Minimum number of reads: 1000
Outlier removal filter [2.2.3]: 0.5
###############################
[Step 1/10] Mapping genomic reads to transcripts...
[Step 2/10] Calculating periodicity...
[Step 3/10] Predicting offset per read set...
[Step 4/10] Calculating codon frequency per read set...
[Step 5/10] Calculating codon frequency (step 2) per read set...
[Step 6/10] Plotting codon correlations per read set...
Error in file(file, "rt") : cannot open the connection
Calls: read.csv -> read.table -> file
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'data/codon_to_AA.csv': No such file or directory
Execution halted
[Step 7/10] Calculating correlations between read sets...
Cannot open riboseq_N24_starmap2goldfish.final.csort.correlation.csv

I have changed the path to the R scripts in concur.pl, because I thought that was the reason.
I have generated a genome for my species, but I am unsure what the data/codon_to_AA.csv file called in the R scripts is - should it have been generated during the genome processing? I only got the bg.txt and bed.gz in the data folder.

I appreciate any advice.

Start position is a negative number

Hi!

my command:

perl concur.pl -i HD.Aligned.sortedByCoord.out.bam -g GRCh38 -o HD.codon.occ -n HD -r 5000

and encountered an error as follows:

Error: Invalid record in file /dev/fd/63. Record is
chrM -7 21 AE01231101005:9:4P240301014UY1S3242SX:L01:R001C060:0115:7938 CACGATGGATCACAGGTCTATCACCCTA -
Error: Invalid record in file /dev/fd/63. Record is
chrM -3 22 AE01231101005:9:4P240301074UY1S3256SX:L01:R004C037:0067:1228 ATGGATCACAGGTCTATCACCCTAT -
Error in dimnames(x) <- dn : length of 'dimnames' [2] not equal to array extent
Calls: colnames<-
stop
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
Calls: plot -> plot.default -> xy.coords
stop

Meanwhile, the error I encountered resulted in a 'codons.txt' file that contains only 8 columns.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.