CONCUR (Codon counts from Ribo-seq) is a tool for calculating codon usage from Ribo-seq data.
These instructions will allow you to run CONCUR on your computer. CONCUR is a command line tool developed for Linux and macOS.
You will need Perl and bedtools to be installed in your system.
In addition, you need to have R and the two R packages pheatmap and RColorBrewer installed to generate some of the figures. If you chose not to install them, you need to run CONCUR with the --withoutR
parameter.
The following organisms are pre-installed from the Gencode [https://www.gencodegenes.org] project:
- Human - hg38, hg19
- Mouse - mm10, mm9
The following organism is pre-installed from Ensembl [http://ensemblgenomes.org/]:
- Rat - rn9
- Yeast - sc3
If you need to analyze another organism, you can easily do so provided you have a gtf file with annotated protein-coding genes for that organism. Please see additional instructions below.
Download the latest release (v1.0) from the release tab.
The following commands will install CONCUR in your current directory:
tar xvfc concur-1.0.tar.gz
cd concur-1.0
Verify that the tool is working with the example in the demo
directory.
Alternatively, clone this repository for the latest version.
You need a bam file with read alignments, alignments.bam
, to run CONCUR. You also need to specify the genome (-g
) and an output directory (-o
).
perl concur.pl -i alignment.bam -g hg38 -o project_name
The following general parameters are available
Parameter | Description |
---|---|
-i / --input BAM_FILE | Input bam file [mandatory] |
-g / --genome GENOME | Genome version (e.g., hg38, hg19, mm10, mm9, rn9 or sc3) [mandatory] |
-o / --out FOLDER | Output folder name [mandatory] |
-n / --name FILENAME | Output file name [optional]. Input file name is used by default. |
-w / --withoutR | Run without creating figures using R. This is useful if R is not installed. |
-h / --help | Print help message and quit |
-m / --man | Print help message and quit |
-v / --version | Print version and quit |
The following parameters can be used to change some of the default behavour
Parameter | Description |
---|---|
-s / --size FROM-TO | This will alter the fragment size range included in the analysis (described in section 2.1 of the manuscript). The default range is 20-50. Non-informative lengths are automatically detected and excluded and the default range should be suitable for most datasets. [optional] |
-r / --reads_min READS | This parameter sets the minimum number of reads near the TIS required to include a read set in the analysis (described in section 2.1 of the manuscript). The default threshold is 1000 reads. Increasing this threshold may improve the analysis of deeply sequenced libraries by excluding low-quality read sets that may affect the read set validation steps. [optional] |
-f / --filter_outliers THRESHOLD | This option will change the final filtering of the selected read sets. By default, a read set is used in the final codon usage calculations if S_r >= 0.5*S_r^max at the P and A site (described in section 2.2.3 of the manuscript). In a dataset where many read sets have passed the validation filters, this threshold would exclude read sets that are nevertheless outliers compared with the best ones. We believe it is generally useful to apply this filter to focus on the most informative reads. However, the threshold can be lowered if keeping as many read sets as possible is of higher importance (use a threshold <0.5), or increased if stricter filtering is desired (use a threshold >0.5). [optional] |
These instructions will help you to install additional genomes. CONCUR can be run for any organism provided that you have a gtf file containing genes and their coding sequence. If available, CONCUR will use the reading frame information in column 8, otherwise frame can be calculated manually.
This is the best option if there is a fasta file with the coding sequences available (e.g., Ensembl annotations).
First, download the coding sequence and annotation files (gtf or gff):
wget -O Saccer3.cds.fa.gz ftp://ftp.ensemblgenomes.org/pub/fungi/release-40/fasta/saccharomyces_cerevisiae/cds/Saccharomyces_cerevisiae.R64-1-1.cds.all.fa.gz
wget -O Saccer3.gff.gz ftp://ftp.ensemblgenomes.org/pub/fungi/release-40/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.40.gff3.gz
wget -O Ratnor9.cds.fa.gz ftp://ftp.ensembl.org/pub/release-95/fasta/rattus_norvegicus/cds/Rattus_norvegicus.Rnor_6.0.cds.all.fa.gz
wget -O Ratnor9.gtf.gz ftp://ftp.ensembl.org/pub/release-95/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_6.1.05.gtf.gz
Next, run the installation tool. Use --recalculate
if you wish to disregard the reading frame information in column 8 of the gtf/gff file.
perl concur_install_genome.pl --gtf Saccer3.gff.gz --fasta Saccer3.cds.fa.gz --short sc3
perl concur_install_genome.pl --gtf Ratnor9.gtf.gz --fasta Ratnor9.cds.fa.gz --short rn9
This will create two files for yeast: data/sc3.bg.txt and data/sc3.bed.gz, and two files for rat: data/rn9.bg.txt and data/rn9.bed.gz.
This is the best option if there is not a separate fasta file with only the coding sequences available, but there is a fasta file with transcript sequences and information about the CDS position (e.g., Gencode annotations).
First, download the coding sequence and annotation files (gtf or gff):
wget -O Musmus10.pcg.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.pc_transcripts.fa.gz
wget -O Musmus10.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.primary_assembly.annotation.gtf.gz
wget -O Homsap38.pcg.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.pc_transcripts.fa.gz
wget -O Homsap38.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.primary_assembly.annotation.gtf.gz
Next, run the installation tool. The --pcg
flag is used to extract the coding sequences from the full transcript sequences. Use --recalculate
if you wish to disregard the reading frame information in column 8 of the gtf/gff file.
perl concur_install_genome.pl --gtf Musmus10.gtf.gz --fasta Musmus10.pcg.fa.gz --short mm10 --pcg
perl concur_install_genome.pl --gtf Homsap38.gtf.gz --fasta Homsap38.pcg.fa.gz --short hg38 --pcg
The --pcg
tag will assume that there is a string in the "CDS:61-1041" format in each fasta header line. The start and end position of the coding sequence is retrieved from this string and is used to extract the coding sequence from the full transcript sequence. Coding sequences where the length is not a multiple of three nucleotides will not be used.
This will create two files for mouse: data/mm10.bg.txt
and data/mm10.bed.gz
, and two files for human: data/hg38.bg.txt
and data/hg38.bed.gz
.
The current version is 1.0. For other the versions, see the releases on this repository.
- Susanne Bornelöv - susbo
This project is licensed under the GNU AGPLv3 License - see the LICENSE.txt file for details.
If you use CONCUR for your work, please cite:
Michaela Frye, Susanne Bornelöv (2020) CONCUR: quick and robust calculation of codon usage from ribosome profiling data, Bioinformatics, bta733, https://doi.org/10.1093/bioinformatics/btaa733
concur's People
concur's Issues
Cannot open correlation.csv file?
Thanks a lot for providing this software to A-site counting;
I tried a lot of software, this is very user-friendly.
But meet an issue with the analysis of the dataset (Nedialkova and Leidel, 2015).
Here is my code:
System: Ubuntu 20.04 LTS
Input:
concur perl concur.pl -w -i ../geno_bam/SRR1944912.1_1.fastq.gz.sam.bam.rm.bam.long.bam -g sc3 -o WT_1
Running CONCUR v1.0
Output:
General options
Input file: ../geno_bam/SRR1944912.1_1.fastq.gz.sam.bam.rm.bam.long.bam
Genome: sc3
Output folder: WT_1
Output file name: SRR1944912.1_1.fastq.gz.sam.rm.bam.long.bam
Run without R: TRUE
Analysis options
Fragment size range tested: 20-50
Minimum number of reads: 1000
Outlier removal filter [2.2.3]: 0.5
###############################
[Step 1/10] Mapping genomic reads to transcripts...
[Step 2/10] Calculating periodicity...
[Step 3/10] Predicting offset per read set...
[Step 4/10] Calculating codon frequency per read set...
[Step 5/10] Calculating codon frequency (step 2) per read set...
[Step 7/10] Calculating correlations between read sets...
Cannot open SRR1944912.1_1.fastq.gz.sam.rm.bam.long.bam.correlation.csv
Errors during correlation steps
Hi
I am trying to run this program, but I keep getting errors:
Running CONCUR v1.0
##### General options #######
Input file: /cluster/work/users/sjannies/finalbams2/riboseq_N24_starmap2goldfish.final.csort.bam
Genome: caur
Output folder: test
Output file name: riboseq_N24_starmap2goldfish.final.csort
Run without R: FALSE
##### Analysis options ######
Fragment size range tested: 20-50
Minimum number of reads: 1000
Outlier removal filter [2.2.3]: 0.5
###############################
[Step 1/10] Mapping genomic reads to transcripts...
[Step 2/10] Calculating periodicity...
[Step 3/10] Predicting offset per read set...
[Step 4/10] Calculating codon frequency per read set...
[Step 5/10] Calculating codon frequency (step 2) per read set...
[Step 6/10] Plotting codon correlations per read set...
Error in file(file, "rt") : cannot open the connection
Calls: read.csv -> read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file 'data/codon_to_AA.csv': No such file or directory
Execution halted
[Step 7/10] Calculating correlations between read sets...
Cannot open riboseq_N24_starmap2goldfish.final.csort.correlation.csv
I have changed the path to the R scripts in concur.pl, because I thought that was the reason.
I have generated a genome for my species, but I am unsure what the data/codon_to_AA.csv
file called in the R scripts is - should it have been generated during the genome processing? I only got the bg.txt and bed.gz in the data folder.
I appreciate any advice.
Start position is a negative number
Hi!
my command:
perl concur.pl -i HD.Aligned.sortedByCoord.out.bam -g GRCh38 -o HD.codon.occ -n HD -r 5000
and encountered an error as follows:
Error: Invalid record in file /dev/fd/63. Record is
chrM -7 21 AE01231101005:9:4P240301014UY1S3242SX:L01:R001C060:0115:7938 CACGATGGATCACAGGTCTATCACCCTA -
Error: Invalid record in file /dev/fd/63. Record is
chrM -3 22 AE01231101005:9:4P240301074UY1S3256SX:L01:R004C037:0067:1228 ATGGATCACAGGTCTATCACCCTAT -
Error in dimnames(x) <- dn : length of 'dimnames' [2] not equal to array extent
Calls: colnames<-
stop
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
Calls: plot -> plot.default -> xy.coords
stop
Meanwhile, the error I encountered resulted in a 'codons.txt' file that contains only 8 columns.
Thanks!
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.