russel88 / crisprcastyper Goto Github PK

View Code? Open in Web Editor NEW

86.0 4.0 15.0 255.55 MB

CCTyper: Automatic detection and subtyping of CRISPR-Cas operons

Home Page: https://typer.crispr.dk

License: MIT License

Python 99.37% Shell 0.63%

crispr-analysis cas crispr crispr-cas crispr-cas9 bioinformatics

crisprcastyper's Introduction

CRISPRCasTyper

Detect CRISPR-Cas genes and arrays, and predict the subtype based on both Cas genes and CRISPR repeat sequence.

CRISPRCasTyper and RepeatType are also available through a webserver

This software finds Cas genes with a large suite of HMMs, then groups these HMMs into operons, and predicts the subtype of the operons based on a scoring scheme. Furthermore, it finds CRISPR arrays with minced and by BLASTing a large suite of known repeats, and using a kmer-based machine learning approach (extreme gradient boosting trees) it predicts the subtype of the CRISPR arrays based on the consensus repeat. It then connects the Cas operons and CRISPR arrays, producing as output:

CRISPR-Cas loci, with consensus subtype prediction based on both Cas genes (mostly) and CRISPR consensus repeats
Orphan Cas operons, and their predicted subtype
Orphan CRISPR arrays, and their predicted associated subtype

It includes the following 50 subtypes/variants (find typing scheme here):

I-A, I-B, I-C, I-D, I-E, I-F, I-F (transposon), I-G, II-A, II-B, II-C, II-C2, II-D, III-A, III-B, III-C, III-D, III-E, III-F, IV-A1, IV-A2, IV-A3, IV-B, IV-C, IV-D, IV-E, V-A, V-B1, V-B2, V-C, V-D, V-E, V-F1, V-F2, V-F3, V-F (the rest), V-G, V-H, V-I, V-J, V-K, V-L, V-M, VI-A, VI-B1, VI-B2, VI-C, VI-D, VI-X, VI-Y.
All subtypes from the most recent Nature Reviews Microbiology (Makarova et al. 2020): Evolutionary classification of CRISPR–Cas systems: a burst of class 2 and derived variants
Updated type IV subtypes and variants based on: Type IV CRISPR–Cas systems are highly diverse and involved in competition between plasmids
Type V-K: RNA-guided DNA insertion with CRISPR-associated transposases
Transposon associated type I-F: Transposon-encoded CRISPR–Cas systems direct RNA-guided DNA integration
New V-A variants: Novel Type V-A CRISPR Effectors Are Active Nucleases with Expanded Targeting Capabilities
New Cas13s: Programmable RNA editing with compact CRISPR–Cas13 systems from uncultivated microbes
V-L (cas12l): A new family of CRISPR-type V nucleases with C-rich PAM recognition
V-M (cas12m): The miniature CRISPR-Cas12m effector binds DNA to block transcription
II-D and II-C2: Compact Cas9d and HEARO enzymes for genome editing discovered from uncultivated microbes

It can automatically draw gene maps of CRISPR-Cas systems and orphan Cas operons and CRISPR arrays

in vector graphics format for direct use in scientific manuscripts

Citation

Jakob Russel, Rafael Pinilla-Redondo, David Mayo-Muñoz, Shiraz A. Shah, Søren J. Sørensen - CRISPRCasTyper: Automated Identification, Annotation and Classification of CRISPR-Cas loci. The CRISPR Journal Dec 2020

Find a free to read version on BioRxiv

Quick start
Installation
CRISPRCasTyper - How to
- Plotting
RepeatType - How to
- Updated models
RepeatType - Train
Troubleshoot

Quick start

conda create -n cctyper -c conda-forge -c bioconda -c russel88 cctyper
conda activate cctyper
cctyper my.fasta my_output

Installation

CRISPRCasTyper can be installed either through conda or pip.

It is advised to use conda, since this installs CRISPRCasTyper and all dependencies, and downloads the database in one go.

Conda

Use miniconda or anaconda to install.

Create the environment with CRISPRCasTyper and all dependencies and database

conda create -n cctyper -c conda-forge -c bioconda -c russel88 cctyper

pip

If you have the dependencies (Python >= 3.8, HMMER >= 3.2, Prodigal >= 2.6, minced, grep, sed) in your PATH you can install with pip

Install cctyper python module

python -m pip install cctyper

Upgrade cctyper python module to the latest version

python -m pip install cctyper --upgrade

When installing with pip, you need to download the database manually:

# Download and unpack
svn checkout https://github.com/Russel88/CRISPRCasTyper/trunk/data
tar -xvzf data/Profiles.tar.gz
mv Profiles/ data/
rm data/Profiles.tar.gz

# Tell CRISPRCasTyper where the data is:
# either by setting an environment variable (has to be done for each terminal session, or added to .bashrc):
export CCTYPER_DB="/path/to/data/"
# or by using the --db argument each time you run CRISPRCasTyper:
cctyper input.fa output --db /path/to/data/

CRISPRCasTyper - How to

CRISPRCasTyper takes as input a nucleotide fasta, and produces outputs with CRISPR-Cas predictions

Activate environment

conda activate cctyper

Run with a nucleotide fasta as input

cctyper genome.fa my_output

If you have a complete circular genome (each entry in the fasta will be treated as having circular topology)

cctyper genome.fa my_output --circular

For metagenome assemblies and short contigs/plasmids/phages, change the prodigal mode

The default prodigal mode expects the input to be a single draft or complete genome

cctyper assembly.fa my_output --prodigal meta

Check the different options

cctyper -h

Output

CRISPR_Cas.tab: CRISPR_Cas loci, with consensus subtype prediction
- Contig: Sequence accession
- Operon: Operon ID (Sequence accession @ NUMBER)
- Operon_Pos: [Start, End] of operon
- Prediction: Consenus prediction based on both Cas operon and CRISPR arrays
- CRISPRs: CRISPRs adjacent to Cas operon
- Distances: Distances to CRISPRs from Cas operon
- Prediction_Cas: Subtype prediction based on Cas operon
- Prediction_CRISPRs: Subtype prediction of CRISPRs based on CRISPR repeat sequences
cas_operons.tab: All certain Cas operons
- Contig: Sequence accession
- Operon: Operon ID (Sequence accession @ NUMBER)
- Start: Start of operon
- End: End of operon
- Prediction: Subtype prediction
- Complete_Interference: Percent completion of the interference module(s). Can be a list if best_type is a list (Hybrid and Ambiguous)
- Complete_Adaptation: Percent completion of the adaptation module(s). Can be a list if best_type is a list (Hybrid and Ambiguous)
- Best_type: Subtype with the highest score. If the score is high then Prediction = Best_type
- Best_score: Score of the highest scoring subtype
- Genes: List of Cas genes
- Positions: List of Gene IDs for the genes
- E-values: List of E-values for the genes
- CoverageSeq: List of sequence coverages for the genes
- CoverageHMM: List of HMM coverages for the genes
- Strand_Interference: Strand of interference module. 1 is positive strand, -1 is negative strand, 0 is mixed, NA if no interference gene found
- Strand_Adaptation: Strand of adaptation module. 1 is positive strand, -1 is negative strand, 0 is mixed, NA if no adaptation gene found
crisprs_all.tab: All CRISPR arrays, also false positives
- Contig: Sequence accession
- CRISPR: CRISPR ID (minced: Sequence accession _ NUMBER; repeatBLAST: Sequence accession - NUMBER _ NUMBER)
- Start: Start of CRISPR
- End: End of CRISPR
- Consensus_repeat: Consensus repeat sequence
- N_repeats: Number of repeats
- Repeat_len: Length of repeat sequences
- Spacer_len_avg: Average spacer length
- Repeat_identity: Average identity of repeat sequences
- Spacer_identity: Average identity of spacer sequences
- Spacer_len_sem: Standard error of the mean of spacer lenghts
- Trusted: TRUE/FALSE, is the array trusted. Based on repeat/spacer identity, spacer sem, prediction probability and adjacency to a cas operon
- Prediction: Prediction of the associated subtype based on the repeat sequence
- Subtype: Subtype with highest prediction probability. Prediction = Subtype if Subtype_probability is high
- Subtype_probability: Probability of subtype prediction
crisprs_near_cas.tab: CRISPRs part of CRISPR-Cas loci
- Same columns as crisprs_all.tab
crisprs_orphan.tab: Orphan CRISPRs (those not in CRISPR_Cas.tab)
- Same columns as crisprs_all.tab
crisprs_putative.tab: Low quality CRISPRs. Most likely false positives
- Same columns as crisprs_all.tab
cas_operons_orphan.tab: Orphan Cas operons (those not in CRISPR_Cas.tab)
- Same columns as cas_operons.tab
CRISPR_Cas_putative.tab: Putative CRISPR_Cas loci, often lonely Cas genes next to a CRISPR array
- Same columns as CRISPR_Cas.tab
cas_operons_putative.tab: Putative Cas operons, mostly false positives, but also some ambiguous and partial systems
- Same columns as cas_operons.tab
spacers/*.fa: Fasta files with all spacer sequences
hmmer.tab: All HMM vs. ORF matches, unfiltered results
- Hmm: HMM name
- ORF: ORF name (Sequence accession _ Gene ID)
- tlen: ORF length
- qlen: HMM length
- Eval: E-value of alignment
- score: Alignment score
- start: ORF start
- end: ORF end
- Acc: Sequence accession
- Pos: Gene ID
- Cov_seq: Sequence coverage
- Cov_hmm: HMM coverage
- strand: Coding strand is like input (1) or reverse complement (-1)
genes.tab All genes and their positions
- Contig: Sequence accession
- Start: Start of ORF
- End: End of ORF
- Strand: Coding strand is like input (1) or reverse complement (-1)
- Pos: Gene ID
arguments.tab: File with arguments given to CRISPRCasTyper
hmmer.log Error messages from HMMER (only produced if any errors were encountered)

If run with `--keep_tmp` the following is also produced

prodigal.log Log from prodigal
proteins.faa Protein sequences
hmmer/*.tab Alignment output from HMMER for each Cas HMM
minced.out: CRISPR array output from minced
blast.tab: BLAST output from repeat alignment against flanking regions of cas operons
Flank....: Fasta of flanking regions near cas operons and BLAST database of this

Notes on output

Files are only created if there is any data. For example, the CRISPR_Cas.tab file is only created if there are any CRISPR-Cas loci.

Plotting

CRISPRCasTyper will automatically plot a map of the CRISPR-Cas loci, orphan Cas operons, and orphan CRISPR arrays.

These maps can be expanded (--expand N) by adding unknown genes and genes with alignment scores below the thresholds. This can help in identify potentially un-annotated genes in operons. You can generate new plots without having to re-run the entire pipeline by adding --redo_typing to the command. This will re-use the mappings and re-type the operons and re-make the plot, based on new thresholds and plot parameters.

The plot below is run with --expand 5000

Arrays are in alternating black/white displaying the actual number of repeats/spacers, and with their predicted subtype association based on the consensus repeat sequence.
The interference module is in yellow.
The adaptation module is in blue.
Cas6 is in red.
Accessory genes are in purple
Genes with alignment scores below the thresholds are lighter and with parentheses around names.
Unknown genes are in gray (the number matches the genes.tab file)

RepeatTyper - How to

With an input of CRISPR repeats (one per line, in a simple textfile) RepeatTyper will predict the subtype, based on the kmer composition of the repeat

Activate environment

conda activate cctyper

Run with a simple textfile, containing only CRISPR repeats (in capital letters), one repeat per line.

repeatType repeats.txt

Output

The script prints:

Repeat sequence
Predicted subtype
Probability of prediction

Notes on output

Predictions with probabilities below 0.75 are uncertain, and should be taken with a grain of salt.
Prior to version 1.4.0 the curated repeatTyper model was included in CCTyper
From version 1.4.0 and onwards updated repeatTyper models are included in CCTyper (see more information in the section below)
The followinig subtypes are included in the updated model as per December 2022:
- I-A, I-B, I-C, I-D, I-E, I-F, I-F (Transposon), I-G
- II-A, II-B, II-C
- III-A, III-B, III-C, III-D, III-E, III-F
- IV-A1, IV-A2, IV-A3, IV-D, IV-E
- V-A, V-B1, V-E, V-F1, V-F2, V-F3, V-F (the rest), V-G, V-I, V-J, V-K
- VI-A, VI-B1, VI-B2, VI-C, VI-D
This is the accuracy per subtype (on an unseen test dataset):
I-A 0.76
I-B 0.81
I-C 0.97
I-D 0.86
I-E 0.95
I-F 0.96
I-F_T 0.99
I-G 0.89
II-A 0.92
II-B 0.97
II-C 0.90
III-A 0.82
III-B 0.68
III-C 0.60
III-D 0.59
III-E 1.00
III-F 0.25
IV-A1 0.85
IV-A2 0.68
IV-A3 0.96
IV-D 0.85
IV-E 0.92
V-A 1.00
V-B1 0.90
V-E 0.30
V-F 0.87
V-F1 0.87
V-F2 0.90
V-F3 0.90
V-G 0.67
V-I 0.80
V-J 0.63
V-K 0.99
VI-A 0.96
VI-B1 0.96
VI-B2 1.00
VI-C 0.67
VI-D 0.97

Updated RepeatTyper models

The CCTyper webserver is crowdsourcing subtyped repeats and includes an updated RepeatTyper model based on a much larger set of repeats and contains additional subtypes compared to the curated RepeatTyper model. This updated model is automatically retrained each month and the models can be downloaded here.

From version 1.4.0 and onwards of CCTyper the newest repeatTyper model is included upon release of the version.

Each model contains a training report (xgb_report), where you can find the training log, and in the bottom the accuracy, both overall and per subtype.

Use new model in CRISPRCasTyper

Save the original database files:

mv ${CCTYPER_DB}/type_dict.tab ${CCTYPER_DB}/type_dict_orig.tab
mv ${CCTYPER_DB}/xgb_repeats.model ${CCTYPER_DB}/xgb_repeats_orig.model

Move the new model into the database folder

mv repeat_model/* ${CCTYPER_DB}/

CRISPRCasTyper and RepeatTyper will now use the new model for repeat prediction!

RepeatTyper - Train

You can train the repeat classifier with your own set of subtyped repeats. With a tab-delimeted input where 1. column contains the subtypes and 2. column contains the CRISPR repeat sequences, RepeatTrain will train a CRISPR repeat classifier that is directly usable for both RepeatTyper and CRISPRCasTyper.

Train

repeatTrain typed_repeats.tab my_classifier

Use new model in RepeatTyper

repeatType repeats.txt --db my_classifier

Use new model in CRISPRCasTyper

Save the original database files:

mv ${CCTYPER_DB}/type_dict.tab ${CCTYPER_DB}/type_dict_orig.tab
mv ${CCTYPER_DB}/xgb_repeats.model ${CCTYPER_DB}/xgb_repeats_orig.model

Move the new model into the database folder

mv my_classifier/* ${CCTYPER_DB}/

CRISPRCasTyper and RepeatTyper will now use the new model for repeat prediction!

Troubleshoot

Running out of memory

Large metagenomic assemblies with many small contigs can exhaust the RAM on your laptop. Fortunately, as metagenomic contigs are analysed separately (when run with --prodigal meta) a simple solution is to split the input into smaller chunks (e.g. with pyfasta)

crisprcastyper's People

Contributors

Stargazers

Watchers

Forkers

alvinleopold elementgenomicsinc healthvivo kobbycyber yemilawal krystal0816 jemimacat xvtyzn geneditbio nataquinones mariormestre pentamorfico liupfskygre graonet flakering

crisprcastyper's Issues

Assembly accession in web server

Make it possible to submit an Assembly accession (GCA_* or GCF_*) to the web server in addition to the currently implemented nuccore accessions.

A complement to CRISRPRCasFinder or its own thing?

Does this program stand on its own, or as a complement to CRISPRCasFinder? I have observed that in your papers you use both. Can CCTyper be used alone to detect CRISPR-Cas on sequences?

CrisprCasTyper webserver

The web server reports fatal error for every input the issue is persistent since a week.

File upload fail on web server with Chrome browser

Three separate solutions to this problem:

Disable hardware acceleration in the Chrome browser settings (restart browser)
Use another browser (Firefox, Opera, Safari, Edge does not have this problem)
Copy-paste the sequence from the file into the submission field

cctyper dies with missing files

Hi Jakob,

I was running cctyper (fresh conda install) on a large fasta file (7.7GB) and it seems that it runs smoothly until minced step where it's trying to locate a missing file. Any idea on what might be the issue ?

Joseph.

cctyper -t 64 m64241e_210617_232502.hifi_reads.fasta cctyper.m64241e_210617_232502.hifi_reads.out
[2021-06-25 12:22:40] INFO: Running CRISPRCasTyper version 1.4.1
[2021-06-25 12:23:23] INFO: Predicting ORFs with prodigal
[2021-06-25 13:31:06] INFO: Running HMMER against Cas profiles
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 693/693 [13:06:04<00:00, 68.06s/it]
[2021-06-26 07:21:47] INFO: Parsing HMMER output
[2021-06-26 07:21:48] INFO: Subtyping putative operons
[2021-06-26 07:28:48] INFO: Predicting CRISPR arrays with minced
Traceback (most recent call last):
  File "/projects/codon_0000/apps/miniconda3/envs/cctyper/bin/cctyper", line 86, in <module>
    crispr_obj.run_minced()
  File "/projects/codon_0000/apps/miniconda3/envs/cctyper/lib/python3.9/site-packages/cctyper/minced.py", line 79, in run_minced
    self.write_spacers()
  File "/projects/codon_0000/apps/miniconda3/envs/cctyper/lib/python3.9/site-packages/cctyper/minced.py", line 156, in write_spacers
    f = open(self.out+'spacers/{}.fa'.format(crisp.crispr), 'w')
FileNotFoundError: [Errno 2] No such file or directory: 'cctyper.m64241e_210617_232502.hifi_reads.out/spacers/m64241e_210617_232502/165939000/ccs_1.fa'

![image](https://user-images.githubusercontent.com/121949837/212599561-d771b233-4bde-4850-8e5f-98a4164f78cb.png)

ValueError: invalid literal for int() with base 10: 'lengt'

Hello.
I first ran the single bins(2159kb) and he was able to output the results properly, but when I merged the 1472 bins (3.39Gb)into one fasta file and ran it again, the following error was reported,

cctyper ~/virus/viwrap_input/merge.fa ~/virus/CRISPRCasTyper --prodigal meta --threads 12
[2024-03-20 14:32:44] INFO: Running CRISPRCasTyper version 1.8.0
[2024-03-20 14:32:57] INFO: Predicting ORFs with prodigal
[2024-03-20 18:56:28] INFO: Running HMMER against Cas profiles
100%|████████████████████████████████████████████████████████████████████| 705/705 [4:57:48<00:00, 25.35s/it]
/data4/machuang/miniconda3/envs/cctyper/lib/python3.8/site-packages/cctyper/hmmer.py:85: DtypeWarning: Columns (28) have mixed types. Specify dtype option on import or set low_memory=False.
  hmm_df = pd.read_csv(self.out+'hmmer.tab', sep='\s+', header=None,
Traceback (most recent call last):
  File "/data4/machuang/miniconda3/envs/cctyper/bin/cctyper", line 85, in <module>
    hmmeri.main_hmm()
  File "/data4/machuang/miniconda3/envs/cctyper/lib/python3.8/site-packages/cctyper/hmmer.py", line 26, in main_hmm
    self.load_hmm()
  File "/data4/machuang/miniconda3/envs/cctyper/lib/python3.8/site-packages/cctyper/hmmer.py", line 100, in load_hmm
    hmm_df['Pos'] = [int(re.sub(".*_","",x)) for x in hmm_df['ORF']]
  File "/data4/machuang/miniconda3/envs/cctyper/lib/python3.8/site-packages/cctyper/hmmer.py", line 100, in <listcomp>
    hmm_df['Pos'] = [int(re.sub(".*_","",x)) for x in hmm_df['ORF']]
ValueError: invalid literal for int() with base 10: 'lengt'

The hmmer.log shows

Fatal exception (source file easel.c, line 2248):
unexpected getcwd() error

what is the reason for this, thanks for your help.

Fusion proteins

Include HMMs for fusion proteins, such as Cas1-RT, Cas4-Cas1, and Cas6-RT-Cas1

Meaning of asterisk at the end of protein sequence

Given below are two protein sequences I have taken from protein.faa file obtained after running ccTyper, one with asterisk and other without asterisk at the end of protein sequence.

NODE_20793_length_1284_cov_1096.518308_2 # 431 # 1024 # -1 # ID=82_2;partial=00;start_type=GTG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.577
MKKKILSLAVVAVFGVMTMGPVMAGEVDPATVPEKKQTTLKLYLTAKEAYDMKKAEGDKV
LLIDVRTPEEIQYVGNLGDMMDANIPYQFNDISGYDEKKKVYASSLNSNFVAEVEELVNK
RGLDKDSTIIVSCRSGDRSAVSANLLAKAGYTHVYSVFDGFEGDLSKDGRRSVNGWKNAG
LPWTYNMDKAKMYFILR*

NODE_26395_length_1052_cov_597.440321_1 # 3 # 1052 # -1 # ID=98_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.368
LRRKNINNMIDKIYPYIHKIIKKTFSYLTLPQQKSLALTISAFFDPPSFSLYNIASKLPL
DTSNRHKHKHLIRFLDKLLINDDFWKSYITTIFLLPHITSRKKFLTLLIDATTLKDDVWI
LSASISYENRAVPIYMELWEGVNQKYDYWARVIGFVRNMRKYLPDKFSYVIIADRGFQGE
RLPKEFKKLKLDYIIRIGENYHIKTKNGEEWRELSLLDDGKYNEVVLGKTNSIEGVNVIV
SSIKDAENKKHLKWYLMSSIKDMEKEEVVGLYAKRMWIEESFKDLKGKLRWEEYTEKLPK
FDRIKKMVIISGLSYGIQLSLGSSKQVVEQRSKGESIIRGLQNALNGVSV

Asterisk in a fasta file generally means a stop codon. My question is does the asterisk mean that the protein sequence is incomplete?
Will the prediction that a given protein is a Cas protein be trusted if there is an asterisk at the end of protein sequence?
Can I estimate the size of the protein if there is an asterisk in the sequence?

Make it possible to enhance plot with custom HMM database

Would be useful if one could supplement CCTyper with a HMM database, e.g. PFfam, which would then be used to annotate previously unknown genes in the vicinity of CRISPR-Cas loci

"Orphan" CRISPRs near low-quality single-effector HMM match -> Putative CRISPR-Cas

At the moment, low-quality single-effector HMM matches near a CRISPR is only visible in the plot, but are not included in the CRISPR_Cas_putatitive.tab output file. This should be fixed.

Problems running the program

First of all I have to say the program is great. Easy to use and the output is very comprehensive.
However, recently I have been having this error:

/Users/JFF/opt/miniconda3/envs/cctyper/lib/python3.9/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
[2022-04-04 14:29:15] INFO: Running CRISPRCasTyper version 1.3.0
[2022-04-04 14:29:15] INFO: Predicting ORFs with prodigal
[2022-04-04 14:29:15] INFO: Running HMMER against Cas profiles
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 691/691 [00:14<00:00, 48.44it/s]
[2022-04-04 14:29:31] INFO: Parsing HMMER output
[2022-04-04 14:29:31] INFO: Subtyping putative operons
/Users/JFF/opt/miniconda3/envs/cctyper/lib/python3.9/site-packages/cctyper/castyping.py:294: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
single_effector_hmms = self.scores[self.scores['Hmm'].isin(list(specifics))].drop('Hmm', 1)
[2022-04-04 14:29:31] INFO: Predicting CRISPR arrays with minced
[2022-04-04 14:29:31] INFO: No CRISPRs found.
[2022-04-04 14:29:31] INFO: Plotting map of CRISPR-Cas loci
[2022-04-04 14:29:31] INFO: Removing temporary files

It's strange because I have run this same sequence on other occasions with the program and I am sure there are CRISPRs.
Do you know what might be causing this?

Kind regards,
Javier

Add support for numeric fasta headers

Fasta headers which only contain numbers are currently not allowed, as they result in errors several places in the script.

Include I-B CAST loci

Complete genome and draft genome

Hello, I am going to use this software to predict some Klebsiella pneumoniae genome data (some are fully assembled genome data, some are draft genome data at the Scaffolds level). Should I change the Prodigal Mode to make the prediction more accurate for different genomes?
Complete genome: cctyper KP1.fa my_output
A draft genome containing multiple sequences(scaffolds): cctyper KP2.fasta my_output2 --prodigal meta
Are these commands correct?
If I keep using a command without changing it, will the result be much less reliable ?
Best wishes！

New V-A variants

Include HMMs for new V-A variants described here: https://www.liebertpub.com/doi/10.1089/crispr.2020.0043

Error in BLASTing for CRISPR near cas operons

Hi, I fed the software with metagenomic datasets but an error was raised as follows:

[2022-06-25 15:58:22] INFO: BLASTing for CRISPR near cas operons
Traceback (most recent call last):
File "/data/linzhl/anaconda3/envs/cctyper/bin/cctyper", line 95, in
repmatch.run()
File "/data/linzhl/anaconda3/envs/cctyper/lib/python3.8/site-packages/cctyper/blast.py", line 37, in run
self.write_gff()
File "/data/linzhl/anaconda3/envs/cctyper/lib/python3.8/site-packages/cctyper/blast.py", line 369, in write_gff
all_seqs[::2] = cr.repeats
ValueError: attempt to assign sequence of size 36 to extended slice of size 35

This problem only came up in one dataset. Do you know what might be causing this?

No database with conda install?

Hello, I installed cctyper with conda,
conda create -n cctyper -c conda-forge -c bioconda -c russel88 cctyper
activated the environment,
conda activate cctyper
and then tried to run cctyper
cctyper ~/Downloads/44_contigs_1000.fasta ~/Desktop/cctyper_output_44 --prodigal meta

However, it says that it cannot find the database directory. Is there an easy way to fix this?

[2022-08-03 13:40:32] INFO: Running CRISPRCasTyper version 1.3.0
[2022-08-03 13:40:32] ERROR: Could not find database directory

Thanks,

Nicholas

Orphan CRISPR definition

Dear Russel,

Thank you very much for this unique tool.

I have a simple question. How is defined an orphan CRISPR? I suppose that a lonely CRISPR in a contig could be considered an orphan, but how many ORFs are needed for considering an array as an orphan CRISPR?

Speed up CRISPR stats calculation

Long CRISPRs take time to process, because each pair of both spacers and repeats are aligned to calculate the average identity of repeats and spacers. It should be sufficient to sample a subset of the repeats/spacers to estimate the identity. Then add a CLI argument to toggle exact versus approximate identity estimation.

Different results from CRISPR Cas typer web server and standalone version

add repeated sequence to repeats.fa

Hi,
I wanted to use your program to get E. coli spacers. It work well on some of my strains but for some of them there are missing spacers due to some mismatch in the last repeat sequence.
Is there a way to update the repeats.fa file in the db folder with my own list of repeated sequences ?
Have a nice day,
Fabien

Option to use Prodigal-gv

Prodigal-gv is a fork of Prodigal v2.6.3 that was modified to improve detection of alternative genetic codes (especially 4 and 15), which are not uncommon in certain phages families. Standard Prodigal (with or without -p meta) doesn't do a good job of detecting these cases, resulting in highly fragmented gene predictions. Given that CRISPR systems have been found in phages, it would be nice if this gene caller was an option when using cc-typer. The command line options are identical to standard prodigal, except the name of the program is prodigal-gv instead of prodigal.

https://github.com/apcamargo/prodigal-gv
https://anaconda.org/bioconda/prodigal-gv

About spacer sequences

I use the the software to construct the CRISPRCas database from the bacterial genome in the Refseq database.
But I found that some spacer sequences are not in the genome，for example（spacers/NC_015711.1_1.fa）：>NC_015711.1_1:1

>NC_015711.1_1:1
TCAACCAGCATTAGCACCGTCCGCGTGGCGCCCGTGT
>NC_015711.1_1:2
CTGGAGTTGTCCCCCGAGGCTGAGCCGGTGTCCCGCGT
>NC_015711.1_1:3
TCTTCCATCTGCGTCTGCGTCTGACCCTTGAACTTCG
>NC_015711.1_1:4
ATGCAGAACAGCGGCAAGGAGGCGATTATCGACCT
>NC_015711.1_1:5
GGGCAGTGAAACCCTTGGGTGGGGAAGGAGTTCTGGGGGC
>NC_015711.1_1:6
AGGAGCGCCCGCCGGCCAGACGCATAGACGACGCA

Am i missing something？

Thanks！

GCF_000219105.zip

(Question) provide GFF file for pre computed gene calls?

Let's say you already ran prodigal and/or have gene calls in GFF format, can you skip the prodigal run and provide the GFF file?

antiSMASH provides a similar option since it requires gene calls and positions but allows for precomputed GFF file to be used.

Indicate completion in output

It would be useful if CCTyper in the output could give an indication of whether the systems are complete or partial

Extract spacer coordinates

Hey,

This tool (super cool btw!) extracts all spacers (true and false) in fasta file into a separate directory. Could you please let me know how to extract the coordinates for all these spacers? The coordinates available on the file crisprs_all only refer to the consensus repeat, from what I understood.

Thanks for your time!

XGBoost model incompatible

Hi! Great program, has been helping me a lot! I've been running cctyper for metagenomes and it has been working for the most part. For one of my FASTA files it has been erroring with XGBoost model incompatible. I am using cctyper v 1.8.0 via mamba on a fresh environment, created as instructed on the README

Thanks for any help in this situation!

cctyper part_001.fasta part_001_results --prodigal meta
/clusterfs/jgi/groups/science/homes/mbfiamenghi/.micromamba/envs/cctyper/bin/cctyper:7: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/
en/latest/pkg_resources.html
  import pkg_resources
/clusterfs/jgi/groups/science/homes/mbfiamenghi/.micromamba/envs/cctyper/lib/python3.8/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been dep
recated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopyth
on developers if you still need the Bio.pairwise2 module.
  warnings.warn(
[2024-02-14 06:27:58] INFO: Running CRISPRCasTyper version 1.8.0
[2024-02-14 06:28:01] INFO: Predicting ORFs with prodigal
[2024-02-14 07:20:09] INFO: Running HMMER against Cas profiles
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 705/705 [54:41<00:00,  4.65s/it]
[2024-02-14 08:17:52] INFO: Subtyping putative operons
[2024-02-14 08:18:08] INFO: Predicting CRISPR arrays with minced
/clusterfs/jgi/groups/science/homes/mbfiamenghi/.micromamba/envs/cctyper/lib/python3.8/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been dep
recated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopyth
on developers if you still need the Bio.pairwise2 module.
  warnings.warn(
/clusterfs/jgi/groups/science/homes/mbfiamenghi/.micromamba/envs/cctyper/lib/python3.8/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been dep
recated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopyth
on developers if you still need the Bio.pairwise2 module.
  warnings.warn(
/clusterfs/jgi/groups/science/homes/mbfiamenghi/.micromamba/envs/cctyper/lib/python3.8/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been dep
recated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopyth
on developers if you still need the Bio.pairwise2 module.
  warnings.warn(
/clusterfs/jgi/groups/science/homes/mbfiamenghi/.micromamba/envs/cctyper/lib/python3.8/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been dep
recated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopyth
on developers if you still need the Bio.pairwise2 module.
  warnings.warn(
[2024-02-14 08:18:50] INFO: BLASTing for CRISPR near cas operons
[2024-02-14 08:21:23] INFO: Predicting subtype of CRISPR repeats
[2024-02-14 08:21:23] ERROR: XGBoost model incompatible

Cas6 in red in visualization?

Dear all

Why is it that the Cas6 proteins are coloured in red in the visualization, as opposed to the other enzymes of either adaptation (blue) or interference (red) modules? Is there something functionally unique about Cas6 to be coloured separately?

Thanks

Marcus

Execution against protein fasta and gff

Hi,

First of all, I would like to thank you for developing this wonderful tool.

My question is about the input files of the tool.
Currently I have a genome sequence where gene prediction is a bit tricky. This genome cannot be predicted by prodigal and requires manual curation.

I would like to run CrisperCasTyper against this genome.
Is it possible to use gbk or protein fasta and gff as input?

Best,

Keigo

Expose minced options

Hello,

Is it possible to make the minced options available when calling cctyper ? Options like -minNR, -minRL, etc. are very useful when it comes to optimizing CRISPR array annotations. I have noticed that certain repeats are missed because the array did not carry the minimum number of repeats required.

Cheers,
Jimmy

castyping.py having an issue with positional arguments.

I've had cctyper working before on another machine. Not sure if you have seen this error before.

(cctyper)user assembly % cctype CIA1\ copy.fna CIA1_cctyper
zsh: command not found: cctype
(cctyper)user assembly % cctyper CIA1\ copy.fna CIA1_cctyper
/Users/user/miniconda3/envs/cctyper/lib/python3.9/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
warnings.warn(
[2023-05-18 10:59:17] INFO: Running CRISPRCasTyper version 1.3.0
[2023-05-18 10:59:18] INFO: Predicting ORFs with prodigal
[2023-05-18 10:59:28] INFO: Running HMMER against Cas profiles
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 691/691 [01:11<00:00, 9.65it/s]
[2023-05-18 11:01:14] INFO: Parsing HMMER output
[2023-05-18 11:01:14] INFO: Subtyping putative operons
Traceback (most recent call last):
File "/Users/user/miniconda3/envs/cctyper/bin/cctyper", line 80, in
castyper.typing()
File "/Users/user/miniconda3/envs/cctyper/lib/python3.9/site-packages/cctyper/castyping.py", line 294, in typing
single_effector_hmms = self.scores[self.scores['Hmm'].isin(list(specifics))].drop('Hmm', 1)
TypeError: drop() takes from 1 to 2 positional arguments but 3 were given

Multiple cctyper runs on the same contig giving different outputs?

Dear all

I ran cctyper on a large number of contigs, and the outputs of cas_operon.tab contained say a subgroup of contigs that had Cas hits.

I then sub-selected this set of Cas-positive contigs, and re-run cctyper. This time, A large majority of the contig and their Cas operon-containing positions were identical, except for a small subset of them, where some of these operons are now chopped up into multiple smaller cas operons. Interestingly, this small group of contigs are now present in the cas_operon_putative.tab, meaning that these predictions have become less confident.

I wonder why this is the case even though the contigs selected from the two cctyper runs were a subset of that of the first, but otherwise identical contigs. Thanks

Marcus

(Question) How can I plot svg post-hoc after a run?

I just ran this on about 60k genomes before I realized the Drawsvg version made changes that weren't address in the current CCTyper implementation (see #45).

Is there a way I can load the data that ran correctly so I can still plot it using CCTyper?

Optimize hybrid system annotation

Hybrid loci in which one of the subtypes is a Class 2 are sometimes missed. Hybrid classification should be optimized to fix these cases.

Refseq/Genbank accession as input

Would be useful if one could use the accession number as input, and CCtyper would automatically download fasta.

alternative to minced?

Thanks for the great tool!

In some cases it might be nice to be able to use CRISPR repeat/spacer arrays identified using a different tool or identified previously. For example we could pass a gff with array coordinates.

CCTyper results

Hello! I've been using CCTyper and I'm very happy with the performance and results, but I have a question about the results of an isolate assembly. This is the CRISPR_Cas.tab file:

Contig Operon Operon_Pos Prediction CRISPRs Distances Prediction_Cas Prediction_CRISPRs
Contig51 Contig51@1 [785, 2608] I-B ['Contig51_1'] [177] Ambiguous ['I-B']

However, when I go to the cas_operons.tab file, I find:

Contig Operon Start End Prediction Complete_Interference Complete_Adaptation Best_type Best_score Genes Positions E-values CoverageSeq CoverageHMM Strand_Interference Strand_Adaptation
Contig137 Contig137@1 40 5517 I-B 100% 0% I-B 15.0 ['Cas6_0_CAS-III-B-I-B', 'Cas8b1_10_CAS-I-B', 'Cas7_0_CAS-I-B', 'Cas5_0_IB', 'Cas3_0_I'] [1, 2, 3, 4, 5] ['2.00e-53', '2.50e-250', '2.40e-48', '1.40e-19', '1.40e-34'] [0.979, 0.983, 0.935, 0.832, 0.545] [0.979, 0.991, 0.967, 0.862, 0.475] 1 NA

This file shows the same results as the cas_operons_orphan.tab file. My question is why it seems to identify a complete CRISPR/Cas locus when it doesn't find the Cas operon. Interestingly, using Bakta to annotate this asolate, it identifies the Cas genes very close to the CRISPR array (less than 5000 nucleotides away and on the same contig). It seems that CCTyper recognizes that the CRISPR/Cas system is complete, but it doesn't display this operon. What could be happening? Thanks in advance!

Warning messages while running

Hi, I got warning messages while running this command

However, I got the results, but I'm not sure whether the warning from the program affects the results or not?

Thank you in advance.

Optimize plot expansions

Would be useful if arrays were included when calculating which genes to add for plot expansions of CRISPR-Cas loci

faa file of individual Cas proteins?

Dear all

I notice that the typical output of cctyper involves a file of cas_operons.txt, with each row containing potentially an operon with multiple Cas protein hits. However, the proteins.faa file contains all proteins used for cctyper (potentially even those not identified in the cas_operons.txt file.

Is there a way to extract the information in the "Genes" column of cas_operons.txt, and extract those amino acid sequences from proteins.faa to only include protein sequences of Cas proteins?

Thanks

Marcus

Creating empty files when there's no data

Hello,
First of all, fantastic software! I've been trying it out and it's super cool.
I have a suggestion regarding the output files. As you mention in the README: files are only created if there is any data. I was wondering why not generating an empty file instead, maybe with the header only. The main reason for this, is that it would be better for integrating it with workflow management systems, like snakemake, which
expect an output file as success of the run. In general, I think it easier to deal with files that indicate no results than with absent files.

If minced detects a CRISPR with N's in the repeat sequence

.. xgboost will fail with a "Incompatible model" error. This should be handled more gracefully

Docker for cctyper

Dear all

Is there pre-built Docker image for cctyper? Tried to look for it in biocontainer and but to no avail.... thanks

Marcus

Error when running cctyper: sed: cannot rename <output filename>/sed78a7GM: Permission denied

After running INFO: Running HMMER against Cas profiles , the job will instantly terminate and I receive the following error:

sed: cannot rename <output filename>/sed78a7GM: Permission denied Traceback (most recent call last): File "/opt/miniconda3/envs/cctyper/bin/cctyper", line 83, in <module> hmmeri.main_hmm() File "/opt/miniconda3/envs/cctyper/lib/python3.8/site-packages/cctyper/hmmer.py", line 26, in main_hmm self.load_hmm() File "/opt/miniconda3/envs/cctyper/lib/python3.8/site-packages/cctyper/hmmer.py", line 85, in load_hmm hmm_df = pd.read_csv(self.out+'hmmer.tab', sep='\s+', header=None, File "/opt/miniconda3/envs/cctyper/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper return func(*args, **kwargs) File "/opt/miniconda3/envs/cctyper/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper return func(*args, **kwargs) File "/opt/miniconda3/envs/cctyper/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv return _read(filepath_or_buffer, kwds) File "/opt/miniconda3/envs/cctyper/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 605, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/opt/miniconda3/envs/cctyper/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__ self._engine = self._make_engine(f, self.engine) File "/opt/miniconda3/envs/cctyper/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1735, in _make_engine self.handles = get_handle( File "/opt/miniconda3/envs/cctyper/lib/python3.8/site-packages/pandas/io/common.py", line 856, in get_handle handle = open( FileNotFoundError: [Errno 2] No such file or directory: '<output filename>/hmmer.tab'

I was just wondering if anyone has come across this error and if so, how they resolved it?

Thanks in advance!

FileNotFoundError: blast.tab

Hi there

I was wondering if you know what caused this issue, "FileNotFoundError: [Errno 2] No such file or directory: DIR/blast.tab"?

Thanks!

Question about spacer direction

Dear Russel,

are the spacers correction for their orientation like CRISPRDetect/CRISPRDirection does?

With kind regards,
Daan

russel88 / crisprcastyper Goto Github PK

crisprcastyper's Introduction

CRISPRCasTyper

It includes the following 50 subtypes/variants (find typing scheme here):

It can automatically draw gene maps of CRISPR-Cas systems and orphan Cas operons and CRISPR arrays

in vector graphics format for direct use in scientific manuscripts

Citation

Table of contents

Quick start

Installation

Conda

pip

When installing with pip, you need to download the database manually:

CRISPRCasTyper - How to

Activate environment

Run with a nucleotide fasta as input

If you have a complete circular genome (each entry in the fasta will be treated as having circular topology)

For metagenome assemblies and short contigs/plasmids/phages, change the prodigal mode

Check the different options

Output

If run with --keep_tmp the following is also produced

Notes on output

Plotting

RepeatTyper - How to

Activate environment

Run with a simple textfile, containing only CRISPR repeats (in capital letters), one repeat per line.

Output

Notes on output

Updated RepeatTyper models

Use new model in CRISPRCasTyper

CRISPRCasTyper and RepeatTyper will now use the new model for repeat prediction!

RepeatTyper - Train

Train

Use new model in RepeatTyper

Use new model in CRISPRCasTyper

CRISPRCasTyper and RepeatTyper will now use the new model for repeat prediction!

Troubleshoot

Running out of memory

crisprcastyper's People

Contributors

Stargazers

Watchers

Forkers

crisprcastyper's Issues

Recommend Projects

Recommend Topics

Recommend Org

If run with `--keep_tmp` the following is also produced