krasileva-group / plant_rgenes Goto Github PK

Shell 2.52% Perl 97.48%

plant_rgenes's Introduction

plant_rgenes

Set of scripts to annotate Pfam domains and extract NLR plant immune receptors and their architectures as published in Sarris et al BMC Biology 2016: https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-016-0228-7

Our basic pipeline

Obtain protein sequences of species of interest and organise them into a directory.

We follow the Phytozome organisation of master_dir/species/annotation/species_version_proteins.fa where each species is denoted by the first letter of the genus name and all letters in the species names, for example Athaliana

Pfam-based annotation of domains

usage: bash run_pfam_scan.sh dir

Dependencies:

HMMER software (http://hmmer.janelia.org/) including pfam_scan.pl (part of HMMER) Move in same directory as this script or set path at command string.
Pfam database (http://pfam.xfam.org/)
File names should be consistent with Phytozome and include Species_*_protein.fa
perl modules specified in the scripts (best to install with cpan: http://www.cpan.org/modules/)

Parsing the pfamscan output with K-parse_Pfam_domains_v3.1.pl

The script parses the output of pfam_scan.pl
The script extracts all domains for each proteins and removes redundant nested hits with larger e-values.
Domains are printed out in the order of apprearance in the query.
By default, Pfam_B domains are skipped.

usage: perl K-parse_Pfam_domains_v3.1.pl <options>

-p|--pfam <pfamscan.out>

-e|--evalue <evalue cutoff>

-o|--output

-v|--verbose <T/F> default F. Display more information about each domain (start, stop, evalue)

We usually parse all pfam outputs of interest in parallel using xargs

Identification of non-canonical NLR-ID domain combinations with K-parse_Pfam_domains_NLR-fusions-v2.2.pl

This script is configured to find any parsed pfam files in specified directory or its sub-directories.
The script will parse the output of K-parse_Pfam_domains-v3.1.pl.
Note that in current configutation, the script will specifically scan input directories for filenames matching "pfamscanparsed.verbose" If your naming scheme is different, you might want to modify line 62.
Configuration of 'db_description' is highly important as the first check in the script is to match species_id in db_description to the one in the name of the file. If successful, the script will print species_id and family name to standard out.
NLR proteins are identified based on the presence of NB-ARC domain.
Fusions are identified based on the presence of non-NBS non-LRR domains with specified evalue cutoff (default 1e-3).

usage: perl K-parse_Pfam_domains_NLR-fusions-v2.2.pl <options>

-i|--indir directory for batch retrieval of input *pfamscan*.parsed.verbose files

-e|--evalue evalue cutoff for determining domain fusions [default 1e-3]

-o|--output output directory

-d|--db_description description of datasets used in the analyses [Organism Species_ID NCBI_taxon_ID Family Database Date_aquired Restrictions Version Common_Name Source Reference] for example of this dataset see Additional file 1 in Sarris et al BMC Biology 2016

Outputs:

Summary of the number of NLRs and NLR-IDs identified in each species (such as Additional file 2 in Sarris et al BMC Biology 2016)
Summary of integrated domains with species list for each domain (such as Additional file 3 in Sarris et al BMC Biology 2016)
Abundance list of integrated domains (counted once for each family) that can be used to generate a Wordcloud (such as Figure 2 in Sarris et al BMC Biology 2016)
Contingency tables (per ID domain) for each species as well as for all species and Fisher's Exact left test

Example datasets:

The example dataset directory contains input Arabidopsis data as well as corresponding db_description file. It also contains the outputs from each stage of the analyses, so you can check your pipeline against them or test individual scripts.

plant_rgenes's People

Contributors

Stargazers

Watchers

Forkers

remco-stam peterjc lzh93 anandksrao yedomon biogeeker njausxl qzhang1002

plant_rgenes's Issues

db_description.txt file is not working for species which have no annotation on public database

@krasileva, I was trying to run the script "K-parse_Pfam_domains_NLR-fusions-v2.2.pl" to extract NLR and NLR-IDs from unannotated species. I have tried editing it for few such species and run it but gives empty outputs with no error report. I wonder if there is any means of using this species description input for unannotated species.

Kind regards,
Tamene

Confusion between https://github.com/krasileva/plant_rgenes

The paper https://doi.org/10.1186/s12915-016-0228-7 cites https://github.com/krasileva/plant_rgenes but that repository seems inactive with its issue tracker disabled.

Should https://github.com/krasileva-group/plant_rgenes be used instead? If so could you ask GitHub to setup a redirection?

Kparse...v2.2.pl: defined(@array) is deprecated

When trying to parse the output, I have run into some troubles. After parsing the output of run_pfam_scan.sh using K-parse_Pfam_domains_v3.1.pl (using a verbose output), I am getting the following errors when using the K-parse_Pfam_domains_NLR-fusions-v2.2.pl script:
defined(@array) is deprecated at ./processing_scripts/K-parse_Pfam_domains_NLR-fusions-v2.2.pl line 172.
(Maybe you should just omit the defined()?)

defined(@array) is deprecated at ./processing_scripts/K-parse_Pfam_domains_NLR-fusions-v2.2.pl line 179.
(Maybe you should just omit the defined()?)

defined(@array) is deprecated at ./processing_scripts/K-parse_Pfam_domains_NLR-fusions-v2.2.pl line 249.
(Maybe you should just omit the defined()?)

I've been using the following command to use this script:
perl ./processing_scripts/K-parse_Pfam_domains_NLR-fusions-v2.2.pl -i ./master_dir/ -o ./master_dir/ -d ./master_dir/Ssisymbriifolium/pfam/db_description.txt

Any advice on what needs to be changed to get this to work?

Thank you for your time and help,

Alex

run_pfam_scan.sh wants protein.fa not proteins.fa

The script run_pfam_scan.sh looks for files ending ...protein.fa, https://github.com/krasileva-group/plant_rgenes/blob/master/bash_scripts/run_pfam_scan.sh#L24

protfiles=$(find $IN_DIR -name '*protein.fa')

The documentation https://github.com/krasileva-group/plant_rgenes/blob/master/README.md says to name files master_dir/species/annotation/species_version_proteins.fa (proteins with a s), which therefore fails.

Use of run_pfam_scan.sh

Hello,

I'm attempting to use your scripts to follow the protocols described in 'Comparative analysis of plant immune receptor architectures uncovers host proteins likely targeted by pathogens' and I am getting some odd errors when I run this on my plant sequences.

Specifically:
Use of uninitialized value in numeric ge (>=) at PfamScan/Bio/Pfam/Scan/PfamScan.pm line 210.
Use of uninitialized value in numeric ge (>=) at PfamScan/Bio/Pfam/Scan/PfamScan.pm line 213.
Use of uninitialized value in printf at PfamScan/Bio/Pfam/HMM/HMMResultsIO.pm line 1131.

I'm unfortunately unexperienced enough with scripts to be able to know whats going on. These errors are stated thousands of times in the terminal window. Please let me know if I can provide any other information to help resolve this problem.

Alex Wixom

Output from K-parse_Pfam_domains_NLR-fusions-v2.2.pl missing

When I run K-parse_Pfam_domains_NLR-fusions-v2.2.pl on the verbose output from K-parse_Pfam_domains_v3.1.pl I no longer have any errors listed, but I am getting only blank output. All of the files that are supposed to output, only have the defined header lines. It doesn't include any of the information from my db_description.tsv or any of the counts of the domains. I really have no idea what could be causing this issue.

The command being used is:
perl ./processing_scripts/K-parse_Pfam_domains_NLR-fusions-v2.2.pl -i ./master_dir/ -o ./master_dir/ -d ./master_dir/Ssisymbriifolium/pfam/db_description.tsv

Is there any information I can provide to help resolve this issue?

Thank you for your help,

Alex

empty outputs

I am trying to run the ’K-parse_Pfam_domains_NLR-fusions-v2.2.pl‘ with -i .parsed.verbose -o outputs/ -d db_descriptions.txt.
And the files are all from tests_example.
But I got the empty outputs：
nlrid_by_prevalence_family_wordcloud_input**.txt
nlrid_by_prevalence***.tsv
nlrid_domains-.stats.tsv
nlrid_summary_table.tsv
How should this be solved?Thanks

No license stated for the scripts

Quoting the README file,

Set of scripts to annotate Pfam domains and extract NLR plant immune receptors and their architectures as published in Sarris et al BMC Biology 2016: https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-016-0228-7

As far as I could tell, there is no mention of any license for the scripts provided in this repository, nor in the paper Sarris et al.

BMC do expect this information to be included in the paper https://bmcbiol.biomedcentral.com/submission-guidelines/preparing-your-manuscript/software but since that was missed, the best option now is to clearly add a license to the repository.

See https://help.github.com/articles/adding-a-license-to-a-repository/ for GitHub's guidelines.

You may be constrained by your employer's policy, but I would urge you to pick a standard open source licence. My personal recommendation would be a BSD or MIT license (both very short and liberal). GitHub provides some guidance on which license to pick https://help.github.com/articles/licensing-a-repository/

If you do open source the scripts, it would be legally possible for the bioinformatics community to package them for automated installation (e.g. I would look at doing this for BioConda https://bioconda.github.io/ - see https://www.biorxiv.org/content/early/2017/10/27/207092 for background).

Versions of PFAM and HMMER etc not stated

README.md does not say which versions of the PFAM database, HMMER tool, nor pfam_scan.pl script are used.

In particular, since they are not compatible, do you use HMMER 2 or 3? This in turn limits the versions of the PFAM database, and the pfam_scan.pl script.

e.g. http://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/OldPfamScan/HMMER2/README contains the HMMER2 version of pfam_scan.pl. It will not work with Pfam HMMER3 models (ie models from Pfam 24.0 onwards).

run_pfam_scan.sh uses obsolete -pfamB option

According to discussion in #6 these scripts should work with recent Pfam releases using HMMER3 and a recent pfam_scan.pl.

However, the current version of bash_scripts/run_pfam_scan.sh when calling pfam_scan.pl v1.6 fails, it aborts with the message:

FATAL: As of release 28.0, Pfam no longer produces Pfam-B. The -pfamB and -only_pfamB options are now obsolete.

It is trivial to remove -pfamB from bash_scripts/run_pfam_scan.sh but does this break any downstream assumptions in the pipeline?