Giter VIP home page Giter VIP logo

plant_rgenes's Introduction

plant_rgenes

Set of scripts to annotate Pfam domains and extract NLR plant immune receptors and their architectures as published in Sarris et al BMC Biology 2016: https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-016-0228-7

Our basic pipeline

  1. Obtain protein sequences of species of interest and organise them into a directory.

We follow the Phytozome organisation of master_dir/species/annotation/species_version_proteins.fa where each species is denoted by the first letter of the genus name and all letters in the species names, for example Athaliana

  1. Pfam-based annotation of domains

usage: bash run_pfam_scan.sh dir

Dependencies:

  1. Parsing the pfamscan output with K-parse_Pfam_domains_v3.1.pl
  • The script parses the output of pfam_scan.pl
  • The script extracts all domains for each proteins and removes redundant nested hits with larger e-values.
  • Domains are printed out in the order of apprearance in the query.
  • By default, Pfam_B domains are skipped.

usage: perl K-parse_Pfam_domains_v3.1.pl <options>

-p|--pfam <pfamscan.out>

-e|--evalue <evalue cutoff>

-o|--output

-v|--verbose <T/F> default F. Display more information about each domain (start, stop, evalue)

We usually parse all pfam outputs of interest in parallel using xargs

  1. Identification of non-canonical NLR-ID domain combinations with K-parse_Pfam_domains_NLR-fusions-v2.2.pl
  • This script is configured to find any parsed pfam files in specified directory or its sub-directories.
  • The script will parse the output of K-parse_Pfam_domains-v3.1.pl.
  • Note that in current configutation, the script will specifically scan input directories for filenames matching "pfamscanparsed.verbose" If your naming scheme is different, you might want to modify line 62.
  • Configuration of 'db_description' is highly important as the first check in the script is to match species_id in db_description to the one in the name of the file. If successful, the script will print species_id and family name to standard out.
  • NLR proteins are identified based on the presence of NB-ARC domain.
  • Fusions are identified based on the presence of non-NBS non-LRR domains with specified evalue cutoff (default 1e-3).

usage: perl K-parse_Pfam_domains_NLR-fusions-v2.2.pl <options>

-i|--indir directory for batch retrieval of input *pfamscan*.parsed.verbose files

-e|--evalue evalue cutoff for determining domain fusions [default 1e-3]

-o|--output output directory

-d|--db_description description of datasets used in the analyses [Organism Species_ID NCBI_taxon_ID Family Database Date_aquired Restrictions Version Common_Name Source Reference] for example of this dataset see Additional file 1 in Sarris et al BMC Biology 2016

Outputs:

  • Summary of the number of NLRs and NLR-IDs identified in each species (such as Additional file 2 in Sarris et al BMC Biology 2016)

  • Summary of integrated domains with species list for each domain (such as Additional file 3 in Sarris et al BMC Biology 2016)

  • Abundance list of integrated domains (counted once for each family) that can be used to generate a Wordcloud (such as Figure 2 in Sarris et al BMC Biology 2016)

  • Contingency tables (per ID domain) for each species as well as for all species and Fisher's Exact left test

Example datasets:

The example dataset directory contains input Arabidopsis data as well as corresponding db_description file. It also contains the outputs from each stage of the analyses, so you can check your pipeline against them or test individual scripts.

plant_rgenes's People

Contributors

cschu avatar erin-baggs avatar krasileva avatar peterjc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

plant_rgenes's Issues

Kparse...v2.2.pl: defined(@array) is deprecated

When trying to parse the output, I have run into some troubles. After parsing the output of run_pfam_scan.sh using K-parse_Pfam_domains_v3.1.pl (using a verbose output), I am getting the following errors when using the K-parse_Pfam_domains_NLR-fusions-v2.2.pl script:
defined(@array) is deprecated at ./processing_scripts/K-parse_Pfam_domains_NLR-fusions-v2.2.pl line 172.
(Maybe you should just omit the defined()?)

defined(@array) is deprecated at ./processing_scripts/K-parse_Pfam_domains_NLR-fusions-v2.2.pl line 179.
(Maybe you should just omit the defined()?)

defined(@array) is deprecated at ./processing_scripts/K-parse_Pfam_domains_NLR-fusions-v2.2.pl line 249.
(Maybe you should just omit the defined()?)

I've been using the following command to use this script:
perl ./processing_scripts/K-parse_Pfam_domains_NLR-fusions-v2.2.pl -i ./master_dir/ -o ./master_dir/ -d ./master_dir/Ssisymbriifolium/pfam/db_description.txt

Any advice on what needs to be changed to get this to work?

Thank you for your time and help,

Alex

Use of run_pfam_scan.sh

Hello,

I'm attempting to use your scripts to follow the protocols described in 'Comparative analysis of plant immune receptor architectures uncovers host proteins likely targeted by pathogens' and I am getting some odd errors when I run this on my plant sequences.

Specifically:
Use of uninitialized value in numeric ge (>=) at PfamScan/Bio/Pfam/Scan/PfamScan.pm line 210.
Use of uninitialized value in numeric ge (>=) at PfamScan/Bio/Pfam/Scan/PfamScan.pm line 213.
Use of uninitialized value in printf at PfamScan/Bio/Pfam/HMM/HMMResultsIO.pm line 1131.

I'm unfortunately unexperienced enough with scripts to be able to know whats going on. These errors are stated thousands of times in the terminal window. Please let me know if I can provide any other information to help resolve this problem.

Alex Wixom

Output from K-parse_Pfam_domains_NLR-fusions-v2.2.pl missing

When I run K-parse_Pfam_domains_NLR-fusions-v2.2.pl on the verbose output from K-parse_Pfam_domains_v3.1.pl I no longer have any errors listed, but I am getting only blank output. All of the files that are supposed to output, only have the defined header lines. It doesn't include any of the information from my db_description.tsv or any of the counts of the domains. I really have no idea what could be causing this issue.

The command being used is:
perl ./processing_scripts/K-parse_Pfam_domains_NLR-fusions-v2.2.pl -i ./master_dir/ -o ./master_dir/ -d ./master_dir/Ssisymbriifolium/pfam/db_description.tsv

Is there any information I can provide to help resolve this issue?

Thank you for your help,

Alex

empty outputs

I am trying to run the ’K-parse_Pfam_domains_NLR-fusions-v2.2.pl‘ with -i .parsed.verbose -o outputs/ -d db_descriptions.txt.
And the files are all from tests_example.
But I got the empty outputs:
nlrid_by_prevalence_family_wordcloud_input
**.txt
nlrid_by_prevalence***.tsv
nlrid_domains-.stats.tsv
nlrid_summary_table
.tsv
How should this be solved?Thanks

No license stated for the scripts

Quoting the README file,

Set of scripts to annotate Pfam domains and extract NLR plant immune receptors and their architectures as published in Sarris et al BMC Biology 2016: https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-016-0228-7

As far as I could tell, there is no mention of any license for the scripts provided in this repository, nor in the paper Sarris et al.

BMC do expect this information to be included in the paper https://bmcbiol.biomedcentral.com/submission-guidelines/preparing-your-manuscript/software but since that was missed, the best option now is to clearly add a license to the repository.

See https://help.github.com/articles/adding-a-license-to-a-repository/ for GitHub's guidelines.

You may be constrained by your employer's policy, but I would urge you to pick a standard open source licence. My personal recommendation would be a BSD or MIT license (both very short and liberal). GitHub provides some guidance on which license to pick https://help.github.com/articles/licensing-a-repository/

If you do open source the scripts, it would be legally possible for the bioinformatics community to package them for automated installation (e.g. I would look at doing this for BioConda https://bioconda.github.io/ - see https://www.biorxiv.org/content/early/2017/10/27/207092 for background).

Versions of PFAM and HMMER etc not stated

README.md does not say which versions of the PFAM database, HMMER tool, nor pfam_scan.pl script are used.

In particular, since they are not compatible, do you use HMMER 2 or 3? This in turn limits the versions of the PFAM database, and the pfam_scan.pl script.

e.g. http://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/OldPfamScan/HMMER2/README contains the HMMER2 version of pfam_scan.pl. It will not work with Pfam HMMER3 models (ie models from Pfam 24.0 onwards).

run_pfam_scan.sh uses obsolete -pfamB option

According to discussion in #6 these scripts should work with recent Pfam releases using HMMER3 and a recent pfam_scan.pl.

However, the current version of bash_scripts/run_pfam_scan.sh when calling pfam_scan.pl v1.6 fails, it aborts with the message:

FATAL: As of release 28.0, Pfam no longer produces Pfam-B. The -pfamB and -only_pfamB options are now obsolete.

It is trivial to remove -pfamB from bash_scripts/run_pfam_scan.sh but does this break any downstream assumptions in the pipeline?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.