Perl homology searcher based on webscrapping and heuristic approaches. It's supposed to look up in HomoloGene, Ensemble and Inparanoid after running Bidirectional best hit algorithm (BDBH).
Clone the repo on local:
git clone https://github.com/carrascomj/gowsh
Add script to path (on your bash initialization file; e.g., ~/bashrc):
export PATH=$PATH:"path/to/gowsh/bin"
The program requires additional packages that can be installed with cpanm, if not already done:
cpanm JSON Data::Dumper Bio::SeqIO LWP::Simple File::Basename Getopt::Long XML::Parser
Alternatively, one could install WebAPIsGOWSH as an usual perl package (on 'gowsh/' directory):
perl Makefile.PL
make
make install
Finally, formatdb and blast+ are both required.
gowsh.pl is the main script. The program takes command-line arguments with the following options:
gowsh.pl --gfile|go|glist "path_to_file|GOid|list" --tfile|torg "path_to_file|organism"
[--modelf|modelo] "path_to_file|organism" --out "outfile" --preserve
--gfile path_to_file: input, genes as multiFASTA
--go GOid: input, Genetic Ontology ID (as in AmiGO)
--glist list: input, blank separated list gene IDs
--tfile path_to_file: multiFASTA containing proteins of genome of target organism
--torg organism: target organism name (genus and specie)
--modfile path_to_file: optional, multiFASTA containing proteins of genome of model organism
--modorg organism: optional, model organism name (genus and specie)
--out "outfile": optional, name of output file; default "GOWSH_output.txt"
--preserve: optional, if it's added, (nearly) all files generated will be preserved.
The script can be tested wit the following command:
gowsh.pl --go 0048507 --modorg "arabidopsis thaliana" --torg "oryza sativa"
You can compare the output with the file "t/GOWSH_outputq1.tsv".
The program will then parse the input file, download both genomes from NCBI and try to match homologues.
This code was developed as a project for one subjects of my BSc in Biotechnology (UPM). To sum up, I learned the following concepts:
- Webscrapping biological information using Perl and mygene API.
- Use of Entrez E-utilities programmatic access API from NCBI.
- Use of Ensembl REST API.
- Run BLAST on local using blast+.
- Heuristic algorithms to account for homology.
- How to build a Perl package.
- How to write a README.md.