General Information
Data and code directory on the server:
/mnt/data/asis/prot-scriber
Note that in the following all relative paths are to be rooted in this directory.
R-Code for evaluation:
prot.scriber-evaluation_R
Executable
exec/measurePerformance.R
can be executed with
Rscript exec/measurePerformance.R
Rust-Code of production version of prot-scriber:
prot-scriber-Rust
can be executed with
/target/release/prot-scriber --help
Note: You can link (ln -s
) to the above executable in your $PATH
...
General approach
The following evaluation procedure is implemented in the R-script mentioned below. The script
- reads in prot-scriber annotations, splits them into words,
- reads in reference annotations (PFam-A and Mercator/MapMan4) and splits them into words
- compares prot-scriber words with reference words to calculate true positives, false negatives, etc
- This means that the "predictor" to be evaluated (obviously) is prot-scriber, i.e. the generated annotations (words) and the reference (gold standard) is the set union of words extracted from PFam-A and Mercator annotations.
Install the prot-scriber R version
Change to the project directory and open R
cd /mnt/data/asis/prot-scriber/prot.scriber-evaluation_R
R
In an interactive R-shell execute:
install.packages(c('data.table', 'optparse', 'brew', 'seqinr', 'ggplot2', 'RColorBrewer'))
q()
Finally in the BASH-shell execute
gold standard data
This is the data, we'll use prot-scriber on and will evaluate it with.
Directory of evaluation data:
/mnt/data/asis/prot-scriber/evaluation
We have three data-sets that at the time of starting the evaluation were not in UniProt yet:
Reference annotations
We compare the words in prot-scriber annotations with the words in reference annotations. Mind you, that "annotations" means protein function predictions in the form of short human readable descriptions (HRDs) generated by prot-scriber, Pfam-A annotations generated by using HMMER3 on each of the above protein sets, and finally by using Mercator [1] to generate MapMan4 [2] annotations.
For each of the three above protein sets you find the respective annotation files.
For P. coccineus:
Pcoccineus_mercator_v4_results.txt
the Mercator annotations
Pcoccineus_vs_PfamA_hmmscan_out.tsv
the PFam annotations
For Faba:
Faba_mercator_results.txt
the Mercator annotations
Faba_vs_PfamA_hmmscan_out.tsv
the PFam annotations
MetaEuk:
Note that MetaEuk for performance measures has been processed in batches (sub-sets).
We used eight batches.
- See sub-directory
MetaEuk_batches/Mercator_MapMan4_annotations
for Mercator (MapMan4) reference annotations
- See files
MetaEuk_preds_Tara_vs_euk_profiles_uniqs_short_IDs_vs_PfamA_batch_1.txt
(replace batch_1 with your batch no) for PFam A annotations
prot-scriber input data
You know that prot-scriber consumes BLAST (or Diamond, modern very fast BLAST reimplementation) outputs to generate its protein function predictions in the form of short human readable descriptions (HRDs).
The above Blast output tables that prot-scriber consumes have been generated using UniProtKB databases from April 2021.
If you run BLAST (Diamond) at any point again, you must use the Blast databases in the following folder:
/mnt/data/asis/UniProt/previous/20210408
, because those do not yet contain the above reference proteins.
Blast results for the respective reference proteins
For P coccinues
- versus (searched) SwissProt
Pcoccineus_vs_swisprot_blastp.txt
- versus trEMBL
Pcoccineus_vs_trembl_blastp.txt
For Faba:
- versus SwissProt
Faba_vs_swisprot_blastp.txt
- versus trEMBL
Faba_vs_trembl_blastp.txt
For MetaEuk Batches, e.g. batch_1
im Ordner MetaEuk_batches
:
- versus SwissProt
MetaEuk_preds_Tara_vs_euk_profiles_uniqs_short_IDs_vs_Swissprot_batch_1.txt
- versus trEMBL
MetaEuk_preds_Tara_vs_euk_profiles_uniqs_vs_trembl_blastp_batch_1.txt
The job management system
Read the manual provided by our system administrators!
Most important commands:
qsub
to submit a script to the job ystem
qstat
to see the status of your running jobs. Use qstat -a
("all") to see terminated jobs, too.
qhost
to see available hosts (nodes), i.e. compute servers
To run a script that e.g. executes the evaluation R-script on prot-scriber annotations generated for the MetaEuk batch_1 see:
./evaluation/MetaEuk_batches/scripts/measure_prot-scriber_performance_on_MetaEuk_batch_1_oge.sh
Copy such a script and adjust to your needs. Consider the header:
#!/bin/bash
#$ -l mem_free=4G,h_vmem=4G
#$ -pe smp 20
#$ -e /mnt/data/asis/prot-scriber/evaluation/MetaEuk_batches/scripts/measure_prot-scriber_performance_on_MetaEuk_batch_1_oge.err
#$ -o /mnt/data/asis/prot-scriber/evaluation/MetaEuk_batches/scripts/measure_prot-scriber_performance_on_MetaEuk_batch_1_oge.out
- line 2: Set required memory per core
- line 3: How many cores do you need (on one single node)
- line 4: The path to the error file - error messages will appear in there (std::err)
- line 5: The path to the output file - std::out will be saved there.
References
- Lohse, M., Nagel, A., Herter, T., May, P., Schroda, M., Zrenner, R., Tohge, T., Fernie, A. R., Stitt, M., & Usadel, B. (2014). Mercator: A fast and simple web server for genome scale functional annotation of plant sequence data. Plant, Cell & Environment, 37(5), 1250โ1258. https://doi.org/10.1111/pce.12231
- Schwacke, R., Ponce-Soto, G. Y., Krause, K., Bolger, A. M., Arsova, B., Hallab, A., Gruden, K., Stitt, M., Bolger, M. E., & Usadel, B. (2019). MapMan4: A refined protein classification and annotation framework applicable to multi-omics data analysis. Molecular Plant. https://doi.org/10.1016/j.molp.2019.01.003