usadellab / prot-scriber Goto Github PK

Assigns short human readable descriptions to biological sequences or gene families using references. For this, prot-scriber consumes sequence similarity search results in tabular format (Blast or Diamond).

License: GNU General Public License v3.0

Rust 98.47% R 1.53%

bioinformatics protein-function-prediction gene-families biological-sequence-analysis blastp blastx genomics

prot-scriber's People

Contributors

Stargazers

Watchers

Forkers

pupubear007 bellabagautdinova derriesenotter studfaensen bgruening

prot-scriber's Issues

Consider phrases consisting only of non-informative words "unknown protein" or "unknown family"

Currently, phrases, i.e. sub-sets of candidate descriptions, that only consist of non-informative words are still scored and might be assigned as the final annotation. This should be changed.

If the set of informative words is empty, classify the protein or sequence family as "unknown".

Hint:

pub fn generate_human_readable_description(
    // [...]
    let mut informative_words_universe: Vec<String> = vec![];
    // If that universe is empty, don't do an annotation, i.e. ignore non-informative words as the only phrase constant

Implement `TODO` "Do not use more threads than user specified, even if more Blast tables are to be parsed

Fix TODO in module AnnotationProcess::run:

// TODO: If there are more input tables than the self.n_threads, only use n_threads
// parallel processes.

Add a command-line option to exclude "unknown protein" or "unknown sequence family" from results

Currently prot-scriber writes out the above default descriptions for queries (protein or family) that could not be assigned a meaningful annotation (human readable description).

Add a command-line option to exclude these queries from the result table instead of calling them "unknown".

This is an option particularly useful for the Mercator-integration.

Use --exclude-not-annotated-queries and short -x

Filter the word-sets (step 1) using an input vector of regular expressions and flag words that should be scored and those that should not be scored. Have as a result a dictionary of all words, where the values are boolean; true means should be scored.

This is step two in the main annotation function as explained in the development wiki.

Assume you get a list of regular expressions as input argument. For testing use default.rs::FILTER_REGEXS.

Recommended implementation

Write several functions:

function for deciding whether a word is informative or not, signature: anyRegexMatches( word: &String, filterRegexs: &vec<String> )
function for splitting candidate descriptions into word sets, where the word-set is represented as a vector and maintains the order of words as in the original source candidate description (input). Make sure the maximum word-set is included, i.e. the sub-set that equals the full set. Make sure you test the order of words.
function to combine 1. and 2. and actually construct "phrases" from an input candidate description, signature: phrasesFromCandidateDescription( hrd: &String, ilterRegexs: &vec<String>, ...) -> PhrasesInfo

struct PhrasesInfo {
   phrases: vec<String>,
   wordClassification: HashMap<String, bool>,
}

impl PhrasesInfo {
  pub fn getWordUniverse(&self) {
     (*self).wordClassification.keys()
  }
}

Comment on above function 2.:

First split into word-sets (phrases), after each phrase construction, add all words into a HashSet, so you have a representation of "universe words", i.e. all words that appear in all phrases.
On each of these words in the "universe" run the function 1. to classify its informativeness.

Remove deprecated code in Hit and related

In a previous experiment of building a well performing annotation procedure that generates human readable descriptions for query sequences using Blast results, code has been developed that is now deprecated. This code needs to be removed.

Hit

In module hit.rs remove everything, fields and functions (methods), that are not related to the following fields of a Hit:

id
bitscore
description

This affects the following fields:

qstart
qend
overlap_with_query
query_similarity_score

This affects the following functions / methods:

impl Ord
impl PartialOrd
refactor function new so it does not require the removed fields any more
overlap_with_hit
description_similarity
check whether description_words is really used - I think it is not
similarity

Further TODOs:

Adjust tests.
Check rest of the code-base for un-used functions / methods (see output, i.e. warnings of cargo build).

cargo.toml

Check which dependencies are still required. Remove not needed ones. Probable candidates are:

ndarray
num

defaults.rs

There might be a few constants in default.rs that are deprecated, i.e. not used any more:

everything related to markov clustering starting with MCL_
maybe ROUND_DECIMAL_DIGITS?
CLUSTER_CONSENSUS_DESCRIPTION_JOIN

Do not break existing code

Make sure you run all tests to ensure that after doing this cleaning up prot-scriber still works.

Add optional input support

In main.rs the following optional inputs should be added and, if given by the user, replace the respective default set in default.rs.

Implementation

For all not required, non mandatory options, have the default value been used from default.rs. In main overwrite these default values, if and only if the user provides a custom value with the command line call.

Multiple vs single options

Please note, that prot-scriber makes use of clap which supports the usage of certain command line options multiple times. In prot-scriber these are among others e.g. seq-sim-table. This means that a user can call prot-scriber with multiple Blast result tables like this:

prot-scriber --seq-sim-table my_proteins_vs_Swissprot_blast_tableout.txt --seq-sim-table my_proteins_vs_trEMBL_blast_tableout.txt ...

Inside prot-scriber the order of appearance of these multiple options is kept. Because in prot-scriber we also want to enable to also specify some options specific for a given input, it is important to note, that several input options can relate to each other by there order of appearance. See blacklist regex list below for a clear example.

Allow keyword `default` for all non mandatory options

Any not required option should be able to be provided with the default keyword. This is in order to enable any combination of positions sensitive custom and default options. See above "Multiple vs single options" for more details. So, make sure, that if a command-line options is provided with default the default value from default.rs is used for that specific option.

regex to split genes in gene families

This option should be named family-gene-separator-regex, the default has been discussed in another issue #6 .

regex to split gene identifier from gene-list in gene family input files

This option should be named family-id-gene-list-separator, the default has been discussed in the respective issue #6. Enable the user to provide her/his custom regular expression.

blacklist regex list

Imagine you want to provide a custom blacklist for your blast input table where you ran a search of your query proteins against a non standard database, e.g. a genome which does not adhere to the Uniprot stitle line standards. In order to provide such blacklists particular only for the blast input table of the same position this blacklist is stated, the user would do this:

prot-scriber  --seq-sim-table my_proteins_vs_alien_genome.txt --seq-sim-table my_proteins_vs_Swissprot_blast_tableout.txt \
  --blacklist-regex-list alien_genome_blacklist_regexs.txt

Because the argument --blacklist-regex-list only appears once, while the argument --seq-sim-table appears twice the first --blacklist-regex-list argument will be applied on the --seq-sim-table my_proteins_vs_alien_genome.txt but not on the second --seq-sim-table my_proteins_vs_Swissprot_blast_tableout.txt.

filter regex list

Enable custom lists of regular expression which with the stitles of Blast Hits are filtered. Remember that filtering cuts out undesired parts of stitle strings. See model_funs.rs filter_stitle for more details.

Note that this option is multiple and position sensitive like the above blacklist-regex-list. So, depending on the position, i.e. the times this option is provided with a value (valid file path), it applies to the corresponding --seq-sim-table.

informative regex list

Currently being implemented as specified here is a method to distinguish informative from un-informative words. Un-informative words are not removed from phrases, but are not scored either. They are detected by applying a list of regular expressions in sequence on each word. If any regex matches that indicates un-informativeness. Enable users to supply their own regex lists of un-informative words. Again, this options is position context sensitive. Its order of appearance ties it to the respective --seq-sim-table input at the same position (see above examples for more details on this).

Performance evaluation

General Information

Data and code directory on the server:
/mnt/data/asis/prot-scriber
Note that in the following all relative paths are to be rooted in this directory.

R-Code for evaluation:
prot.scriber-evaluation_R
Executable
exec/measurePerformance.R
can be executed with
Rscript exec/measurePerformance.R

Rust-Code of production version of prot-scriber:
prot-scriber-Rust
can be executed with
/target/release/prot-scriber --help

Note: You can link (ln -s) to the above executable in your $PATH ...

General approach

The following evaluation procedure is implemented in the R-script mentioned below. The script

reads in prot-scriber annotations, splits them into words,
reads in reference annotations (PFam-A and Mercator/MapMan4) and splits them into words
compares prot-scriber words with reference words to calculate true positives, false negatives, etc
- from these the two performance measures F1-Score and Matthew's Correlation Coefficient are calculated
This means that the "predictor" to be evaluated (obviously) is prot-scriber, i.e. the generated annotations (words) and the reference (gold standard) is the set union of words extracted from PFam-A and Mercator annotations.

Install the prot-scriber R version

Change to the project directory and open R

cd /mnt/data/asis/prot-scriber/prot.scriber-evaluation_R
R

In an interactive R-shell execute:

install.packages(c('data.table', 'optparse', 'brew', 'seqinr', 'ggplot2', 'RColorBrewer'))
q()

Finally in the BASH-shell execute

R CMD INSTALL .

gold standard data

This is the data, we'll use prot-scriber on and will evaluate it with.

Directory of evaluation data:
/mnt/data/asis/prot-scriber/evaluation

We have three data-sets that at the time of starting the evaluation were not in UniProt yet:

The Phaseolus coccineus proteome
The "MetaEuk" marine Eukaryote collection
The Faba proteome

Reference annotations

We compare the words in prot-scriber annotations with the words in reference annotations. Mind you, that "annotations" means protein function predictions in the form of short human readable descriptions (HRDs) generated by prot-scriber, Pfam-A annotations generated by using HMMER3 on each of the above protein sets, and finally by using Mercator [1] to generate MapMan4 [2] annotations.

For each of the three above protein sets you find the respective annotation files.

For P. coccineus:

Pcoccineus_mercator_v4_results.txt the Mercator annotations
Pcoccineus_vs_PfamA_hmmscan_out.tsv the PFam annotations

For Faba:

Faba_mercator_results.txt the Mercator annotations
Faba_vs_PfamA_hmmscan_out.tsv the PFam annotations

MetaEuk:
Note that MetaEuk for performance measures has been processed in batches (sub-sets).
We used eight batches.

See sub-directory MetaEuk_batches/Mercator_MapMan4_annotations for Mercator (MapMan4) reference annotations
See files MetaEuk_preds_Tara_vs_euk_profiles_uniqs_short_IDs_vs_PfamA_batch_1.txt (replace batch_1 with your batch no) for PFam A annotations

prot-scriber input data

You know that prot-scriber consumes BLAST (or Diamond, modern very fast BLAST reimplementation) outputs to generate its protein function predictions in the form of short human readable descriptions (HRDs).

The above Blast output tables that prot-scriber consumes have been generated using UniProtKB databases from April 2021.

If you run BLAST (Diamond) at any point again, you must use the Blast databases in the following folder:
/mnt/data/asis/UniProt/previous/20210408, because those do not yet contain the above reference proteins.

Blast results for the respective reference proteins

For P coccinues

versus (searched) SwissProt Pcoccineus_vs_swisprot_blastp.txt
versus trEMBL Pcoccineus_vs_trembl_blastp.txt

For Faba:

versus SwissProt Faba_vs_swisprot_blastp.txt
versus trEMBL Faba_vs_trembl_blastp.txt

For MetaEuk Batches, e.g. batch_1 im Ordner MetaEuk_batches:

versus SwissProt MetaEuk_preds_Tara_vs_euk_profiles_uniqs_short_IDs_vs_Swissprot_batch_1.txt
versus trEMBL MetaEuk_preds_Tara_vs_euk_profiles_uniqs_vs_trembl_blastp_batch_1.txt

The job management system

Read the manual provided by our system administrators!

Most important commands:

qsub to submit a script to the job ystem
qstat to see the status of your running jobs. Use qstat -a ("all") to see terminated jobs, too.
qhost to see available hosts (nodes), i.e. compute servers

To run a script that e.g. executes the evaluation R-script on prot-scriber annotations generated for the MetaEuk batch_1 see:
./evaluation/MetaEuk_batches/scripts/measure_prot-scriber_performance_on_MetaEuk_batch_1_oge.sh

Copy such a script and adjust to your needs. Consider the header:

#!/bin/bash
#$ -l mem_free=4G,h_vmem=4G
#$ -pe smp 20
#$ -e /mnt/data/asis/prot-scriber/evaluation/MetaEuk_batches/scripts/measure_prot-scriber_performance_on_MetaEuk_batch_1_oge.err
#$ -o /mnt/data/asis/prot-scriber/evaluation/MetaEuk_batches/scripts/measure_prot-scriber_performance_on_MetaEuk_batch_1_oge.out

line 2: Set required memory per core
line 3: How many cores do you need (on one single node)
line 4: The path to the error file - error messages will appear in there (std::err)
line 5: The path to the output file - std::out will be saved there.

References

Lohse, M., Nagel, A., Herter, T., May, P., Schroda, M., Zrenner, R., Tohge, T., Fernie, A. R., Stitt, M., & Usadel, B. (2014). Mercator: A fast and simple web server for genome scale functional annotation of plant sequence data. Plant, Cell & Environment, 37(5), 1250–1258. https://doi.org/10.1111/pce.12231
Schwacke, R., Ponce-Soto, G. Y., Krause, K., Bolger, A. M., Arsova, B., Hallab, A., Gruden, K., Stitt, M., Bolger, M. E., & Usadel, B. (2019). MapMan4: A refined protein classification and annotation framework applicable to multi-omics data analysis. Molecular Plant. https://doi.org/10.1016/j.molp.2019.01.003

Implement an output-writer that generates the tabular output of prot-scriber

Implement a module that exposes a single function

pub fn write_output_table( file_path: String, human_readable_descriptions: HashMap<String, String> ) {
   // stream write line after line
   // iterate over entries in argument human_readable_descriptions
   for (annotee_name, annotation) in human_readable_descriptions {
     // write output line as:
     // format!("{}\t{}", annotee_name, annotation);
   }
}

Make sure a header line is present. Thus the final table should look something like this:

Annotee-Identifier <TAB> Human-Readable-Description
Seq-Family-1 <TAB> alien devouring protein
Protein-123 <TAB> human devouring protein

Note, that the table (output) does not care if gene families or single query sequences were annotated. Both of them are "annotee"s, i.e. entities being annotated.

Once this function is implemented and tested, actually invoke it in main() as indicated by this snippet taken from there:

    // Save output:
    if let Some(o) = matches.value_of("output") {
        // write_output_table(o, human_readable_descriptions);
    }

Improve error messages' print out

If input arguments are not given in the required format, prot-scriber prints an informative error message. However the message appears in the middle of "exception-like" error reporting and thus is overlooked easily. Reproduce e.g. by giving a wrong number of -l args for a given number of -s args.

Please improve the error message printing, so they are more easily spotted and understood.

Handle parsed Blast results for a given query event

Implement method that reacts to the parsed input for a given query.

Background

An instance of AnnotationProcess has several input sequence similarity search result (SSSR) files. prot-scriber expects these to be in tabular format and to be sorted by query identifiers (qacc in Blast terminology). Each input table is read in and parsed in a dedicated thread. Once all input in a given table is parsed for a given query that data can be processed.

Actions to undertake in the "parsed query results" event

Check, how many of the input SSSR tables returned results for the particular query id. If all of them have returned data, then go ahead and process it. Processing is different for the two modes prot-scriber can work in:

mode "query sequence annotation" generates human readable description (HRD) annotations for query sequences
mode "gene family annotation" generates HRDs for gene families, i.e. sets of query sequences

Ensure SSSR tables are sorted by query-identifiers

Make sure, that if more data is parsed for this particular query in the future that causes an error, informing the user that the input SSSR table is not sorted by the query identifiers, but should be.

Query Sequence Annotation

If all input SSSR tables have provided results for a given query, generate a human readable description for it, and delete its data from the in memory database for efficiency.

Gene Family Annotation

If all input SSSR tables have provided results for a given query, indicate that in a vector of indices of parsed queries in the GeneFamily instance. If all of the GeneFamily genes have been parsed, use the data to generate a human readable description for that family, store the result, and drop all data of the queries and GeneFamily for efficiency.

Rerun evaluations

This task depends on the successful termination of Issue #44. Please do not start if before #44 has been completed.

We need to rerun prot-scriber in its latest updated version (v0.1.5 or more) and rerun the R based evaluation scripts.
To be clear, we have a newer prot-scriber version and need to recreate the human readable descriptions (HRDs), i.e. the prot-scriber output. And next, we have a new version of evaluation and also need to rerun it on the latest output of prot-scriber.

List of prot-scriber datasets that need reannotation and reevaluation is:

Phasealus coccineus (Pcoccineus)
Faba - Two Faba variants
MetaEuk

Gene Families

Pcoccineus has new gene families
Faba has gene families, too. But they have been clustered from too few species.

Tasks:

Identify the location of all prot-scriber input data and verify it with the team
Rerun prot-scriber (master branch)
Rerun evaluation exec/measurePerformance.R (evaluation_R branch)
Plot evaluation results exec/visualize_p_coccineus_results.R (evaluation_R branch)

Note that the last script needs to be copied and adapted for each data-set.

Check for empty input vectors and hashmaps in functions of module `generate_hrd_associated_funcs.rs`

Both test runs (on gene families and single proteins) raised the same error:

thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', src/generate_hrd_associated_funcs.rs:303:30

Check the module and first introduce tests that reproduce the error by e.g. using an empty HashMap as input argument for the function predicted_hrd, then fix the error.

Make sure all functions in the module are capable of handling the edge case of empty input arguments, or empty intermediate results.

Write a parser for input files holding biological sequence families

prot-scriber has the option to annotate families, i.e. mathematical sets, of biological sequences, so called "Gene Families". These are read as input in main.rs; see option seq-families for details.

The expected format of such an optional input file is:

<family-name> TAB gene1,gene2,gene2

Note that above TAB stands for a single tab-character, i.e. \t. The white-spaces before and after are just there for better readability.

Write a simple parser function that consumes such a file in the form of a String instance, the path to the respective file to read line by line. Each line will adhere to the above format and represent a gene-family. Take inspiration from seq_sim_table_reader::parse_table. Consider the following code sketch:

use crate::seq_family::SeqFamily;

pub fn parse_seq_families_file( path: String, annotation_process: &mut AnnotationProcess ) {
   // read file line by line
   // each line should be parsed and produce two variables:
   let seq_fam_name : String = // the family name
   let seq_fam_instance : SeqFamily = // the genes contained in that family
   // add the family to the argument annotation_process
  annotation_process.insert_seq_family( seq_fam_name, seq_fam_instance );
}

This new function should be called very early in main(). See the following snippet taken from there:

    // Add biological sequence families information, if provided as input by the user:
    if let Some(seq_families) = matches.value_of("seq-families") {
        // TODO: Parse gene families
        // parse_seq_families_file( seq_families, &mut annotation_process );
    }

Annotation typo

I believe I found another type in one of your annotations. I used the CDS for the barley gene horvu.morex.r3.2hg0178420. For this I receive the annotation "dnaj subfamily b memer". I think it should say "member" in the end. I have attached the sequence file that yielded this result.

HORVU.MOREX.r3.4HG0178420.txt

For equally scoring `phrases` find a deterministic mode of selecting one

Currently prot-scriber takes any phrase if many have equal scores. Thus prot-scriber annotations are not always reproducible. This can confuse and deter users, potentially. Find a solution so that prot-scriber always generates the same result for a given input.

Optimisation of generated HRDs

The assignment of HRDs for following sequences showed room to improve:

mercator4v5.0: dihydopyrimidine aminohydrolase & prot-scriber: amidohydro rel domain containing protein & swissprot: no annotation & original description: Dihydropyrimidinase (AHRD V3.3 *** A0A1U8EK03_CAPAN)
Solyc05g010420.2.1 ITAG 4,1
mercator4v5.0: S-adenosyl methionine decarboxylase & prot-scriber: s adenoylmethionine decarboxylae proenzyme & swissprot: no annotation & original description: S-adenosylmethionine decarboxylase 2 AND mercator4v5.0: EC_4.1 carbon-carbon lyase & prot-scriber: s adenoylmethionine decarboxylae proenzyme & swissprot: no annotation & original description: S-adenosylmethionine decarboxylase 2
Solyc03g118750.4.1
mercator4v5.0: phospho-base N-methyltransferase & prot-scriber: phosphoethanolamine n methyltrasferase & swissprot: no annotation & original description: phosphoethanolamine N-methyltransferase AND mercator4v5.0: EC_2.1 transferase transferring one-carbon group & prot-scriber: phosphoethanolamine n methyltrasferase & swissprot: no annotation & original description: phosphoethanolamine N-methyltransferase
mercator4v5.0: component *(PsaH) of PS-I complex & prot-scriber: photosystem i reacton center subunt v chloroplastc & swissprot: no annotation & original description: Photosystem I reaction center subunit VI, chloroplastic (AHRD V3.3 *** A0A2G2WLH8_CAPBA)
Solyc09g011520.3.1
ARATH raus
mercator4v5.0: class tau glutathione S-transferase & prot-scriber: arath glutathione s tranferae & swissprot: no annotation & original description: Glutathione S-transferase-like protein (AHRD V3.3 *** A8DUB0_SOLLC)

For now, only proteins with a Solyc...g... identifier

Reference data can be found here /mnt/data/asis/prot-scriber/evaluation/Solanum_lycopersicum_ITAG_4.0

Before running prot-scriber, the uniprot databanks should be refreshed /mnt/data/asis/UniProt

prot-scriber will be used twice to see the HRDs when using swissprot alone and swissprot and trembl together.

Add information on how to generate gene families to the `.about` in `main.rs`

We should very briefly explain how to generate gene families, as follows:

Run Blast all vs all and make sure the column pident is written to the output table
Prepare markov clustering from the result table of step 1 using percent sequence identity
Run mcl on the result of step 2
Use result of step 3 as input for prot-scriber

Before you add some description to the about in module main.rs make sure that

You have tested the described procedure on some example protein data (fasta)
You follow the general format the rest of the about is written in

Test different performance calculation

Currently we estimate the performance in terms of prediction quality of prot-scriber generated human readable descriptions (HRDs) like this:

Set A
Descriptions from Blast hits, i.e.
descriptions of proteins from UniProtKB databases
like Swissprot or trEMBL:

> alcohol dehydrogenase
> malate dehydrogenase
> unknown protein
> Geraniol dehydrogenase

Set B
Descriptions found using Hidden Markov Models (HMMs)
using Mercator (Mapman Bin Ontology) or HMMER3 on Pfam-A
(InterPro conserved protein families or protein domains):

dehydrogenase domain
alcohol binding domain
family of geraniol oxidizing proteins

Algorithm:
For each set
Hack each description into words
Union all words into a single word set for the input set A or B, respectively
Output: word_set_A and word_set_B

Estimate True Positives, True Negatives, False Positives, False Negatives
see this sketch

The problem is, that the words in terms of vocabulary used in Pfam-A and the Mapman Bin Ontology are quite different. This causes the algorithms to try and generate HRDs with words that are not in the source descriptions, i.e. the descriptions of the Blast Hits.

In the above example the words "binding" would get a positive score (true positive) but can never be generated by prot-scriber, because it does not appear in any Blast Hit description. That means prot-scriber, and Best Blast, can in many cases never get a maximum score of one (1.0).

Note that scores are F-Score or Mathew's Correlation Coefficient calculated from true positives, true negatives, false positives, and false negatives.

We could try and never count words that do not appear in Blast Descriptions but appear in Pfam-A and/or Mapman Bin descriptions as false negatives.

To Dos

Create a new branch called evaluation_R_fair_false_negs
Have a look at the file evaluation.R and find the places where the rate or number of false negatives is calculated
Introduce a new function called falseNegatives that accepts the necessary arguments and calculates the "fair" number of false negatives
- This excludes false negatives that do not appear in any Blast Hit description
Write another function that tests your new function for correctness (unit test)
Read the measurePerformance.R script to see in what order which functions are called to see how the reference word and prediction word sets are generated. Ask in Slack for help. Note the argument blacklist.regexs in function wordSet in source file funks.R
- Get together with Mr Ender to understand how the file ./R/zzz.R and RData files generate the above default argument.

Generate reference sets

Implement functions that extract reference human readable descriptions from predictions obtained from Mercator4 (MapMan ontology) and HMMER3 against PfamA. Split these reference annotations into word sets. Make them available as RData in our research R package.

Manually check the "universe" word list generated from Pfam-A and Mercator (Mapman Bin) reference annotations

Improve the performance assessment of prot-scriber generated Human Readable Descriptions (HRDs)

So far we identified the following list of candidate words that should not be used as the reference "universe" in the identification of true and false positives and negatives:

"factor"
"component"
"complex"
"targeting"
"processing"
"function"
"segment"
"reaction"
"terminus"
"terminal"
"superfamily"
"the"
"type"
"other"
"motif"

We should run the step of generating universes and manually check the lists for more potential "meaningless" candidates.

Method

Describe how to generate the above list from Pfam-A and Mercator (Mapman Bin) annotations using code in the branch "evaluation_R" (or sub-branches)

Tips: Use these R functions in branch evaluation_R_exclude_35_50 Script ./R/funks.R to get the above done:

parseHmmer3Tblout read in HMMER3 searches in the Pfam-A database results (output tables)
parseMercator4Tblout read in Mercator output tables (Mapman Bin annotations)
filterMercator4Table filter out Mapman Bins starting with root Bin 35 and 50.
referenceWordListFromPfamAAnnos generate reference word lists from Pfam-A annotations
referenceWordListFromMercator4Annos generate reference word lists from Mapman Bin annotations
note that in the above referenceWordList... functions the function wordSet is called. In that function there is an option (argument) called blacklist.regexs which is a vector of regular expressions to detect words to be excluded from the final reference list. This list is loaded from the data ./data/regexLists.RData (see respective argument in the function wordSet)

Current result file has been generated. Please link path to it here.

Rerun the above code and manually check the lists

Pass the new list for revision to the team.

After revision extend the above data ./data/regexLists.RData object blacklist.word.regexs accordingly.

Generate regular expression for parsing UniRef

UniRef is a protein reference database from UniProtKB. There are two versions UniRef50 and 90.
Redundancy has been removed by exclusions of sequence with a significant similarity.

The format of UniRef Fasta headers is different to those of Swissprot and trEMBL.

Please provide a file with Rust syntax regular expressions. This file will be used as the -l, --filter-regexs <filter-regexs> argument to prot-scriber.

When the `generate_human_readable_description` function is implemented, connect it with the existing code

Once the function is implemented that generates human readable descriptions (HRDs) from an input vector of candidate descriptions according to this spec, the function needs to be connected with the already existing code.

In order to do this please implement the TODOs in the two respective annotate functions in the seq_family.rs and query.rs modules:

query.rs

    /// Generates and returns a human readable description (`String`) for this biological query
    /// sequence.
    ///
    /// # Arguments
    ///
    /// * `&self` - A mutable reference to self, this instance of Query
    pub fn annotate(&self) -> String {
        // TODO: Gather all self.hits description fields and pass them to
        // `generate_human_readable_description`.
        (*UNKNOWN_PROTEIN_DESCRIPTION).to_string()
    }

seq_family.rs

    /// Generates and returns a human readable description (`String`) for this set (family) of
    /// biological query sequences.
    ///
    /// # Arguments
    ///
    /// * `&self` - A mutable reference to self, this instance of SeqFamily
    /// * `queries: &HashMap<String, Query>` - A constant reference to the in memory database of
    /// `Query` instances. This is used to extract the `Hit.description`s from.
    pub fn annotate(&self, queries: &HashMap<String, Query>) -> String {
        let mut candidate_descriptions: Vec<String> = vec![];
        // Gather all candidate descriptions of all queries belonging to this sequence family. This
        // means collecting all queries' hit-descriptions:
        for qid in self.query_ids.iter() {
            for (_, hit) in &queries.get(qid).unwrap().hits {
                candidate_descriptions.push(hit.description.clone());
            }
        }
        // TODO: Invoke `generate_human_readable_description(candidate_descriptions, ...)`
        (*UNKNOWN_FAMILY_DESCRIPTION).to_string()
    }

Use `String` instead of `&str` in module `generate_hrd_associated_funcs.rs`

The usage of &str causes problems in the incorporation of the function return values. Please replace all usages of &str with String.

Wrong annotations in prot-scriber

I was currently again looking thourgh the mercator annotation file and found some more misspelled prot-scriber annotations, in addition to the previous ones. So far, I have found the following:

wrong anntoation	right annotation?
domai cotaiig protei	domain containing protein
trasferase	transferase
zinc finger ran binding	zinc finger rna binding
ran binding protein m hoolog	rna binding protein m homolog
ontaining	containing

It would be good to correct these errors so that users don't have to go and look for them by themselves.

Clean up and improve performance of `generate_HRD_associated_funcs.rs`

The module generate_HRD_associated_funcs.rs us functional.

However, there are a few minor issues that should be addressed:

all functions must have Rust-Doc style code comments. generateHumanReadableDescription and mean e.g. still do not have them
in line let mut word_frequencies_map = frequencies(universe.clone()); the word frequencies are calculated for all words, which is not necessary. Instead calculate the frequencies only for the informative words.
After classifying words into "informative" and "not informative" a functional iterator retains only the informative words. Refactor function word_classification_map to avoid this double loop over all words. Iterate once and use retain in case the word is informative, or the opposite otherwise.
Set up your editor to always use rustfmt on saving a rs file. At least format this module.
Check if in all cases maximum efficiency has been applied. Avoid multiple looping as e.g. above in word classification.
Please answer why the main argument universe_hrd_words:Vec<&str> is a vector of scalar &str and not a vector of String. I believe the parsing of the Blast input tables returns String vectors (which might not be optimal, by the way...).

Aim at minimum length of HRDs

If the best scoring HRD is less than 3 characters, choose the next best HRD of length >= 3 chars, if such a HRD exists

When generating the human readable description only add phrases that have not already been scores.

In module generate_hrd_assoicated_funcs.rs the function generate_human_readable_description considers the highest scoring phrase (HSP) of each description separately. However, two descriptions might return the same HSP. Only add a given HSP to the vector phrases if it is not already contained in the vector.