Giter VIP home page Giter VIP logo

cath-tools's Introduction

CATH Tools Build Status Documentation Status

Overview

Protein structure comparison tools such as SSAP, as used by the Orengo Group in curating CATH.

Executable DOWNLOADS
(for Linux/Mac; chmod them to be executable)
Docs
 
Code
 
Extras repo
 

Tools

cath-cluster
Complete-linkage cluster arbitrary data.
cath-map-clusters
Map names from previous clusters to new clusters based on (the overlaps between) their members (which may be specified as regions within a parent sequence). Renumber any clusters with no equivalents.
cath-resolve-hits
Collapse a list of domain matches to your query sequence(s) down to the non-overlapping subset (ie domain architecture) that maximises the sum of the hits' scores.
cath-ssap
Structurally align a pair of proteins.
cath-superpose
Superpose two or more protein structures using an existing alignment.

Extra Tools

  • build-test Perform the cath-tools tests (which should all pass, albeit with a few warnings)
  • cath-assign-domains Use an SVM model on SSAP+PRC data to form a plan for assigning the domains to CATH superfamilies/folds
  • cath-refine-align Iteratively refine an existing alignment by attempting to optimise SSAP score
  • cath-score-align Score an existing alignment using structural data

Authors

The SSAP algorithm (cath-ssap) was devised by Christine A Orengo and William R Taylor.

Please cite: Protein Structure Alignment, Taylor and Orengo, Journal of Molecular Biology 208, 1-22, PMID: 2769748. (PubMed, Elsevier)

Since then, many people have contributed to this code, most notably:

Acknowledgements

cath-ssap typically uses DSSP, either by reading DSSP files or via its own implementation of the DSSP algorithms.

cath-cluster uses Fionn Murtagh's reciprocal-nearest-neighbour algorithm (see Multidimensional clustering algorithms, volume 4 of Compstat Lectures. Physica-Verlag, Würzburg/ Wien, 1985. ISBN 3-7051-0008-4) as described and refined in Daniel Müllner's Modern hierarchical, agglomerative clustering algorithms (2011, arXiv:1109.2378).

Feedback

Please tell us about your cath-tools bugs/suggestions here.

If you find this software useful, please spread the word and star the GitHub repo.

cath-tools's People

Contributors

anadon avatar sillitoe avatar tonyelewis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cath-tools's Issues

Strategy for domain-based superposition of full PDB structures (with ligands)

@toluadeyelu had a query that I said I would add here for future documentation.

He has generated a multiple structure superposition of CATH domains.

He would like to add ligands and binding sites back into the structure.

There may be a more elegant solution in the pipeline (e.g. #3), however my suggested approach in the meantime was something like the following:

  • generate the multiple superposition for each domain
  • output the json file describing the translation/rotation operations for each structure (see --sup-to-json-file)
  • get cath-superpose to apply the same operations to the full PDB structure (rather that the domain atoms)

Does that sound reasonable?

(edit: removed the mentions of 'foreach' for clarity)

[dependency: v14.2 onwards] {cath-superpose.ubuntu14.04} libstdc++.so.6: version `GLIBCXX_3.4.22' not found

Hi there,

I am having some problem get "cath-resolve-hits.ubuntu14.04" to run on my ubuntu.

System: Ubuntu 16.04LTS
apt version: apt 1.2.18 (amd64)

Running

./cath-resolve-hits.ubuntu14.04 

downloaded from versions newer than v0.14.2 raise an exception saying:

./cath-resolve-hits.ubuntu14.04: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by ./cath-resolve-hits.ubuntu14.04)

I tried update "libstdc++6" without useful results (along with, "libstdc++6-4.7-dev","libstdc++6-5-dbg")

sudo apt install libstdc++6
libstdc++6 is already the newest version (5.4.0-6ubuntu1~16.04.4).
libstdc++6-4.7-dev is already the newest version (4.7.4-3ubuntu12).
libstdc++6-5-dbg is already the newest version (5.4.0-6ubuntu1~16.04.4).

Although the legacy version from v0.14.1 works fine without this dependency, I am still wondering how I can install this dependency (for v0.14.2 onwards) on ubuntu quickly?

Kind regards
Feng

cath-ssap segfaults on chain 0 of 1br7

The following segfaults:

cath-ssap --prot-src-files PDB 1br7 1zdn --align-regions 'D[1br70]:0' --align-regions 'D[1zdnA]:A'

This is motivated by one of 31 failures, all of which involve 'D[1br7003]430-546:0' and some other different domain.

Exception: Angle value must be finite

When running:

./cath-ssap --dssp-path /cath/data/current/dssp/ --pdb-path /cath/data/current/pdb --sec-path /cath/data/current/sec 2zbjA00 1cn1A00

Returns a few warnings about overlaps in the SS assignments, then the exception "Angle value must be finite":

Note: This was tested on the latest executable (Compile time: Mar 14 2016 17:59:17). I tested this on a previous executable (Compile time: Jan 20 2015 16:17:35) - it reports the warnings, but doesn't throw an exception.

Full stack trace:

2016-03-15 11:22:38.304321 [cath-ssap|warning] Secondary structure starts at residue number 92, which overlaps with the end of the previous secondary structure at residue number 92 for protein 2zbjA00 - will use the previous secondary structure to label residue(s) within overlapping region.
2016-03-15 11:22:38.304370 [cath-ssap|warning] Secondary structure starts at residue number 207, which overlaps with the end of the previous secondary structure at residue number 207 for protein 2zbjA00 - will use the previous secondary structure to label residue(s) within overlapping region.
Whilst running program ./cath-ssap (via a program_exception_wrapper with typeid: "N4cath30ssap_program_exception_wrapperE"), caught a boost::exception:
../source/structure/geometry/angle.h(134): Throw in function cath::geom::angle<double>::angle(const angle_type &) [T = double]
Dynamic exception type: boost::exception_detail::clone_impl<cath::common::invalid_argument_exception>
std::exception::what: Angle value must be finite

Add new summary-output option to CRH

Add this two options to the CRH output options block:

  • summary-output Output a brief text summary of the input data (rather than processing it)

Jon and I think this option would be useful.

Consider adding MDA information to cath-resolve-hits

Jon says:

One thing that might be useful for researchers in general, could be a summary of the different MDAs with counts ?
or a ranked list of how often different models are used in the final set of resolved MDAs.

Resolve conflict of CRH hidden options that are in Gene3D docs

The Gene3D docs in gene3d_hmmsearch/README_scan.txt of ftp://orengoftp.biochem.ucl.ac.uk/gene3d/CURRENT_RELEASE/gene3d_hmmsearch.tar.gz now include a CRH command line:

./cath-resolve-hits --min-dc-hmm-coverage=80 --output-hmmsearch-aln --input-format  hmmsearch_out seqs.hmmsearch > seqs.crh

...which includes currently hidden / non-public CRH options. Does this mean that the following now need to be made public?

  • --min-dc-hmm-coverage
  • --min-hmm-coverage
  • --output-hmmer-aln

Add a colourer for secondary-structure (and accessibility)

It would often be useful to colour by secondary structure. Now that thid code is able to do its own secondary structure calculations, this can be more consistently applied. Some work may be required to work out how to handle if/when to do the necessary secondary-structure calculations.

Also, this would require a change to the options, which currently just offer --gradient-colour-alignment.

Along the same lines, it may be useful to be able to colour by accessibility.

The idea of a secondary-structure coloured alignment (--aln-to-html-file) relates to the 2DSEC issues in #29.

Run structural comparisons within cath-superpose

Currently, superposing structures with cath-superpose requires that the user has already run the matrix of structure comparisons (SSAP). This is handled by the wrapper script cath-superpose-multi-temp-script, however it would be nice ifcath-superpose could run its own SSAPs (where required).

CRH fails with "Cannot resolve_boundary for mis-ordered data" on sensible-looking data

Jon has found (what looks very much to be) an error with cath-resolve-hits. I've been able to reduce the input data required to reproduce the error message from Jon's ~4.8Gb file (!) down to the attached ~4.1Kb file.

The error can be reproduced with:

cath-resolve-hits --input-format hmmsearch_out jon_problem.20170308.hmmsearch.txt

...which gives an error:

2017-03-08 12:34:39.954581 [cath-resolve-hits|error  ] Unable to parse/process resolve-hits input data file "jon_problem.20170308.hmmsearch.txt" of format hmmsearch_out. Error was:
Cannot resolve_boundary for mis-ordered data

jon_problem.20170308.hmmsearch.txt

--html-output makes cath-resolve-hits reject previously accepted data

From Jon:

Just started to look at some issues with the old HMMs that are fixed
in the new,, using the new HTML output option, :)

However, if I add the flag --html-output it crashes (though it works
fine if I don't have that flag)

./cath-resolve-hits --input-format hmmsearch_out
/cath/homes2/ucbcjle/zn56_human.fa.out.hmmsearch --output-file
test.txt --html-output

2016-12-18 14:16:59.074481 [cath-resolve-hits|error ] Unable to
parse/process resolve-hits input data file
"zn56_human.fa.out.hmmsearch" of format hmmsearch_out. Error was:
Hit's score must be strictly greater than 0 else the algorithm doesn't
work (because there's no way to no how to trade scores off against
empty space)

Some pairs fail under cath-ssap with PDB_DSSP that work with PDB_DSSP_SEC

Note that these don't work with --prot-src-files of PDB_DSSP either.

Examples where both proteins have ≥ 30 residues :

1dleB02  2hntE00  142   67  39.85   55   38    7  11.26
2hntE00  4lk4A01   67  125  45.83   44   35    1  10.01
1l1jA01  2hntE00  118   67  35.48   59   50    1  14.44
2hntE00  3gdvC01   67  116  49.33   51   43    7  13.62
2hntE00  3tloA02   67   99  27.21   45   45    1  11.12
3wcyA01  4jqiL02   53   76  51.25   21   27    0   9.71
1hdlA00  4sgbI00   55   51  40.26   27   49    3  11.60
1ktkF02  1smoB00   47  110  57.71   46   41    6   7.62
1ktkF02  2dm3A00   47  110  42.51   26   23    0   4.93
1ktkF02  3irzA03   47   99  46.07   42   42    6   6.73
2hntE00  4fvdA01   67   42  52.51   32   47    7  10.35
3tbxB00  3wcyA01   41   53  57.29   26   49    2   8.93
2zuxA01  3tvmE02   88   32  55.38   30   34   12   4.95
1dx5I02  1yukB02   33   31  70.54   18   54   12   5.90

Add options to include/remove ligands

It would be useful if the user could specify which ligands to include/exclude in the superposed structure. Perhaps this could start as defaulting to something sensible, e.g. include any biologically relevant ligands that are close to the structural region of interest (see #1), then adding the ability to modify those selections later on.

Unexpected superposition

When running:

$ SSAP $1 $2
$ cath-superpose --ssap-aln-infile="$1$2.list" --pdb-infile="$DOMDIR/$1" --pdb-infile="$DOMDIR/$2" --sup-to-pymol-file="$1_$2.pml"
$ pymol $1_$2.pml

where $DOMDIR=/cath/data/current/pdb, $1=2lf0A01 and $2=1lrzA03

The structural comparison with SSAP reported: a good SSAP score (78.25), good SSAP overlap (88%), and a poor RMSD (8.73A).

The issue:

Upon loading the .pml file with pymol, the two domains are superposed at a strange angle and show very little overlap, despite the very good overlap value calculated.

(Please note that I've added a .pdb.txt suffix to the domain PDB files as it would not support the filetype of the original file)

1lrzA03.pdb.txt
2lf0A01.pdb.txt

Add options to cath-ssap for specifying regions to be aligned

It'd be useful it cath-ssap allowed users to specify sub-regions to aligned within the structures.

It'd also be useful if the hacky script cath-superpose-multi-temp-script provided a simple way to specify regions.

This follows on from #1, which covered equivalent functionality in cath-superpose.

use release tags?

README currently provides links for the binary downloads.

The particular build used in these links is currently hard-coded. It happens to be few versions back from 'current' (e.g. 171/171.1 rather than 175/175.1)

https://cath-tools.s3.amazonaws.com/UCLOrengoGroup/cath-tools/171/171.1/release_build/cath-superpose

Figured it might be time to make an official release so we can use that tag to link to a stable build, e.g.

https://github.com/UCLOrengoGroup/cath-tools/releases

?

Add max num query IDs option to CRH

Add this two options to the CRH hit filtering block:

  • max-num_query-ids <num> (=1) Restrict processing to the first <num> query IDs

Use Boost program_options' implicit_value for the =1

Jon and I think this option would be useful.

Add regression test for overall performance (ROC)

It would be useful to have a regression test to check whether a code change has made an adverse affect on the algorithm's performance.

Presumably this test would take a long time to run so wouldn't be part of the regular build. Also, it may require some manual analysis (e.g. comparing ROC graphs).

Has this already been set up? If so, this might just be a case of adding some internal docs (point me in the right direction and I'll write it up).

Some cath-ssaps that work with --prot-src-files of PDB_DSSP, fail with PDB

Examples where both proteins have ≥ 30 residues :

2hntE00  4jniU01   67  128   1.68   35   27    2  12.85
3wcyA01  5faaA00   53  123  28.99   35   28    1  13.40
2hntE00  2qa9E02   67   95  35.98   58   61    2  13.74
2kl7A00  2wfxB02   71   80  53.21   44   55    5  12.97
3wcyA01  4h5sA00   53   96  35.51   43   44    5  14.97
1ktkF02  4lk4A03   47  105  13.36   31   29    4  15.32
1ktkF02  2qsvA02   47  105  52.89   43   40    2  10.93
1olzA03  3wcyA01   89   53  13.17   35   39    3  14.43
1ktkF02  2cspA01   47   98  49.98   32   32    4   5.83
1ktkF02  3s98A03   47   92  43.97   32   34    4   8.12
1kloA01  2kl7A00   53   71  39.80   33   46   13  10.33
3tvmE02  4ww1A01   32  108  49.63   18   16    0   1.84
3d85D01  3tvmE02   84   32  48.33   23   27    0   5.19
1kliL00  2kb9A00   61   44  49.38   37   60    6   8.28
3lohE06  3tbxB00   65   41  32.33   22   33    2   5.75
1kloA02  3k9xC02   56   44  52.58   32   57   11   8.80
1oxxK03  2nn6I01   45   51  59.88   19   37    4   5.42
1u5mA01  4b2rA02   36   63  71.72   36   57   11   3.15
1kloA03  3k6sH04   53   38  45.50   31   58    2   8.41
1kloA03  1yukB02   53   31  34.86   25   47   12   9.17

tally_residue_names() doesn't handle multiple chains

tally_residue_names() looks for a unique list of residue names but doesn't take chains into account so throws on PDBs with multiple chains that happen to have overlapping residue names.

Add failing test case and then fix.

Example PDB: 4uwe

Accessibility problems prevent cath-ssap aligning two similar structures

Natalie has found an interesting cath-ssap failure on 4tp8U01 versus 2vhpU01. The program runs but generates zero scores despite the pair being very similar:

4tp8U01  2vhpU01    0    0   0.00    0    0    0   0.00

Running under debug mode (--debug) reveals:

  • The algorithm goes straight into a slow_ssap, which is reasonable given that 2vhpU01.sec only lists one secondary structure.
  • When "populating upper_score_matrix", it compares "0 residue pairs out of a possible 1509" and then saves zero scores and stops.

Rummaging in the code reveals that :

  • The reason it hasn't compared any residue pairs is that they've previously all been rejected by residues_have_similar_area_angle_props()
  • residues_have_similar_area_angle_props() has issues (including that the comments currently say there are two problems and then list three).
  • residues_have_similar_area_angle_props() compares whether the sum of various property differences is below some threshold. Except that one of the properties, accessibility, is summed rather than differenced. It's clear that I have previously noticed this as strange when tidying up the legacy code I inherited - see the variable names, comments and the commented accessibility_difference lines of code at the top and bottom of this chunk of code.

It would make sense that this would cause problems with Natalie's pair of structures because the accessibilities that DSSP gives for their residues (which is what cath-ssap is using) are high.

Changing the code to use the accessibility_difference lines makes it work correctly, with a good SSAP score, a good RMSD, 100% sequence identity:

4tp8U01  2vhpU01   39   39  81.94   39  100  100   1.65
LRRFKRSCEKAGVLAEVRRREFYEKPTTERKRAKASAVK
LRRFKRSCEKAGVLAEVRRREFYEKPTTERKRAKASAVK

and also a good superposition:

4tp8u01_2vhpu01

So this example gives motivation to look more carefully at this and consider fixing it. It's worth noting that the calculation is a bit more complicated because the buried_i and buried_j variables that are differenced also involve the DSSP accessibility scores (see here). If we are able to look into this, it would be important to at least compare before/after ROC curves to avoid a regression in quality.

Note:

  • 4tp8U01 consists of PDB residues 16-54 on chain U of PDB 4tp8
  • 2vhpU01 consists of PDB residues 15-54 on chain U of PDB 2vhp
  • The cath-ssap overlap is 100% despite the domains having different lengths because it's running off a DSSP file that has dropped the final residue 54 from 2vhpU01 because that residue only has one nitrogen atom record in the PDB file.

Run directly from PDB files

A placeholder for a known (non-trivial) issue.

Would be great if the algorithm can run directly from PDB (or mmCIF) files rather than requiring the user to generate their own *.sec and *.wolf files.

readthedocs build is currently failing (404)

The readthedocs link currently directs to a 404.

Looks like this is because the build is failing with the following error

https://readthedocs.org/projects/cath-tools/builds/3459390/

error: [Errno 104] Connection reset by peer
/home/docs/checkouts/readthedocs.org/user_builds/cath-tools/envs/latest/local/lib/python2.7/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning

Following that suggested link:

https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning

Seems to suggest we need to upgrade to a more recent version of python (v3).

cath-refine-align doesn't respect --align-regions when writing alignments

Performing the following:

cath-ssap --pdb-path $PDBDIR 1cuk 1bvs --align-regions 'D[1cukA03]156-203:A' --align-regions 'D[1bvsA03]149-199:A'
cath-refine-align --ssap-aln-infile 1cukA031bvsA03.list --pdb-infile $PDBDIR/1cuk --pdb-infile $PDBDIR/1bvs --align-regions 'D[1cukA03]156-203:A' --align-regions 'D[1bvsA03]149-199:A' --aln-to-ssap-file 1cukA031bvsA03.cath-refine-align.list

...generates a second alignment file in which the numbering starts at 1, not at the start of the specified domain.

Problem parsing CORA alignment file

Using this file: 1.10.287.810__FF_SSG9__1.aln_reps.cora.txt...

setenv PDBDIR /cath/data/v4_0_0/pdb
cath-superpose --cora-aln-infile 1.10.287.810__FF_SSG9__1.aln_reps.cora.txt --pdb-infile $PDBDIR/2bsk --pdb-infile $PDBDIR/2bsk --pdb-infile $PDBDIR/3dxr --pdb-infile $PDBDIR/3cjh --align-regions 'D[2bskA00]:A' --align-regions 'D[2bskF00]:F' --align-regions 'D[3dxrA00]:A' --align-regions 'D[3cjhB00]:B' --sup-to-pymol

...gives...

Cannot read CORA legacy alignment file [bad lexical cast: source type value could not be interpreted as target] : No such file or directory
2017-09-26 17:17:32.687629 [cath-superpose|error  ] Problem building alignment (and spanning tree) : Cannot read CORA legacy alignment file [bad lexical cast: source type value could not be interpreted as target]

Running cath-ssaps on files in other directories can causes errors

Eg cath-ssap ../1c0pA01 ../1hdoA00 gives:

Cannot open file "./../1c0pA01../1hdoA00.list" for writing [ios_base::clear: unspecified iostream_category error] : No such file or directory
Whilst running program /cath-tools/ninja_clang_relwithdebinfo/cath-ssap (via a program_exception_wrapper with typeid: "N4cath30ssap_program_exception_wrapperE"), caught a boost::exception:
../source/common/file/open_fstream.hpp(72): Throw in function void cath::common::detail::open_fstream_impl(fstream_t &, const boost::filesystem::path &, const std::ios_base::openmode &, const cath::common::detail::fstream_type &) [fstream_t = std::__1::basic_ofstream<char, std::__1::char_traits<char> >]
Dynamic exception type: boost::exception_detail::clone_impl<cath::common::runtime_error_exception>
std::exception::what: Cannot open file "./../1c0pA01../1hdoA00.list" for writing [ios_base::clear: unspecified iostream_category error] 

Allow cath-ssap to restrict regions in PDB_DSSP mode

At present, a command like this:

cath-ssap --prot-src-files PDB_DSSP 3h5q 4giu --align-regions 'D[3h5qA01]-2-67:A' --align-regions 'D[4giuA01]24-90:A'

...generates an error with key message Cannot yet restrict to regions with this combination of input files.


Full error message:

Whilst running program cath-ssap (via a program_exception_wrapper with typeid: "N4cath30ssap_program_exception_wrapperE"), caught a boost::exception:
/ftp/software/cath-tools/source/structure/protein/protein.cpp(462): Throw in function void cath::restrict_to_regions(cath::protein&, const region_vec_opt&)
Dynamic exception type: boost::exception_detail::clone_impl<cath::common::not_implemented_exception>
std::exception::what: Cannot yet restrict to regions with this combination of input files

Specify regions/domains to use for superpositions

It would be really useful to be able to superpose large, multi-domain structures based on just a single, shared domain. Equally, it might be useful to superpose structures on the residues around a specific set of important residues (e.g. active site, surface patch).

Show "good" vs "bad" superposition

Might be nice to have a simple example to demonstrate the following statement:

The cath-superpose tool makes superpositions that look better (but have higher RMSDs).

e.g. "bad" superposition with "good" RMSD (and vice versa)

Add 2DSEC (multiple alignment plots)

2DSEC takes a multiple structural alignment file and outputs a postscript image of the conserved secondary structures.

We have had a request to make 2DSEC images available for others to use and IIRC the only dependency is that the input file is in CORA format (which cath-superpose should be able to do).

2dsec

Improve superpositions of multiple domains

  • Centre superposition on superposed domains (not context)
  • When doing in_chain or in_pdb try to use name of parent as the name of the whole object
  • Make selected regions a selection
  • Rainbow colour (possibly darkened) outside of superpose regions too

cath-superpose gives confusing options error message

cath-ssap two chains without problem:

$ cath-ssap 1bvs 1cuk --align-regions 'D[1bvsA]:A' --align-regions 'D[1cukA]:A'
 1bvsA   1cukA  183  190  85.27  179   94   30   2.93

cath-superpose the two chains without problem:

$ cath-superpose --pdb-infile $PDBDIR/1bvs --pdb-infile $PDBDIR/1cuk --align-regions 'D[1bvsA]:A' --align-regions 'D[1cukA]:A' --ssap-aln-infile 1bvsA1cukA.list
Standard RMSD is : 2.92621
Superposed using select_best_score_percent[70].ca_atoms and actual full RMSD is : 3.08064

Yet if the --ssap-aln-infile argument is omitted, the error message is about an option it previously accepted:

$ cath-superpose --pdb-infile $PDBDIR/1bvs --pdb-infile $PDBDIR/1cuk --align-regions 'D[1bvsA]:A' --align-regions 'D[1cukA]:A'
cath-superpose: the argument ('D[1cukA]:A') for option '--align-regions' is invalid
See 'cath-superpose --help' for usage.

Enable CRH to filter out hmmsearch output hits covering little of the HMM

Jon requests a CRH feature to filter out hmmsearch output hits covering little of the HMM

This would just apply to input from hmmsearch output files.

It would compare 100.0 * ( hmm_to +1 - hmm_from ) / hmm_length to some threshold (where hmm_length comes from the [M=296] part of the file) and filter out hits that failed the threshold.

Jon has given me an example data file (hslu.hmmsearch) and ID (dc_4dc3ac2c5e4a70703d5d9cf0ba2f0ac9) that I can try to use to build the feature and that I can then add as a test case.

Investigate: why does dropping sec files make some SSAP worse

The result of benchmarking the dropping of sec files appears to indicate that the overall homology-discrimination is largely unaffected.

However results for specific pairs are affected - in some cases, the change results in lower scoring alignments and in some cases, higher scoring alignments. See scatter_plot.pdf. Depending on the causes, it may be possible to address some of the regressions without losing the improvements and hence improve the overall performance.

To find the most extreme changes... get the UCLOrengoGroup/cath-tools-supplementary repo, cd into it, then something like:

cd homologue_discrimination_benchmark/results_sets
join v0.12.24-21-g06fbfeb_PDB_DSSP_SEC.full v0.12.24-21-g06fbfeb_PDB_DSSP.full | awk '$6 >= 60 && $15 >=60 && $6 != $15 {print $2 " " $3 " " $6 " " $15 " " ($15 - $6) }' | sort -g  -k 5 | head
join v0.12.24-21-g06fbfeb_PDB_DSSP_SEC.full v0.12.24-21-g06fbfeb_PDB_DSSP.full | awk '$6 >= 60 && $15 >=60 && $6 != $15 {print $2 " " $3 " " $6 " " $15 " " ($15 - $6) }' | sort -rg -k 5 | head

Improve CRH HTML and options

Jon and I think these options would be useful...

In HTML output, add a single-line result to CRH

Add a new block of HTML options with these two options:

  • --html-max-num-non-soln-hits <num> In HTML output, only display up to <num> hits that aren't part of the solution
  • --html-exclude-rejected-hits In HTML output, don't display hits that fail the score filters

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.