Giter VIP home page Giter VIP logo

arctic3d's People

Contributors

aldovdn avatar amjjbonvin avatar anton-bushuiev avatar dependabot[bot] avatar mgiulini avatar rvhonorato avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arctic3d's Issues

handle non-existing pdb-renum pdb file

it may happen that a call to fetch_pdb fails because the selected pdb is not yet on pdbrenum database.

The workaround here is to select the second best pdb (and then the third/fourth/etc.) which has been already processed by pdbrenum.

In the future we may think of integrating pdbrenum in our pipeline to handle these cases

consider orthologs

we could think of giving more than one uniprot id in input in order to consider homologs/orthologs.

as an example, the uniprot IDs P61970 and P61972 are associated to the same sequence, but one refers to the rat form and the other to the human.

fix clustering output

PR #87 missed two points in , which led to incorrect behavior

cl_dict = get_clustering_dict(clusters)
write_clusters(cl_dict, cl_filename)
# write clustered residues
res_filename = "clustered_residues.out"
res_dict = get_residue_dict(cl_dict, interface_dict)
clustered_residues = write_residues(res_dict, res_filename)

get_clustering_dict should be called with entries and clustered_residues is retrieved by get_residue_dict and not by write_residues

handle the case of input pdb but no interface file

if one gives in input a pdb file but no interface file, currently arctic3d crashes with the following error message:
UnboundLocalError: local variable 'filtered_interfaces' referenced before assignment
of course, the interfaces retrieved fetching the (retrieved) uniprot ID have never been filtered. How to proceed? Get the pdb id of the input pdb, run pdbrenum and use the renumbered pdb to validate and filter the interfaces.

The complicated part is to get the pdb id of the input pdb

create concise and clear output for end user

currently arctic3d outputs the following files:

  • the best pdb obtained from the search (if such search is performed)
  • the set of clustered interfaces
  • the set of clustered residues
  • the dendrogram of clustered interfaces
  • the interface matrix

It would be nice:

  1. to gather all this information in a more clear and concise way
  2. to integrate the current info with further visualisation of the interfaces (on the structure? on the sequence?)
  3. to output more information?

check failed validation

following call crashes the program:
arctic3d Q99895
with the following error message

 [2022-08-25 16:37:00,515 pdb:L36 DEBUG] Fetching PDB file 4h4f from PDBrenum
 [2022-08-25 16:37:00,993 pdb:L36 DEBUG] Fetching PDB file 4h4f from PDBrenum
 [2022-08-25 16:37:01,481 pdb:L204 DEBUG] 4h4f failed validation
...a bunch of mdanalysis errors
FileNotFoundError: [Errno 2] No such file or directory: '4h4f.pdb'

refactor arctic3d bm5 execution

In light of PR #101 the execution of the bm5 benchmark can be adjusted and prettified.

  • the output folder must be a proper argument
  • this folder must contain the arctic3d output folders (so that when the benchmark is re-executed, artic3d automatically stopped without overwriting stuff)
  • the checks on the pair must be done inside the run
  • stats must be clearer (and checked)
  • docstrings should be improved
  • the bm5 pdb is actually used in the search process
  • the bm5 pdb chain is used in the search process (thanks to #123 )
  • there should be a complex directory in which we save the pairwise information and the relevant pdb files

improve filtering on input interface_file

the program should handle the following situations while reading the interface_file

  1. empty line in interface file -> skip it
  2. only a single value (possibly the name of the interface) -> abort the execution and throw an exception
  3. invalid residue numbers (es 19A,28-38) -> abort the execution and throw an exception

better handling of bad calls

we should handle the following cases:

  1. no interface residues found in the first place: exit peacefully without downloading pdbs
  2. one single interface found: don't cluster, but write out the data as if the clustering was done
  3. no interfaces retained after filtering: exit peacefully with a meaningful message

change get_best_pdb retrieval logic

currently the retrieval of the best pdb is done maximizing the resolution, once the checks in the check_list are passed.

It makes more sense to look for the pdb (that passes the checks) that retrieves the most interface hits.

create clustering script

given the set of interfaces provided by get_interface_data, this script or set of script should

  1. create a distance matrix
  2. cluster it
  3. give in output the details of the clustering
  4. report the clustered interfaces in a clear and pretty and maybe visual way

refactor interface matrix calculation

  • using numpy built-in functions it should be possibile to substantially accelerate some calculations
  • add some tests as well
  • move read_int_matrix from clustering module to interface_matrix module

renumbering haddock pdbs according to uniprot?

renumbering input pdbs according to a specific uniprot numbering can be quite painful. I wrote a script for that, should we include it in the benchmark? may it be useful in other contexts?

convert distance to similarity

since our interface matrix is not fully a distance matrix but rather a similarity matrix (triangle inequality is partially violated) we should change any misleading reference to a distance and call it "similarity"

handle pdbrenum failed calls

given that PDBrenum is a core-element of arctic3, we should take into account possible failures from the server side. In order to do that, a feasible option is to create and periodically update a db of all the renumbered structures.

retrieving best pdb

Using Jesus' script, we should be able to retrieve the best pdb associated to an input uniprot ID.

convert interface names to strings

If interfaces names in interface_file are numbers they are wrongly read as integers or doubles, thus causing problems later on in the execution

check PDB renumbering

wherever the pdb comes from, we want to be (decently) sure it matches the numbering of the interface API.

remove spaces in interface names

the interface identifiers can contain spaces, which cause problems when reading txt with pandas (more columns than expected).

During the interface_matrix output phase the names of the interfaces should be stripped (maybe substituting - to spaces when present)

fix unpacking error

when no valid pdbs are found:
TypeError: cannot unpack non-iterable NoneType object
deal with it by returning a double None in get_best_pdb

add arctic3d_resclust cli

add a cli to do the residue-based hierarchical clustering. something like

arctic3d_resclust PDB --residue_list 1,2,3,4,27,28,29,30

that gives in output N (found according to the cutoff) clusters of residues for further analysis.

exclude pdb information

as it is currently done for uniprot IDs, it should be possible to exclude information coming from a list of pdb IDs.

An example could be
arctic3d P01112 --out_pdb=1BKD
where 1BKD would be the pdbs with the structure of the complex.

receptor-ligand mismatch in BM5?

While executing the benchmark on the BM5 dataset, pdb 1JTG gives rise to this output

1JTG_B:A,3GMU_B,P35804,1ZG4_A,P62593

which seems to suggest that the receptor pdb (uniprot ID) is 3GMU (P35804). This is consistent with the BM5 table but not with the BM5 haddock repo, in which the two molecules seem to be swapped (3GMU is the ligand and 1ZG4 is the receptor). This happens because in the haddock dataset the bigger protein is considered as the receptor.

It is not trivial to take into account this, I create this issue so that we don't forget that the problem exists.

refactor clustering and test_clustering

  • test_write_clusters and test_write_residues should be completed
  • write_clusters should be split in two functions, one gathering cl_dict and the other writing clusters to file

handle missing coverage/resolution in get_best_pdb

with the following uniprot IDs P20023 P01075 O60880 P48551 P69786 P0ABS8
arctic call fails with the following error message:

BestPDB hit for {uniprot_id}: {pdb_id}_{chain_id} {coverage:.2f} coverage {resolution:.2f} Angstrom / start {start} end {end}"
TypeError: unsupported format string passed to NoneType.__format__

add calculation of pairwise distance

currently only the calculation of the sine similarity is done. The distance is indeed useless as it is strongly dependent on the size of the interfaces. But it could be useful if the interfaces share the same number of residues.

improve documentation

we should improve the documentation, namely expand the README file and check the autogenerated docs.

create get_interface_data

get_interface_data should be a function that, given a uniprot ID and a reference pdb, gives in output the set of interfaces present in the literature

  • is the reference pdb necessary?
  • how to implement the renumbering?
  • how to exclude things in a practical way?

include probability in the output

when several interfaces are clustered together, the list of clustered residues can be quite long.

An example is can be found by running
arctic3d example/1ppe_E.fasta --db db/swissprot

48 interfaces can be found. 46 of them are clustered together. They are indeed similar, although they span a big set of 41 residues:
1: [43, 45, 46, 47, 63, 65, 66, 67, 68, 89, 91, 98, 99, 100, 101, 102, 103, 104, 114, 149, 150, 152, 153, 154, 156, 157, 159, 178, 194, 195, 197, 198, 200, 215, 217, 218, 219, 220, 222, 223, 225]
Of course not all of them are present in every of the 46 interfaces. Does it make sense to include their probability to be in an interface in the output to facilitate the understanding?

re-filter already filtered interfaces

interfaces filtered during the process of pdb retrieval should be refiltered, in case some amino acids have been removed from the retained pdb during the pdb tidying process.

This has the disadvantages of

  1. slowing down a bit the calculations (not really an issue)
  2. keeping interfaces for which the full coverage is a bit lower than the coverage cutoff. Let's hope this is not dramatic.

specify the pdb to be used

currently we get the best pdb at a given uniprot ID

pdb_f, filtered_interfaces = get_best_pdb(uniprot_id, interface_residues)

but it should be possible to specify it. this

  1. would allow the user to use a pdb in a certain conformation
  2. guarantees a much faster execution (a single pdb download vs many)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.