haddocking / arctic3d Goto Github PK
View Code? Open in Web Editor NEWAutomatic Retrieval and ClusTering of Interfaces in Complexes from 3D structural information
License: Apache License 2.0
Automatic Retrieval and ClusTering of Interfaces in Complexes from 3D structural information
License: Apache License 2.0
it may happen that a call to fetch_pdb
fails because the selected pdb is not yet on pdbrenum database.
The workaround here is to select the second best pdb (and then the third/fourth/etc.) which has been already processed by pdbrenum.
In the future we may think of integrating pdbrenum in our pipeline to handle these cases
we could think of giving more than one uniprot id in input in order to consider homologs/orthologs.
as an example, the uniprot IDs P61970 and P61972 are associated to the same sequence, but one refers to the rat form and the other to the human.
PR #87 missed two points in , which led to incorrect behavior
arctic3d/src/arctic3d/modules/clustering.py
Lines 182 to 187 in cdcc01c
get_clustering_dict
should be called with entries
and clustered_residues is retrieved by get_residue_dict
and not by write_residues
if one gives in input a pdb file but no interface file, currently arctic3d crashes with the following error message:
UnboundLocalError: local variable 'filtered_interfaces' referenced before assignment
of course, the interfaces retrieved fetching the (retrieved) uniprot ID have never been filtered. How to proceed? Get the pdb id of the input pdb, run pdbrenum and use the renumbered pdb to validate and filter the interfaces.
The complicated part is to get the pdb id of the input pdb
currently arctic3d outputs the following files:
It would be nice:
setup tox tests to run tests and lint locally
following call crashes the program:
arctic3d Q99895
with the following error message
[2022-08-25 16:37:00,515 pdb:L36 DEBUG] Fetching PDB file 4h4f from PDBrenum
[2022-08-25 16:37:00,993 pdb:L36 DEBUG] Fetching PDB file 4h4f from PDBrenum
[2022-08-25 16:37:01,481 pdb:L204 DEBUG] 4h4f failed validation
...a bunch of mdanalysis errors
FileNotFoundError: [Errno 2] No such file or directory: '4h4f.pdb'
In light of PR #101 the execution of the bm5 benchmark can be adjusted and prettified.
the program should handle the following situations while reading the interface_file
we should handle the following cases:
related to #16 : the user may have a pdb and a set of interfaces, which may be clustered. A simple if in the cli should make the deal.
some calls are highly verbose, remove them
currently the retrieval of the best pdb is done maximizing the resolution, once the checks in the check_list
are passed.
It makes more sense to look for the pdb (that passes the checks) that retrieves the most interface hits.
given the set of interfaces provided by get_interface_data, this script or set of script should
read_int_matrix
from clustering
module to interface_matrix
modulerenumbering input pdbs according to a specific uniprot numbering can be quite painful. I wrote a script for that, should we include it in the benchmark? may it be useful in other contexts?
since our interface matrix is not fully a distance matrix but rather a similarity matrix (triangle inequality is partially violated) we should change any misleading reference to a distance and call it "similarity"
given that PDBrenum is a core-element of arctic3, we should take into account possible failures from the server side. In order to do that, a feasible option is to create and periodically update a db of all the renumbered structures.
Using Jesus' script, we should be able to retrieve the best pdb associated to an input uniprot ID.
If interfaces names in interface_file are numbers they are wrongly read as integers or doubles, thus causing problems later on in the execution
wherever the pdb comes from, we want to be (decently) sure it matches the numbering of the interface API.
the interface identifiers can contain spaces, which cause problems when reading txt with pandas (more columns than expected).
During the interface_matrix output phase the names of the interfaces should be stripped (maybe substituting - to spaces when present)
If the number of interfaces excedes the few thousands the similarity matrix becomes too big and arctic3d will take forever to run.
I will insert a check to avoid this highly unlikely scenario.
when no valid pdbs are found:
TypeError: cannot unpack non-iterable NoneType object
deal with it by returning a double None in get_best_pdb
when a pdb is submitted with no inteface file a warning is raised as it's assumed the pdb is consistent with the corresponding uniprot numbering
add a cli to do the residue-based hierarchical clustering. something like
arctic3d_resclust PDB --residue_list 1,2,3,4,27,28,29,30
that gives in output N (found according to the cutoff) clusters of residues for further analysis.
as it is currently done for uniprot IDs, it should be possible to exclude information coming from a list of pdb IDs.
An example could be
arctic3d P01112 --out_pdb=1BKD
where 1BKD would be the pdbs with the structure of the complex.
correct naming of clusters
While executing the benchmark on the BM5 dataset, pdb 1JTG gives rise to this output
1JTG_B:A,3GMU_B,P35804,1ZG4_A,P62593
which seems to suggest that the receptor pdb (uniprot ID) is 3GMU (P35804). This is consistent with the BM5 table but not with the BM5 haddock repo, in which the two molecules seem to be swapped (3GMU is the ligand and 1ZG4 is the receptor). This happens because in the haddock dataset the bigger protein is considered as the receptor.
It is not trivial to take into account this, I create this issue so that we don't forget that the problem exists.
test_write_clusters
and test_write_residues
should be completedcurrently we handle only one excluded value
I thought uniprot ID could only have 6 digits, but that's wrong
with the following uniprot IDs P20023 P01075 O60880 P48551 P69786 P0ABS8
arctic call fails with the following error message:
BestPDB hit for {uniprot_id}: {pdb_id}_{chain_id} {coverage:.2f} coverage {resolution:.2f} Angstrom / start {start} end {end}"
TypeError: unsupported format string passed to NoneType.__format__
currently only the calculation of the sine similarity is done. The distance is indeed useless as it is strongly dependent on the size of the interfaces. But it could be useful if the interfaces share the same number of residues.
given set of interface, we should be able to quickly create an interface matrix
it can be that we find no or one interface. currently the clustering fails, which is good, as the distance matrix is empty.
Things should be handled in a more elegant way
we should improve the documentation, namely expand the README file and check the autogenerated docs.
update README and add an interface_file
to example, compatible with the pdb file 1ppe_E.pdb
when one wants to benchmark the software, it might be useful to not consider some uniprot IDs, for example the one of the true complex.
cli is already a bit verbose, needs a couple of tests
get_interface_data
should be a function that, given a uniprot ID and a reference pdb, gives in output the set of interfaces present in the literature
should have the form of a dictionary {"int1_name" : [ 1, 2, 3], "int2_name" : [2,3,4]}
add a script that takes a list of uniprot IDs and executes arctic3d on them
when several interfaces are clustered together, the list of clustered residues can be quite long.
An example is can be found by running
arctic3d example/1ppe_E.fasta --db db/swissprot
48 interfaces can be found. 46 of them are clustered together. They are indeed similar, although they span a big set of 41 residues:
1: [43, 45, 46, 47, 63, 65, 66, 67, 68, 89, 91, 98, 99, 100, 101, 102, 103, 104, 114, 149, 150, 152, 153, 154, 156, 157, 159, 178, 194, 195, 197, 198, 200, 215, 217, 218, 219, 220, 222, 223, 225]
Of course not all of them are present in every of the 46 interfaces. Does it make sense to include their probability to be in an interface in the output to facilitate the understanding?
currently arctic-3d only accepts a single sequence as input, but the user should be able to input a specific uniprot ID
interfaces filtered during the process of pdb retrieval should be refiltered, in case some amino acids have been removed from the retained pdb during the pdb tidying process.
This has the disadvantages of
currently we get the best pdb at a given uniprot ID
Line 129 in a061f56
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.