haddocking / arctic3d Goto Github PK

View Code? Open in Web Editor NEW

25.0 15.0 5.0 2.29 MB

Automatic Retrieval and ClusTering of Interfaces in Complexes from 3D structural information

License: Apache License 2.0

Python 92.25% Shell 7.70% Dockerfile 0.05%

clustering macromolecular-modeling protein-docking utrecht-university data-mining data-retrieval pdbe-s-rest-api

arctic3d's People

Contributors

Stargazers

Watchers

Forkers

mgiulini anton-bushuiev vgpreys rvhonorato cagancosan

arctic3d's Issues

handle non-existing pdb-renum pdb file

it may happen that a call to fetch_pdb fails because the selected pdb is not yet on pdbrenum database.

The workaround here is to select the second best pdb (and then the third/fourth/etc.) which has been already processed by pdbrenum.

In the future we may think of integrating pdbrenum in our pipeline to handle these cases

consider orthologs

we could think of giving more than one uniprot id in input in order to consider homologs/orthologs.

as an example, the uniprot IDs P61970 and P61972 are associated to the same sequence, but one refers to the rat form and the other to the human.

fix clustering output

PR #87 missed two points in , which led to incorrect behavior

arctic3d/src/arctic3d/modules/clustering.py

Lines 182 to 187 in cdcc01c

 cl_dict = get_clustering_dict(clusters) 

 write_clusters(cl_dict, cl_filename) 

 # write clustered residues 

 res_filename = "clustered_residues.out" 

 res_dict = get_residue_dict(cl_dict, interface_dict) 

 clustered_residues = write_residues(res_dict, res_filename)

get_clustering_dict should be called with entries and clustered_residues is retrieved by get_residue_dict and not by write_residues

handle the case of input pdb but no interface file

if one gives in input a pdb file but no interface file, currently arctic3d crashes with the following error message:
UnboundLocalError: local variable 'filtered_interfaces' referenced before assignment
of course, the interfaces retrieved fetching the (retrieved) uniprot ID have never been filtered. How to proceed? Get the pdb id of the input pdb, run pdbrenum and use the renumbered pdb to validate and filter the interfaces.

The complicated part is to get the pdb id of the input pdb

create concise and clear output for end user

currently arctic3d outputs the following files:

the best pdb obtained from the search (if such search is performed)
the set of clustered interfaces
the set of clustered residues
the dendrogram of clustered interfaces
the interface matrix

It would be nice:

to gather all this information in a more clear and concise way
to integrate the current info with further visualisation of the interfaces (on the structure? on the sequence?)
to output more information?

setup tox tests

setup tox tests to run tests and lint locally

check failed validation

following call crashes the program:
arctic3d Q99895
with the following error message

 [2022-08-25 16:37:00,515 pdb:L36 DEBUG] Fetching PDB file 4h4f from PDBrenum
 [2022-08-25 16:37:00,993 pdb:L36 DEBUG] Fetching PDB file 4h4f from PDBrenum
 [2022-08-25 16:37:01,481 pdb:L204 DEBUG] 4h4f failed validation
...a bunch of mdanalysis errors
FileNotFoundError: [Errno 2] No such file or directory: '4h4f.pdb'

refactor arctic3d bm5 execution

In light of PR #101 the execution of the bm5 benchmark can be adjusted and prettified.

the output folder must be a proper argument
this folder must contain the arctic3d output folders (so that when the benchmark is re-executed, artic3d automatically stopped without overwriting stuff)
the checks on the pair must be done inside the run
stats must be clearer (and checked)
docstrings should be improved
the bm5 pdb is actually used in the search process
the bm5 pdb chain is used in the search process (thanks to #123 )
there should be a complex directory in which we save the pairwise information and the relevant pdb files

improve filtering on input interface_file

the program should handle the following situations while reading the interface_file

empty line in interface file -> skip it
only a single value (possibly the name of the interface) -> abort the execution and throw an exception
invalid residue numbers (es 19A,28-38) -> abort the execution and throw an exception

better handling of bad calls

we should handle the following cases:

no interface residues found in the first place: exit peacefully without downloading pdbs
one single interface found: don't cluster, but write out the data as if the clustering was done
no interfaces retained after filtering: exit peacefully with a meaningful message

give the possibility to provide an input pdb

related to #16 : the user may have a pdb and a set of interfaces, which may be clustered. A simple if in the cli should make the deal.

remove some redundant log calls

some calls are highly verbose, remove them

change get_best_pdb retrieval logic

currently the retrieval of the best pdb is done maximizing the resolution, once the checks in the check_list are passed.

It makes more sense to look for the pdb (that passes the checks) that retrieves the most interface hits.

create clustering script

given the set of interfaces provided by get_interface_data, this script or set of script should

create a distance matrix
cluster it
give in output the details of the clustering
report the clustered interfaces in a clear and pretty and maybe visual way

refactor interface matrix calculation

using numpy built-in functions it should be possibile to substantially accelerate some calculations
add some tests as well
move read_int_matrix from clustering module to interface_matrix module

renumbering haddock pdbs according to uniprot?

renumbering input pdbs according to a specific uniprot numbering can be quite painful. I wrote a script for that, should we include it in the benchmark? may it be useful in other contexts?

convert distance to similarity

since our interface matrix is not fully a distance matrix but rather a similarity matrix (triangle inequality is partially violated) we should change any misleading reference to a distance and call it "similarity"

Add scripts to generate a benchmark

handle pdbrenum failed calls

given that PDBrenum is a core-element of arctic3, we should take into account possible failures from the server side. In order to do that, a feasible option is to create and periodically update a db of all the renumbered structures.

retrieving best pdb

Using Jesus' script, we should be able to retrieve the best pdb associated to an input uniprot ID.

convert interface names to strings

If interfaces names in interface_file are numbers they are wrongly read as integers or doubles, thus causing problems later on in the execution

check PDB renumbering

wherever the pdb comes from, we want to be (decently) sure it matches the numbering of the interface API.

remove spaces in interface names

the interface identifiers can contain spaces, which cause problems when reading txt with pandas (more columns than expected).

During the interface_matrix output phase the names of the interfaces should be stripped (maybe substituting - to spaces when present)

impose maximum number of interfaces in interface file

If the number of interfaces excedes the few thousands the similarity matrix becomes too big and arctic3d will take forever to run.

I will insert a check to avoid this highly unlikely scenario.

fix unpacking error

when no valid pdbs are found:
TypeError: cannot unpack non-iterable NoneType object
deal with it by returning a double None in get_best_pdb

add warning message when there's no interface file

when a pdb is submitted with no inteface file a warning is raised as it's assumed the pdb is consistent with the corresponding uniprot numbering

add arctic3d_resclust cli

add a cli to do the residue-based hierarchical clustering. something like

arctic3d_resclust PDB --residue_list 1,2,3,4,27,28,29,30

that gives in output N (found according to the cutoff) clusters of residues for further analysis.

exclude pdb information

as it is currently done for uniprot IDs, it should be possible to exclude information coming from a list of pdb IDs.

An example could be
arctic3d P01112 --out_pdb=1BKD
where 1BKD would be the pdbs with the structure of the complex.

correct clustering

correct naming of clusters

receptor-ligand mismatch in BM5?

While executing the benchmark on the BM5 dataset, pdb 1JTG gives rise to this output

1JTG_B:A,3GMU_B,P35804,1ZG4_A,P62593

which seems to suggest that the receptor pdb (uniprot ID) is 3GMU (P35804). This is consistent with the BM5 table but not with the BM5 haddock repo, in which the two molecules seem to be swapped (3GMU is the ligand and 1ZG4 is the receptor). This happens because in the haddock dataset the bigger protein is considered as the receptor.

It is not trivial to take into account this, I create this issue so that we don't forget that the problem exists.

refactor clustering and test_clustering

test_write_clusters and test_write_residues should be completed
write_clusters should be split in two functions, one gathering cl_dict and the other writing clusters to file

exclude multiple uniprot IDs from command line call

currently we handle only one excluded value

remove useless check on uniprot ID features

I thought uniprot ID could only have 6 digits, but that's wrong

unlink interface_matrix in tests

handle missing coverage/resolution in get_best_pdb

with the following uniprot IDs P20023 P01075 O60880 P48551 P69786 P0ABS8
arctic call fails with the following error message:

BestPDB hit for {uniprot_id}: {pdb_id}_{chain_id} {coverage:.2f} coverage {resolution:.2f} Angstrom / start {start} end {end}"
TypeError: unsupported format string passed to NoneType.__format__

add calculation of pairwise distance

currently only the calculation of the sine similarity is done. The distance is indeed useless as it is strongly dependent on the size of the interfaces. But it could be useful if the interfaces share the same number of residues.

create interface matrix

given set of interface, we should be able to quickly create an interface matrix

handle the case of zero or one retained interfaces

it can be that we find no or one interface. currently the clustering fails, which is good, as the distance matrix is empty.

Things should be handled in a more elegant way

improve documentation

we should improve the documentation, namely expand the README file and check the autogenerated docs.

clarify interface_file scenario

update README and add an interface_file to example, compatible with the pdb file 1ppe_E.pdb

give the possibility to exclude a (list of) uniprot-IDs from the interface API call

when one wants to benchmark the software, it might be useful to not consider some uniprot IDs, for example the one of the true complex.

implement test_cli

cli is already a bit verbose, needs a couple of tests

create get_interface_data

get_interface_data should be a function that, given a uniprot ID and a reference pdb, gives in output the set of interfaces present in the literature

is the reference pdb necessary?
how to implement the renumbering?
how to exclude things in a practical way?

accept a set of interfaces in input

should have the form of a dictionary {"int1_name" : [ 1, 2, 3], "int2_name" : [2,3,4]}

add bm5 execution script

add a script that takes a list of uniprot IDs and executes arctic3d on them

include probability in the output

when several interfaces are clustered together, the list of clustered residues can be quite long.

An example is can be found by running
arctic3d example/1ppe_E.fasta --db db/swissprot

48 interfaces can be found. 46 of them are clustered together. They are indeed similar, although they span a big set of 41 residues:
1: [43, 45, 46, 47, 63, 65, 66, 67, 68, 89, 91, 98, 99, 100, 101, 102, 103, 104, 114, 149, 150, 152, 153, 154, 156, 157, 159, 178, 194, 195, 197, 198, 200, 215, 217, 218, 219, 220, 222, 223, 225]
Of course not all of them are present in every of the 46 interfaces. Does it make sense to include their probability to be in an interface in the output to facilitate the understanding?

run arctic-3d with uniprot ID as input

currently arctic-3d only accepts a single sequence as input, but the user should be able to input a specific uniprot ID

correct parse_out_uniprot

give in output a set
introduce checks (length = 6, no duplicates, alphanumeric)

re-filter already filtered interfaces

interfaces filtered during the process of pdb retrieval should be refiltered, in case some amino acids have been removed from the retained pdb during the pdb tidying process.

This has the disadvantages of

slowing down a bit the calculations (not really an issue)
keeping interfaces for which the full coverage is a bit lower than the coverage cutoff. Let's hope this is not dramatic.

specify the pdb to be used

currently we get the best pdb at a given uniprot ID

arctic3d/src/arctic3d/cli.py

Line 129 in a061f56

pdb_f, filtered_interfaces = get_best_pdb(uniprot_id, interface_residues)

but it should be possible to specify it. this

would allow the user to use a pdb in a certain conformation
guarantees a much faster execution (a single pdb download vs many)

	cl_dict = get_clustering_dict(clusters)
	write_clusters(cl_dict, cl_filename)
	# write clustered residues
	res_filename = "clustered_residues.out"
	res_dict = get_residue_dict(cl_dict, interface_dict)
	clustered_residues = write_residues(res_dict, res_filename)