Giter VIP home page Giter VIP logo

graphein's Introduction

Binder PyPI version supported python versions Docs DOI:10.1101/2020.07.15.204701 Project Status: Active – The project has reached a stable, usable state and is being actively developed. Project Status: Active – The project has reached a stable, usable state and is being actively developed. CodeFactor Quality Gate Status Bugs Maintainability Rating Reliability Rating Gitter chat License: MIT Code style: black



Documentation | Paper | Tutorials | Installation

Protein & Interactomic Graph Library

This package provides functionality for producing geometric representations of protein and RNA structures, and biological interaction networks. We provide compatibility with standard PyData formats, as well as graph objects designed for ease of use with popular deep learning libraries.

What's New?

1.7.0 FoldComp Datasets Open In Colab
1.7.0 Creating Datasets from the PDB Open In Colab
1.6.0 Protein Tensor Module Open In Colab
1.5.0 Protein Graph Creation from AlphaFold2! Open In Colab
1.5.0 RNA Graph Construction from Dotbracket notation Open In Colab
1.4.0 Constructing molecular graphs Open In Colab
1.3.0 Ready-to-go Dataloaders for PyTorch Geometric Open In Colab
1.2.0 Extracting subgraphs from protein graphs Open In Colab
1.2.0 Protein Graph Analytics Open In Colab
1.2.0 Graphein CLI
1.2.0 Protein Graph Visualisation! Open In Colab
1.1.0 Protein - Protein Interaction Network Support & Structural Interactomics (Using AlphaFold2!) Open In Colab
1.0.0 High and Low-level API for massive flexibility - create your own bespoke workflows! Open In Colab

Example usage

Graphein provides both a programmatic API and a command-line interface for constructing graphs.

CLI

Graphein configs can be specified as .yaml files to batch process graphs from the commandline.

Docs

graphein -c config.yaml -p path/to/pdbs -o path/to/output

Creating a Protein Graph

Tutorial (Residue-level) Tutorial (Atomic) Docs
Open In Colab Open In Colab(https://colab.research.google.com/assets/colab-badge.svg)
from graphein.protein.config import ProteinGraphConfig
from graphein.protein.graphs import construct_graph

config = ProteinGraphConfig()
g = construct_graph(config=config, pdb_code="3eiy")

Creating a Protein Graph from the AlphaFold Protein Structure Database

Tutorial Docs
Open In Colab
from graphein.protein.config import ProteinGraphConfig
from graphein.protein.graphs import construct_graph
from graphein.protein.utils import download_alphafold_structure

config = ProteinGraphConfig()
fp = download_alphafold_structure("Q5VSL9", aligned_score=False)
g = construct_graph(config=config, path=fp)

Creating a Protein Mesh

Tutorial Docs
Open In Colab
from graphein.protein.config import ProteinMeshConfig
from graphein.protein.meshes import create_mesh

verts, faces, aux = create_mesh(pdb_code="3eiy", config=config)

Creating Molecular Graphs

Graphein can create molecular graphs from smiles strings as well as .sdf, .mol2, and .pdb files

Tutorial Docs
Open In Colab
from graphein.molecule.config import MoleculeGraphConfig
from graphein.molecule.graphs import construct_graph

g = create_graph(smiles="CC(=O)OC1=CC=CC=C1C(=O)O", config=config)

Creating an RNA Graph

Tutorial Docs
Open In Colab
from graphein.rna.graphs import construct_rna_graph
# Build the graph from a dotbracket & optional sequence
rna = construct_rna_graph(dotbracket='..(((((..(((...)))..)))))...',
                          sequence='UUGGAGUACACAACCUGUACACUCUUUC')

Creating a Protein-Protein Interaction Graph

Tutorial Docs
Open In Colab
from graphein.ppi.config import PPIGraphConfig
from graphein.ppi.graphs import compute_ppi_graph
from graphein.ppi.edges import add_string_edges, add_biogrid_edges

config = PPIGraphConfig()
protein_list = ["CDC42", "CDK1", "KIF23", "PLK1", "RAC2", "RACGAP1", "RHOA", "RHOB"]

g = compute_ppi_graph(config=config,
                      protein_list=protein_list,
                      edge_construction_funcs=[add_string_edges, add_biogrid_edges]
                     )

Creating a Gene Regulatory Network Graph

Tutorial Docs
Open In Colab
from graphein.grn.config import GRNGraphConfig
from graphein.grn.graphs import compute_grn_graph
from graphein.grn.edges import add_regnetwork_edges, add_trrust_edges

config = GRNGraphConfig()
gene_list = ["AATF", "MYC", "USF1", "SP1", "TP53", "DUSP1"]

g = compute_grn_graph(
    gene_list=gene_list,
    edge_construction_funcs=[
        partial(add_trrust_edges, trrust_filtering_funcs=config.trrust_config.filtering_functions),
        partial(add_regnetwork_edges, regnetwork_filtering_funcs=config.regnetwork_config.filtering_functions),
    ],
)

Installation

Pip

The simplest install is via pip. N.B this does not install ML/DL libraries which are required for conversion to their data formats and for generating protein structure meshes with PyTorch 3D. Further details

pip install graphein # For base install
pip install graphein[extras] # For additional featurisation dependencies
pip install graphein[dev] # For dev dependencies
pip install graphein[all] # To get the lot

However, there are a number of (optional) utilities (DSSP, PyMol, GetContacts) that are not available via PyPI:

conda install -c salilab dssp # Required for computing secondary structural features
conda install -c schrodinger pymol # Required for PyMol visualisations & mesh generation

# GetContacts - used as an alternative way to compute intramolecular interactions
conda install -c conda-forge vmd-python
git clone https://github.com/getcontacts/getcontacts

# Add folder to PATH
echo "export PATH=\$PATH:`pwd`/getcontacts" >> ~/.bashrc
source ~/.bashrc
To test the installation, run:

cd getcontacts/example/5xnd
get_dynamic_contacts.py --topology 5xnd_topology.pdb \
                        --trajectory 5xnd_trajectory.dcd \
                        --itypes hb \
                        --output 5xnd_hbonds.tsv

Conda environment

The dev environment includes GPU Builds (CUDA 11.1) for each of the deep learning libraries integrated into graphein.

git clone https://www.github.com/a-r-j/graphein
cd graphein
conda env create -f environment-dev.yml
pip install -e .

A lighter install can be performed with:

git clone https://www.github.com/a-r-j/graphein
cd graphein
conda env create -f environment.yml
pip install -e .

Dockerfile

We provide two docker-compose files for CPU (docker-compose.cpu.yml) and GPU usage (docker-compose.yml) locally. For GPU usage please ensure that you have NVIDIA Container Toolkit installed. Ensure that you install the locally mounted volume after entering the container (pip install -e .). This will also setup the dev environment locally.

To build (GPU) run:

docker-compose up -d --build # start the container
docker-compose down # stop the container

Citing Graphein

Please consider citing graphein if it proves useful in your work.

@inproceedings{jamasb2022graphein,
  title={Graphein - a Python Library for Geometric Deep Learning and Network Analysis on Biomolecular Structures and Interaction Networks},
  author={Arian Rokkum Jamasb and Ramon Vi{\~n}as Torn{\'e} and Eric J Ma and Yuanqi Du and Charles Harris and Kexin Huang and Dominic Hall and Pietro Lio and Tom Leon Blundell},
  booktitle={Advances in Neural Information Processing Systems},
  editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
  year={2022},
  url={https://openreview.net/forum?id=9xRZlV6GfOX}
}

graphein's People

Contributors

1511878618 avatar a-r-j avatar ah-merii avatar amorehead avatar anton-bushuiev avatar avivko avatar biochunan avatar cch1999 avatar chaitjo avatar davidfstein avatar dependabot-preview[bot] avatar eltociear avatar ericmjl avatar kamurani avatar kexinhuang12345 avatar kierandidi avatar manonreau avatar olivert1 avatar pre-commit-ci[bot] avatar rg314 avatar ricomnl avatar ruibin-liu avatar rvinas avatar stevenazy avatar timothystiles avatar yuanqidu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

graphein's Issues

Sequencing Annotation Functions in Higher-level API

Again in construct_rna_graph, sequencing the annotation functions might be important for users that write their own metadata annotation functions. E.g. you might want to use some node properties to annotate some of your edges and you might also want to use some edge properties to annotate your nodes. Obviously, this is less of an issue with the lower-level API, but if we don't want to provide complete control over this (could be messy) at this level we should give some thought to the sequencing that will keep most people happy.

Originally posted by a-r-j in #31 (comment)

Apply same preprocessing as graphs to downloaded PDB in DSSP calculation

In the method Protein/features/nodes/dssp/add_dssp_df(); Biopython's DSSP calculation is invoked on downloaded, unprocessed PDB. The resulting DSSP dataframe sometimes has a different number of residues than the protein graph generated as in #98. I believe the same preprocessing steps that are performed in Protein/graph.py are needed in Protein/features/nodes/dssp/add_dssp_df()

E.g. PDB: 1utm, 2qrh

Thank you!

RNA Graphs Functionality

Todos:

  • We currently don't support pseudoknots: non-standard dotbracket symbols are converted to .

  • We could implement some checking of the validity of base pairing interactions

Originally posted by a-r-j in #31 (comment)

Problem in add_atomic_edges

from graphein.protein.edges.atomic import add_atomic_edges
params_to_change = {"granularity": "atom", "edge_construction_functions": [add_atomic_edges]}

config = ProteinGraphConfig(**params_to_change)
config.dict()

from graphein.protein.graphs import construct_graph

g = construct_graph(config=config, pdb_code="3eiy")

When I execute the above code an error occurs. The error is given below. How can I get rid of these type of errors?

1 from graphein.protein.graphs import construct_graph
      2 
----> 3 g = construct_graph(config=config, pdb_code="3eiy")
      4 # To use a local file, you can do:
      5 # g = construct_graph(config=config, pdb_path="../examples/pdbs/3eiy.pdb")

/srv/conda/envs/notebook/lib/python3.8/site-packages/graphein/protein/graphs.py in construct_graph(config, pdb_path, pdb_code, chain_selection, df_processing_funcs, edge_construction_funcs, edge_annotation_funcs, node_annotation_funcs, graph_annotation_funcs)
    611 
    612     # Compute graph edges
--> 613     g = compute_edges(
    614         g,
    615         funcs=config.edge_construction_functions,

/srv/conda/envs/notebook/lib/python3.8/site-packages/graphein/protein/graphs.py in compute_edges(G, funcs, get_contacts_config)
    505 
    506     for func in funcs:
--> 507         func(G)
    508 
    509     return G

/srv/conda/envs/notebook/lib/python3.8/site-packages/graphein/protein/edges/atomic.py in add_atomic_edges(G)
     86     """
     87     TOLERANCE = 0.56  # 0.4 0.45, 0.56 This is the distance tolerance
---> 88     dist_mat = compute_distmat(G.graph["pdb_df"])
     89 
     90     # We assign bond states to the dataframe, and then map these to covalent radii

/srv/conda/envs/notebook/lib/python3.8/site-packages/graphein/protein/edges/distance.py in compute_distmat(pdb_df)
     55     )
     56     eucl_dists = pd.DataFrame(squareform(eucl_dists))
---> 57     eucl_dists.index = pdb_df.index
     58     eucl_dists.columns = pdb_df.index
     59 

/srv/conda/envs/notebook/lib/python3.8/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   5498         try:
   5499             object.__getattribute__(self, name)
-> 5500             return object.__setattr__(self, name, value)
   5501         except AttributeError:
   5502             pass

/srv/conda/envs/notebook/lib/python3.8/site-packages/pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__set__()

/srv/conda/envs/notebook/lib/python3.8/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
    764     def _set_axis(self, axis: int, labels: Index) -> None:
    765         labels = ensure_index(labels)
--> 766         self._mgr.set_axis(axis, labels)
    767         self._clear_item_cache()
    768 

/srv/conda/envs/notebook/lib/python3.8/site-packages/pandas/core/internals/managers.py in set_axis(self, axis, new_labels)
    214     def set_axis(self, axis: int, new_labels: Index) -> None:
    215         # Caller is responsible for ensuring we have an Index object.
--> 216         self._validate_set_axis(axis, new_labels)
    217         self.axes[axis] = new_labels
    218 

/srv/conda/envs/notebook/lib/python3.8/site-packages/pandas/core/internals/base.py in _validate_set_axis(self, axis, new_labels)
     55 
     56         elif new_len != old_len:
---> 57             raise ValueError(
     58                 f"Length mismatch: Expected axis has {old_len} elements, new "
     59                 f"values have {new_len} elements"

ValueError: Length mismatch: Expected axis has 1 elements, new values have 0 elements  

bug: 'continue' not properly in loop

In /graphein/protein/graphs.py line 271. The continue statement is used outside of a for/while loop.
Solution: replace the continue statement with 'pass' statement to keep functionality as needed.

Saving in gexf format

An error occurs when saving protein atomic graph in gexf format.

attribute value type is not allowed: <class 'numpy.ndarray'>

How can I overcome this problem?

Still unable to use `add_aromatic_interactions`

Hi, I'm sorry for reporting several situations at the same time. I met the same problem using the add_aromatic_interactions

from graphein.protein.edges.distance import (add_peptide_bonds,
                                             add_hydrogen_bond_interactions,
                                             add_disulfide_interactions,
                                             add_ionic_interactions,
                                             add_aromatic_interactions,
                                             add_aromatic_sulphur_interactions,
                                             add_cation_pi_interactions
                                            )

new_edge_funcs = {"edge_construction_functions": [add_peptide_bonds,
                                                  add_hydrogen_bond_interactions,
                                                  add_disulfide_interactions,
                                                  add_ionic_interactions,
                                                  add_aromatic_interactions,
                                                  add_aromatic_sulphur_interactions,
                                                  add_cation_pi_interactions
                                                  ]
                 }

config = ProteinGraphConfig(**new_edge_funcs)
g = construct_graph(config=config, pdb_code="3ED8")
p = plot_protein_structure_graph(G=g, angle=0, colour_edges_by="kind", colour_nodes_by="seq_position", label_node_ids=False)
plt.suptitle("Protein graph with: peptide backbone, H-Bonds, \n Disulphide, ionic, aromatic, aromatic-sulphur and cation-pi interactions. \n  Nodes coloured by sequence position, edges by type")

returned

DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 1134 total nodes
INFO:graphein.protein.edges.distance:Found 372 hbond interactions.
INFO:graphein.protein.edges.distance:Found 25 hbond interactions.
INFO:graphein.protein.edges.distance:Found 10 disulfide interactions.
INFO:graphein.protein.edges.distance:Found 983 ionic interactions.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-8-183722266f5a>](https://localhost:8080/#) in <module>()
     19 
     20 config = ProteinGraphConfig(**new_edge_funcs)
---> 21 g = construct_graph(config=config, pdb_code="3ED8")
     22 p = plot_protein_structure_graph(G=g, angle=0, colour_edges_by="kind", colour_nodes_by="seq_position", label_node_ids=False)
     23 plt.suptitle("Protein graph with: peptide backbone, H-Bonds, \n Disulphide, ionic, aromatic, aromatic-sulphur and cation-pi interactions. \n  Nodes coloured by sequence position, edges by type")

4 frames
[/usr/local/lib/python3.7/dist-packages/graphein/protein/utils.py](https://localhost:8080/#) in filter_dataframe(dataframe, by_column, list_of_values, boolean)
     81     :rtype: pd.DataFrame
     82     """
---> 83     df = dataframe.copy()
     84     df = df[df[by_column].isin(list_of_values) == boolean]
     85     df.reset_index(inplace=True, drop=True)

AttributeError: 'NoneType' object has no attribute 'copy'

And I checked my pandas version:

import pandas as pd
print(pd.__version__)
1.3.5

May I know is it the problem with pandas version?

Originally posted by @johnnytam100 in #81 (comment)

Architectural issues and remediation paths

As I go through the codebase to write tests, I'm noticing a few architectural issues to the package becoming a pydata-friendly package. I'm going to document here issues/anti-patterns that I found inside the codebase, and propose solutions.

Issue 1: Implicit dependencies

Found propy, biovec, ipymol, pytorch3d as imports inside functions.

This is a problem because when the functions get called and the dependency isn't explicitly stated as part of environment.yml or requirements.txt, then code will unexpectedly error out for a user that does not have those packages installed.

One possible resolution is to add those packages as explicit dependencies. (Explicit is better than implicit, according to the Zen of Python.)

Issue 2: Highly nested codebase

To access protein amino acid featurization, one has to import graphein.protein.features.nodes.amino_acids. That's potentially quite a lot of boilerplate for the end-user.

The easiest resolution path is to import those submodules, or even explicit functions, into a higher-level namespace. For example, we might want to follow NetworkX's and PyMC3's pattern, which imports all of the important items into the top-level namespace. Or we might want to follow a pattern where things related to protein are imported into the protein namespace.

Port DSSP Features to new API

This will take a little thought to do elegantly as the features are computed from sequence but assigned to nodes.

Please fix construct_graph.py arguments

Hi there,
Thank you for the very cool and useful package! I think on line 328 in construct_graph.py, you forgot to include chain_selection in your dssp = self._get_protein_features() argument.
It should be:
dssp = self._get_protein_features(file_path=file_path, pdb_code=None, chain_selection=chain_selection)

Getting 61 descriptors for amino acid residues from ExPaSY ProtScale

@a-r-j I'm trying to get node features for residual level graph. Following is my code

new_prot_nodes = {"node_metadata_functions":[meiler_embedding,expasy_protein_scale]}
config_full = ProteinGraphConfig(**new_prot_nodes,**new_edge_funcs)
print (config_full.dict())
g = construct_graph(config=config_full, pdb_path=df.path[0])

How do I access these features?

Installation

I tried installing your package on my Mac laptop as well as machine with Linux and cuda. It works on the laptop and I remember not installing everything according to your instructions such as tk=8.5 (it's not available).

On the cuda one, it just simply did not work with the following error [link]. I installed Pytorch and torchvision for cuda 10.1, torch 1.6. Everything else including getcontacts, vmd-python were installed as well. Do you have any idea for fixing this?

I also followed the instructions on your official website as well as the Github one.
The official website seems to not have the most up-to-date instructions.
There's a minor error in the installation of graphein that I think should be fixed.
On "Install Graphein" at the bottom of the official website, there's a typo with the github link.
git clone https://github.com/a-r-j/grahein
It should be
git clone https://github.com/a-r-j/graphein

Thank you!

Installation problem

There is a problem with the installation graphein in Google colab. How can I convert a protein sequence to its corresponding protein graph?

Gene Regulatory Network Support

We can prioritise:

These provide .csv files for download. We can parse these into graphs easily.

GRNdb also looks good (http://www.grndb.com) but seems tricky.

Misc Sources:

Contacts

Hi Arian,

I'm just wondering if there's a way to create protein graph from dgl_graph_from_pdb_code() without needing the contact file generated from getContacts?

I have tried multiple ways to install getContacts and run it. However, I couldn't install vmd-python and run its module. vmd-python's Conda installation was successfully but I kept having "module name "vmd" not found" problem. And installing from source using vmd-python's github repo yielded problem "RuntimeError: Could not find include file 'netcdf.h' in standard include directories. Update $INCLUDE to include the directory containing this file, or make sure it is present on your system".

If Graphein depends on getContacts and VMD-Python to generate an all-atom graph, anything wrong with those two dependencies will cause problem. I also have difficulty install tk version 8.5 because it's no longer available in anaconda package.

If you know a better way to install and run getContacts (and vmd-python), please let me know! Also, if you know how to create a protein graph without contact file, please also let me know. Thank you Arian!

ImportError on Pytorch3d

Hi Arian,

I followed the installation instructions on README and when I try to import the package from pytorch3d.io import load_obj, save_objI ran into the following error:

ImportError: dlopen(/Users/jason.shi/miniconda3/envs/graphein/lib/python3.7/site-packages/pytorch3d/_C.cpython-37m-darwin.so, 2): Symbol not found: __ZN2at23getLegacyDeviceTypeInitEv
  Referenced from: /Users/jason.shi/miniconda3/envs/graphein/lib/python3.7/site-packages/pytorch3d/_C.cpython-37m-darwin.so
  Expected in: /Users/jason.shi/miniconda3/envs/graphein/lib/python3.7/site-packages/torch/lib/libtorch_cpu.dylib
 in /Users/jason.shi/miniconda3/envs/graphein/lib/python3.7/site-packages/pytorch3d/_C.cpython-37m-darwin.so

I'm using torch 1.6.0 and installed on Mac laptop. Here's what I installed for torch geometric.

pip install torch-scatter==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-sparse==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-cluster==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-spline-conv==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-geometric

Would appreciate if you can point me to the right direction. Thanks!

Sequence-based Edge function

Describe the solution you'd like
Currently, we add sequence-based edges in the form of graphein.protein.edges.distance.add_peptide_bonds, which joins adjacent residues with an edge (e.g. separation=0). It would be useful to generalise this function to add edges at a specified interval. E.g:

g = construct_graph(pdb_code="3eiy")
g = add_peptide_bonds(g, step = 1)

Would add edges between nodes:

1 - 3
2 - 4
3 - 5
4 - 6
5 - 7
...

and

g = construct_graph(pdb_code="3eiy")
g = add_peptide_bonds(g, step = 5)

Would add edges between nodes:

1 - 6
2 - 7
3 - 8
4 - 9
5 - 10
...

Graphein Design Doc

Hey @a-r-j, just pinging back here as promised yesterday with what probably could be considered a design doc for Graphein. I'm excited about this! I also know that my memory is gradually degrading as I get older, so please correct me if I'm wrong anywhere, and don't hesitate to directly edit my post.


Authors

  • Arian Jamasb
  • Eric J. Ma

Introduction

Graphein is an awesome idea: basically, a package that flexibly handles the construction of graph representations of biological data, with the goal of democratizing access of biological graph data to all (machine learners and biologists alike).

In its current state, the package might be too opinionated on what backing packages are used, which affects the portability of the data structures. We would like to propose an improved software design that leverages the ecosystem of the Python data science stack.

Graph Preliminaries

Graphs can be represented in both object form and array form, and this gives us different things.

In the object form, we have node lists and metadata, as well as edge lists and metadata. NetworkX graph objects are one example of this. The object form is rich and flexible - it can store information of arbitrary depth. It can store arbitrary key-value paired metadata. Any hashable object can be a node, as long as the hashable object shows up only once in the data structure. Metadata values can be any Python object. This level of flexibility allows us to construct very human-readable graphs in Python.

In the array form, we have node feature arrays and adjacency matrix-like matrices. This pair of objects is more restricted. For downstream deep learning purposes, the data types of these arrays have to be numeric in nature. Nodes aren't easily indexed by a string name; integer indexing is required. Human auditability of the internal data structure is hampered by the fact that the data are represented in numbers. The metadata are also not easily indexed by string names. That said, the array form is structured and efficient for computation.

In terms of a principled data preparation flow:

  1. We ought to make it as easy as possible for users to inspect their data early on
  2. The raw-est, most "intrinsic" form of data ought to be established and fixed as early on in the data flow
  3. The derivative metadata (which can be calculated or looked up based on the intrinsic metadata) should come in a separate step afterwards.

Design goals and implications

  • Goal 1: Maximal compatibility with the rest of the Python data science stack.
  • Goal 2: Easy human auditability for the graphs.

Implication: Target NetworkX graphs, Pandas DataFrames, XArray DataArrays before dispatching/converting to specialized data structures like DGL graphs. This is in line with democratization, as these are the idiomatic and easily accessible data structures of the PyData community.

  • Goal 3: Follow principled data preparation flow

Implication: Cleanly separate data loading from calculation of derivative data (e.g. things that can be looked up in a lookup table, or other simple/complex calculations), and avoid mixing them up in the same step.

  • Goal 4: A simple installation protocol that is conda install graphein

Implication: Simplify the installation instructions to follow PyData community idioms as much as possible.

  • Goal 5: Enable flexibility in programming model to new categories of edges, create adjacency-like matrices of different kinds, and compute new ways of featurizing nodes
  • Goal 6: Allow easy and opinionated configuration (like black) to construct these graphs.

Implication: APIs that work at different levels are needed. A high-level API for ease-of-use that targets an opinionated subset of the range of possible edge types, adjacency-like matrices, and node features, leveraging a lower-level API that power users can drop down to for customization of how they want to construct a graph for their biological data. This will prevent Graphein from being a god-like package that makes things very easy for end-users, but hamstrings power users from customizing things.

Lower Level API

The key categories of steps that we probably need to be concerned about here are as follows:

  • 1. Reading PDB files into dataframes - handled by BioPandas
  • 2. Inserting nodes into a NetworkX graph object from PDB DataFrames, while also annotating their intrinsic properties -(e.g. amino acid identity, xyz coordinates).
  • 3. Annotating additional, external but nonetheless "intrinsic" node metadata into a NetworkX graph object based on their intrinsic properties, based on a collection of rules encoded as Python functions.
  • 4. Inserting edges into the same NetworkX graph object based on a collection of rules encoded as Python functions, while also annotating their intrinsic properties (e.g. "edge type" -- hydrogen bond? disulfide bridges? etc.)
  • 5. Conversion of node data into node feature dataframes, using a collection of functions to perform the processing.
  • 6. Conversion of edges into adacency-like XArrays, using a collection of functions to perform the processing of data.
  • 7. Optional conversion into raw NumPy arrays for both of the above.

Together, these probably form the low-level API, on which the high-level convenience API can be built, which essentially provides a mapping from keyword arguments to function call dispatches underneath the hood.

In a bit more detail, I'd like to propose additional structure for the steps.

Construction of NetworkX graphs from PDB DataFrames

This should be opinionated, and only have two flags: whether to use an "atom" graph, or whether to use a "biological unit" graph (amino acids for proteins, nucleotides for RNA/DNA).

Annotating node metadata

The pattern would be a dispatching function, possibly named annotate_node_metadata(G, funcs). funcs is a list of functions that gets looped over and called on for each node:

def annotate_node_metadata(G, funcs):
    for func in funcs:
        for n in G.nodes():
            func(G, n)
    return G

Each func modifies G in-place, and has a uniform signature of (G, n):

def annotate_node_metadata(G, n):
    G.node[n]["metadata_field"] = some_value
    return G

### Inserting edges

The edges that are represented in a graph also form a "hyperparameter" of sorts, i.e. a thing that needs to be configured. Inserting edges of different types will probably be based off custom calculations from the PDB dataframe. However, because this might open an area of research, i.e. what types of edges should be present in a graph for the message passing step, the edge type inserted into a graph can be specified in Python code using custom functions that have a common signature:

```python
def add_some_edge_type(G):
    # custom logic that leverages NetworkX's
    for element in some_iterator:
        kwargs = ... # some custom logic
        G.add_edge(u, v, **kwargs)
    return G

Now, we can have a parent function that loops over a bunch of add_some_edge_type(G) class of functions. In a functional programming paradigm, this can be partialled to some defaults for a bunch of default configuration settings:

def compute_edges(G, funcs):
    for func in funcs:
        func(G)
    return G

If the custom function relies on external data, that can be partialled out:

from functools import partial

def add_some_edge_type(G, some_external_data):
    # stuff
    return G

funcs = [
    partial(add_some_edge_type, some_external_data=some_external_data),
]

This takes care of making the function signature uniformly compatible with the `compute_edges` func below!

### Construction of node feature matrix

Now, we explore how we can construct the node feature dataframe.

```python
def generate_feature_dataframe(G, funcs):
    matrix = []
    for n, d in G.nodes(data=True):
        series = []
        for func in funcs:
            res = func(n, d)
            if res.name != n:
                raise NameError(
                    f"function {func.__name__} returns a series "
                    "that is not named after the node."
                )
            series.append(res)
        matrix.append(pd.concat(series))

    df = pd.DataFrame(matrix)
    if return_array:
        return df.values
    return df

In this design, we ask that functions have again a uniform signature, and return a pandas Series. Then, construction of a DataFrame can happen. The return type of the Series should be numeric (float or int), and not be null. This guarantees that the DataFrame should only contain numeric values. We can then freely convert between the dataframe form and the array form.

Construction of adjacency-like matrices

Now, we explore how we can construct the adjacency-like matrices.

It's possible to construct a wide range of adjacency matrices on which message passing can be performed. For example, one might want to do message passing on the 1-degree adjacency matrix, or do it on a 2-degree adjacency matrix where anything within two degrees of separation are given a "1". One may want to leverage the graph Laplacian matrix too. As such, more than just a matrix, we might want to generate an adjacency tensor. A proposed function we could include in the library, or propose to the NetworkX library, is as follows:

def generate_adjacency_tensor(
    G: nx.Graph, funcs: List[Callable], return_array=False
) -> xr.DataArray:
    mats = []
    for func in funcs:
        mats.append(func(G))
    da = xr.concat(mats, dim="name")
    if return_array:
        return da.data
    return da

In doing this, we generate an XArray 3D tensor of adjacency-like matrices. A prerequisite is that each adjacency matrix generated from func should be an XArray DataArray with one tensor dimension being called "name", which houses the name of the adjacency-like matrix. This conversion can be made easier by providing one more function:

def format_adjacency(G: nx.Graph, adj: np.ndarray, name: str) -> xr.DataArray:
    expected_shape = (len(G), len(G))
    if adj.shape != expected_shape:
        raise ValueError(
            "Adjacency matrix is not shaped correctly, "
            f"should be of shape {expected_shape}, "
            f"instead got shape {adj.shape}."
        )
    adj = np.expand_dims(adj, axis=-1)
    nodes = list(G.nodes())
    return xr.DataArray(
        adj,
        dims=["n1", "n2", "name"],
        coords={"n1": nodes, "n2": nodes, "name": [name]},
    )

Now, an end-user can calculate a NumPy array any way they like, but then name them easily and stick them into a uniform data container, such as an XArray DataArray, that allows them to inspect:

  1. Whether the adjacency-like entries between any pair of nodes is correct
  2. Whether the adjacency-like matrix looks correct or not

A brief pause...

We have spent a lot of time not on the fancy deep learning piece, for a very good reason - because graphs are such a flexible data structure, we have to have patterns in place to help build compatibility between different graph types. By targeting common data structures, like NetworkX graph objects, NumPy arrays, Pandas DataFrames and XArray DataArrays, we ensure native compatibility with a very broad swathe of data science computing environments. This fulfills the goal of democratization; use of Pandas and XArray ensures that we have the ability to easily inspect the array forms of the data before we use it for any deep learning purposes. And by not reinventing the wheel, we lessen the maintenance burden on ourselves.

The High Level API

We'll now go into how the high level API can look like. With a dummy configuration that looks something like this:

graph_config = GrapheinConfig(
    graph_type="amino_acid",  # or "atom"
    edges=["adjacency-1", "adjacency-2", "laplacian"],  # and more for defaults
    node_features=["molecular_weight", "pKa", "pI"],  # and more for defaults
    # And possibly more here!
)

F, A = graphein.read_pdb("/path/to/file.pdb", config=graph_config)

This is what we think the high-level API might look like. It's simple (users only have to remember read_pdb and GrapheinConfig), comes with a lot of sane defaults in the Config object, and dispatches to the right routine.

Lessons learned from reimplementing UniRep tells me (EM) that having users manually write code to preprocess data into tensor form introduces the possibility that they could unexpectedly generate an adversarial-like example that has all the right tensor properties (shape, values, no nulls), but is semantically incorrect because the values are preprocessed incorrectly. Hence, a high level API that accepts the idiomatic data structure (in this case, a PDB file, just as in the case of a protein sequence, a string is the idiomatic data structure) that also handles any preprocessing correctly is going to be very enabling for reproducibility purposes.

Pip Installation Trouble

It looks like there is some issue with how graphein's setup.py interacts with temporary files.

When I run pip install graphein, I get an error:

$> pip install graphein
Collecting graphein==1.0.6
  Using cached graphein-1.0.6.tar.gz (102 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  ERROR: Command errored out with exit status 1:
   command: /home/cyeh/miniconda3/envs/mlprot/bin/python3.8 /home/cyeh/miniconda3/envs/mlprot/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_pr
ocess.py get_requires_for_build_wheel /tmp/tmpvrrujszc
       cwd: /tmp/pip-install-sh6rvu10/graphein_fa5f9d34ad2542e798361f4d27a1b533
  Complete output (24 lines):
  Traceback (most recent call last):
    File "/home/cyeh/miniconda3/envs/mlprot/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
      main()
    File "/home/cyeh/miniconda3/envs/mlprot/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/home/cyeh/miniconda3/envs/mlprot/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 130, in get_requires_for_build_whe
el
      return hook(config_settings)
    File "/tmp/pip-build-env-zrmcgwd7/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 162, in get_requires_for_build_wheel
      return self._get_build_requires(
    File "/tmp/pip-build-env-zrmcgwd7/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 143, in _get_build_requires
      self.run_setup()
    File "/tmp/pip-build-env-zrmcgwd7/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 267, in run_setup
      super(_BuildMetaLegacyBackend,
    File "/tmp/pip-build-env-zrmcgwd7/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 158, in run_setup
      exec(compile(code, __file__, 'exec'), locals())
    File "setup.py", line 83, in <module>
      INSTALL_REQUIRES = read_requirements(".requirements/base.in")
    File "setup.py", line 59, in read_requirements
      for line in read(*parts).splitlines():
    File "setup.py", line 44, in read
      return codecs.open(os.path.join(HERE, *parts), "r").read()
    File "/home/cyeh/miniconda3/envs/mlprot/lib/python3.8/codecs.py", line 905, in open
      file = builtins.open(filename, mode, buffering)
  FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-sh6rvu10/graphein_fa5f9d34ad2542e798361f4d27a1b533/.requirements/base.in'
  ----------------------------------------

For context, I'm running pip install graphein inside a fresh conda environment create from the following environment (.yml) file:

name: mlprot
channels:
- pyg
- pytorch3d
- pytorch
- conda-forge
- anaconda
- bioconda
- plotly
- schrodinger
- salilab
dependencies:
- python=3.8
- biopython
- biopandas
- click
- cudatoolkit=10.2
- dssp
- flake8
- ipympl
- jupyterlab
- matplotlib
- multipledispatch
- mypy
- networkx
- nodejs
- numpy
- pandas
- pip
- plotly
- pydantic
- pyg
- pymol
- pyyaml
- pytorch
- pytorch3d
- scikit-learn
- scipy
- torchvision
- tqdm
- versioneer
- python-wget
- xarray
- pip:
  - bioservices
  - biovec
  - propy3
  - pyaaisc

Simplify environment.yml to minimal set of packages required

Issue

Currently environment.yml and ubuntu_environment.yml are duplicates of one another, and could be simplified into a single file that uniquely specifies environment dependencies.

Proposal

I'd like to propose unifying the two environment spec files into an environment-dev.yml file, which only specifies the minimal set of packages necessary for development/hacking on graphein. Deployment onto PyPI and conda forge later can allow for installation with dependencies automatically pulled in. This would be in tandem with the removal of the current *environment.yml files.

A proposed starter environment spec could be:

name: graphein-dev
channels:
- conda-forge
dependencies:
- python=3.8
- biopandas
- pandas
- networkx
- numpy

Conversion to DGL/PyTorch/JAX can be enabled once the library of code has been rewritten.

Pre-processing code

Hey @ericmjl, yep we’re getting there 😁 ! Docs are on the horizon.

The plan for tomorrow is:

  1. Docs coverage (as close 100% as sanity allows)
  2. Feature Pre-processing

Re: pre-processing I was thinking of a setup where users can pass a few standard functions (one-hot, mean, etc) to a dictionary/config as well as fitted sklearn scalers for normalising across graphs in a dataset (with some helpful functions for creating them from a list of graphs) or unfitted sklearn scalers (for single graph normalisation). A config object seems nice and consistent, but a dictionary might be better for users that create their own features as a config object would be a bit inflexible there.

Would be super keen to hear any thoughts/suggestions on this.

pre_processing_dict = {
“molecular_weight”: StandardScaler,
“secondary_structure”: partial(one_hot, vocab = SS_ELEMENTS)
}

G = process_graph(G, pre_processing_dict)

After this, I think the only outstanding task is a conversion toolkit to support various frameworks. Then it’s cleaning up & polishing before I think a V2.0 release is in order :D
EDIT: and tests! Will crack on with them this week.

Originally posted by @a-r-j in #45 (comment)

bug: add_aromatic_interactions get None DataFrame

In graphein/protein/edges/distance.py, function add_aromatic_interactions. The pdb_df dataframe is None by default. The following lines should be probably added (at line 261):

    if pdb_df is None:
        pdb_df = G.graph["pdb_df"]

The setup with Conda doesn't work as described

I just want to start by saying that I really like your work / the idea behind this project.

But when I tried to install it with Conda, as suggested by your README, I noticed some problems.

  1. In the README you say that you can install Graphein with the following commands, but "conda create env" is not a valid Conda command. it should be "conda env create".
git clone https://www.github.com/a-r-j/graphein
cd graphein
conda create env -f environment-dev.yml
pip install -e .
  1. when you do do this, using the latest miniconda3 Linux release, Conda reports conflicts and the installation fails:
$ conda env create -f environment-dev.yml
Collecting package metadata (repodata.json): done
Solving environment: \ 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
<...a long list of conflicts...>

RNA graph construction and KNN representation

I’ve just started looking at RNA graph construction. Ideally, I’d like to generate a KNN representation of the RNA. This function is currently implemented for proteins by using the graphein.protein.edges.distance.add_k_nn_edges function. In short, the edges for the KNN method are added by:

  1. compute distance matrix
    a. To compute the distance matrix we need to know the x,y,z position of each basepair (BP) of RNA
  2. Compute N nearest neighbours using (sklearn.neighbors.kneighbors_graph)
  3. Join interacting nodes calculated form 2.
  4. Return graph

At the moment the x,y,z cords for protein structures are obtained from a PDB file. This is currently not built for RNA structures. For an RNA sequence we must use the sequence and/or dot bracket notation to get the 3D structural information.

If the dot bracket notation is not provided and can be calculated using Nussinov Algorithm (DP approach, see https://github.com/cgoliver/Nussinov/blob/master/nussinov.py for python implementation). See implementation https://github.com/rg314/graphein/blob/rna-model/graphein/rna/nussinov.py

Note that nussinov algo does not guarantee that the dot-bracket notation is correct. There are several other ways of computing this.

The PDB database contains some RNA structures (~5233). PandasPdb can be used to directly read in the PDB file. I suggest that the current protein config is adapted for the RNA structure to read in the RNA structure from a PDB file. @a-r-j what do you think? I have started to implement this please see (https://github.com/rg314/graphein/blob/35bd2297d28bf09bcf0fb98c10c3866d4be6cb83/graphein/rna/graphs.py#L209 note reading in df is currently failing).

Then we can look at alternative sources for reading in the structure.

For example, it appears that the Xiao lab http://biophy.hust.edu.cn/new/ has a RESTful API to return RNA structure. However, I have not investigated this in detail and if it returns the correct 3D data. This could somewhat mimic the behaviour of graphein.protein.utils.download_alphafold_structure.

Does anyone have an idea of other databases that could be used?

I’m also open to creating a server that can be contacted with a RESTful API to predict RNA structure. However, we would need to figure out the best implementation for structure prediction (and make sure it doesn’t take too long 😉).

Atom information

How can obtain the atoms, which are present in an atomic graph of a protein?

Install problem

Dear
I install all package and run the flow code:

from graphein.construct_graphs import ProteinGraph
from vmd import *
pg = ProteinGraph(granularity='CA', insertions=False, keep_hets=True,
node_featuriser='meiler', get_contacts_path='./getcontacts',
pdb_dir='examples/pdbs/',
contacts_dir='examples/contacts/',
exclude_waters=True, covalent_bonds=False, include_ss=True)

graph = pg.dgl_graph_from_pdb_code('3eiy', chain_selection='all')

graph = pg.dgl_graph_from_pdb_file(file_path='examples/pdbs/pdb3eiy.pdb', contact_file='examples/contacts/3eiy_contacts.tsv', chain_selection='all')

graph = pg._make_atom_graph(pdb_code='3eiy', graph_type='bigraph')

and I got these errors:

Traceback (most recent call last):
File "./getcontacts/get_static_contacts.py", line 93, in
main()
File "./getcontacts/get_static_contacts.py", line 80, in main
ligand, solv, lipid, sele1, sele2)
File "/home/jojo/PycharmProjects/PPI/getcontacts/contact_calc/compute_contacts.py", line 274, in compute_contacts
output_fd = open(output, "w")
FileNotFoundError: [Errno 2] No such file or directory: 'examples/contacts/3eiy_contacts.tsv'

seems it couldn't computer contacts
Could you please help me fix it?

Upgrade compatibility to DGL 0.5.0+

Currently, Graphein is built with DGL 0.4.3

This throws an error:

File "d:\desktop2.0\prot\graphein\graphein\construct_graphs.py", line 23, in <module> from dgllife.utils import mol_to_bigraph, mol_to_complete_graph, mol_to_nearest_neighbor_graph File "D:\Anaconda\envs\myenv\lib\site-packages\dgllife\__init__.py", line 9, in <module> from . import model File "D:\Anaconda\envs\myenv\lib\site-packages\dgllife\model\__init__.py", line 8, in <module> from .model_zoo import * File "D:\Anaconda\envs\myenv\lib\site-packages\dgllife\model\model_zoo\__init__.py", line 33, in <module> from .acnn import * File "D:\Anaconda\envs\myenv\lib\site-packages\dgllife\model\model_zoo\acnn.py", line 14, in <module> from dgl import BatchedDGLHeteroGraph ImportError: cannot import name 'BatchedDGLHeteroGraph' from 'dgl' (D:\Anaconda\envs\myenv\lib\site-packages\dgl\__init__.py)

See: dmlc/dgl#2104

This has also been mentioned in #16

Refactor granularity kwarg in `protein.process_dataframe`

I think you're right about granularity - it's doing too much. The idea is to have control over:

  1. What the nodes are in a residue graph (e.g. a-Carbon "CA", b-Carbon "CB")
  2. Whether to use that atom position or compute residue centroids
  3. Whether or not to build a residue-graph as above or an atom-graph

Originally posted by @a-r-j in #27 (comment)

Refactoring out this little piece of the function will probably help with longer-term maintenance.

Number of nodes and shape of node coordinates differ by 1

Hi @a-r-j

I've been observing that some times the number of nodes generated by the library differs to the coordinate data generated by it by exactly 1 node. Do you know why this happens. Following is an example code

configs = {
        "granularity": "CA",
        "keep_hets": False,
        "insertions": False,
        "verbose": False,
        "dssp_config": DSSPConfig(),
        "node_metadata_functions": [meiler_embedding,expasy_protein_scale],
        "edge_construction_functions": [add_peptide_bonds,
                                                  add_hydrogen_bond_interactions,
                                                  add_ionic_interactions,
                                                  add_aromatic_sulphur_interactions,
                                                  add_hydrophobic_interactions,
                                                  add_cation_pi_interactions]
        }
config = ProteinGraphConfig(**configs)
format_convertor = GraphFormatConvertor('nx', 'pyg', 
                                            verbose = 'gnn', 
                                            columns = None)
g = construct_graph(config=config, pdb_code='1c5y')
protdata = format_convertor(g)
print(protdata)

protdata.num_nodes == 256 ; protdata.coords[0].shape==257

Shouldn't these 2 be the same? What am I missing?
Thank you!

Structure-informed dataset splitting

Create good train/val/test sets based on SCOP/CATH classifications. Sequence-based approaches (e.g. identity thresholding or BLAST) are bad practice and should not be encouraged.

Cannot install environment-dev.yml on CPU

Describe the bug
Cannot develop on MacOS Big Sur (11.5.2) on CPU.

To Reproduce
Steps to reproduce the behavior:

  1. Run conda env create -f environment-dev.yml
  2. Error output:
Solving environment: failed

ResolvePackageNotFound: 
  - pytorch==1.9.0=py3.8_cuda11.1_cudnn8.0.5_0
  - cudatoolkit=11.1
  - pytorch3d

Expected behavior
Expected installation of pytorch, cudatoolkit, pytorch3d

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Big Sur: (11.5.2)
  • Python Version: (
  • Graphein Version [e.g. 22] & how it was installed

Additional context
Installation works with the normal environment.yml file. I tried independently installing through pytorch.org with the command conda install pytorch torchvision torchaudio -c pytorch, but after running the example code:

from graphein.protein.config import ProteinGraphConfig
from graphein.protein.graphs import construct_graph

config = ProteinGraphConfig()
g = construct_graph(config=config, pdb_code="3eiy")

Error output:

Traceback (most recent call last):
  File "bin/test.py", line 1, in <module>
    from graphein.protein.config import ProteinGraphConfig
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/graphein/__init__.py", line 7, in <module>
    from .protein import *
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/graphein/protein/__init__.py", line 2, in <module>
    from .config import *
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/graphein/protein/config.py", line 16, in <module>
    from graphein.protein.features.nodes.amino_acid import meiler_embedding
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/graphein/protein/features/__init__.py", line 5, in <module>
    from .sequence import *
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/graphein/protein/features/sequence/__init__.py", line 1, in <module>
    from .embeddings import *
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/graphein/protein/features/sequence/embeddings.py", line 26, in <module>
    import biovec
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/biovec/__init__.py", line 1, in <module>
    from biovec import models
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/biovec/models/__init__.py", line 1, in <module>
    from biovec.models.prot_vec import *
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/biovec/models/prot_vec.py", line 1, in <module>
    from gensim.models import word2vec
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/gensim/__init__.py", line 5, in <module>
    from gensim import parsing, corpora, matutils, interfaces, models, similarities, summarization, utils  # noqa:F401
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/gensim/corpora/__init__.py", line 6, in <module>
    from .indexedcorpus import IndexedCorpus  # noqa:F401 must appear before the other classes
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/gensim/corpora/indexedcorpus.py", line 15, in <module>
    from gensim import interfaces, utils
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/gensim/interfaces.py", line 19, in <module>
    from gensim import utils, matutils
  File "/Users/josephgmaa/miniconda3/envs/graphein/lib/python3.8/site-packages/gensim/matutils.py", line 1054, in <module>
    from gensim._matutils import logsumexp, mean_absolute_difference, dirichlet_expectation
  File "__init__.pxd", line 198, in init gensim._matutils
ValueError: numpy.ndarray has the wrong size, try recompiling. Expected 88, got 96

Add distances as edge features

Describe the solution you'd like
Currently, when we compute edges (graphein.protein.edges.distance) we make use of the distance between the interacting nodes in assigning the edge. However, we do not store this information as an edge attribute. This would be very useful as it is a very standard feature in protein GNN models.

Converting Networkx graph to PyG

Thank you for this excellent work. I am trying to convert the graphs out of the high-level API
g = construct_graph(config=config, pdb_code="3eiy")

I try to convert g further to pyG but get an error
from torch_geometric.utils import from_networkx
data = from_networkx(g)

Error:
~/anaconda3/envs/graphein_dev/lib/python3.8/site-packages/torch_geometric/utils/convert.py in from_networkx(G, group_node_attrs, group_edge_attrs)
162 for key, value in data.items():
163 try:
--> 164 data[key] = torch.tensor(value)
165 except ValueError:
166 pass

RuntimeError: Could not infer dtype of set

Create dictionary of non-standard amino acids

I had a quick look at the PDB file for 5E5T. It seems they define the D-amino acids as HETATMS in the header:

HETNAM     DSN D-SERINE
HETNAM     DSG D-ASPARAGINE
HETNAM     DPN D-PHENYLALANINE
HETNAM     DCY D-CYSTEINE
HETNAM     DAS D-ASPARTIC ACID
HETNAM     DLY D-LYSINE
HETNAM     DLE D-LEUCINE
HETNAM     DAR D-ARGININE
HETNAM     DAL D-ALANINE
HETNAM     DTY D-TYROSINE
HETNAM     DIL D-ISOLEUCINE
HETNAM     DGL D-GLUTAMIC ACID
HETNAM     DVA D-VALINE
HETNAM     DPR D-PROLINE
HETNAM     DTH D-THREONINE
HETNAM     DHI D-HISTIDINE
HETNAM     FMT FORMIC ACID
HETNAM     EDO 1,2-ETHANEDIOL
HETSYN     EDO ETHYLENE GLYCOL

The PDB maintains a list of chemical entities and their abbreviations so it should be a fairly straightforward lookup. Hopefully, these conventions are standard in practice. We'd probably want to label non-standard residues differently to the other hetatms in the dataframe to allow for more nuanced filtering E.g. keep non-standard residues but remove all the water molecules. In any case, I think we should probably keep one of those dictionaries in graphein for easy conversion between PDB ligand identifiers and SMILES/InCHI, though not a priority at this point.

On that page there's also a list of the PDB structures that contain each ligand which might be useful, though I'm not sure how regularly updated it is.

Originally posted by @a-r-j in #19 (comment)


Some things we can do:

  • Add a source CSV file that gets loaded into a dictionary upon request.

pytorch installation related issue

I set up the graphein environment according to the simple install instructions. But when I try to install pytorch (cuda version) I get several conflicts. I followed the instructions from PyTorch website

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

How do I solve this issue?

Thanks

Installation in Colab notebook

Describe the bug
I did the following for installation:

!pip install graphein==1.0.11
!pip install folium==0.2.1

And I tried to run:

from graphein.protein.config import ProteinGraphConfig
from graphein.protein.graphs import construct_graph
config = ProteinGraphConfig()
g = construct_graph(config=config, pdb_code="3eiy")

gave

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
[<ipython-input-14-3aa322475106>](https://localhost:8080/#) in <module>()
      2 from graphein.utils.pymol import MolViewer
      3 pymol = MolViewer()
----> 4 pymol.fetch('3eiy')
      5 pymol.display()

3 frames
[/usr/lib/python3.7/subprocess.py](https://localhost:8080/#) in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
   1549                         if errno_num == errno.ENOENT:
   1550                             err_msg += ': ' + repr(err_filename)
-> 1551                     raise child_exception_type(errno_num, err_msg, err_filename)
   1552                 raise child_exception_type(err_msg)
   1553 

FileNotFoundError: [Errno 2] No such file or directory: 'pymol': 'pymol'

I also tried to run:

# NBVAL_SKIP
from graphein.utils.pymol import MolViewer
pymol = MolViewer()
pymol.fetch('3eiy')
pymol.display()

gave:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-15-1d85efe5422a>](https://localhost:8080/#) in <module>()
----> 1 from graphein.protein.config import ProteinGraphConfig
      2 from graphein.protein.graphs import construct_graph
      3 
      4 config = ProteinGraphConfig()
      5 g = construct_graph(config=config, pdb_code="3eiy")

9 frames
[/usr/local/lib/python3.7/dist-packages/distributed/config.py](https://localhost:8080/#) in <module>()
     18 
     19 with open(fn) as f:
---> 20     defaults = yaml.load(f)
     21 
     22 dask.config.update_defaults(defaults)

TypeError: load() missing 1 required positional argument: 'Loader'

Screenshots
image

image

May I know how to fix it? Thanks!

Graphein Version Inconsistency?

In the setup.py file, the version is listed as 1.0.3:

graphein/setup.py

Lines 137 to 139 in 7c6b079

setup(
name="graphein",
version="1.0.3",

However, on PyPI, the current version for graphein is 1.0.6. What accounts for the difference in version numbers?

Also, is there a version history / changelog available anywhere? Thanks!

Odd Sklearn Dependency

The "base.in" file lists a "sklearn" dependency:

However, I think it should be "scikit-learn" instead of "sklearn." To my understanding, the package's name on PyPI and Conda is "scikit-learn," but for importing in actual python code, its name is "sklearn"

Oddly, there is an actual package on PyPI called "sklearn", which is why I think this mistake doesn't trigger any error message. But clearly, the sklearn (version 0.0) package is not the intended package.

Visualisation of Protein Graphs

  • 3d Interactive Viewer (using coords) - plotly
  • 3d static plot (using coords) - Nx & Matplotlib
  • 2d projection

Colouring options

  • Resiude type
  • Atom type
  • Edge types
  • Degree
  • Sequence position

Node Size options

  • Degree
  • By feature magnitude (single or several?)

Labelling options

  • Residue Name
  • Sequence Position
  • Identifier

Documentation

  • Docstrings
  • Docs

not all ELEMENTX-ELEMENTY bonds are present in BOND_ORDERS lookup dictionary

BOND_ORDERS includes only part of all optional bonds. e.g S-S is not present but indeed optional.
Thus, at line 223 in graphein.protein.edges.atomic.identify_bond_type_from_mapping the following allowable_order = BOND_ORDERS[query] fails.

I assume the solution should be in graphein.protein.edges.atomic.add_bond_order (line 188), as follows:

            try:
                identify_bond_type_from_mapping(G, u, v, a, query)
            except:
                query = f"{atom_b}-{atom_a}"
                try:
                    identify_bond_type_from_mapping(G, u, v, a, query)
                except:
                    G.edges[u, v]["kind"].add("SINGLE")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.