deepfindr / gnn-project Goto Github PK

View Code? Open in Web Editor NEW

248.0 2.0 87.0 3.93 MB

A Graph Neural Network project on HIV data

Python 100.00%

gnn-project's Introduction

Comments about the code

This is the code for this video series: https://www.youtube.com/watch?v=nAEb1lOf_4o

Installing RDKIT

You will need rdkit to run this code.

Follow these instructions to install rdkit. https://www.rdkit.org/docs/Install.html

If you run on Ubuntu / WSL you can simply run:

sudo apt-get install python-rdkit

Ideally execute the code in an anaconda environment, that's the easiest solution with rdkit.

Installing the other packages

For pytorch geometric follow this tutorial: https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html

Make sure your CUDA version as well as torch version match the PyG version you install. I've used torch 1.6.0 as it seemed to be most stable with the other libraries.

Further things

Its highly recommended to setup a GPU (including CUDA) for this code.
Here is where I found ideas for node / edge features: https://www.researchgate.net/figure/Descriptions-of-node-and-edge-features_tbl1_339424976
There is also a Kaggle competition that used this dataset (from a University): https://www.kaggle.com/c/iml2019/overview

Dashboard (MLFlow + Streamlit)

It is required to use conda for this setup, e.g.

wget https://repo.continuum.io/archive/Anaconda3-5.3.1-Linux-x86_64.sh

You need to start the following things:

Streamlit server

streamlit run dashboard.py

MlFlow Server

mlflow server \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./artifacts \
    --host 0.0.0.0
    --port 5000

MlFlow served model

export MLFLOW_TRACKING_URI=http://localhost:5000
mlflow models serve -m "models:/YourModelName/Staging" -p 1234

TODO: Check if multi-input models work for MLFLOW!!!

gnn-project's People

Contributors

Stargazers

Watchers

Forkers

joskid napoles-uach nilsagor noctillion omidtarkhaneh shubham2941 truongquocchien cvarun16 truongchien octaviomtz ricemcm deepak2233 jarvisloh yanyipu rambam613 luvkhandelwal pre-phd sangdonkim samirandas1311 anilkamat seyning innovation64 l4fl4m3 tiger-tiger himanshurepo tehranixyz musicjae kevinfeghoul kenziehong augux81 saikmar-1729 tomerg11e zqcsrz shi-kejian sigmafye kennyng-19 jpbatisteli khalilcse ritaj19zamel furmanlukasz naveen0006 vvr-rao dong845 salamethimawan ankushjain7 aruizacevedo jcbolo72012 bikrammajhi ricardofloresh sujikarv wenliangz wen-workflow coltero sbhttchryy udit-records anshumansinha16 rezavahidi shihaozhangjohn almusawiaf hurairahhassan cx1027 acyjiang keremkurban lorenzanagauge roms89 abhilash-211b010 lehgtrung jaaaaabin daonanzhang theallunknown williamtbarker lestkos thomasgust aswinuci violetzihui ehanbin98 mystatsolve ranaabarghout mjjmmj thisaraweerakoon zbenmo amirjlr alimanjotho unsalseyma spartan-119 marcocolangelo wapiti08

gnn-project's Issues

link in the data folder readme opens a spam page

the link provided in the readme opens a spam website

Dataset downloaded from: http://moleculenet.ai/datasets-1

The error in the calculation of AUC

Hi,

Thank you for this excellent GNN project. But In the train.py file, line 106 shows:

roc= roc_auc_score(y_pred,y_true)

According to sklearn, the first argument should be true label and second argument should be predicted score, so is it something wrong? The correct version could be:

roc=roc_auc_score(y_true,all_preds_raw)

Doubt regarding code execution

Hello sir, can u please explain how to execute the code, while running train.py file I am getting the error

Hope you will reply to this ASAP

Hi,
First thank you for all this amazing materials !
For my part, I have a lot of difficulties trying to follow the video and the code that seems to be the final project.
Is there a way to get the scripts for each videos please ?
Thank you !

Error in f.to_pyg_graph(): TypeError: type object got multiple values for keyword argument 'pos'

Hello, I am trying to run your code, but I am facing a problem when creating the dataset. Particularly on line 54 of the dataset_featurizer.py, where we are transforming to a Pytorch Geometric graph using:
data = f.to_pyg_graph()
I run into this error:
TypeError: type object got multiple values for keyword argument 'pos'

The f in this case looks like this:
GraphData(node_features=[46, 30], edge_index=[2, 108], edge_features=[108, 11], pos=[0])

This has been created from row = ('level_0', 0) ('Unnamed: 0', 3999) ('index', 3999) ('smiles', 'CSc1cc2[n+]3c(c1)-c1cccc[n+]1[Zn-4]314([n+]3ccccc3-2)[n+]2ccccc2-c2cc(SC)cc([n+]21)-c1cccc[n+]14.[O-]Cl+3([O-])[O-]') ('activity', 'CI') ('HIV_active', 0)

I don't really know what to do or how to solve it.
Maybe you could also upload the content in /data/processed, so that this issue is solved.

Help please, I am very stuck with this issue and I cannot run your code.

Thank you for your time.

dataset_featurizer.py referencing a base Class of DeepChem MolGraphConvFeaturizer?

Hi! Thanks for the great effort.

self.process()
  File "/..../dataset_featurizer.py", line 53, in process
    f = featurizer.featurize(mol["smiles"])
  > data = f[0].to_pyg_graph()
AttributeError: 'numpy.ndarray' object has no attribute 'to_pyg_graph'

It seems like the return of featurizer.featurize is a np array not an GraphData object.

Preprocessing with deepchem. Issue with positions

I was runing train.py with recent installation of libraries. I think there is a mismatch of versions such that im getting

 File "../venv2023/lib/python3.8/site-packages/deepchem/feat/graph_data.py", line 151, in to_pyg_graph
    return Data(x=torch.from_numpy(self.node_features).float(),
TypeError: type object got multiple values for keyword argument 'pos'

I found a workaround by ignoring the positional information since f=featurizer._featurize(mol) later shows:

>>>f
GraphData(node_features=[75, 30], edge_index=[2, 162], edge_features=[162, 11], pos=[0])

The workaround is to write a custom function to convert into pyg_graph from f

    def _custom_to_pyg_graph(self,graph_data):
        from torch_geometric.data import Data
        return Data(x=torch.from_numpy(graph_data.node_features).float(),
                    edge_index=torch.from_numpy(graph_data.edge_index).long(),
                    edge_attr=torch.from_numpy(graph_data.edge_features).float())

    def process(self):
        self.data = pd.read_csv(self.raw_paths[0]).reset_index()
        featurizer = dc.feat.MolGraphConvFeaturizer(use_edges=True)
        for index, row in tqdm(self.data.iterrows(), total=self.data.shape[0]):
            # Featurize molecule
            mol = Chem.MolFromSmiles(row["smiles"])
            f = featurizer._featurize(mol)
            data = self._custom_to_pyg_graph(f)
            # data = f.to_pyg_graph()
            data.y = self._get_label(row["HIV_active"])
            data.smiles = row["smiles"]
            if self.test:
                torch.save(data, 
                    os.path.join(self.processed_dir, 
                                 f'data_test_{index}.pt'))
            else:
                torch.save(data, 
                    os.path.join(self.processed_dir, 
                                 f'data_{index}.pt'))

So far it is working in processing. But future versions with positions included needs to considered for general purpose solution.

Also, perhaps requirements didnt have some of the toolboxes like deepchem , providing a version for each tool or dockerizing the venv you have used could help.

'Tensor' object has no attribute 'head_transform1'

Hello,

I tried to replicate the codes in your youtube video 'GNN Project #3.1' and I got this error. Could you please suggest how to fix it? Thank you

sklearn.metrics.confusion_matrix

Thank you for this great work !
I just wanted to make a remark about the confusion matrix function – the y_true comes before the y_pred in the sklearn.metrics.confusion_matrix function's signature.
sklearn.metrics.confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None)

Where does the 0 come from ?

Hi
Thank you for all your effort on the Gnn-project.

Once the training is done:

Processing...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3999/3999 [00:27<00:00, 145.31it/s]
Done!
Loading model...
...
...
...
   if i % self.top_k_every_n == 0:
ZeroDivisionError: integer division or modulo by zero

Any insights @deepfindr

Thanks in advance

question on TransformerConv

Hi Deepfindr,

Thank you so much for you great video and code!

I have question on TransformerConv you used. You defined the layer as:

self.conv1 = TransformerConv(feature_size, 
                                    embedding_size, 
                                    heads=n_heads, 
                                    dropout=dropout_rate,
                                    edge_dim=edge_dim,
                                    beta=True)

but according to PyG website, you are supposed to have in_channel as the first parameter, which is either a tuple defining the shape of the input, or -1, which derive the size from the first input(s) to the forward method.

but feature_size is either of them. Is it a version issue? which version of PyG did you use for the tutorial?

Thank you so much!

Best,
Nicole

can't find ’mango‘which include ’Tuner‘ and 'Scheduler'

the pypi also cant find it ,did i miss the file? or the author already delete it ? I am confused

all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 20

Hi!

I've been exploring some self-made datasets and I've managed to get the project up'n running fine. The training runs well except sometimes this error happens:

F1 Score: 0.764872521246459
Accuracy: 0.7331189710610932
MCC: 0.48423538939278077
Precision: 0.6835443037974683
Recall: 0.8681672025723473
ROC AUC: 0.7331189710610932
Epoch 215 | Test Loss 0.5649742603302002
Early stopping due to no improvement.
  0%|                                                                        | 0/100 [14:09<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 191, in <module>
    results = tuner.minimize()
  File "/opt/conda/lib/python3.7/site-packages/mango/tuner.py", line 153, in minimize
    return self.run()
  File "/opt/conda/lib/python3.7/site-packages/mango/tuner.py", line 140, in run
    self.results = self.runBayesianOptimizer()
  File "/opt/conda/lib/python3.7/site-packages/mango/tuner.py", line 263, in runBayesianOptimizer
    X_sample = np.vstack((X_sample, X_next_batch))
  File "<__array_function__ internals>", line 6, in vstack
  File "/opt/conda/lib/python3.7/site-packages/numpy/core/shape_base.py", line 282, in vstack
    return _nx.concatenate(arrs, 0)
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 20

As you can see this happens at the 0th epoch.