barahona-research-group / hcga Goto Github PK

Highly Comparative Graph Analysis - Code for network phenotyping

License: GNU General Public License v3.0

Python 100.00%

networks feature-extraction classification-algorithm graph-features interpretable-features graph-analysis-toolbox graph-embedding

hcga's People

Contributors

Stargazers

Watchers

Forkers

nrbernier peach-lucien daniel-mietchen ben-girard arnaudon hieutv85 woosuk1103 retune-commons jlkkkcc mengjin123 lisahaxel vishalbelsare codermansur lzhan94swu

hcga's Issues

update doc with how to add feature class

allow to disable kfold for details analysis of features in both sklearn/shap

data loader with only attributes fails

feature summary plot

by default plot top 10 features?
plot one graph per class instead of random graphs

remove self loops before computing rich club

Rich club is unable to compute on graphs with self loops - remove these before computing rich club.

The add_feature is setup so that if an error in calling a function occurs it return nan. However, sometimes we want to compute multiple things from a single function e.g. I compute the community structure using asynfluid and from that output I will create 5 or 6 features. I can't put the computation of the community structure directly into add_feature for each separate feature because otherwise i'll need to compute the asynfluid computation multiple times. However, sometimes the original call to asyn fluid will cause an error and each feature should be given nan. What is the best way around this?

everything

add logging for failed feature computations

If the computation of a feature fails, we need to record the exception, and save them somewhere, so that we know what failed where, and we can backtrace to correct dataset/feature computations. We could use logging for that.

communities modularity fails

Why does this feature fail so often? 👯

failing features

If a feature fails, it get caught and tried on the toy graph to see if it would have returned a fload or a list. The thing is in some cases, even the simple graph will fail (comm_asyn, which needs enough nodes to works), but this is not caught. I have commented this piece of code for now, but it may break things down the line. Not sure what the best strategy is for that.

try to somehow get if a function returns a float or a list (https://docs.python.org/3/library/typing.html, but it is a pain to code on all the add_feature fonctions)
add a flag in add_feature to say that it is a list and compute statistics
just return a nan, but the feature lists will be different across graphs, thus some extra postprocesssing step is needed

I'm more in favor of 3 for now.

save datasets in .h5

the pickle is not good for large datasets, it makes memory error. Investigating that, and maybe using .h5 to save processed dataset.

make a param.yaml file for user configurations

make a better add_feature function

It is becoming quite messy, and now we have a clustering_analysis thing on top of distributions. We may want to rethink this one a bit, and get something cleaner.
I'm quite in favor of having an option on add_feature to compute statistic/clustering stats or something else.

Variability in output of SHAP

I was playing around running multiple times the SHAP analysis on the same feature matrix. I noticed that there is some slight variation in the order of the SHAP values listed in terms of importance. There is also a variation in the "balanced accuracy". It's almost always the exact same number but sometimes it's off (0.86 instead of 0.90 for instance).

I guess that it depends on the training of the classifier and maybe on the way the SHAP values are computed, which requires some kind of sampling as I vaguely understand. Anyway one might want to look into it.

node features cannot be disabled

we forgot to completely finish the node_feature option

node labels/attributes

Do we want to get features for node labels and node attributes independently, instead of stacking them together? They may encode quite different data, so may be worth having double the size of features.

update the unitests

plotly express for plotting

update the plotting to use plotly express: https://plotly.com/python/plotly-express/

class structure

For user experience, they will want to simply open a notebook and import a class and just import there data from there. I think that forcing them to use the command line with a custom dataset is going to cause a lot of problems. They will need to generate a pickle of their data in the correct manner (which they'll have to do in python), then they have to make sure everything is in the correct folders and then run it from the command line with some difficulty in getting commands correct.

I think we need to have a simple class they can import. From which we can return the feature results, analyse the feature results, and plot the feature results. Just return them and don't save them to the class object for memory purposes. This will provide much easier functionality for a user I believe. We only need one class - that of the hcga object.

higher interpretability filter fails

on features that failed

add feature summary

I will add a feature summary - it will exhibit a violin plot for a chosen feature and then display example graphs from different values across the feature. This will be nice to understand what is differentiating graphs with a feature.

bug in the plot feature summary function

graph = graph_to_plot[i].get_graph("networkx")

File "/home/robert/Documents/PythonCode/hcga/hcga/graph.py", line 123, in get_graph
self.set_networkx()
File "/home/robert/Documents/PythonCode/hcga/hcga/graph.py", line 137, in set_networkx
for node, node_data in self.nodes.iterrows()
AttributeError: 'list' object has no attribute 'iterrows'

benchmark dataset bug

We cannot load the DD dataset due to a bug:
exception local variable 'node_attributes' referenced before assignment

This also occurs for collab with:
exception local variable 'node_labels' referenced before assignment

bayes_confint

this feature always fails

move/clean the shap plotting to plotting.py

node features are one hot encoded in benchmarks

In the benchmark datasets the node labels and node features are one-hot encoded. The new version to extract these datasets simply just adds the node label (see DD dataset for example). To make it comparable with the other methods we must also binarize the benchmark datasets.

make pdf report

We could use that: https://matplotlib.org/3.2.1/api/backend_pdf_api.html
to stack all images in a single pdf report, it is quite neat!

powerlaw feature

add in the powerlaw sum of square errors to the statistics of distributions such as centrality. This feature was quite good in DD. To compute this feature we must first bin the data (histogram) and then find the optimal fitting powerlaw distribution to this binned data - then take the SSE from this.

We originally binned into 10, 20 and 50 bins and then fitted the powerlaw distribution and then extracted the SSE.

Distributions

I noticed under some operations, for example centrality_degree, there are comments regarding normal and exponential distributions. Does this mean that other features will be calculated by fitting the data to a normal or an exponential distribution or will the data be fitted to the optimal distribution in order to do this?

profiling + caching

We should run some cProfile + caching (lru_cache) some computation used several times for speed up.

add fit of features against distributions

with some options (basic, medium, advanced)

graph features

We need to allow the option to retain original graph features (or allow users to add that) that might exist. e.g. we might know a property of a molecule such as its bulk moduli and want to incorporate that feature into statistical learning.

User experience for feature analysis

I wanted to launch a discussion on how we would like the main command(s) of hcga to look like, before we go in multiple directions.

Currently there are two commands, feature_analysis and plot_analysis to do the classification of the graphs and plot it. Does it make sense to have a separate plot_analysis command? Do we want to give tons of different options for plotting to the user so that they will want to run in multiple times? In my view it is probably not necessary. As I understand it, the user would run the main classification once and then can look at some standard plot produced. They would do their own plotting if they want something that looks different. They might play with the settings of the classification, but then it would anyway have to run many times and producing the plots is not the computationally expensive part.

Another aspect is that it is currently a single command that takes as an option the kind of analysis to be done, either straight classification with sklearn or with shap values. As I understand it the SHAP analysis just computes the values on top of doing the same exact training of the random forrest or boost, so it does not change things that much. But there is the option of doing two different commands if we feel that the analysis is very different.

I would like to have a feel as to what the end goal is in terms of user experience. What do we want the user to be able to do and what kind of data do they want at the end to be able to look at?

For instance, while the user does not need to look at the initial dataset, they might want to have a look in the features extracted and the analysis results with some other programs. It might be interesting to save the relevant data as text files instead of python pickles then.

Plot styles and feature readability

There was a mention earlier of implementing the plots with plotly such that one could hover over the data points and get detailed information about the features (using the descriptions). Is that something that we want to get into? Plots could be saved as html files for instance, which gives this option to hover later on.

I'm not sure if it is possible to change the plotting library when plotting using the functions in shap. It might not be possible and then maybe not worth it to fight too much over the default. What do you think?

compute a large dataset for interpretability comparisons

We can produce a large dataset with lots of different graph types: real, classic, random etc. We then compute all their features. For a given new graph we can output a set of similar graphs types in respect to the feature space. This will improve interpretability.

save graph labels and labels id

if the user gives graph label that is not integer, save it in another variable and assign an label_id to it

implement a --runtime option to only ouptu runtimes of feature classes on a few graphs

node feature failing when compiling features after extraction

This occurred before when the trivial graph did not have features and as such did not add the node_feature to the possible features.

"Feature {} does not exist in class {}".format(feature_name, self.name)
Exception: Feature node_feature1_mean does not exist in class node_features

pandas dataframe sorting

We may want to consider sorting the pandas dataframe according to graph label for easier exploration.

Also you mentioned something about info_ict but not sure what you're referring to?

get rid of networkx from above

To make hcga even better, I think we should input a list of graphs as adjacency matrices with scipy.sparse instead of networkx objects. Then, if a feature computation requires networkx, we convert it (maybe only once in the feature class or something). So that if we want to add features not based on networkx (some are faster to compute without networkx), but on igraph, or any custom package, it would be easier than converting back from networkx.

how to use interpretability score in analysis?

no sure how using shap?

make consistent feature analysis output/save

so we can run sklearn/shap independently

clean the docs of features classes

We need the bare minimum on each feature class docstring, with the link to networkx function if it is networkx based, so the user can see there what this function is doing. Any other things the user may want to know could be added, too.

cleanup the analysis/plotting

lint + dostring + pandas integration + graph class integration

node degree as feature

add node degree as feature to convolute

add option to classify on subset of graph classes

so we can for example try to classify all pairs of classes to get a matrix of accuracies (classes by classes)

add timeout of feature computation

update synthetic dataset generator

Simplifying file inputs for app.py commands

I thought it would be simpler to have the argument of the commands directly be the pickled file to load. As it is, both extract_features and feature_analysis require to specify both a folder name and a name of the dataset (without any naming extension, as the io.py function later adds .pkl by hand).

Wouldn't it be easier and more straightforward to let the user specify the dataset (with its extension) at any path? Then one can use automatic completion in the terminal to add the file.

If other people agree I can implement it.