barahona-research-group / hcga Goto Github PK
View Code? Open in Web Editor NEWHighly Comparative Graph Analysis - Code for network phenotyping
License: GNU General Public License v3.0
Highly Comparative Graph Analysis - Code for network phenotyping
License: GNU General Public License v3.0
by default plot top 10 features?
plot one graph per class instead of random graphs
as it cannot be used without shap...
Rich club is unable to compute on graphs with self loops - remove these before computing rich club.
The add_feature is setup so that if an error in calling a function occurs it return nan. However, sometimes we want to compute multiple things from a single function e.g. I compute the community structure using asynfluid and from that output I will create 5 or 6 features. I can't put the computation of the community structure directly into add_feature for each separate feature because otherwise i'll need to compute the asynfluid computation multiple times. However, sometimes the original call to asyn fluid will cause an error and each feature should be given nan. What is the best way around this?
If the computation of a feature fails, we need to record the exception, and save them somewhere, so that we know what failed where, and we can backtrace to correct dataset/feature computations. We could use logging for that.
Why does this feature fail so often? ๐ฏ
If a feature fails, it get caught and tried on the toy graph to see if it would have returned a fload or a list. The thing is in some cases, even the simple graph will fail (comm_asyn, which needs enough nodes to works), but this is not caught. I have commented this piece of code for now, but it may break things down the line. Not sure what the best strategy is for that.
I'm more in favor of 3 for now.
the pickle is not good for large datasets, it makes memory error. Investigating that, and maybe using .h5 to save processed dataset.
It is becoming quite messy, and now we have a clustering_analysis thing on top of distributions. We may want to rethink this one a bit, and get something cleaner.
I'm quite in favor of having an option on add_feature to compute statistic/clustering stats or something else.
I was playing around running multiple times the SHAP analysis on the same feature matrix. I noticed that there is some slight variation in the order of the SHAP values listed in terms of importance. There is also a variation in the "balanced accuracy". It's almost always the exact same number but sometimes it's off (0.86 instead of 0.90 for instance).
I guess that it depends on the training of the classifier and maybe on the way the SHAP values are computed, which requires some kind of sampling as I vaguely understand. Anyway one might want to look into it.
we forgot to completely finish the node_feature option
Do we want to get features for node labels and node attributes independently, instead of stacking them together? They may encode quite different data, so may be worth having double the size of features.
update the plotting to use plotly express: https://plotly.com/python/plotly-express/
For user experience, they will want to simply open a notebook and import a class and just import there data from there. I think that forcing them to use the command line with a custom dataset is going to cause a lot of problems. They will need to generate a pickle of their data in the correct manner (which they'll have to do in python), then they have to make sure everything is in the correct folders and then run it from the command line with some difficulty in getting commands correct.
I think we need to have a simple class they can import. From which we can return the feature results, analyse the feature results, and plot the feature results. Just return them and don't save them to the class object for memory purposes. This will provide much easier functionality for a user I believe. We only need one class - that of the hcga object.
on features that failed
I will add a feature summary - it will exhibit a violin plot for a chosen feature and then display example graphs from different values across the feature. This will be nice to understand what is differentiating graphs with a feature.
graph = graph_to_plot[i].get_graph("networkx")
File "/home/robert/Documents/PythonCode/hcga/hcga/graph.py", line 123, in get_graph
self.set_networkx()
File "/home/robert/Documents/PythonCode/hcga/hcga/graph.py", line 137, in set_networkx
for node, node_data in self.nodes.iterrows()
AttributeError: 'list' object has no attribute 'iterrows'
We cannot load the DD dataset due to a bug:
exception local variable 'node_attributes' referenced before assignment
This also occurs for collab with:
exception local variable 'node_labels' referenced before assignment
this feature always fails
In the benchmark datasets the node labels and node features are one-hot encoded. The new version to extract these datasets simply just adds the node label (see DD dataset for example). To make it comparable with the other methods we must also binarize the benchmark datasets.
We could use that: https://matplotlib.org/3.2.1/api/backend_pdf_api.html
to stack all images in a single pdf report, it is quite neat!
add in the powerlaw sum of square errors to the statistics of distributions such as centrality. This feature was quite good in DD. To compute this feature we must first bin the data (histogram) and then find the optimal fitting powerlaw distribution to this binned data - then take the SSE from this.
We originally binned into 10, 20 and 50 bins and then fitted the powerlaw distribution and then extracted the SSE.
I noticed under some operations, for example centrality_degree, there are comments regarding normal and exponential distributions. Does this mean that other features will be calculated by fitting the data to a normal or an exponential distribution or will the data be fitted to the optimal distribution in order to do this?
We should run some cProfile + caching (lru_cache) some computation used several times for speed up.
with some options (basic, medium, advanced)
We need to allow the option to retain original graph features (or allow users to add that) that might exist. e.g. we might know a property of a molecule such as its bulk moduli and want to incorporate that feature into statistical learning.
I wanted to launch a discussion on how we would like the main command(s) of hcga to look like, before we go in multiple directions.
Currently there are two commands, feature_analysis
and plot_analysis
to do the classification of the graphs and plot it. Does it make sense to have a separate plot_analysis
command? Do we want to give tons of different options for plotting to the user so that they will want to run in multiple times? In my view it is probably not necessary. As I understand it, the user would run the main classification once and then can look at some standard plot produced. They would do their own plotting if they want something that looks different. They might play with the settings of the classification, but then it would anyway have to run many times and producing the plots is not the computationally expensive part.
Another aspect is that it is currently a single command that takes as an option the kind of analysis to be done, either straight classification with sklearn or with shap values. As I understand it the SHAP analysis just computes the values on top of doing the same exact training of the random forrest or boost, so it does not change things that much. But there is the option of doing two different commands if we feel that the analysis is very different.
I would like to have a feel as to what the end goal is in terms of user experience. What do we want the user to be able to do and what kind of data do they want at the end to be able to look at?
For instance, while the user does not need to look at the initial dataset, they might want to have a look in the features extracted and the analysis results with some other programs. It might be interesting to save the relevant data as text files instead of python pickles then.
There was a mention earlier of implementing the plots with plotly such that one could hover over the data points and get detailed information about the features (using the descriptions). Is that something that we want to get into? Plots could be saved as html files for instance, which gives this option to hover later on.
I'm not sure if it is possible to change the plotting library when plotting using the functions in shap
. It might not be possible and then maybe not worth it to fight too much over the default. What do you think?
We can produce a large dataset with lots of different graph types: real, classic, random etc. We then compute all their features. For a given new graph we can output a set of similar graphs types in respect to the feature space. This will improve interpretability.
if the user gives graph label that is not integer, save it in another variable and assign an label_id to it
This occurred before when the trivial graph did not have features and as such did not add the node_feature to the possible features.
"Feature {} does not exist in class {}".format(feature_name, self.name)
Exception: Feature node_feature1_mean does not exist in class node_features
We may want to consider sorting the pandas dataframe according to graph label for easier exploration.
Also you mentioned something about info_ict but not sure what you're referring to?
To make hcga even better, I think we should input a list of graphs as adjacency matrices with scipy.sparse instead of networkx objects. Then, if a feature computation requires networkx, we convert it (maybe only once in the feature class or something). So that if we want to add features not based on networkx (some are faster to compute without networkx), but on igraph, or any custom package, it would be easier than converting back from networkx.
no sure how using shap?
so we can run sklearn/shap independently
We need the bare minimum on each feature class docstring, with the link to networkx function if it is networkx based, so the user can see there what this function is doing. Any other things the user may want to know could be added, too.
lint + dostring + pandas integration + graph class integration
add node degree as feature to convolute
so we can for example try to classify all pairs of classes to get a matrix of accuracies (classes by classes)
I thought it would be simpler to have the argument of the commands directly be the pickled file to load. As it is, both extract_features
and feature_analysis
require to specify both a folder name and a name of the dataset (without any naming extension, as the io.py function later adds .pkl
by hand).
Wouldn't it be easier and more straightforward to let the user specify the dataset (with its extension) at any path? Then one can use automatic completion in the terminal to add the file.
If other people agree I can implement it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.