Giter VIP home page Giter VIP logo

hcga's People

Contributors

arnaudon avatar ashermullokandov avatar daniel-mietchen avatar henrypalasciano avatar lzhan94swu avatar mauriciobarahona avatar misterblonde avatar nrbernier avatar peach-lucien avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

hcga's Issues

feature summary plot

by default plot top 10 features?
plot one graph per class instead of random graphs

Feature class add_feature

The add_feature is setup so that if an error in calling a function occurs it return nan. However, sometimes we want to compute multiple things from a single function e.g. I compute the community structure using asynfluid and from that output I will create 5 or 6 features. I can't put the computation of the community structure directly into add_feature for each separate feature because otherwise i'll need to compute the asynfluid computation multiple times. However, sometimes the original call to asyn fluid will cause an error and each feature should be given nan. What is the best way around this?

add logging for failed feature computations

If the computation of a feature fails, we need to record the exception, and save them somewhere, so that we know what failed where, and we can backtrace to correct dataset/feature computations. We could use logging for that.

failing features

If a feature fails, it get caught and tried on the toy graph to see if it would have returned a fload or a list. The thing is in some cases, even the simple graph will fail (comm_asyn, which needs enough nodes to works), but this is not caught. I have commented this piece of code for now, but it may break things down the line. Not sure what the best strategy is for that.

  1. try to somehow get if a function returns a float or a list (https://docs.python.org/3/library/typing.html, but it is a pain to code on all the add_feature fonctions)
  2. add a flag in add_feature to say that it is a list and compute statistics
  3. just return a nan, but the feature lists will be different across graphs, thus some extra postprocesssing step is needed

I'm more in favor of 3 for now.

save datasets in .h5

the pickle is not good for large datasets, it makes memory error. Investigating that, and maybe using .h5 to save processed dataset.

make a better add_feature function

It is becoming quite messy, and now we have a clustering_analysis thing on top of distributions. We may want to rethink this one a bit, and get something cleaner.
I'm quite in favor of having an option on add_feature to compute statistic/clustering stats or something else.

Variability in output of SHAP

I was playing around running multiple times the SHAP analysis on the same feature matrix. I noticed that there is some slight variation in the order of the SHAP values listed in terms of importance. There is also a variation in the "balanced accuracy". It's almost always the exact same number but sometimes it's off (0.86 instead of 0.90 for instance).

I guess that it depends on the training of the classifier and maybe on the way the SHAP values are computed, which requires some kind of sampling as I vaguely understand. Anyway one might want to look into it.

node labels/attributes

Do we want to get features for node labels and node attributes independently, instead of stacking them together? They may encode quite different data, so may be worth having double the size of features.

class structure

For user experience, they will want to simply open a notebook and import a class and just import there data from there. I think that forcing them to use the command line with a custom dataset is going to cause a lot of problems. They will need to generate a pickle of their data in the correct manner (which they'll have to do in python), then they have to make sure everything is in the correct folders and then run it from the command line with some difficulty in getting commands correct.

I think we need to have a simple class they can import. From which we can return the feature results, analyse the feature results, and plot the feature results. Just return them and don't save them to the class object for memory purposes. This will provide much easier functionality for a user I believe. We only need one class - that of the hcga object.

add feature summary

I will add a feature summary - it will exhibit a violin plot for a chosen feature and then display example graphs from different values across the feature. This will be nice to understand what is differentiating graphs with a feature.

bug in the plot feature summary function

graph = graph_to_plot[i].get_graph("networkx")

File "/home/robert/Documents/PythonCode/hcga/hcga/graph.py", line 123, in get_graph
self.set_networkx()
File "/home/robert/Documents/PythonCode/hcga/hcga/graph.py", line 137, in set_networkx
for node, node_data in self.nodes.iterrows()
AttributeError: 'list' object has no attribute 'iterrows'

benchmark dataset bug

We cannot load the DD dataset due to a bug:
exception local variable 'node_attributes' referenced before assignment

This also occurs for collab with:
exception local variable 'node_labels' referenced before assignment

node features are one hot encoded in benchmarks

In the benchmark datasets the node labels and node features are one-hot encoded. The new version to extract these datasets simply just adds the node label (see DD dataset for example). To make it comparable with the other methods we must also binarize the benchmark datasets.

powerlaw feature

add in the powerlaw sum of square errors to the statistics of distributions such as centrality. This feature was quite good in DD. To compute this feature we must first bin the data (histogram) and then find the optimal fitting powerlaw distribution to this binned data - then take the SSE from this.

We originally binned into 10, 20 and 50 bins and then fitted the powerlaw distribution and then extracted the SSE.

Distributions

I noticed under some operations, for example centrality_degree, there are comments regarding normal and exponential distributions. Does this mean that other features will be calculated by fitting the data to a normal or an exponential distribution or will the data be fitted to the optimal distribution in order to do this?

profiling + caching

We should run some cProfile + caching (lru_cache) some computation used several times for speed up.

graph features

We need to allow the option to retain original graph features (or allow users to add that) that might exist. e.g. we might know a property of a molecule such as its bulk moduli and want to incorporate that feature into statistical learning.

User experience for feature analysis

I wanted to launch a discussion on how we would like the main command(s) of hcga to look like, before we go in multiple directions.

Currently there are two commands, feature_analysis and plot_analysis to do the classification of the graphs and plot it. Does it make sense to have a separate plot_analysis command? Do we want to give tons of different options for plotting to the user so that they will want to run in multiple times? In my view it is probably not necessary. As I understand it, the user would run the main classification once and then can look at some standard plot produced. They would do their own plotting if they want something that looks different. They might play with the settings of the classification, but then it would anyway have to run many times and producing the plots is not the computationally expensive part.

Another aspect is that it is currently a single command that takes as an option the kind of analysis to be done, either straight classification with sklearn or with shap values. As I understand it the SHAP analysis just computes the values on top of doing the same exact training of the random forrest or boost, so it does not change things that much. But there is the option of doing two different commands if we feel that the analysis is very different.

I would like to have a feel as to what the end goal is in terms of user experience. What do we want the user to be able to do and what kind of data do they want at the end to be able to look at?

For instance, while the user does not need to look at the initial dataset, they might want to have a look in the features extracted and the analysis results with some other programs. It might be interesting to save the relevant data as text files instead of python pickles then.

Plot styles and feature readability

There was a mention earlier of implementing the plots with plotly such that one could hover over the data points and get detailed information about the features (using the descriptions). Is that something that we want to get into? Plots could be saved as html files for instance, which gives this option to hover later on.

I'm not sure if it is possible to change the plotting library when plotting using the functions in shap. It might not be possible and then maybe not worth it to fight too much over the default. What do you think?

compute a large dataset for interpretability comparisons

We can produce a large dataset with lots of different graph types: real, classic, random etc. We then compute all their features. For a given new graph we can output a set of similar graphs types in respect to the feature space. This will improve interpretability.

node feature failing when compiling features after extraction

This occurred before when the trivial graph did not have features and as such did not add the node_feature to the possible features.

"Feature {} does not exist in class {}".format(feature_name, self.name)
Exception: Feature node_feature1_mean does not exist in class node_features

pandas dataframe sorting

We may want to consider sorting the pandas dataframe according to graph label for easier exploration.

Also you mentioned something about info_ict but not sure what you're referring to?

get rid of networkx from above

To make hcga even better, I think we should input a list of graphs as adjacency matrices with scipy.sparse instead of networkx objects. Then, if a feature computation requires networkx, we convert it (maybe only once in the feature class or something). So that if we want to add features not based on networkx (some are faster to compute without networkx), but on igraph, or any custom package, it would be easier than converting back from networkx.

clean the docs of features classes

We need the bare minimum on each feature class docstring, with the link to networkx function if it is networkx based, so the user can see there what this function is doing. Any other things the user may want to know could be added, too.

Simplifying file inputs for app.py commands

I thought it would be simpler to have the argument of the commands directly be the pickled file to load. As it is, both extract_features and feature_analysis require to specify both a folder name and a name of the dataset (without any naming extension, as the io.py function later adds .pkl by hand).

Wouldn't it be easier and more straightforward to let the user specify the dataset (with its extension) at any path? Then one can use automatic completion in the terminal to add the file.

If other people agree I can implement it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.