jajupmochi / graphkit-learn Goto Github PK

A python package for graph kernels, graph edit distances, and graph pre-image problem.

Home Page: https://graphkit-learn.readthedocs.io

License: GNU General Public License v3.0

Jupyter Notebook 95.84% Python 3.80% Shell 0.01% C++ 0.16% Cython 0.19%

chemoinformatics graph-edit-distance graph-kernels graph-representations kernel-methods machine-learning paths pattern-recognition pre-image walks

graphkit-learn's People

Contributors

Stargazers

Watchers

Forkers

lyf14020510036 mgalusza gear vishalbelsare faraazrm qianlinjun ceciliazzzzzz shuowang-ai lsz19960814 standardgalactic linhongxiang gitter-badger kazimierz-256 jiawozhong mhpthesis zsigacsaba rnaimehaom tamal-mondal bgauzere

graphkit-learn's Issues

Is the graph edit distance developmented?

How long will the edit distance be supported? Thanks a lot!

function generate_median_preimages_by_class() does not work correctly sometimes

When using the function generate_median_preimages_by_class(), the results after the first class are not correct when I run the code in Ubuntu terminal using # python3 ....

The answer here says it has something to do with the Cython and conda. I do have conda installed. So I use a fresh virtual environment without conda and the results seems correct, but I still do not know exactly why.

Reproducing code example:

Here is the test.py:

import multiprocessing
import functools
from gklearn.utils.kernels import deltakernel, gaussiankernel, kernelproduct
from gklearn.preimage.utils import generate_median_preimages_by_class


def xp_median_preimage_1_1():
	"""xp 1_1: Letter-high, sspkernel.
	"""
	# set parameters.
	ds_name = 'Letter-high'
	mpg_options = {'fit_method': 'k-graphs',
				   'init_ecc': [3, 3, 1, 3, 3],
				   'ds_name': ds_name,
				   'parallel': True, # False
				   'time_limit_in_sec': 0,
				   'max_itrs': 100,
				   'max_itrs_without_update': 3,
				   'epsilon_residual': 0.01,
				   'epsilon_ec': 0.1,
				   'verbose': 2}
	mixkernel = functools.partial(kernelproduct, deltakernel, gaussiankernel)
	sub_kernels = {'symb': deltakernel, 'nsymb': gaussiankernel, 'mix': mixkernel}
	kernel_options = {'name': 'structuralspkernel',
					  'edge_weight': None,
					  'node_kernels': sub_kernels,
					  'edge_kernels': sub_kernels, 
					  'compute_method': 'naive',
					  'parallel': 'imap_unordered', 
# 						  'parallel': None, 
					  'n_jobs': multiprocessing.cpu_count(),
					  'normalize': True,
					  'verbose': 2}
	ged_options = {'method': 'IPFP',
				   'initialization_method': 'RANDOM', # 'NODE'
				   'initial_solutions': 1, # 1
				   'edit_cost': 'LETTER2',
				   'attr_distance': 'euclidean',
				   'ratio_runs_from_initial_solutions': 1,
				   'threads': multiprocessing.cpu_count(),
				   'init_option': 'EAGER_WITHOUT_SHUFFLED_COPIES'}
	mge_options = {'init_type': 'MEDOID',
				   'random_inits': 10,
				   'time_limit': 600,
				   'verbose': 2,
				   'refine': False}
	save_results = True
	
	# print settings.
	print('parameters:')
	print('dataset name:', ds_name)
	print('mpg_options:', mpg_options)
	print('kernel_options:', kernel_options)
	print('ged_options:', ged_options)
	print('mge_options:', mge_options)
	print('save_results:', save_results)
	
	# generate preimages.
	for fit_method in ['k-graphs', 'expert', 'random', 'random', 'random']:
		print('\n-------------------------------------')
		print('fit method:', fit_method, '\n')
		mpg_options['fit_method'] = fit_method
		generate_median_preimages_by_class(ds_name, mpg_options, kernel_options, ged_options, mge_options, save_results=save_results, save_medians=True, plot_medians=True, load_gm='auto', dir_save='../results/xp_median_preimage/')


if __name__ == "__main__":
	
	#### xp 1_1: Letter-high, sspkernel.
 	xp_median_preimage_1_1()

Error message:

When I run in Ubuntu terminal:

python3 test.py

The output results are not correct after the first class. However, If I remove the first class before computation, then the results of the first class in the remainder (the original second class) is correct, and the results of the new second class (the original third class) is wrong. This problem does not occur in Spyder3 (4.1.1) console IPython 7.0.1 or fresh virtualenv with only Python modules required installed.

graphkit-learn/Python version information:

Python 3.6.9
graphkit-learn 0.1
Ubuntu 18.04.4 LTS

model_selection_precomputed.py文件中，关于normalization的问题

Hi，linlin同学。两个问题想请教下：
1.在model_selection_precomputed.py文件中,144行我看到您想删除和自身核函数为0的图

remove graphs whose kernels with themselves are zeros

Kmatrix 删除Kmatrix_diag中值为0的坐标处的行和列
但在151行中，normalization又取了Kmatrix_diag的值重新赋给Kmatrix，我使用自己的测试数据（因为我没能把你完整的代码跑下来）发现如果上一步中Kmatrix_diag如果有0的元素出现，这里的乘积为0，相除数组越界。请问这一个公式的目的是什么，你会遇到这个bug吗

normalization

      Kmatrix[i][j] /= np.sqrt(Kmatrix_diag[i] * Kmatrix_diag[j])
      Kmatrix[j][i] = Kmatrix[i][j]`

2.另外在几个"run_"和"test_"开头的ipynb文件中发现几个变量名错误（大小写之类，可能是留下的版本不同的代码？或者是我比较菜没看出来），所以请问您这套代码哪些是新的可以使用的，哪些是旧的不用的，我在测试中notebook目录下的ipynb文件出现的问题比较多，求大神指教，谢谢！

Citing

How can I cite the library?

Key Error gklearn.kernels.treeletKernel

For some Graphs gklearn throws a "Key Error" when generating canonical keys. This does not happen for all graphs, I assume it is limited to this pattern. Help would be highly appreciated!

File ~\Anaconda3\lib\site-packages\gklearn\kernels\treeletKernel.py:128, in treeletkernel(sub_kernel, node_label, edge_label, parallel, n_jobs, chunksize, verbose, *args)
126 canonkeys = []
127 for g in (tqdm(Gn, desc='getting canonkeys', file=sys.stdout) if verbose else Gn):
--> 128 canonkeys.append(get_canonkeys(g, node_label, edge_label, labeled,
129 ds_attrs['is_directed']))
131 # compute kernels.
132 from itertools import combinations_with_replacement

File ~\Anaconda3\lib\site-packages\gklearn\kernels\treeletKernel.py:324, in get_canonkeys(G, node_label, edge_label, labeled, is_directed)
322 treelet = []
323 for pattern in patterns[str(i) + 'star']:
--> 324 canonlist = [tuple((G.nodes[leaf][node_label],
325 G[leaf][pattern[0]][edge_label])) for leaf in pattern[1:]]
326 canonlist.sort()
327 canonlist = list(chain.from_iterable(canonlist))

File ~\Anaconda3\lib\site-packages\gklearn\kernels\treeletKernel.py:325, in (.0)
322 treelet = []
323 for pattern in patterns[str(i) + 'star']:
324 canonlist = [tuple((G.nodes[leaf][node_label],
--> 325 G[leaf][pattern[0]][edge_label])) for leaf in pattern[1:]]
326 canonlist.sort()
327 canonlist = list(chain.from_iterable(canonlist))

File ~\Anaconda3\lib\site-packages\networkx\classes\coreviews.py:51, in AtlasView.getitem(self, key)
50 def getitem(self, key):
---> 51 return self._atlas[key]

KeyError: 1

Weisfeiler_Lehman graph kernel

您好，我刚刚入门graph, 请问一下使用：Weisfeiler_Lehman graph kernel，链接矩阵必须是对称的（无向图）吗？？？

Request for the atom types labels of NCI1 datasets

I have searched almost all the literature, but have not found the atomic type corresponding to the node label of the NCI1 dataset. We know that the NCI1 dataset is made up of molecule graphs composed of atoms in 37 categories. just like the molcules in MUTAG dataset are composed of atoms in 7 categories: {0:'C', 1:'N', 2:'O', 3:'F', 4:'I', 5:'Cl', 6:'Br'}. Sincerely ask everyone to reply !!!

In ./notebooks/utils/plot_all_graphs.py, it plots graphs in MUTAG dataset with node(atom) labels like this:

# line [19 - 40]
    dataset, y = loadDataset("../../datasets/MUTAG/MUTAG_A.txt")
    for idx in [6]: #[65]:#
        G = dataset[idx]
        ncolors= []
        for node in G.nodes:
            if G.nodes[node]['atom'] == '0':
                G.nodes[node]['atom'] = 'C'
                ncolors.append('#bd3182')
            elif G.nodes[node]['atom'] == '1':
                G.nodes[node]['atom'] = 'N'
                ncolors.append('#3182bd')
            elif G.nodes[node]['atom'] == '2':
                G.nodes[node]['atom'] = 'O'
                ncolors.append('#82bd31')
            elif G.nodes[node]['atom'] == '3':
                G.nodes[node]['atom'] = 'F'
            elif G.nodes[node]['atom'] == '4':
                G.nodes[node]['atom'] = 'I'
            elif G.nodes[node]['atom'] == '5':
                G.nodes[node]['atom'] = 'Cl'
            elif G.nodes[node]['atom'] == '6':
                G.nodes[node]['atom'] = 'Br'

Is there any chance to add more node_label and edge_label?

Hi, lin. In the file 'graphfiles.py' you used the 'loadDataset' function to read the dataset and build graphs. I found that all the functions corresponding to different datasets like 'loadCT' , 'loadGXL ' , 'loadSDF' etc. usually add one node lable 'atom' and one edge label 'bond_type'. Is there any chance to add more node_label and edge_label to make the classification more accurate？
Thanks, man.

ValueError: cost matrix is infeasible

Any workaround for this error.

Also, I want to compute GED for large matrices ?