Hi Elior, I am trying to use the library to vectorise the nodes and

failing simple clustering of synthetic networks about node2vec HOT 2 CLOSED

yuanmohe commented on June 18, 2024

failing simple clustering of synthetic networks

from node2vec.

Comments (2)

eliorc commented on June 18, 2024

I can totally see why you are puzzled by this :)

It is written in the README, but apparently not clear enough 😅 - Output node names are ALWAYS string.
When you write m_node.wv[0] you actually get the vector in the first index, and it is not guaranteed (and most likely won't be) the vector that is marked as 0 in the networkx graph.

In the line vectors = {node: m_node.wv[node] for node in G.nodes()}, cast to string by switching it to vectors = {node: m_node.wv[str(node)] for node in G.nodes()} and it will work as expected

Here is the code with this small change and the result, hope this makes sense :)

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from node2vec import Node2Vec

# create the synthetic network
sizes = [300, 300, 300]
probs = [[0.2, 0.01, 0.01], [0.01, 0.2, 0.01], [0.01, 0.01, 0.2]]
G = nx.stochastic_block_model(sizes, probs, seed=0)
node_colors =  [0] * 300 + [1] * 300 + [2] * 300

# train the model
m_node = Node2Vec(G, dimensions = 128, p = 1, q = 1, walk_length = 80, num_walks = 10).fit()

# get the vectors
vectors = {node: m_node.wv[str(node)] for node in G.nodes()}
vectors = pd.DataFrame.from_dict(vectors, orient = 'index')

# prepare dataframe for visualisation
model = TSNE(n_components=2)
df = model.fit_transform(vectors)
df = pd.DataFrame(df, index = G.nodes())
df = df.rename(columns = {0: 'dim1', 1: 'dim2'})
df = df.assign(color = node_colors)

# plot
fig, ax = plt.subplots()
sns.scatterplot(data = df, x = 'dim1', y = 'dim2', hue = 'color', palette = 'Set1', alpha = 0.6)
ax.set_xlabel("dim1")
ax.set_ylabel("dim2")