eliorc / node2vec Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 240.0 73 KB

Implementation of the node2vec algorithm.

License: MIT License

Python 100.00%

deep-learning embeddings machine-learning-algorithms

node2vec's Introduction

node2vec's People

Contributors

Stargazers

Watchers

Forkers

hardikgw diorsking haonanli ai3dvision jalamao sandy4321 ktagowski mgwave georgerichardson jsybrandt rafaelvanbelle peternara jjjmc shuangyumo urielsinger zhaoguangxiang sdustyxr jbdatascience thluo jeremyywb andersonshun pg2455 heyuanhao everaldocsneto drorhilman angelinana0408 anirband adrianbzg saravananpsg jacopoorlandini hmartiniano cjthornhill 30lm32 asayana ricardocyy cmkhaledsaifullah hurun yannan1212 formleaf debangsha1992 mdmahbubarrahman qzshucsz wuyang556 onyukang wz26 afcarl nick4084 richardxy sahasradude deepika-ramasub pmp55 soulinmau hznu1 edwinwallis lgdkobe24 angeloscha excuses123 benzei jjh3024 gaojunruo subpath eeevery denis-xiao cyhong549 peisenli charlottesean chianwei dendisuhubdy oddecust chrinide toatmosto allen15rg anksakhuja vincent-ustach dangchienhsgs rajkuma2011 tadelaide suyc123 sonnguyen478 godamn gsj1029 pbielak petr-chan bartleyn z130110 douboo deadpoolssr dertilo flyingkiss yichingchan1013 feng-123 119243740 pwforks yaellev xzhang2016 cholezlh preetham-salehundam databill86 despotovski01 zian-zhou

node2vec's Issues

If I don't use Anaconda but in Windows Environment, does the code work?

Does not work on Anaconda + Windows?
If I don't use Anaconda but in Windows Environment, does the code work?
kinda of confused.

How would you update a model

I have a fairly large dataset (lower end of 1M nodes in a graph) with new nodes being added possibly every day or so.
I can save the graph and update it then run the Node2Vec again on the whole thing but I'd rather avoid having to do that because it takes a while. Is there a way to generate an embedding for a new node in the graph without running the Node2Vec over the whole graph ?

how can I handle graph with more than 50k nodes?

Traceback (most recent call last):
File "/home/lib/python3.5/site-packages/joblib/externals/loky/backend/queues.py", line 157, in _feed
send_bytes(obj)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.5/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Hi. I really appreciate your library. I can get more accurate result with your code even than the author's one.

However, I have a problem that I can't learn the embedding of graphs whose nodes are about more than 50,000.

I guess the joblib module for "parallel_generate_walks" has a limit for large dataset.

Is this code originally limited to be so?

How can I generate a Graph with some codes？

if i want to convert some codes into a vecto， need i generate the graph with the codes by myself ？

How to use EDGES_EMBEDDING_FILENAME as an input for Neural Network

I have 1000 of small graphs, I can create EDGES_EMBEDDING_FILENAME for all of them. How can I use them as an input in any of the Machine Learning algorithm (mainly NN)? Could you please provide me some suggestions?

if i use weighted graphs,how to set "weight_key"?

Computing transition probabilities extremely slow - not parallel

I am running n2v on a graph with around 40 million edges with the default values from the example in the readme. The Computing transition probabilities step is estimated to take around 533 hours and is running on only one CPU. I have workers set to 30. Is it possible to parallelize this or speed it up some other way?

ValueError: a must be non-empty

_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\Andrew\Anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py", line 398, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "C:\Users\Andrew\Anaconda3\lib\site-packages\joblib_parallel_backends.py", line 561, in call
return self.func(*args, **kwargs)
File "C:\Users\Andrew\Anaconda3\lib\site-packages\joblib\parallel.py", line 224, in call
for func, args, kwargs in self.items]
File "C:\Users\Andrew\Anaconda3\lib\site-packages\joblib\parallel.py", line 224, in
for func, args, kwargs in self.items]
File "C:\Users\Andrew\Anaconda3\lib\site-packages\node2vec\node2vec.py", line 51, in parallel_generate_walks
walk_to = np.random.choice(walk_options, size=1)[0]
File "mtrand.pyx", line 1126, in mtrand.RandomState.choice
ValueError: a must be non-empty
"""

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last)
in ()
----> 1 node2vec = Node2Vec(mail_n_basic, dimensions=64, walk_length=30, num_walks=200, workers=4)
2
3 model = node2vec.fit(window=10, min_count=1, batch_words=4)
4
5 model.wv.most_similar('2')

~\Anaconda3\lib\site-packages\node2vec\node2vec.py in init(self, graph, dimensions, walk_length, num_walks, p, q, weight_key, workers, sampling_strategy)
111
112 self.d_graph = self._precompute_probabilities()
--> 113 self.walks = self._generate_walks()
114
115 def _precompute_probabilities(self):

~\Anaconda3\lib\site-packages\node2vec\node2vec.py in _generate_walks(self)
178 self.NEIGHBORS_KEY,
179 self.PROBABILITIES_KEY) for idx, num_walks
--> 180 in enumerate(num_walks_lists, 1))
181
182 walks = flatten(walk_results)

~\Anaconda3\lib\site-packages\joblib\parallel.py in call(self, iterable)
960
961 with self._backend.retrieval_context():
--> 962 self.retrieve()
963 # Make sure that we get a last message telling us we are done
964 elapsed_time = time.time() - self._start_time

~\Anaconda3\lib\site-packages\joblib\parallel.py in retrieve(self)
863 try:
864 if getattr(self._backend, 'supports_timeout', False):
--> 865 self._output.extend(job.get(timeout=self.timeout))
866 else:
867 self._output.extend(job.get())

~\Anaconda3\lib\site-packages\joblib_parallel_backends.py in wrap_future_result(future, timeout)
513 AsyncResults.get from multiprocessing."""
514 try:
--> 515 return future.result(timeout=timeout)
516 except LokyTimeoutError:
517 raise TimeoutError()

~\Anaconda3\lib\site-packages\joblib\externals\loky_base.py in result(self, timeout)
429 raise CancelledError()
430 elif self._state == FINISHED:
--> 431 return self.__get_result()
432 else:
433 raise TimeoutError()

~\Anaconda3\lib\site-packages\joblib\externals\loky_base.py in __get_result(self)
380 def __get_result(self):
381 if self._exception:
--> 382 raise self._exception
383 else:
384 return self._result

ValueError: a must be non-empty

Load model and embedding

Hi eliorc,
I am using Node2vec for my project. It's good library for me.
But I got stuck when load model and embedding node after saving. I must run node2vec.fit to have node2vec_model.wv.load_word2vec_format(filename). If not, I can not get node2vec_model. I think this is bug, because I always wast time to train from scratch by fit function without load immediately.
This is my code:

node2vec = Node2Vec(graph=graph,
                        dimensions=embedding_dim,
                        walk_length=80,
                        num_walks=20,
                        workers=2)  # Use temp_folder for big graphs
node2vec_model = node2vec.fit()
if is_load_emb:
    embedding = node2vec_model.wv.load_word2vec_format(filename)
else:
     embedding = [np.array(node2vec_model[str(u)]) for u in sorted(graph.nodes)]
     if filename is not None:
        node2vec_model.wv.save_word2vec_format(filename)

Thanks

Will node2vec return same vector for same node V?

Hi There,

I am new to node2vec, and have a quick question: for different runs, will node2vec return same vector for a same node V? given the default parameters. I am not very sure about this because the random walk is a random process.
Looking forward to your reply. Thanks!

Best,
Tao

len(num_walks) -> num_walks?

In https://github.com/eliorc/node2vec/blob/master/node2vec/node2vec.py#L144, len(num_walks) is fed into parallel_generate_walks's num_walks, is it intended or typo?

there is problem with parallel job , when my computer may do parallel calcualtoins

for example when I use this repo
https://github.com/HKUST-KnowComp/MNE

then parallel jobs are happen
it is output for example

C:\Users\sndr\Anaconda3\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
We are loading data from: e:\graphs ML\code\Scalable_Multiplex_Network_Embedding_MNE_27may_removed_LINE_Cpp_code\MNE-master\data\Vickers-Chan-7thGraders_multiplex.edges
Finish loading data
finish building the graph
2018-05-27 08:24:31,539 : WARNING : Slow version of MNE is being used
2018-05-27 08:24:31,539 : INFO : collecting all words and their counts
2018-05-27 08:24:31,539 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-05-27 08:24:31,542 : INFO : collected 29 word types from a corpus of 5800 raw words and 580 sentences
2018-05-27 08:24:31,542 : INFO : Loading a fresh vocabulary
2018-05-27 08:24:31,542 : INFO : min_count=0 retains 29 unique words (100% of original 29, drops 0)
2018-05-27 08:24:31,542 : INFO : min_count=0 leaves 5800 word corpus (100% of original 5800, drops 0)
2018-05-27 08:24:31,543 : INFO : deleting the raw counts dictionary of 29 items
2018-05-27 08:24:31,546 : INFO : sample=0.001 downsamples 29 most-common words
2018-05-27 08:24:31,546 : INFO : downsampling leaves estimated 1133 word corpus (19.5% of prior 5800)
2018-05-27 08:24:31,546 : INFO : estimated required memory for 29 words and 200 dimensions: 60900 bytes
2018-05-27 08:24:31,546 : INFO : resetting layer weights
E:\graphs ML\code\Scalable_Multiplex_Network_Embedding_MNE_27may_removed_LINE_Cpp_code\MNE-master\MNE.py:655: UserWarning: C extension not loaded for Word2Vec, training will be slow. Install a C compiler and reinstall gensim for fast training.
  warnings.warn("C extension not loaded for Word2Vec, training will be slow. "
2018-05-27 08:24:31,549 : INFO : training model with 4 workers on 29 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2018-05-27 08:24:31,550 : INFO : expecting 580 sentences, matching count from corpus used for vocabulary survey
2018-05-27 08:24:35,637 : INFO : PROGRESS: at 1.72% examples, 477 words/s, in_qsize 8, out_qsize 0
2018-05-27 08:24:39,745 : INFO : PROGRESS: at 8.62% examples, 1189 words/s, in_qsize 7, out_qsize 0
2018-05-27 08:24:43,835 : INFO : PROGRESS: at 15.52% examples, 1445 words/s, in_qsize 7, out_qsize 0
2018-05-27 08:24:48,022 : INFO : PROGRESS: at 22.41% examples, 1559 words/s, in_qsize 7, out_qsize 0
2018-05-27 08:24:52,222 : INFO : PROGRESS: at 29.31% examples, 1613 words/s, in_qsize 7, out_qsize 0
2018-05-27 08:24:56,323 : INFO : PROGRESS: at 36.21% examples, 1664 words/s, in_qsize 7, out_qsize 0
2018-05-27 08:24:57,571 : INFO : PROGRESS: at 41.38% examples, 1813 words/s, in_qsize 8, out_qsize 0
2018-05-27 08:25:00,408 : INFO : PROGRESS: at 43.10% examples, 1701 words/s, in_qsize 7, out_qsize 0
2018-05-27 08:25:01,824 : INFO : PROGRESS: at 48.28% examples, 1813 words/s, in_qsize 7, out_qsize 0
2018-05-27 08:25:04,289 : INFO : PROGRESS: at 50.00% examples, 1735 words/s, in_qsize 8, out_qsize 0
2018-05-27 08:25:05,993 : INFO : PROGRESS: at 55.17% examples, 1817 words/s, in_qsize 8, out_qsize 0
2018-05-27 08:25:08,556 : INFO : PROGRESS: at 56.90% examples, 1745 words/s, in_qsize 8, out_qsize 0
2018-05-27 08:25:10,071 : INFO : PROGRESS: at 62.07% examples, 1828 words/s, in_qsize 8, out_qsize 0
2018-05-27 08:25:12,396 : INFO : PROGRESS: at 63.79% examples, 1770 words/s, in_qsize 8, out_qsize 0
2018-05-27 08:25:14,412 : INFO : PROGRESS: at 68.97% examples, 1825 words/s, in_qsize 7, out_qsize 0
2018-05-27 08:25:16,363 : INFO : PROGRESS: at 70.69% examples, 1789 words/s, in_qsize 8, out_qsize 0
2018-05-27 08:25:18,620 : INFO : PROGRESS: at 75.86% examples, 1828 words/s, in_qsize 8, out_qsize 0
2018-05-27 08:25:20,163 : INFO : PROGRESS: at 77.59% examples, 1809 words/s, in_qsize 7, out_qsize 0
2018-05-27 08:25:21,434 : INFO : PROGRESS: at 81.03% examples, 1839 words/s, in_qsize 7, out_qsize 0
2018-05-27 08:25:22,846 : INFO : PROGRESS: at 82.76% examples, 1827 words/s, in_qsize 7, out_qsize 0
2018-05-27 08:25:24,182 : INFO : PROGRESS: at 84.48% examples, 1817 words/s, in_qsize 7, out_qsize 0
2018-05-27 08:25:25,422 : INFO : PROGRESS: at 86.21% examples, 1812 words/s, in_qsize 8, out_qsize 0
2018-05-27 08:25:26,625 : INFO : PROGRESS: at 89.66% examples, 1843 words/s, in_qsize 6, out_qsize 0
2018-05-27 08:25:28,240 : INFO : PROGRESS: at 91.38% examples, 1824 words/s, in_qsize 5, out_qsize 0
2018-05-27 08:25:29,541 : INFO : PROGRESS: at 93.10% examples, 1818 words/s, in_qsize 4, out_qsize 0
2018-05-27 08:25:29,748 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-05-27 08:25:30,427 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-05-27 08:25:31,174 : INFO : PROGRESS: at 98.28% examples, 1866 words/s, in_qsize 1, out_qsize 1
2018-05-27 08:25:31,174 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-05-27 08:25:31,362 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-05-27 08:25:31,362 : INFO : training on 580000 raw words (113232 effective words) took 59.8s, 1894 effective words/s
2018-05-27 08:25:31,512 : WARNING : Slow version of MNE is being used
2018-05-27 08:25:31,512 : INFO : collecting all words and their counts
2018-05-27 08:25:31,512 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-05-27 08:25:31,512 : INFO : collected 29 word types from a corpus of 5800 raw words and 580 sentences
2018-05-27 08:25:31,512 : INFO : Loading a fresh vocabulary
2018-05-27 08:25:31,512 : INFO : min_count=0 retains 29 unique words (100% of original 29, drops 0)
2018-05-27 08:25:31,512 : INFO : min_count=0 leaves 5800 word corpus (100% of original 5800, drops 0)
2018-05-27 08:25:31,512 : INFO : deleting the raw counts dictionary of 29 items
2018-05-27 08:25:31,512 : INFO : sample=0.001 downsamples 29 most-common words
2018-05-27 08:25:31,512 : INFO : downsampling leaves estimated 1137 word corpus (19.6% of prior 5800)
2018-05-27 08:25:31,512 : INFO : estimated required memory for 29 words and 200 dimensions: 60900 bytes
2018-05-27 08:25:31,512 : INFO : resetting layer weights
2018-05-27 08:25:31,528 : INFO : training model with 4 workers on 29 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2018-05-27 08:25:31,528 : INFO : expecting 580 sentences, matching count from corpus used for vocabulary survey
2018-05-27 08:25:35,376 : INFO : PROGRESS: at 17.24% examples, 488 words/s, in_qsize 5, out_qsize 0
2018-05-27 08:25:35,714 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-05-27 08:25:35,830 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-05-27 08:25:37,110 : INFO : PROGRESS: at 82.76% examples, 1694 words/s, in_qsize 1, out_qsize 1
2018-05-27 08:25:37,112 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-05-27 08:25:37,177 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-05-27 08:25:37,177 : INFO : training on 58000 raw words (11320 effective words) took 5.6s, 2016 effective words/s
2018-05-27 08:25:37,177 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
2018-05-27 08:25:37,277 : WARNING : Slow version of MNE is being used
2018-05-27 08:25:37,277 : INFO : collecting all words and their counts
2018-05-27 08:25:37,277 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-05-27 08:25:37,277 : INFO : collected 29 word types from a corpus of 5800 raw words and 580 sentences
2018-05-27 08:25:37,277 : INFO : Loading a fresh vocabulary
2018-05-27 08:25:37,277 : INFO : min_count=0 retains 29 unique words (100% of original 29, drops 0)
2018-05-27 08:25:37,277 : INFO : min_count=0 leaves 5800 word corpus (100% of original 5800, drops 0)
2018-05-27 08:25:37,277 : INFO : deleting the raw counts dictionary of 29 items
2018-05-27 08:25:37,277 : INFO : sample=0.001 downsamples 29 most-common words
2018-05-27 08:25:37,277 : INFO : downsampling leaves estimated 1091 word corpus (18.8% of prior 5800)
2018-05-27 08:25:37,277 : INFO : estimated required memory for 29 words and 200 dimensions: 60900 bytes
2018-05-27 08:25:37,277 : INFO : resetting layer weights
2018-05-27 08:25:37,277 : INFO : training model with 4 workers on 29 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2018-05-27 08:25:37,277 : INFO : expecting 580 sentences, matching count from corpus used for vocabulary survey

Feature Request: Option to disable progress bar

I'm using Node2Vec in another lib and I don't necessarily want to see the progress bar when I'm running tests or encapsulating functions.

Inputs should be in the form of..?

I want to put the float number weighted graph as edgelist for the input, but apparently

it doesn't work. Can you guide me?

A question about random parameter

Hello @eliorc :
Firstly, thanks for your works about Node2Vec in python, which makes this algorithm using more eaier. Now I am trying to generate 10 sets of node vectors, but I get the same data every time. What parameter could I set to get different data? If you could provide advice, I would be very grateful. The Following is the code and the log.

for epoch in range(10):
        node2vec = Node2Vec(
                            graph, 
                            dimensions=2, 
                            walk_length=5, 
                            workers=1,
                            p = 17.8,
                            q = 0.001)
        model = node2vec.fit(window=2, min_count=1, batch_words=6)
        print(model['2'])

Computing transition probabilities: 100%|██████████████████████████████████████████████| 6/6 [00:00<00:00, 1999.99it/s]
Generating walks (CPU: 1): 100%|██████████████████████████████████████████████████████| 10/10 [00:00<00:00, 434.76it/s]
D:\Business\GNN\node2vec_pro\test_para.py:31: DeprecationWarning: Call to deprecated __getitem__ (Method will be removed in 4.0.0, use self.wv.getitem() instead).
print(model['2'])
[-0.20316663 -0.14337409]
Computing transition probabilities: 100%|██████████████████████████████████████████████| 6/6 [00:00<00:00, 3000.22it/s]
Generating walks (CPU: 1): 100%|██████████████████████████████████████████████████████| 10/10 [00:00<00:00, 499.97it/s]
[-0.20315658 -0.14337872]
Computing transition probabilities: 100%|██████████████████████████████████████████████| 6/6 [00:00<00:00, 2999.86it/s]
Generating walks (CPU: 1): 100%|██████████████████████████████████████████████████████| 10/10 [00:00<00:00, 384.59it/s]
[-0.2032127 -0.14340548]
Computing transition probabilities: 100%|██████████████████████████████████████████████| 6/6 [00:00<00:00, 2999.86it/s]
Generating walks (CPU: 1): 100%|██████████████████████████████████████████████████████| 10/10 [00:00<00:00, 416.64it/s]
[-0.20316675 -0.1433859 ]
Computing transition probabilities: 100%|██████████████████████████████████████████████| 6/6 [00:00<00:00, 2999.86it/s]
Generating walks (CPU: 1): 100%|██████████████████████████████████████████████████████| 10/10 [00:00<00:00, 555.52it/s]
[-0.20316663 -0.14337409]
Computing transition probabilities: 100%|██████████████████████████████████████████████| 6/6 [00:00<00:00, 1999.99it/s]
Generating walks (CPU: 1): 100%|██████████████████████████████████████████████████████| 10/10 [00:00<00:00, 526.29it/s]
[-0.2031702 -0.14337513]
Computing transition probabilities: 100%|██████████████████████████████████████████████| 6/6 [00:00<00:00, 2999.86it/s]
Generating walks (CPU: 1): 100%|██████████████████████████████████████████████████████| 10/10 [00:00<00:00, 476.16it/s]
[-0.20327514 -0.14351842]
Computing transition probabilities: 100%|██████████████████████████████████████████████| 6/6 [00:00<00:00, 2999.86it/s]
Generating walks (CPU: 1): 100%|██████████████████████████████████████████████████████| 10/10 [00:00<00:00, 588.20it/s]
[-0.20317353 -0.1433965 ]
Computing transition probabilities: 100%|██████████████████████████████████████████████| 6/6 [00:00<00:00, 2999.86it/s]
Generating walks (CPU: 1): 100%|██████████████████████████████████████████████████████| 10/10 [00:00<00:00, 555.52it/s]
[-0.20317145 -0.14337693]
Computing transition probabilities: 100%|██████████████████████████████████████████████| 6/6 [00:00<00:00, 2999.86it/s]
Generating walks (CPU: 1): 100%|██████████████████████████████████████████████████████| 10/10 [00:00<00:00, 384.59it/s]
[-0.20317672 -0.14338204]
Press any key to continue . . .

Does node2vec take into account node attributes?

I have a networkx such that each node name is a string and its data is a dict with the following format {'type' : 'class A'}.
So essentially an example of a node is
NodeDataView({'dog': {'type': 'class A'}, ...)

Will this library take into account this data attribute of the node?

Embedding of a directed graph

The algorithm outputs poor quality embeddings when a directed graph is given as input. This is likely due the fact that the random walks only go one way along edge and give biased neighborhoods for that reason.
Shouldn't there be a warning when putting in directed graphs? Or perhaps an automatic conversion to an undirected graph?

Documentation to load model and embeddings?

Is there documentation showing how to load model and embeddings after saving using the methods shown in the README?

cannot import name 'Node2Vec' from 'node2vec'

Hi,

I installnode2vec using pip install. However when I start the sample scripts from your example, it shows error:
"from node2vec import Node2Vec
ImportError: cannot import name 'Node2Vec' from 'node2vec'"

Do you know why?

My python version: 3.7.1
OS: tried both on Mac and Red Hat 7.6

Regards,
Chan.

Why is the stage of Generating walks is 0%?

the code:

graph = nx.from_pandas_edgelist(
    answer_info,
    "author_id",
    "question_id",
    "like_nums",
)

node2vec = Node2Vec(graph, dimensions=64, walk_length=30, num_walks=100, workers=10,
                    temp_folder="/data/zhihu/node2vec")  # Use temp_folder for big graphs
model = node2vec.fit(window=10, min_count=1,
                     batch_words=4)

and the log:

Computing transition probabilities: 100%|██████████| 2188614/2188614 [30:54<00:00, 1180.22it/s]
Generating walks (CPU: 4):   0%|          | 0/10 [00:00<?, ?it/s]
Generating walks (CPU: 3):   0%|          | 0/10 [00:00<?, ?it/s]
Generating walks (CPU: 5):   0%|          | 0/10 [00:00<?, ?it/s]
Generating walks (CPU: 2):   0%|          | 0/10 [00:00<?, ?it/s]
Generating walks (CPU: 6):   0%|          | 0/10 [00:00<?, ?it/s]
Generating walks (CPU: 7):   0%|          | 0/10 [00:00<?, ?it/s]
Generating walks (CPU: 8):   0%|          | 0/10 [00:00<?, ?it/s]
Generating walks (CPU: 9):   0%|          | 0/10 [00:00<?, ?it/s]
Generating walks (CPU: 1):   0%|          | 0/10 [00:00<?, ?it/s]
Generating walks (CPU: 10):   0%|          | 0/10 [00:00<?, ?it/s]

RuntimeError with parallel generating walks

Recently used this to detect communities on a bipartite graph (5220 nodes 7136 edges). Currently using this for the same task on a co-occurrence graph with 5131 nodes and over 565k+ edges. The script was able to generate the transitional probabilities but after that it stops and returns this:
RuntimeError: The task could not be sent to workers as it is too large for 'send_bytes' .
Here's the full screenshot:

Here's my python script:

import networkx as nx
import node2vec

# Node2Vec

G = nx.read_gml('../data/co_occurrence_graph.gml')
# use this dictionary to change any argument value!
print("Loaded graph")
n2v_args = {"dimensions": 128, "walk_length": 10, "num_walks": 10, "p": 1, "q": 0.5, "workers": 4}
node2vec = node2vec.Node2Vec(G, **n2v_args)
print("model created")
model = node2vec.fit(window=10, min_count=1, batch_words=4)
print("model training complete")
print(model.wv.most_similar('Cornish'))
model.wv.save_word2vec_format('../data/co_graph.emb')
model.save('../data/co_graph.model')
print("formats saved")

Could it be that there are just "too many edges" for node2vec to handle? I'm not sure how else to fix this issue.

Memory Error

The following code almost takes all of the memory.
node2vec = Node2Vec(G_fb, dimensions=emb_size, walk_length=80, num_walks=10, workers=4)

I have processed only 7 thousand nodes and then got memory error.

Here is the error:
`MemoryError Traceback (most recent call last)
in ()
4
5 for emb_size in [32, 128]:
----> 6 node2vec = Node2Vec(G_fb, dimensions=emb_size, walk_length=80, num_walks=10, workers=3) # Use temp_folder for big graphs
7 model = node2vec.fit(window=10, min_count=1,batch_words=4)
8 model.wv.save_word2vec_format("mypath/%s_%d.deepwalk"%(graph_name,emb_size))

...site-packages\node2vec\node2vec.py in init(self, graph, dimensions, walk_length, num_walks, p, q, weight_key, workers, sampling_strategy, quiet, temp_folder)
68 self.require = "sharedmem"
69
---> 70 self._precompute_probabilities()
71 self.walks = self._generate_walks()
72

...site-packages\node2vec\node2vec.py in _precompute_probabilities(self)
117 if current_node not in first_travel_done:
118 first_travel_weights.append(self.graph[current_node][destination].get(self.weight_key, 1))
--> 119 d_neighbors.append(destination)
120
121 # Normalize

MemoryError:`

Is there a way to process big graphs with a lot of nodes without getting memory error?
I know that I can use temp_folder for that but I don't know how I can apply it?

missing `model.wv.index_to_key` and mismatch node number with index vector

Using model.wv.vectors we can extract list of vectors. Unfortunately, the index is unsorted and doesn't preserve node indexing.

To see the paired node (label) with its corresponding vector, we can use model.wv.save_word2vec_format('word2vec_format.json'). I think this is a bug.

Does Node2Vec code support community detection?

Hello,

I am new to Node2Vec, I have succeeded in creating nodes' embeddings of a directed weighted graph. Now I want to know if this code can achieve community detection and if so how to do it? because if I am not mistaken, in the paper they mention that it is possible to use Node2Vec to do community detection.

Thanks in advance.

TypeError: init() got an unexpected keyword argument 'temp_folder'

n2v=Node2Vec(G, dimensions=100, workers=4, p=0.25, q=0.25, temp_folder='F:/n2v_v1/temp/')

I tried to set a temp_folder but an error appeared as below saying there isn't such a parameter. Is this parameter still available now?

TypeError Traceback (most recent call last)
in
----> 1 n2v=Node2Vec(G, dimensions=100, workers=4, p=0.25, q=0.25, temp_folder='F:/n2v_v1/temp/')

TypeError: init() got an unexpected keyword argument 'temp_folder'

How to set the parameters of the cora dataset?

I've run node2vec on the cora dataset for many times. For each run, I used the different settings of the parameters of the node2vec. However, |'ve never achieved the same or the nearly same result of the paper 'Evaluating Network Embeddings: Node2Vec vs Spectral Clustering vs GCN'.

np.random.choice doesn't want a to be empty

while len(walk) < walk_length:
                    walk_options = d_graph[walk[-1]][neighbors_key]
                    
                    if len(walk) == 1:  # For the first step
                        walk_to = np.random.choice(walk_options, size=1)[0]
                    else:
                        probabilities = d_graph[walk[-1]][probabilities_key][walk[-2]]
                        walk_to = np.random.choice(walk_options, size=1, p=probabilities)[0]

This is giving error. Can you please look into it ?
In some cases, walk_options is an empty list and hence the error. Is there any way around? I tried putting if conditions for empty case but it's not working.

generating node representation vector for individual nodes after embedding

Hi!

I am using node2vec to generate node representation vectors after a random walk. I am working on an adjacency matrix where i want to randomly add or take away 1's and 0's to see how connectivity of the graph changes and then what the resulting node representation vector looks like.

I run the random walk and then fit the model as follows:

from node2vec import Node2Vec
node2vec = Node2Vec(nx_graph, dimensions=128, walk_length=30, num_walks=50, workers=1, seed=30)
#fit
model = node2vec.fit(window=5, min_count=1, batch_words=4)

I can now use each node representation vector:
model.wv('nodex')

Now I want to edit the adjacency matrix- add or remove a random set of nodes (say 5 for example) and see how the vectors for these 5 nodes change. After changing connections on matrix

nx.from_adjacency_matrix - etc

However, I now have to re-run the walk and the model to see how node representation vector has changed..
This is computationally very slow - is there a way of keeping the node vectors from the original model and only finding how the nodes that I have changed in the adjacency matrix have changed for their node representation vector?? i hope this makes sense.. thank you.

what's the difference with aditya-grover' node2vec implementation?

I see aditya-grover have implemented https://github.com/aditya-grover/node2vec before and it support spark version. I find your node2vec later. Your implementation is also perfect, but why you reinvent the wheel?

set the workers more than 1, it will cause the TerminatedWorkerError

PC memory:128GB
node2vec = Node2Vec(G, dimensions=100, walk_length=30, num_walks=100, workers=10)

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {EXIT(2), EXIT(2), EXIT(2), EXIT(2), EXIT(2), EXIT(2), EXIT(2), EXIT(2)}

when I set the worker=1, it runs successfully, but it runs so slowly

Source distributions or release tags

Would it be possible to upload the source distributions (sdist) files to pypi for the new releases? Alternatively, would it be possible to tag the releases in github?

In order for the packages to be created in conda-forge we need to start from one of those two.

Get only the edges embeddings from my input edgelist

Hi,

Thank you already for this implementation, but I have a question.

In your readme in the Usage section, you do

edges_embs = HadamardEmbedder(keyed_vectors=model.wv)

# Get all edges in a separate KeyedVectors instance - use with caution could be huge for big networks
edges_kv = edges_embs.as_keyed_vectors()

And if I understood correctly, here edge_kv corresponds to all possible node pairs.

My problem is that I would like to retrieve only the edges embeddings that correspond to the edges of my edgelist that I gave as input like this graph = nx.read_edgelist(EDGELIST_FILENAME)

I haven't really found a way to do that. Do you know if there is a way to do this?

Thank you very much

Dimensionality Issue

Hey @eliorc ,

I am trying to generate the walks using Node2Vec module(the precomputed probabilities were calculated without any error), but I am facing some dimensionality issue. Can you please help?

Graph:

Error:

How to load a saved model?

[Question] Suggestions for hyperparameter tuning?

Do you recommend any recommendation for tuning hyperparameters based on network topology? Specifically, I usually deal with fully connected correlation-style networks with 100 - 5000 nodes.

From your experimentation, what is your thought process when tuning these hyperparameters?

node2vec = Node2Vec(graph, dimensions=64, walk_length=30, num_walks=200, workers=4) 

# Embed nodes
model = node2vec.fit(window=10, min_count=1, batch_words=4)

how can i get the node's vector after embedding?

thanks your work,its perfect.
i wonder know how to get the node's vector after embedding.

Node2Vec on Bipartite Graph outputs similar nodes from both node set

1)I tried to apply Node2Vec on Bipartite graph, but I want the input to be a node from the node Set A and the output to be nodes from the node set B. I tried to change the parameters (walk_length and num_walks) but I always get output nodes from both the node set og the graph.
How can I do that while letting the model learn from all the graph?
2)How can I apply node2vec on dynamic Graphs?

is this error ctritical: ImportError: [joblib]

Hello ,
thank you very much for cool code
only may you clarify if my run correct,
I have this error message

Computing transition probabilities: 85%|████████▌ | 85/100 [00:01<00:00, 71.36it/s]
Computing transition probabilities: 90%|█████████ | 90/100 [00:01<00:00, 75.52it/s]
Computing transition probabilities: 94%|█████████▍| 94/100 [00:01<00:00, 72.37it/s]
Computing transition probabilities: 99%|█████████▉| 99/100 [00:01<00:00, 76.07it/s]
Computing transition probabilities: 100%|██████████| 100/100 [00:01<00:00, 75.99it/s]
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\sndr\Anaconda3\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users\sndr\Anaconda3\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "C:\Users\sndr\Anaconda3\lib\multiprocessing\spawn.py", line 223, in prepare
_fixup_main_from_name(data['init_main_from_name'])
File "C:\Users\sndr\Anaconda3\lib\multiprocessing\spawn.py", line 249, in _fixup_main_from_name
alter_sys=True)
File "C:\Users\sndr\Anaconda3\lib\runpy.py", line 205, in run_module
return _run_module_code(code, init_globals, run_name, mod_spec)
File "C:\Users\sndr\Anaconda3\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "C:\Users\sndr\Anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "E:\graphs ML\code\node2vec-master\node2vec-master\example.py", line 12, in
node2vec = Node2Vec(graph, dimensions=64, walk_length=30, num_walks=200, workers=4)
File "E:\graphs ML\code\node2vec-master\node2vec-master\node2vec\node2vec.py", line 121, in init
self.walks = self._generate_walks()
File "E:\graphs ML\code\node2vec-master\node2vec-master\node2vec\node2vec.py", line 204, in _generate_walks
in enumerate(num_walks_lists, 1))
File "C:\Users\sndr\Anaconda3\lib\site-packages\joblib\parallel.py", line 749, in call
n_jobs = self._initialize_backend()
File "C:\Users\sndr\Anaconda3\lib\site-packages\joblib\parallel.py", line 547, in _initialize_backend
**self._backend_args)
File "C:\Users\sndr\Anaconda3\lib\site-packages\joblib_parallel_backends.py", line 305, in configure
'[joblib] Attempting to do parallel computing '
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if name == 'main'". Please see the joblib documentation on Parallel for more information

Computing transition probabilities: 100%|██████████| 100/100 [00:01<00:00, 72.85it/s]
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\sndr\Anaconda3\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users\sndr\Anaconda3\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "C:\Users\sndr\Anaconda3\lib\multiprocessing\spawn.py", line 223, in prepare
_fixup_main_from_name(data['init_main_from_name'])
File "C:\Users\sndr\Anaconda3\lib\multiprocessing\spawn.py", line 249, in _fixup_main_from_name
alter_sys=True)
File "C:\Users\sndr\Anaconda3\lib\runpy.py", line 205, in run_module
return _run_module_code(code, init_globals, run_name, mod_spec)
File "C:\Users\sndr\Anaconda3\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "C:\Users\sndr\Anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "E:\graphs ML\code\node2vec-master\node2vec-master\example.py", line 12, in
node2vec = Node2Vec(graph, dimensions=64, walk_length=30, num_walks=200, workers=4)
File "E:\graphs ML\code\node2vec-master\node2vec-master\node2vec\node2vec.py", line 121, in init
self.walks = self._generate_walks()
File "E:\graphs ML\code\node2vec-master\node2vec-master\node2vec\node2vec.py", line 204, in _generate_walks
in enumerate(num_walks_lists, 1))
File "C:\Users\sndr\Anaconda3\lib\site-packages\joblib\parallel.py", line 749, in call
n_jobs = self._initialize_backend()
File "C:\Users\sndr\Anaconda3\lib\site-packages\joblib\parallel.py", line 547, in _initialize_backend
**self._backend_args)
File "C:\Users\sndr\Anaconda3\lib\site-packages\joblib_parallel_backends.py", line 305, in configure
'[joblib] Attempting to do parallel computing '
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if name == 'main'". Please see the joblib documentation on Parallel for more information

Root nodes have no neighbors on directed graphs

Nodes with no in-degrees generate walks with only themselves - meaning they cannot be considered while traveling and this effects directed graphs.

To reproduce:

import networkx as nx
from node2vec import Node2Vec

DG = nx.DiGraph()
DG.add_edges_from([('1', '2'), ('2', '5'), ('2', '6'), ('3', '4'),
                   ('4', '5'), ('4', '6'), ('5', '7'), ('5', '8'),
                   ('6', '7'), ('6', '8'), ('7', '9'), ('8', '10')])

node2vec = Node2Vec(DG, dimensions=10, walk_length=10, num_walks=400, workers=1)
model = node2vec.fit(window=10)


for walk in node2vec.walks:
    if '1' in walk:
        print(walk)

Results in output of

['1']
['1']
['1']
['1']
.
.
.
.

Is the trained word2vec really skip-gram by default? or is it CBOW?

Thanks for putting this together!

The variable naming indicates that the trained word2vec model within node2vec model is a skip-gram model. However, the default setting of gensim's Word2Vec is CBOW. Having said that, one can easily pass the required parameter to chose a skip-gram model instead of CBOW.

This is a tiny technical detail which makes the default setting of this code different from the original node2vec which explicitly trains a skip-gram model. Perhaps the users should be aware of that. Or did I miss something?

Saving corpus in text file

Hi,

Is there a way to save the corpus in a text file which is generated as a result of random walks?

question regarding word2vec

hi,

i am confused about wod2vec part in node2vec. As there are different pre-trained models of word2vec available, such as generic English and Pubmed models. Which pre-trained model of word2vec has been used in this experiment?

Running Node2Vec on Google Colab is very slow

Hi,

Thank you so much for this handy library. I'm trying to implement the node2vec on Google Colab. However, when I embed a sample graph with 10,000 edges, it takes more than 1 hour 40 min to finish (When implementing on 80,000 edges, it takes more than 4 hours). I've used the tem_folder to avoid memory issues. Since I plan to implement it on a real-world dataset, which includes more than 100,000,000 edges, may I ask is there any way that I can speed up the training process without sacrificing much performance?

The parameter I choose is following:

# the dimension of node embedding
embedding_dimension = 50,

 # to avoid too big graph
temp_folder = temp_folder_path,

# the lenght of walk
walk_length=80,

# number of walks
num_walks=10,

# p,q
p=1, 
q=1,

# worker number
workers=4,

# window size
window=10,
min_count=1,

# batch size
batch_words=4

Thank you a lot in advance!

can u guide me how to use the temp folder in Mac?

Actually I have no background on the hardware and I cannot find the way to deal with the big graphs.

Thanks for helping me

longer time for computing transition probabilities for each node.

I have a graph of 141382 nodes and 49296336 edges. But computation of transition probabilities for each node is taking a huge time. Status of my code is as follows:
Computing transition probabilities: 0%| | 44/141382 [10:46<607:39:02, 15.48s/it]
Is there any way out to speed up the execution?

ValueError: probabilities contain NaN

Trying run node2vec on my dataset using the example code from the ReadMe. During the walk generation, I am receiving the following error:

File "mtrand.pyx", line 920, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN

Any idea where this is coming from? My dataset is ~ 120 MB, but I can provide it if helpful

How to load a saved model?

Hi, I save the model by:

model.save('./test1.model')

But when I tried to reload it and wrote:

model = node2vec.load('./test1.model')

I got an AttributeError: 'Node2Vec' object has no attribute 'load'.

I'm wondering what is the right way to reload my model.
Thank you!

Clarification about the walks from node2vec

Hi!

Just a quick question on the clarification on the corpus for further downstream analysis. after running node2vec with the following parameters:

node2vec = Node2Vec(graph, dimensions=128, walk_length=30, num_walks=10, workers=1)

I then access the walks as I need them in a text file:

walks = node2vec.walks

Can I clarify - that in the entire corpus of text, should be 10 sets of 'sentences' for each node which each contain 30 words in that sentence? (word being the nodes around them)

so it should be saved to reflect each set of 30 nodes in the list which can be saved as follows?:

with open(f'walks_for_{graph.name}.txt','w')as f:
        for walk in node2vec.walks:   
            for node in walk:
                f.write(node + " ")
            f.write("\n")

thank you. I want to be sure the format is correct for embedding as I want to embed using another package and want to check that this is what would be entered in to the model as the corpus for Node2Vec also?

Saving edge embeddings as KeyedVectors in Python 3

Hi there,

Thanks for implementing this great algorithm in Python 3. For me all is well and working except for saving the edge embeddings as KeyedVector. It keeps giving me this error

'Word2VecKeyedVectors' object has no attribute 'add'

It works well with Python2, but not in Python 3.

Thanks!

Tested with Large Graphs?

Hi,
I am trying to use this library on a multigraph, which has around 28K nodes and 3000000 edges, it is a weight graph.
The Node2Vec constructor runs for around 5-7 hours after which kernel restarts. I am running it on MacBook Pro with 32 GB Ram and 6 cores. I am trying to run this with a step size of 6 and number of walks as 100. Number of workers is set to 3.
Is there a way I can debug this issue and understand why this might be happening? Can you please advise.
Thanks,
Karrtik