Giter VIP home page Giter VIP logo

Comments (10)

mclegrand avatar mclegrand commented on July 17, 2024 3

I agree with @LeoLaugier that his test should pass, afaiu if a node is of k-value k, then it's connected to at least k nodes of k-value ≥k

I looked at the code and in particular at minheap implementation, and found 2 bugs and a question:

  • Why reimplement min-heap instead of using std::priority_queue
  • [BUG] decrease_key misses a p = parent(pos) in the loop
  • [BUG] pop_min misses a self.pos[self.val[0]] = 0.
index 7bb2423d..f9ae7ca2 100644
--- a/sknetwork/utils/minheap.pyx
+++ b/sknetwork/utils/minheap.pyx
@@ -75,6 +75,7 @@ cdef class MinHeap:
             while (pos != 0) and (scores[self.val[p]] > scores[self.val[pos]]):
                 self.swap(pos, p)
                 pos = p
+                p = parent(pos)
 
     cdef int pop_min(self, int[:] scores):
         """Remove and return the minimum element (or root) from the heap."""
@@ -85,6 +86,7 @@ cdef class MinHeap:
         # Store the minimum value, and remove it from heap
         cdef int root = self.val[0]
         self.val[0] = self.val[self.size-1]
+        self.pos[self.val[0]] = 0
         self.size -= 1
         self.min_heapify(0, scores)
 

However, even with those, I get the issue:

>>> import sknetwork
>>> import scipy
>>> x = scipy.sparse.load_npz("/tmp/biadjacency_full.npz")
>>> from sknetwork.utils import *
>>> from sknetwork.topology import *
>>> adjacency = bipartite2undirected(x)
>>> kcore = CoreDecomposition()
>>> kcore.fit(adjacency)
CoreDecomposition()
>>> for indice in range(adjacency.shape[0]):
...  k=kcore.labels_[indice]
...  x = 0
...  for adj in range(adjacency.indptr[indice], adjacency.indptr[indice+1]):
...   if kcore.labels_[adj] >= k:
...    x = x+1
...  print(indice, k, x)
...  assert(x>=k)
... 
0 5 4
Traceback (most recent call last):
  File "<stdin>", line 8, in <module>
AssertionError
>>> adjacency.indptr
array([      0,      10,      13, ..., 3593494, 3593495, 3593496],
      dtype=int32)
>>> adjacency.indices[0:10]
array([24389, 24390, 24391, 24392, 24393, 24394, 24395, 24396, 24397,
       24398], dtype=int32)
>>> kcore.labels_[24389:24398]
array([ 2, 38, 30,  5,  4, 38,  1,  1,  1], dtype=int32)

(4 values ≥5)

Still digging...

Edit: The error was in my test.

for indice in range(adjacency.shape[0]-1):
 k=kcore.labels_[indice]
 x = 0
 for adj in range(adjacency.indptr[indice], adjacency.indptr[indice+1]):
  if kcore.labels_[adjacency.indices[adj]] >= k:
   x = x+1
 print(indice, k, x)
 assert(x>=k)

passes with the diff above

from scikit-network.

TiphaineV avatar TiphaineV commented on July 17, 2024 1

Hi Leo,

I have assigned @tbonald because I suspect an error creeped in when @tbonald merged the Bi* functions with the regular ones a few weeks ago. I'll have a deeper look too, if the error doesn't seem obvious to Thomas.

Cheers,

from scikit-network.

mclegrand avatar mclegrand commented on July 17, 2024 1

(note: with a priority_queue, it would look like this: )

diff --git a/sknetwork/topology/kcore.pyx b/sknetwork/topology/kcore.pyx
index e180f219..adc7feaa 100755
--- a/sknetwork/topology/kcore.pyx
+++ b/sknetwork/topology/kcore.pyx
@@ -8,6 +8,8 @@ Created on Jun 3, 2020
 @author: Yohann Robert <[email protected]>
 """
 cimport cython
+from libcpp.queue cimport priority_queue
+from libcpp.pair cimport pair
 
 import numpy as np
 cimport numpy as np
@@ -15,7 +17,6 @@ cimport numpy as np
 from scipy import sparse
 
 from sknetwork.utils.base import Algorithm
-from sknetwork.utils.minheap cimport MinHeap
 
 
 @cython.boundscheck(False)
@@ -40,23 +41,27 @@ cdef fit_core(int[:] indptr, int[:] indices):
     cdef int min_node          # current node of minimum degree
     cdef int i, j, k
     cdef int[:] degrees = np.asarray(indptr)[1:] - np.asarray(indptr)[:n]
-    cdef np.ndarray[int, ndim=1] labels = np.empty((n,), dtype=np.int32)
-    cdef MinHeap mh = MinHeap.__new__(MinHeap, n)      # minimum heap with an update system
+    cdef np.ndarray[int, ndim=1] labels = np.full((n,), -1, dtype=np.int32)
+    cdef priority_queue[pair[int, int]] pq
 
-    # inserts all nodes in the heap
+    # inserts all nodes in the queue
     for i in range(n):
-        mh.insert_key(i, degrees)
+        pq.push((-degrees[i], i))
 
     i = n - 1          # index of the rear of the list/array
-    while not mh.empty():              # until the heap is emptied
-        min_node = mh.pop_min(degrees)
+    while not pq.empty():
+        min_node = pq.top().second
+        pq.pop()
+        if labels[min_node] > -1:
+            continue
         core_value = max(core_value, degrees[min_node])
 
         # decreases the degree of each neighbors of min_node to simulate its deletion
         for k in range(indptr[min_node], indptr[min_node+1]):
             j = indices[k]
             degrees[j] -= 1
-            mh.decrease_key(j, degrees)                # updates the heap to take into account the new degrees
+            if labels[j] < 0:
+                pq.push((-degrees[j], j))
 
         labels[min_node] = core_value
         i -= 1

from scikit-network.

TiphaineV avatar TiphaineV commented on July 17, 2024

Hi again @LeoLaugier ! Sorry for the delay, I finally could look into it. I'm not sure there is a bug in the CoreDecomposition module. Indeed, if I replace your last two lines of code with the following:

faulty_nodes = []
for indice in range(adjacency.shape[0]):
    if kcore_nodes[indice] and adjacency.indptr[indice + 1] - adjacency.indptr[indice] < k:
        faulty_nodes.append(indice)

then faulty_nodes is indeed empty at the end on your data. This is also true when checking the degree of the neighbours in the core (in other words, the core property is respected on the whole core subgraph):

faulty_nodes = []
for indice in range(adjacency.shape[0]):
    if kcore_nodes[indice] and adjacency.indptr[indice + 1] - adjacency.indptr[indice] < k:
        if len([degree(x) for x in adjacency.indices[adjacency.indptr[indice + 1] - adjacency.indptr[indice]] if kcore_nodes[x] ]) < k:
            faulty_nodes.append(indice)
len(faulty_nodes), faulty_nodes[:10]

from scikit-network.

QLutz avatar QLutz commented on July 17, 2024

Thanks @mclegrand, to answer your question, the priority queue cannot be cimported in Cython (see here for available std classes). There may be ways to make it work that I am not aware of though.

from scikit-network.

mclegrand avatar mclegrand commented on July 17, 2024

It is here, though: https://github.com/cython/cython/blob/master/Cython/Includes/libcpp/queue.pxd (it's a type of queue)

from scikit-network.

QLutz avatar QLutz commented on July 17, 2024

Then the min heap is probably superfluous indeed.

from scikit-network.

TiphaineV avatar TiphaineV commented on July 17, 2024

Thanks @mclegrand! Could you please commit your changes so we can close the issue? I think refactoring to use the queue instead of the minheap is a bit out of the scope of the issue itself.

from scikit-network.

LeoLaugier avatar LeoLaugier commented on July 17, 2024

Thank you @mclegrand @QLutz @TiphaineV @tbonald : the k-core decomposition now works as expected on my dataset 👍

from scikit-network.

mclegrand avatar mclegrand commented on July 17, 2024

Just to close this topic, I tried to time minheap vs priority_queue with

import sknetwork
from time import time
import scipy
x = scipy.sparse.load_npz("/home/mc/Downloads/biadjacency_full.npz")
from sknetwork.utils import *
from sknetwork.topology import *
adjacency = bipartite2undirected(x)
start_time = time()
kcore = CoreDecomposition()
kcore.fit(adjacency)
print("--- %s seconds ---" % (time() - start_time))

and got

pqueue
--- 0.5988922119140625 seconds ---
--- 0.6231446266174316 seconds ---
--- 0.5909104347229004 seconds ---
--- 0.6011207103729248 seconds ---
--- 0.6517014503479004 seconds ---
--- 0.6074094772338867 seconds ---
--- 0.6160576343536377 seconds ---
--- 0.639246940612793 seconds ---
--- 0.5900239944458008 seconds ---

minheap
--- 0.49435877799987793 seconds ---
--- 0.5180974006652832 seconds ---
--- 0.5038821697235107 seconds ---
--- 0.5666232109069824 seconds ---
--- 0.5060639381408691 seconds ---
--- 0.4990103244781494 seconds ---
--- 0.5349819660186768 seconds ---
--- 0.5090928077697754 seconds ---
--- 0.5094292163848877 seconds ---
--- 0.5129563808441162 seconds ---

15% better for minHeap so no need to get rid of it

from scikit-network.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.