CompNode, an effective and efficient node ranking framework for billion-scale heterogeneous graphs. It can accurately rank over 500 million nodes within a day by leveraging rich path-based information relevant to target conditions. The more relevant paths are used, the higher the ranking accuracy, and the addition of more path types does not significantly impact computation time. For detailed design and analysis, please refer to our paper "Complex-Path: Effective and Efficient Node Ranking with Paths in Billion-Scale Heterogeneous Graphs" (currently under review).
Three stages in the workflow of our CompNode framework.
Our framework is mainly based on a new generalized path schema called 'Complex-path' for defining non-linear conditional path types in a standard form. In real-world scenarios, we find that many non-linear conditional path types represent highly effective information. For example, when searching for potential high-value customers, if a customer can be associated with many existing high-value customers through the paths illustrated in the figure below, they are often likely to be a potential high-value customer themselves.
However, the commonly used path schema ’Meta-path‘ in heterogeneous graph neural networks can only represent linear path types. Therefore, we refine the Meta-path definition to convey nonlinear conditional paths while maintaining a generalized path schema suitable for heterogeneous graph neural networks. We propose four improvements based on the four situations shown below.
The final definition of Complex-path is:
A complex-path is defined on a heterogeneous graph HG = (V, E)
and is represented as: p = { [ (o1, L1, o2), ..., (oi, L(j-1), oj), ..., (on, L(m-1), om) ] | C }
. Each tuple (oi, L(j-1), oj)
connects the j
-th node oj
to the prior i
-th node oi
through any edge or complex-path in L(j-1) = [..., ek, ...]
, where ek = lk or pk
, i < j
. C
is the conditional constraints on the attributes of all nodes and edges.
This framework has been effectively deployed in JD Logistics for finding top potential high-value customers. The following figure illustrates that, based on our prediction results, we made over 262,720 phone calls in China before October 2023. This strategy outperforms the previously deployed method by 252% in success rate. We are still using this framework for recommending customers with more potential value. As of December 1, 2023, more than 50,000 phone calls have been made in the last week alone. For more comprehensive details about our framework, please refer to our paper.
The 'Utils' folder includes our methods for extracting information from heterogeneous graphs, developed using Spark and complex-path. This folder also includes baseline methods like meta-path and k-hop sampling.
The 'Model' folder contains our model implemented with PyTorch, along with the baseline models referenced in our paper.
The 'Main' folder includes code that utilizes the functions from the aforementioned folders to implement node ranking.
We are continuously refining and optimizing our code, and will keep updating it.
pyspark==3.0.0
pytorch==1.8.0