python-graphblas / graphblas-algorithms Goto Github PK

Graph algorithms written in GraphBLAS

License: Apache License 2.0

Python 99.91% Shell 0.09%

complex-networks graph-algorithms graph-analysis graph-analytics graph-datastructures graph-library graph-theory graph-theory-algorithms graphblas pydata

graphblas-algorithms's Introduction

graphblas-algorithms is a collection of GraphBLAS algorithms written using python-graphblas. It may be used directly or as an experimental backend to NetworkX.

Why use GraphBLAS Algorithms? Because it is fast, flexible, and familiar by using the NetworkX API.

Are we missing any algorithms that you want? Please let us know!

Installation

conda install -c conda-forge graphblas-algorithms

pip install graphblas-algorithms

Basic Usage

First, create a GraphBLAS Matrix.

import graphblas as gb

M = gb.Matrix.from_coo(
  [0, 0, 1, 2, 2, 3],
  [1, 3, 0, 0, 1, 2],
  [1., 2., 3., 4., 5., 6.],
  nrows=4, ncols=4, dtype='float32'
)

Next wrap the Matrix as ga.Graph.

import graphblas_algorithms as ga

G = ga.Graph(M)

Finally call an algorithm.

hubs, authorities = ga.hits(G)

When the result is a value per node, a gb.Vector will be returned. In the case of HITS, two Vectors are returned representing the hubs and authorities values.

Algorithms whose result is a subgraph will return ga.Graph.

Plugin for NetworkX

Dispatching to plugins is a new feature in Networkx 3.0. When both networkx and graphblas-algorithms are installed in an environment, calls to NetworkX algorithms can be dispatched to the equivalent version in graphblas-algorithms.

Dispatch Example

import networkx as nx
import graphblas_algorithms as ga

# Generate a random graph (5000 nodes, 1_000_000 edges)
G = nx.erdos_renyi_graph(5000, 0.08)

# Explicitly convert to ga.Graph
G2 = ga.Graph.from_networkx(G)

# Pass G2 to NetworkX's k_truss
T5 = nx.k_truss(G2, 5)

G2 is not a nx.Graph, but it does have an attribute __networkx_plugin__ = "graphblas". This tells NetworkX to dispatch the k_truss call to graphblas-algorithms. This link connection exists because graphblas-algorithms registers itself as a "networkx.plugin" entry point.

The result T5 is a ga.Graph representing the 5-truss structure of the original graph. To convert to a NetworkX Graph, use:

T5.to_networkx()

Note that even with the conversions to and from ga.Graph, this example still runs 10x faster than using the native NetworkX k-truss implementation. Speed improvements scale with graph size, so larger graphs will see an even larger speed-up relative to NetworkX.

Plugin Algorithms

The following NetworkX algorithms have been implemented by graphblas-algorithms and can be used following the dispatch pattern shown above.

graphblas_algorithms.nxapi
├── boundary
│   ├── edge_boundary
│   └── node_boundary
├── centrality
│   ├── degree_alg
│   │   ├── degree_centrality
│   │   ├── in_degree_centrality
│   │   └── out_degree_centrality
│   ├── eigenvector
│   │   └── eigenvector_centrality
│   └── katz
│       └── katz_centrality
├── cluster
│   ├── average_clustering
│   ├── clustering
│   ├── generalized_degree
│   ├── square_clustering
│   ├── transitivity
│   └── triangles
├── community
│   └── quality
│       ├── inter_community_edges
│       └── intra_community_edges
├── components
│   ├── connected
│   │   ├── is_connected
│   │   └── node_connected_component
│   └── weakly_connected
│       └── is_weakly_connected
├── core
│   └── k_truss
├── cuts
│   ├── boundary_expansion
│   ├── conductance
│   ├── cut_size
│   ├── edge_expansion
│   ├── mixing_expansion
│   ├── node_expansion
│   ├── normalized_cut_size
│   └── volume
├── dag
│   ├── ancestors
│   └── descendants
├── dominating
│   └── is_dominating_set
├── efficiency_measures
│   └── efficiency
├── generators
│   └── ego
│       └── ego_graph
├── isolate
│   ├── is_isolate
│   ├── isolates
│   └── number_of_isolates
├── isomorphism
│   └── isomorph
│       ├── fast_could_be_isomorphic
│       └── faster_could_be_isomorphic
├── linalg
│   ├── bethehessianmatrix
│   │   └── bethe_hessian_matrix
│   ├── graphmatrix
│   │   └── adjacency_matrix
│   ├── laplacianmatrix
│   │   ├── laplacian_matrix
│   │   └── normalized_laplacian_matrix
│   └── modularitymatrix
│       ├── directed_modularity_matrix
│       └── modularity_matrix
├── link_analysis
│   ├── hits_alg
│   │   └── hits
│   └── pagerank_alg
│       ├── google_matrix
│       └── pagerank
├── lowest_common_ancestors
│   └── lowest_common_ancestor
├── operators
│   ├── binary
│   │   ├── compose
│   │   ├── difference
│   │   ├── disjoint_union
│   │   ├── full_join
│   │   ├── intersection
│   │   ├── symmetric_difference
│   │   └── union
│   └── unary
│       ├── complement
│       └── reverse
├── reciprocity
│   ├── overall_reciprocity
│   └── reciprocity
├── regular
│   ├── is_k_regular
│   └── is_regular
├── shortest_paths
│   ├── dense
│   │   ├── floyd_warshall
│   │   ├── floyd_warshall_numpy
│   │   └── floyd_warshall_predecessor_and_distance
│   ├── generic
│   │   └── has_path
│   ├── unweighted
│   │   ├── all_pairs_shortest_path_length
│   │   ├── single_source_shortest_path_length
│   │   └── single_target_shortest_path_length
│   └── weighted
│       ├── all_pairs_bellman_ford_path_length
│       ├── bellman_ford_path
│       ├── bellman_ford_path_length
│       ├── negative_edge_cycle
│       └── single_source_bellman_ford_path_length
├── simple_paths
│   └── is_simple_path
├── smetric
│   └── s_metric
├── structuralholes
│   └── mutual_weight
├── tournament
│   ├── is_tournament
│   ├── score_sequence
│   └── tournament_matrix
├── traversal
│   └── breadth_first_search
│       ├── bfs_layers
│       └── descendants_at_distance
└── triads
    └── is_triad

graphblas-algorithms's People

Contributors

Stargazers

Watchers

Forkers

eriknw mriduls jim22k buke2016

graphblas-algorithms's Issues

Add `igraph` backend to `scripts/bench.py`

We would like to be able to compare performance to igraph: https://github.com/igraph/python-igraph

Some algorithms may be different, which can make direct comparisons difficult to do well (for example, stopping criteria for e.g. PageRank could be different).

multi-property graphs and handling of "weight" parameter

Currently, graphblas_algorithms.Graph objects only have a single edge attribute, and the "weight" parameters are ignored when passed a Graph object from our library (it's used when passed a networkx graph). We should strive to match networkx, so we should allow multi-property graphs and handle weight parameters.

To do this, it probably makes the most sense to use multiple graphblas matrices to hold edge data. One matrix should be boolean of all True to indicate the structure, and then we should have a new matrix for each edge property (which may have fewer elements than the structural matrix).

We should also consider how this will affect caching of properties (which probably needs reconsidered anyway). Perhaps we should create a pseudo-syntax (one that doesn't use eval) such as A.select(tril) for structural and A[myweight].select(tril) for a property.

add `id_to_key` as argument to graph constructor

in ga.Graph.__init__ we can specify key_to_id. I can't see any reason why this must be a stdlib.dict and not any other class that implements the typing.Mapping protocol (__getitem__, __len__, __iter__). I have a usecase where I have a more memory efficient mapping than a dict and would like to use it.

However, whenever the id_to_key property is used, this will create a full inverse mapping, undoing any memory efficiency. It would be great to be able to pass an inverse mapping optionally (id_to_key) to avoid this calculation if possible. Currently I'm doing:

G = ga.Graph(..., key_to_id=...)
G._id_to_key = ...

but obviously this relies on setting the 'private' _id_to_key member, which has no API stability guarantees. I'd be happy to implement this change!

square_clustering output using networkx api

digraphTCC2 = ga.Graph.from_networkx(digraphTCC)
clus = nx.square_clustering(digraphTCC2)
print(clus)

for i in clus:
print(i)

I've just found graphblas-algorithms as a way to speed up computations on a networkx graph, and I've integrated it into my work for finding the square clustering coefficient of a graph. The speedup is immense, but I'm not sure how to interpret the result.

Based on the code snippet above (digraphTCC is a valid networkx graph), the result of the function, clus prints out as
<graphblas_algorithms.classes.nodemap.NodeMap object at 0x00000206BC590550>.

I'm unaware how to interact with a NodeMap object, but when I iterate through it, the values seem to just be node labels. How should I interpret this result, and transform it into the dictionary of {node:clustering} that you would expect from the slow networkx version?

Thank you for your time!

Redo/refactor calculations and caching of properties such as degrees

We compute and cache many properties such as has_self_edges, degrees- (w/o diagonals), degrees+ (with diagonals), etc., which is actually very handy. However, the code is a little complicated (cough cough, my bad) and may be error prone (see: #82 (comment)). It would be great if we could refactor this to be more clear, maintainable, and safe. I imagine such a solution would be more declarative and less procedural.

COMMUNITY: Regular GraphBLAS Algorithms community conference calls (Open to all)

For now, the GraphBLAS Algorithms community call is the same as the Python-graphblas community call:

python-graphblas/python-graphblas#247

If you're a new user or contributor looking for guidance (or just want to say hi!), please join. We're friendly :)

Logo for `graphblas-algorithms`

Now that we have a logo for python-graphblas (python-graphblas/python-graphblas#506), I suppose the logical next step is to make a logo for graphblas-algorithms. It would probably make sense to use a variant of the python-graphblas logo.

Here's one idea:

Raw vs Wrapped objects

Calling an algorithm should accept two types of inputs: a raw Matrix or a wrapped Graph.

When given a Matrix, the return type should be raw (i.e. a Matrix or Vector).
When given a Graph, the return type should be wrapped (i.e. a Graph or NodeMap).

This will make the interface more uniform when interacting with nxapi as Graphs will always be passed in, so a wrapper instance will be returned. But when using graphblas-algorithms without nxapi, the need for wrapping can be controlled by the user in how they call algorithms.

`NodeMap` and other results need better reprs

From #82, showing <graphblas_algorithms.classes.nodemap.NodeMap object at 0x00000206BC590550> is not the most helpful even though a NodeMap is a MutableMapping.

So, the repr of NodeMap should be updated to look more dict-like. Same for VectorMap, VectorNodeMap, NodeNodeMap, and NodeSet (these could probably use basic docstrings too!).

Add `graph-tool` backend to `scripts/bench.py`

We would like to be able to compare performance to graph-tool: https://graph-tool.skewed.de

Some algorithms may be different, which can make direct comparisons difficult to do well (for example, stopping criteria for e.g. PageRank could be different).

Alternative for square counting (for square clustering)

Inspired from this paper, https://arxiv.org/pdf/2007.11111.pdf, we can compute square counting for all nodes via:

    Q(~degrees.diag().S) << plus_pair(A @ A)  # Use mask to ignore diagonal
    all_squares = (Q * (Q - 1)).reduce_rowwise() // 2

This is probably better than counting squares one node at a time.

Now, can we come up with a better way to compute the denominator?

Improving onboarding and enabling contributions

It will be an ongoing effort to make it easier to contribute to graphblas-algorithms, so lets gather ideas and prioritize in this thread. Heh, right now, I encourage potential contributors to be very patient. Things will get better/easier, I promise!

Improve the overall documentation of python-graphblas
- Also, point to "best" resources for Python users to learn GraphBLAS in general (i.e., choose a couple from https://graphblas.org/GraphBLAS-Pointers/)
Add contributing documentation to python-graphblas and graphblas-algorithms
- How to set up local dev environment, etc. Should borrow liberally from contributing docs from other libraries.
Create issues for "starter algorithms"
- Identify a networkx algorithm to implement
- Link to existing GraphBLAS implementations (and papers, etc) if possible
- Perhaps sketch an implementation in python-graphblas, but maybe not one that is fully tested or complete
  - Even with this, there is still plenty of work needed to add an algorithm
- This may be an effective use of my (@eriknw) or @jim22k's time
- See https://github.com/python-graphblas/graphblas-algorithms/wiki/Where-to-find-algorithms
Create "algorithm template" notebook that helps guide development
- Create example data, load datasets, guidelines for benchmarking, run networkx tests, etc.
Create notebooks with exercises as e.g. self-guided tutorials
- Example starter exercises:
  - Find neighbors of node X
  - Find "friends of friends"
  - Find predecessors (or successors) of directed graphs
  - etc.; there are lots of "simple" algorithms we can build up to
- I think some notebooks should encourage hands-on experimentation, not just reading
I think dev notebooks could be a nice side-effect of adding algorithms
- For example, for triangle counting, there were lots of possibilities to choose from, and I decided based on benchmarks
- These notebooks can help demonstrate how algorithm development is done, and help revise algorithms later if necessary
- These may be messy. Not intended as tutorial notebooks. I have some notebooks around that probably belong somewhere.
- I don't want to require a notebook for each algorithm, but it's a nice-to-have.
Don't forget about contributions other than algorithms!

I'm the most familiar with both python-graphblas and graphblas-algorithms, so I trust my judgement the least regarding how to improve onboarding and enabling contributions. @MridulS, @z3y50n, @ParticularMiner, what do you think would be the most helpful? Also, thanks for your patience :)

Update README

Package is now available one conda-forge. Update README to indicate that.

Automated testing to compare with networkx

It would be super-duper handy to be able to automatically generate input graphs and node selection lists (i.e., masks) to run with both NetworkX and graphblas-algorithms and compare results.

For example, it would be nice to cover:

Graph and DiGraph
With and w/o self-edges
Symmetric and asymmetric DiGraph
Purely structural (all edges 1)
Edge values contain all combinations of {positive, 0, negative}, and with ints or floats
Different edge densities, including full graph (w/ and w/o self-edges)
Empty graph, only self-edges
Some rows and/or columns are empty (i.e., no in-edges or out-edges or both)
Perhaps some graphs with specific shapes: ring, tree, DAG, etc
Bipartite graphs
More than one groups of connected components (by construction)
etc.

Perhaps we could leverage hypothesis to help generate random inputs. I would be delighted if we began with a very small subset of the above.

CC @jim22k who has done similar work in the past. Having this functionality would be incredibly useful in making sure we match NetworkX. For one thing, it would help us determine what to do about self-edges, which may be poorly defined at times for NetworkX (I really don't know if it is or not), but should be well-defined for us: do what NetworkX does.