Giter VIP home page Giter VIP logo

xswap's Introduction

XSwap: Fast degree-preserving network permutation

Linux Build Status PyPI GitHub issues

Full documentation: https://hetio.github.io/xswap/

XSwap is an algorithm for degree-preserving network randomization (permutation) [1]. Permuted networks can be used for a number of purposes in network analysis, including for generating counterfactual distributions of features when only the network's degree sequence is maintained or for computing a prior probability of an edge given only the network's degree sequence. Overall, permuted networks allow one to quantify the effects of degree on analysis and prediction methods. Understanding this effect is useful when a network's degree sequence is subject to biases. This implementation is a modified version of the algorithm due to Hanhijärvi et al. with two additional parameters (allow_self_loops and allow_antiparallel), which enable greater generalizability to bipartite, directed, and undirected networks.

  1. Randomization Techniques for Graphs
    Sami Hanhijärvi, Gemma C. Garriga, Kai Puolamäki
    Proceedings of the 2009 SIAM International Conference on Data Mining (2009-04-30) https://doi.org/f3mn58
    DOI: 10.1137/1.9781611972795.67

Usage examples

Permuting an edge list

>>> edges = [(0, 1), (1, 0)]
>>> permuted_edges, permutation_statistics = xswap.permute_edge_list(
        edges, allow_self_loops=True, allow_antiparallel=True,
        multiplier=10)
>>> permuted_edges
[(0, 0), (1, 1)]
>>> permutation_statistics
{'swap_attempts': 20, 'same_edge': 10, 'self_loop': 0, 'duplicate': 1,
 'undir_duplicate': 0, 'excluded': 0}

Computing degree-sequence based prior probabilities of edges existing

>>> edges = [(0, 1), (1, 0)]
>>> prior_prob_df = xswap.prior.compute_xswap_priors(
        edges, n_permutations=10000, shape=(2, 2), allow_self_loops=True,
        allow_antiparallel=True)
>>> prior_prob_df
   source_id  target_id   edge  source_degree  target_degree  xswap_prior
0          0          0  False              1              1          0.5
1          0          1   True              1              1          0.5
2          1          0   True              1              1          0.5
3          1          1  False              1              1          0.5

Choice of parameters

Bipartite networks

Bipartite networks should be indexed using the bi-adjacency matrix, meaning that the edge (0, 0) is from source node 0 to target node 0, and is not a self-loop. Moreover, bipartite networks should be permuted using allow_self_loops=False and allow_antiparallel=True.

Directed and undirected networks

For non-bipartite networks, the decisions of allow_self_loops and allow_antiparallel are not always the same. For undirected networks, set allow_antiparallel=False, as otherwise the edges (1, 0) and (0, 1), which represent the same edge, will be treated as separate. Antiparallel edges may or may not be allowed for directed networks, depending on context. Similarly, self-loops may or may not be allowed for directed or undirected networks, depending on the specific network being permuted.

Libraries

The XSwap library includes Roaring Bitmaps, available under the Apache 2.0 license.

Acknowledgments

Development of this project has largely taken place in the Greene Lab at the University of Pennsylvania. As an open source project under the hetio organization, this repository is grateful for its community of maintainers, contributors, and users.

This work is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grants GBMF4552 to Casey Greene, GBMF4560 to Blair Sullivan, and the National Institutes of Health’s National Human Genome Research Institute R01 HG010067.

xswap's People

Contributors

dhimmel avatar zietzm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

xswap's Issues

Tweaks to map_str_edges function

For map_str_edges, I think it will work with edges containing nodes as objects other than strings. Should we make this more clear in its name and docstring. Something like compress_edges_with_int_ids?

def _map_nodes_to_int(nodes):
"""
Return a dict mapping a list of nodes to their sorted indices. Nodes should
be a list of strings.
Returns:
--------
Dict[str, int]
"""
sorted_node_set = sorted(set(nodes))
name_to_id = {name: i for i, name in enumerate(sorted_node_set)}
return name_to_id
def _apply_map(edges, source_mapping, target_mapping):
"""
Maps edges according to new node names specified by source and target maps.
edges : List[Tuple[str, str]]
source_mapping : Dict[str, int]
target_mapping : Dict[str, int]
"""
source_nodes = [edge[0] for edge in edges]
target_nodes = [edge[1] for edge in edges]
mapped_nodes = [
map(source_mapping.get, source_nodes),
map(target_mapping.get, target_nodes),
]
return list(zip(*mapped_nodes))
def map_str_edges(edges, bipartite):
"""
Maps a list of edge tuples containing strings to a minimal set of
integer edges.
edges : List[Tuple[str, str]]
bipartite : bool
Whether to map source and target nodes using the same mapping.
For example, an edge like ('1', '1') may refer to a connection between
separate nodes, or it may be a self-loop. If `bipartite=True`, the
edge would be mapped like (0, 1), where the new node ids reflect the fact
that the same names do not indicate the same nodes. To ensure that names
are consistently mapped between source and target, put `bipartite=False`.
Returns:
--------
Tuple[List[Tuple[int, int]], Dict[int, str]]
Example:
--------
>>> map_str_edges([('a', 'b'), ('b', 'c')], bipartite=False)
([(0, 1), (1, 2)], {0: 'a', 1: 'b', 2: 'c'})
"""
source_nodes = [edge[0] for edge in edges]
target_nodes = [edge[1] for edge in edges]
# Two separate mappings to be used for source and target nodes
if bipartite:
source_map = _map_nodes_to_int(source_nodes)
target_map = _map_nodes_to_int(target_nodes)
# One single mapping to be used for both source and target nodes
if not bipartite:
combined_nodes = list(set(source_nodes + target_nodes))
source_map = target_map = _map_nodes_to_int(combined_nodes)
mapped_edges = _apply_map(edges, source_map, target_map)
return (mapped_edges, source_map, target_map)

Differentiating between edge types at the API level

I am thinking we want three user-facing functions to perform the permutation. They could be named like:

permute_undirected_edges
permute_directed_edges
permute_bipartite_edges

I think this would be clearest for the user. You could remove the confusing allow_self_loop from permute_bipartite_edges. allow_antiparallel would only be an option for permute_directed_edges. Also permute_undirected_edges could mix sources or targets when swapping to achieve greater potential permutations?

@zietzm, do you think this makes sense?

`import xswap` requires numpy but setup.py does not include this dependency

>>> import xswap
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dhimmel/anaconda3/envs/xswap/lib/python3.7/site-packages/xswap/__init__.py", line 1, in <module>
    from xswap import network_formats
  File "/home/dhimmel/anaconda3/envs/xswap/lib/python3.7/site-packages/xswap/network_formats.py", line 3, in <module>
    import numpy
ModuleNotFoundError: No module named 'numpy'

I think we wanted users without numpy to be able to use the package, right? So we should probably just make the numpy import lazy.

Also you may want to specify extras_require in setup.py for users who want to install the extra dependencies.

Use Roaring bitmap

I am considering switching the HashTable custom implementation to a compressed bitset implemented in C, Roaring Bitmap. Since the bitset is often extremely sparse, this should offer a great time/memory tradeoff and should be able to work more flexibly for networks of greater size.

There have been a few example networks I wanted to permute where an uncompressed bit array simply took too much memory.

Use successes rather than attempts as stop condition

Since only successful swaps advance the network randomization, it doesn't make a ton of sense to me to set an attempt threshold as the stop condition versus a success threshold. This has been bothering me since I initially implemented the attempt threshold in hetnetpy.

To prevent infinite running, we could enable a max_attempts parameter, perhaps that is a multiplier for based on the specified number of successes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.