Giter VIP home page Giter VIP logo

an2vec's Introduction

Attributed Node to Vec

Bring those two things together! Do cool context- and language-aware stuff! Okay, let's fill this a little more precisely now.

Setup

This project uses mostly Julia (with left-over parts of Python in scripts, and more so in notebooks).

For Julia: we're running 1.1, but you can get the latest version. To install the required dependencies on your laptop, run julia environment.jl. On a headless server, or a machine on which you don't have the dependencies necessary for Makie.jl (notably GLFW), you can instead run julia environment.headless.jl to use Makie's Cairo backend.

For Python: using Anaconda, set up the environment with conda env create -f environment.lock.yml (this builds the Rust extensions in this package, and installs them locally).

Then activate the anaconda environment, and run python -m julia.sysimage julia/sys-$(hostname).so to build a custom julia system image, making sure calls from python to julia will work.

Finally, still with the environment activated, run ./setup-datasets.sh to set up the test datasets in the datasets/ folder.

Updating and managing the python environment

environment.lock.yml contains the full list of packages and their versions to duplicate the environment we use in development. To upgrade packages in that environment, use environment.yml instead: it only includes the top-level required packages without versions, and conda will resolve to updated package versions.

environment.lock-gpu.yml is a copy of environment.lock.yml that uses GPU-enabled TensorFlow.

Managing large data files

This repository uses git-annex to manage large data files, which can be a bit complicated. To use this, first make sure you have git-annex installed. If not, and if you don't have root access (e.g. in a HPC environment), Anaconda has it so you can install it inside a dedicated Anaconda environment.

Then, from inside your copy of this repository, you tell git that you're using git-annex:

git annex init
git annex enableremote warehouse-rsync

(Note: the warehouse-rsync remote is a so-called "special" annex remote tracked in git, and was created using git annex initremote warehouse-rsync type=rsync rsyncurl=blondy.lip.ens-lyon.fr:/warehouse/COMPLEXNET/nw2vec/annex encryption=none chunk=10Mib. You don't need to do that again.)

Now get a copy of all the large files in the warehouse-rsync repo:

git annex get .

Now whenever you want to add files to the annex:

# Commit hashes of the large files
git annex add <the files you want to add>
git commit
# Actually copy your large files to `warehouse-rsync`:
git annex copy --to warehouse-rsync <the files you added>
# Publish your changes upstream (the `--all` makes sure the `git-annex` branch is also pushed)
git push --all

If the push operation fails because of a conflict on the git-annex branch, you need to merge the remote's version of the git-annex branch into yours. You can do that without having to checkout that branch by doing:

git pull
git fetch origin git-annex:git-annex  # Merge what was just pulled from the remote git-annex branch into yours

then git pushing again should work. Note that since git-annex tracks which computer has which copy of which file, whenever someone downloads a large file from the annex things get tracked and the git-annex branch is updated. So this pull/fetch dance might be necessary quite often as we start working with more people / more computers.

Analyses

You can have a look at the wiki. The projects/ folder contains all the notebooks and scripts for the repo's projects, and is probably the most interesting thing to look at. The julia/ folder the current implementation of AN2VEC. The data/ folder contains any data that may be produced by analyses in julia/ and projects/.

For now, we don't use the Twitter data in the datasets/ folder, so you can ignore the sections below this and jump directly to the wiki or the projects after having installed the environment. If, however, you want to play with the Twitter data, read on.

Datasets folders

After running ./setup-datasets.sh, the datasets/ folder contains pointers to all the data we process and the outputs of processing:

datasets
├── karate                   # Test networks
├── malariaDBLaNetworks2013  #
├── BlogCatalog-dataset      #
├── Flickr-dataset           #
├── YouTube-dataset          #
├── sosweet-raw              # Raw Twitter data
│     -> /warehouse/COMPLEXNET/TWITTER/data
├── sosweet-raw-retweets     # Raw Twitter data for the time range with retweets
│     -> /warehouse/COMPLEXNET/nw2vec/sosweet-raw-retweets
├── sosweet-network          # Files encoding the network(s) of users and its embedding(s)
│     -> /warehouse/COMPLEXNET/nw2vec/sosweet-network
├── sosweet-text             # Files encoding the text of tweets
│     -> /warehouse/COMPLEXNET/nw2vec/sosweet-text
└── sosweet-w2v              # Files encoding the word2vec embeddings of words from tweets
      -> /warehouse/COMPLEXNET/nw2vec/sosweet-w2v

To set these up, and assuming you have access to the raw Twitter data in datasets/sosweet-raw:

  • Compute the word2vec embeddings: TODO: document
  • Extract the networks of users: TODO: document
  • Compute the network embeddings by running the projects/scratchpads/sosweet-node2vec.ipynb notebook
  • Extract the raw text of tweets:
sosweet-scripts/iter-gz \
    --no-gzip \
    sosweet-scripts/pipe-user_timestamp_body-csv \
    datasets/sosweet-text \
    $(ls datasets/sosweet-raw/2016*.tgz datasets/sosweet-raw/2017*.tgz)
  • Extract the user ids for the 5-mutual-mention network:
cat datasets/sosweet-network/undir_weighted_mention_network_thresh_5.txt \
    | sosweet-scripts/pipe-unique_users \
    > datasets/sosweet-network/undir_weighted_mention_network_thresh_5.users.txt
  • Filter the raw text of tweets for only those users:
# Create the destination folder
mkdir -p datasets/sosweet-text/undir_weighted_mention_network_thresh_5
# Run the filter
sosweet-scripts/iter-gz \
    --no-gzip \
    "sosweet-scripts/pipe-filter_users datasets/sosweet-network/undir_weighted_mention_network_thresh_5.users.txt" \
    datasets/sosweet-text/undir_weighted_mention_network_thresh_5 \
    $(ls datasets/sosweet-text/*-csv)

License

This code is under Inria's copyright, and licensed under GPLv3+ in agreement with them.

an2vec's People

Contributors

wehlutyk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

an2vec's Issues

Left-over todos

  • We updated the non-gpu environment in e0b89bc, so once #29 and #31 have finished running we should update the gpu environment
  • Also after #29 has finished running, rename the output files so they include the feature noise scale (as introduced by beba46b)
  • finish Git-LFS history rewrite for all branches
  • recreate a PR to replace #12 which got closed by the history rewrite
  • remove issue-3-mini-batching-with-features if possible
  • remove or check history rewrite of issue-6-sensitivity-analysis
  • find out git-lfs quota on gitlab.com and on Inria's gitlab, as we are reaching the quota here

Rate trade-off between feature and network structure

In representing the network at the level of ξ, both network and feature structure can count, and they might not agree. Some networks e.g. in #8 might be better described by giving more weight to adjacency reconstruction (p_adj_loss), and some more to feature reconstruction (p_v_loss).

  • One way to explore is to simply weigh α × p_adj_loss + β × p_v_loss with α + β = 1, and explore the output for different values for α and β. But for comparisons to be made, we need a quality measure which is other than the total loss (which we affect when changing α and β).
  • Another way is to directly train α and β (use them as trainable parameters in the learning process), and see if different synthetic data sets give different best values.

Test various fixes to blogcatalog training

Run the training only for 2000 epochs, as most of the final quality seems to be reached there already.

  • Use the Bernoulli decoder for features
  • Use changes from #38
  • Remove u from embedding parameters

How much does the trained ξ improve on the untrained ξ?

When the features correlate a little to the network structure, even random weights seem to separate the distribution of ξ (i.e. the representation layer in the VAE) into communities. This is, as Thomas Kipf explains in his post on GCNs, because "a GCN model [can be interpreted] as a generalized, differentiable version of the well-known Weisfeiler-Lehman algorithm on graphs". It also makes sense as you basically average features from community neighbours, therefore recovering any local average that is consistent with network structure.

The upshot is that even without training, in some cases the distribution of ξ seems to reflect network structure. So what's the use of training, apart from getting a generative model (i.e. training the layers that come after the ξ layer)?

Sub-questions:

  • How does such a random ξ behave with non-random features that do not correlate to network structure (e.g. feature communities have lower, or higher, or displaced granularity compared to network communities): does one structure win over the other, or does it simply mess up the picture? The answer to this probably lies in criticisms of Label-propagation algorithms for community detection (e.g. Raghavan, Albert, & Kumara, 2007, "Near linear time algorithm to detect community structures in large-scale networks" ).
  • What is a "good" ξ distribution (i.e. a good representation model), independently of its use in the generative model? One that visually reproduces the feature- or network-structure (or both)? Can we build of measure of how good a given ξ is? Maybe there are answers to this in the pro-con arguments about why t-SNE is a good visualisation method (see e.g. How to Use t-SNE Effectively).

Minibatch effect and strategies

Go back to simple manual exploration of the effect of the current minibatch strategy on a small (200-500 nodes) network.

Try out other sampling strategies. First for the case where we know the graph generation parameters. Then for the case where we don't know the graph model.

Mini-batching optimisation

In #8 we're applying the current code to the BlogCatalog dataset: 10,312 nodes, 333,983 edges, and it is SLOW: about 20s per mini-batch, so about 7 minutes per epoch. Some profiling insights:

On grunch, fit_generator_feed() using n_epochs=1, batch_size=500, max_walk_length=200 and neighbour_samples=30 spends > 52% of its time in the batch-generation code.

On grunch again, same parameters, looking only at the batch-generation code:

  • 4.4% time spent in jumpy_walks()
  • 94.6% time spent in _compute_batch() (21 calls, one per mini-batch), of which
    • 21% (of the 94.6%) time spent in _layer_csr_adj() (4 calls per call to _collect_layers_crops())
    • the rest is at the root of _compute_batch() (i.e. not in nested functions)

Hottest lines:

batching.py:20(_layer_csr_adj)
batching.py:30(_layer_csr_adj)
batching.py:37(_layer_csr_adj)
batching.py:84(_compute_batch)
batching.py:95(_compute_batch)
batching.py:179(jumpy_walks)

And inside _compute_batch() (loop unfurled):
_compute_batch() hotlines

Finally, note the slightly different numbers on keon (my machine):

  • 6.3% time spent in jumpy_walks()
  • 92% time spent in _compute_batch() (21 calls, one per mini-batch), of which
    • 49% (of the 92%) time spent in _layer_csr_adj() (4 calls per call to _collect_layers_crops())
    • the rest is at the root of _compute_batch() (i.e. not in nested functions)

Steps to improve:

  • Create tests for batching.py to make sure nothing breaks
  • Use a sparse adjacency matrix across all the functions
  • Profile again

Train BlogCatalog without node features

i.e. with an identity matrix as features. Maybe extend that to training with BlogCatalog node features to which an identity matrix is concatenated, to make sure the model has enough information to separate the nodes based on features.

This tests if the bad adjacency reconstruction we're seeing in #31 comes from not being able to separate nodes based on their features.

Support mini-batch training in VAE

Note that a mini-batch with convolutions needs access to all the neighbouring nodes it will include in convolutions, on top of the nodes in the mini-batch for which we compute an update.

Test issue

This is a test issue to see if it shows up in slack

Full batch sensitivity analysis (adjacency reconstruction, feature reconstruction, and both)

Test the following parameters:

  • node features: exact network communities, noisy network communities, noisy network communities with redundancy, random
  • dim_ξ: [2, ..., 10]
  • dim_l1: [dim_ξ, ..., 16]
  • n_ξ_samples: [1, 5, 10]
  • bias in all layers: True/False

For each run, save:

  • the original and final-predicted adjacency and/or features
  • the evolution of the training losses
  • ξ distribution plots before and after training (for dim_ξ ≥ 2, use a dimension reduction technique such as t-SNE)

Decide which adjacency reconstruction normalisation to use

This is the follow-up to point 2 of question 1 here.

In short, when using the straightforward cross-entropy value for adjacency reconstruction, we give equal weight to neighbours and non-neighbours in the prediction accuracy. Ideally, what we'd like to do is predict, for each node, the set of neighbours and the set of non-neighbours, without penalising our prediction loss by the fact that there are (usually) many more non-neighbours than there are neighbours.

The current solution (in #10, soon to be merged) is to scale the items in the KL loss so that, for a given node, its neighbours contribute 1/2 on average, and its non-neighbours contribute 1/2 on average. We then average over all nodes to get a loss that contributes 1, but weighs neighbours and non-neighbours equally.

Another way would be to upscale true positives and true negatives, and downscale false positives and false negatives.

Mini-batch sampling strategy

From question 3 here:

In the case where we have a sparse network, mini-batching with randomly selected nodes will almost always select nodes that are not directly connected, such that the adjacency matrix restricted to the mini-batch will almost always be the identity matrix. So training on this is bound to be very bad.

The best option now is probably to randomly sample a small number s of seeds, then make a random walk of length l around each of those seeds to also have nodes connected between each other, such that the mini-batch size is s * l.

Scale up the VAE tests

Blocked on #2 #3.

With synthetic data:

  • increase network size
  • increase the number of features

For real data we can use a subset of the Twitter data, but better than that (at least for starting) the Colorado Index of Complex Networks has lots of good small data sets, e.g.:

  • Malaria var DBLa HVR networks
  • French financial elites (1990)
  • BlogCatalog, Flickr, YouTube (2009)

Reproduce Cora embedding from VGAE paper by Kipf & Welling 2016

Four things, each to be visualised using downscaling and projections (the paper doesn't specify the method used to produce their 2D figure from 16D embeddings):

  • run their gae implementation directly to reproduce their own procedure
  • run our implementation with their parameters:
    • no u parameter for embedding covariance matrix (remove the layer, and set u to 0 in the codec)
    • simple scalar product decoder
    • intermediate layer 32 dims, embedding layer 16 dims
    • train for 200 iterations
    • get the loading code from their gae implementation
  • reactivate kernel in scalar product
  • add the 2-layer decoder we've been using
  • add the u term in the embedding covariance matrix

Note that the current implementation mimics their gae closer than what we'll have after implementing #7, so do this before.

Possible model improvements / optimisations

None of these are necessary, I'm just listing them to keep track of all the possible small-ish changes that might improve performance or reduce the need for resources:

Normalizations:

Connections:

Reduction of necessary parameters and resources:

Noise for training:

Removing unneeded left-overs:

Different models:

Decide on a license

  • What licenses are authorised by Inria?
  • Who holds the copyright to this software?
  • What licenses are compatible with what we ship?

Use non-recursive importance sampling for mini-batches on dense graphs?

Problem

In the current mini-batch approach being worked on in #10, we sample the nodes that will be part of the mini-batch, then sample their neighbourhoods to define which nodes will be used in the last layer's convolution, then walk back the convolution layers to sample, for each convolution, the neighbourhoods of the nodes that are necessary at the layer after the current one. Therefore:

  • if our mini-batch computes a loss for B nodes,
  • and we sample M nodes at each layer,
  • and the network architecture is L convolutions deep,

in the worst case we need B * M^L nodes to compute that mini-batch, which can be a lot if L grows.

Available solution

Chen, Ma, & Xiao, 2018, "FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling" show that for dense graphs, it may not be necessary to recursively sample this way. At any given convolution layer in the neural network, they estimate a node u's value by sampling on the whole graph proportionally to ||A(u, :)||² (where A is the (normalised?) adjacency matrix), irrespective of the nodes that are used in preceding and following layers in the neural network (this is the importance sampling part). With the notations above, the worst case then uses B * M * L nodes to compute a given mini-batch. Much better. And the authors show that this method achieves prediction accuracy equivalent to GraphSAGE on several well-tested data sets.

Caveat

This supposes that an importance sampling on the whole graph can provide enough nodes to estimate any node's value at a convolution layer. In other words, this works if the importance sampling provides enough nodes in the neighbourhood of any node in the graph.

Hamilton notes this in the review linked to above:

The idea of sampling vertices at each layer makes sense in small graphs or dense graphs. However, I think it might become problematic in massive, sparse graphs. For example, in industry applications with 1+ billion nodes (or sparse graphs with many disconnected components), I would think the odds of a node in a batch having a neighbor in the sampled set will get quite small, unless the sampled set grows proportionally large in size.

Resulting question

The question then becomes: how do we know when this can be useful, i.e. when our graph is dense enough to warrant this sampling method?

Possible answers I see:

  • define a threshold on graph sparsity, and see if we need this for the Twitter data
  • discard this as a premature optimisation concern, or an engineering-level improvement that we don't wish to spend time on

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.