ixxi-dante / an2vec Goto Github PK

Bringing node2vec and word2vec together for cool stuff

License: GNU General Public License v3.0

Jupyter Notebook 99.85% Python 0.11% Shell 0.01% Julia 0.03%

an2vec's Introduction

Attributed Node to Vec

Bring those two things together! Do cool context- and language-aware stuff! Okay, let's fill this a little more precisely now.

Setup

This project uses mostly Julia (with left-over parts of Python in scripts, and more so in notebooks).

For Julia: we're running 1.1, but you can get the latest version. To install the required dependencies on your laptop, run julia environment.jl. On a headless server, or a machine on which you don't have the dependencies necessary for Makie.jl (notably GLFW), you can instead run julia environment.headless.jl to use Makie's Cairo backend.

For Python: using Anaconda, set up the environment with conda env create -f environment.lock.yml (this builds the Rust extensions in this package, and installs them locally).

Then activate the anaconda environment, and run python -m julia.sysimage julia/sys-$(hostname).so to build a custom julia system image, making sure calls from python to julia will work.

Finally, still with the environment activated, run ./setup-datasets.sh to set up the test datasets in the datasets/ folder.

Updating and managing the python environment

environment.lock.yml contains the full list of packages and their versions to duplicate the environment we use in development. To upgrade packages in that environment, use environment.yml instead: it only includes the top-level required packages without versions, and conda will resolve to updated package versions.

environment.lock-gpu.yml is a copy of environment.lock.yml that uses GPU-enabled TensorFlow.

Managing large data files

This repository uses git-annex to manage large data files, which can be a bit complicated. To use this, first make sure you have git-annex installed. If not, and if you don't have root access (e.g. in a HPC environment), Anaconda has it so you can install it inside a dedicated Anaconda environment.

Then, from inside your copy of this repository, you tell git that you're using git-annex:

git annex init
git annex enableremote warehouse-rsync

(Note: the warehouse-rsync remote is a so-called "special" annex remote tracked in git, and was created using git annex initremote warehouse-rsync type=rsync rsyncurl=blondy.lip.ens-lyon.fr:/warehouse/COMPLEXNET/nw2vec/annex encryption=none chunk=10Mib. You don't need to do that again.)

Now get a copy of all the large files in the warehouse-rsync repo:

git annex get .

Now whenever you want to add files to the annex:

# Commit hashes of the large files
git annex add <the files you want to add>
git commit
# Actually copy your large files to `warehouse-rsync`:
git annex copy --to warehouse-rsync <the files you added>
# Publish your changes upstream (the `--all` makes sure the `git-annex` branch is also pushed)
git push --all

If the push operation fails because of a conflict on the git-annex branch, you need to merge the remote's version of the git-annex branch into yours. You can do that without having to checkout that branch by doing:

git pull
git fetch origin git-annex:git-annex  # Merge what was just pulled from the remote git-annex branch into yours

then git pushing again should work. Note that since git-annex tracks which computer has which copy of which file, whenever someone downloads a large file from the annex things get tracked and the git-annex branch is updated. So this pull/fetch dance might be necessary quite often as we start working with more people / more computers.

Analyses

You can have a look at the wiki. The projects/ folder contains all the notebooks and scripts for the repo's projects, and is probably the most interesting thing to look at. The julia/ folder the current implementation of AN2VEC. The data/ folder contains any data that may be produced by analyses in julia/ and projects/.

For now, we don't use the Twitter data in the datasets/ folder, so you can ignore the sections below this and jump directly to the wiki or the projects after having installed the environment. If, however, you want to play with the Twitter data, read on.

Datasets folders

After running ./setup-datasets.sh, the datasets/ folder contains pointers to all the data we process and the outputs of processing:

datasets
├── karate                   # Test networks
├── malariaDBLaNetworks2013  #
├── BlogCatalog-dataset      #
├── Flickr-dataset           #
├── YouTube-dataset          #
├── sosweet-raw              # Raw Twitter data
│     -> /warehouse/COMPLEXNET/TWITTER/data
├── sosweet-raw-retweets     # Raw Twitter data for the time range with retweets
│     -> /warehouse/COMPLEXNET/nw2vec/sosweet-raw-retweets
├── sosweet-network          # Files encoding the network(s) of users and its embedding(s)
│     -> /warehouse/COMPLEXNET/nw2vec/sosweet-network
├── sosweet-text             # Files encoding the text of tweets
│     -> /warehouse/COMPLEXNET/nw2vec/sosweet-text
└── sosweet-w2v              # Files encoding the word2vec embeddings of words from tweets
      -> /warehouse/COMPLEXNET/nw2vec/sosweet-w2v

To set these up, and assuming you have access to the raw Twitter data in datasets/sosweet-raw:

Compute the word2vec embeddings: TODO: document
Extract the networks of users: TODO: document
Compute the network embeddings by running the projects/scratchpads/sosweet-node2vec.ipynb notebook
Extract the raw text of tweets:

sosweet-scripts/iter-gz \
    --no-gzip \
    sosweet-scripts/pipe-user_timestamp_body-csv \
    datasets/sosweet-text \
    $(ls datasets/sosweet-raw/2016*.tgz datasets/sosweet-raw/2017*.tgz)

Extract the user ids for the 5-mutual-mention network:

cat datasets/sosweet-network/undir_weighted_mention_network_thresh_5.txt \
    | sosweet-scripts/pipe-unique_users \
    > datasets/sosweet-network/undir_weighted_mention_network_thresh_5.users.txt

Filter the raw text of tweets for only those users:

# Create the destination folder
mkdir -p datasets/sosweet-text/undir_weighted_mention_network_thresh_5
# Run the filter
sosweet-scripts/iter-gz \
    --no-gzip \
    "sosweet-scripts/pipe-filter_users datasets/sosweet-network/undir_weighted_mention_network_thresh_5.users.txt" \
    datasets/sosweet-text/undir_weighted_mention_network_thresh_5 \
    $(ls datasets/sosweet-text/*-csv)

License

This code is under Inria's copyright, and licensed under GPLv3+ in agreement with them.

an2vec's People

Contributors

Stargazers

Watchers

Forkers

scape1989 legendtianjin songfgh tyut2018 qin-folks rogini98

an2vec's Issues

Left-over todos

We updated the non-gpu environment in e0b89bc, so once #29 and #31 have finished running we should update the gpu environment
Also after #29 has finished running, rename the output files so they include the feature noise scale (as introduced by beba46b)
finish Git-LFS history rewrite for all branches
recreate a PR to replace #12 which got closed by the history rewrite
remove issue-3-mini-batching-with-features if possible
remove or check history rewrite of issue-6-sensitivity-analysis
find out git-lfs quota on gitlab.com and on Inria's gitlab, as we are reaching the quota here

Train BlogCatalog without weighing positive/negative errors in adjacency reconstruction loss

This tests if the poor adjacency reconstruction we're seeing in #31 comes from the weighing of losses.

Train BlogCatalog with no feature reconstruction (and normal input features)

Simpler version of #40.

Rate trade-off between feature and network structure

In representing the network at the level of ξ, both network and feature structure can count, and they might not agree. Some networks e.g. in #8 might be better described by giving more weight to adjacency reconstruction (p_adj_loss), and some more to feature reconstruction (p_v_loss).

One way to explore is to simply weigh α × p_adj_loss + β × p_v_loss with α + β = 1, and explore the output for different values for α and β. But for comparisons to be made, we need a quality measure which is other than the total loss (which we affect when changing α and β).
Another way is to directly train α and β (use them as trainable parameters in the learning process), and see if different synthetic data sets give different best values.

Test various fixes to blogcatalog training

Run the training only for 2000 epochs, as most of the final quality seems to be reached there already.

Use the Bernoulli decoder for features
Use changes from #38
Remove u from embedding parameters

Use sparse features and keep all the GCN computations sparse

Minibatch validation: with no GCN, feature-only reconstruction

Work branched out from #19.

Read "Efficient Backprop" to check for mistakes in the implementation

LeCun et al., 1998, "Efficient backprop" gives many tips and tricks for good neural networks.

For instance, some people on SO say that ReLU is not a good activation for auto-encoders as it loses more information than, say, tanh. Check if LeCun has opinions about this and more.

Capturing feature-network dependencies

What happens when features do not correlate to the network structure, i.e. when there are more complex dependencies between features and network structure? What do the embeddings look like, can we influence them?

There are preliminary explorations in the projects/behaviour/feature-network-dependencies.ipynb notebook.

Test model against adversarial attacks

E.g. nettack.

Validate mini-batch with convolutionless feature-only reconstruction

It should work at least as well as full batch training in this case.

How much does the trained ξ improve on the untrained ξ?

When the features correlate a little to the network structure, even random weights seem to separate the distribution of ξ (i.e. the representation layer in the VAE) into communities. This is, as Thomas Kipf explains in his post on GCNs, because "a GCN model [can be interpreted] as a generalized, differentiable version of the well-known Weisfeiler-Lehman algorithm on graphs". It also makes sense as you basically average features from community neighbours, therefore recovering any local average that is consistent with network structure.

The upshot is that even without training, in some cases the distribution of ξ seems to reflect network structure. So what's the use of training, apart from getting a generative model (i.e. training the layers that come after the ξ layer)?

Sub-questions:

How does such a random ξ behave with non-random features that do not correlate to network structure (e.g. feature communities have lower, or higher, or displaced granularity compared to network communities): does one structure win over the other, or does it simply mess up the picture? The answer to this probably lies in criticisms of Label-propagation algorithms for community detection (e.g. Raghavan, Albert, & Kumara, 2007, "Near linear time algorithm to detect community structures in large-scale networks" ).
What is a "good" ξ distribution (i.e. a good representation model), independently of its use in the generative model? One that visually reproduces the feature- or network-structure (or both)? Can we build of measure of how good a given ξ is? Maybe there are answers to this in the pro-con arguments about why t-SNE is a good visualisation method (see e.g. How to Use t-SNE Effectively).

Minibatch effect and strategies

Go back to simple manual exploration of the effect of the current minibatch strategy on a small (200-500 nodes) network.

Try out other sampling strategies. First for the case where we know the graph generation parameters. Then for the case where we don't know the graph model.

Test BlogCatalog training in minibatch

Mini-batching optimisation

In #8 we're applying the current code to the BlogCatalog dataset: 10,312 nodes, 333,983 edges, and it is SLOW: about 20s per mini-batch, so about 7 minutes per epoch. Some profiling insights:

On grunch, fit_generator_feed() using n_epochs=1, batch_size=500, max_walk_length=200 and neighbour_samples=30 spends > 52% of its time in the batch-generation code.

On grunch again, same parameters, looking only at the batch-generation code:

4.4% time spent in jumpy_walks()
94.6% time spent in _compute_batch() (21 calls, one per mini-batch), of which
- 21% (of the 94.6%) time spent in _layer_csr_adj() (4 calls per call to _collect_layers_crops())
- the rest is at the root of _compute_batch() (i.e. not in nested functions)

Hottest lines:

And inside _compute_batch() (loop unfurled):

Finally, note the slightly different numbers on keon (my machine):

6.3% time spent in jumpy_walks()
92% time spent in _compute_batch() (21 calls, one per mini-batch), of which
- 49% (of the 92%) time spent in _layer_csr_adj() (4 calls per call to _collect_layers_crops())
- the rest is at the root of _compute_batch() (i.e. not in nested functions)

Steps to improve:

Create tests for batching.py to make sure nothing breaks
Use a sparse adjacency matrix across all the functions
Profile again

Train BlogCatalog without node features

i.e. with an identity matrix as features. Maybe extend that to training with BlogCatalog node features to which an identity matrix is concatenated, to make sure the model has enough information to separate the nodes based on features.

This tests if the bad adjacency reconstruction we're seeing in #31 comes from not being able to separate nodes based on their features.

Train BlogCatalog with a much higher weight (e.g. 1000) on adjacency reconstruction

Support mini-batch training in VAE

Note that a mini-batch with convolutions needs access to all the neighbouring nodes it will include in convolutions, on top of the nodes in the mini-batch for which we compute an update.

Dynamics of language use

We had a good chat with @jaklevab about that a few days back, I'll be jotting down things here as we flesh it out.

For now, these two papers were mentioned:

Test issue

This is a test issue to see if it shows up in slack

Full batch sensitivity analysis (adjacency reconstruction, feature reconstruction, and both)

Test the following parameters:

node features: exact network communities, noisy network communities, noisy network communities with redundancy, random
dim_ξ: [2, ..., 10]
dim_l1: [dim_ξ, ..., 16]
n_ξ_samples: [1, 5, 10]
bias in all layers: True/False

For each run, save:

the original and final-predicted adjacency and/or features
the evolution of the training losses
ξ distribution plots before and after training (for dim_ξ ≥ 2, use a dimension reduction technique such as t-SNE)

Explore deploying on the PSMN (CPU-only) cluster

The PSMN uses Sun Grid Engine, which is now Univa Grid Engine (see wikipedia).

Univa has a recent blog post on how to integrate SGE with Tensorflow, which requires some admin intervention: https://blogs.univa.com/2018/04/integrating-distributed-tensorflow-with-grid-engine/. It's worth talking to the PSMN people about this.

There's also some conversation about this over at TensorFlow: tensorflow/tensorflow#7068

Decide which adjacency reconstruction normalisation to use

This is the follow-up to point 2 of question 1 here.

In short, when using the straightforward cross-entropy value for adjacency reconstruction, we give equal weight to neighbours and non-neighbours in the prediction accuracy. Ideally, what we'd like to do is predict, for each node, the set of neighbours and the set of non-neighbours, without penalising our prediction loss by the fact that there are (usually) many more non-neighbours than there are neighbours.

The current solution (in #10, soon to be merged) is to scale the items in the KL loss so that, for a given node, its neighbours contribute 1/2 on average, and its non-neighbours contribute 1/2 on average. We then average over all nodes to get a loss that contributes 1, but weighs neighbours and non-neighbours equally.

Another way would be to upscale true positives and true negatives, and downscale false positives and false negatives.

Profile TF to see which ops are slow

Clean up data filenames for blogcalatog exploration

And update results notebooks so they still run.

Replace `tensordot` call by `while_loop` in `Bilinear` layer, to only compute what we need

Mini-batch sampling strategy

From question 3 here:

In the case where we have a sparse network, mini-batching with randomly selected nodes will almost always select nodes that are not directly connected, such that the adjacency matrix restricted to the mini-batch will almost always be the identity matrix. So training on this is bound to be very bad.

The best option now is probably to randomly sample a small number s of seeds, then make a random walk of length l around each of those seeds to also have nodes connected between each other, such that the mini-batch size is s * l.

Support sparse features+adjacency in VAE

On the current synthetic data the features are dense, so there's no immediate gain for this. But once we use real-world data it will become necessary.

Implement Bernoulli codec for feature reconstruction

This could also be a root cause behind the problems of #36, #37, #38.

Minibatch sensitivity analysis

Work branched out from #19.

Set up a validation metric for adjacency and feature reconstruction

For adjacency, this should be the standard link prediction metric (look it up in one of the reference papers).

I'm not sure what we should use for feature reconstruction yet.

Train BlogCatalog with no feature reconstruction (and Id input features)

This is a step further than #37.

Make more training animations

To make sure we understand what's going on during training. A notebook to make animations can be found in projects/behaviour/ae-animations.ipynb.

Encode σ directly in the OrthogonalGaussian codec

Then:

check that performance is comparable to gae (cf #44, we left at .4618 for gae vs. .4701 for nw2vec)
check that the implementations of the adj loss in gae and what we have (esp. weighing) match

Support saving/loading models in VAE

This means implementing the saving/loading logic in the custom Keras layers in nw2vec/layers.py.

Scale up the VAE tests

Blocked on #2 #3.

With synthetic data:

increase network size
increase the number of features

For real data we can use a subset of the Twitter data, but better than that (at least for starting) the Colorado Index of Complex Networks has lots of good small data sets, e.g.:

Malaria var DBLa HVR networks
French financial elites (1990)
BlogCatalog, Flickr, YouTube (2009)

Train BlogCatalog with higher embedding dimension

To see if that's the limiting factor in why the embeddings currently don't look good at all. (See here.)

Re-evaluate the need for custom keras version

Reproduce Cora embedding from VGAE paper by Kipf & Welling 2016

Four things, each to be visualised using downscaling and projections (the paper doesn't specify the method used to produce their 2D figure from 16D embeddings):

run their gae implementation directly to reproduce their own procedure
run our implementation with their parameters:
- no u parameter for embedding covariance matrix (remove the layer, and set u to 0 in the codec)
- simple scalar product decoder
- intermediate layer 32 dims, embedding layer 16 dims
- train for 200 iterations
- get the loading code from their gae implementation
reactivate kernel in scalar product
add the 2-layer decoder we've been using
add the u term in the embedding covariance matrix

Note that the current implementation mimics their gae closer than what we'll have after implementing #7, so do this before.

Re-train BlogCatalog with embedding dimension 2

So that we have the training history, not only the checkpoint.

Visualise blogcatalog feature reconstruction to get a quick view of quality

Possible model improvements / optimisations

None of these are necessary, I'm just listing them to keep track of all the possible small-ish changes that might improve performance or reduce the need for resources:

Normalizations:

Connections:

Residual connections

Reduction of necessary parameters and resources:

Mini-batching
Small sampling of neighbourhoods

Noise for training:

Removing unneeded left-overs:

Bayesian compression

Different models:

Decide on a license

What licenses are authorised by Inria?
Who holds the copyright to this software?
What licenses are compatible with what we ship?

Use non-recursive importance sampling for mini-batches on dense graphs?

Problem

In the current mini-batch approach being worked on in #10, we sample the nodes that will be part of the mini-batch, then sample their neighbourhoods to define which nodes will be used in the last layer's convolution, then walk back the convolution layers to sample, for each convolution, the neighbourhoods of the nodes that are necessary at the layer after the current one. Therefore:

if our mini-batch computes a loss for B nodes,
and we sample M nodes at each layer,
and the network architecture is L convolutions deep,

in the worst case we need B * M^L nodes to compute that mini-batch, which can be a lot if L grows.

Available solution

Chen, Ma, & Xiao, 2018, "FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling" show that for dense graphs, it may not be necessary to recursively sample this way. At any given convolution layer in the neural network, they estimate a node u's value by sampling on the whole graph proportionally to ||A(u, :)||² (where A is the (normalised?) adjacency matrix), irrespective of the nodes that are used in preceding and following layers in the neural network (this is the importance sampling part). With the notations above, the worst case then uses B * M * L nodes to compute a given mini-batch. Much better. And the authors show that this method achieves prediction accuracy equivalent to GraphSAGE on several well-tested data sets.

Caveat

This supposes that an importance sampling on the whole graph can provide enough nodes to estimate any node's value at a convolution layer. In other words, this works if the importance sampling provides enough nodes in the neighbourhood of any node in the graph.

Hamilton notes this in the review linked to above:

The idea of sampling vertices at each layer makes sense in small graphs or dense graphs. However, I think it might become problematic in massive, sparse graphs. For example, in industry applications with 1+ billion nodes (or sparse graphs with many disconnected components), I would think the odds of a node in a batch having a neighbor in the sampled set will get quite small, unless the sampled set grows proportionally large in size.

Resulting question

The question then becomes: how do we know when this can be useful, i.e. when our graph is dense enough to warrant this sampling method?

Possible answers I see:

define a threshold on graph sparsity, and see if we need this for the Twitter data
discard this as a premature optimisation concern, or an engineering-level improvement that we don't wish to spend time on