Comments (16)
I just added a --seed
option to the CLI, and
import random
random.seed(args.seed)
at the top of the communities
command. Here's what I get for seed 1 to 4:
(clics2) [email protected]@dlt4803010l:~/venvs/clics2/clics2$ clics -t 3 -s 1 -f families communities
INFO loaded graph
INFO starting infomap
INFO converted graph...
INFO finished infomap
INFO computed cluster names
(clics2) [email protected]@dlt4803010l:~/venvs/clics2/clics2$ clics -t 3 -g infomap -f families graph-stats ----------- ----
nodes 1534
edges 2630
components 95
communities 247
----------- ----
(clics2) [email protected]@dlt4803010l:~/venvs/clics2/clics2$ clics -t 3 -s 2 -f families communitiesINFO loaded graph
INFO starting infomap
INFO converted graph...
INFO finished infomap
INFO computed cluster names
(clics2) [email protected]@dlt4803010l:~/venvs/clics2/clics2$ clics -t 3 -g infomap -f families graph-stats ----------- ----
nodes 1534
edges 2634
components 96
communities 249
----------- ----
(clics2) [email protected]@dlt4803010l:~/venvs/clics2/clics2$ clics -t 3 -s 3 -f families communitiesINFO loaded graph
INFO starting infomap
INFO converted graph...
INFO finished infomap
INFO computed cluster names
(clics2) [email protected]@dlt4803010l:~/venvs/clics2/clics2$ clics -t 3 -g infomap -f families graph-stats ----------- ----
nodes 1534
edges 2645
components 95
communities 249
----------- ----
(clics2) [email protected]@dlt4803010l:~/venvs/clics2/clics2$ clics -t 3 -s 4 -f families communitiesINFO loaded graph
INFO starting infomap
INFO converted graph...
INFO finished infomap
INFO computed cluster names
(clics2) [email protected]@dlt4803010l:~/venvs/clics2/clics2$ clics -t 3 -g infomap -f families graph-stats ----------- ----
nodes 1534
edges 2637
components 96
communities 247
----------- ----
from clics2.
So I'd say this is rather a documentation issue, and we may think about adding this seed option into pyclics
now.
from clics2.
Hah, probably hard to debug, but would be really interesting. I'm in the process of recreating clics2 sqlite, too, right now. What does pip freeze
say in your virtualenv?
from clics2.
https://gist.github.com/chrzyki/13e11e8471791e6bd151b153ae294596
Just noticed: I installed pyclics
from the clics2 repository - that might have got something to do with it?
from clics2.
from clics2.
Hm, funny enough I'm getting
$ clics -t 3 -g infomap -f families graph-stats ----------- ----
nodes 1534
edges 2634
components 96
communities 248
----------- ----
from clics2.
As a first step towards debugging, I inserted a
print(len(edges), len(ignore_edges))
here
clics2/src/pyclics/commands.py
Line 190 in 97b3121
to see whether we are removing different numbers of edges, or adding differently. My numbers are
50051 46467
from clics2.
Check!
50051 46467
from clics2.
Hi, important is that infomap is based on random walks, so different numbers of edges can happen. You need to seed the random function to caputre this, but I'm not sure if igraph allows for this... So I'd say: use nx.connected_components to check for same size, a sthis is the simplest cluster algorithm.
from clics2.
Ah. Ok, I was chasing down the wrong path! So the actual colexification network is created reproducibly, but the clustering isn't - in the absence of a fixed seed. So, (almost) nothing to see here, move on :)
Except, maybe, we want to find a seed that gives us the number of edges reported in the README :)
from clics2.
My bad, sorry, wasn't aware of the random walks for the clustering!
from clics2.
Hm. Turns out we already had
random.seed(123456)
numpy.random.seed(123456)
in src/pyclics/__main__.py
. So then the question is where does infomap
get its randomness from?
from clics2.
Stackoverflow seems to think what we do should be sufficient: https://stackoverflow.com/a/25726079
from clics2.
Ok, just ran a couple of tests: Running the community_infomap
method multiple times immediately after setting the seed gives indeed identical clusters. But running the complete communities
command multiple times - with seed set again immediately before calling community_infomap
- does not! So I guess, something goes wrong in networkx2igraph
- maybe we must iterate over the graph nodes explicitly sorting them?
from clics2.
Ok, confirmed: The order in which vertices are added to igraph.Graph
in pyclics.util.networkx2igraph
varies across command calls.
from clics2.
That's a good find! I was dumbfounded by this ...
from clics2.
Related Issues (20)
- standalone app without json HOT 8
- only use languages with known coordinates when importing data into clics HOT 7
- contribution and acknowledgements HOT 4
- use CLDF completely, without CLICS intermediate format
- transfer clics to another organisation? HOT 2
- Update datasets.txt
- [release of 1.0] Check for FALL maps to concepticon in all datasets HOT 14
- Remove obsolete output files HOT 1
- Add glottocode and bibkey to exported (downloadable data) HOT 6
- Pluggability
- must pin pylexibank dependency to 0.9
- Make sure all commands are idempotent HOT 1
- pin all dependencies? HOT 4
- factor out pyclics into its own repos HOT 4
- Should we aim for pylexibank 1.0 for this paper? HOT 2
- Check all datasets for compatibility with pylexibank 1.0 ... HOT 1
- Clics tries to write non-ASCII to a GML file HOT 3
- Reproducibility issues HOT 6
- loading data: dataset.id needs to be specified HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clics2.