rglab / rtsne.multicore Goto Github PK
View Code? Open in Web Editor NEWR wrapper for Multicore t-SNE
License: Other
R wrapper for Multicore t-SNE
License: Other
Hi,
The documentation suggests that reproducible results can be achieved by setting the seed in R. This works for the original Rtsne
function, but it doesn't seem to work for Rtsne.multicore
.
library(Rtsne.multicore)
iris_unique <- unique(iris)
mat <- as.matrix(iris_unique[,1:4])
# repeat calculation
set.seed(42)
tsne_out1 <- Rtsne.multicore(mat)
set.seed(42)
tsne_out2 <- Rtsne.multicore(mat)
# plot results
plot(tsne_out1$Y, col=iris_unique$Species, main="first run")
plot(tsne_out2$Y, col=iris_unique$Species, main="second run")
> devtools::install_github("RGLab/Rtsne.multicore")
Downloading GitHub repo RGLab/Rtsne.multicore@master
from URL https://api.github.com/repos/RGLab/Rtsne.multicore/zipball/master
Installing Rtsne.multicore
'/Users/jespinoz/anaconda/lib/R/bin/R' --no-site-file --no-environ --no-save \
--no-restore --quiet CMD INSTALL \
'/private/var/folders/6z/5vbtz_gmkr76ftgc3149dvtr0003c0/T/RtmpCDQkG7/devtoolsd9a499115dc/RGLab-Rtsne.multicore-6789e40' \
--library='/Users/jespinoz/anaconda/lib/R/library' --install-tests
* installing *source* package ‘Rtsne.multicore’ ...
** libs
clang++ -I/Users/jespinoz/anaconda/lib/R/include -DNDEBUG -DROUT -fopenmp -I/Users/jespinoz/anaconda/include -I"/Users/jespinoz/anaconda/lib/R/library/Rcpp/include" -fPIC -I/Users/jespinoz/anaconda/include -c RcppExports.cpp -o RcppExports.o
clang: error: unsupported option '-fopenmp'
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ‘Rtsne.multicore’
* removing ‘/Users/jespinoz/anaconda/lib/R/library/Rtsne.multicore’
Installation failed: Command failed (1)
I am using the Rtsne.multicore
package and I noticed that when I set verbose=T
, the progress doesn't get printed until the whole run is complete, making it more like a "log" instead of "progress".
Example: Rtsne.multicore(as.matrix(unique(iris[, 1:4])), theta=0.01, verbose=T)
runs very fast and prints the log after it's ran, but Rtsne.multicore(as.matrix(unique(iris[, 1:4])), theta=0.01, verbose=T, max_iter=1000000000)
doesn't print anything at all when it's running. Since we know that each iteration runs lightning fast from the first run, it means that verbose
is not being printed out while the program is running.
Hi,
I've found Rtsne.multicore to be very useful in 2-dimensions; it produces embeddings that are very similar to those produced by the original Rtsne package. However, things seem to break down when moving to higher dimensions (in particular 3). The groupings produced by Rtsne.multicore aren't as coherent as those produced by Rtsne.
library(Rtsne)
library(Rtsne.multicore)
iris_unique <- unique(iris)
mat <- as.matrix(iris_unique[,1:4])
# run calculation
set.seed(42)
tsne_out <- Rtsne(mat, dims=3)
set.seed(42)
tsne_out_multi <- Rtsne.multicore(mat, dims=3)
# plot results
pairs(tsne_out$Y, col=iris_unique$Species, main="Rtsne")
pairs(tsne_out_multi$Y, col=iris_unique$Species, main="Rtsne.multicore")
I observed that the results differ based on the number of threads specified.
In my application which used BH-SNE to create a 2D embedding followed by automated clustering using DBSCAN, I have replaced the single-threaded Rtsne
call by a call to your multi-threaded Rtsne.multicore
. This was nice&easy thanks to the similarity of both interfaces.
However, when I run the application, the results differ ever so slightly, as indicated below (just the first couple of points each time):
Using 1 thread
-4.3473001944841 -9.88816236259427
-0.264536173449281 2.26121958696939
-11.8037471711157 -1.23420653192463
18.5043209507443 -13.4638139443446
1.51823629529208 -27.2209786228982
8.44296382274354 11.5004388863181
17.0385503073606 -19.5842234534257
-1.80122124653633 -35.1542911986375
-14.9339466535662 11.4724805072396
-16.7179891732902 10.300907221322
Using 2 threads
-4.33102494052646 -9.94346771160292
-0.300330796745644 2.47627128482164
-14.4865548712467 3.83169546954971
18.0266761572745 -13.3481838170748
1.55009711170931 -27.3536683521347
8.57133969496983 11.704078885386
16.8146752705904 -19.4804761345993
-1.67702875389705 -35.6116919363096
-16.328562693303 10.9834569354747
-17.9212513482976 10.1738069116024
Using 3 threads
-4.15202535615338 -9.91628914440292
-0.266922842312901 2.30165398545058
-12.0458514750223 -1.26327092092668
18.3116039523395 -13.4472311793933
1.8728867702686 -27.0478452540983
8.21259960134093 11.338018514761
16.938103908809 -19.4664656504238
-1.51129210868152 -35.5926372619633
-15.7107052664802 10.622091607029
-16.9275577907434 10.5760540704756
Using 4 threads
-4.40493207317474 -10.2542865145978
-0.240311071414228 2.34386945654285
-11.613066543124 -1.22167721092907
17.978213066292 -13.6367838896947
1.68103298346623 -27.3950001130062
8.48320430773571 11.5841961868582
16.5975194709815 -19.6467988772466
-1.21063128661383 -35.6738754692542
-16.2962040171112 11.6000609166704
-16.4988660902924 10.7927849813962
The results using the same number of threads seems to be consistent between different runs, though - which is good at least :)
Using 1 thread - a second run
-4.3473001944841 -9.88816236259427
-0.264536173449281 2.26121958696939
-11.8037471711157 -1.23420653192463
18.5043209507443 -13.4638139443446
1.51823629529208 -27.2209786228982
8.44296382274354 11.5004388863181
17.0385503073606 -19.5842234534257
-1.80122124653633 -35.1542911986375
-14.9339466535662 11.4724805072396
-16.7179891732902 10.300907221322
And for all the points, computing the MD5SUM:
cat ./one_threads/one.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
2410c2539be68ffe1f52d1be0f04bfac -
cat ./one_threads_old/one.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
2410c2539be68ffe1f52d1be0f04bfac -
cat ./two_threads/two.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
1f7dd4212d74b162420c79e619b3b91b -
cat ./three_threads/three.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
f659b3527318c9545766fed14fc72daa -
./four_threads/four.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
0e7425b7acf3438d047fb1550bbd069f -
While the differences are hard to spot by eye - I mean in a 2D scatterplot -, the automatic clustering is affected by the differences.
Your input is greatly appreciated!
Best,
Cedric
Hi,
I am trying to install Rtsne.multicore on mac os Sierra . There are some problems while installing it. I know this is not a package problem but I am hoping you might have some insight into it. I installed clang compiler with openmp support but I am still getting this error when I try to install the package
devtools::install_github("RGLab/Rtsne.multicore")
The error is as follows:
Error in dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared object '/usr/local/lib/R/3.3/site-library/Rtsne.multicore/libs/Rtsne.multicore.so': dlopen(/usr/local/lib/R/3.3/site-library/Rtsne.multicore/libs/Rtsne.multicore.so, 6): Symbol not found**: ___kmpc_barrier** Referenced from: /usr/local/lib/R/3.3/site-library/Rtsne.multicore/libs/Rtsne.multicore.so Expected in: flat namespace in /usr/local/lib/R/3.3/site-library/Rtsne.multicore/libs/Rtsne.multicore.so Error: loading failed Execution halted
A solution at overflow gives some hints:
http://stackoverflow.com/questions/13715979/parallel-program-giving-error-undefined-reference-to-kmpc-ok-to-fork
Any suggestion is welcome!
Thanks!
Hena
> library(Rtsne.multicore) # Load package
> iris_unique <- unique(iris) # Remove duplicates
> mat <- as.matrix(iris_unique[,1:4])
> set.seed(42) # Sets seed for reproducibility
> tsne_out <- Rtsne.multicore(mat) # Run TSNE
*** caught segfault ***
address 0x6541, cause 'memory not mapped'
Traceback:
1: .Call("_Rtsne_multicore_Rtsne_cpp", PACKAGE = "Rtsne.multicore", X, no_dims_in, perplexity_in, theta_in, num_threads, max_iter, distance_precomputed)
2: Rtsne_cpp(X, dims, perplexity, theta, num_threads, max_iter, is_distance)
3: eval(expr, pf)
4: eval(expr, pf)
5: withVisible(eval(expr, pf))
6: evalVis(expr)
7: capture.output(res <- Rtsne_cpp(X, dims, perplexity, theta, num_threads, max_iter, is_distance))
8: Rtsne.multicore.default(mat)
9: Rtsne.multicore(mat)
I have no idea how to inspect a core dump...
This happens only sometimes, I cannot confidently reproduce the error...
What further info should I provide?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.