Comments (28)
Thanks @UnixJunkie. @Sudha247 will integrate the Multicore OCaml version of this benchmark into sandmark.
from sandmark.
@UnixJunkie : I see that this implementation of gram-matrix uses batteries
and batteries
doesn't work with Multicore yet. We will have to remove batteries
' dependency to build this with Multicore OCaml.
from sandmark.
Ok, I'll get rid of the batteries dependency and let you know.
Should be quick.
from sandmark.
By the way, can you open a bug against batteries?
If there is a problem to fix, we should know about this problem in the first place.
https://github.com/ocaml-batteries-team/batteries-included/issues
from sandmark.
@Sudha247 I'm done, there is no more dependency to batteries in my proposed benchmark. If multicore ocaml manages to beat parmap's parallelization performance, please do drop me an e-mail. :)
from sandmark.
GramMatrix is now included in Sandmark: #100. @Sudha247 can you post the performance comparison between multicore and the other parallel versions?
from sandmark.
I am interested to see the exact command being run to launch the Gram matrix bench using multicore-OCaml.
My current test:
~/src/sandmark/_opam/4.06.0/bin/orun -o test.bench -- taskset --cpu-list 0-15 ~/src/sandmark/_build/default/benchmarks/multicore-grammatrix/grammatrix.exe
just runs a sequential program.
from sandmark.
These are the numbers for grammatrix_multicore.exe
on 4.06.1+multicore
:
Number of cores | time | user_time | sys_time |
---|---|---|---|
1 | 96.726 | 96.349 | 0.392 |
2 | 74.348 | 97.717 | 0.804 |
4 | 44.624 | 98.737 | 0.980 |
8 | 30.742 | 99.169 | 1.712 |
12 | 28.991 | 99.287 | 1.448 |
16 | 27.517 | 99.189 | 1.296 |
20 | 27.221 | 98.986 | 1.755 |
24 | 26.404 | 98.874 | 1.376 |
from sandmark.
@Sudha247 what is the command that you are using to get those results?
from sandmark.
The table should include a speedup column (sequential_time / multi-thread_time).
from sandmark.
@UnixJunkie This is the command used to run the multicore version
~/src/sandmark/_opam/4.06.1+multicore+parallel/bin/orun -o ../../grammatrix_multicore.16.orun.bench -- taskset --cpu-list 2-13,16-27 chrt -r 1 ./grammatrix_multicore.exe 16
16 is the number of cores here which can be adjusted.
To run the parallel benchmarks, you can use the run_all_parallel.sh
file.
from sandmark.
Here are the results I get with parmap:
#cores parmap_t(s) speedup efficiency
1 57.83 1.00 1.00
2 37.93 1.52 0.76
4 24.94 2.32 0.58
8 16.79 3.44 0.43
12 14.93 3.87 0.32
16 13.45 4.30 0.27
from sandmark.
Your results converted to speedup:
#cores multicore_t(s) speedup efficiency
1 96.73 1.00 1.00
2 74.35 1.30 0.65
4 44.62 2.17 0.54
8 30.74 3.15 0.39
12 28.99 3.34 0.28
16 27.52 3.52 0.22
from sandmark.
Maybe your multi-thread implementation has a load balancing problem.
I guess compute_gram_mat is not constant time (though I have no idea
about what 'Domain.Sync.poll();' is doing).
The job computing the first line of the Gram matrix (which has N elements) should also compute
the last line of the Gram matrix (it has only 1 unknown elements).
The job computing the second line of the Gram matrix (N-1 elements) should also
compute the one before last line of the Gram matrix (it has 2 unknown elements).
etc.
Parmap does load balancing when it is used correctly, so we don't need to be too smart
when implementing parallel things with Parmap.
from sandmark.
@Sudha247 can you post just the time
numbers for multicore and parmap that were obtained on our benchmarking server. I assume #99 (comment) and #99 (comment) were obtained on different machines? And include the speedup as well compared to a sequential baseline?
from sandmark.
@kayceesrk your assumption about different machines is correct (I am still unable to run the multicore-ocaml version of the bench...)
from sandmark.
Indeed this might not be the optimal multicore version of this benchmark. On our benchmarking server, the following numbers were obtained.
ConcMinor (concurrent minor collector) is 4.06.1+multicore
which can be found here and ParMinor (parallel minor collector) can be found here here. The parmap version was built with 4.06.1
base compiler.
Cores | Parmap | Speedup | Concminor | Speedup | Parminor | Speedup |
---|---|---|---|---|---|---|
1 | 99.77 | 1 | 95.86 | 1 | 96.02 | 1 |
2 | 57.69 | 1.72 | 72.49 | 1.32 | 72.49 | 1.32 |
4 | 33.45 | 2.98 | 42.93 | 2.23 | 43.05 | 2.23 |
8 | 21.14 | 4.71 | 23.71 | 4.04 | 23.68 | 4.05 |
12 | 16.57 | 6.02 | 16.64 | 5.70 | 16.69 | 5.75 |
16 | 15.34 | 6.50 | 13.55 | 7.07 | 13.39 | 7.17 |
20 | 15.15 | 6.58 | 12.16 | 7.80 | 11.81 | 8.13 |
24 | 13.83 | 7.21 | 11.01 | 8.72 | 10.90 | 8.8 |
@UnixJunkie do your parmap numbers account for the overhead caused by reading data from the input file?
from sandmark.
I just time the initialization of the Gram matrix:
let () = Gc.full_major () in
let curr_dt, curr_matrix =
Utls.wall_clock_time (fun () ->
compute_gram_matrix style ncores csize samples
) in
Note that in your results, up to 12 cores, Parmap is still better.
from sandmark.
@Sudha247 if you manage to scale better, Gram matrix initialization is a useful real-world problem.
Feel free to write the algorithm in which ever way is the most efficient for multicore-OCaml.
from sandmark.
Also, the bench should be run several times for each multic-core config, avg and stddev computed.
from sandmark.
The parmap version was built with 4.10.0 base compiler.
We shouldn't benchmark against two different versions of the compilers. parmap version should also be 4.06.1. Is parmap version using batteries? If so, that must be ported to use whatever multicore is using.
from sandmark.
Parmap does not depend on batteries.
from sandmark.
We shouldn't benchmark against two different versions of the compilers.
Noted, thanks! I have updated my original comment with the numbers for 4.06.1
.
from sandmark.
I've redone the multicore version with load balancing. Here are the results:
Running time (seconds)
Baseline = 92.55 seconds
Cores | Multicore (ConcMinor) | Parmap | Parany |
---|---|---|---|
1 | 96.06382418 | 105.02 | 92.95 |
2 | 48.66779995 | 54.73 | 54.34 |
4 | 24.92529607 | 31.57 | 28.08 |
8 | 12.95011806 | 18.57 | 21.52 |
12 | 9.000571966 | 14.7 | 27.85 |
16 | 7.06490612 | 13.97 | 23.85 |
20 | 5.884493113 | 11.46 | 24.06 |
24 | 5.25877285 | 10.61 | 24.66 |
Speedup
Cores | Multicore (ConcMinor) | Parmap | Parany |
---|---|---|---|
1 | 0.9634219832 | 0.8812607122 | 0.9956966111 |
2 | 1.901668045 | 1.691028686 | 1.703165256 |
4 | 3.713095313 | 2.931580615 | 3.295940171 |
8 | 7.146652991 | 4.983844911 | 4.300650558 |
12 | 10.28267985 | 6.295918367 | 3.323159785 |
16 | 13.09996176 | 6.624910523 | 3.880503145 |
20 | 15.72777778 | 8.07591623 | 3.846633416 |
24 | 17.59916289 | 8.722902922 | 3.753041363 |
Speedup graph is here
from sandmark.
Note that if you play with chunksize to get better performance with multicore-OCaml,
you should also play with it for Parmap and Parany, to see which chunksize is the best for them.
I was also busy. :)
Parany has a new branch called sendmsg. The version on this branch
is significantly faster than before (but only works on Linux for the moment).
Also, I updated the code in
https://github.com/UnixJunkie/gram-matrix-bench
so that the Parmap version is more efficient now.
from sandmark.
I did a bit of experimentation with chunk sizes and it looks like a chunk size of 3 is the most efficient for parmap and parany. For multicore, it is set to 16, though going lower does not impact performance. I ran the benchmark on a 56 core machine.
I ran the benchmarks with the latest updates. parany is performing worse than before.
Cores | Multicore (ConcMinor) (chunk size = 32) | Speedup | ParMap (-c 3) | Speedup | Parany (-c 3) | Speedup |
---|---|---|---|---|---|---|
1 | 68.662 | 0.9471323294 | 70.89 | 0.9173649316 | 65.83 | 0.9878778672 |
2 | 36.426 | 1.785318179 | 38.13 | 1.7055337 | 41.23 | 1.577298084 |
3 | 24.991 | 2.602216798 | 27.58 | 2.357940537 | 29.3 | 2.219522184 |
4 | 18.631 | 3.490526542 | 20.61 | 3.155361475 | 23.45 | 2.773219616 |
5 | 15.668 | 4.150625479 | 17.22 | 3.776538908 | 23.26 | 2.795872743 |
6 | 13.418 | 4.846623938 | 15.09 | 4.309609013 | 24 | 2.709666667 |
7 | 11.574 | 5.61880076 | 13.66 | 4.760761347 | 23.66 | 2.748605241 |
8 | 10.356 | 6.27964465 | 12.41 | 5.240290089 | 28.48 | 2.283426966 |
9 | 9.467 | 6.869335587 | 11.83 | 5.497210482 | 31.57 | 2.059930314 |
10 | 9.694 | 6.708479472 | 10.81 | 6.015911193 | 31.06 | 2.093754024 |
11 | 8.013 | 8.115811806 | 10.22 | 6.363209393 | 30.19 | 2.154090759 |
12 | 7.395 | 8.794050034 | 9.44 | 6.888983051 | 30.34 | 2.143441002 |
13 | 6.996 | 9.295597484 | 9.09 | 7.154235424 | 28.36 | 2.293088858 |
14 | 6.534 | 9.952861953 | 8.61 | 7.553077816 | 28.15 | 2.310195382 |
15 | 6.108 | 10.6470203 | 9.46 | 6.874418605 | 30.94 | 2.101874596 |
16 | 5.886 | 11.04858987 | 9.28 | 7.007758621 | 66.2 | 0.9823564955 |
17 | 5.737 | 11.33554122 | 9.24 | 7.038095238 | 31.49 | 2.065163544 |
18 | 5.696 | 11.41713483 | 8.96 | 7.258035714 | 31.26 | 2.080358285 |
19 | 5.412 | 12.01626016 | 8.17 | 7.959853121 | 76.51 | 0.8499803947 |
20 | 5.218 | 12.46301265 | 9.37 | 6.940448239 | 83.57 | 0.7781739859 |
21 | 5.051 | 12.87507424 | 7.53 | 8.636387782 | 86.09 | 0.7553955163 |
22 | 4.885 | 13.31258956 | 7.94 | 8.190428212 | 90.24 | 0.7206560284 |
23 | 4.701 | 13.83365241 | 8 | 8.129 | 90.95 | 0.7150302364 |
24 | 4.718 | 13.7838067 | 7.52 | 8.64787234 | 32.77 | 1.984498016 |
25 | 4.598 | 14.14354067 | 7.37 | 8.823880597 | 94.9 | 0.6852687039 |
26 | 4.476 | 14.52904379 | 7.73 | 8.412936611 | 107.34 | 0.6058505683 |
27 | 4.285 | 15.17666278 | 7.95 | 8.180125786 | 96.03 | 0.6772050401 |
28 | 4.176 | 15.57279693 | 8.28 | 7.85410628 | 33.08 | 1.965900846 |
29 | 4.087 | 15.91191583 | 7.67 | 8.47874837 | 112.21 | 0.5795561893 |
30 | 4.017 | 16.18919592 | 7.39 | 8.8 | 109.14 | 0.5958585303 |
31 | 3.946 | 16.48048657 | 7.75 | 8.391225806 | 123.71 | 0.5256810282 |
32 | 3.851 | 16.88704233 | 6.82 | 9.535483871 | 36.97 | 1.759047877 |
33 | 3.667 | 17.73438778 | 7.02 | 9.263817664 | 35.16 | 1.84960182 |
34 | 3.516 | 18.4960182 | 6.61 | 9.838426626 | 139.72 | 0.4654451761 |
35 | 3.588 | 18.12486065 | 7.68 | 8.467708333 | 147.3 | 0.4414935506 |
36 | 3.572 | 18.20604703 | 6.92 | 9.397687861 | 148.56 | 0.4377490576 |
37 | 3.528 | 18.43310658 | 8 | 8.129 | 146.66 | 0.4434201555 |
38 | 3.28 | 19.82682927 | 6.56 | 9.913414634 | 144.38 | 0.4504224962 |
39 | 3.225 | 20.16496124 | 6.83 | 9.521522694 | 161.48 | 0.4027247956 |
40 | 3.19 | 20.3862069 | 6.5 | 10.00492308 | 156.26 | 0.4161781646 |
41 | 3.138 | 20.72402804 | 7.17 | 9.070013947 | 176.74 | 0.3679529252 |
42 | 3.091 | 21.03914591 | 6.77 | 9.605908419 | 164.59 | 0.3951151346 |
43 | 3.041 | 21.3850707 | 6.91 | 9.411287988 | 180.24 | 0.3608078118 |
44 | 3.104 | 20.95103093 | 6.75 | 9.63437037 | 32.59 | 1.99545873 |
45 | 3.086 | 21.07323396 | 6.38 | 10.19310345 | 143.49 | 0.453216252 |
46 | 2.943 | 22.09717975 | 6.63 | 9.808748115 | 154.82 | 0.4200490893 |
47 | 3.062 | 21.23840627 | 6.01 | 10.82063228 | 157.57 | 0.412718157 |
48 | 2.923 | 22.24837496 | 6.34 | 10.25741325 | 162.74 | 0.3996067347 |
49 | 3.074 | 21.15549772 | 6.2 | 10.48903226 | 154.36 | 0.4213008551 |
50 | 2.797 | 23.25062567 | 6.05 | 10.74909091 | 179.83 | 0.3616304287 |
51 | 2.83 | 22.9795053 | 6.59 | 9.868285281 | 169.94 | 0.3826762387 |
52 | 2.824 | 23.02832861 | 5.78 | 11.25121107 | 191.27 | 0.3400010456 |
53 | 2.689 | 24.18445519 | 6.7 | 9.706268657 | 166.43 | 0.3907468605 |
54 | 2.733 | 23.79509696 | 6.23 | 10.43852327 | 154.88 | 0.4198863636 |
55 | 2.697 | 24.11271783 | 6.01 | 10.82063228 | 178.34 | 0.3646517887 |
56 | 2.745 | 23.69107468 | 5.99 | 10.85676127 | 198.99 | 0.3268103925 |
Graph is here
I'll consider this issue closed for now since we have the Gram Matrix implementation in the sandmark suite :-)
from sandmark.
Yes, and it looks like a clear win for multicore-OCaml.
I am impressed; nice work.
from sandmark.
FTR, the results of my bench (just Parmap Vs. Parany) can be seen
in the picture in the README.md file there:
https://github.com/UnixJunkie/gram-matrix-bench
from sandmark.
Related Issues (20)
- Report hardware resources in parallel architectures using `hwloc`
- Integrate SPECpower benchmarks and measurements in Sandmark
- Multicore parallel benchmarks support for 4.14 HOT 2
- Fix Uuidm.create, Stdlib.Printf.kprintf and ocaml_deprecated_auto_include compiler warnings
- `mtime.1.2.0` fails to build with 5.1.0+trunk HOT 1
- Add MaPLe benchmarks to Sandmark HOT 1
- graph500 failing HOT 1
- Add benchmarks from frama-c Open Source case studies
- Include pausetimes in jupyter notebook HOT 1
- Remove the old implementations of graph500
- Update Sandmark trunk to build for 5.2.0 HOT 1
- A number of parallel benchmarks seem to be broken HOT 2
- Include percentiles in the pausetimes bench data HOT 2
- Round off errors for mean and max latencies HOT 2
- Unpin `runtime_events_tools` and pin `ppxlib` to a version compatible with OCaml 5.1 and 5.2 HOT 2
- Get latency and throughput information from the same benchmark runs
- Running parallel benchmarks seems to fail sometimes because chrt permissions are lost?
- Benchmark runs seem to be failing due to base.v0.16.3 install failure
- Benchmark runs seem to be failing due to changes on trunk
- Remote access to Turing broken HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sandmark.