Giter VIP home page Giter VIP logo

Comments (28)

kayceesrk avatar kayceesrk commented on July 24, 2024

Thanks @UnixJunkie. @Sudha247 will integrate the Multicore OCaml version of this benchmark into sandmark.

from sandmark.

Sudha247 avatar Sudha247 commented on July 24, 2024

@UnixJunkie : I see that this implementation of gram-matrix uses batteries and batteries doesn't work with Multicore yet. We will have to remove batteries' dependency to build this with Multicore OCaml.

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

Ok, I'll get rid of the batteries dependency and let you know.
Should be quick.

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

By the way, can you open a bug against batteries?
If there is a problem to fix, we should know about this problem in the first place.
https://github.com/ocaml-batteries-team/batteries-included/issues

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

@Sudha247 I'm done, there is no more dependency to batteries in my proposed benchmark. If multicore ocaml manages to beat parmap's parallelization performance, please do drop me an e-mail. :)

from sandmark.

kayceesrk avatar kayceesrk commented on July 24, 2024

GramMatrix is now included in Sandmark: #100. @Sudha247 can you post the performance comparison between multicore and the other parallel versions?

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

I am interested to see the exact command being run to launch the Gram matrix bench using multicore-OCaml.
My current test:

~/src/sandmark/_opam/4.06.0/bin/orun -o test.bench -- taskset --cpu-list 0-15 ~/src/sandmark/_build/default/benchmarks/multicore-grammatrix/grammatrix.exe

just runs a sequential program.

from sandmark.

Sudha247 avatar Sudha247 commented on July 24, 2024

These are the numbers for grammatrix_multicore.exe on 4.06.1+multicore:

Number of cores time user_time sys_time
1 96.726 96.349 0.392
2 74.348 97.717 0.804
4 44.624 98.737 0.980
8 30.742 99.169 1.712
12 28.991 99.287 1.448
16 27.517 99.189 1.296
20 27.221 98.986 1.755
24 26.404 98.874 1.376

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

@Sudha247 what is the command that you are using to get those results?

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

The table should include a speedup column (sequential_time / multi-thread_time).

from sandmark.

Sudha247 avatar Sudha247 commented on July 24, 2024

@UnixJunkie This is the command used to run the multicore version

~/src/sandmark/_opam/4.06.1+multicore+parallel/bin/orun -o ../../grammatrix_multicore.16.orun.bench -- taskset --cpu-list 2-13,16-27 chrt -r 1 ./grammatrix_multicore.exe 16

16 is the number of cores here which can be adjusted.

To run the parallel benchmarks, you can use the run_all_parallel.sh file.

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

Here are the results I get with parmap:

#cores parmap_t(s) speedup efficiency
1 57.83 1.00 1.00
2 37.93 1.52 0.76
4 24.94 2.32 0.58
8 16.79 3.44 0.43
12 14.93 3.87 0.32
16 13.45 4.30 0.27

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

Your results converted to speedup:

#cores multicore_t(s) speedup efficiency
1 96.73 1.00 1.00
2 74.35 1.30 0.65
4 44.62 2.17 0.54
8 30.74 3.15 0.39
12 28.99 3.34 0.28
16 27.52 3.52 0.22

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

Maybe your multi-thread implementation has a load balancing problem.
I guess compute_gram_mat is not constant time (though I have no idea
about what 'Domain.Sync.poll();' is doing).
The job computing the first line of the Gram matrix (which has N elements) should also compute
the last line of the Gram matrix (it has only 1 unknown elements).
The job computing the second line of the Gram matrix (N-1 elements) should also
compute the one before last line of the Gram matrix (it has 2 unknown elements).
etc.

Parmap does load balancing when it is used correctly, so we don't need to be too smart
when implementing parallel things with Parmap.

from sandmark.

kayceesrk avatar kayceesrk commented on July 24, 2024

@Sudha247 can you post just the time numbers for multicore and parmap that were obtained on our benchmarking server. I assume #99 (comment) and #99 (comment) were obtained on different machines? And include the speedup as well compared to a sequential baseline?

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

@kayceesrk your assumption about different machines is correct (I am still unable to run the multicore-ocaml version of the bench...)

from sandmark.

Sudha247 avatar Sudha247 commented on July 24, 2024

Indeed this might not be the optimal multicore version of this benchmark. On our benchmarking server, the following numbers were obtained.

ConcMinor (concurrent minor collector) is 4.06.1+multicore which can be found here and ParMinor (parallel minor collector) can be found here here. The parmap version was built with 4.06.1 base compiler.

Cores Parmap Speedup Concminor Speedup Parminor Speedup
1 99.77 1 95.86 1 96.02 1
2 57.69 1.72 72.49 1.32 72.49 1.32
4 33.45 2.98 42.93 2.23 43.05 2.23
8 21.14 4.71 23.71 4.04 23.68 4.05
12 16.57 6.02 16.64 5.70 16.69 5.75
16 15.34 6.50 13.55 7.07 13.39 7.17
20 15.15 6.58 12.16 7.80 11.81 8.13
24 13.83 7.21 11.01 8.72 10.90 8.8

@UnixJunkie do your parmap numbers account for the overhead caused by reading data from the input file?

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

I just time the initialization of the Gram matrix:

let () = Gc.full_major () in
      let curr_dt, curr_matrix =
        Utls.wall_clock_time (fun () ->
            compute_gram_matrix style ncores csize samples
          ) in

Note that in your results, up to 12 cores, Parmap is still better.

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

@Sudha247 if you manage to scale better, Gram matrix initialization is a useful real-world problem.
Feel free to write the algorithm in which ever way is the most efficient for multicore-OCaml.

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

Also, the bench should be run several times for each multic-core config, avg and stddev computed.

from sandmark.

kayceesrk avatar kayceesrk commented on July 24, 2024

The parmap version was built with 4.10.0 base compiler.

We shouldn't benchmark against two different versions of the compilers. parmap version should also be 4.06.1. Is parmap version using batteries? If so, that must be ported to use whatever multicore is using.

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

Parmap does not depend on batteries.

from sandmark.

Sudha247 avatar Sudha247 commented on July 24, 2024

We shouldn't benchmark against two different versions of the compilers.

Noted, thanks! I have updated my original comment with the numbers for 4.06.1.

from sandmark.

kayceesrk avatar kayceesrk commented on July 24, 2024

I've redone the multicore version with load balancing. Here are the results:

Running time (seconds)

Baseline = 92.55 seconds

Cores Multicore (ConcMinor) Parmap Parany
1 96.06382418 105.02 92.95
2 48.66779995 54.73 54.34
4 24.92529607 31.57 28.08
8 12.95011806 18.57 21.52
12 9.000571966 14.7 27.85
16 7.06490612 13.97 23.85
20 5.884493113 11.46 24.06
24 5.25877285 10.61 24.66

Speedup

Cores Multicore (ConcMinor) Parmap Parany
1 0.9634219832 0.8812607122 0.9956966111
2 1.901668045 1.691028686 1.703165256
4 3.713095313 2.931580615 3.295940171
8 7.146652991 4.983844911 4.300650558
12 10.28267985 6.295918367 3.323159785
16 13.09996176 6.624910523 3.880503145
20 15.72777778 8.07591623 3.846633416
24 17.59916289 8.722902922 3.753041363

Speedup graph is here

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

Note that if you play with chunksize to get better performance with multicore-OCaml,
you should also play with it for Parmap and Parany, to see which chunksize is the best for them.

I was also busy. :)

Parany has a new branch called sendmsg. The version on this branch
is significantly faster than before (but only works on Linux for the moment).
Also, I updated the code in
https://github.com/UnixJunkie/gram-matrix-bench
so that the Parmap version is more efficient now.

from sandmark.

kayceesrk avatar kayceesrk commented on July 24, 2024

I did a bit of experimentation with chunk sizes and it looks like a chunk size of 3 is the most efficient for parmap and parany. For multicore, it is set to 16, though going lower does not impact performance. I ran the benchmark on a 56 core machine.

I ran the benchmarks with the latest updates. parany is performing worse than before.

Cores Multicore (ConcMinor) (chunk size = 32) Speedup ParMap (-c 3) Speedup Parany (-c 3) Speedup
1 68.662 0.9471323294 70.89 0.9173649316 65.83 0.9878778672
2 36.426 1.785318179 38.13 1.7055337 41.23 1.577298084
3 24.991 2.602216798 27.58 2.357940537 29.3 2.219522184
4 18.631 3.490526542 20.61 3.155361475 23.45 2.773219616
5 15.668 4.150625479 17.22 3.776538908 23.26 2.795872743
6 13.418 4.846623938 15.09 4.309609013 24 2.709666667
7 11.574 5.61880076 13.66 4.760761347 23.66 2.748605241
8 10.356 6.27964465 12.41 5.240290089 28.48 2.283426966
9 9.467 6.869335587 11.83 5.497210482 31.57 2.059930314
10 9.694 6.708479472 10.81 6.015911193 31.06 2.093754024
11 8.013 8.115811806 10.22 6.363209393 30.19 2.154090759
12 7.395 8.794050034 9.44 6.888983051 30.34 2.143441002
13 6.996 9.295597484 9.09 7.154235424 28.36 2.293088858
14 6.534 9.952861953 8.61 7.553077816 28.15 2.310195382
15 6.108 10.6470203 9.46 6.874418605 30.94 2.101874596
16 5.886 11.04858987 9.28 7.007758621 66.2 0.9823564955
17 5.737 11.33554122 9.24 7.038095238 31.49 2.065163544
18 5.696 11.41713483 8.96 7.258035714 31.26 2.080358285
19 5.412 12.01626016 8.17 7.959853121 76.51 0.8499803947
20 5.218 12.46301265 9.37 6.940448239 83.57 0.7781739859
21 5.051 12.87507424 7.53 8.636387782 86.09 0.7553955163
22 4.885 13.31258956 7.94 8.190428212 90.24 0.7206560284
23 4.701 13.83365241 8 8.129 90.95 0.7150302364
24 4.718 13.7838067 7.52 8.64787234 32.77 1.984498016
25 4.598 14.14354067 7.37 8.823880597 94.9 0.6852687039
26 4.476 14.52904379 7.73 8.412936611 107.34 0.6058505683
27 4.285 15.17666278 7.95 8.180125786 96.03 0.6772050401
28 4.176 15.57279693 8.28 7.85410628 33.08 1.965900846
29 4.087 15.91191583 7.67 8.47874837 112.21 0.5795561893
30 4.017 16.18919592 7.39 8.8 109.14 0.5958585303
31 3.946 16.48048657 7.75 8.391225806 123.71 0.5256810282
32 3.851 16.88704233 6.82 9.535483871 36.97 1.759047877
33 3.667 17.73438778 7.02 9.263817664 35.16 1.84960182
34 3.516 18.4960182 6.61 9.838426626 139.72 0.4654451761
35 3.588 18.12486065 7.68 8.467708333 147.3 0.4414935506
36 3.572 18.20604703 6.92 9.397687861 148.56 0.4377490576
37 3.528 18.43310658 8 8.129 146.66 0.4434201555
38 3.28 19.82682927 6.56 9.913414634 144.38 0.4504224962
39 3.225 20.16496124 6.83 9.521522694 161.48 0.4027247956
40 3.19 20.3862069 6.5 10.00492308 156.26 0.4161781646
41 3.138 20.72402804 7.17 9.070013947 176.74 0.3679529252
42 3.091 21.03914591 6.77 9.605908419 164.59 0.3951151346
43 3.041 21.3850707 6.91 9.411287988 180.24 0.3608078118
44 3.104 20.95103093 6.75 9.63437037 32.59 1.99545873
45 3.086 21.07323396 6.38 10.19310345 143.49 0.453216252
46 2.943 22.09717975 6.63 9.808748115 154.82 0.4200490893
47 3.062 21.23840627 6.01 10.82063228 157.57 0.412718157
48 2.923 22.24837496 6.34 10.25741325 162.74 0.3996067347
49 3.074 21.15549772 6.2 10.48903226 154.36 0.4213008551
50 2.797 23.25062567 6.05 10.74909091 179.83 0.3616304287
51 2.83 22.9795053 6.59 9.868285281 169.94 0.3826762387
52 2.824 23.02832861 5.78 11.25121107 191.27 0.3400010456
53 2.689 24.18445519 6.7 9.706268657 166.43 0.3907468605
54 2.733 23.79509696 6.23 10.43852327 154.88 0.4198863636
55 2.697 24.11271783 6.01 10.82063228 178.34 0.3646517887
56 2.745 23.69107468 5.99 10.85676127 198.99 0.3268103925

Graph is here

I'll consider this issue closed for now since we have the Gram Matrix implementation in the sandmark suite :-)

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

Yes, and it looks like a clear win for multicore-OCaml.
I am impressed; nice work.

from sandmark.

UnixJunkie avatar UnixJunkie commented on July 24, 2024

FTR, the results of my bench (just Parmap Vs. Parany) can be seen
in the picture in the README.md file there:
https://github.com/UnixJunkie/gram-matrix-bench

from sandmark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.