a numerical bench: Gram matrix initialization in parallel about sandmark HOT 28 CLOSED

UnixJunkie commented on July 24, 2024 4

a numerical bench: Gram matrix initialization in parallel

from sandmark.

Comments (28)

kayceesrk commented on July 24, 2024

Thanks @UnixJunkie. @Sudha247 will integrate the Multicore OCaml version of this benchmark into sandmark.

from sandmark.

Sudha247 commented on July 24, 2024

@UnixJunkie : I see that this implementation of gram-matrix uses batteries and batteries doesn't work with Multicore yet. We will have to remove batteries' dependency to build this with Multicore OCaml.

from sandmark.

UnixJunkie commented on July 24, 2024

Ok, I'll get rid of the batteries dependency and let you know.
Should be quick.

from sandmark.

UnixJunkie commented on July 24, 2024

By the way, can you open a bug against batteries?
If there is a problem to fix, we should know about this problem in the first place.
https://github.com/ocaml-batteries-team/batteries-included/issues

from sandmark.

UnixJunkie commented on July 24, 2024

@Sudha247 I'm done, there is no more dependency to batteries in my proposed benchmark. If multicore ocaml manages to beat parmap's parallelization performance, please do drop me an e-mail. :)

from sandmark.

kayceesrk commented on July 24, 2024

GramMatrix is now included in Sandmark: #100. @Sudha247 can you post the performance comparison between multicore and the other parallel versions?

from sandmark.

UnixJunkie commented on July 24, 2024

I am interested to see the exact command being run to launch the Gram matrix bench using multicore-OCaml.
My current test:

~/src/sandmark/_opam/4.06.0/bin/orun -o test.bench -- taskset --cpu-list 0-15 ~/src/sandmark/_build/default/benchmarks/multicore-grammatrix/grammatrix.exe

just runs a sequential program.

from sandmark.

Sudha247 commented on July 24, 2024

These are the numbers for grammatrix_multicore.exe on 4.06.1+multicore:

Number of cores	time	user_time	sys_time
1	96.726	96.349	0.392
2	74.348	97.717	0.804
4	44.624	98.737	0.980
8	30.742	99.169	1.712
12	28.991	99.287	1.448
16	27.517	99.189	1.296
20	27.221	98.986	1.755
24	26.404	98.874	1.376

from sandmark.

UnixJunkie commented on July 24, 2024

@Sudha247 what is the command that you are using to get those results?

from sandmark.

UnixJunkie commented on July 24, 2024

The table should include a speedup column (sequential_time / multi-thread_time).

from sandmark.

Sudha247 commented on July 24, 2024

@UnixJunkie This is the command used to run the multicore version

~/src/sandmark/_opam/4.06.1+multicore+parallel/bin/orun -o ../../grammatrix_multicore.16.orun.bench -- taskset --cpu-list 2-13,16-27 chrt -r 1 ./grammatrix_multicore.exe 16

16 is the number of cores here which can be adjusted.

To run the parallel benchmarks, you can use the run_all_parallel.sh file.

from sandmark.

UnixJunkie commented on July 24, 2024

Here are the results I get with parmap:

#cores parmap_t(s) speedup efficiency
1 57.83 1.00 1.00
2 37.93 1.52 0.76
4 24.94 2.32 0.58
8 16.79 3.44 0.43
12 14.93 3.87 0.32
16 13.45 4.30 0.27

from sandmark.

UnixJunkie commented on July 24, 2024

Your results converted to speedup:

#cores multicore_t(s) speedup efficiency
1 96.73 1.00 1.00
2 74.35 1.30 0.65
4 44.62 2.17 0.54
8 30.74 3.15 0.39
12 28.99 3.34 0.28
16 27.52 3.52 0.22

from sandmark.

UnixJunkie commented on July 24, 2024

Maybe your multi-thread implementation has a load balancing problem.
I guess compute_gram_mat is not constant time (though I have no idea
about what 'Domain.Sync.poll();' is doing).
The job computing the first line of the Gram matrix (which has N elements) should also compute
the last line of the Gram matrix (it has only 1 unknown elements).
The job computing the second line of the Gram matrix (N-1 elements) should also
compute the one before last line of the Gram matrix (it has 2 unknown elements).
etc.

Parmap does load balancing when it is used correctly, so we don't need to be too smart
when implementing parallel things with Parmap.

from sandmark.

kayceesrk commented on July 24, 2024

@Sudha247 can you post just the time numbers for multicore and parmap that were obtained on our benchmarking server. I assume #99 (comment) and #99 (comment) were obtained on different machines? And include the speedup as well compared to a sequential baseline?

from sandmark.

UnixJunkie commented on July 24, 2024

@kayceesrk your assumption about different machines is correct (I am still unable to run the multicore-ocaml version of the bench...)

from sandmark.

Sudha247 commented on July 24, 2024

Indeed this might not be the optimal multicore version of this benchmark. On our benchmarking server, the following numbers were obtained.

ConcMinor (concurrent minor collector) is 4.06.1+multicore which can be found here and ParMinor (parallel minor collector) can be found here here. The parmap version was built with 4.06.1 base compiler.

Cores	Parmap	Speedup	Concminor	Speedup	Parminor	Speedup
1	99.77	1	95.86	1	96.02	1
2	57.69	1.72	72.49	1.32	72.49	1.32
4	33.45	2.98	42.93	2.23	43.05	2.23
8	21.14	4.71	23.71	4.04	23.68	4.05
12	16.57	6.02	16.64	5.70	16.69	5.75
16	15.34	6.50	13.55	7.07	13.39	7.17
20	15.15	6.58	12.16	7.80	11.81	8.13
24	13.83	7.21	11.01	8.72	10.90	8.8

@UnixJunkie do your parmap numbers account for the overhead caused by reading data from the input file?

from sandmark.

UnixJunkie commented on July 24, 2024

I just time the initialization of the Gram matrix:

let () = Gc.full_major () in
      let curr_dt, curr_matrix =
        Utls.wall_clock_time (fun () ->
            compute_gram_matrix style ncores csize samples
          ) in

Note that in your results, up to 12 cores, Parmap is still better.

from sandmark.

UnixJunkie commented on July 24, 2024

@Sudha247 if you manage to scale better, Gram matrix initialization is a useful real-world problem.
Feel free to write the algorithm in which ever way is the most efficient for multicore-OCaml.

from sandmark.

UnixJunkie commented on July 24, 2024

Also, the bench should be run several times for each multic-core config, avg and stddev computed.

from sandmark.

kayceesrk commented on July 24, 2024

The parmap version was built with 4.10.0 base compiler.

We shouldn't benchmark against two different versions of the compilers. parmap version should also be 4.06.1. Is parmap version using batteries? If so, that must be ported to use whatever multicore is using.

from sandmark.

UnixJunkie commented on July 24, 2024

Parmap does not depend on batteries.

from sandmark.

Sudha247 commented on July 24, 2024

We shouldn't benchmark against two different versions of the compilers.

Noted, thanks! I have updated my original comment with the numbers for 4.06.1.

from sandmark.

kayceesrk commented on July 24, 2024

I've redone the multicore version with load balancing. Here are the results:

Running time (seconds)

Baseline = 92.55 seconds

Cores	Multicore (ConcMinor)	Parmap	Parany
1	96.06382418	105.02	92.95
2	48.66779995	54.73	54.34
4	24.92529607	31.57	28.08
8	12.95011806	18.57	21.52
12	9.000571966	14.7	27.85
16	7.06490612	13.97	23.85
20	5.884493113	11.46	24.06
24	5.25877285	10.61	24.66

Speedup

Cores	Multicore (ConcMinor)	Parmap	Parany
1	0.9634219832	0.8812607122	0.9956966111
2	1.901668045	1.691028686	1.703165256
4	3.713095313	2.931580615	3.295940171
8	7.146652991	4.983844911	4.300650558
12	10.28267985	6.295918367	3.323159785
16	13.09996176	6.624910523	3.880503145
20	15.72777778	8.07591623	3.846633416
24	17.59916289	8.722902922	3.753041363

Speedup graph is here

from sandmark.

UnixJunkie commented on July 24, 2024

Note that if you play with chunksize to get better performance with multicore-OCaml,
you should also play with it for Parmap and Parany, to see which chunksize is the best for them.

I was also busy. :)

Parany has a new branch called sendmsg. The version on this branch
is significantly faster than before (but only works on Linux for the moment).
Also, I updated the code in
https://github.com/UnixJunkie/gram-matrix-bench
so that the Parmap version is more efficient now.

from sandmark.

kayceesrk commented on July 24, 2024

I did a bit of experimentation with chunk sizes and it looks like a chunk size of 3 is the most efficient for parmap and parany. For multicore, it is set to 16, though going lower does not impact performance. I ran the benchmark on a 56 core machine.

I ran the benchmarks with the latest updates. parany is performing worse than before.

Cores	Multicore (ConcMinor) (chunk size = 32)	Speedup	ParMap (-c 3)	Speedup	Parany (-c 3)	Speedup
1	68.662	0.9471323294	70.89	0.9173649316	65.83	0.9878778672
2	36.426	1.785318179	38.13	1.7055337	41.23	1.577298084
3	24.991	2.602216798	27.58	2.357940537	29.3	2.219522184
4	18.631	3.490526542	20.61	3.155361475	23.45	2.773219616
5	15.668	4.150625479	17.22	3.776538908	23.26	2.795872743
6	13.418	4.846623938	15.09	4.309609013	24	2.709666667
7	11.574	5.61880076	13.66	4.760761347	23.66	2.748605241
8	10.356	6.27964465	12.41	5.240290089	28.48	2.283426966
9	9.467	6.869335587	11.83	5.497210482	31.57	2.059930314
10	9.694	6.708479472	10.81	6.015911193	31.06	2.093754024
11	8.013	8.115811806	10.22	6.363209393	30.19	2.154090759
12	7.395	8.794050034	9.44	6.888983051	30.34	2.143441002
13	6.996	9.295597484	9.09	7.154235424	28.36	2.293088858
14	6.534	9.952861953	8.61	7.553077816	28.15	2.310195382
15	6.108	10.6470203	9.46	6.874418605	30.94	2.101874596
16	5.886	11.04858987	9.28	7.007758621	66.2	0.9823564955
17	5.737	11.33554122	9.24	7.038095238	31.49	2.065163544
18	5.696	11.41713483	8.96	7.258035714	31.26	2.080358285
19	5.412	12.01626016	8.17	7.959853121	76.51	0.8499803947
20	5.218	12.46301265	9.37	6.940448239	83.57	0.7781739859
21	5.051	12.87507424	7.53	8.636387782	86.09	0.7553955163
22	4.885	13.31258956	7.94	8.190428212	90.24	0.7206560284
23	4.701	13.83365241	8	8.129	90.95	0.7150302364
24	4.718	13.7838067	7.52	8.64787234	32.77	1.984498016
25	4.598	14.14354067	7.37	8.823880597	94.9	0.6852687039
26	4.476	14.52904379	7.73	8.412936611	107.34	0.6058505683
27	4.285	15.17666278	7.95	8.180125786	96.03	0.6772050401
28	4.176	15.57279693	8.28	7.85410628	33.08	1.965900846
29	4.087	15.91191583	7.67	8.47874837	112.21	0.5795561893
30	4.017	16.18919592	7.39	8.8	109.14	0.5958585303
31	3.946	16.48048657	7.75	8.391225806	123.71	0.5256810282
32	3.851	16.88704233	6.82	9.535483871	36.97	1.759047877
33	3.667	17.73438778	7.02	9.263817664	35.16	1.84960182
34	3.516	18.4960182	6.61	9.838426626	139.72	0.4654451761
35	3.588	18.12486065	7.68	8.467708333	147.3	0.4414935506
36	3.572	18.20604703	6.92	9.397687861	148.56	0.4377490576
37	3.528	18.43310658	8	8.129	146.66	0.4434201555
38	3.28	19.82682927	6.56	9.913414634	144.38	0.4504224962
39	3.225	20.16496124	6.83	9.521522694	161.48	0.4027247956
40	3.19	20.3862069	6.5	10.00492308	156.26	0.4161781646
41	3.138	20.72402804	7.17	9.070013947	176.74	0.3679529252
42	3.091	21.03914591	6.77	9.605908419	164.59	0.3951151346
43	3.041	21.3850707	6.91	9.411287988	180.24	0.3608078118
44	3.104	20.95103093	6.75	9.63437037	32.59	1.99545873
45	3.086	21.07323396	6.38	10.19310345	143.49	0.453216252
46	2.943	22.09717975	6.63	9.808748115	154.82	0.4200490893
47	3.062	21.23840627	6.01	10.82063228	157.57	0.412718157
48	2.923	22.24837496	6.34	10.25741325	162.74	0.3996067347
49	3.074	21.15549772	6.2	10.48903226	154.36	0.4213008551
50	2.797	23.25062567	6.05	10.74909091	179.83	0.3616304287
51	2.83	22.9795053	6.59	9.868285281	169.94	0.3826762387
52	2.824	23.02832861	5.78	11.25121107	191.27	0.3400010456
53	2.689	24.18445519	6.7	9.706268657	166.43	0.3907468605
54	2.733	23.79509696	6.23	10.43852327	154.88	0.4198863636
55	2.697	24.11271783	6.01	10.82063228	178.34	0.3646517887
56	2.745	23.69107468	5.99	10.85676127	198.99	0.3268103925

Graph is here

I'll consider this issue closed for now since we have the Gram Matrix implementation in the sandmark suite :-)

from sandmark.

UnixJunkie commented on July 24, 2024

Yes, and it looks like a clear win for multicore-OCaml.
I am impressed; nice work.

from sandmark.

UnixJunkie commented on July 24, 2024

FTR, the results of my bench (just Parmap Vs. Parany) can be seen
in the picture in the README.md file there:
https://github.com/UnixJunkie/gram-matrix-bench

from sandmark.

a numerical bench: Gram matrix initialization in parallel about sandmark HOT 28 CLOSED

Comments (28)

Running time (seconds)

Speedup

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent