Giter VIP home page Giter VIP logo

Comments (9)

GFleishman avatar GFleishman commented on June 29, 2024

Hi again!

I'm sorry for this, but the tutorials are simply not always up to date and not always applicable in all scenarios. Thanks for asking for help here since its good to clarify this stuff and have it documented somewhere.

Your c dictionary, which I typically call cluster_kwargs, is meant to contain parameters that control how and where the distributed computations will be executed. That is, things like the number of workers that will be created and what resources each of those workers will have. You can set things like the number of cpus a worker has access to, and how many threads dask can submit to that worker at at time.

The c dictionary you show here is meant to specify values for the janelia_lsf_cluster object defined here. However, you're not at Janelia, and you're probably running on a workstation, so bigstream has decided to make a local_cluster object instead. So your c dictionary should only contain arguments that can be passed to the local_cluster constructor (__init__ function). Local clusters are usually easier, dask has access to all the resources on the machine and can figure out how to make workers itself, so you don't need to specify much.

If you're processing large sample EASI-FISH data however, you will need to think about the total size of your data and the resources available on your machine. Bigstream tries hard to make big data problems possible on workstations, but that could mean tying up resources on your computer for a long time, like days. It's always best to get a smaller test case working first before committing to a huge computation on a small machine; like a 1024x1024x512 crop of your image, that you will treat as 4x4x2 blocks of size 256.

If you do have access to a cluster, but not an lsf cluster like we have at Janelia, then you may want to consider implementing your own cluster object using dask-jobqueue Here is an example of a SLURM cluster object others have done in this example (see line 322 here).

from bigstream.

HomoPolyethylen avatar HomoPolyethylen commented on June 29, 2024

Hello and thanks for the quick and thorough response!
So far, I am merely trying to run the test images you provide and then our own test set. So I am using my local machine (with limited resources, but I would guess enough for a test run).

After deleting the janelia-specific kwargs, it runs and eats up all memory until it crashes. Is there a kwarg to limit the memory? does this mean, that I have less then the minimally required memory?
And (last question, sorry): are the kwargs documented somewhere? (these get passed to dask, right?)

from bigstream.

GFleishman avatar GFleishman commented on June 29, 2024

Happy to help. Running the test data, and then your own small tests, are a good way to start. The test data included in the package should be small enough to run on typical workstations or even an average laptop.

The cluster_kwargs are just arguments to this class: https://github.com/GFleishman/ClusterWrap/blob/ecdcb7a419ff7261a1e7ebc5c88b3103e4b2abc3/ClusterWrap/clusters.py#L176

That class is a very thin wrapper around the dask.distributed.LocalCluster object, and you can see that kwargs are passed down to it. So your cluster_kwargs dictionary can contain any of the arguments documented here: https://distributed.dask.org/en/latest/api.html#distributed.LocalCluster
It sounds like the memory_limit argument might be an important one for you to set well.

Notice also that you can set any of the global dask configuration values in the cluster_kwargs if you need. That's any value you can find here: https://docs.dask.org/en/latest/configuration.html#configuration-reference
You shouldn't have to mess around with those too much at first, but just so you know that they are there.

So to have 4 workers processing the data in parallel, and each worker having access to 20% of the total system memory, you could have something like this:

cluster_kwargs = {
    'n_workers':4,
    'memory_limit':0.2,
}

And if you need to set any of the dask global configuration options (which admittedly there are many and not very transparent what they do, but eventually its helpful to know how some of them work), then you can add:

cluster_kwargs = {
    'n_workers':4,
    'memory_limit':0.2,
    'config':{
        'temporary-directory':'/path/to/where/you/want/cache/and/temporary/files/written/by/dask',
    }
}

Just setting the temporary-directory there as an example, but anything in that dask-config reference is a valid option in the config dictionary.

from bigstream.

GFleishman avatar GFleishman commented on June 29, 2024

Last - I'm actively working on bigstream every day, so to keep up with any major changes to the repository or issues resolved or anything like that click the star button for the repo. It also helps me keep track of who is using the software so I can reach out with updates, improvements, and also show my own colleagues and funding sources how the software is being adopted.

from bigstream.

GFleishman avatar GFleishman commented on June 29, 2024

If its alright with you I'm going to close this issue since I think sharing documentation on how to use the cluster_kwargs dictionary for various compute environments was what you needed.

from bigstream.

HomoPolyethylen avatar HomoPolyethylen commented on June 29, 2024

thanks a lot! using your suggested 'n_workers':4, 'memory_limit':0.2, the process does not seem do get killed by the os anymore. However, there still is a memory issue. It seems the problem of memory allocation remains, just now the nanny kills the workers, instead of the os the entire process...

stdout

Run ransac {'blob_sizes': [6, 20]}
Fix spots: 8172 Moving spots: 6584
Found enough spots to estimate the affine fix: 1080 , moving: 1080
Run affine {'shrink_factors': (2,), 'smooth_sigmas': (2.5,), 'optimizer_args': {'learningRate': 0.25, 'minStep': 0.0, 'numberOfIterations': 400}}
LEVEL:  0  ITERATION:  0  METRIC:  -0.4581419604727837
LEVEL:  0  ITERATION:  1  METRIC:  -0.46185173508006006
LEVEL:  0  ITERATION:  2  METRIC:  -0.4650156818305996
LEVEL:  0  ITERATION:  3  METRIC:  -0.4724639649044412
LEVEL:  0  ITERATION:  4  METRIC:  -0.4746615064736247
LEVEL:  0  ITERATION:  5  METRIC:  -0.47997929246244164
LEVEL:  0  ITERATION:  6  METRIC:  -0.48162338752752715
LEVEL:  0  ITERATION:  7  METRIC:  -0.48310909958720233
LEVEL:  0  ITERATION:  8  METRIC:  -0.4837360926167621
LEVEL:  0  ITERATION:  9  METRIC:  -0.4845070410365141
LEVEL:  0  ITERATION:  10  METRIC:  -0.4853417438750474
LEVEL:  0  ITERATION:  11  METRIC:  -0.4861468600349893
LEVEL:  0  ITERATION:  12  METRIC:  -0.4869357008039241
LEVEL:  0  ITERATION:  13  METRIC:  -0.48770793124780737
LEVEL:  0  ITERATION:  14  METRIC:  -0.4884545858080386
LEVEL:  0  ITERATION:  15  METRIC:  -0.48918206906538847
LEVEL:  0  ITERATION:  16  METRIC:  -0.4899023849128928
LEVEL:  0  ITERATION:  17  METRIC:  -0.4905931877795264
LEVEL:  0  ITERATION:  18  METRIC:  -0.49127124891444063
LEVEL:  0  ITERATION:  19  METRIC:  -0.4919326641657991
LEVEL:  0  ITERATION:  20  METRIC:  -0.49257102820812987
LEVEL:  0  ITERATION:  21  METRIC:  -0.4931839567895838
LEVEL:  0  ITERATION:  22  METRIC:  -0.49378765987179096
LEVEL:  0  ITERATION:  23  METRIC:  -0.49437435058566226
LEVEL:  0  ITERATION:  24  METRIC:  -0.49494681942904467
LEVEL:  0  ITERATION:  25  METRIC:  -0.4955013797582309
LEVEL:  0  ITERATION:  26  METRIC:  -0.496047288487377
LEVEL:  0  ITERATION:  27  METRIC:  -0.496568052816728
LEVEL:  0  ITERATION:  28  METRIC:  -0.49706922275872795
LEVEL:  0  ITERATION:  29  METRIC:  -0.4975690755774044
LEVEL:  0  ITERATION:  30  METRIC:  -0.49806048889227034
LEVEL:  0  ITERATION:  31  METRIC:  -0.4985324356268822
LEVEL:  0  ITERATION:  32  METRIC:  -0.4989858475626363
LEVEL:  0  ITERATION:  33  METRIC:  -0.4994252715046777
LEVEL:  0  ITERATION:  34  METRIC:  -0.49986243995316687
LEVEL:  0  ITERATION:  35  METRIC:  -0.5002695242474917
LEVEL:  0  ITERATION:  36  METRIC:  -0.5006717598982761
LEVEL:  0  ITERATION:  37  METRIC:  -0.5010587437426598
LEVEL:  0  ITERATION:  38  METRIC:  -0.5014281548923906
LEVEL:  0  ITERATION:  39  METRIC:  -0.5017930052510243
LEVEL:  0  ITERATION:  40  METRIC:  -0.5021445185715844
LEVEL:  0  ITERATION:  41  METRIC:  -0.5024947157532571
LEVEL:  0  ITERATION:  42  METRIC:  -0.5028207059244102
LEVEL:  0  ITERATION:  43  METRIC:  -0.5031372682895622
LEVEL:  0  ITERATION:  44  METRIC:  -0.5034437579646844
LEVEL:  0  ITERATION:  45  METRIC:  -0.503743909461261
LEVEL:  0  ITERATION:  46  METRIC:  -0.5040461875086483
LEVEL:  0  ITERATION:  47  METRIC:  -0.5043177381748524
LEVEL:  0  ITERATION:  48  METRIC:  -0.5045853248171887
LEVEL:  0  ITERATION:  49  METRIC:  -0.5048496952355569
LEVEL:  0  ITERATION:  50  METRIC:  -0.5051156617514692
LEVEL:  0  ITERATION:  51  METRIC:  -0.5053605919755874
LEVEL:  0  ITERATION:  52  METRIC:  -0.5056060504656191
LEVEL:  0  ITERATION:  53  METRIC:  -0.5058312688263044
LEVEL:  0  ITERATION:  54  METRIC:  -0.5060474193135633
LEVEL:  0  ITERATION:  55  METRIC:  -0.5062694684784214
LEVEL:  0  ITERATION:  56  METRIC:  -0.5064751121519155
LEVEL:  0  ITERATION:  57  METRIC:  -0.5066792230216223
LEVEL:  0  ITERATION:  58  METRIC:  -0.5068751869693193
LEVEL:  0  ITERATION:  59  METRIC:  -0.5070695888245187
LEVEL:  0  ITERATION:  60  METRIC:  -0.5072388482968267
LEVEL:  0  ITERATION:  61  METRIC:  -0.5074144847017321
LEVEL:  0  ITERATION:  62  METRIC:  -0.5075198276353228
LEVEL:  0  ITERATION:  63  METRIC:  -0.50768240549795
LEVEL:  0  ITERATION:  64  METRIC:  -0.5077227248678277
LEVEL:  0  ITERATION:  65  METRIC:  -0.5078342072736535
LEVEL:  0  ITERATION:  66  METRIC:  -0.5079034927879992
LEVEL:  0  ITERATION:  67  METRIC:  -0.5079739747202662
LEVEL:  0  ITERATION:  68  METRIC:  -0.5080430542376165
LEVEL:  0  ITERATION:  69  METRIC:  -0.5081134203488384
LEVEL:  0  ITERATION:  70  METRIC:  -0.5081774806039373
LEVEL:  0  ITERATION:  71  METRIC:  -0.5082466558220115
LEVEL:  0  ITERATION:  72  METRIC:  -0.5083135122591875
LEVEL:  0  ITERATION:  73  METRIC:  -0.5083821598539106
LEVEL:  0  ITERATION:  74  METRIC:  -0.5084394701495869
LEVEL:  0  ITERATION:  75  METRIC:  -0.5084981685382055
LEVEL:  0  ITERATION:  76  METRIC:  -0.5085490525356509
LEVEL:  0  ITERATION:  77  METRIC:  -0.5086023035425143
LEVEL:  0  ITERATION:  78  METRIC:  -0.5086574992694624
LEVEL:  0  ITERATION:  79  METRIC:  -0.5087085839545603
LEVEL:  0  ITERATION:  80  METRIC:  -0.5087576982524257
LEVEL:  0  ITERATION:  81  METRIC:  -0.5088074869079925
LEVEL:  0  ITERATION:  82  METRIC:  -0.5088526680721442
LEVEL:  0  ITERATION:  83  METRIC:  -0.5089088951493157
LEVEL:  0  ITERATION:  84  METRIC:  -0.5089517922795461
LEVEL:  0  ITERATION:  85  METRIC:  -0.5090037517402162
LEVEL:  0  ITERATION:  86  METRIC:  -0.5090403812749982
LEVEL:  0  ITERATION:  87  METRIC:  -0.509084437510761
LEVEL:  0  ITERATION:  88  METRIC:  -0.5091101356593934
LEVEL:  0  ITERATION:  89  METRIC:  -0.5091576075825929
LEVEL:  0  ITERATION:  90  METRIC:  -0.5091674538238428
LEVEL:  0  ITERATION:  91  METRIC:  -0.5092142034516426
LEVEL:  0  ITERATION:  92  METRIC:  -0.5092161895846403
LEVEL:  0  ITERATION:  93  METRIC:  -0.5092692290968145
LEVEL:  0  ITERATION:  94  METRIC:  -0.5092657332979338
LEVEL:  0  ITERATION:  95  METRIC:  -0.5093147763734823
LEVEL:  0  ITERATION:  96  METRIC:  -0.5093185071494335
LEVEL:  0  ITERATION:  97  METRIC:  -0.5093294135759087
LEVEL:  0  ITERATION:  98  METRIC:  -0.5093451798797414
LEVEL:  0  ITERATION:  99  METRIC:  -0.5093668808403666
LEVEL:  0  ITERATION:  100  METRIC:  -0.509383252021113
LEVEL:  0  ITERATION:  101  METRIC:  -0.5093982431109587
LEVEL:  0  ITERATION:  102  METRIC:  -0.5094087199070355
LEVEL:  0  ITERATION:  103  METRIC:  -0.5094233868828695
LEVEL:  0  ITERATION:  104  METRIC:  -0.5094362632815291
LEVEL:  0  ITERATION:  105  METRIC:  -0.5094517777421605
LEVEL:  0  ITERATION:  106  METRIC:  -0.5094609186328296
LEVEL:  0  ITERATION:  107  METRIC:  -0.5094730087572727
LEVEL:  0  ITERATION:  108  METRIC:  -0.5094845376039567
LEVEL:  0  ITERATION:  109  METRIC:  -0.5094973190389112
LEVEL:  0  ITERATION:  110  METRIC:  -0.5095113266061497
LEVEL:  0  ITERATION:  111  METRIC:  -0.509524012397447
LEVEL:  0  ITERATION:  112  METRIC:  -0.5095386844496351
LEVEL:  0  ITERATION:  113  METRIC:  -0.5095482922903689
LEVEL:  0  ITERATION:  114  METRIC:  -0.5095602501030922
LEVEL:  0  ITERATION:  115  METRIC:  -0.5095674961472751
LEVEL:  0  ITERATION:  116  METRIC:  -0.5095784583376415
LEVEL:  0  ITERATION:  117  METRIC:  -0.5095833147552528
LEVEL:  0  ITERATION:  118  METRIC:  -0.5095925119564692
LEVEL:  0  ITERATION:  119  METRIC:  -0.5095976952619861
LEVEL:  0  ITERATION:  120  METRIC:  -0.509606159238133
LEVEL:  0  ITERATION:  121  METRIC:  -0.5096178264521586
LEVEL:  0  ITERATION:  122  METRIC:  -0.5096245047607819
LEVEL:  0  ITERATION:  123  METRIC:  -0.5096347203004734
LEVEL:  0  ITERATION:  124  METRIC:  -0.5096431954787751
LEVEL:  0  ITERATION:  125  METRIC:  -0.5096479335080613
LEVEL:  0  ITERATION:  126  METRIC:  -0.5096586029186595
LEVEL:  0  ITERATION:  127  METRIC:  -0.5096636256486163
LEVEL:  0  ITERATION:  128  METRIC:  -0.5096711551089081
LEVEL:  0  ITERATION:  129  METRIC:  -0.5096715998881329
LEVEL:  0  ITERATION:  130  METRIC:  -0.5096807693991748
LEVEL:  0  ITERATION:  131  METRIC:  -0.509680323123783
LEVEL:  0  ITERATION:  132  METRIC:  -0.5096939514539075
LEVEL:  0  ITERATION:  133  METRIC:  -0.5096832004048439
LEVEL:  0  ITERATION:  134  METRIC:  -0.5096930428913706
LEVEL:  0  ITERATION:  135  METRIC:  -0.5096963343795077
LEVEL:  0  ITERATION:  136  METRIC:  -0.5096985480554834
LEVEL:  0  ITERATION:  137  METRIC:  -0.5097053285208764
LEVEL:  0  ITERATION:  138  METRIC:  -0.5097062599052983
LEVEL:  0  ITERATION:  139  METRIC:  -0.5097065341544831
LEVEL:  0  ITERATION:  140  METRIC:  -0.5097091262876804
LEVEL:  0  ITERATION:  141  METRIC:  -0.5097118115581087
LEVEL:  0  ITERATION:  142  METRIC:  -0.5097138723508882
LEVEL:  0  ITERATION:  143  METRIC:  -0.5097140698968139
LEVEL:  0  ITERATION:  144  METRIC:  -0.5097168842510917
LEVEL:  0  ITERATION:  145  METRIC:  -0.5097187543022861
LEVEL:  0  ITERATION:  146  METRIC:  -0.5097178910183721
LEVEL:  0  ITERATION:  147  METRIC:  -0.509717787232802
LEVEL:  0  ITERATION:  148  METRIC:  -0.509721114706433
LEVEL:  0  ITERATION:  149  METRIC:  -0.5097219704117902
LEVEL:  0  ITERATION:  150  METRIC:  -0.5097228688115688
LEVEL:  0  ITERATION:  151  METRIC:  -0.5097229399058182
LEVEL:  0  ITERATION:  152  METRIC:  -0.5097223881093069
LEVEL:  0  ITERATION:  153  METRIC:  -0.5097235058119421
LEVEL:  0  ITERATION:  154  METRIC:  -0.5097244887067044
LEVEL:  0  ITERATION:  155  METRIC:  -0.5097259018461415
LEVEL:  0  ITERATION:  156  METRIC:  -0.5097241831451017
LEVEL:  0  ITERATION:  157  METRIC:  -0.5097223707268944
LEVEL:  0  ITERATION:  158  METRIC:  -0.5097222373359249
LEVEL:  0  ITERATION:  159  METRIC:  -0.5097220499745916
LEVEL:  0  ITERATION:  160  METRIC:  -0.509722159214666
LEVEL:  0  ITERATION:  161  METRIC:  -0.5097215664996297
LEVEL:  0  ITERATION:  162  METRIC:  -0.5097227202824104
LEVEL:  0  ITERATION:  163  METRIC:  -0.5097227667023613
LEVEL:  0  ITERATION:  164  METRIC:  -0.5097228943788019
LEVEL:  0  ITERATION:  165  METRIC:  -0.5097234897438061
LEVEL:  0  ITERATION:  166  METRIC:  -0.5097231687651836
LEVEL:  0  ITERATION:  167  METRIC:  -0.5097243369825106
LEVEL:  0  ITERATION:  168  METRIC:  -0.5097235911748609
LEVEL:  0  ITERATION:  169  METRIC:  -0.5097204809735754
LEVEL:  0  ITERATION:  170  METRIC:  -0.509722428591006
LEVEL:  0  ITERATION:  171  METRIC:  -0.5097209717667393
LEVEL:  0  ITERATION:  172  METRIC:  -0.5097226721919533
LEVEL:  0  ITERATION:  173  METRIC:  -0.5097165765864129
LEVEL:  0  ITERATION:  174  METRIC:  -0.509721221019394
LEVEL:  0  ITERATION:  175  METRIC:  -0.5097211618658549
LEVEL:  0  ITERATION:  176  METRIC:  -0.5097204024095183
LEVEL:  0  ITERATION:  177  METRIC:  -0.5097211315653398
LEVEL:  0  ITERATION:  178  METRIC:  -0.5097208009936721
LEVEL:  0  ITERATION:  179  METRIC:  -0.5097187495403702
LEVEL:  0  ITERATION:  180  METRIC:  -0.5097178978806756
LEVEL:  0  ITERATION:  181  METRIC:  -0.5097182427254202
LEVEL:  0  ITERATION:  182  METRIC:  -0.5097184599600053
LEVEL:  0  ITERATION:  183  METRIC:  -0.5097191467443036
LEVEL:  0  ITERATION:  184  METRIC:  -0.5097183101747074
LEVEL:  0  ITERATION:  185  METRIC:  -0.5097175381184262
LEVEL:  0  ITERATION:  186  METRIC:  -0.5097182576723988
LEVEL:  0  ITERATION:  187  METRIC:  -0.5097178421670088
LEVEL:  0  ITERATION:  188  METRIC:  -0.5097172999477873
LEVEL:  0  ITERATION:  189  METRIC:  -0.5097183468666006
LEVEL:  0  ITERATION:  190  METRIC:  -0.50971923724084
LEVEL:  0  ITERATION:  191  METRIC:  -0.5097178872277622
LEVEL:  0  ITERATION:  192  METRIC:  -0.5097164868539895
LEVEL:  0  ITERATION:  193  METRIC:  -0.5097159221830665
LEVEL:  0  ITERATION:  194  METRIC:  -0.5097154886871554
LEVEL:  0  ITERATION:  195  METRIC:  -0.509713193716958
LEVEL:  0  ITERATION:  196  METRIC:  -0.5097122259648657
LEVEL:  0  ITERATION:  197  METRIC:  -0.5097114356051958
LEVEL:  0  ITERATION:  198  METRIC:  -0.509713728837721
LEVEL:  0  ITERATION:  199  METRIC:  -0.5097123956242398
LEVEL:  0  ITERATION:  200  METRIC:  -0.5097128240723408
LEVEL:  0  ITERATION:  201  METRIC:  -0.5097108193696624
LEVEL:  0  ITERATION:  202  METRIC:  -0.509710368410682
LEVEL:  0  ITERATION:  203  METRIC:  -0.5097091131155792
LEVEL:  0  ITERATION:  204  METRIC:  -0.5097076479044138
LEVEL:  0  ITERATION:  205  METRIC:  -0.5097093343482894
LEVEL:  0  ITERATION:  206  METRIC:  -0.5097087002993401
LEVEL:  0  ITERATION:  207  METRIC:  -0.5097083009614303
LEVEL:  0  ITERATION:  208  METRIC:  -0.5097071918920788
LEVEL:  0  ITERATION:  209  METRIC:  -0.5097067854194948
LEVEL:  0  ITERATION:  210  METRIC:  -0.5097057318179653
LEVEL:  0  ITERATION:  211  METRIC:  -0.5097081717519585
LEVEL:  0  ITERATION:  212  METRIC:  -0.509707855748597
LEVEL:  0  ITERATION:  213  METRIC:  -0.5097067545324598
LEVEL:  0  ITERATION:  214  METRIC:  -0.5097062380112495
LEVEL:  0  ITERATION:  215  METRIC:  -0.5097056528931018
LEVEL:  0  ITERATION:  216  METRIC:  -0.5097065519091667
LEVEL:  0  ITERATION:  217  METRIC:  -0.5097060156286227
LEVEL:  0  ITERATION:  218  METRIC:  -0.5097055661613286
LEVEL:  0  ITERATION:  219  METRIC:  -0.5097051993482833
Registration succeeded
Block index:  (0, 0, 0) 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(0, 192, None))
Block index:  (0, 0, 4) 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(448, 704, None))
Block index:  (0, 0, 1) 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(64, 320, None))
Block index:  (0, 0, 5) 
Slices:  Block index: (slice(0, 192, None), slice(0, 192, None), slice(576, 832, None))
 (0, 0, 3) 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(320, 576, None))
Block index: Block index:   (0, 0, 2) 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(192, 448, None))
(0, 0, 6)Block index:  (0, 0, 7) 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(832, 913, None))
 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(704, 913, None))
Run ransac {'blob_sizes': [6, 20]}
Run ransac {'blob_sizes': [6, 20]}
Run ransac {'blob_sizes': [6, 20]}
Run ransac {'blob_sizes': [6, 20]}
Run ransac {'blob_sizes': [6, 20]}
Run ransac {'blob_sizes': [6, 20]}
Run ransac {'blob_sizes': [6, 20]}
Run ransac {'blob_sizes': [6, 20]}

stderr

2023-07-25 11:20:28,910 - distributed.worker.memory - WARNING - Worker is at 90% memory usage. Pausing worker.  Process memory: 2.68 GiB -- Worker memory limit: 2.97 GiB
2023-07-25 11:20:28,912 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 2.68 GiB -- Worker memory limit: 2.97 GiB
2023-07-25 11:20:29,854 - distributed.worker.memory - WARNING - Worker is at 90% memory usage. Pausing worker.  Process memory: 2.69 GiB -- Worker memory limit: 2.97 GiB
2023-07-25 11:20:29,855 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 2.69 GiB -- Worker memory limit: 2.97 GiB
2023-07-25 11:20:32,700 - distributed.nanny.memory - WARNING - Worker tcp://172.17.145.68:33033 (pid=130745) exceeded 95% memory budget. Restarting...
2023-07-25 11:20:32,827 - distributed.nanny - WARNING - Restarting worker
2023-07-25 11:20:33,250 - distributed.nanny.memory - WARNING - Worker tcp://172.17.145.68:32769 (pid=130743) exceeded 95% memory budget. Restarting...
2023-07-25 11:20:33,369 - distributed.nanny - WARNING - Restarting worker
Block index:  (0, 1, 0) 
Slices:  (slice(0, 192, None), slice(64, 320, None), slice(0, 192, None))
Block index:  (0, 0, 7) 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(832, 913, None))
Block index:  (0, 0, 4)Block index:  (0, 0, 2) 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(192, 448, None)) 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(448, 704, None))

Run ransac {'blob_sizes': [6, 20]}
Run ransac {'blob_sizes': [6, 20]}
Run ransac {'blob_sizes': [6, 20]}
Run ransac {'blob_sizes': [6, 20]}
2023-07-25 11:20:47,535 - distributed.worker.memory - WARNING - Worker is at 90% memory usage. Pausing worker.  Process memory: 2.68 GiB -- Worker memory limit: 2.97 GiB
2023-07-25 11:20:47,536 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 2.68 GiB -- Worker memory limit: 2.97 GiB
2023-07-25 11:20:49,502 - distributed.nanny.memory - WARNING - Worker tcp://172.17.145.68:37465 (pid=130985) exceeded 95% memory budget. Restarting...
2023-07-25 11:20:49,577 - distributed.nanny - WARNING - Restarting worker
Block index:  (0, 0, 4) 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(448, 704, None))
Block index:  (0, 0, 2) 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(192, 448, None))
Run ransac {'blob_sizes': [6, 20]}
Run ransac {'blob_sizes': [6, 20]}
2023-07-25 11:21:03,030 - distributed.worker.memory - WARNING - Worker is at 90% memory usage. Pausing worker.  Process memory: 2.68 GiB -- Worker memory limit: 2.97 GiB
2023-07-25 11:21:03,031 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 2.68 GiB -- Worker memory limit: 2.97 GiB
2023-07-25 11:21:05,000 - distributed.nanny.memory - WARNING - Worker tcp://172.17.145.68:44825 (pid=131141) exceeded 95% memory budget. Restarting...
2023-07-25 11:21:05,083 - distributed.nanny - WARNING - Restarting worker
Block index:  (0, 0, 4) 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(448, 704, None))
Block index:  (0, 0, 2) 
Slices:  (slice(0, 192, None), slice(0, 192, None), slice(192, 448, None))
Run ransac {'blob_sizes': [6, 20]}
Run ransac {'blob_sizes': [6, 20]}
Fix spots: 396 Moving spots: 176
Fewer than 50 spots found in fixed image, returning default 46
Run deform {'smooth_sigmas': (0.25,), 'control_point_spacing': 50.0, 'control_point_levels': (1,), 'optimizer_args': {'learningRate': 0.25, 'minStep': 0.0, 'numberOfIterations': 25}}
2023-07-25 11:21:22,724 - distributed.worker.memory - WARNING - Worker is at 90% memory usage. Pausing worker.  Process memory: 2.68 GiB -- Worker memory limit: 2.97 GiB
2023-07-25 11:21:22,727 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 2.68 GiB -- Worker memory limit: 2.97 GiB
LEVEL:  0  ITERATION:  0  METRIC:  -0.3962209122288713
2023-07-25 11:21:29,200 - distributed.nanny.memory - WARNING - Worker tcp://172.17.145.68:41689 (pid=131245) exceeded 95% memory budget. Restarting...
2023-07-25 11:21:29,470 - distributed.nanny - WARNING - Restarting worker
2023-07-25 11:21:32,792 - distributed.nanny - WARNING - Worker process still alive after 3.1999981689453127 seconds, killing
2023-07-25 11:21:32,793 - distributed.nanny - WARNING - Worker process still alive after 3.199998474121094 seconds, killing

error message

---------------------------------------------------------------------------
KilledWorker                              Traceback (most recent call last)
Cell In[3], line 10
      4 cluster_kwargs = {
      5     'n_workers':4,
      6     'memory_limit':0.2,
      7 }
      9 # run the pipeline
---> 10 affine, deform, aligned = easifish_registration_pipeline(
     11     fix_lowres, fix_highres, mov_lowres, mov_highres,
     12     fix_lowres_spacing, fix_highres_spacing,
     13     mov_lowres_spacing, mov_highres_spacing,
     14     blocksize=[128,]*3,
     15     write_directory='[./](https://file+.vscode-resource.vscode-cdn.net/home/casimir/Documents/Uni/Bioinformatik_M.sc/4.Semester_SS23/HiWi/tmp/)',
     16     cluster_kwargs=cluster_kwargs
     17 )
     19 # the affine and deform are already saved to disk, but we also want to view the aligned
     20 # result to make sure it worked.
     21 # reformat the aligned data to open in fiji (or similar) - again this works for tutorial data
     22 # but you would do this differently for actually larger-than-memory data
     23 tifffile.imsave('[./aligned.tiff](https://file+.vscode-resource.vscode-cdn.net/home/casimir/Documents/Uni/Bioinformatik_M.sc/4.Semester_SS23/HiWi/tmp/aligned.tiff)', aligned[...])

File [~/Documents/Arbeit/2023_HiWi_QBiC/bigstream/bigstream/application_pipelines.py:232](https://file+.vscode-resource.vscode-cdn.net/home/casimir/Documents/Uni/Bioinformatik_M.sc/4.Semester_SS23/HiWi/tmp/~/Documents/Arbeit/2023_HiWi_QBiC/bigstream/bigstream/application_pipelines.py:232), in easifish_registration_pipeline(fix_lowres, fix_highres, mov_lowres, mov_highres, fix_lowres_spacing, fix_highres_spacing, mov_lowres_spacing, mov_highres_spacing, blocksize, write_directory, global_ransac_kwargs, global_affine_kwargs, local_ransac_kwargs, local_deform_kwargs, cluster_kwargs, cluster)
    229 if cluster is None:
    230     #with cluster_constructor(**cluster_kwargs) as cluster:      #NOTE: removed **{**c, **cluster_kwargs}
...
-> 2231         raise exception.with_traceback(traceback)
   2232     raise exc
   2233 if errors == "skip":

KilledWorker: Attempted to run task align_single_block-c433c39107c7a38ec826bbb66882de03 on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://172.17.145.68:35047. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.

any idea in what direction I should proceed?

from bigstream.

GFleishman avatar GFleishman commented on June 29, 2024

Yes you are correct, the nanny and the dask scheduler together are causing this job to fail.

First I notice that your machine has only 16GB of RAM. This should actually be fine for the test datasets once we get configuration right. But that's honestly quite small for processing any real data. We'll need to really constrain the parallelism to get the work to fit into this machine, and once you scale to real big datasets you'll find they take a very long time. More RAM would really help you parallelize jobs better; so access to a bigger workstation or a cluster might be important depending on what kind of data you are eventually intending this for.

So each worker is getting 0.2 * 16GB ~= 3GB of RAM. Note dask reports everything in Gibibytes not Gigabytes so their numbers are a little different. The dask scheduler is also trying to submit more than one block to a worker at a time. It thinks that a single worker can solve multiple of the block alignments in parallel. Because of the small amount of RAM and the multiple tasks, the workers are exceeding their memory limits. The nanny notices this and pauses, then shuts down the workers. Once this happens three times the dask scheduler decides something is wrong and shuts the whole cluster down.

So we need to make sure that each worker only tries to process one job at a time. My first suggestion is to add 'threads_per_worker':1 to your cluster_kwargs. Theoretically this should allow dask to only submit a single thread to each worker, and with only one thread it can only execute one task at a time. The tasks themselves (the alignment algorithms that will run) are multithreaded, so they will use all the resources the worker can provide - so you're not losing any parallelism here, you're just preventing dask from overloading the workers.

I have noticed in the past however when doing similar things but in a different computing environment, that telling dask to submit only one thread per worker does not prevent it from trying to execute multiple tasks on that worker. So if trying the first suggestion above does not work, then try this second suggestion. Add this to cluster_kwargs:

'threads_per_worker':1,
'config':{
        'distributed.worker.memory.target':0.9,
        'distributed.worker.memory.spill':0.9,
        'distributed.worker.memory.pause':0.9,
        'distributed.scheduler.worker-saturation':0.5,
}

Finally if this does not work I have one last suggestion, which will also require adding something to the source code. Add this to cluster_kwargs:

'threads_per_worker':1,
'config':{
        'distributed.worker.memory.target':0.9,
        'distributed.worker.memory.spill':0.9,
        'distributed.worker.memory.pause':0.9,
        'distributed.scheduler.worker-saturation':0.5,
        'distributed.worker.resources.concurrency':1,
}

Now to edit some source code. In the bigstream repository find the file: bigstream/piecewise_align.py and look at line 358. You should see:

futures = cluster.client.map(
    align_single_block, indices,
    static_transform_list=static_transform_list,
)

Change this to the following:

futures = cluster.client.map(
    align_single_block, indices,
    static_transform_list=static_transform_list,
    resources={'concurrency':1},
)

Save that change. If you installed bigstream before using pip install -e ./ then this change should be available already. Be sure the restart the kernel in your Jupyter notebook after making this source code edit for it to take effect.

I haven't tried these tutorial datasets on a small machine like you are trying, so I hope you'll forgive the extra work required to get them working in that context. It's a good learning experience for me and will help the package be more accommodating for future users as well. Let me know how these changes affect things. Ideally the first suggestion will just fix it, but if not try the others and we'll figure out a way to make it work.

from bigstream.

GFleishman avatar GFleishman commented on June 29, 2024

@HomoPolyethylen Just checking if you've had a chance to try out the suggestions above.

from bigstream.

GFleishman avatar GFleishman commented on June 29, 2024

Since all of the solutions to "small machine memory issues" are presented in this thread I'm going to close for now - but I'm willing to reopen if OP needs more help in the future.

from bigstream.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.