Hi Constantin, I've been running multicut well on a cluster node wit

You can update the slurm config here: <a href="https://github.com/constantinpape/c

Ok this is what I'm getting now: <div class="snippet-clipboard-content notranslate

How to run large volumes,about constantinpape/cluster_tools

Comments (21)

constantinpape commented on September 21, 2024

I've been running multicut well on a cluster node with 1.5TB ram but it has a segmentation fault, presumably running out of RAM, on arrays larger than ~ 5k x 5k x 15.

Yes, this script does not scale well to large volumes.
Instead you will need to use the functionality from this repository.
You can find an example with some explanations here:
https://github.com/constantinpape/cluster_tools/blob/master/example/cremi/run_mc.py

Note that there are some important prerequisites to use this:

You will need a conda environment with these dependencies:
https://github.com/constantinpape/cluster_tools/blob/master/environment.yml#L9-L16
All input data must be stored in n5. For conversion from hdf5 or tiff to n5 have a look at
https://github.com/constantinpape/z5/blob/master/src/python/module/z5py/converter.py

Also, does your cluster run any scheduling system?
For now, I support slurm and lsf, but it is straightforward to extend this to other schedulers, by implementing a class like https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/cluster_tasks.py#L374.

from cluster_tools.

MatthewBM commented on September 21, 2024

Yes we use slurm.

I do have the cluster_env conda environment built, but it wasn't finding the cluster_tools module so I added this:
export PYTHONPATH="/home/mmadany/miniconda3/envs/cluster_env/bin:/home/mmadany/Multicut/cluster_tools-master:/home/mmadany/Multicut/cluster_tools-master/cluster_tools"

I have configured z5 and converted to n5 files. When I try to run that example script, I get this error:

import os
import json
import luigi
from cluster_tools import MulticutSegmentationWorkflow

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/__init__.py", line 1, in <module>
    from .workflows import MulticutSegmentationWorkflow
  File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/workflows.py", line 5, in <module>
    from .watershed import WatershedWorkflow
  File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/watershed/__init__.py", line 1, in <module>
    from .watershed_workflow import WatershedWorkflow
  File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/watershed/watershed_workflow.py", line 4, in <module>
    from . import watershed as watershed_tasks
  File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/watershed/watershed.py", line 11, in <module>
    from nifty.filters import nonMaximumDistanceSuppression
ImportError: cannot import name 'nonMaximumDistanceSuppression' from 'nifty.filters' (/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/site-packages/nifty/filters/__init__.py)

from cluster_tools.

constantinpape commented on September 21, 2024

Yes, sorry, I just implemented nonMaximumDistanceSuppression and it's not in the conda package yet.
Please check out the latest commit 03ec3b8 and try again.
I added a check to skip nonMaximumDistanceSuppression if it's not available.

from cluster_tools.

MatthewBM commented on September 21, 2024

Ok, that runs, and I see it's doing job configuration within the program, this is what I'm getting:

(cluster_env) [mmadany@comet-ln2 cluster_tools-master]$ python ~/Multicut/runluigi.py
DEBUG: Checking if MulticutSegmentationWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=DummyTask, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/fxmemf.n5, input_key=dataset1, ws_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, ws_key=dataset1, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, node_labels_key=node_labels, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, output_key=segmentation/multicut, mask_path=, mask_key=, rf_path=, node_label_dict={}, max_jobs_merge=1, skip_ws=True, agglomerate_ws=False, two_pass_ws=False, sanity_checks=False, max_jobs_multicut=1, n_scales=1) is complete
DEBUG: Checking if WriteSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, output_key=segmentation/multicut, assignment_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, assignment_key=node_labels, dependency=MulticutWorkflow, identifier=multicut, offset_path=) is complete
INFO: Informed scheduler that task MulticutSegmentationWorkflow_False___config_mc_DummyTask_6d798a14ef has status PENDING
DEBUG: Checking if MulticutWorkflow(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, target=slurm, dependency=ProblemWorkflow, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, n_scales=1, assignment_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, assignment_key=node_labels) is complete
INFO: Informed scheduler that task WriteSlurm_node_labels__oasis_scratch_c___config_mc_4d42f4969f has status PENDING
DEBUG: Checking if SolveGlobalSlurm(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, assignment_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, assignment_key=node_labels, scale=1, dependency=ReduceProblemSlurm) is complete
INFO: Informed scheduler that task MulticutWorkflow_node_labels__oasis_scratch_c___config_mc_e52655bb6f has status PENDING
DEBUG: Checking if ReduceProblemSlurm(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, scale=0, dependency=SolveSubproblemsSlurm) is complete
INFO: Informed scheduler that task SolveGlobalSlurm_node_labels__oasis_scratch_c___config_mc_8b8648e259 has status PENDING
DEBUG: Checking if SolveSubproblemsSlurm(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, scale=0, dependency=ProblemWorkflow) is complete
INFO: Informed scheduler that task ReduceProblemSlurm___config_mc_SolveSubproblems_1_182aa76377 has status PENDING
DEBUG: Checking if ProblemWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=DummyTask, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/fxmemf.n5, input_key=dataset1, ws_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, ws_key=dataset1, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, rf_path=, node_label_dict={}, max_jobs_merge=1, compute_costs=True, sanity_checks=False) is complete
INFO: Informed scheduler that task SolveSubproblemsSlurm___config_mc_ProblemWorkflow_1_a1448fd645 has status PENDING
DEBUG: Checking if EdgeCostsWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=EdgeFeaturesWorkflow, features_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, features_key=features, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=s0/costs, node_label_dict={}, rf_path=) is complete
INFO: Informed scheduler that task ProblemWorkflow_True___config_mc_DummyTask_3f92ce107e has status PENDING
DEBUG: Checking if ProbsToCostsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, input_key=features, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=s0/costs, features_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, features_key=features, dependency=EdgeFeaturesWorkflow, node_label_dict={}) is complete
INFO: Informed scheduler that task EdgeCostsWorkflow___config_mc_EdgeFeaturesWork_features_2d838ae4dc has status PENDING
DEBUG: Checking if EdgeFeaturesWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=GraphWorkflow, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/fxmemf.n5, input_key=dataset1, labels_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, labels_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, graph_key=s0/graph, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=features, max_jobs_merge=1) is complete
INFO: Informed scheduler that task ProbsToCostsSlurm___config_mc_EdgeFeaturesWork_features_682c0950ab has status PENDING
DEBUG: Checking if MergeEdgeFeaturesSlurm(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, graph_key=s0/graph, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=features, dependency=BlockEdgeFeaturesSlurm) is complete
INFO: Informed scheduler that task EdgeFeaturesWorkflow___config_mc_GraphWorkflow_s0_graph_f1bc78dfbd has status PENDING
DEBUG: Checking if BlockEdgeFeaturesSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/fxmemf.n5, input_key=dataset1, labels_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, labels_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, dependency=GraphWorkflow) is complete
INFO: Informed scheduler that task MergeEdgeFeaturesSlurm___config_mc_BlockEdgeFeature_s0_graph_34ddff7acc has status PENDING
DEBUG: Checking if GraphWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=DummyTask, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=s0/graph, n_scales=1) is complete
INFO: Informed scheduler that task BlockEdgeFeaturesSlurm___config_mc_GraphWorkflow__oasis_scratch_c_8bd529565b has status PENDING
DEBUG: Checking if MapEdgeIdsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, input_key=s0/graph, scale=0, dependency=MergeSubGraphsSlurm) is complete
INFO: Informed scheduler that task GraphWorkflow___config_mc_DummyTask__oasis_scratch_c_cb70462974 has status PENDING
DEBUG: Checking if MergeSubGraphsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, scale=0, output_key=s0/graph, merge_complete_graph=True, dependency=InitialSubGraphsSlurm) is complete
INFO: Informed scheduler that task MapEdgeIdsSlurm___config_mc_MergeSubGraphsSl__oasis_scratch_c_6c607199dc has status PENDING
DEBUG: Checking if InitialSubGraphsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, dependency=DummyTask) is complete
INFO: Informed scheduler that task MergeSubGraphsSlurm___config_mc_InitialSubGraphs__oasis_scratch_c_8ef59ea786 has status PENDING
DEBUG: Checking if DummyTask() is complete
INFO: Informed scheduler that task InitialSubGraphsSlurm___config_mc_DummyTask__oasis_scratch_c_f2de7aaf60 has status PENDING
INFO: Informed scheduler that task DummyTask__99914b932b has status DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 16
INFO: [pid 21179] Worker Worker(salt=544906811, workers=1, host=comet-ln2.sdsc.edu, username=mmadany, pid=21179) running InitialSubGraphsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, dependency=DummyTask)
sbatch: error: bank_limit plugin: expired user, can't submit job
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
ERROR: [pid 21179] Worker Worker(salt=544906811, workers=1, host=comet-ln2.sdsc.edu, username=mmadany, pid=21179) failed InitialSubGraphsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, dependency=DummyTask)
Traceback (most recent call last):
File "/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/site-packages/luigi/worker.py", line 199, in run
new_deps = self._run_get_new_deps()
File "/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/site-packages/luigi/worker.py", line 139, in _run_get_new_deps
task_gen = self.task.run()
File "/home/mmadany/Multicut/cluster_tools2/cluster_tools-master/cluster_tools/cluster_tasks.py", line 93, in run
raise e
File "/home/mmadany/Multicut/cluster_tools2/cluster_tools-master/cluster_tools/cluster_tasks.py", line 79, in run
self.run_impl()
File "/home/mmadany/Multicut/cluster_tools2/cluster_tools-master/cluster_tools/graph/initial_sub_graphs.py", line 76, in run_impl
self.submit_jobs(n_jobs)
File "/home/mmadany/Multicut/cluster_tools2/cluster_tools-master/cluster_tools/cluster_tasks.py", line 443, in submit_jobs
outp = check_output(command).decode().rstrip()
File "/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/subprocess.py", line 376, in check_output
**kwargs).stdout
File "/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/subprocess.py", line 468, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['sbatch', '-o', './tmp_mc_A/logs/initial_sub_graphs_0.log', '-e', './tmp_mc_A/error_logs/initial_sub_graphs_0.err', '-J', 'initial_sub_graphs_0', './tmp_mc_A/slurm_initial_sub_graphs.sh', '0']' returned non-zero exit status 1.
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task InitialSubGraphsSlurm___config_mc_DummyTask__oasis_scratch_c_f2de7aaf60 has status FAILED
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
DEBUG: There are 16 pending tasks possibly being run by other workers
DEBUG: There are 16 pending tasks unique to this worker
DEBUG: There are 16 pending tasks last scheduled by this worker
INFO: Worker Worker(salt=544906811, workers=1, host=comet-ln2.sdsc.edu, username=mmadany, pid=21179) was stopped. Shutting down Keep-Alive thread
INFO:
===== Luigi Execution Summary =====

Scheduled 17 tasks of which:

1 complete ones were encountered:

1 DummyTask()

1 failed:

1 InitialSubGraphsSlurm(...)

15 were left pending, among these:

15 had failed dependencies:

1 BlockEdgeFeaturesSlurm(...)

1 EdgeCostsWorkflow(...)

1 EdgeFeaturesWorkflow(...)

1 GraphWorkflow(...)

1 MapEdgeIdsSlurm(...)
...

This progress looks :( because there were failed tasks

===== Luigi Execution Summary =====

looks like this is where the cluster configuration comes in. I need to change my group id and such. Where do I change that and other sbatch variables?

from cluster_tools.

constantinpape commented on September 21, 2024

You can update the slurm config here:
https://github.com/constantinpape/cluster_tools/blob/master/example/cremi/run_mc.py#L69
Just add 'groupname': YOUR_GROUP_NAME.

Also, for debugging, it might be useful to run the command that fails directly and see the error message:
sbatch -o ./tmp_mc_A/logs/initial_sub_graphs_0.log -e ./tmp_mc_A/error_logs/initial_sub_graphs_0.err -J initial_sub_graphs_0' ./tmp_mc_A/slurm_initial_sub_graphs.sh 0

from cluster_tools.

MatthewBM commented on September 21, 2024

Ok this is what I'm getting now:


> Traceback (most recent call last):
>   File "./tmp_mc_A/initial_sub_graphs.py", line 152, in <module>
>     initial_sub_graphs(job_id, path)
>   File "./tmp_mc_A/initial_sub_graphs.py", line 144, in initial_sub_graphs
>     ignore_label)
>   File "./tmp_mc_A/initial_sub_graphs.py", line 117, in _graph_block
>     increaseRoi=True)
> RuntimeError: Request has wrong type
>

That came from each of the 16 sbatch jobs. It looks like my data type might be off? I'm using the .n5 files but here's what the .h5 file's data looks like when I get a snippet of data using h5ls -d

Boundary Predictions, where 1 is the background and 0 are the boundaries:

    (0,58,2742) 0.890196078431372, 0.866666666666667, 0.815686274509804, 0.717647058823529, 0.725490196078431, 0.592156862745098, 0.392156862745098, 0.192156862745098, 0.0941176470588235, 0.0431372549019608, 0.0235294117647059, 0.0196078431372549, 0.0156862745098039,
    (0,58,2755) 0.0235294117647059, 0.0392156862745098, 0.0901960784313725, 0.203921568627451, 0.407843137254902, 0.592156862745098, 0.756862745098039, 0.882352941176471, 0.945098039215686, 0.980392156862745, 0.992156862745098, 0.996078431372549, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    (0,58,2779) 1, 0.996078431372549, 0.996078431372549, 0.996078431372549, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Watershed file, uint32 values in sequence with no holes:

(0,3293,1530) 23660, 23660, 23660, 23660, 23660, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23368,
(0,3293,1568) 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368,
(0,3293,1606) 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124,
(0,3293,1644) 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643,
(0,3293,1682) 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653,
(0,3293,1720) 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653

from cluster_tools.

constantinpape commented on September 21, 2024

The watershed needs to be stored in uint64.
Sorry for the late reply.

from cluster_tools.

constantinpape commented on September 21, 2024

Also, to avoid issues you might get in the feature computation:
boundary maps need to be stored either in uint8 or in float32

from cluster_tools.

MatthewBM commented on September 21, 2024

Ok I made sure my data is uint8 for boundaries and unit64 for ws but I'm still getting the same error

sbatch -o ./tmp_mc_A/logs/initial_sub_graphs_0.log -e ./tmp_mc_A/error_logs/initial_sub_graphs_0.err -J initialsub_graphs_0 ./tmp_mc_A/slurm_initial_sub_graphs.sh
cat ./tmp_mc_A/logs/initial_sub_graphs_0.log

Mytype: d your type: m
2019-04-24 21:23:27.502097: start processing job 0
2019-04-24 21:23:27.502127: reading config from ./tmp_mc_A/initial_sub_graphs_job_0.config
2019-04-24 21:23:27.515858: start processing block 0

cat ./tmp_mc_A/error_logs/initial_sub_graphs_0.err

Traceback (most recent call last):
File "./tmp_mc_A/initial_sub_graphs.py", line 152, in
initial_sub_graphs(job_id, path)
File "./tmp_mc_A/initial_sub_graphs.py", line 144, in initial_sub_graphs
ignore_label)
File "./tmp_mc_A/initial_sub_graphs.py", line 117, in _graph_block
increaseRoi=True)
RuntimeError: Request has wrong type

from cluster_tools.

MatthewBM commented on September 21, 2024

Looks like this error message has occurred in you z5 repo

Merged #52, the issue should be fixed.

Originally posted by @constantinpape in constantinpape/z5#50 (comment)

from cluster_tools.

constantinpape commented on September 21, 2024

Yes, this error message comes from z5 and indicates that some datatypes do not agree.
Are you sure both boundaries and superpixel are stored correctly?
Can you open them with z5 from python?

import z5py
f = z5py.File('/path/to/data.n5')
ds = f[path/in/file']
print(ds.dtype)

If you do this the dtype should be uint8 (or float32) for the boundaries and uint64 for the superpixels.

from cluster_tools.

MatthewBM commented on September 21, 2024

Ok I got the workflow up on running on my data and it also worked end-to-end on the sample data. I'm just getting memory errors on my merge_graphs workers. I'd like to change the partition to the high-memory compute nodes where we have 64 cores and 1.45TB of RAM, I tried this:
global_config.update({'shebang': shebang, 'block_shape': block_shape, 'groupname': 'ddp140', 'partition': 'large-shared', 'mem': '1450G'})
but it seems the workers still get sent to the regular nodes. how can I see/edit exactly where the sbatch jobs are being submitted? I think I can also do '--mem=1G --ntasks==1' and it will split the jobs up along the resources on the node.

from cluster_tools.

constantinpape commented on September 21, 2024

Ok I got the workflow up on running on my data and it also worked end-to-end on the sample data.

Glad to hear it!

I tried this:
global_config.update({'shebang': shebang, 'block_shape': block_shape, 'groupname': 'ddp140', 'partition': 'large-shared', 'mem': '1450G'}

The global config does not support arbitrary arguments, but just the ones listed here:
https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/cluster_tasks.py#L217-L224.
Note that I added the partition option just now.

Also, you need to specify the memory limit for the individual tasks. By updating the
mem_limit value in the task_config. See also https://github.com/constantinpape/cluster_tools/blob/master/example/cremi/run_mc.py#L92.

Hope this helps.

from cluster_tools.

MatthewBM commented on September 21, 2024

Ok thanks that works great,

The furthest I've been able to get is solve_global but that run for over 48 hours which is my job time limit. I've been messing around with the parrelel block size for my 7k x 5k x 400 xyz test volume (400 x 5k x 7k in the .n5 file format).

right now I'm trying block_shape = [80, 1024, 1024]

But I was wondering if you could recommend a value here? I'm allocating 180Gb of RAM to the parallel workers and I can allocate up to 1.45 TB to the single worker steps, I've done that for 'solve_global', 'solve_subproblems', and 'merge_edge_features' so far because they were giving 'out of memory' and 'segmentation fault' errors.

from cluster_tools.

constantinpape commented on September 21, 2024

right now I'm trying block_shape = [80, 1024, 1024]

That sounds reasonable.

How many nodes are in the graph (i.e. how many super-voxel ids are there?).

Usually the solve_global step should be quite fast if the problem was reduced by solving the subproblems.
Have you tried running everything on a smaller cutout of the data (say 200 x 1024 x 1024) and checked the results?

One potential issue could be that your boundary maps follow a different convention than what I expect:
I assume boundaries to correspond to high values (i.e. 1 means maximal boundary probability for a pixel).
If your boundary maps have the opposite convention, you can set invert_inputs to True for costs_to_probs, see https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/costs/probs_to_costs.py#L51.

If you use the correct boundary convention and the cutout results look decent, there are 2 options to speed up the final multicut:

Choose a different solver. This can be done by setting agglomerator to greedy-additive in the config of solve_global, see https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/multicut/solve_global.py#L43.
Run with more hierarchy levels by setting n_scales > 1, see https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/workflows.py#L207.

These two options can also be combined. Note that both can reduce the quality of the resulting segmentation a bit, but from my experience the effect should not be very significant.

from cluster_tools.

MatthewBM commented on September 21, 2024

I have 22 million superpixels

high boundary chance is 255, uint8. What's the difference between 'time_limit' and 'time_limit_solver'?
This is a cross of my current volume (lots of myelin unfortunately):

After I get that to work and understand what the time / parallelization values should be, I'd like to try on volumes that are more around the size of 15k x 15k x 1k, like this:

Does that look like it could handle a greedy solver with higher n_scales?

from cluster_tools.

constantinpape commented on September 21, 2024

I have 22 million superpixels

That should be fine; I have solved problems with about 2 orders of magnitude more superpixels with this pipeline.

high boundary chance is 255, uint8

That's good, you don't need to change invert_inputs then.

What's the difference between 'time_limit' and 'time_limit_solver'?

time_limit is the maximum time a job will run; it is passed as value for the -t parameter to slurm.
time_limit_solver is a time limit that is passed to the actual multicut solver.

I forgot to mention this parameter earlier; setting time_limit_solver might actually fix your problem.
You should set it to ~ 4 hours less than time_limit. (time_limit_solver is soft, which means that
the solver will not abruptly stop after the time has passed, but will only check for it after completing an internal iteration. Depending on the problem size, the iterations can take quite a bit, that's why it's safer to give some leeway compared to time_limit).

Does that look like it could handle a greedy solver with higher n_scales?

Yes, this looks feasible.

from cluster_tools.

MatthewBM commented on September 21, 2024

ok, looks like time_limit is in minutes (or slurm/sbatch command format) and time_limit_solver is in seconds, correct?

from cluster_tools.

constantinpape commented on September 21, 2024

yes that's correct

from cluster_tools.

MatthewBM commented on September 21, 2024

Ok, it's working full process and looks great, I'll email you a video of what it looks like. Thanks again!

from cluster_tools.

constantinpape commented on September 21, 2024

You\re very welcome and thanks for your patience. I am looking forward to see the results :).

from cluster_tools.

How to run large volumes about cluster_tools HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent