constantinpape / cluster_tools Goto Github PK

Distributed segmentation for bio-image-analysis

License: MIT License

Python 99.68% Shell 0.32%

segmentation watershed connectomics bio-image-analysis multicut mutex-watershed lifted-multicut cluster-computing 3d-segmentation microscopy-images

cluster_tools's Introduction

Cluster Tools

Workflows for distributed Bio Image Analysis and Segmentation. Supports Slurm, LSF and local execution, easy to extend to more scheduling systems.

Workflows

Hierarchical Multicut / Hierarchical lifted Multicut
- Distance Transform Watersheds
- Region Adjacency Graph
- Edge Feature Extraction from Boundary-or-Affinity Maps
- Agglomeration via (lifted) Multicut
Sparse lifted Multicut from biological priors
Mutex Watershed
Connected Components
Downscaling and Pyramids
Ilastik Prediction
Skeletonization
Distributed Neural Network Prediction (originally implemented here)
Validation with Rand Index and Variation of Information

Installation

You can install the package via conda:

conda install -c conda-forge cluster_tools

To set-up a develoment environment with all necessary dependencies, you can use the environment.yml file:

conda env create -f environment.yml

and then install the package in development mode via

pip install -e . --no-deps

Citation

If you use this software in a publication, please cite

Pape, Constantin, et al. "Solving large multicut problems for connectomics via domain decomposition." Proceedings of the IEEE International Conference on Computer Vision. 2017.

For the lifted multicut workflows, please cite

Pape, Constantin, et al. "Leveraging Domain Knowledge to improve EM image segmentation with Lifted Multicuts." arXiv preprint. 2019.

You can find code for the experiments in publications/lifted_domain_knowledge.

If you are using another algorithom not part of these two publications, please also cite the appropriate publication (see the links here).

Getting Started

This repository uses luigi for workflow management. We support different cluster schedulers, so far

slurm
lsf
local (local execution based on ProcessPool)

The scheduler can be selected by the keyword target. Inter-process communication is achieved through files which are stored in a temporary folder and most workflows use n5 storage. You can use z5 to convert files to it with python.

Simplified, running a workflow from this repository looks like this:

import json
import luigi
from cluster_tools import SimpleWorkflow  # this is just a mock class, not actually part of this repository

# folder for temporary scripts and files
tmp_folder = 'tmp_wf'

# directory for configurations for workflow sub-tasks stored as json
config_dir = 'configs'

# get the default configurations for all sub-tasks
default_configs = SimpleWorkflow.get_config()

# global configuration for shebang to proper python interpreter with all dependencies,
# group name and block-shape
global_config = default_configs['global']
shebang = '#! /path/to/bin/python'
global_config.update({'shebang': shebang, 'groupname': 'mygroup'})
with open('configs/global.config', 'w') as f:
  json.dump(global_config, f)
  
# run the example workflow with `max_jobs` number of jobs
max_jobs = 100
task = SimpleWorkflow(tmp_folder=tmp_folder, config_dir=config_dir,
                      target='slurm', max_jobs=max_jobs,
                      input_path='/path/to/input.n5', input_key='data',
                      output_path='/path/to/output.n5', output_key='data')
luigi.build([task])

For a list of the available segmentation worklfows, have a look at this. Unfortunately, there is no proper documentation yet. For more details, have a look at the examples, in particular this example. You can donwload the example data (also used for the tests) here.

cluster_tools's People

Contributors

Stargazers

Watchers

Forkers

derthorsten kreshuklab matthewbm wolny k-dominik trantorrepository hongbei-ss nickkzz97 sh4zkh4n weihuang527 sourabh-bhide martinschorb tatianawoller

cluster_tools's Issues

Pipeline has problems with non-contiguous numpy arrays

I use numpy version 1.16.1 from conda-forge.

Whenever vigra.analysis.relabelConsecutive is called within the Multicut workflow, I get an error that the argument types did not match with any C++ signature. i checked that the datatype is correct. What fixes the problem for me is adding a line that makes the numpy arrays contiguous.

Below is a small code example that recreates the problem by writing an n5 volume of a graph, loading the graph, and then calling relabelConsecutive on the nodes.

import nifty.distributed as ndist
import z5py
import numpy as np
import vigra.analysis

filename = "test.n5"

with z5py.File(filename, "w") as f:
    g = f.create_group('/graph')
    
    g.attrs['numberOfEdges'] = 1
    g.attrs['numberOfNodes'] = 4
    
    g.create_dataset("labels", (4,), dtype="uint64", data=np.array([0,1,2,3],dtype=np.uint64))
    g.create_dataset("edges", (1,2), dtype="uint64", data=np.array([[0,1]],dtype=np.uint64))

graph = ndist.Graph(filename+"/graph/")
nodes = graph.nodes()

# nodes = np.ascontiguousarray(nodes)
nodes_relabeled,_,_ = vigra.analysis.relabelConsecutive(nodes.astype(np.uint64))
print(nodes_relabeled)

If the line "nodes = np.ascontiguousarray(nodes)" is activated, it works perfectly. Without it, I get the following error:

Boost.Python.ArgumentError: Python argument types in
    vigra.analysis.relabelConsecutive(numpy.ndarray)
did not match C++ signature:
    relabelConsecutive(vigra::NumpyArray<1u, vigra::Singleband<unsigned char>, vigra::StridedArrayTag> labels, unsigned char start_label=1, bool keep_zeros=True, vigra::NumpyArray<1u, vigra::Singleband<unsigned char>, vigra::StridedArrayTag> out=None)
    relabelConsecutive(vigra::NumpyArray<1u, vigra::Singleband<unsigned int>, vigra::StridedArrayTag> labels, unsigned int start_label=1, bool keep_zeros=True, vigra::NumpyArray<1u, vigra::Singleband<unsigned int>, vigra::StridedArrayTag> out=None)
    relabelConsecutive(vigra::NumpyArray<1u, vigra::Singleband<unsigned long>, vigra::StridedArrayTag> labels, unsigned long start_label=1, bool keep_zeros=True, vigra::NumpyArray<1u, vigra::Singleband<unsigned long>, vigra::StridedArrayTag> out=None)
    relabelConsecutive(vigra::NumpyArray<1u, vigra::Singleband<unsigned long>, vigra::StridedArrayTag> labels, unsigned int start_label=1, bool keep_zeros=True, vigra::NumpyArray<1u, vigra::Singleband<unsigned int>, vigra::StridedArrayTag> out=None)
    relabelConsecutive(vigra::NumpyArray<2u, vigra::Singleband<unsigned char>, vigra::StridedArrayTag> labels, unsigned char start_label=1, bool keep_zeros=True, vigra::NumpyArray<2u, vigra::Singleband<unsigned char>, vigra::StridedArrayTag> out=None)
    relabelConsecutive(vigra::NumpyArray<2u, vigra::Singleband<unsigned int>, vigra::StridedArrayTag> labels, unsigned int start_label=1, bool keep_zeros=True, vigra::NumpyArray<2u, vigra::Singleband<unsigned int>, vigra::StridedArrayTag> out=None)
    relabelConsecutive(vigra::NumpyArray<2u, vigra::Singleband<unsigned long>, vigra::StridedArrayTag> labels, unsigned long start_label=1, bool keep_zeros=True, vigra::NumpyArray<2u, vigra::Singleband<unsigned long>, vigra::StridedArrayTag> out=None)
    relabelConsecutive(vigra::NumpyArray<2u, vigra::Singleband<unsigned long>, vigra::StridedArrayTag> labels, unsigned int start_label=1, bool keep_zeros=True, vigra::NumpyArray<2u, vigra::Singleband<unsigned int>, vigra::StridedArrayTag> out=None)
    relabelConsecutive(vigra::NumpyArray<3u, vigra::Singleband<unsigned char>, vigra::StridedArrayTag> labels, unsigned char start_label=1, bool keep_zeros=True, vigra::NumpyArray<3u, vigra::Singleband<unsigned char>, vigra::StridedArrayTag> out=None)
    relabelConsecutive(vigra::NumpyArray<3u, vigra::Singleband<unsigned int>, vigra::StridedArrayTag> labels, unsigned int start_label=1, bool keep_zeros=True, vigra::NumpyArray<3u, vigra::Singleband<unsigned int>, vigra::StridedArrayTag> out=None)
    relabelConsecutive(vigra::NumpyArray<3u, vigra::Singleband<unsigned long>, vigra::StridedArrayTag> labels, unsigned long start_label=1, bool keep_zeros=True, vigra::NumpyArray<3u, vigra::Singleband<unsigned long>, vigra::StridedArrayTag> out=None)
    relabelConsecutive(vigra::NumpyArray<3u, vigra::Singleband<unsigned long>, vigra::StridedArrayTag> labels, unsigned int start_label=1, bool keep_zeros=True, vigra::NumpyArray<3u, vigra::Singleband<unsigned int>, vigra::StridedArrayTag> out=None)

Block-wise multicut currently does not support halo

Fix issues with connected components

See https://github.com/constantinpape/cluster_tools/blob/master/test/thresholded_components/thresholded_components.py#L21-L22

`fit_to_roi` not applied

Hi @constantinpape ,

I tried to implement what you suggested here #42 (comment) in mobie-utils-pythonbut found thatfit_to_roiseems to be ignored bycopy_volume`. The resulting file still has full dimensions.

That's why this fails. https://github.com/mobie/mobie-utils-python/blob/8848c64af5c717af31b396d75b501256ee3d768a/test/test_image_data.py#L372

Here's the config that has the correct parameter.
global.config

wrong kwarg passed on to elf

Hi,

when converting to MoBIE, I run into:

...
  File ".../cluster_tools/cluster_tools/copy_volume/copy_volume.py", line 138, in run_impl
    with vu.file_reader(self.output_path, mode="a", **file_kwargs) as f:
  File ".../cluster_tools/cluster_tools/utils/volume_utils.py", line 34, in file_reader
    return elf.io.open_file(path, mode=mode, **kwargs)
  File ".../elf/elf/io/files.py", line 45, in open_file
    return constructor(path, mode=mode, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'dimension_separator'

I have no idea where that dimension_separator comes from it seems to be not None otherwise

cluster_tools/cluster_tools/copy_volume/copy_volume.py

Line 137 in 4d25373

 file_kwargs = {} if self.dimension_separator is None else dict(dimension_separator=self.dimension_separator) 

should filter it. It does not appear in any of the luigi config files.

Also, like in mobie/mobie-utils-python#57
I noticed the missing str() here:

cluster_tools/cluster_tools/copy_volume/copy_volume.py

Line 128 in 4d25373

dtype = "u" + dtype

Support for cloud-based datastores?

This looks like a super powerful tool, looking forward to using it! I'd love to implement an API abstraction for cloud datastores like BossDB or CloudVolume so that one could, in theory, generate peta-scale segmentation without having to download the data and reformat into n5/hdf.

These datastores tend to have client-side libraries that support numpy-like indexing: e.g:

# Import intern (pip install intern)
from intern import array

# Save a cutout to a numpy array in ZYX order:
em = array("bossdb://microns/minnie65_8x8x40/em")
data = em[19000:19016, 56298:57322, 79190:80214]

My understanding is that this should be a simple drop-in replacement for the ws_path and ws_key if we had a class that looked something like this:

from intern import array

class BossDBAdapterFile:

    def __init__(self, filepath: str):
        self.array = array(filepath)

    def __getitem__(self, groupname: str):
        return self.array

    ...

(I expect I've forgotten a few key APIs / organization, but the gist is this)

Is this something that you imagine is feasible? Desirable? My hypothesis is that this would be pretty straightforward and open up a ton of cloud-scale capability, but I may be misunderstanding. Maybe there's a better place to plug in here than "pretending" to be an n5 file?

Update links for data of LMC publication

Not all links are there yet:
https://github.com/constantinpape/cluster_tools/blob/master/publications/leveraging_domain_knowledge/README.md#obtaining-the-data

dimension slicing error

Hi @constantinpape

I am trying to implement the slicing for mobie/mobie-utils-python#130

and get stuck in the Downscaling workflow.

File "code/cluster_tools/cluster_tools/copy_volume/copy_volume.py", line 151, in run_impl
    f.require_dataset(self.output_key, shape=out_shape, chunks=chunks,
  File "site-packages/z5py/group.py", line 232, in require_dataset
    return Dataset._require_dataset(self, name, shape, dtype, chunks,
  File "site-packages/z5py/dataset.py", line 155, in _require_dataset
    return cls._create_dataset(group, name, shape, dtype, data=data,
  File "site-packages/z5py/dataset.py", line 198, in _create_dataset
    raise RuntimeError("Chunks %s must have same length as shape %s" % (str(chunks),
RuntimeError: Chunks (3, 64) must have same length as shape (3, 128, 128)

(3, 128, 128) is the shape of my input file. I try to get only one channel out.
Do I still need to provide chunks as (1, 64, 64)? This will then create 3D output data instead of the desired 2D.

That's what causes it to fail here:
https://github.com/mobie/mobie-utils-python/actions/runs/7845703615/job/21410747881?pr=130

Here are the configs for the luigi run:

configs.zip

How do I need to specify it correctly? Or is there something wrong in handling the ROI parameters? I could not spot something obvious...

Pipeline does not work for non-consecutive superpixels

This should be not too hard to fix, by exposing and using the maxNodeId of nifty.distributed.Graph.

Pass axis metadata for ome.zarr

We need to ensure that the axis metadata can be passed correctly to ome.zarr files. This can be passed via metadata_dict https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/downscaling/downscaling_workflow.py#L19.
But I am not sure if this is correctly working for the ome.zarr axis metadata.

Create smaller file for tests

to speed up running tests

Use decomposition solver as default for solve_global

Replace kernighan lin here.
Make sure that the corresponding nifty version is released beforehand.

Tasks do not respect thread limt

Tasks tend to spawn way more threads / processes than specified.
This might be related to the luigi scheduler or python subprocess.call?!

Axon-Dendrite Attribution code

Hi,

This file is currently empty:
cluster_tools/publications/leveraging_domain_knowledge/1_axon_dendrite_attribution/1_multicut.py
it is possible to get this code and demo data?

Matthew

threadpool issue

Hi,

I am trying to set up mboie_utils on Windows and Python 3.9 (fresh mamba installation, no other packages).
I run into:

...\cluster_tools\watershed\watershed.py", line 140, in <module>
    @threadpool_limits.wrap(limits=1)  # restrict the numpy threadpool to 1 to avoid oversubscription
AttributeError: type object 'threadpool_limits' has no attribute 'wrap'

I remember having hit this before but could not find an issue for it.

Input data for axon_dendrite_attribution workflow

Hello,
Will it be possible to have input data for the axon_dendrite_attribution experiment?
I am following your code implementation here:
https://github.com/constantinpape/cluster_tools/tree/master/publications/leveraging_domain_knowledge/1_axon_dendrite_attribution

However, the link under “obtaining the data” for experiment 1 does not work (it reverts back to the repository).

Many thanks,
Daniya

Fix tests for edge features

the edge feature computations are not working quite correctly:

affinity features: none of the feature values agree for all edges
boundary features: the edge min values don't agree

See https://github.com/constantinpape/cluster_tools/blob/master/test/features/test_edge_features.py#L21-L24 for details.

Fix issues with uint8 conversion for float data in CopyVolume

The conversion to uint8 for float data in https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/copy_volume/copy_volume.py#L208-L214 is incorrect since it just normalizes based on the local statistics, which will lead to artifacts.
Instead we need to compute global min and max and pass it to the task for conversion without artifacts.

Pipeline does not work with HDF5 input

Because graph computation and feature extraction expect n5 input data.
To fix this, we would need to support hdf5 inputs for these computations in nifty.

How to run large volumes

Hi Constantin,

I've been running multicut well on a cluster node with 1.5TB ram but it has a segmentation fault, presumably running out of RAM, on arrays larger than ~ 5k x 5k x 15.

How do I implement the large volume processing? I believe that is what the package Luigi is for, but I can't find a concise example usage in the repository.

For reference this is the script I'm running (that you were kind enough to send me):

import numpy as np
import vigra
import nifty
import nifty.graph.rag as nrag
import nifty.graph.opt.multicut as nmc

def probs_to_weights(edge_weights, edge_sizes):
    p_min = 0.001
    p_max = 1. - p_min
    edge_weights = (p_max - p_min) * edge_weights + p_min
    # probabilities to edge_weights
    edge_weights = np.log((1. - edge_weights) / edge_weights)
    edge_weights *= (edge_sizes / edge_sizes.max())
    return edge_weights

def normalize(input_):
    input_ = input_.astype('float32')
    input_ -= input_.min()
    input_ /= input_.max()
    return input_

def segment_mc(pmap_path, ws_path):
    print("Loading data ...")
    # need to invert to have multicut boundary conventions
    pmap = vigra.impex.readVolume(pmap_path).view(np.ndarray).squeeze().T
    # need to normalize the probability map to range 0, 1
    pmap = normalize(pmap)
    #pmap = 1. - pmap
    ws = vigra.impex.readVolume(ws_path).view(np.ndarray).squeeze().T.astype('uint32')
    # relabel the over-segmentation consecutively
    ws, max_id, _ = vigra.analysis.relabelConsecutive(ws, start_label=0, keep_zeros=False)
    assert pmap.shape == ws.shape
    print("Building graph ...")
    # build region adjacency graph
    rag = nrag.gridRag(ws, numberOfLabels=max_id+1)
    print("Extracting features ...")
    # extract features over the superpixel edges
    feats = nrag.accumulateEdgeMeanAndLength(rag, pmap)
    edge_weights, edge_sizes = feats[:, 0], feats[:, 1]
    # convert to multicut weights
    edge_weights = probs_to_weights(edge_weights, edge_sizes)
    print("Solving multicut ...")
    graph = nifty.graph.undirectedGraph(rag.numberOfNodes)
    graph.insertEdges(rag.uvIds())
    obj = nmc.multicutObjective(graph, edge_weights)
    # we use the most greedy solver to speed things up
    #solver = obj.kernighanLinFactory(warmStartGreedy=True).create(obj)
    solver = obj.greedyAdditiveFactory().create(obj)
    node_labels = solver.optimize()
    print("Map multicut solution to pixels")
    seg = nrag.projectScalarNodeDataToPixels(rag, node_labels)
    return seg.astype('uint32')

def segment_and_save(pmap_path, ws_path, out_path):
    seg = segment_mc(pmap_path, ws_path)
    vigra.impex.writeVolume(seg, out_path, '', compression='DEFLATE')

if __name__ == '__main__':
    # NOTE I can't read these tiffs with imageio, so I needed to fall back to vigra
    pmap_path = '/oasis/scratch/comet/mmadany/temp_project/MultiCut/lhb_sdscup/fxmemcrop/slice0001.png'
    ws_path = '/oasis/scratch/comet/mmadany/temp_project/MultiCut/lhb_sdscup/sig32_v2/slice0001.tif'
    out_path = '/oasis/scratch/comet/mmadany/temp_project/MultiCut/multiout/new_solver_nvcorr_n24sli15_segmentation.tif'
    # segment_mc(pmap_path, ws_path)
    segment_and_save(pmap_path, ws_path, out_path)

Thanks again,
Matthew

Sample data doesn't extract

I downloaded the sample CREMI data from the google drive link and tried to unpack it:
Tried downloading twice.

tar -xzf ~/sampleA.n5.tar.gz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

AssertionError: [30, 256, 256], (16, 128, 128

I am getting the following error when attempting to replicate results in the example folder:
File "/ibex/scratch/bogesdj/miniconda/envs/cluster_env/lib/python3.10/site-packages/cluster_tools/write/write.py", line 88, in run_impl
assert all(bs % ch == 0 for bs, ch in zip(block_shape, chunks)), "%s, %s" % (str(block_shape),
AssertionError: [30, 256, 256], (16, 128, 128)

I am using the same dataset sampleA.n5 referenced in your repo.
please advise.