Giter VIP home page Giter VIP logo

3dunet-tensorflow-brats18's People

Contributors

tkuanlun350 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

3dunet-tensorflow-brats18's Issues

Please try group normalization

First place in brats18 used group norm. It is interesting to try it. Looking forward to seeing your result on it. One more thing. Why not use crop size smaller for three dim, for example, 64x64x64 then it will have more augmentation data.?

Update: to reduce preprocessing time in n4, you can try to directly use n4 .exe file from ants and just call it in python. In my experiement, it takes 30 seconds instead of 7 mins using pỵthon build-in

why there is no maxpooling layer

qq 20181226170927
above figure illustrates the 3d Unet network architecture(copy from the article), each layer contains two 3 × 3 × 3 convolutions ,each followed by a ReLu, and then a 2 × 2 × 2 max pooling with strides of two in
each dimension.
but in your model.py I can't find maxpooling layer. So I want to ask you if you forget the maxpooling layer or there is something I misunderstand

How to adapt learning rate with custom rules?

I changed the class Unet3dModel as below:

1. class Unet3dModel(ModelDesc):
2.     def __init__(self, modelType="training", inference_shape=config.INFERENCE_PATCH_SIZE):
3.         self.modelType = modelType
4.         self.inference_shape = inference_shape
5.         print(self.modelType)
6.         self.lr = tf.get_variable('learning_rate', initializer=config.BASE_LR, trainable=False)
7.         self.last_total_cost_ema = tf.get_variable('last_total_cost_ema', initializer=1.0, trainable=False)
8. 
9.     def optimizer(self):
10.         # lr = tf.get_variable('learning_rate', initializer=config.BASE_LR, trainable=False)
11.         # tf.summary.scalar('learning_rate', lr)
12.         # opt = tf.train.MomentumOptimizer(lr, 0.9)
13.         tf.summary.scalar('learning_rate', self.lr)
14.         opt = tf.train.MomentumOptimizer(self.lr, 0.9)
15.         return opt
16.         
17.     def preprocess(self, image):
18.         # transform to NCHW
19.         return tf.transpose(image, [0, 4, 1, 2, 3])
20. 
21.     def inputs(self):
22.         S = config.PATCH_SIZE
23.         if self.modelType == 'training':
24.             ret = [
25.                 tf.placeholder(tf.float32, (config.BATCH_SIZE, S[0], S[1], S[2], 4), 'image'),
26.                 tf.placeholder(tf.float32, (config.BATCH_SIZE, S[0], S[1], S[2], 1), 'weight'),
27.                 tf.placeholder(tf.float32, (config.BATCH_SIZE, S[0], S[1], S[2], 1), 'label')]
28.         else:
29.             S = self.inference_shape
30.             ret = [
31.                 tf.placeholder(tf.float32, (config.BATCH_SIZE, S[0], S[1], S[2], 4), 'image')]
32.         return ret
33. 
34.     def build_graph(self, *inputs):
35.         is_training = get_current_tower_context().is_training
36.         if is_training:
37.             image, weight, label = inputs
38.         else:
39.             image = inputs[0]
40.         image = self.preprocess(image)
41.         featuremap = unet3d('unet3d', image) # final upsampled feturemap
42.         if is_training:
43.             loss = Loss(featuremap, weight, label)
44.             wd_cost = regularize_cost(
45.                     '(?:unet3d)/.*kernel',
46.                     l2_regularizer(1e-5), name='wd_cost')
47. 
48.             total_cost = tf.add_n([loss, wd_cost], 'total_cost')
49. 
50.             add_moving_summary(total_cost, wd_cost)
51. 
52.             # keep an exponential moving average
53.             ema = tf.train.ExponentialMovingAverage(decay=1-1/30)
54.             ema.apply([total_cost])
55.             ave = ema.average(total_cost)
56.             self.lr = tf.cond(tf.less(self.last_total_cost_ema-ave, 5e-3), lambda: self.lr/5, lambda:self.lr)
57.             # self.lr /= 5
58.             self.last_total_cost_ema = ave
59.             
60.             return total_cost
61.         else:
62.             final_probs = tf.nn.softmax(featuremap, name="final_probs") #[b,d,h,w,num_class]
63.             final_pred = tf.argmax(final_probs, axis=-1, name="final_pred")

What I changed are line 6, 7, 13, 14, 52-58. I calculated the exponential moving average (EMA) of total cost, and want to reduce learning rate by factor 5 once EMA does not improve by at least 5e-3. However, with this code, the learning rate never changes.

dataset structure

Can you tell me how dataset folder should look like? Mine looks like:

--\data
----\dataset
-------\BRATS2018
-----------\training
---------------\HGG
-------------------HGG147_Flair.nii
-------------------HGG147_Label.nii
-------------------HGG147_T1nii
-------------------HGG147_T1c.nii
-------------------HGG147_T2.nii
---------------\LGG
-------------------LGG71_Flair.nii
-------------------LGG71_Label.nii
-------------------LGG71_T1nii
-------------------LGG71_T1c.nii
-------------------LGG71_T2.nii

But i get an error:

$ python preprocess.py                                                          
Processing HGG ...                                                               
0it [00:00, ?it/s]                                                               
Processing LGG ...                                                              
0it [00:00, ?it/s]

Training Error

thank you open source! when i run the command[python3 train.py ] directly, there is an error that 'OutOfRangeError'
(such as :'FIFOQueue '_0_QueueInput/input_queue' is closed and has insufficient elements'),
should i run the [preprocess.py] firstly,then run [train.py]

use 'None' to define batch dimension when define Model inputs()

@tkuanlun350 Hi, thanks for your work. I met some problem when define model inputs. We usually use None to define batch dimension when define model input placeholder in tensorflow. So model can receive any batch size inputs. And in general, we can use larger batch size when we do model online evaluation. In your original code, you use a fixed size when define batch dimension at Unet3dModel.inputs(). I tried to modified to None as i have seen at other tensorpack example(they use 4d tensor input, but we use 5d tensor input). Unfortunately, some there raise some errors. Could you please help me figure out ? code and log are pasted as follow.

class Unet3dModel(ModelDesc):
    def __init__(self, model_name="unet3d", modelType="training",
                 inference_shape=config.INFERENCE_PATCH_SIZE):
        self.model_name = model_name
        self.modelType = modelType
        self.inference_shape = inference_shape
        print(self.modelType)

    def optimizer(self):
        lr = tf.get_variable('learning_rate', initializer=config.BASE_LR, trainable=False)
        tf.summary.scalar('learning_rate', lr)
        opt = tf.train.MomentumOptimizer(lr, 0.9)
        return opt
        
    def preprocess(self, image):
        # transform to NCDHW
        # original input is [batch, d, h, w, mod]
        return tf.transpose(image, [0, 4, 1, 2, 3])

    def inputs(self):
        S = config.PATCH_SIZE
        if self.modelType == 'training':
            ret = [
                tf.placeholder(tf.float32, (None, S[0], S[1], S[2], 4), 'image'),
                tf.placeholder(tf.float32, (None, S[0], S[1], S[2], 1), 'weight'),
                tf.placeholder(tf.float32, (None, S[0], S[1], S[2], 4), 'label')]
        else:
            S = self.inference_shape
            ret = [
                tf.placeholder(tf.float32, (config.BATCH_SIZE, S[0], S[1], S[2], 4), 'image')]
        return ret

[0217 21:24:33 @training.py:100] Building graph for training tower 0 on device /gpu:0 ...
[0217 21:24:33 @registry.py:121] unet3d input: [None, 4, 32, 32, 32]
Traceback (most recent call last):
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 518, in make_tensor_proto
str_values = [compat.as_bytes(x) for x in proto_values]
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 518, in
str_values = [compat.as_bytes(x) for x in proto_values]
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 68, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got None

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train.py", line 295, in
launch_train_with_config(cfg, trainer)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorpack/train/interface.py", line 82, in launch_train_with_config
model._build_graph_get_cost, model.get_optimizer)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorpack/utils/argtools.py", line 182, in wrapper
return func(*args, **kwargs)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorpack/train/tower.py", line 165, in setup_graph
train_callbacks = self._setup_graph(input, get_cost_fn, get_opt_fn)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorpack/train/trainers.py", line 167, in _setup_graph
self._make_get_grad_fn(input, get_cost_fn, get_opt_fn), get_opt_fn)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorpack/graph_builder/training.py", line 213, in build
use_vs=[False] + [True] * (len(self.towers) - 1))
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorpack/graph_builder/training.py", line 107, in build_on_towers
ret.append(func())
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorpack/train/tower.py", line 192, in get_grad_fn
cost = get_cost_fn(*input.get_input_tensors())
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorpack/tfutils/tower.py", line 207, in call
output = self._tower_fn(*args)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorpack/graph_builder/model_desc.py", line 234, in _build_graph_get_cost
ret = self.build_graph(*inputs)
File "train.py", line 97, in build_graph
featuremap = model(self.model_name, image)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorpack/models/registry.py", line 124, in wrapped_func
outputs = func(*args, **actual_args)
File "/home/user/YangJing/brats/brats_chen/unet.py", line 23, in unet3d
name="init_conv")
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/layers/convolutional.py", line 826, in conv3d
return layer.apply(inputs)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 825, in apply
return self.call(inputs, *args, **kwargs)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 714, in call
outputs = self.call(inputs, *args, **kwargs)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/layers/convolutional.py", line 186, in call
outputs_shape[4]])
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5782, in reshape
"Reshape", tensor=tensor, shape=shape, name=name)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 513, in _apply_op_helper
raise err
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 510, in _apply_op_helper
preferred_dtype=default_dtype)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1040, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 235, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 214, in constant
value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/home/user/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 522, in make_tensor_proto
"supported type." % (type(values), values))
TypeError: Failed to convert object of type <class 'list'> to Tensor. Contents: [None, 16, 1024, 32]. Consider casting elements to a supported type

can only concatenate tuple (not "list") to tuple

Hi,
Thanks for your great work firstly.
There is issue when I do inference .

Below is the detailed information.

Traceback (most recent call last):
File "train.py", line 203, in
offline_pred([pred], args.evaluate)
File "train.py", line 117, in offline_pred
df, lambda img: segment_one_image(img, pred_func))
File "E:\network\3DUnet-Tensorflow-Brats18-master\eval.py", line 339, in pred_brats
final_label, probs = detect_func(data)
File "train.py", line 117, in
df, lambda img: segment_one_image(img, pred_func))
File "E:\network\3DUnet-Tensorflow-Brats18-master\eval.py", line 254, in segment_one_image
final_probs = np.zeros(temp_size + [config.NUM_CLASS], np.float32)
TypeError: can only concatenate tuple (not "list") to tuple

I don't konw how to fix it.

Why the result of the code is abnormally bad....

@tkuanlun350
你好!我用brats18的数据跑了一下这个代码,为什么结果这么差呀

我的操作:
1.搭好环境
2.下好数据
3.用GPU运行train.py

结果:
1.控制台的显示
image
image

2.但是我把结果提交了一下
结果是:Mean,0.0185,0.08663,0.03826,0.02226,0.09266,0.04924,0.97589,0.9431,0.95617,60.53925,54.86339,60.90156

只有 0.0几的dice loss。特别差。可以给我看一下哪里出问题了吗

preprocess

Hi, Thanks for sharing your code. I found you didn't do N4 bias correction on "flair" modality in preprocess. Could you please tell me why ?

msgpack.exceptions.PackValueError: bytes is too large

Hi tkuanlun350,
I have the following error while predict on my data (400, 600, 420).
Can you help me to solve this issue? Thank you

inference
[0821 16:16:15 @registry.py:121] unet3d input: [2, 4, 128, 128, 128]
1 layer (2, 32, 64, 64, 64)
1 layer (2, 64, 32, 32, 32)
1 layer (2, 128, 16, 16, 16)
1 layer (2, 256, 8, 8, 8)
1 layer (2, 256, 8, 8, 8)
final (2, 128, 128, 128, 3)
[0821 16:16:17 @registry.py:129] unet3d output: [2, 128, 128, 128, 3]
[0821 16:16:17 @sessinit.py:90] WRN The following variables are in the checkpoint, but not found in the graph: global_step:0, learning_rate:0
2018-08-21 16:16:18.057440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:03:00.0
totalMemory: 10.91GiB freeMemory: 7.87GiB
2018-08-21 16:16:18.057509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-08-21 16:16:18.455432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-21 16:16:18.455512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-08-21 16:16:18.455541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-08-21 16:16:18.455899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11059 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
[0821 16:16:19 @sessinit.py:117] Restoring checkpoint from ./train_log/unet3d/model-3000 ...
Data Folder: ./data/val
Preprocessing Data ...
100%|█████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:28<00:00, 148.38s/it]
0%| |0/1[00:00<?,?it/s]Process _Worker-1:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/seg/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/ubuntu/anaconda3/envs/seg/lib/python3.6/site-packages/tensorpack/dataflow/parallel.py", line 259, in run
socket.send(dumps(dp), copy=False)
File "/home/ubuntu/anaconda3/envs/seg/lib/python3.6/site-packages/tensorpack/utils/serialize.py", line 32, in dumps_msgpack
return msgpack.dumps(obj, use_bin_type=True)
File "/home/ubuntu/anaconda3/envs/seg/lib/python3.6/site-packages/msgpack_numpy.py", line 166, in packb
return Packer(**kwargs).pack(o)
File "msgpack/_packer.pyx", line 284, in msgpack._packer.Packer.pack
File "msgpack/_packer.pyx", line 290, in msgpack._packer.Packer.pack
File "msgpack/_packer.pyx", line 287, in msgpack._packer.Packer.pack
File "msgpack/_packer.pyx", line 263, in msgpack._packer.Packer._pack
File "msgpack/_packer.pyx", line 234, in msgpack._packer.Packer._pack
File "msgpack/_packer.pyx", line 234, in msgpack._packer.Packer._pack
File "msgpack/_packer.pyx", line 205, in msgpack._packer.Packer._pack
msgpack.exceptions.PackValueError: bytes is too large

"Dimension 0 in both shapes must be equal,but are 4 and 3"

I run the code with patch '128x128x128', and cuda out of memory. So I change patch size to '20x128x128',and it says that:

ValueError: Dimension 0 in both shapes must be equal, but are 4 and 3. Shapes are [4,16,16] and [3,16,16]. for 'tower0/unet3d/concat' (op: 'ConcatV2') with input shapes: [1,128,4,16,16], [1,128,3,16,16], [] and with computed input tensors: input[2] = <1>.

But when I change patch size to '64x64x64', it works. But I am not sure if this is the best setting.

ResourceExhaustedError

Dear tkuanlun350,
I have the following error while running your code on brats18 dataset.
would you mind guide me how to solve the issue.
my laptop config : i5 7300hq , gtx 1050 , 8gb ddr4v ram

1
2
3
Thank you

No idea of how to save a model in python, help :)

Hello!

I am trying to save the model generated once the train.py has been completed succesfully. I am not familiar with tensorflow nor keras, so I really don't know where to write new commands or where will they save any files (.h5 or whatever).

So, my last output is this:
imagen

What should I do next? Because I don't find that model-30000 in any of my folders.

Thank you!

Getting "TypeError: environment can only contain strings" on Windows 10

(base) D:\newwork\tkuanlun350\3DUnet-Tensorflow-Brats18-master>python preprocess.py
Processing HGG ...
0%| | 0/210 [00:00<?, ?it/s]
Traceback (most recent call last):
File "preprocess.py", line 77, in
main()
File "preprocess.py", line 56, in main
N4BiasFieldCorrect(mod_file, output_path)
File "preprocess.py", line 15, in N4BiasFieldCorrect
normalized.run()
File "C:\ProgramData\Miniconda3\lib\site-packages\nipype\interfaces\base\core.py", line 522, in run
runtime = self._run_interface(runtime)
File "C:\ProgramData\Miniconda3\lib\site-packages\nipype\interfaces\ants\segmentation.py", line 420, in _run_interface
runtime, correct_return_codes)
File "C:\ProgramData\Miniconda3\lib\site-packages\nipype\interfaces\base\core.py", line 1035, in _run_interface
runtime = run_command(runtime, output=self.terminal_output)
File "C:\ProgramData\Miniconda3\lib\site-packages\nipype\interfaces\base\core.py", line 773, in run_command
close_fds=(not sys.platform.startswith('win')),
File "C:\ProgramData\Miniconda3\lib\subprocess.py", line 709, in init
restore_signals, start_new_session)
File "C:\ProgramData\Miniconda3\lib\subprocess.py", line 997, in _execute_child
startupinfo)
TypeError: environment can only contain strings


Seems to relate to Unicode being unsupported in Python
Any fix for this ?

ValueError: Layer named InstanceNorm5d is already registered!

Hi, @tkuanlun350 , when I ran train.py, the error occurred. And the dataset and filepath have been modified already.

Traceback (most recent call last):
File "E:/CTAs/HeartSegmentaion/Codes/3DUnet-Tensorflow-Brats18-master/train.py", line 20, in
from model import ( unet3d, Loss )

File "E:\CTAs\HeartSegmentaion\Codes\3DUnet-Tensorflow-Brats18-master\model.py", line 14, in
from custom_ops import BatchNorm3d, InstanceNorm5d

File "E:\CTAs\HeartSegmentaion\Codes\3DUnet-Tensorflow-Brats18-master\custom_ops.py", line 29, in
def InstanceNorm5d(x, epsilon=1e-5, use_affine=True, gamma_init=None, data_format='channels_last'):

File "D:\ProgramData\Anaconda3\lib\site-packages\tensorpack\models\registry.py", line 138, in wrapper
_register(func.name, wrapped_func)

File "D:\ProgramData\Anaconda3\lib\site-packages\tensorpack\models\registry.py", line 24, in _register
raise ValueError("Layer named {} is already registered!".format(name))

ValueError: Layer named InstanceNorm5d is already registered!

Your code had killed or hanging program when finished epoch 1

Thanks for your open source
First, I have a bug, Your code had killed or hanging program when finished epoch 1, not continues training with epoch 2... Would you like when you support me fix this bug? If good we have called for the issue.
Second, What is the test set you test with your model. You split to training-set and testing-set, and then training model, you have test with testing-set. Or you test with a private test on BraTS2018.

Focal Loss Implementation

Hello, I have gone through your paper 'End-to-End Cascade Network for 3D Brain Tumor Segmentation in MICCAI 2018 BraTS Challenge'. As you claimed focal loss implementation in the architecture, can you get me to the file which uses focal loss function? Thank you.

use 3d non-medical dataset

First of all, let me congratulate you on this great project! Many Thanks for making it public! I am trying to make it work with my own 3d non-medical dataset.so i want to consult you that how to make the standary trainning dataset with the origingal dataset which has the format of ".raw"(both of the data and the label)?could you please share me some code? I have never done with 3D dataset .so I really need your help.Thanks a lot

cannot import name 'get_tf_version_number' from 'tensorpack.tfutils.common'

Even if i change get_tf_version_number to get_tf_version_tuple ,i got this error.Why?

Traceback (most recent call last):
File "train.py", line 20, in
from model import ( unet3d, Loss )
File "\Desktop\3DUnet-Tensorflow-Brats18-master\model.py", line 14, in
from custom_ops import BatchNorm3d, InstanceNorm5d
File "\Desktop\3DUnet-Tensorflow-Brats18-master\custom_ops.py", line 18, in
from tensorpack.tfutils.common import get_tf_version_number
ImportError: cannot import name 'get_tf_version_number' from 'tensorpack.tfutils.common' (..\AppData\Roaming\Python\Python37\site-packages\tensorpack\tfutils\common.py)

Running Op sync_variables_from_main_tower...

First, appreciate your code!୧(๑•̀◡•́๑)૭ It helps me a lot!!
When I conduct the training process, it is so weird that the training stops at "Running Op sync_variables_from_main_tower..." without any following output.
1572573952(1)
Do you have any suggestions?@tkuanlun350

out of memory after several online eval iterations

The training stage is well, consuming about 10 GB CPU memory. However, memory increases quickly once online eval (called by EvalCallback) starts, and amounts to 60G after several eval iterations. Did others observe the same problem? How do you solve it?

ZMQ ICP ERROR

Traceback (most recent call last):
File "train.py", line 211, in
data=QueueInput(get_train_dataflow()),
File "C:\Users\idir.hired\Desktop\algoSeg\3DUnet-Tensorflow-Brats18-master\data_sampler.py", line 242, in get_train_dataflow
ds = PrefetchDataZMQ(ds, 6)
File "C:\Users\idir.hired\AppData\Local\Continuum\anaconda3\envs\table1\lib\site-packages\tensorpack\dataflow\parallel.py", line 274, in init
super(PrefetchDataZMQ, self).init()
File "C:\Users\idir.hired\AppData\Local\Continuum\anaconda3\envs\table1\lib\site-packages\tensorpack\dataflow\parallel.py", line 89, in init
assert os.name != 'nt', "ZMQ IPC doesn't support windows!"
AssertionError: ZMQ IPC doesn't support windows!

Low GPU utilization rate

I'm running this model on single 1080ti, and the nvidia-smi show that i've already dumped 8G data into GPU while the utilization rate of GPU remain 0% during most time.
image
image
I wonder if there's anything wrong with it ?And how can I improve this situation.
many thanks.

Training With Google Colab

Hi @tkuanlun350 ,

I'm trying to training the network on Google Colab, with Brats Dataset 2017. But When I try to do the crossvalidation, the virtual machine crashes or the execution stops (the dataset is very large, so he can't load and preprocess it). So I thought of splitting the dataset into several sub-folders and doing the training on each of these. My question is: when I do the training the first time and, once it has finished, I switch the folder of dataset and restart the training, if I choose to keep the log, the model continues to train or it restarts?
If restarts, is there any way to split the dataset and train the same model on each sub-folders?

TypeError: join() argument must be str or bytes, not 'list'

Hey! I just trained the model i.e. CROSS_VALIDATION = False, it's work fine. But when I want to try the model to train on CROSS_VALIDATION = True, it gives me the error i.e.

Traceback (most recent call last):
File "train.py", line 211, in
data=QueueInput(get_train_dataflow()),
File "/home/faizad/3DUnet-Tensorflow-Brats18/data_sampler.py", line 221, in get_train_dataflow
imgs = BRATS_SEG.load_from_file(config.BASEDIR, config.TRAIN_DATASET)
File "/home/faizad/3DUnet-Tensorflow-Brats18/data_loader.py", line 113, in load_from_file
brats = BRATS_SEG(basedir, names)
File "/home/faizad/3DUnet-Tensorflow-Brats18/data_loader.py", line 33, in init
self.basedir = os.path.join(basedir, mode)
File "/usr/lib/python3.7/posixpath.py", line 94, in join
genericpath._check_arg_types('join', a, *p)
File "/usr/lib/python3.7/genericpath.py", line 149, in _check_arg_types
(funcname, s.class.name)) from None

TypeError: join() argument must be str or bytes, not 'list'

Note: I follow all the steps given below:

  1. First run generate_5fold.py to save 5fold.pkl
  2. Set CROSS_VALIDATION to True
  3. CROSS_VALIDATION_PATH to /path/to/5fold.pkl
  4. Set FOLD to 4

Thanks in anticipation.

test failed

Firstly, appreciate your code 👍 I really like it.
While, when I test it, the code has some bugs.
My data architecture:
DIR/
training/
HGG/
LGG/
val/
case1/
t1.nii.gz
t1c.nii.gz
flair.nii.gz
t2.nii.gz
case2/
t1.nii.gz
t1c.nii.gz
flair.nii.gz
t2.nii.gz

While, the error is:

Traceback (most recent call last):
File "/home/xuhuaren/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/xuhuaren/anaconda3/lib/python3.6/site-packages/tensorpack-0.8.9-py3.6.egg/tensorpack/dataflow/parallel.py", line 269, in run
for dp in self.ds:
File "/home/xuhuaren/anaconda3/lib/python3.6/site-packages/tensorpack-0.8.9-py3.6.egg/tensorpack/dataflow/common.py", line 282, in iter
ret = self.func(copy(dp)) # shallow copy the list
File "/home/xuhuaren/anaconda3/lib/python3.6/site-packages/tensorpack-0.8.9-py3.6.egg/tensorpack/dataflow/common.py", line 317, in _mapper
r = self._func(dp[self._index])
File "/work/cvpr2019/experiment/brats/1/data_sampler.py", line 269, in f
volume_list, label, weight, original_shape, bbox = data
ValueError: too many values to unpack (expected 5)

Do you have any suggestions?

Training error: CUDA_ERROR_OUT_OF_MEMORY

I train with my own data set, don't use 5 fold cross validation.
Though I set PATCH_SIZE = [24, 24, 128], BATCH_SIZE = 1.
Always reporting mistakes as follow: Could you please help me figure out ?
My GPU is : Tesla P100.

2019-04-29 11:34:02.983765: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 128.00M (134217728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.984872: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 115.20M (120796160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.985950: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 103.68M (108716544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.987029: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 93.31M (97844992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.988006: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 83.98M (88060672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.988884: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 75.58M (79254784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.989733: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 68.02M (71329536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.991272: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 128.00M (134217728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.991653: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 61.22M (64196608 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.993080: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 115.20M (120796160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.993504: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 55.10M (57777152 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.994662: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 103.68M (108716544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.995871: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 93.31M (97844992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.996989: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 83.98M (88060672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.998042: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 75.58M (79254784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:02.999098: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 68.02M (71329536 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.000601: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 61.22M (64196608 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.358495: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.358600: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 76.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-29 11:34:03.398668: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.398760: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 76.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-29 11:34:03.445785: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 256.00M (268435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-04-29 11:34:03.445889: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 148.69MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

AssertionError

Traceback (most recent call last):
File "D:\1_Deep_learning\codes\Segmentation Deep learing\3DUnet-Tensorflow-Brats18-master\train.py", line 211, in
data=QueueInput(get_train_dataflow()),
File "D:\1_Deep_learning\codes\Segmentation Deep learing\3DUnet-Tensorflow-Brats18-master\data_sampler.py", line 240, in get_train_dataflow
ds = BatchData(MapData(ds, preprocess), config.BATCH_SIZE)
File "D:\1_Deep_learning\codes\Segmentation Deep learing\3DUnet-Tensorflow-Brats18-master\data_sampler.py", line 49, in init
assert batch_size < ds.size()
AssertionError

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.