detectionteamucas / nas_fpn_tensorflow Goto Github PK
View Code? Open in Web Editor NEWNAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection.
Home Page: https://arxiv.org/abs/1904.07392
License: MIT License
NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection.
Home Page: https://arxiv.org/abs/1904.07392
License: MIT License
代码中:
def resnet_base(img_batch, scope_name, is_training=True):
if scope_name.endswith('b'):
get_resnet_fn = get_resnet_v1_b_base
elif scope_name.endswith('d'):
get_resnet_fn = get_resnet_v1_d_base
else:
raise ValueError("scope Name erro....")
_, feature_dict = get_resnet_fn(input_x=img_batch, scope=scope_name,
bottleneck_nums=BottleNeck_NUM_DICT[scope_name],
base_channels=BASE_CHANNELS_DICT[scope_name],
is_training=is_training, freeze_norm=True,
freeze=cfgs.FREEZE_BLOCKS)
pyramid_dict = {}
with tf.variable_scope('build_pyramid'):
with slim.arg_scope([slim.conv2d], weights_regularizer=slim.l2_regularizer(cfgs.WEIGHT_DECAY),
activation_fn=None, normalizer_fn=None):
P5 = slim.conv2d(feature_dict['C5'],
num_outputs=cfgs.FPN_CHANNEL,
kernel_size=[1, 1],
stride=1, scope='build_P5')
pyramid_dict['P5'] = P5
for level in range(4, 1, -1): # build [P4, P3, P2]
pyramid_dict['P%d' % level] = fusion_two_layer(C_i=feature_dict["C%d" % level],
P_j=pyramid_dict["P%d" % (level + 1)],
scope='build_P%d' % level)
for level in range(5, 1, -1):
pyramid_dict['P%d' % level] = slim.conv2d(pyramid_dict['P%d' % level],
num_outputs=cfgs.FPN_CHANNEL, kernel_size=[3, 3],
padding="SAME", stride=1, scope="fuse_P%d" % level,
activation_fn=tf.nn.relu if cfgs.USE_RELU else None)
if "P6" in cfgs.LEVLES:
P6 = slim.avg_pool2d(P5, kernel_size=[1, 1], stride=2, scope='build_P6')
pyramid_dict['P6'] = P6
# for level in range(5, 1, -1):
# add_heatmap(pyramid_dict['P%d' % level], name='Layer%d/P%d_heat' % (level, level))
for i in range(cfgs.NUM_FPN):
pyramid_dict = fpn(pyramid_dict, i)
for i in range(cfgs.NUM_NAS_FPN):
pyramid_dict = nas_fpn(pyramid_dict, i)
您在cfgs.py这个文件默认值为0, 为什么在已经搭建好的fpn网络之后,还要执行这个fpn函数尼?
我的理解是nas_fpn这个函数是在搜索fpn,可之前的是什么意义?以及这个参数?
感谢您!
tensorflow contrib' module note found
How to fix this issue in anaconda
tensorflow contrib' module note found
How to fix this issue in anaconda
Hi! Thanks for sharing the code! I'm sorry I urgently need test-dev results and don't have time to test and submit them. If you have ready results for me, it would be too great. Thank you so much!!
Fix whether it is based on Resnet or NAS
Have the search code ? I think the most import of NAS-FPN is the search method,could'you update the code for search FPN ?
Hi,
I tried the "resnet50_v1d", for a single class object detection. But as the pretrained model is trained for 80 classes, I am unable to train it for single class.
Kindly help
Training with my own dataset can achieve 0.95 mAP on FPN. Why on nas-fpn only have 0.3
Select a configuration file in the folder ($PATH_ROOT/libs/configs/) and copy its contents into cfgs.py, then download the corresponding weights.
Hello,the link failure,could you give me again?Thank you!!!
Good work!
I fail to find the the way to unify the numbers of channels of the feature from the backbone in the paper. According to the original design of FPN, they use 1*1 convs to reduce the numbers of channels. However, I notice that you use 3*3 2d convs to handle with the inconsistent numbers of channels. I wonder whether this design is from the paper or yourselves?
Chinese translation:
感谢你们的工作!
fpn模块涉及到两个不同channel数的feature map相加的操作,这必然要求要先把这两个feature map的通道数统一。但是我在论文中找不到他们是怎么做的。传统的FPN是用1*1conv来做降维的,但是我发现你们的代码中用的是3*3conv。特此请求确认。
感谢!
I want use resnet101_v1d , when i try to train model on my datasets ,it is failed, i can not restore the model, I want kown how to use it ?thanks
Hi, it seems that there is no top-down path for NAS-FPN (https://github.com/tensorflow/tpu/blob/master/models/official/detection/modeling/architecture/nasfpn.py) in the official implement? I am not sure about this.
这里的实现是把NAS-FPN接在top-down之后的feature map上么?官方代码好像是没有top-down那步?我自己在pytorch上的实现发现没有top-down的话NAS-FPN会掉点,不知道是不是我哪理解错了?
I try to training my dataset, and i have reference the three steps in README.md, I use the pretrained weights resnet_v1_50.ckpt, and i modify the $PATH_ROOT/libs/configs/cfgs.py and $PATH_ROOT/libs/label_name_dict/lable_dict.py and $PATH_ROOT/data/io/read_tfrecord.py, the NET_NAME has been alter with "resnet_v1_50", However ,after "python ./tools/multi_gpu_train.py",the error msg:
++--++--++--++--++--++--++--++--++--++--++--++--++--++--++--++--++--++--++--++--
./NAS_FPN_Tensorflow-master
WARNING:tensorflow:From multi_gpu_train.py:123: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
tfrecord path is --> ./NAS_FPN_Tensorflow-master/data/tfrecord/occlusion_recognition_train*
Traceback (most recent call last):
File "./envs/anaconda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1589, in _create_c_op
c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimensions must be equal, but are 256 and 384 for 'tower_0/build_pyramid/build_P4/add' (op: 'Add') with input shapes: [1,?,?,256], [1,?,?,384].
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "multi_gpu_train.py", line 352, in
train()
File "multi_gpu_train.py", line 212, in train
gtboxes_batch=gtboxes_and_label)
File "../libs/networks/build_whole_network.py", line 386, in build_whole_detection_network
P_list = self.build_base_network(input_img_batch)
File "../libs/networks/build_whole_network.py", line 35, in build_base_network
return resnet.resnet_base(input_img_batch, scope_name=self.base_network_name, is_training=self.is_training)
File "../libs/networks/resnet.py", line 189, in resnet_base
scope='build_P%d' % level)
File "../libs/networks/resnet.py", line 63, in fusion_two_layer
add_f = 0.5upsample_p + 0.5reduce_dim_c
File "./envs/anaconda/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 847, in binary_op_wrapper
return func(x, y, name=name)
File "./envs/anaconda/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 297, in add
"Add", x=x, y=y, name=name)
File "./envs/anaconda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "./envs/anaconda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
op_def=op_def)
File "./anaconda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1756, in init
control_input_ops)
File "./envs/anaconda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1592, in _create_c_op
raise ValueError(str(e))
ValueError: Dimensions must be equal, but are 256 and 384 for 'tower_0/build_pyramid/build_P4/add' (op: 'Add') with input shapes: [1,?,?,256], [1,?,?,384].
I rebuild the code following the repo, finding the nas_fpn is too slow compared with the standard FPNm method(about 10 times slower). Is there anyone who has the similar conclusion?
Code brakes with tensorflow 1.14.0, module tensorflow.contrib was removed
had to move to 1.14 because has a fix for something else
您好,
我这边将您的NAS和Cascade连在了一起,但是loss始终降不下来。
我的实际操作就是将nas_fpn.py 放入了cascade中,然后在那边的cascade的resnet.py稍微修改了一下。
但是loss始终在3, 降不下来, 请您给点建议指点,谢谢。
What are the changes we will have to do to train ?
The paper said there is a RNN controller, but I don't see it . Dose this code not implement the controller?
Hi,
I was trying to train NAS-FPN model for my classes. But could not able to load "Tensor name "resnet50_v1d/C1/conv0/BatchNorm/beta" not found in checkpoint files resnet50_v1d.ckpt".
Note:
NUM_CLASSES = 1
Image_size = (500,500)
Back_bone i am using = resnet50_v1d.ckpt
For detection i am using pre-trained model = FPN_Res50_COCO_20190429_v3
Kindly help.
Hi~
I met a problem when I use the pre-training model you supplied 'resnet50_v1d.ckpt'.
The error is like this:
NotFoundError (see above for traceback): Tensor name "resnet50_v1d/C2/bottleneck_0/conv0/BatchNorm/beta" not found in checkpoint files /home/DATA3/user-work/NAS_FPN_Tensorflow/data/pretrained_weights/resnet50_v1d.ckpt
[[Node: save/RestoreV2_15 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_15/tensor_names, save/RestoreV2_15/shape_and_slices)]]
So, is the pretrain model correct? How can I fix it?
I have modified the code for Resnet_152_v1 and python multi_gpu_train.py. But GPU memory 1234MB only used, But GPU is not utilized ( i have set the GPU_GROUP = "0"). CPU utilization is above 230%. But it does not show the steps/epochs. I have waited for 1 hour also, no progress (refer the screenshot below) happening. Could you please suggest me what is missing in my code and config?
Thanks for your work!
In the Global pooling operation, the roles of two input feature maps are not equivalent. Only given Figure 6 and Figure 7 in the paper, I cannot know which feature map play the role of channel attention. I fail to find any latent rule from your realization. I wonder how do you determine the positions of two feature maps at every Global pooling operation. Did you ask the authors?
Another little question is why you take the mean value in Global pooling operation while it uses max pooling according to the paper.
@yangJirui @yangxue0827 向大佬求助,我的参数设置和训练有什么硬伤:
用nas=2,FPN_CHANNEL=256来训练DOTA_H,数据集为裁剪后的800大小的图片,随机抽取10428张,模型采用resnet50_v1d.
学习率设置如下:
IMGNUM_IN_DATASET = 10428
EPSILON = 1e-5
MOMENTUM = 0.9
BATCH_SIZE = 1
WARM_SETP = 4IMGNUM_IN_DATASET
#make lr bigger
LR = 5e-3 * 2 * len(GPU_GROUP.strip().split(',')) * BATCH_SIZE
MAX_ITERATION = 20SAVE_WEIGHTS_INTE
DECAY_STEP = [13IMGNUM_IN_DATASET, 20IMGNUM_IN_DATASET] # 50000, 70000
MAX_ITERATION = 20*IMGNUM_IN_DATASET
anchor设置如下:
USE_CENTER_OFFSET = True
LEVLES = ['P2', 'P3', 'P4', 'P5', 'P6']
BASE_ANCHOR_SIZE_LIST = [16, 32, 64, 128, 256]
ANCHOR_STRIDE_LIST = [4, 8, 16, 32, 64]
ANCHOR_SCALES = [1.0]
ANCHOR_RATIOS = [1, 1 / 2, 2., 1 / 3., 3., 5., 1 / 4., 4., 1 / 5., 6., 1 / 6., 7., 1 / 7.]
ROI_SCALE_FACTORS = [10., 10., 5.0, 5.0, 5.0]
ANCHOR_SCALE_FACTORS = None
FPN设置如下:
SHARE_HEADS = True
KERNEL_SIZE = 3
RPN_IOU_POSITIVE_THRESHOLD = 0.7
RPN_IOU_NEGATIVE_THRESHOLD = 0.3
TRAIN_RPN_CLOOBER_POSITIVES = False
RPN_MINIBATCH_SIZE = 256
RPN_POSITIVE_RATE = 0.5
RPN_NMS_IOU_THRESHOLD = 0.7
RPN_TOP_K_NMS_TRAIN = 12000
RPN_MAXIMUM_PROPOSAL_TARIN = 2000
RPN_TOP_K_NMS_TEST = 6000
RPN_MAXIMUM_PROPOSAL_TEST = 1000
NAS_FPN和fast rcnn设置如下:
--------------------------------------------NAS FPN config
NUM_FPN = 0
NUM_NAS_FPN = 2
USE_RELU = True
#FPN_CHANNEL = 384
FPN_CHANNEL = 256
-------------------------------------------Fast-RCNN config
ROI_SIZE = 14
ROI_POOL_KERNEL_SIZE = 2
USE_DROPOUT = False
KEEP_PROB = 1.0
SHOW_SCORE_THRSHOLD = 0.6 # only show in tensorboard
#R2CNN CONFIG
FAST_RCNN_NMS_IOU_THRESHOLD = 0.1 # 0.6
FAST_RCNN_NMS_MAX_BOXES_PER_CLASS = 150
FAST_RCNN_IOU_POSITIVE_THRESHOLD = 0.4
FAST_RCNN_IOU_NEGATIVE_THRESHOLD = 0.0 # 0.1 < IOU < 0.5 is negative
FAST_RCNN_MINIBATCH_SIZE = 256 # if is -1, that is train with OHEM
FAST_RCNN_POSITIVE_RATE = 0.35
ADD_GTBOXES_TO_TRAIN = True
USE_ATTENTION = False
在训练过程中查看tensorboard的输出,检测效果尚可。但是导出ckpt检测,15个类别检测结果的置信度都很低,加上显示检测结果的阈值以后就没有可以显示的结果了。
比如这是ship的检测结果:
P0484_0000_0000 0.046 563.9 113.7 784.6 328.9
P0484_0000_0000 0.046 563.9 433.7 784.6 648.9
P0484_0000_0000 0.032 666.6 26.5 776.4 127.9
P0484_0000_0000 0.032 23.8 16.0 133.8 139.7
P0484_0000_0000 0.031 665.6 341.6 776.1 449.2
P0484_0000_0000 0.031 665.6 629.6 776.1 737.2
P0484_0000_0000 0.031 121.0 16.2 229.1 138.9
P0484_0000_0000 0.031 537.1 16.3 645.1 138.9
P0484_0000_0000 0.031 409.1 16.3 517.1 138.9
P0484_0000_0000 0.031 313.1 16.3 421.1 138.9
P0484_0000_0000 0.031 217.1 16.3 325.1 138.9
P0484_0000_0000 0.031 772.2 738.8 798.9 786.4
P0484_0000_0000 0.030 55.8 657.7 171.0 777.6
P0484_0000_0000 0.030 0.9 741.6 55.0 798.4
P0484_0000_0000 0.030 215.6 657.5 330.7 777.6
将显示检测框的阈值设置为0.03后,所有的框都是一个形状:
Hi, thanks for your code.
I try to implement NAS-FPN in pytorch, but there seems to be some errors with my BN layer (Both 1-stage(RetinaNeet) / 2-stage). I freeze the BN layer in r50 backbone, fine-tune the BN in NAS-FPN (initialized with weight=1 bias=0), am I right?
The log is like this, training seems OK, but eval is always 0 (0.001).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
2019-08-08 01:31:15,314 - INFO - Epoch [1][7330/7330] lr: 0.02000, bbox_mAP: 0.0000, bbox_mAP_50: 0.0000, bbox_mAP_75: 0.0000, bbox_mAP_s: 0.0000, bbox_mAP_m: 0.0000, bbox_mAP_l: 0.0000, bbox_mAP_copypaste: 0.000 0.000 0.000 0.000 0.000 0.000
2019-08-08 01:31:39,196 - INFO - Epoch [2][50/7330] lr: 0.02000, eta: 8:01:08, time: 0.477, data_time: 0.070, memory: 4057, loss_rpn_cls: 0.0640, loss_rpn_bbox: 0.0223, loss_cls: 0.1917, acc: 96.7520, loss_bbox: 0.0561, loss: 0.3341
2019-08-08 01:31:56,286 - INFO - Epoch [2][100/7330] lr: 0.02000, eta: 8:00:42, time: 0.342, data_time: 0.007, memory: 4057, loss_rpn_cls: 0.0718, loss_rpn_bbox: 0.0185, loss_cls: 0.2137, acc: 96.5977, loss_bbox: 0.0619, loss: 0.3659
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.