Giter VIP home page Giter VIP logo

ssd's Introduction

SSD single short object detector

basic concepts:

  • box, bounding box (contains an objec)
  • boundary coordinates (x_min, x_max, y_min, y_max), center size coordinates (x_c, y_c, width, height)
  • jaccard index 交并比
    • 1 means identiy, 0 means exclusive
  • multibox
    • return 2 things, corordinates and class-specifc score (plus background class)

rodeo

  • priors
    • some box manually assigned across the image
    • match patches on feature maps of some base and aux conv layers
    • they are fixed
  • base conv, auxiliary conv, prediction conv
    • base conv, adjust
    • auxiliary conv, to get multi scale info (understand how different stages match different region in the images)
    • prediction conv, loc & cls
  • how to match predictions (bounding box and class score) to the GT
    • rules of 'postive' and 'negative' priors
      • jaccard index (J) < 0.5 with any GT, negative
      • J > 0.5 with a GT, positive
      • what if J > 0.5 with more than one object?
    • find the targets: predictions - priors - GT
      • target for loc: positive - GT's coordinate; negative: dropped
      • tartget for cls: positive - GT's class; negative: backgroud class hard negative mining
  • dealing with the redundant predictions (during test time)
    • none maximum suppression

implementation (SSD300)

  • base-conv

    • load pretrained model in keras (vgg16 including top)

    • a new model, which:

      (4 modifications)

      • converts the fc_6 and fc_7 to conv layer, and drop fc_8
      • modify the input shape to (300, 300, 3)
      • 3rd pooling layer's output size
      • 5th pooling layer from 2x2_s2 to 3x3_s1
  • aux-conv

    • add 4 more layers (each contains 2 conv layers, downsampling as going)
  • predictiion conv

    • from 6 feature maps, total 8732 priors
    • for each feature map (N priors per tile, C+1 classes)
      • assign N*(C+1) kernels to classify
      • assign N*4 kernels to locate
    • 3x3 kernel is used, padding='same'
    • output is 8732 * (C+1+4)
  • matching the prediction output to GT

    • calculate the jarccard index between the 8732 priors and N GT objects, for each prior, get the max jarccard index

    • maximum > 0.5, positive match with the object

      (what if match more than one object?)

    • otherwise, negative match

    • then we got prediction-prior-object match

    • for each match, we have

      • type: postive or negative
      • class: bgd if negative else coresponding class
      • coordinate: GT coordinate if positive else none
  • loss

    • loss - cls
      • hard negative mininig, positive : negative = 3 : 1
      • averaged by num_positive
    • loss - loc vanilla
    • sum
  • test time

    • for each none-bgd class:
      • for each candidate
        • eliminate boxes that do not meet a certain threshold for this score
      • rank by higher likelihood, eliminate all candidates that have a jaccard overlap with current candidate of more than a threshold, until the end

tour

  • voc 数据集下载地址

  • 目标检测数据集下载 https://www.cnblogs.com/zi-wang/p/12325664.html

  • 最后使用tfds.load, for each example:

    • 'image' (original shape) uint8
    • 'objects' 'bbox' (N, 4) float32
    • 'objects' 'is_difficult' (N,) bool
    • 'objects' 'label' (N,) int64
  • One of the most tough thing to do is to prepare the dataset. We need to:

    • get the GT bounding boxes of each image, and create a label to match the output of SSD according to the bounding boxes. (We need to generate sth of a certain shape (probably like (8732, ...)) for each image)
    • generate a marix with shape == (8732, 4) to represent the location of the 8732 priors on one image, [x_c, y_c, w, h]
    • calculate the jarccard index between the 8732 priors and N GT bounding box. Then we have two rules to match the priors to GT bounding box
    1. for each GT, find the best match priors

    2. for each prior, get the max IoU, if:

      • 0.5, Positive match with the GT bounding box

      • <0.5, negative match with any GT bounding box

    • after matching, for each prior, we got match type(positive or negative), target label, target coords.
    • TPU can only access files in GCS, this is annoying because its impossible to read downloaded file to build a dataset when using TPU. We can only use tfds.
    • Finally, we realized that the matching work should be done in the loss calculation or prepare stage, since the dataset is universe for different object detection methods and should not restrict to some specific methods
  • transform:

    • parse the dict
    • resize
    • normalize by imagenet mean and std
    • data augmentation
    • match(anchors to gt)
  • model:

    • base conv
    • aux conv
    • prediction conv
  • Loss:

    • cls
    • loc
  • defination:

    • example

      : an elem in the dataset, is a dict

      • form:
        • 'image' (original shape) uint8
        • 'objects' 'bbox' (N, 4) float32
        • 'objects' 'is_difficult' (N,) bool # not for now
        • 'objects' 'label' (N,) int64
        • other things
      • flow:
        • from dataset
    • anchor: a group of bboxes that is pre-defined in a detection method

      • form:
        • a list of dict, for each dict:
          • 'bbox'
    • iou (jaccard index)

      • cal:
        • its easy to implement in the bc form
    • bbox: a bounding box

      • form:
        • can be specified by bc or cc
      • type:
        • bbox of a anchor (8732): prior or default box
        • bbox of a pred (8732): bboxes the model proposed
        • bbox of a gt (N): the real ones
        • bbox of a match (8732): the real ones after matching
    • target: the gt_bbox for each anchor_bbox

      • form:
        • a list of dict, each dict corresponds to a anchor, for each dict:
          • 'label': semi-positive value, 0 means bgd, otherwise coresponding label
          • 'offset': from the anchor to the gt_bbox
      • flow:
        • match anchors to gtbboxes
        • there are 2 rules:
          1. anchor wise
          2. gtbbox wise
        • should be done in the transform
    • pred(prediction):

      • form:
        • a list of dict , for each dict (pred):
          • 'label': semi-positive value
          • 'pred_bbox'
    • SSDLoss: calculate the loss according to the SSD paper by

      • cls loss: positive + hard negative
      • loc loss:
    • coords: coordinates

      • form:
        • bc (boundary coordinates): in the form of (x_min, y_min, x_max, y_max)
        • cc (center-size coordinates): in the form of (x_c, y_c, w, h)
      • flow:
        • bc form is used in the dataset
        • in the transform, coords might be changed, in cc form at last

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.