The paper says:"we align the person so that the point between the hips is loceated at

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The detection model outputs directly the face bounding box and the 2 keypoints.

align the person about depthai_blazepose HOT 9 OPEN

geaxgx commented on July 18, 2024

align the person

from depthai_blazepose.

Comments (9)

geaxgx commented on July 18, 2024 1

@tristanle22 Thanks to the mediapipe API, it may seems to the user that the pose estimation is done in one step, but behind the scene, it is actually a 2 steps process as explained there: https://google.github.io/mediapipe/solutions/pose#ml-pipeline
The solution utilizes a two-step detector-tracker ML pipeline, proven to be effective in our MediaPipe Hands and MediaPipe Face Mesh solutions. Using a detector, the pipeline first locates the person/pose region-of-interest (ROI) within the frame. The tracker subsequently predicts the pose landmarks and segmentation mask within the ROI using the ROI-cropped frame as input.
For this repo, I took direct inspiration from the mediapipe implementation, but just adapted it to the depthai library.

from depthai_blazepose.

geaxgx commented on July 18, 2024

This alignment is done either from the detection stage or from the previous frame keypoints.
It consists in calculating the center and rotation of what I call in my code comments the "rectangle" or "rotated rectangle" (actually it is a square).
In host mode, the code is done in:

depthai_blazepose/mediapipe_utils.py

Line 306 in d79e1ee

def detections_to_rect(body, kp_pair=[0,1]):

In edge mode, the code:

when calculating from the detection stage :

depthai_blazepose/template_manager_script.py

Lines 116 to 120 in d79e1ee

 scale_center_x = sqn_scale_x - sqn_rr_center_x 

 scale_center_y = sqn_scale_y - sqn_rr_center_y 

 sqn_rr_size = 2 * ${_rect_transf_scale} * hypot(scale_center_x, scale_center_y) 

 rotation = 0.5 * pi - atan2(-scale_center_y, scale_center_x) 

 rotation = rotation - 2 * pi *floor((rotation + pi) / (2 * pi))

when calculating from the previous frame keypoints:

depthai_blazepose/template_manager_script.py

Lines 211 to 222 in d79e1ee

 # Calculate the ROI for next frame  

 rrn_rr_center_x = lms[next_roi_lm_idx] / 256 

 rrn_rr_center_y = lms[next_roi_lm_idx+1] / 256 

 rrn_scale_x = lms[next_roi_lm_idx+5] / 256 

 rrn_scale_y = lms[next_roi_lm_idx+6] / 256 

 sqn_scale_x, sqn_scale_y = rr2img(rrn_scale_x, rrn_scale_y) 

 sqn_rr_center_x, sqn_rr_center_y = rr2img(rrn_rr_center_x, rrn_rr_center_y) 

 scale_center_x = sqn_scale_x - sqn_rr_center_x 

 scale_center_y = sqn_scale_y - sqn_rr_center_y 

 sqn_rr_size = 2 * ${_rect_transf_scale} * hypot(scale_center_x, scale_center_y) 

 rotation = 0.5 * pi - atan2(-scale_center_y, scale_center_x) 

 rotation = rotation - 2 * pi *floor((rotation + pi) / (2 * pi))

from depthai_blazepose.

RCpengnan commented on July 18, 2024

Thank you for your reply! I'd like to ask you a few questions about the skeleton adjustment.
I found that some pose estimation algorithms can't detect the inverted pose at present, but blazepose can detect the inverted algorithm very well. I think this is the advantage of the alignment of the skeleton, but I don't know how to adjust the skeleton in the detection stage, do I need to detect the hips first?If you can't detect the skeleton in a inverted pose, how can you detect the key points in the hips? Since the part of skeleton alignment is not explained in detail in the paper, could you please explain how the skeleton is adjusted in the detection stage? Thank you~

from depthai_blazepose.

geaxgx commented on July 18, 2024

I think this paragraph clarifies how the detection stage is working: https://google.github.io/mediapipe/solutions/pose.html#personpose-detection-model-blazepose-detector
The pose detector is adapted from the mediapipe face detector. In addition to the face bounding box, the model infers 2 keypoints. One keypoint is an estimation of the mid hip center. The other, combined with the mid hip center keypoint, encodes the size and rotation of the whole body bounding box. It may looks like a bit of magic but the hips don't need to be visible in the image for the detection model to infer these 2 keypoints. For instance, blazepose works even on close-up face picture. It makes sense because knowing the size and orientation of the face is enough to estimate a realistic body position and orientation.

I hope I answered to your question.

from depthai_blazepose.

RCpengnan commented on July 18, 2024

Thank you for your reply, but I still have two questions to ask.
1.Hello, I would like to know how to infer these two additional key points, is it based on the method of face detector?Do I have to understand the idea of face detection if I want to understand the source of these two key points。
2.After the posture is aligned, the skeleton needs to be mapped back to the original posture when drawing the skeleton. I don't know if I understand correctly. If yes, I would like to ask where the code of this part is.
Thank you~

from depthai_blazepose.

geaxgx commented on July 18, 2024

The detection model outputs directly the face bounding box and the 2 keypoints. In host mode, the parsing of the detection model is done by this function :

depthai_blazepose/mediapipe_utils.py

Line 181 in d79e1ee

def decode_bboxes(score_thresh, scores, bboxes, anchors, best_only=False):

More precisely, the model ouputs an array of 12 floats for each detected body (896 bodies max):

the first 4 floats describe the face bounding box (center_x, center_y, width, height);
the 8 last floats corresponds to 4 keypoints with 2 floats (x, y) for each keypoint:
- first keypoint corresponds to mid hip center;
- 2nd keypoint is used to encode the size and rotation of the body bounding box;
- the 3rd and 4th were used in a previous version of blazepose but are not used anymore.

Yes. The landmark regression model yields coordinates in the square rotated body bounding box, so we need to map them back in the image coordinate system if we want to draw the skeleton. In host mode, this is done here:

depthai_blazepose/BlazeposeDepthai.py

Lines 511 to 522 in d79e1ee

 # body.norm_landmarks contains the normalized ([0:1]) 3D coordinates of landmarks in the square rotated body bounding box 

 body.norm_landmarks = lm_raw[:,:3] 

 # Now calculate body.landmarks = the landmarks in the image coordinate system (in pixel) (body.landmarks) 

 src = np.array([(0, 0), (1, 0), (1, 1)], dtype=np.float32) 

 dst = np.array([ (x, y) for x,y in body.rect_points[1:]], dtype=np.float32) # body.rect_points[0] is left bottom point and points going clockwise! 

 mat = cv2.getAffineTransform(src, dst) 

 lm_xy = np.expand_dims(body.norm_landmarks[:self.nb_kps+2,:2], axis=0) 

 lm_xy = np.squeeze(cv2.transform(lm_xy, mat)) 

 # A segment of length 1 in the coordinates system of body bounding box takes body.rect_w_a pixels in the 

 # original image. Then we arbitrarily divide by 4 for a more realistic appearance. 

 lm_z = body.norm_landmarks[:self.nb_kps+2,2:3] * body.rect_w_a / 4 

 lm_xyz = np.hstack((lm_xy, lm_z))

depthai_blazepose/BlazeposeDepthai.py

Line 542 in d79e1ee

body.landmarks = lm_xyz.astype(np.int)

depthai_blazepose/BlazeposeDepthai.py

Lines 549 to 550 in d79e1ee

 if self.pad_h > 0: 

 body.landmarks[:,1] -= self.pad_h

depthai_blazepose/BlazeposeDepthai.py

Lines 553 to 554 in d79e1ee

 if self.pad_w > 0: 

 body.landmarks[:,0] -= self.pad_w

from depthai_blazepose.

RCpengnan commented on July 18, 2024

Thank you for your reply!
Hello, I have looked up a lot of materials and only explained these two key points of model inference, but did not explain in detail how to infer. The code is a little complicated and I don't understand it very well. Could you please help me explain the idea of getting these two key points?

from depthai_blazepose.

geaxgx commented on July 18, 2024

The idea of getting these two keypoints is exactly what I said in my previous message. I don't know how to explain it differently. The 2 keypoints are, among other things, inferred by the detection model. The decode_bboxes() function is just processing the model output to store the information in an instance of a Body class. At the end, the 1st keypoint (mid hip center) is stored in Body.pd_kps[0] and the second keypoint is stored in Body.pd_kps[1] as normalized coordinates (between 0 and 1).

from depthai_blazepose.

tristanle22 commented on July 18, 2024

Hi there! I have a question somewhat relevant to this discussion. In this tutorial, https://google.github.io/mediapipe/solutions/pose#python-solution-api, you can directly obtain the pose using mediapipe.solutions.pose by directly passing in the image. However in your implementation, you're keeping the pose and landmark detection as 2 separate steps in the pipeline, basically re-implementing feature from mediapipe. May I ask what's the reason behind this?

from depthai_blazepose.

align the person about depthai_blazepose HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	scale_center_x = sqn_scale_x - sqn_rr_center_x
	scale_center_y = sqn_scale_y - sqn_rr_center_y
	sqn_rr_size = 2 * ${_rect_transf_scale} * hypot(scale_center_x, scale_center_y)
	rotation = 0.5 * pi - atan2(-scale_center_y, scale_center_x)
	rotation = rotation - 2 * pi floor((rotation + pi) / (2 pi))

	# Calculate the ROI for next frame
	rrn_rr_center_x = lms[next_roi_lm_idx] / 256
	rrn_rr_center_y = lms[next_roi_lm_idx+1] / 256
	rrn_scale_x = lms[next_roi_lm_idx+5] / 256
	rrn_scale_y = lms[next_roi_lm_idx+6] / 256
	sqn_scale_x, sqn_scale_y = rr2img(rrn_scale_x, rrn_scale_y)
	sqn_rr_center_x, sqn_rr_center_y = rr2img(rrn_rr_center_x, rrn_rr_center_y)
	scale_center_x = sqn_scale_x - sqn_rr_center_x
	scale_center_y = sqn_scale_y - sqn_rr_center_y
	sqn_rr_size = 2 * ${_rect_transf_scale} * hypot(scale_center_x, scale_center_y)
	rotation = 0.5 * pi - atan2(-scale_center_y, scale_center_x)
	rotation = rotation - 2 * pi floor((rotation + pi) / (2 pi))

	# body.norm_landmarks contains the normalized ([0:1]) 3D coordinates of landmarks in the square rotated body bounding box
	body.norm_landmarks = lm_raw[:,:3]
	# Now calculate body.landmarks = the landmarks in the image coordinate system (in pixel) (body.landmarks)
	src = np.array([(0, 0), (1, 0), (1, 1)], dtype=np.float32)
	dst = np.array([ (x, y) for x,y in body.rect_points[1:]], dtype=np.float32) # body.rect_points[0] is left bottom point and points going clockwise!
	mat = cv2.getAffineTransform(src, dst)
	lm_xy = np.expand_dims(body.norm_landmarks[:self.nb_kps+2,:2], axis=0)
	lm_xy = np.squeeze(cv2.transform(lm_xy, mat))
	# A segment of length 1 in the coordinates system of body bounding box takes body.rect_w_a pixels in the
	# original image. Then we arbitrarily divide by 4 for a more realistic appearance.
	lm_z = body.norm_landmarks[:self.nb_kps+2,2:3] * body.rect_w_a / 4
	lm_xyz = np.hstack((lm_xy, lm_z))