Thanks for sharing your source code! I'm trying to understand the coordinate system us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Why do you project here without <a href="https://github.com/autonomousvision/d

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Data format of the processed DTU scenes about differentiable_volumetric_rendering HOT 11 CLOSED

autonomousvision commented on July 24, 2024

Data format of the processed DTU scenes

from differentiable_volumetric_rendering.

Comments (11)

m-niemeyer commented on July 24, 2024 4

Hi @Kai-46, thanks a lot for your interest! Here some answers regarding your question:

World Matrix (Rt): We use "Rt" directly from the DTU dataset - if you download it, they are provided in "Calibration/cal18/pos_***.txt". If you apply Rt to a point (x, y, z) and get (x', y', z'), you get the pixel locations (u, v) = (x', y') / z'. These are in the ranges [0, W], [0, H], where W, H are the image resolution.

Camera Matrix (K): To be independent of the image resolution, we have the convention that we scale the pixel locations to [-1, 1] for all datasets. This is useful, e.g. for getting the ground truth pixel values in PyTorch with grid_sample. In the DTU case, we then only have to shift and rescale such that the ranges (see above) change to [-1, 1].

Scale Matrix (S): Finally, we use a scale matrix in our project. The DTU dataset does not use a canonical world coordinate system, and hence the objects can be at very different locations. However, we want to center the object / volume of interest in the unit cube. We do this via the scale matrix S. The inverse S^-1 maps our volume of interest from the DTU world coordinates to the unit cube. We did not merge this matrix with Rt to be still able to transform the points to the DTU world.

How to transform pixels to 3D points and vice-versa: We always use homogeneous coordinates such that you can transform a homogeneous 3D point p from "our" world (the unit cube) to pixel coordinates (between [-1, 1]) via first calculating p_out = K @ Rt @ S @ p, and then (u, v) = p_out[:2] / p_out[2]. This is exactly how we do it in the code. The @ means matrix multiplication.
For the other direction, you can transform a homogeneous pixel (u, v, 1, 1) to the world by first multiplying it with the depth value: pixel[:3] *= depth_value, and then going in the other direction: p = S^-1 @ (Rt)^1 @ K^-1 @ pixel. Here is the respective code.

I hope this helps a little. Good luck with your research!

from differentiable_volumetric_rendering.

m-niemeyer commented on July 24, 2024 1

Why do you project here without scale: Because we directly transform npz_file['points'] which, as indicated before, are the sparse keypoints from SfM which "live in the world space"

Why do you transform 3D points using the inverted scale matrix here: As indicated above, "p lives in world space", and scale_mat defines the transformation from "unit cube to world", so if we want to go from world to unit cube, we have to apply the inverse transformation. Here as a diagram:

(Unit Cube) - scale_mat -> (World) - world_mat -> (Camera) - camera_mat -> (Image)

Is my transformation correct: No, for projecting to the image, you would have to do what we do here from L126 to L129, hence you have to remove the scale mat from your line

projected = (camera_mat @ world_mat @ scale_mat @ p.T).T

from differentiable_volumetric_rendering.

Kai-46 commented on July 24, 2024

Thanks for the clarification! It's really helpful. The definition of K, R, t is quite different from opencv. In opencv, R is a 3 by 3 orthonormal matrix, which I think is aligned with the notion of a rotation matrix. For a 3D point x (3 by 1 vector) in world coordinate system, it's first transformed to camera coordinate system via Rx+t, then projected to image space via (u,v,1)'= K(Rx+t), where K is a 3 by 3 intrinsic matrix containing focal length, principal points. In this project, the Rt looks like the product of K, Rt (augmented to 4 by 4) in opencv (named as projection matrix), while K means normalizing pixel coordinates to (-1,1). This is my first time to see this notation. Thanks again for the explanation.

from differentiable_volumetric_rendering.

m-niemeyer commented on July 24, 2024

@Kai-46 , yes, you are right - for the DTU dataset, it is basically the product because we want to stick to the DTU data. For e.g. the ShapeNet renderings, the matrices should be what you have in might except that we define the image pixels in [-1, 1] instead of [0, W] and [0, H]. I hope this helps. Good luck!

from differentiable_volumetric_rendering.

Kai-46 commented on July 24, 2024

Good luck to you as well! One minor suggestion: if you could add some text describing your convention of coordinate system on the readme, it might help others as well. My personal experience with 3D reconstruction is that coordinate system can sometimes be quite a headache without knowing a priori what convention is adopted, as there seems to be many different conventions :-) Personally, I work with opencv or opengl conventions most of the time.

from differentiable_volumetric_rendering.

m-niemeyer commented on July 24, 2024

Thanks for the suggestion! I had something like this in mind - I will do this when I find time. If you have no further question, you can go ahead and close the issue - thanks!

from differentiable_volumetric_rendering.

cortwave commented on July 24, 2024

Hi @m-niemeyer , I also have some difficulties with cameras format understanding. As I understood from this thread after projection of DTU object points to camera we should get values in [-1; 1]. I wrote following code for testing purposes:

import numpy as np

scan = 118
cameras = np.load(f'differentiable_volumetric_rendering/data/DTU/scan{scan}/scan{scan}/cameras.npz')
points = np.load(f'differentiable_volumetric_rendering/data/DTU/scan{scan}/scan{scan}/pcl.npz')['points']
# to homogeneous coordinates
points = np.hstack([points, np.ones((points.shape[0], 1))])

# let project points to camera 10
idx = 10
world_mat = cameras[f'world_mat_{idx}']
camera_mat = cameras[f'camera_mat_{idx}']
scale_mat = cameras[f'scale_mat_{idx}']

projected = (camera_mat @ world_mat @ scale_mat @ points.T).T
# from homogeneous to 2d
projected = projected[:, :2] / projected[:, 2:3]

As result I got projected values outside of [-1; 1] range. Can you clarify please what I'm doing wrong in the code above?

from differentiable_volumetric_rendering.

m-niemeyer commented on July 24, 2024

Hi @cortwave , thanks for your post.

In theory what you are doing is correct! However, the points you look at

points = np.load(f'differentiable_volumetric_rendering/data/DTU/scan{scan}/scan{scan}/pcl.npz')['points']

is the array of sparse keypoints which is a by-product of Structure-from-Motion. We use this in our project for investigating sparser types of depth supervision instead of a full depth map. In Section 3.5.3 of our supplementary, we write
"Another type of supervision which one encounters often in practice is the even sparser output of Structure-from-Motion (SfM). In particular, this is a small set of 3D keypoints with visibility masks for each view mainly used for camera pose estimation."

To train such a model, you have to use one of the ours_depth_sfm.yaml configs from the multi-view supervision experiments.

Now, coming back to your question, you have to filter out the points which are visible in the respective image before projecting them into the view; otherwise, you just project all key points into the view, but many of them will lie outside or could also be occluded. For more details, please have a look at our data field how we process the points.

from differentiable_volumetric_rendering.

cortwave commented on July 24, 2024

@m-niemeyer thank you for your response. I've changed my code according your notices.

import numpy as np

scan = 118
cameras = np.load(f'/home/cortwave/projects/differentiable_volumetric_rendering/data/DTU/scan{scan}/scan{scan}/cameras.npz')
npz_file = np.load(f'/home/cortwave/projects/differentiable_volumetric_rendering/data/DTU/scan{scan}/scan{scan}/pcl.npz')

idx = 10
p = npz_file['points']
is_in_visual_hull = npz_file['is_in_visual_hull']
c = npz_file['colors']
v = npz_file[f'visibility_{idx:>04}']

p = p[v][is_in_visual_hull[v]]
c = c[v][is_in_visual_hull[v]]

p = np.hstack([p, np.ones((p.shape[0], 1))])

world_mat = cameras[f'world_mat_{idx}']
camera_mat = cameras[f'camera_mat_{idx}']
scale_mat = cameras[f'scale_mat_{idx}']

projected = (camera_mat @ world_mat @ scale_mat @ p.T).T
projected = projected[:, :2] / projected[:, 2:3]
print(np.min(projected), np.max(projected))

But it still projects points outside [-1; 1] range. It prints 0.387742314717342 2.184980055074516 for scan 118 and image 10 e.g.
Also I have problems with understanding why here you projected points without scale matrix and here you transformed 3d points to unit cube using inverted scale matrix.

from differentiable_volumetric_rendering.

cortwave commented on July 24, 2024

Oh, thank you! I think, now I understand these coordinate transformations. I didn't know that points in npz file are already in unit cube. Thank you for clarification!

from differentiable_volumetric_rendering.

tiexuedanxin commented on July 24, 2024

hello, I have create my own dataset, Could you give me some advice on how to calculate the scale matrix, thanks very much.

from differentiable_volumetric_rendering.

Data format of the processed DTU scenes about differentiable_volumetric_rendering HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent