kxhit / eschernet Goto Github PK

View Code? Open in Web Editor NEW

230.0 9.0 12.0 38.81 MB

[CVPR2024 Oral] EscherNet: A Generative Model for Scalable View Synthesis

Home Page: https://kxhit.github.io/EscherNet

License: Other

Python 99.65% C++ 0.01% Cuda 0.31% C 0.02% Shell 0.01%

3d-generation 3d-reconstruction diffusion generative-model novel-view-synthesis cvpr2024

eschernet's People

Contributors

Stargazers

Watchers

Forkers

jclarkk jackzhousz flycole takerum liren-jin yuxuansnow alexandor91 conglesolutionx songuyenerza yuyangup

eschernet's Issues

Memory requirements

Hi, I'd like to know how much memory is required for EscherNet? An A100 or just RTX3090 with 24GB will be enough?

3D Reconstruction for text-to-3D

when come?

Generate target images from a single top-view reference image

Hi, thanks for your amazing work.

I am conducting a research about using one single top-view image to generate the entire object. I've tried many model, including zero-123 XL and stable zero-123, as SDS prior. However, none of them can faithfully generate the 3D object.

I have also try your model to generate 100 target views from 1 reference view. However, for building results, the output are not really good, especially for low-elevation views, as shown in attached figure. Do you think is it because of the model is not trained with top-view images?

Another question is that do you think is it possible or suitable to use EscherNet as SDS prior?

Again, thanks for your amazing work!

Why Use Camera Distance as a Dividing Form?

First of all, I want to express my gratitude for your outstanding research. I thoroughly enjoyed reading your paper.

However, I have a question regarding the encoding used in your work. I noticed that the encoding equation for [azimuth, elevation, orientation] differs from that of [camera distance]. Specifically, I am unsure why the camera distance requires a dividing form. Could you help me understand the reasoning behind this choice? I feel like I might be missing an important detail.

Thank you very much for your assistance.

Gradio Demo

Hi, great work!

Do you plan to release the code to do inference on real word objects?

file not found

train_eschernet.py cant't be found in the folder as shown in readme.md

Result on Franka dataset

Dear authors,
first of all, thank you for the amazing work.
I am trying to use the model on the Frianka16, but it's not clear to me how to do it.
First, in the dataset it seems that the annotated elevations are not correct, as I see images with elevation 0° having actual elevation greater than 0°.
Second, I would like to generate new views on a circular trajectory around the object with a fixed elevation.
I changed the code about the output poses using the same lines of code relative to the input views. However, the elevation of the produced images seems wrong to me.
This happens with the 6DoF model, while with 4DoF it does not work at all.
Could you please give me some suggestions on how to do correctly give the information about the angles?

Best regards,
Giuseppe

License?

Amazing work, congrats!

Dust3r pose processing

Dear authors,
Thank you for the great work! I just checked the online demo that you released few days ago. I have some questions regarding the use of Dust3r to compute initial input poses for EscherNet:

Do you apply some modifications to computed output from Dust3r to obtain orthogonal canonical frame ? If yes could you provide more details about them?
For reproducibility could you provide the Dust3r’s code line from which you take and save the computed poses to assign to each input image for EscherNet?

Thanks in advance

Question about intrinsics for 3D reconstruction

Hi! really nice work

I'm using Eschernet 6Dof and in my dataset I would need to use different intrinsics for different images, I guess it's not an issue for NeuS renderer

My question however is about both the intrinsics and the range used by Objaverse

EscherNet/6DoF/dataset.py

Lines 82 to 85 in 569240f

 downscale = 512 / 256. 

 self.fx = 560. / downscale 

 self.fy = 560. / downscale 

 self.intrinsic = torch.tensor([[self.fx, 0, 128., 0, self.fy, 128., 0, 0, 1.]], dtype=torch.float64).view(3, 3)

EscherNet/3drecon/renderer/renderer.py

Line 493 in 569240f

self.K = np.array([[280.,0.,128.],[0.,280.,128.],[0.,0.,1.]], dtype=np.float32)

Do I need to use those specific Objaverse intrinsics or is there a workaround?

All the best,

Alberto

Training details

Hello! Congratulations for the great work.
I have one question about the training process. In Section 3.1 you say "It builds upon an existing 2D diffusion model, inheriting its strong web-scale prior through large-scale training". However, in the rest of the paper, it is unclear if the overall architecture is trained from scratch on the Objaverse dataset (rendered as Zero123 does), or if it is fine-tuned by starting from some pre-trained modules of Stable Diffusion. Could you please clarify my doubts?
Thanks in advance

Novel view synthesis from a single image

Dear Authors
First, thank you for your interesting and amazing work.

I would like to do the Novel view synthesis task using my own single rgba image, in this case, should the data_type be Text2Img?
If I put only one image as input, I wonder if it will work with other data_type as well.

Thanks for the great work!!!

	downscale = 512 / 256.
	self.fx = 560. / downscale
	self.fy = 560. / downscale
	self.intrinsic = torch.tensor([[self.fx, 0, 128., 0, self.fy, 128., 0, 0, 1.]], dtype=torch.float64).view(3, 3)