prs-eth / marigold Goto Github PK

[CVPR 2024 - Oral, Best Paper Award Candidate] Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Home Page: https://marigoldmonodepth.github.io

License: Apache License 2.0

Python 97.29% Shell 2.71%

monocular-depth-estimation diffusion in-the-wild zero-shot

marigold's People

Contributors

Stargazers

Watchers

Forkers

rb-synth sweetpand mohamedalirashad liannice nandometzger jerry-master eltociear veryvanya abhishekmonogram ssghost jensinjames yahskapar redcalabash thinklikeanarchitect deepuav semjon00 sung206 zuodexin rupher3222 zhaokezzz techthiyanes jaedukseo jwarnergithub alakia jk4011 mavende pablodawson ricklentz philientaylor sangkim98 atumcell ngdhung31 hjxwhy fastflair fishfishson chisarie benjamesbabala ohhhyeahhh ssusantachary parallelsystems graemeniedermayer emanuelegiacomini arupsankarroy ucsd-comp-imaging c-nr rsundar daydreamer2023 butterk3ks kaihsiangl kylesargent yuqunw andupotorac 53756e4c69 gkmocastro supavision csvk20 k2m5t2 jags111 markkua mcx snackferret timdesrochers zcfrank1st luckyadugithub sunpihai-up peterzs jackzhousz michaeltan53 wioponsen zhizhangxian chenhaomingbob bananaman1983 agu18dec wuzhongdehua jx-sony pciodyuc samxrx laurentdupin henryjliu dmytroivakhnenkov beatsmasterz nheyr08 fslaser larissa0829 domino2015

marigold's Issues

How to convert depth to density

Thanks for your great work! This effect is currently the best and most effective model I have ever seen. I am not familiar with depth estimation work, so how can I convert a depth ranging from 0 to 1 to a density ranging from 0 to 1 in order to match the output of other models.

Any plan to train specialized ControlNet/ControlLora?

Thanks for the work you've done on Marigold. This is a revolutionary depth map estimator with extreme details compared to previous methods. Imo, one noticable application of it is ControlNet preprocessor. The problem tho is that current depth map ControlNet was only trained on less detailed dataset so it can't get all details from depth map generated by Marigold.

About Recovering the Depth with Metirc.

Thank you for your outstanding work! It is very impressive to deploy a diffusion pipeline into monocular depth estimation.

As stated in the paper, the model works for affine-invariance depth estimation, since the depth normalization is not revertible, I wonder if I want to recover the depth with metric, what can I do?

In other words, all the affine-invariance depth has a global scale or offset factor, according to the Eq3 of your paper, the d2 and d98 depth values from the given image, which is instance-independent I guess. Is there any method to recover the true depth with the assistance of extra information, like camera intrinsic or stereo images baseline?

Stable Diffusion fine-tuning problem

During fine-tuning for depth estimateaion conditioned on input image, how to deal with the text prompt required in the original pre_trained text-to-image Stable Diffusion model?

Train a ControlNet plugin instead of full-scale fine-tuning?

This work is very inspiring and exciting. Marigold makes huge progress in discriminative diffusion models by showing that general-purpose pre-training can benefit later fine-tuning for discrimination, so that we no longer train discriminative diffusion models from scratches.
Now the problem is the FULL-SCALE fine-tuning. In fact there are alternative ways in generative diffusion models. For example, ControlNet keeps the backbone U-Net frozen and trains a plugin instead, where the plugin can toggle the behavior of the backbone to certain purposes. This approach is more efficient and more flexible.
So I wonder if you can train a plugin-Marigold with all the other settings unchanged? If this approach can be demonstrated feasible (or even infeasible), the community can get very useful insights.

Even the LCM is slow, is that how it should be? or something wrong

Hello dear Marigold developers!

This is my first time doing this

I ran the command on my PC with RTX3090

python run.py
--denoise_steps 4
--ensemble_size 5
--input_rgb_dir input/in-the-wild_example
--output_dir output/in-the-wild_example_lcm

I prepared a png sequence of 1 minute video (24 fps) size 768 x 768 px, and it will take about 3 hours to process all the frames, is this how it should be or did I do something wrong? I tried LCM Demo on HuggingFace and it was many times faster

Why the inference sometimes outputs a depth map and sometimes outputs a parallax map?

How to write depthmap in pfm

Thank for your amazing contributions.
By the way, since my code which can generate 3d point clouds using depthmap needs pfm format of depthmap, so I want to write depth_pred in pfm. How can I do this?

About the predictions of reproduced model

Hi, we reproduced Marigold according to the paper. The first figure is the depth predicted by the model we reproduced. The second figure is the prediction of Marigold's official weight. Compared with Marigold's prediction, our prediction is not smooth. Actually, there is a lot of noise Besides, the model seems to only focus on the foreground, and the depth prediction range is small. Do you have any suggestions on this?

Can Marigold be used as segmentation?

Awesome Work~
The foreground effect of the depth estimation is very good, even better than SAM ，Have you considered applying it to the field of saliency detection?

Would you be willing to share the code used for your training process?

finetune on other domain, the validation is a bit noisy

Awesome work!! I'm trying to use the same fine-tuned protocol over another domain.

However, I suffered from the noisy results. The training lasts for 2 days on A100.

Is there any chance I can get some insights to improve the results?

Best

Try to use it in a real-world scenario

Thank you for sharing, I tried to use it in the real scene, but the depth map I got is very strange after converting it into a point cloud, I would like to ask how you converted the depth map into a point cloud file.

Maxing out VRAM at 24GB

I was told that Marigold maxing out my 24gb VRAM isn't supposed to happen and was told to post my settings here.

I have a bat file that I run with the following:

@echo off
call venv/marigold/Scripts/activate.bat
python run.py --checkpoint checkpoint/Marigold_v1_merged --input_rgb_dir Input --output_dir Output
pause

Different results from huggingface demo and github repo

Hi, I am trying to produce a depth map from a single image using Marigold. I have tried the huggingface demo and also used github repo to run the model locally with the save single photo. But I got different results:

the former one's from higgingface demo, while the latter one's from github repo running locally. I wanna know if there's any difference like config between them?

Clipping is Removing Valuable Depth Estimation Values, Resulting in Squished Depth Maps

Hello everybody,

I have come across this issue while experimenting with the VAE depth decoder ‘decode_depth’ and the single inference function ‘single_infer’. The VAE decoder is not bound to the ranges of [-1,1]. In many instances, for a given image (normalized to the Stable Diffusion v2 native resolution), its decoded latent results in min-max values of around [-1.5, 1.4]. These ranges differ with respect to the image contents, aspect ratio, and in the case of inference, the initial isotropic noise.

At the end of the inference function ‘single_infer’, the decoded generated depth map is simply clipped to [-1,1]. This effectively removes valuable depth information from the generated value distribution, and thus assigns the depth value of 0 (or 1, respectively) to all values outside of [-1,1]. Intuitively, clipping results in a squished depth map. Instead, to retain the complete generated depth value distribution, it is best to swap the clipping and shifting operations to min-max normalization to [0,1]:
min_depth = torch.min(depth)
max_depth = torch.max(depth)
depth = (depth - min_depth) / (max_depth - min_depth)
depth = torch.clamp(depth, 0, 1).

This squishing also affects the final aggregated depth map, as some generated depth maps have decoded ranges closer to [-1,1], retaining these extreme depth values, while others do not. Usually, min-max normalization is not a fix in these kinds of situations. However, since the task is monocular depth estimation, the closest and farthest points must be associated with the values 0 and 1 respectively.

Please let me know if I am missing something.
Best.

Hypersim Data Pre-processing

Would you be able to release your code for pre-processing the Hypersim depth data?

how to inference one image on multi-gpu for faster processing?

Try to convert depth maps to normal maps

Hello, thank you for your brilliant work!
I am a student from USTC, who is a novice in 3D Vision. I run your code on my DIY cases and it works well. Upon your impressive results, I want to make some exploration related to normal maps. In your paper, I notice that you mentioned colored as normals in Figure 5. I wonder whether those are authentic surface normals that we usually use?
If yes, how can I get those normals based on your code?
Thanks a lot!!! :)

Vector displacement? [feature request]

Hi I want to know if it's possible to add vector displacement ? it's allow to get better result than depth map on 3d objects
depth vs vector :

Acknowledgment and Concerns :Training code and License Ambiguities

First and foremost, I'd like to express my sincere appreciation for the remarkable work you've done with Marigold. I view this project as a significant breakthrough in the field of computer graphics.

This endeavor has the potential to extend beyond its current capabilities with depth generation, opening doors for diverse applications given the right dataset. However, there is a critical need for training code to adapt this model for tasks such as generating normal, displacement, metallic maps, and more.

Regrettably, the project's license is presently unclear, posing a hindrance to its utilization in our projects. The ambiguity surrounding whether the generated depth maps can be employed commercially, coupled with the restriction on the commercial use of the code, renders the project impractical for our needs.

How should I set up the dataset

I set up a new folder in the root , named kitti_data, and placed the KITTI dataset in it
BUT when I run the code:
bash script/eval/21_infer_kitti.sh
and there is a KeyError: "filename './2011_09_26/2011_09_26_drive_0002_sync/image_02/data/0000000069.png' not found".

I'm a rookie, u know, and I can't figure this out.

Request: Make model available in Onnx and .pt format.

I am working with a custom version of a program called depthviewer and am currently trying to help the dev integrate the onnx version of depth-anything (tiktok model that kust released) he now has marigold working but it works outside of Unity.

I have found marigold results to be much superior for single images when converting to 3d than depth-anything, the issue is I have tried python conversion scripts to convert to a .pt and they do not work due to missing config.

Is there any way you guys can release the model in onmx format and also pt format?

Question on the output

Is the npy file a numpy array? If so what are the values in? If the model is trained off synthetic data and if the npy is what I think it is, does that mean it'll be in real world or some other linear dimensional scaled value? BTW it works great! takes a long time though.

Here's me face

Pretty amazing it's able to figure out all the hair. Still trying to fully understand from the paper how it works.

Possible Erroneous Depth Map Normalization at Inference Time

Hello everybody,

I very much appreciate the work you all have done on Marigold. Leveraging a strong diffusion prior, like Stable Diffusion v2, to fine-tune the model on the task of monocular depth estimation with exclusively synthetic data indeed allows for strong zero-shot generalization to other domains.

I was experimenting with the Marigold model and its components, and seem to have stumbled upon a slight error in the normalization operation at the end of the inference function ‘single_infer’. After generating the latent encoding depth_latent, decoding, and clipping the values to the usual diffusion ranges [-1,1], the depth map is finally normalized to [0,1]. However, it appears that the wrong operation was applied. The formula depth = depth * 2.0 - 1.0 is meant to normalize from ranges [0,1] to [-1,1]. Instead, it should be depth = (depth+1) / 2.0.

This causes some generated depth maps to be in the ranges of [-3,1] before applying the ensemble optimization step. With regards to the complete inference pipeline, I presume that this doesn't harm the model's performance since the ensembling step normalizes the aggregated depth map to [0,1]. However, using ‘single_infer’ on its own may lead to undesired behavior.

Please let me know if I am missing something.

Regarding the Stochastic Nature of the Stable Diffusion v2 VAE's Encoder

Hello everybody,

The Stable Diffusion v2 VAE encoder outputs a mean and log variance of a Gaussian distribution, from which the latent encoding is drawn. In the field of generative AI, this process adds another stochastic element to the sampling process, resulting in a greater variety of generated images.

For the case of Marigold, when applying the Stable Diffusion Encoder, the reparameterization trick is made deterministic by directly taking the mean. I presume this is a step to remove randomness from the sampling process, as the task is to estimate depth maps, and we are interested in minimizing the variance of the generated maps as much as possible. On the other hand, Marigold already performs an optimization ensemble step, which might benefit from a variety of feasible estimates.

Was this a deliberate change, or would it have been something like rgb_latent = (mean + torch.exp(0.5 * logvar)*torch.randn(mean.shape).to(self.device)) * self.rgb_latent_scale_factor?

Thanks

Out of memory when training with RTX4090, seeking guidance on training details

I am currently attempting to reproduce the training process described in your paper using Stable Diffusion v2. However, my RTX 4090 ran out of memory when training with batchsize 32 , as mentioned in the paper. I use a resolution of 768x768(same as Stable Diffusion v2) and I am uncertain whether this setting is appropriate.

problems about training dataset

Hi authors! I find that you probabilistically choose the KITTI or Hypersim dataset and draw the mini batch from it when you are training. Could you explain the reason for that? Why don't you resize or crop the training dataset to the same size and mix the two datasets to draw the mini batch?

Video depth map deflickering

Depth maps generated from video frames by Marigold is obiviously flickering. While this can be partially fixed by setting high n_repeat and fixed seed, such method is inefficent imo. Is there any way to use previous video frame/depth map to condition the diffusion process?

The main difference between this paper and other density visual prediction methods utilizing the diffusion model

Hello, thank you for your excellent work. I've noticed several papers on density visual prediction utilizing the diffusion model, and I'm interested in understanding how your paper differs from them. Is the main distinction that other papers do not utilize the pretrained SD? The discussion in your paper seems a bit brief. If you could provide further clarification, it would be greatly appreciated. Thank you!

About training on real images

Thanks for sharing your code and model. The depth visualization is really awesome, especially sharp edges.

I noticed that both training datasets(Hypersim and Virtual KITTI) were synthetic datasets. Have you ever tried to train on real dataset?

how to convert a HF Diffusers saved pipeline to a Stable Diffusion checkpoint?

I wanna to convert the current UNet, VAE, scheduler, tokenizer and Text Encoder diffusers model structure to a single safetensors checkpoint file then i can load in comfyUI,any advice or solution?

How to generate the Point Clouds shown in the paper ?

Can you share how the depth maps were used to generate the point clouds?
I checked this issue ( #6 ), but it doesn't specify any method on how to generate the points clouds and visualize them

Also, is it possible to use multiple images showing different individual parts of a scene, and generate depth maps from them? And then use those multiple-depth maps to connect the different point clouds to form the whole scene together without using stitching ?

Predicted Depth to Colorful Point Cloud

Hi,

nice work! I always want to compare the predicted depth with the colorful unprojected point cloud. I compared Marigold, ZoeDepth, OmniDatav2. I tried the following image.

Marigold:

OmniDatav2:

ZoeN:

It's interesting that previous method predicts more likely to a flat geometry, while Marigold can preserve it better :-)

License

Does the license mean that the software itself can’t be used for commercial purposes— as in, I can’t sell it or sell products using the repo— or does it mean the depth maps themselves can’t be used within a project— as in, the depth maps couldn’t be used to create an asset in VFX or gaming for commercial purposes?

Inconsistent results in Table 1 and ablation study.

Hi, I notice that the results reported in Table 1 and those reported in ablation studies are inconsistent.

Could you please tell the difference between the two models?

Broken Arguments

https://github.com/prs-eth/Marigold/blob/cc78ff3033f5804cadf8523ed11b6bbf0d025077/run.py#L64C16-L64C16
The bitwise NOT operator ~ should not be used with boolean values the same way ! is used in other languages.
It turns it into a number and doesn't work right when checked later.

Worse than that:
When you fix the above issue and it actually returns false when checking resize_input, "image" never actually gets assigned and will throw an error.

I'm too lazy to make a pull request though.

Any plan of releasing training code?

Thank you for the great work.

I am planning to train this model on depth images from different domains and also try training on other problem statements like semantic segmentation.

So, it would be very helpful if you release training code as well.

Why, when estimation a sequence, the depth between the frames is always different?

Why, when estimation a sequence, the depth between the frames is always different. Is it possible to averaged the value of the depth, like a implemented in Zoedepth?

Request: Need models in onnx and .pt format (not just .bin and .config)

I was requesting the model in another format because I cannot convert it without the proper model configuration file (I've tried) Need them in onnx or .pt format specifically for a Unity application called Depthviewer. Can we make this happen?

Here is a list of models and their formats available, as you can see depth-anything has onnx, I was hoping marigold could profile this also https://airtable.com/appjWiS91OlaXXtf0/shrchKmROzpsq0HFw/tblviBOLphAw5Befd

"--seed" not making results reproducible

I did a test, duplicated a jpeg in a directory, so there's just 2 image files, same file just different names. I ran this with the --seed and each image was different. It seems a random seed is being generated for each image regardless of this argument. So specifying a seed does NOT make the results reproducible.

the test results seem have much noise

Hi, thanks for sharing this wonderful work, I tried to test but the depth map seems have much noise.

Weight files take forever to download

bash script/download_weights.sh marigold-v1-0
# or LCM checkpoint
bash script/download_weights.sh marigold-lcm-v1-0

Do you have a lighter model?

About evaluation protocol

Hi, I'm new to this field, and I am impressed by your outstanding work. Thank you for sharing your code! I have a question regarding the evaluation protocol details.

From the paper in the evaluation protocol section, the evaluation method of Marigold is described as follows.
"When we first align the estimated merged prediction m to the ground truth d with the least squares fitting, this step gives us the absolute aligned depth map a = m × s + t, in the same units as the ground truth."

Does this mean that the process described above proceeds in the following sequence?

(1) Obtain the depth map 'm' from the Diffusion model.
(2) Refine 'm' through least squares fitting with the ground truth.
(3) Estimate s and t to obtain the aligned depth map, a = m × s + t.

Thank you for reading my question. I am confused about whether the equation a = m × s + t is directly derived through least squares fitting, or if it requires separate calculation. If it's not too much trouble, could you please share the code for the evaluation protocol? Thank you!

torchvision is missing from requirements.txt

In order to use the provided run.py, torchvision is required, which is missing in the requirements.txt. The repo settings forbid me to submit any branches and pull requests, so I guess you'll fix it yourself at one point

Can this algorithm be used to obtain real-world distance to an object?

Question about applying SDS Loss using Marigold

Hi, thanks for your interesting work!
I'm trying to apply sds loss(a loss used in text/image to 3D) using marigold in my work, here I have some question about training depth data details.
First, what's numerical range of depth map, is nearer depth number smaller? Did you normalize the training depth to [-1,1]?
Second, how to process single channel depth to three channel for vae extractor, just repeat it?
Following is my code about sds loss, it has some problem currently.

    def sds_loss(self, pred_depth, rgb_in, bs, view_num, guidance_scale=100, as_latent=False, grad_scale=1,
                 save_guidance_path=None):
        if self.alphas is None:
            self.alphas = self.scheduler.alphas_cumprod.to(self.device)
        """
        pred_depth: the predicted depth, normalized to [-1,1], and nearer is smaller, size of (bs,1,h,w)
        rgb_in: the conditioned image, normalized to [-1,1], size of (bs,3,h,w)
        bs: batch size
        view_num: used for reshaping
        """
        device = self.device
        # Encode image
        pred_depth = pred_depth.repeat(1, 3, 1, 1)
        rgb_latent = self.encode_rgb(rgb_in)
        depth_latent = self.encode_rgb(pred_depth)
        # Set timesteps
        t = torch.randint(self.min_step, self.max_step + 1, (bs,), dtype=torch.long,
                          device=self.device)
        t = t.unsqueeze(-1).repeat(1, view_num).view(-1)

        with torch.no_grad():
            # Initial depth map (noise)
            latent_noise = torch.randn(
                rgb_latent.shape,
                device=device,
                dtype=self.dtype,
                generator=None,
            )  # [B, 4, h, w]

            latents_noisy = self.scheduler.add_noise(depth_latent, latent_noise, t)
            # pred noise
            uncon_latent=torch.cat([torch.zeros_like(rgb_latent).to(rgb_latent), latents_noisy], dim=1)
            con_latent = torch.cat([rgb_latent, latents_noisy], dim=1)
            latent_model_input=torch.cat([uncon_latent,con_latent],dim=0)
            tt = torch.cat([t] * 2)
            # Batched empty text embedding
            if self.empty_text_embed is None:
                self.encode_empty_text()
            batch_empty_text_embed = self.empty_text_embed.repeat(
                (latent_model_input.shape[0], 1, 1)
            ).to(device)  # [B, 2, 1024]

            noise_pred = self.unet(
                latent_model_input, tt, encoder_hidden_states=batch_empty_text_embed
            ).sample  # [B, 4, h, w]

            # perform guidance (high scale from paper!)
            noise_pred_uncond, noise_pred_pos = noise_pred.chunk(2)
            noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_pos - noise_pred_uncond)

        # w(t), sigma_t^2
        w = (1 - self.alphas[t])
        grad = grad_scale * w[:, None, None, None] * (noise_pred - latent_noise)
        grad = torch.nan_to_num(grad)

        targets = (depth_latent - grad).detach()
        loss = 0.5 * F.mse_loss(depth_latent.float(), targets, reduction='sum') / depth_latent.shape[0]

        return loss

Colab NameError: name 'pipe' is not defined

this is the error I am getting while using the Google colab notebook, Please help

---> 29         pipeline_output = pipe(
     30             input_image,
     31             denoising_steps=10,     # optional

NameError: name 'pipe' is not defined

LCM support

I was wondering if lcm support would be possible

why the paper was named Marigold?

As the title said, why the paper was named Marigold? I'm curious about this.

prs-eth / marigold Goto Github PK

marigold's People

Contributors

Stargazers

Watchers

Forkers

marigold's Issues

Recommend Projects

Recommend Topics

Recommend Org