Giter VIP home page Giter VIP logo

unipad's Issues

Question about point feature

when i debug with you code, found that there is no pts_voxel_encoder, thus there is no pts_feat of the output.
I can only found you use point surpervise the sdf in "sparse_points_sdf_supervised"?

So how do you use lidar input like your paper?

Is image data a must?

Can I use this project to train a 3D backbone purely on lidar? Because I dont have corresponding image data

A few questions on details

Thank you for releasing this amazing work! I just had a couple questions on some of the camera-only outdoor details @Nightmare-n

  1. Are the same 6 images used for generating the 3D voxel grid and rendering (as is mentioned to be done for ScanNet in PonderV2)?
  2. Were the used ConvNeXt(V1?) backbones trained from scratch or with IN1k?
  3. Was any data augmentation used for the 2D training stage, besides regular MAE masking?
  4. When using the proposed depth-aware sampling, are the same 512 rays, sampled from the pixels with available LiDAR points, used for both color & depth rendering?
  5. In Table 8g, there appear to be trainable weights associated with the view transformation stage. Does the view transformation generally follow UVTR(?) with multi-scale sampling and depth weighting? Or perhaps single-scale?
  6. In PonderV2, supplementary is mentioned. Would it be possible for this to be made public?
  7. What (total) batch size was used?

Apologies for the list of questions, but I'm really interested in the work. Again, thank you so much in advance!

Provides pre-trained models for ConvNeXt

The author proposes a great pre-training framework.

Can you provide the weights of ConvNeXt on other pre-training methods to help other researchers fellow your research?

image

Question about fusion pretrain?

Hi, thank you for your amazing work.

As indicated in your project config files, the weights for finetuning fusion-based model are got by "merging the weights of uvtr_lidar and uvtr_cam". So, from that statement, uvtr_lidar and uvtr_cam (pretraining phase) are trained individually before, and you just combine state_dicts of those 2 pretrained models to use for finetuning fusion-based model.

But when I first read your paper, I thought the fusion weights are got by training both lidar and camera branches "simultaneously" (as indicated in your framework), which is quite different from the above implementation.

This makes me confused and I want to ask you if I am understanding your ideas correctly.
Is "camera-based pretrained weights + lidar-based pretrained weights = fusion-based pretrained weights" what you mean in your paper too?

Thank you so much in advance.

About running code

Hello, thanks for your excellent work! I have downloaded nuScenes dataset. Can you provide an explanation of running test data?

Ablation for point_nsample

I observed point_nsample==512 in the code. I would like to ask the author whether he has done ablation experiments on point_nsample. According to my understanding, the larger the point_nsample, the more supervision signals, and the effect should be better.

More ablation experiments

Please ask the author if he has verified the effectiveness of the Unipad framework on other backbones.

For example resnet-50. If so, what can I do to verify that resnet-50 is pre-trained under unipad framework.

Bug of the SSL testing

Hello,
Thank you for your excellent work!
There is a bug in the test of UVTRSSL:

results = self.pts_bbox_head(

The passed parameters do not match the definition:
def forward(self, pts_feats, img_feats, rays, img_metas, img_depth):

It seems that the forward_test function in render_head was forgotten to be committed.
Looking forward to your response.

Questions about cfgs & replicated results

Hello, thanks for your excellent work! I found your method concise and elegant, and these days I'm trying to replicate it. I have a few questions about the code.
1. It seems that the provided code only contains camera modality and the lidar modality is missing. Could you please provide the config files of UniPAD that use multi-modality and lidar-only UVTR? It would be very helpful.
2. I ran the code successfully, but I found the result metric was lower than that in the paper. I ran the project in many ways, all of them are based on the complete nuScenes dataset and your configs:
a. I pretrained and finetuned the model from scratch.
b. From the provided Google drive link, I downloaded your official pretrained pth file, and finetuned the model based on it.
c. I downloaded your official pretrained and finetuned pth files, and directly test the model based on your finetuned pth.
Sadly, the result mAP metrics of the 3 ways are all approximately 32. Due to the camera-only setting in the code, I think the corresponding mAP metric in the paper should be UVTR-C + UniPAD, which is 41.5. So I wonder did I miss something critical?
Looking forward to your early reply, and thanks in advance!

Data Distribution for Training

Hi, thank you for your great work!
One simple question, during pre-training, do you use full nuScene training data? For downstream task, 3DOD as example, do you still train on the same data used by pre-training?

About rendering

Thank you for your work. I would like to ask how sampling 512 rays per image view achieves such a high rendering resolution in the final render image(figure3 in your paper). For a high-definition image of 1600 900, 512 seems like a very small number.

Question about BEVDet results in arxiv paper

Hello,
Thanks for the great work, and for making the code available.
In the arxiv paper, you show results with:

  • BEVDet: NDS 27.1, mAP 24.6
  • BEVDet+UniPAD: NDS 32.7, mAP 32.8
    while on BEVDet repo, BEVDet with encoder resenet-50 is reported with NDS 35.0, mAP 28.3 (resolution 256x704)
    Are you using the vanilla version of BEVDet? if so, can you provide the configuration (num. GPUs, learning rate...)?
    Thank you.

[EDIT]:
Could it be that there is a transpose in the table, and that results should be:

  • BEVDet: NDS 32.7, mAP 24.6
  • BEVDet+UniPAD: NDS 32.8, mAP 27.1 ?
    However, it does not explain the lower mAP / NDS for BEVDet without pretraining.

smooth_sampler ops build error

Hi, thank you for sharing this excellent work! When i try to install this project using python setup.py develop --user, i met this compilation error
截屏2023-12-19 上午10 33 52
I wonder how can i solve this errors, hope for your reply!!

Question about rendering results in the paper (Figure 3)

Hi. Thank you for your great work.

May I ask you some questions regarding how you get the rendering results in Figure 3 of your published paper?

Did you use the frozen pretrained unipad-c (without masking the input images) to render both rgb and depth images? And here, the number of rays to sample is corresponding to the image resolution (like 928x1600 rays per each camera view or 928 // k x 1600 //k rays,....,?

Thank you.

About ablation study on different view transform strategies (BEVDet, BEVDepth, BEVFormer)

Hi. Thank you for your interesting work.
I have a question about the ablation study of your paper in Table 5.
Did you mean you only take the view transformation from BEVDet, BEVDepth and BEVFormer while keeping other components like image backbone, detection head, ... same as in UVTR-cam?
Could you please provide the config files for these studies?
Thank you in advance.

Question about FP32 Training

Hi team, thanks for releasing this exceptional work.
In the released log (abl_uvtr_cam_vs0.1_finetune.log), a pretrained model can be observed in Line 1236

2023-11-18 18:34:11,643 - mmdet - INFO - load checkpoint from work_dirs/convnext_s_pretrain_enorm_nuscenes_fp32_d32_x1_nofar/epoch_12.pth

Does this indicate that the UniPAD model is pretrained with FP32?
Also, we notice that the default pretrain setting in this repo is FP16 (fp16_enabled=True). Could you share the comparison between FP32 pretraining and FP16 pretraining?

Thanks again for your attention, and look forward to your reply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.