nickgkan / butd_detr Goto Github PK

View Code? Open in Web Editor NEW

74.0 74.0 11.0 3.69 MB

Code for the ECCV22 paper "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds"

License: Other

Python 90.25% Shell 0.39% C 0.67% C++ 3.67% Cuda 5.03%

butd_detr's People

Contributors

Stargazers

Watchers

Forkers

mumujun97 3dlg-hcvc celestialized ricklentz xxxnhb canbaba0517 tony10101105 jiadingfang hiyyg

butd_detr's Issues

Regarding Object Detection on ScanNet

Hey @ayushjain1144 @nickgkan , I wanted to confirm a query:
For the ScanNet dataset (only), the training and testing tasks are object detection.
The target_ids will be the objects from the DC18 vocab. But according to the src/joint_det_dataset.py line 761 every time only the first object is passed as the target. Sometimes the utterance created (concatenation of object names) might be irrelevant for items not occurring in DC18. (for example in the scene 'scene0405_00', objects like 'trash can', and utterance like 'cabinet . bed . chair . couch . table . door . window . bookshelf . picture . counter . desk . curtain . refrigerator . shower curtain . toilet . sink . bathtub . other furniture')
Could you confirm if I am correct?
I have the following queries also:

Can we extend the object detection to detect two or more objects?
Further, I wanted to do an object detection experiment on DC (instead of DC18), could you give a clue on how shall I do the same?
Thanks!

Why 'base_xyz=cluster_xyz' in 'self.prediction_heads[i](...)'

In the bdetr.py, you set base_xyz=cluster_xyz (the original coordinates of these queries) for function self.prediction_heads[i](...). The codes are shown as:

            # step box Prediction head
            base_xyz, base_size = self.prediction_heads[i](
                query.transpose(1, 2).contiguous(),     # ([B, F=288, V=256])
                base_xyz=cluster_xyz,                   # ([B, 256, 3])
                end_points=end_points,  # 
                prefix=prefix
            )

However, for the self.decoder[i](...), the query_pos is generated based on the output base_xyz by previous self.seg_prediction_heads[i] layer. The codes are shown as:

        for i in range(self.num_decoder_layers):
            prefix = 'last_' if i == self.num_decoder_layers-1 else f'{i}head_'

            # Position Embedding for Self-Attention
            if self.self_position_embedding == 'none':
                query_pos = None
            elif self.self_position_embedding == 'xyz_learned':
                query_pos = base_xyz
            elif self.self_position_embedding == 'loc_learned':
                query_pos = torch.cat([base_xyz, base_size], -1)
            else:
                raise NotImplementedError

            # step Transformer Decoder Layer
            query = self.decoder[i](
                query, points_features.transpose(1, 2).contiguous(),
                text_feats, query_pos,
                query_mask,
                text_padding_mask,
                detected_feats=(
                    detected_feats if self.butd
                    else None
                ),
                detected_mask=detected_mask if self.butd else None
            )  # (B, V, F)

The above codes for self.prediction_heads[i] implies that each prediction head at every layer is modifying the original queries coordinates. On the other hand, the codes for self.decoder[i](...) indicates that each decoder layer is modeling the process of modifying the queries coordinates from the previous layer output. This suggests that the modeling processes in these two places are not consistent. I think parameters in the prediction head should be modified, specifically setting base_xyz=base_xyz in self.prediction_heads[i](...). What are your thoughts on my suggestion?

Training

Hi I'm running the training part and I'm getting this error in this part(sh scripts/train_test_cls.sh), Can you help me please?

/home/exouser/.conda/envs/bdetr3d/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

FutureWarning,
usage: launch.py [-h] [--nnodes NNODES] [--nproc_per_node NPROC_PER_NODE] [--rdzv_backend RDZV_BACKEND] [--rdzv_endpoint RDZV_ENDPOINT] [--rdzv_id RDZV_ID] [--rdzv_conf RDZV_CONF] [--standalone]
[--max_restarts MAX_RESTARTS] [--monitor_interval MONITOR_INTERVAL] [--start_method {spawn,fork,forkserver}] [--role ROLE] [-m] [--no_python] [--run_path] [--log_dir LOG_DIR] [-r REDIRECTS]
[-t TEE] [--node_rank NODE_RANK] [--master_addr MASTER_ADDR] [--master_port MASTER_PORT] [--use_env]
training_script ...
launch.py: error: argument --master_port: invalid int value: 'train_dist_mod.py'

Soft token prediction

Hi, I have a general question regarding the soft token prediction used.

I wonder how the ground truth of soft token predication is obtained. I know in the case of MDETR, this is obtained via a complicated data combination process, as most of dataset does not provide fine level alignment between each text token to boxes. I reckon in 3D this fine level alignment is also not easily obtained, so how this level of ground truth is obtain. I cannot find this part in the paper.
Best,
Zhening

Hi , it's a great work. On your paper you show the time training on one v100 in Flickr30k is more than 2000 hours. It is a large number. Then how long will the modal take to train on the 3D grounding dataset.

results visualize

Thanks for your great work.Could you please tall me how to visualize the result?Thanks

Inference

Thank you for your well-written paper. I've been trying to test your model on our dataset but I'm having trouble figuring out how to do it. Could you please help me with this? Also, when I tried to add the "--eval" argument, it still starts training. Can you assist me in resolving this?

Regarding to reproducing the qualitative results of NR3D and scanrefer

i can not reproduce the qualitative results of NR3D and Scanrefer. No matter how many epochs i train.

Concerning parsing prediction

Hi, thank you for your wonderful work.

I am trying to parse the prediction, and i feel confusing at the following lines.

butd_detr/train_dist_mod.py

Lines 227 to 232 in 10570e0

 end_points['last_sem_cls_scores'] = sem_scores 

 # end contrast 

 sem_cls = torch.zeros_like(end_points['last_sem_cls_scores'])[..., :19] 

 for w, t in zip(wordidx, tokenidx): 

 sem_cls[..., w] += end_points['last_sem_cls_scores'][..., t] 

 end_points['last_sem_cls_scores'] = sem_cls

Why replacing the last_sem_cls_scores to the new generated tokenidx and wordidx based class info? what do the tokenidx and wordidx mean?

thank you for your attention

About the Label Map of object classes.

Dear Authors,

Thanks for your great work. I guess there are four types of class mappings used in the whole system and I wonder what are the mappings of them, i.e. the class id <-> class name mapping.

The first is the predicted label used in gt box + predicted label setting. I guess they are from https://github.com/nickgkan/butd_detr/blob/main/data/cls_results.json
The GT Box and GT labels.
The Detected Box and Detected Labels from Group-Free Detectors.
Your model does not predict class labels themselvies. Am I correct?

I am new to your codebase and feel confused. Look forward to your answers.

Best,

Regarding the pre-trained checkpoint

Hi @ayushjain1144, have you made the pre-trained checkpoint of BUTD-DETR (CLS) on ScanRefer public? If yes could you share the same?
Thanks!

the confidence threshold you set for the pretrained detector

As you mentioned in the papers:

... Group-Free detector [33] for 3D point clouds pre-trained on a vocabulary of 485 object categories on ScanNet [8]. The detected box proposals that surpass a confidence threshold are encoded using a box proposal encoder ...

Could you please tell me the confidence threshold you set for the pretrained Group-Free detector?

Performance about the Group-Free 3D object detector trained to detect 485 object categories in ScanNet

Thanks for your paper and code, I've got a question about the Group-Free 3D object detector trained to detect 485 object categories in ScanNet, can you offer the pre-trained checkpoints or some related per-class performance?

Fixing PyTorch Checkpoint Loading Issue in PointNet++

Hi,
Thank you for your great work on this project. I noticed an issue with loading the pre-trained checkpoint in the code. Currently, the load_state_dict method is being used to load the checkpoint, but it doesn't handle the key 'model' in the gf_detector_l6o256.pth file. Therefore, I suggest modifying the code to use the following:

Change:

self.backbone_net.load_state_dict(torch.load(pointnet_ckpt), strict=False)

to:

self.backbone_net.load_state_dict(torch.load(pointnet_ckpt)['model'], strict=False)

Additionally, the pre-trained weights have the prefix module.backbone_net, so we need to remove it to load the weights properly.

Regarding the strict=False argument, it is currently being used in the code, which means the weights are being randomly initialized instead of using the pre-trained checkpoint. Is that intended?

Best regards.

shift on the detected boxes

Hi, I visualize the detected boxes (before random shift augmentation) on the point cloud scene, but realize that there is a consistent shift existed. After generating target box based on the point cloud scene, it resulted in 0.0 iou due to the shift. Would you plz public the detector backbone training code? Or any other ways to solve this problem? Thanks in advance!

errors in generating .pkl

Hi, I try to generate the .pkl of all scans, but it always resulted in the error of multi processing after loading all scans. Are you available to upload a train.pkl and val.pkl?

Regarding the evaluation metrics

Hi @ayushjain1144 ,

While evaluating the model, the following analyses are available:

`Testing evaluation.....................
[02/23 10:49:16 sr3d_cls_eval]: Eval: [1000/1478]
[02/23 10:49:16 sr3d_cls_eval]: loss 6.7607 loss_bbox 0.8711 loss_ce 9.3744 loss_constrastive_align 31.2625 loss_giou 2.2085 query_points_generation_loss 0.0022
last_ Box given span (soft-token) Acc: 0.682048967618188
last_ Box given span (contrastive) Acc: 0.6841927112715784
proposal_ Box given span (soft-token) Acc: 0.50462597314679
proposal_ Box given span (contrastive) Acc: 0.5195193501071872
0head_ Box given span (soft-token) Acc: 0.6286246192034299
0head_ Box given span (contrastive) Acc: 0.6341532212569108
1head_ Box given span (soft-token) Acc: 0.67149949227124
1head_ Box given span (contrastive) Acc: 0.6738688931513032
2head_ Box given span (soft-token) Acc: 0.6774794087780661
2head_ Box given span (contrastive) Acc: 0.6774794087780661
3head_ Box given span (soft-token) Acc: 0.6788897664447704
3head_ Box given span (contrastive) Acc: 0.68075143856482
4head_ Box given span (soft-token) Acc: 0.6806386099514837
4head_ Box given span (contrastive) Acc: 0.6831208394448832

Analysis
easy 0.6998390989541432
hard 0.6474697885196374
vd 0.5186170212765957
vid 0.6915282196300224
unique 0.0
multi 0.6841927112715784`

The above log is extracted from the evaluation log of BUTD-DETR (cls mode) on the SR3D dataset. I understand the analyses: easy, hard, vd, and vid. But Tables 6 and 7 additionally have another metric "Overall (GT)". Could you also share the procedure for calculating the overall score, as I wasn't able to locate the code related to the same?
Thanks!

Hi, I have a following question here for butd_cls mode.

          Hi, I have a following question here for butd_cls mode.

In Link, you get the detected_class_ids from 'data/cls_results.json', which you mentioned in other issues that it is the predicted class result from pointnet++(pretrained on ScanNet).

However, in the standard mode, where butd_cls = False and butd_gt = False, your detected_class_ids comes from Link. And it is also the predicted class result from pointnet++(pretrained on ScanNet), but with additional DC.nyu40id2class mapping.

I have checked both of the above two detected_class_ids and find that they are difference. Do i miss something here?

ps: I remove the corrupt Link to avoid mistake.

Originally posted by @WeitaiKang in #11 (comment)

Question about point_object_class label

Dear authors,

Thank you so much for your influential work. Your patience in solving so many issues do help a lot new researcher in this 3D field. As a new researcher in this field, i may want to ask a little basic question which i don't understand but find it widely-used case.

In the compute_points_obj_cls_loss_hard_topk function, i find that you assign the object class label (binary value) to just the topk closest points among all the object points and the background points. Such kind of result (objectness_label) serves as the ground truth for model's seeds_obj_cls_logits.

I am confuse about why we only treat the topk closest point as the object for seeds_obj_cls_logits, instead of just all the object points (means directly use obj_assignment_one_hot as the ground truth). Is it that because in the loss computation, we only treat one proposal to calculate loss of one target (assigned by HungarianMatcher)? If so, why top-k, instead of top-1?

why 'end_points['fp2_inds'] = end_points['sa1_inds'][:, 0:num_seed]' ?

In the models/backbone_module.py, you select the first 1024 out of 2048 sa1_inds as fp2_inds. I can understand that the intention behind this is to obtain the indices of these 1024 seed points in the entire point cloud, in order to participate in the loss calculation in the function compute_points_obj_cls_loss_hard_topk (which is in models/losses.py).

However, directly selecting the first 1024 out of 2048 sa1_inds does not correspond one-to-one with fp2_xyz. This mismatch would cause euclidean_dist1 and object_assignment_one_hot variables in the function compute_points_obj_cls_loss_hard_topk to not be aligned one-to-one. Doesn't this introduce an error in the supervision signal for KPS?

models/backbone_module.py:

        # --------- 2 FEATURE UPSAMPLING LAYERS --------
        features = self.fp1(end_points['sa3_xyz'], end_points['sa4_xyz'], end_points['sa3_features'],
                            end_points['sa4_features'])
        features = self.fp2(end_points['sa2_xyz'], end_points['sa3_xyz'], end_points['sa2_features'], features)
        end_points['fp2_features'] = features
        end_points['fp2_xyz'] = end_points['sa2_xyz']
        num_seed = end_points['fp2_xyz'].shape[1]
        end_points['fp2_inds'] = end_points['sa1_inds'][:, 0:num_seed]  # indices among the entire input point clouds

        return end_points

models/losses.py:

def compute_points_obj_cls_loss_hard_topk(end_points, topk):
    box_label_mask = end_points['box_label_mask']
    seed_inds = end_points['seed_inds'].long()  # B, K
    seed_xyz = end_points['seed_xyz']  # B, K, 3
    seeds_obj_cls_logits = end_points['seeds_obj_cls_logits']  # B, 1, K
    gt_center = end_points['center_label'][:, :, 0:3]  # B, K2, 3
    gt_size = end_points['size_gts'][:, :, 0:3]  # B, K2, 3
    B = gt_center.shape[0]
    K = seed_xyz.shape[1]
    K2 = gt_center.shape[1]

    point_instance_label = end_points['point_instance_label']  # B, num_points
    object_assignment = torch.gather(point_instance_label, 1, seed_inds)  # B, num_seed
    object_assignment[object_assignment < 0] = K2 - 1  # set background points to the last gt bbox
    object_assignment_one_hot = torch.zeros((B, K, K2)).to(seed_xyz.device)
    object_assignment_one_hot.scatter_(2, object_assignment.unsqueeze(-1), 1)  # (B, K, K2)
    delta_xyz = seed_xyz.unsqueeze(2) - gt_center.unsqueeze(1)  # (B, K, K2, 3)
    delta_xyz = delta_xyz / (gt_size.unsqueeze(1) + 1e-6)  # (B, K, K2, 3)
    new_dist = torch.sum(delta_xyz ** 2, dim=-1)
    euclidean_dist1 = torch.sqrt(new_dist + 1e-6)  # BxKxK2
    euclidean_dist1 = euclidean_dist1 * object_assignment_one_hot + 100 * (1 - object_assignment_one_hot)  # BxKxK2

requirements.txt-butd_detr2d

hi
for these three lines in I'm getting errors(ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory:) how I can fix it?
numpy @ file:///tmp/build/80754af9/numpy_and_numpy_base_1634095647912/work
Pillow @ file:///home/conda/feedstock_root/build_artifacts/pillow_1648857110829/work
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work

usage part

hi
thanks for your well written paper
I had some questions toward your code:
when I try to run the usage part (sh butd_detr/scripts/train_test_det.sh) I get this error message, do you know what can be the reason? I have already completed the other steps.
best
Nastaran

File "/home/exouser/.conda/envs/bdetr3d/lib/python3.7/site-packages/torch/nn/functional.py", line 5101, in multi_head_attention_forward
attn_output, attn_output_weights = _scaled_dot_product_attention(q, k, v, attn_mask, dropout_p)
File "/home/exouser/.conda/envs/bdetr3d/lib/python3.7/site-packages/torch/nn/functional.py", line 4844, in _scaled_dot_product_attention
attn = torch.bmm(q, k.transpose(-2, -1))
RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 20.00 GiB total capacity; 16.10 GiB already allocated; 309.69 MiB free; 16.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7908) of binary: /home/exouser/.conda/envs/bdetr3d/bin/python
Traceback (most recent call last):
File "/home/exouser/.conda/envs/bdetr3d/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/exouser/.conda/envs/bdetr3d/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/exouser/.conda/envs/bdetr3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/exouser/.conda/envs/bdetr3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/exouser/.conda/envs/bdetr3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/exouser/.conda/envs/bdetr3d/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/exouser/.conda/envs/bdetr3d/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/exouser/.conda/envs/bdetr3d/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

butd_detr/train_dist_mod.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-27_21:54:47
host : butd-detr.js2local
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 7908)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

How to implement referit3d on scanrefer benchmark?

Dear Authors,

Thank you for your great work!

I noticed that you reported a result of ReferIt3DNet on ScanRefer, but I think the ReferIt3DNet accepts object point clouds(GT boxes) but there are no GT boxes available for the ScanRefer benchmark, how did you do?

I conjecture for the ScanRefer benchmark, you use the group-free detector to detect the boxes first and then use these boxes for ReferIt3DNet without the scene point cloud. Is that correct?

Best,
Runsen

The pretrained PointNet++ backbone

Hi, is file "gf_detector_l6o256.pth" the backbone weights of the Group-Free which is trained on ScanNet with a vocabulary of 485 object categories ?

Question about loss

Hi authors, I found that in your loss function here, you sum up all the number of ground truth bboxes among all the gpus, without averaging this number by the number of gpus. It seems that it would lower down the loss value if we use more gpus to train right? Is it a right way to do this?

About the arguments of butd, butd_cls, butd_gt

Hi, I want to make sure I understand some of the arguments correctly.

args.butd: use the box stream or not. True: use the box stream
args.butd_cls == True: use the gt boxes provided by ReferIt3D but no class label. This corresponds to the "GT" setting in the paper.
args.butd_gt == True: use the gt boxes with class labels. The result is not included in the paper.
args.butd_cls == False and args.butd_gt == False: use the gt boxes obtained by the Group Free detector (without class label? I am not sure :( ). This corresponds to the "DET" setting in the paper.

Am I correct?

I have another question. In your codes of Joint3DDataset:

        if self.butd_gt:
            all_detected_bboxes = all_bboxes
            all_detected_bbox_label_mask = all_bbox_label_mask
            detected_class_ids = class_ids

        # Assume a perfect object proposal stage
        if self.butd_cls:
            all_detected_bboxes = all_bboxes
            all_detected_bbox_label_mask = all_bbox_label_mask
            detected_class_ids = np.zeros((len(all_bboxes,)))
            classes = np.array(self.cls_results[anno['scan_id']])
            # detected_class_ids[all_bbox_label_mask] = classes[classes > -1]
            classes[classes == -1] = 325  # 'object' class
            _k = all_bbox_label_mask.sum()
            detected_class_ids[:_k] = classes[:_k]

What is the difference of classes and class_ids? I find that most of the valid labels of them are the same, but some are different. It seems that in both self.butid_cls and self.butd_gt setting, class labels are used?

I find it difficult to figure out the setting here.

Thanks!

detailed information of cls_results.json

Hi,

In the previous issue you said that a pointnet++ classifier's classification results (like prior works) will be loaded if using argument but_cls.
I wonder how did you exactly get the cls_results.json? Also, in another work NS3D it mentioned there are 607 classes in terms of object boxes of sr3d. Does cls_results.json contain all these 607 classes' classification results?

Thanks again for this great work!

About the implementation of the loss function

Hi, thanks for your great work. I have some questions regarding the implementation of the loss function. Hope you can give me some hints.

in loss_contrastive_align:

    # construct a map such that positive_map[k, i, j] = True
    # iff query i is associated to token j in batch item k
    positive_map = torch.zeros(logits.shape, device=logits.device)

    # handle 'not mentioned' # ? these two correspond to the last two tokens, which are " mentioned" and "</s>"?
    inds = tokenized['attention_mask'].sum(1) - 1
    positive_map[torch.arange(len(inds)), :, inds] = 0.5
    positive_map[torch.arange(len(inds)), :, inds - 1] = 0.5

I think the last two lines set the unmatched query to correspond to "not mentioned", is that right? But inds and inds-1 are indexes of "< /s >" and "mentioned". I think there should be inds - 1 and inds - 2.

Here is my debugging output:

    > tokenized['attention_mask'][0].sum() - 1 
    tensor(12, device='cuda:0') 
    > self.tokenizer.decode(tokenized['input_ids'][0][12]) 
    '</s>' 
    > self.tokenizer.decode(tokenized['input_ids'][0][11]) 
    ' mentioned'

in loss_contrastive_align:

    # Token mask for matches <> 'not mentioned' 
    tmask = torch.full(
        (len(logits), logits.shape[-1]),
        self.eos_coef,
        dtype=torch.float32, device=logits.device
    ) # * (B, max_token)
    tmask[torch.arange(len(inds)), inds] = 1.0

Why do set the weight of the last token to 1.0? I think we should set those tokens with matched queries to 1.0

in loss_contrastive_align:

    # Loss 1: which tokens should each query match?
    boxes_with_pos = positive_map.any(2)
    pos_term = positive_logits.sum(2)
    neg_term = negative_logits.logsumexp(2)
    nb_pos = positive_map.sum(2) + 1e-6 
    entropy = -torch.log(nb_pos+1e-6) / nb_pos  # entropy of 1/nb_pos
    box_to_token_loss_ = (
        (entropy + pos_term / nb_pos + neg_term)
    ).masked_fill(~boxes_with_pos, 0)
    box_to_token_loss = (box_to_token_loss_ * mask).sum()

Why do we directly add (entropy + pos_term / nb_pos + neg_term). Should we take sum(exp(positive_logits)) to get pos_term and then pos_term / neg_term?

in loss_labels_st:

     entropy = torch.log(target_sim + 1e-6) * target_sim 
     loss_ce = (entropy - logits * target_sim).sum(-1)

What is entropy used for? Its elements seem to be near zero.

Questions of loading the pretrained checkpoint into GroupFree3D

Since I found that the pertained checkpoints you mentioned was trained on 485 classes, I want to directly load the checkpoints into Group-Free-3D model and inference to see the performance. However, when I load the model, the error occurs:

Thanks very much!

About 'couch' in `DC18.type2class`

Hi, thank you very for your open codes! It really do me favour in my several works.

In this line, DC18.type2class is used to mapping the nyu_labels to 18-class id:

labels = [DC18.type2class.get(lbl, 17) for lbl in nyu_labels]

When I dig into class ScannetDatasetConfig in model_util_scannet.py, I find the key for id 3 is couch:

    self.type2class = {'cabinet': 0, 'bed': 1, 'chair': 2, 'couch': 3, 'table': 4, 'door': 5,
                       'window': 6, 'bookshelf': 7, 'picture': 8, 'counter': 9, 'desk': 10, 'curtain': 11,
                       'refrigerator': 12, 'shower curtain': 13, 'toilet': 14, 'sink': 15, 'bathtub': 16,
                       'other furniture': 17}

However, couch is the raw name for scannet. The nyu40class name should be sofa.

This typo would make DC18.type2class map sofa to id 17 but not 3, just like:

(Pdb) nyu_labels
['window', 'window', 'table', 'counter', 'otherstructure', 'curtain', 'curtain', 'desk', 'cabinet', 'floor', 'sink', 'otherprop', 'otherfurniture', 'otherfurniture', 'otherfurniture', 'television', 'pillow', 'otherprop', 'otherprop', 'wall', 'wall', 'wall', 'wall', 'wall', 'wall', 'wall', 'wall', 'wall', 'otherprop', 'otherprop', 'otherprop', 'otherprop', 'sofa', 'refridgerator', 'table', 'table', 'toilet', 'bed', 'cabinet', 'cabinet', 'cabinet', 'cabinet', 'cabinet', 'cabinet', 'otherprop', 'otherprop', 'otherprop', 'otherprop', 'otherprop', 'night stand', 'night stand', 'otherprop', 'otherprop', 'otherprop', 'door', 'ceiling', 'shelves', 'otherprop', 'otherprop', 'otherprop', 'otherprop', 'otherprop', 'otherprop', 'door', 'mirror', 'door', 'otherprop']
(Pdb) labels
[6, 6, 4, 9, 17, 11, 11, 10, 0, 17, 15, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 4, 4, 14, 1, 0, 0, 0, 0, 0, 0, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 5, 17, 17, 17, 17, 17, 17, 17, 17, 5, 17, 5, 17]
(Pdb)

About the label ID

Hi, thanks for your great work!

Could you please give me some hints about where self.type2class and self.nyu40ids come from? I check scannetv2-labels.combined.tsv and they seem not well aligned.

class ScannetDatasetConfig:

    def __init__(self, num_class=485, agnostic=False):
        self.num_class = num_class if not agnostic else 1  # 18
        self.num_heading_bin = 1
        self.num_size_cluster = num_class
        if num_class == 18:
            self.type2class = {'cabinet': 0, 'bed': 1, 'chair': 2, 'couch': 3, 'table': 4, 'door': 5,
                               'window': 6, 'bookshelf': 7, 'picture': 8, 'counter': 9, 'desk': 10, 'curtain': 11,
                               'refrigerator': 12, 'shower curtain': 13, 'toilet': 14, 'sink': 15, 'bathtub': 16,
                               'other furniture': 17}
        else:
            self.type2class = {'wall': 0, 'chair': 1, 'floor': 2, ...}

        self.class2type = {self.type2class[t]: t for t in self.type2class}
        if num_class == 18:
            self.nyu40ids = np.array([3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 16, 24, 28, 33, 34, 36, 39])
        else:
            # only train
            self.nyu40ids = np.array([1, 2, 3, 4, 5, 6, 7,... ])
        self.nyu40id2class = {nyu40id: i for i, nyu40id in enumerate(list(self.nyu40ids))}

Also, how did you get the cls_results.json?

Evaluating Indicator Overall in Table 1

Thank you for your work. May I know how to calculate the overall evaluation indicators？Thanks o lot.

CUDA kernel failed : initialization error


CUDA kernel failed : initialization error
void furthest_point_sampling_kernel_wrapper(int, int, int, const float*, float*, int*) at L:233 in _ext_src/src/sampling_gpu.cu
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 255) local_rank: 0 (pid: 15542) of binary: ~/miniconda3/envs/vg_3d/bin/python

I wonder if the author has seen this error before. I successfully set up init.sh file but still got this error.

A potential bug of visual_data_handler.py

Hi, in src/visual_data_handlers.py:

def load_point_cloud(self, keep_points=50000):
        """Load point-cloud information."""
        # Load labels
        label = None
        if osp.exists(self.scan_id + '_vh_clean_2.labels.ply'): # This will always be False
            data = PlyData.read(osp.join(
                self.top_scan_dir,
                self.scan_id, self.scan_id + '_vh_clean_2.labels.ply'
            ))
            label = np.asarray(data.elements[0].data['label'])

osp.exists(self.scan_id + '_vh_clean_2.labels.ply') will always be False since you seem to forget to add self.top_scan_dir and self.scan_id. This will result in label being None and will throw an error when calling self.get_object_semantic_label()

Will this affect the exp results, though no error occurred when I ran the codes?

Default Configuration for SR3D

Hi @ayushjain1144 @nickgkan ,

A small query, what is the starting learning rate used for the experiments in Tables 6 and 7? The default configuration according to the code says --lr=1e-3, --lr_backbone=1e-4, and --text_encoder_lr=1e-5. But the script---train_test_cls.sh follows --lr=1e-4, --lr_backbone=1e-3, and --text_encoder_lr=1e-5.

Could you clarify the required configuration, please?

Thanks in advance!

Regarding pre-training details

Hi @nickgkan , great work by the team!

I had a doubt in pre-training for the 3D architecture. For 2D BUTD-DETR architecture, the pre-training details are specified in the paper. I wanted to experiment with the pre-training procedure for BUTD-DETR 3D. Could you share the details (eg. dataset, scripts) for the pre-training, please?

Thanks!

Training on 3D referential datasets, evaluating on 3D detection datasets

In table 10, I want to know the method like "BUTD-DETR trained on SR3D/NR3D/ScanRefer" is only trained on referential data, or on referential data and detection prompts? If latter, I wonder we only train BUTD-DETR on referential data without detection prompt, can the trained model be directed evaluated on 3D detection benchmark, by replacing the visual grounding utterance with detection prompts?

Do we need self.text_encoder.eval()

Hi, it seems that in the code you fix the text_encoder by setting the requires_grad of its parameters to false. Do we need self.text_encoder.eval()? We have dropout in the text_encoder.

how to decide the hyperparameter 'lr_decay_epochs'

hi,

i meet some problem concerning to the hyperparameter 'lr_decay_epochs'. Could you share the method of deciding such hyperparameter.

appreciate for your help.

965 977 excluded?

Thank you for your work.

Could you please explain the meaning of the following code block? Why were scenes 965 and 977 excluded?

if self.split == 'train':
annos = [
anno for a, anno in enumerate(annos)
if a not in {965, 977}
]

Labels not updated for random_utt

Hi @ayushjain1144 , While object detection is trained on ScanNet dataset, if there's a random utterance created why aren't the labels updated reference: line 730 in joint_det_dataset.py?

Thanks in advance!

About the soft token prediction

Since the results of evaluate_bbox_by_contrast are higher than evaluate_bbox_by_span, I have a question that if the soft prediction loss is removed, what would happen to the final results while only keeping contrastive alignment loss? Is alignment loss enough for model training?

Why use predicted spans instead of ground truth spans during the training process?

hi, sorry for bothering you again.

Why use predicted spans instead of ground truth spans during the training process such as this line.

thank you for your time

About Loading Nr3D annotations.

Dear Authors,

Thanks for your great work!

I am writing to ask why we should filter some of the samples in Nr3D and we only test on samples with correct_guess == True. Is this a typical protocol for this benchmark.

    def load_nr3d_annos(self):
        """Load annotations of nr3d."""
        split = self.split
        if split == 'val':
            split = 'test'
        with open('data/meta_data/nr3d_%s_scans.txt' % split) as f:
            scan_ids = set(eval(f.read()))
        with open(self.data_path + 'nr3d_pred_spans.json', 'r') as f:
            pred_spans = json.load(f)
        with open(self.data_path + 'refer_it_3d/nr3d.csv') as f:
            csv_reader = csv.reader(f)
            headers = next(csv_reader)
            headers = {header: h for h, header in enumerate(headers)}
            annos = [
                {
                    'scan_id': line[headers['scan_id']],
                    'target_id': int(line[headers['target_id']]),
                    'target': line[headers['instance_type']],
                    'utterance': line[headers['utterance']],
                    'anchor_ids': [],
                    'anchors': [],
                    'dataset': 'nr3d',
                    'pred_pos_map': pred_spans[i]['span'],  # predicted span
                    'span_utterance': pred_spans[i]['utterance']  # for assert
                }
                for i, line in enumerate(csv_reader)
                if line[headers['scan_id']] in scan_ids
                and
                str(line[headers['mentions_target_class']]).lower() == 'true'
                and
                (
                    str(line[headers['correct_guess']]).lower() == 'true'
                    or split != 'test'
                )
            ]

Eval on Referit3D metrics

Hi authors,

First of all, thanks for the great work and the codebase, it is one of the few repos that can be directly run without a problem.

I noticed that your work is currently placed on top of the official Referit3d benchmark, and I would like to use your eval code to do some tests on the Referit3d dataset. However, when I run your code with sh scripts/train_test_cls.sh, the one with ground-truth bbox, I find that the evaluation metric is still IOU instead of accuracies required by Referit3d benchmark. If I understand correctly, Referit3d setting is given candidate 3d bboxes, and the accuracy is measured as picking the right one from list of bboxes by the referral text.

I wonder if there is a way to do that kind of eval within your codebase? and if not, what's the best way to adapt? Sorry if there are any misunderstandings and appreciate if you could point out.

Regarding predictions

Hi @ayushjain1144 @nickgkan
I wanted to do an error analysis of the trained model. To do so, I wanted to check the model's predictions and ground-truth answers. Is there a part of the code already which can export the predictions for error analysis?
If not, I would be grateful if you could provide insights on how to write a script for the same.

Thanks in advance!

Issue regarding the adaptation of butd-detr for multiview images input

Hello Authors,

Firstly, I would like to extend my sincere appreciation for your exceptional work on "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds". Your work has greatly inspired me, and I am currently working on modifying your model, butd-detr, to accommodate multiview images as input while discarding point clouds.

In my modification, I'm loading multiview images and using ResNet50 to extract image features. These features are backprojected to uniformly sampled positions in the scene space to construct 3D voxels, as implemented in ImVoxelNet (https://github.com/SamsungLabs/imvoxelnet). I have replaced the original visual stream (pointnet++) with this backbone while leaving the rest of the code unchanged.

One difference is in the compute_points_obj_cls_loss_hard_topk function where the voxels (or positions I sampled in the space) will not belong to the objects' point clouds. So, I've assigned the nearest four voxels (positions) to the ground truth center as positive.

Additionally, I have chosen not to use the box stream since my goal is to make the model image-based only.

Despite my efforts, I've been facing an issue where the performance of the model is consistently zero post-training. I've been trying to troubleshoot this problem and I am reaching out to you to ask if I may be missing something critical.

Any insights or suggestions you could provide would be of immense help. Thank you very much for your time and consideration.

Best regards,

Hiusam

Concerning the evaluation code

          hi , i am little confusing with following lines:

butd_detr/src/grounding_evaluator.py

Lines 197 to 201 in 10570e0

 pmap = positive_map[bid, :num_obj] 

 scores = ( 

 sem_scores[bid].unsqueeze(0) # (1, Q, 256) 

 * pmap.unsqueeze(1) # (obj, 1, 256) 

 ).sum(-1) # (obj, Q)

why the GT involved in parsing prediciton ?

Originally posted by @Daniellli in #40 (comment)

Word_idx and Token_idx

Hi @ayushjain1144
Could you clarify what the numpy arrays word_idx and token_idx in train_dist_mod.py are created for?
reference: line 205

Thanks!

Discussion about experimental evaluation in Table 1

Hi there,

Firstly, I want to say that your work is great. Good job!

I noticed in Table 1 of your paper that you evaluated your method against previous works using both ReferIt3D and ScanRefer benchmarks. I have a couple of thoughts on this.

Regarding the ReferIt3D benchmark, I don't think it's necessary to re-train the previous works with the Group-Free 3D detector. Instead, using ground truth (GT) boxes for all methods is sufficient to demonstrate the superiority of your approach.

As for the ScanRefer benchmark, it seems that you directly used the reported results from previous works, which may not be entirely fair since you used additional bounding boxes obtained by the Group-Free 3D detector. It might be better to retrain the prior methods with these additional bounding boxes to make the comparison more equitable.

What do you think about these points?

Thanks and best regards.

	end_points['last_sem_cls_scores'] = sem_scores
	# end contrast
	sem_cls = torch.zeros_like(end_points['last_sem_cls_scores'])[..., :19]
	for w, t in zip(wordidx, tokenidx):
	sem_cls[..., w] += end_points['last_sem_cls_scores'][..., t]
	end_points['last_sem_cls_scores'] = sem_cls

	pmap = positive_map[bid, :num_obj]
	scores = (
	sem_scores[bid].unsqueeze(0) # (1, Q, 256)
	* pmap.unsqueeze(1) # (obj, 1, 256)
	).sum(-1) # (obj, Q)

nickgkan / butd_detr Goto Github PK

butd_detr's People

Contributors

Stargazers

Watchers

Forkers

butd_detr's Issues

butd_detr/train_dist_mod.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-03-27_21:54:47 host : butd-detr.js2local rank : 0 (local_rank: 0) exitcode : 1 (pid: 7908) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Recommend Projects

Recommend Topics

Recommend Org

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-27_21:54:47
host : butd-detr.js2local
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 7908)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html