jialianw / grit Goto Github PK

View Code? Open in Web Editor NEW

290.0 290.0 30.0 14.71 MB

GRiT: A Generative Region-to-text Transformer for Object Understanding (https://arxiv.org/abs/2212.00280)

License: MIT License

Python 100.00%

grit's People

Contributors

Stargazers

Watchers

grit's Issues

eval code

Hello, could you please provide a tutorial on evaluating the model and what evaluation indicators are available.

Output the result in the format of text

I want to get the output like the image in the repo. I want to use the text result to continue my next work. How can I achieve this

Dense Captioning Evaluation on VG Dataset

Hello,

I am currently tring to reproduce the result of task dense captioning of GRiT. I have trained the model by default setting and got the checkpoint of it. Then I ran inference on VG test set and got the json result by

python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_DenseCap.yaml --output-dir-name ./output/grit_b_densecap --eval-only MODEL.WEIGHTS models/grit_b_densecap.pth

However, when installing the environment of DenseCap, I was stuck in the installation of torch on my GPU machine which has a CUDA version of 12.0. I always met this error:

Make Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
CUDA_cublas_device_LIBRARY (ADVANCED)
linked by target "THC" in directory /root/torch/extra/cutorch/lib/THC

Could you tell me what platform you use to install DenseCap and perform evaluation?

Thanks a lot!

Poor result in Densecap Evaluation

Hello,

I am trying to use the results produced by the provided checkpoint of densecap to evaluate on VG, and after replacing
logprobs(confidence/score), box, captions in addResult() as well as idx_to_token, vocab_size in model,
in densecap/eval_utils.lua, I got a mAP result of 0.000609. I found that the number of 'ok=1' is very small, meaning few ground truth are assigned to predictions. Seems like I have done something wrong.

I combined GRiT's boxes, descriptions and score predictions of an image together, and fed them into addResult() per image in densecap, but I got a reletively low mAP and I found that the IOU between ground truth and prediction boxes were very small, could you please tell me what I am wrong with? Thank you!

Here is the process of replacement:

` while true do
------- single image ------
counter = counter + 1

-- Grab a batch of data and convert it to the right dtype            batch_size = 1
local loader_kwargs = {split=split, iterate=true}
local img, gt_boxes, gt_labels, info, _ = loader:getBatch(loader_kwargs)
info = info[1]     
                                  
-- fine the index of corresponding preditions, the indexs of image_id,box,score,descriptions are same in the same image
for index, v in ipairs(my_results.image_id) do
    if tostring(v) == string.gsub(info.filename, '.jpg', '') then
         index_ = index
         print(index_)
    end
end

assert(string.gsub(info.filename, '.jpg', '') == tostring(my_results.image_id[index_]) )

-- replace these with the predictions of the corrsponding image in GRiT
local boxes, logprobs, captions = my_results.box[index_], my_results.score[index_], my_results.descriptions[index_]
local boxes, logprobs = torch.Tensor(boxes), torch.Tensor(logprobs)
local gt_captions = model.nets.language_model:decodeSequence(gt_labels[1])   -- seq: tensor of shape N x T    id_to_tokens      bs = 1

evaluator:addResult(logprobs, boxes, captions, gt_boxes[1], gt_captions)`

KeyError: 'object_description'

i got this error:
`
File "/public/Medical_image_segmentation/lixi/detectron2/detectron2/data/common.py", line 90, in getitem
data = self._map_func(self._dataset[cur_idx])
File "/public/Medical_image_segmentation/lixi/GRiT-4/grit/data/custom_dataset_mapper.py", line 53, in call
dataset_dict_out = self.prepare_data(dataset_dict)
File "/public/Medical_image_segmentation/lixi/GRiT-4/grit/data/custom_dataset_mapper.py", line 99, in prepare_data
object_descriptions = [an['object_description'] for an in dataset_dict["annotations"]]
File "/public/Medical_image_segmentation/lixi/GRiT-4/grit/data/custom_dataset_mapper.py", line 99, in
object_descriptions = [an['object_description'] for an in dataset_dict["annotations"]]

KeyError: 'object_description'
`
what was going wrong?

Installation instructions seem out of date

The installation instructions explain to clone the detectron2 repo and install from the clone

git clone https://github.com/facebookresearch/detectron2.git
cd detectron2
git checkout cc87e7ec
pip install -e .

However, detectron2 is already cloned inside the GRiT repo as third_party/CenterNet2 and pip install -e . should be run from there.

Perhaps the instructions were written before the library was included

Maybe install instructions should be something like this:

conda create --name grit python=3.8 -y
conda activate grit
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

cd ..
git clone https://github.com/JialianW/GRiT.git
cd GRiT
pip install -r requirements.txt

cd third_party/CenterNet2
git checkout cc87e7ec
pip install -e .

If PR #16 is merged then installation is automated and this may become obsolete

Batch size Configuration

Hello, Jialian.

I am currently training the model on 4 3090 GPUs and I find that the batch size is small.

I have changed the SOLVER: IMS_PER_BATCH from 64 to 128 in config/base.yaml but the memomy consumption doesn't seem to become larger.

Could you please tell me how could I increase it.

Thanks a lot.

Question about the training time

Hey,

Thanks for your inspiring work.
How long will the training take with 8 x A100 GPUs?

Question about training on custom data

Hi.

I had a query about how to finetune the model basis a custom data set?
how do we prepare annotation for custom dataset?
Is there any flowchart which you know exist for this or any blog post which describes this would be helpful.

Can you provide the performance based on the GT boxes?

I notice that you present the results of in Table 1 of the paper, but these results are based on the bounding boxes predicted by your foreground object extractor, which might lead to error propagations.

As a result, the real performance of the text decoder is not clear and underestimated. So I'm curious about the real performance of the text decoder. Can you provide the performance based on the GT boxes?

Generate Caption on my own boxes

hello.Could you please tell me how to use 'demo.py' to generate captions on boxes I give to it? I don't need it to do object detection.

Questions about multi-node deepspeed launcher

Hi, @JialianW! Thanks for your wonderful work!
I try to run GRiT on 4 nodes & 32 GPUs with following command:

python train_deepspeed.py --num-machines 4 --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet

However, I notice that only one GPU is used on each node.
In implementation, there is no mp.spawn in multi-node deepspeed launcher, is this the reason and is there any plan to fix this?

No module named 'detectron2'

Despite the fact that I cloned the detectoron2 repository

Traceback (most recent call last):
File "D:\Python\VisionGRIT\GRiT\demo.py", line 9, in
from detectron2.config import get_cfg
ModuleNotFoundError: No module named 'detectron2'

CUDA out of memory

Hi,Jia lian;
When I run train_net.py with the default settings and reached 22000 iterations, it reported an error of "cuda out of memory". May I ask what the reason is? (running on a single 48G GPU)

Fine-tuning GRiT

How much data is required to fine-tune GRiT?

Bug of corner case of proposals

Hi,
Thanks for your amazing work and I try to retrain the model on VG, however, there seems to be a corner case that would raise an error

[01/16 12:04:41 d2.utils.events]:  eta: 1 day, 11:49:23  iter: 1360  total_loss: 2.975  loss_box_reg_stage0: 0.2477  loss_box_reg_stage1: 0.3255  loss_box_reg_stage2: 0.2068  loss_centernet_agn_neg: 0.0414  loss_centernet_agn_pos: 0.1851  loss_centernet_loc: 0.3947  loss_cls_stage0: 0.2062  loss_cls_stage1: 0.1867  loss_cls_stage2: 0.1439  loss_mask: 0.3913  text_decoder_loss: 0.6096  time: 0.7084  data_time: 0.0160  lr: 7.7501e-07  max_mem: 21398M
[01/16 12:04:42] grit.modeling.roi_heads.grit_roi_heads INFO: all proposals are background at stage 2
Traceback (most recent call last):
  File "train_deepspeed.py", line 263, in <module>
    launch_deepspeed(
  File "/nvme/xxxxx/GRiT/lauch_deepspeed.py", line 67, in launch_deepspeed
    mp.spawn(
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/nvme/xxxxx/GRiT/lauch_deepspeed.py", line 133, in _distributed_worker
    main_func(*args)
  File "/nvme/xxxxx/GRiT/train_deepspeed.py", line 251, in main
    do_train(cfg, model, resume=args.resume, train_batch_size=train_batch_size)
  File "/nvme/xxxxx/GRiT/train_deepspeed.py", line 175, in do_train
    loss_dict = model(data)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1656, in forward
    loss = self.module(*inputs, **kwargs)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/xxxxx/GRiT/grit/modeling/meta_arch/grit.py", line 59, in forward
    proposals, roihead_textdecoder_losses = self.roi_heads(
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 302, in forward
    losses = self._forward_box(features, proposals, targets, task=targets_task)
  File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 173, in _forward_box
    proposals = self.check_if_all_background(proposals, targets, k)
  File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 142, in check_if_all_background
    proposals[0].proposal_boxes.tensor[0, :] = targets[0].gt_boxes.tensor[0, :]
IndexError: index 0 is out of bounds for dimension 0 with size 0

The error seems to indicate there is no any proposal for this batch and It can be easily reproduced by single-node training at around iter1360.

Would you mind checking it as I'm not familiar enough with this repo

third_party project gitmodules

Under third_party/Centernet2 there is detectron2 and yet anohter project/Centernet2
The installation however requires installing detecton2 separately.

It is a little confusing as what is needed and what is not in the gitsubmodule

Larger ViT backbone for dense captioning

Thank you for the nice work!

Is it possible to use larger ViT backbone for dense captioning?
Is there a reason that there is only ViT-B backbone for dense captioning?

Thank you.

Inference Issue: Model Fails to Detect Guns in Scene

I am doing inference on object captioning model. But the model is unable to detect guns in the scene. Since, Visual Genome dataset is used for training and it includes images and annotations with guns, why is the model not able to identify them. Here are some output visualizations:

Support for Batch-Inference

Hi,

I need to perform batch inference for my use-case. I followed this thread here that extends the DefaultPredictor class to enable batched inputs. But I end up with this error

grit/modeling/roi_heads/grit_roi_heads.py:230, in GRiTROIHeadsAndTextDecoder._forward_box(self, features, proposals, targets, task)
    227 predictor, predictions, proposals = head_outputs[-1]
    228 boxes = predictor.predict_boxes(
    229     (predictions[0], predictions[1]), proposals)
--> 230 assert len(boxes) == 1
    231 pred_instances, _ = self.fast_rcnn_inference_GRiT(
    232     boxes,
    233     scores,
   (...)
    239     self.soft_nms_enabled,
    240 )
    242 assert len(pred_instances) == 1, "Only support one image"

AssertionError:

Uncommenting the assertion doesn't help either.

Willing to share the original annotations of Visual Genome dataset ?

Sorry to bother you, but seems the official website is down, and i need to use the official annotations of dense caption. I'm hoping that you can share the annotations, it will be very useful. Thx a lot~

jialianw / grit Goto Github PK

grit's People

Contributors

Stargazers

Watchers

Forkers

grit's Issues

Recommend Projects

Recommend Topics

Recommend Org