Giter VIP home page Giter VIP logo

image-captioning's Introduction

Image Captioning

Captioning is an img2txt model that uses the BLIP. Exports captions of images.

Checkpoints [Required]

If there is no 'Checkpoints' folder, the script will automatically create the folder and download the model file, you can do this manually if you want.

Download the fine-tuned checkpoint and copy into 'checkpoints' folder (create if does not exists)

Demo

datasets\0.jpg, a piece of cheese with figs and a piece of cheese
datasets\1002.jpg, a close up of a yellow flower with a green background
datasets\1005.jpg, a planter filled with lots of colorful flowers
datasets\1008.jpg, a teacher standing in front of a classroom full of children
datasets\1011.jpg, a tortoise on a white background with a white background
datasets\1014.jpg, a glass of wine sitting on top of a table
datasets\1017.jpg, a close up of a plant with pink flowers
datasets\102.jpg, a platter of different types of sushi
datasets\1020.jpg, a frog sitting on top of a bamboo stick
datasets\1023.jpg, a revolver on a white background
datasets\1026.jpg, a woman holding a small white dog in her arms
datasets\1029.jpg, a woman in a business suit standing in front of a building
datasets\1032.jpg, sliced cucumber on a white background
datasets\1035.jpg, a woman in glasses and a pair of boxing gloves
datasets\1038.jpg, a pile of sliced potatoes on a white surface
datasets\1041.jpg, two glasses of orange juice on a wooden table
datasets\1044.jpg, a woman sitting on the floor in front of a door

Usage

usage: inference.py [-h] [-i INPUT] [-b BATCH] [-p PATHS] [-g GPU_ID]        

Image caption CLI

optional arguments:
  -h, --help                      show this help message and exit
  -i INPUT,  --input INPUT        Input directoryt path, such as ./images
  -b BATCH,  --batch BATCH        Batch size
  -p PATHS,  --paths PATHS        A any.txt files contains all image paths.
  -g GPU_ID, --gpu-id GPU_ID      gpu device to use (default=0) can be 0,1,2 for multi-gpu

Example

python inference.py -i /path/images/folder --batch 8 --gpu 0

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

image-captioning's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

image-captioning's Issues

MultiGPU does not work

MultiGPU doesnt work as 0,1 is not a valid input for integer parameter.
This results in:
inference.py: error: argument -g/--gpu-id: invalid int value: '0,1'

RuntimeError: The size of tensor a (3) must match the size of tensor b (9) at non-singleton dimension 0

All of my images are 256x256 pixels (taken from images sample folder, just some of the 256x256 ones)

C:\git\image-captioning>python inference.py -i C:\git\image-captioning\inputs --batch 3 --gpu 0
Device: cpu
Images found: 8
Split size: 2
Checkpoint loading...
load checkpoint from ./checkpoints/model_large_caption.pth

Model to cpu
Inference started
0batch [00:02, ?batch/s]
Traceback (most recent call last):
File "C:\git\image-captioning\inference.py", line 88, in
caption = model.generate(
File "C:\git\image-captioning\models\blip.py", line 201, in generate
outputs = self.text_decoder.generate(
File "C:\Python\Python310\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Python\Python310\lib\site-packages\transformers\generation\utils.py", line 1752, in generate
return self.beam_search(
File "C:\Python\Python310\lib\site-packages\transformers\generation\utils.py", line 3091, in beam_search
outputs = self(
File "C:\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\git\image-captioning\models\med.py", line 886, in forward
outputs = self.bert(
File "C:\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\git\image-captioning\models\med.py", line 781, in forward
encoder_outputs = self.encoder(
File "C:\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\git\image-captioning\models\med.py", line 445, in forward
layer_outputs = layer_module(
File "C:\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\git\image-captioning\models\med.py", line 361, in forward
cross_attention_outputs = self.crossattention(
File "C:\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\git\image-captioning\models\med.py", line 277, in forward
self_outputs = self.self(
File "C:\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\git\image-captioning\models\med.py", line 178, in forward
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (3) must match the size of tensor b (9) at non-singleton dimension 0

RuntimeError: The size of tensor a (3) must match the size of tensor b (9) at non-singleton dimension 0

my images are 256x256 pixels

/content/image-captioning
2023-09-02 18:30:18.889829: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Device: cuda:0
Images found: 263
Split size: 263
Checkpoint loading...
load checkpoint from ./checkpoints/model_large_caption.pth

Model to cuda:0
Inference started
0batch [00:01, ?batch/s]
Traceback (most recent call last):
File "/content/image-captioning/inference.py", line 88, in
caption = model.generate(
File "/content/image-captioning/models/blip.py", line 201, in generate
outputs = self.text_decoder.generate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1675, in generate
return self.beam_search(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 3014, in beam_search
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/image-captioning/models/med.py", line 886, in forward
outputs = self.bert(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/image-captioning/models/med.py", line 781, in forward
encoder_outputs = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/image-captioning/models/med.py", line 445, in forward
layer_outputs = layer_module(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/image-captioning/models/med.py", line 361, in forward
cross_attention_outputs = self.crossattention(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/image-captioning/models/med.py", line 277, in forward
self_outputs = self.self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/image-captioning/models/med.py", line 178, in forward
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (3) must match the size of tensor b (9) at non-singleton dimension 0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.