Comments (3)
Hi @Jayce1kk,
Thank you for your interest in our work. To assist further, could you provide the image, the prompt used, and a screenshot highlighting the differences between the model? We will try to replicate and address the issue. If necessary, we can also suggest an alternative checkpoint for you to try. Thank you.
from groundinglmm.
![ballon](https://private-user-images.githubusercontent.com/113454324/317103155-e89db1f5-08fa-4f15-80d8-d708449710b6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIwMDkyOTcsIm5iZiI6MTcyMjAwODk5NywicGF0aCI6Ii8xMTM0NTQzMjQvMzE3MTAzMTU1LWU4OWRiMWY1LTA4ZmEtNGYxNS04MGQ4LWQ3MDg0NDk3MTBiNi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyNlQxNTQ5NTdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wYzQ1ZjU5MmYxMTA2NTQwMzQ5OGUzOTM4NzkzMmVkOTcxZDFjODFhNTM0YzYyODY1YmM4ZGNhNjc2Nzk1OGM1JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.yybfdpUTC6uYyyw0h0APwrXcRSa2scNBk4BbYuZtEYc)
from groundinglmm.
Hi @Jayce1kk,
Thank you for providing more details about your setup. The difference you're noticing is due to differences in the checkpoints and the model versions used in our live demo and the local setup - Our demo model is built on LLaVA 1.0, whereas the released code and models, including MBZUAI/GLaMM-FullScope
, is built on LLaVA 1.5.
To achieve similar results to our live demo, please use this checkpoint: GLaMM-FullScope_v0. You'll want to make a few adjustments:
-
Change the
--vision tower
fromopenai/clip-vit-large-patch14-336
toopenai/clip-vit-large-patch14
. This modification is due to the input size of the global image encoder (CLIP image encoder) being 224 instead of 336. You'll also need to adjust token lengths in these lines (here and here) from 575 to 255, reflecting the change in input dimensions (224/14=16, resulting in 16*16=256 tokens). -
Update the V-L projection layer (LLaVA 1.0 uses a single linear layer):
Replaceself.mm_projector = nn.Sequential(*modules)
withself.mm_projector = nn.Linear(config.mm_hidden_size, config.hidden_size)
We hope these adjustments help you in reproducing the demo results locally. Please don't hesitate to reach out if you encounter any issues or have further questions.
from groundinglmm.
Related Issues (20)
- About region caption HOT 2
- may i ask your total parameter?
- Some bugs in the GranD_ReferringSegm_ds.py
- Online Demo Down HOT 1
- Fine-tuning Grounded Conversation Generation (GCG) Task HOT 4
- token_positives HOT 2
- assertion error cur_len == total_len HOT 1
- can not install mmcv HOT 2
- Can not find file for glamm_conda_env.zip in the given Google Drive Link HOT 4
- Training on New Data HOT 2
- training V-L and L-P projection layer HOT 1
- Can not download the train.json file for visual genome
- How can I let the model receive multiple images at once HOT 1
- How should I train on the GranD dataset
- How can I finetune on combined tasks?
- Confusing referring segmentation results. HOT 1
- mmcv failed to install HOT 1
- AssertionError when running a demo
- Offline demo error
- Why is it that during the computation of segmentation results, the model() function is used instead of model.generate()? Wouldn't this mean that when predicting the next token, the information viewed is from the actual token rather than the predicted one?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from groundinglmm.