vikhyat / moondream Goto Github PK

View Code? Open in Web Editor NEW

3.4K 33.0 299.0 3.15 MB

tiny vision language model

Home Page: https://moondream.ai

License: Apache License 2.0

Python 42.61% Jupyter Notebook 57.39%

moondream's Introduction

🌔 moondream

a tiny vision language model that kicks ass and runs anywhere

Website | Hugging Face | Demo

Benchmarks

moondream2 is a 1.86B parameter model initialized with weights from SigLIP and Phi 1.5.

Model	VQAv2	GQA	TextVQA	TallyQA (simple)	TallyQA (full)
moondream1	74.7	57.9	35.6	-	-
moondream2 (latest)	77.7	61.7	49.7	80.1	74.2

Examples

Image	Example
	What is the girl doing? The girl is sitting at a table and eating a large hamburger. What color is the girl's hair? The girl's hair is white.
	What is this? The image features a computer server rack, which is a large metal structure designed to hold and organize multiple computer components, such as motherboards, cooling systems, and other peripherals. The rack is filled with various computer parts, including multiple computer chips, wires, and other electronic components. The rack is placed on a carpeted floor, and there is a couch in the background, suggesting that the setup is likely in a living or working space. What is behind the stand? There is a brick wall behind the stand.

Image

Example

What is the girl doing?
The girl is sitting at a table and eating a large hamburger.

What color is the girl's hair?
The girl's hair is white.

What is this?
The image features a computer server rack, which is a large metal structure designed to hold and organize multiple computer components, such as motherboards, cooling systems, and other peripherals. The rack is filled with various computer parts, including multiple computer chips, wires, and other electronic components. The rack is placed on a carpeted floor, and there is a couch in the background, suggesting that the setup is likely in a living or working space.

What is behind the stand?
There is a brick wall behind the stand.

Usage

Using transformers (recommended)

pip install transformers einops

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
revision = "2024-04-02"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

image = Image.open('<IMAGE_PATH>')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))

The model is updated regularly, so we recommend pinning the model version to a specific release as shown above.

To enable Flash Attention on the text model, pass in attn_implementation="flash_attention_2" when instantiating the model.

model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision,
    torch_dtype=torch.float16, attn_implementation="flash_attention_2"
).to("cuda")

Batch inference is also supported.

answers = moondream.batch_answer(
    images=[Image.open('<IMAGE_PATH_1>'), Image.open('<IMAGE_PATH_2>')],
    prompts=["Describe this image.", "Are there people in this image?"],
    tokenizer=tokenizer,
)

Using this repository

Clone this repository and install dependencies.

pip install -r requirements.txt

sample.py provides a CLI interface for running the model. When the --prompt argument is not provided, the script will allow you to ask questions interactively.

python sample.py --image [IMAGE_PATH] --prompt [PROMPT]

Use gradio_demo.py script to start a Gradio interface for the model.

python gradio_demo.py

webcam_gradio_demo.py provides a Gradio interface for the model that uses your webcam as input and performs inference in real-time.

python webcam_gradio_demo.py

Limitations

The model may generate inaccurate statements, and struggle to understand intricate or nuanced instructions.
The model may not be free from societal biases. Users should be aware of this and exercise caution and critical thinking when using the model.
The model may generate offensive, inappropriate, or hurtful content if it is prompted to do so.

moondream's People

Contributors

Stargazers

Watchers

Forkers

creativebuilds tomchapin lee-b rminz keyman9848 sroecker arunpattnaik maekawatoshiki techthiyanes yvrjsharma suryatmodulus shadowcz007 dmillner dnth lucataco umairkhan40 mikebirdtech shuhangchen buckedunicorn ayushmodi123 spartanhaden pexavc rkp64 ysurs ailabteam yanxg akdev96 mrcodechef cyrilmagsuci kijai be-farad fastrocket newmedia2 wesley7137 henri-edh princetrunks tymczas zhenghax kustomzone yacov auxon ilyamk natyren barleyj21 azmisahincom guezpatzchio mocy leosapucaia foxalabs mazzzystar alexwd26 elmspark yuanzhongqiao gj-raza codewithsadaf unclecode f901107 xyncgas miss-mickeymouse pink-black-bear eltociear kendrickcheung007 abdoiiii alidjanka nie3e mrpoou gopalanj fiditenemini andreclaudino yihuangdb regireis7 goldenbeef edwinyzh likewaterg freakynit daddyunikii leevaleeth mbakpur123 adeliavale noorchauhan mrnobody700 amir2pl mrsnobody84 stackshareiodev axrwl clic-ethiopia mbreault nkarpov kobishaha hhy5277 chengyi051790 limeng101192 apollohuang1 soon14 sennoy11012 antoinemarcel vidallia apprikatai timsamart cyannrogue

moondream's Issues

Training Code

Hi there,

Great work. I wanted to enquire if we can train the model on custom dataset and if the training code be made available.

Regards,

Phi Training Dataset

Hi,
I saw in the readme that you mentioned it's trained on the Phi dataset, would you mind sharing more about where the dataset is available? Is it public?
Thanks!

Please add a license for the code in this repo

First, thank you for sharing this project with us!

I see the weights license is addressed in the README:

moondream/README.md

Line 8 in 9d678c9

Weights are licensed under CC-BY-SA due to using the LLaVA dataset. Try it out

However there's no statement as to the license of the code of this repo.

Could you please add a license statement, and ideally a LICENSE file to the repo to clarify?

Per GitHub docs on licensing:

[...] without a license, the default copyright laws apply, meaning that you
retain all rights to your source code and no one may reproduce, distribute,
or create derivative works from your work. If you're creating an open source
project, we strongly encourage you to include an open source license.

Thanks!

Update example images

Use public domain / permissively licensed images for the examples in the README. I just picked random images from my downloads folder, don't even know where they came from. Would be a good idea to use a consistent aspect ratio so the table looks nice.

Any response starting with "No" is is dropping the "N"

While using gradio_demo.py, I noticed that anytime time a response starts with "No" it seems to drop the first character. For example : "o, UPS does not offer pickup services for returns at home. The return summary card indicates that the UPS store is the location where the return should be made."

When using sample.py, the response is complete. ""No, UPS does not offer..."

Any Thoughts?

Response cut off early, still very good though

Excellent speed improvements with this, thank you for the great work!

I noticed quite a few results are being cut off mid sentence. Using the image at the top of this article, the description gets cut off early.

"The image features a modern and well-lit home office with a large desk situated in the center of the room. The desk is equipped with a computer monitor, keyboard, and mouse, creating a functional workspace. A chair is placed in front of the desk, providing a comfortable seating option for the user.

In addition to the main desk, there is a bookshelf filled with various books, adding a touch of organization and intellectual ambiance to the room. A potted plant is also present, adding a touch of greenery and life to the office.

A chair is placed in front of the desk, and a"

Inference batch size size for querying multiple images at the same time?

I'd like to be able to query the model and have it respond separately to each image in the provided batch. Is this possible right now?

Looks like I can potentially add with the following changes:

In the vision encoder we add support for multiple PIL images in the vision encoder:

    def __call__(self, images) -> torch.Tensor:
        # Convert a single image to a list if needed
        if isinstance(images, Image.Image):
            images = [images]

        with torch.no_grad():
            # Preprocess images and stack them into a single tensor
            x = torch.stack([self.preprocess(img.convert("RGB")) for img in images])

            x = x.to(self.device, dtype=self.dtype)

            # Rearrange the tensor for patch embedding
            x = rearrange(x, "b c (h p1) (w p2) -> b (h w) (c p1 p2)", p1=14, p2=14)

            # Pass through the encoder
            x = self.encoder(x)

            # Projection
            x = self.projection(x)

            return x

The next modification happens in moondream.py at:

def input_embeds(self, prompt, image_embeds, tokenizer):
...

        if "<image>" not in prompt:
            embeds.append(text_emb(_tokenize(prompt)))
            return  return torch.cat(embeds, dim=1)
        else:
            embeds = []
            assert prompt.count("<image>") == 1
            before, after = prompt.split("<image>")
            for i in range(image_embeds.shape[0]):
                embeds_batch = []
                embeds_batch.append(
                    text_emb((torch.tensor([[tokenizer.bos_token_id]], device=self.device)))
                )
                embeds_batch.append(text_emb(_tokenize(f"{before}<image>")))
                embeds_batch.append(image_embeds[i:i+1].to(self.device))
                embeds_batch.append(text_emb(_tokenize(f"</image>{after}")))
                embeds.append(torch.cat(embeds_batch, dim=1))
            
            return torch.cat(embeds, dim=0)

The final modification appears to be required for the caption outputs in moondream.py:


    def answer_question(
        self,
        image_embeds,
        question,
        tokenizer,
        chat_history="",
        result_queue=None,
        **kwargs,
    ):
        prompt = f"<image>\n\n{chat_history}Question: {question}\n\nAnswer: "
        answers = self.generate(
            image_embeds,
            prompt,
            eos_text="<END>",
            tokenizer=tokenizer,
            max_new_tokens=256,
            **kwargs,
        )
        results = []
        for answer in answers:
            cleaned_answer = re.sub("<$|<END$", "", answer).strip()

            # Use the result_queue to pass the result if it is provided
            if result_queue:
                result_queue.put(cleaned_answer)
            else:
                results.append(cleaned_answer)
        return results

Could you give some more info about the Model?

Hey there, thanks for the repo.
Could you give more info on the image encoder that is used here?

BUG: RunTime ERROR running on 64 bit Raspberry PI OS

code to replicate:

python3 sample.py --image demo-1.jpg --interactive

error :

  File "/home/vizuosense/sensei/engine.py", line 25, in <module>
    print(text_model.answer_question(image_embeds, question))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/moondream/text_model.py", line 102, in answer_question
    answer = self.generate(
             ^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/moondream/text_model.py", line 74, in generate
    output_ids = self.model.generate(
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/transformers/generation/utils.py", line 1718, in generate
    return self.greedy_search(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/transformers/generation/utils.py", line 2579, in greedy_search
    outputs = self(
              ^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/moondream/phi/modeling_phi.py", line 956, in forward
    hidden_states = self.transformer(
                    ^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/moondream/phi/modeling_phi.py", line 915, in forward
    hidden_states = layer(
                    ^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/moondream/phi/modeling_phi.py", line 732, in forward
    hidden_states = self.ln(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/torch/nn/modules/normalization.py", line 196, in forward
    return F.layer_norm(
           ^^^^^^^^^^^^^
  File "/home/vizuosense/sensei/venv/lib/python3.11/site-packages/torch/nn/functional.py", line 2543, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: mixed dtype (CPU): expect parameter to have scalar type of Float

tranformers pipeline

please add tranformers pipeline as per merge request
https://huggingface.co/vikhyatk/moondream1/discussions/6
I checked and ?doesn't seem to be implemented yet?

Could you compare with MoE-LLaVA-1.6B×4-Top2? It seems more better?

Model	Activated Param	Resolution	VQAv2	GQA	VizWiz	T-VQA
MoE-LLaVA-1.6B×4-Top2	2.0B	336	76.7	60.3	36.2	50.1
moondream	1.6B	384	74.3	56.3	30.3	39.8

I just found its results from https://github.com/PKU-YuanGroup/MoE-LLaVA/tree/main?tab=readme-ov-file#-model-zoo

Error "At least one of the model submodule will be offloaded to disk, please pass along an `offload_folder`"

Hi,

This project looks cool & interesting so was trying it out - I don't know if this is an actual issue or just an issue with my configuration, but when I run it (after doing pip install -r requirements.txt) I get the following error (I get the same error both on my 8GB M2 Mac Mini Sonoma 14.1.2 and my 8GB M1 MacBook Air Monterey 12.7.1, and I get it both if I try run in a virtual environment, or just with the global Python setup):

Have I done something wrong? Is there something extra I am meant to download? Anyone else get this? How much disk space should I have?

(moon) david@Davids-Mac-mini moondream % ./python3 sample.py --image assets/demo1.jpg --interactive
Fetching 12 files: 100%|...| 12/12 [00:00<00:00, 82646.38it/s]
model_path=/Users/david/.cache/huggingface/hub/models--vikhyatk--moondream0/snapshots/c875d938535a26218a62d8a4c21818f3622dba6c
Traceback (most recent call last):
  File "/Users/david/moondream/sample.py", line 12, in <module>
    text_model = TextModel(model_path)
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/david/moondream/moondream/text_model.py", line 22, in __init__
    self.model = load_checkpoint_and_dispatch(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/david/moondream/moon/lib/python3.11/site-packages/accelerate/big_modeling.py", line 545, in load_checkpoint_and_dispatch
    load_checkpoint_in_model(
  File "/Users/david/moondream/moon/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 1373, in load_checkpoint_in_model
    raise ValueError(
ValueError: At least one of the model submodule will be offloaded to disk, please pass along an `offload_folder`.

More info:

(moon) david@Davids-Mac-mini moondream % python3 --version
Python 3.11.6
(moon) david@Davids-Mac-mini moondream % which python3
/Users/david/moondream/moon/bin/python3
(moon) david@Davids-Mac-mini moondream % which pip    
/Users/david/moondream/moon/bin/pip

Use Phi implementation from transformers

The text model weights provided are for the old version of Phi before it was integrated into the Huggingface transformers library. Need to write a script to convert the weight keys into the new format and switch to using the Huggingface implementation directly.

Suggestions for future model

Thank you for your model!

If you want to get inspiration from the best multimodal atm for future improvements for moondream, check out qwen-vl-max.

If you want to squeeze out more details from your model maybe you can have it auto slice the image into several pieces, caption each slice, then caption the whole image, then combine all the captions into a single caption.

I do this manually (cropping the image several times in photoshop) for problem images and I am able to get the model to see details it would normally miss or ignore.

can it use bitsandbytes, better transformer, and/or flash attention 2?

Hello, the title explains my question, but here's a link to where I ask the same thing on the huggingface repo and provide a code snippet from a script of mine, albeit a different model, that can use those libraries...

https://huggingface.co/vikhyatk/moondream2/discussions/1

If the creator or anyone else could let me know I'd be very interested!

Multiple image embeds in one prompt?

By slightly augmenting the code I was trying to embed two images into the prompt in the hope that the model would be able to make comparisons between them, but so far it looks like it always just sees the last embed. I am wondering if this approach is feasible at all and what would be required to make this work?

This is my change in sample.py:

parser = argparse.ArgumentParser()
parser.add_argument("--image", type=str, required=True)
parser.add_argument("--image2", type=str, required=True)
parser.add_argument("--prompt", type=str, required=False)
args = parser.parse_args()

image = Image.open(args.image)
image_embeds = vision_encoder(image)
print("image_embeds",image_embeds.size())

image2 = Image.open(args.image2)
image_embeds2 = vision_encoder(image2)
print("image_embeds",image_embeds2.size())

image_embeds = torch.cat((image_embeds,image_embeds2),0)
print("image_embeds",image_embeds.size())

And this is my change in text_model.py

def input_embeds(self, prompt, image_embeds):
        embeds = []

        def _add_toks(toks):
            embeds.append(self.text_emb(toks))

        def _tokenize(txt):
            return self.tokenizer(
                txt, return_tensors="pt", add_special_tokens=False
            ).input_ids.to(self.model.device)

        # Add BOS token
        _add_toks(
            torch.tensor([[self.tokenizer.bos_token_id]], device=self.model.device)
        )

        if "<image>" not in prompt:
            embeds.append(self.text_emb(_tokenize(prompt)))
        else:
            assert prompt.count("<image>") == 1
            before, after = prompt.split("<image>")
            
            if image_embeds.size(0)==1:
                embeds.append(self.text_emb(_tokenize(f"{before}<image>")))
                embeds.append(image_embeds.to(self.model.device))
                embeds.append(self.text_emb(_tokenize(f"</image>{after}")))
            else:
                if len(before)>0:    
                    embeds.append(self.text_emb(_tokenize(f"{before}")))
                for i in range(image_embeds.size(0)):
                    embeds.append(self.text_emb(_tokenize(f"Image #{i+1}: <image>")))
                    embeds.append(image_embeds[i].unsqueeze(0).to(self.model.device))
                    embeds.append(self.text_emb(_tokenize(f"</image>")))
                if len(after)>0:
                    embeds.append(self.text_emb(_tokenize(f"{after}")))

        return torch.cat(embeds, dim=1)

It won't roast me. Remove censorship.

If I ask it to roast me (my picture), it just says "no". Even Twitter's Grok allows roasting.

Wrong Output

ModuleNotFoundError: No module named 'moondream'

D:\AI\mdr>python -m venv moondr

D:\AI\mdr>cd moondr

D:\AI\mdr\moondr>cd scripts

D:\AI\mdr\moondr\Scripts>activate.bat

(moondr) D:\AI\mdr\moondr\Scripts>pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torch
Downloading https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-win_amd64.whl (2704.3 MB)
---------------------------------------- 2.7/2.7 GB 380.5 kB/s eta 0:00:00
Collecting torchvision
Using cached https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-win_amd64.whl (4.9 MB)
Collecting torchaudio
Using cached https://download.pytorch.org/whl/cu118/torchaudio-2.2.0%2Bcu118-cp310-cp310-win_amd64.whl (4.0 MB)
Collecting networkx
Downloading https://download.pytorch.org/whl/networkx-3.2.1-py3-none-any.whl (1.6 MB)
---------------------------------------- 1.6/1.6 MB 14.9 MB/s eta 0:00:00
Collecting filelock
Using cached https://download.pytorch.org/whl/filelock-3.9.0-py3-none-any.whl (9.7 kB)
Collecting jinja2
Using cached https://download.pytorch.org/whl/Jinja2-3.1.2-py3-none-any.whl (133 kB)
Collecting sympy
Using cached https://download.pytorch.org/whl/sympy-1.12-py3-none-any.whl (5.7 MB)
Collecting fsspec
Using cached https://download.pytorch.org/whl/fsspec-2023.4.0-py3-none-any.whl (153 kB)
Collecting typing-extensions>=4.8.0
Using cached https://download.pytorch.org/whl/typing_extensions-4.8.0-py3-none-any.whl (31 kB)
Collecting pillow!=8.3.*,>=5.3.0
Downloading https://download.pytorch.org/whl/pillow-10.2.0-cp310-cp310-win_amd64.whl (2.6 MB)
---------------------------------------- 2.6/2.6 MB 12.9 MB/s eta 0:00:00
Collecting numpy
Downloading https://download.pytorch.org/whl/numpy-1.26.3-cp310-cp310-win_amd64.whl (15.8 MB)
---------------------------------------- 15.8/15.8 MB 11.5 MB/s eta 0:00:00
Collecting requests
Using cached https://download.pytorch.org/whl/requests-2.28.1-py3-none-any.whl (62 kB)
Collecting MarkupSafe>=2.0
Using cached https://download.pytorch.org/whl/MarkupSafe-2.1.3-cp310-cp310-win_amd64.whl (17 kB)
Collecting urllib3<1.27,>=1.21.1
Using cached https://download.pytorch.org/whl/urllib3-1.26.13-py2.py3-none-any.whl (140 kB)
Collecting certifi>=2017.4.17
Using cached https://download.pytorch.org/whl/certifi-2022.12.7-py3-none-any.whl (155 kB)
Collecting idna<4,>=2.5
Using cached https://download.pytorch.org/whl/idna-3.4-py3-none-any.whl (61 kB)
Collecting charset-normalizer<3,>=2
Using cached https://download.pytorch.org/whl/charset_normalizer-2.1.1-py3-none-any.whl (39 kB)
Collecting mpmath>=0.19
Using cached https://download.pytorch.org/whl/mpmath-1.3.0-py3-none-any.whl (536 kB)
Installing collected packages: mpmath, urllib3, typing-extensions, sympy, pillow, numpy, networkx, MarkupSafe, idna, fsspec, filelock, charset-normalizer, certifi, requests, jinja2, torch, torchvision, torchaudio
Successfully installed MarkupSafe-2.1.3 certifi-2022.12.7 charset-normalizer-2.1.1 filelock-3.9.0 fsspec-2023.4.0 idna-3.4 jinja2-3.1.2 mpmath-1.3.0 networkx-3.2.1 numpy-1.26.3 pillow-10.2.0 requests-2.28.1 sympy-1.12 torch-2.2.0+cu118 torchaudio-2.2.0+cu118 torchvision-0.17.0+cu118 typing-extensions-4.8.0 urllib3-1.26.13

(moondr) D:\AI\mdr\moondr\Scripts>cd..

(moondr) D:\AI\mdr\moondr>cd git

(moondr) D:\AI\mdr\moondr\git>pip install -r requirements.txt
Collecting accelerate==0.25.0
Using cached accelerate-0.25.0-py3-none-any.whl (265 kB)
Collecting huggingface-hub==0.20.1
Using cached huggingface_hub-0.20.1-py3-none-any.whl (330 kB)
Collecting Pillow==10.1.0
Using cached Pillow-10.1.0-cp310-cp310-win_amd64.whl (2.6 MB)
Collecting torch==2.1.2
Using cached torch-2.1.2-cp310-cp310-win_amd64.whl (192.3 MB)
Collecting torchvision==0.16.2
Using cached torchvision-0.16.2-cp310-cp310-win_amd64.whl (1.1 MB)
Collecting transformers==4.36.2
Using cached transformers-4.36.2-py3-none-any.whl (8.2 MB)
Collecting einops==0.7.0
Using cached einops-0.7.0-py3-none-any.whl (44 kB)
Collecting gradio==4.15.0
Using cached gradio-4.15.0-py3-none-any.whl (16.6 MB)
Collecting timm==0.9.12
Using cached timm-0.9.12-py3-none-any.whl (2.2 MB)
Collecting safetensors>=0.3.1
Using cached safetensors-0.4.2-cp310-none-win_amd64.whl (269 kB)
Requirement already satisfied: numpy>=1.17 in d:\ai\mdr\moondr\lib\site-packages (from accelerate==0.25.0->-r requirements.txt (line 1)) (1.26.3)
Collecting psutil
Using cached psutil-5.9.8-cp37-abi3-win_amd64.whl (255 kB)
Collecting pyyaml
Using cached PyYAML-6.0.1-cp310-cp310-win_amd64.whl (145 kB)
Collecting packaging>=20.0
Using cached packaging-23.2-py3-none-any.whl (53 kB)
Collecting tqdm>=4.42.1
Using cached tqdm-4.66.1-py3-none-any.whl (78 kB)
Requirement already satisfied: requests in d:\ai\mdr\moondr\lib\site-packages (from huggingface-hub==0.20.1->-r requirements.txt (line 2)) (2.28.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in d:\ai\mdr\moondr\lib\site-packages (from huggingface-hub==0.20.1->-r requirements.txt (line 2)) (4.8.0)
Requirement already satisfied: filelock in d:\ai\mdr\moondr\lib\site-packages (from huggingface-hub==0.20.1->-r requirements.txt (line 2)) (3.9.0)
Collecting fsspec>=2023.5.0
Using cached fsspec-2023.12.2-py3-none-any.whl (168 kB)
Requirement already satisfied: sympy in d:\ai\mdr\moondr\lib\site-packages (from torch==2.1.2->-r requirements.txt (line 4)) (1.12)
Requirement already satisfied: jinja2 in d:\ai\mdr\moondr\lib\site-packages (from torch==2.1.2->-r requirements.txt (line 4)) (3.1.2)
Requirement already satisfied: networkx in d:\ai\mdr\moondr\lib\site-packages (from torch==2.1.2->-r requirements.txt (line 4)) (3.2.1)
Collecting tokenizers<0.19,>=0.14
Using cached tokenizers-0.15.1-cp310-none-win_amd64.whl (2.2 MB)
Collecting regex!=2019.12.17
Using cached regex-2023.12.25-cp310-cp310-win_amd64.whl (269 kB)
Requirement already satisfied: markupsafe~=2.0 in d:\ai\mdr\moondr\lib\site-packages (from gradio==4.15.0->-r requirements.txt (line 8)) (2.1.3)
Collecting orjson~=3.0
Using cached orjson-3.9.12-cp310-none-win_amd64.whl (134 kB)
Collecting ffmpy
Using cached ffmpy-0.3.1-py3-none-any.whl
Collecting python-multipart
Using cached python_multipart-0.0.6-py3-none-any.whl (45 kB)
Collecting typer[all]<1.0,>=0.9
Using cached typer-0.9.0-py3-none-any.whl (45 kB)
Collecting altair<6.0,>=4.2.0
Using cached altair-5.2.0-py3-none-any.whl (996 kB)
Collecting ruff>=0.1.7
Downloading ruff-0.1.15-py3-none-win_amd64.whl (7.3 MB)
---------------------------------------- 7.3/7.3 MB 10.0 MB/s eta 0:00:00
Collecting httpx
Using cached httpx-0.26.0-py3-none-any.whl (75 kB)
Collecting pandas<3.0,>=1.0
Using cached pandas-2.2.0-cp310-cp310-win_amd64.whl (11.6 MB)
Collecting pydantic>=2.0
Using cached pydantic-2.6.0-py3-none-any.whl (394 kB)
Collecting importlib-resources<7.0,>=1.3
Using cached importlib_resources-6.1.1-py3-none-any.whl (33 kB)
Collecting fastapi
Using cached fastapi-0.109.0-py3-none-any.whl (92 kB)
Collecting gradio-client==0.8.1
Using cached gradio_client-0.8.1-py3-none-any.whl (305 kB)
Collecting matplotlib~=3.0
Using cached matplotlib-3.8.2-cp310-cp310-win_amd64.whl (7.6 MB)
Collecting tomlkit==0.12.0
Using cached tomlkit-0.12.0-py3-none-any.whl (37 kB)
Collecting aiofiles<24.0,>=22.0
Using cached aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting pydub
Using cached pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Collecting uvicorn>=0.14.0
Downloading uvicorn-0.27.0.post1-py3-none-any.whl (60 kB)
---------------------------------------- 60.7/60.7 kB 3.2 MB/s eta 0:00:00
Collecting semantic-version~=2.0
Using cached semantic_version-2.10.0-py2.py3-none-any.whl (15 kB)
Collecting websockets<12.0,>=10.0
Using cached websockets-11.0.3-cp310-cp310-win_amd64.whl (124 kB)
Collecting toolz
Using cached toolz-0.12.1-py3-none-any.whl (56 kB)
Collecting jsonschema>=3.0
Using cached jsonschema-4.21.1-py3-none-any.whl (85 kB)
Collecting fonttools>=4.22.0
Using cached fonttools-4.47.2-cp310-cp310-win_amd64.whl (2.2 MB)
Collecting pyparsing>=2.3.1
Using cached pyparsing-3.1.1-py3-none-any.whl (103 kB)
Collecting contourpy>=1.0.1
Using cached contourpy-1.2.0-cp310-cp310-win_amd64.whl (186 kB)
Collecting cycler>=0.10
Using cached cycler-0.12.1-py3-none-any.whl (8.3 kB)
Collecting python-dateutil>=2.7
Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting kiwisolver>=1.3.1
Using cached kiwisolver-1.4.5-cp310-cp310-win_amd64.whl (56 kB)
Collecting pytz>=2020.1
Using cached pytz-2023.4-py2.py3-none-any.whl (506 kB)
Collecting tzdata>=2022.7
Using cached tzdata-2023.4-py2.py3-none-any.whl (346 kB)
Collecting pydantic-core==2.16.1
Using cached pydantic_core-2.16.1-cp310-none-win_amd64.whl (1.9 MB)
Collecting annotated-types>=0.4.0
Using cached annotated_types-0.6.0-py3-none-any.whl (12 kB)
Collecting colorama
Using cached colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting click<9.0.0,>=7.1.1
Using cached click-8.1.7-py3-none-any.whl (97 kB)
Collecting shellingham<2.0.0,>=1.3.0
Using cached shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB)
Collecting rich<14.0.0,>=10.11.0
Using cached rich-13.7.0-py3-none-any.whl (240 kB)
Collecting h11>=0.8
Using cached h11-0.14.0-py3-none-any.whl (58 kB)
Collecting starlette<0.36.0,>=0.35.0
Using cached starlette-0.35.1-py3-none-any.whl (71 kB)
Requirement already satisfied: certifi in d:\ai\mdr\moondr\lib\site-packages (from httpx->gradio==4.15.0->-r requirements.txt (line 8)) (2022.12.7)
Requirement already satisfied: idna in d:\ai\mdr\moondr\lib\site-packages (from httpx->gradio==4.15.0->-r requirements.txt (line 8)) (3.4)
Collecting anyio
Using cached anyio-4.2.0-py3-none-any.whl (85 kB)
Collecting sniffio
Using cached sniffio-1.3.0-py3-none-any.whl (10 kB)
Collecting httpcore==1.*
Using cached httpcore-1.0.2-py3-none-any.whl (76 kB)
Requirement already satisfied: charset-normalizer<3,>=2 in d:\ai\mdr\moondr\lib\site-packages (from requests->huggingface-hub==0.20.1->-r requirements.txt (line 2)) (2.1.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in d:\ai\mdr\moondr\lib\site-packages (from requests->huggingface-hub==0.20.1->-r requirements.txt (line 2)) (1.26.13)
Requirement already satisfied: mpmath>=0.19 in d:\ai\mdr\moondr\lib\site-packages (from sympy->torch==2.1.2->-r requirements.txt (line 4)) (1.3.0)
Collecting rpds-py>=0.7.1
Using cached rpds_py-0.17.1-cp310-none-win_amd64.whl (205 kB)
Collecting jsonschema-specifications>=2023.03.6
Using cached jsonschema_specifications-2023.12.1-py3-none-any.whl (18 kB)
Collecting referencing>=0.28.4
Downloading referencing-0.33.0-py3-none-any.whl (26 kB)
Collecting attrs>=22.2.0
Using cached attrs-23.2.0-py3-none-any.whl (60 kB)
Collecting six>=1.5
Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting markdown-it-py>=2.2.0
Using cached markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
Collecting pygments<3.0.0,>=2.13.0
Using cached pygments-2.17.2-py3-none-any.whl (1.2 MB)
Collecting exceptiongroup>=1.0.2
Using cached exceptiongroup-1.2.0-py3-none-any.whl (16 kB)
Collecting mdurl~=0.1
Using cached mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Installing collected packages: pytz, pydub, ffmpy, websockets, tzdata, toolz, tomlkit, sniffio, six, shellingham, semantic-version, safetensors, ruff, rpds-py, regex, pyyaml, python-multipart, pyparsing, pygments, pydantic-core, psutil, Pillow, packaging, orjson, mdurl, kiwisolver, importlib-resources, h11, fsspec, fonttools, exceptiongroup, einops, cycler, contourpy, colorama, attrs, annotated-types, aiofiles, tqdm, torch, referencing, python-dateutil, pydantic, markdown-it-py, httpcore, click, anyio, uvicorn, typer, torchvision, starlette, rich, pandas, matplotlib, jsonschema-specifications, huggingface-hub, httpx, tokenizers, timm, jsonschema, gradio-client, fastapi, accelerate, transformers, altair, gradio
Attempting uninstall: Pillow
Found existing installation: pillow 10.2.0
Uninstalling pillow-10.2.0:
Successfully uninstalled pillow-10.2.0
Attempting uninstall: fsspec
Found existing installation: fsspec 2023.4.0
Uninstalling fsspec-2023.4.0:
Successfully uninstalled fsspec-2023.4.0
Attempting uninstall: torch
Found existing installation: torch 2.2.0+cu118
Uninstalling torch-2.2.0+cu118:
Successfully uninstalled torch-2.2.0+cu118
Attempting uninstall: torchvision
Found existing installation: torchvision 0.17.0+cu118
Uninstalling torchvision-0.17.0+cu118:
Successfully uninstalled torchvision-0.17.0+cu118
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.2.0+cu118 requires torch==2.2.0+cu118, but you have torch 2.1.2 which is incompatible.
Successfully installed Pillow-10.1.0 accelerate-0.25.0 aiofiles-23.2.1 altair-5.2.0 annotated-types-0.6.0 anyio-4.2.0 attrs-23.2.0 click-8.1.7 colorama-0.4.6 contourpy-1.2.0 cycler-0.12.1 einops-0.7.0 exceptiongroup-1.2.0 fastapi-0.109.0 ffmpy-0.3.1 fonttools-4.47.2 fsspec-2023.12.2 gradio-4.15.0 gradio-client-0.8.1 h11-0.14.0 httpcore-1.0.2 httpx-0.26.0 huggingface-hub-0.20.1 importlib-resources-6.1.1 jsonschema-4.21.1 jsonschema-specifications-2023.12.1 kiwisolver-1.4.5 markdown-it-py-3.0.0 matplotlib-3.8.2 mdurl-0.1.2 orjson-3.9.12 packaging-23.2 pandas-2.2.0 psutil-5.9.8 pydantic-2.6.0 pydantic-core-2.16.1 pydub-0.25.1 pygments-2.17.2 pyparsing-3.1.1 python-dateutil-2.8.2 python-multipart-0.0.6 pytz-2023.4 pyyaml-6.0.1 referencing-0.33.0 regex-2023.12.25 rich-13.7.0 rpds-py-0.17.1 ruff-0.1.15 safetensors-0.4.2 semantic-version-2.10.0 shellingham-1.5.4 six-1.16.0 sniffio-1.3.0 starlette-0.35.1 timm-0.9.12 tokenizers-0.15.1 tomlkit-0.12.0 toolz-0.12.1 torch-2.1.2 torchvision-0.16.2 tqdm-4.66.1 transformers-4.36.2 typer-0.9.0 tzdata-2023.4 uvicorn-0.27.0.post1 websockets-11.0.3

[notice] A new release of pip is available: 23.0.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip

(moondr) D:\AI\mdr\moondr\git>python gradio_demo.py
Traceback (most recent call last):
File "D:\AI\mdr\moondr\git\gradio_demo.py", line 4, in
from moondream import Moondream, detect_device
ModuleNotFoundError: No module named 'moondream'

encode_image() broken after latest commits

Ubuntu 22.04, gtx 2080 and m40 both result in an all NaN tensor when running the encode_image() function.

device, dtype = detect_device()
model_id = "vikhyatk/moondream1"
tokenizer = Tokenizer.from_pretrained(model_id)
moondream = Moondream.from_pretrained(model_id).to(device=device, dtype=dtype)
moondream.eval()
image = Image.open("/test.png")
image_embeds = moondream.encode_image(image)

A temporary fix is :

Moondream.from_pretrained(model_id, revision="1e62d51745be03c0d3a5c582afcada6c1f98f454")

No Errors were printed.
packages are according to the requirements.txt .

Tensor size mismatch

The expanded size of the tensor (754) must match the existing size (755) at non-singleton dimension 1. Target sizes: [1, 754]. Tensor sizes: [1, 755]

        tokenizer = Tokenizer.from_pretrained(model_id)
        moondream = Moondream.from_pretrained(model_id).to(
            device=device, dtype=dtype
        )

...

async def vision(
    image: UploadFile = File(...),
    prompt: str = "Describe the image in a few words.",
):
    try:
        img = Image.open(io.BytesIO(image.file.read()))
        moondream: Moondream = plugin.resources["moondream"]
        tokenizer: Tokenizer = plugin.resources["tokenizer"]
        print("Encoding image...")
        image_embeds = moondream.encode_image(img)
        print("Getting response...")
        response = moondream.answer_question(image_embeds, prompt, tokenizer)

        ...

It gets through the image encoding step but fails on the response. I have tried this with lots of images, mostly 512x512 or 786x786 but also some other odd sizes.

Please Add a Documentation Page

I think it would nice to checkout how the code works, adding custom models, etc...

How do I uninstall and remove the model it downloaded

How do I uninstall and remove the model it downloaded as it was installed on a raspberry pi and it took all my space lol please help undo the install esp the models i cant find where the model files downloaded to iv tried using find and gui search tool aswell i need the install directories the default folders for the following files install to (i dont mind some of the obvious stuff im concerned wiuth the hard to find big model files .safesearch files for example there like 5 gb and 2 gb

(myenv) jay@jnetai:~/Documents/Scripts/AI/MoonDream $ python sample.py --image /home/jay/Documents/Scripts/AI/MoonDream/Brick.jpg --prompt "what is this a picture of"
tokenizer_config.json: 100%|███████████████| 7.34k/7.34k [00:00<00:00, 29.1MB/s]
vocab.json: 100%|████████████████████████████| 798k/798k [00:00<00:00, 10.4MB/s]
merges.txt: 100%|████████████████████████████| 456k/456k [00:00<00:00, 2.04MB/s]
tokenizer.json: 100%|██████████████████████| 2.11M/2.11M [00:00<00:00, 6.75MB/s]
added_tokens.json: 100%|███████████████████| 1.08k/1.08k [00:00<00:00, 6.85MB/s]
special_tokens_map.json: 100%|█████████████████| 441/441 [00:00<00:00, 2.64MB/s]
config.json: 100%|█████████████████████████████| 323/323 [00:00<00:00, 2.17MB/s]
model.safetensors.index.json: 100%|█████████| 59.0k/59.0k [00:00<00:00, 804kB/s]
model-00001-of-00002.safetensors: 100%|████| 5.00G/5.00G [03:14<00:00, 25.7MB/s]
model-00002-of-00002.safetensors: 100%|████| 2.57G/2.57G [01:59<00:00, 21.5MB/s]
Downloading shards: 100%|████████████████████████| 2/2 [05:14<00:00, 157.14s/it]
Loading checkpoint shards: 50%|████████▌ | 1/2 [01:56<01:56, 116.48s/it]Killed

Multiple images

Moondream2 is incredible! Is it possible to support multiple images? Thanks!

Guide me to Run it locally, please.

I found out about this model through comfy ui, the captioning of this model blew me.
I want to run this model with my python script and without a comfyui interface.
how can I go about it?

Things I have tried:

cloned this repository and runned sample.py
It throws the following error

Please consider creating a quantized version or even better a CoreML model

Hi. Thanks for this great small model! Is it in your plans to provide a ggml version? A coreML version?

exclamation response !!!!!!!!

Thanks for this work, it was working great on my local machine until today. I don't know what I changed*, I only get a string of exclamation signs as response to any question that I ask. Any hints?
* conda env is the same. !!! happens regardless cpu or cuda, or the image resolution.
* After some debugging I see that the image embedding has all nans

License

Hi,
What's the license for Moondream?
Thanks!

Try openchat LLM instead of Phi1.5

Its a truly amazing model but Hi I was disappointed by the instructability of the phi1.5 element of your model.
For instance, if asked to ignore a particular object in a photo it doesn't follow this and also when asked to write a certain number of words doesn't do this reliably. when asked to start a new paragraph does not do this. Doesnt respond well to text later in the prompt.

I recently reviewed a lot of models looking for one for a chatbot.

Nous-Hermes-2-SOLAR-10.7B
https://huggingface.co/NousResearch/Nous-Hermes-2-SOLAR-10.7B

OpenChat3.5(0106)
https://huggingface.co/openchat/openchat-3.5-0106
https://github.com/imoneoi/openchat

are much more instructible and urge you to look at these for connecting your visual encoder.

Killed

Hello, I keep on getting this when i run the model

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Killed

How can I solve this?

Memory requirements

Hi,

I'm trying to run this model in a Google Colab but it is not working due to an OOM issue. Do you know how much memory is required to run this model?

I'm trying to run the following notebook:
https://colab.research.google.com/drive/1kghOrIzykPePCUeeZQyo9r5pxfx3V3lo?usp=sharing

Does not understands questions

It does not understand questions like where is this image taken or what meal is this

Please add Hardware Requirements

Hello,

Hardware is maybe for people important. Please consider to make a section in Readme.

Move inference code into Huggingface

Having to clone this repository to run inference is unnecessary friction - if we move it into the Huggingface repository users will be able to use it directly via the transformers library.

running sample.py fails, quick fix was to pass tokenizer as third param

Hi,
running sample.py fails, quick fix was to pass tokenizer as third param
answer = moondream.answer_question(image_embeds, prompt,tokenizer)

moondream/sample.py

Line 76 in de750a6

answer = moondream.answer_question(image_embeds, prompt)

thanks
Senthil

Bulk prompting

Hey, it looks like this could be used for captioning images, so i thought if you could maybe add a function to caption images or adjust the prompting so it can be done in bulk and the output can be saved to files?

How do we use video_gradio_demo.py ??

I've tried several times to use video_gradio_demo.py

Each time, the server seems to launch but when I browse the page, processing is counting up, and the upload button never becomes available while an icon continues to spin in its place.

Not sure how to use it.

LM Studio / Llamafile / ollama integration

Integrating the model with LM Studio and Llamafile would make it more accessible for users.

Memory/continuity?

Sorry if this is a silly question, but is it possible for the model to somehow keep a memory of the previous images it got? To put it simply, can i give it different frames from a single video and it would answer questions understanding that it is a single video?

GGML/GGUF?

StableLM

Hi,
Is this model based on StableLM?
Thanks!

Better support for GPU and Flash Attention during inference

The inference code provided in this repository forces moondream to run on CPU. We should allow the user to leverage GPUs and Flash Attention for faster inference if they want to.

Missing tokenizer parameter

The sample.py code complain about not giving "tokenizer" as an parameter so please add it.

else:
print(">", prompt)
answer = moondream.answer_question(image_embeds, prompt, tokenizer)
print(answer)

PILLOW out of date

I'm getting errors relating to PILLOW, it seems you are using an old version as PILLOW_VERSION has been discontinued.

python sample.py 
Traceback (most recent call last):
  File "/home/chris/ai/moondream/sample.py", line 1, in <module>
    from moondream import VisionEncoder, TextModel
  File "/home/chris/ai/moondream/moondream/__init__.py", line 1, in <module>
    from .vision_encoder import VisionEncoder
  File "/home/chris/ai/moondream/moondream/vision_encoder.py", line 4, in <module>
    from torchvision.transforms.v2 import (
  File "/home/chris/anaconda3/envs/whisper/lib/python3.12/site-packages/torchvision/__init__.py", line 2, in <module>
    from torchvision import datasets
  File "/home/chris/anaconda3/envs/whisper/lib/python3.12/site-packages/torchvision/datasets/__init__.py", line 9, in <module>
    from .fakedata import FakeData
  File "/home/chris/anaconda3/envs/whisper/lib/python3.12/site-packages/torchvision/datasets/fakedata.py", line 3, in <module>
    from .. import transforms
  File "/home/chris/anaconda3/envs/whisper/lib/python3.12/site-packages/torchvision/transforms/__init__.py", line 1, in <module>
    from .transforms import *
  File "/home/chris/anaconda3/envs/whisper/lib/python3.12/site-packages/torchvision/transforms/transforms.py", line 17, in <module>
    from . import functional as F
  File "/home/chris/anaconda3/envs/whisper/lib/python3.12/site-packages/torchvision/transforms/functional.py", line 5, in <module>
    from PIL import Image, ImageOps, ImageEnhance, PILLOW_VERSION
ImportError: cannot import name 'PILLOW_VERSION' from 'PIL' (/home/chris/anaconda3/envs/whisper/lib/python3.12/site-packages/PIL/__init__.py)

Please add a custom model folder

I think it would nice for users to user their custom text models beside the default ones

How to fine-tune it?

Any fine-tuning example notebooks?

Torch + cuda on windows 11 with last nvidia drivers + last cuda toolkit

Fyi

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

Change device_map to cuda on text_model.py file.

Change cpu

class TextModel:
    def __init__(self, model_path: str = "model") -> None:
        super().__init__()
        self.tokenizer = Tokenizer.from_pretrained(f"{model_path}/tokenizer")
        phi_config = PhiConfig.from_pretrained(f"{model_path}/text_model_cfg.json")

        with init_empty_weights():
            self.model = PhiForCausalLM(phi_config)

        self.model = load_checkpoint_and_dispatch(
            self.model,
            f"{model_path}/text_model.pt",
            device_map={"": "cpu"},
        )

To cuda

class TextModel:
    def __init__(self, model_path: str = "model") -> None:
        super().__init__()
        self.tokenizer = Tokenizer.from_pretrained(f"{model_path}/tokenizer")
        phi_config = PhiConfig.from_pretrained(f"{model_path}/text_model_cfg.json")

        with init_empty_weights():
            self.model = PhiForCausalLM(phi_config)

        self.model = load_checkpoint_and_dispatch(
            self.model,
            f"{model_path}/text_model.pt",
            device_map={"": "cuda"},
        )

Thanks for sharing your amazing work!