blaizzy / mlx-vlm Goto Github PK

View Code? Open in Web Editor NEW

175.0 6.0 15.0 228 KB

MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.

License: MIT License

Python 100.00%

llava llm mlx vision-transformer apple-silicon idefics local-ai paligemma vision-framework vision-language-model

mlx-vlm's Introduction

MLX-VLM

MLX-VLM a package for running Vision LLMs on your Mac using MLX.

Get started

The easiest way to get started is to install the mlx-vlm package:

With pip:

pip install mlx-vlm

Inference

CLI

python -m mlx_vlm.generate --model qnguyen3/nanoLLaVA --max-tokens 100 --temp 0.0

Chat UI with Gradio

python -m mlx_vlm.chat_ui --model qnguyen3/nanoLLaVA

Script

import mlx.core as mx
from mlx_vlm import load, generate

model_path = "mlx-community/llava-1.5-7b-4bit"
model, processor = load(model_path)

prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": f"<image>\nWhat are these?"}],
    tokenize=False,
    add_generation_prompt=True,
)

output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)

mlx-vlm's People

Contributors

Stargazers

Watchers

Forkers

goekdeniz-guelmez darvinrivera danis3x brianjking s-smits iamshubhamgupto id-2 whatif-dev gabewillen josefalbers sujantkumarkv edeveloperoz willccbb mvandermeulen

mlx-vlm's Issues

[Feature Request] Only generate output when --verbose False

import subprocess
import sys

command = [
    sys.executable, 
    '-m', 'mlx_vlm.generate',
    '--model', 'qnguyen3/nanoLLaVA',
    '--max-tokens', '100',
    '--temp', '0.0',
    '--image', "http://images.cocodataset.org/val2017/000000039769.jpg",
]

result = subprocess.run(command, capture_output=True, text=True)
caption = result.stdout
print(caption)

Currently:
Gives input including template, output and speeds

Proposed:
with --verbose False only generate the output in the terminal, which makes it easier to interpret the results for further use.

[Feature Request] Direct Python Interface for mlx_vlm.generate

Currently, the mlx_vlm.generate function can only be called from the command line using python -m mlx_vlm.generate. I would like to request a direct Python interface for this function, allowing me to call it from my Python code without having to use the command line.

The desired API would be similar to the following:

from mlx_vlm import generate # Maybe add a load function just as in mlx_lm.generate?

image = "http://images.cocodataset.org/val2017/000000039769.jpg"
caption = generate(model='qnguyen3/nanoLLaVA',
                image=image,
                # processor = Automatically determined by the model choice
                # image_processor = Automatically determined by the model choice
                prompt="Describe this image.",
                temp=0.0,
                max_tokens=100,
                verbose=False,
                formatter=None,
                repetition_penalty=None,
                repetition_context_size=None,
                top_p=1
                )

This would allow me to easily integrate the mlx_vlm.generate function into my Python code without calling a subprocess, and use it to generate captions for images programmatically.

ValueError: The number of images in the text [3] and images [1] should be the same.

I'm getting this error: ValueError: The number of images in the text [3] and images [1] should be the same.

The first image I drag in, and add a prompt works.
When I try another image I get that error. I have to clear the page to get it to work again.

Trying idefics2

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gradio/queueing.py", line 527, in process_events
response = await route_utils.call_process_api(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gradio/route_utils.py", line 270, in call_process_api
output = await app.get_blocks().process_api(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gradio/blocks.py", line 1847, in process_api
result = await self.call_function(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gradio/blocks.py", line 1445, in call_function
prediction = await utils.async_iteration(iterator)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gradio/utils.py", line 629, in async_iteration
return await iterator.anext()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gradio/utils.py", line 755, in asyncgen_wrapper
response = await iterator.anext()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gradio/chat_interface.py", line 551, in _stream_fn
first_response = await async_iteration(generator)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gradio/utils.py", line 629, in async_iteration
return await iterator.anext()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gradio/utils.py", line 622, in anext
return await anyio.to_thread.run_sync(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gradio/utils.py", line 605, in run_sync_iterator_async
return next(iterator)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/mlx_vlm/chat_ui.py", line 116, in chat
for chunk in generate(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/mlx_vlm/chat_ui.py", line 57, in generate
input_ids, pixel_values = prepare_inputs(image_processor, processor, image, prompt)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/mlx_vlm/utils.py", line 636, in prepare_inputs
inputs = processor(prompt, image, return_tensors="np")
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/transformers/models/idefics2/processing_idefics2.py", line 225, in call
raise ValueError(
ValueError: The number of images in the text [3] and images [1] should be the same.

ChatUI improvements

Multi-turn conversation
Load models on the fly
Support multiple images

LLVM 1.6 is quite good, but could be better

I have now managed to get LLAVA 1.6 running nicely on my MacBook Pro (M3 Max 48Gb).

import mlx.core as mx
from mlx_vlm import load, generate

import os
from pathlib import Path

# model_path = "mlx-community/llava-1.5-7b-4bit"
model_path = "mlx-community/llava-v1.6-mistral-7b-8bit"
#model_path = "mlx-community/llava-v1.6-34b-8bit"
model, processor = load(model_path)

prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": f"<image>\nProvide a formal caption and keywords for this image, suitable for microstock"}],
    tokenize=False,
    add_generation_prompt=True,
)

picpath = "/Users/home/Pictures/Processed"
pics = sorted(Path(picpath).iterdir(), key=os.path.getmtime, reverse=True)
pic = str(pics[0])
print(pic)


#output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=True)
output = generate(model, processor, pic, prompt, max_tokens=200, verbose=True)

print(output)

Interestingly, the 34B and 7B+mistral models seem to produce almost the same result. As the former is very much slower than the latter, you can guess which I use.

The generated keywords are limited; expanding the number of tokens you ask it generate just leads to repetition of generated keywords, rather than stopping.

If I try to generate keywords etc using Google services, this approach seems to generate more relevant keywords, including the mood of the picture, whereas the Google approach generates keywords such as "sky" or "vehicle registration", which are technically correct, but not generally really what the photo is about.

The development of models seems to be quite rapid at present. Are any of the newer models better than LLaacn 1.6?

For example, do any models use GPS coordinates, or keywords that are already added? (I usually include location data on photo ingestion).

Example in Readme doesn't work

mlx 0.13.1
mlx-lm 0.13.1
mlx-vlm 0.0.5

import mlx.core as mx
from mlx_vlm import load, generate

model_path = "mlx-community/llava-1.5-7b-4bit"
model, processor = load(model_path)

prompt = processor.apply_chat_template(
    [{"role": "user", "content": f"<image>\nWhat are these?"}],
    tokenize=False,
    add_generation_prompt=True,
)

output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)

Got error:

AttributeError Traceback (most recent call last)
Cell In[44], line 7
4 model_path = "mlx-community/llava-1.5-7b-4bit"
5 model, processor = load(model_path)
----> 7 prompt = processor.apply_chat_template(
8 [{"role": "user", "content": f"\nWhat are these?"}],
9 tokenize=False,
10 add_generation_prompt=True,
11 )
13 output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)

AttributeError: 'LlavaProcessor' object has no attribute 'apply_chat_template'

FileNotFoundError: No safetensors found

Hello

I was trying out the mlx-vlm package and was able to run the default example here

However, replacing the model card with Efficient-Large-Model/VILA-13b-4bit-awq, it fails. heres the stack trace:

(mlx) shubham@Shubhams-MBP ~ % python -m mlx_vlm.generate --model Efficient-Large-Model/VILA-13b-4bit-awq \
--prompt "what are these?" --image "http://images.cocodataset.org/val2017/000000039769.jpg" \
--max-tokens 100 --temp 0.0

Fetching 7 files: 100%|████████████████████████| 7/7 [00:00<00:00, 14761.25it/s]
ERROR:root:No safetensors found in /Users/shubham/.cache/huggingface/hub/models--Efficient-Large-Model--VILA-13b-4bit-awq/snapshots/ab335be6c2a5b6a08d4784491bf270fe0ce7a41d
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/shubham/anaconda3/envs/mlx/lib/python3.12/site-packages/mlx_vlm/generate.py", line 107, in <module>
    main()
  File "/Users/shubham/anaconda3/envs/mlx/lib/python3.12/site-packages/mlx_vlm/generate.py", line 69, in main
    model, processor, image_processor, config = get_model_and_processors(args.model)
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shubham/anaconda3/envs/mlx/lib/python3.12/site-packages/mlx_vlm/generate.py", line 55, in get_model_and_processors
    model, processor = load(model_path, {"trust_remote_code": True})
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shubham/anaconda3/envs/mlx/lib/python3.12/site-packages/mlx_vlm/utils.py", line 212, in load
    model = load_model(model_path, lazy)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shubham/anaconda3/envs/mlx/lib/python3.12/site-packages/mlx_vlm/utils.py", line 123, in load_model
    raise FileNotFoundError(f"No safetensors found in {model_path}")
FileNotFoundError: No safetensors found in /Users/shubham/.cache/huggingface/hub/models--Efficient-Large-Model--VILA-13b-4bit-awq/snapshots/ab335be6c2a5b6a08d4784491bf270fe0ce7a41d
(mlx) shubham@Shubhams-MBP ~ %

I am using a m1 MacBook Pro. Im new to mlx-vlm but im happy to work on this to add support.

LLava documentation?

Running the script on the front page, I get:

config.json: 100%|███████████████████████████████████| 1.13k/1.13k [00:00<00:00, 4.87MB/s]
added_tokens.json: 100%|████████████████████████████████| 41.0/41.0 [00:00<00:00, 131kB/s]
special_tokens_map.json: 100%|███████████████████████████| 552/552 [00:00<00:00, 5.37MB/s]
preprocessor_config.json: 100%|██████████████████████████| 819/819 [00:00<00:00, 9.76MB/s]
model.safetensors.index.json: 100%|████████████████████| 129k/129k [00:00<00:00, 1.53MB/s]
tokenizer_config.json: 100%|█████████████████████████| 1.31k/1.31k [00:00<00:00, 9.20MB/s]
tokenizer.model: 100%|█████████████████████████████████| 500k/500k [00:00<00:00, 6.96MB/s]
tokenizer.json: 100%|████████████████████████████████| 1.84M/1.84M [00:00<00:00, 4.65MB/s]
model.safetensors: 100%|█████████████████████████████| 3.98G/3.98G [06:43<00:00, 9.85MB/s]
Fetching 9 files: 100%|█████████████████████████████████████| 9/9 [06:44<00:00, 44.89s/it]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.

Is this OK??

I assume that to run the lava v1.6 models I just look at the versions available on hugging face. Are there any further details on how much memory is required to run the 7B v 34B variants, and how much better the 8bit v 4bit takes are, please?

Chat template error with MLX Community LLava models (moved from FastMLX)

Continued from: Blaizzy/fastmlx#6

When I tried this at the command line: "python -m mlx_vlm.chat_ui --model mlx-community/llava-1.5-7b-4bit", I get the same chat template errors with all of the following:

models--mlx-community--llava-1.5-7b-4bit
models--mlx-community--llava-llama-3-8b-v1_1-8bit
models--mlx-community--llava-phi-3-mini-4bit
models--mlx-community--llava-v1.6-mistral-7b-8bit

Logs:

mlx-community/llava-1.5-7b-4bit

(rbuild) (base) Stewarts-MacBook-Pro:vmlx stewart$ python -m mlx_vlm.chat_ui --model mlx-community/llava-1.5-7b-4bit
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 88820.56it/s]
Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 30740.01it/s]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 68759.08it/s]
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/queueing.py", line 541, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/route_utils.py", line 276, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/blocks.py", line 1928, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/blocks.py", line 1526, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 657, in async_iteration
    return await iterator.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 783, in asyncgen_wrapper
    response = await iterator.__anext__()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/chat_interface.py", line 592, in _stream_fn
    first_response = await async_iteration(generator)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 657, in async_iteration
    return await iterator.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 650, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 859, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 633, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/mlx_vlm/chat_ui.py", line 103, in chat
    messages = processor.apply_chat_template(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/transformers/processing_utils.py", line 926, in apply_chat_template
    raise ValueError(
ValueError: No chat template is set for this processor. Please either set the `chat_template` attribute, or provide a chat template as an argument. See https://huggingface.co/docs/transformers/main/en/chat_templating for more information.

mlx-community/llava-v1.6-mistral-7b-8bit

(rbuild) (base) Stewarts-MacBook-Pro:vmlx stewart$ python -m mlx_vlm.chat_ui --model mlx-community/llava-v1.6-mistral-7b-8bit
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Fetching 10 files: 100%|████████████████████| 10/10 [00:00<00:00, 110960.42it/s]
Fetching 10 files: 100%|█████████████████████| 10/10 [00:00<00:00, 34865.37it/s]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Fetching 10 files: 100%|████████████████████| 10/10 [00:00<00:00, 108942.96it/s]
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/queueing.py", line 541, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/route_utils.py", line 276, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/blocks.py", line 1928, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/blocks.py", line 1526, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 657, in async_iteration
    return await iterator.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 783, in asyncgen_wrapper
    response = await iterator.__anext__()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/chat_interface.py", line 592, in _stream_fn
    first_response = await async_iteration(generator)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 657, in async_iteration
    return await iterator.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 650, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 859, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 633, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/mlx_vlm/chat_ui.py", line 103, in chat
    messages = processor.apply_chat_template(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/transformers/processing_utils.py", line 926, in apply_chat_template
    raise ValueError(
ValueError: No chat template is set for this processor. Please either set the `chat_template` attribute, or provide a chat template as an argument. See https://huggingface.co/docs/transformers/main/en/chat_templating for more information.
^CKeyboard interruption in main thread... closing server.

mlx-community/llava-llama-3-8b-v1_1-8bit

(rbuild) (base) Stewarts-MacBook-Pro:vmlx stewart$ python -m mlx_vlm.chat_ui --model mlx-community/llava-llama-3-8b-v1_1-8bit
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Fetching 8 files: 100%|████████████████████████| 8/8 [00:00<00:00, 74731.47it/s]
Fetching 8 files: 100%|█████████████████████████| 8/8 [00:00<00:00, 9742.87it/s]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Fetching 8 files: 100%|████████████████████████| 8/8 [00:00<00:00, 34344.35it/s]
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/queueing.py", line 541, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/route_utils.py", line 276, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/blocks.py", line 1928, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/blocks.py", line 1526, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 657, in async_iteration
    return await iterator.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 783, in asyncgen_wrapper
    response = await iterator.__anext__()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/chat_interface.py", line 592, in _stream_fn
    first_response = await async_iteration(generator)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 657, in async_iteration
    return await iterator.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 650, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 859, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 633, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/mlx_vlm/chat_ui.py", line 103, in chat
    messages = processor.apply_chat_template(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/transformers/processing_utils.py", line 926, in apply_chat_template
    raise ValueError(
ValueError: No chat template is set for this processor. Please either set the `chat_template` attribute, or provide a chat template as an argument. See https://huggingface.co/docs/transformers/main/en/chat_templating for more information.
^CKeyboard interruption in main thread... closing server.

mlx-community/llava-phi-3-mini-4bit

(rbuild) (base) Stewarts-MacBook-Pro:vmlx stewart$ python -m mlx_vlm.chat_ui --model mlx-community/llava-phi-3-mini-4bit
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
preprocessor_config.json: 100%|████████████████| 819/819 [00:00<00:00, 8.36MB/s]
added_tokens.json: 100%|███████████████████████| 978/978 [00:00<00:00, 19.6MB/s]
config.json: 100%|█████████████████████████| 1.33k/1.33k [00:00<00:00, 18.7MB/s]
special_tokens_map.json: 100%|█████████████████| 615/615 [00:00<00:00, 3.51MB/s]
model.safetensors.index.json: 100%|██████████| 129k/129k [00:00<00:00, 9.68MB/s]
tokenizer_config.json: 100%|███████████████| 8.45k/8.45k [00:00<00:00, 46.1MB/s]
tokenizer.model: 100%|███████████████████████| 500k/500k [00:00<00:00, 12.5MB/s]
tokenizer.json: 100%|██████████████████████| 1.85M/1.85M [00:00<00:00, 8.59MB/s]
model.safetensors: 100%|███████████████████| 2.47G/2.47G [00:57<00:00, 43.2MB/s]
Fetching 9 files: 100%|███████████████████████████| 9/9 [00:57<00:00,  6.41s/it]
Fetching 9 files: 100%|███████████████████████| 9/9 [00:00<00:00, 110054.62it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 27453.63it/s]
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/queueing.py", line 541, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/route_utils.py", line 276, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/blocks.py", line 1928, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/blocks.py", line 1526, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 657, in async_iteration
    return await iterator.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 783, in asyncgen_wrapper
    response = await iterator.__anext__()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/chat_interface.py", line 592, in _stream_fn
    first_response = await async_iteration(generator)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 657, in async_iteration
    return await iterator.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 650, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 859, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/gradio/utils.py", line 633, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/mlx_vlm/chat_ui.py", line 103, in chat
    messages = processor.apply_chat_template(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/stewart/Dropbox/dev/vmlx/rbuild/lib/python3.11/site-packages/transformers/processing_utils.py", line 926, in apply_chat_template
    raise ValueError(
ValueError: No chat template is set for this processor. Please either set the `chat_template` attribute, or provide a chat template as an argument. See https://huggingface.co/docs/transformers/main/en/chat_templating for more information.
^CKeyboard interruption in main thread... closing server.

Batch inference support (self-assigning)

Would be nice to have batch inference support similar to mlx_parallm, happy to try and add soon. @Blaizzy can you assign this to me?

unable to run paligemma

I am trying to run the following code but it is giving error. please assist!

import mlx.core as mx
from mlx_vlm import load, generate

model_path = "google/paligemma-3b-mix-448"
model, processor = load(model_path)

print(processor)

prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": f"<image>\nWhat are these?"}],
    tokenize=False,
    add_generation_prompt=True,
)

output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)

Traceback (most recent call last):
  File "/Users/namanjain/Desktop/repos/local-recall/models.py", line 15, in <module>
    output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/mlx_vlm/utils.py", line 809, in generate
    logits, cache = model(input_ids, pixel_values, mask)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 139, in __call__
    input_embeddings, final_attention_mask_4d = self.get_input_embeddings(
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 82, in get_input_embeddings
    self._prepare_inputs_for_multimodal(
  File "/opt/anaconda3/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 115, in _prepare_inputs_for_multimodal
    final_embedding[image_mask_expanded] = scaled_image_features.flatten()
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
ValueError: NumPy boolean array indexing assignment cannot assign 2097152 input values to the 2099200 output values where the mask is true

Error running inference script in readme using paligemma-3b-mix-448-8bit

mlx-vlm Version: 0.0.7
mlx Version: 0.14.0

Great work with this, it's working well apart from when using with PaliGemma in the supplied inference Python script. I'm experiencing an error when running the script found in the readme file, using the paligemma-3b-mix-448-8bit model as per code below:

import mlx.core as mx
from mlx_vlm import load, generate

model_path = "mlx-community/paligemma-3b-mix-448-8bit"

model, processor = load(model_path)

prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": f"<image>\nWhat are these?"}],
    tokenize=False,
    add_generation_prompt=True,
)

output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)
print(output)

The "CLI" and "Chat UI with Gradio" inference steps in the readme are working correctly, with the model set as "mlx-community/paligemma-3b-mix-448-8bit". I'm using Conda and MLX and MLX-VLM has been installed using PIP.

The error is as follows:

NumPy boolean array indexing assignment cannot assign 2097152 input values to the 2099200 output values where the mask is true
File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 115, in _prepare_inputs_for_multimodal
final_embedding[image_mask_expanded] = scaled_image_features.flatten()
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 82, in get_input_embeddings
self._prepare_inputs_for_multimodal(
File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 139, in call
input_embeddings, final_attention_mask_4d = self.get_input_embeddings(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_vlm/utils.py", line 809, in generate
logits, cache = model(input_ids, pixel_values, mask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "mlx-vlm-test.py", line 15, in
output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/mlx/lib/python3.11/runpy.py", line 88, in _run_code
exec(code, run_globals)
File "/opt/anaconda3/envs/mlx/lib/python3.11/runpy.py", line 198, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: NumPy boolean array indexing assignment cannot assign 2097152 input values to the 2099200 output values where the mask is true

Inconsistent OCR results with Idefics2 model in mlx_vlm compared to other environments

I am currently evaluating the OCR capabilities of the Idefics2 model, specifically for extracting text from comic book speech balloons.

The model performs as expected on various platforms including my local Linux environment, HF Playground, and Google Colab across different hardware configurations. However, when using the mlx_vlm implementation, the results are inconsistent and generally nonsensical.

mlx_vlm version: "0.0.6", dev install
Model Used: mlx-community/idefics2-8b-4bit (also tested with 8bit)

Code Snippet:

model, processor = load("mlx-community/idefics2-8b-4bit")
prompt_text_tmpl = "Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English."
resulting_messages = [
    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt_text_tmpl}]}
]
prompt = processor.apply_chat_template(resulting_messages, add_generation_prompt=True)
output = generate(model, processor, image, prompt, temp=0.4, max_tokens=512, top_p=0.8, verbose=True)

The expected output should closely match the results from other environments, such as:

["THE ECHO OF THE OLD MAN'S FOOTSTEPS FADES DOWN THE HALL AS ..."]

I can give you the code used to generate that but it follows closely the code from HF Idefics2 model card.

The output from mlx_vlm is significantly different and less accurate:

==========
Image: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=185x145 at 0x537C49490> 

Prompt: User:<image>Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English.<end_of_utterance>
Assistant:
The text consists of a single word: "down".<end_of_utterance>
==========
Prompt: 76.531 tokens-per-sec
Generation: 49.216 tokens-per-sec

Additional Information

The issue persists across different quantizations.
Similar tests with the llava-1.5-7b model in both HF, my Linux rig, and mlx_lvm environments show consistent and more accurate results:

==========
Image: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=185x145 at 0x174971490> 

Prompt: <s>[INST] <image>
Perform optical character recognition OCR on this image, which contains speech balloons from a comic book. The text is in English. [/INST]
</s>
The echo of the old man's footsteps fades down the hall.
==========
Prompt: 16.869 tokens-per-sec
Generation: 8.546 tokens-per-sec

Could you please investigate why the mlx_vlm Idefics2 model yields such different results compared to other environments? As far as I can tell, the inputs generated by the processor with transformers and mlx_vlm are the same. It seems the issue might be related to the generation process or detokenization, but I am unsure due to the complexity of the transformers code and my limited familiarity with mlx_lvm.

Llava v1.6 support

Which Llava models are supported? The mlx-examples repo supports only v1.5. Does this one support v1.6? Does it generate keywords for an image , for example?

Idefics2: Inconsistent converted model outputs and special token leakage

I am encountering significant discrepancies in the outputs of two models when testing OCR that should theoretically perform similarly, as they share the same configuration. The first model is a pre-trained community model, and the second is a version I converted. Additionally, there's an issue with special tokens (<end_of_utterance>) not being filtered out in the outputs.

Below are the details and comparisons of the outputs using both models. I've also noted the same behavior across different quantization settings. Could there be an underlying issue with the conversion process or the initial model weights that might explain these differences?

Consider the following snippet:

prompt_text = (
    "Do perform optical character recognition OCR on the image, which contains speech "
    "balloons from a comic book. The text is in English. Carefully extract the text exactly "
    "as it appears, ensuring that you preserve the original capitalization, punctuation, and "
    "formatting."
)

prompt = processor.apply_chat_template(resulting_messages, add_generation_prompt=True)

output = generate(
    model, processor, image, prompt,  # type: ignore
    verbose=True
)

Image:

but I've used many others.

Here are two different outputs using different models:

Output 1 (Community Model):

==========
Image: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=185x145 at 0x55F1D7C10> 

Prompt: User:<image>Do perform optical character recognition OCR on the image, which contains speech balloons from a comic book. The text is in English. Carefully extract the text exactly as it appears, ensuring that you preserve the original capitalization, punctuation, and formatting.<end_of_utterance>
Assistant:
YOU'VE SEEMED SO UNHAPPY LATELY, CYNTHY! I WISH THERE WAS SOMETHING I COULD DO! I WISH YOU'D LET ME TRY AND MAKE YOU HAPPY!<end_of_utterance>
==========
Prompt: 112.710 tokens-per-sec
Generation: 27.255 tokens-per-sec

Output 2 (My Converted Model):

==========
Image: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=185x145 at 0x55F1D7C10> 

Prompt: User:<image>Do perform optical character recognition OCR on the image, which contains speech balloons from a comic book. The text is in English. Carefully extract the text exactly as it appears, ensuring that you preserve the original capitalization, punctuation, and formatting.<end_of_utterance>
Assistant:
YOU'VE SEEMED LATELY, CYNTHY! "<end_of_utterance>
==========
Prompt: 60.653 tokens-per-sec
Generation: 27.033 tokens-per-sec

The only difference between the two runs is the model used:

Output 1 uses the community model: model_path = 'mlx-community/idefics2-8b-8bit'
Output 2 uses the model converted by me: model_path = '../model/idefics2-8b-8bit'

There are two issues:

Minor, the output does not filter out the special token <end_of_utterance>.
There is a significant difference in the quality of the outputs, which is curious given that both models use the same config.json. The conversion command used was:
```
python -m mlx_vlm.convert --hf-path HuggingFaceM4/idefics2-8b --mlx-path model/idefics2-8b-8bit --q-bits 8 --quantize
```
The quantization process in ML Core is a basic affine quatization, which should be deterministic. This leads me to suspect that the input weights might differ, although they shouldn't.

I've also tested other quantizations (4-bit and casting to bfloat16), and the (bad) outputs are consistently the same (YOU'VE SEEMED LATELY, CYNTHY! "<end_of_utterance> or very similar) across all quantizations, which differs substantially from the more accurate results I typically get with CUDA on Linux. This discrepancy might be due to the different quantization approaches between Hugging Face's BitsAndBytes and ML Core, but I didn't expect results so different.

Despite my efforts, I might be overlooking a simple explanation for these discrepancies. Could there be an aspect of the conversion process or setup that I'm missing? Any insights or suggestions would be greatly appreciated.

Too sensitive to prompting

I found some VLMs are too sensitive to prompt. For example, when I use mlx-community/llava-1.5-7b-4bit:
the image is:

python -m mlx_vlm.generate --model mlx-community/llava-1.5-7b-4bit --prompt "how many dogs in the image?" --image "/Users/xxx/Pictures/xx.jpg" --max-tokens 100 --temp 0.0

response is (which is correct):
There are nine dogs in the image.

but if I change the prompt to "How many dogs in the image?"..

python -m mlx_vlm.generate --model mlx-community/llava-1.5-7b-4bit --prompt "How many dogs in the image?" --image "/Users/xxx/Pictures/xx.jpg" --max-tokens 100 --temp 0.0

response is wrong:
There are seven dogs in the image.
I also tried llava-llama-3-8b-v1_1-8bit/llava-phi-3-mini-8bit/idefics2-8b-chatty-8bit with both "how..." and "How...", but response were wrong all the time.

Can't run deepseek-vl with script on M2

import mlx.core as mx
from mlx_vlm import load, generate

model_path = "mlx-community/deepseek-vl-7b-chat-4bit"
model, processor = load(model_path)

prompt = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": f"<image>\nWhat are these?"}],
    tokenize=False,
    add_generation_prompt=True,
)

output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)
print(output)

Traceback (most recent call last):
File "/Users/wayne/2-learning/Projects/gpt/DeepSeek-VL/inference2.py", line 8, in
prompt = processor.tokenizer.apply_chat_template(
AttributeError: 'LlamaTokenizerFast' object has no attribute 'tokenizer'

>>> processor
LlamaTokenizerFast(name_or_path='/Users/wayne/.cache/huggingface/hub/models--mlx-community--deepseek-vl-7b-chat-4bit/snapshots/79feff56645faf5f145c834118ca3d43c8c55984', vocab_size=100000, model_max_length=16384, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<｜begin▁of▁sentence｜>', 'eos_token': '<｜end▁of▁sentence｜>', 'additional_special_tokens': ['<image>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
        100000: AddedToken("<｜begin▁of▁sentence｜>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
        100001: AddedToken("<｜end▁of▁sentence｜>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
        100002: AddedToken("ø", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
        100003: AddedToken("ö", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
        100004: AddedToken("ú", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
        100005: AddedToken("ÿ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
        100006: AddedToken("õ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
        100007: AddedToken("÷", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
        100008: AddedToken("û", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
        100009: AddedToken("ý", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
        100010: AddedToken("À", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
        100011: AddedToken("ù", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
        100012: AddedToken("Á", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
        100013: AddedToken("þ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
        100014: AddedToken("ü", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
        100015: AddedToken("<image_placeholder>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        100016: AddedToken("<image>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

generate is a function but it can get loaded as a module in a "corner" case.

I ran into this experimenting in a rather disorganized manner in jupyter notebook.

from mlx_vlm import load, generate
from mlx_vlm.generate import get_model_and_processors

//do some coding, run inference
// and then later

from mlx_vlm import load, generate # do this again by mistake "redundantly"

then get error:

output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)

TypeError: 'module' object is not callable

this is python 3.10.9 and bet it got something to do with how modules get loaded. generate as a artifact exists as both a method and a module so in this rather contrived way, it caused problem.

this is just a doc of this behaviour in case if this helps others.

Chat UI example with Gradio is not working

Sending text prompt or text+image prompt through Gradio UI results in error

messages = processor.apply_chat_template(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'LlavaProcessor' object has no attribute 'apply_chat_template'

Here's the code used to start the chat UI:

python -m mlx_vlm.chat_ui  --model mlx-community/llava-llama-3-8b-v1_1-4bit

Running mlx_vlm.generate example through command line works fine.

[Feature Request] Supports fine-tuning

[Feature Request] Add support for OpenAI compatible API server

Hello @Blaizzy, it would be so great to have a similar OpenAI Compatible server like in https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md but supporting also image modality, what do you think?

Models to port to MLX-VLM

Instructions:

Select the model and comment below with your selection
Create a Draft PR titled: "Add support for X"
Read Contribution guide
Check existing models
Tag @Blaizzy for code reviews and questions.

If the model you want is not listed, please suggest it and I will add it.

[Feature Request] Add LLaVA-phi-3-mini and quantized version

See title.
A q4 version would be great as well.
https://huggingface.co/xtuner/llava-phi-3-mini-hf

Feature Request: Support for `phi-3-vision-128k-instruct`

Hi, I've been exploring this repo for the past couple of days and I find your work here really amazing. I'm curious if there are any plans to add support for the Phi-3-vision-128k-instruct model to this library? I'd be happy to contribute in any way I can to help make this happen.

Batch Processing Feature

Overview

The goal is to add support for efficient batch processing of inputs to the MLX-VLM library. This will allow users to process multiple images and text prompts simultaneously to generate corresponding outputs in a single batch, improving performance.

Use cases:

Generating captions for a large dataset of images.
Localizing objects or regions in a batch of images based on textual descriptions.
Classifying a large number of images into predefined categories, considering accompanying text information.
Answering questions based on a batch of images (single and multiple question prompts).
Video processing.

Note: Tag @Blaizzy for code reviews and questions.

Requirements

Support batched inputs:

Accept a batch of images as input, provided as a list or array of image objects.
Accept a batch of text prompts as input, provided as a list or array of strings.
Accept a single text prompt as input, provided as a string.

Perform batch processing:

Process the batch of images and text prompts simultaneously (async) using the MLX-VLM model.
Utilize parallel processing or GPU acceleration to optimize batch processing performance.
Ensure that the processing of one input in the batch does not affect the processing of other inputs.

Generate batched outputs:

Return the generated outputs for each input in the batch.
Maintain the order of the outputs corresponding to the order of the inputs.
Support different output formats such as text, embeddings, or visual representations based on the specific task.

Error handling:

Handle errors gracefully during batch processing.
Provide informative error messages for invalid inputs or processing failures.
Continue processing the remaining inputs in the batch if an error occurs for a specific input.

API design:

Provide a clear and intuitive API for users to perform batch processing.
Allow users to specify the maximum batch size supported by their system.
Provide options to control the batch processing behavior, such as enabling/disabling parallel processing.

Documentation and examples:

Update the library documentation to include information about the batch processing feature.
Provide code examples demonstrating how to use the batch processing API effectively.
Include performance benchmarks and guidelines for optimal batch sizes based on system resources.

Implementation

Modify the existing input handling logic to accept batches of images and text prompts.
Implement batch processing functionality using parallel processing techniques or GPU acceleration libraries.
Optimize memory usage and performance for efficient batch processing.
Update the output generation logic to handle batched outputs and maintain the correct order.
Implement error handling mechanisms to gracefully handle and report errors during batch processing.
Design and expose a user-friendly API for performing batch processing.
Write unit tests to verify the correctness and performance of the batch processing implementation.
Update the library documentation and provide code examples for using the batch processing feature.

Testing

Prepare a comprehensive test suite to validate the batch processing functionality.
Test with different batch sizes and input variations to ensure robustness.
Verify that the generated outputs match the expected results for each input in the batch.
Measure the performance improvement gained by batch processing compared to individual processing.
Conduct error handling tests to ensure graceful handling of invalid inputs and processing failures.

Delivery

Integrate the batch processing feature into the existing MLX-VLM library codebase.
Ensure backward compatibility with previous versions of the library.
Provide release notes highlighting the new batch processing capability and any breaking changes.
Update the library version number following semantic versioning conventions.
Publish the updated library package to the relevant package repositories or distribution channels.

By implementing this batch processing feature, MLX-VLM will provide users with the ability to efficiently process multiple inputs simultaneously, improving performance and usability of the library for various vision-language tasks.