epfllm / meditron Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 163.0 23.21 MB

Meditron is a suite of open-source medical Large Language Models (LLMs).

Home Page: https://huggingface.co/epfl-llm

License: Apache License 2.0

Python 83.09% Shell 9.02% TypeScript 7.89%

meditron's People

Contributors

Stargazers

Watchers

Forkers

codeaudit 3x0dv5 josephrmartinez techthiyanes keyman9848 ia35 wanasyraf4 hurricanejin polya20 shreezus rockywangxiaolei vgees rajaramkuberan sephett xli-github kub12 grasool epfl-nlp abambhroliya gerardjimmyp ihanif mrubash1 ai-mou vineetp6 rkp64 karteekpalhirnwal tighthe32 izederry-q npcoc eltociear mahmedturk ivanes1234 soon14 menglizhao marfaniko mikelmanro ignitewind lixiangccst tomchapin sporkseneta selfmoff46socialwil ethicalsecurity-agency boardtwinkle-baseat wplayergy tchirpik messagesta-r ifallen-riseraci missingol-promnica theevilchariyesmessages hirorolli centragate machineska gavinchen1314 jhsygg cab789 omarquess cruisecai raptorspace jeffara lespet tfuzi miladansari joskid apollohuang1 roshanivijayan akast7 jcho19 bmedi ghdwlsdn-dev wuguo5982 efelem devmacsanalytics alex-sconjecture tangtc1981 jaedukseo popupbuddy ahmedjemaa-tech hamiltionbert blueoceandevops 54yyyu rocapp graphgrailai kalchakra13 zzgw zheleska daruishi intuitionmachine smartjoy-tech mathieulaiking qusaifarraj rezaoptic suranig technoculture athatheo ecrespo akiessner prec-co cpa2001 rorosan gvc0461082002

meditron's Issues

Missing acc.py in evaluation

The inference_pipeline need to call acc.py in the end, which is missing in this repo.

Small error in the Gap_Replay download scripts, related to the red_pijama dataset.

In this line - https://github.com/epfLLM/meditron/blob/main/gap-replay/replay/dataset.py#L210

format_number - function missing.

Data preparation not working

Overall this part doesn't works, the scripts seems to have wrong paths, and the part of selenium.. etc

If you sort it out we'll be making a UNA version of your model, hope it can help on your research.

Issue with generation with standard HF generation

Great work and repo - however there is a tokenizer issue with the base version of the model.

When trying to just simple prompt the base model to do something with the suggested format, it runs into cuda issues which seem to indicate weird tokenizer/embedding mismatches

Working example:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "epfl-llm/meditron-7b"

# BitsAndBytesConfig int-4 config 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=False, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_id)

def format_prompt(prompt):


    system_msg = "You are a helpful, respectful and honest assistant." + \
    "Always answer as helpfully as possible, while being safe." + \
    "Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content." + \
    "Please ensure that your responses are socially unbiased and positive in nature.\n\n" + \
    "If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct." + \
    "If you don't know the answer to a question, please don't share false information."
    
    system_msg = "You are a helpful, respectful and honest assistant."
    
    return f"<|im_start|> system\n{system_msg}<|im_end|>\n <|im_start|> user\n{prompt}<|im_end|>\n <|im_start|> assistant\n"


 med_prompt = format_prompt("What is a possible treatment for high blood pressure in a pregnant woman?")

Gives us this prompt:

'<|im_start|> system\nYou are a helpful, respectful and honest assistant.<|im_end|>\n <|im_start|> user\nmake a clinical note<|im_end|>\n <|im_start|> assistant\n'

Use vanilla HF pipeline:

# Use a pipeline for later
from transformers import pipeline

pipe = pipeline("text-generation",
                model=model,
                tokenizer= tokenizer,    
                max_new_tokens = 1024,
                do_sample=True,
                top_k=30,
                num_return_sequences=2,
                eos_token_id=tokenizer.eos_token_id,
                return_full_text=False,
                )

# generate from prompt
generated = pipe(med_prompt)

Leads to:

../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [642,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

But it all works fine if the special formatting is not provided. I understand the special formatting was only for the finetuned versions, but the tokenizer has these special tokens added for the base model too, which seems problematic.

I hope this is enough detail to go on, but its throwing me a bit - seems like the special tokens do not play nice.

Envrionment details:

Python 3.9

Pip packages:

Package Version

accelerate 0.20.3
aiofiles 23.2.1
aiohttp 3.8.4
aiosignal 1.3.1
altair 5.1.2
annotated-types 0.6.0
anyio 3.7.1
asttokens 2.2.1
async-timeout 4.0.2
attrs 23.1.0
backcall 0.2.0
bertopic 0.16.0
blis 0.7.11
catalogue 2.0.10
certifi 2023.5.7
charset-normalizer 3.1.0
click 8.1.7
cloudpathlib 0.16.0
cmake 3.26.4
colorama 0.4.6
comm 0.1.3
confection 0.1.4
contourpy 1.2.0
cycler 0.12.1
cymem 2.0.8
Cython 0.29.36
datasets 2.13.1
debugpy 1.6.7
decorator 5.1.1
dill 0.3.6
einops 0.6.1
en-core-web-sm 3.7.1
exceptiongroup 1.1.3
executing 1.2.0
fastapi 0.104.1
fastjsonschema 2.19.0
ffmpy 0.3.1
filelock 3.12.2
fonttools 4.44.0
frozenlist 1.3.3
fsspec 2023.6.0
gradio 4.2.0
gradio_client 0.7.0
h11 0.14.0
hdbscan 0.8.33
httpcore 1.0.2
httpx 0.25.1
huggingface-hub 0.15.1
idna 3.4
importlib-metadata 6.7.0
importlib-resources 6.1.1
ipykernel 6.23.3
ipython 8.14.0
jedi 0.18.2
Jinja2 3.1.2
joblib 1.3.2
jsonschema 4.19.2
jsonschema-specifications 2023.7.1
jupyter_client 8.3.0
jupyter_core 5.3.1
kiwisolver 1.4.5
langcodes 3.3.0
lit 16.0.6
llvmlite 0.41.1
markdown-it-py 3.0.0
MarkupSafe 2.1.3
matplotlib 3.8.1
matplotlib-inline 0.1.6
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.4
multiprocess 0.70.14
murmurhash 1.0.10
nbformat 5.9.2
nest-asyncio 1.5.6
networkx 3.1
nltk 3.8.1
numba 0.58.1
numpy 1.25.0
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu11 10.9.0.58
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu11 10.2.10.91
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu11 11.7.4.91
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu11 2.14.3
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.52
nvidia-nvtx-cu11 11.7.91
nvidia-nvtx-cu12 12.1.105
orjson 3.9.10
packaging 23.1
pandas 2.0.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
Pillow 10.1.0
pip 23.1.2
platformdirs 3.8.0
plotly 5.18.0
preshed 3.0.9
prompt-toolkit 3.0.38
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 12.0.1
pydantic 2.4.2
pydantic_core 2.10.1
pydub 0.25.1
Pygments 2.15.1
pynndescent 0.5.11
pyparsing 3.1.1
python-dateutil 2.8.2
python-multipart 0.0.6
pytz 2023.3
PyYAML 6.0
pyzmq 25.1.0
referencing 0.30.2
regex 2023.6.3
requests 2.31.0
rich 13.6.0
rpds-py 0.12.0
safetensors 0.3.1
scikit-learn 1.3.2
scipy 1.11.4
semantic-version 2.10.0
sentence-transformers 2.2.2
sentencepiece 0.1.99
setuptools 58.1.0
shellingham 1.5.4
six 1.16.0
smart-open 6.4.0
sniffio 1.3.0
spacy 3.7.2
spacy-legacy 3.0.12
spacy-loggers 1.0.5
srsly 2.4.8
stack-data 0.6.2
starlette 0.27.0
sympy 1.12
tenacity 8.2.3
thinc 8.2.1
threadpoolctl 3.2.0
tokenizers 0.13.3
tomlkit 0.12.0
toolz 0.12.0
torch 2.1.1
torchvision 0.16.1
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
transformers 4.30.2
triton 2.1.0
typer 0.9.0
typing_extensions 4.8.0
tzdata 2023.3
umap-learn 0.5.5
urllib3 2.0.3
uvicorn 0.24.0.post1
wasabi 1.1.2
wcwidth 0.2.6
weasel 0.3.4
websockets 11.0.3
wheel 0.40.0
xxhash 3.2.0
yarl 1.9.2
zipp 3.15.0

Accuracy calculation failure

I see that here in evaluate.py accuracy is calculated in two different ways. And there is an assert statements to make sure they match.
In my tests, these assertion sometimes fails. Why is this assertion there? and what does it failure signify?

Thank you,

eval generation path issue

Hello, I'm trying to use your eval pipeline.
I ran ./inference_pipeline.sh -b pubmedqa -c gpt2 -s 0 -m 0 -out_dir out_dir
After it is done with generation I get the following error:

Stored pubmedqa generations to the following path: ../benchmarks/generations/pubmedqa-gpt2.jsonl
Traceback (most recent call last):
  File "evaluate.py", line 475, in <module>
    main(args)
  File "evaluate.py", line 390, in main
    data = load_jsonl(path)
  File "evaluate.py", line 39, in load_jsonl
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../benchmarks/generations/pubmedqa/pubmedqa-gpt2.jsonl'

It looks like the generations are saved in: meditron/benchmarks/generations/pubmedqa-gpt2.jsonl
But eval looks for them in: meditron/benchmarks/generations/pubmedqa/pubmedqa-gpt2.jsonl'

prompt Meditron for NER

Hi, thank you for releasing the model. I believe it's a great resource for biomedical research and applications.

I am trying to prompt Meditron (7b, 70b-4b) for a medical NER task (using the example prompt, i.e., you are a helpful...) and couldn't get good results. Any suggestions on how to do NER with this model?

missing medqa_cot_train.jsonl

Could you release your training and validation SFT data?

load model size mismatch error

operations

I download the model file from https://huggingface.co/epfl-llm/meditron-7b/tree/main
then load the model using:
model = transformers.AutoModelForCausalLM.from_pretrained('./meditron-7b/', trust_remote_code=True, use_cache=True)

get the error:

size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32000, 4096]) from checkpoint, the shape in current model is torch.Size([32017, 4096]).

package

transformer version is 4.25.2

where is mycode error?

there are not any output. where is mycode error?
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

tokenizer = AutoTokenizer.from_pretrained("/Users/yutao/Documents/MyCode/meditron_7b/model")
model = AutoModelForCausalLM.from_pretrained("/Users/yutao/Documents/MyCode/meditron_7b/model")

question = "What to do about high blood pressure?"
inputs = tokenizer(question, return_tensors="pt")
set_seed(42)

answers = []

with torch.inference_mode():
beam_output = model.generate(**inputs,
max_new_tokens=1024,
num_beams=1,
pad_token_id=2,
eos_token_id=2,
early_stopping=False,
do_sample=False,
)
answers.append(tokenizer.decode(beam_output[0], skip_special_tokens=True))

print("answers: "+answers)

Loading the guidelines with huggingface datasets fails

Running the following code

from datasets import load_dataset

dataset = load_dataset("epfl-llm/guidelines")

Gives me this error:

{
	"name": "DatasetGenerationError",
	"message": "An error occurred while generating the dataset",
	"stack": "---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/builder.py:1932, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1925     writer = writer_class(
   1926         features=writer._features,
   1927         path=fpath.replace(\"SSSSS\", f\"{shard_id:05d}\").replace(\"JJJJJ\", f\"{job_id:05d}\"),
   (...)
   1930         embed_local_files=embed_local_files,
   1931     )
-> 1932 writer.write_table(table)
   1933 num_examples_progress_update += len(table)

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/arrow_writer.py:573, in ArrowWriter.write_table(self, pa_table, writer_batch_size)
    572 pa_table = pa_table.combine_chunks()
--> 573 pa_table = table_cast(pa_table, self._schema)
    574 if self.embed_local_files:

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:2332, in table_cast(table, schema)
   2331 if table.schema != schema:
-> 2332     return cast_table_to_schema(table, schema)
   2333 elif table.schema.metadata != schema.metadata:

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:2291, in cast_table_to_schema(table, schema)
   2290     raise ValueError(f\"Couldn't cast\
{table.schema}\
to\
{features}\
because column names don't match\")
-> 2291 arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
   2292 return pa.Table.from_arrays(arrays, schema=schema)

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:2291, in <listcomp>(.0)
   2290     raise ValueError(f\"Couldn't cast\
{table.schema}\
to\
{features}\
because column names don't match\")
-> 2291 arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
   2292 return pa.Table.from_arrays(arrays, schema=schema)

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:1834, in _wrap_for_chunked_arrays.<locals>.wrapper(array, *args, **kwargs)
   1833 if isinstance(array, pa.ChunkedArray):
-> 1834     return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
   1835 else:

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:1834, in <listcomp>(.0)
   1833 if isinstance(array, pa.ChunkedArray):
-> 1834     return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
   1835 else:

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:2147, in cast_array_to_feature(array, feature, allow_number_to_str)
   2146 elif not isinstance(feature, (Sequence, dict, list, tuple)):
-> 2147     return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
   2148 raise TypeError(f\"Couldn't cast array of type\
{array.type}\
to\
{feature}\")

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:1836, in _wrap_for_chunked_arrays.<locals>.wrapper(array, *args, **kwargs)
   1835 else:
-> 1836     return func(array, *args, **kwargs)

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:2029, in array_cast(array, pa_type, allow_number_to_str)
   2028 if pa.types.is_null(pa_type) and not pa.types.is_null(array.type):
-> 2029     raise TypeError(f\"Couldn't cast array of type {array.type} to {pa_type}\")
   2030 return array.cast(pa_type)

TypeError: Couldn't cast array of type string to null

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
model_playground.ipynb Cell 6 line 3
      model_playground.ipynb#W5sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2'>3</a> dataset = load_dataset(\"epfl-llm/guidelines\")

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/load.py:2152, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   2149 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
   2151 # Download and prepare data
-> 2152 builder_instance.download_and_prepare(
   2153     download_config=download_config,
   2154     download_mode=download_mode,
   2155     verification_mode=verification_mode,
   2156     try_from_hf_gcs=try_from_hf_gcs,
   2157     num_proc=num_proc,
   2158     storage_options=storage_options,
   2159 )
   2161 # Build dataset for splits
   2162 keep_in_memory = (
   2163     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   2164 )

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/builder.py:948, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    946     if num_proc is not None:
    947         prepare_split_kwargs[\"num_proc\"] = num_proc
--> 948     self._download_and_prepare(
    949         dl_manager=dl_manager,
    950         verification_mode=verification_mode,
    951         **prepare_split_kwargs,
    952         **download_and_prepare_kwargs,
    953     )
    954 # Sync info
    955 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/builder.py:1043, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1039 split_dict.add(split_generator.split_info)
   1041 try:
   1042     # Prepare split will record examples associated to the split
-> 1043     self._prepare_split(split_generator, **prepare_split_kwargs)
   1044 except OSError as e:
   1045     raise OSError(
   1046         \"Cannot find data file. \"
   1047         + (self.manual_download_instructions or \"\")
   1048         + \"\
Original error:\
\"
   1049         + str(e)
   1050     ) from None

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/builder.py:1805, in ArrowBasedBuilder._prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
   1803 job_id = 0
   1804 with pbar:
-> 1805     for job_id, done, content in self._prepare_split_single(
   1806         gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1807     ):
   1808         if done:
   1809             result = content

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/builder.py:1950, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1948     if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1949         e = e.__context__
-> 1950     raise DatasetGenerationError(\"An error occurred while generating the dataset\") from e
   1952 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset"
}

Cannot find article data in papers-PubMed.jsonl

It is great to see you have done an open-source Medical LLM with SOTA performance. When I ran "python load.py --dataset papers --key_path keys.json". It outputs papers-PubMed.jsonl. But I cannot find any paper in this dataset. Only some basic info of each article. Anyone knows what's wrong?
Thank you!

llama.cpp Integration to Support Low-End Hardware Compatibility

Request for llama.cpp Integration to Support Low-End Hardware Compatibility

Description

I'm currently trying to integrate llama.cpp with Meditron for running models on lower-end hardware. Meditron is based on Llama, so in theory, this should be possible. However, I'm encountering issues when attempting to convert the Meditron model using llama.cpp.

Steps to Reproduce

Either run python3 convert-hf-to-gguf.py ../meditron-7b/

Output:

Loading model: meditron-7b
Traceback (most recent call last):
...
NotImplementedError: Architecture "LlamaForCausalLM" not supported!

Or directly launching with llama.cpp using:

./build/bin/main --rope-freq-scale 8.0 -m ../meditron-7b/pytorch_model-00008-of-00008.bin -p "I have pain in my leg from toes to hip"

Output:

Log start
...
error loading model: llama_model_loader: failed to load model from ../meditron-7b/pytorch_model-00008-of-00008.bin

Expected Behavior

Successful integration of llama.cpp with Meditron, allowing the model to run on lower-end hardware.

Actual Behavior

Encountering a NotImplementedError for the architecture "LlamaForCausalLM" when trying to convert the model, and an error loading the model when launching directly with llama.cpp.

Possible Solution

Adjustments in llama.cpp to support the "LlamaForCausalLM" architecture used by Meditron. This could involve modifying the model conversion script or the model loading mechanism in llama.cpp.

Additional Context

Link to llama.cpp

Request

I kindly request the team to consider adding support for llama.cpp integration with Meditron. Or to give advices on how to implement it. This would be a significant enhancement, enabling the use of Meditron models on more diverse hardware setups, especially those at the lower end.

Errors with three of the scrapers

Hello,

I was trying to scrape magic, drugs and guidelinecentral without success, while some others were fine. Any idea how to make them work? Drugs seemed to work but 0 article was in the JSONL. GuidelineCentral got some click issues. FInally, Magic printed errors for each article but one.

Thanks in advance,

Question about training hour

Hi I've tried to calculate the training hours by myself and get following result

it shows 430 hours, which is inconsistent with the 332 hours that given in your appendix A

Can't benchmark on medqa

In the evaluation folder, running python inference.py --checkpoint mistral --checkpoint_name mistral --benchmark medqa fails with the error datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset.

I managed to fix the error by setting the MedQA benchmark attribute self.subsets = ['med_qa_en_source]

By the way, the repository requirements.txt is missing the packages wandb, scikit-learn and openai (unused though), to be able to run the benchmark suite.

share python scripts for processing PubMed full articles

It is great to see you have done an open-source Medical LLM with SOTA performance. I searched online the python scripts to process PubMed full articles are not complete. Could you share your python scripts of processing PubMed full articles? Thank you very much!

Mismatch in vocab_size between .bin files and .safetensors files

Hey !

I'm sorry if this is not an issue and it's just me not understanding the problem, I'm not an expert, rather a novice, in this field.

I'm trying to deploy the project according to your deployment guide.
However, since I don't have access to enough memory for the -70B version of the model, I want to use the --load-8bit parameter to enable model compression. (I shall specify that I run the model using the CPU, with the --device cpu flag)

When I use this, I get the following error:

ValueError: Trying to set a tensor of shape torch.Size([32000, 8192]) in "weight" (which has shape torch.Size([32017, 8192])), this look incorrect

If I look in the HF's upload log, I see that there were two main upload of the model:

The first one with the .bin files, including the vocab_size value set to 32000
The seconde one with the .safetensors files, including the vocab_size value set to 32017

My understanding is that to enable model compression, the .bin files are needed, which do not match to the model configuration anymore.

This is supported by a manual edit of the config.json file to set vocab_size back to 32000, which allows the model to load properly using --load-8bit.

Meditron-7b doesn't behave as expected

I've been experimenting with Meditron-7b for answering medical queries, but its performance seems not as expected compared to other LLM models.

I loaded the model and tokenizer and then used the standard HF pipeline:

pipeline = transformers.pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    temperature=0.01,
    do_sample=True,
    top_k=3,
    top_p=0.01,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=200,
)

Then I used langchain wrapper:

from langchain.llms import HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=pipeline)

For a simple greeting with llm(prompt="Hi, how are you?"), the model repetitively echoed the prompt:

'\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi,'

When asked about lung cancer risk factors with llm(prompt="What are the risk factors for lung cancer?"),, it provided a list of related questions instead of direct answers:

What are the symptoms of lung cancer?

What causes lung cancer?

What are the stages of lung cancer?

When to seek urgent medical care?

How to diagnose lung cancer?

How to treat lung cancer?

How to prevent lung cancer?

What to expect (Outlook/Prognosis)?

Further, using a formatted prompt based on a GitHub repository example, the response included the prompt format instructions verbatim, without addressing the medical query.

def format_prompt(prompt):
    system_msg = "You are a helpful, respectful and honest assistant." + \
    "Always answer as helpfully as possible, while being safe." + \
    "Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content." + \
    "Please ensure that your responses are socially unbiased and positive in nature.\n\n" + \
    "If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct." + \
    "If you don't know the answer to a question, please don't share false information."""
    return f"<|im_start|> system\n{system_msg}<|im_end|>\n <|im_start|> user\n{prompt}<|im_end|>\n <|im_start|> assistant\n"

example = {
        "prompt": """Four weeks after starting hydrochlorothiazide, a 49-year-old man with hypertension comes to the physician because of muscle cramps and weakness. His home medications also include amlodipine. His blood pressure today is 176/87 mm Hg. Physical examination shows no abnormalities. The precordial leads of a 12-lead ECG are shown. The addition of which of the following is most likely to have prevented this patient's condition?\n\nOptions:\nA. Torsemide \nB. Nifedipine \nC. Eplerenone \nD. Hydralazine""",
        "gold": "C",
        "steps": [
            "The patient has started hydrochlorothiazide.",
            "He now presents with muscle cramps and weakness and an ECG that supports the diagnosis of hypokalemia.",
            "(A) Torsemide is a loop diuretic and would likely aggravate the hypokalemia.",
            "(B) Nifedipine is a calcium antagonist and would not alleviate the hypocalcemia.",
            "(C) Eplerenone is a potassium-sparing diuretic and would likely decrease the chance of hypokalemia.",
            "(D) Hydralazine is a potent vasodilator and would not decrease the risk of hypokalemia.",
        ]
    }

prompt = format_prompt(example['prompt'])
res = llm(prompt=prompt )
print(res)

And this returned

You are a helpful, respectful and honest assistant.Always answer as helpfully as possible, while being safe.Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.If you don't know the answer to a question, please don't share false information.<|im_end|>
<|im_start|> user
A 65-year-old man with a history of hypertension and hyperlipidemia presents with a 2-week history of progressive dyspnea on exertion. He has a history of smoking 1 pack of cigarettes per day for 30 years. He has no history of diabetes mellitus, coronary artery disease, or peripheral vascular disease. His blood pressure is 150/90 mm Hg, and his pulse is 80 beats per minute. Physical examination reveals a grade 3/6 systolic murmur at the apex. The precordial leads of a 12-lead ECG are shown. The addition of which of the following is most likely to have prevented this patient's condition?

Options:
A. Amlodipine
B. Lisinopril
C. Metoprolol
D. Nifedipine<|im_end|>
<|im_start|> assistant
You are a helpful, respectful and honest assistant.Always answer as helpfully as possible, while being safe.Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.Please ensure that your responses are socially unbiased and positive in nature.

Is this behavior typical for Meditron-7b, or might it be an issue with my prompting technique? Additionally, would Meditron-70b potentially yield better results?

Eval results aren't matching the paper

I'm not able to match the 3-shot eval results reported in the paper for the pretrained model.
I downloaded the Meditron-7b model from HF.
For example, for MedQA I get 0.353, while the paper reports 0.287±0.008
My command was: ./inference_pipeline.sh -b medqa4 -c meditron-7b -s 3 -m 0 -out_dir out_dir

On PubMedQA, I got 0.486, but the paper reports .693±.151.

Abnormal evaluation result

I evaluation llama-2-70b model on pubmedqa with cot, sc_cot, and multi_seed + sc_cot inference modes, but I got some abnormal evaluation results.

For the cot inference mode: There are only 26 correct answers with 476 ignored, does it normal?
For the sc_cot and multi_seed + sc_cot result, I got about 52% acc, different from the result in your paper.

I want to know does the evaluation code is completely same as that you used?

My evaluation result:
cot:

====================================
Report accuracy for pubmedqa-cot-llama2-70b-base on pubmedqa:
Accuracy: 0.032
Accuracy (calibrated): 0.6153846153846154
Precision: 0.03709090909090909
Recall: 0.032
F1: 0.033303703703703696
------------------------------------
Correct: 16
Counted: 26
Total: 500
Unable to find answer: 474
Ignored prompts: 474
 ====================================

sc_cot

====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa:
Accuracy: 0.458
Accuracy (calibrated): 0.5240274599542334
Precision: 0.36550423868216514
Recall: 0.458
F1: 0.4052266991967127
------------------------------------
Correct: 229
Counted: 437
Total: 500
Unable to find answer: 63
Ignored prompts: 63
====================================

Multi-seed + sc_cot:

====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa:
Accuracy: 0.458
Accuracy (calibrated): 0.5240274599542334
Precision: 0.36550423868216514
Recall: 0.458
F1: 0.4052266991967127
------------------------------------
Correct: 229
Counted: 437
Total: 500
Unable to find answer: 63
Ignored prompts: 63
====================================
====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa-1234:
Accuracy: 0.458
Accuracy (calibrated): 0.5240274599542334
Precision: 0.36550423868216514
Recall: 0.458
F1: 0.4052266991967127
------------------------------------
Correct: 229
Counted: 437
Total: 500
Unable to find answer: 63
Ignored prompts: 63
====================================
====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa-432:
Accuracy: nan
Accuracy (calibrated): -1
Precision: nan
Recall: nan
F1: nan
------------------------------------
Correct: 0
Counted: 0
Total: 0
Unable to find answer: 0
Ignored prompts: 0
====================================
====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa-32:
Accuracy: nan
Accuracy (calibrated): -1
Precision: nan
Recall: nan
F1: nan
------------------------------------
Correct: 0
Counted: 0
Total: 0
Unable to find answer: 0
Ignored prompts: 0
====================================

What system prompt do you recommend?

Hello Meditron team,

Happy New Year! Hope you are doing well. Thank you so much for releasing Meditron! I have the following questions on the recommended system prompt and how to input it to the model.

It seems that during evaluation, different system prompts are used for different datasets (https://github.com/epfLLM/meditron/blob/main/evaluation/inference.py#L121). In general, when using the meditron models posted on huggingface (https://huggingface.co/epfl-llm/meditron-7b and https://huggingface.co/epfl-llm/meditron-70b), what system prompt do you recommend?
Given a system prompt, is the following the proper way to input it to the model?

import torch
from transformers import pipeline

model_path = "epfl-llm/meditron-7b"
prompt = "What are the symptoms of diabetes?"
prompt_template = f"system prompt beginning... {prompt} ... system prompt end"

#load model and tokenizer
model = transformers.AutoModelForCausalLM.from_pretrained(
   model_path,
   torch_dtype=torch.float16,
   device_map="auto",
   )
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)

#tokenize input
prompt_tokens = tokenizer(prompt_template, return_tensors='pt')["input_ids"].to(device)

#generate output
output = model.generate(
   inputs=prompt_tokens,
   temperature=0.1,
   do_sample=True,
   eos_token_id=tokenizer.eos_token_id,
   pad_token_id=tokenizer.pad_token_id,
   max_new_tokens=512
   )

print(tokenizer.decode(output[0]))

Thank you very much!

Question: llama3

Hi,
On the meta-site https://ai.meta.com/blog/llama-2-3-meditron-yale-medicine-epfl-open-source-llm/
I saw that you also published a model based on llama3? However, I could not find it and the link provided on the meta-site does not work: https://meditron-ddx.github.io/llama3-meditron.github.io/
I was just curious, if you are going to publish it?
Kind regards!

Issue with using model

Trying the prompt give in the paper but the model just repeats the question without any helpful answers. I am prompting it wrong?

Question about Figure 1

I read the paper and found 70.2 is task specific fine-tuned version:

Are the values of chatgpt(60.2) and gpt-4(82.3) coming from in-context learning result or fine-tuning? If they are coming from in-context learning, then I think this graph is misleading and unfair.

Did you use llama-2-7B base model or the chat model ?

Are you planing to release fine-tuned models?

Thank you for this great work and very detailed paper! In the paper, you write:

MEDITRON models (7B and 70B) with and without fine-tuning to the public to ensure access for real-world evaluation and to facilitate similar efforts in other domains.

Should we expect fine-tuned models to be released soon?

Can't run finetuning script (wrong paths?)