justinphan3110 / scifive Goto Github PK

View Code? Open in Web Editor NEW

86.0 7.0 14.0 15.89 MB

SciFive: a text-text transformer model for biomedical literature

Home Page: https://arxiv.org/abs/2106.03598

License: MIT License

Jupyter Notebook 99.36% Python 0.64%

nlp transformer transfer-learning biomedical-language pubmed pmc attention pretrained-models jax flax

scifive's Introduction

SciFive

SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework and desrbibed in the paper SciFive: a text-to-text transformer model for biomedical literature, SciFive achieve state-of-the-art and competitive results on multiple biomedical-natural language tasks.

🎉 UPDATE Jan 2023

We are migrating SciFive into BioT5X: Pretrained T5X Transformer for Biomedical Text Generation and Classification that use T5X and Flaxformer

📝 Our example BioT5X Fine-tunning notebook for the BLURB Tasks finetunning_biot5x_blurb.ipynb

🤗 HuggingFace

SciFive Pubmed+PMC: Base | Large
SciFive Pubmed: Base | Large
SciFive PMC: Base | Large

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-base-Pubmed")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-base-Pubmed")

sentence = "Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor ."
text = sentence + " </s>"

encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=256,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

Google Cloud Storage

Our base Google Cloud Storage URI is at gs://scifive

As described in our paper, we make public 6 version of SciFive, each one has been benchmarked to achieve state-of-the-art on different biomedical task. They are all available on our Google Cloud bucket, we are working on release the models on HuggingFace also.

Instruction on access Cloud Storage from the command line with python library gsutil is described here

gsutil URI for 6 SciFive models:

The following table contains pretrained SciFive checkpoints.

Model	Size	Step	Config	Checkpoint
SciFive Pubmed	base & large	1194600 & 1196500	T5 configs	`gs://scifive/models/pubmed/{size}/`
SciFive Pubmed+PMC	base & large	1200000	T5 configs	`gs://scifive/models/pubmed_pmc/{size}/`
SciFive PMC	base & large	1200000	T5 configs	`gs://scifive/models/pmc/{size}/`

{size} is either base or large

gsutil URI for Pretrain data:

Pubmed: gs://scifive/pretrain/pubmed
PMC: gs://scifive/pretrain/pmc

Example

Below, we give an example of how to use SciFive on Huggingface to generate MedNLI outputs. We also publish our SciFive finetuned on MedNLI for reproducing experiments.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model.cuda()

sent_1 = "In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA."
sent_2 = "The patient is hemodynamically stable"
text =  f"mednli: sentence1: {sent_1} sentence2: {sent_2}"

encoding = tokenizer.encode_plus(text, padding='max_length', max_length=256, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=8,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

Datasets

All of the finetune dataset already pre-procossed into text-text format also availabe at this

📊 Expected Results

Citations

If you use SciFive model or our code for publications, please cite:

@misc{phan2021scifive,
      title={SciFive: a text-to-text transformer model for biomedical literature}, 
      author={Long N. Phan and James T. Anibal and Hieu Tran and Shaurya Chanana and Erol Bahadroglu and Alec Peltekian and Grégoire Altan-Bonnet},
      year={2021},
      eprint={2106.03598},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

scifive's People

Contributors

Stargazers

Watchers

Forkers

nlp-kg adbmd jamesanibal zhengxian-fan mobashgr techthiyanes quantum360 ashtonomy crispae menyosoz pppppyamap sysang yxwang8775 izzygull

scifive's Issues

Need Config file for tensorflow checkpoint to Pytorch .bin model conversion (SciFive)

Hi there,

we want to convert your tensorflow model to pytorch model. We need the config file to do this, however we did not find the config file from this repo, https://console.cloud.google.com/storage/browser/scifive/models?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false

could you please share the config file also?

biot5x/examples/finetunning_biot5x_blurb.ipynb has a problem as to typing_extensions.

After compliting the gsutil step and then during executing the biot5x/src/finetune_biot5x.py, I encountered the following error:

~/.local/lib/python3.8/site-packages/tensorflow/python/types/trace.py in <module>
     29 from typing import Any, List, Optional, Sequence
     30 
---> 31 from typing_extensions import Protocol
     32 from typing_extensions import runtime_checkable
     33 

ModuleNotFoundError: No module named 'typing_extensions'

I confirmed that pip install typing_extensions has been done successfully.

Ubuntu 20.04.6 LTS (Focal Fossa)
5.15.133.1-microsoft-standard-WSL2
Python 3.11.5

Expand SciFive to Yes/No questions

Is there a way to answer yes/no questions with the SciFive models on Huggingface?

pubmed-pmc-large version seems wrong

Thanks for sharing the models.
I found that the pubmed-pmc-large version (https://huggingface.co/razent/SciFive-large-Pubmed_PMC/tree/main) seems wrong as the fine-tuning results on MEDNLI are drastically worse than T5-large (acc 0.67+ vs 0.82+). However, the pubmed-large version gives good results.

Could you confirm this version?

hugging face models do not work

Hi,
Thanks for your great contribution in biomedical domain.
I tried all the models in the hugging face format and I couldn't replicate any of the results or even get a reasonable output. Is there something wrong with the code, model, or anything is missing?

I run the following code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-base-PMC")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-base-PMC")
model.to(device)
sentence = "Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor ."
text =  "ncbi_ner: " + sentence + " </s>"

encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=256,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

And this is the output:

ncbi_ner: ncbi_ner: ncbi_ner:

The expected output (based on the paper) should be as follow:

Identification of APC2 , a homologue of the entity* adenomatous polyposis coli tumour *entity suppressor .

I replaced the model with all other available large, base, pubmed, pmc, pubmed+pmc models (basically all 6 hugging face variations) but I didn't get any reasonable outputs.

Could you give me a solution?

RE example doesn't work

Hello, your examples for relation extraction don't work. Could you please check that?

BioT5X for NLI

Hi! Very interesting work!
Could you also provide a script for finetuning an NLI task with BIOT5X and describe the data format of MedNLI required for BiOT5X and how it should be exactly preprocessed?
Thanks!

Fine-tuning model configuration

I want to double check that the HOC task fine-tuned with learning rate=0.01 and step=45000 (show in the notebook), batch size=64 for base model. The last step checkpoint is used to calculate the F1 and other evaluation metric for the test set. This is correct? Please let me know if I miss some other details.

Vocab is not accessible, it is in gs://t5-data

Hello, first of all, I want to say nice work!

When I want to reproduce your results on chemprot, I notice the following auth issue in the code

model.finetune(
    mixture_or_task_name="re_all",
    pretrained_model_dir=PRETRAINED_DIR,
    finetune_steps=FINETUNE_STEPS
)

2022-11-30 14:46:16.639835: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".

Turns out that this is caused by not being able to find vocab which is in 'gs://t5-data/vocabs/cc_all.32000/sentencepiece.model'. But currently only gs://scifive is accessible.

Could you please release the vocab or share with us how exactly did you obtain the sentencepiece vocab so that we can reproduce the results? Thank you!

fine-tuned models

Hi Justin, I needed QA model, which need to be fine tuned, but wantet to ask if NER models already fine-tuned are available? Thanks

the authentication error

Hello, it appears that your DDI example (SciFive/finetune/re/ddi_1.ipynb) contain some errors. Could you please double-check that? For example, the authentication error in the line 'tensorflow gcs config.configure gcs from colab auth()' produces the following error:

Setting up GCS access...
Running on TPU: grpc://10.20.166.202:8470
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
[<ipython-input-3-8279b3908dbb>](https://localhost:8080/#) in <module>()
     16   auth.authenticate_user()
     17   tf.config.experimental_connect_to_host(TPU_ADDRESS)
---> 18   tensorflow_gcs_config.configure_gcs_from_colab_auth()
     19 
     20 tf.disable_v2_behavior()

[/usr/local/lib/python3.7/dist-packages/tensorflow_gcs_config/__init__.py](https://localhost:8080/#) in configure_gcs_from_colab_auth(device)
    130   adc_filename = os.environ.get(
    131       "GOOGLE_APPLICATION_CREDENTIALS", "/content/adc.json")
--> 132   with open(adc_filename) as f:
    133     data = json.load(f)
    134   return configure_gcs(credentials=data, device=device)

FileNotFoundError: [Errno 2] No such file or directory: '/content/adc.json'

Please let us know if you have any suggestion to fix the issue.
Thank you in advance.

gs://t5_training/ public access?

Hello,

Is there a way to access gs://t5_training/ bucket?

It seems doesn't have public access.

Thanks

HOC dataset for SCIFIVE

Hi, I noticed that there are two versions of hoc dataset, one is originally sentence-level, and the other is modified by BLURB dataset to abstract-level. Which hoc dataset does SCIFIVE used for fine-tuning? I think it's sentence-level one because the input-length is 256

Do the gs: files in the code still exist?

I am trying out the code in scifive_pretrain_base.ipynb.

I got
OSError: Unable to open file for all the gs: files in the code
'gs://t5_training/t5-data/config/pretrained_models_google_base_operative_config.gin'
gs://mindxhack/bio_sentence_piece_small.txt

I tried look them up using the google cloud storage browser and don't see these files.

The browser does find the model files like
gs://scifive/models/pubmed_pmc/base

So the question is whether this is working code as is. Do these dependent files still exist on the cloud?

Weights for MedNLI

Is it possible to share the fine-tuned MedNLI classifier with a simple code snippet for how to perform inference given a text and premise pair? I saw the notebook for fine-tuning but was wondering if the output could be open sourced. many thanks

The link for finetune dataset has no data

Hi, thank you for sharing the code.

Is it possible to share the pre-procossed finetune datasets as well?
On the google cloud link, there is no dataset available.

Performing relation extraction

Hi, is there any way to perform relation extraction from new text using any of your models in Huggingface?

About the question_answer?

How can I test the model in the QA task? Just input the text like " How many teeth do humans have?" or if i need to add the prefix,like "QA: How many teeth do humans have?"

SciFive pre-training not using the init checkpoint

I am a PhD student trying to use your model for a research project.

Looking at the pre-training notebooks, it seems you do not use an init checkpoint to continue the training of the t5 model. Is this because you already have checkpoints in your model directory or because you train t5 from zero instead of using an already pre-trained T5 model?