Giter VIP home page Giter VIP logo

scifive's Introduction

SciFive

PWC PWC PWC PWC PWC PWC PWC PWC PWC

PRs Welcome arXiv

SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework and desrbibed in the paper SciFive: a text-to-text transformer model for biomedical literature, SciFive achieve state-of-the-art and competitive results on multiple biomedical-natural language tasks.

๐ŸŽ‰ UPDATE Jan 2023

๐Ÿ“ Our example BioT5X Fine-tunning notebook for the BLURB Tasks finetunning_biot5x_blurb.ipynb

๐Ÿค— HuggingFace

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-base-Pubmed")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-base-Pubmed")

sentence = "Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor ."
text = sentence + " </s>"

encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=256,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

Google Cloud Storage

Our base Google Cloud Storage URI is at gs://scifive

As described in our paper, we make public 6 version of SciFive, each one has been benchmarked to achieve state-of-the-art on different biomedical task. They are all available on our Google Cloud bucket, we are working on release the models on HuggingFace also.

Instruction on access Cloud Storage from the command line with python library gsutil is described here

gsutil URI for 6 SciFive models:

The following table contains pretrained SciFive checkpoints.

Model Size Step Config Checkpoint
SciFive Pubmed base & large 1194600 & 1196500 T5 configs gs://scifive/models/pubmed/{size}/
SciFive Pubmed+PMC base & large 1200000 T5 configs gs://scifive/models/pubmed_pmc/{size}/
SciFive PMC base & large 1200000 T5 configs gs://scifive/models/pmc/{size}/
  • {size} is either base or large

gsutil URI for Pretrain data:

Example

Below, we give an example of how to use SciFive on Huggingface to generate MedNLI outputs. We also publish our SciFive finetuned on MedNLI for reproducing experiments.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model.cuda()

sent_1 = "In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA."
sent_2 = "The patient is hemodynamically stable"
text =  f"mednli: sentence1: {sent_1} sentence2: {sent_2}"

encoding = tokenizer.encode_plus(text, padding='max_length', max_length=256, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=8,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

Datasets

All of the finetune dataset already pre-procossed into text-text format also availabe at this

๐Ÿ“Šย  Expected Results

Citations

If you use SciFive model or our code for publications, please cite:

@misc{phan2021scifive,
      title={SciFive: a text-to-text transformer model for biomedical literature}, 
      author={Long N. Phan and James T. Anibal and Hieu Tran and Shaurya Chanana and Erol Bahadroglu and Alec Peltekian and Grรฉgoire Altan-Bonnet},
      year={2021},
      eprint={2106.03598},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

scifive's People

Contributors

alecpeltekian avatar heraclex12 avatar justinphan3110 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

scifive's Issues

biot5x/examples/finetunning_biot5x_blurb.ipynb has a problem as to typing_extensions.

After compliting the gsutil step and then during executing the biot5x/src/finetune_biot5x.py, I encountered the following error:

~/.local/lib/python3.8/site-packages/tensorflow/python/types/trace.py in <module>
     29 from typing import Any, List, Optional, Sequence
     30 
---> 31 from typing_extensions import Protocol
     32 from typing_extensions import runtime_checkable
     33 

ModuleNotFoundError: No module named 'typing_extensions'

I confirmed that pip install typing_extensions has been done successfully.

  • Ubuntu 20.04.6 LTS (Focal Fossa)
  • 5.15.133.1-microsoft-standard-WSL2
  • Python 3.11.5

hugging face models do not work

Hi,
Thanks for your great contribution in biomedical domain.
I tried all the models in the hugging face format and I couldn't replicate any of the results or even get a reasonable output. Is there something wrong with the code, model, or anything is missing?

I run the following code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-base-PMC")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-base-PMC")
model.to(device)
sentence = "Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor ."
text =  "ncbi_ner: " + sentence + " </s>"

encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=256,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

And this is the output:

ncbi_ner: ncbi_ner: ncbi_ner:

The expected output (based on the paper) should be as follow:

Identification of APC2 , a homologue of the entity* adenomatous polyposis coli tumour *entity suppressor .

I replaced the model with all other available large, base, pubmed, pmc, pubmed+pmc models (basically all 6 hugging face variations) but I didn't get any reasonable outputs.

Could you give me a solution?

BioT5X for NLI

Hi! Very interesting work!
Could you also provide a script for finetuning an NLI task with BIOT5X and describe the data format of MedNLI required for BiOT5X and how it should be exactly preprocessed?
Thanks!

Fine-tuning model configuration

I want to double check that the HOC task fine-tuned with learning rate=0.01 and step=45000 (show in the notebook), batch size=64 for base model. The last step checkpoint is used to calculate the F1 and other evaluation metric for the test set. This is correct? Please let me know if I miss some other details.

Vocab is not accessible, it is in gs://t5-data

Hello, first of all, I want to say nice work!

When I want to reproduce your results on chemprot, I notice the following auth issue in the code

model.finetune(
    mixture_or_task_name="re_all",
    pretrained_model_dir=PRETRAINED_DIR,
    finetune_steps=FINETUNE_STEPS
)
2022-11-30 14:46:16.639835: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".

Turns out that this is caused by not being able to find vocab which is in 'gs://t5-data/vocabs/cc_all.32000/sentencepiece.model'. But currently only gs://scifive is accessible.

Could you please release the vocab or share with us how exactly did you obtain the sentencepiece vocab so that we can reproduce the results? Thank you!

fine-tuned models

Hi Justin, I needed QA model, which need to be fine tuned, but wantet to ask if NER models already fine-tuned are available? Thanks

the authentication error

Hello, it appears that your DDI example (SciFive/finetune/re/ddi_1.ipynb) contain some errors. Could you please double-check that? For example, the authentication error in the line 'tensorflow gcs config.configure gcs from colab auth()' produces the following error:

Setting up GCS access...
Running on TPU: grpc://10.20.166.202:8470
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
[<ipython-input-3-8279b3908dbb>](https://localhost:8080/#) in <module>()
     16   auth.authenticate_user()
     17   tf.config.experimental_connect_to_host(TPU_ADDRESS)
---> 18   tensorflow_gcs_config.configure_gcs_from_colab_auth()
     19 
     20 tf.disable_v2_behavior()

[/usr/local/lib/python3.7/dist-packages/tensorflow_gcs_config/__init__.py](https://localhost:8080/#) in configure_gcs_from_colab_auth(device)
    130   adc_filename = os.environ.get(
    131       "GOOGLE_APPLICATION_CREDENTIALS", "/content/adc.json")
--> 132   with open(adc_filename) as f:
    133     data = json.load(f)
    134   return configure_gcs(credentials=data, device=device)

FileNotFoundError: [Errno 2] No such file or directory: '/content/adc.json'

Please let us know if you have any suggestion to fix the issue.
Thank you in advance.

HOC dataset for SCIFIVE

Hi, I noticed that there are two versions of hoc dataset, one is originally sentence-level, and the other is modified by BLURB dataset to abstract-level. Which hoc dataset does SCIFIVE used for fine-tuning? I think it's sentence-level one because the input-length is 256

Do the gs: files in the code still exist?

I am trying out the code in scifive_pretrain_base.ipynb.

I got
OSError: Unable to open file for all the gs: files in the code
'gs://t5_training/t5-data/config/pretrained_models_google_base_operative_config.gin'
gs://mindxhack/bio_sentence_piece_small.txt

I tried look them up using the google cloud storage browser and don't see these files.

The browser does find the model files like
gs://scifive/models/pubmed_pmc/base

So the question is whether this is working code as is. Do these dependent files still exist on the cloud?

Weights for MedNLI

Is it possible to share the fine-tuned MedNLI classifier with a simple code snippet for how to perform inference given a text and premise pair? I saw the notebook for fine-tuning but was wondering if the output could be open sourced. many thanks

The link for finetune dataset has no data

Hi, thank you for sharing the code.

Is it possible to share the pre-procossed finetune datasets as well?
On the google cloud link, there is no dataset available.

About the question_answer?

How can I test the model in the QA task? Just input the text like " How many teeth do humans have?" or if i need to add the prefix,like "QA: How many teeth do humans have?"

SciFive pre-training not using the init checkpoint

I am a PhD student trying to use your model for a research project.

Looking at the pre-training notebooks, it seems you do not use an init checkpoint to continue the training of the t5 model. Is this because you already have checkpoints in your model directory or because you train t5 from zero instead of using an already pre-trained T5 model?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.