Giter VIP home page Giter VIP logo

biomed_hackathon's Introduction

Welcome to the BigScience🌸 Biomedical NLP Hackathon!

Huggingface's BigScience🌸 initative is an open scientific collaboration of nearly 600 researchers from 50 countries and 250 institutions who collaborate on various projects within the natural language processing (NLP) space to broaden accessibility of language datasets while working on challenging scientific questions around language modeling.

We are running a Biomedical Datasets hackathon to centralize many NLP datasets in the biological and medical space. Biological data is often diverse, so a unified location that joins multiple sources while preserving the data closest to the original form can greatly help accessbility.

Goals of this hackathon

We want to create as many biomedical dataset dataloader scripts for use with the datasets library.

The datasets library contains many different datasets, from a variety of domains, and the unique natural language attributes that describe them. The strength of this library is that all these datasets can be downloaded, and accessed with a single line of code.

Our goal is to do the same for biomedical datasets, so that practioners can have easy access to these datasets.

There are two broad categories of biomedical datasets:

1. Publically licensed data
2. Externally licensed data

We will accept dataloder scripts for either datatype; please see the FAQs for more explicit details.

How will this data be used?

Contribution Guidelines

There are official guides to contributing to the datasets library from Huggingface's (🤗) for a shared dataset and to add a dataset. Our guide follows closely from these, with adaptations to suit this intitiative.

Contributors must implement an accepted dataloading script to the library for at least 1 dataset to be guaranteed acknowledgement. All PRs submitted will be subject to code review prior to acceptance.

Details for contributor acknowledgements can be found here

Get started

A step-by-step guide on how you can implement a dataset can be found here.

Please ensure your dataloader follows our expected biomedical schema

Pre-Requisites

Please make a github account prior to implementing a dataset; you can follow instructions to install git here.

You will also need at Python 3.6+. If you are installing python, we recommend downloading anaconda to curate a python environment with necessary packages.

For MAC users, if you run into compilation errors like error: command 'x86_64-apple-darwin13.4.0-clang' failed with exit status 254, please try setting the compilers:

    export CC=/usr/bin/clang
    export CXX=/usr/bin/clang++

All commands in the guide are executed via terminal.

Template scripts

You can find template scripts and examples as follows:

  1. Template for publically licensed data
  2. Template for externally licensed data
  3. Example of publically licensed data
  4. Example of externally licensed data
  5. Example of Bio-C format annotation
  6. Example with BRAT format annotation (coming soon)

Community channels

We welcome contributions from a wide variety of backgrounds; we are more than happy to guide you through the process. For instructions on how to get involved or ask for help, check out the following:

Join BigScience

Please join the BigScience initiative here; there is a google form to fill out to have access to the biomedical working group slack. Once you have filled out this form, you'll get access to BigScience's google drive. There is a document where you can fill your name next to a working group; be sure to fill your name on the "Biomedical" group.

Join our Discord Server

Alternatively, you can ping us on the Biomedical discord server. The Discord server can be used to share information quickly or ask code-related questions.

Make a Github Issue

For quick questions and clarifications, you can make an issue via Github.

FAQs

What if my dataset does not have a public license?

We understand that some biomedical datasets require external licensing. To respect the agreement of the license, we recommend implementing a dataloader script that works if the user has a locally downloaded file. You can find an example here and follow the local template.

What types of libraries can we import?

Can I upload the data directly?

My dataset is complicated, can you help me?

Thank you!

We greatly appreciate your help - as a token or our gratitude, contributors can get the following rewards:

biomed_hackathon's People

Contributors

hakunanatasha avatar ruisi-su avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.