Giter VIP home page Giter VIP logo

bio-datasets's People

Contributors

adsodemelk avatar delfosseaurelien avatar hayfabm avatar martinp7 avatar theomeb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bio-datasets's Issues

Manage first datasets

  • Add open-source NetMHCpan-4.1 dataset

    • Compute contextual sequence embeddings with either ESM or ProtBertn and add them to netmhcpan-4.1-train dataset
  • Make swissProt dataset usable with the API.

  • Make pathogen dataset usable with the API.

Add configuration file for a dataset

Configuration file to define the dataset and embeddings files as well the inputs/targets variable names (add them as attributes).

  • Also add an attribute when there is only one input sequence, and use it as default in embeddings.

Implement first POC

Simple implementation:

class Dataset:
    def __init__(self, name: str):
        ## 1. Check if available locally
        ## 2. Check if available on GS and download it
        if not is_available_in_cache(name):
            self._fetch_from_gs()
        self.path = cache_path
    def _fetch_from_gs(self):
        """Fetch the dataset from GS and store it in cache for later use."""
        pass
    def to_npy_arrays(self, inputs, targets):
        # If 1 input: X is an numpy array
        # If several inputs: X list of numpy arrays
        # This format is usable by tf.keras.Model.fit(...)
        # cf https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
        return X, y
    def get_embeddings(self, column):
        """Return a 2D numpy array with the CLS ProtBert embeddings of each sequence."""
        return X
    def to_torch_data_loader(self):
        return Data
  • Possibility to cache the dataset
  • load_dataset() function which calls this dataset
  • Progress bar when download from GCP

We first only support .csv dataset with .npy aligned files for the embeddings of a sequence column.

Update the dataset workflow with new structure/format

The main idea (to be confirmed though) is to have for the user the following process:

  • The user adds raw data files such as (csv + npy for embeddings - to be extended to other formats as well)
  • The user defines a schema for variable types
  • The library converts the raw data files into a format used to load data in memory
  • The dataset instance can return native tf.data.Dataset or torch.Dataloader in order to train models with this dataset

For the last point, mainly two options:

  • convert dataset (csv with npy files) to hdf5 and then Apache Arrow or vaex to load it in memory
  • or if we want native tf/torch tensors in the end: convert datasets into Parquet and then use petastorm

Brainstorm has been done in a Notion doc. Next steps is to investigate properly the different options.

Other points:

  • Tensorflow or PyTorch should not be dependencies for the project, we need to put it as dependencies in environment.yaml rather than requirements.txt
  • The user needs to be able to use biodatasets package with either PyTorch or TF installed, so we need to manage import errors in both to_torch_dataset() and to_tf_dataset() and catch errors to display that one of these libraries should be installed if the user tries to call one of these functions.

Add issue templates

An issue template is a good way to define the structure of the issue based on the type: bug, feature request, documentation, ...

How to store unusual labels / Y values.

I've successfully uploaded a dataset (subset of PDB) but it has unusual labels in that they are matrices. Storing matrices/ndarrays/sparse arrays as a column in a .csv is not ideal. If you're writing to and reading from these files with pandas you quickly land up with issues where \t and \n characters mess up the parsing. I have just uploaded a seperate pickle file with a dictionary of my labels, but it probably something the team should consider if you want the full datasets available in a single file.

Perhaps we could consider if there is some way to automate pulling separate labels files when calling a dataset. This would make no difference to the end user as we could hide some computation from the API. Let me know your thoughts ๐Ÿ˜ƒ .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.