Giter VIP home page Giter VIP logo

bio-datasets's Issues

Implement first POC

Simple implementation:

class Dataset:
    def __init__(self, name: str):
        ## 1. Check if available locally
        ## 2. Check if available on GS and download it
        if not is_available_in_cache(name):
            self._fetch_from_gs()
        self.path = cache_path
    def _fetch_from_gs(self):
        """Fetch the dataset from GS and store it in cache for later use."""
        pass
    def to_npy_arrays(self, inputs, targets):
        # If 1 input: X is an numpy array
        # If several inputs: X list of numpy arrays
        # This format is usable by tf.keras.Model.fit(...)
        # cf https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
        return X, y
    def get_embeddings(self, column):
        """Return a 2D numpy array with the CLS ProtBert embeddings of each sequence."""
        return X
    def to_torch_data_loader(self):
        return Data
  • Possibility to cache the dataset
  • load_dataset() function which calls this dataset
  • Progress bar when download from GCP

We first only support .csv dataset with .npy aligned files for the embeddings of a sequence column.

Update the dataset workflow with new structure/format

The main idea (to be confirmed though) is to have for the user the following process:

  • The user adds raw data files such as (csv + npy for embeddings - to be extended to other formats as well)
  • The user defines a schema for variable types
  • The library converts the raw data files into a format used to load data in memory
  • The dataset instance can return native tf.data.Dataset or torch.Dataloader in order to train models with this dataset

For the last point, mainly two options:

  • convert dataset (csv with npy files) to hdf5 and then Apache Arrow or vaex to load it in memory
  • or if we want native tf/torch tensors in the end: convert datasets into Parquet and then use petastorm

Brainstorm has been done in a Notion doc. Next steps is to investigate properly the different options.

Other points:

  • Tensorflow or PyTorch should not be dependencies for the project, we need to put it as dependencies in environment.yaml rather than requirements.txt
  • The user needs to be able to use biodatasets package with either PyTorch or TF installed, so we need to manage import errors in both to_torch_dataset() and to_tf_dataset() and catch errors to display that one of these libraries should be installed if the user tries to call one of these functions.

Manage first datasets

  • Add open-source NetMHCpan-4.1 dataset

    • Compute contextual sequence embeddings with either ESM or ProtBertn and add them to netmhcpan-4.1-train dataset
  • Make swissProt dataset usable with the API.

  • Make pathogen dataset usable with the API.

Add configuration file for a dataset

Configuration file to define the dataset and embeddings files as well the inputs/targets variable names (add them as attributes).

  • Also add an attribute when there is only one input sequence, and use it as default in embeddings.

Add issue templates

An issue template is a good way to define the structure of the issue based on the type: bug, feature request, documentation, ...

How to store unusual labels / Y values.

I've successfully uploaded a dataset (subset of PDB) but it has unusual labels in that they are matrices. Storing matrices/ndarrays/sparse arrays as a column in a .csv is not ideal. If you're writing to and reading from these files with pandas you quickly land up with issues where \t and \n characters mess up the parsing. I have just uploaded a seperate pickle file with a dictionary of my labels, but it probably something the team should consider if you want the full datasets available in a single file.

Perhaps we could consider if there is some way to automate pulling separate labels files when calling a dataset. This would make no difference to the end user as we could hide some computation from the API. Let me know your thoughts ๐Ÿ˜ƒ .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.