deepchainbio / bio-datasets Goto Github PK

View Code? Open in Web Editor NEW

34.0 34.0 10.0 148 KB

Free collection of Bio datasets and embeddings

License: Apache License 2.0

Python 100.00%

bio-datasets's People

Contributors

Stargazers

Watchers

Forkers

delfosseaurelien bugaje kchennen hayfabm abedygathaba khalilrhouma m-hakmi shicheng-guo

bio-datasets's Issues

Manage first datasets

Add open-source NetMHCpan-4.1 dataset
- Compute contextual sequence embeddings with either ESM or ProtBertn and add them to netmhcpan-4.1-train dataset
Make swissProt dataset usable with the API.
Make pathogen dataset usable with the API.

Mistake in README - `biodatasets` vs `bio-datasets.`

It seems like the installation instruction for pip is pip install bio-datasets but the import uses import biodatasets. Might want to fix this on the README.md for new users 😄.

Compute the test coverage + add corresponding badge in the README

The test coverage can be computed thanks to pytest job.
It is always nice for a user to know what is the test coverage of the library used.

Add configuration file for a dataset

Configuration file to define the dataset and embeddings files as well the inputs/targets variable names (add them as attributes).

Also add an attribute when there is only one input sequence, and use it as default in embeddings.

Implement first POC

Simple implementation:

class Dataset:
    def __init__(self, name: str):
        ## 1. Check if available locally
        ## 2. Check if available on GS and download it
        if not is_available_in_cache(name):
            self._fetch_from_gs()
        self.path = cache_path
    def _fetch_from_gs(self):
        """Fetch the dataset from GS and store it in cache for later use."""
        pass
    def to_npy_arrays(self, inputs, targets):
        # If 1 input: X is an numpy array
        # If several inputs: X list of numpy arrays
        # This format is usable by tf.keras.Model.fit(...)
        # cf https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
        return X, y
    def get_embeddings(self, column):
        """Return a 2D numpy array with the CLS ProtBert embeddings of each sequence."""
        return X
    def to_torch_data_loader(self):
        return Data

Possibility to cache the dataset
load_dataset() function which calls this dataset
Progress bar when download from GCP

We first only support .csv dataset with .npy aligned files for the embeddings of a sequence column.

Update the dataset workflow with new structure/format

The main idea (to be confirmed though) is to have for the user the following process:

The user adds raw data files such as (csv + npy for embeddings - to be extended to other formats as well)
The user defines a schema for variable types
The library converts the raw data files into a format used to load data in memory
The dataset instance can return native tf.data.Dataset or torch.Dataloader in order to train models with this dataset

For the last point, mainly two options:

convert dataset (csv with npy files) to hdf5 and then Apache Arrow or vaex to load it in memory
or if we want native tf/torch tensors in the end: convert datasets into Parquet and then use petastorm

Brainstorm has been done in a Notion doc. Next steps is to investigate properly the different options.

Other points:

Tensorflow or PyTorch should not be dependencies for the project, we need to put it as dependencies in environment.yaml rather than requirements.txt
The user needs to be able to use biodatasets package with either PyTorch or TF installed, so we need to manage import errors in both to_torch_dataset() and to_tf_dataset() and catch errors to display that one of these libraries should be installed if the user tries to call one of these functions.

Use vaex to load data

In order to be able to load our data with to_npy_array in memory

Add issue templates

An issue template is a good way to define the structure of the issue based on the type: bug, feature request, documentation, ...

Parse description.md and have different fonctions to display dataset descriptions

We should clarify the structure of the description.md file for a dataset.
Given the structure, we would have different functions (i.e. display_description(), display_summary, etc..) that would display different parts of the dataset description.

Add progress bar when downloading the dataset

Get inspiration from this : googleapis/python-storage#27 (comment)
We should need to use requests instead of download_to_filename, or to call gsutil via subprocess call..?

Force the download when dataset files shave changed remotely

How to store unusual labels / Y values.

I've successfully uploaded a dataset (subset of PDB) but it has unusual labels in that they are matrices. Storing matrices/ndarrays/sparse arrays as a column in a .csv is not ideal. If you're writing to and reading from these files with pandas you quickly land up with issues where \t and \n characters mess up the parsing. I have just uploaded a seperate pickle file with a dictionary of my labels, but it probably something the team should consider if you want the full datasets available in a single file.

Perhaps we could consider if there is some way to automate pulling separate labels files when calling a dataset. This would make no difference to the end user as we could hide some computation from the API. Let me know your thoughts 😃 .