deepchainbio / bio-datasets Goto Github PK
View Code? Open in Web Editor NEWFree collection of Bio datasets and embeddings
License: Apache License 2.0
Free collection of Bio datasets and embeddings
License: Apache License 2.0
Simple implementation:
class Dataset:
def __init__(self, name: str):
## 1. Check if available locally
## 2. Check if available on GS and download it
if not is_available_in_cache(name):
self._fetch_from_gs()
self.path = cache_path
def _fetch_from_gs(self):
"""Fetch the dataset from GS and store it in cache for later use."""
pass
def to_npy_arrays(self, inputs, targets):
# If 1 input: X is an numpy array
# If several inputs: X list of numpy arrays
# This format is usable by tf.keras.Model.fit(...)
# cf https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
return X, y
def get_embeddings(self, column):
"""Return a 2D numpy array with the CLS ProtBert embeddings of each sequence."""
return X
def to_torch_data_loader(self):
return Data
We first only support .csv
dataset with .npy
aligned files for the embeddings of a sequence column.
Get inspiration from this : googleapis/python-storage#27 (comment)
We should need to use requests
instead of download_to_filename
, or to call gsutil
via subprocess call..?
It seems like the installation instruction for pip
is pip install bio-datasets
but the import uses import biodatasets.
Might want to fix this on the README.md for new users ๐.
The main idea (to be confirmed though) is to have for the user the following process:
tf.data.Dataset
or torch.Dataloader
in order to train models with this datasetFor the last point, mainly two options:
csv
with npy
files) to hdf5
and then Apache Arrow or vaex to load it in memorytf/torch
tensors in the end: convert datasets into Parquet and then use petastormBrainstorm has been done in a Notion doc. Next steps is to investigate properly the different options.
Other points:
environment.yaml
rather than requirements.txt
biodatasets
package with either PyTorch or TF installed, so we need to manage import
errors in both to_torch_dataset()
and to_tf_dataset()
and catch errors to display that one of these libraries should be installed if the user tries to call one of these functions.In order to be able to load our data with to_npy_array
in memory
We should clarify the structure of the description.md
file for a dataset.
Given the structure, we would have different functions (i.e. display_description()
, display_summary
, etc..) that would display different parts of the dataset description.
Add open-source NetMHCpan-4.1 dataset
netmhcpan-4.1-train
dataset Make swissProt
dataset usable with the API.
Make pathogen
dataset usable with the API.
Configuration file to define the dataset
and embeddings
files as well the inputs/targets variable names (add them as attributes).
An issue template is a good way to define the structure of the issue based on the type: bug, feature request, documentation, ...
I've successfully uploaded a dataset (subset of PDB) but it has unusual labels in that they are matrices. Storing matrices/ndarrays/sparse arrays as a column in a .csv
is not ideal. If you're writing to and reading from these files with pandas
you quickly land up with issues where \t
and \n
characters mess up the parsing. I have just uploaded a seperate pickle file with a dictionary of my labels, but it probably something the team should consider if you want the full datasets available in a single file.
Perhaps we could consider if there is some way to automate pulling separate labels files when calling a dataset. This would make no difference to the end user as we could hide some computation from the API. Let me know your thoughts ๐ .
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.