Light

biofam / cellij Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 28.25 MB

Implementation of a Modular Multi-Omics Factor Model Framework

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

cellij's Introduction

Multi-Omics Factor Analysis

MOFA is a factor analysis model that provides a general framework for the integration of multi-omic data sets in an unsupervised fashion.

Please visit our website for installation instructions, tutorials, and much more!

cellij's People

Stargazers

Watchers

cellij's Issues

Port MSc thesis repo to official cellij repo

The repo got quite dirty during the final push for the MSc thesis, so I have to partially clean it up before we can push it over.

Fix GH actions
Have a MSc thesis state release tag
Port it over as a PR to main
Replace all occurances of mfmf with cellij

Define data/model/training options

As discussed in Slack

2023-03-31: ToDos

Prioritized

Switch get_w/get_z to pull from pyro param storage @timtreis
Provide save and load functionality @timtreis
Fix CI @timtreis
Skip missings in obs during inference @timtreis
na_strategy: add impute with means @timtreis
Change Black formatter to only work on merge @timtreis
Logging
Add OrderedDict for moments in mofa model @martinrohbeck
Log number of missings when new dataset is added

Done

EarlyStopping for training models @martinrohbeck

Unclear

scale all modalities according to features and likelihoods?

Fix inheritance problem with cascading `PyroModule`s

Guide needs to execute Generative once to get the site names and shapes for the initialization. However, storing the Generative model object as a submodule leads to some inheritance error.
See: https://github.com/pyro-ppl/pyro/blob/dev/pyro/infer/autoguide/guides.py#L69

Prevent random shuffling of rows when constructing combined dataset

Implement various Sparsity and Shrinkage Priors

Title says it all.

Implement script to perform basic benchmark

Write an argparse or click based python script to generate synthetic data, perform a training instance, and save the output.

Merge only if >1 modalities are present

If the MuData object contains only 1 modality we don't need to merge the metadata

cellij/cellij/core/factormodel.py

Line 562 in 5844535

for modality_name, anndata_object in data.mod.items():

Implement standalone prior distributions

Lasso @martinrohbeck
Horseshoe @arberqoku

Notes:
Resources:

main paper: https://proceedings.mlr.press/v5/carvalho09a/carvalho09a.pdf
pyro distributions: https://github.com/pyro-ppl/pyro/tree/dev/pyro/distributions
TF implementation: https://github.com/tensorflow/probability/blob/v0.19.0/tensorflow_probability/python/distributions/horseshoe.py#L39-L237

Test and Integrate Optimized Einsum

We should give this optimized einsum for cuda a shot: https://pypi.org/project/opt-einsum-torch/

It is very likely that users come up with large matrices, and we might gain some performance improvements or maybe without something like this we might even run into memory issues.

Merge tensors that have same likelihood and sparsity during training

Recrate original object incl. data after loading model

Currently we are stripping the ._data attribute of the FA model before saving. I think that makes sense, to save memory.
However, when reloading the model, we should be able to re-add data to still run downstream analysis, that user proeprties from self._data. Otherwise, we have to make sure that the important downstream functionalities work w/o self._data
This currently does not work, because add_data() throws an error. @ data expert @timtreis, maybe you can have a look at this?

Feel free to use the notebook in the current PR from branch feature/issue-x/example-notebook as an example. I already added the cell, but commented it out it.

Implement GPs

Going to keep some notes here for reference

Notes

https://docs.gpytorch.ai/en/stable/examples/07_Pyro_Integration/Pyro_GPyTorch_Low_Level.html#Overview
High level interface
- Base class is gpytorch.models.PyroGP
- GPyTorch automatically defines the model and guide functions for Pyro
- Best used when prediction is the primary goal
Low level interface
- Base class is gpytorch.models.ApproximateGP
- User defines the model and guide functions for Pyro
- Best used when inference is the primary goal
https://docs.gpytorch.ai/en/stable/examples/07_Pyro_Integration/Pyro_GPyTorch_Low_Level.html#Overview

Make GPs part of the cellij model, both 1D and 2D

Fix Data related Warning

Running

# Afterwards, we need to add the data
model.add_data(data=mdata)

raises
/home/m015k/code/cellij/cellij/core/_factormodel.py:251: FutureWarning: Passing 'suffixes' which cause duplicate columns {'T6_x', 'T5_x', 'treatedAfter_x', 'Gender_x', 'IGHV_x', 'Diagnosis_x', 'ConsClust_x', 'died_x', 'IC50beforeTreatment_x', 'Age4Main_x'} in the result is deprecated and will raise a MergeError in a future version.

coming from some suffixes. See code line

cellij/cellij/core/_factormodel.py

Line 251 in a0385d7

anndata_object.obs = anndata_object.obs.merge(

Implement the necessary data structures for training

Convert the preprocessed and clean MuData into a pytorch.Dataset wrapped into a pytorch.DataLoader to facilitate training during inference, e.g. when introducing mini-batching for SVI. Keep in mind sample-/feature-wise metadata stored in .obs and .var fields.

Implement MOFA model with sparsity in the latent factor loadings

Implement a version of MOFA with structured sparsity in the factor loadings.

Make sparsity priors reproducible

Implement `CellijModel` class

The CellijModel class describes the generative model and hence implements Pyro's model() function.

Implement `CellijGuide` class

The CellijGuide class describes the variational distribution and hence implements Pyro's guide() function.

Add minibatching

Make use of Dataloaders to implement efficient minibatch training to handle larger datasets.

ToDos Deadline

Everybody:

Read draft, make suggestions, draw conclusions from Figures

Tim:

#68
Run GP on some test data as a POC
Fill in Table in Appendix with established methods
Make Plots Features (x-axs) vs Factor Norms (y-axis) for Non-negativity vs. HS & SNS & Laplace for non-negative DGP
1st plot: 2 UMAPs of latent space z (one colored by time, one colored by diff stage). Inference w/o any covariates
2nd plot: 2 UMAPs of latent space z (one colored by time, one colored by diff stage). Inference only with time in GP
3rd plot: 2 UMAPs of latent space z (one colored by time, one colored by diff stage). Inference with time and diff stage in GP
Gridsearch over lengthscales [0.001, 0.01, 0.1, 1, 2, 5, 10]
Predict only 1 or 2 factors from the GP, estimate other factors separately

Arber:

Refactor generative models with multiple plate combinations (SnS etc...)
Write subsection about DGP of synthetic data
Add benchmark results on sparsity and recon error as a table (prec, rec, f1, rmse)

Martin:

Make Plots Features (x-axis) vs R2 Reconstruction (y-axis) for different Sparsity Priors
CLL Data
Depending on GPU, repeat all plots but with samples/views/missings on y-axis
Make Heatmaps: Features x Samples vs Time until Convergence for different Priors

Not assigned (for grabs):

add merged_obs_names + implications

https://github.com/bioFAM/cellij/blob/fae36150bac5aa1bae23687771619241f23e1ed7/cellij/core/_data.py#LL30C29-L30C29

Evaluation metrics for benchmarks

Factorwise correlation values
- Tests

Steal PCA/orthogonal initialization from MOFA

Create Example Notebook with Real Data

Perform initial benchmark evaluation

Focusing on two main objectives, the reconstruction loss (RMSE, R2) and modeling structured sparsity (precision, recall, F1).

ToDo for First Code Release

Make Sparsity Prior Benchmark reproducible
Polish Code
get_w() and get_z() must work for all priors (probably related to #20)
GPs (#68)
MOFA+ (#44)

Setup Poetry dependency management

Add entry to Readme how to set the environment up/use it for new users

Migrate unittests to pytest

There are fewer boilerplate and cleaner tests, among other features like supporting unittests, tox, etc.

Road to MVP

The following boxes should be checked for an MVP:

Each checkbox will have it's own issue. This issue deals solely as an overview, feel free to edit according to your thoughts.

Priority Task List

Priority High

Priority Medium

Implement and Experiment with Custom Guide class (#10 )
Implement tests for benchmarking metrics
Add OrderedDict for moments in mofa model @martinrohbeck
Log number of missings when new dataset is added @timtreis

Priority Low

(unclear) scale all modalities according to features and likelihoods?

model = FactorModel(n_factors)
model.add_data(...)
# implicitly create model with normal priors and normal likelihoods...
model.fit()

Advanced use case:

model = FactorModel(n_factors)
model.add_data(...)
model.set_data_options(...)
model.set_model_options(...)
model.set_training_options(...)
model.fit()

number of factors, features, samples, datasets
likelihoods
sparsity levels
noise levels

Line 86 in fafb6a1

self._device = device

Implement the horseshoe prior as a standalone pyro distribution

Resources:

main paper: https://proceedings.mlr.press/v5/carvalho09a/carvalho09a.pdf
pyro distributions: https://github.com/pyro-ppl/pyro/tree/dev/pyro/distributions
TF implementation: https://github.com/tensorflow/probability/blob/v0.19.0/tensorflow_probability/python/distributions/horseshoe.py#L39-L237

Save missingness mask and pass it to Generative so we don't recompute

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.