scverse / scvi-tools Goto Github PK

Deep probabilistic analysis of single-cell and spatial omics data

License: BSD 3-Clause "New" or "Revised" License

Python 99.97% Dockerfile 0.03%

scrna-seq variational-bayes variational-autoencoder cite-seq single-cell-genomics single-cell-rna-seq deep-generative-model human-cell-atlas scverse deep-learning

scvi-tools's Issues

[CLOSED] Refactoring

Issue by Edouard360
Tuesday Apr 03, 2018 at 17:01 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/13

Moving some logic to test.py + removing default values + separating train from dataset + option for dataset dropout + prepare for imputation test + add requirement for Travis build

Edouard360 included the following code: https://github.com/YosefLab/scVI-dev/pull/13/commits

training progress bar for notebooks

The main suggestion we've gotten is to display progress bars rather than a lot of text output.

https://github.com/tqdm/tqdm#ipython-jupyter-integration

[CLOSED] new model of the dispersion parameter

Issue by jeff-regier
Friday Mar 30, 2018 at 16:59 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/5

[CLOSED] Working Benchmarks

Issue by Edouard360
Thursday Apr 05, 2018 at 05:39 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/16

Solved the model’s last issues.
Moved all the benchmark logic in a single file: run_benchmarks.py. Now the tests only use a toy dataset, and run benchmarks after training on 1 epoch. Also can be run from command line.
Added the contrib folder for python scripts for loading/preprocessing.

PS: Sorry I used about 5 Travis builds to correct minor errors... (In particular make lint didn't warn for python files in new folder contrib)

Edouard360 included the following code: https://github.com/YosefLab/scVI-dev/pull/16/commits

use Dataset10x for brain_small

See #37

BrainSmallDataset should inherit from Dataset10X (like the way RetinaDataset inherits from LoomDataset)

Beta-Poisson generative model

Issue by jeff-regier
Friday Mar 30, 2018 at 17:00 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/7

We can still use the reparameterization trick to differentiate through a beta random variable:
https://arxiv.org/pdf/1805.08498v1.pdf

Negative Binomial parameterization

Hello,

Could you share the form you are using for the negative binomial distribution? I find it weird that there is not factorial in the likelihood function.

[CLOSED] unit tests for cortex dataset

Issue by jeff-regier
Friday Apr 06, 2018 at 19:35 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/20

jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/20/commits

refactoring: VAEC and SVAEC as subclasses of VAE

There's a lot of duplicated code in the VAE, VAEC, and SVAEC classes. To get rid of it, lets make VAEC and SVAEC subclasses of VAE. In addition to avoiding duplicated code (which is hard to maintain), it's nice conceptually to express our semi-supervised models as specializations or our unsupervised scVI model.

Also, I wonder about whether SVAEC could inherit from VAEC, or whether even we need both SVAEC and VAEC.

h5 format from 10x_genomics

For large datasets, e.g. 1.3 million cells in mousr brain dataset

is there a support for h5 format?

Thanks

[CLOSED] marginal log likelihood benchmark

Issue by jeff-regier
Friday Mar 30, 2018 at 16:59 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/3

add AnnData support

Let's support reading from AnnData files.

scverse/anndata#30 (comment)

run groups of benchmarks with `run_benchmarks.py`

Currently run_benchmarks.py just runs one benchmark per call. It'd be nice if it could, optionally, run groups of benchmarks.

For example,

./run_benchmarks.py --annotation

might run all the annotation (semi-supervised) benchmarks, and then print out a nice table afterwards with all the annotation results for all the dataset.

And

./run_benchmarks.py --harmonization

might run all the unsupervised harmonization benchmarks.

And

./run_benchmarks.py --basic

might run all of the original seven benchmarks.

And

./run_benchmarks.py --all --epoch 1

would be useful for testing that everything works.

One way to implement this without a mess of if statements (c.f., load_dataset) might be to create a new Benchmark class. It would have as fields all the arguments to the train(), including one dataset instance (e.g. a CbmcDataset) and one model instance (e.g. a VAE instance). These Benchmark objects could then be organized into groups.

remote loom datasets

Let's change run_benchmarks.py, as well as the run_benchmarks function, so they can take as an argument the URL of an arbitrary loom file. e.g.,

./run_benchmarks.py --url http://loom.linnarssonlab.org/dataset/cellmetadata/Previously%20Published/Cortex.loom

In this case, the output should be the same as running

./run_benchmarks.py --dataset cortex

We also want to test that it works with

http://loom.linnarssonlab.org/dataset/cellmetadata/osmFISH/osmFISH_SScortex_mouse_all_cells.loom

[CLOSED] Lgamma / Cuda / Dataset / Dependencies

Issue by Edouard360
Friday Apr 06, 2018 at 07:10 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/18

Remove approx for log_zinb_positive since we don’t need it anymore (we have fully working lgamma)+ add compiling and dependency.
Added cuda support (distinction) - thanks Maxime.
Remove sklearn dependency. Removed the train_test_split. Using SubsetRamdomSampler could be another option but is yet complicated.
Create dataset module and subclasses for each dataset, inheriting the same base class; with downloading/preprocessing
Move benchmark logic in scvi/benchmark.py

Edouard360 included the following code: https://github.com/YosefLab/scVI-dev/pull/18/commits

refactoring: combine training method

There's a lot of code duplicated across our four methods for training models: train, train_semi_supervised_jointly, train_semi_supervised_alternately, and train_classifier. How about combining all these methods into one method, train, and passing it additional arguments (e.g., boolean flags) to control how it behaves?

[CLOSED] fixed makefile test

Issue by romain-lopez
Monday Apr 02, 2018 at 22:39 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/12

romain-lopez included the following code: https://github.com/YosefLab/scVI-dev/pull/12/commits

KeyError: 'cmbc'

Possibly stopped working after #54 ?

jeff@dean ~/git/scVI $ ./run_benchmarks.py --dataset cbmc
Traceback (most recent call last):
  File "./run_benchmarks.py", line 54, in <module>
    dataset = load_datasets(args.dataset, url=args.url)
  File "./run_benchmarks.py", line 22, in load_datasets
    gene_dataset = CiteSeqDataset('cmbc', save_path=save_path)
  File "/home/jeff/git/scVI/scvi/dataset/cite_seq.py", line 17, in __init__
    s = available_datasets[name]
KeyError: 'cmbc'

More information

It would be great if you guys could share more insights in what's going on in the code and how to use it, maybe through a more complete README file?

use GeneDataset.download for multiple urls

The method GeneDataset.download will download multiple urls now. So can we delete the code below and instead use GeneDataset.download for all the downloading?

https://github.com/YosefLab/scVI/blob/6f2a934d0fc62ef174e8a45eeddde3f5a3625e93/scvi/dataset/hemato.py#L32-L53

[CLOSED] normalizing flows

Issue by jeff-regier
Monday Apr 02, 2018 at 21:11 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/11

Normalizing flows should let us better approximate the posterior distribution. Sylvester normalizing flows seems like the first thing to try.

https://arxiv.org/pdf/1803.05649.pdf

Loom for RETINA dataset

Create a new class called LoomDataset that uses the loompy library to load an arbitrary dataset in the loom format.

https://github.com/linnarsson-lab/loompy

As a test case --- and so we can easily use it --- convert the RETINA dataset to loom format and upload it to the YosefLab/scVI-data repository.

refactoring: remove use_cuda from models

It's better if the models (i.e. VAE, VAEC, SVEAC) don't "know" whether they are using cuda or not. In the __init__ method for these models, we can just load everything on the CPU. Then, call the .cuda() method (e.g. vae.cuda() right after instantiating the model. It's kind of a detail, but the distinction between the model and what chip it runs on is a helpful one to maintain.

See

https://github.com/YosefLab/scVI/blob/704ce19d80e5728542464e488968c136a241eb6f/scvi/models/vae.py#L46-L48

https://github.com/YosefLab/scVI/blob/704ce19d80e5728542464e488968c136a241eb6f/scvi/models/vaec.py#L40-L43

https://github.com/YosefLab/scVI/blob/704ce19d80e5728542464e488968c136a241eb6f/scvi/models/svaec.py#L47-L50

add ADT counts (proteins markers) to PbmcDataset and CbmcDataset

As described in #40

documentation

Documentation, generated with Sphinx, and posted at https://scvi.readthedocs.io/

We primarily want to document the functions that Chenling used in the demo notebook.

Examples:
http://scanpy.readthedocs.io/en/latest/

[CLOSED] ladder-VAE style training

Issue by jeff-regier
Friday Mar 30, 2018 at 17:00 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/6

https://arxiv.org/pdf/1602.02282.pdf

move load_dataset from init.py to run_benchmarks.py

Let's move the load_dataset function from datasets/__init__.py to run_benchmarks.py. Then let's only call that function in run_benchmarks.py. Everywhere else load_dataset is called (e.g. in unit tests), just instantiate the right dataset class directly. For example, rather than

gene_dataset = load_datasets('cortex')

gene_dataset = CortexDataset()

invertible likelihood

Issue by jeff-regier
Sunday Apr 01, 2018 at 21:50 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/10

Modeling p(x | z) with an invertible conditional distribution may give us a latent space that is more interpretable.

https://arxiv.org/pdf/1611.05209.pdf

[CLOSED] added .cpu() so run_benchmark won't crash when cuda is available

Issue by jeff-regier
Friday Apr 06, 2018 at 18:21 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/19

I was getting this error from run_benchmarks.py. Easily fixed.

Traceback (most recent call last):
  File "run_benchmarks.py", line 19, in <module>
    run_benchmarks(gene_dataset, n_epochs=args.epochs)
  File "/home/jeff/git/scVI-dev/scvi/benchmark.py", line 32, in run_benchmarks
    imputation_score = imputation(vae, gene_dataset)
  File "/home/jeff/git/scVI-dev/scvi/imputation.py", line 45, in imputation
    mae = imputation_error(px_rate.data.numpy(), X, i, j, ix)
RuntimeError: can't convert CUDA tensor to numpy (it doesn't support GPU arrays). Use .cpu() to move the tensor to host memory first.

jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/19/commits

call imputation in scVI-dev.ipynb notebook

Let's change the value returned by imputation to distance_list here, rather than it's median:
https://github.com/YosefLab/scVI/blob/1c01132164b3cbebe0de06057e1ed6652602ee89/scvi/metrics/imputation.py#L26

Then, after that minor change, we can revise the "Checking imputation accuracy" section of scvi-dev.ipynb to make it easier to follow, by removing all the code in that section and just calling imputation(vae, rate=0.3).

conda

Make scVI available through conda.

Should we use the bioconda channel rather than the default channel?

https://bioconda.github.io/

save_path instead of unit_test

Anywhere we're currently using a unit_test flag, let's make save_path an argument instead. By default save_path = "data/" but for unit tests call, for example, run_benchmarks("cortex", save_path = "tests/data/").

And in retina.py, for example, rather than

    def __init__(self, unit_test=False):

use

    def __init__(self, save_path="data/"):

[CLOSED] Importance Weighted Autoencoder

Issue by jeff-regier
Friday Mar 30, 2018 at 17:00 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/8

https://arxiv.org/pdf/1509.00519.pdf

[CLOSED] enable flake8 syntax checking

Issue by jeff-regier
Friday Mar 30, 2018 at 20:55 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/9

jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/9/commits

[CLOSED] clustering

Issue by jeff-regier
Wednesday Apr 04, 2018 at 03:57 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/14

To start, maybe figure out when VaDE does/doesn't work.

https://arxiv.org/pdf/1611.05148.pdf

imputation based on multiple samples

Our imputation results are based on just one sample from the variational distribution currently:

https://github.com/YosefLab/scVI/blob/704ce19d80e5728542464e488968c136a241eb6f/scvi/metrics/imputation.py#L24

That makes imputation not much of a metric for assessing changes to our model --- I'm surprised it works as well as it does. We should be sure to fix it before relying on imputation scores to guide any modeling decisions.

[CLOSED] add skeleton code

Issue by jeff-regier
Thursday Mar 29, 2018 at 02:16 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/1

jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/1/commits

v0.1.4

Let's push v0.1.14 to pip and conda after finishing #42 and #62 .

Unfortunately I don't think anyone has written a conda recipe yet for anndata, so we'll have to create one and push it along with ours.

[CLOSED] conda

Issue by jeff-regier
Wednesday Apr 04, 2018 at 22:00 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/15

jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/15/commits

cortex imputation error is too high

In the paper we report that imputation error for cortex is around 2.2 but when I run the current version of scVI imputation error is much higher.

jeff@dean ~/git/scVI $ ./run_benchmarks.py --dataset cortex --epochs 200
File data/expression.bin already downloaded
Preprocessing Cortex data
training: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:29<00:00,  6.69it/s]
Total runtime for 201 epochs is: 29.987644910812378 seconds for a mean per epoch runtime of 0.14919226323787252 seconds.
Best ll was : 1288.2259785091362
Log-likelihood Train: 1261.4138140255177
Log-likelihood Test: 1289.2478976328903
Imputation score on test (MAE) is: 4.208409309387207

tox-based testing

https://tox.readthedocs.io/en/latest/examples.html

[CLOSED] WIP: first VAE

Issue by maxime1310
Thursday Mar 29, 2018 at 17:33 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/2

maxime1310 included the following code: https://github.com/YosefLab/scVI-dev/pull/2/commits

[CLOSED] added travis status to readme

Issue by jeff-regier
Thursday Apr 05, 2018 at 17:10 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/17

jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/17/commits

[CLOSED] imputation benchmark

Issue by jeff-regier
Friday Mar 30, 2018 at 16:59 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/4

IPython notebook defines unused variable latent_dimension

A very minor issue in the example IPython notebook:

latent_dimension = 10

is defined but the following model definition does not include n_latent=latent_dimension, thus changing the value defined in the code has no effect. I think the same may be true of batch_size.

Thank you for this algorithm!

notebook of data loading examples

Create a notebook named examples/data_loading.ipynb that shows users all the ways they can load data into scVI. They can load

a loom file
a csv file
a 10x file
any of our "built in" datasets (list them, and give a little information about each of them)

additional datasets

@imyiningliu -- I think Maxime, Eddie, and Chenling are all working with datasets now that aren't yet "wrapped" by scVI. It'd be great if you could talk with all three of them to find out what datasets they're using, and add one class (that inherits from GeneExpressionDataset) for each dataset they plan to keep using. They may already have some code, which they haven't committed yet, that you can start with. We'd also want some documentation for each dataset, and a unit test if the new dataset requires a non-trivial amount of code.

These new datasets may have some characteristics that are different from the dataset we've seen so far.
-- Maxime's smFISH datasets have position information. I think he had some ideas for how to modify GeneExpressionDataset to include that information.
-- Eddie's pbmc donor data may already be accessible through the Dataset10x class.
-- Chenling's data from the simulator isn't available on at any public url yet, and maybe it's too early to make it public. But if not, we could share it through our scVI-dev repo.

Also, I'm interested in getting access through scVI to the dataset mentioned in this paper: https://www.nature.com/articles/nmeth.4636

hemato -- File b'data/bBM.spring_and_pba.csv' does not exist

Looks like one of the files for hemato isn't getting downloaded automatically:

jeff@dean ~/git/scVI $ ./run_benchmarks.py --dataset hemato
Downloading file at data/bBM.raw_umifm_counts.csv.gz
Downloading file at data/data.zip
Preprocessing Hemato data
Traceback (most recent call last):
  File "./run_benchmarks.py", line 54, in <module>
    dataset = load_datasets(args.dataset, url=args.url)
  File "./run_benchmarks.py", line 26, in load_datasets
    gene_dataset = HematoDataset(save_path=save_path)
  File "/home/jeff/git/scVI/scvi/dataset/hemato.py", line 20, in __init__
    expression_data, gene_names = self.download_and_preprocess()
  File "/home/jeff/git/scVI/scvi/dataset/dataset.py", line 47, in download_and_preprocess
    return self.preprocess()
  File "/home/jeff/git/scVI/scvi/dataset/hemato.py", line 35, in preprocess
    spring_and_pba = pd.read_csv(self.save_path + self.spring_and_pba_filename)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
    self._make_engine(self.engine)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 384, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 695, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: File b'data/bBM.spring_and_pba.csv' does not exist

run_benchmarks uses the same VAE architecture for all datasets

It looks like run_benchmarks uses the same VAE architecture (e.g. default n_layers) for every datasets:
https://github.com/YosefLab/scVI/blob/cbd6f41bc6e40bbb0c1082ec9bf1d5b32bcd3f7d/scvi/benchmark.py#L38-L39

Instead, let's instantiate the VAE class with the number of layers etc from Table 2:
https://www.biorxiv.org/content/biorxiv/early/2018/03/30/292037.full.pdf

scverse / scvi-tools Goto Github PK

scvi-tools's Issues

Recommend Projects

Recommend Topics

Recommend Org