scverse / scvi-tools Goto Github PK
View Code? Open in Web Editor NEWDeep probabilistic analysis of single-cell and spatial omics data
Home Page: http://scvi-tools.org/
License: BSD 3-Clause "New" or "Revised" License
Deep probabilistic analysis of single-cell and spatial omics data
Home Page: http://scvi-tools.org/
License: BSD 3-Clause "New" or "Revised" License
Issue by Edouard360
Tuesday Apr 03, 2018 at 17:01 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/13
Moving some logic to test.py + removing default values + separating train from dataset + option for dataset dropout + prepare for imputation test + add requirement for Travis build
Edouard360 included the following code: https://github.com/YosefLab/scVI-dev/pull/13/commits
The main suggestion we've gotten is to display progress bars rather than a lot of text output.
Issue by jeff-regier
Friday Mar 30, 2018 at 16:59 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/5
Issue by Edouard360
Thursday Apr 05, 2018 at 05:39 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/16
Solved the model’s last issues.
Moved all the benchmark logic in a single file: run_benchmarks.py. Now the tests only use a toy dataset, and run benchmarks after training on 1 epoch. Also can be run from command line.
Added the contrib folder for python scripts for loading/preprocessing.
PS: Sorry I used about 5 Travis builds to correct minor errors... (In particular make lint
didn't warn for python files in new folder contrib)
Edouard360 included the following code: https://github.com/YosefLab/scVI-dev/pull/16/commits
See #37
BrainSmallDataset
should inherit from Dataset10X
(like the way RetinaDataset
inherits from LoomDataset
)
Issue by jeff-regier
Friday Mar 30, 2018 at 17:00 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/7
We can still use the reparameterization trick to differentiate through a beta random variable:
https://arxiv.org/pdf/1805.08498v1.pdf
Hello,
Could you share the form you are using for the negative binomial distribution? I find it weird that there is not factorial in the likelihood function.
Issue by jeff-regier
Friday Apr 06, 2018 at 19:35 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/20
jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/20/commits
There's a lot of duplicated code in the VAE, VAEC, and SVAEC classes. To get rid of it, lets make VAEC and SVAEC subclasses of VAE. In addition to avoiding duplicated code (which is hard to maintain), it's nice conceptually to express our semi-supervised models as specializations or our unsupervised scVI model.
Also, I wonder about whether SVAEC could inherit from VAEC, or whether even we need both SVAEC and VAEC.
For large datasets, e.g. 1.3 million cells in mousr brain dataset
is there a support for h5 format?
Thanks
Issue by jeff-regier
Friday Mar 30, 2018 at 16:59 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/3
Let's support reading from AnnData files.
Currently run_benchmarks.py
just runs one benchmark per call. It'd be nice if it could, optionally, run groups of benchmarks.
For example,
./run_benchmarks.py --annotation
might run all the annotation (semi-supervised) benchmarks, and then print out a nice table afterwards with all the annotation results for all the dataset.
And
./run_benchmarks.py --harmonization
might run all the unsupervised harmonization benchmarks.
And
./run_benchmarks.py --basic
might run all of the original seven benchmarks.
And
./run_benchmarks.py --all --epoch 1
would be useful for testing that everything works.
One way to implement this without a mess of if
statements (c.f., load_dataset
) might be to create a new Benchmark
class. It would have as fields all the arguments to the train()
, including one dataset instance (e.g. a CbmcDataset
) and one model instance (e.g. a VAE
instance). These Benchmark
objects could then be organized into groups.
Let's change run_benchmarks.py
, as well as the run_benchmarks
function, so they can take as an argument the URL of an arbitrary loom file. e.g.,
./run_benchmarks.py --url http://loom.linnarssonlab.org/dataset/cellmetadata/Previously%20Published/Cortex.loom
In this case, the output should be the same as running
./run_benchmarks.py --dataset cortex
We also want to test that it works with
http://loom.linnarssonlab.org/dataset/cellmetadata/osmFISH/osmFISH_SScortex_mouse_all_cells.loom
Issue by Edouard360
Friday Apr 06, 2018 at 07:10 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/18
Remove approx for log_zinb_positive since we don’t need it anymore (we have fully working lgamma)+ add compiling and dependency.
Added cuda support (distinction) - thanks Maxime.
Remove sklearn dependency. Removed the train_test_split. Using SubsetRamdomSampler could be another option but is yet complicated.
Create dataset module and subclasses for each dataset, inheriting the same base class; with downloading/preprocessing
Move benchmark logic in scvi/benchmark.py
Edouard360 included the following code: https://github.com/YosefLab/scVI-dev/pull/18/commits
There's a lot of code duplicated across our four methods for training models: train
, train_semi_supervised_jointly
, train_semi_supervised_alternately
, and train_classifier
. How about combining all these methods into one method, train
, and passing it additional arguments (e.g., boolean flags) to control how it behaves?
Issue by romain-lopez
Monday Apr 02, 2018 at 22:39 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/12
romain-lopez included the following code: https://github.com/YosefLab/scVI-dev/pull/12/commits
Possibly stopped working after #54 ?
jeff@dean ~/git/scVI $ ./run_benchmarks.py --dataset cbmc
Traceback (most recent call last):
File "./run_benchmarks.py", line 54, in <module>
dataset = load_datasets(args.dataset, url=args.url)
File "./run_benchmarks.py", line 22, in load_datasets
gene_dataset = CiteSeqDataset('cmbc', save_path=save_path)
File "/home/jeff/git/scVI/scvi/dataset/cite_seq.py", line 17, in __init__
s = available_datasets[name]
KeyError: 'cmbc'
It would be great if you guys could share more insights in what's going on in the code and how to use it, maybe through a more complete README file?
The method GeneDataset.download
will download multiple urls now. So can we delete the code below and instead use GeneDataset.download
for all the downloading?
Issue by jeff-regier
Monday Apr 02, 2018 at 21:11 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/11
Normalizing flows should let us better approximate the posterior distribution. Sylvester normalizing flows seems like the first thing to try.
Create a new class called LoomDataset
that uses the loompy library to load an arbitrary dataset in the loom format.
https://github.com/linnarsson-lab/loompy
As a test case --- and so we can easily use it --- convert the RETINA dataset to loom format and upload it to the YosefLab/scVI-data repository.
It's better if the models (i.e. VAE, VAEC, SVEAC) don't "know" whether they are using cuda or not. In the __init__
method for these models, we can just load everything on the CPU. Then, call the .cuda()
method (e.g. vae.cuda()
right after instantiating the model. It's kind of a detail, but the distinction between the model and what chip it runs on is a helpful one to maintain.
See
As described in #40
Documentation, generated with Sphinx, and posted at https://scvi.readthedocs.io/
We primarily want to document the functions that Chenling used in the demo notebook.
Examples:
http://scanpy.readthedocs.io/en/latest/
Issue by jeff-regier
Friday Mar 30, 2018 at 17:00 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/6
Let's move the load_dataset
function from datasets/__init__.py
to run_benchmarks.py
. Then let's only call that function in run_benchmarks.py
. Everywhere else load_dataset
is called (e.g. in unit tests), just instantiate the right dataset class directly. For example, rather than
gene_dataset = load_datasets('cortex')
do
gene_dataset = CortexDataset()
Issue by jeff-regier
Sunday Apr 01, 2018 at 21:50 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/10
Modeling p(x | z) with an invertible conditional distribution may give us a latent space that is more interpretable.
Issue by jeff-regier
Friday Apr 06, 2018 at 18:21 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/19
I was getting this error from run_benchmarks.py. Easily fixed.
Traceback (most recent call last):
File "run_benchmarks.py", line 19, in <module>
run_benchmarks(gene_dataset, n_epochs=args.epochs)
File "/home/jeff/git/scVI-dev/scvi/benchmark.py", line 32, in run_benchmarks
imputation_score = imputation(vae, gene_dataset)
File "/home/jeff/git/scVI-dev/scvi/imputation.py", line 45, in imputation
mae = imputation_error(px_rate.data.numpy(), X, i, j, ix)
RuntimeError: can't convert CUDA tensor to numpy (it doesn't support GPU arrays). Use .cpu() to move the tensor to host memory first.
jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/19/commits
Let's change the value returned by imputation
to distance_list
here, rather than it's median:
https://github.com/YosefLab/scVI/blob/1c01132164b3cbebe0de06057e1ed6652602ee89/scvi/metrics/imputation.py#L26
Then, after that minor change, we can revise the "Checking imputation accuracy" section of scvi-dev.ipynb
to make it easier to follow, by removing all the code in that section and just calling imputation(vae, rate=0.3)
.
Make scVI available through conda.
Should we use the bioconda channel rather than the default channel?
Anywhere we're currently using a unit_test
flag, let's make save_path
an argument instead. By default save_path = "data/"
but for unit tests call, for example, run_benchmarks("cortex", save_path = "tests/data/")
.
And in retina.py, for example, rather than
def __init__(self, unit_test=False):
use
def __init__(self, save_path="data/"):
Issue by jeff-regier
Friday Mar 30, 2018 at 17:00 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/8
Issue by jeff-regier
Friday Mar 30, 2018 at 20:55 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/9
jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/9/commits
Issue by jeff-regier
Wednesday Apr 04, 2018 at 03:57 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/14
To start, maybe figure out when VaDE does/doesn't work.
Our imputation results are based on just one sample from the variational distribution currently:
That makes imputation not much of a metric for assessing changes to our model --- I'm surprised it works as well as it does. We should be sure to fix it before relying on imputation scores to guide any modeling decisions.
Issue by jeff-regier
Thursday Mar 29, 2018 at 02:16 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/1
jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/1/commits
Issue by jeff-regier
Wednesday Apr 04, 2018 at 22:00 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/15
jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/15/commits
In the paper we report that imputation error for cortex is around 2.2 but when I run the current version of scVI imputation error is much higher.
jeff@dean ~/git/scVI $ ./run_benchmarks.py --dataset cortex --epochs 200
File data/expression.bin already downloaded
Preprocessing Cortex data
training: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:29<00:00, 6.69it/s]
Total runtime for 201 epochs is: 29.987644910812378 seconds for a mean per epoch runtime of 0.14919226323787252 seconds.
Best ll was : 1288.2259785091362
Log-likelihood Train: 1261.4138140255177
Log-likelihood Test: 1289.2478976328903
Imputation score on test (MAE) is: 4.208409309387207
Issue by maxime1310
Thursday Mar 29, 2018 at 17:33 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/2
maxime1310 included the following code: https://github.com/YosefLab/scVI-dev/pull/2/commits
Issue by jeff-regier
Thursday Apr 05, 2018 at 17:10 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/17
jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/17/commits
Issue by jeff-regier
Friday Mar 30, 2018 at 16:59 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/4
A very minor issue in the example IPython notebook:
latent_dimension = 10
is defined but the following model definition does not include n_latent=latent_dimension
, thus changing the value defined in the code has no effect. I think the same may be true of batch_size.
Thank you for this algorithm!
Create a notebook named examples/data_loading.ipynb
that shows users all the ways they can load data into scVI. They can load
@imyiningliu -- I think Maxime, Eddie, and Chenling are all working with datasets now that aren't yet "wrapped" by scVI. It'd be great if you could talk with all three of them to find out what datasets they're using, and add one class (that inherits from GeneExpressionDataset
) for each dataset they plan to keep using. They may already have some code, which they haven't committed yet, that you can start with. We'd also want some documentation for each dataset, and a unit test if the new dataset requires a non-trivial amount of code.
These new datasets may have some characteristics that are different from the dataset we've seen so far.
-- Maxime's smFISH datasets have position information. I think he had some ideas for how to modify GeneExpressionDataset
to include that information.
-- Eddie's pbmc donor data may already be accessible through the Dataset10x class.
-- Chenling's data from the simulator isn't available on at any public url yet, and maybe it's too early to make it public. But if not, we could share it through our scVI-dev repo.
Also, I'm interested in getting access through scVI to the dataset mentioned in this paper: https://www.nature.com/articles/nmeth.4636
Looks like one of the files for hemato isn't getting downloaded automatically:
jeff@dean ~/git/scVI $ ./run_benchmarks.py --dataset hemato
Downloading file at data/bBM.raw_umifm_counts.csv.gz
Downloading file at data/data.zip
Preprocessing Hemato data
Traceback (most recent call last):
File "./run_benchmarks.py", line 54, in <module>
dataset = load_datasets(args.dataset, url=args.url)
File "./run_benchmarks.py", line 26, in load_datasets
gene_dataset = HematoDataset(save_path=save_path)
File "/home/jeff/git/scVI/scvi/dataset/hemato.py", line 20, in __init__
expression_data, gene_names = self.download_and_preprocess()
File "/home/jeff/git/scVI/scvi/dataset/dataset.py", line 47, in download_and_preprocess
return self.preprocess()
File "/home/jeff/git/scVI/scvi/dataset/hemato.py", line 35, in preprocess
spring_and_pba = pd.read_csv(self.save_path + self.spring_and_pba_filename)
File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
self._make_engine(self.engine)
File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 384, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 695, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: File b'data/bBM.spring_and_pba.csv' does not exist
It looks like run_benchmarks
uses the same VAE architecture (e.g. default n_layers) for every datasets:
https://github.com/YosefLab/scVI/blob/cbd6f41bc6e40bbb0c1082ec9bf1d5b32bcd3f7d/scvi/benchmark.py#L38-L39
Instead, let's instantiate the VAE class with the number of layers etc from Table 2:
https://www.biorxiv.org/content/biorxiv/early/2018/03/30/292037.full.pdf
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.