czbiohub-sf / tabula-muris-senis Goto Github PK

Tabula Muris Senis

Home Page: http://tabula-muris-senis.ds.czbiohub.org

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 99.32% Python 0.09% HTML 0.55% R 0.01% Shell 0.04%

tabula-muris-senis's Introduction

tabula-muris-senis

Tabula Muris Senis is a comprehensive resource for the cell biology community which offers a detailed molecular and cell-type specific portrait of aging. You can read our pre-print here!

We view such a cell atlas as an essential companion to the genome: the genome provides a blueprint for the organism but does not explain how genes are used in a cell type specific manner or how the usage of genes changes over the lifetime of the organism. The cell atlas provides a deep characterization of phenotype and physiology which can serve as a reference for understanding many aspects of the cell biological changes that mammals undergo during their lifespan.

For a quick overview checkout the lightining talk slides presented at SciPy2019

This is a resource for the community! The repo is being continously updated with the code used for the publication - let us know your priorities and we will prioritize that code release!

Data access

Raw data

Since October 2019, Tabula Muris Senis data have been made available to all users free of charge. AWS has made the data freely available on Amazon S3 so that anyone can download the resource to perform analysis and advance medical discovery without needing to worry about the cost of storing Tabula Muris Senis data or the time required to download it.

Processed data

Our ready-to-use with scanpy data is available from figshare: https://figshare.com/projects/Tabula_Muris_Senis/64982

Online data browsing

Interactive browsers for the data are available from the Tabula Muris Senis portal using cellxgene and thanks to Max Haeussler and Matt Speir at the UCSC Cell Browser

Contact

If you have questions about the data, please create a new Issue.

License

There are no restrictions on the use of data received from the Chan Zuckerberg Biohub, unless expressly identified prior to or at the time of receipt.

tabula-muris-senis's People

Contributors

Stargazers

Watchers

tabula-muris-senis's Issues

Accessed raw data consists of floating-point numbers

Hi,

I have downloaded the Large_intestine_droplet.h5ad from https://figshare.com/articles/dataset/Processed_files_to_use_with_scanpy_/8273102/2

When I load the file using scanpy and access the raw data with adata.raw.X, the data does consist of floating-point numbers. However, I would expect the raw data entries to be integer values, or isn't the raw data equal to count data?

Thank you for your help!

loss some cells after import ? and how are cell markers defined ?

Hi!
Thanks a lot for this amazing work!

I recently encounter some issue when trying to analyze some data from specific organs. For example:

https://tabula-muris-senis.ds.czbiohub.org/thymus/droplet/
here it described the whole thymus analyzed through the Droplet pipeline contains 9275 cells and includes DN3, DN4, double negative T cell, immature T cell, professional APC and thymocyte.

but when I download the "h5ad" file from https://figshare.com/articles/dataset/Processed_files_to_use_with_scanpy_/8273102/2

and run:
"

thymus2 = sc.read_h5ad("C:/.../32669714/Thymus_droplet.h5ad")
C:...\miniconda3\lib\site-packages\anndata\compat_init_.py:180: FutureWarning: Moving element from .uns['neighbors']['distances'] to .obsp['distances'].

This is where adjacency matrices should go now.
warn(
C:...\miniconda3\lib\site-packages\anndata\compat_init_.py:180: FutureWarning: Moving element from .uns['neighbors']['connectivities'] to .obsp['connectivities'].

This is where adjacency matrices should go now.
warn(

thymus2
AnnData object with n_obs × n_vars = 7570 × 19860
obs: 'age', 'batch', 'cell', 'cell_ontology_class', 'cell_ontology_id', 'free_annotation', 'method', 'mouse.id', 'n_genes', 'sex', 'subtissue', 'tissue', 'tissue_free_annotation', 'n_counts', 'louvain', 'cluster_names', 'leiden'
var: 'n_cells', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'leiden', 'louvain', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_umap', 'X_tsne'
varm: 'PCs'
obsp: 'distances', 'connectivities'
"

so it seems like the cell number is 7570 which is lower than 9275 ? (difference ~1700)

and when I went through the "cell_ontology_class" I think I missed the "immature T cell", as I could not find this annotation (the number this population is about ~1700).

So I wonder which step I might do it wrong and lead to such problem ?

Also, just might be an naive question (maybe I missed some details regarding the methods ?), in theory we should expect DN-DP-SP populations within thymus, so why only DN is annotated here ? and plus, DN3, DN4, double negative T cell, immature T cell, and thymocyte, sound a bit confusing and feel like (based on conventional flow cytometry analysis) there might be some overlapping between these populations ? or for example, should the thymocyte be a mixture of SP etc. ?

Thank you very much!

Request for the marker gene

Hello ！
We will annotate scRNA data of mouse lung , but we can't find detailed marker genes corresponding to each cell type of lung tissue in this article。
If it is convenient for you, could you please provide the corresponding marker genes? Thank you for any reply!

How can I find the marker genes used to annotate each cell type?

Hello, first I wanted to thank you for this amazing job. It is very helpful for the aging research community.

I am trying to do some analysis using the brain data, but I need to know the exact markers used for each cell type. I have been looking around in the github account and I haven't found the precise genes used (maybe I just missed those lines of code?).

Also, I tried to access the AWS data just in case there were more code notebooks there, however it asks me all the time to create an account asking for billing data and credit card information (for example this link: S3 Public Bucket. Is the first time I use AWS, is that required to access these public datasets?

Thank you in advance for your help and attention.

How to access the cell type annotation

Dear Tabula team,
I downloaded the processed data from https://figshare.com/projects/Tabula_Muris_Senis/64982 and loaded the data into R using Seurat's ReadH5AD function. When I checked the meta.data I cannot find the cell type annotation? For example, https://github.com/czbiohub/tabula-muris/blob/master/22_markers/droplet_Lung_cell_ontology_class_classes.csv listed the cell type annotation, but I don't know which column it corresponds to in the meta.data? Here is an example of the meta.data of the lung droplet data set:

age cell cell.ontology.class cell.ontology.id free.annotation method mouse.id nFeatures_RNA sex subtissue tissue tissue.free.annotation nCount_RNA louvain cluster.names leiden
AAACCTGAGCGTAATA-1-11-0-0 2 MACA_18m_F_LUNG_50_AAACCTGAGCGTAATA 0 4 0 0 6 2470 0 2 0 0 6594.1533203125 33 35 35
AAACGGGTCGCCCTTA-1-11-0-0 2 MACA_18m_F_LUNG_50_AAACGGGTCGCCCTTA 0 4 0 0 6 1821 0 2 0 0 5138.74755859375 8 8 8
AAAGATGAGCAGACTG-1-11-0-0 2 MACA_18m_F_LUNG_50_AAAGATGAGCAGACTG 6 7 6 0 6 1333 0 2 0 0 4399.15380859375 5 2 2
AAAGATGAGCCGTCGT-1-11-0-0 2 MACA_18m_F_LUNG_50_AAAGATGAGCCGTCGT 8 8 8 0 6 1455 0 2 0 0 4933.3740234375 4 1 1
AAAGCAACATGGTAGG-1-11-0-0 2 MACA_18m_F_LUNG_50_AAAGCAACATGGTAGG 5 3 5 0 6 3322 0 2 0 0 7091.1279296875 6 3 3
AAATGCCAGGAGTTGC-1-11-0-0 2 MACA_18m_F_LUNG_50_AAATGCCAGGAGTTGC 6 7 6 0 6 1300 0 2 0 0 4612.63916015625 5 2 2

Thank!

Code availability?

Just wonder if the code for analysis would be available any time soon. I downloaded the processed data of droplet scRNA-seq in h5ad. Need some info about how the raw data was processed to do some downstream analysis.

Cheers,
Gary

Are the bam files missing for some cells of the FACS data? What are the 'TK' parameters?

Hello!

I have downloaded the whole FACS data, but I met a little problem. In the paper, I noticed that there should be 110824 cells sequenced for FACS. When I sorted out all the bam files according to the provided 'tabula-muris-senis-full-metadata.csv' file, I got a total of 106273 bam files and I didn't find those missing bam files in the AWS database.

Another question is that I found a part of fastq.gz files of these missing cell files. So, I want to generate bam files using these fastq.gz files. I noticed that you used STAR v.2.5.2b with parameters TK. Can you tell me what the 'TK parameters' mean since I didn't find them in the STAR?

Thank you very much!

Do I need to normalize the Processed files (to use with scanpy) download from Figshare?

Hi,

I am wondering whether should I need to normalize the Processed files (.h5ad) downloaded from FigShare. I did not find related illustrations for this, and I compared the downstream results between doing normalization and not doing normalization. I think I need the normalization for the downloaded files, but I am not sure. Thank you!

For normalization, I mean:
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Best regards,
MD

Mitochondrial genes and ERCC genes

Hello,

I have downloaded one of the FACS datasets and am looking through it. I notice there are no mitochondrial genes or the ERCC spike-in controls. Why is this? Are there objects available which already have this?

Thanks,
Ronnie

FACS Brain Non-Myeloid 3-month data are NaN and 18, 24 data are Neuronal

Dear Tabula Muris team,

I am trying to understand an aspect of your aligned raw read dataset from Gene Expression Omnibus. Namely, FACS Brain Non-Myeloid cell age classes seem to be divided as explained and illustrated below.

only Neuronal cells:
sc.pl.umap(nonan,color=['age','subtissue'])
#this returns 18 and 24 months data only, and 2x each subtissue (i.e. 2xcerebellum, 2xcortex, 2xhippocampus, 2xstriatum)
umapnonan_age_subtissue.pdf

only NaN cells:
sc.pl.umap(yesnan,color=['age','subtissue'])
#this returns 3 months data only, and 1x each subtissue (i.e. 1xcerebellum, 1xcortex, 1xhippocampus, 1xstriatum)
umapyesnan_age_subtissue.pdf

In the Nature paper you say: "Data from the 3-month time point—which has previously been published and constitutes the Tabula Muris5 represents approximately 20% of the cells in the entire dataset, and was used as a basis from which to perform semi-automated cell-type annotation of the data from the additional time points (Fig. 1b, Extended Data Fig. 4b)."

Does this mean that the 3-month data points come from the original Tabula Muris Consortium, were not analysed with FACS for Tabula Muris Senis, and the GEO files contain combined batch corrected experiments from both Tabula Muris and Tabula Muris Senis?

I hope you can answer my question.

Sincerely,
Hanna

Options for UMAP generation

Hello!

I have been working with the raw counts data, from the facs-official-raw-obj.h5ad file, and wanted to regenerate the facs-specific clustering and UMAP projection from Extended Data Fig. 1A-C as an internal check. I'm not clear what code within the provided ipynb files was used to generate these plots in particular-- would you be able to point me towards it? In particular, which options were used to run the PCA and generate the neighborhood graph?

Thank you very much!

Forbidden BAMs

Hi,

Thanks for this amazing resource! There are a set of a few thousand BAM files in the AWS bucket under Plate_seq/3_month/ that are locked down with access disabled. I've been able to access all but these BAMs.

Here is one example:
https://czb-tabula-muris-senis.s3.us-west-2.amazonaws.com/Plate_seq/3_month/170910_A00111_0054_AH2HGWDMXX__170910_A00111_0053_BH2HGKDMXX/results_gencode_ercc/P9-MAA000930-3_8_M-1-1.gencode.vM19.ERCC.Aligned.out.sorted.bam

I'm happy to share the full list if it would help debug the settings for these files to make them available. Thanks!

Best,
David

WBC

Hi!

When you download the experimental design there is a group named "White_Blood_Cells". I have now try to retrieve the fastq files and I don't see the same annotation. Could you point out which files have the information with the fastq file names and the cell ranger outputs? Or just the WBC fastq files would be sufficient. I can only find the metadata corresponding to fastq files for the Plate_seq data sets, where could I find the same file for the 10x fastq files?

Thank you!

Ana

Raw data sample information

The corresponding organization and age information of 10X_P1_ to 10X_P6_ samples in aws s3 10x data has not been found. Is there any corresponding sample information?

An expression matrix for the SCAT tissue is required

 I would like the SCAT data you measured for further analysis, preferably its expression matrix.
 I have tried to download it on AWS. The steps are to first enter Plate_seq and then download the data for each month. For the data in 3_month, I selected the path of SCAT organization through fastqs_annotated.csv in 3_month, and downloaded the file with the suffix gencode.vm19.ercc.htseq-count.txt and finally merged it. This method cannot find all the 3_month cells of the SCAT.
In the folders of 24_month and 18_month, I tried to download the CSV files under each folder and tried to filter the SCAT cells with the csv named annotation information of tabula-Muris-Senis-facs-official-raw-obj_cell-metadata.csv. But they were all fewer than 100 cells selected. 
This was the only way I could think of to get the SCAT representation matrix for each month, but it didn't work. Could you tell me how I can get it? Or can you send the expression matrix of SCAT to my mailbox?My email is [email protected]

How do I extract the list of genes names in each tissue?

I am working with the h5ad.

Additional question, what is the different in the pre-process phases of the raw.X and the X datasets?

h5ad to BAM matching

Hi, perhaps related to this issue in human czbiohub-sf/tabula-sapiens#42, I have also found apparently mismatched BAMs in the mouse data.

For example, cell row 96552 in the tabula-muris-senis-facs-official-raw-obj.h5ad has 'cell' column 'L22.MAA000593.3_8_M.1.1'. The table reports 2099 n_genes and 189281 n_counts.

That appears to match s3://czb-tabula-muris-senis/Plate_seq/3_month/170907_A00111_0051_BH2HWLDMXX/results_gencode_ercc/L22-MAA000593-3_8_M-1-1.gencode.vM19.ERCC.Aligned.out.sorted.bam. However, this BAM only has 486 aligned read rows, so it can't reach those reported gene expression values.

Could you help me understand where I'm going wrong here? Thanks!

Where can I find the fastq files for adata_with_ercc_genecode_counts_for_gatk_with_metadata.h5ad?

Hi,

thanks for this organized repositiory!

I want to analyze the adta-object for FACS which includes ercc-sequences. This should be the following file:

(https://s3.console.aws.amazon.com/s3/object/czb-tabula-muris-senis?region=us-west-2&prefix=Data-to-reproduce-figures/mutation-analysis-objs/adata_with_ercc_genecode_counts_for_gatk_with_metadata.h5ad)

But in the metadata are no information about the fastq-files. So I cant get them from SRA. But I see that there is the folder:

s3://czb-tabula-muris-senis/Plate_seq/3_month/

But I cannot map the cell_ids from to the fastq-files? How do I get the corresponding fastq-files?

Best,
Tolga

Raw data and processed data not matching?

Hi,

I was trying to figure out the cells' raw counts based on the cell id from the processed data in bone marrow Facs sorted data (download from figshare). I can manage to find the 24m age group's cell id but not the 3m age group as both groups were labelled quite differently. I.e. A2_B003031_S98_L004.mus-5-0 for the 24m group but D13.D042193.3_8_M.1.1-1 for the 3m age group. Is there any way I can match the cell id or match back to the raw count?

Thanks in advance,
Huiwen

24m FACS raw data availability?

Hi,
I don't see the 24m FACS raw sequence data in GEO (GSE132042). How could I access it?
Best,
Gonzalo

No counts in AWS h5ad object

AnnData file at AWS (https://cellxgene-example-data.czi.technology/tabula-muris-senis.h5ad) apparently does not have counts:

They also do not seem like log(TP10k+1). Do you mind adding a "counts" layer to the object which contains counts? Or is there another h5ad file which contains all cells across all tissues with counts?

How was the droplet (and facs) data processed/normalized

Hello,

Charlotte(@csoneson), Federico (@federicomarini) and I are trying to convert some of the h5ad files into objects to be read into R to be used in R/Bioconductor. We are particularly looking at these 2 files from the droplet data:

https://figshare.com/articles/dataset/Processed_files_to_use_with_scanpy_/8273102?file=23938934
https://figshare.com/articles/dataset/Processed_files_to_use_with_scanpy_/8273102?file=23936684

Extracting the matrices of these files, it seems one has the raw counts (23938934) and the other (23936684) has some processed form of the counts. Could you elaborate on what kind of transformations and processing the counts underwent? We were interested in using this h5ad file (23936684) as it also contains the reduced dimensioanlities which would be nice to include with the raw data.

Thank you and best,
Dania

Question about data downloading

Dear Tabula Muris team,

I'm trying to download the raw data from AWS, but I really confused with the file name of the datasets. I want to know the meaning of the subfolder name like "170907_A00111_0051_BH2HWLDMXX".

Futhermore, I only need raw data from brain cells in my study, but I could not figure out the brain cells. Could you please tell me how to distinguish the brain cells from other tissues according to the filename?

I really hope you can answer my question. Thanks a lot !

Sincerely,
Vikki

Help downloading data in other ways

It's appreciate to be able to download the Tabula Muris Senis data from Amazon. However, registering an AWS account requires binding a credit card. For some reasons, I cannot obtain a valid credit card, which causes me failing register an AWS account. Is it convenient for you to provide other ways to download data? Thank you so much!

Help downloading data from AWS

Hi,

I am trying to download FASTQ files for 10x data from AWS S3

What is the best way to do this? On the AWD page it will only let me download each file individually, I cant download the entire directory.

How do I gain access to the files so that I can start aligning things?

Tabula Muris vs Tabula Muris Senis

Hi, thanks for the wonderful resource! I'm wondering if the s3 bucket s3://czb-tabula-muris-senis/ contains the data from the original publication, too s3://czbiohub-tabula-muris/. Many of the files in czbiohub-tabula-muris/ are labeled "Glacier Deep Archive" and unavailable.

Is 'import util' an author-defined function?

Code downstream_tissue_cell.aging_score.final part used `import util' and it included a lot of function that were need for downstream analysis. However, 'util' seems an author-defined function so we can not go on without that.

How are the 10X files named in AWS?

I'm trying to download the 10X files from AWS. The files in this folder (Amazon S3/czb-tabula-muris-senis/10x//1_month/)
are named as "10X_P5_0/ 10X_P5_1/ 10X_P5_2/..... "
Do these numbers match with different tissues? where can I find the ID info?
Thank you,
Xu

mislabeling of marrow clusters?

Hello,

Thanks for providing what promises to be a great resource! I primarily study hematopoiesis and was quickly checking out your bone marrow droplet data. I noticed that several of the cell_ontology_class labels seem incorrect, at least based on my experience with similar datasets. I suspect the issue is the result of a simple human error in the matching of cell_ontology_class labels with cluster IDs rather than outright misclassification, since some class labels have been assigned to quite unrelated clusters (e.g., "granulocytopoietic cell").

Below is a figure with cells colored by the labels from the h5ad files/cellxgene (left panel, arrows marking labels that I'm confident are incorrect) and by marker gene expression (right panel).

If you prefer to receive feedback through other channels, let me know.

Best,
Sam

question about age coefficient (coef (age.logFC))

Dear all,

Thank you for sharing the great data!

I have probably a naive question, as I'm not very familiar with MAST and the details of your methods, and I definitely don't want to mess up something by reinventing your methods.

In your great paper https://www.nature.com/articles/s41586-020-2496-1 it's written multiple times "age coefficient threshold of 0.005 (corresponding to an approximately 10%-fold change)".

In the file facs.Brain_Non-Myeloid.neuron.gz from https://figshare.com/articles/dataset/tms_gene_data_rv1/12827615?file=27856758 I see "coef (age.logFC)", which is the "age coefficient" from the paper as far as I understand.

My questions:

Which transformation is applied to go from the age coefficient 0.005 to 10% fold change?
Does 10% fold change in this case mean 100%(old - young)/young or 100%(old/young)?
Which base of the log is implied throughout the paper when it's not specified and given just as log?

In other words, what I need is to get rid of the log scale in age coefficient, and since none of the common base choices produce 0.1 from 0.005, I thought I may be missing something about the methods.

Thank you very much for the clarifications!

Best regards,
Polina

Raw fastq is unprocessed or re-extract from BAM?

I noticed the issue 223.

As @jamestwebber mentioned, the previous released raw fastq in AWS was extracted from BAM, which may exclude chimeric reads.

But this issue was post two years before. Downloading directory has changed. So I was wondering if this issue still exist.

What about the fastq files stored here? https://s3.console.aws.amazon.com/s3/buckets/czb-tabula-muris-senis?region=us-west-2&tab=objects

Are these files unprocessed or extracted from BAM?

Request for Data Link for Figures 1 and 3 in the article "A single-cell transcriptomic atlas characterizes ageing tissues in the mouse"

Hello!
I am writing to request assistance in obtaining the data for Figures 1 and 3 in your article titled "A single-cell transcriptomic atlas characterizes ageing tissues in the mouse."
The original link to the data has failed.
I am participating in a competition where I intend to utilize this data, and I would greatly appreciate your support in providing a downloadable method.
Many thanks!

Greetings,
Dnj

Missing genes in raw data

Hi there,
Are there unprocessed h5ad files available for droplet and facs containing all genes (especially mt-genes)?
I am looking for something like the "tabula-muris-senis-droplet-official-raw-obj.h5ad" file but with all genes.

Thanks,
Dennis

Mapping from cell ontology class number to labels?

Hello! Thanks for making this dataset well-organized. I might be missing something obvious, but is there a file that provides a mapping of cell ontology class numbers (int) to the actual name of the cell type ("B cell")? Also, what is the difference between "cell.ontology.class", "cell.ontology.id", and "free.annotation" columns in the @meta.data in droplet_filtered.h5ad? Thanks so much!

Mismatch between raw and processed data

Hello, I performed pre-processing to the raw data and did not get the processed data.
For instance, I have downloaded the "Bladder_droplet" file from figshare and "tabula-muris-senis-droplet-processed-official-annotations-Bladder" from amazon web. Then I did size factor normalization (with 10000 counts per cell) and log-transformed on the count's matrics in the raw data and it didn't match with the processed data.
Specifically in the first cell ("AAACCTGAGTACGTTC-1-24-0-0") on genes "Snhg6" and "Tram1" there's 1 UMI and after I did the normalization I got 0.45 in those genes and in the processed data (from figshare) there's 0.96 and 0.71 in the above genes (respectively).
So my question is, did I do something wrong or there's a problem in the processed data?

Fastq annotations for Plate_seq fastq files

There is a fastq_annotated.csv file in s3://czb-tabula-muris-senis/Plate_seq/3_month/, which lists the samples and corresponding s3 URIs.
However, there is no such file for other age groups - 18 and 21 and 24 months (Plate_seq/18_month/, Plate_seq/21_month, Plate_seq/24_month, respectively).
So I tried to work around the problem by searching for the cell ids from tabula-muris-senis-facs-official-raw-obj__cell-metadata.csv in /metadata and compare the cell id against URI of all plate-seq retrieved through AWS cli command aws s3 ls s3://czb-tabula-muris-senis/Plate_seq/${month}_month/.
Here, I confronted another problem that some cells have two different fastqs with the same cell ID, where only the super directory names differ.
This is the case for the following example:

s3://czb-tabula-muris-senis/Plate_seq/3_month/170925_A00111_0066_AH3TKNDMXX/fastqs/A1-B000126-3_39_F-1-1_R1_001.fastq.gz
s3://czb-tabula-muris-senis/Plate_seq/3_month/170925_A00111_0066_AH3TKNDMXX__170925_A00111_0067_BH3M5YDMXX/fastqs/A1-B000126-3_39_F-1-1_R1_001.fastq.gz

Could you please provide the FASTQ annotation data?

About data interpretation

Hi,
I want to know how can I understand how the expression of a gene changes with age in a particular tissue from tubula Muris Senis. Is there any information available online which I can refer to understand it.

I tried selecting the tissue/age/gender/cell type/gene I am interested in. It just seem same to me and not understandable.

thanks!

Facs sorted data - assayed genes

Hi,

we (Charlotte and I) are trying to convert the existing h5ad files to a merged SingleCellExperiment object to be used in R/Bioconductor via the https://github.com/csoneson/TabulaMurisData package.

I noticed upon loading the files via scanpy that not all subsets share the same set of genes.

>>> adata_facs_bat.X
<1561x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 2728411 stored elements in Compressed Sparse Row format>
>>> adata_facs_Bladder.X
<1740x16553 sparse matrix of type '<class 'numpy.float32'>'
    with 8153961 stored elements in Compressed Sparse Row format>
>>> adata_facs_Brain_Myeloid.X
<8956x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 15979574 stored elements in Compressed Sparse Row format>
>>> adata_facs_Brain_NonMyeloid.X
<4614x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 12401088 stored elements in Compressed Sparse Row format>
>>> adata_facs_Diaphragm.X
<1608x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 2776493 stored elements in Compressed Sparse Row format>
>>> adata_facs_gat.X
<2531x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 7571811 stored elements in Compressed Sparse Row format>
>>> adata_facs_Heart.X
<3104x21190 sparse matrix of type '<class 'numpy.float32'>'
    with 10448154 stored elements in Compressed Sparse Row format>
>>> adata_facs_Kidney.X
<1400x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 2173284 stored elements in Compressed Sparse Row format>
>>> adata_facs_Large_Intestine.X
<5942x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 22461504 stored elements in Compressed Sparse Row format>
>>> adata_facs_Limb_Muscle.X
<2334x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 4077488 stored elements in Compressed Sparse Row format>
>>> adata_facs_Liver.X
<1679x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 5148416 stored elements in Compressed Sparse Row format>
>>> adata_facs_Lung.X
<3532x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 7385690 stored elements in Compressed Sparse Row format>
>>> adata_facs_Mammary_Gland.X
<3132x17232 sparse matrix of type '<class 'numpy.float32'>'
    with 11823042 stored elements in Compressed Sparse Row format>
>>> adata_facs_Marrow.X
<9734x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 26197974 stored elements in Compressed Sparse Row format>
>>> adata_facs_mat.X
<1960x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 5013970 stored elements in Compressed Sparse Row format>
>>> adata_facs_Pancreas.X
<2551x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 9105733 stored elements in Compressed Sparse Row format>
>>> adata_facs_scat.X
<2723x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 7491027 stored elements in Compressed Sparse Row format>
>>> adata_facs_Skin.X
<3468x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 10444947 stored elements in Compressed Sparse Row format>
>>> adata_facs_Spleen.X
<2812x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 5185231 stored elements in Compressed Sparse Row format>
>>> adata_facs_Thymus.X
<2629x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 5565327 stored elements in Compressed Sparse Row format>
>>> adata_facs_Tongue.X
<2776x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 11608627 stored elements in Compressed Sparse Row format>
>>> adata_facs_Trachea.X
<2353x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 6583927 stored elements in Compressed Sparse Row format>

I see the majority have 22899 genes in them, so I was wondering whether additional steps were applied to the files that have less - ideally, filtered out if not detected in any cell?

For the droplet data, this problem does not show up and all subsets 19860 genes.

Would it be possible to have the "original data" uploaded also for Bladder, MammaryGland, and Heart? (assuming the genes are all in the same order)

I'm tagging @csoneson to follow up on this one.

Thanks in advance!

Federico

Documentation link on AWS Registry of Open Data is broken

This link: https://github.com/czbiohub/tabula-muris-senis/blob/master/tabula-muris-senis-on-aws.md

returns a 404 error. Is this dataset up to date? It says that "This is the first version of the dataset and it will be updated after the manuscript has been peer reviewed"

Thank you!

Brain Non-myeloid data analysis questions

Hi!
Firstly, I've wanted to thank you for providing such an amazing dataset!

I've been learning and working with Brain Non-Myeloid dataset, and over time I've accumulated quite a few questions that I'd like to ask someone. I am relatively new when it comes to RNA-seq data analysis and would love to understand the data more indepth than I do now. Would it be plausible to contact someone to clear a few thing up?
Thank you very much!

Best wishes,
Gabriele

How to find the raw data of cardiomyocytes and heart？（IN AWS）

There are three questions here：

Can you tell me which folder is the raw data of the heart in 1m and 30m?（figure 1）
Are the raw data folders for 3m, 18m , 21m and 24m of the folders with HEART in their names?（figure 2）
Myocardial cells are a separate folder, how do I distinguish between each month?（figure 3）

Cluster labels for droplet data

Is there an easy way to get the cluster labels for the droplet data without parsing through all of the individual scanpy files? The AWS bucket contains such a metadata file, but it only contains cells from the FACS protocol. Thanks!

Cannot load Brain_Non-Myeloid_facs.h5ad

Hi,

really nice resource and cool that you provide pre-processed anndata files.
I downloaded the brain datasets: the myeloid one loads without any problem, the non-myeloid one throws an error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-d7f4867a0914> in <module>
----> 1 adata_nonmyeloid = sc.read_h5ad("/Users/giovanni.palla/Projects/spatial-scripts/dat/tabula-muris-senis/Brain_Non-Myeloid_facs.h5ad")

~/miniconda3/envs/scanpy-issues/lib/python3.6/site-packages/anndata/_io/h5ad.py in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    427     _clean_uns(d)  # backwards compat
    428 
--> 429     return AnnData(**d)
    430 
    431 

~/miniconda3/envs/scanpy-issues/lib/python3.6/site-packages/anndata/_core/anndata.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx)
    296                 varp=varp,
    297                 filename=filename,
--> 298                 filemode=filemode,
    299             )
    300 

~/miniconda3/envs/scanpy-issues/lib/python3.6/site-packages/anndata/_core/anndata.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, varp, obsp, raw, layers, dtype, shape, filename, filemode)
    495             self._raw = None
    496         elif isinstance(raw, cabc.Mapping):
--> 497             self._raw = Raw(self, **raw)
    498         else:  # is a Raw from another AnnData
    499             self._raw = Raw(self, raw._X, raw.var, raw.varm)

~/miniconda3/envs/scanpy-issues/lib/python3.6/site-packages/anndata/_core/raw.py in __init__(self, adata, X, var, varm)
     30             self._X = X
     31             self._var = _gen_dataframe(var, self.X.shape[1], ["var_names"])
---> 32             self._varm = AxisArrays(self, 1, varm)
     33         elif X is None:  # construct from adata
     34             self._X = adata.X.copy()

~/miniconda3/envs/scanpy-issues/lib/python3.6/site-packages/anndata/_core/aligned_mapping.py in __init__(self, parent, axis, vals)
    229         self._data = dict()
    230         if vals is not None:
--> 231             self.update(vals)
    232 
    233 

~/miniconda3/envs/scanpy-issues/lib/python3.6/_collections_abc.py in update(*args, **kwds)
    844                     self[key] = other[key]
    845             else:
--> 846                 for key, value in other:
    847                     self[key] = value
    848         for key, value in kwds.items():

ValueError: not enough values to unpack (expected 2, got 1)

I'm using scanpy 11.4.5.11 and anndata 0.7.1
With a quick check I could not figure out if it's an issue with anndata, was wondering if you observe the same problem.

Thanks a lot!
Giovanni

mm10plus

Hi,

I see that tabula muris senis used "mm10plus" as genome reference. I am assuming it is a modified version of mm10. If so, may I know what modifications/adjustments were made?

Thanks!

bbknn metadata file

Hi,

thank you for providing all these beautiful single cell data.
I downloaded the "tabula-muris-senis-bbknn-processed-official-annotations.h5ad".
I am more familiar with Seurat, which is why I converted the file with the new
"seurat-disk" tool. Everything works fine, but the annotations are missing.
As far as I learned, this is probably because storage in the $uns part of the *.h5ad files
is not recognized by Seurat (or the intermediate h5seurat).
However, this issue has been frequently brought up
(#6; satijalab/seurat#1427),
but for now not with the new created bbknn file (https://figshare.com/articles/dataset/Processed_files_to_use_with_scanpy_/8273102/3?file=23936555).

Since you provided a metadata.csv file for the facs data here (https://s3.console.aws.amazon.com/s3/buckets/czb-tabula-muris-senis?region=us-west-2&prefix=Metadata/), would it also be possible to create a metadata file for the bbknn file?
I already looked at GEO, but this one (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4505404) seems to just have
100,000 annotations less than needed for the bbknn file.
So if someone could provide the correct metadata file as *.csv, this would be highly appreciated.

Thanks

Having trouble locating SC and bulk RNA-seq data (FASTQ files) for BM, liver, and tongue

Information about liver dataset

Hello!
Thank you for the enormous work you have done and the great dataset you have created!
In my project I am investigating ageing cells in liver tissue and their interaction with the immune system. Since I have some specific questions about the data that I need to answer in order to use the dataset, I was wondering if I could contact someone to resolve these questions?
Many thanks!

Greetings,
Helene

Some FASTQ files on AWS missing reads?

Hi,

I have been trying to download the raw data for selected FACS-isolated cells. I downloaded a metadata spreadsheet containing the FASTQ locations from AWS here:

s3://czb-tabula-muris-senis/Metadata/tabula-muris-senis-facs-official-raw-obj__cell-metadata__cleaned_ids__read1_read2.csv

I filtered for particular cell types of interest, and then additionally filtered for cells with high read/gene counts by cross-referencing the metadata with another spreadsheet that I downloaded from GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4505405):

GSM4505405_tabula-muris-senis-facs-official-raw-obj-metadata.csv

I joined the two spreadsheets on the index field (GEO spreadsheet)/obs_names field (AWS spreadsheet).

I noticed that at least some of the FASTQ files from AWS have many fewer reads than I expected based on the GEO metadata. For example, cell A10-D042044-3_9_M-1-1 has 4,111 genes detected and 11915818 counts according to the GEO metadata, but when I download the FASTQ files from the following S3 keys given in the AWS spreadsheet the resulting files have fewer than 200 reads:

s3://czb-tabula-muris-senis/Plate_seq/3_month/170907_A00111_0052_AH2HTCDMXX/fastqs/A10-D042044-3_9_M-1-1_R1_001.fastq.gz
s3://czb-tabula-muris-senis/Plate_seq/3_month/170907_A00111_0052_AH2HTCDMXX/fastqs/A10-D042044-3_9_M-1-1_R2_001.fastq.gz

I have double-checked to make sure that this is not just a corrupted download issue. The size of the two FASTQ files in the AWS S3 bucket is quite small -- about 28 KB -- and I also noticed that the other FASTQs in that S3 directory are also small. Most other FASTQ files in the other directories are in the tens of MB, but these are in the tens of KB. Was there an issue uploading some of the FASTQ files to S3? Or am I trying to download the wrong FASTQ files? Any help you could provide would be greatly appreciated! Thank you!

Metadata labels

Hi,

I am importing the .5had files into R using Seurat and all the metadata information e.g. cell.ontology.class is stored as numbers (0 1 2 3) for each factor rather than their names. Is there a way that I can get the name information?

czbiohub-sf / tabula-muris-senis Goto Github PK

tabula-muris-senis's Introduction

tabula-muris-senis

Data access

Raw data

Processed data

Online data browsing

Contact

License

tabula-muris-senis's People

Contributors

Stargazers

Watchers

Forkers

tabula-muris-senis's Issues

Recommend Projects

Recommend Topics

Recommend Org