allencellmodeling / pytorch_integrated_cell Goto Github PK
View Code? Open in Web Editor NEWIntegrated Cell project implemented in pytorch
Integrated Cell project implemented in pytorch
can latent dims be associated with cell size etc?
Another aspect of dimensionality reduction that is woefully underexplored is the possibility of interpretation of the latent space dimensions. Since they are sorted by the amount of variance explained, the first dimension (for all beta) should be something like cell size. This can be easily confirmed by generating images walking along that axis, where the value for z1 is varied from, say, -3 to +3 (in increments of 0.5) and the value for z2, z3 ... zn is kept at zero. I was intrigued to learn from Rory on Friday that the dimensions in the latent space should also be sort of orthogonal to each other. If this is correct, then the second dimension should be something like cell aspect ratio. Again, this can be easily constructed by generating images where z2 is varied and all the other values are kept at zero. If the beta-VAE is working as advertised, it should be able to be rationally substituted for traditional dimensionality reduction methods like PCA. Just inspecting the variation along the various latent space dimensions should go a long way toward demonstrating that utility.
Clean up figure and move to white-background cell images
Get rid of the black boxes with white Xs; lack of data is better conveyed by a simple white space. The faint magenta cell membrane label on the black background is very, very hard to see; I recommend that (essentially all) images be presented on a white background instead, like in Fig. 1.
better definition of "sampling"
It is not at all clear from the text what is meant by “sampling” (7 instances in part A, 3 instances each in part B). What determines the variation in the “samples”? It is very striking that the independent samples of the nuclear lamin all look just about exactly alike, while the sampled mitochondrial images are quite different (blobbier or fuzzier). I suspect this must reflect the intrinsic underlying variation in the organelle distributions, which vary by organelle type, but this is not explained at all.
Largely naming corrections
Table 1: “Golgi” should probably be called “Golgi apparatus”. The gene name for “Plasma membrane” should not be AAVS1. This is the “safe harbor” locus. The fluorescent protein label in that cell line is fused to a membrane-targeting CAAX tag (a site for prenylation). Please ask Ru how this should be abbreviated in the “gene name” column. Also please ask Susanne and Chris whether it would be better to say “nuclear envelope” or “nuclear lamina” (the lamina is, technically, part of the envelope, but people often are thinking of the double membrane when they say “envelope”, and we have a different cell line (nucleoporin Nup153) that has a tag embedded in the nuclear membrane).
conda env create --name int_cell_refactor2 -f=environment.yml
Using Anaconda API: https://api.anaconda.org
Solving environment: failed
ResolvePackageNotFound:
- cuda91=1.0
we underplay how good the encoded / generated cells look
It is a HUGE WIN that the model is learning appropriate organelle distributions. This is underplayed in the text to a point well beyond modesty or even diffidence, just mentioned offhand in lines 127-128 about how the images “appear”, without even any comment about whether the appearance matches with what is known to be correct for those structures. It is TOTALLY AWESOME that the model learned to put the tight junctions at the appropriate apical location! Remember that this was a big part of the motivation to undertake the full-on 3D retraining of the model in January. I think that another panel should be added to the figure to emphasize this point. I suggest asking Thao to draw a “typical” interphase iPS cell with a bunch of the organelles labeled, and show LARGER IMAGES for a single cell (close to the middle of the distribution) with the “most likely” reconstructions for a bunch of structures where there is a simple rule that can be articulated. For example, mitochondria should be distributed throughout the cytoplasm, but are never found inside the nucleus. The nuclear lamina forms a closed shell around the DNA. The tight junction is at the apical side and around the cell periphery. The nucleolus (both fibrillarin and nucleophosmin) form blobs that are always inside of the nucleus, never outside. Paxillin is always at the basal surface.
Lots of stuff
It might be worth considering including some images for the Golgi in brefeldin-treated cells as well as for microtubules in taxol-treated cells. It is clear that there is a population for the treated cells that is missing from the untreated (between -2 and -4 on the horizontal axis), as well as some cells that fall into the bulk of the distribution for untreated cells. It would be helpful to have at least one image for each. This is a low dose of brefeldin as such things go, so it is to be expected that some cells are not responding.
Is there a clear explanation for why the contingent latent spaces for tight junctions and for the Golgi are so narrowly distributed compared to the microtubules? I expect this has something to do with the actual intrinsic cell-to-cell variation for those structures. It should be at least noted.
It seems inappropriate to “interpret” the blue and pink bars for tight junctions in part C as revealing a role for microtubules in tight junction formation (lines 268-269). The actual difference between this and the brefeldin result is miniscule, regardless of which bar gets the star, and the magnitude (<0.5 standard deviations) is not sufficient to draw a biological conclusion without any further support.
just a draft list
For cell+nuc, struct, and all three together:
save as csv.
color-code by mito annotations
What is there is fine. I think it would be nice to color-code the mitotic cells by the four annotated phases. For example, the dots that are very far away from the bulk of the interphase cells for the nuclear envelope are probably M4 or later, and those that remain clustered up with the interphase cells are probably M1/M2/M3.
discuss new model re old one(s)
There are a lot of interesting things to say about the beta-VAE as compared to the GAN used in both previous manuscripts (arXiv1705.00092 for the 2D version and bioRxiv 238378 for the 3D version). In particular, the ability to order the dimensions within the latent space rationally with respect to the amount of variance explained is a game changer for making the output of the autoencoder biologically interpretable. This really needs to be emphasized in the discussion as being useful both for the generative applications and for the dimensionality reduction applications.
Both the beta-VAE and the adversarial network end up enforcing a Gaussian prior in all latent space dimensions, but my understanding is that this is through very different mechanisms: the beta-VAE simply includes the KL divergence as part of the loss function (simple and elegant) while the GAN was trying to make the reconstructions similar between real images and constructed images drawn from an n-dimensional Gaussian (works, but is kind of fussy and ad hoc). I suspect that this also puts the generated images drawn from the beta-VAE on a much firmer statistical footing with respect to drawing conclusions about differences between cell populations. It would be nice to discuss this a little bit in the discussion, particularly in the context of facilitating the use of this for cell biological research. In more personal terms, Greg was very excited about the GANs a couple of years ago but then soured on them to the extent that in this manuscript he didn’t really even want to acknowledge the profound difference between the prior art and this one. He went on some kind of intellectual journey that convinced him that GANs should be abandoned and beta-VAEs embraced instead. It would be a service to the community to communicate what was learned that led to this conclusion.
feature histograms are all messed up
There is something deeply wrong with the output in part C. Look at “cell shape area” (which I would call, um, cell area, or projected cell area). These are z-scored so I don't know the actual numbers, but for sure the actual data should vary over at least a two-fold range (for cells in different times in their cell cycle). As beta is increased, the generated cell images systematically increase (?) until by the time we get to beta = 0.99 the average cell area is 7.5 standard deviations !!!!! off from the actual cell size, that is it must be either extremely large or extremely tiny. There is no way this is right, it is flatly contradicted by the images in part B, and also it doesn’t make any sense. Can we please look at the actual numbers here (area in square micrometers), not the z-scores? Similarly it seems unreasonable to have this systematic march in z-scores for the DNA intensity to, again 7.5 standard deviations away from the median of the actual data. Again, this makes no sense at all given the images in part B.
don't be ashamed
You all seem a little abashed that this is a “plain vanilla” beta-VAE and not some kind of breakthrough in the ML space. This is FINE! It is almost always better to use the simplest possible alternative. The merit of this paper is the cell biological application, and the way that the beta-VAE makes it possible to put quantitative values (and probabilities of specific observations) on things that are very hard to describe in anything other than qualitative terms. We are aiming for a computational biology audience, not a CS/ML audience.
top-2 dim latent space plots colored by mito annotation
I strongly suggest that we include the plot Rory made showing the distribution of the four labeled mitotic phases vs interphase in the top two dimensions of the latent space as a new part A. This is a very, very important sanity check showing that the model is “learning” appropriate variations within the data for subpopulations of cells that we know for sure have distinct shapes and internal organizations. Another very useful sanity check would be to show a few key structures in the top two dimensions of their target structure (conditional) latent space zt, also color-coded by mitotic phase. Specifically, for the microtubule zt the metaphase cells should look very “uncommon” but clustered, just as they do for the reference shape plot, but for organelles that are more uniformly distributed through the cytoplasm (mitochondria, peroxisomes, lysosomes) their distributions in the metaphase cells should more closely follow the underlying distribution for all cells (contingent on reference shape). Again it might be useful to get some drawings from Thao to indicate what is expected here.
Make q-q plots of real vs fake
gan -> bVAE
remove latex compilation errors
last flurry of edits before greg left added dozens of errors and warnings
hard truncation of latent space and look at reconstructions
Again by analogy with PCA, it should be possible to truncate the latent space (force all dimensions beyond some chosen zn to go to zero) and still get pretty decent reconstructions, if they are really sorted in order in a meaningful way such that the later dimensions are “noise”. This could also be given a simple reality check by choosing an intermediate value of beta (say 0.4 or 0.5), truncating at some modest number of dimensions (I’m gonna say 10, but this should be done empirically) and then examining how much the reconstruction loss suffers by truncation. The reason this is useful is because it is much easier to do meaningful comparative statistics (cell population A is different from or indistinguishable from cell population B) in fewer dimensions than 512.
Add some citations and buffer text to the intro
Convolutional neural networks are becoming more commonly used for image analysis in cell biology. Most applications use CNN to perform classification tasks. These generally fall into two categories:
- Pixel-by-pixel classification such as for segmentation (D. A. van Valen et al., 2016, PLoS Comp Biol 12: e1005177), for label-free prediction (C. Ounkomol et al., Ref. 6; E. M. Christiansen et al., 2018, Cell 173: 792), for image restoration (M. Weigert et al., Ref. 7; W. Ouyang et al., 2018, Nat. Biotech. 36: 460).
- Cell-by-cell classification such as for predicting cell fates (A. Waisman et al., 2019, Stem Cell Rep. 12: 845), for classifying cell cycle status (P. Eulenberg et al., 2017, Nat. Commun., 8: 463), for distinguishing motility behaviors of different cell types (J. Kimmel et al., 2019, IEEE/ACM Trans. Comp. Biol. Bioinf. doi: 10.1109/TCBB.2019.2919307) (probably also many other examples).
as beta changes, plot KL vs sorted(dims) and show what happens (maybe more peaked?)
Another feature of the reconstruction vs. compactness trade-off explored by tuning beta should, I think, be reflected in the amount of variance encapsulated in each of the (ordered) latent space dimensions. At some point I saw a plot of variance vs. dimension that showed a pretty strong fall-off over the first few dimensions, flattening down to some kind of noise baseline at about 40. I think that as beta is increased, the falloffs in these curves should get steeper and steeper, as the model is shoving more information into a more sparse latent space. I think it would be good to include these curves in the figure, or at the very least in the supplement, and comment on the utility of tuning beta for the purposes of dimensionality reduction.
Make our two key contributions stand out
Here we explore use of a CNN for a completely distinct kind of application in cell biology, NOT a classification task. Instead we develop a conditional autoencoder framework that has two applications:
- Statistically accurate image generation based on learning correlated features within a large image data set, to predict distributions of fluorescent labels that are not directly observed. It is important to emphasize that this approach is COMPLETELY DIFFERENT from the pixel-by-pixel classification approaches described above, as we can use this to learn and measure population distributions of organelles within cells, explore their relationships to one another, etc., not just predict a distribution in a given transmitted light image. The VAE architecture also makes the generative function much more flexible than other alternatives, in that we can generate expected organelle distributions for an artificial chosen reference shape (e.g. a cubical cell with a spherical nucleus).
- Practical, nonlinear dimensionality reduction for extremely high dimensional image data (number of voxels * number of channels). This enables us to, for example, construct a statistically meaningful “average” cell from a population, determine whether a particular cell represents a common or unusual phenotype, and quantitatively measure changes in cell organization as a function of cell state (mitotic state, drug treatment, etc.).
Fewer images / organelles / conditions
Too many sample images, way too teeny. Five columns should be enough in part B to make the point. The label at the bottom in part B should be “beta” not B. Labels in part C are goofy (dna intensity intensity std). It is weird and confusing to have mitotic cells sampled in some but not all of the values of beta. Either we should deliberately include them in all betas, or we should not have any mitotic images (mitosis can be better explored in Fig. 4).
Add a discriminator on the latent space representation to enforce that latent space representations are not destinguishable from each other on a class basis.
Script to make slices from 3D images
minor aesthetics but otherwise figure 1 is fine
Figure 1 is fine. I don’t love the red boxes around “Reference structure variation” and “Image structure variation” as these make it look like a PowerPoint slide. Maybe check in with Thao about aesthetics.
pass in parent directory
save things in a row-per-comparison way
talk up coupling summary stats
The contingency results summarized in part B are really great, again the specific couplings observed are a HUGE WIN for demonstrating that the model is learning what we think it should be learning in terms of important cell biological relationships. There needs to be a lot more explanation in the text (beyond the examples casually tossed off in lines 206-213). I am happy to help with this.
inner loop and outer loop both use i
for the loop var. change to j
or ii
or ï
or w/e.
orthogonal latent space ==> reproducible training?
If it is really true that the latent space dimensions also end up tending towards being orthogonal to each other, as Rory mentioned Friday, then it should also be the case that retraining a model with the same data set permuted into a different order should give a pretty congruent latent space. I am explicitly not asking for any more retraining associated with this manuscript but I do think this is something we should definitely explore moving forward, ideally with the simplified 2D reference-only data set used in Fig. 3.
During the 'pip install -e' step, the following error was encountered:
Collecting lkaccess (from pytorch-integrated-cell==0.1)
ERROR: Could not find a version that satisfies the requirement lkaccess (from pytorch-integrated-cell==0.1) (from versions: none)
ERROR: No matching distribution found for lkaccess (from pytorch-integrated-cell==0.1)
An online search on 'lkaccess' returned no results. From utils.py it looks like the package may be specific to AICS.
decide how to handle raw vs seg as interesting or not.
I think the emphasis on raw image vs. segmentation input in the current introduction is a red herring. First, not all feature extraction requires segmentation (e.g. very few of the hundreds of CellProfiler features do). Second, it would be fine (desirable!!!) to use segmented images as input into the autoencoder, so don’t make it sound like segmentations are bad.
Listing PlosCompBiol submission guidelines
Full guidelines:
https://journals.plos.org/ploscompbiol/s/submission-guidelines
Talk about how loss function can affect results
It would also be nice in the discussion to offer some thoughts and speculations about other ways to tailor the loss function for specific cell biological applications. Rory had some interesting suggestions along these lines when he spoke to us on Friday.
Allen Institute Contribution Agreement
This document describes the terms under which you may make “Contributions” — which may include without limitation, software additions, revisions, bug fixes, configuration changes, documentation, or any other materials — to any of the projects owned or managed by the Allen Institute. If you have questions about these terms, please contact us at [email protected].
You certify that:
· Your Contributions are either:
Created in whole or in part by you and you have the right to submit them under the designated license (described below); or
Based upon previous work that, to the best of your knowledge, is covered under an appropriate open source license and you have the right under that license to submit that work with modifications, whether created in whole or in part by you, under the designated license; or
Provided directly to you by some other person who certified (1) or (2) and you have not modified them.
· You are granting your Contributions to the Allen Institute under the terms of the 2-Clause BSD license(the “designated license”).
· You understand and agree that the Allen Institute projects and your Contributions are public and that a record of the Contributions (including all metadata and personal information you submit with them) is maintained indefinitely and may be redistributed consistent with the Allen Institute’s mission and the 2-Clause BSD license.
3D movie:
Fix up figure pretty significantly
had not appreciated that there were only 23 cells with the Golgi tag in the paclitaxel-treated group. I think perhaps we should leave these out as it makes that class so imbalanced with respect to the others. Part B should show the marginal distributions as well as just the dots, and please label the damned axes (z1 and z2, I presume). Part D has way too many tiny, tiny images, and the conclusion is stepped on a bit by only walking from centroid to centroid. I think it would be better to take a slice across the whole distribution and walk from -4, -2 up to +4, +2 (i.e. along the same direction as the current traverse, but capturing both sides of the distribution). 7 images should be plenty. I don’t like using the “centroid distance from untreated” as a metric for statistical significance here in Part C, and the explanation given in the methods (lines 431-433) is wildly insufficient to understand what was done (also not a sentence). We really need some kind of measurement that compares the whole distribution. Let’s discuss alternatives. This is a place where truncation of the latent space to a smaller number of dimensions might offer better options for summary statistics (I think right now the “distance” is calculated in all dimensions, but please correct me if I’m wrong).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.