Giter VIP home page Giter VIP logo

Comments (6)

jlause avatar jlause commented on June 16, 2024

Hey @pavlin-policar,

Thanks for your thoughts and questions!

How would you recommend handing multiple batches using this normalization scheme? In the past, I've used scTransform, which puts the batch variable into the model itself, so I guess it kind of regresses out the batch effect. It's worked very well in the past for me. [..]

I've quickly checked scTransform and its batch integration (which I have not used before myself). Seems like putting in the batch_var can only account for simple mean-shift batch effects. Christoph Hafemeister wrote about that e.g. here: satijalab/sctransform#41 (comment), and he only recommends it for cases where the cell type composition is roughly similar between the batches.

I also noticed that in your PR to scanpy comment you mention applying this normalization to each batch separately, then just concatenating the results. Wouldn't this also be problematic? For instance, if I have two batches of different cell populations and one gene is never expressed in one batch, the residuals will always be zero, since it never deviates. In the other batch, for instance the gene is always expressed. In this case the residuals will also always be zero, since the gene is always expressed, and the model mean can fit this.

Yes, I think what you describe is what would happen here. Yet, I assume that something similarly problematic will happen with the batch_var correction in scTransform: the model will mistake the marker gene that distinguishes the two populations for a batch effect gene, modelling both batches with a different mean. That should also result in low residuals in the end. This also fits with Christoph's recommendation to not use this batch_var method for batches that differ in celltype composition and thereby are expected to have biologically relevant info that differs between them!

To be honest, we have not thought about this much (yet), and have not implemented or benchmarked any batch-corrected version of Pearson residuals-as-data-transform. We did include a batch correction in the (upcoming) scanpy implementation of Pearson residuals-for-HVG-selection. Here, we simply followed the scanpy default batch correction, which (I think) essentially makes the same assumptions as when using scTransform's batch_var: that all batch-to-batch variation is purely technical and should be removed.

In your paper, in the Cao figure, I noticed you identified batch-specific genes, and just removed them from the dataset. I'm unsure of this, isn't it possible these genes might be biologically relevant?

I think that in this particular case this is rather unlikely: The batches were individual embryos, and each of them should have a comparable cell composition. Additionally, the vast major genes we excluded showed a strange, embryo-specific pattern (see figure below, from our Cao analysis notebook here).

image
image

We also communicated with the original authors, who mentioned that the batch effect could be caused by alignment problems (where the embryo-specific part of an UMI-tag would be too similar to the sequence of an existing gene, causing spurious high counts of certain genes in single embryos, as we observed). Of course, this pattern could be superimposed on a weaker biological signal. I agree that, to recover this, more sophisticated integration could be needed.

Overall, I want to stress that we were not after a general solution for batch effects when doing this analysis of the Cao data: It was rather to demonstrate that Pearson residuals was the only method able to discover such a weird batch effect in the data, allowing us to investigate and remove it. One could have taken other, more sophisticated methods for removal instead, and I agree there will be cases where just removing the affected genes is too harsh. However, to us, the important part was to show that we were able to see this batch effect only with Pearson residuals.

Let me know what you think!

from umi-normalization.

pavlin-policar avatar pavlin-policar commented on June 16, 2024

Thanks for your very thorough answer!

I've quickly checked scTransform and its batch integration (which I have not used before myself). Seems like putting in the batch_var can only account for simple mean-shift batch effects. Christoph Hafemeister wrote about that e.g. here: satijalab/sctransform#41 (comment), and he only recommends it for cases where the cell type composition is roughly similar between the batches.

That makes sense. I suspected this, but never actually dug into it. My understanding is that they add a batch term to
image
so I suppose this term is different for every gene. Or is this term shared across all genes?

Still, isn't this approach better than simply running the normalization on each batch separately, then concatenating the results? You'd be using more data to estimate the mean, which should alleviate overfitting (but I guess overfitting isn't an issue for this model). But then again, if you added a batch term, would the model still have an analytical solution?

Regarding the Cao dataset -- I thought as much. Did you have a chance to see if scTransform is also able to show this batch effect? That's probably be extremely slow to compute, but still, it would be interesting, even on a smaller subset.

In any case, the data I'm working with has a weak batch effect, and the composition is very similar. That's why scTransform worked so well I suppose. Unfortunately, I haven't been able to find a good python scTransform implementation, and from what I can tell, this works equally well. Or at least the results are usually very similar to scTransform. I remember reading this somewhere... However, if I do nothing to the batch effect, it is noticeable, so I want to get rid of it in some way. I suppose I'll try the concatenation approach, since that seems to be the most straightforward thing to implement.

from umi-normalization.

dkobak avatar dkobak commented on June 16, 2024

My feeling is that using gene-specific batch offsets should be equivalent to computing Pearson residuals on each batch separately and then concatenating the results...

from umi-normalization.

pavlin-policar avatar pavlin-policar commented on June 16, 2024

I'm thinking this: We could fake multiple batches. For instance, take a data set with 1000 cells and randomly assign the cells to 10 batches. Each batch has 100 cells. Wouldn't doing the estimation on the data set as a whole lead to more robust estimates and less noisy residuals than if we considered each of these "batches" with 100 cells separately?

from umi-normalization.

dkobak avatar dkobak commented on June 16, 2024

In our formulation, the model has only one parameter (offset) per gene. So if you have 10 batches and allow batch-specific offsets (like apparently scTransform does), then you will end up with 10 parameters per gene if analyzing the data together. That's exactly the same as analyzing 10 batches separately.

from umi-normalization.

pavlin-policar avatar pavlin-policar commented on June 16, 2024

I guess that makes sense. I'll have to think about it a bit more. Thanks for the clarifications!

from umi-normalization.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.