zhangyuqing / combat-seq Goto Github PK

View Code? Open in Web Editor NEW

151.0 151.0 39.0 611.26 MB

Batch effect adjustment based on negative binomial regression for RNA sequencing count data

R 99.13% Python 0.87%

batch-effects negative-binomial-regression rna-seq

combat-seq's People

Contributors

Stargazers

Watchers

combat-seq's Issues

Multiple corrections

I have an RNA experiement with 2 factors, and one is confounding with the batch effect. Running ComBat seq on my data, with a formula of ComBat_seq(count_matrix, batch=batch, group = group), with the factor that is not a linear combination of the batch effect, results in a poor separation of my batches(only 2 out of 6 batches are separated). Running ComBat_seq(count_matrix, batch=batch) does still not yield a good separation on the data.
I have managed to obtained what I think it's a good separation of my data, only after running two consecutive corrections on the raw counts of the type: ComBat_seq(count_matrix, batch=batch),followed by ComBat_seq(count_matrix, batch=batch, group = group). The same batch is used for the second correction.

My question is, is this double adjusted data usable for any DE genes analysis?

Genes reduced after ComBat-seq

Hi,
I have been using ComBat-seq for batch adjustment in a TCGA RNA-seq counts dataset. Before , there were 37,000 genes in that dataset. Somehow only 25,000s remained after adjustment. Was there any filtering conducted by ComBat-seq quietly? Or was it expected because of negative binomial regression which was applied in the function?
Thanks.

different output after changing group parameter

Hi! I have a very different output every time I try to specify the biological variable using the “group” parameter of ComBat. We have RNA sequencing results from two different sites. I have 3 different groups and two different ages. I got first RNA-seq results from one age and then we increased the number of replicates and included another age but using a different sequencing company. I used ComBat to corrected the batch effect of the samples from the 'young' age and that resulted in 508 DEGs using DESeq2. Then if I want to compare the 2 different ages I need to correct the batch effect of all my samples. I used group=age and the results decreased to 181 DEGs. If I take the group into consideration using covar_mod the results decreased even more (98 DEGs). If I only remove the batch effect with group=NULL that gave me 349 DEGs.

I would appreciate any input/thoughts on this.

Working with normalized datasets

I suppose that we should input the raw count matrix instead of normalized count matrix to ComBat-seq so that the negative binomial assumption would satisfy. However, sometimes we may only obtain a normalized dataset, without raw count matrix. Are there any suggested strategies about how to apply ComBat-seq for that scenario? Thank you.

Error in phi_matrix

Hi.
I tried to use the code in the tutorial (contents 8 of "The SVA package for moving batch effects and other desired variants in high-throughput experiences"), the ComBat_seq function was not defined. So, I defined ComBat_seq function using ComBat-seq/ComBat_seq.R in github.

However, the following error occurred:

" adjusted <- ComBat_seq(count_matrix, batch=batch, group=NULL)
Found 2 batches
Using null model in ComBat-seq.
Adjusting for 0 covariate(s) or covariate level(s)
Estimating dispersions

Error in phi_matrix[, batches_ind[[k]]] <- vec2mat(genewise_disp_lst[[k]], :
number of items to replace is not a multiple of replacement length
In addition: Warning message:
In result[lower.tri(result, diag = FALSE)] <- data :
number of items to replace is not a multiple of replacement length"

Can you check this problem?

Using ComBat-Seq to adjust for batch effect/ colony ID

I am currently working on an RNA-Seq dataset with an unbalanced design and two factors: treatment and colony ID (I am working with social insects, so colony ID is important to control for because of pseudoreplication).
Unfortunately, the samples had been dissected and extracted on three different days and those are in line with the colony ID.

When I plot a PCA, I clearly see an influence of colony ID/ extraction day, so a potential batch effect I would like to remove, although I cannot be certain whether this is due to extraction day or colony ID.

That's where I found out about ComBat-Seq and I thought this might solve the problem I have by removing the strong batch effect from my data. But I am unsure whether afterwards, I still need to include colony ID into my DESeq2 model, because I have already removed the batch effect which is the same as colony ID. But in my case individuals from the same colony represent pseudoreplicates, so not controlling for colony ID seems a bit odd to me.

So, should I run

a) ComBat-Seq + DESeq2 (gene expression ~ treatment) or
b) ComBat-Seq + DESeq2 (gene expression ~ colony ID + treatment)

Thank you a lot in advance, I am happy for any feedback on this.

Run combat-seq on continues variable

Hello,

I was wondering if it would be possible to run Combat-seq with a continues variable as batch, for example age of patients. Or else, would it be possible to modify the algorithm to do this (I am willing to try this myself obviously)

Kind regards and thank you.

no data rows with required number of counts

Hi，
I hope to apply ComBat_seq to my microbial otu table, my table has 214 cols and 6663 rows, but when I put it in the command below it return 'no data rows with required number of counts'. If there is anyone who could help me check with it, Is there something going wrong with my data? It's all numeric and have some 0 inside.

`> nrow(a)
[1] 6663

adjusted_counts <- ComBat_seq(a, batch=metadata$Souce, group=NULL, full_mod=FALSE)
Found 3 batches
Using null model in ComBat-seq.
Adjusting for 0 covariate(s) or covariate level(s)
Estimating dispersions
Error in dispCoxReid(y, design = design, offset = offset, subset = subset, :
no data rows with required number of counts`

I will appreciate any help!

A strange problem

Hello! Now I meet a strange problem. My desktop broken so I reinstalled the R package, before I reinstalled, I use combat_seq can get the The processed matrix, but now I can't. now after I adjusted the martix, I only can get a large list, 614.1GB, I can't understand.
There is code:
adjusted <- ComBat_seq(fullcounts, batch= Data_source, group = Medium,

```
                   full_mod = TRUE)
```

Found 10 batches
Using full model in ComBat-seq.
Adjusting for 1 covariate(s) or covariate level(s)
Estimating dispersions
Fitting the GLM model
Shrinkage off - using GLM estimates for parameters
Adjusting the data

head(adjusted)
[[1]]
[1] 256

[[2]]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12
[21] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[41] 0 0 661 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
[61] 0 0 1 0 0 0 1 0 0 0 0 3 0 0 0 0 0 0 0 0
[81] 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 179 1
[101] 0 999 285 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
[121] 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
[141] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0
[161] 0 0 0 3 0 0 0 0 3 3 0 0 7 1 0 2 0 0 0 14
[181] 0 0 0 0 0 0 0 0 0 2 0 0 0 3 0 1 0 0 1 0
[201] 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
[221] 2 0 0 0 0 0 0 0 0 0 0 0 0 175 0 0 0 0 0 0
[241] 0 0 7 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0
[261] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[281] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[301] 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 1 0 0 0 0
[321] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
[341] 0 0 0 0 1 0 1 0 7 0 0 0 0 0 0 0 0 0 0 0
[361] 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
[381] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0
[401] 0 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 1 0 2
[421] 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
[441] 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 3 0 0 0 2
[461] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[481] 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
[501] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0
[521] 0 0 0 0 0 0 0 0 0 0 0 0 831 0 0 1 0 0 0 0
[541] 0 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
[561] 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0
[581] 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0
[601] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
[621] 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[641] 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
[661] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
[681] 0 4 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
[701] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0
[721] 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[761] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[781] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
[801] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[821] 1711 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[841] 0 0 0 0 0 0 2 6 0 0 0 0 259 0 1 0 0 0 0 0
[861] 0 0 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 1 0 0
[881] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0
[901] 1 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[921] 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 1
[941] 0 0 0 0 0 0 0 0 0 33 0 0 0 0 1 0 0 0 1 1
[961] 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0
[981] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[ reached getOption("max.print") -- omitted 33755 entries ]

[[3]]
[1] 766

[[4]]
[1] 175

[[5]]
[1] 77

[[6]]
[1] 1
I realy need help. Thank you!

combat-seq can set a reference batch like combat?

Hi yuqing,

I know combat can set a reference batch.I wonder whether combat-seq can also set a reference batch by modifying some code.If yes,is it easy to implement or non-trivial to do so? I am looking forward to your reply.

Best,

Jinsheng

ComBat-Seq and pseudobulk counts

Thank you for this major improvement, Yuqing!

Since you mentioned in the previous issues that it is valid to use corrected values as input for DESeq2, I am wondering can the ComBat-seq be used for correcting pseudobulk-counts from single-cell data followed by DESeq2 DGE analysis?

In addition, if experimental design includes control and condition samples processed in multiple batches, is separate batch correction (i.e. separate batch correction of control samples + separate batch correction of condition samples, followed by merging matrices) a valid approach? Or that can be handled by specifying groups?

Thank you in advance for your help!

Kind regards,
Amel Zulji

ComBat-seq + DESeq2

Hi,
I am wondering if it is valid to input batch adjusted counts from ComBat-seq into DESeq2's standard workflow.
Thank you.

Plus one (well, I mean more than one) samples

Hi, I utilized ComBat-Seq for a meta-analysis of 200 RNA-Seq datasets, and it delivered exceptional results! However, a question arises: what if I have additional projects to incorporate into the adjusted dataset? Would I need to run ComBat-Seq each time, or do you have any suggestions?

Question concerning further analysis

Thank you for implementing the ComBat_seq function. I am eager to try it out.
I have a question concerning the further analysis workflow. Can I just adjust the counts and then continue with standard workflows? Hence filter the low expressed genes (or should that already be done before batch correcting?), normalising e.g. TMM and then converting to logCPM or voom transformation and so on?
How would the workflow/input exactly differ between ComBat_seq and ComBat?
Thanks so much!

scRNAseq - CombatSeq

Hi,

First I'd like to thank you for this wonderful tool that helped me a lot with RNAseq analysis ! I am currently running some analysis on scRNAseq data which are king of messed up by the batch effects.
I have used CombatSeq to correct for the batch on the pseudo-bulk count but I was wondering whether a similar approach could be used on the raw single cell count instead of the pseudobulk matrices ?
I found an adaptation on the singleCellTK for scRNAseq but with my large dataset it keeps crashing probably because it tries to create a matrix from my very large SCE object.
Do you have experiences this type of analysis on scRNAseq ? Is this something doable to you ?

Many thanks for the insights !

Alll the best,

Salim.

running combat-seq with DESeq2

I have 2 batches, batch 1 having 4 samples (normal tissue) and batch 2 having 8 samples (normal tissue). These 2 batches are different in library size preparation (one is unstranded and other is reverse stranded) rna-seq datasets downloaded from GEO. I have a third dataset from TCGA (cancer tissue). And I wish to perform DESEq2 analysis. How do I combine them true DGEs? Should I use Combat-seq within DESeq2, if yes then what will be the code?

ComBat_seq took a very long time

Hi, I was using ComBat_seq for RNA seq data, and I have covar_mod as the argument, which has 8 variables, however, it has been run more than 10 hours, and still not completed, which is wired.

I wonder why it took so long?

Implementation with DESeq2 > "invalid class “DESeqDataSet” object: the count data is not in integer mode"

Dear Yuqing,

Im working on a RNA-seq meta-analysis and trying to use ComBat-seq to normalize read counts previous to DESeq2 analysis. Im using the following command to normalize the counts within the dds object:
assay(dds) <- ComBat_seq(assay(dds), batch=batch, group=NULL)

And then run the following command for the DESeq analysis:
dds <- DESeq(dds)

Im obtaining the following error when running the DESeq() function on the normalized counts: "invalid class “DESeqDataSet” object: the count data is not in integer mode". Upon revising the count data it does not seem that I have any non-integer values, but still Im intrigued if I should be using any specific argument in ComBat-seq prior to DESeq2.

A second question I'd like to ask, Im comparing datasets from vastly different genotypes, should I specify the genotypes with the group variable?

Thank you!

Samuel

including more than one feature to be corrected

Hello, thank you for developing ComBat-seq. It is definitely useful. I would like to ask whether there is a way to use ComBat-seq to correct two different features. One feature would be the typical batch effect, and another would be to correct for different sample concentrations in the libraries.

Best,
Leonor

ComBat-seq + DESeq2 / WGCNA Or DESeq2 (batch as covariate) / ComBat + WGCNA

Hi,

I am using ComBat-seq to remove batch effects from my dataset, and then running DESeq2 on the same. I was wondering if I could use the same data, after rlog transformation, for WGCNA?

Which pipeline would be better (to get both differentially expressed genes and WGCNA results) -

ComBat-seq -> DESeq2 -> rlog -> WGCNA
DESeq2 (batch as covariate) -> rlog -> ComBat -> WGCNA

Thank you!

install issue on R4.02 and R4.03

Error: Failed to install 'sva' from GitHub:
(converted from warning) installation of package ‘C:/Users/AppData/Local/Temp/Rtmp6rciWQ/filef5220635c4627/sva_3.35.2.tar.gz’ had non-zero exit status

How to adjust by multiple variables using ComBat-Seq?

Dear @zhangyuqing ,

I am trying to adjust my RNA data using ComBat-Seq since I realised that there are 3 batches that I need to adjust for:

Place (2 levels: place 1, place 2)
Library Preparation Date (16 levels - different dates)
Type of tube (2 levels: A, B)

I have 960 samples and around 62000 genes.

In my biological matrix, I have: Age, Sex, Group (cases,controls..) and WBC counts.

biol_mat = model.matrix(~Age + as.factor(Sex) + as.factor(Group) + LYMPH + MONO + NEUT, data=phenotype)

In the tutorial of Combat-Seq appears how to adjust by 1 variable but it doesn't tell you how to adjust by more than 1.

I have seen a lot of posts using combat that the only way is to adjust by 1 variable and then, with those results, adjust again by the 2nd variable and so on.

That would be:

Adjust by library prep.

raw.cts_adjustedLibPrep <- ComBat_seq(counts = raw_cts_matrix, batch=batch_libraryprep, group=NULL, covar_mod = biol_mat)

Adjust by library prep + type of tube.

raw.cts_adjusted_LibPrep_TypeTube <- ComBat_seq(counts = raw.cts_adjustedLibPrep, batch=batch_type_tube, group=NULL, covar_mod = biol_mat)

Adjust by library prep + type of tube + place

 raw.cts_adjusted_LibPrep_TypeTube_Place <- ComBat_seq(counts = raw.cts_adjusted_LibPrep_TypeTube, batch=batch_place, group=NULL, covar_mod = biol_mat)

For the first adjustment (library prep) it takes around 15min, but for the second... it has been running for more than 2 days.. I stopped it and launch it again, changing the order of the adjustment but it is taking too much time and it seems that something is wrong...

Here I attach a similar example of my data:

Covariates (variables to adjust + biological information)

set.seed(11344)
covariates_df <- data.frame(
  Sample = paste0("A", seq(1,960)),
  Age = sample(seq(18, 70), size = 960, replace = TRUE),
  Group = sample(seq(1, 3), size = 960, replace = TRUE),
  Sex = sample(seq(1, 2), size = 960, replace = TRUE),
  WBC = sample(seq(0.5, 35, by=0.3), size = 960, replace = TRUE),
  Place = sample(seq(1, 2), size = 960, replace = TRUE),
  Tubes = sample(seq(1, 2), size = 960, replace = TRUE),
  LibPrep = sample(seq(1, 16), size = 960, replace = TRUE)
)
rownames(covariates_df) <- covariates_df$Sample

categorical_variables <- c("Sample", "Group", "Sex", "Place", "Tubes", "LibPrep")
covariates_df[,categorical_variables] <- lapply(covariates_df[,categorical_variables],as.factor)


>str(covariates_df)

'data.frame':	960 obs. of  8 variables:
 $ Sample : Factor w/ 960 levels "A1","A10","A100",..: 1 112 223 334 445 556 667 778 889 2 ...
 $ Age    : int  40 66 37 40 25 42 46 51 45 61 ...
 $ Group  : Factor w/ 3 levels "1","2","3": 2 2 2 3 2 2 3 3 3 3 ...
 $ Sex    : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 1 2 1 ...
 $ WBC    : num  12.8 29.9 25.7 0.8 30.8 5.9 30.5 19.1 2.9 12.2 ...
 $ Place  : Factor w/ 2 levels "1","2": 1 1 2 1 2 1 2 2 1 2 ...
 $ Tubes  : Factor w/ 2 levels "1","2": 1 1 2 1 2 2 2 2 1 2 ...
 $ LibPrep: Factor w/ 16 levels "1","2","3","4",..: 14 1 13 4 10 5 1 8 7 5 ...

The 3 variables that I want to adjust for will be: Place, Tubes and LibPrep. On the other hand, the biological information that I want to preserve would be Age, Group, Sex and WBC.

RNA raw counts

exp_df <- data.frame(replicate(960,sample(0:5216979,66000,rep=TRUE)))
colnames(exp_df) <- covariates_df$Sample
rownames(exp_df) <- paste0("Gene", rownames(exp_df))
exp_df_mat <- as.matrix(exp_df)

> str(exp_df)
'data.frame':	66000 obs. of  960 variables:
 $ A1  : int  3687756 4638393 4315502 316404 2209492 831342 4261323 1283200 1522808 1088673 ...
 $ A2  : int  4645779 3495763 4782397 2869628 2402288 1012125 1267979 1361720 1919367 4859438 ...
 $ A3  : int  2883976 4770076 228011 915940 19038 4650368 112632 1316246 1926498 484384 ...
 $ A4  : int  3496840 3676764 2763012 2723528 944224 3809955 5054054 1122139 116722 4090191 ...
 $ A5  : int  3140122 650422 2075888 2987814 1656462 2863317 155120 1086391 3163073 625809 ...
 $ A6  : int  4084796 1932277 3491059 4898410 4183070 4470479 4193882 928271 3282841 4418068 ...
 $ A7  : int  765797 3153554 5075853 4775395 3582194 4274642 1530455 1433179 4757168 2209519 ...
 $ A8  : int  1652346 3505656 111027 3170748 5087383 912180 2693545 1907366 3135340 548296 ...
 $ A9  : int  441053 5132021 1857530 2413007 1218852 3614836 4388581 106105 3270886 3840996 ...
 $ A10 : int  995597 2076013 1576689 1857073 1435772 3788330 3983860 816969 733090 1357226 ...
 $ A11 : int  4944929 5182067 4415296 3224862 145068 3533346 4174175 2406340 4572529 4820674 ...
 $ A12 : int  2731408 1896439 5063233 4621405 2686720 1507796 4764591 887296 2257893 384470 ...

The biological matrix would be:
biol_mat = model.matrix(~Age + as.factor(Sex) + as.factor(Group) + WBC, data=covariates_df)

head(biol_mat)

(Intercept) Age as.factor(Sex)2 as.factor(Group)2 as.factor(Group)3  WBC
A1           1  40               0                 1                 0 12.8
A2           1  66               1                 1                 0 29.9
A3           1  37               1                 1                 0 25.7
A4           1  40               0                 0                 1  0.8
A5           1  25               0                 1                 0 30.8
A6           1  42               1                 1                 0  5.9

Could you kindly help me, please? I have tried all the possibilities and I do not really know if it is something wrong my code or it is because combat_seq cannot handle too much amount of data.

Thanks very much in advance
Regards

ComBat (negative value) and ComBat_seq (lib-size)

Hello, I have some questions about ComBat and ComBat_seq.
My analysis is based on CPM, so I use ComBat to adjust logCPM data and then convert it to CPM. But there will be a lot of negative numbers. I can't use these negative numbers for subsequent analysis. What should I do with these negative numbers?
I've tried to use ComBat_seq to adjust the counts data. But the library size of each sample will become vary greatly. Some lib-sizes become really big and the CPM calculated will be inaccurate. How can I get accurate CPM value?
Thank you.

Error in qr.default(design) : NA/NaN/Inf in foreign function call (arg 1)

I encountered the error when I ran the following:
ComBat_seq(cts, batch = coldata_filtered$Rep2, group=NULL, covar_mod=cbind(coldata_filtered$Cancer, coldata_filtered$Treatment))
Error in qr.default(design) : NA/NaN/Inf in foreign function call (arg 1)

I did a brief google search and found that it might due to 0 variance rows. I am not sure how to mitigate the error. Thanks!

Install error

devtools::install_github("zhangyuqing/sva-devel")
Downloading GitHub repo zhangyuqing/sva-devel@master
trying URL 'https://mirrors.tuna.tsinghua.edu.cn/CRAN/bin/windows/Rtools/Rtools35.exe'
Content type 'application/octet-stream' length 108622512 bytes (103.6 MB)
downloaded 40.5 MB

Error: Failed to install 'unknown package' from GitHub:
Could not find tools necessary to compile a package
Call pkgbuild::check_build_tools(debug = TRUE) to diagnose the problem.
In addition: Warning message:
In utils::download.file("https://mirrors.tuna.tsinghua.edu.cn/CRAN/bin/windows/Rtools/Rtools35.exe", :
downloaded length 42446848 != reported length 108622512

Combat-seq, Pseudoaligners and DESeq2

I have been working with combat-seq for a little while, where I quantify with kallisto, import with tximport, load to deseq2 to get "raw" counts, then run combat-seq on that data.

I noticed that when i use combat_seq, I need to load my data into deseq2 as a matrix, which removes all the special normalization deseq2 will do with tximported data.

Would this be the proper way to analyze pseudo counts? I feel I am losing a lot of the transcript/size factor normalization that is expected with pseudoaligners when I feed the data into combat-seq this way

bug in getting adjusted counts matrix

Hi,
I have a dat matrix (1000 X 12). when I am trying to obtain the adjusted_counts using the command "ComBat_seq(dat, batch=batch, group=group)", I will get a list with a length of (1000 X 12). Each list could be one number or a vector of numbers. Can you check this?

batch <- c(rep(1, 6), rep(2, 6))
group <- c(0,0,0,1,1,1,0,0,0,2,2,2)
dat[10,]
1533 | 1503 | 1387 | 1258 | 1657 | 1139 | 989 | 935 | 902 | 829 | 913 | 1153
16 | 19 | 19 | 32 | 48 | 17 | 9 | 12 | 10 | 16 | 7 | 7
7 | 216 | 83 | 65 | 30 | 3 | 4 | 4 | 0 | 5 | 5 | 2
1 | 0 | 2 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0
54963 | 47549 | 45226 | 39082 | 53703 | 47144 | 16344 | 18281 | 15508 | 16260 | 16299 | 21966
1177 | 1066 | 1079 | 843 | 933 | 930 | 785 | 667 | 805 | 650 | 704 | 710
63 | 65 | 56 | 38 | 45 | 39 | 17 | 12 | 15 | 10 | 20 | 20
165 | 179 | 276 | 233 | 351 | 81 | 78 | 51 | 46 | 57 | 60 | 50
162 | 135 | 163 | 216 | 138 | 136 | 90 | 79 | 78 | 65 | 93 | 79
1809 | 1613 | 1378 | 2227 | 1752 | 1561 | 408 | 482 | 297 | 394 | 311 | 368

Batch effect present for some biotypes after running CombatSeq

I'm working with smallRNA data and performed batch control with Combat_Seq (bioconductor version).

I used the below command -

library('sva')
corrected_batch <- ComBat_seq(as.matrix(counts), batch = coldata$Small_Seq_Batch)
corrected_batch <- as.data.frame(corrected_batch)

Here's a MDS plot before correcting -

And here's the after

What could cause this? I tried separating the biotypes first then performing the batch correction, but the outcome was pretty much the same.

Keep getting the error that rows do not match

I have a matrix with 2300 samples, and about 34 batches. I am trying to use ComBat_Seq on this matrix. The matrix is arranged by genes in the rows, and samples in the columns. I have a vector to be used a batches that has same length as the number of columns in the count matrix. But I keep getting the following error:

    > length(use_batch)
    [1] 2312 
    > ncol(ori_matrix)
    [1] 2312
    > corrected_mat <- ComBat_seq(ori_matrix, batch = use_batch)
    Found 34 batches
    Using null model in ComBat-seq.
    Error in cbind(batchmod, mod) : 
      number of rows of matrices must match (see arg 2)

Am I missing something? What is the way around this. The ori_matrix of the original count values is a matrix.

Can I Use ComBat-Seq to adjust multigroups RNAseq?

hi,I am confused to use the combat_seq to deal with my RNAseq data .I have two batchs and I have thress groups such as groupA vs groupB,groupA vs groupC.How I use the combat_seq?
below is my code .
group=c(rep(0,1,2),6)
batch=c(rep(0,1),9)
adjusted_counts=combat_seq(counts,batch=batch,group=group)
is this right?
thank you very much.

readcount to FPKM

Hi,
Can I use readcount from ComBat-seq to calculate FPKM or TPM ?

Best wishes

the vignette returns a matrix | array, real data returns a list

I've resolved the issue, but will post this brief note anyway on the chance that it might be helpful.

In R v4.0.3, with sva v3.36.0, if I run the ComBat_seq example, it outputs a 50 x 8 "matrix" "array" -
adjusted_counts <- ComBat_seq(count_matrix, batch=batch, group=NULL, full_mod=FALSE)
...
class(adjusted_counts)

"matrix" "array"

dim(adjusted_counts)

50 8

If I run real read-count data consisting of 2103 genes x 270 samples, with two batches (156 and 114 samples), but input a data.frame, I get a list whose length is the product of 2103 * 270.

But if I follow the help docs and make the input data.frame into a matrix, then I get a "matrix" "array" out.
system.time(
adjusted_counts.old.new <- ComBat_seq(
as.matrix(input),
batch = batch,
group = NULL,
full_mod = FALSE # Boolean, if TRUE include condition of interest in model
)
)

Using ComBat-Seq over RSEM estimated counts

Hi,
I have 2 RNA-Seq studies on RSEM counts only.
Would it a problem using the float RSEM counts as an input to ComBat-Seq? or Should I round them?

Thanks in advance!

ComBat_seq cannot be found

Hi, I installed the package from GitHub repo sva-devel. I can see the document page of ComBat_seq but when I try to run the function, I got "could not find function "ComBat_seq".

Common not tagwise dispersion used when one level of covariate - Code error?

When a batch has only one level of a covariate (i.e. group) variable, the common dispersion is used for ALL GENES. Why? Shouldn't the code default to the tagwise dispersion (without the group adjustment)? Genes are not adequately debatched by using the common dispersion across all genes...at least in my data. I noticed my covariate adjustment run of combat still had batch effects showing at PC3/PC4 (but not when the group variable was null). I ran dispersion estimates and glms manually and this seems to be the issue in the combat-seq wrapper function....

## Estimate gene-wise dispersion within each batch genewise_disp_lst <- lapply(1:n_batch, function(j){ if((n_batches[j] <= ncol(design)-ncol(batchmod)+1) | qr(mod[batches_ind[[j]], ])$rank < ncol(mod)){ # not enough residual degrees of freedom - use the common dispersion return(rep(disp_common[j], nrow(counts))) }else{ return(estimateGLMTagwiseDisp(counts[, batches_ind[[j]]], design=mod[batches_ind[[j]], ], dispersion=disp_common[j], prior.df=0)) } })

WHY NOT THIS....
genewise_disp_lst <- lapply(1:n_batch, function(j){ if((n_batches[j] <= ncol(design)-ncol(batchmod)+1) | qr(mod[batches_ind[[j]], ])$rank < ncol(mod)){ # not enough residual degrees of freedom - use the common dispersion return(estimateGLMTagwiseDisp(counts[, batches_ind[[j]]], design=NULL, dispersion=disp_common[j], prior.df=0)) }else{ return(estimateGLMTagwiseDisp(counts[, batches_ind[[j]]], design=mod[batches_ind[[j]], ], dispersion=disp_common[j], prior.df=0)) } })

This modification runs the batch effect adjustment per usual, with covariate adjustment when present in a batch. For example, my data is designed as below:

Batch 1 - Tumor --> tagwise
Batch 2 - Normal --> tagwise
Batch 3 - Tumor/Normal --> tagwise-adjusted
Batch 4 - Tumor --> tagwise

This corrects the residual batch effect seen at PC3/PC4. I haven't checked but my guess is that the genes with batch effects remaining are those that are not adequately modeled using the common dispersion in batches 1, 2, and 4. Please let me know if I am missing a reason why we would want to use common dispersion across all genes in this scenario. However, based on the regression described in the paper, I think the current code does not reflect the regression described.

ComBat-seq doesn't support 1 sample per batch yet

Hi @zhangyuqing,

I have RNAseq data from 3 batches: c("3", "1", "2", "1", "2") but batch3 only has one sample. When running ComBat_seq, it gives error:

Can I ask if it's possible for ComBat_seq to support one sample per batch?

Best wishes,

Jian

combat-seq produces very large values after batch adjustment for some genes

Hi,

I am trying out Combat-seq as a way to reduce batch effects in a dataset prior to use with downstream packages that do not support direct modeling of batch effects. I have found that for a few genes, it produces very strange results. For example:

Original counts:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 4., 0., 0., 0., 0., 0.])

Combat-seq adjusted counts:
array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 5.82247031e+11, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

The value of 4 has been adjusted to 5.8e11! This is causing errors in my downstream analysis.

I am just running Combat-seq with the default parameters and a covariate matrix:

adjusted_all_cov <- ComBat_seq(all_m, batch = combo_batches4, group = NULL, covar_mod = cov_mat_all)

I was wondering if you could explain why this is happening and if there are any options in the Combat-seq package that might help me avoid it.

I'm using sva_3.36.0.

Thanks!

Error in .compressOffsets(y, lib.size = lib.size, offset = offset) : offsets must be finite values

Hi there,

I previously used the sva::ComBat function and it worked perfectly:

adjusted <- sva::ComBat(dat=raw_counts,
                        batch=metadata$source,
                        mod=cbind(metadata$tissue_coded))

But since I would love to use DESeq2 for downstream analysis I was now trying the new function:

adjusted <- sva::ComBat_seq(counts=raw_counts,
                            group=NULL,
                            batch=metadata$source)

Unfortunatly, as you can see, I got this error:

Error in .compressOffsets(y, lib.size = lib.size, offset = offset) : 
  offsets must be finite values
In addition: Warning message:
In DGEList(counts = counts) : library size of zero detected

What am I doing wrong?
Thank you in advance!
Luca

Error in order(y) : unimplemented type 'list' in 'orderVector1'

Hi @zhangyuqing ,

I am trying to run batch correction with Combat-seq package but keep running into an error. The data set has 14 variables (the first 10 belong to the same batch of sequencing, the other 4 are from a different batch of sequencing). The data is read into R "as is" and i try to run batch correction. I get an error.

uncorrected = read_xlsx ( "genes_file.xlsx" , sheet = "genes" )
sample_info = read_xlsx ( "info_file.xlsx" , sheet = "sample_info" )

batches = sapply (as.character (sample_info$library), switch, "My_data" = 1, "lab2_et_al_2020" = 2, USE.NAMES = F)
corrected = ComBat_seq (counts = as.matrix (uncorrected [,sample_info$file]) , batch = batches)

When I run the line of code to compute correction, then R returns "Error in order(y) : unimplemented type 'list' in 'orderVector1'" as the error message.

The same thing happens when I try to create a group vector using the following:

samples = c (0,0,0,0,0,0,0,0,0,0,1,1,1,1)

corrected = ComBat_seq (counts = as.matrix(uncorrected[,sample_info$file]), batch = batches, group = samples)

The same error message "Error in order(y) : unimplemented type 'list' in 'orderVector1' is what R returns.

I have uninstalled R (was on v 4.1.2) and reinstalled to the latest version and also install dev version of your sva package. The error still comes up unfortunately.

Do you have any idea what could be causing this?

Thanks a lot!

ComBat-Seq correction creating very large count values--cannot feed into DESeq2

Hey,

Thanks for making this software, it seems really helpful (and I'd love to use it if I can get this to work). The matrix generated by ComBat-Seq cannot be read into DESeq. My current guess is that something about the batch correction creates very large values in the matrix. I can see this when looking at the highest values in the matrix, where some very low raw counts (<10) are being corrected to very large numbers (+e17).

When DESeq2 tries to convert these to integers, it cannot because the value is too high for DESeq2 to handle. This isn't really the concern, however, as I can't imagine a correction should ever change a count value that drastically so I may be doing something incorrectly in my implementation of ComBat-Seq.

A little information on the project: these are technical replicates where the same library was sequenced twice, the second batch being at much higher read depth. Each batch was analyzed separately and counts generated by HTSeq-Count were used for batch correction. I used the covar_mod to account for my two treatment variables.

Please let me know what you think or what other information would be helpful to provide here.

Thanks in advance--I really appreciate your help.

Christine

The covariates are confounded

Dear @zhangyuqing , creator of ComBat-seq,

for experimental reasons, I was trying to run ComBat-seq on only two batches of the GFRN dataset you provided. I cropped the matrix accordingly (99 % sure it is not the issue). Everything works perfectly as long as I do not use the 'group' parameter. I made sure only to use the correct labels here. See my batch and group input:

batch <- as.integer(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2))

group <- factor(c("gfp", "gfp", "gfp", "gfp", "gfp", "gfp", "gfp", "gfp", "gfp", "gfp", "gfp", "gfp", "her2",
"her2", "her2", "her2", "her2", "gfp", "gfp", "gfp", "gfp", "gfp", "gfp", "egfr", "egfr",
"egfr", "egfr", "egfr", "egfr", "gfp", "gfp", "gfp", "gfp", "gfp", "gfp", "gfp", "gfp",
"gfp", "kraswt", "kraswt", "kraswt", "kraswt", "kraswt", "kraswt", "kraswt", "kraswt", "kraswt"))

Both have the same length of 29, which matches with my amount of samples/columns in the count dataset (attached).
main1_2.csv

Be aware that I am a programmer myself and doublechecked everything to be correct. Everything should work, yet I get the following error:
The covariates are confounded! Please remove one or more of the covariates so the design is not confounded

Is there any hidden restriction on the covariates (groups) I use? With all three batches and respective metadata it works fine on the full GFRN dataset. In my example above, Batch 1 contains gfp and her2 and Batch 2 contains gfp and kraswt.

I do need your input on this, I believe.

Thank you in advance!

Simon Schlumbohm

Can i nordmalise without having commvon sample between two repeats

I want to coumbine 2 different datasets for RNA seq analysis however the Control on each datasets are different so there is no commvon samples between them can I still user this tool to remote the Batch effect

support for multiple cpus?

I use kallisto output files and trim some features but still, many genes remain. For instance, my input count matrix is 91,897 x 32. ComBat-Seq runs for 2+ days. It seems it is busy at match_quantiles() and is there any plan to make faster/parallelization?