xihaoli / staarpipeline-tutorial Goto Github PK

The tutorial for performing single-/multi-trait association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using FAVORannotator, STAARpipeline and STAARpipelineSummary

License: GNU General Public License v3.0

R 93.28% Shell 6.72%

statistical-genetics whole-exome-sequencing whole-genome-sequencing functional-annotation rare-variant-analysis staar-pipeline multi-trait-analysis

staarpipeline-tutorial's Issues

dsyMatrix is not supported

Dear authors,

Thank you so much for developing the tool. After I created aGDS, I encountered an error of "dsyMatrix is not supported" when performing STAARpipeline_Gene_Centric_Noncoding.r.

The complete error message is

Error in STAAR_O_SMMAT_sparse(G, Sigma_i, Sigma_iX, cov, residuals.phenotype,  :
  dsyMatrix is not supported.

Any idea how to fix the issue Thanks!

Best,

Question about Sliding_Window function

Hello,

I'm trying to run Sliding_Window() with ~2k variants but everytime it returns "Error: $ operator is invalid for atomic vectors." This happened even when just given obj_nullmodel argument, like this Sliding_Window(chr="chr3",start_loc=11919,end_loc=1239497,sliding_window_length=6000,obj_nullmodel="gds/obj_nullmodel_GENESIS.Rdata").
Do you have any idea how this happened? Is it due to a misformatted obj_nullmodel? Thanks!

I have followed the tutorial to preparing this
covar <- read.csv("covar.csv")
obj_nullmodel_GENESIS <- fitNullModel(covar,outcome="rate",covars=c("PC1","PC2","PC3","PC4"),family="binomial", cov.mat = NULL,group.var = NULL)

obj_nullmodel <- genesis2staar_nullmodel(obj_nullmodel_GENESIS)
save(obj_nullmodel,obj_nullmodel_GENESIS.Rdata")

"Coding mask" AND "Noncoding mask"

Dear Dr. Li
I am confused with the "Coding mask" AND "Noncoding mask" in R script STAARpipelineSummary_Gene_Centric_Coding_Annotation.r and STAARpipelineSummary_Gene_Centric_Noncoding_Annotation.r

## Chr
chr_seq <- c(1,7,19,19,9)
## Gene name
gene_name_seq <- c("PCSK9","NPC1L1","LDLR","APOE","RNF20")
## Coding mask
category_seq <- c("plof","missense","missense","missense","synonymous")

Why did you choose these five genes? If I want to run these annotation steps, should I choose other genes and how to choose these genes?
Looking forward to your reply.

Small number of variants are passed for analysis

Hello Xihao,

I tried to run both Individual_Analysis() and Sliding_Window() on my aGDS file (178 individuals and 14055 variants on chr22) but the analyses were only performed on 7/NULL variants. I didn't see any other filterings other than "QC_label" in the wrapper methods and the sites are all "PASS" in my case. See below.

I'm not sure if there are additional filterings performed on the backend and most of my sites failed. Can you help figure this out or potentially direct me to these functions so that I may see how they actually worked? Thanks!

A side question is how is STAAR handling imputed SNP dosage if it can?

results <- try(Sliding_Window(chr=22,start_loc=start_loc_sub,end_loc=200000,
                                  sliding_window_length=sliding_window_length,type="multiple",
                                  genofile=genofile,obj_nullmodel=obj_nullmodel,Use_annotation_weights=Use_annotation_weights,
                                  QC_label=QC_label,variant_type=variant_type,geno_missing_imputation=geno_missing_imputation,
                                  Annotation_name_catalog=Annotation_name_catalog,
                                  rare_maf_cutoff=1,rv_num_cutoff=0, Annotation_name=Annotation_name))
# of selected samples: 178
# of selected variants: 7
# of selected samples: 178
# of selected variants: 14,055

results_individual_analysis<-Individual_Analysis(chr=22,start_loc=start_loc,end_loc=20000,
                                                     genofile=genofile,obj_nullmodel=obj_nullmodel,mac_cutoff=20, variant_type=variant_type,
                                                     geno_missing_imputation=geno_missing_imputation)
> results_individual_analysis
NULL

Error in fit STAAR null model

hello,may I ask why does this error occur

obj_nullmodel <- fit_nullmodel(V6~age+sex+V5,

                           data=phenotype,kins=as.matrix(sgrm),use_sparse=TRUE,kins_cutoff=0.022,id="V1",

                           family=gaussian(link="identity"),verbose=TRUE)

[1] "kins is a dense matrix, transforming it into a sparse matrix using cutoff 0.022."
Using 500 samples provided
Identifying clusters of relatives...
13 relatives in 2 clusters; largest cluster = 9
Creating block matrices for clusters...
487 samples with no relatives included
Putting all samples together into one block diagonal matrix
Error in glmmkin(fixed = fixed, data = data, kins = kins_sp, id = id, :
Error: kins matrix 1 does not include all individuals in the data.

Null outputs from Sliding_Window()

Dear @xihaoli,

In my datasets, I have 11201 samples and there are no errors/problems with performing the other STAARpipeline steps, but only with the Sliding_Window step. Each arrayid gives the same warnings continuously and eventually the R crushes.

I tried to re-install each R package or to change the sliding_window_length in various sizes, and the same for the MAF cut-off.

A small example from the errors/warnings:

[1] 1
# of selected samples: 11,021
# of selected variants: 73
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
# of selected samples: 11,021
# of selected variants: 1,307,186
[1] 2
# of selected samples: 11,021
# of selected variants: 46
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
.
.

I would like to inform you that there were no significant results from the Dynamic_Window analysis. Is there any possible reason for the NULL outputs and crushes?

Thank you for your time!
Bests,
Pelin

Lack of Input Examples for running STAARpipeline

Thank you for designing such a useful tool with STAARpipeline.

I am stuck on the summary portion of the pipeline due to the not knowing what the known loci info csv should look like or contain. This is the line in the the R script file: "STAARpipelineSummary_Known_Loci_Pruning.r" within step 0 of the summary step: known_loci_info <- read.csv("/path_to_the_file/TOPMed_F5_LDL_Known_Loci_info.csv"). I was hoping to see what the csv format should look like, what information/columns it should contain and how to format it. In addition, is there any way to include some example input files for reference along the way in the other steps too (such as an example phenotype file or any other csv inputs).

The example R scripts and Shell scripts are very helpful in knowing how to run the pipeline but I have found myself stuck on steps with additional required inputs.

Thank You!

Questions about individual analysis results

Hello,

I have some questions about the output of the individual analysis.

What's "# of selected variants: 23"? Given my input of 14,055 variants, I assume the "selected variants" are not selected by the user-given arguments (e.g., mac_cutoff as I set it to 0) rather indicating the final variants after the individual analysis. So do we have the intermediate results of those unselected variants? Are they stored somewhere and why are they not selected (maybe by some p-values)?
For the error message "(!all(CHR == chr)) ", is it indicating the entire 22 chromosomes are required for the analysis since I only give one chromosome?

> start_loc <- 1
> end_loc <- start_loc + 10e7 - 1
> results_individual_analysis<-Individual_Analysis(chr=22,start_loc=start_loc,end_loc=end_loc,
                                                     genofile=genofile,obj_nullmodel=obj_nullmodel,mac_cutoff=0, variant_type=variant_type,
                                                     geno_missing_imputation=geno_missing_imputation)
# of selected samples: 178
# of selected variants: 23
Error in if (!all(CHR == chr)) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In Individual_Analysis(chr = 22, start_loc = start_loc, end_loc = end_loc,  :
  NAs introduced by coercion

Thanks and I hope I'm not bothering you too much.

Error in Gene_Centric_Coding analysis

Dear Xihao,

While I performed Gene_Centric_Coding analysis, the error below showed up and there were no outputs for some arrayid.

Error in seqGetData(genofile, paste0(Annotation_dir, Annotation_name_catalog$dir[which(Annotation_name_catalog$name ==  : 
  The GDS node "annotation/info/FunctionalAnnotation/genecode_comprehensive_exonic_category" does not exist.

There was no output from every arrayid but I thought it might run the code with the existing output files. I performed the STAARpipelineSummary Gene Centric Coding analysis and got this error:

Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
Calls: Gene_Centric_Coding_Results_Summary -> get -> load -> readChar
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
  cannot open compressed file '..../Gene_Centric_Coding_1.Rdata', probable reason 'No such file or direc$
Execution halted

I have the same problem with Gene_Centric_Noncoding, Sliding Window, and Dynamic Window analysis. However, there was no problem with the individual analysis.

I hope you can help me :)

Thank you!
Pelin

Error in seqGetData: Invalid dimension 'Start' and 'Length'

Hi Xihao,

Thanks for designing such a useful tool. I'm trying to run STAARpipeline_Gene_Centric_Noncoding.r in _Step 3.2: Gene-centric noncoding analysis_in my way, and the error below showed up:

i=1
Rscript STAARpipeline_Gene_Centric_Noncoding.r ${i}
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 273677 14.7     660845 35.3   451833 24.2
Vcells 459327  3.6    8388608 64.0  1800859 13.8
[1] 379
[1] 401
Error in seqGetData(genofile, paste0(Annotation_dir, Annotation_name_catalog$dir[which(Annotation_name_catalog$name ==  : 
  Invalid dimension 'Start' and 'Length'.
Calls: Gene_Centric_Noncoding -> noncoding -> seqGetData
Execution halted

And I run the code in noncoding.R line by line and it seems to report an error when I run

GENCODE.Category <- seqGetData(genofile, paste0(Annotation_dir,Annotation_name_catalog$dir[which(Annotation_name_catalog$name=="GENCODE.Category")])).

In this code, paste0(Annotation_dir,Annotation_name_catalog$dir[which(Annotation_name_catalog$name=="GENCODE.Category")]) refers to "annotation/info/FunctionalAnnotation/genecode_comprehensive_category". And I double check the information in my genofile, and "annotation/info/FunctionalAnnotation/genecode_comprehensive_category" is indeed in genofile.

r$> genofile
Object of class "SeqVarGDSClass"
File: ADNI.808_indiv.minGQ_21.pass.ADNI_ID.chr1.gds (108.7M)
+    [  ] *
|--+ description   [  ] *
|--+ sample.id   { Str8 808 LZMA_ra(29.9%), 2.6K } *
|--+ variant.id   { Int32 2949744 LZMA_ra(3.99%), 459.6K } *
|--+ position   { Int32 2949744 LZMA_ra(34.4%), 3.9M } *
|--+ chromosome   { Str8 2949744 LZMA_ra(0.02%), 1009B } *
|--+ allele   { Str8 2949744 LZMA_ra(11.3%), 1.3M } *
|--+ genotype   [  ] *
|  |--+ data   { Bit2 2x808x2949744 LZMA_ra(3.23%), 36.8M } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
|  \--+ extra   { Int16 0 LZMA_ra, 18B }
|--+ phase   [  ]
|  |--+ data   { Bit1 808x2949744 LZMA_ra(0.01%), 42.5K } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
|  \--+ extra   { Bit1 0 LZMA_ra, 18B }
|--+ annotation   [  ]
|  |--+ id   { Str8 2949744 LZMA_ra(33.6%), 6.1M } *
|  |--+ qual   { Float32 2949744 LZMA_ra(60.5%), 6.8M } *
|  |--+ filter   { Int32,factor 2949744 LZMA_ra(0.02%), 1.8K } *
|  |--+ info   [  ]
|  |  |--+ AC   { Int32 0 LZMA_ra, 18B } *
|  |  |--+ AF   { Float32 0 LZMA_ra, 18B } *
|  |  |--+ AN   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ BaseQRankSum   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ CCC   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ ClippingRankSum   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ DB   { Bit1 2949744 LZMA_ra(0.05%), 209B } *
|  |  |--+ DP   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ DS   { Bit1 2949744 LZMA_ra(0.05%), 209B } *
|  |  |--+ END   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ FS   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ GQ_MEAN   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ GQ_STDDEV   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ HWP   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ HaplotypeScore   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ InbreedingCoeff   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ MLEAC   { Int32 0 LZMA_ra, 18B } *
|  |  |--+ MLEAF   { Float32 0 LZMA_ra, 18B } *
|  |  |--+ MQ   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ MQ0   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ MQRankSum   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ NCC   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ NEGATIVE_TRAIN_SITE   { Bit1 2949744 LZMA_ra(0.05%), 209B } *
|  |  |--+ POSITIVE_TRAIN_SITE   { Bit1 2949744 LZMA_ra(0.05%), 209B } *
|  |  |--+ QD   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ ReadPosRankSum   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ VQSLOD   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ culprit   { Str8 2949744 LZMA_ra(0.02%), 581B } *
|  |  \--+ FunctionalAnnotation   [ spec_tbl_df,tbl_df,tbl,data.frame,list ] *
|  |     |--+ VarInfo   { Str8 2946875 LZMA_ra(14.2%), 6.2M }
|  |     |--+ apc_conservation   { Float64 2946875 LZMA_ra(24.6%), 5.5M }
|  |     |--+ apc_epigenetics   { Float64 2946875 LZMA_ra(24.8%), 5.6M }
|  |     |--+ apc_epigenetics_active   { Float64 2946875 LZMA_ra(24.0%), 5.4M }
|  |     |--+ apc_epigenetics_repressed   { Float64 2946875 LZMA_ra(19.1%), 4.3M }
|  |     |--+ apc_epigenetics_transcription   { Float64 2946875 LZMA_ra(17.4%), 3.9M }
|  |     |--+ apc_local_nucleotide_diversity   { Float64 2946875 LZMA_ra(24.5%), 5.5M }
|  |     |--+ apc_mappability   { Float64 2946875 LZMA_ra(10.3%), 2.3M }
|  |     |--+ apc_protein_function   { Float64 2946875 LZMA_ra(2.23%), 514.1K }
|  |     |--+ apc_transcription_factor   { Float64 2946875 LZMA_ra(4.59%), 1.0M }
|  |     |--+ cage_tc   { Str8 2946875 LZMA_ra(3.13%), 100.6K }
|  |     |--+ metasvm_pred   { Str8 2946875 LZMA_ra(0.60%), 17.2K }
|  |     |--+ rsid   { Str8 2946875 LZMA_ra(17.3%), 764.3K }
|  |     |--+ fathmm_xf   { Float64 2946875 LZMA_ra(18.9%), 4.3M }
|  |     |--+ genecode_comprehensive_category   { Str8 2946875 LZMA_ra(5.23%), 503.0K }
|  |     |--+ genecode_comprehensive_info   { Str8 2946875 LZMA_ra(11.1%), 1.9M }
|  |     |--+ genecode_comprehensive_exonic_category   { Str8 2946875 LZMA_ra(0.86%), 26.3K }
|  |     |--+ genecode_comprehensive_exonic_info   { Str8 2946875 LZMA_ra(5.72%), 249.5K }
|  |     |--+ genehancer   { Str8 2946875 LZMA_ra(2.04%), 649.1K }
|  |     |--+ linsight   { Float64 2946875 LZMA_ra(10.7%), 2.4M }
|  |     |--+ cadd_phred   { Float64 2946875 LZMA_ra(8.62%), 1.9M }
|  |     \--+ rdhs   { Str8 2946875 LZMA_ra(7.30%), 346.1K }
|  \--+ format   [  ]
\--+ sample.annotation   [  ]

As I can run STAARpipeline_Gene_Centric_Coding.r without any error and I find there is a similar code in coding.R(line 64), which filter variants and samples before seqGetData(). And the code GENCODE.Category <- seqGetData(genofile, paste0(Annotation_dir,Annotation_name_catalog$dir[which(Annotation_name_catalog$name=="GENCODE.Category")]))
can run successfully after restrict variants. There were 2,949,744 variants before filter in my genofile and I guess is it possible there was too much variants in my genofile which cause this error? Should I filter some variants (such as MAF or other quality control indicators) before run this code?

Thank you very much for your time and help,
Bangsheng

#SNVs per case/control?

Hello,

For the gene-centric analyses, is there any possible way to obtain the number of SNVs per case and control samples? The output from the summary functions only lists the total number of SNVs found per gene+category. I cannot seem to locate where this SNV count is actually generated in any of the functions to even try to modify this myself.

GENESIS null model not working

When genesis null model is used in the code below, it returns: dsyMatrix not supported

results_individual_analysis <- Individual_Analysis(chr=chr,start_loc=start_loc,end_loc=end_loc,
genofile=genofile,obj_nullmodel=obj_nullmodel,mac_cutoff=20,
QC_label=QC_label,variant_type=variant_type,
geno_missing_imputation=geno_missing_imputation)

Known_Loci_Pruning error

When I execute the script STAARpipelineSummary_Known_Loci_Pruning.r, there are several chromosomes that did not generate corresponding data, and the error is as follows:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 283209 15.2 664320 35.5 450929 24.1
Vcells 487090 3.8 8388608 64.0 1815701 13.9
[1] 8

of selected samples: 3,508

of selected variants: 2

of selected samples: 3,508

of selected variants: 2

of selected samples: 3,508

of selected variants: 4,303,007

of selected samples: 3,508

of selected variants: 1

of selected samples: 3,508

of selected variants: 4,303,007

of selected samples: 3,508

of selected variants: 2

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
没有(非NA)案例可用
Calls: LD_pruning -> Individual_Analysis_cond -> lm -> lm.fit
停止执行
Is the reason for this that the provided known pathogenic loci do not exist in my data? If so, how can I avoid conflicts between removing certain regions during QC and this step?

Thank you for your help.

VCF2GDS

hello，may I ask whether the vcf file needs to be divided by chromosome before vcf2GDS processing?

Running staarpipeline docker image

Hi Xihaoli,
Thanks for this seemingly great tool, I’d like to try it out. I’m having challenges with installing the staarpipeline dependencies. I’ve now pulled the docker image into an HPC as a singularity image. How do I run the image with the Rscripts that you have provided on GitHub?

Generate GDS files

Hi Xihao,

We are trying to run STAARpipeline for rare non-coding variants. I had a question for "Step 1: Generate the variants list to be annotated". Is the user meant to generate GDS files on their input data using SeqArray (before using Varinfo_gds.R), or are these files already provided? If so, do you have example code on how to do generate them? Is something like the example here on pages 3-5 intended? I am not familiar with GDS files or the SeqArray package and I wanted to check first. Thanks.

Daniel

fit_nullmodel Output is mostly Null and 0

Hi,
I'm not sure if I'm getting a reasonable output of the null model.

My genotyping (WGS) data has 1026 samples.
`
genofile
Object of class "SeqVarGDSClass"
File: /analysis/21-WGS/analysis/STAARpipeline/gds_with_AVGDP/chr11.gds (283.1M)

[ ] *
|--+ description [ ] *
|--+ sample.id { Str8 1026 LZMA_ra(13.0%), 1.1K } *
|--+ variant.id { Int32 2496036 LZMA_ra(4.50%), 439.1K } *
|--+ position { Int32 2496036 LZMA_ra(30.8%), 2.9M } *
|--+ chromosome { Str8 2496036 LZMA_ra(0.02%), 1.2K } *
|--+ allele { Str8 2496036 LZMA_ra(15.8%), 1.7M } *
|--+ genotype [ ] *
| |--+ data { Bit2 2x1026x2496036 LZMA_ra(4.87%), 59.5M } *
| |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
| --+ extra { Int16 0 LZMA_ra, 18B }
|--+ phase [ ]
| |--+ data { Bit1 1026x2496036 LZMA_ra(0.01%), 45.6K } *
| |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
| --+ extra { Bit1 0 LZMA_ra, 18B }
|--+ annotation [ ]
| |--+ id { Str8 2496036 LZMA_ra(9.76%), 2.8M } *
| |--+ qual { Float32 2496036 LZMA_ra(62.4%), 5.9M } *
| |--+ filter { Int32,factor 2496036 LZMA_ra(2.17%), 211.1K } *
| |--+ info [ ]
| | |--+ AC { Int32 2496036 LZMA_ra(18.8%), 1.8M } *
| | |--+ AF { Float32 2496036 LZMA_ra(36.3%), 3.5M } *
| | |--+ ALLELE_END { Bit1 2496036 LZMA_ra(0.06%), 201B } *
| | |--+ AN { Int32 2496036 LZMA_ra(12.2%), 1.2M } *
| | |--+ AS_BaseQRankSum { Float32 0 LZMA_ra, 18B } *
| | |--+ AS_FS { Float32 1531413 LZMA_ra(26.4%), 1.5M } *
| | |--+ AS_InbreedingCoeff { Float32 3060955 LZMA_ra(23.0%), 2.7M } *
| | |--+ AS_MQ { Float32 0 LZMA_ra, 18B } *
| | |--+ AS_MQRankSum { Float32 0 LZMA_ra, 18B } *
| | |--+ AS_QD { Float32 3060955 LZMA_ra(35.3%), 4.1M } *
| | |--+ AS_ReadPosRankSum { Float32 0 LZMA_ra, 18B } *
| | |--+ AS_SB_TABLE { Str8 2496036 LZMA_ra(0.02%), 517B } *
| | |--+ AS_SOR { Float32 1531413 LZMA_ra(28.4%), 1.7M } *
| | |--+ BaseQRankSum { Float32 2496036 LZMA_ra(39.8%), 3.8M } *
| | |--+ ClippingRankSum { Float32 2496036 LZMA_ra(34.4%), 3.3M } *
| | |--+ DB { Bit1 2496036 LZMA_ra(0.06%), 201B } *
| | |--+ DP { Int32 2496036 LZMA_ra(40.8%), 3.9M } *
| | |--+ DS { Bit1 2496036 LZMA_ra(0.06%), 201B } *
| | |--+ END { Int32 2496036 LZMA_ra(0.02%), 1.6K } *
| | |--+ ExcessHet { Float32 2496036 LZMA_ra(24.4%), 2.3M } *
| | |--+ FS { Float32 2496036 LZMA_ra(35.9%), 3.4M } *
| | |--+ InbreedingCoeff { Float32 2496036 LZMA_ra(24.2%), 2.3M } *
| | |--+ MLEAC { Int32 2642114 LZMA_ra(17.2%), 1.7M } *
| | |--+ MLEAF { Float32 2642114 LZMA_ra(18.6%), 1.9M } *
| | |--+ MQ { Float32 2496036 LZMA_ra(12.3%), 1.2M } *
| | |--+ MQ0 { Int32 2496036 LZMA_ra(0.02%), 1.6K } *
| | |--+ MQRankSum { Float32 2496036 LZMA_ra(35.7%), 3.4M } *
| | |--+ NEGATIVE_TRAIN_SITE { Bit1 2496036 LZMA_ra(45.5%), 138.7K } *
| | |--+ POSITIVE_TRAIN_SITE { Bit1 2496036 LZMA_ra(96.5%), 294.1K } *
| | |--+ QD { Float32 2496036 LZMA_ra(37.5%), 3.6M } *
| | |--+ RAW_MQ { Float32 2496036 LZMA_ra(0.02%), 1.6K } *
| | |--+ ReadPosRankSum { Float32 2496036 LZMA_ra(39.5%), 3.8M } *
| | |--+ SOR { Float32 2496036 LZMA_ra(36.3%), 3.5M } *
| | |--+ VQSLOD { Float32 2496036 LZMA_ra(41.1%), 3.9M } *
| | |--+ culprit { Str8 2496036 LZMA_ra(7.96%), 1.0M } *
| | |--+ F_MISSING { Float32 2496036 LZMA_ra(13.2%), 1.3M } *
| | |--+ AN_case { Int32 2496036 LZMA_ra(6.09%), 593.7K } *
| | |--+ AN_control { Int32 2496036 LZMA_ra(11.0%), 1.0M } *
| | |--+ AC_case { Int32 2496036 LZMA_ra(9.24%), 900.6K } *
| | |--+ AC_control { Int32 2496036 LZMA_ra(19.1%), 1.8M } *
| | |--+ AC_Het_case { Int32 2496036 LZMA_ra(8.40%), 818.7K } *
| | |--+ AC_Het_control { Int32 2496036 LZMA_ra(17.7%), 1.7M } *
| | |--+ AC_Het { Int32 2496036 LZMA_ra(17.7%), 1.7M } *
| | |--+ AF_case { Float32 2496036 LZMA_ra(13.4%), 1.3M } *
| | |--+ AF_control { Float32 2496036 LZMA_ra(34.4%), 3.3M } *
| | |--+ HWE_case { Float32 2496036 LZMA_ra(7.81%), 761.3K } *
| | |--+ HWE_control { Float32 2496036 LZMA_ra(22.8%), 2.2M } *
| | |--+ HWE { Float32 2496036 LZMA_ra(23.6%), 2.2M } *
| | |--+ FunctionalAnnotation [ spec_tbl_df,tbl_df,tbl,data.frame,list ] *
| | | |--+ VarInfo { Str8 2496036 LZMA_ra(17.0%), 6.7M }
| | | |--+ apc_conservation { Float64 2496036 LZMA_ra(79.1%), 15.1M }
| | | |--+ apc_epigenetics { Float64 2496036 LZMA_ra(81.4%), 15.5M }
| | | |--+ apc_epigenetics_active { Float64 2496036 LZMA_ra(75.9%), 14.5M }
| | | |--+ apc_epigenetics_repressed { Float64 2496036 LZMA_ra(50.5%), 9.6M }
| | | |--+ apc_epigenetics_transcription { Float64 2496036 LZMA_ra(43.6%), 8.3M }
| | | |--+ apc_local_nucleotide_diversity { Float64 2496036 LZMA_ra(79.9%), 15.2M }
| | | |--+ apc_mappability { Float64 2496036 LZMA_ra(29.7%), 5.7M }
| | | |--+ apc_protein_function { Float64 2496036 LZMA_ra(2.15%), 419.9K }
| | | |--+ apc_transcription_factor { Float64 2496036 LZMA_ra(7.10%), 1.4M }
| | | |--+ cage_tc { Str8 2496036 LZMA_ra(5.19%), 183.1K }
| | | |--+ metasvm_pred { Str8 2496036 LZMA_ra(0.82%), 20.1K }
| | | |--+ rsid { Str8 2496036 LZMA_ra(36.4%), 9.4M }
| | | |--+ fathmm_xf { Float64 2496036 LZMA_ra(50.6%), 9.6M }
| | | |--+ genecode_comprehensive_category { Str8 2496036 LZMA_ra(0.57%), 147.1K }
| | | |--+ genecode_comprehensive_info { Str8 2496036 LZMA_ra(6.10%), 3.2M }
| | | |--+ genecode_comprehensive_exonic_category { Str8 2496036 LZMA_ra(1.19%), 34.6K }
| | | |--+ genecode_comprehensive_exonic_info { Str8 2496036 LZMA_ra(7.53%), 489.1K }
| | | |--+ genehancer { Str8 2496036 LZMA_ra(0.39%), 365.7K }
| | | |--+ linsight { Float64 2496036 LZMA_ra(21.7%), 4.1M }
| | | |--+ cadd_phred { Float64 2496036 LZMA_ra(23.7%), 4.5M }
| | | --+ rdhs { Str8 2496036 LZMA_ra(3.81%), 317.0K }
| | --+ AVGDP { Int32 2496036 LZMA_ra(40.8%), 3.9M } *
| --+ format [ ]
--+ sample.annotation [ ]
`

The sgrm matrix is 1026x1026 . I made this using --degree 2 (everything else the same, I know non of my cases are related, and only 1 pair of controls are 1st degree relatives, which the SGRM pipeline identified)

> sgrm[100:105,100:105]  ### Example of few lines of the matrix
> 6 x 6 sparse Matrix of class "dsCMatrix"
>           0_HG00151 0_HG00154 0_HG00155 0_HG00157 0_HG00158 0_HG00159
> 0_HG00151 0.5043822 .         .         .         .         .        
> 0_HG00154 .         0.5009275 .         .         .         .        
> 0_HG00155 .         .         0.4998904 .         .         .        
> 0_HG00157 .         .         .         0.4966408 .         .        
> 0_HG00158 .         .         .         .         0.4979511 .        
> 0_HG00159 .         .         .         .         .         0.5044529

Phenotype has 1004 samples, with a binary variable, which I've created as follows:

tail(phenotype)   #### The PCs here are those that are generated by the SGRM output .score file
                   FID SNPSEX pheno group_race         PC1        PC2           PC3
999  0_NA20821      2     1        EUR -0.02332929 0.01675661 -0.0010476483
1000 0_NA20822      2     1        EUR -0.02330287 0.01741464  0.0001754674
1001 0_NA20826      2     1        EUR -0.02354092 0.01799696 -0.0002679973
1002 0_NA20827      1     1        EUR -0.02345230 0.01764050 -0.0008549907
1003 0_NA20828      2     1        EUR -0.02335791 0.01736181 -0.0007909704
1004 0_NA20832      2     1        EUR -0.02320004 0.01698723  0.0004000635
              PC4          PC5         PC6           PC7           PC8
999  -0.003812300 -0.007291464 -0.04352340  0.0010770846 -0.0009705049
1000 -0.003096957 -0.006086234 -0.04654090  0.0005451567 -0.0014752338
1001 -0.004336742 -0.004473674 -0.04533632  0.0002382286  0.0016030985
1002 -0.003471963 -0.004603992 -0.03881987 -0.0001682949 -0.0005360977
1003 -0.003189036 -0.004899655 -0.04413306  0.0002276412 -0.0001268270
1004 -0.003445883 -0.008064412 -0.04676934  0.0007819886 -0.0020845665
....
... PC9 to PC20 ....
```> 

My genotyping data is a mix of Europeans, hispanics, and African races, which I've coded as the "group_race" in the phenotype file. 
From another answered question I understood that "groups" need only be used if my phenotype is a continuous variable.
So I skipped it.
`## fit null model
obj_nullmodel <- fit_nullmodel(pheno~SNPSEX+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10, data=phenotype,kins=sgrm,use_sparse=TRUE,
kins_cutoff=0.022,id="FID",family=gaussian(link="identity"),verbose=TRUE)
`

This is the output. I am wondering if this is reasonable. 
obj_nullmodel$coefficients
(Intercept)      SNPSEX         PC1         PC2         PC3         PC4 
 1.35457458 -0.22051452 -2.45884174  1.31956283 -0.12508581  0.21253838 
        PC5         PC6         PC7         PC8         PC9        PC10 
 0.40876625  1.82775475 -0.27736805 -0.14491347  0.03741783 -0.01117903 


table( obj_nullmodel$residuals )

   0 
1004 
table( obj_nullmodel$scaled.residuals)
< table of extent 0 >
head(obj_nullmodel$scaled.residuals)
  1   2   3   4   5   6 
NaN NaN NaN NaN NaN NaN 
> obj_nullmodel$converged
[1] TRUE
>  obj_nullmodel$sparse_kins
[1] TRUE
>  obj_nullmodel$relatedness
[1] TRUE
>  obj_nullmodel$use_SPA
[1] FALSE


I'm concerned because when I ran the Individual analysis, I got NaN Pvalues for all variants.
head analysis/results_individual_analysis_sig.csv 
"","CHR","POS","REF","ALT","ALT_AF","MAF","N","pvalue","pvalue_log10","Score","Score_se","Est","Est_se"
"NA",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.1",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.2",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.3",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.4",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.5",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.6",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA


Here is a one part of the individual analysis just to see the NaNs
ind<- get(load("analysis/DILIN_Individual_Analysis_78.Rdata"))
head(ind)
    CHR      POS REF ALT     ALT_AF        MAF    N pvalue pvalue_log10 Score Score_se Est. Est_se
989   4 70019914   G   A 0.02442672 0.02442672 1004    NaN          NaN   NaN. 26.30146 NaN 0.03802071
1     4 70019928   G   A 0.55276382 0.44723618 1004    NaN          NaN   NaN.    77.90924 NaN 0.01283545
990   4 70020508  GA   G 0.01846307 0.01846307 1004    NaN          NaN   NaN. 22.09448 NaN 0.04526017
2     4 70020586   G   A 0.06000000 0.06000000 1004    NaN          NaN   NaN.   38.48158 NaN 0.02598646
3     4 70020606   T   C 0.09450000 0.09450000 1004    NaN          NaN   NaN 46.11840 NaN 0.02168332
4     4 70020771   G   A 0.09530938 0.09530938 1004    NaN          NaN   NaN.    46.27219 NaN 0.02161125

Error in Fit STAAR null model step: asMethod(object) : matrix is not symmetric [1,2]

Hello,

When I ran the command in script STAARpipeline_Null_Model.r, I got this error:
obj_nullmodel <- fit_nullmodel(phenotype~age+sex+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10,
data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="sample.id",
family=gaussian(link="identity"),verbose=TRUE)
[1] "kins is a sparse matrix."
Fixed-effect coefficients:
(Intercept) age sex PC1 PC2 PC3
-0.09710790 0.01004128 0.01336379 0.59188157 0.82643259 -0.88178803
PC4 PC5 PC6 PC7 PC8 PC9
-0.38470585 0.74543827 0.04345983 0.81811801 -1.34958786 -0.52993129
PC10
0.13472704
Error in asMethod(object) : matrix is not symmetric [1,2]

Then I generated the sgrm data by FastSparseGRM with no error again and the result for command "isSymmetric(as.matrix(sgrm))" was "TRUE". I also check the colnames and rownames by running command "unique(colnames(sgrm) == rownames(sgrm))" and result was "TRUE". I don't know what lead to this error. Could you help me to resolve this problem?

Best
Xubing

Ancestry PCs / kinship

This is more of a question than an issue regarding including ancestry information in the model.

I see there is the step of generating the GRM to pass to the null model function using the kins argument. When defining the model, however, the example code in STAARpipeline_Null_Model.r also includes 10 principal components as covariates. Are these also the ancestry components? If so, would they be accounted for twice then when also supplying the GRM?

I've generated the null model with including ancestry PCs as covariates and excluded the GRM (i.e., setting kins=NULL). After running the analyses, I'm seeing some genomic deflation in my QQ plots, especially in the gene-centric noncoding analysis. I'm wondering if my setup could be the issue?

info and tutorial files

Dear authors,
I am trying to reproduce the analysis provided in this tutorial but it is a bit tricky without the same input files you used and/or a bit more explanations for reproducibility.
For example, in the "STAARpipeline_Null_Model.r" it is not clear how the covariates are coded, as well as the tested phenotype.
Also, it is not clear how the sparse GRM is produced from the gds file.
Thank you for the help.

About rank-based inverse normal transformation

Dear Xihao,
I have some questions about when I perform rank-based inverse normal transformation for my phenotype before fitting the null model.
The phenotype is directly performed to the inverse normal transformation? like using the R package RNOmni. OR I need to perform two steps. The first step is that the null model is fit using the original outcome values. And the second step is using residuals to rank-based inverse Normal transform, and then fit the model again, but it seems not to be rescaled.
So, which one is more recommended?

Best,
Qilong

Question about selecting variants to calculate sparse GRM

Dear authors,

I'm trying to use your pipeline to our association study, but got questions about the way to build GRM.
Is it correct to use the common (MAF > 5%) variants to build GRM, so that we can capture population structure among individuals?
Or should I use the rare variants (those to be used in the main analysis), so we capture the similarity between individuals as will be observed in the main analysis?

Thank you,
In-Hee

error in running "STAARpipeline_Null_Model.r"

Hi Xihao,a very impressive job!!but when I run the program to this point--“STAARpipeline_Null_Model.r”，I get this error
Iteration 20 : Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'chol2inv': error in evaluating the argument 'x' in selecting a method for function 't': leading principal minor of order 1 is not positive Calls: fit_nullmodel ... .Call -> .handleSimpleError -> h -> .handleSimpleError -> h In addition: Warning message: In .local(A, ...) : CHOLMOD warning 'not positive definite' at file '../Cholesky/t_cholmod_rowfac.c', line 430 Execution halted

When generating the GRM matrix, I input the bed file that I completed pruning on Plink myself, and then follow your process to get a sparse matrix,I don't know how to fix it

No Manhattan or QQ plots from STAARpipelineSummary_Dynamic_Window.r?

Hello,

While running through the summary scripts, it seems the dynamic window analysis summary does not produce any plots? The README does state that it should...

Within the function Dynamic_Window_Results_Summary.R in the STAARpipelineSummary repo, should the results from each separate SCANG table within SCANG_res be plotted?

as.numeric(commandArgs(TRUE)[1])

Dear xihaoli,

To perform rare-variant analysis from genotype data, I have been sent batch jobs to our uni cluster using your R scripts, and I came until Individual_analysis, however, I have an issue about:

arrayid <- as.numeric(commandArgs(TRUE)[1])
print(arrayid)
[1] NA

It returns as "NA_real_" and the following codes don't work. Do you have any suggestions to solve this issue?

Thanks!

Here is R info;

sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Gene Centric Coding Annotation fails

Hello,

When running the script STAARpipelineSummary_Gene_Centric_Coding_Annotation.r, I am encountering this error:

Error in seqGetData(genofile, paste0(Annotation_dir, Annotation_name_catalog$dir[which(Annotation_name_catalog$name ==  :
  The GDS node "annotation/info/FunctionalAnnotation/rsid" does not exist.
Calls: Gene_Centric_Coding_Info -> info_plof -> seqGetData

The rsids for each variant are not included when annotating the gds files in the earlier steps. Are the rsids supposed to be within the FAVOR database files? I'm using the csv files downloaded from the FAVOR database website and the rsids don't seem to be in there..

error: run FastSparseGRM

Dear Dr. Li
I encountered a problem when I run one of the steps in R package FastSparseGRM to calculate the GRM. There were no output files after running the script runPCA_wrapper.R. And I checked the log file runPCA.Rout, here is the part of the log file:

file.unrels <- opt$file.unrels
prefix.in <- opt$prefix.in
no_pcs <- opt$no_pcs
no_cores <- opt$no_cores
num_threads <- opt$num_threads
prefix.out <- opt$prefix.out
no_iter <- opt$no_iter

library('FastSparseGRM')
runPCA(prefix.in,file.unrels,no_pcs,num_threads,prefix.out,no_iter)
[1] "PCA started at 2022-12-04 17:53:17"
[1] "4 relateds, 768 unrelateds, and 24928532 variants"
Reading the file: pruned.bed
Reading BED file complete
Number of SNPs read: 24928532
Number of samples: 772
Using 40 threads
Processing samples complete
Allele frequencies calculated
[1] "Reading Completed in 16 seconds"
[1] "1 iterations completed"
[1] "2 iterations completed"
[1] "3 iterations completed"
[1] "4 iterations completed"
[1] "5 iterations completed"
[1] "6 iterations completed"
[1] "7 iterations completed"
[1] "8 iterations completed"
[1] "9 iterations completed"
[1] "10 iterations completed"
[1] "Iterations Completed in 683 seconds"
Error in svd(H, nv = 0) : infinite or missing values in 'x'
Calls: runPCA -> drpca -> svd
Execution halted

Has this happened before? I have no idea to solve this problem, please give me some advice.
Hope to your reply :)

vcf to gds

Hello, @xihaoli.
In the script of "Varinfo_gds.R" and "gds2agds.R", it has two gds name.
gds_file_name_1 <- "freeze.5.chr"
gds_file_name_2 <- ".pass_and_fail.gtonly.minDP0.gds"

I wonder what the two documents represent and how they are produced? Thank you.

Expected input Parameters for Gene Centric Coding Step

Whenever I run the Gene Centric Coding Step I get this error regardless of how many arguments or what the argument is that I input. What is the pipeline expecting as input?

[1] 379
[1] 701

of selected samples: 0

of selected variants: 207

of selected samples: 0

of selected variants: 29

of selected samples: 0

of selected variants: 0

Error in data.frame(id.genotype, index = seq(1, length(id.genotype))) :
arguments imply differing number of rows: 0, 2
Calls: Gene_Centric_Coding -> coding -> data.frame
Execution halted

Number of jobs

Hi Xihao,
I am using a different dataset and have slightly different number of jobs for the following analyses:
sum(jobs_num$individual_analysis_num) = 294
sum(jobs_num$sliding_window_num) = 574
sum(jobs_num$scang_num) = 1883
I'm not sure about how to determine the number of jobs for gene-centric coding and noncoding analyses. Please advise on whether I have to use similar number of jobs as you for gene-centric coding and noncoding analyses.
Thanks.

Phenofile in STAARpipeline_Null_Model.r

Hi,

Thanks for the amazing tool!
When constructing the null model using STAARpipeline_Null_Model.r, should the sample in 'pheno.csv' be the same as input vcf (which is used in the very beginning and converted to GDS files)? Or the sample in 'pheno.csv' could be a subset of that in input vcf?

Now I only want to analyze the samples in 'pheno.csv'. When I used STAARpipeline_STAAR2SCANG.r, it reported Error in y %*% diag(eigen_diff) : non-integral arguments. So I am wondering if the inconsistency of pheno.csv and input vcf is the reason for the error?

Thanks!
Shea

Is there any example file for test?

Hi,

I am wondering is there any example file that I could use for test?

Thanks a lot!

Shea

LD_pruning error

Hello Xihao,
I tried to run both Summary_Known_Loci_Pruning.r and Summary_Known_Loci_Individual_Analysis.r,but I came across an issue. LD pruning halted on some chromosomes. Below is the output log, along with the error message. Do you know what caused the following error message, and how it will affect my final result, and how I can fix this problem


         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 283149 15.2     664202 35.5   450931 24.1
Vcells 486990  3.8    8388608 64.0  1815700 13.9
[1] 1
# of selected samples: 378,834
# of selected variants: 3
# of selected samples: 378,834
# of selected variants: 3
# of selected samples: 380,943
# of selected variants: 2,093,696
# of selected samples: 378,834
# of selected variants: 0

Error in apply(Geno_adjusted, 2, mean) : 
  dim(X) must have a positive length
Calls: LD_pruning -> Individual_Analysis_cond -> apply
Execution halted
Error in apply(Geno_adjusted, 2, mean) : 
  dim(X) must have a positive length
Calls: LD_pruning -> Individual_Analysis_cond -> apply
Execution halted
Error in apply(Geno_adjusted, 2, mean) : 
  dim(X) must have a positive length
Calls: LD_pruning -> Individual_Analysis_cond -> apply
Execution halted
Error in apply(Geno_adjusted, 2, mean) : 
  dim(X) must have a positive length
Calls: LD_pruning -> Individual_Analysis_cond -> apply
Execution halted
Error in apply(Geno_adjusted, 2, mean) : 
  dim(X) must have a positive length
Calls: LD_pruning -> Individual_Analysis_cond -> apply
Execution halted

STAAR sliding window errors

Dear Xihao Li,

I have found your package very useful, and I thank you for your clear and easy to follow pipeline tutorial.

I would like to ask you about an issue I've encountered. When running the fixed sliding window analysis I get the following odd Manhattan plot:

It seems to tend to "work" at the start of chromosomes and then skip large distances for the rest of each chromosome, resulting in a sparse number of windows evaluated.

I concomitantly get the following error as well while running the analysis:

warning: eig_sym(): given matrix is not symmetric Error : eig_sym(): decomposition failed

This occurs repeatedly, though the code still seems to progress, outputting ascending integers (presumably indices), e.g. 2, bunch of errors, 3, bunch of errors, etc. See below for an example:

[1] 3
# of selected samples: 1,177
# of selected variants: 292

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed
# of selected samples: 1,177
# of selected variants: 1,007,368
[1] 4
# of selected samples: 1,177
# of selected variants: 243

Do you have any idea what the issue might be?

I would be very much grateful for any help you could provide.

Many thanks,

Alex

Errors when running Sliding_Window()

Hello,

I'm testing 2k variant sites among 1373 individuals using sliding_window(). I first set window_length 6k, which seems to run twice but returns different numbers of samples and variants each time, then stop with "rare variant number less than 2" error (see below). Then I set window_length 2k, this time it throws out an error "Error in if (sum(is.in) >= 2) { : missing value where TRUE/FALSE needed".

Why it returns different errors when setting different window lengths? Why it didn't read 2000 variants and 1373 individuals initially in my first test? Is it a format thing with genofile? What do you think? Thanks

Sliding_Window(chr=3,start_loc=11919,end_loc=1239497,sliding_window_length=6000, genofile=genofile,obj_nullmodel=obj_nullmodel)

of selected samples: 1,372

of selected variants: 0

of selected samples: 1,373

of selected variants: 2,000

Error in Sliding_Window_Single(chr = chr, start_loc = start_loc, end_loc = end_loc, :
Number of rare variant in the set is less than 2!

Sliding_Window(chr=3,start_loc=11919,end_loc=1239497,sliding_window_length=2000, genofile=genofile,obj_nullmodel=obj_nullmodel)

of selected samples: 1,372

of selected variants: 0

Error in if (sum(is.in) >= 2) { : missing value where TRUE/FALSE needed

"CSV error" when running Annotate.R

Hi,

Recently I was trying to use our in-house data in STAARpipeline for WGS burden test. When I ran the Annotate.R, the log showed errors like this:

[1] 1
CSV error: record 4 (line: 5, byte: 46): found record with 2 fields, but the previous record has 1 fields
[1] 2
CSV error: record 29 (line: 30, byte: 439): found record with 2 fields, but the previous record has 1 fields
[1] 3
CSV error: record 1 (line: 2, byte: 8): found record with 2 fields, but the previous record has 1 fields
...

However, it did not stop executing the job, and generated the output "Anno_chr1_STAARpipeline.csv".

I wondered how these errors happen (is it because of the VCF format?) and whether these errors would affect the final burden result.

Attached is the log file.
slurm-59206.log

Looking forward to your reply! Thanks a lot!

Shea

Fitting NULL model for binary outcomes

Hello!
I'm trying to fit my null_model (with genesis and STAAR). This is my phenotype:

sex	diagnosis_phenotype	ID
f	Disease	1
f	Disease	2
f	Control	3
m	Disease	4
m	Disease	5
f	Control	6
f	Control	7
m	Control	8
f	Control	9
m	Control	10
f	Control	11

When using:

fit_nullmodel(as.factor(diagnosis_phenotype)~as.factor(sex),
                               data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="ID",
                               family=gaussian(link="identity"),verbose=TRUE)

I get

Error in glm.fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  :
  NA/NaN/Inf in 'y'
In addition: Warning messages:
1: In Ops.factor(y, mu) : ‘-’ not meaningful for factors
2: In Ops.factor(eta, offset) : ‘-’ not meaningful for factors
3: In Ops.factor(y, mu) : ‘-’ not meaningful for factors

With GENESIS, i get:


fitNullModel(data_GENESIS,outcome="diagnosis_phenotype",
+                                       cov.mat=sgrm,AIREML.tol=1e-4,verbose=TRUE, family = "binomial")
Error in svd(x, 0, 0) : a dimension is zero

How can I generate the null model for binary/categorical outcomes?

Question about the results of Dynamic_Window_SCANG() and Dynamic_Window_Results_Summary()

Hi!

It's me again.)

Finally get to the final step.
After successfully getting the output using the 1000Genome dataset as toy example, I used an in-house dataset of 100 samples. My purpose is to do WGS burden using dynamic window analysis. As I ran the last step (STAARpipeline_Dynamic_Window.r), I got a strange result.

I got 22 files named "sscpv2_1.Rdata"..."sscpv2_22.Rdata", I think it should stand for the analysis result of chr1 to chr22. However, when I opened "sscpv2_13.Rdata", the SCANG_B_top1 was showed as below:

      -logp chr start_pos end_pos  GWER SNV_num
 [1,]     0   1         0       0 5e-04       1

I have checked that the inputs of STAARpipeline_Dynamic_Window.r (i.e. genofile, obj_nullmodel...)are good. If it is the result of chr13, it's strange the "chr" column is 1. And the "start_pos" and "end_pos" are 0. Does it mean the no significant output? After all, the test sample size was still sooo small.

The Rdata is attached here.
result_scang.zip

Another question: I tried to run STAARpipelineSummary_Dynamic_Window_run.r, but it reported error "cannot open '/mnt/work/research/geyx/STAARpipeline/FAVOR/sscpv2_test_cx/QC/burden/result/sscpv2_23.Rdata'". It's strange because the sscpv2_23.Rdata should not exist.

Looking forward to your suggestion!! Thanks a lot!!

Shea

Error in "Step 3: Generate the annotated GDS (aGDS) file."

Hi Xihao, we're trying to run STAARpipeline but am running into an issue in "Step 3: Generate the annotated GDS (aGDS) file". Below is the command testing on chromosome 22 with paths changed and the error:

Rscript gds2agds.R 22

         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 262458 14.1     641551 34.3   431252 23.1
Vcells 443021  3.4    8388608 64.0  1754506 13.4
[1] 1641932      22
Error in add.gdsn(ans, nm[i], val[[i]], compress = compress, closezip = closezip,  :
  The GDS node "apc_protein_function" exists.
Calls: add.gdsn -> add.gdsn
In addition: Warning messages:
1: One or more parsing issues, see `problems()` for details
2: In add.gdsn(ans, nm[i], val[[i]], compress = compress, closezip = closezip,  :
  Missing characters are converted to "".
3: In add.gdsn(ans, nm[i], val[[i]], compress = compress, closezip = closezip,  :
  Missing characters are converted to "".
Execution halted

If I run the same command again I actually get a different error, and it seems to stay like this (and the runtime also shortened to ~5 seconds from a couple minutes):

         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 262458 14.1     641551 34.3   431252 23.1
Vcells 443021  3.4    8388608 64.0  1754506 13.4
[1] 1641932      22
Error in add.gdsn(Anno.folder, "FunctionalAnnotation", val = FunctionalAnnotation,  :
  The GDS node "FunctionalAnnotation" exists.
Execution halted

Would you know what the problem is? Thanks.

Daniel

No known loci for conditional analysis?

Hi there!
Thank you again for providing this very useful pipeline.
I have a question regarding the later steps of analysis (Summarization and visualization).
I am trying to analyse a WES dataset of around 5500 samples (~ 1200 case vs 4300 controls).
I was wondering how I would proceed with the analysis if I lack a set of known loci for my binary trait?
Is this an issue or is this part somewhat optional?
Furthermore, what would the optimal path / pipeline be for this kind of analysis, which has the sole focus of coding variants?
And do you perhaps have a recommendation as to how to proceed with a suitable pathway analysis after?

Sorry is this a lot, and thank you in advance!

Best wishes,
Daniel

Dynamic_Window_SCANG error(s)

Hello, I have been receiving two errors.

Error 1 occurs when "res" (a SCANG output) all equal zero, or not applicable, the pipeline breaks before moving to next sub_seq_num.
This is usually fixed by increasing the rare_maf_cutoff value.

Error 2 I am not sure what is going on:
Error in resmost_SCANG_O[1, 3] - resmost_SCANG_O[1, 2] : non-numeric argument to binary operator Calls: Dynamic_Window_SCANG -> matrix Execution halted

Appreciate any help!
Thanks
lj

Gene_Centric_Coding output ERROR messages

Hi Dr. Li
Sorry to bother you again. I met some ERROR messages when I ran Gene Centric Coding analysis. Here are the part of the log file.

     used (Mb) gc trigger (Mb) max used (Mb)

Ncells 269062 14.4 654557 35 444157 23.8
Vcells 448534 3.5 8388608 64 1771722 13.6
[1] 379
[1] 51
# of selected samples: 772
# of selected variants: 280
# of selected samples: 772
# of selected variants: 6
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 1
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
Number of rare variant in the set is less than 2!
# of selected samples: 772
# of selected variants: 5
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 3,197,953
[1] 52
# of selected samples: 772
# of selected variants: 24
# of selected samples: 772
# of selected variants: 6
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
\ # of selected variants: 3
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
Number of rare variant in the set is less than 2!
# of selected samples: 772
# of selected variants: 3
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 3,197,953
[1] 53
# of selected samples: 772
# of selected variants: 163
# of selected samples: 772
# of selected variants: 10
# of selected samples: 772
# of selected variants: 1
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
Number of rare variant in the set is less than 2!
# of selected samples: 772
# of selected variants: 1
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
Number of rare variant in the set is less than 2!
# of selected samples: 772
# of selected variants: 5
# of selected samples: 772
# of selected variants: 4
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 3,197,953
[1] 54
# of selected samples: 772
# of selected variants: 37
# of selected samples: 772
# of selected variants: 1
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 1
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
Number of rare variant in the set is less than 2!
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 3,197,953
[1] 55
# of selected samples: 772
# of selected variants: 76
# of selected samples: 772
# of selected variants: 11
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 4
# of selected samples: 772
# of selected variants: 7
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 3,197,953

I don't know if the Error messages will interrupt the next steps. Could you give some advice?
Looking forward to yout reply.
Damien

RESULT: gene centric coding

Dear Dr. Li
When I ran the gene_centric_coding analysis, I find that the Q-Q plot is strange. Some of first half lines are under the red line, instead of closing the red line.
Hear are the manhattan and Q-Q plot of gene_centric_coding analysis.
Is that a bad analysis result? If is bad, what I should do to adjust the analysis step.
Hope to receive your reply.

run this pipline in crop population

Dear Xihao Li,
I want to know whether we could run this pipline in crop population?

Step 3.2: Gene-centric noncoding analysis

Hi Xihao,

Sorry to keep bothering you. I'm trying to run the association analyses now, specifically the rare non-coding analyses. I had a couple questions/issues:

As an example I ran the following:
Rscript STAARpipeline_Gene_Centric_Noncoding.r 1

which the full output/error is:

         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 262458 14.1     641551 34.3   431252 23.1
Vcells 443021  3.4    8388608 64.0  1754506 13.4
[1] 379
[1] 1
# of selected samples: 0
# of selected variants: 0
# of selected samples: 6,280
# of selected variants: 9,400,898
# of selected samples: 0
# of selected variants: 0
Error in data.frame(id.genotype, index = seq(1, length(id.genotype))) :
  arguments imply differing number of rows: 0, 2
Calls: Gene_Centric_Noncoding -> noncoding -> data.frame
Execution halted

Would you know what the issue is? If it's any help, 6,280 is the number of samples with genetic data, and 9,400,898 is the number of variants on chromosome 1.

I saw the script STAARpipeline_Gene_Centric_Noncoding.sh to run STAARpipeline on a cluster. However this seems to be for SLURM, and our cluster uses LSF. Is essentially what is being done in this script and in STAARpipeline_Gene_Centric_Noncoding.r is it runs 50 genes each for 379 jobs? So if I just run STAARpipeline_Gene_Centric_Noncoding.r in a loop from {1..379} this should do the same thing? And likewise for the rest of the analyses in 3.2 for however many "SLURM_ARRAY_TASK_ID"s there are. Thanks.

fitnullmodel error

Dear Dr. Li,
I am facing another problem when I ran fitnullmodel step using STAARpipeline_Null_Model_GENESIS.r, there was an error:

Error in .checkSampleId(cov.mat, x) :
all sample names in dimnames of cov.mat must be present in x$sample.id
Calls: fitNullModel -> fitNullModel -> .local -> .checkSampleId
Execution halted

the cov.mat points sGRM.RData, so I check the content of the sGRM.Rdata. I find the sGRM@Dimnames[[1]]
output looks strange:

"D2109086378_D2109086378" "D2109086379_D2109086379" "D2109086380_D2109086380"

It looks paste FID and IID using "_" as the rownames and colnames of the sparse GRM. Not sure if I understand correctly?
After I did an operation to delect the "_" and the string after using sGRM@Dimnames[[1]] <- gsub("_D[0-9]+", "", sGRM@Dimnames[[1]]), I ran fitnullmodel again, but It still occurred the same error.

for "x", I think it is the phenotype I am concerned. Is it ringht? Meanwhile, I am not sure the phenotype file should include what columns?
Here are the colnames in my phenotype file:

FID IID sex age first_transfusion PLT PLT_zscore pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10 pc11 pc12 pc13 pc14 pc15 pc16 pc17 pc18 pc19 pc20

the first two columns are FID and IID and they are indentical, Not sure it should remove one of them and rename the colname of the remain to sample.id?

How to get the variable genes_info in STAARpipeline_Gene_Centric_Coding.r

Hi, I am trying to learn this STAAR pipeline to deal with rare variant analysis. However, when I ran the Gene_Centric_Coding script, I found that I did not have the genes_info as the input file. So, how to get the genes_info file.

###########################################################
#           Main Function 
###########################################################
## gene number in job
gene_num_in_array <- 50 
group.num.allchr <- ceiling(table(genes_info[,2])/gene_num_in_array)
sum(group.num.allchr)

gnomAD AF & family-based vcf analysis

Hi Xihao,

Thanks for your useful tool! And I've met some problems really confused me:

I noticed that allele frequencies of gnomAD and 1000G are contained in FAVOR full database but not in FAVOR essential database. Are there any approaches that I could make use of these AF annotations in the STAAR procedure? (Like the variants with gnomAD AF < 5‰ will be given priority).
I'm now working on a rare disease and my cohort contains ~700 trio families (only the child has the disease, parents are healthy), so my data is a vcf file with ~2000 samples. Does STAAR pipeline support family-based analysis? If so, how can I represent this family relationship while analyzing? (maybe in the pheno.csv?)
Do I need to add more irrelevant healthy samples as control?

Thanks a lot!!!

Gene_Centric_Coding function fails due to character-formatted annotation data

Hello,

I've been working my way through each step in the pipeline and have encountered an error when running STAARpipeline_Gene_Centric_Coding.r

The error occurs in the internal coding function within the Gene_Centric_Coding function. When the current annotation name is "aPC.LocalDiversity", the script attempts to perform a transformation of the data:

if (Annotation_name[k] == "aPC.LocalDiversity") {
                    Annotation.PHRED.2 <- -10 * log10(1 - 10^(-Annotation.PHRED/10))
                    Annotation.PHRED <- cbind(Annotation.PHRED,
                      Annotation.PHRED.2)
                    Anno.Int.PHRED.sub.name <- c(Anno.Int.PHRED.sub.name,
                      paste0(Annotation_name[k], "(-)"))
                  }

However, the data in Annotation.PHRED is of the character class, causing this error:

Error in -Annotation.PHRED : invalid argument to unary operator
Calls: Gene_Centric_Coding -> coding

aPC.LocalDiversity is not the only character annotation. Of those picked from the name catalog, the only other character values are 3 other annotation PCs:

/cadd_phred  # numeric
/linsight  # numeric
/fathmm_xf  # numeric
/apc_epigenetics_active  # character
/apc_epigenetics_repressed  # character
/apc_epigenetics_transcription  # numeric
/apc_conservation  # numeric
/apc_local_nucleotide_diversity  # character (causing this error)
/apc_mappability  # character
/apc_transcription_factor  # numeric
/apc_protein_function  # numeric

Would anyone know why these values would be written to (or read from) the GDS file as characters?

Questions/issues in "Step 1: Fit STAAR null model"

Hi Xihao,

Thanks for helping me out earlier. I had a couple questions/issues when running "Step 1: Fit STAAR null model":

If I try running STAARpipeline_Null_Model.r, I get an error:

Error in glmmkin(fixed = fixed, data = data, kins = kins, id = id, random.slope = random.slope,  :
  Error: "id" must be one of the variables in the names of "data".
Calls: fit_nullmodel -> glmmkin
Execution halted

The error seems to be in the line:

obj_nullmodel <- fit_nullmodel(BMI_IRNT~Age+AgeSq+Subject_Information.Sex..str+PC1+PC2,data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="IID",family=gaussian(link="identity"),verbose=TRUE)

for the field 'id="IID"'. However "IID" (no quotes) is the column name that the sample IDs are in. I tried running the script without quotes for "IID" and it doesn't work either. Would you know what the issue is?

I then tried using STAARpipeline_Null_Model_GENESIS.r and it gave an error:

Error in .checkSampleId(cov.mat, x) :
  all sample names in dimnames of cov.mat must be present in x$sample.id
Calls: fitNullModel -> fitNullModel -> .local -> .checkSampleId
Execution halted

I saw in another issue this was caused by the IDs being converted to something like "ID_ID" -- I tried the fix in the other issue but still had problem. However, I do have samples in the sparse GRM that are not in the phenotype file -- we will probably run ~20 phenotypes which will have different numbers of individuals with available phenotypes. I suppose it would be preferable to just make one sparse GRM for all phenotypes, but it may not be too much more effort to make a new sparse GRM for each phenotype. Do you have any recommendation here? Thanks.

Daniel

xihaoli / staarpipeline-tutorial Goto Github PK

staarpipeline-tutorial's Issues

of selected samples: 3,508

of selected variants: 2

of selected samples: 3,508

of selected variants: 2

of selected samples: 3,508

of selected variants: 4,303,007

of selected samples: 3,508

of selected variants: 1

of selected samples: 3,508

of selected variants: 4,303,007

of selected samples: 3,508

of selected variants: 2

of selected samples: 0

of selected variants: 207

of selected samples: 0

of selected variants: 29

of selected samples: 0

of selected variants: 0

of selected samples: 1,372

of selected variants: 0

of selected samples: 1,373

of selected variants: 2,000

of selected samples: 1,372

of selected variants: 0

Recommend Projects

Recommend Topics

Recommend Org