xihaoli / staarpipeline-tutorial Goto Github PK

The tutorial for performing association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using FAVORannotator, STAARpipeline and STAARpipelineSummary

License: GNU General Public License v3.0

R 93.18% Shell 6.82%

statistical-genetics whole-exome-sequencing whole-genome-sequencing functional-annotation rare-variant-analysis staar-pipeline

staarpipeline-tutorial's Introduction

STAARpipeline-Tutorial

This is a tutorial for (1) automatically functionally annotating the variants of whole-genome/whole-exome sequencing (WGS/WES) studies and integrating the functional annotations with the genotype data using FAVORannotator and (2) performing association analysis of WGS/WES studies, summarizing and visualization results using STAARpipeline and STAARpipelineSummary. The software prerequisites, dependencies and installation can be found in STAARpipeline and STAARpipelineSummary packages.

FAVORannotator, STAARpipeline and STAARpipelineSummary are implemented as a collection of apps. Please see the apps favorannotator, staarpipeline, staarpipelinesummary_varset and staarpipelinesummary_indvar that run on the UK Biobank Research Analysis Platform for more details.

Pre-step of association analysis using STAARpipeline

Generate Genomic Data Structure (GDS) file

R/Bioconductor package SeqArray provides functions to convert the genotype data (in VCF/BCF/PLINK BED/SNPRelate format) to SeqArray GDS format. For more details on usage, please see the R/Bioconductor package SeqArray [manual]. A wrapper for the seqVCF2GDS/seqBCF2GDS function in the SeqArray package can be found here (Credit: Michael R. Brown and Jennifer A. Brody).

R package gds2bgen provides functions to convert the genotype data (in BGEN format) to SeqArray GDS format. For more details on usage, please see the R package gds2bgen. An example for the seqBGEN2GDS function in the gds2bgen package can be found here (Credit: Xiuwen Zheng).

Note 1: As a file integrity check, it is expected that variant in the GDS file can be uniquely identified based on its CHR-POS-REF-ALT combination. That is, there shouldn't be two variants in the GDS file with identical CHR-POS-REF-ALT records. It is also expected that the physical positions of variants in the GDS file (of each chromosome) should be sorted in ascending order.

Note 2: After the GDS file is generated, there is supposed to be a channel in the GDS file (default is annotation/filter) where all variants passing the quality control (QC) should be labeled as "PASS". If there is no such channel for a given post-QC GDS file (where all variants in the GDS file are pass variants), one can create a new channel in the GDS file by setting the value of all variants as "PASS". An example script can be found here. Then, in all scripts of STAARpipeline, QC_label <- "annotation/filter" should be updated to QC_label <- "annotation/info/QC_label".

Generate annotated GDS (aGDS) file using FAVORannotator

Prerequisites:

FAVORannotator (CSV version 1.0.0) depends on the xsv software and the FAVOR database in CSV format. Please install the xsv software and download the FAVOR essential database CSV files from FAVOR website (under the "FAVORannotator" tab's top panel, 31.2 GB for chr1 CSV) or Harvard Dataverse before using FAVORannotator (CSV version 1.0.0).

Step 0: Install xsv

The following steps are for the widely used operating system (Ubuntu) on a virtual machine.

Install Rust and Cargo:

$ curl https://sh.rustup.rs -sSf | sh

Source the environment:

$ source $HOME/.cargo/env

Install xsv using Cargo:

$ cargo install xsv

Step 1: Generate the variants list to be annotated

Script: Varinfo_gds.R

Input: GDS files of each chromosome and the FAVOR database information FAVORdatabase_chrsplit.csv. For more details, please see the R script.

Output: CSV files of the variants list. For each chromosome, the number of CSV files is listed in FAVORdatabase_chrsplit.csv.

Note: The physical positions of variants in the GDS file (of each chromosome) should be sorted in ascending order.

Step 2: Annotate the variants using the FAVOR database through xsv software

Script: Annotate.R

Input: CSV files of the variants list to be annotated, the FAVOR database information FAVORdatabase_chrsplit.csv,

the FAVOR database, and the directory xsv software. For more details, please see the R script.

Output: CSV files of the annotated variants list.

Anno_chrXX.csv: a CSV file containing annotated variants list of chromosome XX.
Anno_chrXX_STAARpipeline.csv: a CSV file containing the variants list with annotations required for STAARpipeline of chromosome XX. The annotations in this file is a subset of Anno_chrXX.csv.

Step 3: Generate the annotated GDS (aGDS) file

Script: gds2agds.R

Input: GDS files and the CSV files of annotated variants list (`Anno_chrXX.csv` or `Anno_chrXX_STAARpipeline.csv`). For more details, please see the R script.

Output: aGDS files including both the genotype and annotation information.

Note: FAVORannotator also supports the database in SQL format. Please see the FAVORannotator tutorial for detailed usage of FAVORannotator (SQL version).

Generate sparse Genetic Relatedness Matrix (GRM)

R package FastSparseGRM provides functions and a pipeline to efficiently calculate genetic principal components (PCs) and the ancestry-adjusted sparse genetic relatedness matrix (GRM). It accounts for population heterogeneity using genetic PCs which are automatically calculated as part of the pipeline. The genetic PCs can be used as fixed effect covariates to account for the population stratification and the sparse GRM can be used to model the random effects to account for the sample relatedness in a mixed effects phenotype-genotype association testing model implemented in STAARpipeline. For more details on usage, please see the R package FastSparseGRM.

Association analysis using STAARpipeline

Step 0: Preparation for association analysis of whole-genome/whole-exome sequencing studies

Script: Association_Analysis_PreStep.r

Input: aGDS files of all 22 chromosomes. For more details, please see the R script.

Output: `agds_dir.Rdata`, `Annotation_name_catalog.Rdata`, `jobs_num.Rdata`.

agds_dir.Rdata: a vector containing directory of GDS/aGDS files of all chromosomes.
Annotation_name_catalog.Rdata: a data frame containing the annotation name and the corresponding channel name in the aGDS file. Alternatively, one can skip this part in the R script by providing Annotation_name_catalog.csv with the same information. An example of Annotation_name_catalog.csv can be found here.
jobs_num.Rdata: a data frame containing the number of jobs for association analysis, including individual analysis, sliding window analysis and dynamic window analysis (SCANG-STAAR).

Step 1: Fit STAAR null model

Script: STAARpipeline_Null_Model.r or STAARpipeline_Null_Model_GENESIS.r

STAARpipeline_Null_Model.r fits the STAAR null model using the STAARpipeline package.
STAARpipeline_Null_Model_GENESIS.r fits the null model using the GENESIS package and convert it to the STAAR null model using the STAARpipeline package.

Input: Phenotype data and (sparse) genetic relatedness matrix. For more details, please see the R scripts.

Output: a Rdata file of the STAAR null model.

Step 2: Individual (single-variant) analysis

Script: STAARpipeline_Individual_Analysis.r

Perform single-variant analysis for common and low-frequency variants across the genome using the STAARpipeline package.

Input: aGDS files and the STAAR null model. For more details, please see the R script.

Output: Rdata files with the user-defined names.

The number of output files is the summation of the column "individual_analysis_num" for the object in jobs_num.Rdata.

Step 3.1: Gene-centric coding analysis

Script: STAARpipeline_Gene_Centric_Coding.r and STAARpipeline_Gene_Centric_Coding_Long_Masks.r

Perform gene-centric analysis for coding rare variants using the STAARpipeline package. The gene-centric coding analysis provides five functional categories to aggregate coding rare variants of each protein-coding gene: (1) putative loss of function (stop gain, stop loss, and splice) RVs, (2) missense RVs, (3) disruptive missense RVs, (4) putative loss of function and disruptive missense RVs, and (5) synonymous RVs.

STAARpipeline_Gene_Centric_Coding.r performs gene-centric coding analysis for all protein-coding genes across the genome. There are 379 jobs using this script.
STAARpipeline_Gene_Centric_Coding_Long_Masks.r performs gene-centric coding analysis for some specific long masks, and might require larger memory compared to STAARpipeline_Gene_Centric_Coding.r. There are 2 jobs using this script.

Input: aGDS files and the STAAR null model. For more details, please see the R scripts.

Output: 381 Rdata files with the user-defined names.

Step 3.2: Gene-centric noncoding analysis

Script: STAARpipeline_Gene_Centric_Noncoding.r, STAARpipeline_Gene_Centric_Noncoding_Long_Masks.r, STAARpipeline_Gene_Centric_ncRNA.r and STAARpipeline_Gene_Centric_ncRNA_Long_Masks.r

Perform gene-centric analysis for noncoding rare variants using the STAARpipeline package. The gene-centric noncoding analysis provides eight functional categories of regulatory regions to aggregate noncoding rare variants: (1) promoter RVs overlaid with CAGE sites, (2) promoter RVs overlaid with DHS sites, (3) enhancer RVs overlaid with CAGE sites, (4) enhancer RVs overlaid with DHS sites, (5) untranslated region (UTR) RVs, (6) upstream region RVs, (7) downstream region RVs, and (8) noncoding RNA (ncRNA) RVs.

STAARpipeline_Gene_Centric_Noncoding.r performs gene-centric noncoding analysis for all protein-coding genes across the genome. There are 379 jobs using this script.
STAARpipeline_Gene_Centric_Noncoding_Long_Masks.r performs gene-centric noncoding analysis for some specific long masks, and might require larger memory compared to STAARpipeline_Gene_Centric_Noncoding.r. There are 8 jobs using this script.
STAARpipeline_Gene_Centric_ncRNA.r performs gene-centric noncoding analysis for ncRNA genes across the genome. There are 222 jobs using this script.
STAARpipeline_Gene_Centric_ncRNA_Long_Masks.r performs gene-centric noncoding analysis for some specific long masks, and might require larger memory compared to STAARpipeline_Gene_Centric_ncRNA.r. There is 1 job using this script.

Input: aGDS files and the STAAR null model. For more details, please see the R scripts.

Output: 387 Rdata files with the user-defined names for protein-coding genes and 223 Rdata files with the user-defined names for ncRNA genes.

Step 4: Sliding window analysis

Script: STAARpipeline_Sliding_Window.r

Perform sliding window analysis using the STAARpipeline package.

Input: aGDS files and the STAAR null model. For more details, please see the R script.

Output: Rdata files with the user-defined names.

The number of output files is the summation of the column "sliding_window_num" for the object in jobs_num.Rdata.

Step 5.0: Obtain SCANG-STAAR null model

Script: STAARpipeline_STAAR2SCANG.r

Generate the SCANG-STAAR null model using the STAAR null model.

Input: STAAR null model. For more details, please see the R script.

Output: a Rdata file of the SCANG-STAAR null model.

Step 5: Dynamic window analysis using SCANG-STAAR

Script: STAARpipeline_Dynamic_Window.r

Perform dynamic window analysis using the STAARpipeline package.

Input: SCANG-STAAR null model. For more details, please see the R script.

Output: Rdata files with the user-defined names.

The number of output files is the summation of the column "scang_num" for the object in jobs_num.Rdata.

Summarization and visualization of association analysis results using STAARpipelineSummary

Step 0 (Optional): Select independent variants from a known variants list to be used in conditional analysis

Script: STAARpipelineSummary_Known_Loci_Pruning.r

Perform LD pruning (stepwise selection) to select the subset of independent variants from a known variants list to be used in conditional analysis.

Input: aGDS files, a list of known variants (4-column "CHR-POS-REF-ALT" format), and the STAAR null model.

STAARpipelineSummary_Known_Loci_Info.r extracts the information of CHR, POS, REF, and ALT from #rs. For more details, please see the R script.

Output: a Rdata file containing a list of independent variants to be used in conditional analysis.

STAARpipelineSummary_Known_Loci_Pruning_Combination.r combines chromosome-wide results into genome-wide.

Step 1: Summarize individual (single-variant) analysis results

Script: STAARpipelineSummary_Individual_Analysis.r

Summarize single-variant analysis results and perform conditional analysis of unconditionally significant variants by adjusting a list of known variants.

Input: aGDS files, individual analysis results generated by STAARpipeline, the STAAR null model, and a list of known variants. For more details, please see the R script.

Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant variants.

Note: STAARpipelineSummary_Known_Loci_Individual_Analysis_Pruning.r and STAARpipelineSummary_Known_Loci_Individual_Analysis_Pruning_Combination.r show an example to select independent variants from both the known variants in literature and significant single variants detected in individual analysis, which can be used for variant-set conditional analysis.

Step 2.1: Summarize gene-centric coding analysis results

Script: STAARpipelineSummary_Gene_Centric_Coding.r

Summarize gene-centric coding analysis results and perform conditional analysis of unconditionally significant coding masks by adjusting a list of known variants.

Input: aGDS files, gene-centric coding analysis results generated by STAARpipeline, the STAAR null model, and a list of known variants. For more details, please see the R script.

Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant coding masks.

Step 2.2: Summarize gene-centric noncoding analysis results

Script: STAARpipelineSummary_Gene_Centric_Noncoding.r

Summarize gene-centric noncoding analysis results and perform conditional analysis of unconditionally significant noncoding masks by adjusting a list of known variants.

Input: aGDS files, gene-centric noncoding analysis results generated by STAARpipeline, the STAAR null model, and a list of known variants. For more details, please see the R script.

Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant noncoding masks.

Step 3: Summarize sliding window analysis results

Script: STAARpipelineSummary_Sliding_Window.r

Summarize sliding window analysis results and perform conditional analysis of unconditionally significant genetic regions by adjusting a list of known variants.

Input: aGDS files, sliding window analysis results generated by STAARpipeline, the STAAR null model, and a list of known variants. For details, see the R scripts.

Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant sliding windows.

Step 4: Summarize dynamic window analysis results

Script: STAARpipelineSummary_Dynamic_Window.r

Summarize dynamic window analysis results and perform conditional analysis of unconditionally significant genetic regions by adjusting a list of known variants.

Input: aGDS files, dynamic window analysis results generated by STAARpipeline, the STAAR null model, and a list of known variants. For more details, please see the R script.

Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant dynamic windows.

Step 5.1: Functionally annotate a list of variants

Script: STAARpipelineSummary_Individual_Analysis_Annotation.r

Functionally annotate a list of variants.

Input: aGDS files and a list of variants.

The list of variants could be the individual analysis results generated by STAARpipelineSummary.

Output: a Rdata file containing the input variants together with the corresponding functional annotations.

Step 5.2: Functionally annotate rare variants in coding masks

Script: STAARpipelineSummary_Gene_Centric_Coding_Annotation.r

Functionally annotate rare variants of each of the input coding masks.

Input: aGDS files and coding masks (chr, gene name, and functional category).

Output: For each input coding mask, the script outputs a Rdata file containing the rare variants and the corresponding functional annotations.

Step 5.3: Functionally annotate rare variants in noncoding masks

Script: STAARpipelineSummary_Gene_Centric_Noncoding_Annotation.r

Functionally annotate rare variants of each of the input noncoding masks.

Input: aGDS files and noncoding masks (chr, gene name, and functional category).

Output: For each input noncoding mask, the script outputs a Rdata file containing the rare variants and the corresponding functional annotations.

Step 5.4: Functionally annotate rare variants in genetic regions

Script: STAARpipelineSummary_Genetic_Region_Annotation.r

Functionally annotate rare variants of each of the input genetic regions.

Input: aGDS files and noncoding masks (chr, start position, and end position).

Output: For each input genetic region, the script outputs a Rdata file containing the rare variants and the corresponding functional annotations.

An example of batch job submission scripts for these analyses can be found here.

staarpipeline-tutorial's People

Contributors

Stargazers

Watchers

Forkers

shicheng-guo zhouhufeng byoo nhasbani gaochengprc damientan das2000sidd jingxuanbao kevinluolk nsafarian zjmanmu wangyujian123 ztsin kwdoyle milad-bastami socrates1100 yuxinyuanqt

staarpipeline-tutorial's Issues

Question about selecting variants to calculate sparse GRM

Dear authors,

I'm trying to use your pipeline to our association study, but got questions about the way to build GRM.
Is it correct to use the common (MAF > 5%) variants to build GRM, so that we can capture population structure among individuals?
Or should I use the rare variants (those to be used in the main analysis), so we capture the similarity between individuals as will be observed in the main analysis?

Thank you,
In-Hee

"CSV error" when running Annotate.R

Hi,

Recently I was trying to use our in-house data in STAARpipeline for WGS burden test. When I ran the Annotate.R, the log showed errors like this:

[1] 1
CSV error: record 4 (line: 5, byte: 46): found record with 2 fields, but the previous record has 1 fields
[1] 2
CSV error: record 29 (line: 30, byte: 439): found record with 2 fields, but the previous record has 1 fields
[1] 3
CSV error: record 1 (line: 2, byte: 8): found record with 2 fields, but the previous record has 1 fields
...

However, it did not stop executing the job, and generated the output "Anno_chr1_STAARpipeline.csv".

I wondered how these errors happen (is it because of the VCF format?) and whether these errors would affect the final burden result.

Attached is the log file.
slurm-59206.log

Looking forward to your reply! Thanks a lot!

Shea

Is there any example file for test?

Hi,

I am wondering is there any example file that I could use for test?

Thanks a lot!

Shea

vcf to gds

Hello, @xihaoli.
In the script of "Varinfo_gds.R" and "gds2agds.R", it has two gds name.
gds_file_name_1 <- "freeze.5.chr"
gds_file_name_2 <- ".pass_and_fail.gtonly.minDP0.gds"

I wonder what the two documents represent and how they are produced? Thank you.

Phenofile in STAARpipeline_Null_Model.r

Hi,

Thanks for the amazing tool!
When constructing the null model using STAARpipeline_Null_Model.r, should the sample in 'pheno.csv' be the same as input vcf (which is used in the very beginning and converted to GDS files)? Or the sample in 'pheno.csv' could be a subset of that in input vcf?

Now I only want to analyze the samples in 'pheno.csv'. When I used STAARpipeline_STAAR2SCANG.r, it reported Error in y %*% diag(eigen_diff) : non-integral arguments. So I am wondering if the inconsistency of pheno.csv and input vcf is the reason for the error?

Thanks!
Shea

About rank-based inverse normal transformation

Dear Xihao,
I have some questions about when I perform rank-based inverse normal transformation for my phenotype before fitting the null model.
The phenotype is directly performed to the inverse normal transformation? like using the R package RNOmni. OR I need to perform two steps. The first step is that the null model is fit using the original outcome values. And the second step is using residuals to rank-based inverse Normal transform, and then fit the model again, but it seems not to be rescaled.
So, which one is more recommended?

Best,
Qilong

Error in Gene_Centric_Coding analysis

Dear Xihao,

While I performed Gene_Centric_Coding analysis, the error below showed up and there were no outputs for some arrayid.

Error in seqGetData(genofile, paste0(Annotation_dir, Annotation_name_catalog$dir[which(Annotation_name_catalog$name ==  : 
  The GDS node "annotation/info/FunctionalAnnotation/genecode_comprehensive_exonic_category" does not exist.

There was no output from every arrayid but I thought it might run the code with the existing output files. I performed the STAARpipelineSummary Gene Centric Coding analysis and got this error:

Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
Calls: Gene_Centric_Coding_Results_Summary -> get -> load -> readChar
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
  cannot open compressed file '..../Gene_Centric_Coding_1.Rdata', probable reason 'No such file or direc$
Execution halted

I have the same problem with Gene_Centric_Noncoding, Sliding Window, and Dynamic Window analysis. However, there was no problem with the individual analysis.

I hope you can help me :)

Thank you!
Pelin

VCF2GDS

hello，may I ask whether the vcf file needs to be divided by chromosome before vcf2GDS processing?

Error in seqGetData: Invalid dimension 'Start' and 'Length'

Hi Xihao,

Thanks for designing such a useful tool. I'm trying to run STAARpipeline_Gene_Centric_Noncoding.r in _Step 3.2: Gene-centric noncoding analysis_in my way, and the error below showed up:

i=1
Rscript STAARpipeline_Gene_Centric_Noncoding.r ${i}
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 273677 14.7     660845 35.3   451833 24.2
Vcells 459327  3.6    8388608 64.0  1800859 13.8
[1] 379
[1] 401
Error in seqGetData(genofile, paste0(Annotation_dir, Annotation_name_catalog$dir[which(Annotation_name_catalog$name ==  : 
  Invalid dimension 'Start' and 'Length'.
Calls: Gene_Centric_Noncoding -> noncoding -> seqGetData
Execution halted

And I run the code in noncoding.R line by line and it seems to report an error when I run

GENCODE.Category <- seqGetData(genofile, paste0(Annotation_dir,Annotation_name_catalog$dir[which(Annotation_name_catalog$name=="GENCODE.Category")])).

In this code, paste0(Annotation_dir,Annotation_name_catalog$dir[which(Annotation_name_catalog$name=="GENCODE.Category")]) refers to "annotation/info/FunctionalAnnotation/genecode_comprehensive_category". And I double check the information in my genofile, and "annotation/info/FunctionalAnnotation/genecode_comprehensive_category" is indeed in genofile.

r$> genofile
Object of class "SeqVarGDSClass"
File: ADNI.808_indiv.minGQ_21.pass.ADNI_ID.chr1.gds (108.7M)
+    [  ] *
|--+ description   [  ] *
|--+ sample.id   { Str8 808 LZMA_ra(29.9%), 2.6K } *
|--+ variant.id   { Int32 2949744 LZMA_ra(3.99%), 459.6K } *
|--+ position   { Int32 2949744 LZMA_ra(34.4%), 3.9M } *
|--+ chromosome   { Str8 2949744 LZMA_ra(0.02%), 1009B } *
|--+ allele   { Str8 2949744 LZMA_ra(11.3%), 1.3M } *
|--+ genotype   [  ] *
|  |--+ data   { Bit2 2x808x2949744 LZMA_ra(3.23%), 36.8M } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
|  \--+ extra   { Int16 0 LZMA_ra, 18B }
|--+ phase   [  ]
|  |--+ data   { Bit1 808x2949744 LZMA_ra(0.01%), 42.5K } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
|  \--+ extra   { Bit1 0 LZMA_ra, 18B }
|--+ annotation   [  ]
|  |--+ id   { Str8 2949744 LZMA_ra(33.6%), 6.1M } *
|  |--+ qual   { Float32 2949744 LZMA_ra(60.5%), 6.8M } *
|  |--+ filter   { Int32,factor 2949744 LZMA_ra(0.02%), 1.8K } *
|  |--+ info   [  ]
|  |  |--+ AC   { Int32 0 LZMA_ra, 18B } *
|  |  |--+ AF   { Float32 0 LZMA_ra, 18B } *
|  |  |--+ AN   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ BaseQRankSum   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ CCC   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ ClippingRankSum   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ DB   { Bit1 2949744 LZMA_ra(0.05%), 209B } *
|  |  |--+ DP   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ DS   { Bit1 2949744 LZMA_ra(0.05%), 209B } *
|  |  |--+ END   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ FS   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ GQ_MEAN   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ GQ_STDDEV   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ HWP   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ HaplotypeScore   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ InbreedingCoeff   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ MLEAC   { Int32 0 LZMA_ra, 18B } *
|  |  |--+ MLEAF   { Float32 0 LZMA_ra, 18B } *
|  |  |--+ MQ   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ MQ0   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ MQRankSum   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ NCC   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ NEGATIVE_TRAIN_SITE   { Bit1 2949744 LZMA_ra(0.05%), 209B } *
|  |  |--+ POSITIVE_TRAIN_SITE   { Bit1 2949744 LZMA_ra(0.05%), 209B } *
|  |  |--+ QD   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ ReadPosRankSum   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ VQSLOD   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ culprit   { Str8 2949744 LZMA_ra(0.02%), 581B } *
|  |  \--+ FunctionalAnnotation   [ spec_tbl_df,tbl_df,tbl,data.frame,list ] *
|  |     |--+ VarInfo   { Str8 2946875 LZMA_ra(14.2%), 6.2M }
|  |     |--+ apc_conservation   { Float64 2946875 LZMA_ra(24.6%), 5.5M }
|  |     |--+ apc_epigenetics   { Float64 2946875 LZMA_ra(24.8%), 5.6M }
|  |     |--+ apc_epigenetics_active   { Float64 2946875 LZMA_ra(24.0%), 5.4M }
|  |     |--+ apc_epigenetics_repressed   { Float64 2946875 LZMA_ra(19.1%), 4.3M }
|  |     |--+ apc_epigenetics_transcription   { Float64 2946875 LZMA_ra(17.4%), 3.9M }
|  |     |--+ apc_local_nucleotide_diversity   { Float64 2946875 LZMA_ra(24.5%), 5.5M }
|  |     |--+ apc_mappability   { Float64 2946875 LZMA_ra(10.3%), 2.3M }
|  |     |--+ apc_protein_function   { Float64 2946875 LZMA_ra(2.23%), 514.1K }
|  |     |--+ apc_transcription_factor   { Float64 2946875 LZMA_ra(4.59%), 1.0M }
|  |     |--+ cage_tc   { Str8 2946875 LZMA_ra(3.13%), 100.6K }
|  |     |--+ metasvm_pred   { Str8 2946875 LZMA_ra(0.60%), 17.2K }
|  |     |--+ rsid   { Str8 2946875 LZMA_ra(17.3%), 764.3K }
|  |     |--+ fathmm_xf   { Float64 2946875 LZMA_ra(18.9%), 4.3M }
|  |     |--+ genecode_comprehensive_category   { Str8 2946875 LZMA_ra(5.23%), 503.0K }
|  |     |--+ genecode_comprehensive_info   { Str8 2946875 LZMA_ra(11.1%), 1.9M }
|  |     |--+ genecode_comprehensive_exonic_category   { Str8 2946875 LZMA_ra(0.86%), 26.3K }
|  |     |--+ genecode_comprehensive_exonic_info   { Str8 2946875 LZMA_ra(5.72%), 249.5K }
|  |     |--+ genehancer   { Str8 2946875 LZMA_ra(2.04%), 649.1K }
|  |     |--+ linsight   { Float64 2946875 LZMA_ra(10.7%), 2.4M }
|  |     |--+ cadd_phred   { Float64 2946875 LZMA_ra(8.62%), 1.9M }
|  |     \--+ rdhs   { Str8 2946875 LZMA_ra(7.30%), 346.1K }
|  \--+ format   [  ]
\--+ sample.annotation   [  ]

As I can run STAARpipeline_Gene_Centric_Coding.r without any error and I find there is a similar code in coding.R(line 64), which filter variants and samples before seqGetData(). And the code GENCODE.Category <- seqGetData(genofile, paste0(Annotation_dir,Annotation_name_catalog$dir[which(Annotation_name_catalog$name=="GENCODE.Category")]))
can run successfully after restrict variants. There were 2,949,744 variants before filter in my genofile and I guess is it possible there was too much variants in my genofile which cause this error? Should I filter some variants (such as MAF or other quality control indicators) before run this code?

Thank you very much for your time and help,
Bangsheng

error: run FastSparseGRM

Dear Dr. Li
I encountered a problem when I run one of the steps in R package FastSparseGRM to calculate the GRM. There were no output files after running the script runPCA_wrapper.R. And I checked the log file runPCA.Rout, here is the part of the log file:

file.unrels <- opt$file.unrels
prefix.in <- opt$prefix.in
no_pcs <- opt$no_pcs
no_cores <- opt$no_cores
num_threads <- opt$num_threads
prefix.out <- opt$prefix.out
no_iter <- opt$no_iter

library('FastSparseGRM')
runPCA(prefix.in,file.unrels,no_pcs,num_threads,prefix.out,no_iter)
[1] "PCA started at 2022-12-04 17:53:17"
[1] "4 relateds, 768 unrelateds, and 24928532 variants"
Reading the file: pruned.bed
Reading BED file complete
Number of SNPs read: 24928532
Number of samples: 772
Using 40 threads
Processing samples complete
Allele frequencies calculated
[1] "Reading Completed in 16 seconds"
[1] "1 iterations completed"
[1] "2 iterations completed"
[1] "3 iterations completed"
[1] "4 iterations completed"
[1] "5 iterations completed"
[1] "6 iterations completed"
[1] "7 iterations completed"
[1] "8 iterations completed"
[1] "9 iterations completed"
[1] "10 iterations completed"
[1] "Iterations Completed in 683 seconds"
Error in svd(H, nv = 0) : infinite or missing values in 'x'
Calls: runPCA -> drpca -> svd
Execution halted

Has this happened before? I have no idea to solve this problem, please give me some advice.
Hope to your reply :)

Gene_Centric_Coding function fails due to character-formatted annotation data

Hello,

I've been working my way through each step in the pipeline and have encountered an error when running STAARpipeline_Gene_Centric_Coding.r

The error occurs in the internal coding function within the Gene_Centric_Coding function. When the current annotation name is "aPC.LocalDiversity", the script attempts to perform a transformation of the data:

if (Annotation_name[k] == "aPC.LocalDiversity") {
                    Annotation.PHRED.2 <- -10 * log10(1 - 10^(-Annotation.PHRED/10))
                    Annotation.PHRED <- cbind(Annotation.PHRED,
                      Annotation.PHRED.2)
                    Anno.Int.PHRED.sub.name <- c(Anno.Int.PHRED.sub.name,
                      paste0(Annotation_name[k], "(-)"))
                  }

However, the data in Annotation.PHRED is of the character class, causing this error:

Error in -Annotation.PHRED : invalid argument to unary operator
Calls: Gene_Centric_Coding -> coding

aPC.LocalDiversity is not the only character annotation. Of those picked from the name catalog, the only other character values are 3 other annotation PCs:

/cadd_phred  # numeric
/linsight  # numeric
/fathmm_xf  # numeric
/apc_epigenetics_active  # character
/apc_epigenetics_repressed  # character
/apc_epigenetics_transcription  # numeric
/apc_conservation  # numeric
/apc_local_nucleotide_diversity  # character (causing this error)
/apc_mappability  # character
/apc_transcription_factor  # numeric
/apc_protein_function  # numeric

Would anyone know why these values would be written to (or read from) the GDS file as characters?

Generate GDS files

Hi Xihao,

We are trying to run STAARpipeline for rare non-coding variants. I had a question for "Step 1: Generate the variants list to be annotated". Is the user meant to generate GDS files on their input data using SeqArray (before using Varinfo_gds.R), or are these files already provided? If so, do you have example code on how to do generate them? Is something like the example here on pages 3-5 intended? I am not familiar with GDS files or the SeqArray package and I wanted to check first. Thanks.

Daniel

LD_pruning error

Hello Xihao,
I tried to run both Summary_Known_Loci_Pruning.r and Summary_Known_Loci_Individual_Analysis.r,but I came across an issue. LD pruning halted on some chromosomes. Below is the output log, along with the error message. Do you know what caused the following error message, and how it will affect my final result, and how I can fix this problem


         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 283149 15.2     664202 35.5   450931 24.1
Vcells 486990  3.8    8388608 64.0  1815700 13.9
[1] 1
# of selected samples: 378,834
# of selected variants: 3
# of selected samples: 378,834
# of selected variants: 3
# of selected samples: 380,943
# of selected variants: 2,093,696
# of selected samples: 378,834
# of selected variants: 0

Error in apply(Geno_adjusted, 2, mean) : 
  dim(X) must have a positive length
Calls: LD_pruning -> Individual_Analysis_cond -> apply
Execution halted
Error in apply(Geno_adjusted, 2, mean) : 
  dim(X) must have a positive length
Calls: LD_pruning -> Individual_Analysis_cond -> apply
Execution halted
Error in apply(Geno_adjusted, 2, mean) : 
  dim(X) must have a positive length
Calls: LD_pruning -> Individual_Analysis_cond -> apply
Execution halted
Error in apply(Geno_adjusted, 2, mean) : 
  dim(X) must have a positive length
Calls: LD_pruning -> Individual_Analysis_cond -> apply
Execution halted
Error in apply(Geno_adjusted, 2, mean) : 
  dim(X) must have a positive length
Calls: LD_pruning -> Individual_Analysis_cond -> apply
Execution halted

"Coding mask" AND "Noncoding mask"

Dear Dr. Li
I am confused with the "Coding mask" AND "Noncoding mask" in R script STAARpipelineSummary_Gene_Centric_Coding_Annotation.r and STAARpipelineSummary_Gene_Centric_Noncoding_Annotation.r

## Chr
chr_seq <- c(1,7,19,19,9)
## Gene name
gene_name_seq <- c("PCSK9","NPC1L1","LDLR","APOE","RNF20")
## Coding mask
category_seq <- c("plof","missense","missense","missense","synonymous")

Why did you choose these five genes? If I want to run these annotation steps, should I choose other genes and how to choose these genes?
Looking forward to your reply.

RESULT: gene centric coding

Dear Dr. Li
When I ran the gene_centric_coding analysis, I find that the Q-Q plot is strange. Some of first half lines are under the red line, instead of closing the red line.
Hear are the manhattan and Q-Q plot of gene_centric_coding analysis.
Is that a bad analysis result? If is bad, what I should do to adjust the analysis step.
Hope to receive your reply.

Error in "Step 3: Generate the annotated GDS (aGDS) file."

Hi Xihao, we're trying to run STAARpipeline but am running into an issue in "Step 3: Generate the annotated GDS (aGDS) file". Below is the command testing on chromosome 22 with paths changed and the error:

Rscript gds2agds.R 22

         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 262458 14.1     641551 34.3   431252 23.1
Vcells 443021  3.4    8388608 64.0  1754506 13.4
[1] 1641932      22
Error in add.gdsn(ans, nm[i], val[[i]], compress = compress, closezip = closezip,  :
  The GDS node "apc_protein_function" exists.
Calls: add.gdsn -> add.gdsn
In addition: Warning messages:
1: One or more parsing issues, see `problems()` for details
2: In add.gdsn(ans, nm[i], val[[i]], compress = compress, closezip = closezip,  :
  Missing characters are converted to "".
3: In add.gdsn(ans, nm[i], val[[i]], compress = compress, closezip = closezip,  :
  Missing characters are converted to "".
Execution halted

If I run the same command again I actually get a different error, and it seems to stay like this (and the runtime also shortened to ~5 seconds from a couple minutes):

         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 262458 14.1     641551 34.3   431252 23.1
Vcells 443021  3.4    8388608 64.0  1754506 13.4
[1] 1641932      22
Error in add.gdsn(Anno.folder, "FunctionalAnnotation", val = FunctionalAnnotation,  :
  The GDS node "FunctionalAnnotation" exists.
Execution halted

Would you know what the problem is? Thanks.

Daniel

info and tutorial files

Dear authors,
I am trying to reproduce the analysis provided in this tutorial but it is a bit tricky without the same input files you used and/or a bit more explanations for reproducibility.
For example, in the "STAARpipeline_Null_Model.r" it is not clear how the covariates are coded, as well as the tested phenotype.
Also, it is not clear how the sparse GRM is produced from the gds file.
Thank you for the help.

Null outputs from Sliding_Window()

Dear @xihaoli,

In my datasets, I have 11201 samples and there are no errors/problems with performing the other STAARpipeline steps, but only with the Sliding_Window step. Each arrayid gives the same warnings continuously and eventually the R crushes.

I tried to re-install each R package or to change the sliding_window_length in various sizes, and the same for the MAF cut-off.

A small example from the errors/warnings:

[1] 1
# of selected samples: 11,021
# of selected variants: 73
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
# of selected samples: 11,021
# of selected variants: 1,307,186
[1] 2
# of selected samples: 11,021
# of selected variants: 46
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
Error in STAAR(G, obj_nullmodel, phred_sub, rare_maf_cutoff = rare_maf_cutoff,  : 
  Number of rare variant in the set is less than 2!
.
.

I would like to inform you that there were no significant results from the Dynamic_Window analysis. Is there any possible reason for the NULL outputs and crushes?

Thank you for your time!
Bests,
Pelin

Error in Fit STAAR null model step: asMethod(object) : matrix is not symmetric [1,2]

Hello,

When I ran the command in script STAARpipeline_Null_Model.r, I got this error:
obj_nullmodel <- fit_nullmodel(phenotype~age+sex+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10,
data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="sample.id",
family=gaussian(link="identity"),verbose=TRUE)
[1] "kins is a sparse matrix."
Fixed-effect coefficients:
(Intercept) age sex PC1 PC2 PC3
-0.09710790 0.01004128 0.01336379 0.59188157 0.82643259 -0.88178803
PC4 PC5 PC6 PC7 PC8 PC9
-0.38470585 0.74543827 0.04345983 0.81811801 -1.34958786 -0.52993129
PC10
0.13472704
Error in asMethod(object) : matrix is not symmetric [1,2]

Then I generated the sgrm data by FastSparseGRM with no error again and the result for command "isSymmetric(as.matrix(sgrm))" was "TRUE". I also check the colnames and rownames by running command "unique(colnames(sgrm) == rownames(sgrm))" and result was "TRUE". I don't know what lead to this error. Could you help me to resolve this problem?

Best
Xubing

dsyMatrix is not supported

Dear authors,

Thank you so much for developing the tool. After I created aGDS, I encountered an error of "dsyMatrix is not supported" when performing STAARpipeline_Gene_Centric_Noncoding.r.

The complete error message is

Error in STAAR_O_SMMAT_sparse(G, Sigma_i, Sigma_iX, cov, residuals.phenotype,  :
  dsyMatrix is not supported.

Any idea how to fix the issue Thanks!

Best,

Small number of variants are passed for analysis

Hello Xihao,

I tried to run both Individual_Analysis() and Sliding_Window() on my aGDS file (178 individuals and 14055 variants on chr22) but the analyses were only performed on 7/NULL variants. I didn't see any other filterings other than "QC_label" in the wrapper methods and the sites are all "PASS" in my case. See below.

I'm not sure if there are additional filterings performed on the backend and most of my sites failed. Can you help figure this out or potentially direct me to these functions so that I may see how they actually worked? Thanks!

A side question is how is STAAR handling imputed SNP dosage if it can?

results <- try(Sliding_Window(chr=22,start_loc=start_loc_sub,end_loc=200000,
                                  sliding_window_length=sliding_window_length,type="multiple",
                                  genofile=genofile,obj_nullmodel=obj_nullmodel,Use_annotation_weights=Use_annotation_weights,
                                  QC_label=QC_label,variant_type=variant_type,geno_missing_imputation=geno_missing_imputation,
                                  Annotation_name_catalog=Annotation_name_catalog,
                                  rare_maf_cutoff=1,rv_num_cutoff=0, Annotation_name=Annotation_name))
# of selected samples: 178
# of selected variants: 7
# of selected samples: 178
# of selected variants: 14,055

results_individual_analysis<-Individual_Analysis(chr=22,start_loc=start_loc,end_loc=20000,
                                                     genofile=genofile,obj_nullmodel=obj_nullmodel,mac_cutoff=20, variant_type=variant_type,
                                                     geno_missing_imputation=geno_missing_imputation)
> results_individual_analysis
NULL

Fitting NULL model for binary outcomes

Hello!
I'm trying to fit my null_model (with genesis and STAAR). This is my phenotype:

sex	diagnosis_phenotype	ID
f	Disease	1
f	Disease	2
f	Control	3
m	Disease	4
m	Disease	5
f	Control	6
f	Control	7
m	Control	8
f	Control	9
m	Control	10
f	Control	11

When using:

fit_nullmodel(as.factor(diagnosis_phenotype)~as.factor(sex),
                               data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="ID",
                               family=gaussian(link="identity"),verbose=TRUE)

I get

Error in glm.fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  :
  NA/NaN/Inf in 'y'
In addition: Warning messages:
1: In Ops.factor(y, mu) : ‘-’ not meaningful for factors
2: In Ops.factor(eta, offset) : ‘-’ not meaningful for factors
3: In Ops.factor(y, mu) : ‘-’ not meaningful for factors

With GENESIS, i get:


fitNullModel(data_GENESIS,outcome="diagnosis_phenotype",
+                                       cov.mat=sgrm,AIREML.tol=1e-4,verbose=TRUE, family = "binomial")
Error in svd(x, 0, 0) : a dimension is zero

How can I generate the null model for binary/categorical outcomes?

GENESIS null model not working

When genesis null model is used in the code below, it returns: dsyMatrix not supported

results_individual_analysis <- Individual_Analysis(chr=chr,start_loc=start_loc,end_loc=end_loc,
genofile=genofile,obj_nullmodel=obj_nullmodel,mac_cutoff=20,
QC_label=QC_label,variant_type=variant_type,
geno_missing_imputation=geno_missing_imputation)

Question about the results of Dynamic_Window_SCANG() and Dynamic_Window_Results_Summary()

Hi!

It's me again.)

Finally get to the final step.
After successfully getting the output using the 1000Genome dataset as toy example, I used an in-house dataset of 100 samples. My purpose is to do WGS burden using dynamic window analysis. As I ran the last step (STAARpipeline_Dynamic_Window.r), I got a strange result.

I got 22 files named "sscpv2_1.Rdata"..."sscpv2_22.Rdata", I think it should stand for the analysis result of chr1 to chr22. However, when I opened "sscpv2_13.Rdata", the SCANG_B_top1 was showed as below:

      -logp chr start_pos end_pos  GWER SNV_num
 [1,]     0   1         0       0 5e-04       1

I have checked that the inputs of STAARpipeline_Dynamic_Window.r (i.e. genofile, obj_nullmodel...)are good. If it is the result of chr13, it's strange the "chr" column is 1. And the "start_pos" and "end_pos" are 0. Does it mean the no significant output? After all, the test sample size was still sooo small.

The Rdata is attached here.
result_scang.zip

Another question: I tried to run STAARpipelineSummary_Dynamic_Window_run.r, but it reported error "cannot open '/mnt/work/research/geyx/STAARpipeline/FAVOR/sscpv2_test_cx/QC/burden/result/sscpv2_23.Rdata'". It's strange because the sscpv2_23.Rdata should not exist.

Looking forward to your suggestion!! Thanks a lot!!

Shea

Questions/issues in "Step 1: Fit STAAR null model"

Hi Xihao,

Thanks for helping me out earlier. I had a couple questions/issues when running "Step 1: Fit STAAR null model":

If I try running STAARpipeline_Null_Model.r, I get an error:

Error in glmmkin(fixed = fixed, data = data, kins = kins, id = id, random.slope = random.slope,  :
  Error: "id" must be one of the variables in the names of "data".
Calls: fit_nullmodel -> glmmkin
Execution halted

The error seems to be in the line:

obj_nullmodel <- fit_nullmodel(BMI_IRNT~Age+AgeSq+Subject_Information.Sex..str+PC1+PC2,data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="IID",family=gaussian(link="identity"),verbose=TRUE)

for the field 'id="IID"'. However "IID" (no quotes) is the column name that the sample IDs are in. I tried running the script without quotes for "IID" and it doesn't work either. Would you know what the issue is?

I then tried using STAARpipeline_Null_Model_GENESIS.r and it gave an error:

Error in .checkSampleId(cov.mat, x) :
  all sample names in dimnames of cov.mat must be present in x$sample.id
Calls: fitNullModel -> fitNullModel -> .local -> .checkSampleId
Execution halted

I saw in another issue this was caused by the IDs being converted to something like "ID_ID" -- I tried the fix in the other issue but still had problem. However, I do have samples in the sparse GRM that are not in the phenotype file -- we will probably run ~20 phenotypes which will have different numbers of individuals with available phenotypes. I suppose it would be preferable to just make one sparse GRM for all phenotypes, but it may not be too much more effort to make a new sparse GRM for each phenotype. Do you have any recommendation here? Thanks.

Daniel

Number of jobs

Hi Xihao,
I am using a different dataset and have slightly different number of jobs for the following analyses:
sum(jobs_num$individual_analysis_num) = 294
sum(jobs_num$sliding_window_num) = 574
sum(jobs_num$scang_num) = 1883
I'm not sure about how to determine the number of jobs for gene-centric coding and noncoding analyses. Please advise on whether I have to use similar number of jobs as you for gene-centric coding and noncoding analyses.
Thanks.

Gene Centric Coding Annotation fails

Hello,

When running the script STAARpipelineSummary_Gene_Centric_Coding_Annotation.r, I am encountering this error:

Error in seqGetData(genofile, paste0(Annotation_dir, Annotation_name_catalog$dir[which(Annotation_name_catalog$name ==  :
  The GDS node "annotation/info/FunctionalAnnotation/rsid" does not exist.
Calls: Gene_Centric_Coding_Info -> info_plof -> seqGetData

The rsids for each variant are not included when annotating the gds files in the earlier steps. Are the rsids supposed to be within the FAVOR database files? I'm using the csv files downloaded from the FAVOR database website and the rsids don't seem to be in there..

How to get the variable genes_info in STAARpipeline_Gene_Centric_Coding.r

Hi, I am trying to learn this STAAR pipeline to deal with rare variant analysis. However, when I ran the Gene_Centric_Coding script, I found that I did not have the genes_info as the input file. So, how to get the genes_info file.

###########################################################
#           Main Function 
###########################################################
## gene number in job
gene_num_in_array <- 50 
group.num.allchr <- ceiling(table(genes_info[,2])/gene_num_in_array)
sum(group.num.allchr)

Errors when running Sliding_Window()

Hello,

I'm testing 2k variant sites among 1373 individuals using sliding_window(). I first set window_length 6k, which seems to run twice but returns different numbers of samples and variants each time, then stop with "rare variant number less than 2" error (see below). Then I set window_length 2k, this time it throws out an error "Error in if (sum(is.in) >= 2) { : missing value where TRUE/FALSE needed".

Why it returns different errors when setting different window lengths? Why it didn't read 2000 variants and 1373 individuals initially in my first test? Is it a format thing with genofile? What do you think? Thanks

Sliding_Window(chr=3,start_loc=11919,end_loc=1239497,sliding_window_length=6000, genofile=genofile,obj_nullmodel=obj_nullmodel)

of selected samples: 1,372

of selected variants: 0

of selected samples: 1,373

of selected variants: 2,000

Error in Sliding_Window_Single(chr = chr, start_loc = start_loc, end_loc = end_loc, :
Number of rare variant in the set is less than 2!

Sliding_Window(chr=3,start_loc=11919,end_loc=1239497,sliding_window_length=2000, genofile=genofile,obj_nullmodel=obj_nullmodel)

of selected samples: 1,372

of selected variants: 0

Error in if (sum(is.in) >= 2) { : missing value where TRUE/FALSE needed

Ancestry PCs / kinship

This is more of a question than an issue regarding including ancestry information in the model.

I see there is the step of generating the GRM to pass to the null model function using the kins argument. When defining the model, however, the example code in STAARpipeline_Null_Model.r also includes 10 principal components as covariates. Are these also the ancestry components? If so, would they be accounted for twice then when also supplying the GRM?

I've generated the null model with including ancestry PCs as covariates and excluded the GRM (i.e., setting kins=NULL). After running the analyses, I'm seeing some genomic deflation in my QQ plots, especially in the gene-centric noncoding analysis. I'm wondering if my setup could be the issue?

STAAR sliding window errors

Dear Xihao Li,

I have found your package very useful, and I thank you for your clear and easy to follow pipeline tutorial.

I would like to ask you about an issue I've encountered. When running the fixed sliding window analysis I get the following odd Manhattan plot:

It seems to tend to "work" at the start of chromosomes and then skip large distances for the rest of each chromosome, resulting in a sparse number of windows evaluated.

I concomitantly get the following error as well while running the analysis:

warning: eig_sym(): given matrix is not symmetric Error : eig_sym(): decomposition failed

This occurs repeatedly, though the code still seems to progress, outputting ascending integers (presumably indices), e.g. 2, bunch of errors, 3, bunch of errors, etc. See below for an example:

[1] 3
# of selected samples: 1,177
# of selected variants: 292

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed

warning: eig_sym(): given matrix is not symmetric
Error : eig_sym(): decomposition failed
# of selected samples: 1,177
# of selected variants: 1,007,368
[1] 4
# of selected samples: 1,177
# of selected variants: 243

Do you have any idea what the issue might be?

I would be very much grateful for any help you could provide.

Many thanks,

Alex

No Manhattan or QQ plots from STAARpipelineSummary_Dynamic_Window.r?

Hello,

While running through the summary scripts, it seems the dynamic window analysis summary does not produce any plots? The README does state that it should...

Within the function Dynamic_Window_Results_Summary.R in the STAARpipelineSummary repo, should the results from each separate SCANG table within SCANG_res be plotted?

as.numeric(commandArgs(TRUE)[1])

Dear xihaoli,

To perform rare-variant analysis from genotype data, I have been sent batch jobs to our uni cluster using your R scripts, and I came until Individual_analysis, however, I have an issue about:

arrayid <- as.numeric(commandArgs(TRUE)[1])
print(arrayid)
[1] NA

It returns as "NA_real_" and the following codes don't work. Do you have any suggestions to solve this issue?

Thanks!

Here is R info;

sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Gene_Centric_Coding output ERROR messages

Hi Dr. Li
Sorry to bother you again. I met some ERROR messages when I ran Gene Centric Coding analysis. Here are the part of the log file.

     used (Mb) gc trigger (Mb) max used (Mb)

Ncells 269062 14.4 654557 35 444157 23.8
Vcells 448534 3.5 8388608 64 1771722 13.6
[1] 379
[1] 51
# of selected samples: 772
# of selected variants: 280
# of selected samples: 772
# of selected variants: 6
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 1
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
Number of rare variant in the set is less than 2!
# of selected samples: 772
# of selected variants: 5
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 3,197,953
[1] 52
# of selected samples: 772
# of selected variants: 24
# of selected samples: 772
# of selected variants: 6
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
\ # of selected variants: 3
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
Number of rare variant in the set is less than 2!
# of selected samples: 772
# of selected variants: 3
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 3,197,953
[1] 53
# of selected samples: 772
# of selected variants: 163
# of selected samples: 772
# of selected variants: 10
# of selected samples: 772
# of selected variants: 1
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
Number of rare variant in the set is less than 2!
# of selected samples: 772
# of selected variants: 1
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
Number of rare variant in the set is less than 2!
# of selected samples: 772
# of selected variants: 5
# of selected samples: 772
# of selected variants: 4
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 3,197,953
[1] 54
# of selected samples: 772
# of selected variants: 37
# of selected samples: 772
# of selected variants: 1
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 1
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
Number of rare variant in the set is less than 2!
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 3,197,953
[1] 55
# of selected samples: 772
# of selected variants: 76
# of selected samples: 772
# of selected variants: 11
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 4
# of selected samples: 772
# of selected variants: 7
# of selected samples: 772
# of selected variants: 0
Error in STAAR(Geno, obj_nullmodel, Anno.Int.PHRED.sub.category, rare_maf_cutoff = rare_maf_cutoff, :
genotype is not a matrix!
# of selected samples: 772
# of selected variants: 3,197,953

I don't know if the Error messages will interrupt the next steps. Could you give some advice?
Looking forward to yout reply.
Damien

Questions about individual analysis results

Hello,

I have some questions about the output of the individual analysis.

What's "# of selected variants: 23"? Given my input of 14,055 variants, I assume the "selected variants" are not selected by the user-given arguments (e.g., mac_cutoff as I set it to 0) rather indicating the final variants after the individual analysis. So do we have the intermediate results of those unselected variants? Are they stored somewhere and why are they not selected (maybe by some p-values)?
For the error message "(!all(CHR == chr)) ", is it indicating the entire 22 chromosomes are required for the analysis since I only give one chromosome?

> start_loc <- 1
> end_loc <- start_loc + 10e7 - 1
> results_individual_analysis<-Individual_Analysis(chr=22,start_loc=start_loc,end_loc=end_loc,
                                                     genofile=genofile,obj_nullmodel=obj_nullmodel,mac_cutoff=0, variant_type=variant_type,
                                                     geno_missing_imputation=geno_missing_imputation)
# of selected samples: 178
# of selected variants: 23
Error in if (!all(CHR == chr)) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In Individual_Analysis(chr = 22, start_loc = start_loc, end_loc = end_loc,  :
  NAs introduced by coercion

Thanks and I hope I'm not bothering you too much.

Dynamic_Window_SCANG error(s)

Hello, I have been receiving two errors.

Error 1 occurs when "res" (a SCANG output) all equal zero, or not applicable, the pipeline breaks before moving to next sub_seq_num.
This is usually fixed by increasing the rare_maf_cutoff value.

Error 2 I am not sure what is going on:
Error in resmost_SCANG_O[1, 3] - resmost_SCANG_O[1, 2] : non-numeric argument to binary operator Calls: Dynamic_Window_SCANG -> matrix Execution halted

Appreciate any help!
Thanks
lj

Error in fit STAAR null model

hello,may I ask why does this error occur

obj_nullmodel <- fit_nullmodel(V6~age+sex+V5,

                           data=phenotype,kins=as.matrix(sgrm),use_sparse=TRUE,kins_cutoff=0.022,id="V1",

                           family=gaussian(link="identity"),verbose=TRUE)

[1] "kins is a dense matrix, transforming it into a sparse matrix using cutoff 0.022."
Using 500 samples provided
Identifying clusters of relatives...
13 relatives in 2 clusters; largest cluster = 9
Creating block matrices for clusters...
487 samples with no relatives included
Putting all samples together into one block diagonal matrix
Error in glmmkin(fixed = fixed, data = data, kins = kins_sp, id = id, :
Error: kins matrix 1 does not include all individuals in the data.

Expected input Parameters for Gene Centric Coding Step

Whenever I run the Gene Centric Coding Step I get this error regardless of how many arguments or what the argument is that I input. What is the pipeline expecting as input?

[1] 379
[1] 701

of selected samples: 0

of selected variants: 207

of selected samples: 0

of selected variants: 29

of selected samples: 0

of selected variants: 0

Error in data.frame(id.genotype, index = seq(1, length(id.genotype))) :
arguments imply differing number of rows: 0, 2
Calls: Gene_Centric_Coding -> coding -> data.frame
Execution halted

gnomAD AF & family-based vcf analysis

Hi Xihao,

Thanks for your useful tool! And I've met some problems really confused me:

I noticed that allele frequencies of gnomAD and 1000G are contained in FAVOR full database but not in FAVOR essential database. Are there any approaches that I could make use of these AF annotations in the STAAR procedure? (Like the variants with gnomAD AF < 5‰ will be given priority).
I'm now working on a rare disease and my cohort contains ~700 trio families (only the child has the disease, parents are healthy), so my data is a vcf file with ~2000 samples. Does STAAR pipeline support family-based analysis? If so, how can I represent this family relationship while analyzing? (maybe in the pheno.csv?)
Do I need to add more irrelevant healthy samples as control?

Thanks a lot!!!

No known loci for conditional analysis?

Hi there!
Thank you again for providing this very useful pipeline.
I have a question regarding the later steps of analysis (Summarization and visualization).
I am trying to analyse a WES dataset of around 5500 samples (~ 1200 case vs 4300 controls).
I was wondering how I would proceed with the analysis if I lack a set of known loci for my binary trait?
Is this an issue or is this part somewhat optional?
Furthermore, what would the optimal path / pipeline be for this kind of analysis, which has the sole focus of coding variants?
And do you perhaps have a recommendation as to how to proceed with a suitable pathway analysis after?

Sorry is this a lot, and thank you in advance!

Best wishes,
Daniel

Step 3.2: Gene-centric noncoding analysis

Hi Xihao,

Sorry to keep bothering you. I'm trying to run the association analyses now, specifically the rare non-coding analyses. I had a couple questions/issues:

As an example I ran the following:
Rscript STAARpipeline_Gene_Centric_Noncoding.r 1

which the full output/error is:

         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 262458 14.1     641551 34.3   431252 23.1
Vcells 443021  3.4    8388608 64.0  1754506 13.4
[1] 379
[1] 1
# of selected samples: 0
# of selected variants: 0
# of selected samples: 6,280
# of selected variants: 9,400,898
# of selected samples: 0
# of selected variants: 0
Error in data.frame(id.genotype, index = seq(1, length(id.genotype))) :
  arguments imply differing number of rows: 0, 2
Calls: Gene_Centric_Noncoding -> noncoding -> data.frame
Execution halted

Would you know what the issue is? If it's any help, 6,280 is the number of samples with genetic data, and 9,400,898 is the number of variants on chromosome 1.

I saw the script STAARpipeline_Gene_Centric_Noncoding.sh to run STAARpipeline on a cluster. However this seems to be for SLURM, and our cluster uses LSF. Is essentially what is being done in this script and in STAARpipeline_Gene_Centric_Noncoding.r is it runs 50 genes each for 379 jobs? So if I just run STAARpipeline_Gene_Centric_Noncoding.r in a loop from {1..379} this should do the same thing? And likewise for the rest of the analyses in 3.2 for however many "SLURM_ARRAY_TASK_ID"s there are. Thanks.

error in running "STAARpipeline_Null_Model.r"

Hi Xihao,a very impressive job!!but when I run the program to this point--“STAARpipeline_Null_Model.r”，I get this error
Iteration 20 : Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'chol2inv': error in evaluating the argument 'x' in selecting a method for function 't': leading principal minor of order 1 is not positive Calls: fit_nullmodel ... .Call -> .handleSimpleError -> h -> .handleSimpleError -> h In addition: Warning message: In .local(A, ...) : CHOLMOD warning 'not positive definite' at file '../Cholesky/t_cholmod_rowfac.c', line 430 Execution halted

When generating the GRM matrix, I input the bed file that I completed pruning on Plink myself, and then follow your process to get a sparse matrix,I don't know how to fix it

Running staarpipeline docker image

Hi Xihaoli,
Thanks for this seemingly great tool, I’d like to try it out. I’m having challenges with installing the staarpipeline dependencies. I’ve now pulled the docker image into an HPC as a singularity image. How do I run the image with the Rscripts that you have provided on GitHub?

#SNVs per case/control?

Hello,

For the gene-centric analyses, is there any possible way to obtain the number of SNVs per case and control samples? The output from the summary functions only lists the total number of SNVs found per gene+category. I cannot seem to locate where this SNV count is actually generated in any of the functions to even try to modify this myself.

run this pipline in crop population

Dear Xihao Li,
I want to know whether we could run this pipline in crop population?

fit_nullmodel Output is mostly Null and 0

Hi,
I'm not sure if I'm getting a reasonable output of the null model.

My genotyping (WGS) data has 1026 samples.
`
genofile
Object of class "SeqVarGDSClass"
File: /analysis/21-WGS/analysis/STAARpipeline/gds_with_AVGDP/chr11.gds (283.1M)

[ ] *
|--+ description [ ] *
|--+ sample.id { Str8 1026 LZMA_ra(13.0%), 1.1K } *
|--+ variant.id { Int32 2496036 LZMA_ra(4.50%), 439.1K } *
|--+ position { Int32 2496036 LZMA_ra(30.8%), 2.9M } *
|--+ chromosome { Str8 2496036 LZMA_ra(0.02%), 1.2K } *
|--+ allele { Str8 2496036 LZMA_ra(15.8%), 1.7M } *
|--+ genotype [ ] *
| |--+ data { Bit2 2x1026x2496036 LZMA_ra(4.87%), 59.5M } *
| |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
| --+ extra { Int16 0 LZMA_ra, 18B }
|--+ phase [ ]
| |--+ data { Bit1 1026x2496036 LZMA_ra(0.01%), 45.6K } *
| |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
| --+ extra { Bit1 0 LZMA_ra, 18B }
|--+ annotation [ ]
| |--+ id { Str8 2496036 LZMA_ra(9.76%), 2.8M } *
| |--+ qual { Float32 2496036 LZMA_ra(62.4%), 5.9M } *
| |--+ filter { Int32,factor 2496036 LZMA_ra(2.17%), 211.1K } *
| |--+ info [ ]
| | |--+ AC { Int32 2496036 LZMA_ra(18.8%), 1.8M } *
| | |--+ AF { Float32 2496036 LZMA_ra(36.3%), 3.5M } *
| | |--+ ALLELE_END { Bit1 2496036 LZMA_ra(0.06%), 201B } *
| | |--+ AN { Int32 2496036 LZMA_ra(12.2%), 1.2M } *
| | |--+ AS_BaseQRankSum { Float32 0 LZMA_ra, 18B } *
| | |--+ AS_FS { Float32 1531413 LZMA_ra(26.4%), 1.5M } *
| | |--+ AS_InbreedingCoeff { Float32 3060955 LZMA_ra(23.0%), 2.7M } *
| | |--+ AS_MQ { Float32 0 LZMA_ra, 18B } *
| | |--+ AS_MQRankSum { Float32 0 LZMA_ra, 18B } *
| | |--+ AS_QD { Float32 3060955 LZMA_ra(35.3%), 4.1M } *
| | |--+ AS_ReadPosRankSum { Float32 0 LZMA_ra, 18B } *
| | |--+ AS_SB_TABLE { Str8 2496036 LZMA_ra(0.02%), 517B } *
| | |--+ AS_SOR { Float32 1531413 LZMA_ra(28.4%), 1.7M } *
| | |--+ BaseQRankSum { Float32 2496036 LZMA_ra(39.8%), 3.8M } *
| | |--+ ClippingRankSum { Float32 2496036 LZMA_ra(34.4%), 3.3M } *
| | |--+ DB { Bit1 2496036 LZMA_ra(0.06%), 201B } *
| | |--+ DP { Int32 2496036 LZMA_ra(40.8%), 3.9M } *
| | |--+ DS { Bit1 2496036 LZMA_ra(0.06%), 201B } *
| | |--+ END { Int32 2496036 LZMA_ra(0.02%), 1.6K } *
| | |--+ ExcessHet { Float32 2496036 LZMA_ra(24.4%), 2.3M } *
| | |--+ FS { Float32 2496036 LZMA_ra(35.9%), 3.4M } *
| | |--+ InbreedingCoeff { Float32 2496036 LZMA_ra(24.2%), 2.3M } *
| | |--+ MLEAC { Int32 2642114 LZMA_ra(17.2%), 1.7M } *
| | |--+ MLEAF { Float32 2642114 LZMA_ra(18.6%), 1.9M } *
| | |--+ MQ { Float32 2496036 LZMA_ra(12.3%), 1.2M } *
| | |--+ MQ0 { Int32 2496036 LZMA_ra(0.02%), 1.6K } *
| | |--+ MQRankSum { Float32 2496036 LZMA_ra(35.7%), 3.4M } *
| | |--+ NEGATIVE_TRAIN_SITE { Bit1 2496036 LZMA_ra(45.5%), 138.7K } *
| | |--+ POSITIVE_TRAIN_SITE { Bit1 2496036 LZMA_ra(96.5%), 294.1K } *
| | |--+ QD { Float32 2496036 LZMA_ra(37.5%), 3.6M } *
| | |--+ RAW_MQ { Float32 2496036 LZMA_ra(0.02%), 1.6K } *
| | |--+ ReadPosRankSum { Float32 2496036 LZMA_ra(39.5%), 3.8M } *
| | |--+ SOR { Float32 2496036 LZMA_ra(36.3%), 3.5M } *
| | |--+ VQSLOD { Float32 2496036 LZMA_ra(41.1%), 3.9M } *
| | |--+ culprit { Str8 2496036 LZMA_ra(7.96%), 1.0M } *
| | |--+ F_MISSING { Float32 2496036 LZMA_ra(13.2%), 1.3M } *
| | |--+ AN_case { Int32 2496036 LZMA_ra(6.09%), 593.7K } *
| | |--+ AN_control { Int32 2496036 LZMA_ra(11.0%), 1.0M } *
| | |--+ AC_case { Int32 2496036 LZMA_ra(9.24%), 900.6K } *
| | |--+ AC_control { Int32 2496036 LZMA_ra(19.1%), 1.8M } *
| | |--+ AC_Het_case { Int32 2496036 LZMA_ra(8.40%), 818.7K } *
| | |--+ AC_Het_control { Int32 2496036 LZMA_ra(17.7%), 1.7M } *
| | |--+ AC_Het { Int32 2496036 LZMA_ra(17.7%), 1.7M } *
| | |--+ AF_case { Float32 2496036 LZMA_ra(13.4%), 1.3M } *
| | |--+ AF_control { Float32 2496036 LZMA_ra(34.4%), 3.3M } *
| | |--+ HWE_case { Float32 2496036 LZMA_ra(7.81%), 761.3K } *
| | |--+ HWE_control { Float32 2496036 LZMA_ra(22.8%), 2.2M } *
| | |--+ HWE { Float32 2496036 LZMA_ra(23.6%), 2.2M } *
| | |--+ FunctionalAnnotation [ spec_tbl_df,tbl_df,tbl,data.frame,list ] *
| | | |--+ VarInfo { Str8 2496036 LZMA_ra(17.0%), 6.7M }
| | | |--+ apc_conservation { Float64 2496036 LZMA_ra(79.1%), 15.1M }
| | | |--+ apc_epigenetics { Float64 2496036 LZMA_ra(81.4%), 15.5M }
| | | |--+ apc_epigenetics_active { Float64 2496036 LZMA_ra(75.9%), 14.5M }
| | | |--+ apc_epigenetics_repressed { Float64 2496036 LZMA_ra(50.5%), 9.6M }
| | | |--+ apc_epigenetics_transcription { Float64 2496036 LZMA_ra(43.6%), 8.3M }
| | | |--+ apc_local_nucleotide_diversity { Float64 2496036 LZMA_ra(79.9%), 15.2M }
| | | |--+ apc_mappability { Float64 2496036 LZMA_ra(29.7%), 5.7M }
| | | |--+ apc_protein_function { Float64 2496036 LZMA_ra(2.15%), 419.9K }
| | | |--+ apc_transcription_factor { Float64 2496036 LZMA_ra(7.10%), 1.4M }
| | | |--+ cage_tc { Str8 2496036 LZMA_ra(5.19%), 183.1K }
| | | |--+ metasvm_pred { Str8 2496036 LZMA_ra(0.82%), 20.1K }
| | | |--+ rsid { Str8 2496036 LZMA_ra(36.4%), 9.4M }
| | | |--+ fathmm_xf { Float64 2496036 LZMA_ra(50.6%), 9.6M }
| | | |--+ genecode_comprehensive_category { Str8 2496036 LZMA_ra(0.57%), 147.1K }
| | | |--+ genecode_comprehensive_info { Str8 2496036 LZMA_ra(6.10%), 3.2M }
| | | |--+ genecode_comprehensive_exonic_category { Str8 2496036 LZMA_ra(1.19%), 34.6K }
| | | |--+ genecode_comprehensive_exonic_info { Str8 2496036 LZMA_ra(7.53%), 489.1K }
| | | |--+ genehancer { Str8 2496036 LZMA_ra(0.39%), 365.7K }
| | | |--+ linsight { Float64 2496036 LZMA_ra(21.7%), 4.1M }
| | | |--+ cadd_phred { Float64 2496036 LZMA_ra(23.7%), 4.5M }
| | | --+ rdhs { Str8 2496036 LZMA_ra(3.81%), 317.0K }
| | --+ AVGDP { Int32 2496036 LZMA_ra(40.8%), 3.9M } *
| --+ format [ ]
--+ sample.annotation [ ]
`

The sgrm matrix is 1026x1026 . I made this using --degree 2 (everything else the same, I know non of my cases are related, and only 1 pair of controls are 1st degree relatives, which the SGRM pipeline identified)

> sgrm[100:105,100:105]  ### Example of few lines of the matrix
> 6 x 6 sparse Matrix of class "dsCMatrix"
>           0_HG00151 0_HG00154 0_HG00155 0_HG00157 0_HG00158 0_HG00159
> 0_HG00151 0.5043822 .         .         .         .         .        
> 0_HG00154 .         0.5009275 .         .         .         .        
> 0_HG00155 .         .         0.4998904 .         .         .        
> 0_HG00157 .         .         .         0.4966408 .         .        
> 0_HG00158 .         .         .         .         0.4979511 .        
> 0_HG00159 .         .         .         .         .         0.5044529

Phenotype has 1004 samples, with a binary variable, which I've created as follows:

tail(phenotype)   #### The PCs here are those that are generated by the SGRM output .score file
                   FID SNPSEX pheno group_race         PC1        PC2           PC3
999  0_NA20821      2     1        EUR -0.02332929 0.01675661 -0.0010476483
1000 0_NA20822      2     1        EUR -0.02330287 0.01741464  0.0001754674
1001 0_NA20826      2     1        EUR -0.02354092 0.01799696 -0.0002679973
1002 0_NA20827      1     1        EUR -0.02345230 0.01764050 -0.0008549907
1003 0_NA20828      2     1        EUR -0.02335791 0.01736181 -0.0007909704
1004 0_NA20832      2     1        EUR -0.02320004 0.01698723  0.0004000635
              PC4          PC5         PC6           PC7           PC8
999  -0.003812300 -0.007291464 -0.04352340  0.0010770846 -0.0009705049
1000 -0.003096957 -0.006086234 -0.04654090  0.0005451567 -0.0014752338
1001 -0.004336742 -0.004473674 -0.04533632  0.0002382286  0.0016030985
1002 -0.003471963 -0.004603992 -0.03881987 -0.0001682949 -0.0005360977
1003 -0.003189036 -0.004899655 -0.04413306  0.0002276412 -0.0001268270
1004 -0.003445883 -0.008064412 -0.04676934  0.0007819886 -0.0020845665
....
... PC9 to PC20 ....
```> 

My genotyping data is a mix of Europeans, hispanics, and African races, which I've coded as the "group_race" in the phenotype file. 
From another answered question I understood that "groups" need only be used if my phenotype is a continuous variable.
So I skipped it.
`## fit null model
obj_nullmodel <- fit_nullmodel(pheno~SNPSEX+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10, data=phenotype,kins=sgrm,use_sparse=TRUE,
kins_cutoff=0.022,id="FID",family=gaussian(link="identity"),verbose=TRUE)
`

This is the output. I am wondering if this is reasonable. 
obj_nullmodel$coefficients
(Intercept)      SNPSEX         PC1         PC2         PC3         PC4 
 1.35457458 -0.22051452 -2.45884174  1.31956283 -0.12508581  0.21253838 
        PC5         PC6         PC7         PC8         PC9        PC10 
 0.40876625  1.82775475 -0.27736805 -0.14491347  0.03741783 -0.01117903 


table( obj_nullmodel$residuals )

   0 
1004 
table( obj_nullmodel$scaled.residuals)
< table of extent 0 >
head(obj_nullmodel$scaled.residuals)
  1   2   3   4   5   6 
NaN NaN NaN NaN NaN NaN 
> obj_nullmodel$converged
[1] TRUE
>  obj_nullmodel$sparse_kins
[1] TRUE
>  obj_nullmodel$relatedness
[1] TRUE
>  obj_nullmodel$use_SPA
[1] FALSE


I'm concerned because when I ran the Individual analysis, I got NaN Pvalues for all variants.
head analysis/results_individual_analysis_sig.csv 
"","CHR","POS","REF","ALT","ALT_AF","MAF","N","pvalue","pvalue_log10","Score","Score_se","Est","Est_se"
"NA",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.1",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.2",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.3",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.4",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.5",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.6",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA


Here is a one part of the individual analysis just to see the NaNs
ind<- get(load("analysis/DILIN_Individual_Analysis_78.Rdata"))
head(ind)
    CHR      POS REF ALT     ALT_AF        MAF    N pvalue pvalue_log10 Score Score_se Est. Est_se
989   4 70019914   G   A 0.02442672 0.02442672 1004    NaN          NaN   NaN. 26.30146 NaN 0.03802071
1     4 70019928   G   A 0.55276382 0.44723618 1004    NaN          NaN   NaN.    77.90924 NaN 0.01283545
990   4 70020508  GA   G 0.01846307 0.01846307 1004    NaN          NaN   NaN. 22.09448 NaN 0.04526017
2     4 70020586   G   A 0.06000000 0.06000000 1004    NaN          NaN   NaN.   38.48158 NaN 0.02598646
3     4 70020606   T   C 0.09450000 0.09450000 1004    NaN          NaN   NaN 46.11840 NaN 0.02168332
4     4 70020771   G   A 0.09530938 0.09530938 1004    NaN          NaN   NaN.    46.27219 NaN 0.02161125

fitnullmodel error

Dear Dr. Li,
I am facing another problem when I ran fitnullmodel step using STAARpipeline_Null_Model_GENESIS.r, there was an error:

Error in .checkSampleId(cov.mat, x) :
all sample names in dimnames of cov.mat must be present in x$sample.id
Calls: fitNullModel -> fitNullModel -> .local -> .checkSampleId
Execution halted

the cov.mat points sGRM.RData, so I check the content of the sGRM.Rdata. I find the sGRM@Dimnames[[1]]
output looks strange:

"D2109086378_D2109086378" "D2109086379_D2109086379" "D2109086380_D2109086380"

It looks paste FID and IID using "_" as the rownames and colnames of the sparse GRM. Not sure if I understand correctly?
After I did an operation to delect the "_" and the string after using sGRM@Dimnames[[1]] <- gsub("_D[0-9]+", "", sGRM@Dimnames[[1]]), I ran fitnullmodel again, but It still occurred the same error.

for "x", I think it is the phenotype I am concerned. Is it ringht? Meanwhile, I am not sure the phenotype file should include what columns?
Here are the colnames in my phenotype file:

FID IID sex age first_transfusion PLT PLT_zscore pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10 pc11 pc12 pc13 pc14 pc15 pc16 pc17 pc18 pc19 pc20

the first two columns are FID and IID and they are indentical, Not sure it should remove one of them and rename the colname of the remain to sample.id?

Question about Sliding_Window function

Hello,

I'm trying to run Sliding_Window() with ~2k variants but everytime it returns "Error: $ operator is invalid for atomic vectors." This happened even when just given obj_nullmodel argument, like this Sliding_Window(chr="chr3",start_loc=11919,end_loc=1239497,sliding_window_length=6000,obj_nullmodel="gds/obj_nullmodel_GENESIS.Rdata").
Do you have any idea how this happened? Is it due to a misformatted obj_nullmodel? Thanks!

I have followed the tutorial to preparing this
covar <- read.csv("covar.csv")
obj_nullmodel_GENESIS <- fitNullModel(covar,outcome="rate",covars=c("PC1","PC2","PC3","PC4"),family="binomial", cov.mat = NULL,group.var = NULL)

obj_nullmodel <- genesis2staar_nullmodel(obj_nullmodel_GENESIS)
save(obj_nullmodel,obj_nullmodel_GENESIS.Rdata")

Known_Loci_Pruning error

When I execute the script STAARpipelineSummary_Known_Loci_Pruning.r, there are several chromosomes that did not generate corresponding data, and the error is as follows:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 283209 15.2 664320 35.5 450929 24.1
Vcells 487090 3.8 8388608 64.0 1815701 13.9
[1] 8

of selected samples: 3,508

of selected variants: 2

of selected samples: 3,508

of selected variants: 2

of selected samples: 3,508

of selected variants: 4,303,007

of selected samples: 3,508

of selected variants: 1

of selected samples: 3,508

of selected variants: 4,303,007

of selected samples: 3,508

of selected variants: 2

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
没有(非NA)案例可用
Calls: LD_pruning -> Individual_Analysis_cond -> lm -> lm.fit
停止执行
Is the reason for this that the provided known pathogenic loci do not exist in my data? If so, how can I avoid conflicts between removing certain regions during QC and this step?

Thank you for your help.

Lack of Input Examples for running STAARpipeline

Thank you for designing such a useful tool with STAARpipeline.

I am stuck on the summary portion of the pipeline due to the not knowing what the known loci info csv should look like or contain. This is the line in the the R script file: "STAARpipelineSummary_Known_Loci_Pruning.r" within step 0 of the summary step: known_loci_info <- read.csv("/path_to_the_file/TOPMed_F5_LDL_Known_Loci_info.csv"). I was hoping to see what the csv format should look like, what information/columns it should contain and how to format it. In addition, is there any way to include some example input files for reference along the way in the other steps too (such as an example phenotype file or any other csv inputs).

The example R scripts and Shell scripts are very helpful in knowing how to run the pipeline but I have found myself stuck on steps with additional required inputs.

Thank You!

xihaoli / staarpipeline-tutorial Goto Github PK

staarpipeline-tutorial's Introduction

STAARpipeline-Tutorial

Pre-step of association analysis using STAARpipeline

Generate Genomic Data Structure (GDS) file

Generate annotated GDS (aGDS) file using FAVORannotator

Prerequisites:

Step 0: Install xsv

Step 1: Generate the variants list to be annotated

Script: Varinfo_gds.R

Input: GDS files of each chromosome and the FAVOR database information FAVORdatabase_chrsplit.csv. For more details, please see the R script.

Output: CSV files of the variants list. For each chromosome, the number of CSV files is listed in FAVORdatabase_chrsplit.csv.

Step 2: Annotate the variants using the FAVOR database through xsv software

Script: Annotate.R

Input: CSV files of the variants list to be annotated, the FAVOR database information FAVORdatabase_chrsplit.csv,

Output: CSV files of the annotated variants list.

Step 3: Generate the annotated GDS (aGDS) file

Script: gds2agds.R

Input: GDS files and the CSV files of annotated variants list (Anno_chrXX.csv or Anno_chrXX_STAARpipeline.csv). For more details, please see the R script.

Output: aGDS files including both the genotype and annotation information.

Generate sparse Genetic Relatedness Matrix (GRM)

Association analysis using STAARpipeline

Step 0: Preparation for association analysis of whole-genome/whole-exome sequencing studies

Script: Association_Analysis_PreStep.r

Input: aGDS files of all 22 chromosomes. For more details, please see the R script.

Output: agds_dir.Rdata, Annotation_name_catalog.Rdata, jobs_num.Rdata.

Step 1: Fit STAAR null model

Script: STAARpipeline_Null_Model.r or STAARpipeline_Null_Model_GENESIS.r

Input: Phenotype data and (sparse) genetic relatedness matrix. For more details, please see the R scripts.

Output: a Rdata file of the STAAR null model.

Step 2: Individual (single-variant) analysis

Script: STAARpipeline_Individual_Analysis.r

Input: aGDS files and the STAAR null model. For more details, please see the R script.

Output: Rdata files with the user-defined names.

Step 3.1: Gene-centric coding analysis

Script: STAARpipeline_Gene_Centric_Coding.r and STAARpipeline_Gene_Centric_Coding_Long_Masks.r

Input: aGDS files and the STAAR null model. For more details, please see the R scripts.

Output: 381 Rdata files with the user-defined names.

Step 3.2: Gene-centric noncoding analysis

Script: STAARpipeline_Gene_Centric_Noncoding.r, STAARpipeline_Gene_Centric_Noncoding_Long_Masks.r, STAARpipeline_Gene_Centric_ncRNA.r and STAARpipeline_Gene_Centric_ncRNA_Long_Masks.r

Input: aGDS files and the STAAR null model. For more details, please see the R scripts.

Output: 387 Rdata files with the user-defined names for protein-coding genes and 223 Rdata files with the user-defined names for ncRNA genes.

Step 4: Sliding window analysis

Script: STAARpipeline_Sliding_Window.r

Input: aGDS files and the STAAR null model. For more details, please see the R script.

Output: Rdata files with the user-defined names.

Step 5.0: Obtain SCANG-STAAR null model

Script: STAARpipeline_STAAR2SCANG.r

Input: STAAR null model. For more details, please see the R script.

Output: a Rdata file of the SCANG-STAAR null model.

Step 5: Dynamic window analysis using SCANG-STAAR

Script: STAARpipeline_Dynamic_Window.r

Input: SCANG-STAAR null model. For more details, please see the R script.

Output: Rdata files with the user-defined names.

Summarization and visualization of association analysis results using STAARpipelineSummary

Step 0 (Optional): Select independent variants from a known variants list to be used in conditional analysis

Script: STAARpipelineSummary_Known_Loci_Pruning.r

Input: aGDS files, a list of known variants (4-column "CHR-POS-REF-ALT" format), and the STAAR null model.

Output: a Rdata file containing a list of independent variants to be used in conditional analysis.

Step 1: Summarize individual (single-variant) analysis results

Script: STAARpipelineSummary_Individual_Analysis.r

Input: aGDS files, individual analysis results generated by STAARpipeline, the STAAR null model, and a list of known variants. For more details, please see the R script.

Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant variants.

Step 2.1: Summarize gene-centric coding analysis results

Script: STAARpipelineSummary_Gene_Centric_Coding.r

Input: aGDS files, gene-centric coding analysis results generated by STAARpipeline, the STAAR null model, and a list of known variants. For more details, please see the R script.

Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant coding masks.

Step 2.2: Summarize gene-centric noncoding analysis results

Script: STAARpipelineSummary_Gene_Centric_Noncoding.r

Input: aGDS files, gene-centric noncoding analysis results generated by STAARpipeline, the STAAR null model, and a list of known variants. For more details, please see the R script.

Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant noncoding masks.

Step 3: Summarize sliding window analysis results

Script: STAARpipelineSummary_Sliding_Window.r

Input: aGDS files, sliding window analysis results generated by STAARpipeline, the STAAR null model, and a list of known variants. For details, see the R scripts.

Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant sliding windows.

Step 4: Summarize dynamic window analysis results

Script: STAARpipelineSummary_Dynamic_Window.r

Input: aGDS files, dynamic window analysis results generated by STAARpipeline, the STAAR null model, and a list of known variants. For more details, please see the R script.

Output: The summary includes the Manhattan plot, Q-Q plot, and conditional p-values of unconditionally significant dynamic windows.

Step 5.1: Functionally annotate a list of variants

Input: GDS files and the CSV files of annotated variants list (`Anno_chrXX.csv` or `Anno_chrXX_STAARpipeline.csv`). For more details, please see the R script.

Output: `agds_dir.Rdata`, `Annotation_name_catalog.Rdata`, `jobs_num.Rdata`.