Hi! It's me again.) Finally get to the final step.

Question about the results of Dynamic_Window_SCANG() and Dynamic_Window_Results_Summary() about staarpipeline-tutorial HOT 7 CLOSED

xihaoli commented on June 26, 2024

Question about the results of Dynamic_Window_SCANG() and Dynamic_Window_Results_Summary()

from staarpipeline-tutorial.

Comments (7)

zilinli1988 commented on June 26, 2024

These results indicate that there might be some bugs in your analysis. One possible reason might be the small sample sizes. Another possible reason might be some errors in your GDS file. As mentioned in the STAARpipeline manuscript, the dynamic window analysis is designed to detect associations in the noncoding genome, especially for the intergenic region. Therefore, we recommend the users run gene-centric coding analysis for WES data. If your sample size is small, an alternative approach to gene-centric coding analysis is the gene-based analysis based on the genes' position. We could achieve it through the sliding window analysis code in STAARpipeline. We will close the comment if you have no more questions.

from staarpipeline-tutorial.

SheaCheng2000 commented on June 26, 2024

Hi,

Thanks for your reply!
The data are all WGS. I agree that the 100 sample size is still very small. And I want to ask how can I check whether the GDS files are good or not?

from staarpipeline-tutorial.

xihaoli commented on June 26, 2024

Hi Shea,

Checking the validity of an aGDS file include (but not limited to) (1) whether the dimension of genotypes and each annotation is aligned; (2) whether you have handled multi-allelic variants; (3) whether there are any missing values (NAs) in any of the field for any of the variants in the aGDS file; (4) whether the chromosome field is numeric (e.g. 13) rather than character ("chr13"), etc.

In addition, how is your QC_label defined in your aGDS file? If all of the variants in your aGDS file are assumed to be PASS variants (after QC), did you follow the section in the STAARpipeline-Tutorial to create a new channel in the aGDS file by setting the value of all variants as "PASS"?

Hope this helps.

Best,
Xihao

from staarpipeline-tutorial.

SheaCheng2000 commented on June 26, 2024

Hi Xihao,

Thanks for your helpful reply and sorry again for bothering you.

I have checked aGDS file,
(2) I have handled multi-allelic variants in the QC step ✅
(4) The chromosome field is numeric ✅

However, (1)(3) seem to have some problems:
(1) the dimension of genotypes and each annotation seems not to be aligned in format field? (i.e. AD/DP/GQ/PGT/PID/PL)

File: /mnt/work/research/geyx/STAARpipeline/FAVOR/sscpv2_test_cx/QC/STAARtest_sscpv2.split.hard.AB.VQSR.QDSOR.samFlt.renew.chr13.vcf.gz.gds (248.4M)

[ ] *
|--+ description [ ] *
|--+ sample.id { Str8 100 LZMA_ra(18.6%), 397B } *
|--+ variant.id { Int32 527398 LZMA_ra(6.26%), 129.1K } *
|--+ position { Int32 527398 LZMA_ra(40.5%), 833.4K } *
|--+ chromosome { Str8 527398 LZMA_ra(0.02%), 381B } *
|--+ allele { Str8 527398 LZMA_ra(17.9%), 466.0K } *
|--+ genotype [ ] *
| |--+ data { Bit2 2x100x527398 LZMA_ra(9.11%), 2.3M } *
| |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
| --+ extra { Int16 0 LZMA_ra, 18B }
|--+ phase [ ]
| |--+ data { Bit1 100x527398 LZMA_ra(0.08%), 5.1K } *
| |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
| --+ extra { Bit1 0 LZMA_ra, 18B }
|--+ annotation [ ]
| |--+ id { Str8 527398 LZMA_ra(33.6%), 1.8M } *
| |--+ qual { Float32 527398 LZMA_ra(69.7%), 1.4M } *
| |--+ filter { Int32,factor 527398 LZMA_ra(0.02%), 465B } *
| |--+ info [ ]
| | |--+ BaseQRankSum { Float32 527398 LZMA_ra(46.0%), 948.3K } *
| | |--+ ClippingRankSum { Float32 527398 LZMA_ra(1.86%), 38.4K } *
| | |--+ DB { Bit1 527398 LZMA_ra(35.5%), 22.9K } *
| | |--+ DP { Int32 527398 LZMA_ra(33.6%), 692.1K } *
| | |--+ END { Int32 527398 LZMA_ra(0.02%), 465B } *
| | |--+ ExcessHet { Float32 527398 LZMA_ra(22.5%), 464.0K } *
| | |--+ FS { Float32 527398 LZMA_ra(33.9%), 699.1K } *
| | |--+ InbreedingCoeff { Float32 527398 LZMA_ra(34.5%), 710.6K } *
| | |--+ MLEAC { Int32 527398 LZMA_ra(16.6%), 342.5K } *
| | |--+ MLEAF { Float32 527398 LZMA_ra(20.1%), 414.2K } *
| | |--+ MQ { Float32 527398 LZMA_ra(6.75%), 139.1K } *
| | |--+ MQRankSum { Float32 527398 LZMA_ra(1.99%), 41.0K } *
| | |--+ NEGATIVE_TRAIN_SITE { Bit1 527398 LZMA_ra(2.70%), 1.7K } *
| | |--+ POSITIVE_TRAIN_SITE { Bit1 527398 LZMA_ra(94.8%), 61.0K } *
| | |--+ QD { Float32 527398 LZMA_ra(39.8%), 820.3K } *
| | |--+ ReadPosRankSum { Float32 527398 LZMA_ra(40.9%), 842.9K } *
| | |--+ SOR { Float32 527398 LZMA_ra(35.5%), 731.4K } *
| | |--+ VQSLOD { Float32 527398 LZMA_ra(34.9%), 718.4K } *
| | |--+ culprit { Str8 527398 LZMA_ra(0.88%), 44.0K } *
| | |--+ AC { Int32 527398 LZMA_ra(16.9%), 348.2K } *
| | |--+ AF { Float32 527398 LZMA_ra(24.5%), 504.1K } *
| | |--+ NS { Int32 527398 LZMA_ra(0.02%), 465B } *
| | |--+ AN { Int32 527398 LZMA_ra(5.76%), 118.7K } *
| | |--+ FunctionalAnnotation [ spec_tbl_df,tbl_df,tbl,data.frame,list ] *
| | | |--+ VarInfo { Str8 527398 LZMA_ra(18.9%), 1.6M }
| | | |--+ apc_conservation { Float64 527398 LZMA_ra(78.5%), 3.2M }
| | | |--+ apc_epigenetics { Float64 527398 LZMA_ra(78.8%), 3.2M }
| | | |--+ apc_epigenetics_active { Float64 527398 LZMA_ra(77.5%), 3.1M }
| | | |--+ apc_epigenetics_repressed { Float64 527398 LZMA_ra(52.3%), 2.1M }
| | | |--+ apc_epigenetics_transcription { Float64 527398 LZMA_ra(43.8%), 1.8M }
| | | |--+ apc_local_nucleotide_diversity { Float64 527398 LZMA_ra(78.4%), 3.2M }
| | | |--+ apc_mappability { Float64 527398 LZMA_ra(24.9%), 1.0M }
| | | |--+ apc_protein_function { Float64 527398 LZMA_ra(1.84%), 75.8K }
| | | |--+ apc_transcription_factor { Float64 527398 LZMA_ra(7.30%), 300.9K }
| | | |--+ cage_tc { Str8 527398 LZMA_ra(5.04%), 32.3K }
| | | |--+ metasvm_pred { Str8 527398 LZMA_ra(0.46%), 2.4K }
| | | |--+ rsid { Str8 527398 LZMA_ra(34.7%), 1.7M }
| | | |--+ fathmm_xf { Float64 527398 LZMA_ra(51.0%), 2.1M }
| | | |--+ genecode_comprehensive_category { Str8 527398 LZMA_ra(0.95%), 51.9K }
| | | |--+ genecode_comprehensive_info { Str8 527398 LZMA_ra(7.40%), 957.8K }
| | | |--+ genecode_comprehensive_exonic_category { Str8 527398 LZMA_ra(0.79%), 4.3K }
| | | |--+ genecode_comprehensive_exonic_info { Str8 527398 LZMA_ra(5.56%), 41.5K }
| | | |--+ genehancer { Str8 527398 LZMA_ra(1.33%), 125.9K }
| | | |--+ linsight { Float64 527398 LZMA_ra(20.5%), 844.7K }
| | | |--+ cadd_phred { Float64 527398 LZMA_ra(23.3%), 960.0K }
| | | --+ rdhs { Str8 527398 LZMA_ra(7.16%), 118.3K }
| | --+ QC_label { Int32,factor 527398 LZMA_ra(0.02%), 465B } *
| --+ format [ ]
| |--+ AD [ ] *
| | --+ data { VL_Int 100x1054796 LZMA_ra(40.4%), 40.7M } *
| |--+ DP [ ] *
| | --+ data { VL_Int 100x527398 LZMA_ra(56.6%), 28.7M } *
| |--+ GQ [ ] *
| | --+ data { VL_Int 100x527398 LZMA_ra(20.2%), 19.2M } *
| |--+ PGT [ ] *
| | --+ data { Str8 100x147456 LZMA_ra(3.77%), 732.5K } *
| |--+ PID [ ] *
| | --+ data { Str8 100x147456 LZMA_ra(3.91%), 1.4M } *
| --+ PL [ ] *
| --+ data { VL_Int 100x1580553 LZMA_ra(46.7%), 115.8M } *
--+ sample.annotation [ ]

(3) As for missing values (NAs), I did find some NAs in the aGDS file, e.g., I checked the field of FunctionalAnnotation as follows:

library(SeqArray)
agds_dir <- get(load("/mnt/work/research/geyx/STAARpipeline/FAVOR/sscpv2_test_cx/QC/burden/agds_dir.Rdata"))
gds.path <- agds_dir[13]
genofile <- seqOpen(gds.path)
head(seqGetData(genofile, "annotation/info/FunctionalAnnotation/apc_protein_function"))

The result is:

[1] NA NA 2.969487 2.969487 NA 2.969487

Is this key to the problem? But I thought it may be normal for some variants to have no annotation in certain fields.

BTW, My QC_label is defined as default in annotation/filter so I did not create a new channel.

Thanks again!
Shea

from staarpipeline-tutorial.

xihaoli commented on June 26, 2024

Hi Shea,

Thanks for your input. First of all, for your questions:

(1) The dimension of genotypes and each annotation seems not to be aligned in format field? (i.e. AD/DP/GQ/PGT/PID/PL)
This doesn't matter, as these fields would not be used as part of STAARpipeline.

(3) As for missing values (NAs), I did find some NAs in the aGDS file
These missing values come from indels, as long as you specify variant_type <- "SNV" in STAARpipeline_Dynamic_Window.r, there shouldn't be any problem.

Now, we have identified the issue of

      -logp chr start_pos end_pos  GWER SNV_num
 [1,]     0   1         0       0 5e-04       1

If you are following STAARpipeline-tutorial, you can see that the number of output files should be the summation of the column "scang_num" for the object in jobs_num.Rdata. For example, in our case, we submitted a total of 1879 jobs instead of 22 jobs. As such, your sscpv2_13.Rdata corresponds to the 13th chunk of chromosome 1 but not the results for chromosome 13, and that is why the chr column in the results is 1 but not 13.

As a follow-up question, did you convert your STAAR null model to SCANG-STAAR null model before running dynamic window analysis using SCANG-STAAR?

Hope this helps.

Best,
Xihao

from staarpipeline-tutorial.

SheaCheng2000 commented on June 26, 2024

Hi Xihao,

That's right! Sorry that I mistook the job number. I am re-running now. )

BTW, for the function Dynamic_Window_SCANG(), if there is no significant result, will the "start_loc" and "end_loc" in the output be 0? I am not sure if it's the default setting of the function itself or if I messed up other things.

For your follow-up question, I converted the STAAR null model to SCANG-STAAR null model before running dynamic window analysis. )

Thanks!
Shea

from staarpipeline-tutorial.

xihaoli commented on June 26, 2024

Hi Shea,

BTW, for the function Dynamic_Window_SCANG(), if there is no significant result, will the "start_loc" and "end_loc" in the output be 0? I am not sure if it's the default setting of the function itself or if I messed up other things.

Yes, that is correct. In this case, 0 is used for placeholder.

Best,
Xihao

from staarpipeline-tutorial.

Question about the results of Dynamic_Window_SCANG() and Dynamic_Window_Results_Summary() about staarpipeline-tutorial HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent