Comments (7)
These results indicate that there might be some bugs in your analysis. One possible reason might be the small sample sizes. Another possible reason might be some errors in your GDS file. As mentioned in the STAARpipeline manuscript, the dynamic window analysis is designed to detect associations in the noncoding genome, especially for the intergenic region. Therefore, we recommend the users run gene-centric coding analysis for WES data. If your sample size is small, an alternative approach to gene-centric coding analysis is the gene-based analysis based on the genes' position. We could achieve it through the sliding window analysis code in STAARpipeline. We will close the comment if you have no more questions.
from staarpipeline-tutorial.
Hi,
Thanks for your reply!
The data are all WGS. I agree that the 100 sample size is still very small. And I want to ask how can I check whether the GDS files are good or not?
from staarpipeline-tutorial.
Hi Shea,
Checking the validity of an aGDS file include (but not limited to) (1) whether the dimension of genotypes and each annotation is aligned; (2) whether you have handled multi-allelic variants; (3) whether there are any missing values (NAs) in any of the field for any of the variants in the aGDS file; (4) whether the chromosome field is numeric (e.g. 13) rather than character ("chr13"), etc.
In addition, how is your QC_label
defined in your aGDS file? If all of the variants in your aGDS file are assumed to be PASS variants (after QC), did you follow the section in the STAARpipeline-Tutorial to create a new channel in the aGDS file by setting the value of all variants as "PASS"?
Hope this helps.
Best,
Xihao
from staarpipeline-tutorial.
Hi Xihao,
Thanks for your helpful reply and sorry again for bothering you.
I have checked aGDS file,
(2) I have handled multi-allelic variants in the QC step ✅
(4) The chromosome field is numeric ✅
However, (1)(3) seem to have some problems:
(1) the dimension of genotypes and each annotation seems not to be aligned in format
field? (i.e. AD/DP/GQ/PGT/PID/PL)
File: /mnt/work/research/geyx/STAARpipeline/FAVOR/sscpv2_test_cx/QC/STAARtest_sscpv2.split.hard.AB.VQSR.QDSOR.samFlt.renew.chr13.vcf.gz.gds (248.4M)
- [ ] *
|--+ description [ ] *
|--+ sample.id { Str8 100 LZMA_ra(18.6%), 397B } *
|--+ variant.id { Int32 527398 LZMA_ra(6.26%), 129.1K } *
|--+ position { Int32 527398 LZMA_ra(40.5%), 833.4K } *
|--+ chromosome { Str8 527398 LZMA_ra(0.02%), 381B } *
|--+ allele { Str8 527398 LZMA_ra(17.9%), 466.0K } *
|--+ genotype [ ] *
| |--+ data { Bit2 2x100x527398 LZMA_ra(9.11%), 2.3M } *
| |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
| --+ extra { Int16 0 LZMA_ra, 18B }
|--+ phase [ ]
| |--+ data { Bit1 100x527398 LZMA_ra(0.08%), 5.1K } *
| |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
| --+ extra { Bit1 0 LZMA_ra, 18B }
|--+ annotation [ ]
| |--+ id { Str8 527398 LZMA_ra(33.6%), 1.8M } *
| |--+ qual { Float32 527398 LZMA_ra(69.7%), 1.4M } *
| |--+ filter { Int32,factor 527398 LZMA_ra(0.02%), 465B } *
| |--+ info [ ]
| | |--+ BaseQRankSum { Float32 527398 LZMA_ra(46.0%), 948.3K } *
| | |--+ ClippingRankSum { Float32 527398 LZMA_ra(1.86%), 38.4K } *
| | |--+ DB { Bit1 527398 LZMA_ra(35.5%), 22.9K } *
| | |--+ DP { Int32 527398 LZMA_ra(33.6%), 692.1K } *
| | |--+ END { Int32 527398 LZMA_ra(0.02%), 465B } *
| | |--+ ExcessHet { Float32 527398 LZMA_ra(22.5%), 464.0K } *
| | |--+ FS { Float32 527398 LZMA_ra(33.9%), 699.1K } *
| | |--+ InbreedingCoeff { Float32 527398 LZMA_ra(34.5%), 710.6K } *
| | |--+ MLEAC { Int32 527398 LZMA_ra(16.6%), 342.5K } *
| | |--+ MLEAF { Float32 527398 LZMA_ra(20.1%), 414.2K } *
| | |--+ MQ { Float32 527398 LZMA_ra(6.75%), 139.1K } *
| | |--+ MQRankSum { Float32 527398 LZMA_ra(1.99%), 41.0K } *
| | |--+ NEGATIVE_TRAIN_SITE { Bit1 527398 LZMA_ra(2.70%), 1.7K } *
| | |--+ POSITIVE_TRAIN_SITE { Bit1 527398 LZMA_ra(94.8%), 61.0K } *
| | |--+ QD { Float32 527398 LZMA_ra(39.8%), 820.3K } *
| | |--+ ReadPosRankSum { Float32 527398 LZMA_ra(40.9%), 842.9K } *
| | |--+ SOR { Float32 527398 LZMA_ra(35.5%), 731.4K } *
| | |--+ VQSLOD { Float32 527398 LZMA_ra(34.9%), 718.4K } *
| | |--+ culprit { Str8 527398 LZMA_ra(0.88%), 44.0K } *
| | |--+ AC { Int32 527398 LZMA_ra(16.9%), 348.2K } *
| | |--+ AF { Float32 527398 LZMA_ra(24.5%), 504.1K } *
| | |--+ NS { Int32 527398 LZMA_ra(0.02%), 465B } *
| | |--+ AN { Int32 527398 LZMA_ra(5.76%), 118.7K } *
| | |--+ FunctionalAnnotation [ spec_tbl_df,tbl_df,tbl,data.frame,list ] *
| | | |--+ VarInfo { Str8 527398 LZMA_ra(18.9%), 1.6M }
| | | |--+ apc_conservation { Float64 527398 LZMA_ra(78.5%), 3.2M }
| | | |--+ apc_epigenetics { Float64 527398 LZMA_ra(78.8%), 3.2M }
| | | |--+ apc_epigenetics_active { Float64 527398 LZMA_ra(77.5%), 3.1M }
| | | |--+ apc_epigenetics_repressed { Float64 527398 LZMA_ra(52.3%), 2.1M }
| | | |--+ apc_epigenetics_transcription { Float64 527398 LZMA_ra(43.8%), 1.8M }
| | | |--+ apc_local_nucleotide_diversity { Float64 527398 LZMA_ra(78.4%), 3.2M }
| | | |--+ apc_mappability { Float64 527398 LZMA_ra(24.9%), 1.0M }
| | | |--+ apc_protein_function { Float64 527398 LZMA_ra(1.84%), 75.8K }
| | | |--+ apc_transcription_factor { Float64 527398 LZMA_ra(7.30%), 300.9K }
| | | |--+ cage_tc { Str8 527398 LZMA_ra(5.04%), 32.3K }
| | | |--+ metasvm_pred { Str8 527398 LZMA_ra(0.46%), 2.4K }
| | | |--+ rsid { Str8 527398 LZMA_ra(34.7%), 1.7M }
| | | |--+ fathmm_xf { Float64 527398 LZMA_ra(51.0%), 2.1M }
| | | |--+ genecode_comprehensive_category { Str8 527398 LZMA_ra(0.95%), 51.9K }
| | | |--+ genecode_comprehensive_info { Str8 527398 LZMA_ra(7.40%), 957.8K }
| | | |--+ genecode_comprehensive_exonic_category { Str8 527398 LZMA_ra(0.79%), 4.3K }
| | | |--+ genecode_comprehensive_exonic_info { Str8 527398 LZMA_ra(5.56%), 41.5K }
| | | |--+ genehancer { Str8 527398 LZMA_ra(1.33%), 125.9K }
| | | |--+ linsight { Float64 527398 LZMA_ra(20.5%), 844.7K }
| | | |--+ cadd_phred { Float64 527398 LZMA_ra(23.3%), 960.0K }
| | | --+ rdhs { Str8 527398 LZMA_ra(7.16%), 118.3K }
| | --+ QC_label { Int32,factor 527398 LZMA_ra(0.02%), 465B } *
| --+ format [ ]
| |--+ AD [ ] *
| | --+ data { VL_Int 100x1054796 LZMA_ra(40.4%), 40.7M } *
| |--+ DP [ ] *
| | --+ data { VL_Int 100x527398 LZMA_ra(56.6%), 28.7M } *
| |--+ GQ [ ] *
| | --+ data { VL_Int 100x527398 LZMA_ra(20.2%), 19.2M } *
| |--+ PGT [ ] *
| | --+ data { Str8 100x147456 LZMA_ra(3.77%), 732.5K } *
| |--+ PID [ ] *
| | --+ data { Str8 100x147456 LZMA_ra(3.91%), 1.4M } *
| --+ PL [ ] *
| --+ data { VL_Int 100x1580553 LZMA_ra(46.7%), 115.8M } *
--+ sample.annotation [ ]
(3) As for missing values (NAs), I did find some NAs in the aGDS file, e.g., I checked the field of FunctionalAnnotation as follows:
library(SeqArray)
agds_dir <- get(load("/mnt/work/research/geyx/STAARpipeline/FAVOR/sscpv2_test_cx/QC/burden/agds_dir.Rdata"))
gds.path <- agds_dir[13]
genofile <- seqOpen(gds.path)
head(seqGetData(genofile, "annotation/info/FunctionalAnnotation/apc_protein_function"))
The result is:
[1] NA NA 2.969487 2.969487 NA 2.969487
Is this key to the problem? But I thought it may be normal for some variants to have no annotation in certain fields.
BTW, My QC_label
is defined as default in annotation/filter
so I did not create a new channel.
Thanks again!
Shea
from staarpipeline-tutorial.
Hi Shea,
Thanks for your input. First of all, for your questions:
(1) The dimension of genotypes and each annotation seems not to be aligned in format field? (i.e. AD/DP/GQ/PGT/PID/PL)
This doesn't matter, as these fields would not be used as part of STAARpipeline.
(3) As for missing values (NAs), I did find some NAs in the aGDS file
These missing values come from indels, as long as you specify variant_type <- "SNV"
in STAARpipeline_Dynamic_Window.r, there shouldn't be any problem.
Now, we have identified the issue of
-logp chr start_pos end_pos GWER SNV_num
[1,] 0 1 0 0 5e-04 1
If you are following STAARpipeline-tutorial, you can see that the number of output files should be the summation of the column "scang_num" for the object in jobs_num.Rdata
. For example, in our case, we submitted a total of 1879 jobs instead of 22 jobs. As such, your sscpv2_13.Rdata
corresponds to the 13th chunk of chromosome 1 but not the results for chromosome 13, and that is why the chr
column in the results is 1
but not 13
.
As a follow-up question, did you convert your STAAR null model to SCANG-STAAR null model before running dynamic window analysis using SCANG-STAAR?
Hope this helps.
Best,
Xihao
from staarpipeline-tutorial.
Hi Xihao,
That's right! Sorry that I mistook the job number. I am re-running now. )
BTW, for the function Dynamic_Window_SCANG(), if there is no significant result, will the "start_loc" and "end_loc" in the output be 0? I am not sure if it's the default setting of the function itself or if I messed up other things.
For your follow-up question, I converted the STAAR null model to SCANG-STAAR null model before running dynamic window analysis. )
Thanks!
Shea
from staarpipeline-tutorial.
Hi Shea,
BTW, for the function Dynamic_Window_SCANG(), if there is no significant result, will the "start_loc" and "end_loc" in the output be 0? I am not sure if it's the default setting of the function itself or if I messed up other things.
Yes, that is correct. In this case, 0 is used for placeholder.
Best,
Xihao
from staarpipeline-tutorial.
Related Issues (20)
- Gene Centric Coding Annotation fails HOT 2
- Error in Fit STAAR null model step: asMethod(object) : matrix is not symmetric [1,2] HOT 6
- error in running "STAARpipeline_Null_Model.r" HOT 7
- Error in seqGetData: Invalid dimension 'Start' and 'Length' HOT 5
- Null outputs from Sliding_Window() HOT 2
- Known_Loci_Pruning error HOT 2
- vcf to gds HOT 7
- #SNVs per case/control? HOT 1
- fit_nullmodel Output is mostly Null and 0 HOT 16
- Fitting NULL model for binary outcomes HOT 5
- Error in Gene Centric Analysis HOT 1
- Error in results_plof_genome[, "cMAC"] : subscript out of bounds HOT 2
- Followup Question to Issue #28 HOT 2
- STAARpipeline_Gene_Centric_Noncoding HOT 2
- Dynamic Window dim(X) error HOT 2
- Can't annotate individual variant results HOT 2
- [Suggestion-Implementation] Add information to summary and annotations of results
- Conditional analysis - Summary Gene Centric Noncoding not running to completion HOT 6
- Ukbiobank Agds files generation HOT 16
- Plots for gene centric ncRNA regions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from staarpipeline-tutorial.