Giter VIP home page Giter VIP logo

Comments (7)

zilinli1988 avatar zilinli1988 commented on June 26, 2024

These results indicate that there might be some bugs in your analysis. One possible reason might be the small sample sizes. Another possible reason might be some errors in your GDS file. As mentioned in the STAARpipeline manuscript, the dynamic window analysis is designed to detect associations in the noncoding genome, especially for the intergenic region. Therefore, we recommend the users run gene-centric coding analysis for WES data. If your sample size is small, an alternative approach to gene-centric coding analysis is the gene-based analysis based on the genes' position. We could achieve it through the sliding window analysis code in STAARpipeline. We will close the comment if you have no more questions.

from staarpipeline-tutorial.

SheaCheng2000 avatar SheaCheng2000 commented on June 26, 2024

Hi,

Thanks for your reply!
The data are all WGS. I agree that the 100 sample size is still very small. And I want to ask how can I check whether the GDS files are good or not?

from staarpipeline-tutorial.

xihaoli avatar xihaoli commented on June 26, 2024

Hi Shea,

Checking the validity of an aGDS file include (but not limited to) (1) whether the dimension of genotypes and each annotation is aligned; (2) whether you have handled multi-allelic variants; (3) whether there are any missing values (NAs) in any of the field for any of the variants in the aGDS file; (4) whether the chromosome field is numeric (e.g. 13) rather than character ("chr13"), etc.

In addition, how is your QC_label defined in your aGDS file? If all of the variants in your aGDS file are assumed to be PASS variants (after QC), did you follow the section in the STAARpipeline-Tutorial to create a new channel in the aGDS file by setting the value of all variants as "PASS"?

Hope this helps.

Best,
Xihao

from staarpipeline-tutorial.

SheaCheng2000 avatar SheaCheng2000 commented on June 26, 2024

Hi Xihao,

Thanks for your helpful reply and sorry again for bothering you.

I have checked aGDS file,
(2) I have handled multi-allelic variants in the QC step ✅
(4) The chromosome field is numeric ✅

However, (1)(3) seem to have some problems:
(1) the dimension of genotypes and each annotation seems not to be aligned in format field? (i.e. AD/DP/GQ/PGT/PID/PL)

File: /mnt/work/research/geyx/STAARpipeline/FAVOR/sscpv2_test_cx/QC/STAARtest_sscpv2.split.hard.AB.VQSR.QDSOR.samFlt.renew.chr13.vcf.gz.gds (248.4M)

  • [ ] *
    |--+ description [ ] *
    |--+ sample.id { Str8 100 LZMA_ra(18.6%), 397B } *
    |--+ variant.id { Int32 527398 LZMA_ra(6.26%), 129.1K } *
    |--+ position { Int32 527398 LZMA_ra(40.5%), 833.4K } *
    |--+ chromosome { Str8 527398 LZMA_ra(0.02%), 381B } *
    |--+ allele { Str8 527398 LZMA_ra(17.9%), 466.0K } *
    |--+ genotype [ ] *
    | |--+ data { Bit2 2x100x527398 LZMA_ra(9.11%), 2.3M } *
    | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
    | --+ extra { Int16 0 LZMA_ra, 18B }
    |--+ phase [ ]
    | |--+ data { Bit1 100x527398 LZMA_ra(0.08%), 5.1K } *
    | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
    | --+ extra { Bit1 0 LZMA_ra, 18B }
    |--+ annotation [ ]
    | |--+ id { Str8 527398 LZMA_ra(33.6%), 1.8M } *
    | |--+ qual { Float32 527398 LZMA_ra(69.7%), 1.4M } *
    | |--+ filter { Int32,factor 527398 LZMA_ra(0.02%), 465B } *
    | |--+ info [ ]
    | | |--+ BaseQRankSum { Float32 527398 LZMA_ra(46.0%), 948.3K } *
    | | |--+ ClippingRankSum { Float32 527398 LZMA_ra(1.86%), 38.4K } *
    | | |--+ DB { Bit1 527398 LZMA_ra(35.5%), 22.9K } *
    | | |--+ DP { Int32 527398 LZMA_ra(33.6%), 692.1K } *
    | | |--+ END { Int32 527398 LZMA_ra(0.02%), 465B } *
    | | |--+ ExcessHet { Float32 527398 LZMA_ra(22.5%), 464.0K } *
    | | |--+ FS { Float32 527398 LZMA_ra(33.9%), 699.1K } *
    | | |--+ InbreedingCoeff { Float32 527398 LZMA_ra(34.5%), 710.6K } *
    | | |--+ MLEAC { Int32 527398 LZMA_ra(16.6%), 342.5K } *
    | | |--+ MLEAF { Float32 527398 LZMA_ra(20.1%), 414.2K } *
    | | |--+ MQ { Float32 527398 LZMA_ra(6.75%), 139.1K } *
    | | |--+ MQRankSum { Float32 527398 LZMA_ra(1.99%), 41.0K } *
    | | |--+ NEGATIVE_TRAIN_SITE { Bit1 527398 LZMA_ra(2.70%), 1.7K } *
    | | |--+ POSITIVE_TRAIN_SITE { Bit1 527398 LZMA_ra(94.8%), 61.0K } *
    | | |--+ QD { Float32 527398 LZMA_ra(39.8%), 820.3K } *
    | | |--+ ReadPosRankSum { Float32 527398 LZMA_ra(40.9%), 842.9K } *
    | | |--+ SOR { Float32 527398 LZMA_ra(35.5%), 731.4K } *
    | | |--+ VQSLOD { Float32 527398 LZMA_ra(34.9%), 718.4K } *
    | | |--+ culprit { Str8 527398 LZMA_ra(0.88%), 44.0K } *
    | | |--+ AC { Int32 527398 LZMA_ra(16.9%), 348.2K } *
    | | |--+ AF { Float32 527398 LZMA_ra(24.5%), 504.1K } *
    | | |--+ NS { Int32 527398 LZMA_ra(0.02%), 465B } *
    | | |--+ AN { Int32 527398 LZMA_ra(5.76%), 118.7K } *
    | | |--+ FunctionalAnnotation [ spec_tbl_df,tbl_df,tbl,data.frame,list ] *
    | | | |--+ VarInfo { Str8 527398 LZMA_ra(18.9%), 1.6M }
    | | | |--+ apc_conservation { Float64 527398 LZMA_ra(78.5%), 3.2M }
    | | | |--+ apc_epigenetics { Float64 527398 LZMA_ra(78.8%), 3.2M }
    | | | |--+ apc_epigenetics_active { Float64 527398 LZMA_ra(77.5%), 3.1M }
    | | | |--+ apc_epigenetics_repressed { Float64 527398 LZMA_ra(52.3%), 2.1M }
    | | | |--+ apc_epigenetics_transcription { Float64 527398 LZMA_ra(43.8%), 1.8M }
    | | | |--+ apc_local_nucleotide_diversity { Float64 527398 LZMA_ra(78.4%), 3.2M }
    | | | |--+ apc_mappability { Float64 527398 LZMA_ra(24.9%), 1.0M }
    | | | |--+ apc_protein_function { Float64 527398 LZMA_ra(1.84%), 75.8K }
    | | | |--+ apc_transcription_factor { Float64 527398 LZMA_ra(7.30%), 300.9K }
    | | | |--+ cage_tc { Str8 527398 LZMA_ra(5.04%), 32.3K }
    | | | |--+ metasvm_pred { Str8 527398 LZMA_ra(0.46%), 2.4K }
    | | | |--+ rsid { Str8 527398 LZMA_ra(34.7%), 1.7M }
    | | | |--+ fathmm_xf { Float64 527398 LZMA_ra(51.0%), 2.1M }
    | | | |--+ genecode_comprehensive_category { Str8 527398 LZMA_ra(0.95%), 51.9K }
    | | | |--+ genecode_comprehensive_info { Str8 527398 LZMA_ra(7.40%), 957.8K }
    | | | |--+ genecode_comprehensive_exonic_category { Str8 527398 LZMA_ra(0.79%), 4.3K }
    | | | |--+ genecode_comprehensive_exonic_info { Str8 527398 LZMA_ra(5.56%), 41.5K }
    | | | |--+ genehancer { Str8 527398 LZMA_ra(1.33%), 125.9K }
    | | | |--+ linsight { Float64 527398 LZMA_ra(20.5%), 844.7K }
    | | | |--+ cadd_phred { Float64 527398 LZMA_ra(23.3%), 960.0K }
    | | | --+ rdhs { Str8 527398 LZMA_ra(7.16%), 118.3K }
    | | --+ QC_label { Int32,factor 527398 LZMA_ra(0.02%), 465B } *
    | --+ format [ ]
    | |--+ AD [ ] *
    | | --+ data { VL_Int 100x1054796 LZMA_ra(40.4%), 40.7M } *
    | |--+ DP [ ] *
    | | --+ data { VL_Int 100x527398 LZMA_ra(56.6%), 28.7M } *
    | |--+ GQ [ ] *
    | | --+ data { VL_Int 100x527398 LZMA_ra(20.2%), 19.2M } *
    | |--+ PGT [ ] *
    | | --+ data { Str8 100x147456 LZMA_ra(3.77%), 732.5K } *
    | |--+ PID [ ] *
    | | --+ data { Str8 100x147456 LZMA_ra(3.91%), 1.4M } *
    | --+ PL [ ] *
    | --+ data { VL_Int 100x1580553 LZMA_ra(46.7%), 115.8M } *
    --+ sample.annotation [ ]

(3) As for missing values (NAs), I did find some NAs in the aGDS file, e.g., I checked the field of FunctionalAnnotation as follows:

library(SeqArray)
agds_dir <- get(load("/mnt/work/research/geyx/STAARpipeline/FAVOR/sscpv2_test_cx/QC/burden/agds_dir.Rdata"))
gds.path <- agds_dir[13]
genofile <- seqOpen(gds.path)
head(seqGetData(genofile, "annotation/info/FunctionalAnnotation/apc_protein_function"))

The result is:

[1] NA NA 2.969487 2.969487 NA 2.969487

Is this key to the problem? But I thought it may be normal for some variants to have no annotation in certain fields.

BTW, My QC_label is defined as default in annotation/filter so I did not create a new channel.

Thanks again!
Shea

from staarpipeline-tutorial.

xihaoli avatar xihaoli commented on June 26, 2024

Hi Shea,

Thanks for your input. First of all, for your questions:

(1) The dimension of genotypes and each annotation seems not to be aligned in format field? (i.e. AD/DP/GQ/PGT/PID/PL)
This doesn't matter, as these fields would not be used as part of STAARpipeline.

(3) As for missing values (NAs), I did find some NAs in the aGDS file
These missing values come from indels, as long as you specify variant_type <- "SNV" in STAARpipeline_Dynamic_Window.r, there shouldn't be any problem.

Now, we have identified the issue of

      -logp chr start_pos end_pos  GWER SNV_num
 [1,]     0   1         0       0 5e-04       1

If you are following STAARpipeline-tutorial, you can see that the number of output files should be the summation of the column "scang_num" for the object in jobs_num.Rdata. For example, in our case, we submitted a total of 1879 jobs instead of 22 jobs. As such, your sscpv2_13.Rdata corresponds to the 13th chunk of chromosome 1 but not the results for chromosome 13, and that is why the chr column in the results is 1 but not 13.

As a follow-up question, did you convert your STAAR null model to SCANG-STAAR null model before running dynamic window analysis using SCANG-STAAR?

Hope this helps.

Best,
Xihao

from staarpipeline-tutorial.

SheaCheng2000 avatar SheaCheng2000 commented on June 26, 2024

Hi Xihao,

That's right! Sorry that I mistook the job number. I am re-running now. )

BTW, for the function Dynamic_Window_SCANG(), if there is no significant result, will the "start_loc" and "end_loc" in the output be 0? I am not sure if it's the default setting of the function itself or if I messed up other things.

For your follow-up question, I converted the STAAR null model to SCANG-STAAR null model before running dynamic window analysis. )

Thanks!
Shea

from staarpipeline-tutorial.

xihaoli avatar xihaoli commented on June 26, 2024

Hi Shea,

BTW, for the function Dynamic_Window_SCANG(), if there is no significant result, will the "start_loc" and "end_loc" in the output be 0? I am not sure if it's the default setting of the function itself or if I messed up other things.

Yes, that is correct. In this case, 0 is used for placeholder.

Best,
Xihao

from staarpipeline-tutorial.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.