Giter VIP home page Giter VIP logo

Comments (16)

samreenzafer avatar samreenzafer commented on June 23, 2024

I would like to add that my WGS VCFs did not have the AVGDP annotation, so I added that manually by copying the DP values for every variant as follows for every chromosome.

`
#open GDS
gds.path <- paste0(dir_geno,gds_file_name_1,chr,gds_file_name_2)
genofile <- seqOpen(gds.path, readonly = FALSE)

avgdp<- seqGetData(genofile , "annotation/info/DP")
print( length(avgdp)) # for chr22 [1] 730895
seqAddValue(genofile, "annotation/info/AVGDP", avgdp ,"dummy sequencing depth added by samreen")
seqClose(genofile)

`

from staarpipeline-tutorial.

xihaoli avatar xihaoli commented on June 23, 2024

Hi @samreenzafer,

Thanks for reaching out. It's OK that your aGDS files do not have a specific annotation named AVGDP (average DP), and what you have done makes sense. Generally, we would want to use a set of high-quality genetic variants to compute ancestry PCs and sparse GRM for null model fitting, and AVGDP is one of the filtering criteria.

For your question, I wonder whether the sample.id from your phenotype data can be matched with the sample.id stored in your aGDS file. It is also OK if your phenotype sample ids are a subset of the sample ids in the aGDS files. Could you please double check?

Best,
Xihao

from staarpipeline-tutorial.

samreenzafer avatar samreenzafer commented on June 23, 2024

Thank you for responding.

for the sampleID question , yes they are overlapping.

phenotype.id <- obj_nullmodel$id_include
genotype.id<- seqGetData(genofile, "sample.id")
print(length(phenotype.id))
[1] 1004
print(length(seqGetData(genofile, "sample.id")))
[1] 1026
length(intersect( phenotype.id,genotype.id))
[1] 1004

from staarpipeline-tutorial.

xihaoli avatar xihaoli commented on June 23, 2024

It seems like your outcome variable is binary, and if so, you should use family=binomial(link="logit") instead of family=gaussian(link="identity") and make sure that your outcome is coded as numeric 0/1 instead of a factor variable. Also, while it is true that "groups" need only be used if your outcome is a continuous variable, you could still consider adding your "group" variable as a fixed effect term in your logistic mixed model.

Hope this helps.

from staarpipeline-tutorial.

samreenzafer avatar samreenzafer commented on June 23, 2024

For group, do you mean I should use my "group_race" variable from the phenotype file for "groups"

I will try the family=binomial(link="logit") and let you know.
And yes my outcome variable was coded as 1/2 instead of 0/1. I will also change that.

I hadn't seen any instructions on the tutorial page to format the phenotype file, and had just assumed the plink-like notation. That was my bad.

from staarpipeline-tutorial.

samreenzafer avatar samreenzafer commented on June 23, 2024

I got this error while using the suggested logit

Error in glmmkin(fixed = fixed, data = data, kins = kins, id = id, random.slope = random.slope, :
Error: heteroscedastic linear mixed models are only applicable when "family" is gaussian.
Calls: fit_nullmodel -> glmmkin
Execution halted

obj_nullmodel <- fit_nullmodel(pheno~SNPSEX+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+as.factor(group_race), data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID", groups="group_race",family=binomial(link="logit"),verbose=TRUE)

from staarpipeline-tutorial.

xihaoli avatar xihaoli commented on June 23, 2024

Hi @samreenzafer, for binary traits, please leave groups parameter as NULL.

from staarpipeline-tutorial.

samreenzafer avatar samreenzafer commented on June 23, 2024

I'm not sure what I'm doing wrong but I get errors.

obj_nullmodel <- fit_nullmodel(pheno~SNPSEX+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+as.factor(group_race), data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID", family=binomial(link="logit"),verbose=TRUE, groups=NULL)
it ran for 3 iterations and then gave the error
Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'chol2inv': not a positive definite matrix Calls: fit_nullmodel ... chol2inv -> chol -> chol -> .local -> as -> asMethod Execution halted

Then I edited the command to remove the as.factor(group_race)

obj_nullmodel <- fit_nullmodel(pheno~as.factor(SNPSEX)+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10 ,
                               data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID",
                                family=binomial(link="logit"),verbose=TRUE, groups=NULL)

And I still get an error, but after 500 iterations.


[1] "kins is a sparse matrix."
Fixed-effect coefficients:
(Intercept)      SNPSEX         PC1         PC2         PC3         PC4 
 16.0408287 -18.3676970 -27.0664801   9.4036154  -5.5903104   3.7215324 
        PC5         PC6         PC7         PC8         PC9        PC10 
  6.9741816  20.4901888  -6.8478820  -2.8771294   0.6520779   1.4433908 

Iteration  1 :
Variance component estimates:
[1] 1.000000 7.234874
Fixed-effect coefficients:
 [1]  17.1054427 -19.5066252 -30.2674194   9.8406604  -6.9357156   4.8385081
 [7]   8.1287788  24.0488017 -12.4647225  -3.3129108  -0.1664124   2.1897960
...
...
...
...

Iteration  499 :
Variance component estimates:
[1] 1.000000 1.040721
Fixed-effect coefficients:
 [1]  12.9362534 -15.2646269 -27.1436580   9.3835255  -5.6920925   3.8344001
 [7]   7.0144809  20.7832368  -7.6375962  -2.8396810   0.5219339   1.5484195

Iteration  500 :
Variance component estimates:
[1] 1.0000000 0.5132395
Fixed-effect coefficients:
 [1]  13.9159314 -16.2561242 -27.1981325   9.4333121  -5.6752385   3.8187186
 [7]   7.0241264  20.8224936  -7.4864314  -2.8504011   0.5512372   1.5232672
Fixed-effect coefficients:
(Intercept)      SNPSEX         PC1         PC2         PC3         PC4 
 16.0408287 -18.3676970 -27.0664801   9.4036154  -5.5903104   3.7215324 
        PC5         PC6         PC7         PC8         PC9        PC10 
  6.9741816  20.4901888  -6.8478820  -2.8771294   0.6520779   1.4433908 

Iteration  1 :
Error: Not a matrix.
In addition: Warning message:
Average Information REML not converged, refitting model using Brent method... 
Execution halted


I don't understand how the program runs, so not sure what is it that is " not a matrix " here.

from staarpipeline-tutorial.

samreenzafer avatar samreenzafer commented on June 23, 2024

Is it even okay for me to have all 3 of my ethnic group of samples analysed together (the creating of the SGRM, fitting null model and all the variant analyses)? Or should this pipeline be run separately for the 3 ethnic groups?

from staarpipeline-tutorial.

samreenzafer avatar samreenzafer commented on June 23, 2024

And here is my GRM object


> grm<- get(load("chrall.degree2.sparseGRM.sGRM.RData"))
> str(grm)
Formal class 'dsCMatrix' [package "Matrix"] with 7 slots
  ..@ i       : int [1:1027] 0 0 1 2 3 4 5 6 7 8 ...
  ..@ p       : int [1:1027] 0 1 3 4 5 6 7 8 9 10 ...
  ..@ Dim     : int [1:2] 1026 1026
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:1026] "0_NA19331" "0_NA19334" "0_BKP000674" "0_BKP000684" ...
  .. ..$ : chr [1:1026] "0_NA19331" "0_NA19334" "0_BKP000674" "0_BKP000684" ...
  ..@ x       : num [1:1027] 0.5 0.23 0.494 0.499 0.497 ...
  ..@ uplo    : chr "U"
  ..@ factors : list()


> min(grm@x)
[1] 0.2297566
> max(grm@x)
[1] 0.5668696


Which I had created using the following command.

Rscript ../extdata/runPipeline_wrapper.R --prefix.in chrall_pruned --prefix.in.unfiltered chrall --prefix.out chrall.degree2.sparseGRM --degree 2 --num_threads 8 --no_pcs 20 --KINGformat.out TRUE

outputting
chrall.degree2.sparseGRM.kins chrall.degree2.sparseGRM.score chrall.degree2.sparseGRM.sGRM.RData

from staarpipeline-tutorial.

samreenzafer avatar samreenzafer commented on June 23, 2024

Hi @xihaoli

I tried a few things, and realized that the "SNPSEX" in the phenotype is causing the problems for me.

I refitted the null object without using SNPSEX as a covariate (also did NOT use group_race)

obj_nullmodel <- fit_nullmodel(pheno~ PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10 ,
                               data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID",
                                family=binomial(link="logit"),verbose=TRUE, groups=NULL)

The code ran to completion in only 16 iterations (as opposed to 500 iterations when I used SNPSEX which had given me the "not a matrix" error).

Then I tried one randome Individual_Analysis on arrayid=130
Rscript STAARpipeline_Individual_Analysis_newSampleIDs.noSexnoRace.R 130

which gave the output log

[1] 1004
[1] 1026
[1] "IDs of phenotype: "
[1] "0_NA20821" "0_NA20822" "0_NA20826" "0_NA20827" "0_NA20828" "0_NA20832"
[1] "IDs of genotype: "
[1] "0_NA20821" "0_NA20822" "0_NA20826" "0_NA20827" "0_NA20828" "0_NA20832"
[1] 1004
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 2,290
# of selected samples: 1,026
# of selected variants: 2,941,749
Time difference of 14.57882 secs

and output of


> df<- get(load("analysis/Individual_Analysis.noSexnoRace_130.Rdata"))
> head(df)
  CHR      POS    REF ALT     ALT_AF        MAF    N    pvalue pvalue_log10
1   7 20022978      G   A 0.08850000 0.08850000 1004 0.1671100   0.77699757
2   7 20023277 AAAAAC   A 0.18661972 0.18661972 1004 0.7529397   0.12323983
3   7 20023350      G   A 0.08133733 0.08133733 1004 0.7110876   0.14807688
4   7 20026620      T   C 0.05700000 0.05700000 1004 0.7886436   0.10311921
5   7 20026886     CT   C 0.19687815 0.19687815 1004 0.7687004   0.11424288
6   7 20027095      T   C 0.08941059 0.08941059 1004 0.9202903   0.03607516
       Score Score_se         Est    Est_se
1  3.2924486 2.383156  0.57971434 0.4196116
2 -1.4022395 4.454869 -0.07065652 0.2244735
3 -1.1257320 3.039268 -0.12187006 0.3290266
4  0.8054780 3.004703  0.08921759 0.3328116
5 -1.3251899 4.506296 -0.06525875 0.2219118
6  0.3447801 3.445455  0.02904349 0.2902374

which now has numeric Pvalues in the PValue column. I think I can now run for all the entire job array.

Question: Why do you think SNPSEX could be causing a problem?
This is what my phenotype file is coded as

FID	SNPSEX	pheno	group_race	PC1	PC2	...PC10
0_NA20814	1	0	EUR	-0.0233194936208413	0.0177841478492174
0_NA20815	1	0	EUR	-0.0232876963243618	0.0178482237390411
0_NA20818	2	0	EUR	-0.023596640369937	0.0174856248889635
0_NA20819	2	0	EUR	-0.0235735234473422	0.016707921936772
0_NA20821	2	0	EUR	-0.0233292853163099	0.0167566107200051
0_NA20822	2	0	EUR	-0.0233028712187108	0.0174146357365599
0_NA20826	2	0	EUR	-0.0235409212779821	0.0179969624535894
0_NA20827	1	0	EUR	-0.0234523001514743	0.0176404968800983
0_NA20828	2	0	EUR	-0.0233579084370098	0.0173618077100926
0_NA20832	2	0	EUR	-0.0232000420926164	0.0169872294492342

where SNPSEX is 1 (for male) or 2 (for female),
pheno is 0 (control) or 1 (case) - I made this change after you corrected my error.

from staarpipeline-tutorial.

xihaoli avatar xihaoli commented on June 23, 2024

Hi @samreenzafer,

Thanks for following up. First of all, your output log of arrayid=130 looks good to me, so you may proceed with running other arrayid's and summarize/visualize all results.

For your question, yes it is possible that the non-convergence issue is on SNPSEX due to the collinearity between SNPSEX and intercept or some PCs. A similar issue is here. Could you please check?

Lastly, even if you end up not including SNPSEX, you may also consider adding as.factor(group_race) to the fixed effect formula when fitting the null model.

from staarpipeline-tutorial.

samreenzafer avatar samreenzafer commented on June 23, 2024

That makes sense. The cases in our cohort are all male, whereas controls have males and females (only for the European cohort). You can see the tables below.
Also we have small number of cases in the Afr and Hispanic cohorts as opposed to the European cohort.
So I think adding as.factor(group_race) to the fixed effect formula when fitting the null model , and Excluding SNPSEX makes most sense.


pheno == 0. (i.e. controls)   
                        AFR EUR HISP
  Num males    321 213  169
  Num females    0 212    0

pheno == 1 (i.e. cases)   
                      AFR EUR HISP
  Num males  12  61   16
  Num males   0   0    0

from staarpipeline-tutorial.

xihaoli avatar xihaoli commented on June 23, 2024

Hi @samreenzafer,

Thanks for checking. It all makes sense now. Here you may not be able to include SNPSEX in the null model fitting, because knowing a sample with SNPSEX=2 guarantees that pheno=0.

Also we have small number of cases in the Afr and Hispanic cohorts as opposed to the European cohort.
So I think adding as.factor(group_race) to the fixed effect formula when fitting the null model , and Excluding SNPSEX makes most sense.

This sounds good.

from staarpipeline-tutorial.

samreenzafer avatar samreenzafer commented on June 23, 2024

from staarpipeline-tutorial.

xihaoli avatar xihaoli commented on June 23, 2024

Hi Samreen,

This is great to hear. You may close this case for now, and if you have any questions, please feel free to reopen it.

Best,
Xihao

from staarpipeline-tutorial.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.