stefanedwards / siccuracy Goto Github PK

View Code? Open in Web Editor NEW

2.0 3.0 0.0 615 KB

Stefan's imputation accuracy package

R 19.30% Fortran 4.14% HTML 76.26% C 0.31%

genotype imputation imputation-accuracy

siccuracy's People

Contributors

Stargazers

Watchers

siccuracy's Issues

Make cbind SNP chip

New name for rowconcatenate: cbind_SNP.

Do not use cbind.SNP as this will cause method dispatcher to call this function for an object of class 'SNP'.

Add deprecated descriptions with new names.

Reading integers vs reals, and writing integers vs reals

I have noticed the following:

1: Using format statement (Iw) does not work if the variable is a real.

2: Fortran throws a fit if trying to read a real formatted number into an integer. No help there.

Int to int is easy, real to real even so. But what do we do about cross-overs, i.e. reading integers and output reals (easy), and reading reals and outputting integers?

Tests in test_convert_phases failed

Using commit 2e6503f6, the first two tests of test_convert_phase failed.

Check that adaptive imputation accuracy stores true matrix efficiently

Intuitively, true matrix has samples as rows and SNPs as columns. Data is however retrieved as rows. A simple transposing should do the trick, but needs to be checked in calculations.

Add correct call % to imputation accuracy

imputation_accuracy should also count correct called genotypes (column-wise, row-wise), entries were genotype is missing in true, in imputed, or false.
Allow for a tolerance for comparison (e.g. tol = 0.1), to compare with gene dosages.

Update return value of imputation_accuracy to have:

List of 2
 $ snps : data.frame(means, sds, cors, correct, true.na, imputed.na, both.na)
 $ samples : data.frame(rowID, cors, correct, true.na, imputed.na, both.na)

Make convert_plinkA

Use plink -bfile <name stem> --recode A to recode a plink binary file to a text formatted file, coding genotypes as 0, 1, and 2. Two issues exists:

Recoded file as a leading line with column names.
Recoded file as leading columns with family ID, sample ID, sex, paternal ID, maternal ID, and phenotype. Columns 3-6 needs to be stripped. Columns 1-2 will need to be converted to an integer ID.

Options:

Give arguments converting familyID and sampleID into an integer.
Automatically decide conversion, i.e. if familyID is the same throughout, drop it. If sampleID is the same throughout, drop it. If familyID == sampleID, drop one. Conversion: e.g. familyID's are thousands, sampleIDs are singles. Count maximum sampleID's per familyID to determine minimum radix for familyID.

Return:

List with n as number of rows converted, data.frame with mapping, m as number of columns.

This method will also work for converting files for DMU (although with argument --recode 12 [?]).

Update imputation tests to new return value.

Update `imputation_accuracy`'s `fast` argument to `adaptive`

The fast concept is not really fast. But the "fast" method has a really low memory footprint, although not adaptive.

Row with no variance drops returned element

In adaptive routine, providing a row that has no variance, the corresponding element in rowcors disappears. In fast routine, this is not the case.

Standardization must be FALSE, else it adds variance by scaling and shifting each element in the row separately.

ts <- Siccuracy:::make.test(15, 21)
 true <- ts$true
 true[2,] <- 2
 write.snps(true, ts$truefn)
 
 # No standardization, as this changes each element of row 2 -- and it gets variance!
 imputed <- ts$imputed
 mat1 <- cor(as.vector(true), as.vector(imputed), use = 'complete.obs')
 suppressWarnings(row1 <- sapply(1:nrow(true), function(i) cor(true[i,], imputed[i,], use='na.or.complete')))
 suppressWarnings(col1 <- sapply(1:ncol(true), function(i) cor(true[,i], imputed[,i], use='na.or.complete')))
 
 res <- imputation_accuracy(ts$truefn, ts$imputedfn, standardized = FALSE, adaptive = TRUE)
 expect_equal(res$matcor, mat1, tolerance=1e-9)
 expect_equal(res$rowcors, row1, tolerance=1e-9)
 expect_equal(res$colcors, col1, tolerance=1e-9)
 
 res <- imputation_accuracy(ts$truefn, ts$imputedfn, standardized = FALSE, adaptive = FALSE)
 expect_equal(res$matcor, mat1, tolerance=1e-9)
 expect_equal(res$rowcors, row1, tolerance=1e-9)
 expect_equal(res$colcors, col1, tolerance=1e-9)

Convert PLINK binary format

Learn to parse the binary format of PLINK without the need to use --recode A (or --recode 12).

Stream line parameter descriptions and names.

ncol, nSNPs, nAnimals, naval, NAval, numeric.format, etc. are just some of the different names for the same concepts. These need to be stream lined.

Task also requires removing nSNPs and ncol as setting this to anything other than the actual number of columns will lead to weird results.

Make maskSNPs function.

Required documentation: `maskSNPs` returns a bunch of stuff in a list.

Should probably be invisible, and be documented in help.

Supplying user allele frequencies results `NA` column correlations being wrong when compiled as i386

This problem is not replicated when compiled as x64.

The test-routine has at least 1 column that has zero variance. Under x64, the correlation of this column is NA. Only when compiled under i386, and supplying vector of scaling (just one of center, scaling, or p), this constant column gets an odd correlation.

stefanedwards / siccuracy Goto Github PK

siccuracy's People

Contributors

Stargazers

Watchers

siccuracy's Issues

Recommend Projects

Recommend Topics

Recommend Org