quantgen / bgdata Goto Github PK

A Suite of Packages for Analysis of Big Genomic Data

License: Other

R 82.96% C 17.04%

bgdata's Introduction

BGData: A Suite of Packages for Analysis of Big Genomic Data

BGData (Grueneberg & de los Campos, 2019) is an R package that provides scalable and efficient computational methods for large genomic datasets, e.g., genome-wide association studies (GWAS) or genomic relationship matrices (G matrices). It also contains a container class called BGData that holds genotypes, sample information, and variant information.

Modern genomic datasets are big (large n), high-dimensional (large p), and multi-layered. The challenges that need to be addressed are memory requirements and computational demands. Our goal is to develop software that will enable researchers to carry out analyses with big genomic data within the R environment.

We have identified several approaches to tackle those challenges within R:

File-backed matrices: The data is stored in on the hard drive and users can read in smaller chunks when they are needed.
Linked arrays: For very large datasets a single file-backed array may not be enough or convenient. A linked array is an array whose content is distributed over multiple file-backed nodes.
Multiple dispatch: Methods are presented to users so that they can treat these arrays pretty much as if they were RAM arrays.
Multi-level parallelism: Exploit multi-core and multi-node computing.
Inputs: Users can create these arrays from standard formats (e.g., PLINK .bed).

The BGData package is an umbrella package that comprises several packages: BEDMatrix, LinkedMatrix, and symDMatrix.

Examples

Loading the package

Load the BGData package:

library(BGData)

Inspecting the example dataset

The inst/extdata folder contains example files that were generated from the 250k SNP and phenotype data in Atwell et al. (2010). Only the first 300 SNPs of chromosome 1, 2, and 3 were included to keep the size of the example dataset small enough for CRAN. PLINK was used to convert the data to .bed and .raw files. FT10 has been chosen as a phenotype and is provided as an alternate phenotype file. The file is intentionally shuffled to demonstrate that the additional phenotypes are put in the same order as the rest of the phenotypes.

path <- system.file("extdata", package = "BGData")
list.files(path)
#>  [1] "chr1.bed"  "chr1.bim"  "chr1.fam"  "chr1.raw"  "chr2.bed"  "chr2.bim"
#>  [7] "chr2.fam"  "chr2.raw"  "chr3.bed"  "chr3.bim"  "chr3.fam"  "chr3.raw"
#> [13] "pheno.txt"

Loading example dataset

Loading individual PLINK .bed files

Load the .bed file for chromosome 1 (chr1.bed) using the BEDMatrix package:

chr1 <- BEDMatrix(paste0(path, "/chr1.bed"))
#> Extracting number of individuals and rownames from .fam file...
#> Extracting number of markers and colnames from .bim file...

BEDMatrix objects behave similarly to regular matrices:

dim(chr1)
#> [1] 199 300
rownames(chr1)[1:10]
#> [1] "5837_5837" "6008_6008" "6009_6009" "6016_6016" "6040_6040" "6042_6042"
#> [7] "6043_6043" "6046_6046" "6064_6064" "6074_6074"
colnames(chr1)[1:10]
#> [1] "snp1_T"  "snp2_G"  "snp3_A"  "snp4_T"  "snp5_G"  "snp6_T"  "snp7_C"
#> [8] "snp8_C"  "snp9_C"  "snp10_G"
chr1["6008_6008", "snp5_G"]
#> [1] 0

Linking multiple BEDMatrix objects together

Load the other two .bed files:

chr2 <- BEDMatrix(paste0(path, "/chr2.bed"))
#> Extracting number of individuals and rownames from .fam file...
#> Extracting number of markers and colnames from .bim file...
chr3 <- BEDMatrix(paste0(path, "/chr3.bed"))
#> Extracting number of individuals and rownames from .fam file...
#> Extracting number of markers and colnames from .bim file...

Combine the BEDMatrix objects by columns using the LinkedMatrix to avoid the inconvenience of having three separate matrices:

wg <- ColumnLinkedMatrix(chr1, chr2, chr3)

Just like BEDMatrix objects, LinkedMatrix objects also behave similarly to regular matrices:

dim(wg)
#> [1] 199 900
rownames(wg)[1:10]
#> [1] "5837_5837" "6008_6008" "6009_6009" "6016_6016" "6040_6040" "6042_6042"
#> [7] "6043_6043" "6046_6046" "6064_6064" "6074_6074"
colnames(wg)[1:10]
#> [1] "snp1_T"  "snp2_G"  "snp3_A"  "snp4_T"  "snp5_G"  "snp6_T"  "snp7_C"
#> [8] "snp8_C"  "snp9_C"  "snp10_G"
wg["6008_6008", "snp5_G"]
#> [1] 0

Creating a BGData object

BGData objects can be created from individual BEDMatrix objects or a collection of BEDMatrix objects as a LinkedMatrix object using the as.BGData() function. This will read the .fam and .bim file that comes with the .bed files. The alternatePhenotypeFile parameter points to the file that contains the FT10 phenotype:

bg <- as.BGData(wg, alternatePhenotypeFile = paste0(path, "/pheno.txt"))
#> Extracting phenotypes from .fam file, assuming that the .fam file of the first BEDMatrix instance is representative of all the other nodes...
#> Extracting map from .bim files...
#> Merging alternate phenotype file...

The bg object will use the LinkedMatrix object as genotypes, the .fam file augmented by the FT10 phenotype as sample information, and the .bim file as variant information.

str(bg)
#> Formal class 'BGData' [package "BGData"] with 3 slots
#>   ..@ geno :Formal class 'ColumnLinkedMatrix' [package "LinkedMatrix"] with 1 slot
#>   .. .. ..@ .Data:List of 3
#>   .. .. .. ..$ :BEDMatrix: 199 x 300 [/home/agrueneberg/.pkgs/R/BGData/extdata/chr1.bed]
#>   .. .. .. ..$ :BEDMatrix: 199 x 300 [/home/agrueneberg/.pkgs/R/BGData/extdata/chr2.bed]
#>   .. .. .. ..$ :BEDMatrix: 199 x 300 [/home/agrueneberg/.pkgs/R/BGData/extdata/chr3.bed]
#>   ..@ pheno:'data.frame':       199 obs. of  7 variables:
#>   .. ..$ FID      : int [1:199] 5837 6008 6009 6016 6040 6042 6043 6046 6064 6074 ...
#>   .. ..$ IID      : int [1:199] 5837 6008 6009 6016 6040 6042 6043 6046 6064 6074 ...
#>   .. ..$ PAT      : int [1:199] 0 0 0 0 0 0 0 0 0 0 ...
#>   .. ..$ MAT      : int [1:199] 0 0 0 0 0 0 0 0 0 0 ...
#>   .. ..$ SEX      : int [1:199] 0 0 0 0 0 0 0 0 0 0 ...
#>   .. ..$ PHENOTYPE: int [1:199] -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 ...
#>   .. ..$ FT10     : num [1:199] 57 60 98 75 71 56 90 93 96 91 ...
#>   ..@ map  :'data.frame':       900 obs. of  6 variables:
#>   .. ..$ chromosome        : int [1:900] 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..$ snp_id            : chr [1:900] "snp1" "snp2" "snp3" "snp4" ...
#>   .. ..$ genetic_distance  : int [1:900] 0 0 0 0 0 0 0 0 0 0 ...
#>   .. ..$ base_pair_position: int [1:900] 657 3102 4648 4880 5975 6063 6449 6514 6603 6768 ...
#>   .. ..$ allele_1          : chr [1:900] "T" "G" "A" "T" ...
#>   .. ..$ allele_2          : chr [1:900] "C" "A" "C" "C" ...

Saving a BGData object

A BGData object can be saved like any other R object using the save function:

save(bg, file = "BGData.RData")

Loading a BGData object

The genotypes in a BGData object can be of various types, some of which need to be initialized in a particular way. The load.BGData takes care of reloading a saved BGData object properly:

load.BGData("BGData.RData")
#> Loaded objects: bg

Summarizing data

Use chunkedApply to count missing values (among others):

countNAs <- chunkedApply(X = geno(bg), MARGIN = 2, FUN = function(x) sum(is.na(x)))

Use the summarize function to calculate minor allele frequencies and frequency of missing values:

summarize(geno(bg))

Running GWASes with different regression methods

A data structure for genomic data is useful when defining methods that act on both phenotype and genotype information. We have implemented a GWAS function that supports various regression methods. The formula takes phenotypes from the sample information of the BGData object and inserts one marker at a time.

gwas <- GWAS(formula = FT10 ~ 1, data = bg)

Generating the G Matrix

G <- getG(geno(bg))

Installation

Install the stable version from CRAN:

install.packages("BGData")

Alternatively, install the development version from GitHub:

# install.packages("remotes")
remotes::install_github("QuantGen/BGData")

Documentation

Further documentation can be found on RDocumentation.

Contributing

Issue Tracker: https://github.com/QuantGen/BGData/issues
Source Code: https://github.com/QuantGen/BGData

bgdata's People

Contributors

Stargazers

Watchers

Forkers

lian0090 sudhaveturi minghao2016 dnharry funkhou9 davan690 feraguate jingali ruixiangliu geneticresources quanrd gcostaneto rnaimehaom wook2014

bgdata's Issues

Handle duplicate row names in @pheno

The BC has three duplicate individual IDs that cause an error message when printing @pheno. readPED runs without complaining.

Upload bundled package to GitHub

This should make installation easier for users that do not have the devtools package.

Change na.strings of readPED back to NA to support raw files

That was a misunderstanding on my part, thinking that regular PED files are more common for this type of analysis than the ones that are generated by recodeA.

Implement show method for mmMatrix

This is the method that is run when just the variable name is typed into the console. For larger matrices, printing the whole thing can be quite annoying, so I think just printing the dimensions and number of chunks should be enough.

write.plink?

Thanks for implementing this package!

Is there already a workflow to write merged or subset data back to the file system in plink binary format?

Error in .local(.Object, ...) : No such device

Hi,

I am getting this error and cannot figure out why it is happening:

Extracting number of samples and rownames from FAM file...
Extracting number of variants and colnames from BIM file...
Error in .local(.Object, ...) : No such device

obviously, it can find the fam and bim and the bed file is there too but I don't see why I get this error. I would be thankful if you could help.

Thanks,
Kayhan

Consider covariate files in as.BGData

We have a formula interface, so the distinction between phenotypes and covariates is not as important as in PLINK. Maybe we should have an alias for alternatePhenotypeFile in as.BGData for covariates? They have the same file format.

Add options for i and j in getG.symDMatrix

Add summarize function

To calculate NA and MA frequencies.

GWAS: Check for columns with too many NAs / with zero variance

Fix signature of rayOLS

rayOLS should have a signature that starts with x to not confuse apply etc. I don't think this is causing an issue right now because we are explicitly passing y which takes y out of the rotation, but it may bite back in the future.

Treat dataType of readPED as a string

For now we could support a subset of allowed types in R, say integer, double, and character. character will only be available in readPED.matrix for now.

Add i and j to summarize

GWAS: attempt to set 'colnames' on an object with less than two dimensions

I'm attempting to run GWAS using BGData. I first went through the tutorial in the readme with test data and it worked properly. I'm now trying to run my data and am getting the following error.

> gwas <- GWAS(formula = my_phenotype ~ 1, data = bg)
Error in `colnames<-`(`*tmp*`, value = colnames(data@geno)[j]) : 
  attempt to set 'colnames' on an object with less than two dimensions
In addition: Warning message:
In parallel::mclapply(X = seq_len(nChunks), FUN = chunkApply, ...,  :
  all scheduled cores encountered errors in user code

The only thing I did differently from the tutorial was that I loaded the entire genome as a single .bed, rather than loading each chromosome separately and then combining them using ColumnLinkedMatrix. I don't know a good way to load and combine chromosomes/scaffolds in my case since there are over 1000 for the reference genome I'm working with. I ran the following:

wg <- BEDMatrix("my_filepath.bed")
bg <- as.BGData(wg, alternatePhenotypeFile = "my_phenodata")

...and I can't see what the problem with the object "geno" is. It appears to have two dimensions, contrary to the error I'm getting from the GWAS function.

> bg
An object of class "BGData"
Slot "geno":
BEDMatrix: 882 x 4428726 [my_filepath.bed]

Thanks for the help getting this to run with my data.

getG.symDMatrix: Support minVar

getG filters out constant columns based on the minVar parameter, getG.symDMatrix doesn't. If the X matrix contains constant columns, the G matrices created by getG.symDMatrix will be NaN while the ones created getG will be fine.

Thanks to @yrubio for reporting the problem.

Implement custom save function

This is necessary for #12.

getGi versus getGij ( problems in getGi )

These examples illustrate a problem that we need to fix in getGi (note, getGij works fine)

library(BGData)
library(BGLR)
data(mice)
X=mice.X
scales=apply(FUN=sd,MARGIN=2,X=X) 
means=apply(FUN=mean,MARGIN=2,X=X)

## Computing G using tcrossprod
  G0=tcrossprod(scale(X))
## Computing G using getG with scales and centers computed internally
 G1=getG(X,scaleG=F,scaleCol=T)
 all.equal(G0,G1)
## Computing G using getG with user provided scales and centers
 G2=getG(X,scaleG=F,scaleCol=T,i=1:nrow(X),centers=centers,scales=scales)
 all.equal(G0,G2)
## Computing G using i and i2
 G12=getG(X,scaleG=F,scaleCol=T,centers=centers,scales=scales,i=11:20,i2=11:20)
 all.equal(as.vector(G12),as.vector(G0[rownames(G12),colnames(G12)]))
## Computin G using i only
 G12=getG(X,scaleG=F,scaleCol=T,centers=centers,scales=scales,i=11:20)
 all.equal(as.vector(G12),as.vector(G0[rownames(G12),colnames(G12)]))

as.BGData: Read BIM file

Handle character data

ff does not handle character data. We discussed recoding, but it would slow things down. Instead, we decided to throw an error with the respective PLINK command: plink --file ped.txt --recodeA?

getG imputes by 0 if center = FALSE

This implies that the data has already been centered, but since this is not the case, impute by mean instead.

Try to improve error handling for mclapply calls

Check for try-error objects, etc.

Support phenotype files

To support more than one phenotype. Used with --pheno in PLINK. The file does not depend on the order of the PED file, rather it should be merged by FID and IID with missing values if a value for an entry in PED is not found. There is also support for headers, but they have to start with FID and IID.

Add Y parameter to getG

Currently i2 is taken from X, assuming that all samples are in X. For distributing data to nodes, this can be inefficient. Need to add parameter Y: if Y is NULL, i2 will be taken from X as is, otherwise from Y.

Some PED files use both Family ID and Subject ID to uniquely identify individuals

A single parameter idCol might not be enough: "The combination of family and individual ID should uniquely identify a person." (http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml)

The following PED file will break @pheno because of duplicates if idCol = 2 (the current default):

1 1 0 0 1  0    G G    2 2    C C
1 2 0 0 1  0    A A    0 0    A C
1 3 1 2 1  2    0 0    1 2    A C
2 1 0 0 1  0    A A    2 2    0 0
2 2 0 0 1  2    A A    2 2    0 0
2 3 1 2 1  2    A A    2 2    A A

Quick fix: allow vector as idCol?

getG.symDMatrix: dimnames[[i]] is neither NULL nor matches dim[i]

This happens when I run getG.symDMatrix on a larger dataset:

 Working pair 1-1 (0% 2274.482 seconds).
Error in `dimnames<-.ff_array`(`*tmp*`, value = dn) : 
  dimnames[[i]] is neither NULL nor matches dim[i]

Debugging information:

> dim(bg@geno)
[1]  13113 841820
> head(rownames(bg@geno))
[1] "0_230853" "0_147875" "0_373389" "0_354405" "0_500565" "0_383668"
> head(colnames(bg@geno))
[1] "SNP_A-4256704_C" "SNP_A-1964387_A" "SNP_A-1965574_A" "SNP_A-1968937_G"
[5] "SNP_A-1968946_C" "SNP_A-4212582_C"

nChunks, nChunks2, nTastks, mc.cores, chunkSize

We need to consolidate this.

Clearly we need one keyword to define the number of columns brought to RAM at a time, we need a different term to define how many cores used to process each chunk and probably a different term to define the block size in symDMatrix.

Here is a proposal

nChunks: define the size of the chunk to be brought to RAM.
chunkSize: used also to define the size of the chunk to brought to RAM. I suggest chunkSize over-rides nChunks.
nTasks used to define the number of tasks used to process one chunk.
mc.cores, I suggest to set it by default to min(detctCores(), nTasks)
blockSize: the number of columns/rows of a block in a symDMatrix

Problem with j in getG

Hi,

I'm using a genomic data set with 180K SNPs to create G matrices. I want to exclude around 3K of these SNPs from the calculation, so I used the j argument to include all other columns than these 3K. This gave a wildly different answer than when not excluding columns, with generally much larger entries than what we should be seeing.

This problem persisted even when I excluded just 1 single randomly selected SNP column (again, out of 180K SNPs, so this should have a large effect, right?). This also shows that the problem is not specific to the columns I wanted to exclude, but to excluding columns with j at all.

Luckily I was able to get around this issue by instead making a new BGData object where the unwanted SNP columns were not included in the first place, and then not specifying j in getG. This gave the expected/right answer, so there seems to be some problem with using j.

Thank you for the package, it is very helpful.

'mc.cores' > 1 is not supported on Windows

I'm running RStudio on windows and have hit an error at the Summarizing Data step:

Summarising data

Use chunkedApply to count missing values (among others):

countNAs <- chunkedApply(X = geno(bg), MARGIN = 2, FUN = function(x) sum(is.na(x)))
Error in mclapply(X = seq_len(nChunks), FUN = chunkApply, ..., mc.cores = nCores) :
'mc.cores' > 1 is not supported on Windows

Relax R version requirement

3.1.2 is not really needed, try something lower.

Support LinkedMatrix in as.BGData

For now just with BEDMatrix elements.

getG.symDMatrix: blockSize does not use all individuals

library(BGLR)
data(mice)
X <- mice.X
rownames(X) <- paste0("ID_", 1:nrow(X))
G3 <- getG.symDMatrix(X, scaleCol = TRUE, centerCol = TRUE, folder = "tmp", blockSize = 300, vmode = "double")
dim(G3) # returns 1500 1500 but should be 1814 1814

Add rbind and cbind methods for mmMatrix

rbind will be easy for rmmMatrix, cbind for cmmMatrix. The other ones will be tricky and probably involve creating new chunks.

Add ff-backed LinkedMatrix constructor

use gpuR

For computations of G matrices gpuR can give use better performance that the current multi-core approach. I suggest to add a function crossprods_gpu(), see template below.

Problems to consider:
- Will we require gpuR? For now, while we test it and learn, we may not require gpuR
- Detecting whether gpu is availalbe, if not throw an error.
- Modifying getG by adding an argument useGPU, if TRUE, we call crossprods_gpu, that authomatically will override, nCores.

Gustavo

crossprods_gpu <- function(x, y = NULL, use_tcrossprod = FALSE) { #*# remove some args
dx <- dim(x)
if (!is.null(y)) {
y <- as.matrix(y)
dy <- dim(y)
if (use_tcrossprod) {
if (dx[2L] != ncol(y)) {
stop("Error in tcrossprod_gpu: non-conformable arguments.")
}
} else {
if (dx[1L] != dy[1L]) {
stop("Error in crossprod_gpu: non-conformable arguments.")
}
}
}

x <- gpuMatrix(x,"float")
if (!is.null(y)) {
    y <- gpuMatrix(y,"float")
}
if (use_tcrossprod) {
   Xy <- tcrossprod(x, y)
} else {
	Xy <- crossprod(x, y)
}

return(Xy)

}

Add centerCol parameter for getG

Right now centering can only be disabled by using a null vector as centers.

Add a way to load an mmMatrix independent of BGData

I have to think about this a little, but this could be a separate function or in the constructor.

Add option to transpose input data in readPED

Some input files have the markers in the rows and the individuals in the columns. A simple transpose parameter that defaults to false and some logic that write rows into columns should do the trick.

Allow transformations during getG

@agugonrey brought this up, could also be useful for other functions. A view feature for BEDMatrix could also be useful to implement this.

Drop .RData extension of BGData object

Over in #5 I mentioned that we need the storage location of an mmMatrix object to add more chunks. This could be done using a parameter of *bind, or as a property of the mmMatrix object itself.
I think the latter is more convenient if we are planning to append to the object instead of creating a new one. I could image some attribute that is set when the mmMatrix is created or loaded. The problem is that currently the object can be loaded using the load function without us having a chance to interfere. In fact, if we rely on the parameter, this will have a negative impact on portability (paths could be wrong).
I think we have discussed this before, but I would like to hide the fact that the BGData object is saved using the save command in favor of our loadBGData and load2 functions. We would need a dedicated save function as well (that strips the attribute before saving, for instance).