openmendel / bgen.jl Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 0.0 13.48 MB

License: MIT License

Julia 100.00%

bgen.jl's People

Stargazers

Watchers

bgen.jl's Issues

Features to add

The following are the items raised so far.

Importing haplotypes.bgen throws assertion error

It seems like the example haplotypes.bgen cannot be read:

using BGEN
b = Bgen(BGEN.datadir("haplotypes.bgen"); idx_path=BGEN.datadir("haplotypes.bgen.bgi"))
variants = parse_variants(b; from_bgen_start=true);

julia> minor_allele_dosage!(b, variants[1]; T=Float32)
ERROR: AssertionError: p.phased == 0
Stacktrace:
 [1] minor_allele_dosage!(::Bgen, ::Variant; T::Type{T} where T, mean_impute::Bool, clear_decompressed::Bool) at /Users/biona001/.julia/packages/BGEN/QxEG0/src/genotypes.jl:490
 [2] top-level scope at REPL[4]:1

Is this expected?

Individual ID extracted from wrong column

Hi,

When reading BGEN files, I think the sample id is currently extracted from the wrong column. According to this documentation, the column ID_2 corresponds to the individual ID. I think the origin might be this function.

Potential fix? replace [1] by [2]

Always treat REF as `major_allele` and ALT as `minor_allele`

Similar to #16, is there a way to always read the ALT allele (ATCG letter) and the REF allele regardless of which one is the minor allele?

I believe major_allele and minor_allele first checks which ALT/REF is less frequent, before assigning one as the minor allele.

Here's a test data:
Archive 2.zip

MWE

using BGEN, VCFTools

# import VCF data (always read ALT as minor allele)
Gtrue, sampleID, record_chr, record_pos, record_ids, record_ref, 
    record_alt = convert_gt(Float64, "target.typedOnly.masked.vcf.gz",
    trans=true, save_snp_info=true)

# import BGEN and look at 2nd marker
b = Bgen("target.typedOnly.masked.bgen")
variants = parse_variants(b; from_bgen_start=true)
v = variants[2]
minor_allele_dosage!(b, v)

# VCF and BGEN disagree
major_allele(v), minor_allele(v) # this outputs ("G", "T")
record_ref[2], record_alt[2] # this outputs ("T", ["G"]) where ALT is a vector because it is possible to have >1 ALT allele

minor_allele_dosage!'s outputs oscillates for some variants

Hi,

I am quite surprised to see that minor_allele_dosage! does not yield the same result when called multiple times for some variants. Could you please explain what exactly it represents and why it is the case?

using BGEN

b = Bgen(
    BGEN.datadir("example.8bits.bgen"); 
    sample_path=BGEN.datadir("example.sample"), 
    idx_path=BGEN.datadir("example.8bits.bgen.bgi")
    )
# Not reproducible
v = variant_by_rsid(b, "RSID_110")
mean(minor_allele_dosage!(b, v)) == 0.9621725f0
mean(minor_allele_dosage!(b, v)) == 1.0378275f0

# Reproducible
v = variant_by_rsid(b, "RSID_198")
mean(minor_allele_dosage!(b, v)) == 0.48411763f0
mean(minor_allele_dosage!(b, v)) == 0.48411763f0

Read all ALT allele as 1 and REF allele as 0 in `minor_allele_dosage!`?

Here is a test data (unphased VCF file converted to BGEN using qctool v2)
Archive 2.zip

I want to import all BGEN genotypes into numeric matrix, so I wrote the following function

using BGEN, VCFTools
function convert_gt(t::Type{T}, b::Bgen) where T <: Real
    n = n_samples(b)
    p = n_variants(b)
    G = Matrix{t}(undef, p, n)

    # loop over each variant
    i = 1
    for v in iterator(b; from_bgen_start=true)
        dose = minor_allele_dosage!(b, v; T=t)
        copyto!(@view(G[i, :]), dose)
        i += 1
    end
    return G
end

But imported genotype matrix does not agree with VCF file:

Gtest = convert_gt(Float64, Bgen("target.typedOnly.masked.bgen"))
Gtrue = VCFTools.convert_gt(Float64, "target.typedOnly.masked.vcf.gz", trans=true)
julia> all(Gtrue .== Gtest)
false

This seems to be because BGEN is checking which allele is the minor allele (so a 2 is swapped with 0 and 0 is swapped with 2 compared to VCF file)

idx = findall(skipmissing(Gtrue .!= Gtest)) # index where Gtrue and Gtest does not agree
julia> [Gtrue[idx] Gtest[idx]]
66811×2 Matrix{Union{Missing, Float64}}:
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 0.0  2.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 ⋮
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 0.0  2.0
 0.0  2.0
 0.0  2.0
 0.0  2.0

Is there a way to instead read all ALT alleles as 1 and all REF allele as 0?

Typo for haplotype documentation

"On the other hand, if the data are phased, probabilities!(b, v)[i, j] represents the probability that haplotype (i - 1) ÷ p + 1 has allele (i - 1) % p + 1"

Surely there are typos here? If i = 1, then haplotype 1 has allele 1 with probability probabilities!(b, v)[i, j]? I can't understand how to parse extract haplotypes for each sample

maf, hwe, info_score, counts! for general cases

Currently, we only have it for 8-bit files. We need to have a fallback implementation.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Cannot get ref/alt allele label for haplotype data

I think I'm running into the same problem as #7.

After parsing a variant's haplotypes, I want to additionally store ref/alt label (ie call major_allele and minor_allele), but I get Error: minor_allele_dosage!() must be called before minor_allele(). But calling minor_allele_dosage!() throws another error.

MWE

using BGEN
b = Bgen(BGEN.datadir("haplotypes.bgen"))
variants = parse_variants(b; from_bgen_start=true)
v = variants[1]

dose = probabilities!(b, v) # I will parse `dose` using some other routine to get haplotypes

julia> major_allele(v) # now try to get REF allele
┌ Error: `minor_allele_dosage!()` must be called before `minor_allele()`
└ @ BGEN ~/.julia/dev/BGEN/src/variant.jl:92

Try calling minor_allele_dosage!() throws not phased error

julia> minor_allele_dosage!(b, v)
ERROR: AssertionError: p.phased == 0
Stacktrace:
 [1] ref_allele_dosage!(b::Bgen, v::Variant; T::Type, mean_impute::Bool, clear_decompressed::Bool, data::Nothing, decompressed::Nothing, is_decompressed::Bool)
   @ BGEN ~/.julia/dev/BGEN/src/genotypes.jl:546
 [2] minor_allele_dosage!(b::Bgen, v::Variant; T::Type, mean_impute::Bool, clear_decompressed::Bool, data::Nothing, decompressed::Nothing, is_decompressed::Bool)
   @ BGEN ~/.julia/dev/BGEN/src/genotypes.jl:603
 [3] minor_allele_dosage!(b::Bgen, v::Variant)
   @ BGEN ~/.julia/dev/BGEN/src/genotypes.jl:589
 [4] top-level scope
   @ REPL[57]:1

but variant is phased:

julia> phased(v) == true
true

Here is another test data (VCF haplotypes converted to BGEN using qctools v2)
Archive.zip

convert to called genotypes

Hi,

Sorry for all the feature requests, I'm working a lot with BGEN files and would love to move all of my preprocessing to Julia if possible.

In SnpArrays, there is the possibility to convert the snp array to genotypes and I was wondering if it could be possible to have the same function here using the probabilities contained in the BGEN file, possibly with a "calling" threshold.

Cheers,
Olivier

Feature request: import genotype/haplotypes into numeric matrix

Below is some prototype code.

This convert_gt runs without error but is untested

"""
    convert_gt(b::Bgen, T=Float32)

Imports dosage information and chr/pos/snpID/ref/alt into numeric arrays.

# Input
- `b`: a `Bgen` object
- `T`: Type for genotype array

# Output
- `G`: a `p × n` matrix of type `T`. Each column is a genotype
- `Gchr`: Vector of `String`s holding chromosome number for each variant
- `Gpos`: Vector of `Int` holding each variant's position
- `GsnpID`: Vector of `String`s holding variant ID for each variant
- `Gref`: Vector of `String`s holding reference allele for each variant
- `Galt`: Vector of `String`s holding alterante allele for each variant
"""
function convert_gt(b::Bgen, T=Float32)
    n = n_samples(b)
    p = n_variants(b)

    # return arrays
    G = Matrix{T}(undef, p, n)
    Gchr = Vector{String}(undef, p)
    Gpos = Vector{Int}(undef, p)
    GsnpID = Vector{String}(undef, p)
    Gref = Vector{String}(undef, p)
    Galt = Vector{String}(undef, p)

    # loop over each variant
    i = 1
    for v in iterator(b; from_bgen_start=true)
        dose = minor_allele_dosage!(b, v; T=T)
        copyto!(@view(G[i, :]), dose)
        Gchr[i], Gpos[i], GsnpID[i], Gref[i], Galt[i] =
            chrom(v), pos(v), rsid(v), major_allele(v), minor_allele(v)
        i += 1
        clear!(v)
    end
    return G, Gchr, Gpos, GsnpID, Gref, Galt
end

This convert_ht does not work due to major_allele and minor_allele not working.

"""
    convert_ht(b::Bgen)

Import phased haplotypes as a `BitMatrix`, and store chr/pos/snpID/ref/alt.

# Input
- `b`: a `Bgen` object. Each variant must be phased and samples must be diploid

# Output
- `H`: a `p × 2n` matrix of type `T`. Each column is a haplotype. 
- `Hchr`: Vector of `String`s holding chromosome number for each variant
- `Hpos`: Vector of `Int` holding each variant's position
- `HsnpID`: Vector of `String`s holding variant ID for each variant
- `Href`: Vector of `String`s holding reference allele for each variant
- `Halt`: Vector of `String`s holding alterante allele for each variant
"""
function convert_ht(b::Bgen)
    n = 2n_samples(b)
    p = n_variants(b)

    # return arrays
    H = BitMatrix(undef, p, n)
    Hchr = Vector{String}(undef, p)
    Hpos = Vector{Int}(undef, p)
    HsnpID = Vector{String}(undef, p)
    Href = Vector{String}(undef, p)
    Halt = Vector{String}(undef, p)

    # loop over each variant
    i = 1
    for v in iterator(b; from_bgen_start=true)
        dose = probabilities!(b, v)
        phased(v) || error("variant $(rsid(v)) at position $(pos(v)) not phased!")
        for j in 1:n_samples(b)
            Hi = @view(dose[:, j])
            H[i, 2j - 1] = read_haplotype1(Hi)
            H[i, 2j] = read_haplotype2(Hi)
        end
        Hchr[i], Hpos[i], HsnpID[i], Href[i], Halt[i] =
            chrom(v), pos(v), rsid(v), major_allele(v), minor_allele(v)
        i += 1
        clear!(v)
    end
    return H, Hchr, Hpos, HsnpID, Href, Halt
end

read_haplotype1(Hi::AbstractVector) = Hi[2] ≥ 0.5 ? true : false
read_haplotype2(Hi::AbstractVector) = Hi[4] ≥ 0.5 ? true : false

Do you know of a dataset that is in both VCF and BGEN format? That would be great for testing. Also, I'm guessing probabilities! and minor_allele_dosage! are allocating for every variant. If so, probably we need to modify it so it accepts preallocated vectors.

openmendel / bgen.jl Goto Github PK

bgen.jl's People

Stargazers

Watchers

bgen.jl's Issues

Features to add

Importing haplotypes.bgen throws assertion error

Individual ID extracted from wrong column

Always treat REF as `major_allele` and ALT as `minor_allele`

minor_allele_dosage!'s outputs oscillates for some variants

Read all ALT allele as 1 and REF allele as 0 in `minor_allele_dosage!`?

Typo for haplotype documentation

maf, hwe, info_score, counts! for general cases

TagBot trigger issue

Cannot get ref/alt allele label for haplotype data

convert to called genotypes

Feature request: import genotype/haplotypes into numeric matrix

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent