Giter VIP home page Giter VIP logo

bgen.jl's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bgen.jl's Issues

Features to add

The following are the items raised so far.

  • select a variant by rsid or index within the region selected

  • getindex function

  • major/minor alleles for haplotypes

  • ref allele dosage (avoid confusion with minor allele)

  • minor allele frequency

  • HWE test

  • comparison: parse timing vs. VCF format

  • import values in a matrix (#7)

  • write functions

  • expose interface for non-allocating probability reading functions

  • documentation: define RSID

  • documentation/code: pos -> position

  • allow integer arguments for chromosome selection

  • show() function for Bgen, Variant, Genotypes

  • documentation: slash (/) for unphased, bar (|) for phased genotypes and haplotypes

  • documentation: clarify the explanation of probabilities!(b, v)[i, j]

  • documentation: biallelic diploid examples then polyploid exceptions

  • same name for similar functionalities to SnpArrays.jl

Importing haplotypes.bgen throws assertion error

It seems like the example haplotypes.bgen cannot be read:

using BGEN
b = Bgen(BGEN.datadir("haplotypes.bgen"); idx_path=BGEN.datadir("haplotypes.bgen.bgi"))
variants = parse_variants(b; from_bgen_start=true);

julia> minor_allele_dosage!(b, variants[1]; T=Float32)
ERROR: AssertionError: p.phased == 0
Stacktrace:
 [1] minor_allele_dosage!(::Bgen, ::Variant; T::Type{T} where T, mean_impute::Bool, clear_decompressed::Bool) at /Users/biona001/.julia/packages/BGEN/QxEG0/src/genotypes.jl:490
 [2] top-level scope at REPL[4]:1

Is this expected?

Always treat REF as `major_allele` and ALT as `minor_allele`

Similar to #16, is there a way to always read the ALT allele (ATCG letter) and the REF allele regardless of which one is the minor allele?

I believe major_allele and minor_allele first checks which ALT/REF is less frequent, before assigning one as the minor allele.

Here's a test data:
Archive 2.zip

MWE

using BGEN, VCFTools

# import VCF data (always read ALT as minor allele)
Gtrue, sampleID, record_chr, record_pos, record_ids, record_ref, 
    record_alt = convert_gt(Float64, "target.typedOnly.masked.vcf.gz",
    trans=true, save_snp_info=true)

# import BGEN and look at 2nd marker
b = Bgen("target.typedOnly.masked.bgen")
variants = parse_variants(b; from_bgen_start=true)
v = variants[2]
minor_allele_dosage!(b, v)

# VCF and BGEN disagree
major_allele(v), minor_allele(v) # this outputs ("G", "T")
record_ref[2], record_alt[2] # this outputs ("T", ["G"]) where ALT is a vector because it is possible to have >1 ALT allele

minor_allele_dosage!'s outputs oscillates for some variants

Hi,

I am quite surprised to see that minor_allele_dosage! does not yield the same result when called multiple times for some variants. Could you please explain what exactly it represents and why it is the case?

using BGEN

b = Bgen(
    BGEN.datadir("example.8bits.bgen"); 
    sample_path=BGEN.datadir("example.sample"), 
    idx_path=BGEN.datadir("example.8bits.bgen.bgi")
    )
# Not reproducible
v = variant_by_rsid(b, "RSID_110")
mean(minor_allele_dosage!(b, v)) == 0.9621725f0
mean(minor_allele_dosage!(b, v)) == 1.0378275f0

# Reproducible
v = variant_by_rsid(b, "RSID_198")
mean(minor_allele_dosage!(b, v)) == 0.48411763f0
mean(minor_allele_dosage!(b, v)) == 0.48411763f0

Read all ALT allele as 1 and REF allele as 0 in `minor_allele_dosage!`?

Here is a test data (unphased VCF file converted to BGEN using qctool v2)
Archive 2.zip

I want to import all BGEN genotypes into numeric matrix, so I wrote the following function

using BGEN, VCFTools
function convert_gt(t::Type{T}, b::Bgen) where T <: Real
    n = n_samples(b)
    p = n_variants(b)
    G = Matrix{t}(undef, p, n)

    # loop over each variant
    i = 1
    for v in iterator(b; from_bgen_start=true)
        dose = minor_allele_dosage!(b, v; T=t)
        copyto!(@view(G[i, :]), dose)
        i += 1
    end
    return G
end

But imported genotype matrix does not agree with VCF file:

Gtest = convert_gt(Float64, Bgen("target.typedOnly.masked.bgen"))
Gtrue = VCFTools.convert_gt(Float64, "target.typedOnly.masked.vcf.gz", trans=true)
julia> all(Gtrue .== Gtest)
false

This seems to be because BGEN is checking which allele is the minor allele (so a 2 is swapped with 0 and 0 is swapped with 2 compared to VCF file)

idx = findall(skipmissing(Gtrue .!= Gtest)) # index where Gtrue and Gtest does not agree
julia> [Gtrue[idx] Gtest[idx]]
66811×2 Matrix{Union{Missing, Float64}}:
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 0.0  2.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 2.0  0.0
 0.0  2.0
 0.0  2.0
 0.0  2.0
 0.0  2.0

Is there a way to instead read all ALT alleles as 1 and all REF allele as 0?

Typo for haplotype documentation

"On the other hand, if the data are phased, probabilities!(b, v)[i, j] represents the probability that haplotype (i - 1) ÷ p + 1 has allele (i - 1) % p + 1"

Surely there are typos here? If i = 1, then haplotype 1 has allele 1 with probability probabilities!(b, v)[i, j]? I can't understand how to parse extract haplotypes for each sample

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Cannot get ref/alt allele label for haplotype data

I think I'm running into the same problem as #7.

After parsing a variant's haplotypes, I want to additionally store ref/alt label (ie call major_allele and minor_allele), but I get Error: minor_allele_dosage!() must be called before minor_allele(). But calling minor_allele_dosage!() throws another error.

MWE

using BGEN
b = Bgen(BGEN.datadir("haplotypes.bgen"))
variants = parse_variants(b; from_bgen_start=true)
v = variants[1]

dose = probabilities!(b, v) # I will parse `dose` using some other routine to get haplotypes

julia> major_allele(v) # now try to get REF allele
┌ Error: `minor_allele_dosage!()` must be called before `minor_allele()`
└ @ BGEN ~/.julia/dev/BGEN/src/variant.jl:92

Try calling minor_allele_dosage!() throws not phased error

julia> minor_allele_dosage!(b, v)
ERROR: AssertionError: p.phased == 0
Stacktrace:
 [1] ref_allele_dosage!(b::Bgen, v::Variant; T::Type, mean_impute::Bool, clear_decompressed::Bool, data::Nothing, decompressed::Nothing, is_decompressed::Bool)
   @ BGEN ~/.julia/dev/BGEN/src/genotypes.jl:546
 [2] minor_allele_dosage!(b::Bgen, v::Variant; T::Type, mean_impute::Bool, clear_decompressed::Bool, data::Nothing, decompressed::Nothing, is_decompressed::Bool)
   @ BGEN ~/.julia/dev/BGEN/src/genotypes.jl:603
 [3] minor_allele_dosage!(b::Bgen, v::Variant)
   @ BGEN ~/.julia/dev/BGEN/src/genotypes.jl:589
 [4] top-level scope
   @ REPL[57]:1

but variant is phased:

julia> phased(v) == true
true

Here is another test data (VCF haplotypes converted to BGEN using qctools v2)
Archive.zip

convert to called genotypes

Hi,

Sorry for all the feature requests, I'm working a lot with BGEN files and would love to move all of my preprocessing to Julia if possible.

In SnpArrays, there is the possibility to convert the snp array to genotypes and I was wondering if it could be possible to have the same function here using the probabilities contained in the BGEN file, possibly with a "calling" threshold.

Cheers,
Olivier

Feature request: import genotype/haplotypes into numeric matrix

Below is some prototype code.

This convert_gt runs without error but is untested

"""
    convert_gt(b::Bgen, T=Float32)

Imports dosage information and chr/pos/snpID/ref/alt into numeric arrays.

# Input
- `b`: a `Bgen` object
- `T`: Type for genotype array

# Output
- `G`: a `p × n` matrix of type `T`. Each column is a genotype
- `Gchr`: Vector of `String`s holding chromosome number for each variant
- `Gpos`: Vector of `Int` holding each variant's position
- `GsnpID`: Vector of `String`s holding variant ID for each variant
- `Gref`: Vector of `String`s holding reference allele for each variant
- `Galt`: Vector of `String`s holding alterante allele for each variant
"""
function convert_gt(b::Bgen, T=Float32)
    n = n_samples(b)
    p = n_variants(b)

    # return arrays
    G = Matrix{T}(undef, p, n)
    Gchr = Vector{String}(undef, p)
    Gpos = Vector{Int}(undef, p)
    GsnpID = Vector{String}(undef, p)
    Gref = Vector{String}(undef, p)
    Galt = Vector{String}(undef, p)

    # loop over each variant
    i = 1
    for v in iterator(b; from_bgen_start=true)
        dose = minor_allele_dosage!(b, v; T=T)
        copyto!(@view(G[i, :]), dose)
        Gchr[i], Gpos[i], GsnpID[i], Gref[i], Galt[i] =
            chrom(v), pos(v), rsid(v), major_allele(v), minor_allele(v)
        i += 1
        clear!(v)
    end
    return G, Gchr, Gpos, GsnpID, Gref, Galt
end

This convert_ht does not work due to major_allele and minor_allele not working.

"""
    convert_ht(b::Bgen)

Import phased haplotypes as a `BitMatrix`, and store chr/pos/snpID/ref/alt.

# Input
- `b`: a `Bgen` object. Each variant must be phased and samples must be diploid

# Output
- `H`: a `p × 2n` matrix of type `T`. Each column is a haplotype. 
- `Hchr`: Vector of `String`s holding chromosome number for each variant
- `Hpos`: Vector of `Int` holding each variant's position
- `HsnpID`: Vector of `String`s holding variant ID for each variant
- `Href`: Vector of `String`s holding reference allele for each variant
- `Halt`: Vector of `String`s holding alterante allele for each variant
"""
function convert_ht(b::Bgen)
    n = 2n_samples(b)
    p = n_variants(b)

    # return arrays
    H = BitMatrix(undef, p, n)
    Hchr = Vector{String}(undef, p)
    Hpos = Vector{Int}(undef, p)
    HsnpID = Vector{String}(undef, p)
    Href = Vector{String}(undef, p)
    Halt = Vector{String}(undef, p)

    # loop over each variant
    i = 1
    for v in iterator(b; from_bgen_start=true)
        dose = probabilities!(b, v)
        phased(v) || error("variant $(rsid(v)) at position $(pos(v)) not phased!")
        for j in 1:n_samples(b)
            Hi = @view(dose[:, j])
            H[i, 2j - 1] = read_haplotype1(Hi)
            H[i, 2j] = read_haplotype2(Hi)
        end
        Hchr[i], Hpos[i], HsnpID[i], Href[i], Halt[i] =
            chrom(v), pos(v), rsid(v), major_allele(v), minor_allele(v)
        i += 1
        clear!(v)
    end
    return H, Hchr, Hpos, HsnpID, Href, Halt
end

read_haplotype1(Hi::AbstractVector) = Hi[2]  0.5 ? true : false
read_haplotype2(Hi::AbstractVector) = Hi[4]  0.5 ? true : false

Do you know of a dataset that is in both VCF and BGEN format? That would be great for testing. Also, I'm guessing probabilities! and minor_allele_dosage! are allocating for every variant. If so, probably we need to modify it so it accepts preallocated vectors.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.