openmendel / bgen.jl Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
The following are the items raised so far.
select a variant by rsid or index within the region selected
getindex function
major/minor alleles for haplotypes
ref allele dosage (avoid confusion with minor allele)
minor allele frequency
HWE test
comparison: parse timing vs. VCF format
import values in a matrix (#7)
write functions
expose interface for non-allocating probability reading functions
documentation: define RSID
documentation/code: pos -> position
allow integer arguments for chromosome selection
show()
function for Bgen
, Variant
, Genotypes
documentation: slash (/
) for unphased, bar (|
) for phased genotypes and haplotypes
documentation: clarify the explanation of probabilities!(b, v)[i, j]
documentation: biallelic diploid examples then polyploid exceptions
same name for similar functionalities to SnpArrays.jl
It seems like the example haplotypes.bgen
cannot be read:
using BGEN
b = Bgen(BGEN.datadir("haplotypes.bgen"); idx_path=BGEN.datadir("haplotypes.bgen.bgi"))
variants = parse_variants(b; from_bgen_start=true);
julia> minor_allele_dosage!(b, variants[1]; T=Float32)
ERROR: AssertionError: p.phased == 0
Stacktrace:
[1] minor_allele_dosage!(::Bgen, ::Variant; T::Type{T} where T, mean_impute::Bool, clear_decompressed::Bool) at /Users/biona001/.julia/packages/BGEN/QxEG0/src/genotypes.jl:490
[2] top-level scope at REPL[4]:1
Is this expected?
Hi,
When reading BGEN files, I think the sample id is currently extracted from the wrong column. According to this documentation, the column ID_2 corresponds to the individual ID. I think the origin might be this function.
Potential fix? replace [1] by [2]
Similar to #16, is there a way to always read the ALT allele (ATCG letter) and the REF allele regardless of which one is the minor allele?
I believe major_allele
and minor_allele
first checks which ALT/REF is less frequent, before assigning one as the minor allele.
Here's a test data:
Archive 2.zip
MWE
using BGEN, VCFTools
# import VCF data (always read ALT as minor allele)
Gtrue, sampleID, record_chr, record_pos, record_ids, record_ref,
record_alt = convert_gt(Float64, "target.typedOnly.masked.vcf.gz",
trans=true, save_snp_info=true)
# import BGEN and look at 2nd marker
b = Bgen("target.typedOnly.masked.bgen")
variants = parse_variants(b; from_bgen_start=true)
v = variants[2]
minor_allele_dosage!(b, v)
# VCF and BGEN disagree
major_allele(v), minor_allele(v) # this outputs ("G", "T")
record_ref[2], record_alt[2] # this outputs ("T", ["G"]) where ALT is a vector because it is possible to have >1 ALT allele
Hi,
I am quite surprised to see that minor_allele_dosage!
does not yield the same result when called multiple times for some variants. Could you please explain what exactly it represents and why it is the case?
using BGEN
b = Bgen(
BGEN.datadir("example.8bits.bgen");
sample_path=BGEN.datadir("example.sample"),
idx_path=BGEN.datadir("example.8bits.bgen.bgi")
)
# Not reproducible
v = variant_by_rsid(b, "RSID_110")
mean(minor_allele_dosage!(b, v)) == 0.9621725f0
mean(minor_allele_dosage!(b, v)) == 1.0378275f0
# Reproducible
v = variant_by_rsid(b, "RSID_198")
mean(minor_allele_dosage!(b, v)) == 0.48411763f0
mean(minor_allele_dosage!(b, v)) == 0.48411763f0
Here is a test data (unphased VCF file converted to BGEN using qctool v2)
Archive 2.zip
I want to import all BGEN genotypes into numeric matrix, so I wrote the following function
using BGEN, VCFTools
function convert_gt(t::Type{T}, b::Bgen) where T <: Real
n = n_samples(b)
p = n_variants(b)
G = Matrix{t}(undef, p, n)
# loop over each variant
i = 1
for v in iterator(b; from_bgen_start=true)
dose = minor_allele_dosage!(b, v; T=t)
copyto!(@view(G[i, :]), dose)
i += 1
end
return G
end
But imported genotype matrix does not agree with VCF file:
Gtest = convert_gt(Float64, Bgen("target.typedOnly.masked.bgen"))
Gtrue = VCFTools.convert_gt(Float64, "target.typedOnly.masked.vcf.gz", trans=true)
julia> all(Gtrue .== Gtest)
false
This seems to be because BGEN is checking which allele is the minor allele (so a 2 is swapped with 0 and 0 is swapped with 2 compared to VCF file)
idx = findall(skipmissing(Gtrue .!= Gtest)) # index where Gtrue and Gtest does not agree
julia> [Gtrue[idx] Gtest[idx]]
66811×2 Matrix{Union{Missing, Float64}}:
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
0.0 2.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
⋮
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
2.0 0.0
0.0 2.0
0.0 2.0
0.0 2.0
0.0 2.0
Is there a way to instead read all ALT alleles as 1 and all REF allele as 0?
"On the other hand, if the data are phased, probabilities!(b, v)[i, j] represents the probability that haplotype (i - 1) ÷ p + 1 has allele (i - 1) % p + 1"
Surely there are typos here? If i = 1
, then haplotype 1
has allele 1
with probability probabilities!(b, v)[i, j]
? I can't understand how to parse extract haplotypes for each sample
Currently, we only have it for 8-bit files. We need to have a fallback implementation.
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
I think I'm running into the same problem as #7.
After parsing a variant's haplotypes, I want to additionally store ref/alt label (ie call major_allele
and minor_allele
), but I get Error: minor_allele_dosage!() must be called before minor_allele()
. But calling minor_allele_dosage!()
throws another error.
MWE
using BGEN
b = Bgen(BGEN.datadir("haplotypes.bgen"))
variants = parse_variants(b; from_bgen_start=true)
v = variants[1]
dose = probabilities!(b, v) # I will parse `dose` using some other routine to get haplotypes
julia> major_allele(v) # now try to get REF allele
┌ Error: `minor_allele_dosage!()` must be called before `minor_allele()`
└ @ BGEN ~/.julia/dev/BGEN/src/variant.jl:92
Try calling minor_allele_dosage!()
throws not phased error
julia> minor_allele_dosage!(b, v)
ERROR: AssertionError: p.phased == 0
Stacktrace:
[1] ref_allele_dosage!(b::Bgen, v::Variant; T::Type, mean_impute::Bool, clear_decompressed::Bool, data::Nothing, decompressed::Nothing, is_decompressed::Bool)
@ BGEN ~/.julia/dev/BGEN/src/genotypes.jl:546
[2] minor_allele_dosage!(b::Bgen, v::Variant; T::Type, mean_impute::Bool, clear_decompressed::Bool, data::Nothing, decompressed::Nothing, is_decompressed::Bool)
@ BGEN ~/.julia/dev/BGEN/src/genotypes.jl:603
[3] minor_allele_dosage!(b::Bgen, v::Variant)
@ BGEN ~/.julia/dev/BGEN/src/genotypes.jl:589
[4] top-level scope
@ REPL[57]:1
but variant is phased:
julia> phased(v) == true
true
Here is another test data (VCF haplotypes converted to BGEN using qctools v2)
Archive.zip
Hi,
Sorry for all the feature requests, I'm working a lot with BGEN files and would love to move all of my preprocessing to Julia if possible.
In SnpArrays, there is the possibility to convert
the snp array to genotypes and I was wondering if it could be possible to have the same function here using the probabilities contained in the BGEN file, possibly with a "calling" threshold.
Cheers,
Olivier
Below is some prototype code.
This convert_gt
runs without error but is untested
"""
convert_gt(b::Bgen, T=Float32)
Imports dosage information and chr/pos/snpID/ref/alt into numeric arrays.
# Input
- `b`: a `Bgen` object
- `T`: Type for genotype array
# Output
- `G`: a `p × n` matrix of type `T`. Each column is a genotype
- `Gchr`: Vector of `String`s holding chromosome number for each variant
- `Gpos`: Vector of `Int` holding each variant's position
- `GsnpID`: Vector of `String`s holding variant ID for each variant
- `Gref`: Vector of `String`s holding reference allele for each variant
- `Galt`: Vector of `String`s holding alterante allele for each variant
"""
function convert_gt(b::Bgen, T=Float32)
n = n_samples(b)
p = n_variants(b)
# return arrays
G = Matrix{T}(undef, p, n)
Gchr = Vector{String}(undef, p)
Gpos = Vector{Int}(undef, p)
GsnpID = Vector{String}(undef, p)
Gref = Vector{String}(undef, p)
Galt = Vector{String}(undef, p)
# loop over each variant
i = 1
for v in iterator(b; from_bgen_start=true)
dose = minor_allele_dosage!(b, v; T=T)
copyto!(@view(G[i, :]), dose)
Gchr[i], Gpos[i], GsnpID[i], Gref[i], Galt[i] =
chrom(v), pos(v), rsid(v), major_allele(v), minor_allele(v)
i += 1
clear!(v)
end
return G, Gchr, Gpos, GsnpID, Gref, Galt
end
This convert_ht
does not work due to major_allele
and minor_allele
not working.
"""
convert_ht(b::Bgen)
Import phased haplotypes as a `BitMatrix`, and store chr/pos/snpID/ref/alt.
# Input
- `b`: a `Bgen` object. Each variant must be phased and samples must be diploid
# Output
- `H`: a `p × 2n` matrix of type `T`. Each column is a haplotype.
- `Hchr`: Vector of `String`s holding chromosome number for each variant
- `Hpos`: Vector of `Int` holding each variant's position
- `HsnpID`: Vector of `String`s holding variant ID for each variant
- `Href`: Vector of `String`s holding reference allele for each variant
- `Halt`: Vector of `String`s holding alterante allele for each variant
"""
function convert_ht(b::Bgen)
n = 2n_samples(b)
p = n_variants(b)
# return arrays
H = BitMatrix(undef, p, n)
Hchr = Vector{String}(undef, p)
Hpos = Vector{Int}(undef, p)
HsnpID = Vector{String}(undef, p)
Href = Vector{String}(undef, p)
Halt = Vector{String}(undef, p)
# loop over each variant
i = 1
for v in iterator(b; from_bgen_start=true)
dose = probabilities!(b, v)
phased(v) || error("variant $(rsid(v)) at position $(pos(v)) not phased!")
for j in 1:n_samples(b)
Hi = @view(dose[:, j])
H[i, 2j - 1] = read_haplotype1(Hi)
H[i, 2j] = read_haplotype2(Hi)
end
Hchr[i], Hpos[i], HsnpID[i], Href[i], Halt[i] =
chrom(v), pos(v), rsid(v), major_allele(v), minor_allele(v)
i += 1
clear!(v)
end
return H, Hchr, Hpos, HsnpID, Href, Halt
end
read_haplotype1(Hi::AbstractVector) = Hi[2] ≥ 0.5 ? true : false
read_haplotype2(Hi::AbstractVector) = Hi[4] ≥ 0.5 ? true : false
Do you know of a dataset that is in both VCF
and BGEN
format? That would be great for testing. Also, I'm guessing probabilities!
and minor_allele_dosage!
are allocating for every variant. If so, probably we need to modify it so it accepts preallocated vectors.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.