illumina / beadarrayfiles Goto Github PK
View Code? Open in Web Editor NEWPython library to parse file formats related to Illumina bead arrays
Python library to parse file formats related to Illumina bead arrays
Hello,
Do you have recommendations for how to proceed reading a cluster file with this library that gives this error: Exception: Data block version in cluster file 7 not supported
I am trying to extract the gentrain and clustersep score
Thank you so much for making this repo. The B allele frequencies and logr values are important features for our studies but the version of our GTC files is 3, which means we cannot get those two features from your library. Do you know any other way to get these two features or the way to convert idat files to gtc version 4 or 5.
Thank you very much!
Will support for python 3 be added? Thank you.
Hi folks,
I am having an issue when using IlluminaBeadArray libaray to convert gtc file to ped file format. A lot of snps in the converted plink file have 0|G instead they should be A|G. Below is the part of my script related to this conversion.
import sys
import os
from IlluminaBeadArrayFiles import GenotypeCalls, BeadPoolManifest, code2genotype
def outputPlink(gtc_file, manifest_file, sample_name, plink_out_dir, genoThresh = 0.15):
manifest = BeadPoolManifest(manifest_file)
gtc = GenotypeCalls(gtc_file)
GenoScores = gtc.get_genotype_scores()
top_strand_genotypes = gtc.get_base_calls()
outBase = plink_out_dir + '/' + sample_name
allGenotypes = []
with open(outBase + '.ped', 'w') as pedOut, open(outBase +'.map','w') as mapOut:
for (name, chrom, map_info, source_strand_genotype, genoScore) in zip(manifest.names, manifest.chroms, manifest.map_infos, top_strand_genotypes, GenoScores):
mapOut.write(' '.join([chrom, name, '0', str(map_info)]) + '\n')
if source_strand_genotype == '--':
geno = ['0', '0']
else:
geno = [source_strand_genotype[0], source_strand_genotype[1]]
allGenotypes += geno
pedOut.write(' '.join([sample_name, sample_name, '0', '0', '0', '-9'] + allGenotypes) + '\n')
Can this program be used to determine log r ratios and b allele frequencies? I have a set of .gtc files from an Illumina BeadChip run, and their corresponding manifest file in .bpm format. I’m interested in determining the log r ratios and b allele frequencies for each SNP, but when I use the applicable GenotypeCalls functions those values are returned as zeroes. I see how to use the NormalizationTransform functions to get the normalized x and y values, and the R and theta values from those. What would be the recommended next steps to obtain the log r ratios and b allele frequencies?
This line assigns a generator:
BeadArrayFiles/module/LocusAggregate.py
Lines 196 to 197 in dc4eb37
Which is then used here:
BeadArrayFiles/module/LocusAggregate.py
Lines 200 to 201 in dc4eb37
Here one GenerateLocusAggregate is created for each index in the loci group, however when the first GenerateLocusAggregate is called it consumes buffer
, meaning the subsequent GenerateLocusAggregate calls are empty.
Probably the simplest way to fix this is for the generator results to copied into a list to it can be used multiple times:
buffer = list(LocusAggregate.load_buffer(
samples, loci_group[0], loci_group[-1] - loci_group[0] + 1, normalization_lookups))
Hi! It was my understanding that the GenTrain score is "project" dependent and does not come from the cluster file. Is this correct?
If so, what does the score calculated in the GenTrain.py module refer to?
If not, am I correct in assuming that, when creating a GenomeStudio project with the same manifest and cluster file, I should always be getting the same GenTrain score for a certain SNP regardless of the samples?
Thank you for making this repo. Is there any way to get the file specification for cluster files (EGT files) in addition to the one that is already provided for GTC files in docs/
. Thank you!
I am tryinig to transfer from .gtc data to .txt with Reference allele and Alternative allele. So I am using GenotypeCalls and code2genotype module. I got some weird result.
For example, the last snp of GSAMD-24v1-0_20011747_A1 is
index : 700078
chromosome : X
position : 99912338
SNP : [T/A]
strand : +
When I transfer my .gtc file with this SNP, the result genotype call was BB and basecall was TT. It should be AA, right? I got so many of this kinda strange output. The saddest part is it works for some other rows...... I have no idea.
Greetings!
I had a query if I wanted to change the columns that are printed in the txt file of the output. How do I get two columns, one with GC scores and another with GT scores?
output_handle.write("[Data]\n")delim.join(["SNP Name", "Sample ID", "Chr", "Position", "GC Score", "Allele 1 - Plus", "Allele 2 - Plus", "GT Score"]
Something like this, how do I select the relevant attributes?
Hi team, I have some questions on the "get_control_x_intensities()" method.
from IlluminaBeadArrayFiles import GenotypeCalls, BeadPoolManifest, code2genotype
import sys
gtc_file=r"C:\202309880087_R01C02.gtc"
GenotypeCalls(gtc_file).get_control_x_intensities()
array([22163, 1337, 1341, 1025, 45830, 48170, 3044, 4004, 1773,
1749, 1357, 2127, 1042, 1086, 1287, 972, 1075, 1241,
1435, 1055, 1209, 874, 1398], dtype=uint16)
GenotypeCalls(gtc_file).get_control_y_intensities()
array([ 1412, 1090, 17303, 775, 1341, 912, 32024, 31040, 741,
23807, 16977, 8654, 776, 721, 743, 684, 737, 758,
1904, 833, 1092, 562, 2295], dtype=uint16)
len(GenotypeCalls(gtc_file).get_control_y_intensities())
23
Hi thanks for providing this python library for array genotype calling.
I wanted to include top_strand genotypes as a part of the final report by including top_strand_genotypes = gtc.get_base_calls()
in the gtc_final_report.py as shown below.
for gtc_file in samples:
sys.stderr.write("Processing " + gtc_file + "\n")
gtc_file = os.path.join(args.gtc_directory, gtc_file)
gtc = GenotypeCalls(gtc_file)
genotypes = gtc.get_genotypes()
top_strand_genotypes = gtc.get_base_calls()
plus_strand_genotypes = gtc.get_base_calls_plus_strand(manifest.snps, manifest.ref_strands)
forward_strand_genotypes = gtc.get_base_calls_forward_strand(manifest.snps, manifest.source_strands)
normalized_intensities = gtc.get_normalized_intensities(manifest.normalization_lookups)
b_allele_freq = gtc.get_ballele_freqs()
logr_ratio = gtc.get_logr_ratios()
assert len(genotypes) == len(manifest.names)
for (name, chrom, map_info, genotype, top_strand_genotype, ref_strand_genotype, source_strand_genotype, (x_norm, y_norm), b_freq, log_r_ratio) in zip(manifest.names, manifest.chroms, manifest.map_infos, genotypes, top_strand_genotypes, plus_strand_genotypes, forward_strand_genotypes, normalized_intensities, b_allele_freq, logr_ratio):
output_handle.write(delim.join([name, os.path.basename(gtc_file)[:-4], chrom, str(map_info), code2genotype[genotype], top_strand_genotype, ref_strand_genotype, source_strand_genotype, str(x_norm), str(y_norm), str(b_freq), str(log_r_ratio)]) + "\n")
However, I encountered the issue below.
Traceback (most recent call last):
File "gtc_gp2_final_report.py", line 57, in <module>
output_handle.write(delim.join([name, os.path.basename(gtc_file)[:-4], chrom, str(map_info), code2genotype[genotype], top_strand_genotype, ref_strand_genotype, source_strand_genotype, str(x_norm), str(y_norm), str(b_freq), str(log_r_ratio)]) + "\n")
TypeError: sequence item 5: expected str instance, bytes found
When I removed the parts related to top_strand_genotype, the script worked.
I am not sure what went wrong and how to modify it.
If someone has experience to call TOP genotypes, I would appreciate your input on this matter.
Thanks in advance.
Zih-Hua
I have a question about a small difference I observe in the normalized X and Y signals in final reports generated with GenomeStudio vs the illumina autocovert tool + python package IlluminaBeadArrayFiles::GenotypeCalls . I analyzed 96 samples on a GSA array GSAMD-24v1-0_20011747_A5. For 99.9 % of the snps the X and Y signals are exactly the same. But for about a 100 snps the X and Y signals differ, where the X_RAW and Y_RAW number are the same.
Genome Studio results:
SNP | X_RAW | Y_RAW | X | Y |
---|---|---|---|---|
GSA-21:9871186 | 7248 | 2236 | 1.173 | 0.326 |
GSA-21:9952707 | 6435 | 2797 | 1.036 | 0.421 |
GSA-rs1006435 | 5119 | 2310 | 0.813 | 0.338 |
Illumina:BeadArrayFiles tool results for the same snps:
-- | -- | -- | -- | --
SNP | X_RAW | Y_RAW | X | Y
GSA-21:9871186 | 7248 | 2236 | 1.385 | 0.433
GSA-21:9952707 | 6435 | 2797 | 1.230 | 0.542
GSA-rs1006435 | 5119 | 2310 | 0.978 | 0.448
As you can see, the X_RAW and Y_RAW number are the same, but the normalized X and Y differ. Do you have an explanation why we see this, only for a small number of snps?
Hi there,
We have written a GWAS QC pipeline starting with GTC files.
Recently when we try to run this pipeline with a new .bpm file (Illumina Omni5Exome-4 v1.3), we encountered some errors.
The error is caused by a call to a function from IlluminaBeadArrayFiles, the gtc parsing library that Illumina made. It’s this call that’s causing the error:
norm = gtc.get_normalized_intensities(manifest.normalization_lookups)
Below is how I run the related code:
**>>> from IlluminaBeadArrayFiles import GenotypeCalls, BeadPoolManifest, code2genotype
manifest_file = 'InfiniumOmni5Exome-4v1-3_A1.bpm'
gtc_file = '203423200004_R01C01.gtc'
gtc = GenotypeCalls(gtc_file)
manifest = BeadPoolManifest(manifest_file)
norm = gtc.get_normalized_intensities(manifest.normalization_lookups)
Traceback (most recent call last):
File "", line 1, in
File "IlluminaBeadArrayFiles/GenotypeCalls.py", line 581, in get_normalized_intensities
return [normalization_transforms[lookup].normalize_intensities(x_raw, y_raw) for (x_raw, y_raw, lookup) in zip(self.get_raw_x_intensities(), self.get_raw_y_intensities(), normalization_lookups)]
IndexError: list index out of range**
Here is the .bpm file used:
https://support.illumina.com/array/array_kits/infinium_humanomni5exome_beadchip_kit/downloads.html
Infinium Omni5Exome-4 v1.3 Manifest File (BPM Format - GRCh37)
185 MB
Nov 20, 2016
Attached is the testing etc file in the above code. Please note that you need to change the file extension from ".gtc.txt" to ".gtc" for testing.
Would you please take a look at the issue and help me with this request?
Thanks.
Jia
Hi,
I'm trying to use BeadArrayFiles to generate a report that includes information about the Illumina (design) strand, as well as the source strand. It looks like there is not a class corresponding to the Illumina strand, is this correct? Can you tell me how I might extract this information?
Thanks!
I used iaap-cli software convert idat file to gtc file.
When use the gtc file produced by iaap-cli, the beadarrayfile package would produce the error Exception: GTC file is incomplete.
The same issue with #20
Are the two software incompatible?
Looking forward to your reply!
Dear all,
I have many GTC files and want to convert them into finalreports. Sadly I can't get the programm running. I have the manifest and the clusterfile.
Do you have an example for running gtc_final_report.py with all the parameters?
Best,
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.