Giter VIP home page Giter VIP logo

beadarrayfiles's People

Contributors

daveware-nv avatar jjzieve avatar jzieve avatar kelleyryanm avatar lmtani avatar malam1-illumina avatar mialam24 avatar minsungking avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

beadarrayfiles's Issues

GTC file version 3 can't output B allele frequencies and logr

Thank you so much for making this repo. The B allele frequencies and logr values are important features for our studies but the version of our GTC files is 3, which means we cannot get those two features from your library. Do you know any other way to get these two features or the way to convert idat files to gtc version 4 or 5.

Thank you very much!

gtc to plink ped format - 0|G instead of A|G

Hi folks,

I am having an issue when using IlluminaBeadArray libaray to convert gtc file to ped file format. A lot of snps in the converted plink file have 0|G instead they should be A|G. Below is the part of my script related to this conversion.

import sys
import os
from IlluminaBeadArrayFiles import GenotypeCalls, BeadPoolManifest, code2genotype

def outputPlink(gtc_file, manifest_file, sample_name, plink_out_dir, genoThresh = 0.15):
manifest = BeadPoolManifest(manifest_file)
gtc = GenotypeCalls(gtc_file)
GenoScores = gtc.get_genotype_scores()
top_strand_genotypes = gtc.get_base_calls()
outBase = plink_out_dir + '/' + sample_name
allGenotypes = []
with open(outBase + '.ped', 'w') as pedOut, open(outBase +'.map','w') as mapOut:
for (name, chrom, map_info, source_strand_genotype, genoScore) in zip(manifest.names, manifest.chroms, manifest.map_infos, top_strand_genotypes, GenoScores):
mapOut.write(' '.join([chrom, name, '0', str(map_info)]) + '\n')
if source_strand_genotype == '--':
geno = ['0', '0']
else:
geno = [source_strand_genotype[0], source_strand_genotype[1]]
allGenotypes += geno
pedOut.write(' '.join([sample_name, sample_name, '0', '0', '0', '-9'] + allGenotypes) + '\n')

Determining log r ratio and b allele frequency

Can this program be used to determine log r ratios and b allele frequencies? I have a set of .gtc files from an Illumina BeadChip run, and their corresponding manifest file in .bpm format. I’m interested in determining the log r ratios and b allele frequencies for each SNP, but when I use the applicable GenotypeCalls functions those values are returned as zeroes. I see how to use the NormalizationTransform functions to get the normalized x and y values, and the R and theta values from those. What would be the recommended next steps to obtain the log r ratios and b allele frequencies?

LocusAggregate.aggregate_samples only populates first LocusAggregate

This line assigns a generator:

buffer = LocusAggregate.load_buffer(
samples, loci_group[0], loci_group[-1] - loci_group[0] + 1, normalization_lookups)

Which is then used here:

aggregates = map(GenerateLocusAggregate(
buffer, loci_group[0]), loci_group)

Here one GenerateLocusAggregate is created for each index in the loci group, however when the first GenerateLocusAggregate is called it consumes buffer, meaning the subsequent GenerateLocusAggregate calls are empty.

Probably the simplest way to fix this is for the generator results to copied into a list to it can be used multiple times:

buffer = list(LocusAggregate.load_buffer(
                samples, loci_group[0], loci_group[-1] - loci_group[0] + 1, normalization_lookups))

Gentrain score in ClusterFile.py

Hi! It was my understanding that the GenTrain score is "project" dependent and does not come from the cluster file. Is this correct?
If so, what does the score calculated in the GenTrain.py module refer to?

If not, am I correct in assuming that, when creating a GenomeStudio project with the same manifest and cluster file, I should always be getting the same GenTrain score for a certain SNP regardless of the samples?

File specification for cluster files

Thank you for making this repo. Is there any way to get the file specification for cluster files (EGT files) in addition to the one that is already provided for GTC files in docs/. Thank you!

Weird Basecall result

I am tryinig to transfer from .gtc data to .txt with Reference allele and Alternative allele. So I am using GenotypeCalls and code2genotype module. I got some weird result.

For example, the last snp of GSAMD-24v1-0_20011747_A1 is

index : 700078
chromosome : X
position : 99912338
SNP : [T/A]
strand : +

When I transfer my .gtc file with this SNP, the result genotype call was BB and basecall was TT. It should be AA, right? I got so many of this kinda strange output. The saddest part is it works for some other rows...... I have no idea.

Obtaining GC and GT scores as columns in gtc_final_report.py

Greetings!
I had a query if I wanted to change the columns that are printed in the txt file of the output. How do I get two columns, one with GC scores and another with GT scores?

output_handle.write("[Data]\n")delim.join(["SNP Name", "Sample ID", "Chr", "Position", "GC Score", "Allele 1 - Plus", "Allele 2 - Plus", "GT Score"]

Something like this, how do I select the relevant attributes?

Need further clarifications on the return values of GenotypeCalls.py module

Hi team, I have some questions on the "get_control_x_intensities()" method.

  1. It returns a numpy array with a length of 23. What are these 23 numbers standing for? We can always see the control types and their values on GenomeStudio's control dashboard, such as DNP (High) | DNP (Bgnd) | Biotin (High) | Biotin (Bgnd) | Extension (A) | Extension (T) and so on.
  2. Dose x_intensity stand for red channel values and y_intensity for green?
  3. MethylationEpic Array has more control types than other genotyping arrays, dose it mean that it'll return a 23+ length of numpy array when apply get_control_x_intensities() to methylation array gtcs? If yes, could you specify their corresponding control types?
  4. Are these return values the same as what we can see in GenomeStudio's control dashboard?

See below what I pulled out from 1 MethylationEpic gtc, Thanks in advance!

from IlluminaBeadArrayFiles import GenotypeCalls, BeadPoolManifest, code2genotype
import sys
gtc_file=r"C:\202309880087_R01C02.gtc"
GenotypeCalls(gtc_file).get_control_x_intensities()

array([22163, 1337, 1341, 1025, 45830, 48170, 3044, 4004, 1773,
1749, 1357, 2127, 1042, 1086, 1287, 972, 1075, 1241,
1435, 1055, 1209, 874, 1398], dtype=uint16)

GenotypeCalls(gtc_file).get_control_y_intensities()

array([ 1412, 1090, 17303, 775, 1341, 912, 32024, 31040, 741,
23807, 16977, 8654, 776, 721, 743, 684, 737, 758,
1904, 833, 1092, 562, 2295], dtype=uint16)

len(GenotypeCalls(gtc_file).get_control_y_intensities())

23


Error on calling TOP_strand genotypes

Hi thanks for providing this python library for array genotype calling.

I wanted to include top_strand genotypes as a part of the final report by including top_strand_genotypes = gtc.get_base_calls() in the gtc_final_report.py as shown below.

for gtc_file in samples:
        sys.stderr.write("Processing " + gtc_file + "\n")
        gtc_file = os.path.join(args.gtc_directory, gtc_file)
        gtc = GenotypeCalls(gtc_file)
        genotypes = gtc.get_genotypes()
        top_strand_genotypes = gtc.get_base_calls()
        plus_strand_genotypes = gtc.get_base_calls_plus_strand(manifest.snps, manifest.ref_strands)
        forward_strand_genotypes = gtc.get_base_calls_forward_strand(manifest.snps, manifest.source_strands)
        normalized_intensities = gtc.get_normalized_intensities(manifest.normalization_lookups)
        b_allele_freq = gtc.get_ballele_freqs()
        logr_ratio = gtc.get_logr_ratios()

        assert len(genotypes) == len(manifest.names)
        for (name, chrom, map_info, genotype, top_strand_genotype, ref_strand_genotype, source_strand_genotype, (x_norm, y_norm), b_freq, log_r_ratio) in zip(manifest.names, manifest.chroms, manifest.map_infos, genotypes, top_strand_genotypes, plus_strand_genotypes, forward_strand_genotypes, normalized_intensities, b_allele_freq, logr_ratio):
            output_handle.write(delim.join([name, os.path.basename(gtc_file)[:-4], chrom, str(map_info), code2genotype[genotype], top_strand_genotype, ref_strand_genotype, source_strand_genotype, str(x_norm), str(y_norm), str(b_freq), str(log_r_ratio)])  + "\n")

However, I encountered the issue below.

Traceback (most recent call last):
  File "gtc_gp2_final_report.py", line 57, in <module>
    output_handle.write(delim.join([name, os.path.basename(gtc_file)[:-4], chrom, str(map_info), code2genotype[genotype], top_strand_genotype, ref_strand_genotype, source_strand_genotype, str(x_norm), str(y_norm), str(b_freq), str(log_r_ratio)])  + "\n")
TypeError: sequence item 5: expected str instance, bytes found

When I removed the parts related to top_strand_genotype, the script worked.
I am not sure what went wrong and how to modify it.
If someone has experience to call TOP genotypes, I would appreciate your input on this matter.

Thanks in advance.
Zih-Hua

Small difference observed in the normalized X and Y signals between GenomeStudio vs Illumina:BeadArrayFiles

I have a question about a small difference I observe in the normalized X and Y signals in final reports generated with GenomeStudio vs the illumina autocovert tool + python package IlluminaBeadArrayFiles::GenotypeCalls . I analyzed 96 samples on a GSA array GSAMD-24v1-0_20011747_A5. For 99.9 % of the snps the X and Y signals are exactly the same. But for about a 100 snps the X and Y signals differ, where the X_RAW and Y_RAW number are the same.

Genome Studio results:

SNP X_RAW Y_RAW X Y
GSA-21:9871186 7248 2236 1.173 0.326
GSA-21:9952707 6435 2797 1.036 0.421
GSA-rs1006435 5119 2310 0.813 0.338

Illumina:BeadArrayFiles tool results for the same snps:  
-- | -- | -- | -- | --
SNP | X_RAW | Y_RAW | X | Y
GSA-21:9871186 | 7248 | 2236 | 1.385 | 0.433
GSA-21:9952707 | 6435 | 2797 | 1.230 | 0.542
GSA-rs1006435 | 5119 | 2310 | 0.978 | 0.448

As you can see, the X_RAW and Y_RAW number are the same, but the normalized X and Y differ. Do you have an explanation why we see this, only for a small number of snps?

Error from a call to a function from IlluminaBeadArrayFiles

Hi there,

We have written a GWAS QC pipeline starting with GTC files.

Recently when we try to run this pipeline with a new .bpm file (Illumina Omni5Exome-4 v1.3), we encountered some errors.

The error is caused by a call to a function from IlluminaBeadArrayFiles, the gtc parsing library that Illumina made. It’s this call that’s causing the error:
norm = gtc.get_normalized_intensities(manifest.normalization_lookups)

Below is how I run the related code:

**>>> from IlluminaBeadArrayFiles import GenotypeCalls, BeadPoolManifest, code2genotype

manifest_file = 'InfiniumOmni5Exome-4v1-3_A1.bpm'
gtc_file = '203423200004_R01C01.gtc'
gtc = GenotypeCalls(gtc_file)
manifest = BeadPoolManifest(manifest_file)
norm = gtc.get_normalized_intensities(manifest.normalization_lookups)
Traceback (most recent call last):
File "", line 1, in
File "IlluminaBeadArrayFiles/GenotypeCalls.py", line 581, in get_normalized_intensities
return [normalization_transforms[lookup].normalize_intensities(x_raw, y_raw) for (x_raw, y_raw, lookup) in zip(self.get_raw_x_intensities(), self.get_raw_y_intensities(), normalization_lookups)]
IndexError: list index out of range**

Here is the .bpm file used:
https://support.illumina.com/array/array_kits/infinium_humanomni5exome_beadchip_kit/downloads.html
Infinium Omni5Exome-4 v1.3 Manifest File (BPM Format - GRCh37)
185 MB
Nov 20, 2016

Attached is the testing etc file in the above code. Please note that you need to change the file extension from ".gtc.txt" to ".gtc" for testing.

Would you please take a look at the issue and help me with this request?

Thanks.

Jia

Illumina strand from manifest

Hi,

I'm trying to use BeadArrayFiles to generate a report that includes information about the Illumina (design) strand, as well as the source strand. It looks like there is not a class corresponding to the Illumina strand, is this correct? Can you tell me how I might extract this information?

Thanks!

Error: Exception: GTC file is incomplete

I used iaap-cli software convert idat file to gtc file.
When use the gtc file produced by iaap-cli, the beadarrayfile package would produce the error Exception: GTC file is incomplete.
The same issue with #20
Are the two software incompatible?
Looking forward to your reply!

gtc_final_report.py

Dear all,

I have many GTC files and want to convert them into finalreports. Sadly I can't get the programm running. I have the manifest and the clusterfile.

Do you have an example for running gtc_final_report.py with all the parameters?

Best,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.