mbekritsky / useq Goto Github PK
View Code? Open in Web Editor NEWLeveraging population information to detect, align, genotype, and detect de novo mutations in Illumina sequencing data
Leveraging population information to detect, align, genotype, and detect de novo mutations in Illumina sequencing data
1. PREREQUISITES Before running uSeq, the following libraries and software must be installed: 1. BamTools 2. BWA 3. SamTools 4. Picard 5. R Please make sure to follow the instructions for building and installing each of these files. BamTools must be installed in the uSeq root directory. 3. INSTALLING USEQ The primary deetction and read processing components of uSeq are written in C++ and must be compiled before the first time uSeq is used. Once you have downloaded uSeq to your uSeq root directory (USEQ_ROOT), you will have one directory containing the C++, Perl, R, and BASH code used by uSeq. The core programs used by uSeq's individual module are all written in C++. Any files ending in .h or .cc are either the classes and object used by one or more pipeline components, or the main programs themselves. To build the programs used by uSeq, simply run the following command: make all Afterwards, there should be four programs: tetrascan, reindexBam, profiler, and summarizer. Tetrascan is uSeq's custom microsatellite detection algorithm, reindexBam converts alignments from a condensed reference genome to a normal reference genome, profiler creates microsatellite profile (*.msp), and summarizer creates a summary table for all microsatellite loci observed in a population. The individual component is composed of a master Perl script which handles the calling of component-specific Perl scripts, as well as logging and tracking STDERR and STDOUT from all commands that are called. It is currently called MicroSeq2.pl. Several Perl modules are available for use with uSeq. They may be found in perllib. To aid in the preparation and analysis of uSeq output, several short scripts are provided in the scripts directory. For SSC families, there is a copyPilot2Files.sh script that takes a target directory and file as input, which is then used to create a directory with the following setup: project/family/individual/ Inside each individualName directory, soft links are created to the raw sequencing input uSeq will take as input. These can be in BAM or FASTQ format. uSeq has been extensively tested with BAM input data. In addition, each directory has a relationship.txt file, which defines an individual's relationship to the affected child (since this is a study of de novo microsatellite mutations in children with autism). Once the project directory has been created by copyPilot2Files.sh, uSeq can be launched over SGE using runFamilies.sh. runFamilies must have a complete path specified to the project directory. If uSeq is being run over a distributed filesystem, that includes the path from the directory with temporary mount points for other nodes on the filesystem, typically the /mnt directory. uSeqStatusCheck.sh can be used to report the number of SGE jobs waiting, holding, or running while the individual component is running. Once uSeq has completed running on a study population, findErrors.sh will look through the log files and STDERR files generated by uSeq to identify whether any individual in the population has had any errors in the course of the individual component processing. If any individuals need to be rerun, the complete paths to the individual's directory, as well as the stage to start from (scan or profile are the most common options) can be specified in a file that is then taken as input by rerunPeople.sh. Microsatellite profile counts for the entire population can be obtained using getProfileReplicateCounts.sh, which takes a project directory as input. The number of reads processed by each stage in the individual component are reported by getPipelineCounts.sh. This allows for tracking to make sure the number of reads is consistent for each study individual throughout the pipeline. In addition, data regarding the processing of PCR duplicates is obtained using getReplicateCounts.sh. Please note that many of the files in the scripts directory are tailored specifically to the SSC project. While they do have abstract concepts that may be suitable to other projects, they must be adapted for a particular project. The uSeq population module starts by summarizing all the microsatellite loci observed in a study population using summarizer, then identifying well-covered loci using table2mat.pl. Once a set of well-covered loci have been identified, uSeq calls microsatellite genotypes. The uSeq genotyper is written in R, and can be found in the emGenotyper directory. Each R script takes a directory as input and looks for the following files: 1. allele_matrix.txt - the file generated by table2mat.pl with coverage for each individual at each well-covered locus. 2. allele_matrix_info.txt - the file generated by table2mat.pl with information about the well-covered alleles it has identified 3. person_info.txt - a tab-delimited file providing population information for the genotyper (IDs and relations), which is identical to microsat.refOnly.columns.txt The coverage modeler and EM genotyper can be found in emGenotyper in the uSeq directory. 2. CREATING A CONDENSED REFERENCE GENOME To create a condensed reference genome, you must first download a copy of the appropriate reference genome (e.g., hg19). Once the download has completed, the genome is processed using the tetrascan microsatellite detection algorithm. Tetrascan must be run with the same microsatellite detection thresholds that will be used to detect microsatellites in sequencing reads. In the simplest case, tetrascan can be run using the following command: /path/to/uSeq/tetrascan -f /path/to/reference/genome/referenceGenome.fa If you would like to change any default detection settings, you can find the appropriate optional commands by running /path/to/uSeq/tetrascan -h Once microsatellite detection of the reference genome is complete, there should be a file with a condensed reference genome with the following path: /path/to/reference/genome/referenceGenome.mod.fa You can create a Burrows-Wheeler transform for use with bwa using the command: /path/to/bwa index /path/to/reference/genome/referenceGenome.mod.fa Be sure to select the appropriate indexing algorithm for the length of the genome being transformed. 3. SETTING UP CONFIG.TXT uSeq uses a config file to specify the paths to the applications it uses (i.e. BWA, SamTools, and Picard). If the programs are in the user's path, config.pl can be run in auto-detect mode. If the programs are not in the user's path, their locations must be specified using the appropriate flags (--bwa,--samtools, or --picard). The config file is generated by the following command while in the uSeq directory: ./config.pl [optional flags with paths to programs] 4. RUNNING USEQ INDIVIDUAL MODULE 4A. RUNNING USEQ ON POPULATION DATA uSeq has been used extensively on SSC exome sequencing data, and has a sample means of launching for an entire population. Currently, uSeq generally requires 4 - 5 arguments. It requires: 1. A path to the condensed reference genome (if you're on a server with distributed memory, be sure to include the hostname of the server where the condensed genome is stored) 2. A project ID which is used to store some files throughout a distributed cluster 3. An argument defining a job manager to use to launch uSeq instances (currently, only SGE is implemented) 4. A directory containing sequencing information for a single individual (uSeq has been run extensively using paired-end BAM files) 5. A load balance scheme. The default load balance scheme is 2, which minimizes the number of threads used for each stage of the pipeline. This allows for many samples to be processed simultaneously. In addition, other commands can be used to change different default settings, such as detector thresholds. To see a complete list of uSeq command-line options, run the following command: /path/to/uSeq/MicroSeq2.pl --help (still needs to be implemented) uSeq instances can be run using SGE as well. It is recommended that uSeq be run with the -V flag in SGE, as well as the -e and -o flags. In addition the -v flag must be specified with the following argument: -v USEQ_CONFIG=/path/to/uSeq/config.txt Currently, uSeq instances are submitted to a special queue that limits the number of concurrently running instances to 120. This ensures that uSeq instances do not overtake the cluster. The current command used to submit uSeq instances to SGE looks like this: qsub -V -P [queue identifier] -q [uSeq dedicated queuename] -S /bin/bash -v USEQ_CONFIG=/path/to/uSeq/config.txt \ -e [path to STDERR for job submission] -o [path to STDOUT for job submission] /path/to/uSeq/MicroSeq2.pl --dir \ /path/to/individual/sequencing/directory --genome /path/to/condensed/reference/genome --sge --project_id [project ID]\ --lb_scheme 2 In its latest instance, the uSeq individual module has processed exome sequencing data from over 5,300 individuals in a little under a week. Each directory for each individual will have log files describing the commands run and the actions taken in the course of the uSeq individual module, as well as the STDERR from each module component. Rather than look at potentially tens of thousands of log files and STDERR files, a simple BASH script is able to go through a directory containing individual sequencing directories and look for any errors. This is currently implemented as findErrors.sh. This will write to STDOUT for each individual in the directory and report whether any errors were seen and where the errors occurred. If any errors occurred during a uSeq run, please contact Mitchell Bekritsky ([email protected]). 4B. RUNNING USEQ ON INDIVIDUALS TO BE WRITTEN 5. RUNNING USEQ POPULATION MODULE -- CALLING GENOTYPES USING POPULATION-INFERRED MODEL PARAMETERS 5A. SUMMARIZING MICROSATELLITE PROFILES IN A STUDY POPULATION Once a population has completed processing and any errors are resolved, uSeq aggregates data from the entire population to infer model parameters and call genotypes. Currently, a large summary table for all loci detected in any individual within the population is construted using summarizer in the uSeq directory. The only argument summarizer takes is a path to a directory containing microsatellite profiles for a study population. Currently, summarizer is specifically implemented for SSC data, but adjustments can be made later. It can be run with the following coomand: /path/to/uSeq/summarizer /path/to/uSeq/population uSeq's summarizer currently only reports microsatellite loci that are found in its reference set, and produces three files as output: 1. microsat.refOnly.columns.txt - contains all the pertinent information for each individual in the population (in this case, SSC family and individual ID, as well as relation to proband). 2. microsat.refOnly.index.txt - contains all the pertinent information for each microsatellite locus observed in any individual from the study population (chromosome, start position, motif, reference length, number of population individuals with coverage at the locus, maximum coverage at the locus in any individual, and total coverage at the locus throughout the population). 3. microsat.refOnly.counts.txt - a tab-delimited matrix with one row for each locus and one column for each study individual. Each entry reports the number of alleles observed, the the allele lengths observed, and their coverage. Each entry is semicolon-delimited. If any entry has more than 2 alleles, the allele lengths and their coverages are comma-delimited. The summarizer can run for anywhere from a few days to a few weeks depending on the study population size. 5B. IDENTIFYING WELL-COVERED LOCI Once the summarizer has completed, well-covered loci are identified throughout the population. This is currently done using a simple Perl script called table2mat.pl. table2mat takes the counts and index file produced as output by summarizer, as well as thresholds that will be used to determine which loci are considered well-covered. Loci may be filtered by the percent of the population with coverage at the locus, its maximum individual coverage, or its average coverage across the population. Run table2mat.pl with the following command: /path/to/uSeq/table2mat.pl --counts <counts file> --index <locus index file> [ --minCount <threshold for maximum \ individual coverage at a locus> --minPopPct <minimum percent of study population with coverage> --meanPopCov \ <mean coverage at a locus>] 5C. SETTING UP THE USEQ GENOTYPER CONFIG FILE The uSeq genotyper is in a separate directory currently, but needs to be merged with the rest of uSeq. It has its own tab-delimited config file with three lines: 1. src_dir - the directory with R source files for the uSeq genotyper, which should be /path/to/uSeq/genotyper/src/ 2. anno_dir - a directory with BED files providing RefSeq and CCDS annotations for microsatellite loci (should make into a a data pack for uSeq). 3. cran_repo - the default repository to download R packages from if they are not already installed 5D. GENERATING SVD-DERIVED EXPECTED COVERAGE ESTIMATORS FOR EACH LOCUS AND INDIVIDUAL Expected coverage estimators are generated by coverageModeler.R. It only takes two options as input--the directory where the files generated by table2mat are found, and an optional verbosity. If not already installed, it will install IRLBA and R's getopt. To run coverageModeler, use the following command: /path/to/uSeq/genotyper/coverageModeler.R --directory=/path/to/directory/with/table2mat/output [--verbose=1] A verbosity of 0 will limit the output of the coverage modeler. The program will generate one file with an RData object, as well as several graphs. The graphs are of the total and mean coverage for every study individual, the Pearson correlation of observed and expected coverage values for every study individual, and the expected vs observed coverage. The RData object is a list with 3 items: the expected per-allele coverage matrix, the Pearson correlation of expected vs observed coverage for each study individual, and the number of components used to model expected coverage. 5E. CALLING MICROSATELLITE GENOTYPES In its current form, uSeq splits the well-covered loci into 10,000 locus chunks for genotyping. This serves two purpose: it allows the genotyper to achieve some degree of parallelism, and it allows genotypes to be called on an arbitrary number of loci. The genotyping array would otherwise be constrained by R's limit on the maximum number of elements in an array or matrix. Each genotype chunk is run using the following command: /path/to/uSeq/genotyper/chunkGenotyper.R --directory=/path/to/directory/with/table2mat/output --start.locus=[start locus] \| --num.loci=[number of loci per chunk] There is currently a barebones shell script that will automate this process if provided a directory that has the table2mat output files. It can be called as: /path/to/uSeq/genotyper/genotypeChunks.sh /path/to/directory/with/table2mat/output [number of loci per chunk] Genotype chunks will then use SGE to run the genotyper in parallel on well-covered loci. The genotyper will create a GenotypeInformation directory in /path/to/directory/with/table2mat/output, where RData files will be stored with the genotyper data structures for each chunk, as well as attendant locus information for each chunk. (Note: this would be very amenable to HDF5 files if we put in the effort to convert the format). 5F. CALCULATING MENDEL OBEDIENCE SCORES Once genotypes have been calculated for the entire population, Mendel obedience scores can be calculated on the same chunks. The command to run an individual chunk is /path/to/uSeq/genotyper/chunkFamilyGenotyper.R --directory=/path/to/directory/with/table2mat/output --start.locus=[start locus] \| --num.loci=[number of loci per chunk] As before, there is a barebones shell script that will automate this process and submit parallelized jobs to SGE. That command is /path/to/uSeq/genotyper/famGenotypeChunks.sh /path/to/directory/with/table2mat/output [number of loci per chunk] This command will add a RData files with family genotypes to the GenotypeInformation directory, which can then be evaluated for potential de novo mutations. Note: This would also be a good candidate for HDF5. It is also probably smarter to run the EM genotyper and calculate Mendel obedience scores simultaneously, with a Boolean flag allowing someone to only call genotypes or only calculate Mendel obedience scores. Also, this algorithm is currently only implemented for quads. We are in the midst of adapting it to arbitrary family structures. Finally, the kinship score could be calculated better (talk to Mitch). 5G. FINDING DE NOVO MUTATIONS Once family genotypes have been called, potential de novo mutations can be identified. This will depend on strong evidence of Mendel disobedience and kinship, as well as strong genotypes. To identify potential de novo mutations in any chunk, we can run /path/to/uSeq/genotyper/loadAndGraphDenovos.R --directory=/path/to/directory/with/table2mat/output --start.locus=[start locus] \| --num.loci=[number of loci per chunk] --mendel.threshold=[minimum Mendel obedience score threshold] --swap.threshold=[minimum kinship] \| --trio.confidence=[minimum trio confidence] --trio.p.null=[minimum trio null probability] --trio.noise.fit=[minimum trio noise fit] \| --trio.allele.fit=[minimum trio allele fit] --max.dn.noise=[max EM-estimated locus error rate] Defaults are set for all parameters, although the user can set their own thresholds to identify potential de novo microsatellite mutations. There is another barebones shell script to enable this to run on each chunk in parallel using SGE. That command is: /path/to/uSeq/genotyper/dnChunks.sh /path/to/directory/with/table2mat/output [number of loci per chunk] It does not currently take threshold arguments to filter de novo mutations. This command will create a set of graphs for each potential de novo mutation, including a family graph showing detailed coverage and family information for the family with the de novo mutation, a scatter plot of coverage for the reference and de novo allele, a scatter plot with the observed/expected coverage ratio for the reference and de novo allele, a 3D scatter plot of coverage, a genotype frequency table for the entire population at the locus with the de novo mutation, and a population heatmap showing the coverage throughout the population at the locus. OUTSTANDING ISSUES: TEST USEQ FOR FASTQ FILES AND ZIPPED FASTQ FILES, TEST LOCAL RUNS OF USEQ, TEST USEQ ON OTHER CLUSTERS TRY NEW KINSHIP SCORE USING MARGINAL PROBABILITIES FOR PARENTAL ALLELES
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.