Giter VIP home page Giter VIP logo

useq's Introduction

1. PREREQUISITES

Before running uSeq, the following libraries and software must be installed:
1. BamTools
2. BWA
3. SamTools
4. Picard
5. R

Please make sure to follow the instructions for building and installing each of these
files.  BamTools must be installed in the uSeq root directory.

3. INSTALLING USEQ

The primary deetction and read processing components of uSeq are written in C++ and must
be compiled before the first time uSeq is used.  Once you have downloaded uSeq to your 
uSeq root directory (USEQ_ROOT), you will have one directory containing the C++, Perl,
R, and BASH code used by uSeq.

The core programs used by uSeq's individual module are all written in C++.  Any files
ending in .h or .cc are either the classes and object used by one or more pipeline
components, or the main programs themselves.  To build the programs used by uSeq, simply
run the following command:

  make all
	
Afterwards, there should be four programs: tetrascan, reindexBam, profiler, and summarizer.
Tetrascan is uSeq's custom microsatellite detection algorithm, reindexBam converts alignments
from a condensed reference genome to a normal reference genome, profiler creates microsatellite
profile (*.msp), and summarizer creates a summary table for all microsatellite loci observed in
a population.

The individual component is composed of a master Perl script which handles the calling of
component-specific Perl scripts, as well as logging and tracking STDERR and STDOUT from
all commands that are called.  It is currently called MicroSeq2.pl.  Several Perl modules
are available for use with uSeq.  They may be found in perllib.

To aid in the preparation and analysis of uSeq output, several short scripts are provided
in the scripts directory.  For SSC families, there is a copyPilot2Files.sh script that takes
a target directory and file as input, which is then used to create a directory with the
following setup:

project/family/individual/

Inside each individualName directory, soft links are created to the raw sequencing input
uSeq will take as input.  These can be in BAM or FASTQ format.  uSeq has been extensively
tested with BAM input data.  In addition, each directory has a relationship.txt file, which
defines an individual's relationship to the affected child (since this is a study of de novo
microsatellite mutations in children with autism).

Once the project directory has been created by copyPilot2Files.sh, uSeq can be launched over
SGE using runFamilies.sh.  runFamilies must have a complete path specified to the project
directory.  If uSeq is being run over a distributed filesystem, that includes the path from
the directory with temporary mount points for other nodes on the filesystem, typically the
/mnt directory.  uSeqStatusCheck.sh can be used to report the number of SGE jobs waiting, 
holding, or running while the individual component is running.

Once uSeq has completed running on a study population, findErrors.sh will look through the
log files and STDERR files generated by uSeq to identify whether any individual in the
population has had any errors in the course of the individual component processing.  If any
individuals need to be rerun, the complete paths to the individual's directory, as well as the
stage to start from (scan or profile are the most common options) can be specified in a file
that is then taken as input by rerunPeople.sh.

Microsatellite profile counts for the entire population can be obtained using 
getProfileReplicateCounts.sh, which takes a project directory as input.  The number of reads 
processed by each stage in the individual component are reported by getPipelineCounts.sh.  
This allows for tracking to make sure the number of reads is consistent for each study 
individual throughout the pipeline.  In addition, data regarding the processing of PCR 
duplicates is obtained using getReplicateCounts.sh.

Please note that many of the files in the scripts directory are tailored specifically to
the SSC project.  While they do have abstract concepts that may be suitable to other
projects, they must be adapted for a particular project.

The uSeq population module starts by summarizing all the microsatellite loci observed in a
study population using summarizer, then identifying well-covered loci using table2mat.pl.
Once a set of well-covered loci have been identified, uSeq calls microsatellite genotypes.
The uSeq genotyper is written in R, and can be found in the emGenotyper directory.  Each
R script takes a directory as input and looks for the following files: 
1. allele_matrix.txt - the file generated by table2mat.pl with coverage for each individual
   at each well-covered locus.
2. allele_matrix_info.txt - the file generated by table2mat.pl with information about the
   well-covered alleles it has identified
3. person_info.txt - a tab-delimited file providing population information for the
   genotyper (IDs and relations), which is identical to microsat.refOnly.columns.txt
	 
The coverage modeler and EM genotyper can be found in emGenotyper in the uSeq directory.


2. CREATING A CONDENSED REFERENCE GENOME

To create a condensed reference genome, you must first download a copy of the appropriate
reference genome (e.g., hg19).  Once the download has completed, the genome is processed
using the tetrascan microsatellite detection algorithm.  Tetrascan must be run with the
same microsatellite detection thresholds that will be used to detect microsatellites in
sequencing reads.  In the simplest case, tetrascan can be run using the following command:

/path/to/uSeq/tetrascan -f /path/to/reference/genome/referenceGenome.fa

If you would like to change any default detection settings, you can find the appropriate
optional commands by running
/path/to/uSeq/tetrascan -h

Once microsatellite detection of the reference genome is complete, there should be a file with
a condensed reference genome with the following path:
/path/to/reference/genome/referenceGenome.mod.fa

You can create a Burrows-Wheeler transform for use with bwa using the command:

/path/to/bwa index /path/to/reference/genome/referenceGenome.mod.fa

Be sure to select the appropriate indexing algorithm for the length of the genome being
transformed.

3. SETTING UP CONFIG.TXT

uSeq uses a config file to specify the paths to the applications it uses (i.e. BWA, SamTools, 
and Picard).  If the programs are in the user's path, config.pl can be run in auto-detect mode.
If the programs are not in the user's path, their locations must be specified using the 
appropriate flags (--bwa,--samtools, or --picard).  The config file is generated by the
following command while in the uSeq directory:

./config.pl [optional flags with paths to programs]

4. RUNNING USEQ INDIVIDUAL MODULE

4A. RUNNING USEQ ON POPULATION DATA
uSeq has been used extensively on SSC exome sequencing data, and has a sample means of launching
for an entire population.  Currently, uSeq generally requires 4 - 5 arguments.  It requires:
1. A path to the condensed reference genome (if you're on a server with distributed memory, be sure to 
	 include the hostname of the server where the condensed genome is stored)
2. A project ID which is used to store  some files throughout a distributed cluster
3. An argument defining a job manager to use to launch uSeq instances (currently, only SGE is implemented)
4. A directory containing sequencing information for a single individual (uSeq has been run extensively
	 using paired-end BAM files)
5. A load balance scheme.  The default load balance scheme is 2, which minimizes the number of threads
	 used for each stage of the pipeline.  This allows for many samples to be processed simultaneously.

In addition, other commands can be used to change different default settings, such as detector
thresholds.  To see a complete list of uSeq command-line options, run the following command:

/path/to/uSeq/MicroSeq2.pl --help (still needs to be implemented)

uSeq instances can be run using SGE as well.  It is recommended that uSeq be run with the -V flag in SGE,
as well as the -e and -o flags.  In addition the -v flag must be specified with the following argument:

-v USEQ_CONFIG=/path/to/uSeq/config.txt

Currently, uSeq instances are submitted to a special queue that limits the number of concurrently running
instances to 120.  This ensures that uSeq instances do not overtake the cluster.

The current command used to submit uSeq instances to SGE looks like this:

qsub -V -P [queue identifier] -q [uSeq dedicated queuename] -S /bin/bash -v USEQ_CONFIG=/path/to/uSeq/config.txt \
-e [path to STDERR for job submission] -o [path to STDOUT for job submission] /path/to/uSeq/MicroSeq2.pl --dir \
/path/to/individual/sequencing/directory --genome /path/to/condensed/reference/genome --sge --project_id [project ID]\
--lb_scheme 2

In its latest instance, the uSeq individual module has processed exome sequencing data from over 5,300 individuals
in a little under a week.

Each directory for each individual will have log files describing the commands run and the actions taken in the course
of the uSeq individual module, as well as the STDERR from each module component.  Rather than look at potentially tens
of thousands of log files and STDERR files, a simple BASH script is able to go through a directory containing individual
sequencing directories and look for any errors.  This is currently implemented as findErrors.sh.  This will write to STDOUT
for each individual in the directory and report whether any errors were seen and where the errors occurred.  If any errors
occurred during a uSeq run, please contact Mitchell Bekritsky ([email protected]).


4B. RUNNING USEQ ON INDIVIDUALS

TO BE WRITTEN


5. RUNNING USEQ POPULATION MODULE -- CALLING GENOTYPES USING POPULATION-INFERRED MODEL PARAMETERS


5A. SUMMARIZING MICROSATELLITE PROFILES IN A STUDY POPULATION

Once a population has completed processing and any errors are resolved, uSeq aggregates data from the entire 
population to infer model parameters and call genotypes.  Currently, a large summary table for all loci detected
in any individual within the population is construted using summarizer in the uSeq directory.  The only argument 
summarizer takes is a path to a directory containing microsatellite profiles for a study population.  Currently,
summarizer is specifically implemented for SSC data, but adjustments can be made later.  It can be run with the
following coomand:

/path/to/uSeq/summarizer /path/to/uSeq/population

uSeq's summarizer currently only reports microsatellite loci that are found in its reference set, and produces
three files as output:

1. microsat.refOnly.columns.txt - contains all the pertinent information for each individual in the population
	 (in this case, SSC family and individual ID, as well as relation to proband).
2. microsat.refOnly.index.txt   - contains all the pertinent information for each microsatellite locus observed
	 in any individual from the study population (chromosome, start position, motif, reference length, number of
	 population individuals with coverage at the locus, maximum coverage at the locus in any individual, and total
	 coverage at the locus throughout the population).
3. microsat.refOnly.counts.txt  - a tab-delimited matrix with one row for each locus and one column for each study
	 individual.  Each entry reports the number of alleles observed, the the allele lengths observed, and their 
	 coverage.  Each entry is semicolon-delimited.  If any entry has more than 2 alleles, the allele lengths and their
	 coverages are comma-delimited.
	 
The summarizer can run for anywhere from a few days to a few weeks depending on the study population size.


5B. IDENTIFYING WELL-COVERED LOCI

Once the summarizer has completed, well-covered loci are identified throughout the population.  This is currently
done using a simple Perl script called table2mat.pl.  table2mat takes the counts and index file produced as output
by summarizer, as well as thresholds that will be used to determine which loci are considered well-covered.  Loci
may be filtered by the percent of the population with coverage at the locus, its maximum individual coverage, or its
average coverage across the population.  Run table2mat.pl with the following command:

/path/to/uSeq/table2mat.pl --counts <counts file>  --index <locus index file> [ --minCount <threshold for maximum \
individual coverage at a locus> --minPopPct <minimum percent of study population with coverage> --meanPopCov \
<mean coverage at a locus>]


5C. SETTING UP THE USEQ GENOTYPER CONFIG FILE

The uSeq genotyper is in a separate directory currently, but needs to be merged with the rest of uSeq.  It has its own
tab-delimited config file with three lines:

1. src_dir   - the directory with R source files for the uSeq genotyper, which should be /path/to/uSeq/genotyper/src/
2. anno_dir  - a directory with BED files providing RefSeq and CCDS annotations for microsatellite loci (should make into a
							 a data pack for uSeq).
3. cran_repo - the default repository to download R packages from if they are not already installed

5D. GENERATING SVD-DERIVED EXPECTED COVERAGE ESTIMATORS FOR EACH LOCUS AND INDIVIDUAL

Expected coverage estimators are generated by coverageModeler.R.  It only takes two options as input--the directory where
the files generated by table2mat are found, and an optional verbosity.  If not already installed, it will install IRLBA
and R's getopt.  To run coverageModeler, use the following command:

/path/to/uSeq/genotyper/coverageModeler.R --directory=/path/to/directory/with/table2mat/output [--verbose=1]

A verbosity of 0 will limit the output of the coverage modeler.  The program will generate one file with an RData object,
as well as several graphs.  The graphs are of the total and mean coverage for every study individual, the Pearson correlation
of observed and expected coverage values for every study individual, and the expected vs observed coverage.  The RData object
is a list with 3 items: the expected per-allele coverage matrix, the Pearson correlation of expected vs observed coverage for
each study individual, and the number of components used to model expected coverage.

5E. CALLING MICROSATELLITE GENOTYPES

In its current form, uSeq splits the well-covered loci into 10,000 locus chunks for genotyping.  This serves two purpose: it
allows the genotyper to achieve some degree of parallelism, and  it allows genotypes to be called on an arbitrary number of
loci.  The genotyping array would otherwise be constrained by R's limit on the maximum number of elements in an array or matrix.

Each genotype chunk is run using the following command:

/path/to/uSeq/genotyper/chunkGenotyper.R --directory=/path/to/directory/with/table2mat/output --start.locus=[start locus] \|
--num.loci=[number of loci per chunk]

There is currently a barebones shell script that will automate this process if provided a directory that has the table2mat output
files.  It can be called as:

/path/to/uSeq/genotyper/genotypeChunks.sh /path/to/directory/with/table2mat/output [number of loci per chunk]

Genotype chunks will then use SGE to run the genotyper in parallel on well-covered loci.  

The genotyper will create a GenotypeInformation directory in /path/to/directory/with/table2mat/output, where RData files will be
stored with the genotyper data structures for each chunk, as well as attendant locus information for each chunk.  (Note: this would
be very amenable to HDF5 files if we put in the effort to convert the format).

5F. CALCULATING MENDEL OBEDIENCE SCORES

Once genotypes have been calculated for the entire population, Mendel obedience scores can be calculated on the same chunks.  The
command to run an individual chunk is

/path/to/uSeq/genotyper/chunkFamilyGenotyper.R --directory=/path/to/directory/with/table2mat/output --start.locus=[start locus] \|
--num.loci=[number of loci per chunk]

As before, there is a barebones shell script that will automate this process and submit parallelized jobs to SGE.  That command is

/path/to/uSeq/genotyper/famGenotypeChunks.sh /path/to/directory/with/table2mat/output [number of loci per chunk]

This command will add a RData files with family genotypes to the GenotypeInformation directory, which can then be evaluated for
potential de novo mutations.

Note:  This would also be a good candidate for HDF5.  It is also probably smarter to run the EM genotyper and calculate Mendel
obedience scores simultaneously, with a Boolean flag allowing someone to only call genotypes or only calculate Mendel obedience
scores.  Also, this algorithm is currently only implemented for quads.  We are in the midst of adapting it to arbitrary family
structures.  Finally, the kinship score could be calculated better (talk to Mitch).

5G. FINDING DE NOVO MUTATIONS

Once family genotypes have been called, potential de novo mutations can be identified.  This will depend on strong evidence of
Mendel disobedience and kinship, as well as strong genotypes.  To identify potential de novo mutations in any chunk, we can run

/path/to/uSeq/genotyper/loadAndGraphDenovos.R --directory=/path/to/directory/with/table2mat/output --start.locus=[start locus] \|
--num.loci=[number of loci per chunk] --mendel.threshold=[minimum Mendel obedience score threshold] --swap.threshold=[minimum kinship] \|
--trio.confidence=[minimum trio confidence] --trio.p.null=[minimum trio null probability] --trio.noise.fit=[minimum trio noise fit] \|
--trio.allele.fit=[minimum trio allele fit] --max.dn.noise=[max EM-estimated locus error rate]

Defaults are set for all parameters, although the user can set their own thresholds to identify potential de novo microsatellite
mutations.

There is another barebones shell script to enable this to run on each chunk in parallel using SGE.  That command is:

/path/to/uSeq/genotyper/dnChunks.sh /path/to/directory/with/table2mat/output [number of loci per chunk]

It does not currently take threshold arguments to filter de novo mutations.

This command will create a set of graphs for each potential de novo mutation, including a family graph showing detailed coverage and
family information for the family with the de novo mutation, a scatter plot of coverage for the reference and de novo allele, a scatter
plot with the observed/expected coverage ratio for the reference and de novo allele, a 3D scatter plot of coverage, a genotype frequency
table for the entire population at the locus with the de novo mutation, and a population heatmap showing the coverage throughout the population
at the locus.

OUTSTANDING ISSUES:

TEST USEQ FOR FASTQ FILES AND ZIPPED FASTQ FILES, TEST LOCAL RUNS OF USEQ, TEST USEQ ON OTHER CLUSTERS
TRY NEW KINSHIP SCORE USING MARGINAL PROBABILITIES FOR PARENTAL ALLELES

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.