genome / genome Goto Github PK

Core modules used by the GMS

License: GNU Lesser General Public License v3.0

Perl 67.47% Shell 0.29% HTML 18.65% XSLT 2.92% Makefile 0.01% Java 0.07% C++ 0.04% CSS 0.74% JavaScript 7.18% R 1.87% Python 0.02% MATLAB 0.05% CoffeeScript 0.03% ActionScript 0.13% PHP 0.35% Ruby 0.01% C 0.13% PLpgSQL 0.05% Raku 0.01%

genome's Introduction

# Genome

Genome analysis software from The Genome Institute at Washington University
School of Medicine, funded by the National Human Genome Research Institute.

## Testing

Running tests currently depends on many TGI resources so testing cannot be
performed outside TGI at this time.  Within TGI you can run tests by doing,

    genome-test-env test-tracker prove --lsf --git

This chains the `genome-test-env` and `test-tracker` commands together.  See
`test-tracker prove --help` for its usage.  For example, you may need to pass a
value to the `--git` option if your branch is not setup to track
`origin/master`.

genome's People

Contributors

Stargazers

Watchers

genome's Issues

Unarchive InstrumentData when md5 does not match.

When the md5 for a LIMS unarchive doesn't match, compare flagstat to minimally ensure the read counts match and protect from truncated BAM files.

Support external alignment results.

Remove references to Workflow

finish taking out references to Workflow in Genome (and thus end support for the genome config set-env workflow_builder_backend workflow case)

Allow builds to start with missing instrument data files.

Instead of failing immediately with an Unstartable status, allow builds to start with missing instrument data files. Often the data is not realigned, but reprocessed through a new pipeline type or with parameters that do not change the alignment strategy. If the data IS realigned or the raw instrument data IS required by a downstream step, fail at that point of the pipeline instead.

Remove legacy workflow submodules

remove the workflow submodule from the Genome repo and update our build processes to no longer care about it.

Support GRCh38 with alts

Initial work was done by @ernfrid here:
#1294

BWA docs : https://github.com/lh3/bwa/blob/master/README-alt.md

GATK Blog : https://software.broadinstitute.org/gatk/blog?id=8180

ReferenceSequence models will need to support auxiliary files required by downstream tools to correctly process the alt alignments.

A new version of SpeedSeq is near completion. This includes a new version of BWA that supports alt contigs.

Completion of this Epic requires successful SomaticValidation, RNA-seq, ClinSeq and SingleSampleGenotype builds ran using HCC1395 test data. See existing ClinSeq AnP for data:
http://spectacle.gsc.wustl.edu/analysis_projects/ef89d9e8c1f942c492a288e5b1b4b078

Refactor ClinSeq to take advantage of UR::Role with typechecking

Move Test DB to new disk volumes

Support Cromwell as CWL backend

Eventually we may want to support WDL, but for now add cromwell as a backend replacement for Toil when running CwlPipeline models.

Improve error handling of move-allocations

When a build has zero allocations a less than helpful error message is returned:

2017/05/26 08:27:52 Genome::Command: Can't call method "disk_group_name" on an undefined value at /gsc/scripts/opt/genome/snapshots/genome-3750/lib/perl/Genome/Model/Build/Command/MoveAllocations.pm line 82.
ERROR: Can't call method "disk_group_name" on an undefined value at /gsc/scripts/opt/genome/snapshots/genome-3750/lib/perl/Genome/Model/Build/Command/MoveAllocations.pm line 82.

Improve error handling. Another edge case is if the command fails to resolve the disk group from the build/AnP.

Make ClinSeq unit tests not depend on live db

From JIRA CI-157

The following unit tests depend on getting builds from the live db:
Somatic variation builds:

Model/ClinSeq/Command/CreateMutationDiagrams.t
Model/ClinSeq/Command/CreateMutationSpectrum-exome.t
Model/ClinSeq/Command/CreateMutationSpectrum-wgs.t
Model/ClinSeq/Command/GenerateClonalityPlots.t
Model/ClinSeq/Command/GetBamReadCountsMatrix.t
Model/ClinSeq/Command/GetVariantSources.t
Model/ClinSeq/Command/ImportSnvsIndels.t
Model/ClinSeq/Command/RunCnView.t
Model/ClinSeq/Command/SummarizeSvs.t

Rna-seq builds:

Model/ClinSeq/Command/CufflinksDifferentialExpression.t
Model/ClinSeq/Command/CufflinksExpressionAbsolute.t
Model/ClinSeq/Command/TophatJunctionsAbsolute.t
Model/ClinSeq/Command/Converge/CufflinksDe.t

SomVar + Rna-Seq:

Model/ClinSeq/Command/GetBamReadCounts.t

Clin-seq build:

Model/ClinSeq/Command/DumpIgvXml.t
Model/ClinSeq/Command/SummarizeBuilds.t
Model/ClinSeq/Command/SummarizeCnvs.t
Model/ClinSeq/Command/SummarizeModels.t
Model/ClinSeq/Command/Converge/AllEvents.t

Multiple types?

Model/ClinSeq/Command/TestGenomeCommands.t
Model/ClinSeq/Command/UpdateAnalysis.t

CwlPipeline AnP Config Should Support `input_data` in `instrument_data_properties`

The instrument_data_properties are currently processed here. We have a scalar and an ARRAY case. Adding a HASH case to allow for one level of nesting could be one way to solve this. We need to be careful to merge with any existing hash but not leak one instrument data's values back to the original configuration hash.

Add Exome Variant Project MAF information to the Clinseq SnvIndelReport

From JIRA CI-110 requested by @chrisamiller :
We'd like to have the results from "gmt annotate add-evs-maf" to the SnvIndel report. That's the one set of info that my group often uses that is absent from the current implementation.

Support CRAM alignment results.

There has been internal discussion about porting existing alignments from BAM to CRAM to reduce file size. Instead of re-importing CRAM as External alignment results, we can make a filetype attribute (using metric) on all AlignedBamResults. Then bam_path will resolve the correct file extension based on this attribute, bam or cram (with default value bam). A command like genome instrument-data alignment-result compress would be added to the code base. Builds and callers that use the bam_path accessor would then get CRAM file paths instead.

Refactor ClinSeq workflow to use SnvIndelReport instead of ImportSnvsIndels

Revise data import process

Right now, there's kind of a dead space in between the standard import (which does a ton of qc, sanitizing, linking, etc) and the trusted-importer (which does none of that). As our sequencing partners evolve, and we start to accept more data that doesn't come through the established LIMS link, we need to revamp this process to:

accept key metadata attributes in some defined import format (csv?)
- subtask - decide what the scope of supported properties is, and which ones are mandatory
create or link sample and library objects and populate them appropriately
import raw sequence data and link the instrument data to these samples/libs with a minimum of fuss

Human readable file names for CWL workflows

Add a build step (or a workflow step if possible) to create human readable symlinks to the output files of CWL workflows.

Generic auxiliary file importer

Either revamp something like genome db or make a new generic file importer. The idea is to simplify the process of storing VCF, GTF, BED, etc. for use in future CwlPipelines. This could lower the burden of creating specialty model types or custom importers for specific files, recent example gnomAD #1777 .

The solution should retain minimal metadata. One of the requirements should be a valid reference sequence. I'm open to debate on the ease of validating the refseq matches the imported file. It's possible we simply have to rely on the honor system since validation of many file types could be cumbersome....

PrepDataForTransfer and CWL Pipeline Outputs

Before we process the next round of MDS data, we should add the build ID and a timestamp (creation date) for each file in the MANIFEST.

Input BAM file instead of build for `gmt transcriptome ercc-map-unaligned`

The original implementation assumed a RnaSeq build would be used. Instead, provide the BAM path so this tool will work with CwlPipeline builds just as well as RnaSeq (or any BAM file).

Add Total Disk Usage to AnP command

The new command genome analysis-project disk-usage sums the usage for each config item. Those values do not add up to the total for the AnP since some results are shared between config items. Also add a master list of results and sum the total disk usage for the AnP without duplicate counting results shared across multiple config items.

Remove old mysql code

There are a few remaining commands that either do no use the db_ensembl_* config or make direct connections to mysql servers. For the former, they should be updated to use the config variables. For the latter, we should consider removing the code or disabling the connection to mysql servers for old data sets like build36.

Consider updating:
/Model/Tools/Ensembl/Base.pm
./Model/Tools/ImportAnnotation/UpdateAnnotation.pm

Consider removing or disabling mysql connections from:
./Model/Tools/Pcap/Ace.t
./Model/Tools/Pcap/Config.pm
./Model/Tools/Sv/BreakAnnot.pl
./Model/Tools/Sv/SvAnnot.pm

Update default paths in `gmt transcriptome ercc-map-unaligned`

The default values are no longer accessible on the MGI filesystem.

The new paths:

--ercc-fasta-file /gscmnt/gc2560/core/model_data/2861523156/build0bfc1bd5fcfc474c9db737a520ae109d/appended_sequences.fa --ercc-spike-in-file=/gscmnt/gc2560/core/RNASeq/ERCC/metadata/ERCC_Controls_Analysis-v1.txt

Move PerLaneTophat Allocations

When moving allocations for a build, the class Genome::InstrumentData::AlignmentResult::PerLaneTophat is not in the whitelist for acceptable allocations to move. Should these files be moved along with the other allocations, ie. MergedAlignmentResults.

Replace Breakdancer for SV detection.

Implement a new SV caller and replace Breakdancer in somatic SV detection strategies.

https://jira.gsc.wustl.edu/browse/CI-34

Add FASTA symlink (or gzip of file) to support CRAM conversion

For the following command:
genome model cwl-pipeline prep-for-transfer

add an option to include the FASTA to support CRAM files and conversions.

Somatic Variant Reporting support for GRCm38

See #1623 for human equivalent.

Add ClinSeq model inputs for tumor/normal microarray.

Somatic Validation models do not have microarray inputs. This means that ClinSeq can not run microarray copy number, etc. when SomVal model types are inputs. Rather than supplying the microarray build, ClinSeq should have tumor/normal microarray inputs that are the paths to a VCF or the VCF result.

More careful splitting of lock information.

Please see Pull Request #1652 for details.

AnP command to move allocations

The new command would move ALL allocations for an AnP from one disk group to another. The implementation should allow for the move of individual models/builds as well.

See #1687 for a discussion of disk_usage_allocations. Once implemented for all model types using a list of subclass names that "belong" to a model, this new command should be fairly straightforward when an allocation is unique to an AnP.

Jenkins genome model test: 5.10-clinseq-wer - Build 2129 - Diffs Found

Project: 2-Genome-Model-Tests-2-Run-Models/TEST_SPEC=5.10-clinseq-wer
Build: https://apipe-ci.gsc.wustl.edu/job/2-Genome-Model-Tests-2-Run-Models/TEST_SPEC=5.10-clinseq-wer/2129/
Console: https://apipe-ci.gsc.wustl.edu/job/2-Genome-Model-Tests-2-Run-Models/TEST_SPEC=5.10-clinseq-wer/2129/console

DIFFERENCES FOUND:
Comparing new object 7712d85d4956475e9649622d2b93165d to blessed object 0bc8c6ff5d4c4fc0ae645335f738a29d
File: AML109/mutation-spectrum/b20_q10/exome/mutation_spectrum_sequence_context/AML109.prop.test.2type
Reason: files are not the same (diff -u {/gscmnt/gc9026/test/model_data/8a175224cae14b80ab36d749216efb2b/build0bc8c6ff5d4c4fc0ae645335f738a29d,/gscmnt/gc9026/test/model_data/8a175224cae14b80ab36d749216efb2b/build7712d85d4956475e9649622d2b93165d}/AML109/mutation-spectrum/b20_q10/exome/mutation_spectrum_sequence_context/AML109.prop.test.2type)

If you want to bless this object (7712d85d4956475e9649622d2b93165d) update and commit the DB file (Model/Build/Command/DiffBlessed.pm.YAML).

InstrumentData import requires large resource requests.

Using NovaSeq as an example, importing the FASTQ files required requests of 92GB of RAM to ensure success.

See internal JIRA issue CIS-93 for a few examples.

Handle index during BAM/CRAM conversion

The index file is not useful after going round trip BAM->CRAM->BAM. The BAM file must be reindexed. Appropriately handle or fix the index creation to make this easier on the end user. For now, indexing can be handled by running samtools index.

In fact, there is a crai index we should make that allows for viewing of the CRAM directly in IGV 3 or greater.

Require 2FA for genome org

Similar to the University IT requirement to enable 2FA, we should also require 2FA for push access to repositories. The team @genome/unauth includes the users that have not enabled 2FA. If you'd like to remain in your teams that have push access to genome repos, please enable 2FA as soon as possible. Using the University deadline of October 31st, I'll revisit the list of users that have enabled 2FA at that time. If you have not enabled 2FA by that time, you will be removed from the teams that allow push access to any genome repo. Also, if you'd like to be removed from the Genome org at this time to avoid future communications, please comment and I'll remove you from the Genome GitHub Organization.

Evaluate effect of BQSR on somatic variant pipelines.

How to install

Hi, could you tell me how to install and use this software on Ubuntu?
I did not find the README file, so I tried to install genome but failed. The main installation steps are as follows:
1.run git clone https://github.com/genome/genome.git
2.go into the bin directory,run./genome,then an error message appears:
unable to locate spec: config at /usr/local/lib/x86_64-linux-gnu/perl/5.22.1/Genome/Site.pm line 19. BEGIN failed--compilation aborted at /usr/local/lib/x86_64-linux-gnu/perl/5.22.1/Genome/Site.pm line 53. Compilation failed in require at /usr/local/lib/x86_64-linux-gnu/perl/5.22.1/Genome.pm line 40. Compilation failed in require at (eval 1) line 2. BEGIN failed--compilation aborted at (eval 1) line 2. BEGIN failed--compilation aborted at ./genome line 23.
I really need your help. Looking forward to your reply!

ClinSeq with Somatic Validation Model Inputs

ClinSeq currently uses Somatic Variation, Reference Alignment and RNA-seq models as input. Somatic Variation models require Reference Alignment as inputs as well. The model dependency chain introduces several manual steps in the analysis process. Somatic Validation models combine the pipeline steps of RefAlign and SomVar in one workflow. This removes one of the manual steps to performing ClinSeq analysis.

Relevant tasks:

#1363

Analysis Project 'view' command does not show store-only 'inactive' config

Using the --disabled-configs option it is displayed, but inactive configs are not disabled... it's a little confusing for end-users.

Use disk configuration of AnP when importing.

The command genome instrument-data import basic should use the disk group defined by the AnP used in the command.

New coverage report (simplified) using QC framework results.

No longer rely on CoverageStats or RefCov, but generate a coverage report from QC results (CollectHsMetrics).

VcfToBed.pm 1-based flag appears broken

Hi, using gmt bed convert vcf-to-bed --one-based flag appears to still ouput 0-based coordinates:

input:
9 34795908 . CT C . .

example output:
9 34795908 34795909 T/* - -

should be:
9 34795909 34795909 T/* - -

Add AnP command to update status

Currently owner is inferred as created_by. Do we want a new owner?

Either way, this issue is specific to add a command that allows the created_by/owner to update the status from 'Completed' back to 'In Progress'.

Add Somatic Validation tracks to DumpIgvXml ClinSeq command.

The DumpIgvXml command relies on finding the tumor_build and normal_build which only exist as ReferenceAlignment builds on Somatic Variation inputs. The DumpIgvXml command should resolve the path to the tumor/normal bed/bam file for all somatic model types.

Move-allocations that supports DeNovoAssembly Model Types

Somatic Variant Reporting support for GRCh38

Containerize core GMS functions.

A few core tools, ex. samtools, bwa, bedtools, joinx, liftover, etc. are used in basic GMS functions. There seem to be a few options here:

include the tool in the genome image
execute a simple workflow step using the correct image
execute a CWL workflow to produce the desired result

In general, this is a modern solution to the complete removal of legacy /gsc installed software, see #1560

Find and re-locate legacy software/data hiding out under `/gsc/`.

The end goal is to remove the dependency on /gsc/ as a special directory for the GMS.

Some software/versions might be old enough we don't need them any more.
Some will probably need to be incorporated into our Docker image somehow (maybe in a local /gsc/ directory?)

Refactor SummarizeTier1SnvSummary to use SnvIndelReport result

This step originally ran bam-readcount redundantly with the SnvIndelReport. The original reworking of the ClinSeq workflow did NOT add this step back to the process. If needed, the R script could be rewritten to use the existing read counts and VAFs in th SnvIndelReport. Here is an example file:
/gscmnt/gc13001/info/model_data/9960e32e7f344f17b476b728ac487bb3/build24cd53cfd5ab449abb2e3991f846c2cd/BRC251/snv_indel_report/b1_q1/BRC251_final_filtered_coding.tsv

Modify ClinSeq update analysis to use Analysis Projects (AnPs).

genome model clin-seq update-analysis does not handle Somatic Validation models as inputs to ClinSeq. Rather than add Somatic Validation models, we can leverage the Analysis Project configuration that was used for the project. ClinSeq models are never assigned to an AnP anyway. Even when Somatic Variation models are created they should be assigned to an AnP.

Additional info from JIRA CI-81:
UpdateAnalysis has a number of defaults for its inputs. Additionally, there is a hard-coded translation in the method get_roi_name
Other commands that may have hard-coded things include:
DumpIgvXml
GenerateClonalityPlots
SummarizeCnvs

New docker image with updated SSL implementation for GDC queries

See the following work-around PR for details:
#1511

@tmooney : "The underlying problem is that lucid is too old for the required OpenSSL version"

genome / genome Goto Github PK

genome's Introduction

genome's People

Contributors

Stargazers

Watchers

Forkers

genome's Issues

Recommend Projects

Recommend Topics

Recommend Org