Giter VIP home page Giter VIP logo

genome's Introduction

# Genome

Genome analysis software from The Genome Institute at Washington University
School of Medicine, funded by the National Human Genome Research Institute.

## Testing

Running tests currently depends on many TGI resources so testing cannot be
performed outside TGI at this time.  Within TGI you can run tests by doing,

    genome-test-env test-tracker prove --lsf --git

This chains the `genome-test-env` and `test-tracker` commands together.  See
`test-tracker prove --help` for its usage.  For example, you may need to pass a
value to the `--git` option if your branch is not setup to track
`origin/master`.

genome's People

Contributors

acoffman avatar apipe-tester avatar apregier avatar brummett avatar chrisamiller avatar clu76 avatar davidlmorton avatar ddgenome avatar dufeiyu avatar ebelter avatar ernfrid avatar gatoravi avatar guesu avatar iferguson90 avatar indraniel avatar jasonwalker80 avatar jeldred avatar johnegarza avatar johnmaruska avatar jweible avatar kkrysiak avatar kkyung avatar malachig avatar mark-burnett avatar mkiwala avatar obigriffith avatar sakoht avatar sleongmgi avatar susannasiebert avatar tmooney avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

genome's Issues

Remove references to Workflow

finish taking out references to Workflow in Genome (and thus end support for the genome config set-env workflow_builder_backend workflow case)

Allow builds to start with missing instrument data files.

Instead of failing immediately with an Unstartable status, allow builds to start with missing instrument data files. Often the data is not realigned, but reprocessed through a new pipeline type or with parameters that do not change the alignment strategy. If the data IS realigned or the raw instrument data IS required by a downstream step, fail at that point of the pipeline instead.

Support GRCh38 with alts

Initial work was done by @ernfrid here:
#1294

BWA docs : https://github.com/lh3/bwa/blob/master/README-alt.md

GATK Blog : https://software.broadinstitute.org/gatk/blog?id=8180

ReferenceSequence models will need to support auxiliary files required by downstream tools to correctly process the alt alignments.

A new version of SpeedSeq is near completion. This includes a new version of BWA that supports alt contigs.

Completion of this Epic requires successful SomaticValidation, RNA-seq, ClinSeq and SingleSampleGenotype builds ran using HCC1395 test data. See existing ClinSeq AnP for data:
http://spectacle.gsc.wustl.edu/analysis_projects/ef89d9e8c1f942c492a288e5b1b4b078

Support Cromwell as CWL backend

Eventually we may want to support WDL, but for now add cromwell as a backend replacement for Toil when running CwlPipeline models.

Improve error handling of move-allocations

When a build has zero allocations a less than helpful error message is returned:

2017/05/26 08:27:52 Genome::Command: Can't call method "disk_group_name" on an undefined value at /gsc/scripts/opt/genome/snapshots/genome-3750/lib/perl/Genome/Model/Build/Command/MoveAllocations.pm line 82.
ERROR: Can't call method "disk_group_name" on an undefined value at /gsc/scripts/opt/genome/snapshots/genome-3750/lib/perl/Genome/Model/Build/Command/MoveAllocations.pm line 82.

Improve error handling. Another edge case is if the command fails to resolve the disk group from the build/AnP.

Make ClinSeq unit tests not depend on live db

From JIRA CI-157

The following unit tests depend on getting builds from the live db:
Somatic variation builds:

  • Model/ClinSeq/Command/CreateMutationDiagrams.t
  • Model/ClinSeq/Command/CreateMutationSpectrum-exome.t
  • Model/ClinSeq/Command/CreateMutationSpectrum-wgs.t
  • Model/ClinSeq/Command/GenerateClonalityPlots.t
  • Model/ClinSeq/Command/GetBamReadCountsMatrix.t
  • Model/ClinSeq/Command/GetVariantSources.t
  • Model/ClinSeq/Command/ImportSnvsIndels.t
  • Model/ClinSeq/Command/RunCnView.t
  • Model/ClinSeq/Command/SummarizeSvs.t

Rna-seq builds:

  • Model/ClinSeq/Command/CufflinksDifferentialExpression.t
  • Model/ClinSeq/Command/CufflinksExpressionAbsolute.t
  • Model/ClinSeq/Command/TophatJunctionsAbsolute.t
  • Model/ClinSeq/Command/Converge/CufflinksDe.t

SomVar + Rna-Seq:

  • Model/ClinSeq/Command/GetBamReadCounts.t

Clin-seq build:

  • Model/ClinSeq/Command/DumpIgvXml.t
  • Model/ClinSeq/Command/SummarizeBuilds.t
  • Model/ClinSeq/Command/SummarizeCnvs.t
  • Model/ClinSeq/Command/SummarizeModels.t
  • Model/ClinSeq/Command/Converge/AllEvents.t

Multiple types?

  • Model/ClinSeq/Command/TestGenomeCommands.t
  • Model/ClinSeq/Command/UpdateAnalysis.t

Support CRAM alignment results.

There has been internal discussion about porting existing alignments from BAM to CRAM to reduce file size. Instead of re-importing CRAM as External alignment results, we can make a filetype attribute (using metric) on all AlignedBamResults. Then bam_path will resolve the correct file extension based on this attribute, bam or cram (with default value bam). A command like genome instrument-data alignment-result compress would be added to the code base. Builds and callers that use the bam_path accessor would then get CRAM file paths instead.

Revise data import process

Right now, there's kind of a dead space in between the standard import (which does a ton of qc, sanitizing, linking, etc) and the trusted-importer (which does none of that). As our sequencing partners evolve, and we start to accept more data that doesn't come through the established LIMS link, we need to revamp this process to:

  1. accept key metadata attributes in some defined import format (csv?)

    • subtask - decide what the scope of supported properties is, and which ones are mandatory
  2. create or link sample and library objects and populate them appropriately

  3. import raw sequence data and link the instrument data to these samples/libs with a minimum of fuss

Generic auxiliary file importer

Either revamp something like genome db or make a new generic file importer. The idea is to simplify the process of storing VCF, GTF, BED, etc. for use in future CwlPipelines. This could lower the burden of creating specialty model types or custom importers for specific files, recent example gnomAD #1777 .

The solution should retain minimal metadata. One of the requirements should be a valid reference sequence. I'm open to debate on the ease of validating the refseq matches the imported file. It's possible we simply have to rely on the honor system since validation of many file types could be cumbersome....

Add Total Disk Usage to AnP command

The new command genome analysis-project disk-usage sums the usage for each config item. Those values do not add up to the total for the AnP since some results are shared between config items. Also add a master list of results and sum the total disk usage for the AnP without duplicate counting results shared across multiple config items.

Remove old mysql code

There are a few remaining commands that either do no use the db_ensembl_* config or make direct connections to mysql servers. For the former, they should be updated to use the config variables. For the latter, we should consider removing the code or disabling the connection to mysql servers for old data sets like build36.

Consider updating:
/Model/Tools/Ensembl/Base.pm
./Model/Tools/ImportAnnotation/UpdateAnnotation.pm

Consider removing or disabling mysql connections from:
./Model/Tools/Pcap/Ace.t
./Model/Tools/Pcap/Config.pm
./Model/Tools/Sv/BreakAnnot.pl
./Model/Tools/Sv/SvAnnot.pm

Update default paths in `gmt transcriptome ercc-map-unaligned`

The default values are no longer accessible on the MGI filesystem.

The new paths:

--ercc-fasta-file /gscmnt/gc2560/core/model_data/2861523156/build0bfc1bd5fcfc474c9db737a520ae109d/appended_sequences.fa --ercc-spike-in-file=/gscmnt/gc2560/core/RNASeq/ERCC/metadata/ERCC_Controls_Analysis-v1.txt

Move PerLaneTophat Allocations

When moving allocations for a build, the class Genome::InstrumentData::AlignmentResult::PerLaneTophat is not in the whitelist for acceptable allocations to move. Should these files be moved along with the other allocations, ie. MergedAlignmentResults.

Add ClinSeq model inputs for tumor/normal microarray.

Somatic Validation models do not have microarray inputs. This means that ClinSeq can not run microarray copy number, etc. when SomVal model types are inputs. Rather than supplying the microarray build, ClinSeq should have tumor/normal microarray inputs that are the paths to a VCF or the VCF result.

AnP command to move allocations

The new command would move ALL allocations for an AnP from one disk group to another. The implementation should allow for the move of individual models/builds as well.

See #1687 for a discussion of disk_usage_allocations. Once implemented for all model types using a list of subclass names that "belong" to a model, this new command should be fairly straightforward when an allocation is unique to an AnP.

Jenkins genome model test: 5.10-clinseq-wer - Build 2129 - Diffs Found

Project: 2-Genome-Model-Tests-2-Run-Models/TEST_SPEC=5.10-clinseq-wer
Build: https://apipe-ci.gsc.wustl.edu/job/2-Genome-Model-Tests-2-Run-Models/TEST_SPEC=5.10-clinseq-wer/2129/
Console: https://apipe-ci.gsc.wustl.edu/job/2-Genome-Model-Tests-2-Run-Models/TEST_SPEC=5.10-clinseq-wer/2129/console


DIFFERENCES FOUND:
Comparing new object 7712d85d4956475e9649622d2b93165d to blessed object 0bc8c6ff5d4c4fc0ae645335f738a29d
File: AML109/mutation-spectrum/b20_q10/exome/mutation_spectrum_sequence_context/AML109.prop.test.2type
Reason: files are not the same (diff -u {/gscmnt/gc9026/test/model_data/8a175224cae14b80ab36d749216efb2b/build0bc8c6ff5d4c4fc0ae645335f738a29d,/gscmnt/gc9026/test/model_data/8a175224cae14b80ab36d749216efb2b/build7712d85d4956475e9649622d2b93165d}/AML109/mutation-spectrum/b20_q10/exome/mutation_spectrum_sequence_context/AML109.prop.test.2type)

If you want to bless this object (7712d85d4956475e9649622d2b93165d) update and commit the DB file (Model/Build/Command/DiffBlessed.pm.YAML).

Handle index during BAM/CRAM conversion

The index file is not useful after going round trip BAM->CRAM->BAM. The BAM file must be reindexed. Appropriately handle or fix the index creation to make this easier on the end user. For now, indexing can be handled by running samtools index.

In fact, there is a crai index we should make that allows for viewing of the CRAM directly in IGV 3 or greater.

Require 2FA for genome org

Similar to the University IT requirement to enable 2FA, we should also require 2FA for push access to repositories. The team @genome/unauth includes the users that have not enabled 2FA. If you'd like to remain in your teams that have push access to genome repos, please enable 2FA as soon as possible. Using the University deadline of October 31st, I'll revisit the list of users that have enabled 2FA at that time. If you have not enabled 2FA by that time, you will be removed from the teams that allow push access to any genome repo. Also, if you'd like to be removed from the Genome org at this time to avoid future communications, please comment and I'll remove you from the Genome GitHub Organization.

How to install

Hi, could you tell me how to install and use this software on Ubuntu?
I did not find the README file, so I tried to install genome but failed. The main installation steps are as follows:
1.run git clone https://github.com/genome/genome.git
2.go into the bin directory,run./genome,then an error message appears:
unable to locate spec: config at /usr/local/lib/x86_64-linux-gnu/perl/5.22.1/Genome/Site.pm line 19. BEGIN failed--compilation aborted at /usr/local/lib/x86_64-linux-gnu/perl/5.22.1/Genome/Site.pm line 53. Compilation failed in require at /usr/local/lib/x86_64-linux-gnu/perl/5.22.1/Genome.pm line 40. Compilation failed in require at (eval 1) line 2. BEGIN failed--compilation aborted at (eval 1) line 2. BEGIN failed--compilation aborted at ./genome line 23.
I really need your help. Looking forward to your reply!

ClinSeq with Somatic Validation Model Inputs

ClinSeq currently uses Somatic Variation, Reference Alignment and RNA-seq models as input. Somatic Variation models require Reference Alignment as inputs as well. The model dependency chain introduces several manual steps in the analysis process. Somatic Validation models combine the pipeline steps of RefAlign and SomVar in one workflow. This removes one of the manual steps to performing ClinSeq analysis.

Relevant tasks:

VcfToBed.pm 1-based flag appears broken

Hi, using gmt bed convert vcf-to-bed --one-based flag appears to still ouput 0-based coordinates:

input:
9 34795908 . CT C . .

example output:
9 34795908 34795909 T/* - -

should be:
9 34795909 34795909 T/* - -

Add AnP command to update status

Currently owner is inferred as created_by. Do we want a new owner?

Either way, this issue is specific to add a command that allows the created_by/owner to update the status from 'Completed' back to 'In Progress'.

Add Somatic Validation tracks to DumpIgvXml ClinSeq command.

The DumpIgvXml command relies on finding the tumor_build and normal_build which only exist as ReferenceAlignment builds on Somatic Variation inputs. The DumpIgvXml command should resolve the path to the tumor/normal bed/bam file for all somatic model types.

Containerize core GMS functions.

A few core tools, ex. samtools, bwa, bedtools, joinx, liftover, etc. are used in basic GMS functions. There seem to be a few options here:

  1. include the tool in the genome image
  2. execute a simple workflow step using the correct image
  3. execute a CWL workflow to produce the desired result

In general, this is a modern solution to the complete removal of legacy /gsc installed software, see #1560

Find and re-locate legacy software/data hiding out under `/gsc/`.

The end goal is to remove the dependency on /gsc/ as a special directory for the GMS.

  • Some software/versions might be old enough we don't need them any more.
  • Some will probably need to be incorporated into our Docker image somehow (maybe in a local /gsc/ directory?)

Refactor SummarizeTier1SnvSummary to use SnvIndelReport result

This step originally ran bam-readcount redundantly with the SnvIndelReport. The original reworking of the ClinSeq workflow did NOT add this step back to the process. If needed, the R script could be rewritten to use the existing read counts and VAFs in th SnvIndelReport. Here is an example file:
/gscmnt/gc13001/info/model_data/9960e32e7f344f17b476b728ac487bb3/build24cd53cfd5ab449abb2e3991f846c2cd/BRC251/snv_indel_report/b1_q1/BRC251_final_filtered_coding.tsv

Modify ClinSeq update analysis to use Analysis Projects (AnPs).

genome model clin-seq update-analysis does not handle Somatic Validation models as inputs to ClinSeq. Rather than add Somatic Validation models, we can leverage the Analysis Project configuration that was used for the project. ClinSeq models are never assigned to an AnP anyway. Even when Somatic Variation models are created they should be assigned to an AnP.

Additional info from JIRA CI-81:
UpdateAnalysis has a number of defaults for its inputs. Additionally, there is a hard-coded translation in the method get_roi_name
Other commands that may have hard-coded things include:
DumpIgvXml
GenerateClonalityPlots
SummarizeCnvs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.