genome / genome Goto Github PK
View Code? Open in Web Editor NEWCore modules used by the GMS
License: GNU Lesser General Public License v3.0
Core modules used by the GMS
License: GNU Lesser General Public License v3.0
# Genome Genome analysis software from The Genome Institute at Washington University School of Medicine, funded by the National Human Genome Research Institute. ## Testing Running tests currently depends on many TGI resources so testing cannot be performed outside TGI at this time. Within TGI you can run tests by doing, genome-test-env test-tracker prove --lsf --git This chains the `genome-test-env` and `test-tracker` commands together. See `test-tracker prove --help` for its usage. For example, you may need to pass a value to the `--git` option if your branch is not setup to track `origin/master`.
When the md5 for a LIMS unarchive doesn't match, compare flagstat to minimally ensure the read counts match and protect from truncated BAM files.
finish taking out references to Workflow
in Genome
(and thus end support for the genome config set-env workflow_builder_backend workflow
case)
Instead of failing immediately with an Unstartable status, allow builds to start with missing instrument data files. Often the data is not realigned, but reprocessed through a new pipeline type or with parameters that do not change the alignment strategy. If the data IS realigned or the raw instrument data IS required by a downstream step, fail at that point of the pipeline instead.
remove the workflow submodule from the Genome
repo and update our build processes to no longer care about it.
Initial work was done by @ernfrid here:
#1294
BWA docs : https://github.com/lh3/bwa/blob/master/README-alt.md
GATK Blog : https://software.broadinstitute.org/gatk/blog?id=8180
ReferenceSequence models will need to support auxiliary files required by downstream tools to correctly process the alt alignments.
A new version of SpeedSeq is near completion. This includes a new version of BWA that supports alt contigs.
Completion of this Epic requires successful SomaticValidation, RNA-seq, ClinSeq and SingleSampleGenotype builds ran using HCC1395 test data. See existing ClinSeq AnP for data:
http://spectacle.gsc.wustl.edu/analysis_projects/ef89d9e8c1f942c492a288e5b1b4b078
Eventually we may want to support WDL, but for now add cromwell as a backend replacement for Toil when running CwlPipeline models.
When a build has zero allocations a less than helpful error message is returned:
2017/05/26 08:27:52 Genome::Command: Can't call method "disk_group_name" on an undefined value at /gsc/scripts/opt/genome/snapshots/genome-3750/lib/perl/Genome/Model/Build/Command/MoveAllocations.pm line 82.
ERROR: Can't call method "disk_group_name" on an undefined value at /gsc/scripts/opt/genome/snapshots/genome-3750/lib/perl/Genome/Model/Build/Command/MoveAllocations.pm line 82.
Improve error handling. Another edge case is if the command fails to resolve the disk group from the build/AnP.
From JIRA CI-157
The following unit tests depend on getting builds from the live db:
Somatic variation builds:
Rna-seq builds:
SomVar + Rna-Seq:
Clin-seq build:
Multiple types?
The instrument_data_properties
are currently processed here. We have a scalar and an ARRAY
case. Adding a HASH
case to allow for one level of nesting could be one way to solve this. We need to be careful to merge with any existing hash but not leak one instrument data's values back to the original configuration hash.
From JIRA CI-110 requested by @chrisamiller :
We'd like to have the results from "gmt annotate add-evs-maf" to the SnvIndel report. That's the one set of info that my group often uses that is absent from the current implementation.
There has been internal discussion about porting existing alignments from BAM to CRAM to reduce file size. Instead of re-importing CRAM as External alignment results, we can make a filetype attribute (using metric) on all AlignedBamResults. Then bam_path
will resolve the correct file extension based on this attribute, bam
or cram
(with default value bam
). A command like genome instrument-data alignment-result compress
would be added to the code base. Builds and callers that use the bam_path
accessor would then get CRAM file paths instead.
Right now, there's kind of a dead space in between the standard import (which does a ton of qc, sanitizing, linking, etc) and the trusted-importer (which does none of that). As our sequencing partners evolve, and we start to accept more data that doesn't come through the established LIMS link, we need to revamp this process to:
accept key metadata attributes in some defined import format (csv?)
create or link sample and library objects and populate them appropriately
import raw sequence data and link the instrument data to these samples/libs with a minimum of fuss
Add a build step (or a workflow step if possible) to create human readable symlinks to the output files of CWL workflows.
Either revamp something like genome db
or make a new generic file importer. The idea is to simplify the process of storing VCF, GTF, BED, etc. for use in future CwlPipelines. This could lower the burden of creating specialty model types or custom importers for specific files, recent example gnomAD #1777 .
The solution should retain minimal metadata. One of the requirements should be a valid reference sequence. I'm open to debate on the ease of validating the refseq matches the imported file. It's possible we simply have to rely on the honor system since validation of many file types could be cumbersome....
Before we process the next round of MDS data, we should add the build ID and a timestamp (creation date) for each file in the MANIFEST.
The original implementation assumed a RnaSeq build would be used. Instead, provide the BAM path so this tool will work with CwlPipeline builds just as well as RnaSeq (or any BAM file).
The new command genome analysis-project disk-usage
sums the usage for each config item. Those values do not add up to the total for the AnP since some results are shared between config items. Also add a master list of results and sum the total disk usage for the AnP without duplicate counting results shared across multiple config items.
There are a few remaining commands that either do no use the db_ensembl_* config or make direct connections to mysql servers. For the former, they should be updated to use the config variables. For the latter, we should consider removing the code or disabling the connection to mysql servers for old data sets like build36.
Consider updating:
/Model/Tools/Ensembl/Base.pm
./Model/Tools/ImportAnnotation/UpdateAnnotation.pm
Consider removing or disabling mysql connections from:
./Model/Tools/Pcap/Ace.t
./Model/Tools/Pcap/Config.pm
./Model/Tools/Sv/BreakAnnot.pl
./Model/Tools/Sv/SvAnnot.pm
The default values are no longer accessible on the MGI filesystem.
The new paths:
--ercc-fasta-file /gscmnt/gc2560/core/model_data/2861523156/build0bfc1bd5fcfc474c9db737a520ae109d/appended_sequences.fa --ercc-spike-in-file=/gscmnt/gc2560/core/RNASeq/ERCC/metadata/ERCC_Controls_Analysis-v1.txt
When moving allocations for a build, the class Genome::InstrumentData::AlignmentResult::PerLaneTophat
is not in the whitelist for acceptable allocations to move. Should these files be moved along with the other allocations, ie. MergedAlignmentResults.
Implement a new SV caller and replace Breakdancer in somatic SV detection strategies.
For the following command:
genome model cwl-pipeline prep-for-transfer
add an option to include the FASTA to support CRAM files and conversions.
See #1623 for human equivalent.
Somatic Validation models do not have microarray inputs. This means that ClinSeq can not run microarray copy number, etc. when SomVal model types are inputs. Rather than supplying the microarray build, ClinSeq should have tumor/normal microarray inputs that are the paths to a VCF or the VCF result.
Please see Pull Request #1652 for details.
The new command would move ALL allocations for an AnP from one disk group to another. The implementation should allow for the move of individual models/builds as well.
See #1687 for a discussion of disk_usage_allocations
. Once implemented for all model types using a list of subclass names that "belong" to a model, this new command should be fairly straightforward when an allocation is unique to an AnP.
Project: 2-Genome-Model-Tests-2-Run-Models/TEST_SPEC=5.10-clinseq-wer
Build: https://apipe-ci.gsc.wustl.edu/job/2-Genome-Model-Tests-2-Run-Models/TEST_SPEC=5.10-clinseq-wer/2129/
Console: https://apipe-ci.gsc.wustl.edu/job/2-Genome-Model-Tests-2-Run-Models/TEST_SPEC=5.10-clinseq-wer/2129/console
DIFFERENCES FOUND:
Comparing new object 7712d85d4956475e9649622d2b93165d to blessed object 0bc8c6ff5d4c4fc0ae645335f738a29d
File: AML109/mutation-spectrum/b20_q10/exome/mutation_spectrum_sequence_context/AML109.prop.test.2type
Reason: files are not the same (diff -u {/gscmnt/gc9026/test/model_data/8a175224cae14b80ab36d749216efb2b/build0bc8c6ff5d4c4fc0ae645335f738a29d,/gscmnt/gc9026/test/model_data/8a175224cae14b80ab36d749216efb2b/build7712d85d4956475e9649622d2b93165d}/AML109/mutation-spectrum/b20_q10/exome/mutation_spectrum_sequence_context/AML109.prop.test.2type)
If you want to bless this object (7712d85d4956475e9649622d2b93165d) update and commit the DB file (Model/Build/Command/DiffBlessed.pm.YAML).
Using NovaSeq as an example, importing the FASTQ files required requests of 92GB of RAM to ensure success.
See internal JIRA issue CIS-93 for a few examples.
The index file is not useful after going round trip BAM->CRAM->BAM. The BAM file must be reindexed. Appropriately handle or fix the index creation to make this easier on the end user. For now, indexing can be handled by running samtools index
.
In fact, there is a crai
index we should make that allows for viewing of the CRAM directly in IGV 3 or greater.
Similar to the University IT requirement to enable 2FA, we should also require 2FA for push access to repositories. The team @genome/unauth includes the users that have not enabled 2FA. If you'd like to remain in your teams that have push access to genome repos, please enable 2FA as soon as possible. Using the University deadline of October 31st, I'll revisit the list of users that have enabled 2FA at that time. If you have not enabled 2FA by that time, you will be removed from the teams that allow push access to any genome repo. Also, if you'd like to be removed from the Genome org at this time to avoid future communications, please comment and I'll remove you from the Genome GitHub Organization.
Hi, could you tell me how to install and use this software on Ubuntu?
I did not find the README
file, so I tried to install genome but failed. The main installation steps are as follows:
1.run git clone https://github.com/genome/genome.git
2.go into the bin
directory,run./genome
,then an error message appears:
unable to locate spec: config at /usr/local/lib/x86_64-linux-gnu/perl/5.22.1/Genome/Site.pm line 19. BEGIN failed--compilation aborted at /usr/local/lib/x86_64-linux-gnu/perl/5.22.1/Genome/Site.pm line 53. Compilation failed in require at /usr/local/lib/x86_64-linux-gnu/perl/5.22.1/Genome.pm line 40. Compilation failed in require at (eval 1) line 2. BEGIN failed--compilation aborted at (eval 1) line 2. BEGIN failed--compilation aborted at ./genome line 23.
I really need your help. Looking forward to your reply!
ClinSeq currently uses Somatic Variation, Reference Alignment and RNA-seq models as input. Somatic Variation models require Reference Alignment as inputs as well. The model dependency chain introduces several manual steps in the analysis process. Somatic Validation models combine the pipeline steps of RefAlign and SomVar in one workflow. This removes one of the manual steps to performing ClinSeq analysis.
Relevant tasks:
Using the --disabled-configs
option it is displayed, but inactive configs are not disabled... it's a little confusing for end-users.
The command genome instrument-data import basic
should use the disk group defined by the AnP used in the command.
No longer rely on CoverageStats or RefCov, but generate a coverage report from QC results (CollectHsMetrics).
Hi, using gmt bed convert vcf-to-bed --one-based
flag appears to still ouput 0-based coordinates:
input:
9 34795908 . CT C . .
example output:
9 34795908 34795909 T/* - -
should be:
9 34795909 34795909 T/* - -
Currently owner is inferred as created_by. Do we want a new owner?
Either way, this issue is specific to add a command that allows the created_by/owner to update the status from 'Completed' back to 'In Progress'.
The DumpIgvXml command relies on finding the tumor_build and normal_build which only exist as ReferenceAlignment builds on Somatic Variation inputs. The DumpIgvXml command should resolve the path to the tumor/normal bed/bam file for all somatic model types.
A few core tools, ex. samtools, bwa, bedtools, joinx, liftover, etc. are used in basic GMS functions. There seem to be a few options here:
In general, this is a modern solution to the complete removal of legacy /gsc
installed software, see #1560
The end goal is to remove the dependency on /gsc/
as a special directory for the GMS.
/gsc/
directory?)This step originally ran bam-readcount redundantly with the SnvIndelReport. The original reworking of the ClinSeq workflow did NOT add this step back to the process. If needed, the R script could be rewritten to use the existing read counts and VAFs in th SnvIndelReport. Here is an example file:
/gscmnt/gc13001/info/model_data/9960e32e7f344f17b476b728ac487bb3/build24cd53cfd5ab449abb2e3991f846c2cd/BRC251/snv_indel_report/b1_q1/BRC251_final_filtered_coding.tsv
genome model clin-seq update-analysis
does not handle Somatic Validation models as inputs to ClinSeq. Rather than add Somatic Validation models, we can leverage the Analysis Project configuration that was used for the project. ClinSeq models are never assigned to an AnP anyway. Even when Somatic Variation models are created they should be assigned to an AnP.
Additional info from JIRA CI-81:
UpdateAnalysis has a number of defaults for its inputs. Additionally, there is a hard-coded translation in the method get_roi_name
Other commands that may have hard-coded things include:
DumpIgvXml
GenerateClonalityPlots
SummarizeCnvs
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.