nf-core / phaseimpute Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Add STITCH software as one of the imputation modes.
Add QUILT software as one of the imputation modes.
Most of the small tests we use have phased = true
. A test should be added that evaluates the output of the VCF_PHASE_PANEL
subworkflow. Specifically this part:
if (params.phased == false) {
VCF_PHASE_SHAPEIT5(ch_vcf
.map { meta, vcf, csi -> [meta, vcf, csi, [], meta.region] },
Channel.of([[],[],[]]).collect(),
Channel.of([[],[],[]]).collect(),
Channel.of([[],[]]).collect())
ch_versions = ch_versions.mix(VCF_PHASE_SHAPEIT5.out.versions)
ch_panel_phased = VCF_PHASE_SHAPEIT5.out.variants_phased
.combine(VCF_PHASE_SHAPEIT5.out.variants_index, by: 0)
} else {
ch_panel_phased = ch_vcf
}
No response
No response
No response
Currently, the pipeline requires the channels from --step panelprep
to run the following modes. To start the pipeline simply from different steps (such as --step impute
), some files are necessary depending on the tool used (such as chunks
, posfile
, bams
and panel
). The bams and panel can be added via a csv with --input
and --panel
. We should add the other files as external params.
Create the list of renaming chr automatically
Use the fai and add or remove "chr" prefix
While the current testing of the pipeline works, the CSVs are not stored in the nf-core repository. This should be corrected for reproducibility issues.
Example phaseimpute/tests/csv:
sample,vcf,csi
NA12878,https://raw.githubusercontent.com/louislenezet/test-datasets/imputation/data/individuals/NA12878/NA12878.s.1x.bcf,https://raw.githubusercontent.com/louislenezet/test-datasets/imputation/data/individuals/NA12878/NA12878.s.1x.bcf.csi
NA19401,https://raw.githubusercontent.com/louislenezet/test-datasets/imputation/data/individuals/NA19401/NA19401.s.1x.bcf,https://raw.githubusercontent.com/louislenezet/test-datasets/imputation/data/individuals/NA19401/NA19401.s.1x.bcf.csi
NA20359,https://raw.githubusercontent.com/louislenezet/test-datasets/imputation/data/individuals/NA20359/NA20359.s.1x.bcf,https://raw.githubusercontent.com/louislenezet/test-datasets/imputation/data/individuals/NA20359/NA20359.s.1x.bcf.csi
No response
No response
No response
We need to normalize the names of tools used such as BCFTOOLS_INDEX.
Same for the use of GAWK.
What can be done:
The panel preparation and the imputation are not yet done separately for glimpse and quilt.
The aim would be to do all the preprocessing in the get_panel sbwf to not have duplicated modules and better readability.
No response
No response
No response
There are several modules that are used many times in the pipeline. However, in modules.config
the configuration for these modules is global and not specific for the subworkflows. An example:
withName: GLIMPSE_CHUNK {
ext.args = [
"--window-size 200000",
"--buffer-size 20000"
].join(' ')
ext.prefix = { "${meta.id}" }
}
This global config affects all the subworkflows that use the same module, even when those have their own config.
Specific config:
withName: 'NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:GLIMPSE_CHUNK' {
ext.prefix = { "${meta.id}_${meta.chr}" }
}
Example of error:
Caused by:
Process `NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:GLIMPSE_CHUNK (1000G_phased)` terminated with an error exit status (1)
Command executed:
GLIMPSE_chunk \
--window-size 200000 --buffer-size 20000 \
--input ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--region chr22 \
--thread 2 \
--output 1000G_phased_chr22.txt
Notice how --window-size
and --buffer-size
are NOT defined in the NFCORE_PHASEIMPUTE:PHASEIMPUTE:MAKE_CHUNKS:GLIMPSE_CHUNK
but these are executed anyway.
Therefore, we should strive to add the corresponding subworkflow for each module configuration. This should be solved before adding new functionality as it can have a snowball effect.
No response
No response
No response
The metromap is not yet updated to match the metromap.
This should be adress for the first release
Move to the new plugin nf-schema instead of the actual nf-validation
Proposal: Adding an optional subworkflow to remove a specified sample from the reference panel. For example, NA12878 is always included in the reference panel and is the typical sample used for assessing performance.
Edit the test.config file and fix a folder for storing the output files with the "--outdir" argument
Before the first release here is the different issues that need to be adressed:
The full scale test should work.
Which test should we run at full scale ?
I think the most exhaustive would be a simulation, pre-processing of the panel, phasing of the later, imputation and validation.
Some params are mandatory for some steps, while others are not.
For instance, users may want to run --step panelprep
to obtain all the necessary panel files. However, they do not need an --input
or --tools
in this case.
nextflow run phaseimpute -profile test_panelprep,docker --outdir test
ERROR ~ No tools provided. Expression: params.tools
-- Check script 'phaseimpute/./subworkflows/local/utils_nfcore_phaseimpute_pipeline/main.nf' at line: 291 or see '.nextflow.log' file for more details
No response
No response
The schema_input_panel.json
required that the "panel index file cannot contain spaces and must have extension '.vcf' or '.bcf' with '.csi' or '.tbi' extension
". However, 1000G index files, such as s3://1000genomes/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi
end with extension vcf.gz.tbi
Solution:
"index": {
"type": "string",
"pattern": "^\\S+\\.(vcf|bcf)(\\.gz)?\\.(tbi|csi)$",
"errorMessage": "Panel index file must be provided, cannot contain spaces and must have extension '.vcf' or '.bcf' with optional '.gz' extension and with '.csi' or '.tbi' extension"
}
No response
No response
No response
It's not a good practice to have two main.nf scripts. It is better to name the main workflow as phaseinput.nf
nextflow.config
file: test_sim & test_panelprepAdd glimpse2
as an alternative imputation tool
Develop a module that can handle non PAR chr X regions and perform imputation.
In the panel preparation phase, we generate two types of tsv:
-f'%CHROM\t%POS\t%REF,%ALT\\n'
-f'%CHROM\t%POS\t%REF\t%ALT\\n'
I think it would be convenient to generate only a single type of tsv. This would be useful when using these files as independent inputs with param --posfile
, so that they can have the same post-processing.
To make them specific to each tool, we could add a pre-processing step where we replace the last \t
with ,
, for example.
The main workflow takes 6 parameters as inputs: ch_input, ch_fasta, ch_panel, ch_region, ch_map, and ch_versions. These variables are not defined with value using params.
The pipeline currently allows for selecting only one single imputation tool. This behavior should be modified as users may want to use more than one tool for comparison.
ERROR ~ ERROR: Validation of pipeline parameters failed!
-- Check '.nextflow.log' file for details
The following invalid input values have been detected:
* --tools: 'glimpse1,quilt' is not a valid choice (Available choices: glimpse1, glimpse2, quilt)
nextflow run phaseimpute -profile test,singularity --outdir test_both --tools glimpse1,quilt
No response
No response
The reference panel currently accepts a BCF. Example:
panel = "https://raw.githubusercontent.com/nf-core/test-datasets/phaseimpute/data/panel/21_22/1000GP.chr21_22.s.norel.bcf"
Reference panels, such as 1000G, are generally stored per chromosome.
It would be useful if the reference panel input flag would accept a CSV. The csv could contain the chromosome names and the URL to the reference panel. For example:
The samplesheet can have as many columns as you desire, however, there is a strict requirement for at least 2 columns to match those defined in the table below.
A final samplesheet file for the reference panel may look something like the one below. This is for 3 chromosomes.
chr,vcf
1,ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
2,ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
3,ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
Column | Description |
---|---|
chr |
Name of the chromosome. Use the prefix 'chr' if the panel uses the prefix. |
vcf |
Full path to a VCF file for that chromosome. File has to be gzipped and have the extension ".vcf.gz".gz". |
Each row represents a chromosome with its corresponding VCF file, containing information about the reference haplotype panel. You can obtain reference panels from publicly available sources such as the 1000 Genomes Project phase 3.
The second column, vcf
, can directly point to publicly available remote S3 buckets with the 1000G reference panels.
An example of this is: https://github.com/atrigila/quilt_nextflow/blob/master/docs/usage.md#structure-1
The plugin nf-co2footprint would be really interesting to be added to the pipeline.
For the moment glimpse1 subworkflow is autonomous as available in nf-core.
But it is not compatible with the pipeline as it is.
We should create a new like for glimpse2 to allow the chunks to be computed outside the sbwf.
No response
No response
No response
It would be nice to test if the pipeline produce the good files each time we run it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.