bergmanlab / mcclintock Goto Github PK

Meta-pipeline to identify transposable element insertions using next generation sequencing data

Python 83.84% CSS 1.23% HTML 9.43% JavaScript 2.18% R 3.20% Shell 0.12%

mcclintock's Introduction

McClintock: _{^{A meta-pipeline to identify transposable element insertions using short-read whole genome sequencing data}}

Getting Started

# INSTALL (Requires Conda and Mamba to be installed)
git clone [email protected]:bergmanlab/mcclintock.git
cd mcclintock
mamba env create -f install/envs/mcclintock.yml --name mcclintock
conda activate mcclintock
python3 mcclintock.py --install
python3 test/download_test_data.py

# RUN
python3 mcclintock.py \
    -r test/sacCer2.fasta \
    -c test/sac_cer_TE_seqs.fasta \
    -g test/reference_TE_locations.gff \
    -t test/sac_cer_te_families.tsv \
    -1 test/SRR800842_1.fastq.gz \
    -2 test/SRR800842_2.fastq.gz \
    -p 4 \
    -o /path/to/output/directory

Getting Started
Introduction
Installing Conda/Mamba
Installing McClintock
McClintock Usage
McClintock Input
McClintock Output
Run Examples
Citation
License

Introduction

Many methods have been developed to detect transposable element (TE) insertions from short-read whole genome sequencing (WGS) data, each of which has different dependencies, run interfaces, and output formats. McClintock provides a meta-pipeline to reproducibly install, execute, and evaluate multiple TE detectors, and generate output in standardized output formats. A description of the original McClintock 1 pipeline and evaluation of the original six TE detectors on the yeast genome can be found in Nelson, Linheiro and Bergman (2017) G3 7:2763-2778. A description of the re-implemented McClintock 2 pipeline, the reproducible simulation system, and evaluation of 12 TE detectors on the yeast genome can be found in Chen, Basting, Han, Garfinkel and Bergman (2023) Mobile DNA 14:8. The set of TE detectors currently included in McClintock 2 are:

Installing Conda/Mamba

McClintock is written in Python3 leveraging the SnakeMake workflow system and is designed to run on linux operating systems. Installation of software dependencies for McClintock and its component methods is automated by Conda, thus a working installation of Conda is required to install McClintock. Conda can be installed via the Miniconda installer.

Installing Miniconda (Python 3.X)

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O $HOME//miniconda.sh
bash ~/miniconda.sh -b -p $HOME/miniconda # silent mode
echo "export PATH=\$PATH:\$HOME/miniconda/bin" >> $HOME/.bashrc # add to .bashrc
source $HOME/.bashrc
conda init

conda init requires you to close and open a new terminal before it take effect

Update Conda

conda update -y conda

Install Mamba

conda install -c conda-forge mamba

Installing McClintock

After installing and updating Conda/Mamba, McClintock and its component methods can be installed by: 1. cloning the repository, 2. creating the Conda environment, and 3. running the install script.

Clone McClintock Repository

git clone [email protected]:bergmanlab/mcclintock.git
cd mcclintock

Create McClintock Conda Environment

mamba env create -f install/envs/mcclintock.yml --name mcclintock

This installs the base dependencies (Snakemake, Python3, BioPython) needed to run the main McClintock script into the McClintock Conda environment

Activate McClintock Conda Environment

conda activate mcclintock

This adds the dependencies installed in the McClintock conda environment to the environment PATH so that they can be used by the McClintock scripts.
This environment must always be activated prior to running any of the McClintock scripts
NOTE: Sometimes activating conda environments does not work via conda activate myenv when run through a script submitted to a queueing system, this can be fixed by activating the environment in the script as shown below

CONDA_BASE=$(conda info --base)
source ${CONDA_BASE}/etc/profile.d/conda.sh
conda activate mcclintock

For more on Conda: see the Conda User Guide

Install McClintock Component Methods

To install all of the component methods and create their associated conda environments, use the following command:

python3 mcclintock.py --install

If you only want to install specific methods to save space and time, you can specify method(s) using the -m flag:

python3 mcclintock.py --install -m <method1>,<method2>

NOTE: If you re-run either the full installation or installation of specific methods, the installation script will do a clean installation and remove previously installed components.
If you want to install missing methods to an already existing mcclintock installation, you can use the --resume flag:

python3 mcclintock.py --install --resume

NOTE: If you use the --resume flag when installing specific method(s) with -m, the installation script will only install the specified method(s) if they haven't previously been installed. Do not use the --resume flag if you want to do a clean installation of a specific method.

McClintock Usage

Running the complete McClintock pipeline requires a fasta reference genome (option -r), a set of TE consensus/canonical sequences present in the organism (option -c), and fastq paired-end sequencing reads (options -1 and -2). If only single-end fastq sequencing data are available, then this can be supplied using only option -1, however only the TE detectors that handle single-ended data will be run. Optionally, if a detailed annotation of TE sequences in the reference genome has been performed, a GFF file with annotated reference TEs (option -g) and a tab-delimited "taxonomy" file linking annotated insertions to their TE family (option -t) can be supplied. Example input files are included in the test directory.

##########################
##       Required       ##
##########################
  -r, --reference REFERENCE
                        A reference genome sequence in fasta format
  -c, --consensus CONSENSUS
                        The consensus sequences of the TEs for the species in
                        fasta format
  -1, --first FIRST
                        The path of the first fastq file from paired end read
                        sequencing or the fastq file from single read
                        sequencing

##########################
##       Optional       ##
##########################
  -h, --help            show this help message and exit
  -2, --second SECOND
                        The path of the second fastq file from a paired end
                        read sequencing
  -p, --proc PROC       The number of processors to use for parallel stages of
                        the pipeline [default = 1]
  -o, --out OUT         An output folder for the run. [default = '.']
  -m, --methods METHODS
                        A comma-delimited list containing the software you
                        want the pipeline to use for analysis. e.g. '-m
                        relocate,TEMP,ngs_te_mapper' will launch only those
                        three methods. If this option is not set, all methods
                        will be run [options: ngs_te_mapper, ngs_te_mapper2, 
                        relocate, relocate2, temp, temp2, retroseq, 
                        popoolationte, popoolationte2, te-locate, teflon, 
                        coverage, trimgalore, map_reads, tebreak]

  -g, --locations LOCATIONS
                        The locations of known TEs in the reference genome in
                        GFF 3 format. This must include a unique ID attribute
                        for every entry. If this option is not set, a file of 
                        reference TE locations in GFF format will be produced 
                        using RepeatMasker
  -t, --taxonomy TAXONOMY
                        A tab delimited file with one entry per ID in the GFF
                        file and two columns: the first containing the ID and
                        the second containing the TE family it belongs to. The
                        family should correspond to the names of the sequences
                        in the consensus fasta file. If this option is not set, 
                        a file mapping reference TE instances to TE families 
                        in TSV format will be produced using RepeatMasker
  -s, --coverage_fasta COVERAGE_FASTA
                        A fasta file that will be used for TE-based coverage
                        analysis, if not supplied then the consensus sequences
                        of the TEs set by -c/--consensus will be used for the 
                        analysis
  -a, --augment AUGMENT
                        A fasta file of TE sequences that will be included as
                        extra chromosomes in the reference file (useful if the
                        organism is known to have TEs that are not present in
                        the reference strain)
  -k, --keep_intermediate KEEP_INTERMEDIATE
                        This option determines which intermediate files are 
                        preserved after McClintock completes [default: general]
                        [options: minimal, general, methods, <list,of,methods>, 
                        all]
  -s, --sample_name SAMPLE_NAME
                        The sample name to use for output files [default: 
                        fastq1 name]
  -n, --config CONFIG   This option determines which config files to use for 
                        your McClintock run [default: config in McClintock 
                        Repository]
  -v, --vcf VCF         This option determines which format of VCF output will 
                        be created [default: siteonly][options: siteonly,sample]
  --install             This option will install the dependencies of McClintock
  --resume              This option will attempt to use existing intermediate 
                        files from a previous McClintock run
  --debug               This option will allow snakemake to print progress to 
                        stdout
  --serial              This option runs without attempting to optimize thread 
                        usage to run rules concurrently. Each multithread rule 
                        will use the max processors designated by -p/--proc
  --make_annotations    This option will only run the pipeline up to the 
                        creation of the repeat annotations
  --comments            If this option is specified then fastq comments (e.g.
                        barcode) will be incorporated to SAM output. Warning:
                        do not use this option if the input fastq files do not
                        have comments

Available methods to use with -m/--methods:
- trimgalore : Runs Trim Galore to QC the fastq file(s) and trim the adaptors prior to running the component methods
- coverage : Estimates copy number based on normalized coverage and creates coverage plots for each TE in the fasta provided by -c/--consensus or -s/coverage_fasta if provided
- map_reads : Maps the reads to the reference genome. This is useful to ensure the BAM alignment file is produced regardless if another method requires it as input
- ngs_te_mapper : Runs the ngs_te_mapper component method
- ngs_te_mapper2: Runs the ngs_te_mapper2 component method
- relocate : Runs the RelocaTE component method
- relocate2 : Runs the RelocaTE2 component method
- temp : Runs the TEMP component method (Paired-End Only)
- temp2 : Runs the TEMP2 component method (Paired-End Only)
- retroseq : Runs the RetroSeq component method (Paired-End Only)
- popoolationte : Runs the PoPoolation TE component method (Paired-End Only)
- popoolationte2 : Runs the PoPoolation TE2 component method (Paired-End Only)
- te-locate : Runs the TE-locate component method (Paired-End Only)
- teflon : Runs the TEFLoN component method (Paired-End Only)

Mcclintock Input Files

Warning

Feature names (contig IDs, TE IDs, Family IDs) in input files must not contain any invalid symbols to ensure compatibility with all component methods.
INVALID_SYMBOLS:; & ( ) | * ? [ ] ~ { } < ! ^ " , \ $ / + - #

Required

Reference FASTA (-r/--reference)
- The genome sequence of the reference genome in FASTA format. The reads from the FASTQ file(s) will be mapped to this reference genome to predict TE insertions
- example
Consensus FASTA (-c/--consensus)
- A FASTA file containing a consensus sequence for each family
- example
FASTQ File 1 (-1/--first)
- Either the Read1 FASTQ file from a paired-end sequencing run or the FASTQ file from an unpaired sequencing run

Optional

FASTQ File 2 (-2/--second)
- The Read2 FASTQ file from a paired-end sequencing run. Not required if using unpaired data.
Locations (-g/--locations)
- A GFF file contianing the locations of the reference TEs.
- Each annotation should contain an ID= attribute that contains a unique identifier that does not match any other annotation.
- If both the locations GFF and taxonomy TSV are not provided, McClintock will produce them using RepeatMasker with the consensus sequences
- example
Taxonomy (-t/--taxonomy)
- A Tab delimited file that maps the unique reference TE to the family it belongs to.
- This file should contain two columns, the first corresponding to the reference TE identifier which should match the ID= attribute from the locations GFF(-g). The second column contains the reference TE's family which should match the name of a sequence in the consensus fasta (-c)
- If both the locations GFF and taxonomy TSV are not provided, McClintock will produce them using RepeatMasker with the consensus sequences
- example
Coverage FASTA (-s/--coverage_fasta)
- A fasta file of TE sequences to be used for the coverage analysis.
- By default, McClintock estimates the coverage and creates coverage plots of the consensus TE sequences (-c). This option allows you to use a custom set of TEs for the coverage estimations and plots.
Augment FASTA (-a/--augment)
- A FASTA file of TE sequences that will be included as extra chromosomes in the reference genome file (-r)
- Some methods leverage the reference TE sequences to find non-reference TEs insertions. The augment FASTA can be used to augment the reference genome with additional TEs that can be used to locate non-reference TE insertions that do not have a representative in the reference genome.

McClintock Output

The results of McClintock component methods are output to the directory <output>/<sample>/results.

Summary files from the run can be located at <output>/<sample>/results/summary/.
Each component method has raw output files which can be found at <output>/<sample>/results/<method>/unfiltered/.
Raw results are filtered by parameters defined in the <method>_post.py postprocessing configuration files for each method, then standardized into BED and VCF formats. Post-processing config files can be found in /path/to/mcclintock/config/<method> and can be modified if you want to adjust default filtering parameters.
Standardize results in BED format contain non-reference and (when available for a method) reference TE predictions and can be found in <output>/<sample>/results/<method>/*.bed, where <output>/<sample>/results/<method>/*.nonredundant.bed has any redundant predictions removed.
Standardized results in VCF format contain only non-reference TE predictions and can be found in <output>/<sample>/results/<method>/*_nonredundant_non-reference_*.vcf. Two types of VCF are supported: (i) VCF files with "site-only" information (*_nonredundant_non-reference_siteonly.vcf) and (ii) VCF files that contains a "sample" column (*_nonredundant_non-reference_sample.vcf). The position of TE insertion variants in VCF files corresponds to the start postion of predicted intervals in nonredundant BED files. Note that the genotype (GT) field of *_nonredundant_non-reference_sample.vcf only indicates the presence/absence of a non-reference TE insertion variant call in the sample, and does not contain information about the ploidy or zygosity of the variant in the sample.

HTML Summary Report: `<output>/<sample>/results/summary/`

McClintock generates an interactive HTML summary report that contains information on how the run was executed, read mapping information, QC information, and a summary of component method predictions. <output>/<sample>/results/summary/summary.html
This page also links to the pages that summarize the predictions from each method: all predictions by method, predictions for each family, predictions for each contig. <output>/<sample>/results/summary/html/<method>.html
The HTML report also summarizes reference and non-reference predictions for all families. <output>/<sample>/results/summary/html/families.html
A page is also generated for each family, which summarizes the coverage for the family consensus sequence and the family-specific predictions from each component method. <output>/<sample>/results/summary/html/<family>.html

Raw Summary files : `<output>/<sample>/results/summary/`

<output>/<sample>/results/summary/data/run/summary_report.txt : Summary Report of McClintock run. Contains information on the McClintock command used, when and where the script was run, details about the mapped reads, and table that shows the number of TE predictions produced from each method.
<output>/<sample>/results/summary/data/run/te_prediction_summary.txt : A comma-delimited table showing reference and non-reference predictions for each component method
<output>/<sample>/results/summary/data/families/family_prediction_summary.txt : a comma-delimited table showing TE predictions (all, reference, non-reference) from each method for each TE family
<output>/<sample>/results/summary/data/coverage/te_depth.txt : (Only produced if coverage module is run) a comma-delimited table showing normalized depth for each consensus TE or TE provided in coverage fasta.
All tables and plots contain a link to the raw data so that users can manually filter or visualize it with other programs.

TrimGalore : `<output>/<sample>/results/trimgalore/`

<fastq>_trimming_report.txt : Information on parameters used and statistics related to adapter trimming with cutadapt. Provides an overview of sequences removed via the adapter trimming process.
<fastq>_fastqc.html : FastQC report of the trimmed fastq files. Provides information on the results of steps performed by FastQC to assess the quality of the trimmed reads.
<fastq>_fastqc.zip : FastQC graph images and plain-text summary statistics compressed into a single .zip file

Coverage : `<output>/<sample>/results/coverage/`

plots/*.png : Coverage plots showing the normalized read coverage across each TE either from the consensus fasta (-c) or the coverage fasta (-s) if provided. Coverage of uniquely mapping reads (MAPQ > 0) is in dark gray, while coverage of all reads (MAPQ >= 0) is in light gray. Raw coverage at each postion in a TE is normalized to the average mapping depth at unique regions of the hard-masked reference genome. The average normalized coverage is shown as a black line, and is estimated from the central region of each TE omitting regions at the 5' and 3' ends equal to the average read length to prevent biases due to mapping at TE edges.
te-depth-files/*.allQ.cov : Raw read coverage at each position in a TE sequence. (Output of samtools depth)
te-depth-files/*.highQ.cov : coverage of mapped reads with MAPQ > 0 at each position, omitting multi-mapped reads.

ngs_te_mapper : `<output>/<sample>/results/ngs_te_mapper/`

unfiltered/<sample>_insertions.bed : BED file containing raw 0-based intervals corresponding to TSDs for non-reference predictions and 0-based intervals corresponding to the reference TEs. Reference TE intervals are inferred from the data, not from the reference TE annotations. Strand information is present for both non-reference and reference TEs.
<sample>_ngs_te_mapper_nonredundant.bed : BED file containing 0-based intervals corresponding to TSDs for non-reference predictions and 0-based intervals corresponding to the reference TEs. This file contains the same predictions from unfiltered/<sample>_insertions.bed with the BED line name adjusted to match the standard McClintock naming convention. By default, no filtering is performed on the raw ngs_te_mapper predictions aside from removing redundant predictions. However, the config file: (/path/to/mcclintock/config/ngs_te_mapper/ngs_te_mapper_post.py) can be modified to increase the minimum read support threshold if desired.

ngs_te_mapper2 : `<output>/<sample>/results/ngs_te_mapper2/`

unfiltered/<sample>.nonref.bed: BED file containing raw 0-based intervals corresponding to the 5' and 3' breakpoints for non-reference predictions.
unfiltered/<sample>.ref.bed: BED file containing raw 0-based intervals corresponding to the reference TE annotations predicted by ngs_te_mapper2
<sample>_ngs_te_mapper2_nonredundant.bed: BED file containing predictions from unfiltered/<sample>.nonref.bed and unfiltered/<sample>.ref.bed with BED line names matching the standard McClintock naming convention.

PoPoolationTE : `<output>/<sample>/results/popoolationTE/`

unfiltered/te-poly-filtered.txt : Tab-delimited table with non-reference and reference TE predictions and support values. Predictions are annotated as 1-based intervals on either end of the predicted insertion, and also as a midpoint between the inner coordinates of the two terminal spans (which can lead to half-base midpoint coordinates)
<sample>_popoolationte_nonredundant.bed : BED file containing only TE predictions with read support on both ends ("FR") and with percent read support >0.1 for both ends were retained in this file. The entire interval between the inner coordinates of the of the two terminal spans (not the midpoint) was converted to 0-based coordinates. Filtering parameters can be modified in the config file: (/path/to/mcclintock/config/popoolationte/popoolationte_post.py)

PoPoolationTE2 : `<output>/<sample>/results/popoolationTE2/`

unfiltered/teinsertions.txt : Tab-delimited table with TE predictions and TE frequency values (ratio of physical coverage supporting a TE insertion to the total physical coverage). PoPoolationTE2 does not indicate which predictions are reference and non-reference TEs. Also, only a single position is reported for each prediction, so the TSD is not predicted. Predictions may only have support from one side of the junction ("F" or "R") or both sides ("FR"). Prediction coordinates are 1-based.
<sample>_popoolationte2_nonredundant.bed : BED file containing all of the predictions from unfiltered/teinsertions.txt that have support on both ends ("FR") and have a frequency >0.1. The filtering criteria can be modified in the config file: (/path/to/mcclintock/config/popoolationte2/popoolationte2_post.py). If predictions overlap a TE in the reference genome, that reference TE is reported in this file using the positions of the reference TE annotation (not the position reported by PoPoolationTE2). If the prediction does not overlap a reference TE, it is designated a non-reference TE insertion _non-reference_. The coordinates for all predictions are adjusted to be 0-based.

RelocaTE : `<output>/<sample>/results/relocaTE/`

unfiltered/combined.gff : GFF containing 1-based TSDs for non-reference predictions and 1-based intervals for reference TEs. The reference intervals are based on the reference TE annotations.
<sample>_relocate_nonredundant.bed : BED file containing predictions from unfiltered/combined.gff converted into 0-based intervals with BED line names matching the standard McClintock naming convention. By default, no filtering is performed on the raw predictions aside from removing redundant predictions. However, the config file: (/path/to/mcclintock/config/relocate/relocate_post.py) can be modified to increase the minimum left and right prediction support thresholds for both reference and non-reference predictions.

RelocaTE2 : `<output>/<sample>/results/relocaTE2/`

unfiltered/repeat/results/ALL.all_ref_insert.gff : GFF file containing reference TE predictions with 1-based coordinates. The final column also contains read counts supporting the junction (split-read) and read counts supporting the insertion (read pair).
unfiltered/repeat/results/ALL.all_nonref_insert.gff : GFF file containing non-reference TE predictions with 1-based coordinates. The final column also contains read counts supporting the junction (split-read) and read counts supporting the insertion (read pair).
<sample>_relocate2_nonredundant.bed : BED file containing all reference and non-reference predictions from ALL.all_ref_insert.gff and ALL.all_nonref_insert.gff. Coordinates are adjusted to be 0-based. By default, no filtering is performed on split-read and split-pair evidence. However, the config file: (/path/to/mcclintock/config/ngs_te_mapper/ngs_te_mapper_post.py) can be modified to increase the default threshold for these values.

RetroSeq : `<output>/<sample>/results/retroseq/`

unfiltered/<sample>.call : VCF file containing non-reference TE predictions. Non-reference TEs are annotated as 1-based intervals in the POS column and two consecutive coordinates in the INFO field. No predictions are made for reference TEs. Strand information is not provided.
<sample>_retroseq_nonredundant.bed : BED file containing non-reference TE predictions from unfiltered/<sample>.call with a Breakpoint confidence threshold of >6 are retained in this file. This filtering threshold can be changed by modifying the config file: (/path/to/mcclintock/config/retroseq/retroseq_post.py). The position interval reported in the INFO column of unfiltered/<sample>.call VCF file are converted to 0-based positions and used as the start and end positions in the BED lines in this file.

TEbreak : `<output>/<sample>/results/tebreak/`

unfiltered/<sample>.sorted.tebreak.table.txt : Tab-delimited table containing non-reference TE predictions with 0-based coordinates. No predictions are made for reference TEs. Strand information is provided.
<sample>_tebreak_nonredundant.bed : BED file containing non-reference TE predictions from unfiltered/<sample>.sorted.tebreak.table.txt with no additional filtering. This filtering threshold can be changed by modifying the config file: (/path/to/mcclintock/config/retroseq/tebreak_post.py).

TEMP : `<output>/<sample>/results/TEMP/`

unfiltered/<sample>.absence.refined.bp.summary : Tab-delimited table containing reference TEs that are predicted to be absent from the short read data. Position intervals are 1-based.
unfiltered/<sample>.insertion.refined.bp.summary : Tab-delimited table containing non-reference TE predictions. Position intervals are 1-based.
<sample>_temp_nonredundant.bed : BED file containing all reference TEs not reported as absent by TEMP in the unfiltered/<sample>.absence.refined.bp.summary file. Also contains non-reference TE predictions unfiltered/<sample>.insertion.refined.bp.summary formatted as a BED line using the McClintock naming convention. Positions for both reference and non-reference predictions are 0-based. Non-reference predictions from unfiltered/<sample>.insertion.refined.bp.summary are only added to this file if the prediction has read support on both ends ("1p1") and has a sample frequency of > 10%. These filtering restrictions can be modified in the config file: (/path/to/mcclintock/config/TEMP/temp_post.py). Non-reference TEs with split-read support at both ends are marked in the BED line name with "sr" and the Junction1 and Junction2 columns from unfiltered/<sample>.insertion.refined.bp.summary are used as the start and end positions of the TSD in this file (converted to 0-based positions). If the non-reference TE prediction does not have split-read support on both ends of the insertions, the designation "rp" is added to the BED line name and the Start and End columns from unfiltered/<sample>.insertion.refined.bp.summary are used as the start and end positions of the TSD in this file (converted to 0-based). Note: TEMP reference insertions are labeled nonab in the BED line name since they are inferred by no evidence of absence, in contrast to reference insertions detected by other components that are inferred from direct evidence of their presence.

TEMP2 : `<output>/<sample>/results/temp2/`

unfiltered/<sample>.absence.refined.bp.summary : Tab-delimited table containing reference TEs that are predicted to be absent from the short read data. Position intervals are 1-based.
unfiltered/<sample>.insertion.bed: BED file containing 0-based coordinates of the non-reference predictions.
<sample>_temp_nonredundant.bed: BED file containing all reference TEs not reported in the unfiltered/<sample>.absence.refined.bp.summary. Also contains non-reference TE predictions from unfiltered/<sample>.insertion.bed. Non-reference predictions from unfiltered/<sample>.insertion.bed are only added to this file if the prediction has read support on both ends ("1p1") and has a sample frequency of > 10%. These filtering restrictions can be modified in the config file: (/path/to/mcclintock/config/temp2/temp2_post.py).

TE-Locate : `<output>/<sample>/results/te-locate/`

unfiltered/te-locate-raw.info : A tab-delimited table containing reference ("old") and non-reference ("new") predictions using 1-based positions. TSD intervals are not predicted for non-reference TEs, instead a single position is reported.
<sample>_telocate_nonredundant.bed : BED file containing all reference and non-reference predictions from unfiltered/te-locate-raw.info. Coordinates for both reference and non-reference TE predictions are converted to a 0-based interval. The reference TE end position is extended by the len column in unfiltered/te-locate-raw.info. Non-reference TE predictions are a single position as TE-Locate does not predict the TSD size.

TEFLoN : `<output>/<sample>/results/teflon/`

unfiltered/genotypes/sample.genotypes.txt : A tab-delimited table containing all of the breakpoints and support information for insertion predictions. Predictions are treated as reference predictions if they contain a TE ID in column 7.

# from: https://github.com/jradrion/TEFLoN
C1: chromosome
C2: 5' breakpoint estimate ("-" if estimate not available)
C3: 3' breakpoint estimate ("-" if estimate not available)
C4: search level id (Usually TE family)
C5: cluster level id (Usually TE order or class)
C6: strand ("." if strand could not be detected)
C7: reference TE ID ("-" if novel insertion)
C8: 5' breakpoint is supported by soft-clipped reads (if TRUE "+" else "-")
C9: 3' breakpoint is supported by soft-clipped reads (if TRUE "+" else "-")
C10: read count for "presence reads"
C11: read count for "absence reads"
C12: read count for "ambiguous reads"
C13: genotype for every TE (allele frequency for pooled data, present/absent for haploid, present/absent/heterozygous for diploid) #Note: haploid/diploid caller is under construction, use "pooled" for presence/absence read counts
C14: numbered identifier for each TE in the population

<sample>_teflon_nonredundant.bed : BED file containing all reference and non-reference predictions from unfiltered/genotypes/sample.genotypes.txt. Reference predictions use the coordinates for the TE with the reference ID from column 7. By default, only non-reference predictions with both breakpoints (C2 and C3) are kept in this file. Non-reference predictions must also have at least 3 presence reads (C10) and an allele frequency greater than 0.1 (C13). These filtering restrictions can be changed by modifying the TEFLoN config file: /path/to/mcclintock/config/teflon/teflon_post.py

Run Examples

Running McClintock with test data

This repository also provides test data to ensure your McClintock installation is working. Test data can be found in the test/ directory which includes a yeast reference genome (UCSC sacCer2) and an annotation of TEs in this version of the yeast genome from Carr et al. (2012). A pair of fastq files can be downloaded from SRA using the test/download_test_data.py script:

python test/download_test_data.py

Once the fastq files have been downloaded, Mcclintock can be run on the test data as follows:

python3 mcclintock.py \
    -r test/sacCer2.fasta \
    -c test/sac_cer_TE_seqs.fasta \
    -g test/reference_TE_locations.gff \
    -t test/sac_cer_te_families.tsv \
    -1 test/SRR800842_1.fastq.gz \
    -2 test/SRR800842_2.fastq.gz \
    -p 4 \
    -o /path/to/output/directory

Change /path/to/output/directory to a real path where you desire the McClintock output to be created.
You can also increase -p 4 to a higher number if you have more CPU threads available.
A working installation of McClintock applied to the test data should yield the following results in /path/to/output/directory/SRR800842_1/results/summary/data/run/summary_report.txt:

----------------------------------
MAPPED READ INFORMATION
----------------------------------
read1 sequence length:  94
read2 sequence length:  94
read1 reads:            18547818
read2 reads:            18558408
median insert size:     302
avg genome coverage:    268.185
----------------------------------

-----------------------------------------------------
METHOD          ALL       REFERENCE    NON-REFERENCE 
-----------------------------------------------------
ngs_te_mapper   35        21           14            
ngs_te_mapper2  87        49           38            
relocate        80        63           17            
relocate2       139       41           98            
temp            365       311          54            
temp2           367       311          56            
retroseq        58        0            58            
popoolationte   141       130          11            
popoolationte2  186       164          22            
te-locate       713       164          549           
teflon          414       390          24            
tebreak         60        0            60            
-----------------------------------------------------

NOTE: popoolationte and popoolationte2 exhibit run-to-run variation so numbers for these methods will differ slightly on replicate runs of the test data.

Running McClintock with specific component methods

By default, McClintock runs all components (all 12 TE detection methods plus the coverage module using the output of the trimgalore method).
If you only want to run a specific component method, you can use the -m flag to specify which method to run:

python3 mcclintock.py \
    -r test/sacCer2.fasta \
    -c test/sac_cer_TE_seqs.fasta \
    -g test/reference_TE_locations.gff \
    -t test/sac_cer_te_families.tsv \
    -1 test/SRR800842_1.fastq.gz \
    -2 test/SRR800842_2.fastq.gz \
    -p 4 \
    -m temp \
    -o /path/to/output/directory

You can also specify an arbitrary set of multiple component methods to run using a comma-separated list of the methods after the -m flag:

python3 mcclintock.py \
    -r test/sacCer2.fasta \
    -c test/sac_cer_TE_seqs.fasta \
    -g test/reference_TE_locations.gff \
    -t test/sac_cer_te_families.tsv \
    -1 test/SRR800842_1.fastq.gz \
    -2 test/SRR800842_2.fastq.gz \
    -p 4 \
    -m trimgalore,temp,ngs_te_mapper,retroseq \
    -o /path/to/output/directory

Note: if the -m option is set, you must specify the trimgalore method explicitly if you want the other component methods to use trimmed reads as input.

Running McClintock with multiple samples using same reference genome

When running McClintock on multiple samples that use the same reference genome and consensus TEs, it is advised to generate the reference TE annotation GFF and TE Taxonomy TSV files as a pre-processing step. Otherwise, these files will be created by McClintock for each sample, which can lead to increased run time and disk usage.
To create the reference TE annotation GFF and TE Taxonomy TSV files, you can run McClintock with the --make_annotations flag, which will use RepeatMasker to produce only these files, then exit:

python3 mcclintock.py \
    -r test/sacCer2.fasta \
    -c test/sac_cer_TE_seqs.fasta \
    -p 4 \
    -o <output> \
    --make_annotations

The locations of the reference TE annotation GFF and TE Taxonomy TSV files generated using --make_annotations are as follows:
- Reference TE locations GFF: <output>/<reference_name>/reference_te_locations/unaugmented_inrefTEs.gff
- TE Taxonomy TSV: <output>/<reference_name>/te_taxonomy/unaugmented_taxonomy.tsv
You can then use the --resume flag for future runs with the same reference genome and output directory without having to redundantly generate these files for each run:

python3 mcclintock.py \
    -r test/sacCer2.fasta \
    -c test/sac_cer_TE_seqs.fasta \
    -1 /path/to/sample1_1.fastq.gz \
    -2 /path/to/sample1_2.fastq.gz \
    -p 4 \
    -o <output> \
    --resume

python3 mcclintock.py \
    -r test/sacCer2.fasta \
    -c test/sac_cer_TE_seqs.fasta \
    -1 /path/to/sample2_1.fastq.gz \
    -2 /path/to/sample2_2.fastq.gz \
    -p 4 \
    -o <output> \
    --resume

## etc ##

Individual samples can be run in a serial manner as shown in the example above, or run in parallel, such as through separate jobs on a HPC cluster.

Citation

To cite McClintock 1, the general TE detector meta-pipeline concept, or the single synthetic insertion simulation framework, please use: Nelson, M.G., R.S. Linheiro & C.M. Bergman (2017) McClintock: An integrated pipeline for detecting transposable element insertions in whole genome shotgun sequencing data. G3. 7:2763-2778.

To cite McClintock 2 or the reproducible simulation system, please use: Chen, J., P.J. Basting, S. Han, D.J. Garfinkel & C.M. Bergman (2023) Reproducible evaluation of transposable element detectors with McClintock 2 guides accurate inference of Ty insertion patterns in yeast Mobile DNA 14:8.

License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

mcclintock's People

Contributors

Stargazers

Watchers

mcclintock's Issues

Investigate adding TEPID

paper is here: https://elifesciences.org/content/5/e20777
code is here: https://github.com/ListerLab/TEPID

Investigate adding discord-retro to the pipeline

https://github.com/adamewing/discord-retro

Improve annotation of TSDs in RelocaTE canonical TE fasta

TSD=... is required for RelocaTE input.
Currently, McClintock automatically annotates all TE fasta headers to be TSD=UNK
User may have file with TSDs curated already, so need to add option to not edit canonical TE fasta for input to RelocaTE.
Could also use results from ngs_te_mapper to find modal TSD length to supply to RelocaTE

Investigate adding TE-tracker

code is here: http://www.genoscope.cns.fr/externe/tetracker/
paper is here: http://www.biomedcentral.com/1471-2105/15/377/abstract

Any chance RelocaTE could be replaced with RelocaTE2

Hi Bergman Lab folks,
Thanks for developing McClintock! I'm in the process of testing it for us in my research. While I'm finding it useful, it also looks like RelocaTE is prohibitively slow for moderately sized datasets. Is there any chance that you could replace RelocaTE with the much faster RelocaTE2 (or simply add it to the workflow)?

See RelocaTE2 on Github
Also, here is the RelocaTE2 paper.

Just a thought.
Thanks!
Dave

Investigate adding Jitterbug to pipeline

Paper is here: http://www.biomedcentral.com/1471-2164/16/768
Code is here: http://sourceforge.net/projects/jitterbug/

RepeatMasker modifies some TE names

The RepeatMasker script ProcessRepeats overrides the custom TE library naming in some instances.
In D. melanogaster, this appears to involve only the pogo and McClintock elements.
- pogo: most hits are named pogo, but a few are renamed to POGON1
- McClintock: most hits are named McClintock, but a minority of hits are named McClintock-int
The rules for renaming TE hits are hard-coded in the RepeatAnnotationData.pm script. The rules seem to be based on the consensus coordinates of the TE. e.g.:

    'pogo' => {
                'equiv' => [
                             {
                               'name'   => 'POGON1',
                               'ranges' => [
                                             {
                                               'div'     => '0.00',
                                               'score'   => '85',
                                               'eqend'   => 88,
                                               'eqstart' => 1,
                                               'end'     => 88,
                                               'start'   => 1
                                             },
                                             {
                                               'div'     => '0.00',
                                               'score'   => '86',
                                               'eqend'   => 186,
                                               'eqstart' => 98,
                                               'end'     => 2121,
                                               'start'   => 2033
                                             }
                               ]
                             }
                ],
                'subtype'   => 'TcMar-Pogo',
                'conlength' => '2121',
                'type'      => 'DNA'
    },

The addition of -int to a TE name is performed by the ProcessRepeats script of RepeatMasker. This is likely a way of accounting for highly divergent internal regions of LRT TEs when producing a summary of the total number of elements.
Here is the block of code in ProcessRepeats that renames the McClintock element.

# Now change the names
        while ( $nextInChain ) {
          if ( $currentAnnot->getClassName() =~ /LTR/ ) {
            if ( isInternal( $nextInChain ) ) {
              if (    $newLTRName
                   && $nextInChain->getClassName() =~ /ERVL-MaLR/ )
              {

                # Name after LTR
                if ( $newLTRName !~ /-I|-int/ ) {
                  $nextInChain->setHitName( $newLTRName . "-int" );
                }
                else {
                  $nextInChain->setHitName( $newLTRName );
                }
              }
              elsif ( $newLTRIntName ) {

                # Name after highest scoring internal
                if ( $newLTRIntName !~ /-I|-int/ ) {
                  $nextInChain->setHitName( $newLTRIntName . "-int" );
                }
                else {
                  $nextInChain->setHitName( $newLTRIntName );
                }

              }
            }
            else {
              $nextInChain->setHitName( $newLTRName );
            }
          }
          else {
            $nextInChain->setSubjName( $newLINEName );
          }
          if ( $DEBUG ) {
            print "Fixing conPosCorrection(*$highestConCorr): ";
            $nextInChain->print();
          }

McClintock's current fix for this involves renaming these elements back to their original names in the TE consensus file.
https://github.com/bergmanlab/mcclintock/blob/master/scripts/mcclintock.sh

# RepeatMasker appears to override the custom database names during the ProcessRepeats step so this changes them back for
# Drosophila, more rules like this may be needed for other reference genomes
sed "s/McClintock-int/McClintock/g" $reference_genome".out.gff" > $referencefolder/tmp
sed "s/POGON1/pogo/g" $referencefolder/tmp > $reference_genome".out.gff"

This ensures that the RepeatMasker result contains naming identical to the custom library provided.
A similar solution might be necessary when analyzing other genomes.

Genotype or merge TIPs across samples?

Hi @cbergman
Thanks for developing such a great pipeline to detect TE insertions. I found that the pipeline can just detect TIPs in one sample at a time. So after all samples are finished, is there any methods to genotype or merge all the TIPs across samples to a bed file to facilitate the next analysis?
Reagrds,
Tao

Standardize output files and directory names in test.sh

change popoolationte directory from SRR800842_1 -> SRR800842
sort bed files by chr and start columns so different methods make predictions in same order
change ngs_te_mapper family name from e.g. TY1#LTR_Copia_new -> TY1_new
change relocate family name from e.g. TY1#LTR_Copia_new -> TY1_new
change RetroSeq family name from e.g. TY4#LTR_Copia==SRR800842.bam -> TY4_new

Compile papers using McClintock pipeline

mcclintock.sh Segmentation fault

I am trying to use mcclintock tool for TE analysis, but I am receiving several errors below;

This is my command;

~/applications/mcclintock/mcclintock/mcclintock.sh -r SpeciesA.genome.fa -c SpeciesA.consensi.fa.classified -1 DNAreads_1.fastq -I

I am using HPC system, so I don’t have memory problems.

Erros are:

Running McClintock version: b9d86c1

Date of run is 04_03_20

Creating directory structure...

~/applications/mcclintock/mcclintock/mcclintock.sh: line 216: 54800 Segmentation fault (core dumped) samtools faidx $referencefolder/$reference_genome_file
cp: cannot stat ‘DNAreads_1.fastq’: No such file or directory
cp: cannot stat ‘DNAreads_1.fastq’: No such file or directory
RepeatMasker::setspecies: Could not find user specified library /lustre6/home/lustre2/user1/Projects/Species/Active_Repeats/SpeciesA/reference/consensi.fa.classified.
sed: can't read /lustre6/home/lustre2/user1/Projects/Species/Active_Repeats/SpeciesA/reference/SpeciesA.genome.fa.out.gff: No such file or directory
index file /lustre6/home/lustre2/user1/Projects/Species/Active_Repeats/SpeciesA/reference/SpeciesA.genome.fa.fai not found, generating...
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check
/home/user1/applications/mcclintock/mcclintock/mcclintock.sh: line 488: 74803 Aborted (core dumped) bedtools getfasta -name -fi $reference_genome -bed $te_locations -fo $referencefolder/reference_te_seqs.fasta
rm: cannot remove ‘/lustre6/home/lustre2/user1/Projects/Species/Active_Repeats/SpeciesA/reference/SpeciesA.genome.fa.fai’: No such file or directory

Performing FastQC analysis...

Skipping '/lustre6/home/lustre2/user1/Projects/Species/Active_Repeats/SpeciesA/DNAreads_1/reads/Ip1583_1.fastq' which didn't exist, or couldn't be read
/home/user1/applications/mcclintock/mcclintock/mcclintock.sh: line 511: 77039 Segmentation fault (core dumped) samtools faidx $reference_genome
/home/user1/applications/mcclintock/mcclintock/mcclintock.sh: line 511: 77086 Segmentation fault (core dumped) samtools faidx $popoolationte_reference_genome
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.4-r385
[main] CMD: bwa index /lustre6/home/lustre2/user1/Projects/Species/Active_Repeats/SpeciesA/reference/SpeciesA.genome.fa
[main] Real time: 0.009 sec; CPU: 0.005 sec
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.4-r385
[main] CMD: bwa index /lustre6/home/lustre2/user1/Projects/Species/Active_Repeats/SpeciesA/reference/popoolationte_full_SpeciesA.fasta
[main] Real time: 0.008 sec; CPU: 0.005 sec

Running RelocaTE pipeline...

/home/user1/applications/mcclintock/mcclintock/mcclintock.sh: line 767: source: deactivate: file not found

Investigate adding Grasper

code is here: https://github.com/COL-IU/GRASPER
paper is here: http://online.liebertpub.com/doi/abs/10.1089/cmb.2013.0129

Investigate adding STEAK

paper is here: https://academic.oup.com/ve/article/4090953
code is here: https://github.com/applevir/STEAK

Investigate adding method from Cridland et al to pipeline

Samtools argument error in mcclintock.sh

Hi McClintock devs,

I've been trying to run the pipleline on both the test data as well as some of my own, and I've encountered issues using methods that require samtools. After debugging a bit, I believe there is an error on line 495 of the mcclintock.sh script in one of the arguments passed to samtools sort:

samtools view -Sb -t $reference_genome".fai" $sam | samtools sort - $samplefolder/bam/$sample

I believe the samtools sort command should actually be:

samtools sort -o $samplefolder/bam/$sample.bam

(emphasis added)

Once I make this change, the pipeline seems to work properly again.

Thanks!
Dave

Add option to include/exclude individual methods from McClintock run

Prefer to do this as command line parameter rather than config file since config files not used elsewhere in system.

Error running RelocaTE with test data

Hi Bergman Lab,

Thanks for this pipeline, it was really necessary! I've been running the pipeline with the test data (sacCer2), and I got no results from RelocaTE. Both SRR800842_relocate_nonredundant.bed and SRR800842_relocate_redundant.bed files are empty.

The errors I got are these (beside subroutine redefined warnings for the module Bio::DB::IndexedBase.pm) :

Error: The requested bed file (/scratch/bnavarr2/analysis/mcclintock_test/mcclintock/sacCer2/SRR800842/RelocaTE/SRR800842_relocate_presort_redundant.txt) could not be opened. Exiting!
Error: The requested bed file (/scratch/bnavarr2/analysis/mcclintock_test/mcclintock/sacCer2/SRR800842/RelocaTE/tmp) could not be opened. Exiting!

In the originalmethodsresults folder of RelocaTE, the following files are empty too:

not.given.*.confident_nonref_genomeflanks.fa
not.given.*.confident_nonref_reads_list.txt
not.given.*.confident_nonref.txt

Any hints?

Add option to run all split-read or paired-end methods

This will make it easier for a user to execute the pipeline for all split-read or paired-end methods, respectively, without having to specify all of the individual component methods. This option should also excute a module that parses the relevant split-read or paired-end form methods like TEMP that have both split-read and paired-end predictions.

Refactor branch consistently failing with test data on RelocaTE2

Hi Bergman lab,

I'm really excited about the recent updates you've made to the pipeline. I'm looking forward to using it with my own data, which I hope to try soon.

However, I wanted to note that I'm consistently getting errors when trying to run mcclintock on the test dataset. It will fail consistently running RelocaTE2. Oddly enough, sometimes the relocaTE2_run rule fails, while sometimes the relocaTE2_post rule fails (unfortunately after wiping my environment and reinstalling mcclintock, I don't currently have any info from runs where the relocaTE2_post rule failed).

I've seen this happen on two different system (one is a cluster, while the other is a server), both running CentOS 7.x I've followed the installation instructions in the README, including updating to the latest release of conda.

Here is the script I used to run the test analysis:

#!/usr/bin/env bash

python3 mcclintock.py \
-r test/sacCer2.fasta \
-c test/sac_cer_TE_seqs.fasta \
-g test/reference_TE_locations.gff \
-t test/sac_cer_te_families.tsv \
-1 test/SRR800842_1.fastq.gz \
-2 test/SRR800842_2.fastq.gz \
-p 25 \
-o test_out 2>&1 | tee test.log

Here is the error message I'm getting from the main log:

[Fri Jun 19 14:07:59 2020]
Error in rule relocaTE2_run:
    jobid: 24
    output: /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/results/ALL.all_nonre
f_insert.gff, /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/results/ALL.all_ref_insert.gff
    conda-env: /datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa

RuleException:
CalledProcessError in line 579 of /datahome/oenothera/mcclintock/test_out/snakemake/3412670/Snakefile:
Command 'source /home/progs/anaconda2/bin/activate '/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa'; set -euo pipefail;  /home/progs/anaconda2/envs/mcclintock/bin/python3.8 /datahome/oenothera/mcclintock/test_out/snakemake/3412670/.snakemake/scripts/tmp196igscz.relocate2_run.py' returned non-zero exit status 1.
  File "/datahome/oenothera/mcclintock/test_out/snakemake/3412670/Snakefile", line 579, in __rule_relocaTE2_run
  File "/home/progs/anaconda2/envs/mcclintock/lib/python3.8/concurrent/futures/thread.py", line 57, in run

I'm also seeing the following error message from the relocaTE2.log file:

[main] CMD: /datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/bwa sampe /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/input/reference.fasta /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/bwa_aln/reference.SRR800842_1.te_repeat.flankingReads.fq.matched.fullreads.bwa.mates.sai /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/bwa_aln/reference.SRR800842_2.te_repeat.flankingReads.fq.matched.fullreads.bwa.mates.sai /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_1.te_repeat.flankingReads.fq.matched.fullreads.fq /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_2.te_repeat.flankingReads.fq.matched.fullreads.fq
[main] Real time: 3.828 sec; CPU: 3.338 sec
[W::bam_merge_core2] No @HD tag found.
/datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_1.te_repeat.flankingReads.fq
/datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_2.te_repeat.flankingReads.fq
/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/clean_pairs_memory.py --fq1 /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_1.te_repeat.flankingReads.fq --fq2 /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_2.te_repeat.flankingReads.fq --repeat /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/te_containing_fq --fq_dir /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/input/fastq --seqtk /datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/seqtk
mergeing bam file: 2/2 files
mergeing fullread bam file: 1/1 files
job: sh /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/shellscripts/step_3/0.te_repeat.blat.sh
job: sh /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/shellscripts/step_3/1.te_repeat.blat.sh
job: sh /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/shellscripts/step_4/step_4.reference.repeat.align.sh
Traceback (most recent call last):
  File "/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/relocaTE2.py", line 741, in <module>
    main()
  File "/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/relocaTE2.py", line 643, in main
    existingTE_RM_ALL(top_dir, reference_ins_flag)
  File "/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/relocaTE2.py", line 155, in existingTE_RM_ALL
    if unit[9] == '+':
IndexError: list index out of range

It looks like the existingTE.bed file is empty.

I've also attached the full log of my most recent attempt, along with the full relocaTE2 log. If I can provide any more information or if you like me to send an archive of the output directory, please let me know.

Thanks,
Dave

test.log
relocaTE2.log

ID problems in taxonomy and locations files

Hello,

I have a little confusion on how to make the taxonomy file and the location gff file. I tried to make them from the RepeatMasker out file from scratch.
But the tool complains that it doesn't find a TE ID despite being in both files:

TE ID: ID=35_44 not found in IDs from GFF: ~/locations.gff3 please make sure each ID in: ~/taxonomy.tsv is found in: ~/locations.gff3

Despite being in both files:

locations.gff3:

Chr1 RepeatModeler RC/Helitron 66707 66797 . . . ID=35_44

and taxonomy.tsv:

ID=35_44 rnd-6_family-6781#RC/Helitron

Any suggestions on how I need to make those two files?

Thanks in advance!

Separate test script from a 'normal' run script

At the moment the test script is the only way to run the pipeline and so a user must edit this to use the software on their own data. I am working on a separate run script that takes 6 inputs (reference, consensus tes, gff, te hierarchy, fastq 1and fastq 2) and runs the pipeline. The new test script will download the data and call this script to run the pipeline. These new scripts will be separate from the existing install script which will remain.

Investigate adding PopoolationTE2

Paper is here: https://academic.oup.com/mbe/article/33/10/2759/2925581
Software is here: https://sourceforge.net/p/popoolation-te2/wiki/Home/

Add ability to merge output from different component methods

use bedtools cluster: http://bedtools.readthedocs.org/en/latest/content/tools/cluster.html
use library insert size as clustering parameter
need to store information about source methods that support merged insertion
possibly may need to use more complex file format (e.g. GFF3)
should allow user to specify which methods to include in merging & support level

Add option to generate VCF output for all methods

Since we are calling variants we should consider adding support to generate VCF files for each method in addition the BED files.

see VCF format specification here: http://vcftools.sourceforge.net/specs.html
Retroseq already generates output in VCF format
need to think about best way to represent TSD and non integer based predictions
need to understand how to incorporate reference insertions into VCF (currently Retroseq doesn't include)

install.sh: 108: install.sh: Syntax error: "(" unexpected (expecting "}")

Hi,
I typed sh install.sh and I don't know if this is a problem as the folders were not empty.

  inflating: samplot-1de65afd22e88c5cb5122ae638e8ba4cf6f75144/test/func/samplot_test.sh  
mv: cannot move 'samplot-1de65afd22e88c5cb5122ae638e8ba4cf6f75144' to 'samplot/samplot-1de65afd22e88c5cb5122ae638e8ba4cf6f75144': Directory not empty
Copying run scripts...
patching file PoPoolationTE/Modules/TEInsertUtility.pm
Reversed (or previously applied) patch detected!  Assume -R? [n] y
patching file PoPoolationTE/Modules/TEInsert.pm
Reversed (or previously applied) patch detected!  Assume -R? [n] y
patching file PoPoolationTE/samro.pl
Reversed (or previously applied) patch detected!  Assume -R? [n] y
patching file PoPoolationTE/identify-te-insertsites.pl
Reversed (or previously applied) patch detected!  Assume -R? [n] y
patching file RelocaTE/scripts/relocaTE_insertionFinder.pl
Reversed (or previously applied) patch detected!  Assume -R? [n] y
patching file TEMP/scripts/TEMP_Absence.sh
Reversed (or previously applied) patch detected!  Assume -R? [n] y
install.sh: 66: install.sh: source: not found
install.sh: 69: install.sh: source: not found
install.sh: 108: install.sh: Syntax error: "(" unexpected (expecting "}")

Add sample allele frequency information to BED output when available

TEMP and a few other methods provide an estimate of non-reference TE insertion frequency in the raw output
this information is useful for filtering/analyzing predictions and could be encode in the BED file

Investigate adding SPLITREADER

paper is here: https://elifesciences.org/content/5/e15716
code is here: https://github.com/LeanQ/SPLITREADER

WARNING. chromosome (chrXV) was not found in the FASTA file. Skipping.

Investigate adding Tangram to the pipeline

The paper for Tangram has been released with the system located at:
https://github.com/jiantao/Tangram
I will look in to including this in the pipeline now that MOSAIK is not required for input.

Investigate adding mobster to pipeline

Code is here: http://sourceforge.net/projects/mobster/
Paper is here: http://genomebiology.com/content/15/10/488

Investigate adding TIDAL

paper here: http://nar.oxfordjournals.org/content/early/2015/11/16/nar.gkv1193.long
software here: https://github.com/laulabbrandeis/TIDAL

Generate Docker image for McClintock

Since we are aiming to make a reproducible and easy to use system, let's package McClintock as a Docker image with McClintock, all of the TE detection systems, and all of the dependencies. See the Docker documentation for more details: https://docs.docker.com/

Handle single ended sequencing data

To run the whole pipeline paired end sequencing data is required however, ngs_te_mapper can take single ended data as input.
If single ended data is supplied stop the pipeline from failing and force it to only run ngs_te_mapper.

Random error

Hi there,

I just started using mcclintock and have run into some issues.

Here it is:

Command:

    python3 ~/bin/mcclintock/mcclintock.py \\
		-r ${params.genome} \\
		-c ${params.con_seqs} \\
		-g ${params.ref_locs} \\
		-t ${params.families} \\
		-1 ${fq1} \\
		-2 ${fq2} \\
		-p ${task.cpus} \\
		-m ngs_te_mapper,temp,retroseq,te-locate \\
		-o mcclintock_out

Error:

Traceback (most recent call last):
  File "/projects/b1042/AndersenLab/work/stefan/ca/ff5ae5af5922d73544acee2c85de68/mcclintock_out/snakemake/1096582/.snakemake/scripts/tmpx20ygtgw.temp_post.py", line 154, in <module>
    main()
  File "/projects/b1042/AndersenLab/work/stefan/ca/ff5ae5af5922d73544acee2c85de68/mcclintock_out/snakemake/1096582/.snakemake/scripts/tmpx20ygtgw.temp_post.py", line 29, in main
    mccutils.make_nonredundant_bed(insertions, sample_name, out_dir, method="temp")
  File "/home/szs315/bin/mcclintock/scripts/mccutils.py", line 394, in make_nonredundant_bed
    if uniq_inserts[key].temp.support != "!" and insert.temp.support > uniq_inserts[key].temp.support:
TypeError: '>' not supported between instances of 'str' and 'float'
[Thu Jul 30 12:16:40 2020]
Error in rule process_temp:
    jobid: 0
    output: /projects/b1042/AndersenLab/work/stefan/ca/ff5ae5af5922d73544acee2c85de68/mcclintock_out/results/TEMP/BGI2-RET6-JU1440-trim-1P_temp_redundant.bed, /projects/b1042/AndersenLab/work/stefan/ca/ff5ae5af5922d73544acee2c85de68/mcclintock_out/results/TEMP/BGI2-RET6-JU1440-trim-1P_temp_nonredundant.bed
    conda-env: /home/szs315/bin/mcclintock/install/envs/conda/716bd1a3

RuleException:
CalledProcessError in line 498 of /projects/b1042/AndersenLab/work/stefan/ca/ff5ae5af5922d73544acee2c85de68/mcclintock_out/snakemake/1096582/Snakefile:
Command 'source /home/szs315/.pyenv/versions/miniconda3-4.3.27/bin/activate '/home/szs315/bin/mcclintock/install/envs/conda/716bd1a3'; set -euo pipefail;  /home/szs315/.pyenv/versions/miniconda3-4.3.27/envs/mcclintock/bin/python3.8 /projects/b1042/AndersenLab/work/stefan/ca/ff5ae5af5922d73544acee2c85de68/mcclintock_out/snakemake/1096582/.snakemake/scripts/tmpx20ygtgw.temp_post.py' returned non-zero exit status 1.
  File "/projects/b1042/AndersenLab/work/stefan/ca/ff5ae5af5922d73544acee2c85de68/mcclintock_out/snakemake/1096582/Snakefile", line 498, in __rule_process_temp
  File "/home/szs315/.pyenv/versions/miniconda3-4.3.27/envs/mcclintock/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Exiting because a job execution failed. Look above for error message

I understand this is not the most useful error post, but I am running about a 1000 samples and decided to show you the error in case it makes sense to you. In the mean time, I am just skipping the sample that caused the issue.

Best

Miss --make_annotations and --mem arguments

Dear Bergman,

I am Haidong, a user of the mcclintock. I found this tool is more convenient to use than before. Thanks for the updating!
But I found there is no --make_annotations and --mem arguments in the current version. Or I go a wrong way to use it. Thanks!

Best wishes,
Haidong

No such file or directory

Hi!
I'm facing errors or warnings that the respective files don't exist. My dataset has fastq.gz files, is it necessary to unzip the files? This is the stdout:

[bwa_index] Pack FASTA... 0.08 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 3.75 seconds elapse.
[bwa_index] Update BWT... 0.08 sec
[bwa_index] Pack forward-only FASTA... 0.06 sec
[bwa_index] Construct SA from BWT and Occ... 1.16 sec
[main] Version: 0.7.4-r385
[main] CMD: bwa index -p /hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/reference/sacCer2 /hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/reference/sacCer2.fasta
[main] Real time: 5.568 sec; CPU: 5.139 sec
mkdir: cannot create directory ?/hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/ngs_te_mapper?: File exists
[main] Version: 0.7.4-r385
[main] CMD: bwa mem -t 36 /hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/reference/sac_cer_TE_seqs /hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/reads/SRR5678544_paired_R1.fastq.gz
[main] Real time: 0.002 sec; CPU: 0.003 sec
[main] Version: 0.7.4-r385
[main] CMD: bwa mem -t 36 /hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/reference/sac_cer_TE_seqs /hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/reads/SRR5678544_paired_R2.fastq.gz
[main] Real time: 0.002 sec; CPU: 0.002 sec
Error in aLength:max(lengths) : result would be too long a vector
Calls: GetSamFile
In addition: Warning message:
In max(lengths) : no non-missing arguments to max; returning -Inf
Execution halted
awk: fatal: cannot open file `/hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/ngs_te_mapper/bed_tsd/SRR5678544_paired_R1_SRR5678544_paired_R2insertions.bed' for reading (No such file or directory)
Error: The requested bed file (/hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/ngs_te_mapper/SRR5678544_paired_R1_ngs_te_mapper_presort.bed) could not be opened. Exiting!
DeprecationWarning: 'source deactivate' is deprecated. Use 'conda deactivate'.
cp: cannot stat '/hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/ngs_te_mapper/bed_tsd/*.bed': No such file or directory

Running PoPoolationTE pipeline...

DeprecationWarning: 'source deactivate' is deprecated. Use 'conda deactivate'.
awk: fatal: cannot open file `/hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/reads/SRR5678544_paired_R1.fastq.gz' for reading (No such file or directory)
awk: fatal: cannot open file `/hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/reads/SRR5678544_paired_R2.fastq.gz' for reading (No such file or directory)
[main] Version: 0.7.4-r385
[main] CMD: bwa bwasw -t 36 /hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/reference/popoolationte_full_sacCer2.fasta /hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/PoPoolationTE/reads1.fastq
[main] Real time: 0.032 sec; CPU: 0.032 sec
[main] Version: 0.7.4-r385
[main] CMD: bwa bwasw -t 36 /hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/reference/popoolationte_full_sacCer2.fasta /hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/PoPoolationTE/reads2.fastq
[main] Real time: 0.029 sec; CPU: 0.028 sec

Median insert size = 

runpopoolationte.sh: line 41: * 3 + -1: syntax error: operand expected (error token is "* 3 + -1")
DeprecationWarning: 'source deactivate' is deprecated. Use 'conda deactivate'.
mv: cannot stat '/hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/PoPoolationTE/SRR5678544_paired_R1_popoolationte*': No such file or directory
cp: cannot stat '/hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/PoPoolationTE/te-poly-filtered.txt': No such file or directory
unzip:  cannot find or open *.zip, *.zip.zip or *.zip.ZIP.

No zipfiles found.
ls: cannot access '/hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/results/summary/fastqc_analysis/*_1*/fastqc_data.txt': No such file or directory
ls: cannot access '/hosts/linuxhome/chaperone/silviav/reads/Gallone/McClintock_results/sacCer2/SRR5678544_paired_R1/results/summary/fastqc_analysis/*_2*/fastqc_data.txt': No such file or directory

Thank you for your help

RelocaTE and RelocaTE2 run infinitely or freeze

Hello Casey and others,

I have been working with your test data as well as some other data with your pipeline since I've been using your older version of McClintock for quite some time. I was excited to include RelocaTE and RelocaTE2 but sadly these are the programs that are giving me the most headaches.

Everything runs smoothly on the test data after installation, I haven't seen any problems in the log outputs there. However, when I run the pipeline on my own data, both default and specifying specific programs, there are some problems. RelocaTE shows output in the log file but takes a really long time, I ran the default pipeline for 120 hours on our remote cluster and it timed out before RelocaTE was finished when every other program was complete hours after I started the job (except RelocaTE2, but I'll explain). The fastq files are larger, 7.5 GB for each of the paired-ends, but with multi-threading this seems too long.

Additionally, RelocaTE2 runs indefinitely, but there's almost no information I can gather about where the problem is occurring. This is the log file I have so far after running the pipeline with just trimgalore and RelocaTE2 for nearly 48 hours (and all of this text was available after 30 minutes of run time).

output/48_all
Job counts:
count jobs
1 index_reference_genome
1 make_consensus_fasta
1 make_reference_fasta
1 map_reads
1 median_insert_size
1 relocaTE2_post
1 relocaTE2_run
1 repeatmask
1 setup_reads
1 summary_report
10
PROCESSING making consensus fasta
PROCESSING consensus fasta created
PROCESSING making reference fasta
PROCESSING reference fasta created
PROCESSING making samtools and bwa index files for reference fasta &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/logs/20200720.120349.7542720/processing.log
PROCESSING samtools and bwa index files for reference fasta created
PROCESSING prepping reads for McClintock
PROCESSING running trim_galore &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/logs/20200720.120349.7542720/trimgalore.log
PROCESSING read setup complete
PROCESSING Running RepeatMasker for RelocaTE2 &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/logs/20200720.120349.7542720/processing.log
PROCESSING Repeatmasker for RelocaTE2 complete
PROCESSING mapping reads to reference &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/logs/20200720.120349.7542720/bwa.log
PROCESSING read mapping complete
PROCESSING calculating median insert size of reads
PROCESSING median insert size of reads calculated
Conda environment defines Python version < 3.5. Using Python of the master process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.5 only.

So there is no explanation of what might be going wrong. I did notice that while the last line looks like an error:
Conda environment defines Python version < 3.5. Using Python of the master process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.5 only.

This line also present in the log output of the entire McClintock pipeline on the test dataset but not present in the RelocaTE2 log output. I'm wondering then how RelocaTE2 completely stalls or runs indefinitely without any visible progress when it works on the test data. The only other clue I can give is the error message that is given when the job hits the time limit. But I see similar kinds of error messages when I stop the pipeline prematurely in general.

RuleException:
CalledProcessError in line 578 of /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/snakemake/3427427/Snakefile:
Command 'source /scratch/lhemmer/programs/miniconda3/bin/activate '/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/32bdc8a2'; set -euo pipefail; /scratch/lhemmer/programs/miniconda3/envs/mcclintock/bin/python3.8 /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/snakemake/3427427/.snakemake/scripts/tmp70q2ux3x.relocate2_run.py' died with <Signals.SIGTERM: 15>.
File "/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/snakemake/3427427/Snakefile", line 578, in __rule_relocaTE2_run
File "/scratch/lhemmer/programs/miniconda3/envs/mcclintock/lib/python3.8/concurrent/futures/thread.py", line 57, in run
RuleException:
CalledProcessError in line 525 of /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/snakemake/3427427/Snakefile:
Command 'source /scratch/lhemmer/programs/miniconda3/bin/activate '/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/272a644b'; set -euo pipefail; /scratch/lhemmer/programs/miniconda3/envs/mcclintock/bin/python3.8 /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/snakemake/3427427/.snakemake/scripts/tmppmwkuerv.relocate_run.py' died with <Signals.SIGTERM: 15>.
File "/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/snakemake/3427427/Snakefile", line 525, in __rule_relocaTE_run
File "/scratch/lhemmer/programs/miniconda3/envs/mcclintock/lib/python3.8/concurrent/futures/thread.py", line 57, in run

Any help is appreciated, and if there are any additional outputs or files you need let me know. Thanks!

Lucas Hemmer

Add TEMP to pipeline

The TEMP paper has now been released:
http://nar.oxfordjournals.org/content/early/2014/04/21/nar.gku323.full
This should be integrated into the mcclintock pipeline.

No such file or directory

Hello,
I have got ~2000 files of paired-end reads. I divided them into four folders in order to run McClintock on different CPUs. I could successfully run on two folders, but not on the last two folders giving me this error:

silviav@wildtype1:/hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/From701_1000:for i in *paired_R1.fastq ; do /home/silviav/mcclintock/mcclintock.sh -r /home/silviav/mcclintock/test/sacCer2.fasta -c /home/silviav/mcclintock/test/sac_cer_TE_seqs.fasta -g /home/silviav/mcclintock/test/reference_TE_locations.gff -t /home/silviav/mcclintock/test/sac_cer_te_families.tsv -1 $i -2 "${i:0:19}2.fastq" -p 36; done

Running McClintock version: 
Running McClintock version: 1d06a94bf7bcd5a1e17dfc8336fb82895780c713
1d06a94bf7bcd5a1e17dfc8336fb82895780c713


Date of run is 22_03_20



Date of run is 22_03_20


Creating directory structure...


Creating directory structure...

cp: cannot stat 'ERR1309028_paired_R1.fastq': No such file or directory
cp: cannot stat 'ERR1309028_paired_R2.fastq': No such file or directory

Performing FastQC analysis...


Performing FastQC analysis...

Skipping '/hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/From701_1000/sacCer2/ERR1309028_paired_R1/reads/ERR1309028_paired_R1.fastq' which didn't exist, or couldn't be read
Skipping '/hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/From701_1000/sacCer2/ERR1309028_paired_R1/reads/ERR1309028_paired_R2.fastq' which didn't exist, or couldn't be read
[bwa_index] Pack FASTA... 0.10 sec
[bwa_index] Construct BWT for the packed sequence...
^X^C
silviav@wildtype1:/hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/From701_1000:ls -ltrhs ERR1309028_paired_R1.fastq
5.8G -rw-r--r-- 1 silviav binf 5.8G Dec 14 10:01 ERR1309028_paired_R1.fastq
silviav@wildtype1:/hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/From701_1000:ls -ltrhs ERR1309028_paired_R2.fastq
5.8G -rw-r--r-- 1 silviav binf 5.8G Dec 14 10:01 ERR1309028_paired_R2.fastq

I also tried running on absolute paths but again I got an error.

silviav@wildtype1:/hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/From701_1000:for i in /hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/From701_1000/*paired_R1.fastq ; do /home/silviav/mcclintock/mcclintock.sh -r /home/silviav/mcclintock/test/sacCer2.fasta -c /home/silviav/mcclintock/test/sac_cer_TE_seqs.fasta -g /home/silviav/mcclintock/test/reference_TE_locations.gff -t /home/silviav/mcclintock/test/sac_cer_te_families.tsv -1 /hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/from701_1000/$i -2 /hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/from701_1000/"${i:0:19}2.fastq" -p 36; done

Running McClintock version: 
Running McClintock version: 1d06a94bf7bcd5a1e17dfc8336fb82895780c713
1d06a94bf7bcd5a1e17dfc8336fb82895780c713


Date of run is 22_03_20



Date of run is 22_03_20


Creating directory structure...


Creating directory structure...

cp: cannot stat '/hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/from701_1000//hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/From701_1000/ERR1309028_paired_R1.fastq': No such file or directory
cp: cannot stat '/hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/from701_1000//hosts/linuxhome/ch2.fastq': No such file or directory

Performing FastQC analysis...


Performing FastQC analysis...

Skipping '/hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/From701_1000/sacCer2/ERR1309028_paired_R1/reads/ERR1309028_paired_R1.fastq' which didn't exist, or couldn't be read
Skipping '/hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/From701_1000/sacCer2/ERR1309028_paired_R1/reads/ch2.fastq' which didn't exist, or couldn't be read
[bwa_index] Pack FASTA... 0.10 sec
[bwa_index] Construct BWT for the packed sequence...
^X^C

Additionally, miniconda is activated:

silviav@wildtype1:/hosts/linuxhome/chaperone/silviav/reads/Peter/paired_IMPORTANTE/From701_1000:which activate 
activate is /home/silviav/miniconda/bin/activate
activate is /home/silviav/anaconda2/bin/activate

Thank you in advance for your help.

Add option to save BAM files of mapped reads

Currently there is only the option to store either all or no intermediate files. It would be useful to retain BAM files of mapped reads from large runs to reuse for other projects.

Parse mcclintock.sh paramters with getopts

See example here: http://rsalveti.wordpress.com/2007/04/03/bash-parsing-arguments-with-getopts/
print help menu with -h option

Incorporate new version of ngs_te_mapper that uses BWA instead of blat

need to update README.md and install.sh accordingly once this change is made

"A required parameter is missing"

Thank you very much for putting this pipeline together. However, I am having difficulty running the entire pipeline (sh mcclintock.sh). I have checked and confirmed that all necessary parameters are stated (-r, -c, -g, -t, -1, -2), but I continue to get an error message: "A required parameter is missing". I have also compared my run command with your example pipeline run. Everything seems to be in order, and I have tried many variations. Any ideas why I keep getting this error?

Thank you!

The fasta consensus set of TE sequences

Hi, Thanks for developing such a great pipeline to detect TE insertions. I am just confused about the fasta consensus set of TE sequences. Is it mean the consensus sequence of each TE or each TE family? And could you please give some advice how to construct the consensus sequence! THX
Tao

Galaxy wrapper for mcclintock pipeline

Hello @cbergman @nelson42,

I've been following your efforts for a while, and I have to say this looks great.
How would you feel if I made a galaxy (https://github.com/galaxyproject/galaxy ) wrapper for your pipeline?
I'd be interested in doing this as a project with a student that will join our lab soon.

Add a quality control step for fastq input

In large runs some run errors could be encountered from 'bad' fastq input files.

Refactor to run on many samples in parallel

When scaling up to run on cluster we will run into two problems with current design:

reference files will be created by each instance
we will accumulate many intermediate result files in each method directory

To avoid these issues, we need to add:

a file test to check if reference files exist (and is complete)
creation of a results folder in the species/sample directory
movement of bed files from method/sample to species/sample
a flag to allow user to specify whether method/sample folders should be deleted

bergmanlab / mcclintock Goto Github PK

mcclintock's Introduction

McClintock: A meta-pipeline to identify transposable element insertions using short-read whole genome sequencing data

Getting Started

Table of Contents

Introduction

Installing Conda/Mamba

Installing Miniconda (Python 3.X)

Update Conda

Install Mamba

Installing McClintock

Clone McClintock Repository

Create McClintock Conda Environment

Activate McClintock Conda Environment

Install McClintock Component Methods

McClintock Usage

Mcclintock Input Files

Warning

Required

Optional

McClintock Output

HTML Summary Report: <output>/<sample>/results/summary/

Raw Summary files : <output>/<sample>/results/summary/

TrimGalore : <output>/<sample>/results/trimgalore/

Coverage : <output>/<sample>/results/coverage/

ngs_te_mapper : <output>/<sample>/results/ngs_te_mapper/

ngs_te_mapper2 : <output>/<sample>/results/ngs_te_mapper2/

PoPoolationTE : <output>/<sample>/results/popoolationTE/

PoPoolationTE2 : <output>/<sample>/results/popoolationTE2/

RelocaTE : <output>/<sample>/results/relocaTE/

RelocaTE2 : <output>/<sample>/results/relocaTE2/

RetroSeq : <output>/<sample>/results/retroseq/

TEbreak : <output>/<sample>/results/tebreak/

TEMP : <output>/<sample>/results/TEMP/

TEMP2 : <output>/<sample>/results/temp2/

TE-Locate : <output>/<sample>/results/te-locate/

TEFLoN : <output>/<sample>/results/teflon/