Giter VIP home page Giter VIP logo

multiprime's Introduction

multiPrime: version 2.1.1

Multi PCR primer pairs design processing pipeline

MultiPrime is a pipeline designed for broad-spectrum detection of target sequences using tNGS. It is implemented in Python and Snakemake and takes a FASTA format file as input. The pipeline has three main steps: classification by identity, primer design, and primer set combination. In the classification step, redundant sequences are removed and clusters are formed by identity. Rare sequence clusters are compared to others by average nucleotide identity, and if they are deemed similar enough, they are merged. In the primer design step, multi-alignment is performed using MUSCLE or MAFFT, and candidate primers are designed using the nearest-neighbor model. Primer pairs are selected based on PCR product length, melting temperature, dimer examination, coverage with errors, and other factors. Finally, a greedy algorithm is used to combine primer pairs into a minimal primer set according to dimer examination.

If you only require primer design without the need for primer set combination, you may use the primer design module of MultiPrime, which is accessible through scripts/multiPrime-core.py or pip install multiPrime (version >=2.3.8) and utilize the DPrime function.

multiPrime1: Degenerate primer design by DEGEPRIME (MC-DPD).

mulitPrime2: Degenerate primer design by multiPrime-core (MC-EDPD or MC-DPD). It allows for avoidance of mismatches at 3'end region.

mulitPrime3: It is similar to multiPrime2, but it allows for easy avoidance of mismatches at any position, making it flexible for experienced users.

Scripts and pipelines provided in this repository aid to design multiplex PCR primer and return a minimal primerset for multi-PCR. It contains all scripts to allow a self-assembled processing and additionally provides pipeline scripts that run the entire processing automatically.

Requirements

To run this pipeline, your computer requires 30 GB of available memory (RAM) to process larger number of sequence (e.g. 1,000,000). Note: We don't suggest that Input sequences contains those sequences whose length is greater than 100K, if necessary, you'd better set the Maxseq in yaml file as small as possible, but do not smaller than 200. Alternatively, you may consider using conserved genes/regions instead of whole genomes. Snakemake was used to facilitate the automated execution of all analysis steps. The easiest way to make use of the pipeline is to set up a python 3.9 virtual environment and run the pipeline in this environment.

Download/Provide all necessary files:

Comparison

DEGEPRIME-1.1.0: DOI: 10.1128/AEM.01403-14; Please cite: "DegePrime, a program for degenerate primer design for broad-taxonomic-range PCR in microbial ecology studies." Links: https://github.com/EnvGen/DegePrime; please move this directory into scripts.

mfeprimer-3.2.6: DOI: 10.1093/nar/gkz351; Please cite: "MFEprimer-3.0: quality control for PCR primers." please move this it into scripts. Please add "execute" to mfeprimer-3.2.6

Programs we employed:

biopython: Not required in v1.0.1 and the subsequent version.

The method for calculating Tm values in this study is a slightly modified version of primer3-py here. Reference paper: Owczarzy et al., 2004; Owczarzy et al., 2008.

The method for calculating deltaG in this study is a slightly modified version of the approach proposed by Martin et al., 2020. "Base-Pairing and Base-Stacking Contributions to Double-Stranded DNA Formation."

The method for dimer examination in this study is a slightly modified version of the approach proposed by Xie et al., 2022. "Designing highly multiplex PCR primer sets with Simulated Annealing Design using Dimer Likelihood Estimation (SADDLE)"

MUSCLE: It is already in the requirement.txt. version=v3.8.1551. http://www.drive5.com/muscle This software is donated to the public domain. Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

MAFFT: It is already in the requirement.txt. version=v7.508 (2022/Sep/07). Please cite: "MAFFT multiple sequence alignment software version 7: improvements in performance and usability".

fastANI: It is already in the requirement.txt. version=version 1.33. Please cite: "FastANI, Mash and Dashing equally differentiate between Klebsiella species."

blast+: It is already in the requirement.txt. version=BLAST 2.13.0+. Links: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastNews.

bowtie2: It is already in the requirement.txt. version=version 2.2.5. DOI:10.1038/nmeth.1923; Please cite: "Fast gapped-read alignment with Bowtie 2." Links: https://www.nature.com/articles/nmeth.1923

Installation and Snakemake

Snakemake is a workflow management system that helps to create and execute data processing pipelines. It requires python3 and dependent environment (multiPrime == multiPrime2) can be most easily installed via the bioconda package of the python anaconda distribution.

conda create -n multiPrime -c bioconda -c conda-forge --file requirement.txt

if conflicts:

conda create -n multiPrime -c bioconda -c conda-forge --file requirement2.txt

Activate and exit the environment

To activate the envitoment

source activate multiPrime

To exit the environment (after finishing the usage of the pipeline), just execute

source deactivate

Run the pipeline

Configure input parameters

The working directory contains files named multiPrime.yaml, multiPrime2.yaml and multiPrime3.yaml. These are the central file in which all user settings, paramter values and path specifications are stored. multiPrime.yaml employs DEGEPRIME-1.1.0 for maximum coverage degenerate primer design (MC-DPD), multiPrime2.yaml and multiPrime3.yaml use multiPrime-core.py for MC-DPD or MC-DPD with error. During a run, all steps of the pipeline will retrieve their paramter values from these file. It follows the yaml syntax (find more information about yaml and it's syntax here) what makes it easy to read and edit. The main principles are:

  • everything that comes after a # symbol is considered as comment and will not be interpreted
  • paramters are given as key-value pair, with key being the name and value the value of any paramter

Before starting the pipeline, open the multiPrime.yaml configuration file and set all options according as required. This should at least include:

  • name of the input directory - where are your input fasta files stored -input_dir: ["abs_path_to_input_dir"]
  • name of the output directory - where should the pipeline store the output files (the direcotry is created if not existing) -results_dir: ["abs_path_to_results_dir"]
  • name of the log directory - where should the pipeline store the log files -log_dir: ["abs_path_to_log_dir"]
  • name of the scripts directory - where should the pipeline store the scripts files -scripts_dir: ["abs_path_to"]/multiPrime/scripts
  • name(s) of your input samples - please note: If your sample is named sample1.fa then sample1 will be kept as naming scheme throughout the entire run to indicate output files that belong to this input file, e.g. the pipeline will create a file called sample1.fa. If you have multiple input files, just follow the given pattern with one sample name per line (and a dash that indicates another list item).
  • identity - threshold for classification. please note: If you set 1, multiPrime will design candidate primer pairs for each fasta in input files. Suggestion: 0.7-0.8.
  • others - for more information on the parameters, please refer to the YAML file.

Start a run

Once you set up your configuration file, running the pipeline locally on your computer is as easy as invoking:

sh run.sh

maximal coverage degenerate primer design (MC-DPD). The approach employed DegePrime to design degenerate primers for the target sequence.

snakemake --configfile multiPrime.yaml -s multiPrime.py --cores 10 --resources disk_mb=80000

maximal coverage degenerate primer design with errors tolerant (MC-EDPD) or MC-DPD. MultiPrime2 is capable of avoiding mismatches that occur at the 3'end position. The approach used in multiPrime2.yaml depends on the value of the "variation" parameter.

If "variation" is set to 0, then multiPrime uses the MC-DPD approach to design degenerate primers for the target sequence. In this approach, the primer sequences are designed with prefect match (0-mismatch).

If "variation" is set to a value greater than 0, then multiPrime uses the MC-EDPD approach to design degenerate primers with errors (mismatches) tolerant (1-mismatch or 2-mismatches). In this approach, the primer sequences are allowed to contain a limited number of errors (mismatches), which increases the probability of finding suitable primer sequences for the target sequence.

snakemake --configfile multiPrime2.yaml -s multiPrime2.py --cores 10 --resources disk_mb=80000

multiPrime3 is similiar to multiPrime2, but it enables easy avoidance of mismatches at any position.

snakemake --configfile multiPrime3.yaml -s multiPrime3.py --cores 10 --resources disk_mb=80000

Start a run independently

Setting default parameters may not always be suitable for all conditions. If you want to design primers with more flexible parameters, you can install the multiPrime package through PyPI (Python Package Index) using pip.

pip install multiPrime==2.3.8 (make sure version >=2.3.8)
multiPrime --help

The multiPrime package includes various functions for designing primers, selecting primer pairs, and calculating the coverage of these primer pairs. These features have been developed to assist researchers in performing PCR experiments efficiently and accurately. The package is continuously being improved and expanded upon, with additional functions expected to be added in the future. All manual instruction for multiPrime can be found in here.

If you have already cloned this repository and do not wish to install the multiPrime package from PyPI, you can use the scripts included in this repository to perform primer design. These scripts have the same functionality as the multiPrime package and can be run locally on your computer without the need for a separate installation.

python {path to script}/{target}.py --help

or

python {path to script}/{target}.py

For example:

Primer design with MC-DPD (--variation 0) or MC-EDPD (--variation 1 or 2. We do not recommend setting the --variation parameter greater than 2, as amplification efficiency can be severely inhibited when there are more than 2 mismatches between the primers and their targets).

python scripts/multiPrime-core.py
Usage: multiPrime-core.py -i input -o output -p 20
               Options: { -l [18] -n [4] -d [10] -v [1] -e [3.6] -g [0.2,0.7] -f [0.8] -c [4] -p [10] -a [4] }

Options:
-h, --help            show this help message and exit
-i INPUT, --input=INPUT
                      Input file: multi-alignment output (muscle or others).
-l PLEN, --plen=PLEN  Length of primer. Default: 18.
-n DNUM, --dnum=DNUM  Number of degenerate. Default: 4.
-d DEGENERACY, --degeneracy=DEGENERACY
                      degeneracy of primer. Default: 10.
-v VARIATION, --variation=VARIATION
                      Max mismatch number of primer. Default: 1.
-e ENTROPY, --entropy=ENTROPY
                      Entropy is a measurement of the degree of randomness or disorder in a system. 
                      This parameter is utilized to determine whether a window is conserved or not. 
                      Entropy of primer-length window. Default: 3.6.
-g GC, --gc=GC        Filter primers by GC content. Default [0.2,0.7].
-s SIZE, --size=SIZE  Filter primers by mini PRODUCT size. Default 100.
-f FRACTION, --fraction=FRACTION
                      Filter primers by match fraction. Default: 0.8.
-c COORDINATE, --coordinate=COORDINATE
                      Mismatch index is not allowed to locate in start or
                      stop region. otherwise, it won't be regard as the mis-
                      coverage. With this param, you can control the index
                      of Y-distance (number and position of mismatch) when calculate
                      coverage with error.Default: 4.
-p PROC, --proc=PROC  Number of process to launch. Default: 20.
-a AWAY, --away=AWAY  Filter hairpin structure, which means distance of the
                      minimal paired bases. Default: 4. Example:(number of
                      X) AGCT[XXXX]AGCT. Primers should not have
                      complementary sequences (no consecutive 4 bp
                      complementarities),otherwise the primers themselves
                      will fold into hairpin structure.
-o OUT, --out=OUT     Output file: candidate primers. e.g.
                      [*].candidate.primers.txt.

To get candidate degenerate primer pairs with high coverage.

python scripts/get_multiPrime.py
Usage: get_multiPrime.py -i input -r sequence.fa -o output
               Options: {-f [0.6] -m [500] -n [200] -e [4] -p [9] -s [250,500] -g [0.2,0.7] -d [4] -a ","}.

Options:
-h, --help            show this help message and exit
-i INPUT, --input=INPUT
                      Input file: degeprimer out.
-r REF, --ref=REF     Reference sequence file: all the sequence in 1 fasta,
                      for example: (Cluster_96_171.fa).
-g GC, --gc=GC        Filter primers by GC content. Default [0.2,0.7].
-f FRACTION, --fraction=FRACTION
                      Filter primers by match fraction. Default: 0.6.
                      Sometimes you need a small fraction to get output.
-e END, --end=END     Filter primers by degenerate base position. e.g. [-t
                      4] means I dont want degenerate base appear at the end
                      four bases when primer pre-filter. Default: 4.
-p PROC, --proc=PROC  Number of process to launch.  default: 10.
-s SIZE, --size=SIZE  Filter primers by PRODUCT size. Default [250,500].
-d DIST, --dist=DIST  Filter param of hairpin, which means distance of the
                      minimal paired bases. Default: 4. Example:(number of
                      X) AGCT[XXXX]AGCT.
-a ADAPTOR, --adaptor=ADAPTOR
                      Adaptor sequence, which is used for NGS next. Hairpin
                      or dimer detection for [adaptor--primer]. For example: 
                      TCTTTCCCTACACGACGCTCTTCCGATCT,TCTTTCCCTACACGACGCTCTTCCGATCT 
                      (Default). If you dont want adaptor,
                      use [","]
-m MAXSEQ, --maxseq=MAXSEQ
  		The default value for the limit of sequence number is set at 500. 
  		However, if the value is set to 0, then all sequences will be taken into consideration. 
  		It is important that this parameter remains consistent with the [max_seq] 
  		parameter used in multi-alignment [muscle]. 
-o OUT, --out=OUT     Output file: candidate primers. e.g.
                      [*].candidate.primers.txt.

To extract primers of your amplicons from Oxford Nanopore Technology (ONT) reads. FindONTprimerV2.py = FindONTprimerV3.py. If your primers are degenerate, you can use the "FindONTexpandprimer.py" script included in the multiPrime package. This script is designed specifically to identify degenerate primer sequences from ONT reads and expand them into their full, non-degenerate forms.:

python scripts/FindONTprimerV3.py
Usage: FindONTprimerV3.py -i [input] -s [primer set] -p [20] -l [primer length] -m [0.6] -f [fq] -o [output].

Options:
-h, --help            show this help message and exit
-i INPUT, --input=INPUT
                      Input file: fastq or fasta or fq.gz or fa.gz.
-s SET, --set=SET     primer set file.
-p NPROC, --nproc=NPROC
                      Primer set file. option. Default: 10
-l LEN, --len=LEN     Primer length. Default: 18
-m MIN_IDENT, --min_ident=MIN_IDENT
                      min identity. Default: 0.6
-f FORMAT, --format=FORMAT
                      Input format can be fasta, fastq, fa.gz and fq.gz. Default: fastq
-o OUT, --out=OUT     Output file: candidate primers. e.g.
                      [*].candidate.primers.txt.

To extract PCR products with perfect matches from your input FASTA file:

python scripts/extract_PCR_product.py
Usage: extract_PCR_product.py -i [input] -r [reference] -p [10] -f [format] -o [output]

Options:
--version             show program's version number and exit
-h, --help            show this help message and exit
-r REF, --ref=REF     reference file: template fasta or reference fasta.
-i INPUT, --input=INPUT
                      Primer file. One of the followed three types:
                      final_maxprimers_set.xls   primer.fa
                      primer_F,primer_R.
-f FORMAT, --format=FORMAT
                      Format of primer file: xls or fa or seq; default: xls.
                      xls: final_primer_set.xls, output of multiPrime.  fa:
                      fasta format.  seq: sequence format, comma seperate.
                      e.g. primer_F,Primer_R.
-o OUT, --out=OUT     Output_dir. default: PCR_product.
-p PROCESS, --process=PROCESS
                      Number of process to launch.  default: 10.
-s STAST, --stast=STAST
                      Stast information: number of coverage and total.
                      default: Coverage.xls

To extract PCR products with mismatches from your input FASTA file

python scripts/primer_coverage_validation_by_BWT.py
Usage: primer_coverage_validation_by_BWT.py -i [input] -r [bowtie index] -l [150,2000] -p [10]-o [output]

Options:
-h, --help            show this help message and exit
-i INPUT_FILE, --input=INPUT_FILE
                      input file: primer.fa.
-r REF, --ref=REF     reference file: template fasta or reference fasta.
-l LEN, --len=LEN     Length of primer, which is used for mapping. Default:
                      18
-t TERM, --term=TERM  Position of mismatch is not allowed in the 3 term of
                      primer. Default: 4
-s SIZE, --s=SIZE     Length of PCR product, default: 150,2000.
-p PROC, --proc=PROC  Number of process. Default: 20
-b BOWTIE, --bowtie=BOWTIE
                      bowtie or bowtie2 was employed for mapping. Default:
                      bowtie2
-m SEEDMMS, --seedmms=SEEDMMS
                      Bowtie: Mismatches in seed (can be 0 - 3, default: -n
                      1).Bowtie2: Gap or mismatches in seed (can be 0 - 1,
                      default: -n 1).
-o OUT, --out=OUT     Prodcut of PCR product with primers.

Others ...

Output

logs: log file of the multiPrime.py

results: results directory

-cluster.identities: identity of each sequence.

-cluster.txt: cluster information. for example: Cluster_0_222.fa, 0 ==> cluster rank; 222 ==> sequence number.

-history.txt:  history of clusters with rare sequence numbers are compared with others by average nucleotide identity.

-Total_fa: genome file and cluster of genome file.

-Clusters_fa: genome file split by each cluster.
	--*.fa: fasta of each cluster
	--*.tfa: top N {default: 500 randomly selected. Always contain the representative seq} fasta of each cluster
	--*.txt: accession id of each cluster
	--*.db: directory of database (for bowtie2).
	--*.number: number of fasta in each cluster

-Clusters_msa: alginment by muscle
	--*.tmsa: muscle output of the top N {default: 500 randomly selected. Always contain the representative seq}

-Clusters_trim_msa: trimmed alignment by degePrimer
	--*.trim.tmsa: trimmed muscle by degePrimer

-Clusters_primer: get_degePrimer from degePrimer out
	--*.out: paired primers designed by the top N {default: 500} fasta
	--*.gap_seq_id_json: Positions and non-contained sequences caused by gap.
	--*.non_coverage_seq_id_json: Positions and non-contained sequences caused by others.

-Clusters_cprimer: candidate primers for each cluster.
	--*.bed: candidate PCR product (1 mismatch and mismatch position must 4bp away from 3'end at least.)
	--*.fa: candidate primers in fa format
	--*.txt: candidate primers in txt format (1 line)
	--*.Check: tmp file; primers filter by bowtie2 (1 mismatch and mismatch position must 4bp away from 3'end at least.)

-Primers_set:
	--candidate_primers_sets.txt: all candidate primers in each cluster
	--candidate_primers_sets: directory contain all candidate primers in fasta
	--sort.candidate_primers_sets.txt: sorted by the number of candidate primers in each line (cluster)
	--final_maxprimers_set.fa: fasta format of primers set for multiPCR
	--final_maxprimers_set.xls: primers information of primers set
	--final_maxprimers_set.next.xls: primer set 2
	--Coverage_stast.xls: coverage of all primers in primer set (perfect match)
	--final_maxprimers_set.fa.dimer: dimer check by mfePrimer 
	--final_maxprimers_set.fa.hairpin: hairpin check by mfePrimer
	--PCR_product: perfect PCR product of each primer

-Core_primers_set:
	--BWT_coverage: coverage of all primers in core primer set (up to 2-mismatch)
	--core_candidate_primers_sets.txt: core candidate primers in each cluster
	--core_candidate_primers_sets:  directory contain all core candidate primers in fasta
	--sort.core_candidate_primers_sets.txt: sorted by the number of core candidate primers in each line (cluster)
	--core_final_maxprimers_set.fa: fasta format of core primers set for multiPCR
	--core_final_maxprimers_set.xls: primers information of core primers set
	--core_final_maxprimers_set.next.xls: primer set 2
	--core_Coverage_stast.xls: coverage of all primers in core primer set (perfect match)
	--core_candidate_primers_sets.fa: core primer set fasta
	--core_candidate_primers_sets.number: candidate primer number of each core cluster
	--core_final_maxprimers_set.fa.dimer: dimer check by mfePrimer
	--core_final_maxprimers_set.fa.hairpin: hairpin check by mfePrimer
	--core_PCR_product: core PCR product of each primer

Contact

The project was conceptualized and scripted by Junbo Yang.Please send comments, suggestions, bug reports and bug fixes to [email protected] / [email protected].

Todo

This repository is updated frequently. Expect breaking changes. More functions will be improved in the future and a new version is comming.

multiprime's People

Contributors

joybio avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.