Giter VIP home page Giter VIP logo

catch's Introduction

CATCH  ·  Build Status codecov PRs Welcome MIT License

Compact Aggregation of Targets for Comprehensive Hybridization

CATCH is a Python package for designing probe sets to use for nucleic acid capture of diverse sequence.

  • Comprehensive coverage: CATCH accepts any collection of unaligned sequences — typically whole genomes of all known genetic diversity of one or more microbial species. It designs oligo sequences that guarantee coverage of this diversity, enabling rapid design of exhaustive probe sets for customizable targets.
  • Compact designs: CATCH can design with a specified constraint on the number of oligos (e.g., array size). It searches a space of probe sets, which may pool many species, to find an optimal design. This allows its designs to scale well with known genetic diversity, and also supports cost-effective applications.
  • Flexibility: CATCH supports applications beyond whole genome enrichment, such as differential identification of species. It allows avoiding sequence from the design (e.g., background in microbial enrichment), supports customized models of hybridization, enables weighting the sensitivity for different species, and more.

Table of contents


Setting up CATCH

Python dependencies

CATCH requires:

CATCH is tested with Python 3.8, 3.9, and 3.10 on Linux (Ubuntu) and macOS, with the above NumPy and SciPy versions. CATCH may also work with older versions of Python, NumPy, and SciPy, as well as with other operating systems, but is not tested with them.

Installing CATCH with pip (or conda), as described below, will install NumPy and SciPy if they are not already installed.

Setting up a conda environment

Note: This section is optional, but may be useful to users who are new to Python.

It is generally helpful to install and run Python packages inside of a virtual environment, especially if you have multiple versions of Python installed or use multiple packages. This can prevent problems when upgrading, conflicts between packages with different requirements, installation issues that arise from having different Python versions available, and more.

One option to manage packages and environments is to use conda. A fast way to obtain conda is to install Miniconda: you can download it here and find installation instructions for it here. For example, on Linux you would run:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Once you have conda, you can create an environment for CATCH with Python 3.8:

conda create -n catch python=3.8

Then, you can activate the catch environment:

conda activate catch

After the environment is created and activated, you can install CATCH as described immediately below or by using the conda package. You will need to activate the environment each time you use CATCH.

Downloading and installing

An easy way to setup CATCH is to clone the repository and install with pip:

git clone https://github.com/broadinstitute/catch.git
cd catch
pip install -e .

If you do not have write permissions in the installation directory, you may need to supply --user to pip install.

Testing

Note: This section is optional and not required for using CATCH.

CATCH uses Python's unittest framework. To execute all tests, run:

python -m unittest discover

Alternative installation approach: conda

CATCH is also available through the conda package manager as part of the bioconda channel. If you use conda, the easiest way to install CATCH is by running:

conda install -c bioconda catch

Using CATCH

There are 3 ways of using CATCH to consider. They are all related — in that they wrap around the command described in Option #1 — but they can have large differences in their required computational resources (runtime and memory) and the design criteria they use. If you want a command that is quick to run without needing close familiarity with CATCH, we recommend Option #2. If you have the time to design a complex probe set that works well in practice and/or carefully read documentation, we recommend Option #1 and Option #3.

Option #1: Cautious defaults and one setting of hybridization criteria

This way of running CATCH uses design.py. To see details on all of the arguments that the program accepts, run:

design.py --help

design.py requires one or more datasets that specify input sequence data to target, as well as a path to which the probe sequences are written:

design.py [dataset] [dataset ...] -o OUTPUT

Each dataset can be one of two input formats:

  • A path to a FASTA file.
  • An NCBI taxonomy ID, for which sequences will be automatically downloaded. This is specified as download:TAXID where TAXID is the taxonomy ID. CATCH will fetch all accessions (representing whole genomes) for this taxonomy and download the sequences. For viruses, NCBI taxonomy IDs can be found via the Taxonomy Browser.

The probe sequences are written to OUTPUT in FASTA format.

This program sets cautious hybridization criteria by default (e.g., tolerate no probe-target mismatches) and does not automatically enable options that, in practice, reduce runtime and memory usage. The program was built and described in our publication as a subroutine of CATCH applied to one viral species at a time. Nevertheless, it can be applied to multiple taxa and to non-viral taxa with appropriate settings: if datasets encompass multiple taxa and/or include large genomes, please (i) consider the arguments described below that reduce runtime and memory usage or (ii) consider running CATCH as described in Option #2 or Option #3.

Below are several important arguments to design.py:

  • -pl/--probe-length PROBE_LENGTH/-ps/--probe-stride PROBE_STRIDE: Design probes to be PROBE_LENGTH nt long, and generate candidate probes using a stride of PROBE_STRIDE nt. (Default: 100 and 50.)
  • -m/--mismatches MISMATCHES: Tolerate up to MISMATCHES mismatches when determining whether a probe covers a target sequence. Higher values lead to fewer probes. This value can considerably affect runtime, with lower runtime at smaller values. Also, see -l/--lcf-thres and --island-of-exact-match for adjusting additional hybridization criteria, as described by the output of design.py --help. (Default: 0.)
  • -c/--coverage COVERAGE: Guarantee that at least COVERAGE of each target genome is captured by probes, where COVERAGE is either a fraction of a genome or a number of nucleotides. Higher values lead to more probes. (Default: 1.0 — i.e., whole genome.)
  • -e/--cover-extension COVER_EXTENSION: Assume that a probe will capture both the region of the sequence to which it hybridizes, as well as COVER_EXTENSION nt on each side of that. This parameter is reasonable because library fragments are generally longer than the capture probes, and its value may depend on the library fragment length. Higher values lead to fewer probes, whereas lower values are more stringent in modeling capture. Values of around 50 are commonly used and work well in practice. (Default: 0.)
  • -i/--identify: Design probes to perform differential identification. This is typically used with small values of COVERAGE and >1 specified datasets. Probes are designed such that each dataset should be captured by probes that are unlikely to hybridize to other datasets. Also, see -mt/--mismatches-tolerant, -lt/--lcf-thres-tolerant, and --island-of-exact-match-tolerant, as described by the output of design.py --help.
  • --avoid-genomes dataset [dataset ...]: Design probes to be unlikely to hybridize to any of these datasets. Also, see -mt/--mismatches-tolerant, -lt/--lcf-thres-tolerant, and --island-of-exact-match-tolerant, as described by the output of design.py --help.
  • --add-adapters: Add PCR adapters to the ends of each probe sequence. This selects adapters to add to probe sequences so as to minimize overlap among probes that share an adapter, allowing probes with the same adapter to be amplified together. (See --adapter-a and --adapter-b too, as described by the output of design.py --help.)
  • --custom-hybridization-fn PATH FN: Specify a function, for CATCH to dynamically load, that implements a custom model of hybridization between a probe and target sequence. See design.py --help for details on the expected input and output of this function. If not set, CATCH uses its default model of hybridization based on -m/--mismatches, -l/--lcf-thres, and --island-of-exact-match. Relatedly, see --custom-hybridization-fn-tolerant as described by the output of design.py --help.

Arguments that often lower runtime and memory usage

Several arguments change the design process in a way that can considerably reduce the computational burden needed for design — especially for large and highly diverse inputs. The trade-off of these arguments is an increase in the number of output probes, but this increase is usually small (<10%). The arguments are:

  • --filter-with-lsh-minhash FILTER_WITH_LSH_MINHASH: Use locality-sensitive hashing (LSH) to reduce the space of candidate probes that are considered by detecting and filtering near-duplicate candidate probes. This can significantly improve runtime and memory requirements when the input is especially large and diverse. FILTER_WITH_LSH_MINHASH should generally be around 0.5 to 0.7; 0.6 is a reasonable choice based on probe-target divergences that are typically desired in practice. In particular, the argument tells CATCH to use LSH with a MinHash family to detect near-duplicates within the set of candidate probes, and LSH_WITH_LSH_MINHASH gives the maximum Jaccard distance (1 minus Jaccard similarity) at which to call, and subsequently filter, near-duplicates; the similarity between two probes is computed by treating each probe as a set of 10-mers and measuring the Jaccard similarity between the two sets. Its value should be accordingly commensurate with parameter values for determining whether a probe hybridizes to a target sequence (i.e., with CATCH's default hybridization model using -m MISMATCHES and letting the probe-target divergence D be MISMATCHES divided by PROBE_LENGTH, then the value should be, at most, roughly 1 - 1/(2*e^(10*D) - 1); see Ondov et al. 2016 and solve Eq. 4 for 1-j with k=10). One caveat: when requiring low probe-target divergences (e.g., MISMATCHES of ~0, 1, or 2), using this argument may cause the coverage provided for each genome to be less than desired with --coverage because too many candidate probes are filtered, so the argument should be used with caution or avoided in this case; --print-analysis or --write-analysis-to-tsv provides the resulting coverage. Values of FILTER_WITH_LSH_MINHASH above ~0.7 can require significant memory and runtime that offset the benefits of using this argument, but such values should not be needed in practice.
  • --filter-with-lsh-hamming FILTER_WITH_LSH_HAMMING: Similar to --filter-with-lsh-minhash, except uses a different technique for locality-sensitive hashing. Namely, the argument tells CATCH to filter near-duplicate candidate probes using LSH with a Hamming distance family. FILTER_WITH_LSH_HAMMING is the maximum Hamming distance between two probes at which to call those probes near-duplicates. With CATCH's default hybridization model using -m MISMATCHES, FILTER_WITH_LSH_HAMING should be commensurate with, but not greater than, MISMATCHES. We recommend setting the value to MISMATCHES - 1 or MISMATCHES - 2; for example, if MISMATCHES is 5, we recommend setting this value to 3 or 4. It can be more intuitive to determine an appropriate value for this argument than for --filter-with-lsh-minhash. However, the improvement in runtime and memory usage is usually less with this argument than with --filter-with-lsh-minhash because the method for near-duplicate detection used by --filter-with-lsh-minhash is more sensitive; thus, in general, we recommend using --filter-with-lsh-minhash for most use cases. The same caveat regarding coverage noted for --filter-with-lsh-minhash also applies to this argument.
  • --cluster-and-design-separately CLUSTER_AND_DESIGN_SEPARATELY: Cluster input sequences prior to design and design probes separately on each cluster, merging the resulting probes to produce the final output; input sequence comparisons are performed rapidly, and in an alignment-free manner, using locality-sensitive hashing. Like the above arguments, this is another option to improve runtime and memory requirements on large and diverse input. CLUSTER_AND_DESIGN_SEPARATELY gives the nucleotide dissimilarity at which to cluster input sequences (i.e., 1-ANI, where ANI is the average nucleotide identity between two sequences). Values must be in (0, 0.5] and generally should be around 0.1 to 0.2 because, for probe-target divergences typically desired in practice, it is reasonable to design probes independently on clusters of sequences determined at this threshold; in general, we recommend using 0.15 (i.e., cluster at 15% nucleotide dissimilarity). In particular, this argument tells CATCH to (1) compute a MinHash signature for each input sequence; (2) cluster the input sequences by comparing their signatures (i.e., the Mash approach); (3) design probes, as usual, separately on each cluster; and (4) merge the resulting probes. The particular clustering method (step (2)) is set by --cluster-and-design-separately-method [choose|simple|hierarchical] (see design.py --help for details on this argument) and the default value, choose, uses a heuristic to decide among them. With the simple clustering method, clusters correspond to connected components of a graph in which each vertex is a sequence and two sequences are connected if and only if their nucleotide dissimilarity (estimated by MinHash signatures), is within CLUSTER_AND_DESIGN_SEPARATELY. With the hierarchical clustering method, clusters are computed by agglomerative hierarchical clustering and CLUSTER_AND_DESIGN_SEPARATELY is the maximum inter-cluster distance at which to merge clusters. With both clustering methods, higher values of CLUSTER_AND_DESIGN_SEPARATELY result in fewer, larger clusters of input sequences; consequently, higher values may result in better solutions (i.e., when using CATCH with fixed hybridization criteria, fewer probes for the same coverage) at the expense of more computational requirements.
  • --cluster-from-fragments CLUSTER_FROM_FRAGMENTS: Break input sequences into fragments of length CLUSTER_FROM_FRAGMENTS nt, and proceed with clustering as described for --cluster-and-design-separately, except cluster these fragments rather than whole input sequences. When using this argument, you must also set --cluster-and-design-separately because this argument tells CATCH to cluster from fragments of the input sequences rather than whole input sequences; see the information on --cluster-and-design-separately, above, for information about clustering. This option can be useful for large genomes (e.g., bacterial genomes) in which probes for different chunks can be designed independently. The fragment length must balance a trade-off between (a) yielding too many fragments (owing to a short fragment length), which would slow clustering and potentially lead to outputs that are suboptimal in terms of the number of probes; and (b) yielding too few fragments (owing to a long fragment length), which negates the benefit of this argument in speeding design on large genomes. In practice, we have found that a fragment length of around 50,000 nt achieves a reasonable balance, i.e., setting CLUSTER_FROM_FRAGMENTS to 50000 is our recommendation in general.

In general, a reasonable starting point for enabling these option is to try: --filter-with-lsh-minhash 0.6 --cluster-and-design-separately 0.15 --cluster-from-fragments 50000. Specific applications may benefit from different values.

Note that these arguments may slightly increase runtime for small inputs so, when also considering the trade-offs noted above, we do not recommend them for every application. Nevertheless, we recommend trying these arguments for large and diverse input (e.g., large genomes, many genomes, or highly divergent genomes), or in any case when runtime or memory usage presents an obstacle to running CATCH.

See the output of design.py --help for additional details on arguments.

design.py accepts many other arguments that can be useful for particular design applications, to improve resource requirements, or for debugging; design.py --help describes all of them.

Option #2: Pragmatic defaults for large, diverse input and one setting of hybridization criteria

This way of running CATCH uses design_large.py. This program wraps design.py, setting default hybridization criteria that are reasonable in practice and enabling, by default, options that lower runtime and memory usage on large, highly diverse input. The format for specifying inputs and outputs follows the description above.

In particular, design_large.py:

  • Sets default hybridization criteria that have worked well in practice: --mismatches 5 --cover-extension 50
  • Turns on options to lower runtime and memory usage: --filter-with-lsh-minhash 0.6 --cluster-and-design-separately 0.15 --cluster-from-fragments 50000
  • Uses all available CPUs to parallelize computations

The values of these arguments, as well as all of the arguments described above, can be overriden by specifying the argument.

To see details on all of the arguments that the program accepts, including their default values, as well as more information about input and output, run:

design_large.py --help

Option #3: Variable hybridization criteria across many taxa

This way of using CATCH combines multiple runs of design.py with one run of pool.py. It is most true to how we built CATCH, described it in our publication, and apply it in practice.

design.py and design_large.py require a particular selection of hybridization criteria, but it may not be appropriate to use a single criterion uniformly across all taxa in a multi-taxa probe set. Using pool.py, CATCH can find optimal hybridization settings that are allowed to vary across taxa, under a specified limit on the total number of probes (e.g., synthesis array size). This process accounts for taxa having different degrees of variation, so that CATCH can balance a probe set's composition between the less diverse species needing few probes with the more diverse species that might otherwise dominate the probe set. In other words, more diverse species are allowed lower enrichment efficiencies, in order to reduce the number of probes they require, relative to the less diverse species. Species can also be weighted during this process if there is a greater likelihood of encountering some in a sample than others.

pool.py searches over a space of potential probe sets to solve a constrained optimization problem. To see details on all the arguments that the program accepts, run:

pool.py --help

In practice, the process follows a MapReduce-like model. First, run design.py on each taxon (or other choice of dataset) over a grid of parameter values that spans a reasonable domain. These parameters could encompass the number of mismatches (--mismatches) or cover extension (--cover-extension), or other parameters in a custom hybridization model. Then, create a table that provides a probe count for each taxon and choice of parameters; the table must be a TSV, in a format like this. Now, use this table as input to pool.py:

pool.py INPUT_TSV TARGET_PROBE_COUNT OUTPUT_TSV

where INPUT_TSV is a path to the table described above, TARGET_PROBE_COUNT is a constraint on the number of probes to allow in the pool, and OUTPUT_TSV is a path to a file to which the program will write the optimal parameter values.

Below are two arguments that generalize the search:

  • --loss-coeffs COEFF [COEFF ...]: Specify coefficients on parameters in the objective function. This allows you to adjust how conservative each parameter is treated relative to others. (Default: 1 for mismatches and 1/100 for cover extension.)
  • --dataset-weights WEIGHTS_TSV: Assign a weight for each dataset to use in the objective function, where WEIGHTS_TSV is a path to a table that provides a weight for each dataset. This allows you to seek that probes in the pooled design be more sensitive for some taxa than others, for instance, if you believe you are more likely to encounter some taxa in a sample. (Default: 1 for all datasets.)

When running design.py, be sure to store the probes output for each choice of parameter values in the search grid. Then, combine the output FASTA files corresponding to the parameter values written to OUTPUT_TSV.

Each run of pool.py may yield a slightly different output based on the (random) initial guess. You could run it multiple times and select the output that has the smallest loss, which is written to standard output at the end of the program.

Examples

Example of running design.py

Below is an example of designing probes to target a single taxon.

design.py download:64320 -pl 75 -m 2 -l 60 -e 50 -o zika-probes.fasta --verbose

This will download whole genomes of Zika virus (NCBI taxonomy ID 64320) and design probes that:

  • are 75 nt long (-pl 75)
  • capture the entirety of each genome under a model that a probe hybridizes to a region if the longest common substring, up to 2 mismatches (-m 2), between a probe and target is at least 60 nt (-l 60)
  • assume 50 nt on each side of the hybridization is captured as well (-e 50)

and will save them to zika-probes.fasta.

It will provide detailed output during runtime (--verbose) and yield about 600 probes. Note that using -l 75 here (or, equivalently, leaving out -l) will run significantly faster, but result in more probes. Also, note that the input can be a path to any custom FASTA file.

For large input, please consider notes above regarding options that lower runtime and memory usage.

Example of running design_large.py

Below is an example of designing probes with larger and more diverse input than that provided in the above example:

design_large.py download:64320 download:12637 -o zika-and-dengue-probes.fasta --verbose

This will download whole genomes of Zika virus (NCBI taxonomy ID 64320) and dengue virus (NCBI taxonomy ID 12637). It will then design probes to enrich both species, and will save the probes to zika-and-dengue-probes.fasta.

The command will provide detailed output during runtime (--verbose) and yield about 3,200 probes. It will take about 1 hour to run (with 8 CPUs).

For hybridization criteria, design_large.py will use default values (listed in the output of design_large.py --help), which can be overridden with custom values if desired.

Example of running pool.py

Here is a table listing probe counts used in the design of the V-WAfr probe set (provided here and described in our publication). It provides counts for each dataset (species) and combination of two parameters (mismatches and cover extension) that were allowed to vary in CATCH's design; the table was produced by running design.py for each species over a grid of parameter values. Below is an example of designing the V-WAfr probe set using this table as input:

pool.py num-probes.V-WAfr.201506.tsv 90000 params.V-Wafr.201506.tsv --round-params 1 10

This will search for parameters that yield at most 90,000 probes across the species, and will output those to params.V-Wafr.201506.tsv. Because the search is over a continuous space, here we use --round-params 1 10 to set each value of the mismatches parameter to an integer and each value of the cover extension parameter to a multiple of 10 while still meeting the constraint on probe count. The pooled design yields about 89,950 probes.

Contributing

We welcome contributions to CATCH. This can be in the form of an issue or pull request. If you have questions, please create an issue or email Hayden Metsky <[email protected]>.

Citation

For details on how CATCH works, please refer to our publication in Nature Biotechnology. If you find CATCH useful to your work, please cite our paper as:

  • Metsky HC and Siddle KJ et al. Capturing sequence diversity in metagenomes with comprehensive and scalable probe design. Nature Biotechnology, 37(2), 160–168 (2019). doi: 10.1038/s41587-018-0006-x

License

CATCH is licensed under the terms of the MIT license.

catch's People

Contributors

dpark01 avatar haydenm avatar nekoui avatar yesimon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

catch's Issues

clarification of -e and -m flag function

Hello catch developers,

Could you please clarify how the probe stride parameter (-ps) is affected by the coverage extension parameter (-e)? Also, does the mismatch (-m) flag make a distinction between contiguous and non-contiguous mismatches?

Thank you in advance for your time.

Best,
Enrique

build from clustered segments results in probes w/ less than target coverage

I'm running into an issue where Catch generates a probe set with < 1.0x coverage across the input genomes. I narrowed the issue down to the --cluster-and-design-separately and --cluster-from-fragments flags. When I use these flags in the sample input and map the probes back to the inputs with minimap2, I see only 1x coverage on the 2nd half of the inputs. Note: this only happens when I set --design-from-fragments and not just --cluster-and-design-separately by itself.

Screen Shot 2020-05-12 at 4 51 39 PM

top is default, bottom is with cluster and design separately

To reproduce this issue, I created a small input.fasta containing just 2 6-kb Astroviridae sequences: (2, 6k genomes: see input.fasta). The issue was originally identified on a larger set of 157 genomes which contained these two sequences.

Commands used to generate probes:

../bin/design.py \
	--verbose \
	--coverage 1.0 \
	--output-probes=input.fasta \
	input.probes.fasta

Then, mapping probes back to input.fasta:

../bin/design.py \
	--verbose \
	--coverage 1.0 \
	--cluster-and-design-separately 0.1 \
	--cluster-from-fragments 10000 \
	--output-probes=input.fasta \
        input.clustered-probes.fasta

Commands used to map probes back to inputs:

# (repeat for clustered probes)
minimap2 \
	-a \
	--sam-hit-only \
	-x sr \
 	input.fasta \
	input.probes.fasta
	| samtools sort \
	> input.probes.bam
samtools index input.probes.bam
samtools mpileup input.probes.bam > input.probes.bam.pileup

Catch logs:

Default Parameters

2020-05-12 16:39:06,859 - catch.utils.seq_io [INFO] Reading fasta file data/input.fasta
2020-05-12 16:39:06,860 - catch.filter.probe_designer [INFO] Building candidate probes from target sequences
2020-05-12 16:39:06,871 - catch.filter.probe_designer [INFO] Starting filter DuplicateFilter
2020-05-12 16:39:06,871 - catch.filter.probe_designer [INFO] Starting filter SetCoverFilter
2020-05-12 16:39:06,871 - catch.filter.set_cover_filter [INFO] Building set cover sets input
2020-05-12 16:39:06,871 - catch.filter.set_cover_filter [INFO] Building map from k-mers to probes
2020-05-12 16:39:06,900 - catch.filter.set_cover_filter [INFO] Computing coverage in grouping 1 (of 1), with target genome 1 (of 2)
2020-05-12 16:39:06,914 - catch.filter.set_cover_filter [INFO] Computing coverage in grouping 1 (of 1), with target genome 2 (of 2)
2020-05-12 16:39:07,020 - catch.filter.set_cover_filter [INFO] Building set cover ranks input
2020-05-12 16:39:07,020 - catch.filter.set_cover_filter [INFO] Building set cover costs input
2020-05-12 16:39:07,020 - catch.filter.set_cover_filter [INFO] Building set cover universe_p input
2020-05-12 16:39:07,020 - catch.filter.set_cover_filter [INFO] Building universe_p directly from desired fractional coverage
2020-05-12 16:39:07,020 - catch.filter.set_cover_filter [INFO] Approximating the solution to a single set cover instance across all groupings
2020-05-12 16:39:07,021 - catch.utils.set_cover [INFO] Selected 0 sets with a total of 13050 elements remaining to be covered
2020-05-12 16:39:07,025 - catch.utils.set_cover [INFO] Selected 10 sets with a total of 12050 elements remaining to be covered
2020-05-12 16:39:07,026 - catch.utils.set_cover [INFO] Selected 20 sets with a total of 11050 elements remaining to be covered
2020-05-12 16:39:07,027 - catch.utils.set_cover [INFO] Selected 30 sets with a total of 10050 elements remaining to be covered
2020-05-12 16:39:07,028 - catch.utils.set_cover [INFO] Selected 40 sets with a total of 9050 elements remaining to be covered
2020-05-12 16:39:07,030 - catch.utils.set_cover [INFO] Selected 50 sets with a total of 8050 elements remaining to be covered
2020-05-12 16:39:07,031 - catch.utils.set_cover [INFO] Selected 60 sets with a total of 7050 elements remaining to be covered
2020-05-12 16:39:07,032 - catch.utils.set_cover [INFO] Selected 70 sets with a total of 6050 elements remaining to be covered
2020-05-12 16:39:07,034 - catch.utils.set_cover [INFO] Selected 80 sets with a total of 5050 elements remaining to be covered
2020-05-12 16:39:07,035 - catch.utils.set_cover [INFO] Selected 90 sets with a total of 4050 elements remaining to be covered
2020-05-12 16:39:07,037 - catch.utils.set_cover [INFO] Selected 100 sets with a total of 3050 elements remaining to be covered
2020-05-12 16:39:07,039 - catch.utils.set_cover [INFO] Selected 110 sets with a total of 2050 elements remaining to be covered
2020-05-12 16:39:07,040 - catch.utils.set_cover [INFO] Selected 120 sets with a total of 1050 elements remaining to be covered
2020-05-12 16:39:07,042 - catch.utils.set_cover [INFO] Selected 130 sets with a total of 50 elements remaining to be covered
132

Clustered

2020-05-12 16:39:07,420 - catch.utils.seq_io [INFO] Reading fasta file data/input.fasta
2020-05-12 16:39:07,421 - catch.filter.probe_designer [INFO] Clustering 2 sequences using MinHash signatures, at an average nucleotide dissimilarity threshold of 0.100000
2020-05-12 16:39:07,421 - catch.utils.cluster [INFO] Producing signatures of 2 sequences
2020-05-12 16:39:07,427 - catch.utils.cluster [INFO] Creating condensed distance matrix of 2 sequences
2020-05-12 16:39:07,451 - catch.utils.cluster [INFO] Clustering 2 sequences at Jaccard distance threshold of 0.822702
2020-05-12 16:39:07,456 - catch.filter.probe_designer [INFO] Found 2 clusters with sizes: [1, 1]
2020-05-12 16:39:07,456 - catch.filter.probe_designer [INFO] Building candidate probes from target sequences
2020-05-12 16:39:07,460 - catch.filter.probe_designer [INFO] Starting filter DuplicateFilter
2020-05-12 16:39:07,460 - catch.filter.probe_designer [INFO] Starting filter SetCoverFilter
2020-05-12 16:39:07,460 - catch.filter.set_cover_filter [INFO] Building set cover sets input
2020-05-12 16:39:07,460 - catch.filter.set_cover_filter [INFO] Building map from k-mers to probes
2020-05-12 16:39:07,468 - catch.filter.set_cover_filter [INFO] Computing coverage in grouping 1 (of 1), with target genome 1 (of 1)
2020-05-12 16:39:07,590 - catch.filter.set_cover_filter [INFO] Building set cover ranks input
2020-05-12 16:39:07,591 - catch.filter.set_cover_filter [INFO] Building set cover costs input
2020-05-12 16:39:07,591 - catch.filter.set_cover_filter [INFO] Building set cover universe_p input
2020-05-12 16:39:07,591 - catch.filter.set_cover_filter [INFO] Building universe_p directly from desired fractional coverage
2020-05-12 16:39:07,591 - catch.filter.set_cover_filter [INFO] Approximating the solution to a single set cover instance across all groupings
2020-05-12 16:39:07,591 - catch.utils.set_cover [INFO] Selected 0 sets with a total of 3390 elements remaining to be covered
2020-05-12 16:39:07,593 - catch.utils.set_cover [INFO] Selected 10 sets with a total of 2390 elements remaining to be covered
2020-05-12 16:39:07,594 - catch.utils.set_cover [INFO] Selected 20 sets with a total of 1390 elements remaining to be covered
2020-05-12 16:39:07,594 - catch.utils.set_cover [INFO] Selected 30 sets with a total of 390 elements remaining to be covered
2020-05-12 16:39:07,595 - catch.filter.probe_designer [INFO] Building candidate probes from target sequences
2020-05-12 16:39:07,597 - catch.filter.probe_designer [INFO] Starting filter DuplicateFilter
2020-05-12 16:39:07,598 - catch.filter.probe_designer [INFO] Starting filter SetCoverFilter
2020-05-12 16:39:07,598 - catch.filter.set_cover_filter [INFO] Building set cover sets input
2020-05-12 16:39:07,598 - catch.filter.set_cover_filter [INFO] Building map from k-mers to probes
2020-05-12 16:39:07,605 - catch.filter.set_cover_filter [INFO] Computing coverage in grouping 1 (of 1), with target genome 1 (of 1)
2020-05-12 16:39:07,728 - catch.filter.set_cover_filter [INFO] Building set cover ranks input
2020-05-12 16:39:07,728 - catch.filter.set_cover_filter [INFO] Building set cover costs input
2020-05-12 16:39:07,728 - catch.filter.set_cover_filter [INFO] Building set cover universe_p input
2020-05-12 16:39:07,728 - catch.filter.set_cover_filter [INFO] Building universe_p directly from desired fractional coverage
2020-05-12 16:39:07,728 - catch.filter.set_cover_filter [INFO] Approximating the solution to a single set cover instance across all groupings
2020-05-12 16:39:07,728 - catch.utils.set_cover [INFO] Selected 0 sets with a total of 3560 elements remaining to be covered
2020-05-12 16:39:07,730 - catch.utils.set_cover [INFO] Selected 10 sets with a total of 2560 elements remaining to be covered
2020-05-12 16:39:07,731 - catch.utils.set_cover [INFO] Selected 20 sets with a total of 1560 elements remaining to be covered
2020-05-12 16:39:07,732 - catch.utils.set_cover [INFO] Selected 30 sets with a total of 560 elements remaining to be covered
70

I'm not sure what could be causing this behavior. The cluster and design separately flag should have no effect as the genomes are under the fragment size for splitting.

Please let me know if you need any other information.

Replicability Zika probes Metsky et al Nature Biotechnology

Dear Heyden,

I am quite interested in both CATCH and your paper on Nature Biotech (https://www.nature.com/articles/s41587-018-0006-x#MOESM3) as a method to define custom probes for viruses. CATCH its exactly the tool we were looking for!

The first try outs that we did were with HHV-5, but as we were unsuccessful obtaining whole genome probe coverage (10-20 probes depending on the parameters), we decided to replicate the ZIKA probe design that is offered in GitHub, using the parameters from your paper on NB.

Your recommended parameters there are


Name | Taxonomic lineage(s) included | Input sequences | mismatches | cover-extension | probe-length | probe-stride | lcf-thres | island-of-exact-match | Include reverse complement probes | Add adapters | Expand Ns | Number of probes

zika | Flaviviridae,Flavivirus,Zika   virus | https://github.com/broadinstitute/catch/blob/24bf7d4ea924c1afbaa02f59c9a8c4c7051b1a7b/hybseldesign/datasets/data/zika.fasta | 2 | 10 | 75 | 75 | 75 | 30 | No | No | Yes | 1081

So we run design.py download:64320 -pl 75 -ps 75 --lcf-thres 75 --island-of-exact-match 30 -m 2 -e 10 --max-num-process 5 -o zika-probes.fasta --verbose --expand-n -o zika-probes.fasta

This only yielded 4898 of the 1081 probes listed in your paper. Even if more ZIKA genomes are available now, we don't think that the amount of probes should be reduced in half. Could guide us in to fully replicate it?

Additionally, we mapped the probes back to a reference for Zika (KY415991.1) and we replized that the tiling is quite uneven. Just to confirm, is this the behaviour should we expect? Is this tiling comming from taking into account the diveristy in the sample?

Looking forward to hear from you soon.

Joan

catch_zika_test.pdf

Cluster input sequences by producing signatures of them

Sometimes when the input data to design.py includes large numbers of divergent sequence (e.g., all sequences for all eight segments of Influenza A virus), solving the set cover instance (here) can be slow. One solution is to run design.py separately on clusters that can be easily identified (e.g., each run contains all sequences for one segment of Influenza A virus), and then pool the results either on the same choice of parameter values or from a search.

In other cases, a user may want to input all divergent sequences to design.py. For cases like this, there is no need to solve a single set cover instance across all sequences. Instead, we can cluster the input sequences and solve a separate instance on each cluster, and then pool the resulting probes. This may slightly raise the size of the final output (if there is some homology and shared probe sequences between the clusters), but could dramatically improve runtime.

Ideally the clustering should be alignment-free. One option is to produce a signature (or "sketch") of each input sequence using MinHash (as done in Mash) or HyperLogLog (as done in Dashing). Then, the sequences can be clustered using these signatures.

Relatedly, an option to reduce runtime on long genomes (even in cases where the input sequences are not completely divergent) is to chop across the sequences, cluster these fragments, and solve a separate instance on each cluster.

Error using design.py

Hi, I got error message when using 'design.py download:64320 -pl 75 -m 2 -l 60 -e 50 -o zika-probes.fasta --verbose' (The example running command)
`2022-10-27 10:58:02,644 - catch.utils.ncbi_neighbors [INFO] Creating a FASTA file for taxid 64320
2022-10-27 10:58:02,645 - catch.utils.ncbi_neighbors [INFO] Constructing a list of neighbors for taxid 64320
2022-10-27 10:58:09,090 - catch.utils.ncbi_neighbors [INFO] There are 1602 neighbors, 802 of which have unique accessions
2022-10-27 10:58:38,846 - catch.utils.seq_io [INFO] Reading fasta file /var/folders/gx/_k3jcq4x2pzby696p4t0wxqc0000gq/T/tmp69wuu5st
2022-10-27 10:58:39,270 - catch.filter.probe_designer [INFO] Building candidate probes from target sequences
2022-10-27 10:58:48,822 - catch.filter.probe_designer [INFO] Starting filter DuplicateFilter
2022-10-27 10:59:05,183 - catch.filter.probe_designer [INFO] Starting filter SetCoverFilter
2022-10-27 10:59:05,184 - catch.filter.set_cover_filter [INFO] Building set cover inputs for 1 groups
2022-10-27 10:59:05,185 - catch.filter.set_cover_filter [INFO] Building set cover sets input (group 1 of 1)
2022-10-27 10:59:05,186 - catch.filter.set_cover_filter [INFO] Building map from k-mers to probes
2022-10-27 10:59:13,004 - catch.filter.set_cover_filter [INFO] Computing coverage in target genome 1 (of 802)
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/labpc/miniconda3/envs/comenv/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/labpc/miniconda3/envs/comenv/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/Users/labpc/miniconda3/envs/comenv/lib/python3.10/site-packages/catch/probe.py", line 1050, in _find_probe_covers_in_subsequence
if _pfp_kmer_probe_map_use_native:
NameError: name '_pfp_kmer_probe_map_use_native' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/labpc/miniconda3/envs/comenv/bin/design.py", line 971, in
main(args)
File "/Users/labpc/miniconda3/envs/comenv/bin/design.py", line 423, in main
pb.design()
File "/Users/labpc/miniconda3/envs/comenv/lib/python3.10/site-packages/catch/filter/probe_designer.py", line 236, in design
candidates, probes = self._design_for_genomes(self.genomes,
File "/Users/labpc/miniconda3/envs/comenv/lib/python3.10/site-packages/catch/filter/probe_designer.py", line 223, in _design_for_genomes
probes = self._pass_through_filters(candidates, genomes, filters)
File "/Users/labpc/miniconda3/envs/comenv/lib/python3.10/site-packages/catch/filter/probe_designer.py", line 159, in _pass_through_filters
probes = f.filter(probes, genomes, input_is_grouped=True)
File "/Users/labpc/miniconda3/envs/comenv/lib/python3.10/site-packages/catch/filter/base_filter.py", line 103, in filter
return self._filter(input, target_genomes)
File "/Users/labpc/miniconda3/envs/comenv/lib/python3.10/site-packages/catch/filter/set_cover_filter.py", line 902, in _filter
input_paths = self._construct_and_pickle_set_cover_input(
File "/Users/labpc/miniconda3/envs/comenv/lib/python3.10/site-packages/catch/filter/set_cover_filter.py", line 812, in _construct_and_pickle_set_cover_input
sets = self._make_sets(possible_probes, target_genomes)
File "/Users/labpc/miniconda3/envs/comenv/lib/python3.10/site-packages/catch/filter/set_cover_filter.py", line 408, in _make_sets
probe_cover_ranges = probe.find_probe_covers_in_sequence(
File "/Users/labpc/miniconda3/envs/comenv/lib/python3.10/site-packages/catch/probe.py", line 1248, in find_probe_covers_in_sequence
all_subseq_probe_cover_ranges = _pfp_pool.map(scan_subsequence,
File "/Users/labpc/miniconda3/envs/comenv/lib/python3.10/multiprocessing/pool.py", line 367, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Users/labpc/miniconda3/envs/comenv/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
NameError: name '_pfp_kmer_probe_map_use_native' is not defined`

Any ideas on solving the problem?

Kind regards,
Brian

Error running design.py

I get the following error while trying to run design.py from within the catch directory:

Traceback (most recent call last):
  File "/home/janani/.local/bin/design.py", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/home/janani/bin/tools/catch/bin/design.py", line 12, in <module>
    from catch.datasets import hg19
  File "/home/janani/bin/tools/catch/catch/datasets/hg19.py", line 15, in <module>
    ds = GenomesDatasetMultiChrom(__name__, __file__, __spec__,
NameError: name '__spec__' is not defined

Running design.py --help also returns

Traceback (most recent call last):
  File "/home/janani/.local/bin/design.py", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/home/janani/bin/tools/catch/bin/design.py", line 12, in <module>
    from catch.datasets import hg19
  File "/home/janani/bin/tools/catch/catch/datasets/hg19.py", line 15, in <module>
    ds = GenomesDatasetMultiChrom(__name__, __file__, __spec__,
NameError: name '__spec__' is not defined

In case it helps, this was the error message after testing:

======================================================================
ERROR: test_standard_search_vwafr_with_dataset_weights (catch.pool.tests.test_param_search.TestSearchFunctions)
Integration test with the V-Wafr probe set data.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/pool/tests/test_param_search.py", line 266, in test_standard_search_vwafr_with_dataset_weights
    set(['hiv1_without_ltr', 'hepatitis_c'])):
TypeError: unsupported operand type(s) for -: 'list' and 'set'

======================================================================
ERROR: catch.filter.tests.test_adapter_filter (unittest.loader.ModuleImportFailure)
----------------------------------------------------------------------
ImportError: Failed to import test module: catch.filter.tests.test_adapter_filter
Traceback (most recent call last):
  File "/home/janani/anaconda3/lib/python2.7/unittest/loader.py", line 254, in _find_tests
    module = self._get_module_from_name(name)
  File "/home/janani/anaconda3/lib/python2.7/unittest/loader.py", line 232, in _get_module_from_name
    __import__(name)
  File "catch/filter/tests/test_adapter_filter.py", line 8, in <module>
    from catch.filter import candidate_probes as cp
  File "catch/filter/candidate_probes.py", line 13, in <module>
    from catch.utils import seq_io
  File "catch/utils/seq_io.py", line 223
    yield from process(f)
             ^
SyntaxError: invalid syntax


======================================================================
ERROR: test_basic (catch.filter.tests.test_reverse_complement_filter.TestReverseComplementFilter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_reverse_complement_filter.py", line 26, in test_basic
    self.assertCountEqual(f.input_probes, input_probes)
AttributeError: 'TestReverseComplementFilter' object has no attribute 'assertCountEqual'

======================================================================
ERROR: catch.filter.tests.test_probe_designer (unittest.loader.ModuleImportFailure)
----------------------------------------------------------------------
ImportError: Failed to import test module: catch.filter.tests.test_probe_designer
Traceback (most recent call last):
  File "/home/janani/anaconda3/lib/python2.7/unittest/loader.py", line 254, in _find_tests
    module = self._get_module_from_name(name)
  File "/home/janani/anaconda3/lib/python2.7/unittest/loader.py", line 232, in _get_module_from_name
    __import__(name)
  File "catch/filter/tests/test_probe_designer.py", line 8, in <module>
    from catch.filter import probe_designer
  File "catch/filter/probe_designer.py", line 6, in <module>
    from catch.filter import candidate_probes
  File "catch/filter/candidate_probes.py", line 13, in <module>
    from catch.utils import seq_io
  File "catch/utils/seq_io.py", line 223
    yield from process(f)
             ^
SyntaxError: invalid syntax


======================================================================
ERROR: test_no_shift_no_mismatch (catch.filter.tests.test_naive_redundant_filter.TestNaiveRedundantFilterShiftAndMismatchCount)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_naive_redundant_filter.py", line 40, in test_no_shift_no_mismatch
    self.compare_input_with_desired_output(0, 0, input, desired_output)
  File "catch/filter/tests/test_naive_redundant_filter.py", line 31, in compare_input_with_desired_output
    self.assertCountEqual(f.input_probes, input_probes)
AttributeError: 'TestNaiveRedundantFilterShiftAndMismatchCount' object has no attribute 'assertCountEqual'

======================================================================
ERROR: test_no_shift_one_mismatch (catch.filter.tests.test_naive_redundant_filter.TestNaiveRedundantFilterShiftAndMismatchCount)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_naive_redundant_filter.py", line 46, in test_no_shift_one_mismatch
    self.compare_input_with_desired_output(0, 1, input, desired_output)
  File "catch/filter/tests/test_naive_redundant_filter.py", line 31, in compare_input_with_desired_output
    self.assertCountEqual(f.input_probes, input_probes)
AttributeError: 'TestNaiveRedundantFilterShiftAndMismatchCount' object has no attribute 'assertCountEqual'

======================================================================
ERROR: test_one_shift_one_mismatch (catch.filter.tests.test_naive_redundant_filter.TestNaiveRedundantFilterShiftAndMismatchCount)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_naive_redundant_filter.py", line 52, in test_one_shift_one_mismatch
    self.compare_input_with_desired_output(1, 1, input, desired_output)
  File "catch/filter/tests/test_naive_redundant_filter.py", line 31, in compare_input_with_desired_output
    self.assertCountEqual(f.input_probes, input_probes)
AttributeError: 'TestNaiveRedundantFilterShiftAndMismatchCount' object has no attribute 'assertCountEqual'

======================================================================
ERROR: test_basic (catch.filter.tests.test_duplicate_filter.TestDuplicateFilter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_duplicate_filter.py", line 26, in test_basic
    self.assertCountEqual(f.input_probes, input_probes)
AttributeError: 'TestDuplicateFilter' object has no attribute 'assertCountEqual'

======================================================================
ERROR: catch.filter.tests.test_fasta_filter (unittest.loader.ModuleImportFailure)
----------------------------------------------------------------------
ImportError: Failed to import test module: catch.filter.tests.test_fasta_filter
Traceback (most recent call last):
  File "/home/janani/anaconda3/lib/python2.7/unittest/loader.py", line 254, in _find_tests
    module = self._get_module_from_name(name)
  File "/home/janani/anaconda3/lib/python2.7/unittest/loader.py", line 232, in _get_module_from_name
    __import__(name)
  File "catch/filter/tests/test_fasta_filter.py", line 8, in <module>
    from catch.filter import fasta_filter as ff
  File "catch/filter/fasta_filter.py", line 14, in <module>
    from catch.utils import seq_io
  File "catch/utils/seq_io.py", line 223
    yield from process(f)
             ^
SyntaxError: invalid syntax


======================================================================
ERROR: test_all_similar_but_one_too_far (catch.filter.tests.test_near_duplicate_filter.TestNearDuplicateFilterWithHammingDistance)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_near_duplicate_filter.py", line 57, in test_all_similar_but_one_too_far
    f = ndf.NearDuplicateFilterWithHammingDistance(2, 10)
  File "catch/filter/near_duplicate_filter.py", line 122, in __init__
    super().__init__(k=20)
TypeError: super() takes at least 1 argument (0 given)

======================================================================
ERROR: test_all_similar_but_zero_dist_thres (catch.filter.tests.test_near_duplicate_filter.TestNearDuplicateFilterWithHammingDistance)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_near_duplicate_filter.py", line 48, in test_all_similar_but_zero_dist_thres
    f = ndf.NearDuplicateFilterWithHammingDistance(0, 10)
  File "catch/filter/near_duplicate_filter.py", line 122, in __init__
    super().__init__(k=20)
TypeError: super() takes at least 1 argument (0 given)

======================================================================
ERROR: test_all_similar_no_exact_dup (catch.filter.tests.test_near_duplicate_filter.TestNearDuplicateFilterWithHammingDistance)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_near_duplicate_filter.py", line 25, in test_all_similar_no_exact_dup
    f = ndf.NearDuplicateFilterWithHammingDistance(2, 10)
  File "catch/filter/near_duplicate_filter.py", line 122, in __init__
    super().__init__(k=20)
TypeError: super() takes at least 1 argument (0 given)

======================================================================
ERROR: test_all_similar_with_exact_dup (catch.filter.tests.test_near_duplicate_filter.TestNearDuplicateFilterWithHammingDistance)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_near_duplicate_filter.py", line 36, in test_all_similar_with_exact_dup
    f = ndf.NearDuplicateFilterWithHammingDistance(2, 10)
  File "catch/filter/near_duplicate_filter.py", line 122, in __init__
    super().__init__(k=20)
TypeError: super() takes at least 1 argument (0 given)

======================================================================
ERROR: test_two_clusters (catch.filter.tests.test_near_duplicate_filter.TestNearDuplicateFilterWithHammingDistance)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_near_duplicate_filter.py", line 72, in test_two_clusters
    f = ndf.NearDuplicateFilterWithHammingDistance(2, 10)
  File "catch/filter/near_duplicate_filter.py", line 122, in __init__
    super().__init__(k=20)
TypeError: super() takes at least 1 argument (0 given)

======================================================================
ERROR: test_all_similar_but_one_too_far (catch.filter.tests.test_near_duplicate_filter.TestNearDuplicateFilterWithMinHash)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_near_duplicate_filter.py", line 126, in test_all_similar_but_one_too_far
    f = ndf.NearDuplicateFilterWithMinHash(0.8, 3)
  File "catch/filter/near_duplicate_filter.py", line 158, in __init__
    super().__init__(k=3)
TypeError: super() takes at least 1 argument (0 given)

======================================================================
ERROR: test_all_similar_but_zero_dist_thres (catch.filter.tests.test_near_duplicate_filter.TestNearDuplicateFilterWithMinHash)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_near_duplicate_filter.py", line 117, in test_all_similar_but_zero_dist_thres
    f = ndf.NearDuplicateFilterWithMinHash(0, 3)
  File "catch/filter/near_duplicate_filter.py", line 158, in __init__
    super().__init__(k=3)
TypeError: super() takes at least 1 argument (0 given)

======================================================================
ERROR: test_all_similar_no_exact_dup (catch.filter.tests.test_near_duplicate_filter.TestNearDuplicateFilterWithMinHash)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_near_duplicate_filter.py", line 94, in test_all_similar_no_exact_dup
    f = ndf.NearDuplicateFilterWithMinHash(0.8, 3)
  File "catch/filter/near_duplicate_filter.py", line 158, in __init__
    super().__init__(k=3)
TypeError: super() takes at least 1 argument (0 given)

======================================================================
ERROR: test_all_similar_with_exact_dup (catch.filter.tests.test_near_duplicate_filter.TestNearDuplicateFilterWithMinHash)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_near_duplicate_filter.py", line 105, in test_all_similar_with_exact_dup
    f = ndf.NearDuplicateFilterWithMinHash(0.8, 3)
  File "catch/filter/near_duplicate_filter.py", line 158, in __init__
    super().__init__(k=3)
TypeError: super() takes at least 1 argument (0 given)

======================================================================
ERROR: test_two_clusters (catch.filter.tests.test_near_duplicate_filter.TestNearDuplicateFilterWithMinHash)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_near_duplicate_filter.py", line 140, in test_two_clusters
    f = ndf.NearDuplicateFilterWithMinHash(0.8, 3)
  File "catch/filter/near_duplicate_filter.py", line 158, in __init__
    super().__init__(k=3)
TypeError: super() takes at least 1 argument (0 given)

======================================================================
ERROR: catch.filter.tests.test_candidate_probes (unittest.loader.ModuleImportFailure)
----------------------------------------------------------------------
ImportError: Failed to import test module: catch.filter.tests.test_candidate_probes
Traceback (most recent call last):
  File "/home/janani/anaconda3/lib/python2.7/unittest/loader.py", line 254, in _find_tests
    module = self._get_module_from_name(name)
  File "/home/janani/anaconda3/lib/python2.7/unittest/loader.py", line 232, in _get_module_from_name
    __import__(name)
  File "catch/filter/tests/test_candidate_probes.py", line 6, in <module>
    from catch.datasets import ebola_zaire_with_2014
  File "catch/datasets/ebola_zaire_with_2014.py", line 15, in <module>
    ds = GenomesDatasetSingleChrom(__name__, __file__, __spec__)
NameError: name '__spec__' is not defined


======================================================================
ERROR: test_one_shift_one_mismatch (catch.filter.tests.test_dominating_set_filter.TestDominatingSetFilter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_dominating_set_filter.py", line 36, in test_one_shift_one_mismatch
    self.compare_input_with_desired_output(1, 1, input, desired_output)
  File "catch/filter/tests/test_dominating_set_filter.py", line 30, in compare_input_with_desired_output
    self.assertCountEqual(f.input_probes, input_probes)
AttributeError: 'TestDominatingSetFilter' object has no attribute 'assertCountEqual'

======================================================================
ERROR: catch.filter.tests.test_set_cover_filter (unittest.loader.ModuleImportFailure)
----------------------------------------------------------------------
ImportError: Failed to import test module: catch.filter.tests.test_set_cover_filter
Traceback (most recent call last):
  File "/home/janani/anaconda3/lib/python2.7/unittest/loader.py", line 254, in _find_tests
    module = self._get_module_from_name(name)
  File "/home/janani/anaconda3/lib/python2.7/unittest/loader.py", line 232, in _get_module_from_name
    __import__(name)
  File "catch/filter/tests/test_set_cover_filter.py", line 10, in <module>
    from catch.filter import set_cover_filter as scf
  File "catch/filter/set_cover_filter.py", line 54, in <module>
    from catch.utils import seq_io
  File "catch/utils/seq_io.py", line 223
    yield from process(f)
             ^
SyntaxError: invalid syntax


======================================================================
ERROR: test_load_function (catch.utils.tests.test_dynamic_load.TestLoadFunction)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/utils/tests/test_dynamic_load.py", line 46, in test_load_function
    sum_fn = dynamic_load.load_function_from_path(self.module_path, 'sum')
  File "catch/utils/dynamic_load.py", line 49, in load_function_from_path
    module = load_module_from_path(path)
  File "catch/utils/dynamic_load.py", line 26, in load_module_from_path
    spec = importlib.util.spec_from_file_location(module_name, path)
AttributeError: 'module' object has no attribute 'util'

======================================================================
ERROR: test_load_module (catch.utils.tests.test_dynamic_load.TestLoadModule)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/utils/tests/test_dynamic_load.py", line 24, in test_load_module
    module = dynamic_load.load_module_from_path(self.module_path)
  File "catch/utils/dynamic_load.py", line 26, in load_module_from_path
    spec = importlib.util.spec_from_file_location(module_name, path)
AttributeError: 'module' object has no attribute 'util'

======================================================================
ERROR: test_varied_k (catch.utils.tests.test_lsh.TestHammingNearNeighborLookup)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/utils/tests/test_lsh.py", line 186, in test_varied_k
    self.assertCountEqual(nnl.query(a), {a, b, c})
AttributeError: 'TestHammingNearNeighborLookup' object has no attribute 'assertCountEqual'

======================================================================
ERROR: test_varied_k (catch.utils.tests.test_lsh.TestMinHashNearNeighborLookup)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/utils/tests/test_lsh.py", line 227, in test_varied_k
    self.assertCountEqual(nnl.query(a), {a, b, c})
AttributeError: 'TestMinHashNearNeighborLookup' object has no attribute 'assertCountEqual'

======================================================================
ERROR: catch.utils.tests.test_seq_io (unittest.loader.ModuleImportFailure)
----------------------------------------------------------------------
ImportError: Failed to import test module: catch.utils.tests.test_seq_io
Traceback (most recent call last):
  File "/home/janani/anaconda3/lib/python2.7/unittest/loader.py", line 254, in _find_tests
    module = self._get_module_from_name(name)
  File "/home/janani/anaconda3/lib/python2.7/unittest/loader.py", line 232, in _get_module_from_name
    __import__(name)
  File "catch/utils/tests/test_seq_io.py", line 9, in <module>
    from catch.datasets import ebola_zaire_with_2014
  File "catch/datasets/ebola_zaire_with_2014.py", line 15, in <module>
    ds = GenomesDatasetSingleChrom(__name__, __file__, __spec__)
NameError: name '__spec__' is not defined


======================================================================
ERROR: test_probe_cover_ranges (catch.tests.test_coverage_analysis.TestAnalyzerCoversWithCoverExtension)
Test the probe cover ranges that are found.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_coverage_analysis.py", line 255, in test_probe_cover_ranges
    self.assertCountEqual(self.analyzer.target_covers[0][0][False],
AttributeError: 'TestAnalyzerCoversWithCoverExtension' object has no attribute 'assertCountEqual'

======================================================================
ERROR: test_probe_cover_ranges (catch.tests.test_coverage_analysis.TestAnalyzerCoversWithoutReverseComplement)
Test the probe cover ranges that are found.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_coverage_analysis.py", line 300, in test_probe_cover_ranges
    self.assertCountEqual(self.analyzer.target_covers[0][0][False],
AttributeError: 'TestAnalyzerCoversWithoutReverseComplement' object has no attribute 'assertCountEqual'

======================================================================
ERROR: test_probe_cover_ranges (catch.tests.test_coverage_analysis.TestAnalyzerWithTwoTargetGenomes)
Test the probe cover ranges that are found.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_coverage_analysis.py", line 49, in test_probe_cover_ranges
    self.assertCountEqual(self.analyzer.target_covers[0][0][False],
AttributeError: 'TestAnalyzerWithTwoTargetGenomes' object has no attribute 'assertCountEqual'

======================================================================
ERROR: test_one_mismatch (catch.tests.test_probe.TestConstructPigeonholedKmerProbeMap)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 304, in test_one_mismatch
    self.assertCountEqual(kmer_map['ABCDE'], [a])
AttributeError: 'TestConstructPigeonholedKmerProbeMap' object has no attribute 'assertCountEqual'

======================================================================
ERROR: test_positions (catch.tests.test_probe.TestConstructPigeonholedKmerProbeMap)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 329, in test_positions
    self.assertCountEqual(kmer_map['AB'], [(a, 0), (b, 6)])
AttributeError: 'TestConstructPigeonholedKmerProbeMap' object has no attribute 'assertCountEqual'

======================================================================
ERROR: test_shared_kmer (catch.tests.test_probe.TestConstructPigeonholedKmerProbeMap)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 317, in test_shared_kmer
    self.assertCountEqual(kmer_map['ABCDE'], [a, b])
AttributeError: 'TestConstructPigeonholedKmerProbeMap' object has no attribute 'assertCountEqual'

======================================================================
ERROR: test_positions (catch.tests.test_probe.TestConstructRandKmerProbeMap)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 251, in test_positions
    self.assertCountEqual(kmer_map['DEF'], [(a, 3), (b, 3)])
AttributeError: 'TestConstructRandKmerProbeMap' object has no attribute 'assertCountEqual'

======================================================================
ERROR: test_island_with_exact_match1 (catch.tests.test_probe.TestFindProbeCoversInSequence)
Tests the 'island_with_exact_match' argument for
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 629, in test_island_with_exact_match1
    self.assertCountEqual(found[a], [(2, 8), (16, 22)])
AttributeError: 'TestFindProbeCoversInSequence' object has no attribute 'assertCountEqual'

======================================================================
ERROR: test_island_with_exact_match2 (catch.tests.test_probe.TestFindProbeCoversInSequence)
Tests the 'island_with_exact_match' argument for
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 653, in test_island_with_exact_match2
    probe.open_probe_finding_pool(kmer_map, fn, n_workers)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_more_than_cover (catch.tests.test_probe.TestFindProbeCoversInSequence)
Tests with short sequence and short probes
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 580, in test_more_than_cover
    probe.open_probe_finding_pool(kmer_map, f, n_workers)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_multiple_searches_with_same_pool (catch.tests.test_probe.TestFindProbeCoversInSequence)
Tests more than one call to find_probe_covers_in_sequence()
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 719, in test_multiple_searches_with_same_pool
    probe.open_probe_finding_pool(kmer_map, f, n_workers)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_one_or_no_occurrence (catch.tests.test_probe.TestFindProbeCoversInSequence)
Tests with short sequence and short probes
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 534, in test_one_or_no_occurrence
    probe.open_probe_finding_pool(kmer_map, f, n_workers)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_open_close_pool_without_work (catch.tests.test_probe.TestFindProbeCoversInSequence)
Tests opening a probe finding pool and closing it without doing
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 740, in test_open_close_pool_without_work
    probe.open_probe_finding_pool(kmer_map, f, n_workers)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_pigeonhole_with_mismatch (catch.tests.test_probe.TestFindProbeCoversInSequence)
Tests with short sequence and short probes
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 679, in test_pigeonhole_with_mismatch
    probe.open_probe_finding_pool(kmer_map, f, n_workers)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_random_large_genome1 (catch.tests.test_probe.TestFindProbeCoversInSequence)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 759, in test_random_large_genome1
    lcf_thres=100, seed=1)
  File "catch/tests/test_probe.py", line 860, in run_random
    use_native_dict=use_native_dict)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_random_large_genome2 (catch.tests.test_probe.TestFindProbeCoversInSequence)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 763, in test_random_large_genome2
    lcf_thres=80, seed=2)
  File "catch/tests/test_probe.py", line 860, in run_random
    use_native_dict=use_native_dict)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_random_large_genome3 (catch.tests.test_probe.TestFindProbeCoversInSequence)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 767, in test_random_large_genome3
    lcf_thres=75, seed=3)
  File "catch/tests/test_probe.py", line 860, in run_random
    use_native_dict=use_native_dict)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_random_large_genome_native_dict (catch.tests.test_probe.TestFindProbeCoversInSequence)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 776, in test_random_large_genome_native_dict
    lcf_thres=100, seed=4, use_native_dict=True)
  File "catch/tests/test_probe.py", line 860, in run_random
    use_native_dict=use_native_dict)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_random_large_genome_varied_k (catch.tests.test_probe.TestFindProbeCoversInSequence)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 772, in test_random_large_genome_varied_k
    lcf_thres=100, kmer_probe_map_k=k, seed=1)
  File "catch/tests/test_probe.py", line 860, in run_random
    use_native_dict=use_native_dict)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_random_small_genome1 (catch.tests.test_probe.TestFindProbeCoversInSequence)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 746, in test_random_small_genome1
    self.run_random(100, 15000, 25000, 300, seed=1)
  File "catch/tests/test_probe.py", line 860, in run_random
    use_native_dict=use_native_dict)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_random_small_genome2 (catch.tests.test_probe.TestFindProbeCoversInSequence)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 750, in test_random_small_genome2
    lcf_thres=75, seed=2)
  File "catch/tests/test_probe.py", line 860, in run_random
    use_native_dict=use_native_dict)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_random_small_genome_varied_k (catch.tests.test_probe.TestFindProbeCoversInSequence)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 755, in test_random_small_genome_varied_k
    kmer_probe_map_k=k, seed=1)
  File "catch/tests/test_probe.py", line 860, in run_random
    use_native_dict=use_native_dict)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_repetitive (catch.tests.test_probe.TestFindProbeCoversInSequence)
Tests with short sequence and short probes
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 603, in test_repetitive
    probe.open_probe_finding_pool(kmer_map, f, n_workers)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_two_occurrences (catch.tests.test_probe.TestFindProbeCoversInSequence)
Tests with short sequence and short probes
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 556, in test_two_occurrences
    probe.open_probe_finding_pool(kmer_map, f, n_workers)
  File "catch/probe.py", line 830, in open_probe_finding_pool
    raise RuntimeError("Probe finding pool is already open")
RuntimeError: Probe finding pool is already open

======================================================================
ERROR: test_pigeonholed_kmer_map (catch.tests.test_probe.TestSharedKmerProbeMap)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 388, in test_pigeonholed_kmer_map
    self.assertCountEqual(shared_kmer_map.get('AB'),
AttributeError: 'TestSharedKmerProbeMap' object has no attribute 'assertCountEqual'

======================================================================
ERROR: test_rand_kmer_map (catch.tests.test_probe.TestSharedKmerProbeMap)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/tests/test_probe.py", line 364, in test_rand_kmer_map
    self.assertCountEqual(shared_kmer_map.get('DEF'),
AttributeError: 'TestSharedKmerProbeMap' object has no attribute 'assertCountEqual'

======================================================================
FAIL: test_limit_expansion_0 (catch.filter.tests.test_n_expansion_filter.TestNExpansionFilter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_n_expansion_filter.py", line 79, in test_limit_expansion_0
    limit_n_expansion_randomly=0)
  File "catch/filter/tests/test_n_expansion_filter.py", line 29, in check_output
    self.assertEqual(f.output_probes, desired_output_probes)
AssertionError: Lists differ: [ATCG, AGCG, ACTG] != [ATCG, AACG, ACAG]

First differing element 1:
AGCG
AACG

- [ATCG, AGCG, ACTG]
?         ^      ^

+ [ATCG, AACG, ACAG]
?         ^      ^


======================================================================
FAIL: test_limit_expansion_0_with_two_N (catch.filter.tests.test_n_expansion_filter.TestNExpansionFilter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_n_expansion_filter.py", line 85, in test_limit_expansion_0_with_two_N
    limit_n_expansion_randomly=0)
  File "catch/filter/tests/test_n_expansion_filter.py", line 29, in check_output
    self.assertEqual(f.output_probes, desired_output_probes)
AssertionError: Lists differ: [ATCG, AGTG, ACTG] != [ATCG, AAAG, ACTG]

First differing element 1:
AGTG
AAAG

- [ATCG, AGTG, ACTG]
?          --

+ [ATCG, AAAG, ACTG]
?        ++


======================================================================
FAIL: test_limit_expansion_1_with_all_N (catch.filter.tests.test_n_expansion_filter.TestNExpansionFilter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_n_expansion_filter.py", line 103, in test_limit_expansion_1_with_all_N
    limit_n_expansion_randomly=1)
  File "catch/filter/tests/test_n_expansion_filter.py", line 29, in check_output
    self.assertEqual(f.output_probes, desired_output_probes)
AssertionError: Lists differ: [ATCG, GTAT, GTTT, GTCT, GTGT,... != [ATCG, AAAG, TAAG, CAAG, GAAG,...

First differing element 1:
GTAT
AAAG

- [ATCG, GTAT, GTTT, GTCT, GTGT, ACTG]
+ [ATCG, AAAG, TAAG, CAAG, GAAG, ACTG]

======================================================================
FAIL: test_limit_expansion_1_with_two_N (catch.filter.tests.test_n_expansion_filter.TestNExpansionFilter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_n_expansion_filter.py", line 97, in test_limit_expansion_1_with_two_N
    limit_n_expansion_randomly=1)
  File "catch/filter/tests/test_n_expansion_filter.py", line 29, in check_output
    self.assertEqual(f.output_probes, desired_output_probes)
AssertionError: Lists differ: [ATCG, AGAG, AGTG, AGCG, AGGG,... != [ATCG, AAAG, AATG, AACG, AAGG,...

First differing element 1:
AGAG
AAAG

- [ATCG, AGAG, AGTG, AGCG, AGGG, ACTG]
?         ^     ^     ^       -

+ [ATCG, AAAG, AATG, AACG, AAGG, ACTG]
?         ^     ^     ^    +


======================================================================
FAIL: test_limit_expansion_2_with_all_N (catch.filter.tests.test_n_expansion_filter.TestNExpansionFilter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/filter/tests/test_n_expansion_filter.py", line 114, in test_limit_expansion_2_with_all_N
    limit_n_expansion_randomly=2)
  File "catch/filter/tests/test_n_expansion_filter.py", line 29, in check_output
    self.assertEqual(f.output_probes, desired_output_probes)
AssertionError: Lists differ: [ATCG, GAAT, GATT, GACT, GAGT,... != [ATCG, AAAA, AAAT, AAAC, AAAG,...

First differing element 1:
GAAT
AAAA

  [ATCG,
+  AAAA,
+  AAAT,
+  AAAC,
+  AAAG,
+  TAAA,
+  TAAT,
+  TAAC,
+  TAAG,
+  CAAA,
+  CAAT,
+  CAAC,
+  CAAG,
+  GAAA,
   GAAT,
-  GATT,
-  GACT,
?     -

+  GAAC,
?    +

-  GAGT,
?     -

+  GAAG,
?    +

-  GTAT,
-  GTTT,
-  GTCT,
-  GTGT,
-  GCAT,
-  GCTT,
-  GCCT,
-  GCGT,
-  GGAT,
-  GGTT,
-  GGCT,
-  GGGT,
   ACTG]

======================================================================
FAIL: test_similar (catch.utils.tests.test_lsh.TestHammingDistanceFamily)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "catch/utils/tests/test_lsh.py", line 42, in test_similar
    self.assertGreater(collision_count, 8)
AssertionError: 8 not greater than 8

----------------------------------------------------------------------
Ran 212 tests in 457.176s

FAILED (failures=6, errors=53)

Any ideas on how to get this up and running?

Problem install Catch

Dear team Catch

I have this error installing catch

catch$ pip install -e .

Obtaining file:///home/ins-bio/programas/catch
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [62 lines of output]
/home/ins-bio/programas/anaconda_install/lib/python3.11/site-packages/setuptools/dist.py:520: SetuptoolsDeprecationWarning: Invalid version: 'v1.5.1-17-g9c4696c2'.
!!

          ********************************************************************************
          The version specified is not a valid version according to PEP 440.
          This may not work as expected with newer versions of
          setuptools, pip, and PyPI.

          This deprecation is overdue, please update your project and remove deprecated
          calls to avoid build errors in the future.

          See https://peps.python.org/pep-0440/ for details.
          ********************************************************************************

  !!
    self._validate_version(self.metadata.version)
  running egg_info
  /home/ins-bio/programas/anaconda_install/lib/python3.11/site-packages/setuptools/command/egg_info.py:131: SetuptoolsDeprecationWarning: Invalid version: 'v1.5.1-17-g9c4696c2'.
  !!

          ********************************************************************************
          Version 'v1.5.1-17-g9c4696c2' is not valid according to PEP 440.

          Please make sure to specify a valid version for your package.
          Also note that future releases of setuptools may halt the build process
          if an invalid version is given.

          This deprecation is overdue, please update your project and remove deprecated
          calls to avoid build errors in the future.

          See https://peps.python.org/pep-0440/ for details.
          ********************************************************************************

  !!
    return _normalization.best_effort_version(tagged)
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/home/ins-bio/programas/catch/setup.py", line 11, in <module>
      setup(name='catch',
    File "/home/ins-bio/programas/anaconda_install/lib/python3.11/site-packages/setuptools/__init__.py", line 107, in setup
      return distutils.core.setup(**attrs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ins-bio/programas/anaconda_install/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 185, in setup
      return run_commands(dist)
             ^^^^^^^^^^^^^^^^^^
    File "/home/ins-bio/programas/anaconda_install/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/home/ins-bio/programas/anaconda_install/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/home/ins-bio/programas/anaconda_install/lib/python3.11/site-packages/setuptools/dist.py", line 1244, in run_command
      super().run_command(command)
    File "/home/ins-bio/programas/anaconda_install/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
      cmd_obj.ensure_finalized()
    File "/home/ins-bio/programas/anaconda_install/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 111, in ensure_finalized
      self.finalize_options()
    File "/home/ins-bio/programas/anaconda_install/lib/python3.11/site-packages/setuptools/command/egg_info.py", line 219, in finalize_options
      parsed_version = packaging.version.Version(self.egg_version)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ins-bio/programas/anaconda_install/lib/python3.11/site-packages/setuptools/_vendor/packaging/version.py", line 197, in __init__
      raise InvalidVersion(f"Invalid version: '{version}'")
  setuptools.extern.packaging.version.InvalidVersion: Invalid version: 'v1.5.1-17-g9c4696c2'
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Error when running design.py example

design.py download:64320 -pl 75 -m 2 -l 60 -e 50 -o zika-probes.fasta --verbose

2022-02-23 17:02:05,101 - catch.utils.ncbi_neighbors [INFO] Creating a FASTA file for taxid 64320
2022-02-23 17:02:05,102 - catch.utils.ncbi_neighbors [INFO] Constructing a list of neighbors for taxid 64320
2022-02-23 17:03:56,248 - catch.utils.ncbi_neighbors [INFO] There are 1596 neighbors, 799 of which have unique accessions
Traceback (most recent call last):
File "/Users/yfeng/opt/anaconda3/lib/python3.8/http/client.py", line 555, in _get_chunk_left
chunk_left = self._read_next_chunk_size()
File "/Users/yfeng/opt/anaconda3/lib/python3.8/http/client.py", line 522, in _read_next_chunk_size
return int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/yfeng/opt/anaconda3/lib/python3.8/http/client.py", line 572, in _readall_chunked
chunk_left = self._get_chunk_left()
File "/Users/yfeng/opt/anaconda3/lib/python3.8/http/client.py", line 557, in _get_chunk_left
raise IncompleteRead(b'')
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/yfeng/opt/anaconda3/bin/design.py", line 7, in
exec(compile(f.read(), file, 'exec'))
File "/Users/yfeng/Desktop/catch/bin/design.py", line 856, in
main(args)
File "/Users/yfeng/Desktop/catch/bin/design.py", line 66, in main
ds_fasta_tf = ncbi_neighbors.construct_fasta_for_taxid(taxid,
File "/Users/yfeng/Desktop/catch/catch/utils/ncbi_neighbors.py", line 465, in construct_fasta_for_taxid
seqs_tf = fetch_fastas(acc_to_fetch)
File "/Users/yfeng/Desktop/catch/catch/utils/ncbi_neighbors.py", line 207, in fetch_fastas
raw_data = r.read()
File "/Users/yfeng/opt/anaconda3/lib/python3.8/http/client.py", line 465, in read
return self._readall_chunked()
File "/Users/yfeng/opt/anaconda3/lib/python3.8/http/client.py", line 579, in _readall_chunked
raise IncompleteRead(b''.join(value))
http.client.IncompleteRead: IncompleteRead(536576 bytes read)

Update of dataset all viruses

Hi,

Thanks for this tool!
I would like to design probes for capturing viruses infecting humans. I am planning to use your dataset V-all. I see that the sequences are from a pull from October 2018. I wanted to know if it would be possible for you to update this?
Thanks!
Juliette

Having trouble accessing preloaded datasets

Hi, novice Linux user here.

I work for the CDC and our scientific computing people have installed CATCH on our biolinux platform. I have loaded CATCH and was trying to run the line of code to have the program make probes for the installed Zika virus data set. I'm getting error messages that seem to say the .gz file can't be found, although if I move around the directories, I can see the .gz file that is supposed to be used to generate the probe designs.

Here is the line of code and the error messages:
fph6@biolinux> design.py zika -pl 75 -m 2 -l 60 -e 50 -o zika-probes.fasta --verbose
2019-03-26 15:09:11,298 - catch.utils.seq_io [INFO] Reading fasta file /apps/x86_64/python/3.6.1/lib/python3.6/sit
e-packages/catch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/datasets/data/zika.fasta.gz
Traceback (most recent call last):
File "/apps/x86_64/catch/catch/bin/design.py", line 811, in
main(args)
File "/apps/x86_64/catch/catch/bin/design.py", line 60, in main
genomes_grouped += [seq_io.read_dataset_genomes(dataset)]
File "/apps/x86_64/python/3.6.1/lib/python3.6/site-packages/catch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/utils
/seq_io.py", line 71, in read_dataset_genomes
seqs = list(read_fasta(fn).values())
File "/apps/x86_64/python/3.6.1/lib/python3.6/site-packages/catch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/utils
/seq_io.py", line 152, in read_fasta
with gzip.open(fn, 'rt') as f:
File "/apps/x86_64/python/3.6.1/lib/python3.6/gzip.py", line 53, in open
binary_file = GzipFile(filename, gz_mode, compresslevel)
File "/apps/x86_64/python/3.6.1/lib/python3.6/gzip.py", line 163, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/apps/x86_64/python/3.6.1/lib/python3.6/site-packages/cat
ch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/datasets/data/zika.fasta.gz'

Can you help?
Thanks,
Linda

catch on large input

I am trying to run catch on a large number of sequences. I have read the documentation and trying to use "--cluster-and-design-separately" command but continually get the following error:

design.py: error: argument --cluster-and-design-separately: 0.85 is an invalid average nucleotide dissimilarity"

I also tried 85, rather than 0.85 and get the same error. What type of value should this parameter be?

Best,
Christina

Issue running design.py on large files

Hello catch developers,

We've had a great experience using your software on a database of smaller genes, but now I am trying to make probes on > 60 genomes from a single bacterial species and the design.py script seems to get hung up.

The fasta file is currently formatted such that each genome is a single header and sequence (mean = 2.7Mb) so I was wondering if this could be causing issues. I tried using flags that would improve performance, such as longer probe stride and the "--cluster-and-design-separately" flag, but I still could not get a result after letting it run for more than a week. It does not appear that memory is the issue because using htop shows that there was plenty of memory still available and nothing had jumped to swap. Do I just need to let the process keep running for as long as needed? What would you suggest I could do to run the probe design on the entire file at once?

For now, I'm splitting the probe design into groups of 5-10 genomes and this seems to be working fine, but I was wondering if I was missing a simple solution.

Thank you for your time,
Enrique

CATCH fails on macOS starting with Python 3.8

Problem

The issue is likely by a behavioral change between Python 3.7 and 3.8, in which multiprocessing in Python 3.8 on macOS switched to spawn processes rather than fork them. See here for an issue reported to Python. Apparently forking processes in macOS can cause crashes, but CATCH with older versions of Python has not experienced those issues.

Parts of CATCH — especially, the pools in the probe module — are written to follow the fork behavior, where child processes inherit memory from their parent. When spawning processes, those child processes do not inherit memory, and CATCH crashes because the global variables are not accessible by the child processes. See #48 for an example output of the failure.

Fix

Quick fix

On macOS, use multiprocessing.set_start_method() to fork processes by default.

Longer term fix

When global variables are shared among processes (even if read-only) and accessed by children, use multiprocessing.shared_memory.SharedMemory objects.

Error running design.py: unexpected keyword argument 'avoided_genomes'

Dear Dr. Metsky,

I contact you regarding the software CATCH that you developed, which is a very interesting tool for me to use on my own data.
I have installed the software but when I run the example as written in Github, I get the following output and error message:

design.py download:64320 -pl 75 -m 2 -l 60 -e 50 -o zika-probes.fasta --verbose

2023-08-08 14:14:53,897 - catch.utils.ncbi_neighbors [INFO] Creating a FASTA file for taxid 64320
2023-08-08 14:14:53,898 - catch.utils.ncbi_neighbors [INFO] Constructing a list of neighbors for taxid 64320
2023-08-08 14:14:59,166 - catch.utils.ncbi_neighbors [INFO] There are 1602 neighbors, 802 of which have unique accessions
2023-08-08 14:15:16,135 - catch.utils.seq_io [INFO] Reading fasta file /tmp/tmpcnemrqnp
Traceback (most recent call last):
File "/data/antwerpen/209/vsc20911/Miniconda3/bin/catch/bin/design.py", line 958, in
main(args)
File "/data/antwerpen/209/vsc20911/Miniconda3/bin/catch/bin/design.py", line 325, in main
scf = set_cover_filter.SetCoverFilter(
TypeError: init() got an unexpected keyword argument 'avoided_genomes'

Is there a way to solve this error?

Thank you in advance for your response!

Kind regards,
MGeraerts

Incorrect fasta paths for some genomes in directories

These human host genomes have virus.fasta as their fasta path but their actual path is virus/[0-9a-z]+.fasta

alethinophid_reptarenavirus
amapari
ambe
bear_canyon
bwamba
cao_bang
caraparu
cupixi
hughes
human_picobirnavirus
keterah
lujo
marituba
oliveros
oriboca
parana
toros
yogue
zerdali

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.