eclarke / swga Goto Github PK

Select primer sets for selective whole genome amplification (SWGA)

License: GNU General Public License v3.0

Makefile 0.80% C 29.46% C++ 43.75% Python 25.86% R 0.13%

swga's Introduction

Selective whole genome amplification

Introduction

This is an easy-to-use, start-to-end package for finding sets of primers that selectively amplify a particular genome (the "foreground" genome) over a background genome. For instance, we can design a set of primers that amplify a parasite's genome in a sample that is overwhelmingly composed of host DNA.

You can run SWGA on hardware ranging from a Mac laptop to a high-end server.

Features:

Counts all the possible primers in a size range in both genomes
Filters primers based on:
- foreground and background genome binding rates
- melting temperatures (with a built-in melt temp calculator that accounts for mono- and divalent cation solutions!)
- Possible homodimerization
Finds primer sets containing primers that are compatible with each other using graph theory (largest clique formation). The process ensures:
- No primer in a set is a heterodimer
- Even binding site spacing in foreground genome
- Low total binding to background genome
Score each set based on certain binding metrics and allows exploration of high-scoring sets via output to common formats.

Installation

Follow the installation instructions here

Using SWGA

Follow the guide on our Wiki/Quick Start to get started!

Updates

New features and bugfixes are released all the time. To update, simply follow steps 3-5 on the installation instructions.

3rd-party code

SWGA incorporates code from other open-source projects:

cliquer, a clique-finding library by Sampo Niskanen and Patric Ostergard
- http://users.aalto.fi/~pat/cliquer.html
DSK, a disk-based kmer-counting tool by G. Rizk
- http://minia.genouest.org/dsk/
- Citation: (Rizk, G., Lavenier, D. and Chikhi, R. DSK: k-mer counting with very low memory usage, Bioinformatics, 2013.)

DSK is licensed under the CeCILL license, which can be found in src/dsk/LICENSE, and is GPL compatible.

swga's People

Contributors

Stargazers

Watchers

Forkers

sesh-sas pombredanne sherrillmix sophy7074 nkono niyo2fine mdj117 brianandika9 poojasgupta henger4108 oracle5th sachiwije kbitc

swga's Issues

test_correct_write error

ERROR: test_correct_write (main.GraphTests)

Traceback (most recent call last):
File "test_swga.py", line 86, in test_correct_write
write_graph(self.good_set, self.good_edges, tmp)
File "/home/eloy/PrimerSets/swga/heterodimer_graph.py", line 62, in write_graph
file_handle.write('p sp {} {}\n'.format(len(primers), len(edges)))
ValueError: zero length field name in format

Ran 12 tests in 0.031s

FAILED (errors=1)

also, I thought the program was looking for python modules during the install. It failed to notice that importLib was missing

Set IDs should be included in the output of `swga export sets`

Since these IDs can be used to export a given set

memory usage for kmers > 14 too high

loads all kmers into memory; change to generator fn

swga flatten issue

The swga flatten command produces this error:

Traceback (most recent call last):
File "/media/RAID/home/sesh/.local/bin/swga", line 9, in
load_entry_point('swga==0.2.0', 'console_scripts', 'swga')()
File "/media/RAID/home/sesh/SWGAScriptTesting/PrimerSets/swga/commands/swga_entry.py", line 47, in main
command_opts[args.command](remaining, args.config, args.quiet)
File "/media/RAID/home/sesh/SWGAScriptTesting/PrimerSets/swga/commands/flatten.py", line 37, in main
if os.path.isfile(args.output):
File "/usr/lib/python2.7/genericpath.py", line 29, in isfile
st = os.stat(path)
TypeError: coercing to Unicode: need string or buffer, file found

Filter primers that bind a given genome or from a list

We should be able to provide a fasta sequence or list of primers and filter them. For the fasta it would be good if we set a limit for filtering. I.e. If the primer occurs more than once in the file then filter it. The list should be completely filtered since it's a list.

Primer locations not stored as integers

When trying to export bedfiles for a primer set, this error occurs:

swga export bedfile --limit 1 --order_by score
...
Exporting set 49575 and associated primers as bedfiles...
Traceback (most recent call last):
  File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/bin/swga", line 9, in <module>
    load_entry_point('swga==0.3.2', 'console_scripts', 'swga')()
  File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/swga/commands/swga_entry.py", line 47, in main
    command_opts[args.command](remaining, args.config)
  File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/swga/commands/export.py", line 61, in main
    export_bedfiles(set, cmd.fg_genome_fp, outpath)
  File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/swga/commands/export.py", line 172, in export_bedfiles
    for location in primer.locations[record]:
TypeError: string indices must be integers

update installations - should be in virtual environment rather than just steps 3-5

Current instructions would send people initially through virtual environment, the require root access for updating. Should clarify to update in virtual environment (very minor addition to installation step 3, or an additional note in step two should do it)

case insensitive

when looking for primers in a genome (swga locate) the code is currently case sensitive. Since "a" and "A" refer to the same nucleotide it would be better if it ignored case.

Unknown error swga

Hi,

I'm having trouble running the program. Installation appears to be successful, but I get the following error when I run the command swga. The program downloads python2.6 into swga_workspace, but claims to need >2.7. Could this be the issue?

Thanks for any help you can give.

Cheers, Iain

swga
Traceback (most recent call last):
File "/global/home/users/icclark/swga_workspace/bin/swga", line 9, in
load_entry_point('swga==0.4.3', 'console_scripts', 'swga')()
File "/global/home/users/icclark/swga_workspace/lib/python2.6/site-packages/pkg_resources.py", line 468, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/global/home/users/icclark/swga_workspace/lib/python2.6/site-packages/pkg_resources.py", line 2563, in load_entry_point
return ep.load()
File "/global/home/users/icclark/swga_workspace/lib/python2.6/site-packages/pkg_resources.py", line 2254, in load
['name'])
File "/global/home/users/icclark/swga_workspace/lib/python2.6/site-packages/swga/main.py", line 9, in
import workspace
File "/global/home/users/icclark/swga_workspace/lib/python2.6/site-packages/swga/workspace.py", line 34
m = {key: getattr(m, key) for key in m.fieldnames() if key != 'id'}
^
SyntaxError: invalid syntax

Problem with swga count

I am now stuck on the swga count line

I keep getting this message after entering the command:

This workspace was created with a different version of swga.
Workspace version: 0.4.4
swga version: 0.4.4
Please re-initialize the workspace with swga init or use a different version of swga.

Can anyone help?

Thanks!

Could not reproduce example

Hi Developers!!

I am trying to use SWGA and I download it via conda.
I was testing the tool but it seems to take forever to run swga find_sets. Please help!!

Problem scoring old sets

Sets appear not to be reading in and/or activating correctly.

The command "swga count --input oldsets.txt" appears to recognize that the primers have already been found in a previous swga run:

(swga_workspace)[eloy@hem-shwhn-050-med-upenn-edu 101415oldsets]$ swga count --input Sesholdprimers.txt 
Command: count
- fg_genome_fp: /home/eloy/swga_workspace/swga-master/SWGA_Pvivax/101215-testingoldsets/Pvivax.fasta
- exclude_fp: /home/eloy/swga_workspace/swga-master/SWGA_Pvivax/101215-testingoldsets/SalIMitochondrialsequence.fasta
- max_dimer_bp: 3
- bg_genome_fp: /home/eloy/swga_workspace/swga-master/SWGA_Pvivax/101215-testingoldsets/humangenome.fasta
- min_fg_bind: 226
- min_size: 5
- input: <open file 'Sesholdprimers.txt', mode 'r' at 0x18fbc90>
- exclude_threshold: 1
- max_bg_bind: 21037
- max_size: 12
- primer_db: primers.db
AACGAAAAAA already exists in db, skipping...
ACGAAACG already exists in db, skipping...
CCGTTCG already exists in db, skipping...
CGAAAAAAAA already exists in db, skipping...
CGAAACGG already exists in db, skipping...
CGAACGAA already exists in db, skipping...
CGCAACG already exists in db, skipping...
CGCACGA already exists in db, skipping...
CGTTGCG already exists in db, skipping...
CGTTTCGC already exists in db, skipping...
CGTTTCGT already exists in db, skipping...
CGTTTTCG already exists in db, skipping...
CGTTTTTTTT already exists in db, skipping...
GCGAAAAAA already exists in db, skipping...
GCGAAATG already exists in db, skipping...
TCGTGCG already exists in db, skipping...
TTTTTTCGC already exists in db, skipping...
TTTTTTTCGT already exists in db, skipping...
Binary kmer file already exists at .swga_tmp/Pvivax.fasta-8mers.solid_kmers_binary; parsing...
Binary kmer file already exists at .swga_tmp/humangenome.fasta-8mers.solid_kmers_binary; parsing...
Writing 0 8-mers into db in blocks of 199...
Binary kmer file already exists at .swga_tmp/Pvivax.fasta-15mers.solid_kmers_binary; parsing...
Binary kmer file already exists at .swga_tmp/humangenome.fasta-15mers.solid_kmers_binary; parsing...
TCGTTCGTCGGCGAA does not exist in foreground genome, skipping...
Writing 0 15-mers into db in blocks of 199...

But when I try to activate them it says they are not in the database (see below) and attepts to score them do not work.

(swga_workspace)[eloy@hem-shwhn-050-med-upenn-edu 101415oldsets]$ swga activate --input Sesholdprimers.txt 
Command: activate
- reset: False
- input: <open file 'Sesholdprimers.txt', mode 'r' at 0x1928d20>
- bg_genome_fp: /home/eloy/swga_workspace/swga-master/SWGA_Pvivax/101215-testingoldsets/humangenome.fasta
- fg_genome_fp: /home/eloy/swga_workspace/swga-master/SWGA_Pvivax/101215-testingoldsets/Pvivax.fasta
- primer_db: primers.db
TCGTTCGTCGGCGAA not in the database; skipping. Add it manually with `swga count --input <file>` 
AACGAACG not in the database; skipping. Add it manually with `swga count --input <file>` 
Traceback (most recent call last):
  File "/home/eloy/swga_workspace/bin/swga", line 9, in <module>
    load_entry_point('swga==0.3.6', 'console_scripts', 'swga')()
  File "/home/eloy/swga_workspace/lib/python2.7/site-packages/swga/commands/swga_entry.py", line 61, in main
    command_opts[args.command](remaining, args.config)
  File "/home/eloy/swga_workspace/lib/python2.7/site-packages/swga/commands/activate.py", line 17, in main
    swga.database.Primer.update_Tms(primers)
AttributeError: type object 'Primer' has no attribute 'update_Tms'

SWGA error related to dependencies

From @saclarkPHE:

Hi,

I'm also having problems running the programme. After a seemingly successful installation, I run the swga command and get:

Traceback (most recent call last):
File "/home/ste/miniconda2/bin/swga", line 4, in
import swga.main
File "/home/ste/miniconda2/lib/python2.7/site-packages/swga/main.py", line 9, in
import workspace
File "/home/ste/miniconda2/lib/python2.7/site-packages/swga/workspace.py", line 161, in
class Set(SwgaModel):
File "/home/ste/miniconda2/lib/python2.7/site-packages/swga/workspace.py", line 170, in Set
primers = ManyToManyField(Primer, related_name='sets')
TypeError: init() got an unexpected keyword argument 'related_name'

Ive tried the method outlined in the installation instructions (within Virtualenv) and using miniconda but with the same TypeError in init. I'm running Python 2.7.14 and GCC 5.4.0.

Any help would be appreciated. I'm a complete Linux/command line noob so the more idiot proof the better.

Many thanks!

No troubleshooting section in README

Identifying gaps

It would be helpful for the program to the largest gaps (regions of sequence that are N's) in the foreground. This would ensure that the user set the max fg distance higher than that value.

This would also be helpful with the repeat masking as a repeat masking enhancement would increase the number and size of gaps.

Backwards compatible, Python error with new version of swga and old version files

Hi,
I created primer sets using an older version of swga. When I updated swga to the new version (for bug fixes) commands no longer seem to work on the old files. Py.test produces no errors and it works properly on files created under the same version. The python error is:
Traceback (most recent call last):
File "/home/smalls/anaconda/bin/swga", line 9, in
load_entry_point('swga==0.3.6', 'console_scripts', 'swga')()
File "/home/smalls/anaconda/lib/python2.7/site-packages/swga/commands/swga_entry.py", line 61, in main
command_opts[args.command](remaining, args.config)
File "/home/smalls/anaconda/lib/python2.7/site-packages/swga/commands/summary.py", line 13, in main
summary(*_cmd.args)
File "/home/smalls/anaconda/lib/python2.7/site-packages/swga/commands/summary.py", line 48, in summary
bs = Set.select().order_by(Set.score).limit(1).get()
File "/home/smalls/anaconda/lib/python2.7/site-packages/peewee.py", line 2595, in get
return clone.execute().next()
File "/home/smalls/anaconda/lib/python2.7/site-packages/peewee.py", line 2636, in execute
self._qr = ResultWrapper(model_class, self._execute(), query_meta)
File "/home/smalls/anaconda/lib/python2.7/site-packages/peewee.py", line 2325, in _execute
return self.database.execute_sql(sql, params, self.require_commit)
File "/home/smalls/anaconda/lib/python2.7/site-packages/peewee.py", line 3015, in execute_sql
self.commit()
File "/home/smalls/anaconda/lib/python2.7/site-packages/peewee.py", line 2864, in exit
reraise(new_type, new_type(_exc_value.args), traceback)
File "/home/smalls/anaconda/lib/python2.7/site-packages/peewee.py", line 3007, in execute_sql
cursor.execute(sql, params or ())
peewee.OperationalError: no such column: t1.bg_dist_mean

any thoughts?

Gini plot

Since we calculate the gini index, is it possible to then output the data so we could generate the CDF plot from which the index is derived?

bash swga_init.sh error

Running bash saga_init.sh gives this error

cp swga/parameters.cfg .
cp: cannot stat ‘swga/parameters.cfg’: No such file or directory

Problem with swga count

Dear swga community,

I have a problem when I use swga.

First of all, I didn't install virtualenv, the archive seems to have some problems. However, I success to install swga, and the swga init command works properly (which allows me to receive the foreground and background genomes. Two files are created.

Now, when I run the swga count command, I have a series of errors:

swga count, v0.4.4
  - force: False
  - max_dimer_bp: 3
  - exclude_threshold: 1
  - min_size: 5
  - input: None
  - min_fg_bind: 237
  - max_bg_bind: 21128
  - max_size: 12
Traceback (most recent call last):
  File "/home/maintenance-gg/.local/bin/swga", line 8, in <module>
    sys.exit(main())
  File "/home/maintenance-gg/.local/lib/python2.7/site-packages/swga/main.py", line 81, in main
    setup_and_run(cmd_class, args.command, remaining)
  File "/home/maintenance-gg/.local/lib/python2.7/site-packages/swga/main.py", line 49, in setup_and_run
    cmd.run()
  File "/home/maintenance-gg/.local/lib/python2.7/site-packages/swga/commands/count.py", line 27, in run
    self.count_kmers()
  File "/home/maintenance-gg/.local/lib/python2.7/site-packages/swga/commands/count.py", line 85, in count_kmers
    fg = swga.kmers.count_kmers(k, self.fg_genome_fp, output_dir)
  File "/home/maintenance-gg/.local/lib/python2.7/site-packages/swga/kmers.py", line 42, in count_kmers
    cmdarr = [swga.utils.dsk(), genome_fp, str(k), '-o', out, '-t', str(threshold)]
  File "/home/maintenance-gg/.local/lib/python2.7/site-packages/swga/utils.py", line 93, in dsk
    return _get_resource_file('dsk')
  File "/home/maintenance-gg/.local/lib/python2.7/site-packages/swga/utils.py", line 72, in _get_resource_file
    swga.error("Could not find `{}': try re-installing swga.".format(rs))
  File "/home/maintenance-gg/.local/lib/python2.7/site-packages/swga/__init__.py", line 25, in error
    raise SWGAError(msg)
swga.SWGAError: Could not find `dsk': try re-installing swga.

I uninstall dsk, then properly install it, but the error remains.
Do you know the origin of this error?

Thank you in advance!

Regards,

Asternosis

init should prompt for "fasta files from which to exclude primers (e.g. mitochondria)"

At the very least the QuickStart should mention the command for including a file from which to exclude primers that bind to the mitochondria as this will be common to nearly all projects.

swga count fails if no exclude file is present

Running swga count with exclude fp = None (generated by swga init) results in this error

Binary kmer file already exists at .swga_tmp/Plasmodium_vivax.ASM241v1.27.dna.toplevel.fasta-5mers.solid_kmers_binary; parsing...
Binary kmer file already exists at .swga_tmp/humangenome.fasta-5mers.solid_kmers_binary; parsing...
Traceback (most recent call last):
File "/home/eloy/swga_workspace/bin/swga", line 9, in
load_entry_point('swga==0.3.6', 'console_scripts', 'swga')()
File "/home/eloy/swga_workspace/lib/python2.7/site-packages/swga/commands/swga_entry.py", line 61, in main
command_opts[args.command](remaining, args.config)
File "/home/eloy/swga_workspace/lib/python2.7/site-packages/swga/commands/count.py", line 24, in main
count_kmers(**cmd.args)
File "/home/eloy/swga_workspace/lib/python2.7/site-packages/swga/commands/count.py", line 104, in count_kmers
assert os.path.isfile(exclude_fp)
AssertionError

unrecognized command "-v"

The command -v wasn't recognized when I ran the script on the Linux Mint OS. Deleted -v and the locations were stored without trouble. Typo?

find locations of filtered primers in foreground genome

find_fg_locations -v -i filtered_primers --fg_genome fg-genome.fasta.flattened

to
find_fg_locations -i filtered_primers --fg_genome fg-genome.fasta.flattened

swga locate error

after input:
--input filtered_primers --genome fg-genome.fasta.flattened

I receive the following error:

{} <ConfigParser.SafeConfigParser instance at 0x10b3d40>
swga locate config file: /root/.swga/parameters.cfg
swga locate parameters: {'ncores': 4, 'input': <open file 'filtered_primers', mode 'r' at 0x1098270>, 'passthrough': False, 'genome': 'fg-genome.fasta.flattened', 'locations_store': None}
[====================] 100%
Traceback (most recent call last):
File "/usr/local/bin/swga", line 9, in
load_entry_point('swga==0.2.0', 'console_scripts', 'swga')()
File "/home/steph/Desktop/PrimerSets-master/swga/commands/swga_entry.py", line 47, in main
command_opts[args.command](remaining, args.config, args.quiet)
File "/home/steph/Desktop/PrimerSets-master/swga/commands/locate.py", line 73, in main
swga.save_locations(locations, args.locations_store, not quiet)
File "/home/steph/Desktop/PrimerSets-master/swga/init.py", line 274, in save_locations
with gzip.GzipFile(filename, 'w') as dest:
File "/usr/lib/python2.7/gzip.py", line 94, in init
fileobj = self.myfileobj = builtin.open(filename, mode or 'rb')
TypeError: coercing to Unicode: need string or buffer, NoneType found

Installation issue

Hi, I'm getting the error below when I get to step 3 of the installation, do you have any advice on how I install correctly?

Many thanks

(swga_workspace)CommandPrompt:~/swga_workspace$ pip install git+https://github.com/eclarke/swga
Collecting git+https://github.com/eclarke/swga
Cloning https://github.com/eclarke/swga to /tmp/pip-ugn2Y6-build
Error [Errno 2] No such file or directory while executing command git clone -q https://github.com/eclarke/swga /tmp/pip-ugn2Y6-build
Cannot find command 'git'

bedgraph output sliding window

The current bedgraph output uses a window to calculate the number of binding sites and a step size to move the window.

The output positions should be based on the step size rather than the window size. The current output generates overlapping segments in the bedgraph that are confusing.

track type=bedGraph None
gi 0 5001 0
gi 1000 6001 0
gi 2000 7001 0
gi 3000 8001 0
gi 4000 9001 0
gi 5000 10001 0

Dimer issue and melt temp query

Hi Erik,

The find_sets command initially produced one set containing two primers:

_id | score | set_size | bg_dist_mean | fg_max_dist | fg_dist_mean | fg_dist_std | fg_dist_gini | scoring_fn | primers
1 | 0.0023844142 | 2 | 568770.802169 | 20284 | 2579.2951191828 | 2854.2363783882 | 0.5257968313 | (fg_dist_mean * fg_dist_gini) / (bg_dist_mean) | CGAAATCG,CTTCGTCG

Despite the max dimer_bp parameter being set to 3, these two primers appear to harbour four consecutive complementary bases with each other.

After reducing the stringency of the parameters, the programme generated only one set again. The first primer in this set has 5 and 4 consecutive complementary bases with the second and third primers, respectively:

_id | score | set_size | bg_dist_mean | fg_max_dist | fg_dist_mean | fg_dist_std | fg_dist_gini | scoring_fn | primers
1 | 0.0413338986 | 3 | 116572.114135 | 47397 | 9468.1625 | 9495.5703900324 | 0.5089033852 | (fg_dist_mean * fg_dist_gini) / (bg_dist_mean) | ATTAATCG, CTTAATCG,GATAATCG

I'm slightly confused why these have been put in the same set. Perhaps I'm misunderstanding the process but shouldn't those primers with hetero dimers be filtered and not included in the same set?

I also had a question about the melting temperatures. I used Oligoanalyzer 3.1 (IDT) to calculate the melt temps of these primers based on the approximate primer/salt concs within the SWGA reaction as outlined at http://selectivewga.blogspot.co.uk/p/swga-protocol.html. (however the salt concs could be a little off). The temperatures are in some cases are very low (10-15degC) and are below the min_tm specified in the swga parameters. I am not sure how you export the calculated melt temps for the individual primers to check but I suspect the difference could be due to the default concs (oligo conc, na conc, mg conc etc.) being different within swga thus producing a higher tm? Is there anyway of checking these defaults concs for my notes?

Thanks for your help,

All the best!

Change parameters named bg_ratio

Historically we have used bg_ratio (bg genome length / # of binding sites) as an approximation of average bg binding distance. We have called this bg_ratio to be mathematically correct. I think we should simply change this to bg_dist_mean in the scores output.

We can mention that our calculation is an approximation of mean distance in the docs and that we use the approximation because calculating a true average would get very complicated when using genomes that are not complete assemblies, etc.

Also our parameter for this is actually:
min_bg_bind_dist: minimum allowable AVERAGE distance between primers in the background dataset. Larger distances between priming sites result in lower levels of amplification by the phi29 enzyme. However, this enzyme has high processivity (~70kb), so the primers may need to be quite far apart.
Note that this is an AVERAGE distance, thus some areas of the background may be amplified despite a large values for this parameter.

so we are already halfway to considering it as an approximation of the average.

Duplicate key error while parsing background genome fasta file

Is there a maximum file size for background genomes? My genome fasta file is 7.6GB. It reads the foreground easily, but I get the following error when inputting my background genome file:

Enter path to background genome file, in FASTA format: /mnt/1tb/thesis_data/SWGA/infestans_genome.fasta

Error reading /mnt/1tb/thesis_data/SWGA/infestans_genome.fasta: invalid FASTA format?
Traceback (most recent call last):
File "/home/alex/swga_workspace_2015.03.11/bin/swga", line 9, in
load_entry_point('swga==0.3.2', 'console_scripts', 'swga')()
File "/home/alex/swga_workspace_2015.03.11/local/lib/python2.7/site-packages/swga/commands/swga_entry.py", line 47, in main
command_opts[args.command](remaining, args.config)
File "/home/alex/swga_workspace_2015.03.11/local/lib/python2.7/site-packages/click/core.py", line 610, in call
return self.main(_args, *_kwargs)
File "/home/alex/swga_workspace_2015.03.11/local/lib/python2.7/site-packages/click/core.py", line 590, in main
rv = self.invoke(ctx)
File "/home/alex/swga_workspace_2015.03.11/local/lib/python2.7/site-packages/click/core.py", line 782, in invoke
return ctx.invoke(self.callback, *_ctx.params)
File "/home/alex/swga_workspace_2015.03.11/local/lib/python2.7/site-packages/click/core.py", line 416, in invoke
return callback(_args, **kwargs)
File "/home/alex/swga_workspace_2015.03.11/local/lib/python2.7/site-packages/swga/commands/init.py", line 71, in main
bg_length, bg_nrecords = fasta_stats(bg_genome_fp)
File "/home/alex/swga_workspace_2015.03.11/local/lib/python2.7/site-packages/swga/commands/init.py", line 118, in fasta_stats
fasta = Fasta(fasta_fp)
File "/home/alex/swga_workspace_2015.03.11/local/lib/python2.7/site-packages/pyfaidx/init.py", line 540, in init
filt_function=filt_function)
File "/home/alex/swga_workspace_2015.03.11/local/lib/python2.7/site-packages/pyfaidx/init.py", line 216, in init
self.read_fai(split_char)
File "/home/alex/swga_workspace_2015.03.11/local/lib/python2.7/site-packages/pyfaidx/init.py", line 244, in read_fai
raise ValueError('Duplicate key "%s"' % rname)
ValueError: Duplicate key "['M00281:39:000000000-AAGY4:1:1101:15888:1332']"

Using a custom list of primers for finding or scoring specific sets

Hi Eric,

I tried to evaluate a set of primers following the instructions provided in this page. However, I got an error message for several primers like this:

TATATATATATT does not exist in foreground genome, skipping...

This prime does exist in the foreground genome when I search for it manually. I changed the parameters several times and there is always the same set of primers that are not found in the foreground by swga.

Do you know what could be the issue here?

I can send you the data.

Thanks,
Fred

Filter output confusing

Need to indicate how many primers fulfill both fg and bg binding parameters, not just how many fulfill each separately.

15148 primers bind foreground genome with avg freq >= 18 sites
27501 primers bind background genome with avg freq <= 10 sites
5871 of those primers have a melting temp within given range

Need a better error message for trying to score a set whose primers are not in the database or are not active

Running SWGA score for a set of primers that have not been added to or activated in the database produces a useless error.

Can we make what the user did wrong clearer?

swga score --input Set.txt

Traceback (most recent call last):
 File "/home/eloy/swga_workspace/bin/swga", line 9, in <module>
   load_entry_point('swga==0.3.4', 'console_scripts', 'swga')()
 File "/home/eloy/swga_workspace/lib/python2.7/site-packages/swga/commands/swga_entry.py", line 47, in main
   command_opts[args.command](remaining, args.config)
 File "/home/eloy/swga_workspace/lib/python2.7/site-packages/swga/commands/score.py", line 41, in main
   interactive=True)
 File "/home/eloy/swga_workspace/lib/python2.7/site-packages/swga/commands/score.py", line 56, in score_set
   binding_locations = swga.locate.linearize_binding_sites(primers, chr_ends)
 File "/home/eloy/swga_workspace/lib/python2.7/site-packages/swga/locate.py", line 36, in linearize_binding_sites
   for rec, locs in json.loads(primer.locations).iteritems():
 File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
   return _default_decoder.decode(s)
 File "/usr/lib64/python2.7/json/decoder.py", line 365, in decode
   obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

Sets with overlapping primers

I am getting sets that have overlapping primers (see below for full output):

For example:
CGAATCGTTCTA
GCGAATCGTTCT

Is this allowed in swga? or am I setting the parameters wrong? or could it be a bug?

PRIMER SUMMARY

There are 43704 primers in the database.

500 are marked as active (i.e., they passed filter steps and will be
used to find sets of compatible primers.)

The average number of foreground genome binding sites is 1.
(avg binding / genome_length = 0.000110)
The average number of background genome binding sites is 4727.
(avg binding / genome_length = 0.000001)

The melting temp of the primers ranges between 49.50C and 64.81C with an average of 58.54C.

SETS SUMMARY

There are 84463 sets in the database.
The best scoring set is #7111, with 10 primers and a score of 0.000011.
Various statistics:

set_size........: 10
bg_dist_mean....: 4.61052E+07
fg_max_dist.....: 3,584
fg_dist_mean....: 834.545
fg_dist_std.....: 1,082.58
fg_dist_gini....: 0.596059
scoring_fn......: (fg_dist_mean * fg_dist_gini) / (bg_dist_mean)
The primers in Set 7111 are:
ACTGCGAATCGT, AGACCGGTTCTA, AGGCGTTACTCG, CGAATCGTTCTA, GCGAATCGTTCT, GGCGTTACTCGA, GTAGCAAGCTCG, GTCTATCGGCTC, TACCGTCAGCGT, TCCGCAGATCGT

Parameters for similar genomes

Hi, I am working with genomes of two co-infecting similar species as foreground and background genomes. So far I have not successfully obtained any primer sets at all, even after adjusting some of the parameters. I have had between 125 -196 primers marked as active after filtering but nothing comes up with the swga find_sets command. I was wondering if there are specific parameters I can adjust and how stringent/relaxed can I go when the genomes are similar?

GC content

Dear author,
It is obvious that GC content of the foreground genome is very important for swga primer identification, like you mentioned for the P vivax genome in the first swga publication (Cowell et al. 2017;mBio 8:e02257-16). I wonder how did the authors decided reference/foreground genomic regions with high GC content and poor coverage for designing swga primers. Authors mentioned:

"longer regions (>195,000 bp) of the P. vivax reference genome (Sal-1) that were GC rich (48.5 to 50.6%) and yielded low genome coverage"

I am unable to find the methodology authors used to decide the long genome regions and what methodology should I use to calculate the GC content in that region (e.g. kmer size). I am standardising the methodology so that I could use swga tool for my own plasmodium sp., however I am unable to replicate P vivax genome results.

Please help me. Thank you.

Missing table error when using `swga count --input ...`

Trying to specify a primer set to count using swga count --input yields this error:

Traceback (most recent call last):
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/bin/swga", line 9, in
load_entry_point('swga==0.3.1', 'console_scripts', 'swga')()
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/swga/commands/swga_entry.py", line 47, in main
command_opts[args.command](remaining, args.config)
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/swga/commands/count.py", line 20, in main
count_specific_kmers(kmers, cmd.fg_genome_fp, cmd.bg_genome_fp)
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/swga/commands/count.py", line 27, in count_specific_kmers
existing = [p.seq for p in Primer.select().where(Primer.seq << kmers)]
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/peewee.py", line 2558, in iter
return iter(self.execute())
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/peewee.py", line 2551, in execute
self._qr = ResultWrapper(model_class, self._execute(), query_meta)
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/peewee.py", line 2243, in _execute
return self.database.execute_sql(sql, params, self.require_commit)
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/peewee.py", line 2877, in execute_sql
self.commit()
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/peewee.py", line 2732, in exit
reraise(new_type, new_type(*exc_value.args), traceback)
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/peewee.py", line 2869, in execute_sql
cursor.execute(sql, params or ())
peewee.OperationalError: no such table: primer

Primers without gini and temp

Hi. Any idea why I have my primers without gini score and temp?

exclude fp error in `swga count` when no exclusionary file is specified

Running swga count with no exclude_fp yields this error.

Traceback (most recent call last):
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/bin/swga", line 9, in
load_entry_point('swga==0.3.2', 'console_scripts', 'swga')()
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/swga/commands/swga_entry.py", line 47, in main
command_opts[args.command](remaining, args.config)
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/swga/commands/count.py", line 21, in main
count_kmers(**cmd.args)
File "/media/8TB_PLAYGROUND/home/sesh/SWGAScriptTesting/swga_workspace/local/lib/python2.7/site-packages/swga/commands/count.py", line 77, in count_kmers
assert os.path.isfile(exclude_fp)
AssertionError

Masking repetitive regions

Short sequences repeated many times (i.e. telomeres) can result in the generation of poor sets with primers that bind many times within a very short distance. Masking these regions would probably improve set finding.

Problem with swga filter

Hi,

I keep getting an error message when I attempt to use the "swga filter" command. Here's the end of the error message -- happy to post the entire traceback error message if needed:

" File "~/swga_workspace/lib/python2.7/site-packages/peewee.py", line 3485, in execute_sql
cursor.execute(sql, params or ())
peewee.OperationalError: near "?": syntax error"

No problems with swga init or swga count, but I get this message when I try swga filter every time. I'm using python 2.7.6, as suggested in the install instructions, and I've installed the most recent swga update. Looks like there may be a problem with the melting temperature filters or with peewee?

Open to ideas/suggestions. Thanks!

Jonathan

Support for Python 3

Hello.
I would like to ask if there's a plan to update this code and make it compatible to run with Python 3? I could not install pip2 anymore as they stopped support last January. I tried installing swga with the recent version of pip (pip3) but it keeps displaying that Python 3 is not supported.
I hope you can help me with this problem as I could really use your very helpful code for my research.
Thank you very much and looking forward to your answers.

SWGA init error from peewee

We had SWGA working previous, but it has recently been giving errors. We have tried an update, but it gives the same error from peewee. Please see below.

(swga_workspace)[mts11@bert Aug12]$ swga init

swga v0.4.4 - interactive setup

This will set up a new swga workspace in the current directory (/ibers/ernie/scratch/mts11/swga_workspace/Aug12).

Enter path to foreground FASTA file: T.fa
Checking /ibers/ernie/scratch/mts11/swga_workspace/ToxoDB-35_TgondiiME49_Genome.fasta...
Foreground filepath: /ibers/ernie/scratch/mts11/swga_workspace/ToxoDB-35_TgondiiME49_Genome.fasta
Length: 66765379 bp
Records: 2265

Enter path to background FASTA file: h.fa
Checking /ibers/ernie/scratch/mts11/swga_workspace/human-toxo/human.fa...
Background filepath: /ibers/ernie/scratch/mts11/swga_workspace/human-toxo/human.fa
Length: 3101788253 bp

Do you want to add a FASTA file of sequences that will be used to exclude
primers? For instance, to avoid primers that bind to a mitochondrial genome,
you would add the path to that genome file. There can be multiple sequences in
the file, but only one file can be specified.
[y/N]: N
No exclusionary sequences file specified.
Existing file `parameters.cfg' will be overwritten. Continue? [y/N]: y
This directory already appears to be a swga workspace. Do you want to re-initialize it? [y/N]: y
Traceback (most recent call last):
File "/ibers/ernie/scratch/mts11/swga_workspace/bin/swga", line 9, in
load_entry_point('swga==0.4.4', 'console_scripts', 'swga')()
File "/ibers/ernie/scratch/mts11/swga_workspace/lib/python2.7/site-packages/swga/main.py", line 78, in main
command_optsargs.command
File "/ibers/ernie/scratch/mts11/swga_workspace/lib/python2.7/site-packages/swga/commands/init.py", line 174, in main
ws.create_tables()
File "/ibers/ernie/scratch/mts11/swga_workspace/lib/python2.7/site-packages/swga/workspace.py", line 45, in create_tables
super(SwgaWorkspace, self).create_tables(_tables, safe=safe)
File "/ibers/ernie/scratch/mts11/swga_workspace/lib/python2.7/site-packages/peewee.py", line 3917, in create_tables
create_model_tables(models, fail_silently=safe)
File "/ibers/ernie/scratch/mts11/swga_workspace/lib/python2.7/site-packages/peewee.py", line 5356, in create_model_tables
m.create_table(**create_table_kwargs)
File "/ibers/ernie/scratch/mts11/swga_workspace/lib/python2.7/site-packages/peewee.py", line 5028, in create_table
if fail_silently and cls.table_exists():
File "/ibers/ernie/scratch/mts11/swga_workspace/lib/python2.7/site-packages/peewee.py", line 5024, in table_exists
return cls._meta.db_table in cls._meta.database.get_tables(**kwargs)
File "/ibers/ernie/scratch/mts11/swga_workspace/lib/python2.7/site-packages/peewee.py", line 4074, in get_tables
'type = ? ORDER BY name;', ('table',))
File "/ibers/ernie/scratch/mts11/swga_workspace/lib/python2.7/site-packages/peewee.py", line 3837, in execute_sql
self.commit()
File "/ibers/ernie/scratch/mts11/swga_workspace/lib/python2.7/site-packages/peewee.py", line 3656, in exit
reraise(new_type, new_type(*exc_args), traceback)
File "/ibers/ernie/scratch/mts11/swga_workspace/lib/python2.7/site-packages/peewee.py", line 3830, in execute_sql
cursor.execute(sql, params or ())
peewee.OperationalError: disk I/O error
(swga_workspace)[mts11@bert Aug12]$

removing large numbers of sets takes too long

in dev branch, sets should be removed in bulk

Sliding window step size doesn't work

In swga export bedgraph the sliding window step size cannot be set manually, it always defaults to 2000

Change fg_min_avg_rate and bg_max_avg_rate

These parameters are confusing, especially since they serve a similar purpose as min_fg_bind and max_bg_bind but are named differently.

So let's change them to min_fg_bind and max_bg_bind.
Per Erik will still be defined separately in parameters.cfg to allow for less stringent settings at the count step and more stringent settings at the filter step. This allows a user to run count (slow) once filtering out very bad primers but then rerun filter (fast) many times if they change their cutoffs.

Should be easy to fix in filter.py since these rates are used to derive binding frequency cutoffs.

So that the defaults scale with genome length we will set the defaults (not seen by user) as min_avg_rate and max_avg_rate, and use these with the genome sizes to populate the min_fg_bind and max_bg_bind parameters.

Will use rates of 1/150000 for bg and 1/100000 for both count and filter and recommend that the user changes the filter parameters.

swga filter output

Since the parameters specify how many primers to choose after filtering (default 200), it might be nice to have the program output a warning if there are fewer primers than specified left after filtering.

For example if you want to choose the top 200 primers, but only 10 pass all the filters, the program would put a warning at the bottom that indicates that fewer primers than desired have passed filter and that the user should consider changing the parameters of swga count and swga filter

DSK merges kmers and reverse complements

Since most properties of kmers and their RCs will be the same, this is fine through swga filter.

However sets may exist that can include primerA and primerB(RC) but not primerA and primerB (due to heterodimer checks).

So we need to include the RC of active primers in swga find_sets

However, I still can't bypass the python 3 error.
Can you advice?

Do you have a version for python 3?