ibest / arc Goto Github PK

View Code? Open in Web Editor NEW

41.0 41.0 5.0 10.18 MB

Assembly by Reduced Complexity (ARC)

License: Apache License 2.0

Python 88.86% R 8.14% Shell 3.01%

arc's People

Contributors

Stargazers

Watchers

Forkers

samhunter kdm9 atcg wangpanqiao

arc's Issues

bowtie2 only writes reads to a single target

Bowtie2 is being run in default mapping mode which means it only report one hit. Add the ability for reporting multiple hits (-k switch).

This has been implemented and is in need of testing.

Speed up SPAdes

Alessandro asks:

"One option which could be really useful in ARC config file is to disable the hammer steps during spades assembler (read error correction, option --only-assembler) for each iteration."

It would probably be good to only disable the read error correction on all but the final assembly, this way the final assembly would (in theory) have been generated from error corrected reads.

Keep contigs from previous assembly if current assembly is killed

Currently when an assembly is killed for a target, any contigs assembled in a previous iteration for that target will not be written out to the finished folder. It would instead be nice to get the contigs assembled at a previous iteration.

Alternatively mapping coverage could be calculated at each step and used to mask bases/contigs which have higher than expected coverage (indicating repeats) however these calculations would likely be costly and slow ARC down.

Fix bug in finished contigs out file

Finished contigs output file is being overwritten each time, so the final set of contigs is incomplete.

Add support for gzipped files

Blat and bowtie2 appear to support gzipped files, this should be trivial to implement.

ARC doesn't properly detect missing PE files

In config.py, line 203 (if not (pe or se)) there is a small logic error which can cause ARC to fail to identify incorrectly formatted config files which contain a SE and one of two PE files. This also causes a weird edge case when the config file is formatted with comment characters in front of the column headers e.g.

Sample_ID FileName FileType

Sample1 ./reads/Sample1_R1.fasta PE1

instead of:
Sample_ID FileName FileType
Sample1 ./reads/Sample1_R1.fasta PE1

Fix this logic.

Turn off urt option in Newbler for last iteration

Generate mapping statistics

Build a table during the final finishing stage (or write to it at each iteration) which is formatted like:

iteration, target1, target2, .... targetn
1 12, 15, ......, N

And contains counts for the number of reads mapped at that iteration

Output Qual values for contigs

In some cases it is useful to have quals for assembled contigs. These are available from Newbler (determine whether Spades produces them).

Basic Install and Running Instructions

Basic instruction for how to install and run.

Track expected coverage

Maintain information about the length of contigs and calculated expected coverage, using this, add a -e switch to Newbler (Spades doesn't seem to support expected coverage, but research this further).

Reduce disk space requirements when using bowtie2

It might be possible to reduce space requirements when using Bowtie2 by only writting out mapped reads (--no-unal flag). Before this is done it is necessary to double check that pairs in which one of the two members of the pair have been mapped are both written.

In the future, using a pipe instead of a file to get output from bowtie2 into the parser would be an even better option. This would require re-writing the mapper + splitter however, and it doesn't appear that Blat can output to stdout. So some other strategy will need to be developed (e.g. creating another Blat patch to enable output to stdout).

Install into path

Allow one to install into bin and run from anywhere

Generate more summary tables

It would be nice to get a final set of summary tables or datasets with details like:

Final status for all Samples X Targets combinations e.g.:

Sample	Target	Status	Iteration	Reads	Contigs	ContigLength
S1	T1	Finished	5	2300	1	2000
S1	T2	NoContigs	1	5	0	0
S1	T3	Killed	3	15232	12	9400
S1	T4	Repeat	6	12000	11	15000
...	...	...	...	...	...	...
SN	TM	Finished	12	300	1	1500

This would make it much easier to generate a final set of summary statistics, and facilitate many other comparisons without the need for any log-file

repetitive element detection

Allow ARC to resume a run if it is terminated/crashes

In some cases it may be useful to allow ARC to be run for a few more iterations or to resume a run which was terminated for some reason. This could be controlled by adding something to the ARC_config.txt (i.e. restart = True).

Do something like the following:

Clean up: old assemblies, intermediate mapping results, and other junk (if any)
Check for IX_contigs.fasta where X is range(1, numcycles), choose the last one as the targets to start with (ignore the targets in ARC_config.txt).
Set up a Config object with all of the necessary pieces (iteration etc)
Run the spawner which should kick off mapping etc effectively resuming the process.

If reads are mapped, but no contig is assembled, use the mapped reads for the next mapping step

This is implemented, but hasn't been tested.

idiot proof dependencies

currently ARC will install if dependencies are not OK.

running install
running build
running build_py
running build_scripts
running install_lib
running install_scripts
changing mode of /usr/bin/ARC to 775
running install_egg_info
Removing /usr/lib/python2.6/site-packages/ARC-1.1.0-py2.6.egg-info
Writing /usr/lib/python2.6/site-packages/ARC-1.1.0-py2.6.egg-info

"KeyError" on failed target assembly after previous iteration killed for that target

I'm running into an issue where I get a python KeyError for targets that fulfill the following conditions:

1). Target assembly fails in iteration 1. Then "Writing reads as contigs."
2). Target assembly killed (spades times out) on iteration 2. Then "Writing contigs from previous iteration."
3). Target recruits fewer reads in iteration 3 than in 2. Target assembly fails in iteration 3.

evan@maven:/mnt/Data1/Gary/ARC$ grep Contig40947 ARC--try2.log 
[2015-02-05 08:54:19,745 INFO 21734] Sample: ALL5 target: 36948|Contig40947 iteration: 1 Split 180 reads in 0.0773031711578 seconds
[2015-02-05 08:54:56,345 INFO 21744] Sample: ALL5 target: 36948|Contig40947 iteration: 1 Assembly failed after 7.53394508362 seconds
[2015-02-05 08:58:46,348 INFO 21761] Sample: ALL5 target: 36948|Contig40947 finishing target..
[2015-02-05 08:58:46,348 INFO 21761] Sample: ALL5 target: 36948|Contig40947 iteration: 1 Assembly reports status: assembly_failed.
[2015-02-05 08:58:46,348 INFO 21761] Sample ALL5 target 36948|Contig40947: Writing reads as contigs.
[2015-02-05 10:47:54,027 INFO 21740] Sample: ALL5 target: 36948|Contig40947 iteration: 2 Split 411178 reads in 240.437613964 seconds
[2015-02-05 12:56:23,086 WARNING 21751] Sample: ALL5 target: 36948|Contig40947 Assembly killed after 7200.12375808 seconds.
[2015-02-05 12:56:23,109 INFO 21751] Sample: ALL5 target: 36948|Contig40947 iteration: 2 Assembly killed after 7200.14657593 seconds
[2015-02-06 11:39:30,387 INFO 21740] Sample: ALL5 target: 36948|Contig40947 finishing target..
[2015-02-06 11:39:30,387 INFO 21740] Sample: ALL5 target: 36948|Contig40947 iteration: 2 Assembly reports status: assembly_killed.
[2015-02-06 11:40:16,442 INFO 21740] Sample: ALL5 target: 36948|Contig40947 iteration: 2 Writing contigs from previous iteration.
[2015-02-08 10:39:37,060 INFO 21736] Sample: ALL5 target: 36948|Contig40947 iteration: 3 Setting last_assembly to True
[2015-02-08 10:39:37,060 INFO 21736] Sample: ALL5 target: 36948|Contig40947 iteration: 3 Split 2511 reads in 2.27638602257 seconds
[2015-02-08 10:40:18,630 INFO 21753] Sample: ALL5 target: 36948|Contig40947 iteration: 3 Assembly failed after 41.5684621334 seconds
[2015-02-09 08:02:30,358 INFO 21752] Sample: ALL5 target: 36948|Contig40947 finishing target..
[2015-02-09 08:02:30,358 INFO 21752] Sample: ALL5 target: 36948|Contig40947 iteration: 3 Assembly reports status: assembly_failed.
[2015-02-09 08:02:30,358 INFO 21752] Sample ALL5 target 36948|Contig40947 did not incorporate any more reads, no more mapping will be done

KeyError: '36948|Contig40947'

Bottom of the ARC output:

[2015-02-09 08:02:30,358 INFO 21752] Sample: ALL5 target: 36948|Contig40947 finishing target..
[2015-02-09 08:02:30,358 INFO 21752] Sample: ALL5 target: 36948|Contig40947 iteration: 3 Assembly reports status: assembly_failed.
[2015-02-09 08:02:30,358 INFO 21752] Sample ALL5 target 36948|Contig40947 did not incorporate any more reads, no more mapping will be done
[2015-02-09 08:02:30,584 ERROR 21752] Traceback (most recent call last):

  File "/home/evan/anaconda/lib/python2.7/site-packages/ARC/process_runner.py", line 62, in run
    self.launch()

  File "/home/evan/anaconda/lib/python2.7/site-packages/ARC/process_runner.py", line 43, in launch
    job.runner()

  File "/home/evan/anaconda/lib/python2.7/site-packages/ARC/runners/base.py", line 58, in runner
    self.start()

  File "/home/evan/anaconda/lib/python2.7/site-packages/ARC/runners/finisher.py", line 135, in start
    self.write_target(target, target_folder, outf=fin_outf, finished=True)

  File "/home/evan/anaconda/lib/python2.7/site-packages/ARC/runners/finisher.py", line 291, in write_target
    targetLength=self.params['summary_stats'][target]['targetLength'],

KeyError: '36948|Contig40947'

[2015-02-09 08:02:30,584 ERROR 21752] An unhandled exception occurred
[2015-02-09 08:02:30,585 ERROR 21684] Terminating processes
[2015-02-09 08:02:30,728 ERROR 21684] ARC.app unexpectedly terminated
[2015-02-09 08:02:30,732 INFO 21684] process shutting down

Any thoughts? Thanks very much!
Evan

ARC sometimes incorporates a lot of off-target and/or repetative reads

Because of the "sloppy mapping" approach, targets sometimes pull in a few repetitive regions which then pull in a few more etc, causing big problems for assembly speed and slowing down the whole process. Currently this is partially handled by repeat detection and removal based on % difference in read incorporation from iteration to iteration.
Some alternative, smarter approaches to dealing with this might include:

More stringent mapping parameters which go into effect after the first iteration. There isn't really any need for "sloppy" mapping once a set of initial contigs has been established.
Some sort of a contig composition filtering step to screen low-complexity contigs. This might be as simple as a 2-mer frequency table followed by some outlier detection, or something like a "Dusty score" calculation might work better.

Do a better job of detecting corrupted read indexes

If a user kills ARC during read indexing (e.g. with cntrl+c) the read index (a SQLite database) will only contain a subset of the reads in the Fastq file. At this point the user should delete the partial indexes before running ARC again. However, if ARC is then run a second time without deleting the partial indexes, ARC improperly detects that the FastQ has already been indexed and skips re-index in order to save time. When read splitting takes place reads exist in the fastq which are not in the index, causing a crash.
This problem has come up frequently enough that it would be nice to fix it. One option that immediately comes to mind is to modify app.py so that each indexing step is wrapped in a try/except/finally block which explicitly deletes the index file.

Don't call assembler if no reads are mapped

There are some instances where no PE or SE reads will be mapped. Handle this gracefully.

Refactor code for passing params

Instead of passing a list of certain params from class to class, pass all params (and delete those which aren't necessary i.e. read_dict for the assemblers). This will make future additions/enhancements to the code much easier to implement and also require less maintenance.

Modify map_against_reads behavior

Modify map_against_reads behavior so that it includes the contigs AND reads on the 2nd iteration. In other words, it does mapping, attempts to do assemblies, and then includes contigs if any, as well as all reads (assembled or not) as targets for the second round of mapping.

SPAdes assembly fails on low coverage (<=10X) contigs (killing ARC)

Hi there,

Excellent work so far on ARC - I'm really looking forward to using this! In playing around with the approach, I've run into an issue...

When assembling with SPAdes (v.2.4.0), there appears to be a problem where SPAdes/ARC choke on very low-coverage contigs. The console log from ARC is:

[2013-07-02 14:57:33,035 INFO 6413] Reading config file...
[2013-07-02 14:57:33,043 INFO 6413] Setting up working directories and building indexes...
[2013-07-02 14:58:50,929 INFO 6413] Setting up multiprocessing...
[2013-07-02 14:58:50,929 INFO 6413] Starting...
[2013-07-02 14:58:51,465 INFO 6420] Running bowtie2 for Sample1
[2013-07-02 14:58:51,471 INFO 6420] Calling bowtie2-build for sample: Sample1
[2013-07-02 14:58:51,472 INFO 6420] bowtie2-build -f /path/to/uce-probes-arc-format.fasta /path/to/working_Sample1/idx/idx
[2013-07-02 14:58:52,066 INFO 6420] Calling bowtie2 for sample: Sample1
[2013-07-02 14:58:52,066 INFO 6420] nice -n 19 bowtie2 -I 0 -X 1500 --local -p 12 -x /path/to/working_Sample1/idx/idx -1 /path/to/reads/gallus-gallus-READ1.fastq -2 /path/to/reads/gallus-gallus-READ2.fastq -U /path/to/reads/gallus-gallus-READ-singleton.fastq -S /path/to/working_Sample1/mapping.sam
[2013-07-02 14:59:25,197 INFO 6420] Sample: Sample1, Processed 2643335 lines in 5.86183905602 seconds.
[2013-07-02 14:59:25,637 INFO 6420] Running splitreads for Sample1
[2013-07-02 14:59:28,473 INFO 6420] Split 84 reads for sample Sample1 target uce1119 in 2.83518695831 seconds
[2013-07-02 14:59:29,083 INFO 6421] Running Spades for sample: Sample1 target: uce1119
[2013-07-02 14:59:29,094 INFO 6421] Calling spades for sample: Sample1 target uce1119
[2013-07-02 14:59:29,094 INFO 6421] spades.py -t 1 -1 /path/to/working_Sample1/t__001379/PE1.fastq -2 /path/to/working_Sample1/t__001379/PE2.fastq -s /path/to/working_Sample1/t__001379/SE.fastq -o /path/to/working_Sample1/t__001379/assembly
[2013-07-02 14:59:30,401 INFO 6420] Split 4 reads for sample Sample1 target uce4473 in 4.76373004913 seconds
[2013-07-02 14:59:30,554 INFO 6423] Running Spades for sample: Sample1 target: uce4473
[2013-07-02 14:59:30,555 ERROR 6423] An unhandled exception occured
Process ProcessRunner-4:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/bcf/git/ARC/bin/../ARC/process_runner.py", line 66, in run
    raise e
KeyError: 'assembly_SE'
[2013-07-02 14:59:30,590 ERROR 6413] Fatal error returned from ProcessRunner-4
[2013-07-02 14:59:30,590 ERROR 6413] Terminating processes
[2013-07-02 14:59:30,590 ERROR 6413] A fatal error was encountered.
    'Unrecoverable error'

I figured this might be related to the 4 reads split into the PE files for the target, so I ran SPAdes against just those reads split into t__002555 (the directory containing the PE files for the 4 reads) with:

spades.py -t 1 -1 /path/to/working_Sample1/t__002555/PE1.fastq -2 /path/to/working_Sample1/t__002555/PE2.fastq -o /path/to/working_Sample1/t__002555/assembly

The error returned from SPAdes was (snipped):

Verification of expression 'cov_.size() > 10' failed in function 'void cov_model::KMerCoverageModel::Fit()'. In file '/home/yasha/gitrep/algorithmic-biology/assembler/src/debruijn/kmer_coverage_model.cpp' on line 192. Message 'Invalid kmer coverage histogram'.
Verification of expression 'cov_.size() > 10' failed in function 'void cov_model::KMerCoverageModel::Fit()'. In file '/home/yasha/gitrep/algorithmic-biology/assembler/src/debruijn/kmer_coverage_model.cpp' on line 192. Message 'Invalid kmer coverage histogram'.
Exception caught std::exception

I looked around for an easy way to turn off the coverage limitation for the assembly to see if that would allow the process to continue, but SPAdes seems to have no CLI flags to do that (at least not readily apparent).

Thanks very much and keep up the excellent work!

best,
b

Improve repeat detection

In the MultiMite test, 9 target X sample combinations were flagged as hitting a repeat and further assembly was stopped at iteration 2. In actuality this occurred because a small number of reads were recruited on the first iteration followed by a large number on the second. In 8 of 9 cases a reduced number of contigs was produced on iteration 2 compared to 1, and in the 9th case the number was equal.

Based on these results:
Set up a new criteria for repeat detection which includes num contigs. For example:

if NumReads > lastNumReads * multiplier AND NumContigs > lastNumContigs:
isRepeat = True

This should guard against most cases of false repeat detection.

Fasta as input

Bowtie2 does not work with fasta files as input, must use either blat in this case, or we could produce bogus qualities and generate a fastq. Should check however if using format=fasta, then must use blat

enhancement request to deal with unusual read names in SRA datasets (causes KeyError termination)

Hi there,

Thanks for writing and making ARC available - I am beginning to use it and finding it VERY useful. I have an enhancement suggestion I'd love you to consider, if it's easy to implement. If it's not easy, not to worry!

I've been using ARC on some datasets I downloaded from NCBI's SRA using their SRA toolkit. Single-end data was working well in ARC, but I was having some trouble with paired-end data until I realized that read naming in those SRA files is not like standard Illumina format. I cooked up a script that fixes the read names to be suitable for ARC, but it'd be nice to avoid that step if possible.

When I download paired-end reads from SRA, pairs of reads get named like this:
SRR505874.1.1 and SRR505874.1.2 (first pair in the set)
SRR505874.51.1 and SRR505874.51.2 (51st pair in the set)
etc

ARC doesn't seem to like the .1 and .2 extensions (understandable), but if I fix the read names so that both members of the pair have the same names (i.e. SRR505874.1 for the first pair, and SRR505874.51 for the 51st pair) then ARC works fine. It'd be nice to be able to specify some option to ARC that it strip off the .1/.2 extensions from the names itself. I've pasted at the bottom the error given by ARC when I leave on the .1/.2 extensions looks.

If you want some test datasets, please let me know and I should be able to cook something up.

all the best,

Janet Young

Dr. Janet Young

Malik lab
http://research.fhcrc.org/malik/en.html

Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., A2-025,
P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512
email: jayoung ...at... fhcrc.org

ARC Version: v1.1.3 2014-09-02
[2015-04-20 18:32:19,484 INFO 20261] Reading config file...
[2015-04-20 18:32:19,608 INFO 20261] max_incorporation not specified in ARC_config.txt, defaulting to 10
[2015-04-20 18:32:19,608 INFO 20261] workingdirectory not specified in ARC_config.txt, defaulting to ./
[2015-04-20 18:32:19,608 INFO 20261] fastmap not specified in ARC_config.txt, defaulting to False
[2015-04-20 18:32:19,608 INFO 20261] keepassemblies not specified in ARC_config.txt, defaulting to False
[2015-04-20 18:32:20,111 INFO 20261] Setting up working directories and building indexes...
/home/jayoung/testRNAseqReads/FASTQfiles/tempUnpack/combined_1.fastq
/home/jayoung/testRNAseqReads/FASTQfiles/tempUnpack/combined_2.fastq
[2015-04-20 20:21:46,540 INFO 20261] Sample: T_malaccensis, indexed reads in 6566.36648893 seconds.
[2015-04-20 20:21:48,152 INFO 20261] allocating a new mmap of length 4096
[2015-04-20 20:21:48,153 INFO 20261] Running ARC.
[2015-04-20 20:21:48,153 INFO 20261] Submitting initial mapping runs.
[2015-04-20 20:21:48,153 INFO 20261] Starting...
[2015-04-20 20:21:48,156 INFO 12215] child process calling self.run()
[2015-04-20 20:21:48,156 INFO 12216] child process calling self.run()
[2015-04-20 20:21:48,158 INFO 12217] child process calling self.run()
[2015-04-20 20:21:48,158 INFO 12215] Sample: T_malaccensis Running bowtie2.
[2015-04-20 20:21:48,159 INFO 12218] child process calling self.run()
[2015-04-20 20:21:48,201 INFO 12215] Sample: T_malaccensis Calling bowtie2-build.
[2015-04-20 20:21:48,201 INFO 12215] bowtie2-build -f /home/jayoung/testARCgene1/firstTry/working_T_malaccensis/I000_contigs.fasta /home/jayoung/testARCgene1/firstTry/working_T_malaccensis/idx/idx
[2015-04-20 20:21:53,044 INFO 12215] Sample: T_malaccensis Calling bowtie2 mapper
[2015-04-20 20:21:53,044 INFO 12215] bowtie2 -I 0 -X 1500 --very-fast-local --mp 12 --rdg 12,6 --rfg 12,6 -p 4 -x /home/jayoung/testARCgene1/firstTry/working_T_malaccensis/idx/idx -k 3 -1 /home/jayoung/testRNAseqReads/FASTQfiles/tempUnpack/combined_1.fastq -2 /home/jayoung/testRNAseqReads/FASTQfiles/tempUnpack/combined_2.fastq -S /home/jayoung/testARCgene1/firstTry/working_T_malaccensis/mapping.sam
[2015-04-20 22:38:52,969 INFO 12215] Sample: T_malaccensis, Processed 126931493 lines from SAM in 6139.76893306 seconds.
[2015-04-20 22:38:54,263 INFO 12215] Sample: T_malaccensis Running splitreads.
[2015-04-20 22:52:42,095 ERROR 12215] Traceback (most recent call last):

File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/process_runner.py", line 62, in run
self.launch()

File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/process_runner.py", line 43, in launch
job.runner()

File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/runners/base.py", line 58, in runner
self.start()

File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/runners/mapper.py", line 55, in start
self.splitreads()

File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/runners/mapper.py", line 392, in splitreads
read2 = idx_PE2[readID]

File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/biopython-1.60-py2.7-linux-x86_64.egg/Bio/SeqIO/_index.py", line 423, in getitem
if not row: raise KeyError

KeyError

[2015-04-20 22:52:42,095 ERROR 12215] An unhandled exception occurred
[2015-04-20 22:52:42,095 ERROR 20261] Terminating processes
[2015-04-20 22:52:42,166 ERROR 20261] ARC.app unexpectedly terminated
[2015-04-20 22:52:42,167 INFO 20261] process shutting down

Check for idx file and don't re-index if it exists

Often times it is advantageous to be able to restart ARC, either because it became obvious that a few more iterations were necessary, or because a different set of targets could be used with the same reads.

For very large projects, it can take minutes/hours to index the massive reads files, so rather than do this, check whether working_dir and idx files exist already, and don't create them if they do.

Currently these folders/files are not deleted when ARC exists (it is up to the user to clean these up) so the change should be easy to implement.

Log file is missing entries

It turns out that the original version of the log handler wasn't thread-safe and was either garbling entries, or failing to write them entirely. A modified version is now in testing which should fix this.

Handle the situation where no reads are mapped from an entire Sample

In some cases there are no reads mapped at all for an entire sample. When this happens, the Sample should be treated as finished and no further jobs added to the queue.

bowtie2-build crashes if a contig is all 'n'

With the recent implementation of repeat masking, it is now possible to get a contig which looks like this:

masked
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

When bowtie2-build is called on a file containing a contig like this, it will crash:
*** glibc detected *** bowtie2-build: double free or corruption (out): 0x000000000481b210 ***

Solution: Avoid writing out any cotings which are all 'n' (these won't recruit reads anyway).

Update Install Requirement

Update install requirement to include python modules required by ARC, but are not a part of the standard install. Biopython for sure, but what about subprocess and logger

Indexing is really slow.

Indexing is very slow. Currently only one file is indexed at any given time (limiting ARC to using only a single processor during indexing). Further tests need to be done to determine whether indexing multiple files at the same time will overwhelm disk I/O and/or result in overall improvements to indexing speed.

Ideas:

Create an adaptive strategy where parallel indexing processes are launched until the I/O overhead becomes significant (see python psutil).
Launch a fixed number of N indexing processes with N <= nprocs. Maybe make this configurable by the user.
Develop a new strategy for indexing the fastq files and/or recruiting reads (address #23, #43, and other issues in the way the reads are recruited).

Do a better job of cleaning up intermediate files if ARC is restarted

Currently ARC can be run in a folder where it has been run before and it will skip re-indexing reads. If ARC previously crashed or was terminated prematurely it won't have cleaned up a variety of intermediary files however (bowtie idx, assembly folder, etc).
If ARC detects that it is being re-run clean up all of these files so that it will run successfully.

ARC is much too slow for very large sets of reads

When very large datasets are used (a full HiSeq lane for example), ARC is incredibly slow at splitting reads. I.E

[2013-06-20 11:35:09,603 INFO 21595] Split 3 reads for sample Sample1 target HWI-ST522_0060:7:2108:11503:138410#0/1_Cluster-3254_M072 in 3139.90650702 seconds

It might be necessary to re-think the current indexing scheme, perhaps going back to a simpler approach where the splitter runs through the whole file, pulling out every read that was hit and either writing it to memory or a temporary folder on the disk. This would make it so that all assemblies couldn't be kicked off until all reads had been processed.

Alternatively, we could dump support for BLAT and pull the reads directly from the SAM file, making it unnecessary to go to the original reads files entirely. This would also require that the entire SAM file was parsed before any assemblies could be started.

Stopping Criteria

Add in Stopping criteria, target ends when no additional progress is being made.

Allow alternative working folder

The speed of ARC is heavily dependent on disk storage speed. Putting the working directory on a flash-based drive speeds ARC up tremendously, as does putting the reads on a flash-based drive. Even better on high memory systems is to store everything in RAM (i.e. /dev/shm on CentOS).

Currently some tricks can be used to make this happen, for example creating a set of "working" directories and then symlinking them to the location where ARC is running. A better alternative would be to just tell ARC where to put the working folder with a parameter which defaults to './'.

Improve speed and reduce disk IO for read recruitment

This will involve re-writing the index_db functionality in Biopython.

Currently a SQLite database is being generated by SeqIO.index_db() to provide fast random access to reads in the fasta/fastq input files. If this database were modified to include an additional column indicating the target against which the read has been mapped in previous iterations, only newly mapped reads would need to be split out on each iteration. This would drastically improve speed because lookups in the SQLite database are very quick. The current major bottleneck in ARC is the splitting step, so it is the obvious target for further optimization.

Targets with names containing non-alphanumeric characters cause problems

For example, making a folder name with a "|" or "" etc will seriously screw up pathing and command line operations like starting the assemblers.

If no files are written to the PE files or SE file, the assembler will crash

As part of the read/coverage tracking code, only include the PE/SE files in the call to the assembler if they actually have reads in them.

Keep final set of reads

The final set of reads may be necessary for mapping & variant discovery as well as a number of other analysis steps. Instead of discarding the last set of reads, keep them (maybe gzipped) in the final_Sample folder.

Fatal -- missing module

ARC doesn't fail to install without biopython.

Bug with read recruitment when targets have an ARC-like name

Due to kind of a hack to stop ARC from recruiting reads which were already incorporated into a target, there exists a bug where ARC will refuse to recruit reads on the first iteration if the targets are named in a certain way

The code for this is in the mapper:

handle references built using assembled contigs:

if len(target.split(":")) == 3:
target, status = target.split(":")[1:]
# This keeps ARC from writing reads which mapped to finished contigs
if status.startswith("Contig") or status.startswith("isogroup"):
continue

Add the ability to use high-specificity search on the first iteration

Some users have very high identity targets and don't need sloppy mapping on the first iteration. Being able to use only high-specificity mapping parameters on the first iteration could allow these users to avoid getting off-target contigs.

Biopython method is deprecated

Fix call to get rid of this warning:
/opt/modules/devel/python/2.7.5/lib/python2.7/site-packages/Bio/Seq.py:302: BiopythonDeprecationWarning: This method is obsolete; please use str(my_seq) instead of my_seq.tostring().

Enable simple repeat screening on iteration 1

Keep final set of assemblies

In the finished_sample folders, create a folder for each sample, in this folder keep a reads and assembly folder for each target. This is the final set of reads and final set of assembly folders.

Allow mapping against reads instead of contigs on first iteration

This is partially implemented, but the assembler is still run, even if only the reads will be written out.

Make urt switch for Newbler dynamic

Currently the "urt" (use read tips) switch for Newbler can be enabled so that it is used during all assemblies except for the last one. This doesn't function properly however if assemblies are terminated early because no more reads will map. In this case the final assembly will be made with the "urt" switch enabled, causing the final contigs to have low-quality tips.