gabaldonlab / redundans Goto Github PK

View Code? Open in Web Editor NEW

123.0 8.0 19.0 65.24 MB

Redundans is a pipeline that assists an assembly of heterozygous/polymorphic genomes.

License: GNU General Public License v3.0

Python 14.75% Shell 3.58% Perl 4.42% Dockerfile 0.10% Makefile 0.88% C++ 73.50% C 0.11% R 0.66% Java 1.99%

genome-assembly python pipeline fasta contigs heterozygous polymorphic docker-image mate-pairs paired-end

redundans's Introduction

Redundans
- Prerequisites
  - Official conda package
  - UNIX installer
  - Docker image
- Running the pipeline
  - Parameters
  - Test run
- Support
- Citation

Redundans

Redundans pipeline assists an assembly of heterozygous genomes.
Program takes as input assembled contigs, sequencing libraries and/or reference sequence and returns scaffolded homozygous genome assembly. Final assembly should be less fragmented and with total size smaller than the input contigs. In addition, Redundans will automatically close the gaps resulting from genome assembly or scaffolding.

The pipeline consists of several steps (modules):

de novo contig assembly (optional if no contigs are given)
redundancy reduction: detection and selective removal of redundant contigs from an initial de novo assembly
scaffolding: joining of genome fragments using paired-end reads, mate-pairs, long reads and/or reference chromosomes
gap closing: filling the gaps after scaffolding using paired-end and/or mate-pair reads

Redundans is:

fast & lightweight, multi-core support and memory-optimised, so it can be run even on the laptop for small-to-medium size genomes
flexible toward many sequencing technologies (Illumina, 454, Sanger, PacBio & Nanopore) and library types (paired-end, mate pairs, fosmids, long reads)
modular: every step can be omitted or replaced by other tools
reliable: it has been already used to improve genome assemblies varying in size (several Mb to several Gb) and complexity (fungal, animal & plants)

For more information have a look at the documentation, poster, publication, test dataset or manual.

Prerequisites

Redundans uses several programs (all except the interpreters and its submodules are provided within this repository):

Resource	Type	Version
Python	Language interpreter	<3.11, ≥ 3.8
Platanus	Genome assembler	v1.2.4
Miniasm	Genome assembler	≥ v0.3 (r179)
Minimap2	Sequence aligner	≥ v2.2.4 (r1122)
LAST	Sequence aligner	≥ v800
BWA	Sequence aligner	≥ v0.7.12
SNAP aligner	Sequence aligner	v2.0.1
SSPACE3	Scaffolding software	v3.0
GapCloser	Gapclosing software	v1.12
GFAstats	Stats software	≥ v1.3.6
Meryl	K-mer counter software	≥ v1.3
Merqury	Assembly evaluation software	v1.3
k8	Javascript shell based on V8	v0.2.4
R	Language interpreter	≥ 3.6
ggplot2	R package	≥ 3.3.2
scales	R package	≥ 3.3.2
argparser	R package	≥ 3.6

On most Linux distros, the installation should be as easy as:

git clone --recursive https://github.com/Gabaldonlab/redundans/
cd redundans && bin/.compile.sh

If it fails, make sure you have below dependencies installed:

Perl [SSPACE3]
make, gcc & g++ [BWA, GFAstats, Miniasm & LAST] ie. sudo apt-get install make gcc g++
zlib including zlib.h headers [BWA] ie. sudo apt-get install zlib1g-dev
R ≥ 3.6 and additional packages [ggplot2, scales, argparser] for plotting the Merqury results.
optionally for additional plotting numpy and matplotlib ie. sudo -H pip install -U matplotlib numpy

For user convenience, we provide UNIX installer and Docker image, that can be used instead of manually installation.

Official conda package

If you are familiar with conda, this will be by far the easiest way of installing redundans:

# create new Python3 >=3.8,<3.11 environment
conda create -n redundans python=3.10
# activate it
conda activate redundans
# and install redundans
conda install -c bioconda redundans

UNIX installer

UNIX installer will automatically fetch, compile and configure Redundans together with all dependencies. It should work on all modern Linux systems, given Python >= 3, commonly used programmes (ie. wget, make, curl, git, perl, gcc, g++, ldconfig) and libraries (zlib including zlib.h) are installed.

source <(curl -Ls https://github.com/Gabaldonlab/redundans/raw/master/INSTALL.sh)

Docker image

First, you need to install docker: wget -qO- https://get.docker.com/ | sh
Then, you can run the test example by executing:

#Pull the image directly from dockerhub
docker pull cgenomics/redundans:latest

# process the data inside the image - all data will be lost at the end
docker run -it -w /root/src/redundans cgenomics/redundans:latest ./redundans.py -v -i test/{600,5000}_{1,2}.fq.gz -f test/contigs.fa -o test/run1

# if you wish to process local files, you need to mount the volume with -v
## make sure you are in redundans repo directory (containing test/ directory)
docker run -v `pwd`/test:/test:rw -it cgenomics/redundans:latest /root/src/redundans/redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1

Singularity image

Redundans is also supported by singularity. First install singularity.

You can either use our singularity repository to build the image or to build the image out of the docker image. Then run the first example:

#Pull from the singularity repo
singularity pull --arch amd64 library://cgenomics/redundans/redundans:2.0

#Build the image based on the docker repo
singularity build redundans.sif docker://cgenomics/redundans

#Use exec instead of run to account for shell-based wildcarsds * and ?
singularity exec redundans.sif bash -c "/root/src/redundans/redundans.py -v -i /root/src/redundans/test/*_?.fq.gz -f /root/src/redundans/test/contigs.fa -o /tmp/run1"

Running the pipeline

Redundans input consists of any combination of:

assembled contigs (FastA)
paired-end and/or mate pairs reads (FastQ*)
long reads (FastQ/FastA*) - both PacBio and Nanopore are supported for the scaffolding
and/or reference chromosomes/contigs (FastA).

gzipped files are also accepted.

Redundans will return homozygous genome assembly in scaffolds.filled.fa (FastA). It will also report the heterozygous contigs that were not discarded during the reduction step. In addition, the program reports statistics for every pipeline step, including number of contigs that were removed, GC content, N50, N90 and size of gap regions.

Parameters

For the user convenience, Redundans is equipped with a wrapper that automatically estimates run parameters and executes all steps/modules. You should specify some sequencing libraries (FastA/FastQ) or reference sequence (FastA) in order to perform scaffolding. If you don't specify -f contigs (FastA), Redundans will assemble contigs de novo, but you'll have to provide paired-end and/or mate pairs reads (FastQ). Most of the pipeline parameters can be adjusted manually (default values are given in square brackets []):
HINT: If you run fails, you may try to resume it, by adding --resume parameter.

General options:

  -h, --help            show this help message and exit
  -v, --verbose         verbose
  --version             show program's version number and exit
  -i FASTQ, --fastq FASTQ
                        FASTQ PE / MP files
  -f FASTA, --fasta FASTA
                        FASTA file with contigs / scaffolds
  -o OUTDIR, --outdir OUTDIR
                        output directory [redundans]
  -t THREADS, --threads THREADS
                        no. of threads to run [4]
  --resume              resume previous run
  --log LOG             output log to [stderr]
  --nocleaning

De novo assembly options:

  -m MEM, --mem MEM     max memory to allocate (in GB) for the Platanus assembler [2]
  --tmp TMP             tmp directory [/tmp]

Reduction options:

  --identity IDENTITY   min. identity [0.51]
  --overlap OVERLAP     min. overlap  [0.80]
  --minLength MINLENGTH
                        min. contig length [200]
  --minimap2reduce      Use minimap2 for the initial and final Reduction step. Recommended for input assembled contigs from long reads or larger contigs using --preset[asm5] by default. By default LASTal is used for Reduction.
  -x INDEX, --index INDEX
                        Minimap2 parameter -i used to load at most INDEX target bases into RAM for indexing [4G]. It has to be provided as a string INDEX ending with k/K/m/M/g/G.
  --noreduction         Skip reduction

Short-read scaffolding options:

  -j JOINS, --joins JOINS
                        min pairs to join contigs [5]
  -a LINKRATIO, --linkratio LINKRATIO
                        max link ratio between two best contig pairs [0.7]
  --limit LIMIT         align subset of reads [0.2]
  -q MAPQ, --mapq MAPQ  min mapping quality [10]
  --iters ITERS         iterations per library [2]
  --noscaffolding       Skip short-read scaffolding
  -b, --usebwa          use bwa mem for alignment [use snap-aligner]

Long-read scaffolding options:

  -l LONGREADS, --longreads LONGREADS
                        FastQ/FastA files with long reads
  -s, --populateScaffolds
                        Run populateScaffolds mode for long read scaffolding, else generate a dirty assembly for reference-based scaffolding. Not recommended for highly repetitive genomes. Default False.
  --minimap2scaffold         Use Minimap2 for aligning long reads. Preset usage dependant on file name convention (case insensitive): ont, nanopore, pb, pacbio, hifi, hi_fi, hi-fi. ie: s324_nanopore.fq.gz. Else it uses LASTal.

Reference-based scaffolding options:

  -r REFERENCE, --reference REFERENCE
                        reference FastA file
  --norearrangements    high identity mode (rearrangements not allowed)
  -p PRESET, --preset PRESET
                        Preset option for Minimap2-based Reduction and/or Reference-based scaffolding. Possible options: asm5 (5 percent sequence divergence), asm10 (10 percent sequence divergence) and asm20(20 percent sequence divergence). Default [asm5]

Gap closing options:

  --nogapclosing

Meryl and Merqury options:

  --runmerqury           Run meryldb and merqury for assembly kmer multiplicity stats. [False] by default.
  -k KMER, --kmer KMER  K-mer size for meryl [21]

Redundans is extremely flexible. All steps of the pipeline can be ommited using: --noreduction, --noscaffolding, --nogapclosing and/or --runmerqury parameters.

Test run

To run the test example, execute:

./redundans.py -v -i test/*_?.fq.gz -f test/contigs.fa -o test/run1

#Test it using minimap2 for the reduction step, increasing performance for large genomes
./redundans.py -v -i test/*_?.fq.gz -f test/contigs.fa --minimap2reduce -o test/run2

# if your run failed for any reason, you can try to resume it
rm test/run1/_sspace.2.1.filled.fa
./redundans.py -v -i test/*_?.fq.gz -f test/contigs.fa -o test/run1 --resume

# if you have no contigs assembled, just run without `-f`
./redundans.py -v -i test/*_?.fq.gz -o test/run.denovo

Note, the order of libraries (-i/--input) is not important, as long as read1 and read2 from each library are given one after another i.e. -i 600_1.fq.gz 600_2.fq.gz 5000_1.fq.gz 5000_2.fq.gz would be interpreted the same as -i 5000_1.fq.gz 5000_2.fq.gz 600_1.fq.gz 600_2.fq.gz.

You can play with any combination of inputs ie. paired-end, mate pairs, long reads and / or reference-based scaffolding as well as selecting minimap2 for each step or default to LASTal, for example:

# reduction, scaffolding with paired-end, mate pairs and long reads used to generate a miniasm assembly to do reference-based scaffolding, and gap closing with paired-end and mate pairs using as an aligner minimap2
./redundans.py -v -i test/*_?.fq.gz -l test/nanopore.fa.gz -f test/contigs.fa -o test/run_short_long_ref --minimap2scaffold

# reduction, scaffolding with paired-end, mate pairs and long reads, and gap closing with paired-end and mate pairs using populateScaffolds method using as aligner minimap2
./redundans.py -v -i test/*_?.fq.gz -l test/pacbio.fq.gz test/nanopore.fa.gz -f test/contigs.fa -o test/run_short_long_populatescaffold --minimap2scaffold --populateScaffolds

# scaffolding and gap closing with paired-end and mate pairs (no reduction)
./redundans.py -v -i test/*_?.fq.gz -f test/contigs.fa -o test/run_short-scaffolding-closing --noreduction

# reduction, reference-based scaffolding and gap closing with paired-end reads (--noscaffolding disables only short-read scaffolding)
./redundans.py -v -i test/600_?.fq.gz -r test/ref.fa -f test/contigs.fa -o test/run_ref_pe-closing --noscaffolding

For more details have a look in test directory.

Support

If you have any issues or doubts check documentation and FAQ (Frequently Asked Questions). You may want also to sign to our forum.

Citation

Leszek P. Pryszcz and Toni Gabaldón (2016) Redundans: an assembly pipeline for highly heterozygous genomes. NAR. doi: 10.1093/nar/gkw294

redundans's People

Contributors

Stargazers

Watchers

Forkers

fw1121 jinfengchen xflicsu tw7649116 sunnycqcn mictadlo waoki litswu lacademic xuzhichao830 li-michael cristinafriaslopez shiyi-pan ayixon wook2014 mgarl-10 dfupa sowmyapulapet kalonji08

redundans's Issues

fasta2homozygous.py error

Dear Leszek,
Your Redundans assembly pipeline looks very promising. Unfortunately I got stuck when running the test run:

./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1

I got the following error message:
File "/home/mgalland/workspace/wild_tomato_genomes/scr/redundans/bin/fasta2homozygous.py", line 126 contig2skip = {c: 0 for c in faidx}
SyntaxError: invalid syntax

The forlooks like the problem here.

Can you help there?
Thanks in advance,
Marc

Canu assebly (PacBio)

Hi,
I have got a PacBio assembly produced by Canu. It has created 3 output files (BAFB.contigs.fasta, BAFB.unassembled.fasta, BAFB.unitigs.fasta). The meaning of the output files is described here.

Which of the files should I use for redundans?

Thank you in advance.

Michal

TypeError: 'generator' object has no attribute 'getitem'

Hi,
I ran:

$ python fasta2homozygous.py -i /scratch/fruit-data/BAFB.contigs.fasta -t 25
Homozygous assembly/ies will be written with input name + '.homozygous.fa.gz'
#file name      genome size     contigs heterozygous size       [%]     heterozygous contigs    [%]     identity [%]    possible joins  homozygous size [%]     homozygous contigs      [%]
Traceback (most recent call last):
  File "fasta2homozygous.py", line 239, in <module>
    main()
  File "fasta2homozygous.py", line 233, in main
    o.threads, o.verbose)
  File "fasta2homozygous.py", line 151, in fasta2homozygous
    contig2skip = fasta2skip(out, fasta, faidx, threads, identity, overlap, verbose)
  File "fasta2homozygous.py", line 71, in fasta2skip
    sys.stderr.write(' [ERROR] `%s` (%s) not in contigs!\n'%(q, str(hits[i-1])))
TypeError: 'generator' object has no attribute '__getitem__'
ubuntu@waterhouse-1:/mnt/apps/redundans/bin$ lastal: write error

Any idea what did I do wrong?

Thank you in advance.

Michal

lastal: can't interpret TAB

Hello,

Ran into this error:

@BioPower3-IBM ~/programs/redundans $ ./redundans.py -v -i /shared/reads/Pseudoloma/GDR-24_R*_trim100.fastq -f ~/Pseudoloma/reads/mira_merged_sample50X_try1_out.unpadded_above500.fasta -o test/Pseudo_mira_100bp_Scaffold_try1 --sspacebin ~/programs/SSPACE-STANDARD-3.0_linux-x86_64/SSPACE_Standard_v3.0.pl -t 8
Options: Namespace(fasta='/home/adrian/Pseudoloma/reads/mira_merged_sample50X_try1_out.unpadded_above500.fasta', fastq=['/shared/reads/Pseudoloma/GDR-24_R1_trim100.fastq', '/shared/reads/Pseudoloma/GDR-24_R2_trim100.fastq'], identity=0.51000000000000001, iters=2, joins=5, limit=0.20000000000000001, linkratio=0.69999999999999996, log=<open file '', mode 'w' at 0x7f84863ad1e0>, mapq=10, minLength=200, nocleaning=True, nogapclosing=True, noreduction=True, noscaffolding=True, outdir='test/Pseudo_mira_100bp_Scaffold_try1', overlap=0.66000000000000003, sspacebin='/home/adrian/programs/SSPACE-STANDARD-3.0_linux-x86_64/SSPACE_Standard_v3.0.pl', threads=8, verbose=True)
Aligning 3594477 mates per library...
Insert size statistics Mates orientation stats
FastQ files median mean stdev FF FR RF RR
/shared/reads/Pseudoloma/GDR-24_R1_trim100.fastq /shared/reads/Pseudoloma/GDR-24_R2_trim100.fastq 264 265.46 14.71 1 9985 13 1

[Sat Mar 26 14:51:11 2016] Reduction...

file name genome size contigs heterozygous size [%] heterozygous contigs [%] identity [%] possible joins homozygous size [%] homozygous contigs [%]

lastal: can't interpret: TAB
test/Pseudo_mira_100bp_Scaffold_try1/contigs.fa 17972387 5935 0 0.00 0 0.00 0.000 0 17972387 100.00 5935 100.00
Aligning 3594477 mates per library...

[Sat Mar 26 14:51:40 2016] Scaffolding...
iteration 1.1 ...

It doesn't look like it removed redundancies. Any idea what is causing this?
Adrian

the position of the original contigs on the final output

Dear colleague, Is there a file like .agp that specify the position of the original contigs on the final output?

lastal: Input/output error

Redundans is working fine on most of my samples but for some I'm getting a lastal: Input/output error during scaffolding. Redundans seems to go ahead with gap closing without error, but it never terminates.

Here's an example log including the error. I terminated this job early but typically jobs with this error get through the second gap closing iteration also. I assigned the job 160 Gb of memory.

Options: Namespace(fasta='output/assembly3/GBSC.fasta', fastq=['output/filter_contamination2/GBSC.R1.fastq.gz', 'output/filter_contamination2/GBSC.R2.fastq.gz'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '<stderr>', mode 'w' at 0x7f47d9b0a1e0>, longreads=[], mapq=10, minLength=200, nocleaning=True, nogapclosing=True, norearrangements=True, noreduction=True, noscaffolding=True, outdir='output/redundans/GB', overlap=0.8, reference='input/ref/magna.nuc.fa', resume=False, threads=16, verbose=True)

##################################################
[Tue Apr 10 14:43:19 2018] Reduction...
#file name	genome size	contigs	heterozygous size	[%]	heterozygous contigs	[%]	identity [%]	possible joins	homozygous size	[%]	homozygous contigs	[%]
[WARNING] numpy or matplotlib missing! Cannot plot histogram
output/redundans/GB/contigs.fa	109746840	27555	7715577	7.03	20565	74.63	89.048	0	102031263	92.97	6990	25.37

##################################################
[Tue Apr 10 14:45:42 2018] Estimating parameters of libraries...
 Aligning 20406252 mates per library...
Insert size statistics				Mates orientation stats
FastQ files	read length	median	mean	stdev	FF	FR	RF	RR
output/filter_contamination2/GBSC.R1.fastq.gz output/filter_contamination2/GBSC.R2.fastq.gz	139	344	358.25	70.94	17	9114	859	10

##################################################
[Tue Apr 10 14:45:42 2018] Scaffolding...
 iteration 1.1: output/redundans/GB/contigs.reduced.fa	6990	102031263	40.535	6705	101849352	28763	6567	1007	201015
   20406253 pairs. 14761812 passed filtering [72.34%]. 207904 in different contigs [1.02%].
    1544643 pairs. 1476486 in different contigs [95.59%].
 iteration 1.2: output/redundans/GB/_sspace.1.1.fa	5923	102046669	40.535	5753	101939137	35144	7779	30809	272440
   20406253 pairs. 14868644 passed filtering [72.86%]. 184593 in different contigs [0.90%].

/usr/bin/bash: /mnt/lfs2/schaack/fmacrae/megadaph.private/megadaph/pipe/util/redundans/bin/last/src/lastal: Input/output error

    1476033 pairs. 1406284 in different contigs [95.27%].

##################################################
[Tue Apr 10 17:05:56 2018] Scaffolding based on reference...

##################################################
[Tue Apr 10 17:08:11 2018] Gap closing...
 iteration 1.1: output/redundans/GB/scaffolds.ref.fa	5781	119263174	40.535	5638	119173859	44959	9000	17253219	1690113

Thanks for building Redundans!

Error while installng redundans

Hello !

I've tried ton install the easy way redundans. I'veperformed the following steps :

git clone --recursive https://github.com/lpryszcz/redundans.git
cd redundans && bin/.compile.sh

As it is written that every dependencies is already included, I did not expected the following error I got while trying to run :

./redundans.py -v -i /media/loutre/SUZUKII/illumina_reads/*.fastq -l '/media/loutre/SUZUKII/fasta_reads_pacbio/filtered_subreads_clean.fasta' -f '/media/loutre/SUZUKII/assembly/Drosophila-suzukii-p.fasta' -t 30 -o '/media/loutre/SUZUKII/redundans'

Loutre:~/redundans$ ./redundans.py -v -i /media/loutre/SUZUKII/illumina_reads/*.fastq -l '/media/loutre/SUZUKII/fasta_reads_pacbio/filtered_subreads_clean.fasta' -f '/media/loutre/SUZUKII/assembly/Drosophila-suzukii-p.fasta' -t 30 -o '/media/loutre/SUZUKII/redundans'
Options: Namespace(fasta='/media/loutre/SUZUKII/assembly/Drosophila-suzukii-p.fasta', fastq=['/media/loutre/SUZUKII/illumina_reads/SRR1002946_1.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR1002946_1_trimmed.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR1002946_2.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR1002946_2_trimmed.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942797_1.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942797_2.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942798_1.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942798_2.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942799_1.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942799_2.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942800_1.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942800_2.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942801_1.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942801_2.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942802_1.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942802_2.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942803_1.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942803_2.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942804_1.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942804_2.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942805_1.fastq', '/media/loutre/SUZUKII/illumina_reads/SRR942805_2.fastq'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '<stderr>', mode 'w' at 0x7f73392991e0>, longreads=['/media/loutre/SUZUKII/fasta_reads_pacbio/filtered_subreads_clean.fasta'], mapq=10, minLength=200, nocleaning=True, nogapclosing=True, norearrangements=False, noreduction=True, noscaffolding=True, outdir='/media/loutre/SUZUKII/redundans', overlap=0.8, reference='', resume=False, threads=30, verbose=True)
[ERROR] lastdb: not found

[ERROR] lastal: not found

When I'm looking in redundans installation, I can see the tool Last, so I guess it was downloaded, but maybe not compiled correctly.

Did I made a mistake or forgot something in the process ?

Cheers,

Roxane

error running test and personal data set

I initially posted a comment on BioStar here. But for convenience, and a recent error, I'll continue the dialogue, here.
I've downloaded all of the dependencies for redundans and after assembling on Spades I attempted to run:

./redundans.py -i /home/molecularecology/Desktop/zcpb/CPBassembly/Ldec_180bp_male_1CLEAN.fq /home/molecularecology/Desktop/zcpb/CPBassembly/Ldec_180bp_male_2CLEAN.fq -f /home/molecularecology/Desktop/zcpb/CPBassembly/CPB_spades/CPBscaffolds.fasta -o redundansCPB

but received the error:

[ERROR] GapCloser: not found


Make sure you have installed all dependencies from https://github.com/lpryszcz/redundans#manual-installation !

and I'm fairly certain I have GapCloser. In the src directory:

41SSPACE-STANDARD-3.0_linux-x86_64.tar.gz  last-714
bwa                                        last-714.zip
bwa-0.7.12                                 redundans
bwa-0.7.12.tar.bz2                         redundans.install.log
GapCloser                                  redundans.tgz
GapCloser-bin-v1.12-r6.tgz                 SSPACE
GapCloser_Manual.pdf                       SSPACE-STANDARD-3.0_linux-x86_64
last

and redundans directory:

CHANGELOG.md          fastq2insert_size.py   GapFiller_v1-10_linux-x86_64
docs                  fastq2insert_size.pyc  INSTALL.sh
fasta2homozygous.py   fastq2mates.py         LICENSE
fasta2homozygous.pyc  fastq2sspace.py        pyScaf.py
FastaIndex.py         fastq2sspace.pyc       README.md
FastaIndex.pyc        filterReads.py         redundans.py
fasta_stats.py        filterReads.pyc        SSPACE-STANDARD-3.0_linux-x86_64
fasta_stats.pyc       GapCloser              test

So, any advice you could give would be greatly appreciated!

new features

confusion with contigs.reduced.fa.hetero.tsv file

Hello,

I've tried redundans before, but now I need it for a different purpose, so I downloaded latest version.

I suppose that the info displayed in first column of this auxiliar file represents all the contigs that were removed that fit the --identity and --overlap criteria. I confirmed that by observing their absence in the contigs.reduced.fa file. My doubt lies on those longer contigs represented in column 3. I see that not all of them are represented in the final reduction fasta file, which means that those that are the longer representation of some contigs in column 1, are also the shorter representation of some contigs in column 3, thus they also get removed. Is that right?

Just want to clear this out, before I proceed with my analysis.
Thanks in advance,
Pedro

test fails after new install

Greetings. On CentOS (64 bit), installed in /usr/local automake 1.15 autoconf to 2.65 then, successfully
installed redundans (according to the log file), but the test failed. Suggestions? The devtoolset-4 is needed for a compiler which is recent enough to understand all the g++ command line switches. Typically programs produced this way need no special treatment when they are run.

cd ~/src
scl enable devtoolset-4 'source <(curl -Ls http://bit.ly/redundans_installer)' 2>&1 | tee redundans_installer.log
#worked!  Says to test it like so:
cd redundans
./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1
Options: Namespace(fasta='test/contigs.fa', fastq=['test/5000_1.fq.gz', 'test/5000_2.fq.gz', 'test/600_1.fq.gz', 'test/600_2.fq.gz', 'test/pacbio.fq.gz'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '<stderr>', mode 'w' at 0x7fad26fc2270>, longreads=[], mapq=10, minLength=200, nocleaning=True, nogapclosing=True, norearrangements=False, noreduction=True, noscaffolding=True, outdir='test/run1', overlap=0.8, reference='', resume=False, threads=4, verbose=True)

##################################################
[Tue Jan  9 15:17:39 2018] Reduction...
#file name      genome size     contigs heterozygous size       [%]     heterozygous contigs    [%]     identity [%]    possible joins  homozygous size    [%]     homozygous contigs      [%]
test/run1/contigs.fa    163897  245     66377   40.50   221     90.20   94.854  0       97520   59.50   24      9.80

##################################################
[Tue Jan  9 15:17:40 2018] Estimating parameters of libraries...
 Aligning 19504 mates per library...
Insert size statistics                          Mates orientation stats
FastQ files     read length     median  mean    stdev   FF      FR      RF      RR
test/5000_1.fq.gz test/5000_2.fq.gz     50      4986    4981.70 692.22  0       4067    14      0
test/600_1.fq.gz test/600_2.fq.gz       100     599     598.74  47.22   0       10000   0       0

##################################################
[Tue Jan  9 15:17:42 2018] Scaffolding...
 iteration 1.1: test/run1/contigs.reduced.fa    24      97520   39.355  17      94157   7321    2195    0       29603
   19505 pairs. 17302 passed filtering [88.71%]. 1627 in different contigs [8.34%].
    1526 pairs. 558 in different contigs [36.57%].
 iteration 1.2: test/run1/_sspace.1.1.fa        3       97626   39.344  3       97626   87536   6063    821     87536
   19505 pairs. 17607 passed filtering [90.27%]. 182 in different contigs [0.93%].
    1077 pairs. 124 in different contigs [11.51%].
 iteration 2.1: test/run1/_sspace.1.2.fa        3       97626   39.344  3       97626   87536   6063    821     87536
   19505 pairs. 15112 passed filtering [77.48%]. 1295 in different contigs [6.64%].
    3417 pairs. 396 in different contigs [11.59%].
 iteration 2.2: test/run1/_sspace.2.1.fa        1       99115   39.344  1       99115   99115   99115   2310    99115
   19505 pairs. 15151 passed filtering [77.68%]. 0 in different contigs [0.00%].
    3398 pairs. 0 in different contigs [0.00%].

##################################################
[Tue Jan  9 15:17:48 2018] Gap closing...
 iteration 1.1: test/run1/scaffolds.fa  1       99115   39.344  1       99115   99115   99115   2310    99115

##################################################
[Tue Jan  9 15:17:49 2018] Final reduction...
#file name      genome size     contigs heterozygous size       [%]     heterozygous contigs    [%]     identity [%]    possible joins  homozygous size    [%]     homozygous contigs      [%]
Traceback (most recent call last):
  File "./redundans.py", line 521, in <module>
    main()
  File "./redundans.py", line 516, in main
    o.norearrangements, o.verbose, o.log)
  File "./redundans.py", line 391, in redundans
    info = fasta2homozygous(out, open(lastOutFn), identity, overlap,  minLength, threads, verbose=0, log=log)
  File "/home/mathog/src/redundans/bin/fasta2homozygous.py", line 207, in fasta2homozygous
    contig2skip = fasta2skip(out, fasta, faidx, threads, identity, overlap, minLength, verbose)
  File "/home/mathog/src/redundans/bin/fasta2homozygous.py", line 136, in fasta2skip
    plot_histograms(out.name, contig2skip, identities, sizes)
  File "/home/mathog/src/redundans/bin/fasta2homozygous.py", line 160, in plot_histograms
    for i, isize in zip(np.digitize(best, bins, right=1), bestalgsizes):
ValueError: Both x and bins must have non-zero length
rm -rf test/run1
scl enable devtoolset-4 './redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1'
#fails exactly the same way

Suggestions?

Thanks.

bioconda package for redundans

Hi,
Were you able to create a bioconda package for redundans?

Thank you in advance.

Best wishes,

Michal

Help info

README
- I/O paragraph
- program parameters
- performance info
test/README
- simulations
- de novo genome assembly
- redundans pipeline
  - run statistics
  - method description
- accuracy estimation

invalid literal for int() with base 10: 'option'

Hi,

When running redundans, I got this error:

invalid literal for int() with base 10: 'option'

Any solutions ?

Please make the installation relocatable

While I appreciate the automation provided with INSTALL.sh, it assumes several things:

You're installing for your own user only.
You're installing in your home directory.
All required libraries are present.
None of the dependencies are installed already.

In a typical HPC environment multiple users want to use a single installation of the same tool in multiple servers, typically on a shared volume over NFS or similar. Most of them also implement dynamic management of environment variables through Environment Modules, rather than relying on ~/.bashrc. or the like.

Why not providing a standard ./configure and Makefile to allow system administrators to relocate where to install (via --prefix)?
Alternatively, you could let the end user install the dependencies himself but still check for their presence in the environment PATH (the way you do now after compiling them).

Thanks

How is multiple redundancy handled?

How does the reduction step handle redundancy that involve more than 2 sequences? When mapped to itself, my assembly had some contigs that aligned to multiple other shorter contigs at high confidence (>99% identity, longer than 250nt, as showed in the attached image, which is actually after running fasta2homozygous.py). I want to remove all but the longest contig. When I ran redundans (only fasta2homozygous.py ), I saw messages like "matched already removed contig". Does this mean that redundans removes only the first matched sequence?

Thanks,

Takeo

Paired end and mate pair for reduction

Hi,

If i want to perform onlx reduction (using --noscaffolding --nogapclosing), should I include mate pairs in the -i option ? or only PE.

Redundans error- OSError: [Errno 2] No such file or directory

Please check the above screen shot and help me in resolving the error.

How is scaffolds.reduced.fa created?

I ran redundans.py using the options --noscaffolding and --nogapclosing.
Below is the full run command I used.

python redundans.py --noscaffolding --nogapclosing -t 64 -f draft.contig.fa --identity 0.90 --overlap 0.95 --log draft.contig.log -o draft.contig

I used --noscaffolding, but not only contigs.reduced.fa, but scaffolds.reduced.fa was also written as output.
I did not use paired-end reads as input. What is the criteria that scaffolds.reduced.fa is output?
What is the difference between contigs.reduced.fa and scaffolds.reduced.fa?

Not multiprocessing?

Hi,
This didn't seem to have been covered before. I'm running redundans with '-t 32' yet when I do a 'htop' to look at the processor usages, I see 32 instances of lastal being put through one processor. So they are being threaded through the one processor.

I did some speed tests with -t2 and -t32 and got no improvement in run time. I've tried two different systems with different linux os. Same problem.

When I run the lastal command at the commandline, I see the expected behaviour. My python skills are not particularly strong. I spoke with a colleague who said that getting python to run multiprocessor jobs can be a pain. Is there a problem?

Thanks

James

Run reduction only - no scaffolding libraries

If I understand correctly, illumina libraries are not required for the reduction step. I am running redundans with --noscaffolding --nogapclosing because I want to scaffold the reduced assembly with PacBio data. However, redundans always seems to require -i/--fastq and than checks insert sizes for the libraries. Is there a way around that?

Redundans dockerfile

Hello,
is it possible to get the redundans dockerfile somewhere ? I'm not looking for the image only as it is available on the dockerhub, but the dockerfile itself.
Our cluster architecture requires that we provide some modifications to the dockerfile in order to deploy the tools.
Best Regards

blurfl/quux for blurfl/quux: No such file or directory at ~/redundans/SSPACE/SSPACE_Standard_v3.0.pl

Hi,
I received the below error when executing /home/lorencm/apps/redundans/redundans/redundans.py -i *.fq -f StriDe-contigs.fa -o redundans --sspacebin /home/lorencm/apps/redundans/SSPACE/SSPACE_Standard_v3.0.pl -t 7

redundans/contigs.fa    372614  301     75352   20.22   87      28.90   93.679  0       297773  79.91   214     71.10
   59555 pairs. 35742 passed filtering [60.02%]. 12505 in different contigs [21.00%].
blurfl/quux for blurfl/quux: No such file or directory at /home/lorencm/apps/redundans/SSPACE/SSPACE_Standard_v3.0.pl line 761.
blurfl/quux for blurfl/quux: No such file or directory at /home/lorencm/apps/redundans/SSPACE/SSPACE_Standard_v3.0.pl line 761.
blurfl/quux for blurfl/quux: No such file or directory at /home/lorencm/apps/redundans/SSPACE/SSPACE_Standard_v3.0.pl line 761.
blurfl/quux for blurfl/quux: No such file or directory at /home/lorencm/apps/redundans/SSPACE/SSPACE_Standard_v3.0.pl line 385.
   59555 pairs. 36473 passed filtering [61.24%]. 10051 in different contigs [16.88%].
blurfl/quux for blurfl/quux: No such file or directory at /home/lorencm/apps/redundans/SSPACE/SSPACE_Standard_v3.0.pl line 761.
blurfl/quux for blurfl/quux: No such file or directory at /home/lorencm/apps/redundans/SSPACE/SSPACE_Standard_v3.0.pl line 761.
blurfl/quux for blurfl/quux: No such file or directory at /home/lorencm/apps/redundans/SSPACE/SSPACE_Standard_v3.0.pl line 761.
blurfl/quux for blurfl/quux: No such file or directory at /home/lorencm/apps/redundans/SSPACE/SSPACE_Standard_v3.0.pl line 385.
#fname  contigs bases   GC [%]  contigs >1kb    bases in contigs >1kb   N50     N90     Ns      longest
redundans/contigs.fa    301     372614  34.815  228     320138  1203    978     0       6074
redundans/contigs.reduced.fa    214     297773  33.639  197     281296  1289    1018    0       6074
redundans/_sspace.1.1.fa        188     299268  33.639  175     286585  1416    1027    1495    8690
redundans/_sspace.1.2.fa        188     299268  33.639  175     286585  1416    1027    1495    8690
redundans/scaffolds.fa  188     299268  33.639  175     286585  1416    1027    1495    8690
redundans/_gapcloser.1.1.fa     188     299625  33.656  175     286942  1416    1031    195     8955
redundans/_gapcloser.1.2.fa     188     299644  33.655  175     286961  1416    1031    195     8956
redundans/scaffolds.filled.fa   188     299644  33.655  175     286961  1416    1031    195     8956
#Time elapsed: 0:00:39.005123
redundans/contigs.fa    372614  301     75352   20.22   87      28.90   93.679  0       297773  79.91   214

What did I do wrong?

Thank you in advance
mic

sspace

any way around using sspace?

Docker - OSError: [Errno 2] No such file or directory

Hi,
I used your latest docker container but for some reasons I got OSError: [Errno 2] No such file or directory

$ docker run -v `pwd`:/test:rw -it lpryszcz/redundans

root@7bf273fe0c52:/# head /test/BAFB.contigs.fasta
>tig00000003 len=68902 reads=220 covStat=35.22 gappedBases=no class=contig suggestRepeat=no suggestCircular=no
AAAAGTTTTCAGAGAAGTAAGCTTCTGTGGTTTTATCATGGATAGTGCAGTTATGTCCAGCGACAATAGATCTCTATCTTATTGAAGGAATTTTGTGTTA
TTCACACTCTTGTTTTTGGCCTTCAGGAAGACTGATCGAGTCATCTTCAGCTAGAGTGGTAATTCTTCTGTCTGTGGCATCTAGCCATTCTTCCAACCTG
TTTTCTTTTCTTCTTCTTTTTCTTGTTACATTACACTGTAGTTGATGCGTCGATAAGAAGCTAAGATTTTATTTGCTGTCATTGCTTAATGTCTTTGTCA
TGTTCAAACTGACTGTAAGAAAATACTTTGTTGGAAATTCATGCTTCTATGCTTCAAGTCACACCTTATATAGATTCTGATCAAAAGCTTTAAAGGTAAA
AGATGATATAAGGCTTATAGAACGACTCTGATAGTTTCTGGGAGCAATACTGTACCAAGACATCTTATTACTTGCCAACAGTTTAATCCATTCATATATT
CAGCTAGTGCATTGCTATGGTCTGGTAAATTTTCTTTTTCTTTTGAGTAAACAGTTTTGTCTAAAGGCTCTTCTGTTTTTTATTGTAATGTCAGATGCTA
AGTTGAGAAATAACACTAAATTTATGCTGCATCAATTCTAACTCATTCTAGCTTGCTGCCTGTTATCTTATATCAAAAGGCTGATATGGAGGAGACGGAG
AGAGACTTGAAAAAGAACTACGAGAGTATGGGTATTTCATAGACATGCATATTTCAGTGTGATATAGTCAGGTAAATGCATATTGCAAGTGGACTTGAAC
GACAATTTTAAGAATATTAAATTATGTCTTGCAAATGAACATATTCTAAAGTATCAAAAATCAAAGTTGTCAAGCTAAAGAGAAAGAAACTAAAGGTTCT

root@7bf273fe0c52:/# /root/src/redundans/bin/fasta2homozygous.py -i /test/BAFB.contigs.fasta -t 25
Homozygous assembly/ies will be written with input name + '.homozygous.fa.gz'
#file name      genome size     contigs heterozygous size       [%]     heterozygous contigs    [%]     identity [%]    possible joins  homozygous size [%]     homozygous contigs      [%]
Traceback (most recent call last):
  File "/root/src/redundans/bin/fasta2homozygous.py", line 238, in <module>
    main()
  File "/root/src/redundans/bin/fasta2homozygous.py", line 232, in main
    o.threads, o.verbose)
  File "/root/src/redundans/bin/fasta2homozygous.py", line 150, in fasta2homozygous
    contig2skip = fasta2skip(out, fasta, faidx, threads, identity, overlap, verbose)
  File "/root/src/redundans/bin/fasta2homozygous.py", line 63, in fasta2skip
    for i, (score, t, q, algLen, identity, overlap) in enumerate(hits, 1):
  File "/root/src/redundans/bin/fasta2homozygous.py", line 36, in fasta2hits
    last = run_last(fasta.name, identityTh, threads, verbose)
  File "/root/src/redundans/bin/fasta2homozygous.py", line 30, in run_last
    proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=sys.stderr)        
  File "/usr/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1343, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

What did I miss?

Thank you in advance.

Michal

Some questions

Hi,

I'm not sure to understand how redundans works.
a) For 2 contig of the same lengths: it makes a consensus: ok
b) For contigs with different lengths: does it makes a consensus of the longest as a seed ?
Can redundans be applied to scaffolds instead of contigs ?

Issue in Scaffolding

Hi,

When running redundans on the test dataset it passes reduction and library parameters fine, but then returns an error during scaffold construction:

Traceback (most recent call last):
  File "./redundans.py", line 512, in 
    main()
  File "./redundans.py", line 507, in main
    o.norearrangements, o.verbose, o.log)
  File "./redundans.py", line 314, in redundans
    identity, overlap, minLength, resume)
  File "./redundans.py", line 121, in run_scaffolding
    sspacebin, verbose=0, log=log)
  File "/home/sb36g09/software/redundans/bin/fastq2sspace.py", line 191, in fastq2sspace
    tabFnames = get_tab_files(out, fasta, libnames, libFs, libRs, libIS, libISStDev, libreadlen, cores, mapq, upto, verbose, log)
  File "/home/sb36g09/software/redundans/bin/fastq2sspace.py", line 144, in get_tab_files
    proc = _get_aligner_proc(f1.name, f2.name, ref, cores, verbose, bwalog)
  File "/home/sb36g09/software/redundans/bin/fastq2sspace.py", line 117, in _get_snap_proc
    proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=log)
  File "/local/software/python/2.7.5/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/local/software/python/2.7.5/lib/python2.7/subprocess.py", line 1308, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

Command line was just the one for running the test data:

./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1

Cheers,
Steve

IndexError: list index out of range

Hi,
Thanks for the useful software. I am having an issue where redundans returns an index error.
Python 2.7.11; I also installed all programs manually from versions/links listed in INSTALL.sh. The test case ran without error.

redundans.py -v -i ../../Anf.1.fastq.gz ../../Anf.2.fastq.gz -f redundans.in.fasta -t 5 -o run1 --sspacebin ~/SSPACE-STANDARD-3.0_linux-x86_64/SSPACE_Standard_v3.0.pl --noscaffolding --nogapclosing

Options: Namespace(fasta='redundans.in.fasta', fastq=['../../Anf.1.fastq.gz', '../../Anf.2.fastq.gz'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '', mode 'w' at 0x7f47e31fe1e0>, mapq=10, minLength=200, nocleaning=True, nogapclosing=False, noreduction=True, noscaffolding=False, outdir='run1', overlap=0.66, sspacebin='~/SSPACE-STANDARD-3.0_linux-x86_64/SSPACE_Standard_v3.0.pl', threads=5, verbose=True)
Aligning 69476742 mates per library...
Insert size statistics Mates orientation stats
FastQ files median mean stdev FF FR RF RR
../../Anf.1.fastq.gz ../../Anf.2.fastq.gz 403 399.70 135.09 4 9974 22 0

[Wed Jul 6 15:45:19 2016] Reduction...

file name genome size contigs heterozygous size [%] heterozygous contigs [%] identity [%] possible joins homozygous size [%] homozygous contigs [%]

run1/contigs.fa 347383710 121440 88802556 25.56 49234 40.54 79.471 0 260148921 74.89 72206 59.46
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 403, in
main()
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 398, in main
o.verbose, o.log)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 263, in redundans
limit = get_read_limit(reducedFname, readLimit, verbose, log)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 99, in get_read_limit
stats = fasta_stats(open(fasta))
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/fasta_stats.py", line 18, in fasta_stats
faidx = FastaIndex(handle)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 37, in init
self._generate_index()
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 70, in _generate_index
stats = self.get_stats(header, seq, offset)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 186, in get_stats
linebases, linebytes = len(seq[0].strip()), len(seq[0])
IndexError: list index out of range

Identity may be wrongly estimated by fasta2homozygous.py

Unix install failure

I ran the suggested unix install command (on Ubuntu 16.04) and got install error from Lastal:

'
Redundans and its dependencies will be installed in: /media/data/software/redundans

Installation will take 5-10 minutes.
To track the installation status execute in the new terminal:
tail -f /media/data/software/redundans/install.log

Do you want to proceed with installation (y/n)? y

Fri Feb 23 08:41:57 CST 2018 Checking dependencies...
Everything looks good :) Let's proceed...
Fri Feb 23 08:41:57 CST 2018 Downloading Redundans...
Fri Feb 23 08:41:58 CST 2018 Updating submodules...
Fri Feb 23 08:41:59 CST 2018 Compiling dependencies...
=== You can find log in: /media/data/software/redundans/install.log ===
I'll use 63 thread(s) for compiling
Fri Feb 23 08:41:59 CST 2018 BWA
Fri Feb 23 08:42:04 CST 2018 LASTal
[2] ERROR!
^
lastal.cc:761:33: error: no matching function for call to ‘cbrc::Alignment::makeXdrop(cbrc::Centroid&, cbrc::GreedyXdropAligner&, bool&, const uchar*&, const uchar*&, int&, const int (&)[64], int&, cbrc::GeneralizedAffineGapCosts&, int&, int&, size_t&, const int (&)[64], const cbrc::TwoQualityScoreMatrix&, const uchar*&, const uchar*&, cbrc::Alphabet&, cbrc::AlignmentExtras&, double&, int&)’
args.gamma, args.outputType );
^
lastal.cc:761:33: note: candidate is:
In file included from AlignmentPot.hh:8:0,
from lastal.cc:16:
Alignment.hh:80:8: note: void cbrc::Alignment::makeXdrop(cbrc::GappedXdropAligner&, cbrc::Centroid&, const uchar*, const uchar*, int, const int ()[64], int, const cbrc::GeneralizedAffineGapCosts&, int, int, size_t, const int ()[64], const cbrc::TwoQualityScoreMatrix&, const uchar*, const uchar*, const cbrc::Alphabet&, cbrc::AlignmentExtras&, double, int)
void makeXdrop( GappedXdropAligner& aligner, Centroid& centroid,
^
Alignment.hh:80:8: note: candidate expects 19 arguments, 20 provided
lastal.cc: In function ‘void eraseWeakAlignments(LastAligner&, cbrc::AlignmentPot&, size_t, char, const uchar*)’:
lastal.cc:775:12: error: ‘struct cbrc::Alignment’ has no member named ‘hasGoodSegment’
if (!a.hasGoodSegment(dis.a, dis.b, args.minScoreGapped, dis.m, gapCosts,
^
makefile:100: recipe for target 'lastal.o8' failed
make[1]: *** [lastal.o8] Error 1
make[1]: Leaving directory '/media/data/software/redundans/bin/last/src'
makefile:3: recipe for target 'all' failed
make: *** [all] Error 2
'

How to continue the redundans command?

Hi,
In my command, the redundans.py is running in the "closing gaps ..." step. But for some reason, the command is killed when I run nearly a month. So, I want to know how to continue this command and I really don't want to rerun.

Thanks.

Maximum sequences limited to 327068

For redundans is there a upper limit of sequences that can be stored (first step). I have few assemblies that have more than 327,068 contigs, but by default redundans only reads the above mentioned scaffolds for processing.

Thanks for any input!

FastaIndex.py and fasta_stats.py missing?

I just installed redundans both through the git repository as well as with the UNIX installer, but both versions give me the following error when I try to run the test:

~/redundans$ ./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1 Traceback (most recent call last): File "./redundans.py", line 34, in <module> from fasta2homozygous import fasta2homozygous File "/home/pepijn/redundans/bin/fasta2homozygous.py", line 17, in <module> from FastaIndex import FastaIndex

When I look at the bin folder it seems that the symlinks for both python scripts are present, but they seem empty. Any idea what the problem might be?

Remove hardcoded path for SSPACE

In https://github.com/lpryszcz/redundans/blob/master/redundans.py#L370 the assumption is that SSPACE will be installed under ~/src/SSPACE. This is the case when using the automated installed, but not when the user installs the dependencies manually.

While --spacebin can be used at invoke time, it would be better to treat SSPACE the same way other dependencies are treated. I.e.: if it's not in the PATH warn the user and exit.

code/syntax improvement

threading
- fasta2homozygous.py
code polishing
- fasta2homozygous.py
  - removed unnecessary sort
  - memory-optimised (generator instead of list)
- fasta_stats.py
  - headers in .stats
- fastq2sspace.py
- sspace intermetiate files/folders in cur dir instead in outrid
- make sure subprocesses are closed if not longer needed i.e. bwa
deprecate
- fasta2diverged.py
- Biopyton & SQLite: causing problems
- scipy, numpy

Bug: automake-1.15 needed

Regarding your installation instructions:

(cd bin/parallel && make clean && ./configure && make)

I believe make clean should be after the configure command.

Additionally, for some reason your configure script inside bin/parallel is requiring the exectable automake-1.15 instead of automake. It is important that we use specifically version 1.15 of automake? If not, perhaps you can fix this? My current workaround is to create a symbolic link ln -s /bin/automake automake-1.15.

lastdb invalid argument

In Ubuntu 14.04 Trusty the correct command is:
lastdb -v

While the script uses --:

aln@notik:~/Science/schools/ngschool2016/ngschool2016-materials/src/redundans$ python redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1 --sspacebin $SSPACEBIN
Options: Namespace(fasta='test/contigs.fa', fastq=['test/5000_1.fq.gz', 'test/5000_2.fq.gz', 'test/600_1.fq.gz', 'test/600_2.fq.gz'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '<stderr>', mode 'w' at 0x7f72fd1cc1e0>, mapq=10, minLength=200, nocleaning=True, nogapclosing=True, noreduction=True, noscaffolding=True, outdir='test/run1', overlap=0.66, resume=False, sspacebin='/home/aln/Science/schools/ngschool2016/ngschool2016-materials/src/SSPACE/SSPACE_Standard_v3.0.pl', threads=4, verbose=True)
[WARNING] Problem checking lastdb version: lastdb: invalid option -- '-'
lastdb: bad option
Traceback (most recent call last):
  File "redundans.py", line 450, in <module>
    main()
  File "redundans.py", line 438, in main
    _check_dependencies(dependencies)
  File "redundans.py", line 370, in _check_dependencies
    if int(curver)<version:
ValueError: invalid literal for int() with base 10: 'option'

Use for iontorrent data

I want to modify this to use for iontorrent data. Should I do that or it is a no-go because some dependencies work only with paired-end data?

redundans failed

Dear all,

I launched the following script to assist my assembly with redundans :

#!/bin/bash
redundans.py -i /home/raw/S04[56]_[12].trimmed.fastq -f /home/contig_k116.fasta -o /data_sra_home/scaffolding_redundans/run1/ -t 40 --resume -m 200 -l /data_sra_home/scaffolding_redundans/SR_subset_1K.fa

bellow is the screenshot of my terminal after the failed attempt :
##################################################
[Sun Mar 18 10:18:32 2018] Resuming previous run from /data_sra_home/scaffolding_redundans/run1/...
[WARNING] numpy or matplotlib missing! Cannot plot histogram
/data_sra_home/scaffolding_redundans/run1/contigs.fa 665152391 83469 141609580 21.29 42509 50.93 88.051 0 523542811 78.71 40960 49.07
Insert size statistics Mates orientation stats
FastQ files read length median mean stdev FF FR RF RR
Traceback (most recent call last):
File "/home1/software/redundans/redundans.py", line 539, in
main()
File "/home1/software/redundans/redundans.py", line 534, in main
o.norearrangements, o.verbose, o.usebwa, o.log, o.tmp)
File "/home1/software/redundans/redundans.py", line 321, in redundans
libraries = get_libraries(fastq, lastOutFn, mapq, threads, verbose, log, usebwa=usebwa)
File "/home1/software/redundans/redundans.py", line 61, in get_libraries
libdata = fastq2insert_size(log, fastq, fasta, mapq, threads, limit/100, genomeFrac, stdfracTh, maxcfracTh, usebwa=usebwa)
File "/home1/software/redundans/bin/fastq2insert_size.py", line 191, in fastq2insert_size
isstats = get_isize_stats(fq1, fq2, fasta, mapq, threads, limit, verbose, stdfracTh, maxcfracTh)
File "/home1/software/redundans/bin/fastq2insert_size.py", line 124, in get_isize_stats
rname, flag, chrom, pos, mapq, cigar, mchrom, mpos, isize, seq = sam.split('\t')[:10]
ValueError: need more than 1 value to unpack

thank you for your help to raise this issue!

fasta2homozygous fails for --identity 0.33 --overlap 0.51

~/src/redundans/fasta2homozygous.py -i redundans/MAGSPI*/contigs.fa --identity 0.33 --overlap 0.33

Homozygous assembly/ies will be written with input name + '.homozygous.fa.gz'

file name genome size contigs heterozygous size [%] heterozygous contigs [%] identity [%] possible joins homozygous size [%] homozygous contigs [%]

[ERROR] seq21018_len7789_cov189 (('seq1_len104_cov121', 104, 0, 70, 'seq9735_len101_cov148', 101, 31, 101, '+', 0.8285714285714286, 0.693069306930693, 46)) not in contigs!
[ERROR] seq26477_len7324_cov148 (('seq1_len104_cov121', 104, 0, 81, 'seq9735_len101_cov148', 101, 20, 101, '+', 0.7777777777777778, 0.801980198019802, 45)) not in contigs!
[ERROR] seq20577_len9015_cov141 (('seq3_len238_cov168', 238, 0, 238, 'seq10_len238_cov189', 238, 0, 238, '+', 0.9831932773109243, 1.0, 230)) not in contigs!
[ERROR] seq19649_len8979_cov175 (('seq3_len238_cov168', 238, 0, 84, 'seq16825_len172_cov141', 172, 88, 172, '+', 1.0, 0.4883720930232558, 84)) not in contigs!
[ERROR] seq20482_len6873_cov155 (('seq8_len6349_cov128', 6349, 588, 743, 'seq27924_len149_cov114', 149, 0, 149, '+', 0.8456375838926175, 1.0, 103)) not in contigs!
[ERROR] seq20999_len4731_cov189 (('seq8_len6349_cov128', 6349, 5653, 5935, 'seq5386_len269_cov141', 269, 0, 269, '+', 0.6505576208178439, 1.0, 81)) not in contigs!
[ERROR] seq24053_len5120_cov168 (('seq8_len6349_cov128', 6349, 4540, 4646, 'seq27404_len104_cov189', 104, 0, 104, '+', 0.8653846153846154, 1.0, 77)) not in contigs!
[ERROR] seq24131_len10134_cov155 (('seq8_len6349_cov128', 6349, 6249, 6349, 'seq7853_len154_cov296', 154, 0, 100, '+', 0.84, 0.6493506493506493, 68)) not in contigs!
[ERROR] seq25312_len13069_cov148 (('seq8_len6349_cov128', 6349, 419, 679, 'seq5386_len269_cov141', 269, 0, 269, '+', 0.620817843866171, 1.0, 65)) not in contigs!
[ERROR] seq18534_len7476_cov148 (('seq8_len6349_cov128', 6349, 5849, 5999, 'seq27924_len149_cov114', 149, 0, 149, '+', 0.6912751677852349, 1.0, 57)) not in contigs!
[ERROR] seq24935_len5793_cov162 (('seq8_len6349_cov128', 6349, 493, 613, 'seq14809_len119_cov94', 119, 0, 119, '+', 0.6890756302521008, 1.0, 45)) not in contigs!
Traceback (most recent call last):
File "/home/lpryszcz/src/redundans/fasta2homozygous.py", line 234, in
main()
File "/home/lpryszcz/src/redundans/fasta2homozygous.py", line 228, in main
libraries, limit, o.threads, o.joinOverlap, o.endTrimming, o.verbose)
File "/home/lpryszcz/src/redundans/fasta2homozygous.py", line 122, in fasta2homozygous
contig2skip, identity = hits2skip(hits, faidx, verbose)
File "/home/lpryszcz/src/redundans/fasta2homozygous.py", line 74, in hits2skip
if contig2skip[q]:
KeyError: 'seq18807_len4442_cov148'

error: unrecognized arguments: /mnt/md1/REFERENCE.fa

I would like to try a reference-guided assembly to compare with my de novo assembly.

I am doing something wrong and would like any guidance that you can offer.

My command is:
/home/rob/src/redundans/redundans.py -v -i Pcri_800bp_12repaired_R1.fastq Pcri_800bp_12repaired_R2.fastq Pcri_3kb_12repaired_R1.fastq Pcri_3kb_12repaired_R2.fastq Pcri_5kb_12repaired_R1.fastq Pcri_5kb_12repaired_R2.fastq Pcri_10kb_12repaired_R1.fastq Pcri_10kb_12repaired_R2.fastq Pcri_470bp_40pct_R1.fastq Pcri_470bp_40pct_R2.fastq -r /mnt/md1/Mguttatus.fa -f /mnt/md1/redundans_1Kcut_u.5_iden0.91/scaffolds.filled.fa -o /mnt/md1/redundans_1Kcut_u.5_iden0.90_overlap0.66_REF --noscaffolding --threads 80 --identity 0.90 --overlap 0.66 --minLength 1000 --mapq 15 --iters 4 --sspacebin /home/rob/src/SSPACE/SSPACE_Standard_v3.0.pl

Which throws the following error:
redundans.py: error: unrecognized arguments: /mnt/md1/Mguttatus.fa

Mguttatus.fa is the fasta file for the reference assembly.

Why does it not recognize my Mguttatus.fa file?

Rob

optimize `sort` for large data sets

Hi Leszek,

I am currently trying to use redundans with a 3Gbp diploid fragmented plant genome. I have got 39 .psl files with a total of 138GB. Now I am kind of stuck in sort-ing those. Do have an idea on how to go about optimizing the sorting? Currently it uses on 5MB tmp files and does not run in parallel..

Cheers
Thomas

Feature: Log file

Hello,

Just thought I would suggest a main log file that starts with the command issued and all parameters used. The log file would then be followed by all standard output the program generates.

As a little extra, you could also add the basic stats (n50, # contigs, longest contig) for both the input contigs and the output generated.

Adrian

Scaffolding without Illumina reads

Hello,

First, I want to thank you for this remarkable tool.

I used Redundans with only PacBio reads since I do not have Illumina reads. Here is the command:

../redundans/redundans/redundans.py -v -t 6 --log LOG.redundans -l Reads_?.fastq -f contigs.fasta -o redundans_results

I was surprised to realize that Redundans can generate scaffolds without having pairing information provided by Illumina reads. Can I have some explications ? How it can determines the number of "N" ?

Thank you

reduce Pacbio assembly ?

Hi,

I have a fasta file after a pacbio assembly. I would like to run redundans but only the reducing option (no scaffolding and no gapclosing) with that command:

redundans.py -i reads.fastq -f assembly.fasta -o output --identity 0.9 --overlap 0.9 --minLength 200 --noscaffolding --nogapclosing -t 10 2> redundans.log

Does it work with pacbio ? does redundans need the input files (-i) when only the reduced option is applied ?

implement gap2seq instead of gapcloser

time Gap2Seq -scaffolds scaffolds.fa -filled scaffolds.gap2seq.fa -reads ../600_1.fastq.gz,../600_2.fastq.gz -nb-cores 2
http://www.cs.helsinki.fi/u/lmsalmel/Gap2Seq/

I/O checks

PREREQUISITES
- dependencies installed
INPUT FILES
- catch non-existing files
  - catching [Errno 95] Operation not supported in samba
- check if supported format
LIBRARY QUALITY WARNINGS
- high fraction of paired-end in mate pairs, or more generally not consisted read orientation
- insert size stdev is large
OUTPUT FILES
- check if all output files exists

Error during execution of redundans

Hi!

I have successfully used redundans on one of my genome assemblies, but doing it on another one I get this error. Any help with this would be highly appreciated.

Philipp


Traceback (most recent call last):
  File "/home/philipp/src/redundans/redundans.py", line 450, in <module>
    main()
  File "/home/philipp/src/redundans/redundans.py", line 445, in main
    o.verbose, o.log)
  File "/home/philipp/src/redundans/redundans.py", line 311, in redundans
    identity, overlap, minLength, resume)
  File "/home/philipp/src/redundans/redundans.py", line 150, in run_scaffolding
    threads, limit, iters=1, resume=resume, verbose=0, log=log, basename=basename)
  File "/home/philipp/src/redundans/redundans.py", line 242, in run_gapclosing
    fasta_stats(index)
  File "/home/philipp/src/redundans/fasta_stats.py", line 26, in fasta_stats
    A, C, G, T = map(sum, zip(*[stats[-4:] for stats in id2stats.itervalues()]))
ValueError: need more than 0 values to unpack

redundans.py check if input files exists and correct

subprocess error

Hi all, I am working on a polymorphic genome, and tried to use this software. But after installing everything, I kept getting the following error when using the test data. I am not sure what is the reason. Can anyone help me to solve this problem? Thank you so much.

Traceback (most recent call last):
File "./redundans.py", line 521, in
main()
File "./redundans.py", line 516, in main
o.norearrangements, o.verbose, o.log)
File "./redundans.py", line 315, in redundans
libraries = get_libraries(fastq, lastOutFn, mapq, threads, verbose, log)
File "./redundans.py", line 62, in get_libraries
genomeFrac, stdfracTh, maxcfracTh)
File "/home/xinw/software/redundans/bin/fastq2insert_size.py", line 189, in fastq2insert_size
isstats = get_isize_stats(fq1, fq2, fasta, mapq, threads, limit, verbose, stdfracTh, maxcfracTh)
File "/home/xinw/software/redundans/bin/fastq2insert_size.py", line 111, in get_isize_stats
aligner = _get_snap_proc(fq1, fq2, fasta, threads, verbose, alignerlog)
File "/home/xinw/software/redundans/bin/fastq2sspace.py", line 133, in _get_snap_proc
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=log)
File "/usr/local/lib/python2.7/subprocess.py", line 390, in init
errread, errwrite)
File "/usr/local/lib/python2.7/subprocess.py", line 1024, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

gabaldonlab / redundans Goto Github PK

redundans's Introduction

Table of Contents

Redundans

Prerequisites

Official conda package

UNIX installer

Docker image

Singularity image

Running the pipeline

Parameters

Test run

Support

Citation

redundans's People

Contributors

Stargazers

Watchers

Forkers

redundans's Issues

file name genome size contigs heterozygous size [%] heterozygous contigs [%] identity [%] possible joins homozygous size [%] homozygous contigs [%]

file name genome size contigs heterozygous size [%] heterozygous contigs [%] identity [%] possible joins homozygous size [%] homozygous contigs [%]

file name genome size contigs heterozygous size [%] heterozygous contigs [%] identity [%] possible joins homozygous size [%] homozygous contigs [%]

Recommend Projects

Recommend Topics

Recommend Org