ganlab / gala Goto Github PK

Long-reads Gap-free Chromosome-scale Assembler

License: MIT License

Python 85.39% Shell 14.61%

pacbio nanopore genome-assembly genome-analysis scaffolding long-reads gap-filling

gala's Introduction

Gap-Free Long-read Assembler (GALA)

GALA is a Gap-free Long-read Assembler. GALA builds a multi-layer graph from different preliminary assemblies, long-reads, and potentially other sources of information, such as Hi-C assemblies. During this process, it identifies mis-assembled contigs and trim them. The corrected data are then partitioned into multiple scaffolding groups, each representing a single chromosome. Each scaffolding group is assembled independently with existing assembly tools and a simplified version of overlap-graph-based merging algorithm is used to merge multiple contigs if necessary.

GALA has three modules each can be used separately.

GALA performance with a human genome

GALA assembled a human genome using (HiFi) reads. GALA used canu draft for CHM13 and the current human reference genome GRCh38.p13 as input of GALA. In this way GALA essentially created a reference-guided de novo assembly. GALA assembly comprised of 37 continuous contigs, including 8 telomer-to-telomer gap-free pseudomolecular sequences, 4 near complete chromosomes each with a small telomeric fragment unanchored, 3 with only gapped centromeric regions, and the long arm of acrocentric chromosomes. Human Genome

Dependency

Installation

GALA can be run directly from the gala folder

git clone https://github.com/ganlab/gala.git cd GALA

Or You can run install to add it to your PATH

Usage

Using GALA pipeline to assemble a genome involves preliminary steps and three main Steps.

Preliminary step and Inputs

Preliminary step

Use different software to construct preliminary assemblies from long reads, e.g. (Canu, Flye, MECAT, Miniasm, and Wtdbg2).

Inputs:

Raw reads and corrected reads if available.
The user needs to prepare draft_names_paths.txt for preliminary assemblies. Here is an example:

draft_01=path/to/draft_fasta_file
draft_02=path/to/draft_fasta_file
draft_03=path/to/draft_fasta_file
draft_n=path/to/draftfasta file

GALA Single Command Mode

To run GALA using one command user can use the following command:

gala draft_names_paths.txt fa/fq reads_file platform

In single command mode, GALA used canu for Chromosome-by-Chromosome assembly.

To use another assembler or multiple assemblers, GALA provides three choices Canu, Flye, and Miniasm, pass it to -a argument with a single space between them.

For sequencing_platform the user needs to provide it in this way: -pacbio-raw -pacbio-corrected -nanopore-raw -nanopore-corrected

usage: gala -h  [options] <draft_names & paths> <fa/fq> <reads> <platform>

GALA Gap-free Long-read Assembler

positional arguments:
  draft_names           Draft names and paths [required]
  input_file            input type (fq/fa) [required]
  reads                 raw/corrected reads [required]
  sequencing_platform   -pacbio-raw -pacbio-corrected -nanopore-raw -nanopore-
                        corrected [required]

optional arguments:
  -h, --help            show this help message and exit
  -a [ASSEMBLER [ASSEMBLER ...]]
                        Chr-by_Chr assembler (canu flye miniasm) [default
                        canu]
  -b Alignment block length	 [default 5000]
  -p Alignment identity percentage	 [default 70%]
  -c Shortest contig length	 [default 5000]
  -q Mapping quality	 [default 20]
  -f Output files name	[default gathering]
  -o output files path	[default current directory]
  -v, --version         show program's version number and exit

GALA Step-by-Step Mode (recommended)

Mis Assembly Detector Module (MDM)

Use the comp module to generate a draft_comparison file

comp draft_names_paths.txt
Run draft_comparison file to produce drafts comparison paf files

sh draft_compare.sh
Use the mdm module to identify mis-assembled contigs.

mdm comparison_folder number of assembly drafts
Use the newgenome module to Produce misassembly-free drafts.

newgenome draft_names_paths.txt cut_folder

Contig Clustering Module (CCM)

Use the comp module to generate a draft_comparison file for misassembly-free drafts.

comp new_draft_names_paths.txt
Run draft_comparison file to produce new drafts comparison paf files.

sh draft_compare.sh
Run the ccm module to produce contigs scaffolding groups.

ccm comparison_folder number of assembly drafts
- Note: You can also use the reformat module to generate reformatted paf files and use them to confirm Scaffolding groups.

Scaffolding Group Assembly Module (SGAM)

Map all drafts against raw long reads and self-corrected reads if available.

bwa index misassembly-free draft bwa mem -x pacbio/ont2d misassembly-free draft long-reads
Use the following commands to separate the read names mapped to each contig

samtools view -H bam_file |grep "SQ"|cut -f 2|cut -d : -f 2 > contig_names

seprator contig_names mapping.bam

sh bam_seprator.sh

for i in bams/*; do samtools view $i | cut -f 1 > $i.read_names;done;
Use the cat command to concatenate read name files belongs to the same scaffolding group.
- For example:
  
  cat contig_1.bam.read_names contig_3.bam.read_names contig_7.bam.read_names > scaffold_1.read_names
Use the readsep Module to separate each scaffold correlated-reads.

for i in scaffold_*.read_names; do readsep raw/correted-reads $i -f input reads file type fa/fq
Implement Chromosome-by-Chromosome assembly approach to retrieve the gap-free chromosome-scale assembly by
```
Assemble each read set from scaffold_*.read.fq with different
assembly software, e.g.(Canu, Flye, Mecat, Miniasm, and Wtdbg).
```
we recommend the user to try different assembly tools especially ( Flye, MECAT/NECAT, and Miniasm)
Finally, map the SGAM outcomes against one of the preliminary draft assemblies to confirm that all the contigs in the scaffolding group are assembled to the right chromosome/Scaffold.

Description

comp:

The comp module used to generate a genome comparison file if the user wants to compare multiple genomes against each other.

usage: comp -h  [options] <draft_names & paths> 

Generate genome comparison files, part of GALA Gap-free Long-read Assembler

positional arguments:
  drafts                Draft names and paths [required]

optional arguments:
  -h, --help            show this help message and exit
  -o output files path	[default current directory]
  -v, --version         show program's version number and exit

mdm:

Miss-assembly Detector Module used to detect misassembled contigs. The algorithm relies on the alignment's contradictory information.

mis-assembly detection module should be applicable for error correction regardless of the specific algorithm used for assembly and can differentiate between misassembly and Structure variation

usage: mdm -h  [options] path/to/mapping_files number of drafts
MDM Mis-assembly Detector Module, part of GALA Gap-free Long-read Assembler

positional arguments:
  mapping_files         mapping paf file [required]
  drafts                Number of drafts [required]

optional arguments:
  -h, --help            show this help message and exit
  -b Alignment block length	 [default 5000]
  -p Alignment identity percentage	 [default 70%]
  -c Shortest contig length	 [default 5000]
  -q Mapping quality	 [default 20]
  -f Output files name	[default gathering]
  -o output files path	[default current directory]
  -v, --version         show program's version number and exit

newgenome:

The newgenome module trims the misassembled contigs and gives misassembly free genome. This module used only with multiple samples

usage: newgenome -h  [options] <draft_names & paths> <path to cut files>

Produce mis-assembly free genomes, part of GALA Gap-free Long-read Assembler 

positional arguments:
  draft                 Draft names and paths [required]
  cut_files             path_to_cut_files" [required]

optional arguments:
  -h, --help            show this help message and exit
  -f Output files name	[default new_genome]
  -o output files path	[default current directory]
  -v, --version         show program's version number and exit

ccm:

Contig Clustering Module used to identify the scaffolding groups and the contigs overlap information in multiple preliminary assemblies.

ccm could have extended applications in generating consensus assembly from multiple sequences. Besides, it is useful in reference guide scaffolding to determine Chromosomes scaffolding groups

usage: ccm -h  [options] <path/to/mapping_files> <number of drafts>

CCM Contig Clustering Module, part of GALA Gap-free Long-read Assembler

positional arguments:
mapping_files         mapping paf file [required]
drafts                Number of drafts [required]

optional arguments:
-h, --help            show this help message and exit
-b Alignment block length	 [default 5000]
-p Alignment identity percentage	 [default 70%]
-c Shortest contig length	 [default 5000]
-q Mapping quality	 [default 20]
-f Output files name	[default scaffolds]
-o output files path	[default current directory]
-v, --version         show program's version number and exit

reformat

the reformat module filters the alignment data in paf mapping files and merge overlapping and continuous alignment intervals into a single mapping interval. So, each contig in query draft will have one alignment interval with the subject draft.

usage: reformat -h  [options] <path/to/mapping_files> <number of drafts>

Re-formatting mapping files module, part of GALA Gap-free Long-read Assembler

positional arguments:
  mapping_files         mapping paf file [required]
  drafts                Number of drafts [required]

optional arguments:
  -h, --help            show this help message and exit
  -b Alignment block length	 [default 5000]
  -p Alignment identity percentage	 [default 70%]
  -c Shortest contig length	 [default 5000]
  -q Mapping quality	 [default 20]
  -f Output files name	[default reformated]
  -o output files path	[default current directory]
  -v, --version         show program's version number and exit

seprator

The seprator module used to separate contigs alignments in individual bams and separate the read names mapped to each contig in an individual file

usage: seprator -h  [options] <contig_names> <bam_file>

Separate each contig correlated read names, part of GALA Gap-free Long-read Assembler

positional arguments:
  contig_names          contig_names [required]
  bam_file              mapping bam file [required]

optional arguments:
  -h, --help            show this help message and exit
  -o output files path	[default current directory]
  -f Output files name	[default bam_seprator]
  -b output folder name	 [default bams]
  -v, --version         show program's version number and exit

Use the following command to produce contig_names file:
	samtools view -H <bam_file> |grep 'SQ'|cut -f 2|cut -d : -f 2 > contig_names

readsep

The readsep module separates a set of reads from a sequencing dataset according to the read name in the definition line.

usage: readsep -h [options] <reads> <read_titles>

Extract reads from fasta or fastq, part of GALA Gap-free Long-read Assembler

positional arguments:
  reads                 raw/corrected reads [required]
  read_titles           read names [required]

optional arguments:
  -h, --help            show this help message and exit
  -f input file format (fa/fq)
  -v, --version         show program's version number and exit

Licence

GALA is distributed under MIT license. See the LICENSE file for details.

gala's People

Contributors

Stargazers

Watchers

Forkers

ural-yunusbaev biocko wook2014 risenl airbj31 johnurban ningshuang-yao shankarkshakya sherry520 amit4mchiba tcztzy skyclub3 alexpersa7 gly-123 altingia

gala's Issues

Single command error, then issues with individual commands.

Hiya,

I saw the preprint on bioRxiv and thought I'd give GALA a go. I've ran into two issues though pretty right off the bat.

I've two different genome assemblies listed in drafts.txt, with full file pathways and formatted as per the github instructions.

The first issue was when I run the following command:
./gala drafts.txt fa cns_final.fasta.gz nanorepore-corrected

I get this error:
Traceback (most recent call last):
File "./gala", line 51, in
comp_generator(genomes=draft_names,output=workdir)
File "/scale_wlg_nobackup/filesets/nobackup/landcare00072/Genome/Programs/GALA/src/comp_generator.py", line 11, in comp_generator
z=list(open(genomes))
IOError: [Errno 2] No such file or directory: 'drafts.txt'

Apparently it can't find drafts.txt even though it's in the working directory, as all the other files are! Any thoughts?

Following that I tried running the steps individually and ran into an error at the mdm step.

./comp drafts.txt worked fine, as did sh draft_comp.sh, however when I ran ./mdm comparison/ 2 I get this issue:

./mdm comparison/ 2
Traceback (most recent call last):
File "./mdm", line 14, in
from cut_gathering import cut_gathering
ImportError: No module named cut_gathering

Cheers,
Chris

Possible to maintain haplotypes?

Hi, I wondered whether GALA is suitable/capable for keeping and improving haplotigs of the assembly? I've sequenced a triploid species, and would like to phase the haplotypes, no sure if GALA could help. Thanks! @mawad89

KeyError: 'contig_2601_awad'

Hi thanks for this great tool.

I am having failure at newgenome stage in both step-by-step and direct running mode.
Base environment: Python 2.7.15
Dependancies installed using conda after on Python 2.7.15 environment.

tail of std.err

`Traceback (most recent call last):
  File "/home/apps/gala/gala", line 98, in <module>
    scaffolding(path='comparison',number_of_drafts=number_of_drafts,block=block,percentage=percent,shortage_contig=contig,quality=qty,out_file=True,output_name=name,output=outpath)
  File "/home/apps/gala/src/scaffolding.py", line 250, in scaffolding
    zom=zom+int(a[ien][bac][0].split('\t')[1])
KeyError: 'contig_2601_awad'`

Any suggestions please

How to assemble gap-free chomosome by gala according hifi and ONT data?

I have assembled 3 kinds of scafold by hifiasm, hicanu and falcon according Pacbio hifi data, at the same time, I have assembled another 3 kinds of scafold by nextdenovo, necat and flye, and polished them by nextpolish, I want to know how to assemble gap-free chomosome by gala according all the 6 kinds of scafold.
Thanks,
Best regards.
Wei

minimap commands in the auto-generated draft_comp.sh file resulting in usage errors

Hi,

Thanks for the interesting tool. The very first Minimap2 step in the Gala pipeline is throwing usage errors.

This is what my drafts file essentially looks like (paths shortened for simpler appearance here):

draft_01=/selected_asms/canu.fasta
draft_02=/selected_asms/shasta.fasta
draft_03=/selected_asms/wtdbg2.fasta
draft_04=/selected_asms/flye.fasta

As part of the pipeline, it auto-generates the draft_comp.sh file, which looks like this:

mkdir -p preliminary_comparison
cd preliminary_comparison
minimap2 -x asm5 $draft_01 $draft_02 > draft_01vsdraft_02.paf
minimap2 -x asm5 $draft_01 $draft_03 > draft_01vsdraft_03.paf
minimap2 -x asm5 $draft_01 $draft_04 > draft_01vsdraft_04.paf
minimap2 -x asm5 $draft_02 $draft_01 > draft_02vsdraft_01.paf
minimap2 -x asm5 $draft_02 $draft_03 > draft_02vsdraft_03.paf
minimap2 -x asm5 $draft_02 $draft_04 > draft_02vsdraft_04.paf
minimap2 -x asm5 $draft_03 $draft_01 > draft_03vsdraft_01.paf
minimap2 -x asm5 $draft_03 $draft_02 > draft_03vsdraft_02.paf
minimap2 -x asm5 $draft_03 $draft_04 > draft_03vsdraft_04.paf
minimap2 -x asm5 $draft_04 $draft_01 > draft_04vsdraft_01.paf
minimap2 -x asm5 $draft_04 $draft_02 > draft_04vsdraft_02.paf
minimap2 -x asm5 $draft_04 $draft_03 > draft_04vsdraft_03.paf

Since Minimap2 is throwing the usage errors, my gut feeling it that the pipeline intends to export the variables draft_01 ... draft_04 (with the values being the FILE_PATHs associated with them) into the user's environment outside of Python, but that its not working. Otherwise, I'd assume Gala would want to write draft_comp.sh with the paths to those files rather than variable names like "$draft_01". Either way, something is not working correctly.

Any advice is appreciated.

Best,

John

bwa still running but no new output ?

I use bwa to map the reads to new_draft_01.fa , the commands like this ,../../bwa-0.7.17/bwa mem -t 12 -x ont2d new_draft_01.fa crabA_nano_pass.fastq |../../samtools-1.9/bin/samtools view -@ 12 -b -o draft1_map.bam ,I checked the result , it generate draft1_map.bam yesterday 3pm . since the ,no new output from the bam file . The other bam files also stopped . So can I stop it running now and go on next . By the way , the genome is 1 G , the sequencing data is ~200G , and I have run the bwa for 5 days .

Single Command Mode: `KeyError:...

Here I shared my files
https://drive.google.com/drive/folders/1CUCO74e_YArHYTAT92UdLcGiIgfPb9c-?usp=sharing

Another bug!
'[M::main] CMD: minimap2 -x asm5 /home/crciv/AcerChrAssemb/GALA/gala/new_genomes/new_draft_3.fa /home/crciv/AcerChrAssemb/GALA/gala/new_genomes/new_draft_2.fa
[M::main] Real time: 16.140 sec; CPU: 39.736 sec; Peak RSS: 1.634 GB
Traceback (most recent call last):
File "/home/crciv/soft/gala/gala", line 82, in <module>
scaffolding(path='comparison',number_of_drafts=number_of_drafts,block=block,percentage=percent,shortage_contig=contig,quality=qty,out_file=True,output_name=name,output=outpath)
File "/home/crciv/soft/gala/src/scaffolding.py", line 163, in scaffolding
for bas in a[e][ba]:
KeyError: 'Ctg8_pilon_1_awad'

If you can share the files in gala_results/gap_free_comp/comparison folder with me (I tried different data-sets but I don't see this error)

Originally posted by @mawad89 in #1 (comment)

How to get misassembly-free draft

Hi,
I ran the ccm module and got three different new_draft.fa. Then I want to ran bwa index misassembly-free draft, but I have problems about how to get misassembly-free draft. Should I merge three different new_draft.fa? Hope to get your advice.

Can it be installed through Conda

This software has many dependencies, can it be installed through Conda?

CCM output

Hello,

Thank you for developing GALA, it's a very interesting tool. I'm confused about how CCM works - please could you help me out with this?

In the bioRxiv preprint, it seems like all of the preliminary assemblies are collapsed into one set of linkage groups ("the contig-clustering module (CCM) pools the linked nodes within different layers and those inside the same layer into different linkage groups"); the Online Methods for CCM suggest that CCM uses the raw reads to connect contigs, and "pools all connected nodes into a linkage group", when nodes can be linked across assemblies (layers).

But in actual use, the ccm tool doesn't seem to use the raw reads at all, and it outputs a separate set of scaffolds for each draft assembly. It seems to do a good job of grouping contigs within one assembly, but it doesn't group contigs from different assemblies. So I have a different number of linkage groups for each draft assembly.

(I'm giving GALA a set of 5 preliminary assemblies, and running, for example, gala/ccm comparison 5, where comparison is a folder containing the PAF files from running draft_comp.sh.)

The paper says "GALA modelled the preliminary assemblies and raw reads into 14 independent linkage groups" for C. elegans (Online Methods) - but how did you use the raw reads, and how did you identify one set of 14 linkage groups, from the separate sets of linkage groups output for each assembly? Am I missing something about running CCM?

Many thanks
John

Multithreading option in GALA

Hi,
I am trying to use GALA in a single command on the cluster using the sun grid engine. But it is very slow due to the unavailability of multithreading in the GALA. Is there any way to speed up the job using multithreading.

Here is my script.

#!/bin/bash

#$ -N gala
#$ -cwd
#$ -l h_vmem=20G
#$ -o gala.log
#$ -e gala.error
### number of cores to be used
#$ -pe smp 26


#####   Actual script and data dir ########
WD=/mnt/data/

export PATH="/mnt/bin/hifiasm/hifiasm-0.16.1/:$PATH"
export PATH="/mnt/bin/canu/canu-2.2/bin/:$PATH"
export PATH="/mnt/bin/minimap2/minimap2-2.17_x64-linux/:/mnt/bin/samtools/samtools-1.9/:$PATH"
export PATH="/mnt/bin/bwa/bwa-0.7.17/:$PATH"
export PATH="/mnt/bin/flye/Flye-2.9.1/bin/:$PATH"

python2 /mnt/tool/GALA/gala draft_names_paths.txt fq raw pacbio-raw -f Final_combine -a canu flye hifiasm

Problem ccm

Hi,
I have a problem with the ccm module, I ran:

python ./GALA/ccm comparison_new 16

and got:
`

new_draft_121623
new_draft_135853
new_draft_101723
new_draft_112381
new_draft_164056
new_draft_145149
new_draft_153244
new_draft_82379
new_draft_95088
Traceback (most recent call last):
File "./GALA/ccm", line 35, in
scaffolding(path=paf,number_of_drafts=drafts,block=block,percentage=percent,shortage_contig=contig,quality=qty,out_file=True,output_name=name,output=output)
File "/scratch/GALA/src/scaffolding.py", line 211, in scaffolding
elif (d[base][n].split('\t')[-3].replace('\n','')==d[base][n+1].split('\t')[-3].replace('\n','')) and (d[base][n].split('\t')[5]==d[base][n+1].split('\t')[5]):
IndexError: list index out of range
`

Thanks,
Best.

Is the input genome contig version or scaffold (Hic) version

I have assembled multiple versions of contig genomes, and I would like to use the gala software. Do these genomes require chromosome scaffolding before I input them into the gala software?

how to deal with the ccm results

I am currently trying to split the raw reads by chromosome, and I have a few draft assemblies based on Canu, Nextdenovo, Flye, and Necat. However, the scaffold numbers ended up differently based on draft assemblies after the ccm module, and I also didn't know the real chromosome number of this species (no reference). It would be helpful if you could tell me which scaffolding results (or how to choose) should I use, since they will be of great importance to the assembly step after the raw reads have been split based on scaffolding.

Py2.7

Hi,

interesting tool, but why require Python2.7 seeing as it is not supported any more.

Are there any plans to upgrade to Python3 ?

Thanks

How to run Gala

Hello,
I have assembled a genome of 1.4 GBases with reads that come from MiNion platform using CANU and Flye assembles and then I polished them with short reads that come from illumina using a number the Pilon , Polca and NextPolish tools. Now I want to try your tool but I don't understand if I run it with the proper way. Inside the file draft_names_paths.txt I insert all the polished fasta genomes and the raw data (fastq.gz) of the MiNioN platform.

Then I run the command :

./gala /data2/maria/assembles/draft_names_paths.txt fq/fa corrected reads
Is it right ?

Please help me,
Maria

The amount of data eventually generated

After running lgam module, 201g data (FQ. GZ) is generated, while the original data is 93g (FQ. GZ)

Calculation time of bwa step

I performed GALA using the genome assembly files output by Canu (num_seqs: 45,165; sum_len: 5,369,907,803), Flye (num_seqs: 19,212; sum_len: 3,453,428,464) and wtdbg2 (num_seqs: 15,658; sum_len: 3,366,331,150), respectively.
The MDM module completed the calculation relatively quickly, but the later steps to use BWA took a lot of time.
The bwa calculation has been going on for about two weeks, is it possible to speed up this step?

I ran the following command.
$ vi draft_paths.txt
draft_01=/home/local/GALA/Assemblies/canu211-no2.contigs.fasta
draft_02=/home/local/GALA/Assemblies/flye.fasta
draft_03=/home/local/GALA/Assemblies/wtdbg2.fasta
$ gala draft_paths.txt fa /home/local/reads_files/1.fasta.gz pacbio-raw

I want to try CHM13 assembly

i want to try to assembly CHM13 before begin to my project,do you provide the assembley code about your article CHM13 assembly, thanks~

Computational requirements and time

Hi, thank you for sharing this program! I was wondering if you can give any estimates of the computational resources and time required to run the single command program?

Parameters for multi-threads?

Hi! @mawad89 GALA seems powerful, I'm trying this software out.. however I don't see a parameter for multi-threads(?).
For example, running bwa takes pretty long if not multi-threaded. Thanks!

usage: gala -h [options] <draft_names & paths> <fa/fq>

GALA Gap-free Long-reads Assembler

positional arguments:
draft_names Draft names and paths [required]
input_file input type (fq/fa) [required]
reads raw/corrected reads [required]
sequencing_platform pacbio-raw pacbio-corrected nanopore-raw nanopore-
corrected [required]

optional arguments:
-h, --help show this help message and exit
-a [ASSEMBLER [ASSEMBLER ...]]
Chr-by_Chr assembler (canu flye miniasm) [default
canu]
-b Alignment block length [default 5000]
-p Alignment identity percentage [default 70%]
-l lowest number of misassemblies indecator [default 1]
-c Shortest contig length [default 5000]
-k Mis-assembly block [default 175]
It is better to extend the misassembly block in case of
unpolished assemblies or expected mis-assemblies
in highly repetative regions (5000-10000)
-q Mapping quality [default 20]
-f Output files name [default gathering]
-t cut on a threshold passed by -u [default False]
-u threshold cut value [default 3]
-o output files path [default current directory]
-v, --version show program's version number and exit

newgenome throws ValueError: invalid literal for int() with base 10

Hello,

Thank you for developing GALA. I have a problem with newgenome as follows.

$ /media/Users/src/GALA/newgenome drafts.txt gathering
Traceback (most recent call last):
  File "/media/Users/src/GALA/newgenome", line 28, in <module>
    genomes(genomes=draft,gathering=gathering,gathering_name=name,outpath=output)
  File "/media/Users/src/GALA/src/new_genome.py", line 82, in genomes
    b=new_genome(cut_file=gathering+gathering_name+'_'+a+'_cuts.txt',old_genome=aa,out_path=outpath,name='new_'+a)
  File "/media/Users/src/GALA/src/new_genome.py", line 48, in new_genome
    i[base+'_'+str(h)+'_'+'awad']=e[base][int(g[j]):int(g[j+1])]
ValueError: invalid literal for int() with base 10: '6067162.0'

Single command mode throws a similar error.
The content of gathering/gathering_draft_01_cuts.txt:

scaffold0003	6067162.0	4	3	0
...
scaffold0016	1397294.3333333333	3	2	0

I edited the *_cuts.txt files as all float values were trucated to integer values:

scaffold0003	6067162	4	3	0
...
scaffold0016	1397294	3	2	0

Then newgenome exited successfully. However, will the final result be expected?

Best

How many draft assemblies should I need?

Hi,

I have hifi and hic data of a species, I want to know that how many draft assemblies should I need if I want to use GALA? 4 assemblies generated from hifi data using different tools (hifiasm, flye, wtdbg2, raven) are enough?

Best,
Kun

The single command mode

Hi could I ask when the single command mode be used?
I tried the single command mode but failed as some bugs in the main script.

Best,

Syntax confusion regarding sequencing_platform

Hi,

Thanks for the cool tool. I hope to get it running.

I have been trying to get it run with the following command syntax:

${GALA} ${DRAFTS} -a canu fq ${READS} -nanopore-corrected

It gave this error:

usage: gala -h  [options] <draft_names & paths> <fa/fq> <reads> <platform>
gala: error: the following arguments are required: draft_names, input_file, reads, sequencing_platform

I tried taking out -a canu, it gave a new error:

${GALA} ${DRAFTS} -fq ${READS} -nanopore-corrected


usage: gala -h  [options] <draft_names & paths> <fa/fq> <reads> <platform>
gala: error: the following arguments are required: sequencing_platform

To get past this error, I had to remove the dash "-" in front of nanopore-corrected:

${GALA} ${DRAFTS} fq ${READS} nanopore-corrected

That got it past that error, but I am currently dealing with Minimap2 now yelling about something, which I will diagnose and report back on.

Speaking to that error though, it seems to me that the syntax guidance on the frontpage of this github repo says to use the following for sequence platforms:

-pacbio-raw -pacbio-corrected -nanopore-raw -nanopore-corrected

...yet that front dash causes a problem.

So perhaps the syntax guidance needs changing, or the argparse code regarding this positional argment.

Best,

John Urban

strange format of 'draf_comp.sh' generated by comp

As you can see, the lines were broke, and what I don't understand is it seems 12 times minimap genome comparison, why is that?!
Should be at most 9 times using 3 draft genomes?

TabError: inconsistent use of tabs and spaces in indentation

Hi,

Thanks for the cool program. I am giving it a try.

First thing that pops up is this error:

Traceback (most recent call last):
  File "/central/groups/carnegie_poc/jurban/software/gala/GALA/gala", line 19, in <module>
    from read_extract import read_extract
  File "/central/groups/carnegie_poc/jurban/software/gala/GALA/src/read_extract.py", line 23
    if read_file[-2:]=='gz':
                           ^
TabError: inconsistent use of tabs and spaces in indentation

I am guessing Python didn't complain in the past about this, but now it is, which is annoying.

I fixed it with:

cp read_extract.py save_original_read_extract.py
awk '{gsub("\t","    "); print}' save_original_read_extract.py > read_extract.py

I will report back if other files start reporting the same error.

As an aside, the dependencies on the Github front page of this Repo says it depends on Python 2.7, yet that file calls for python 3. Can you update the list of dependencies?

Many thanks!

Best,

John Urban

one of the comparison file is empty

Hi, when running GALA on two draft assemblies, i found that both .paf, scaffolding and gathering files of one of the draft is empty, is this normal or does it reflect some error? Thanks for your help.

it si too slow by running 'bwa mem -x ont2d misassembly-free draft long-reads'

Hi,
thank you for the nice pipeline GALA. The main problem i met these days is that the process 'bwa mem -x ont2d' is too slow. It takes me two days but not finished.

I want to know is there any new way for me to quiclkly finish the process of mapping my ont reads to my clean draft contigs?

Best,
Xu

Step-by-Step Mode: newgenome: cut_file error

cat draft_names_paths.txt
draft_1=/home/crciv/AcerChrAssemb/Pilon/Ra_assembly_Pilon_polished/Ra_assembly_Pilon_polished.fasta
draft_2=/home/crciv/AcerChrAssemb/Pilon/Flye_assembly_Pilon_polished/Flye_assembly_Pilon_polished.fasta
draft_3=/home/crciv/AcerChrAssemb/NextPolish/Acer_data/01_rundir/02.kmer_count/05.polish.ref.sh.work/genome.nextpolish.part000_part001.fasta
$ comp draft_names_paths.txt
comp 1.0.0
$ sh draft_comp.sh
[M::mm_idx_gen::4.388*1.18] collected minimizers
[M::mm_idx_gen::4.849*1.35] sorted minimizers
[M::main::4.849*1.35] loaded/built the index for 119 target sequence(s)
[M::mm_mapopt_update::5.107*1.33] mid_occ = 100
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 119
[M::mm_idx_stat::5.299*1.32] distinct minimizers: 19721415 (95.65% are singletons); average occurrences: 1.106; average spacing: 9.935
[M::worker_pipeline::21.596*2.28] mapped 226 sequences
[M::main] Version: 2.14-r892-dirty
[M::main] CMD: minimap2 -x asm5 /home/crciv/AcerChrAssemb/Pilon/Ra_assembly_Pilon_polished/Ra_assembly_Pilon_polished.fasta /home/crciv/AcerChrAssemb/Pilon/Flye_assembly_Pilon_polished/Flye_assembly_Pilon_polished.fasta
[M::main] Real time: 21.671 sec; CPU: 49.280 sec; Peak RSS: 1.685 GB
[M::mm_idx_gen::4.040*1.31] collected minimizers
[M::mm_idx_gen::4.543*1.49] sorted minimizers
[M::main::4.543*1.49] loaded/built the index for 119 target sequence(s)
[M::mm_mapopt_update::4.803*1.47] mid_occ = 100
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 119
[M::mm_idx_stat::4.994*1.45] distinct minimizers: 19721415 (95.65% are singletons); average occurrences: 1.106; average spacing: 9.935
[M::worker_pipeline::23.551*2.32] mapped 64 sequences
[M::main] Version: 2.14-r892-dirty
[M::main] CMD: minimap2 -x asm5 /home/crciv/AcerChrAssemb/Pilon/Ra_assembly_Pilon_polished/Ra_assembly_Pilon_polished.fasta /home/crciv/AcerChrAssemb/NextPolish/Acer_data/01_rundir/02.kmer_count/05.polish.ref.sh.work/genome.nextpolish.part000_part001.fasta
[M::main] Real time: 23.626 sec; CPU: 54.694 sec; Peak RSS: 1.651 GB
[M::mm_idx_gen::4.043*1.26] collected minimizers
[M::mm_idx_gen::4.497*1.43] sorted minimizers
[M::main::4.497*1.43] loaded/built the index for 226 target sequence(s)
[M::mm_mapopt_update::4.747*1.41] mid_occ = 100
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 226
[M::mm_idx_stat::4.936*1.39] distinct minimizers: 19752353 (96.20% are singletons); average occurrences: 1.101; average spacing: 9.938
[M::worker_pipeline::16.736*2.42] mapped 119 sequences
[M::main] Version: 2.14-r892-dirty
[M::main] CMD: minimap2 -x asm5 /home/crciv/AcerChrAssemb/Pilon/Flye_assembly_Pilon_polished/Flye_assembly_Pilon_polished.fasta /home/crciv/AcerChrAssemb/Pilon/Ra_assembly_Pilon_polished/Ra_assembly_Pilon_polished.fasta
[M::main] Real time: 16.764 sec; CPU: 40.572 sec; Peak RSS: 1.458 GB
[M::mm_idx_gen::4.037*1.26] collected minimizers
[M::mm_idx_gen::4.507*1.43] sorted minimizers
[M::main::4.507*1.43] loaded/built the index for 226 target sequence(s)
[M::mm_mapopt_update::4.781*1.41] mid_occ = 100
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 226
[M::mm_idx_stat::4.981*1.39] distinct minimizers: 19752353 (96.20% are singletons); average occurrences: 1.101; average spacing: 9.938
[M::worker_pipeline::22.634*2.40] mapped 64 sequences
[M::main] Version: 2.14-r892-dirty
[M::main] CMD: minimap2 -x asm5 /home/crciv/AcerChrAssemb/Pilon/Flye_assembly_Pilon_polished/Flye_assembly_Pilon_polished.fasta /home/crciv/AcerChrAssemb/NextPolish/Acer_data/01_rundir/02.kmer_count/05.polish.ref.sh.work/genome.nextpolish.part000_part001.fasta
[M::main] Real time: 22.706 sec; CPU: 54.389 sec; Peak RSS: 1.668 GB
[M::mm_idx_gen::4.049*1.32] collected minimizers
[M::mm_idx_gen::4.547*1.49] sorted minimizers
[M::main::4.547*1.49] loaded/built the index for 64 target sequence(s)
[M::mm_mapopt_update::4.818*1.47] mid_occ = 100
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 64
[M::mm_idx_stat::5.015*1.45] distinct minimizers: 19800975 (95.79% are singletons); average occurrences: 1.138; average spacing: 9.943
[M::worker_pipeline::15.880*2.38] mapped 119 sequences
[M::main] Version: 2.14-r892-dirty
[M::main] CMD: minimap2 -x asm5 /home/crciv/AcerChrAssemb/NextPolish/Acer_data/01_rundir/02.kmer_count/05.polish.ref.sh.work/genome.nextpolish.part000_part001.fasta /home/crciv/AcerChrAssemb/Pilon/Ra_assembly_Pilon_polished/Ra_assembly_Pilon_polished.fasta
[M::main] Real time: 15.901 sec; CPU: 37.778 sec; Peak RSS: 1.474 GB
[M::mm_idx_gen::4.095*1.32] collected minimizers
[M::mm_idx_gen::4.590*1.49] sorted minimizers
[M::main::4.590*1.49] loaded/built the index for 64 target sequence(s)
[M::mm_mapopt_update::4.891*1.46] mid_occ = 100
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 64
[M::mm_idx_stat::5.118*1.44] distinct minimizers: 19800975 (95.79% are singletons); average occurrences: 1.138; average spacing: 9.943
[M::worker_pipeline::19.194*2.24] mapped 226 sequences
[M::main] Version: 2.14-r892-dirty
[M::main] CMD: minimap2 -x asm5 /home/crciv/AcerChrAssemb/NextPolish/Acer_data/01_rundir/02.kmer_count/05.polish.ref.sh.work/genome.nextpolish.part000_part001.fasta /home/crciv/AcerChrAssemb/Pilon/Flye_assembly_Pilon_polished/Flye_assembly_Pilon_polished.fasta
[M::main] Real time: 19.270 sec; CPU: 43.076 sec; Peak RSS: 1.610 GB
$ ls comparison
draft_1vsdraft_2.paf  draft_1vsdraft_3.paf  draft_2vsdraft_1.paf  draft_2vsdraft_3.paf  draft_3vsdraft_1.paf  draft_3vsdraft_2.paf
$ mdm comparison 3
mdm 1.0.0
newgenome draft_names_paths.txt cut_folder
Traceback (most recent call last):
  File "/home/crciv/soft/gala/newgenome", line 28, in <module>
    genomes(genomes=draft,gathering=gathering,gathering_name=name,outpath=output)
  File "/home/crciv/soft/gala/src/new_genome.py", line 82, in genomes
    b=new_genome(cut_file=gathering+gathering_name+'_'+a+'_cuts.txt',old_genome=aa,out_path=outpath,name='new_'+a)
  File "/home/crciv/soft/gala/src/new_genome.py", line 11, in new_genome
    a=list(open(cut_file))
IOError: [Errno 2] No such file or directory: 'gathering/new_genome_draft_1_cuts.txt'

How to deal with the same reads’ name?

When I check the different chromosomes reads’ names, there are some reads’ names belonged to different chromosomes, should we remove the same reads’ names in different chromosomes, or just keep them in different chromosomes? The size of chromosome 1 of Arabidopsis is 29.9M. When I assembled the fq of chromosome 1 of Arabidopsis with the same reads’ names by canu 2.0, the size of contig is 36.2M, and when I assembled the fq of chromosome 1 of Arabidopsis without the same reads’ names by canu 2.0, the size of contig is 19.6M.

how to prepare ref fasta file

When I cheack your ref (ref=../ref/GRCh38.p13.genome.fa) file in humen test, it has many lines that belong to 'chromosome 1', for example, $grep 'chromosome 1 ' GRCh38.p13.genome.fa, it shows many lines as follow:

NT_187361.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p13 Primary Assembly HSCHR1_CTG1_UNLOCALIZED
NT_187362.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p13 Primary Assembly HSCHR1_CTG2_UNLOCALIZED
NT_187363.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p13 Primary Assembly HSCHR1_CTG3_UNLOCALIZED
NT_187364.1 Homo sapiens chromosome 1 unlocalized genomic scaffold, GRCh38.p13 Primary Assembly HSCHR1_CTG4_UNLOCALIZED
......
So, could you please tell me how to prepare such ref file?
Thanks,
Best regards.

Can we replace bwa to minimap2?

Hi,

BWA process in LGAM seems to be very slow. Can we replace bwa with minimap2 and "-x map-pb/ont" option?
I think they are compatible.

Best wiches,

ganlab / gala Goto Github PK

gala's Introduction

Gap-Free Long-read Assembler (GALA)

GALA performance with a human genome

Dependency

Installation

Usage

Preliminary step and Inputs

Preliminary step

Inputs:

GALA Single Command Mode

GALA Step-by-Step Mode (recommended)

Mis Assembly Detector Module (MDM)

Contig Clustering Module (CCM)

Scaffolding Group Assembly Module (SGAM)

Description

comp:

mdm:

newgenome:

ccm:

reformat

seprator

readsep

Licence

gala's People

Contributors

Stargazers

Watchers

Forkers

gala's Issues

Recommend Projects

Recommend Topics

Recommend Org