tangerzhang / allhic Goto Github PK

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data

Perl 66.18% Python 31.15% Shell 2.66%

allhic's Introduction

ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
See wiki for details (https://github.com/tangerzhang/ALLHiC/wiki).

allhic's People

Contributors

Stargazers

Watchers

allhic's Issues

questions about subsequent processing of allhic assembly results

Hi, @tanghaibao and @tangerzhang ,

Compared to the reference genome, I found many inversions and other structural errors in the allhic results. I wanted to know if there were any post scripts in your pipelines to correct these errors and detrermined chromosome boundaries?

best,
Xu

How to speed up the "Optimize" steps

Hello,
Thanks for this great Hic pipeline. I have some trouble of the spending too much time for "Optimize" steps. My target genome is about 500 MB, and 11 chromosomes. But first I just tried K=16 to run partition, until now it spent real time 13 days (331 hours), cpu time is 384 hours, it is still running.
I used Intel E7-8870 2.40GHz, 80 CPU cores, 2TB memory and PBS to summit jobs to run.
#!/bin/bash
#PBS -l nodes=1:ppn=40

So I wondered that is ALLHic forced to using only one core to run, or is there anything I could do to speed up this "Optimize" step. And I searched the previous issues, then I knew in my case using K=11 is expected, so I will rerun it again. If the information that I provide isn't enough, just let me know.

Thank you

assembly a high heterozygous

Hi, I am assembled a high heterozygous genome. The assembled size is larger than estimated size. Do I need to remove the alternative contigs before using AllHic?
Thanks for any advices~

Globally rescue: format of user-prepared.clusters.txt -

Could you please elaborate on the format of the user-prepared.clusters.txt file to carry out the globally rescue step?
thanks,
Diego

identify allelic contigs: Allele.ctg.table

Hi @tangerzhang
Thanks for the great tool. I am wonder how to choose the well-assembled close related species.
For example, I study the Brassica napus (AACC, 2n = 4x = 38)), which is an allopolyploid and formed by interspecific hybridization between B. rapa (AA, 2n = 2x = 20) and B. oleracea (CC, 2n =2x = 18). Of course, It is really close related to Arabidopsis. So can I just choose Aradopsis or B.rapa or B.oleracea as the close related specie? As I know Arabidopsis is the best-assedmbled, and then B.rapa. Thanks!

What should I do with low heterozygous and diploid genome？

I have a low heterozygous and diploid insect genome of about 480m. I notice ALLHIC is nearly the only HIC assembly tool maintained well. So I want to try ALLHIC to finish my assembly. The question is what I should do with this issue , just follow the WIKI or skip the Prune step?

Multiple restriction enzyme

Dear Xingtan,
Do you have any plan to update multiple restriction enzyme functions?
I love you.
Won

scaffolds misjoin detect

Hi, @tanghaibao and @tangerzhang ,

my scaffolds have many misjoins, i konw you have much experience in processing HiC data. is there any suggestions to perform error correct before running Allhic?

best,
he

Dnase as enzyme parameter

Hello,
I was wondering if ALLHiC supports libraries generated using DNASE as the fragmenting enzyme?
If so, what should be input for the enzyme parameters during the filtering and partitioning step of the pipeline?
Thanks,
Kevin

How to determine K value for diploid？

Thank you for the algorithm！
But when I scaffoldof a simple diploid genome(1n = 2x = 16) with K value =8 , one of clusters were much bigger than the others.
How to determine K value for diploid？

Rescue stage: Use of uninitialized value in concatenation (.) or string at /bin/ALLHiC/bin/ALLHiC_rescue line 165.

Hi, I am running ALLHiC on a polyploid genome. Everything went fine till at the rescue stage a few perl messages like these showed up:

Number of original contigs in contig000650: 0
Number of rescued contigs in contig000650: 1
Use of uninitialized value in concatenation (.) or string at /bin/ALLHiC/bin/ALLHiC_rescue line 165.
Use of uninitialized value in concatenation (.) or string at /bin/ALLHiC/bin/ALLHiC_rescue line 165.
Number of original contigs in group3: 822

According the help and the git instructions I used the correct files. I rechecked what I have done at the prunn and partition stages and the input files were correct.

Any tips?

partition and build probelem

hi,dear Tanger Zhang. i am confused for the step of set -k for partition..Should K be the number of chromosomes or the number of chromesome sets (e.g., diploid, should K be 2?)And if -k is the total number of chromosomes (for example, a highly heterozygous diploid, haploids have 20 chromosome, diploids have 40 chromosome).I'm going to set -k to 40. How can I get the haploid chromosome from the total 40 chromosome?

thans for you share

allhic prune.pl fail

Hi,

I would like to test all-hic and am stuck at the pruning step using the command:

ALLHiC/scripts/prune.pl -i Allele.ctg.table -b bam.files.txt -r Simon_Salmon.flye15K_pilon.fasta

I get files remove.log, removedb_Allele.txt and removedb_nonBest.txt created but afterwards, the command dies saying there is no prunning.bam. Indeed, there is no such file, as it was never created.
The code to create is commented out in the prune.pl script.

I am missing something in the ALLHiC pipeline?

Cheers,
Michel

Dose the tool produces full length haplotypes ?

Hi,
I was wondering if this tool produces the full length haplotypes with the common reagon in each fasta file produced or it just gives the fasta for sequences that are uncommon between the haplotypes.

Gmap-based method to identify allelic contig table

您好！
您这个ALLHiC软件要想鉴定contigs之间的等位基因，需要一个contigs的注释文件和cds文件。也就是说需要对contigs进行从头注释得到这两个信息文件，才能鉴定等位基因，从而去除染色体之间的噪音，是这样吗？
因为我这里看来，做一个大的基因组，注释是非常消耗时间和计算资源的，我在想怎么只通过已经发表的cds序列blastn到contigs上，来鉴定contigs之间的等位基因，这样可以吗？
祝好！！！

can the results of sample.bwa_aln.REduced.paired_only.flagstat normal?

the results of samtools flagstat sample.bwa_aln.REduced.paired_only.bam that is sample.bwa_aln.REduced.paired_only.flagstat as follows,the "19691 + 0 properly paired (0.01% : N/A)"

is this results normally for subsequent analysis?
Thank you!

panic: runtime error: index out of range

Hi, I got panic: runtime error: index out of range when I ran ALLHiC scripts sh allhic_pbs.sh QMg_NbQ4P_RN.fasta 38 GATC:

#!/bin/bash
#Usage: sh allhic_pbs.sh QMg_NbQ4P_RN.fasta 38 GATC
cat << EOF  |qsub
#!/bin/bash -l
#PBS -N allhic
#PBS -l walltime=150:00:00
#PBS -j oe
#PBS -l mem=200G
#PBS -l ncpus=8
#PBS -l cpuarch=avx2
#PBS -M [email protected]
##PBS -m bea

cd \$PBS_O_WORKDIR

conda activate allhic 
export PATH=/work/waterhouse_team/apps/ALLHiC/bin:/work/waterhouse_team/apps/ALLHiC/scripts:$PATH

echo "samtools merge and index"
#samtools merge all.hic.sorted.dedup.bam *.bam
#samtools index all.hic.sorted.dedup.bam

echo "extract"
allhic extract all.hic.sorted.dedup.bam $1 --RE $3

echo "partition"
allhic partition all.hic.sorted.dedup.counts_${3}.txt all.hic.sorted.dedup.pairs.txt $2

echo "optimize"
for o in \$(find . -name "all.hic.sorted.dedup.counts_${3}.*g*.txt");
do
   echo \$o
   allhic optimize \$o all.hic.sorted.dedup.clm
done

echo "build"
allhic build all.hic.sorted.dedup.counts_${3}.${2}g*.tour $1 asm-${2}g.chr.fasta

Here is the logs:

samtools merge and index
/work/waterhouse_team/apps/ALLHiC/scripts/PreprocessSAMs.pl all.hic.sorted.dedup.bam QMg_NbQ4P_RN.fasta MBOI

Fri Jan 11 09:41:09 2019: PreprocessSAMs.pl: make_bed_around_RE_site.pl QMg_NbQ4P_RN.fasta GATC 500
Fri Jan 11 09:41:09 2019: Reading file QMg_NbQ4P_RN.fasta...
Fri Jan 11 09:42:02 2019: Done!  Found 8338823 total instances of motif GATC.  Created files:
QMg_NbQ4P_RN.fasta.near_GATC.500.bed
QMg_NbQ4P_RN.fasta.pos_of_GATC.txt
Fri Jan 11 09:42:02 2019: PreprocessSAMs.pl: bedtools intersect -abam all.hic.sorted.dedup.bam -b QMg_NbQ4P_RN.fasta.near_GATC.500.bed > all.hic.sorted.dedup.REduced.bam
Fri Jan 11 17:51:41 2019: PreprocessSAMs.pl: samtools view -F12 all.hic.sorted.dedup.REduced.bam -b -o all.hic.sorted.dedup.REduced.paired_only.bam
Sat Jan 12 00:58:47 2019: PreprocessSAMs.pl: samtools flagstat all.hic.sorted.dedup.REduced.paired_only.bam > all.hic.sorted.dedup.REduced.paired_only.flagstat
partition
Extract function: calculate an empirical distribution of Hi-C link size based on intra-contig links
CMD: allhic extract sample.clean.bam QMg_NbQ4P_RN.fasta --RE GATC
02:48:59 writeRE | NOTICE  RE counts in 1512 contigs (total: 8340335, avg 1 per 332 bp) written to `sample.clean.counts_GATC.txt`
02:48:59 extractContigLinks | NOTICE  Parse bamfile `sample.clean.bam`
02:48:59 extractContigLinks | NOTICE  Extracted 0 intra-contig link groups (total = 0)
02:48:59 extractContigLinks | NOTICE  Extracted 0 inter-contig groups to `sample.clean.clm` (total = 0, maxLinks = 0)
panic: runtime error: index out of range

goroutine 1 [running]:
_/Users/bao/code/allhic.(*LinkDensityModel).countBinDensities(0xc00011a4d0, 0xc000148000, 0x5e8, 0x6a0)
        /Users/bao/code/allhic/model.go:149 +0x576
_/Users/bao/code/allhic.(*Extracter).makeModel(0xc00022f740, 0xc000178620, 0x1d)
        /Users/bao/code/allhic/extract.go:155 +0x312
_/Users/bao/code/allhic.(*Extracter).Run(0xc00022f740)
        /Users/bao/code/allhic/extract.go:139 +0xaf
main.main.func1(0xc0000fcc60, 0xc00001cf00, 0xc0000fcc60)
        /Users/bao/code/allhic/cmd/allhic.go:143 +0x17a
github.com/urfave/cli.HandleAction(0x90f200, 0x9da328, 0xc0000fcc60, 0x0, 0xc00001cfc0)
        /Users/bao/go/src/github.com/urfave/cli/app.go:501 +0xc8
github.com/urfave/cli.Command.Run(0x9b7b0c, 0x7, 0x0, 0x0, 0x0, 0x0, 0x0, 0x9c5253, 0x23, 0x9d0617, ...)
        /Users/bao/go/src/github.com/urfave/cli/command.go:165 +0x459
github.com/urfave/cli.(*App).Run(0xc000140000, 0xc00001c180, 0x6, 0x6, 0x0, 0x0)
        /Users/bao/go/src/github.com/urfave/cli/app.go:259 +0x6bb
main.main()
        /Users/bao/code/allhic/cmd/allhic.go:400 +0xb31
Partition contigs based on prunning bam file
CMD: allhic partition sample.clean.counts_GATC.txt sample.clean.pairs.txt 38 --minREs 25
02:48:59 ReadCSVLines | NOTICE  Parse csvfile `sample.clean.counts_GATC.txt`
02:48:59 readRE | NOTICE  Loaded 1512 contig RE lengths for normalization from `sample.clean.counts_GATC.txt`
02:48:59 skipContigsWithFewREs | NOTICE  skipContigsWithFewREs with MinREs = 25 (RE = GATC)
Contig #20 (QM4NbP_21) has 4 RE sites -> MARKED SHORT
Contig #32 (QM4NbP_33) has 18 RE sites -> MARKED SHORT
Contig #41 (QM4NbP_42) has 1 RE sites -> MARKED SHORT
Contig #72 (QM4NbP_73) has 2 RE sites -> MARKED SHORT
Contig #75 (QM4NbP_76) has 12 RE sites -> MARKED SHORT
Contig #80 (QM4NbP_81) has 15 RE sites -> MARKED SHORT
Contig #89 (QM4NbP_90) has 23 RE sites -> MARKED SHORT
Contig #113 (QM4NbP_114) has 22 RE sites -> MARKED SHORT
Contig #153 (QM4NbP_154) has 16 RE sites -> MARKED SHORT
Contig #169 (QM4NbP_170) has 6 RE sites -> MARKED SHORT
Contig #177 (QM4NbP_178) has 9 RE sites -> MARKED SHORT
Contig #199 (QM4NbP_200) has 2 RE sites -> MARKED SHORT
Contig #243 (QM4NbP_244) has 1 RE sites -> MARKED SHORT
Contig #282 (QM4NbP_283) has 2 RE sites -> MARKED SHORT
Contig #283 (QM4NbP_284) has 1 RE sites -> MARKED SHORT
Contig #291 (QM4NbP_292) has 1 RE sites -> MARKED SHORT
Contig #293 (QM4NbP_294) has 3 RE sites -> MARKED SHORT
Contig #316 (QM4NbP_317) has 2 RE sites -> MARKED SHORT
Contig #323 (QM4NbP_324) has 24 RE sites -> MARKED SHORT
Contig #343 (QM4NbP_344) has 3 RE sites -> MARKED SHORT
Contig #374 (QM4NbP_375) has 5 RE sites -> MARKED SHORT
Contig #379 (QM4NbP_380) has 10 RE sites -> MARKED SHORT
Contig #384 (QM4NbP_385) has 7 RE sites -> MARKED SHORT
Contig #391 (QM4NbP_392) has 7 RE sites -> MARKED SHORT
Contig #450 (QM4NbP_451) has 7 RE sites -> MARKED SHORT
Contig #470 (QM4NbP_471) has 4 RE sites -> MARKED SHORT
Contig #477 (QM4NbP_478) has 12 RE sites -> MARKED SHORT
Contig #480 (QM4NbP_481) has 23 RE sites -> MARKED SHORT
Contig #488 (QM4NbP_489) has 13 RE sites -> MARKED SHORT
Contig #505 (QM4NbP_506) has 4 RE sites -> MARKED SHORT
02:48:59 skipContigsWithFewREs | NOTICE  Marked 30 contigs (avg 8.6 RE sites, len 41439) since they contain too few REs (MinREs = 25)
02:48:59 ReadCSVLines | NOTICE  Parse csvfile `sample.clean.pairs.txt`
02:48:59 mustOpen | CRITIC  open sample.clean.pairs.txt: no such file or directory
optimize
02:49:02 writeRE | NOTICE  RE counts in 1512 contigs (total: 8340335, avg 1 per 332 bp) written to `sample.clean.counts_GATC.txt`
02:49:02 extractContigLinks | NOTICE  Parse bamfile `sample.clean.bam`
02:49:02 extractContigLinks | NOTICE  Extracted 0 intra-contig link groups (total = 0)
02:49:02 extractContigLinks | NOTICE  Extracted 0 inter-contig groups to `sample.clean.clm` (total = 0, maxLinks = 0)
panic: runtime error: index out of range

goroutine 1 [running]:
_/Users/bao/code/allhic.(*LinkDensityModel).countBinDensities(0xc00010e4d0, 0xc000150000, 0x5e8, 0x6a0)
        /Users/bao/code/allhic/model.go:149 +0x576
_/Users/bao/code/allhic.(*Extracter).makeModel(0xc000239740, 0xc000182620, 0x1d)
        /Users/bao/code/allhic/extract.go:155 +0x312
_/Users/bao/code/allhic.(*Extracter).Run(0xc000239740)
        /Users/bao/code/allhic/extract.go:139 +0xaf
main.main.func1(0xc0000f0c60, 0xc00001d000, 0xc0000f0c60)
        /Users/bao/code/allhic/cmd/allhic.go:143 +0x17a
github.com/urfave/cli.HandleAction(0x90f200, 0x9da328, 0xc0000f0c60, 0x0, 0xc00001d020)
        /Users/bao/go/src/github.com/urfave/cli/app.go:501 +0xc8
github.com/urfave/cli.Command.Run(0x9b7b0c, 0x7, 0x0, 0x0, 0x0, 0x0, 0x0, 0x9c5253, 0x23, 0x9d0617, ...)
        /Users/bao/go/src/github.com/urfave/cli/command.go:165 +0x459
github.com/urfave/cli.(*App).Run(0xc000134000, 0xc00001c1e0, 0x6, 0x6, 0x0, 0x0)
        /Users/bao/go/src/github.com/urfave/cli/app.go:259 +0x6bb
main.main()
        /Users/bao/code/allhic/cmd/allhic.go:400 +0xb31
build
1. tour format to agp ...

What did I miss?

Thank you in advance,

Michal

Example data cannot be downloaded

Hi Xingtan,

thanks for providing the wiki on ALLHiC.

I would like to run AllHiC with the example data you provided at ftp://59.79.232.12/. But it seems the ftp is not working (that I cannot get access to the data)?

Could you please check if it is offline? Or, is there any other place where I can download the data?

Bests,
Hequan

Allele.ctg.table was none!

I used two draft genome to conduct the command "blastn" and "blastn_parse.pl", and two *.blast.out files were generated. When then command "classify.pl", the three files were none, including Allele.ctg.table, remove.log and Allele.gene.table. Can you help me solve this problem?
Another question: If I have new reference genome with long-reads assembly, can I use ALLHIC to assemble hi-c data?
Thanks !

Problem about mitochondrial genome

Report a problem about mitochondrial genome inserted in to chromosome by ALLHIC. So the mitochondrial genome should be excluded from the fasta file for HIC assemble.

classify.pl error

Hi,

I try to create allele.ctg.table files but run into trouble running the classify step.

classify.pl returns following errors:

Use of uninitialized value $tgene in substitution (s///) at /mnt/users/michelmo/tools/ALLHiC/scripts/classify.pl line 54, <IN> line 14417.
Use of uninitialized value $tgene in hash element at mnt/users/michelmo/tools/ALLHiC/scripts/classify.pl line 55, <IN> line 14417.
Use of uninitialized value $tgene in substitution (s///) at mnt/users/michelmo/tools/ALLHiC/scripts/classify.pl line 54, <IN> line 14418.
Use of uninitialized value $tgene in hash element at /mnt/users/michelmo/tools/ALLHiC/scripts/classify.pl line 55, <IN> line 14418.

I thought its formatting problem but cant spot anything obvious in my input files.

I formated cds files according to tutorials with identical name in gff:
target gff3:

Flye15k_scaffold_584    Ssal_COMBO_polished     gene    640018  642089  .       -       .       ID=ENSSSAT00000004279_1_path1;Name=ENSSSAT00000004279.1

target cds

>ENSSSAT00000004279_1_path1

ref gff3

ssa29   ensembl gene    42450663        42460615        .       -       .       ID=gene_ENSSSAG00000067226;biotype=protein_coding;gene_id=ENSSSAG00000067226;logic_name=ensembl;version=1

ref cds

>gene_ENSSSAG00000067226

after blastn_parse.pl Eblast.out

gene_ENSSSAG00000004085 ENSSSAT00000008969_1_path1      97.113  381     11      0       1       381     1       381     0.0     643
gene_ENSSSAG00000004089 ENSSSAT00000075924_1_path1      99.383  324     2       0       1       324     1       324     4.68e-167       588

Is there a problem using underscores as gene names?
thanks,
Michel

Allele.ctg.table format ?

How to create Allele.ctg.table file ?

➜  ALLHiC_assembly ~/Tools/ALLHiC/bin/ALLHiC_prune -i Allele.ctg.table -b sample.clean.bam -r A_and_B.fa 
Cannot open Allele.ctg.table

Are there other methods to speed up the 'bwa sampe' step?

Hi,
Thanks for sharing this program. I was trying to use ALLHiC to help with my genome assembly, however, the first step Map Hi-C reads to draft assembly ran almost one week and still not finished. I found that the bottleneck was bwa sampe, it has run 5 days and generated .sam file slowly. Is that OK? Are there other efficient ways to speed up this step?

Best,
WTarabi

How should I set cds names when I generate Allele.ctg.table?

Hello,

Thank you so much to design such an excellent software to do with HiC assembly! I am a green hand in genome assembly. Now I am going to generate the Allele.ctg.table, but I am confused of the guidance saying "Please modify cds name before running BLAST. The cds name should be same with gene name present in GFF3". Do you mean the ">" line in cds.fasta before blast should be exactly the same as the original gff3 gene name? Is gene ID OK? Can you give an example for me? Thanks a lot!

Questions of multiple libraries and no closed chromosome-scale assembly

Dear Dr. Zhang,

This is Xiaofei Yang, from Xi'an Jiaotong University. Thanks for your tool ALLHiC, first.
I try to use it to correct our draft assembly. We have sequenced hic of six libraries, should I merge the results (bam file after bwa mem alignment by samtools merge) to get a bam file before PreprocessSAMs.pl? Or do you have other solutions?

Another question is our genome is a new genome, so we cannot find a closed chromosome-scale assembly, how to do the prune step?

Best
Xiaofei

The algorithm details about the ordering and orientations for contigs

hi,
I am very interested in algorithm details about the ordering and orientations for contigs. However the allhic is a binary code script. I want to get the readable code detail of the binary software to understand how the process of ordering and orientation is implemented.
I would be very grateful if I can get your reply.
Thanks.

Majority of contigs in the first parition

Hi @tangerzhang ,

I am working on a highly heterozygous diploid genome. We used long reads in the assembly. The assembly reference has been corrected for miss assemblies. We mapped the HiC reads and then the ALLHiC results showed that more than 40% of all contigs were placed in the first partition. Can I somewhere in the log or intermediate files observe why that is happening? Do you have any suggestions how to fix this?

allhic build run error

Hi there,

I got a run time error from building fasta step, details are below:

12:06:45 getFastaSizes | NOTICE  Parse FASTA file `MR_H.fasta`
12:06:48 mergeTours | NOTICE  Import `REduced.paired_only.srt.counts_GATC.12g10.tour` => g1
12:06:48 parseTourFile | NOTICE  Parse tour file `REduced.paired_only.srt.counts_GATC.12g10.tour`
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x881785]

goroutine 1 [running]:
github.com/shenwei356/bio/seq.(*Seq).Length(...)
	/Users/bao/go/src/github.com/shenwei356/bio/seq/seq.go:75
_/Users/bao/code/allhic.(*OO).parseLastTour(0xc0000576e8, 0x7fff3eaa2e54, 0x46, 0xc038f95290, 0x2)
	/Users/bao/code/allhic/build.go:155 +0xf5
_/Users/bao/code/allhic.(*OO).mergeTours(0xc0000576e8, 0xc0000894f0, 0x1, 0x1)
	/Users/bao/code/allhic/build.go:136 +0x230
_/Users/bao/code/allhic.(*Builder).Run(0xc000057788)
	/Users/bao/code/allhic/build.go:124 +0x80
main.main.func5(0xc000110c60, 0xc000098f00, 0xc000110c60)
	/Users/bao/code/allhic/cmd/allhic.go:279 +0x234
github.com/urfave/cli.HandleAction(0x90f200, 0x9da348, 0xc000110c60, 0x0, 0xc000098f00)
	/Users/bao/go/src/github.com/urfave/cli/app.go:501 +0xc8
github.com/urfave/cli.Command.Run(0x9b70f6, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x9bcac5, 0x14, 0x9d048a, ...)
	/Users/bao/go/src/github.com/urfave/cli/command.go:165 +0x459
github.com/urfave/cli.(*App).Run(0xc000124380, 0xc00009a000, 0x5, 0x5, 0x0, 0x0)
	/Users/bao/go/src/github.com/urfave/cli/app.go:259 +0x6bb
main.main()
	/Users/bao/code/allhic/cmd/allhic.go:400 +0xb31

There is no issues from the server side. Any ideas?

Thank you!
Chen

problems with allhic

Hi, I have got a problem when I was running the protocol. I have finished generating the sam file from fq1 and fq2 file, The generated sam file is about 100G, but when I was running the PreprocessSAMs.pl file, it doesn't report any mistake but give no results. The generated bam file is only 1.83K and also the data in sample.bwa_aln.REduced.paired_only.flagstat are zero. Can anybody help me resovle this problem. Thanks!

Without Allele.ctg.table skipping ALLHiC_prune

Hi @tanghaibao and @tangerzhang ,
I am working on denovo plant genome and I can't create Allele.ctg.table. In this case would the following work be correct?

PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI
filterBAM_forHiC.pl sample.bwa_aln.REduced.paired_only.bam sample.clean.sam  
samtools view -bt draft.asm.fasta.fai sample.clean.sam > sample.clean.bam 
ALLHiC_partition -b sample.clean.bam -r draft.asm.fasta -e AAGCTT -k 16
ALLHiC_rescue -b sample.clean.bam -r draft.asm.fasta -c clusters.txt -i counts_AAGCTT.txt
allhic extract sample.clean.bam draft.asm.fasta --RE AAGCTT
for K in {1..16};do allhic optimize group${K}.txt sample.clean.clm;done
ALLHiC_build draft.asm.fasta

Thank you in advance,

Michal

bwa mem

can I use bwa mem instead of bwa aln?
align reads to the genome?

Allele.ctg.table for genome with no annotation

Hi,

I have a fresh highly polymorphic denovo assembly and I am not planning on doing annotation before I phase the genome. Is there a way to create Allele.ctg.table without annotation (CDS and GFF3) files?

BR,
Pezhman

assembly workflow using falcon-phase output and ALLHiC

Hi, @tanghaibao and @tangerzhang ,

Thanks for developing ALLHiC, a tool in urgent need and handy to use.

From your expertise, could you please kindly provide some advice about my assembly workflow?

I am doing a genome assembly (two separate nuclei in one cell and each nucleus with a set of chromosomes) and my aim is to generate a chromosome-level assembly which is fully phased. So I am actually expecting 2 fasta assembly files, each for one nucleus.

Now I have phased.0.fa and phased.1.fa for the diploid genome generated by falcon-phase, and hic data, and I am wondering what's the best approach to get the final 2 fasta assembly files for each nucleus?

Shall I just use phased.0.fa for 3d-dna for misjoin correction, then allhic for scaffolding and do the same thing for phased.1.fa separately?

My concern is that for phased.0.fa, although each contig is fully phased (which means all bases within the contig come from one nucleus), two contigs may come from two different nuclei.
Same problem for phased.1.fa.
So if I used phased.0.fa and phased.1.fa separately, I won't be able to get chromosome sequence fully phased.

To get 2 fasta assembly files, each for one nucleus, I have the following questions.

Then maybe it's better/possible to concatenate phased.0.fa and phased.1.fa together and do 3d-dna misjoin correction and allhic?
Will 3d-dna work for concatenated two copies of genome? If so, in such a situation, what would be the recommended parameter settings for 3d-dna that can go well with downstream allhic?
Then for allhic, is a Allele.ctg.table is required since we have two copies in the input now?

Looking forward to your advice

Best wishes

Jeannie

Cannot open bamfile `sample.clean.bam` (sam: reference already used)

I was using ALLHiC to deal with a haploid genome. Thus I skip the prune step, but when I run the ALLHiC_partition cmd, I got this error.

Can I use scaffold-level draft assembly for ALLHic?

Dear community,

I understand the input of original draft assembly for ALLHic scaffolding is contig-level. I am curious that is it possible to use scaffold-level assembly as input data? I have one draft assembly used Flye assembler which have 223 contigs (N50 = 12 Mb) and estimated genome size is 500 Mb. This assembler also provide scaffolding process to order contigs into scaffolds based on the repeat graph structure. And I got assembly of 193 fragments after scaffolding process by Flye (30 scaffolds and 163 contigs, and there 100 Ns in the gap in each scaffold).
So, can I use 193 fragments to do scaffolding by ALLHic? Or is there any advise that I can do?
Thank you for the help.

Typical assembly wokflow for ALLHiC scaffolding

Hi @tangerzhang ,

I have a high heterozygosity polypoidy genome to assembly. After reading the manusrcript of ALLHiC and issues in the github, I want to confirm the typical workflow for ALLHiC scaffoling assembly. Here is my thoughts. Do you think there are any step need to be added or corrected?

Using Canu to assembly allele contig by PB / ONT reads (100x for the ploypoidy genome size) . The parameters of Canu should be the canu-poly or others ?
purge_haplotigs to remove alternative haplotigs
10x , BioNano, BAC etc. to improve the contig assembly
3D-DNA pipeline to correct the misassembly in the contig
repeat masking and gene annotation on the contig assembly
get allelic table by blastn to the progenitor diploid genomes speceies (If have two progenitor diploid genomes , do I need align to the different genome to get two allleic table? )
ALLHiC pipeline

Best,
Zhigui

Length mismatch after Pilon

Dear Xingtan,
After I ran Pilon for error correction, I found some error like this.
Look like allhic extraction cannot read my fasta length. Do you have any solutions?

Thanks.

Terminating ... no more ACCEPT

Hi, @tanghaibao and @tangerzhang ,
I used ParaFly to run optimize for ordering and orientation, and from log file, I found a strrange Terminating information.
ESC[32m16:49:58 flipLog | NOTICE ESC[0m FLIPONE (400/562): -420786031.95846 => -420819527.78346 REJECT
ESC[32m16:50:00 flipLog | NOTICE ESC[0m FLIPONE (450/562): -420786031.95846 => -420786031.95846 REJECT
ESC[32m16:50:02 flipLog | NOTICE ESC[0m FLIPONE (500/562): -420786031.95846 => -420786031.95846 REJECT
ESC[32m16:50:04 flipLog | NOTICE ESC[0m FLIPONE (550/562): -420786031.95846 => -420786031.95846 REJECT
ESC[32m16:50:05 flipOne | NOTICE ESC[0m FLIPONE: N_accepts=0 N_rejects=562
ESC[32m16:50:05 Run | NOTICE ESC[0m Terminating ... no more ACCEPT

how can i solve this Terminating ... no more ACCEPT imformation?

Thank you in advance,

Best
He

Can we creat .hic and .assembly files by using groups.agp and groups.asm.fasta ?

hi,
In the end of Build analysis, we will get both groups.agp and groups.asm.fasta. Is it possible for us to generate .hic and .assembly files which are the input files for JuiceBox software ? If we can creat these files, then we can load .hic and .assembly files to further correct possible assembly errors by Juicebox Assembly Tools.
Thanks

Died at ALLHiC/bin/ALLHiC_build line 12

Hi,
When I use ALLHIC to assembly a draft genome, there was an error as follow:
1. tour format to agp ...
Died at /ALLHiC/bin/ALLHiC_build line 12.
I followed this https://github.com/tangerzhang/ALLHiC/wiki/ALLHiC:-scaffolding-of-a-simple-diploid-genome.
Could you help me to solve this? Thank you very much!
wangzhennan

How to determine K value for autopolyploid genome？

Hello
I assembly a autopolyploid genome.After colchicine treatment, the 28 chromosomes doubled to 56,for K=28 is right?

ALLHiC on diploid draft genome questions

Hi,

Iam currently using ALLHiC on a diploid draft assembly. I have a two questions though:

The number of pre defined groups (k) in the ALLHiC partition is the same as the number of chromosomes in the genome? So for example 12 chromosomes, k=12 or is it 2n: k=24?
I also see the number of single copies in BUSCO results has increased in the ALLHiC scaffolded assembly compared to the draft assembly. I expected the change to be minimal. Is this normal?

Can this process be applied to diploid plants?

Hi, @tanghaibao and @tangerzhang ,

From discriptions, i found this pipelines work well, and I am wondering if this process can be applied to diploid plants. if it is possible, can you give me some advice on running this process?

Best,
He

can the

the group parameter

Dear Dr. Zhang,
I have tried ALLHIC to scaffolds our genome by HiC data. We skipped ALLHIC_prune and ALLHIC rescue steps since our genome is a diploid genome.
For the ALLHiC_partition step, there is a parameter -k, means group. I am wondering if the group number should be set as same as the chromosome number, e.g. our genome has 20 chromosomes, we should set k as 20? Or do you have any suggestions to set k value?

In my current project, I fond the final group.asm.fasta has 20 main scaffolds and a lot of contigs. Alao, can I consider the group.asm.fasta as the final assembled genome?

Best
Xiaofei

superscaffolds

Could you please explain a bit in detail what is expected as input by the scriptALLHiC/scripts/link_superscaffold.pl
Thanks
Diego

download testdata problem

hi,I can't invite the host 59.79.232.12,can you give me a favor.here is my method:
I used the filezilla to connect the host.I set the host as ftp://59.79.232.12/ and username as AllHiCData and password as allhic and port as 21.

Format Allele.ctg.table

Could you please provide a full description of Allele.ctg.table?

I have assembled a highly polymorphic diploid species, and used haploMerger2 to separate the two haplotypes. I have a file that tracks which contigs are alternative versions of each other. I think I can use that to generate Allele.ctg.table, I just need a good description of this file. Will appreciate your help. Thanks.

Something wrong about "Optimize" steps

Hello,
Thanks for this great Hic pipeline. A month ago, I had some trouble of the spending too much time for "Optimize" steps. And you suggested to check the data format, command lines and log file whether there is any error reported. And I rerun it, but it still spent cpu time over 400 hours, and it is still working!!!

So I wondered if the HiC contacts too low let it work so long? Or maybe just my Hi-C quality is too bad.

It's my Hi-C date after juicer.

 Sequenced Read Pairs:  441,811,039
 Normal Paired: 199,286,377 (45.11%)
 Chimeric Paired: 25,363,801 (5.74%)
 Chimeric Ambiguous: 153,132,631 (34.66%)
 Unmapped: 64,028,230 (14.49%)
 Ligation Motif Present: 174,921,480 (39.59%)
Alignable (Normal+Chimeric Paired): 224,650,178 (50.85%)
Unique Reads: 62,625,023 (14.17%)
PCR Duplicates: 161,848,665 (36.63%)
Optical Duplicates: 176,490 (0.04%)

Library Complexity Estimate: 64,629,613
Intra-fragment Reads: 21,622,234 (4.89% / 34.53%)
Below MAPQ Threshold: 17,483,057 (3.96% / 27.92%)
Hi-C Contacts: 23,519,732 (5.32% / 37.56%)
 Ligation Motif Present: 5,254,662  (1.19% / 8.39%)
 3' Bias (Long Range): 67% - 33%
 Pair Type %(L-I-O-R): 25% - 27% - 24% - 24%
Inter-chromosomal: 5,968,399  (1.35% / 9.53%)
Intra-chromosomal: 17,551,333  (3.97% / 28.03%)
Short Range (<20Kb): 17,151,373  (3.88% / 27.39%)
Long Range (>20Kb): 399,954  (0.09% / 0.64%)

Thank you so much.

enzyme_sites problem

Hi，
In Filtering SAM file step : PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI; you use MBOI enzyme.
but in Partition step : ALLHiC_partition -b prunning.bam -r draft.asm.fasta -e AAGCTT -k 16; why you use -e AAGCTT parameter.
I have some errors when I follow your command; and I use -e GATC parameter ,it worked.
but in last result,it seems some mistakes in group15 and group16.

tangerzhang / allhic Goto Github PK

allhic's Introduction

ALLHiC

allhic's People

Contributors

Stargazers

Watchers

Forkers

allhic's Issues

Recommend Projects

Recommend Topics

Recommend Org