dfguan / asset Goto Github PK

View Code? Open in Web Editor NEW

33.0 33.0 6.0 199 KB

assembly evaluation tool

Python 8.39% C 90.83% Makefile 0.78%

asset's People

Contributors

Stargazers

Watchers

Forkers

kant arangrhie yuzhenpeng pythseq ethering china-acadamy-ioz

asset's Issues

bin/col_conts don't find the script

Dear @dfguan,

I don't find the col_conts script in the asset webpage, do you have any clue where should I look at?

Thanks a lot!

Marc

pchlst_final.bed Result

in final line ,what`s means of the "S" and "C" ?

10X Chromium pipeline

Hi Dengfeng,

I am trying to evaluate a 10X Chromium assembly done with SUPERNOVA.

I can get the gap.bed file, but the following never finished:
#####################################################

bwa index Andean12nov19_SUPERNOVA_output_pseudohap2.1.fasta
while read -r r1 r2
do
prefix=basename $r1 .fastq.gz
/home/sgod/GenomicsPrograms/asset/bin/10x -c -p $prefix $r1 $r2 # generate trimmed read files $prefix_{1,2}.fq.gz
bwa mem -t12 Andean12nov19_SUPERNOVA_output_pseudohap2.1.fasta $prefix_1.fq.gz $prefix_2.fq.gz | samtools view -b - > $prefix.bam
done < $10xlist

/home/sgod/GenomicsPrograms/asset/bin/ast_10x gaps.bed $bam1 $bam2 $bam3 ... >10x.bed 2>ast_10x.log

#####################################################
My 10Xlist is a file with the directory and 4 fastq files from the original 10X sequencing (R1 and R2 X 2 lanes)

Is there something simple I am missing in the syntax?

Thank you for your input!
Sher

Output of ast_hic, Unexpected Truncation

Hi Dengfeng, @dfguan

I want to evaluate an assembly (not scaffolded yet) with a HiC data. It seems that using ast_hic I should be able to find the reliable blocks. I used the default parameters for ast_hic

Usage: ast_hic [options] <GAP_BED> <BAM_FILEs>
Options:
         -c    INT      minimum coverage [7]
         -C    INT      maximum coverage [inf]
         -q    INT      minimum alignment quality [0]
         -L    INT      maximum insertion length, gap excluded [15000]
         -h             help

So I expected a block to be present in the output BED file if at least 7 paired reads are spanning over it and those reads are having an insertion length less than 15Kb. I tried to investigate the output BED file for some regions manually. I found some regions that I couldn't explain why they are not present in the BED file.

Here is what I did:

detgaps asm.fa > gaps.bed
ast_hic gaps.bed hic.bam > hic.bed

The assembly and the bam file are attached HiC_asset.zip and you can reproduce the BED file. ( I reduced the whole assembly and bam file to just one contig for simplicity )

The first lines of the BED file is as below:

track name="HC" description="hic data"
h1tg000188l	302	205890
h1tg000188l	206041	206104
h1tg000188l	206177	206215
h1tg000188l	206234	215904

As you can see the locus 205900 is not included so there should be a problem with this locus. But It seems that the reads spanning 205880 (included) and 205900 (not included) are exactly the same.

To find the spanning reads I ran the line below once for pos=205900 and once for pos=205880 :
samtools view hic.bam | awk '{if (($4 < pos) && ($8 > pos)) print $0 }'

For either of them I got the same 7 paired reads (all of them are having insertion lengths less than 15Kb):

A00675:75:H5CNNDSXY:1:1166:22761:14544	81	h1tg000188l	192537	19	10S128M13S	=	206017	13354	AATATCAAAGTGTAATCCCAGCACTTTTGGAGGCCAAGGCAAGAGGATCACAAGGTGAGGAGATCAAGACCATCCTGGCCAATACAGTGAAACGCTGTCACTATTAAAAATCCAAAAAATTAGCTAGGCATGGTGGCAAGGTTCGTCCATC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF::,FFFFFFFFFFFFFFFFF:FFFFFFFFFFFF	NM:i:0	MD:Z:128	MC:Z:39S112M	AS:i:128	XS:i:117	XA:Z:h1tg000132l,-2800693,10S128M13S,1;h1tg000188l,-185735,10S128M13S,1;h1tg000132l,-2807711,10S128M13S,1;h1tg000144l,-2327679,10S128M13S,2;
A00675:75:H5CNNDSXY:2:1308:13982:24298	145	h1tg000188l	197070	0	50M4I1M4D92M3S	=	206108	8893	TACCCAAAATATGTATAATATACTGTACATAAATTATGGAAGTACCCAAAGATTCATAATCAACTGTACATAAAATAACAAAGTACTCAAACTATATATTCTATACAGTACATAAAATATCAAAGTACCCATACTGTATATTATATACTG	FFFFFFFFFF,FFFFFFFFF:FFFFFFFFFFFFF:F,FFFFFFFFFF::FF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:12	MD:Z:38A12^TTTT5A17T52A15	MC:Z:48S66M34S	AS:i:79	XS:i:78
A00675:75:H5CNNDSXY:2:1645:30481:33207	2177	h1tg000188l	199232	0	94H57M	=	206104	6873	CAAAATATGTATAATATACTGTACATAAATTATGAAAGTAACCAAACATTTATAATA	,FFF:,FF,,F,FFF:FFFF,FFFFFFFFF:F:FFFF,FF,FFFFF,:F,:FFFF:F	NM:i:1	MD:Z:40C16	MC:Z:5H88M58H	AS:i:46	XS:i:46	SA:Z:h1tg000006l,78306568,+,71M80S,49,1;
A00675:75:H5CNNDSXY:2:2671:9109:8876	2129	h1tg000188l	200829	0	106H32M13H	=	211484	10625	ACTGTACATAATATATTAAAGTCGCCTAAATA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:32	MC:Z:151M	AS:i:32	XS:i:26	SA:Z:h1tg000157l,793787,+,95S56M,0,0;h1tg000188l,235190,-,55S62M34S,0,1;
A00675:75:H5CNNDSXY:3:2152:14000:15969	113	h1tg000188l	202107	60	82M69S	=	211218	9167	TATGTATTATATATTGTACATAAAATATCAAACTATGCCAAATATATATTATATACTGTACATAAAATATCAAATTACCCAATATGAAGCTTTTATTAAAAATTCTAGGAAGTGAATACTAATCTATAGTGAAAAAGCAGACTAGGAAGAC	FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:82	MC:Z:14S137M	AS:i:82	XS:i:54	SA:Z:h1tg000144l,5760537,+,70M81S,0,0;
A00675:75:H5CNNDSXY:4:1214:15637:20964	2161	h1tg000188l	203773	0	30M118H	=	207353	3674	TACATAAAATATCAAAGTACCCAAACTTCA	FFFFFF,FFFFFFFFFFFFFFFFFFFF:FF	NM:i:0MD:Z:30	MC:Z:28H100M1D1M1I21M	AS:i:30	XS:i:28	SA:Z:h1tg000188l,207450,+,89M59S,0,0;
A00675:75:H5CNNDSXY:4:1618:3486:17018	161	h1tg000188l	204379	43	151M	=	208464	4213	TCAAATATATATATTATTCTGTACATAAAATATCACATTACACCATATATATATTATATACTGTACATAAAATATCAAAGTACCCAAAATATGTATAATATACTGTATATAAATTATGAAAGTACCCAAACCTTTATAATAAAGTGTACAT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFF:FFF:FF:FFFF:F	NM:i:0	MD:Z:151	MC:Z:23S128M	AS:i:151	XS:i:131	XA:Z:h1tg000188l,+199979,131M20S,0;h1tg000132l,+2822475,4S147M,2;

I also found these lines in HC.cov.base indicating a reduction in the coverage (from 7 to 6) at 205890. But I couldn't figure out why? Could you please explain what's happening at 205890 that makes ast_hic truncate the BED file there?

205863	205890	7
205891	205962	6

Thanks in advance,
Mobin

ONT and HiC only

Hi Dengfeng,

My output with pchlst_final.bed file looks strange. Could you take a look at my workflow to ensure I have run it correctly? I've also attached the outputs

MY WORKFLOW WITH ONT AND HiC data only:

/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/detgaps gapclosed.fasta > gaps.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/split_fa gapclosed.fasta > split.fa
samtools faidx split.fa
bwa index split.fa
bwa mem -SP -B10 -t 18 split.fa /nesi/project/landcare02730/HiC_Data/Rewa_trimgalore_output/Rewa_trimgalore_adaptortrim/Rewarewa_HiC_R1_trimmed_val_1.fq /nesi/project/landcare02730/HiC_Data/Rewa_trimgalore_output/Rewa_trimgalore_adaptortrim/Rewarewa_HiC_R2_trimmed_val_2.fq | samtools view -b - > asset_aln.bam
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/col_conts asset_aln.bam > links.mat
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/ast_hic2 split.fa.fai links.mat > hic2.bed 2>ast_hic.log

minimap2 -x map-ont -t 18 *.fasta /nesi/nobackup/uoa02642/Rewa_Assemblies/rewarewa_rebasecalled.fasta > ont.paf
/nesi/nobackup/uoa02642/Rewa_Assemblies/purge_dups/src/pbcstat ont.paf
/nesi/nobackup/uoa02642/Rewa_Assemblies/purge_dups/src/calcuts PB.stat > ont.cutoffs 2>calcults.log
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/ast_pb -m 5 -M 252 ont.paf > ont.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/acc gaps.bed hic2.bed ont.bed > hic2_ont.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/pchlst gaps.bed hic2_ont.bed > pchlst_scaf.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/pchlst -c gaps.bed hic2_ont.bed > pchlst_ctg.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/union_brks gaps.bed pchlst_scaf.bed pchlst_ctg.bed > pchlst_final.bed

Thanks for all the help and patience with this. Very much appreciated.

A
Re__Additional_query.zip

ast_10x fails

Hi,

I got stuck with the 10x steps of the Asset pipeline. All other parts of the pipeline (i.e. identifying gaps and ends, PacBio support, HiC support) work well.

However, when I get to the
$asset/bin/ast_10x -x gaps.bed $prefix.bam > 10x.bed
step, I get the following error message:

Program starts
[M::aa_10x] processing bam file
[M::proc_bam] finish processing 2121039345 bams
[M::proc_bam] read pair counter: 1000860477
[W::aa_10x] none useful information, quit

As I have a (admittingly unnecessary) large number of 10x reads, my $prefix.bam file is 260GB large. Also, what I saw is that in the gaps.bed file, all scaffolds have a gap at the first (0 0) and last position. Is that normal?

Best,
Oliver

Add descriptions of expected output in .bed files

Hello,

I'm using your tool to evaluate assemblies produced with PacBio HiFi reads, 10x reads, and Omni-C reads.

Can you provide descriptions of the expected output columns in the various bed files?

My final punchlist bed file has only "C"s present in the final column, which I assume means "conflict". I don't have any "S", which I'm assuming means "supported".

If I knew what the upstream file fields are, I can possibly troubleshoot there.

Thanks!
Kim

Having only two sources of data

Hi there,
I generated an assembly using hifi reads along with hi-c data but there were no additional data e.g. 10x or bn. The assembly is of high quality with N50 of 220 Mb and is free of gaps. I tried the following codes to check mis-joins and mis-assemblies but the pchlst_scaf.bed is empty. Accordingly, the output is not reasonable and there are many C in the final results.

./bin/acc gaps.bed pb.bed > pb_acc.bed
./bin/pchlst -c gaps.bed pb_acc.bed > pchlst_ctg.bed

./bin/acc gaps.bed pb.bed hic2.bed > hic_acc.bed
./bin/pchlst -c gaps.bed hic_acc.bed > pchlst_scaf.bed

./bin/union_brks gaps.bed pchlst_{ctg,scaf}.bed > pchlst_final.bed

Can you please see the codes and let me know if there is any mistake.
I have 10x reads from another individual. This individual is related to main sample that was used for assembly. I wonder if it is possible to use sequences from another sample to asses the genome assembly.
Thank you

zlib.h: No such file or directory

I got a error when I have runned make，below this error.

gcc -g -Wall -D VERBOSE -O2 -D PRINT_COVERAGE -c -o paf.o paf.c
paf.c:1:10: fatal error: zlib.h: No such file or directory
#include <zlib.h>
^~~~~~~~
compilation terminated.
make: *** [paf.o] Error 1

my zlib.h path is /usr/include/zlib.h，but i dont know how to fix this problem，i hope you can help me ,thank you!!

dfguan / asset Goto Github PK

asset's People

Contributors

Stargazers

Watchers

Forkers

asset's Issues

bin/col_conts don't find the script

pchlst_final.bed Result

10X Chromium pipeline

Output of ast_hic, Unexpected Truncation

ONT and HiC only

ast_10x fails

Add descriptions of expected output in .bed files

Having only two sources of data

zlib.h: No such file or directory

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent