dfguan / asset Goto Github PK
View Code? Open in Web Editor NEWassembly evaluation tool
assembly evaluation tool
Dear @dfguan,
I don't find the col_conts script in the asset webpage, do you have any clue where should I look at?
Thanks a lot!
Marc
Hi Dengfeng,
I am trying to evaluate a 10X Chromium assembly done with SUPERNOVA.
I can get the gap.bed file, but the following never finished:
#####################################################
bwa index Andean12nov19_SUPERNOVA_output_pseudohap2.1.fasta
while read -r r1 r2
do
prefix=basename $r1 .fastq.gz
/home/sgod/GenomicsPrograms/asset/bin/10x -c -p $prefix $r1 $r2 # generate trimmed read files $prefix_{1,2}.fq.gz
bwa mem -t12 Andean12nov19_SUPERNOVA_output_pseudohap2.1.fasta $prefix_1.fq.gz $prefix_2.fq.gz | samtools view -b - > $prefix.bam
done < $10xlist
/home/sgod/GenomicsPrograms/asset/bin/ast_10x gaps.bed $bam1 $bam2 $bam3 ... >10x.bed 2>ast_10x.log
#####################################################
My 10Xlist is a file with the directory and 4 fastq files from the original 10X sequencing (R1 and R2 X 2 lanes)
Is there something simple I am missing in the syntax?
Thank you for your input!
Sher
Hi Dengfeng, @dfguan
I want to evaluate an assembly (not scaffolded yet) with a HiC data. It seems that using ast_hic
I should be able to find the reliable blocks. I used the default parameters for ast_hic
Usage: ast_hic [options] <GAP_BED> <BAM_FILEs>
Options:
-c INT minimum coverage [7]
-C INT maximum coverage [inf]
-q INT minimum alignment quality [0]
-L INT maximum insertion length, gap excluded [15000]
-h help
So I expected a block to be present in the output BED file if at least 7 paired reads are spanning over it and those reads are having an insertion length less than 15Kb. I tried to investigate the output BED file for some regions manually. I found some regions that I couldn't explain why they are not present in the BED file.
Here is what I did:
detgaps asm.fa > gaps.bed
ast_hic gaps.bed hic.bam > hic.bed
The assembly and the bam file are attached HiC_asset.zip and you can reproduce the BED file. ( I reduced the whole assembly and bam file to just one contig for simplicity )
The first lines of the BED file is as below:
track name="HC" description="hic data"
h1tg000188l 302 205890
h1tg000188l 206041 206104
h1tg000188l 206177 206215
h1tg000188l 206234 215904
As you can see the locus 205900 is not included so there should be a problem with this locus. But It seems that the reads spanning 205880 (included) and 205900 (not included) are exactly the same.
To find the spanning reads I ran the line below once for pos=205900 and once for pos=205880 :
samtools view hic.bam | awk '{if (($4 < pos) && ($8 > pos)) print $0 }'
For either of them I got the same 7 paired reads (all of them are having insertion lengths less than 15Kb):
A00675:75:H5CNNDSXY:1:1166:22761:14544 81 h1tg000188l 192537 19 10S128M13S = 206017 13354 AATATCAAAGTGTAATCCCAGCACTTTTGGAGGCCAAGGCAAGAGGATCACAAGGTGAGGAGATCAAGACCATCCTGGCCAATACAGTGAAACGCTGTCACTATTAAAAATCCAAAAAATTAGCTAGGCATGGTGGCAAGGTTCGTCCATC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF::,FFFFFFFFFFFFFFFFF:FFFFFFFFFFFF NM:i:0 MD:Z:128 MC:Z:39S112M AS:i:128 XS:i:117 XA:Z:h1tg000132l,-2800693,10S128M13S,1;h1tg000188l,-185735,10S128M13S,1;h1tg000132l,-2807711,10S128M13S,1;h1tg000144l,-2327679,10S128M13S,2;
A00675:75:H5CNNDSXY:2:1308:13982:24298 145 h1tg000188l 197070 0 50M4I1M4D92M3S = 206108 8893 TACCCAAAATATGTATAATATACTGTACATAAATTATGGAAGTACCCAAAGATTCATAATCAACTGTACATAAAATAACAAAGTACTCAAACTATATATTCTATACAGTACATAAAATATCAAAGTACCCATACTGTATATTATATACTG FFFFFFFFFF,FFFFFFFFF:FFFFFFFFFFFFF:F,FFFFFFFFFF::FF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:12 MD:Z:38A12^TTTT5A17T52A15 MC:Z:48S66M34S AS:i:79 XS:i:78
A00675:75:H5CNNDSXY:2:1645:30481:33207 2177 h1tg000188l 199232 0 94H57M = 206104 6873 CAAAATATGTATAATATACTGTACATAAATTATGAAAGTAACCAAACATTTATAATA ,FFF:,FF,,F,FFF:FFFF,FFFFFFFFF:F:FFFF,FF,FFFFF,:F,:FFFF:F NM:i:1 MD:Z:40C16 MC:Z:5H88M58H AS:i:46 XS:i:46 SA:Z:h1tg000006l,78306568,+,71M80S,49,1;
A00675:75:H5CNNDSXY:2:2671:9109:8876 2129 h1tg000188l 200829 0 106H32M13H = 211484 10625 ACTGTACATAATATATTAAAGTCGCCTAAATA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:32 MC:Z:151M AS:i:32 XS:i:26 SA:Z:h1tg000157l,793787,+,95S56M,0,0;h1tg000188l,235190,-,55S62M34S,0,1;
A00675:75:H5CNNDSXY:3:2152:14000:15969 113 h1tg000188l 202107 60 82M69S = 211218 9167 TATGTATTATATATTGTACATAAAATATCAAACTATGCCAAATATATATTATATACTGTACATAAAATATCAAATTACCCAATATGAAGCTTTTATTAAAAATTCTAGGAAGTGAATACTAATCTATAGTGAAAAAGCAGACTAGGAAGAC FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:82 MC:Z:14S137M AS:i:82 XS:i:54 SA:Z:h1tg000144l,5760537,+,70M81S,0,0;
A00675:75:H5CNNDSXY:4:1214:15637:20964 2161 h1tg000188l 203773 0 30M118H = 207353 3674 TACATAAAATATCAAAGTACCCAAACTTCA FFFFFF,FFFFFFFFFFFFFFFFFFFF:FF NM:i:0MD:Z:30 MC:Z:28H100M1D1M1I21M AS:i:30 XS:i:28 SA:Z:h1tg000188l,207450,+,89M59S,0,0;
A00675:75:H5CNNDSXY:4:1618:3486:17018 161 h1tg000188l 204379 43 151M = 208464 4213 TCAAATATATATATTATTCTGTACATAAAATATCACATTACACCATATATATATTATATACTGTACATAAAATATCAAAGTACCCAAAATATGTATAATATACTGTATATAAATTATGAAAGTACCCAAACCTTTATAATAAAGTGTACAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFF:FFF:FF:FFFF:F NM:i:0 MD:Z:151 MC:Z:23S128M AS:i:151 XS:i:131 XA:Z:h1tg000188l,+199979,131M20S,0;h1tg000132l,+2822475,4S147M,2;
I also found these lines in HC.cov.base
indicating a reduction in the coverage (from 7 to 6) at 205890. But I couldn't figure out why? Could you please explain what's happening at 205890 that makes ast_hic
truncate the BED file there?
205863 205890 7
205891 205962 6
Thanks in advance,
Mobin
Hi Dengfeng,
My output with pchlst_final.bed file looks strange. Could you take a look at my workflow to ensure I have run it correctly? I've also attached the outputs
MY WORKFLOW WITH ONT AND HiC data only:
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/detgaps gapclosed.fasta > gaps.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/split_fa gapclosed.fasta > split.fa
samtools faidx split.fa
bwa index split.fa
bwa mem -SP -B10 -t 18 split.fa /nesi/project/landcare02730/HiC_Data/Rewa_trimgalore_output/Rewa_trimgalore_adaptortrim/Rewarewa_HiC_R1_trimmed_val_1.fq /nesi/project/landcare02730/HiC_Data/Rewa_trimgalore_output/Rewa_trimgalore_adaptortrim/Rewarewa_HiC_R2_trimmed_val_2.fq | samtools view -b - > asset_aln.bam
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/col_conts asset_aln.bam > links.mat
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/ast_hic2 split.fa.fai links.mat > hic2.bed 2>ast_hic.log
minimap2 -x map-ont -t 18 *.fasta /nesi/nobackup/uoa02642/Rewa_Assemblies/rewarewa_rebasecalled.fasta > ont.paf
/nesi/nobackup/uoa02642/Rewa_Assemblies/purge_dups/src/pbcstat ont.paf
/nesi/nobackup/uoa02642/Rewa_Assemblies/purge_dups/src/calcuts PB.stat > ont.cutoffs 2>calcults.log
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/ast_pb -m 5 -M 252 ont.paf > ont.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/acc gaps.bed hic2.bed ont.bed > hic2_ont.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/pchlst gaps.bed hic2_ont.bed > pchlst_scaf.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/pchlst -c gaps.bed hic2_ont.bed > pchlst_ctg.bed
/nesi/nobackup/uoa02642/Rewa_Assemblies/asset/bin/union_brks gaps.bed pchlst_scaf.bed pchlst_ctg.bed > pchlst_final.bed
Thanks for all the help and patience with this. Very much appreciated.
Hi,
I got stuck with the 10x steps of the Asset pipeline. All other parts of the pipeline (i.e. identifying gaps and ends, PacBio support, HiC support) work well.
However, when I get to the
$asset/bin/ast_10x -x gaps.bed $prefix.bam > 10x.bed
step, I get the following error message:
Program starts
[M::aa_10x] processing bam file
[M::proc_bam] finish processing 2121039345 bams
[M::proc_bam] read pair counter: 1000860477
[W::aa_10x] none useful information, quit
As I have a (admittingly unnecessary) large number of 10x reads, my $prefix.bam
file is 260GB large. Also, what I saw is that in the gaps.bed
file, all scaffolds have a gap at the first (0 0) and last position. Is that normal?
Best,
Oliver
Hello,
I'm using your tool to evaluate assemblies produced with PacBio HiFi reads, 10x reads, and Omni-C reads.
Can you provide descriptions of the expected output columns in the various bed files?
My final punchlist bed file has only "C"s present in the final column, which I assume means "conflict". I don't have any "S", which I'm assuming means "supported".
If I knew what the upstream file fields are, I can possibly troubleshoot there.
Thanks!
Kim
Hi there,
I generated an assembly using hifi reads along with hi-c data but there were no additional data e.g. 10x or bn. The assembly is of high quality with N50 of 220 Mb and is free of gaps. I tried the following codes to check mis-joins and mis-assemblies but the pchlst_scaf.bed is empty. Accordingly, the output is not reasonable and there are many C in the final results.
./bin/acc gaps.bed pb.bed > pb_acc.bed
./bin/pchlst -c gaps.bed pb_acc.bed > pchlst_ctg.bed
./bin/acc gaps.bed pb.bed hic2.bed > hic_acc.bed
./bin/pchlst -c gaps.bed hic_acc.bed > pchlst_scaf.bed
./bin/union_brks gaps.bed pchlst_{ctg,scaf}.bed > pchlst_final.bed
Can you please see the codes and let me know if there is any mistake.
I have 10x reads from another individual. This individual is related to main sample that was used for assembly. I wonder if it is possible to use sequences from another sample to asses the genome assembly.
Thank you
I got a error when I have runned make,below this error.
gcc -g -Wall -D VERBOSE -O2 -D PRINT_COVERAGE -c -o paf.o paf.c
paf.c:1:10: fatal error: zlib.h: No such file or directory
#include <zlib.h>
^~~~~~~~
compilation terminated.
make: *** [paf.o] Error 1
my zlib.h path is /usr/include/zlib.h,but i dont know how to fix this problem,i hope you can help me ,thank you!!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.