markhilt / arbitr Goto Github PK
View Code? Open in Web Editor NEWARBitR: Assembly Refinement with Barcode-identity-tagged Reads
License: Other
ARBitR: Assembly Refinement with Barcode-identity-tagged Reads
License: Other
hi,
Could you provide more info on how I can scaffold the fasta files generated using supernova assembly ?
Hi, I run arbitr.py -i pilon.fasta pilon.bam -m 75000 -s 40000 -B 20 -F 70 -Q 60 -n
3 and it leads to this error. Could you please help me? Thank you very much.
$ arbitr.py -i pilon.fasta pilon.bam -m 75000 -s 40000 -B 20 -F 70 -Q 60 -n 3
[Mon May 3 16:56:34 2021] Starting ARBitR.
[Mon May 3 16:56:35 2021] Collecting contigs.
[Mon May 3 16:56:35 2021] Collecting barcodes for linkgraph.
[Mon May 3 16:56:35 2021] Starting barcode collection. Found 262 contigs.
[Mon May 3 16:59:41 2021] [ BARCODE COLLECTION ] Completed: 100.0% (524 out of 524)
[Mon May 3 16:59:41 2021] Creating link graph.
Traceback (most recent call last):
File "/xxx/ARBitR/src/arbitr.py", line 250, in <module>
main()
File "/xxx/ARBitR/src/arbitr.py", line 191, in main
backbone_graph = graph_building.main(backbone_contig_lengths, \
File "/xxx/ARBitR/src/graph_building.py", line 507, in main
GEMcomparison = pairwise_comparisons(GEMlist)
File "/xxx/ARBitR/src/graph_building.py", line 390, in pairwise_comparisons
misc.printstatus("[ BARCODE COMPARISON ]\t" + misc.reportProgress(idx+1, len(GEMlist)))
File "/xxx/ARBitR/src/misc.py", line 29, in reportProgress
return "Completed: {0}% ({1} out of {2})".format( str(round( (current / total) * 100, 2)), current, total)
ZeroDivisionError: division by zero
Dear all,
When I run the ARBitR, I am getting an error and now, I could not solve the problem yet. After generating sorted ".bam" and ".bai" files (genome.nextpolish.sorted.bam and genome.nextpolish.sorted.bam.bai), I run following script;
/okyanus/users/veldem/01.Direct_Projects/06.Anchovy_Genome_Projects/04.Genome_Scaffolding/02.arbitR_scaffolding/ARBitR-master/src/arbitr.py -i /okyanus/users/veldem/01.Direct_Projects/06.Anchovy_Genome_Projects/03.Genome_Polishing/01.NextPolish/NextPolish/genome.nextpolish.fa genome.nextpolish.sorted.bam
and getting error (after it take nearly one hour);
[Sun Apr 25 19:24:54 2021] Collecting contigs.
[Sun Apr 25 19:24:54 2021] Collecting barcodes for linkgraph.
[Sun Apr 25 19:24:54 2021] Starting barcode collection. Found 3949 contigs.
[Sun Apr 25 19:30:00 2021] [ BARCODE COLLECTION ] Completed: 100.0% (7898 out of 7898)
[Sun Apr 25 19:30:00 2021] Creating link graph.
[Sun Apr 25 19:55:45 2021] [ BARCODE COMPARISON ] Completed: 100.0% (7707 out of 7707)
[Sun Apr 25 19:55:45 2021] Number of windows: 7707
[Sun Apr 25 19:56:52 2021] [ BARCODE LINKING ] Completed: 100.0% (7707 out of 7707)
[Sun Apr 25 19:56:54 2021] Writing link graph to genome.nextpolish.sorted.ARBitR.backbone.gfa.
[Sun Apr 25 19:56:54 2021] Finding paths.
[Sun Apr 25 19:57:05 2021] Found 741 paths.
[Sun Apr 25 19:57:05 2021] Collecting barcodes from short contigs.
[Sun Apr 25 19:57:05 2021] Starting barcode collection. Found 7568 contigs.
[Sun Apr 25 20:00:32 2021] [ BARCODE COLLECTION ] Completed: 100.0% (7568 out of 7568)
[Sun Apr 25 20:20:17 2021] [ PATH FILLING ] Completed: 100.0% (741 out of 741)
[Sun Apr 25 20:20:17 2021] Found fasta file for merging: /okyanus/users/veldem/01.Direct_Projects/06.Anchovy_Genome_Projects/03.Genome_Polishing/01.NextPolish/NextPolish/genome.nextpolish.fa
[Sun Apr 25 20:20:17 2021] Trimming contig ends...
[E::fai_retrieve] Failed to retrieve block: unexpected end of file(741 out of 741)
[Sun Apr 25 20:22:27 2021] [ TRIMMING ] Completed: 100.0% (741 out of 741)
Traceback (most recent call last):
File "/okyanus/users/veldem/01.Direct_Projects/06.Anchovy_Genome_Projects/04.Genome_Scaffolding/02.arbitR_scaffolding/ARBitR-master/src/arbitr.py", line 250, in
main()
File "/okyanus/users/veldem/01.Direct_Projects/06.Anchovy_Genome_Projects/04.Genome_Scaffolding/02.arbitR_scaffolding/ARBitR-master/src/arbitr.py", line 228, in main
bed = merge_fasta.main( args.input_fasta,
File "/okyanus/users/veldem/01.Direct_Projects/06.Anchovy_Genome_Projects/04.Genome_Scaffolding/02.arbitR_scaffolding/ARBitR-master/src/merge_fasta.py", line 1011, in main
trimmed_fasta[tig] = fastafile.fetch(reference=tig,
File "pysam/libcfaidx.pyx", line 319, in pysam.libcfaidx.FastaFile.fetch
ValueError: failure when retrieving sequence on 'ctg27_np12'
My python version: Python 3.8.8
Best wishes
Hi @markhilt !
I am running ARBitR on a very large and quite fragmented genome assembly. I have gotten to the point where the program runs many processes in parallel:
[Wed Oct 14 18:45:43 2020] Trimming contig ends...
[Wed Oct 14 20:25:35 2020] [ TRIMMING ] Completed: 100.0% (24098 out of 24098)
[Wed Oct 14 20:29:38 2020] Creating scaffolds...
[Wed Oct 14 20:29:38 2020] Number of paths: 24098
[Thu Oct 15 22:52:48 2020] [ SCAFFOLDING ] Completed: 99.67% (24018 out of 24098)
All my 16 processes are running with 55-99% CPU, respectively. However it seems as if the analysis has been sitting at 99.67% completed for over 10 hours or so by now. Up until this point the scaffolding completion was going quite fast.
Is there a particularly time/CPU consuming step at the very end of the scaffolding process?
Also, seeing as the program has produced various, including the GFA and path text files, I wonder if it is possible to resume the analysis from these checkpoint datasets if I need to cancel the analysis (after it has now been running for one week)?
Hi Markus,
I'm testing ARBitR against different versions of my assembly. However, the final assembly is often larger than the input. For instance, the input assembly was 688 Mbp, and the output was 711 Mbp. That was strange I thought. I mapped the input and the output against each other using minimap2 and got this:
scaffold_8 7150795 2288900 2867235 - scaffold_8010 578385 0 578335 578235 578235 60 NM:i:100 ms:i:578135 AS:i:578135 nn:i:100 tp:A:P cm:i:50580 s1:i:518937 s2:i:23132 de:f:0 rl:i:740701 cs:Z::578335
scaffold_49 11269021 10570709 10571427 - scaffold_8010 578385 530867 531484 610 618 28 NM:i:108 ms:i:298 AS:i:270 nn:i:10tp:A:P cm:i:43 s1:i:450 s2:i:443 de:f:-0.1753 rl:i:1163772 cs:Z::80+aa:88*ct:8*gc:45*tc:34+nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn*tn:184*gt:2*gc:104*ta:65
scaffold_169 7216630 2354735 2933070 - scaffold_8010 578385 0 578335 578235 578235 60 NM:i:100 ms:i:578135 AS:i:578135 nn:i:100 tp:A:P cm:i:50580 s1:i:518937 s2:i:23132 de:f:0 rl:i:744527 cs:Z::578335
The ARBitR assembly is the subject, with scaffold_8, scaffold_49 and scaffold_169. The input assembly has scaffold_8010. Here, you can see that scaffold_8010 has been included twice in the ARBitR assembly, in its whole 578235 bp length.
This is the corresponding backbone.gfa:
L scaffold_8010 - contig_3241 + *
L contig_3211 + scaffold_8010 - *
L scaffold_8010 + contig_3211 + *
L contig_5534 + scaffold_8010 - *
L contig_3241 - scaffold_8010 + *
L contig_3211 - scaffold_8010 - *
L scaffold_8010 + contig_5534 - *
L scaffold_8010 + contig_3211 - *
If I look at the pre-merge.paths.txt:
8 [start: contig_8098e, target: contig_4707s, connections: [], start: contig_4707e, target: contig_7789s, connections: ['contig_2054'], start: contig_7789e, target: contig_7773s, connections: [], start: contig_7773e, target: contig_5534s, connections: [], start: contig_5534e, target: scaffold_8010e, connections: ['contig_3211'], start: scaffold_8010s, target: contig_3241s, connections: [], start: contig_3241e, target: contig_3582s, connections: [], start: contig_3582e, target: contig_2101s, connections: ['contig_2100'], start: contig_2101e, target: contig_7801s, connections: ['contig_2100'], start: contig_7801e, target: contig_8203e, connections: ['contig_3922', 'contig_3920'], start: contig_8203s, target: contig_7560e, connections: [], start: contig_7560s, target: contig_7827e, connections: ['contig_3638', 'contig_3639', 'contig_7562'], start: contig_7827s, target: contig_1322e, connections: [], start: contig_1322s, target: contig_2866s, connections: [], start: contig_2866e, target: contig_1751s, connections: ['contig_2864']]
169 [start: contig_8100s, target: contig_8098s, connections: [], start: contig_8098e, target: contig_4707s, connections: [], start: contig_4707e, target: contig_7789s, connections: ['contig_2054'], start: contig_7789e, target: contig_7773s, connections: [], start: contig_7773e, target: contig_5534s, connections: [], start: contig_5534e, target: scaffold_8010e, connections: ['contig_3211'], start: scaffold_8010s, target: contig_3241s, connections: [], start: contig_3241e, target: contig_3582s, connections: [], start: contig_3582e, target: contig_2101s, connections: ['contig_2100'], start: contig_2101e, target: contig_7801s, connections: ['contig_2100'], start: contig_7801e, target: contig_8203e, connections: ['contig_3922', 'contig_3920'], start: contig_8203s, target: contig_7560e, connections: [], start: contig_7560s, target: contig_7827e, connections: ['contig_3638', 'contig_3639', 'contig_7562'], start: contig_7827s, target: contig_1322e, connections: [], start: contig_1322s, target: contig_2866s, connections: [], start: contig_2866e, target: contig_1751s, connections: ['contig_2864']]
These two are quite similar. Indeed, if I map them against each other, scaffold_169 has only 65 kbp not in scaffold_8 (both are longer than 7 Mbp).
How do I avoid cases like this?
Thank you.
Ole
Hello,
I can get your test dataset to run fine using
python3 ../src/arbitr.py -m 12000 -s 5000 -i ecoli.broken.fasta -o ecoli.test_data ecoli.broken.bam
but on my own data I get the following error. My bam was generated using bwa mem for mapping.
python3 /mnt/griffin/chrwhe/software/ARBitR/src/arbitr.py -m 45000 -s 5000 -i Polygonia-scaffolds.fa -o Pcalbum_10X Pcalbum_10X.sorted.bam
[Sun Dec 13 21:09:47 2020] Starting ARBitR.
[Sun Dec 13 21:09:48 2020] Collecting contigs.
[Sun Dec 13 21:09:48 2020] Collecting barcodes for linkgraph.
[Sun Dec 13 21:09:49 2020] Starting barcode collection. Found 1016 contigs.
[Sun Dec 13 21:10:23 2020] [ BARCODE COLLECTION ] Completed: 100.0% (2032 out of 2032)
[Sun Dec 13 21:10:24 2020] Creating link graph.
Traceback (most recent call last):
File "/mnt/griffin/chrwhe/software/ARBitR/src/arbitr.py", line 250, in
main()
File "/mnt/griffin/chrwhe/software/ARBitR/src/arbitr.py", line 194, in main
barcode_fraction)
File "/mnt/griffin/chrwhe/software/ARBitR/src/graph_building.py", line 507, in main
GEMcomparison = pairwise_comparisons(GEMlist)
File "/mnt/griffin/chrwhe/software/ARBitR/src/graph_building.py", line 390, in pairwise_comparisons
misc.printstatus("[ BARCODE COMPARISON ]\t" + misc.reportProgress(idx+1, len(GEMlist)))
File "/mnt/griffin/chrwhe/software/ARBitR/src/misc.py", line 29, in reportProgress
return "Completed: {0}% ({1} out of {2})".format( str(round( (current / total) * 100, 2)), current, total)
ZeroDivisionError: division by zero
hello,
I'm going to use AnVIL to saffold my genome with 10X genomics sequencing,But I don't know what's the meaning of '-f BARCODE_FRACTION'.So could you please tell me the meaning and use of this parameter?
Thanks a lot.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.