Giter VIP home page Giter VIP logo

arbitr's People

Contributors

markhilt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

hyphaltip

arbitr's Issues

Scaffolding supernova 10x

hi,

Could you provide more info on how I can scaffold the fasta files generated using supernova assembly ?

graph_building error

Hi, I run arbitr.py -i pilon.fasta pilon.bam -m 75000 -s 40000 -B 20 -F 70 -Q 60 -n 3 and it leads to this error. Could you please help me? Thank you very much.

$ arbitr.py -i pilon.fasta pilon.bam -m 75000 -s 40000 -B 20 -F 70 -Q 60 -n 3
[Mon May  3 16:56:34 2021]	Starting ARBitR.
[Mon May  3 16:56:35 2021]	Collecting contigs.
[Mon May  3 16:56:35 2021]	Collecting barcodes for linkgraph.
[Mon May  3 16:56:35 2021]	Starting barcode collection. Found 262 contigs.
[Mon May  3 16:59:41 2021]	[ BARCODE COLLECTION ]	Completed: 100.0% (524 out of 524)
[Mon May  3 16:59:41 2021]	Creating link graph.
Traceback (most recent call last):
  File "/xxx/ARBitR/src/arbitr.py", line 250, in <module>
    main()
  File "/xxx/ARBitR/src/arbitr.py", line 191, in main
    backbone_graph = graph_building.main(backbone_contig_lengths, \
  File "/xxx/ARBitR/src/graph_building.py", line 507, in main
    GEMcomparison = pairwise_comparisons(GEMlist)
  File "/xxx/ARBitR/src/graph_building.py", line 390, in pairwise_comparisons
    misc.printstatus("[ BARCODE COMPARISON ]\t" + misc.reportProgress(idx+1, len(GEMlist)))
  File "/xxx/ARBitR/src/misc.py", line 29, in reportProgress
    return "Completed: {0}% ({1} out of {2})".format( str(round( (current / total) * 100, 2)), current, total)
ZeroDivisionError: division by zero

A question regarding "ValueError: failure when retrieving sequence on"

Dear all,

When I run the ARBitR, I am getting an error and now, I could not solve the problem yet. After generating sorted ".bam" and ".bai" files (genome.nextpolish.sorted.bam and genome.nextpolish.sorted.bam.bai), I run following script;

/okyanus/users/veldem/01.Direct_Projects/06.Anchovy_Genome_Projects/04.Genome_Scaffolding/02.arbitR_scaffolding/ARBitR-master/src/arbitr.py -i /okyanus/users/veldem/01.Direct_Projects/06.Anchovy_Genome_Projects/03.Genome_Polishing/01.NextPolish/NextPolish/genome.nextpolish.fa genome.nextpolish.sorted.bam

and getting error (after it take nearly one hour);

[Sun Apr 25 19:24:54 2021] Collecting contigs.
[Sun Apr 25 19:24:54 2021] Collecting barcodes for linkgraph.
[Sun Apr 25 19:24:54 2021] Starting barcode collection. Found 3949 contigs.
[Sun Apr 25 19:30:00 2021] [ BARCODE COLLECTION ] Completed: 100.0% (7898 out of 7898)
[Sun Apr 25 19:30:00 2021] Creating link graph.
[Sun Apr 25 19:55:45 2021] [ BARCODE COMPARISON ] Completed: 100.0% (7707 out of 7707)
[Sun Apr 25 19:55:45 2021] Number of windows: 7707
[Sun Apr 25 19:56:52 2021] [ BARCODE LINKING ] Completed: 100.0% (7707 out of 7707)
[Sun Apr 25 19:56:54 2021] Writing link graph to genome.nextpolish.sorted.ARBitR.backbone.gfa.
[Sun Apr 25 19:56:54 2021] Finding paths.
[Sun Apr 25 19:57:05 2021] Found 741 paths.
[Sun Apr 25 19:57:05 2021] Collecting barcodes from short contigs.
[Sun Apr 25 19:57:05 2021] Starting barcode collection. Found 7568 contigs.
[Sun Apr 25 20:00:32 2021] [ BARCODE COLLECTION ] Completed: 100.0% (7568 out of 7568)
[Sun Apr 25 20:20:17 2021] [ PATH FILLING ] Completed: 100.0% (741 out of 741)
[Sun Apr 25 20:20:17 2021] Found fasta file for merging: /okyanus/users/veldem/01.Direct_Projects/06.Anchovy_Genome_Projects/03.Genome_Polishing/01.NextPolish/NextPolish/genome.nextpolish.fa
[Sun Apr 25 20:20:17 2021] Trimming contig ends...
[E::fai_retrieve] Failed to retrieve block: unexpected end of file(741 out of 741)
[Sun Apr 25 20:22:27 2021] [ TRIMMING ] Completed: 100.0% (741 out of 741)
Traceback (most recent call last):
File "/okyanus/users/veldem/01.Direct_Projects/06.Anchovy_Genome_Projects/04.Genome_Scaffolding/02.arbitR_scaffolding/ARBitR-master/src/arbitr.py", line 250, in
main()
File "/okyanus/users/veldem/01.Direct_Projects/06.Anchovy_Genome_Projects/04.Genome_Scaffolding/02.arbitR_scaffolding/ARBitR-master/src/arbitr.py", line 228, in main
bed = merge_fasta.main( args.input_fasta,
File "/okyanus/users/veldem/01.Direct_Projects/06.Anchovy_Genome_Projects/04.Genome_Scaffolding/02.arbitR_scaffolding/ARBitR-master/src/merge_fasta.py", line 1011, in main
trimmed_fasta[tig] = fastafile.fetch(reference=tig,
File "pysam/libcfaidx.pyx", line 319, in pysam.libcfaidx.FastaFile.fetch
ValueError: failure when retrieving sequence on 'ctg27_np12'

My python version: Python 3.8.8

Best wishes

Scaffolding steps and checkpoints

Hi @markhilt !

I am running ARBitR on a very large and quite fragmented genome assembly. I have gotten to the point where the program runs many processes in parallel:

[Wed Oct 14 18:45:43 2020] Trimming contig ends...
[Wed Oct 14 20:25:35 2020] [ TRIMMING ] Completed: 100.0% (24098 out of 24098)
[Wed Oct 14 20:29:38 2020] Creating scaffolds...
[Wed Oct 14 20:29:38 2020] Number of paths: 24098
[Thu Oct 15 22:52:48 2020] [ SCAFFOLDING ] Completed: 99.67% (24018 out of 24098)

All my 16 processes are running with 55-99% CPU, respectively. However it seems as if the analysis has been sitting at 99.67% completed for over 10 hours or so by now. Up until this point the scaffolding completion was going quite fast.

Is there a particularly time/CPU consuming step at the very end of the scaffolding process?

Also, seeing as the program has produced various, including the GFA and path text files, I wonder if it is possible to resume the analysis from these checkpoint datasets if I need to cancel the analysis (after it has now been running for one week)?

Sequences occur more than once in the output

Hi Markus,
I'm testing ARBitR against different versions of my assembly. However, the final assembly is often larger than the input. For instance, the input assembly was 688 Mbp, and the output was 711 Mbp. That was strange I thought. I mapped the input and the output against each other using minimap2 and got this:

scaffold_8	7150795	2288900	2867235	-	scaffold_8010	578385	0	578335	578235	578235	60	NM:i:100	ms:i:578135	AS:i:578135	nn:i:100	tp:A:P	cm:i:50580	s1:i:518937	s2:i:23132	de:f:0	rl:i:740701	cs:Z::578335
scaffold_49	11269021	10570709	10571427	-	scaffold_8010	578385	530867	531484	610	618	28	NM:i:108	ms:i:298	AS:i:270	nn:i:10tp:A:P	cm:i:43	s1:i:450	s2:i:443	de:f:-0.1753	rl:i:1163772	cs:Z::80+aa:88*ct:8*gc:45*tc:34+nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn*tn:184*gt:2*gc:104*ta:65
scaffold_169	7216630	2354735	2933070	-	scaffold_8010	578385	0	578335	578235	578235	60	NM:i:100	ms:i:578135	AS:i:578135	nn:i:100	tp:A:P	cm:i:50580	s1:i:518937	s2:i:23132	de:f:0	rl:i:744527	cs:Z::578335

The ARBitR assembly is the subject, with scaffold_8, scaffold_49 and scaffold_169. The input assembly has scaffold_8010. Here, you can see that scaffold_8010 has been included twice in the ARBitR assembly, in its whole 578235 bp length.

This is the corresponding backbone.gfa:
L scaffold_8010 - contig_3241 + *
L contig_3211 + scaffold_8010 - *
L scaffold_8010 + contig_3211 + *
L contig_5534 + scaffold_8010 - *
L contig_3241 - scaffold_8010 + *
L contig_3211 - scaffold_8010 - *
L scaffold_8010 + contig_5534 - *
L scaffold_8010 + contig_3211 - *

If I look at the pre-merge.paths.txt:

8	[start: contig_8098e, target: contig_4707s, connections: [], start: contig_4707e, target: contig_7789s, connections: ['contig_2054'], start: contig_7789e, target: contig_7773s, connections: [], start: contig_7773e, target: contig_5534s, connections: [], start: contig_5534e, target: scaffold_8010e, connections: ['contig_3211'], start: scaffold_8010s, target: contig_3241s, connections: [], start: contig_3241e, target: contig_3582s, connections: [], start: contig_3582e, target: contig_2101s, connections: ['contig_2100'], start: contig_2101e, target: contig_7801s, connections: ['contig_2100'], start: contig_7801e, target: contig_8203e, connections: ['contig_3922', 'contig_3920'], start: contig_8203s, target: contig_7560e, connections: [], start: contig_7560s, target: contig_7827e, connections: ['contig_3638', 'contig_3639', 'contig_7562'], start: contig_7827s, target: contig_1322e, connections: [], start: contig_1322s, target: contig_2866s, connections: [], start: contig_2866e, target: contig_1751s, connections: ['contig_2864']]
169	[start: contig_8100s, target: contig_8098s, connections: [], start: contig_8098e, target: contig_4707s, connections: [], start: contig_4707e, target: contig_7789s, connections: ['contig_2054'], start: contig_7789e, target: contig_7773s, connections: [], start: contig_7773e, target: contig_5534s, connections: [], start: contig_5534e, target: scaffold_8010e, connections: ['contig_3211'], start: scaffold_8010s, target: contig_3241s, connections: [], start: contig_3241e, target: contig_3582s, connections: [], start: contig_3582e, target: contig_2101s, connections: ['contig_2100'], start: contig_2101e, target: contig_7801s, connections: ['contig_2100'], start: contig_7801e, target: contig_8203e, connections: ['contig_3922', 'contig_3920'], start: contig_8203s, target: contig_7560e, connections: [], start: contig_7560s, target: contig_7827e, connections: ['contig_3638', 'contig_3639', 'contig_7562'], start: contig_7827s, target: contig_1322e, connections: [], start: contig_1322s, target: contig_2866s, connections: [], start: contig_2866e, target: contig_1751s, connections: ['contig_2864']]

These two are quite similar. Indeed, if I map them against each other, scaffold_169 has only 65 kbp not in scaffold_8 (both are longer than 7 Mbp).

How do I avoid cases like this?

Thank you.

Ole

trouble running

Hello,

I can get your test dataset to run fine using
python3 ../src/arbitr.py -m 12000 -s 5000 -i ecoli.broken.fasta -o ecoli.test_data ecoli.broken.bam

but on my own data I get the following error. My bam was generated using bwa mem for mapping.

python3 /mnt/griffin/chrwhe/software/ARBitR/src/arbitr.py -m 45000 -s 5000 -i Polygonia-scaffolds.fa -o Pcalbum_10X Pcalbum_10X.sorted.bam
[Sun Dec 13 21:09:47 2020] Starting ARBitR.
[Sun Dec 13 21:09:48 2020] Collecting contigs.
[Sun Dec 13 21:09:48 2020] Collecting barcodes for linkgraph.
[Sun Dec 13 21:09:49 2020] Starting barcode collection. Found 1016 contigs.
[Sun Dec 13 21:10:23 2020] [ BARCODE COLLECTION ] Completed: 100.0% (2032 out of 2032)
[Sun Dec 13 21:10:24 2020] Creating link graph.
Traceback (most recent call last):
File "/mnt/griffin/chrwhe/software/ARBitR/src/arbitr.py", line 250, in
main()
File "/mnt/griffin/chrwhe/software/ARBitR/src/arbitr.py", line 194, in main
barcode_fraction)
File "/mnt/griffin/chrwhe/software/ARBitR/src/graph_building.py", line 507, in main
GEMcomparison = pairwise_comparisons(GEMlist)
File "/mnt/griffin/chrwhe/software/ARBitR/src/graph_building.py", line 390, in pairwise_comparisons
misc.printstatus("[ BARCODE COMPARISON ]\t" + misc.reportProgress(idx+1, len(GEMlist)))
File "/mnt/griffin/chrwhe/software/ARBitR/src/misc.py", line 29, in reportProgress
return "Completed: {0}% ({1} out of {2})".format( str(round( (current / total) * 100, 2)), current, total)
ZeroDivisionError: division by zero

about barcode fraction

hello,
I'm going to use AnVIL to saffold my genome with 10X genomics sequencing,But I don't know what's the meaning of '-f BARCODE_FRACTION'.So could you please tell me the meaning and use of this parameter?
Thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.