Hi ! we followed these steps in a pipeline in order to improve a draft genome rela

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

empty _original.gv file,about bcgsc/arcs

Comments (35)

benvvalk commented on June 13, 2024 1

@mossishahi Lauren recently fixed some whitespace errors in makeTSVfile.py. I made a new ARCS release (1.0.5) that includes Lauren's fix: https://github.com/bcgsc/arcs/releases/download/v1.0.5/arcs-1.0.5.tar.gz

Can you try the makeTSVfile.py in that tarball and see if it solves your problem?

from arcs.

burri-wildlight commented on June 13, 2024 1

Hi Lauren,

I have now tried to run arcs-make, but still run into problems in the very first steps of the pipeline
'make: *** No rule to make target oenMenl_1.3.fa.fa', needed by oenMenl_1.3.fa.renamed.fa'. Stop.'

But never mind. - I've found the source of the issue with my own pipeline that both a collaborator and I ran into. As silly as it is, I'll still post it here in case it can be useful for somebody: We both thought that the tsv file output by ARCS can be used as input for LINKS, and that the step over the makeTSVfile.py script can be skipped. Well, it can't, as we found out.

Thanks again for the swift support & very best wishes,
Reto

from arcs.

lcoombe commented on June 13, 2024

Hello @mossishahi,

Just to confirm, you are using longranger align, and aligning your longranger basic processed reads to your draft genome? Have you sorted the bamfile(s) by name (samtools sort -n), and is the barcode available either in the BX:Z tag or in the form readname_barcode?

It might also help to take a look at our ARCS makefile (available in Examples/arcs-make), which can be used to do many of the different steps in the ARCS pipeline for you.

Lauren

from arcs.

mossishahi commented on June 13, 2024

@lcoombe
As the supernova support told us, it's not required to runlongranger alignon the outputs of longranger basic and it process the barcodes by itself.
We have only one bamfile made by longranger align and so no need to be sorted . We expect the longranger align to put the barcodes in the headers in a form like you mentioned.
I've not got familiar with the arcs-make and its difference with the arcs.

from arcs.

lcoombe commented on June 13, 2024

@mossishahi : I ask because we normally use longranger basic followed by bwa mem for the alignments - Then we can re-use that read set with a bunch of different tools that are not neccessarily from 10x Genomics, and know that the barcode has been removed from the first read of each read pair.

ARCS requires the input BAM file to be sorted by read name, and I don't how the output BAM from longranger align is sorted. An unsorted input will lead to an error like skipped 531501621 unpaired reads similar to what you mentioned above. I would definitely check how the BAM is sorted, and run samtools sort -n if it is not sorted by read name.

If you want to be sure that the format outputted by longranger align conforms to the format expected by ARCS, feel free to post a couple of lines of it here.

arcs-make is a Makefile that will run all the different steps of the ARCS pipeline for you. The inputs it expects are longranger basic processed reads and a draft genome. So for example, it will run the BWA mem alignments, launch the arcs binary, the makeTSVfile.py script, etc.
Since you are using a different aligner from the one that we normally use, you might not be able to run the Makefile directly, but you can use it to see what steps need to be run in the pipeline.

from arcs.

mossishahi commented on June 13, 2024

@lcoombe
as I noticed your purposed pipeline is somehow like this: ??

run longranger basic on fastq files
run the bwa mem on the output files of longranger basic (I don't have an exact idea, do you know how should I specify the inputs of it? )
run samtools sort -n on the bam files
Q we have 4 fastq files which are separated into two groups, each group is related to a sample and they yield to two individual results in longranger basic process. we decided to run this script on the second file in order to index the barcodes:
gunzip -c lib2.fq.gz| sed '1~4s/-1$/-2/' | pigz >lib2.bx.fq.gz
the question is that does bwa mem accepts two inputs and how does it make the outputs?
specify the bamfile as the input of Arcs (is it required to sort the bamfile list.txt ?)

some lines of our bamfile extracted by samtools view

SN7001394:302:HMMVKBCXY:1:2114:10044:41109      163     DjScaffold1     6000    60      18S133M =       6180    308     TGCCGTTTTCCAAATCACAAAACGGCAATCCCCATCAATTTTTCTCTATAATTGTACACACAACTACTACGCAACAGCAATTTTGTTCCAGAGAGAAAATGGGCTTTTAATATATTCCTCTGACTTAAATATAACGATCTCTGTCCGGTTT DDDDDEHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIGIIIIHIIIIHIIIIIIIIIIIIGHHHIIIIIIIHIIHIHGIIIIHIIIHIIIIIIIHIIHIIIEEHIIF@HFHHIHIIIIHHH RX:Z:ATCATGGAGATGTGGC   QX:Z:DDDDDIIIIIIIIIII   XS:i:-85        AS:i:-16       XM:Z:0   AM:Z:1  XT:i:0  BX:Z:ATCATGGAGATGTGGC-1 DM:Z:1.100000   RG:Z:PG2103:LibraryNotSpecified:1:unknown_fc:0  OM:i:60
SN7001394:302:HMMVKBCXY:1:1113:8868:36728       163     DjScaffold1     6011    60      151M    =       6291    376     CCGCATCAATTTTTCTCTATAATTGTACACACAACTACTACGCAACAGCAATTTTGTTCCAGAGAGAAAATGGGCTTTTAATATATTCCTCTGACTTAAATATAACGATCTCTGTCCGATTTTTATATCAGTTAATTCAGAGATTCAACCA DDDDDIHIIIIIGIIIHIIIIIIIIIIIIIIIIIIIIIIIICCHIHIIIHIIIIIIIIHIHHGHHHHHIIIIIIIIIHIGIIHIIIIIIIIEHFHGIIEHIGHHIIIIIIHIIIIHHHIIGIIHIIIHHIIIIIHGIHHHHEHHHIIIGID RX:Z:TAGACACAGTCGACTT   QX:Z:DDDDDIIIIIIIIIII   XS:i:-105       AS:i:-27       XM:Z:0   AM:Z:1  XT:i:0  BX:Z:TAGACACAGTCGACTT-1 DM:Z:3.250000   RG:Z:PG2103:LibraryNotSpecified:1:unknown_fc:0  OM:i:60
SN7001394:302:HMMVKBCXY:2:1207:11417:81454      147     DjScaffold1     6011    60      151M    =       5802    -360    CCCCATCAATTTTTCTCTATAATTGTACACACAACTACTACGCAACAGCAATTTTGTTCCAGAGAGAAAATGGGCTTTTAATATATTCCTCTGACTTAAATATAACGATCTCTGTCCGATTTTTATATCAGTTAATTCAGAGATTCAACCA IHFIGHHHHHIIIIIIIHHG?EFHIIIIHHIHHGIHHIHHIIGIIIIHIHIHHIHHIIHIIIIIIGIIIIH?HHHHHHEHHGHHFFHIHHFHHG@1IIIIIIIIIIIHIHGIIIIIIIIIHIIIIIIIIIIIIIHIIIIIIIIIIIDDCDD RX:Z:GCATCTCTCAGCTCTC   QX:Z:DDDDDIIIIIIIIIII   XS:i:-84        AS:i:-4 XM:Z:0 AM:Z:1   XT:i:0  BX:Z:GCATCTCTCAGCTCTC-1 DM:Z:2.125000   RG:Z:PG2103:LibraryNotSpecified:1:unknown_fc:0  OM:i:60
SN7001394:302:HMMVKBCXY:1:2101:7075:63131       99      DjScaffold1     6014    60      128M    =       6138    275     CATCAATTTTTCTCTATAATTGTACACACAACTACTACGCAACAGCAATTTTGTTCCAGAGAGAAAATGGGCTTTTAATATATTCCTCTGACTTAAATATAACGATCTCTGTCCGGTTTTTATATCAG        IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHIIGIIIHIHIIIIFHIIIIHIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIG        RX:Z:GTACGTACATGTATGC   QX:Z:DDDDDIIIIIIIIIHI   TR:Z:CCGAGAG    TQ:Z:IIIIIII    XS:i:-71        AS:i:0  XM:Z:0 AM:Z:1   XT:i:0  BX:Z:GTACGTACATGTATGC-1 DM:Z:0.833333   RG:Z:PG2103:LibraryNotSpecified:1:unknown_fc:0  OM:i:60
SN7001394:302:HMMVKBCXY:1:1210:18584:48645      99      DjScaffold1     6016    60      128M    =       6174    310     TCAATTTTTCTCTATAATTGTACACACAACTACTACGCAACAGCAATTTTGTTCCAGAGAGAAAATGGGCTTTTAATATATTCCTCTGACTTAAATATAACGATCTCTGTCCGATTTTTATATCAGTT        IIIIIIIIIIGIIIIIIIIIIGIIIIIIIHIIIIIIIIHIIIHIIIIIIIIIHIIIIIIIHIIIIIIIIIIIIIIIIIGIIHIIHIIIIIHIIIHIIIIIIIIIIIIIIIIIICGCHHIIIIIIIHII        RX:Z:ACATCTTTCGCCTGTT   QX:Z:DDDDDIIIIIIIIIII   TR:Z:ATCCCCA    TQ:Z:IIIIIII    XS:i:-74        AS:i:-5 XM:Z:0 AM:Z:0   XT:i:0  BX:Z:ACATCTTTCGCCTGTT-1 RG:Z:PG2103:LibraryNotSpecified:1:unknown_fc:0  OM:i:60
SN7001394:302:HMMVKBCXY:1:2114:9274:77248       99      DjScaffold1     6025    60      128M    =       6212    338     CTCTATAATTGTACACACAACTACTACGCAACAGCAATTTTGTTCCAG

do you recommend to run your suggested pipeline or there is only a small fault with our pipeline?

from arcs.

lcoombe commented on June 13, 2024

@mossishahi : Here's a sample command for the arcs-make Makefile, and the commands that would be run. (Using -n option with a makefile triggers a 'dry-run' -- the commands that would have been run are printed but not executed). Hopefully this will clear up your questions about how to run the ARCS pipeline.

arcs-make arcs -n draft=celegans_LRSIM_pseudohap_noNewLines reads=celegans-N2-LRSIM_longrangerBarcoded
Here, the draft assembly is celegans_LRSIM_pseudohap_noNewLines.fa, and the longranger basic processed reads are celegans-N2-LRSIM_longrangerBarcoded.fq.gz

Here are the commands run:

perl -ne 'chomp; if(/>/){$ct+=1; print ">$ct\n";}else{print "$_\n";} ' < celegans_LRSIM_pseudohap_noNewLines.fa > celegans_LRSIM_pseudohap_noNewLines.renamed.fa 
/usr/bin/time -v bwa index celegans_LRSIM_pseudohap_noNewLines.renamed.fa |& tee celegans_LRSIM_pseudohap_noNewLines_bwa_index.log
/usr/bin/time -v sh -c 'bwa mem -t8 -C -p celegans_LRSIM_pseudohap_noNewLines.renamed.fa celegans-N2-LRSIM_longrangerBarcoded.fq.gz | samtools view -Sb - | samtools sort -@8 -n - -o celegans_LRSIM_pseudohap_noNewLines.sorted.bam' |& tee celegans_LRSIM_pseudohap_noNewLines_bwa_mem.log
echo celegans_LRSIM_pseudohap_noNewLines.sorted.bam > celegans_LRSIM_pseudohap_noNewLines_bamfiles.fof
/usr/bin/time -v arcs --bx -v -f celegans_LRSIM_pseudohap_noNewLines.renamed.fa -a celegans_LRSIM_pseudohap_noNewLines_bamfiles.fof -c 5 -m 50-10000 -s 98 -r 0.05 -e 30000 -z 500 -d 0 --gap 100 -b celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500 |& tee celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500_arcs.log
python /projects/btl/lcoombe/git/arcs/Examples/../Examples/makeTSVfile.py celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500_original.gv celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500.tigpair_checkpoint.tsv celegans_LRSIM_pseudohap_noNewLines.renamed.fa 
ln -s celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500.tigpair_checkpoint.tsv celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.3.tigpair_checkpoint.tsv
/usr/bin/time -v LINKS -f celegans_LRSIM_pseudohap_noNewLines.renamed.fa -s empty.fof -b celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.3 -l 5 -a 0.3 -z 500 |& tee celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500_l5_a0.3_links.log

If you have multiple read files, you can run the bwa mem command (the output is piped to samtools sort -n, as you can see above) separately for each read file. Then, make a file (named celegans_LRSIM_pseudohap_noNewLines_bamfiles.fof above) which lists the paths to each individual BAM file. This list of file names is provided to ARCS with the -a option.

Looking at the sample lines from your alignment file, I can see that the alignments aren't sorted by name, which ARCS does assume.

Hope that helps -- let me know if you have further questions!

from arcs.

mossishahi commented on June 13, 2024

@lcoombe I was supposed that running samtools sort -n on the present bamfile could be useful, I did it but it didn't conduct to a successful Arcs run. It skipped the paired reads as the previous run and it was printing out some messages like this which I break the process.

Warning: Skipping an unpaired read. Read pairs should be consecutive in the SAM/BAM file.
  Prev read: SN7001394:302:HMMVKBCXY:1:2210:14481:27671
  Curr read: SN7001394:302:HMMVKBCXY:2:2206:21254:21263
Warning: Skipped 1000000 unpaired reads.
Warning: Skipped 2000000 unpaired reads.

now I decided to follow your pipeline using bwa mem but there is a question:
As I contacted the support of Supernova, they expressed that our present fastq files are standard fastq files while your pipeline ideal files are interleaved fastq files. Does it matter?
note: a part of supernova support message about the Arcs:

However, it does appear that they are using the "interleaved" fastq format, where the R1 and R2 reads are present in the same fastq file, rather than in separate files like you have. (Interleaved format is an older format that we no longer use).

from arcs.

lcoombe commented on June 13, 2024

Hi @mossishahi,

Did you let ARCS finish, or did you stop when you saw those error messages? It is possible that you might still see that warning message but still get a non-empty graph output. That could happen if your aligned outputs supplementary alignments - ARCS reads the BAM file a pair at a time, so will output that warning when it sees a pair of reads that do NOT have the same name, but it will continue reading through the file and identifying correct read pairs. Feel free to post or send me ([email protected]) a portion of your BAM file if you'd like another set of eyes.
Sounds good to use BWA mem. I'm not sure which stage of fastq-processing you were discussing with the Supernova support -- The input reads for the Supernova de novo assembler are the raw reads (barcode still in read 1 sequence). For ARCS, the input is the interleaved fastq reads file post longranger basic. Longranger basic outputs an interleaved fastq file called barcoded.fastq.gz,
where the barcode sequence is in a BX:Z comment of the read headers.
(https://support.10xgenomics.com/genome-exome/software/pipelines/latest/advanced/other-pipelines)

from arcs.

mossishahi commented on June 13, 2024

@lcoombe , My apologies for not responding to your comment sooner.
thanks for your guidance.

Actually, I stopped the Arcs and didn't let it continue because I was supposed that It will result in errors again.
Now I'm doing some preparations on our data to run the bwa mem on them. Support of Supernova informed me that they made a mistake and actually, there is no problem with fastq files.
already I have some questions about the steps you mentioned in above comments:

one step in the pipeline is:
samtools sort -@8 -n - -o celegans_LRSIM_pseudohap_noNewLines.sorted.bam ,
Q1: I would ask if it is the - between -n and -o correct, and what does it mean? (it seems that the bam file name specified at the end of the command is input name, no ouput name is required? )
Q2: I didn't run tee functions in the pipeline. is it necessary to run? no .fof format is generated. is it ok to specify the file names in .txt file?
Q3: in below arcs command, the input parameters seem a bit ambiguous? would you please give me some explanation about the --bx parameter?
arcs --bx -v -f celegans_LRSIM_pseudohap_noNewLines.renamed.fa -a celegans_LRSIM_pseudohap_noNewLines_bamfiles.fof -c 5 -m 50-10000 -s 98 -r 0.05 -e 30000 -z 500 -d 0 --gap 100 -b celegans_LRSIM_pseudohap_noNewLines_c5_m50-10000_s98_r0.05_e30000_z500
Q4: it seems that in above command the default values has been set again, is it necessary? and is it better to chage the values according to our data?

regards

from arcs.

lcoombe commented on June 13, 2024

Hi @mossishahi,

Yes, the - between -n and -o is correct. The output of bwa mem is being piped directly to the samtools sort, so it indicates that it should read from standard input, not from a specified file. The -o indicates the output file name.
The tee functions aren't necessary -- they are just used in the pipeline for capturing the standard out and standard error to log files.
I'm not too sure what you mean about the input parameters being ambiguous -- If you are unsure what they mean, arcs has a help page which explains them: arcs --help. --bx just indicates that the barcode is read out of the BX:Z tag of the alignments (although that option is deprecated -- ARCS now will look for the barcode in that tag, and then in the read header in the form "read_name"_barcode if it doesn't find it in the SAM tag)
You can certainly set the parameters for ARCS as you wish based on your data - Doing runs with various parameters is a good idea to ensure you find the best parameters you can for your particular data. If you want to use the arcs-make file, take a look at the help page -- arcs-make help. Parameters can be specified like this:
arcs-make arcs draft=test_scaffolds reads=test_reads m=50-6000 a=0.9

Hope that helps!

from arcs.

mossishahi commented on June 13, 2024

@lcoombe I'm grateful for your guidance.
as you recommended I followed this pipe line:

run the longranger basic on the fastq files
run the bwa index on our genome reference
run the bwa mem on the 10x libraries and reference
conver the output to bam
sorting bamfiles
and now, the below command:
arcs --bx -v -f ~/e/Planaria_10X/Fastq/For_10x_Denovo_Data/PG2103_03BE5/ref/refdata-DjScaff_fnl20141213/PG2103/outs/pre-ref/DjScaff_fnl20141213.fa -a bam-libs.txt -c 5 -m 50-10000 -s 98 -r 0.05 -e 30000 -z 500 -d 0 --gap 100 -b ImprovedDjScaff_fnl20141213
but however the output message was Done and the messages during the process was like:
On line 40000000
On line 50000000
On line 60000000
On line 70000000
On line 80000000
but the ImprovedDjScaff_fnl20141213_original.gv file is empty! eventhough the ImprovedDjScaff_fnl20141213.dist.gv is about 11MB.

from arcs.

mossishahi commented on June 13, 2024

@lcoombe as I saw you had requested a sample of the genome in similar issues, I prepared it:

   15 >DjScaffold8
   16 TTCTTTTCCTTTAATGCTTCCTCTAATCAGGCTTCCTTTAGTGGGCTTACTTACGTGGTCTACCTTTAACAATCTGGTTAATTTACTTATGACTCAGTTACCAAAGTAACCTGTACAAATGGTTCTACTGAAAACTGTCTAGGCAAAATGCAATTGAATCTA
   17 >DjScaffold9
   18 TCAATATAATTCACAATATAAACAGATAATTTAAAATAATTAATAACGAAATAACTAATTAATTGTTCAATAAGAACCAAATAAAGAATTTTTTTTCAAAAGTTATTTCTTAAAGACGTTGAGTATAAAATTTACAGAAGACCATAACTATTGAGAATGGCG
   19 >DjScaffold10
   20 AAAAAAAATAATAAATAAAATTTTTTTTATGAATTATTTTCCCTAAATTTTACGTGGGATTTTAAAGAGTTTTATCTGCTTATATGTAATTTATGAAACATTTTATCAACTTATATGTAATGTTTGACAAATTTGTTCTAATAAATCAAATAAGTTACCAAA
   21 >DjScaffold11
   22 TTATAATAACAGTTTGTATTTAACTGATTCTTTGTAGTGGAAATAATGATTTTAAATACTTCAATCATTCAGTGAATAAACTAGTATTACAAATGAGAATTCTTGATCGACTGGTTTTTAATAGTGAAAATTTTCAATTTAATCTGGATATTTATATTATAA
   23 >DjScaffold12
   24 AAAAGGTAATGTTTAAATCGTGTTGCAAAATAATATATTGCCAGGAGTTCATTTCCTAGTTACACAGTACATCTGCTCTGTCTTTGTAAACTTTTTTTGATCCATAGCAGGTAACTTTCTCTTCATTATTTTGTAACTGGCTAAGAACTACCCCCATAGCAT
   25 >DjScaffold13
   26 AATTGTTGGACATAACCATTGATCGTTTCTTAGAATTCATTTAAAATAAAAGCTTCGATCAAAAAAATTTTTTTTTGTTTTTTTGAGATTGATTCGATATTTTGACCTTGTTATTGTCAGAAATTGAAAGTCAATTTTCGGTTCCACCCAATTATTTCTAAA
   27 >DjScaffold14
   28 TGGTTGATGGAATTTGTTGGATGATGTGTTGTTTGAGAGGTAATGTGATGGAGAAAATTTTGGATGCCGATTGTGACTTAATTTACGATTTTTACGGCTTGAATTTGATTGGTTGATTTTTTGAAGTAATATAAGTGTGTTAAAAAATATTTTCGAGCGGAG
   29 >DjScaffold15

also we had generated 2 fastq files using longranger basic,
first lines of barcoded_1.fastq:

      1 @SN7001394:302:HMMVKBCXY:1:1110:9233:53199
      2 TCAGAAGAAGATAATAAGAAGGGGATAACCAATCCGAAAGTGACAGTGAAACTGAAATGTGTGAGTAGCATAAATATGACAGCGAAGAATTGGTAGATCGTAGCTGGTTGCCGTCATGTGAAGTAAGT
      3 +
      4 IHGIIIIHIIHIIIIIIIGIIIGIIIIHIHHHIIIIIIIGIHGIFHGHHIIIIIFHHHHIIIIIIHHIGHHHEHIIGHGGHHGHHIHIHHHIIHIHIIHIIHIIIIIIIIGHHHHHFHHHEHHHHII<
      5 @SN7001394:302:HMMVKBCXY:1:1110:9233:53199
      6 ATGTAGATAATATAAAAGTCCGTTTAATTCCGTTCAACGTTGCTTCATCCGGGTCCTCCAGTTTTTTCTATTTAAAACTTATAACAGTAAGGTTTGTGCCCTTAAAACAAAGAGCTTCAAAAGTTGCCTTAATAGGTGTTAAAGCATCAAA
      7 +
      8 BDDDDIIIIIIIIIHIIIIIIHEHIIIIIIEHHIGIIHHIIIIIIIIIIIIIIIIIGIIIIIGIIHIIIIIIIIIIIIGHIFHGIIIIIIIIIHIHIEHHIGHHEHGHIHHGHIIIIIIGIIIHHHGEIIHHIIHIHHHHCHHIIIIIIGH
      9 @SN7001394:302:HMMVKBCXY:1:2107:13745:91712
     10 TCAACAATTATCTGAGCGATGTGATATTCTCACCGAAATGCTTAAAAACCAATTGAGATCAAAGCTAGCCAGGAACTTCATCTCATTGATGTCAAGACTCCATATGATATTGCTAGCTGTATGGAAAA
     11 +
     12 HHIIIIIIIIEHIIIIIIIIIIIIIIGIIGIIIIIIIHIIIIIIIHIIIIIIIFHIIIIHIIIIIIHIGHGHIHHGHIFHHHHIIHIGIIIIIIIIGIIIIIHHHHIIIIIIHHIHIIIIIIIIIIII
     13 @SN7001394:302:HMMVKBCXY:1:2107:13745:91712
     14 TCAAACATCAATTGGAGTGCGCAAACTTGGAGAATACATAATACTATGGTTGCCATCGACAATTGATGTTTGATGAAATGCATTTGCATTTAATGTTGATTTGATTTTTCTTGGAGATTTTCAATCGTGAGAGAATCTGATTGTTGTTGGA
     15 +
     16 DDDDDIIIIEHIIHIHIHHHDHIIIIIIHIHIIIHGHIIIIIIIIIGHHIHIIIIHIIIIIIHIHIHCGEGHIHHICHEHHCHHIIIIGIHIIIIEEEHHHCHHHHIHICHHFIIIF?GHIHHHHHIHHHHHHHHIIIIIIIIIIHIIGHC
     17 @SN7001394:302:HMMVKBCXY:1:2111:3028:90075
     18 AGATTGAAAATCGCACAAGGAGTTTCAATATGAAGGGTAGGAAAGCATCTCCTGAACTGAACTCGTATTATTATAACTACTAACGTATTTCTCATTGTATATTTACAGACATTTCATATCGTAATTTT
     19 +
     20 IIHHIIIIIIIIIIIIIIIIIIHHHIIIIGIHIIIIIIIHHIIIHIIIIIHIIIIIIIHHIIIIGHHHHHIIIIIIHIHHHIIIIIHHHHFHHHEHHHIHHHIIIGHHHIIIIHI?FG@HHHIFIHII
     21 @SN7001394:302:HMMVKBCXY:1:2111:3028:90075
     22 TTTTGCGTTTTTTTTGTTGGCTACAAAAATAAAAAAATTATTTTTAAAAAAAAAAAAAGGAGCAAAAAAAAATAAATATTTTTTTTTTTAAAAAAAAAAATTTTTTTTTTTATAATTAATTTTAGTTTTTTTTATGTGGTTTTAAGTTATT
     23 +
     24 0<0001//</<1//</11111111<1<111111<</C111<111<<1DC?F//</<//00111<<C@H/CE/111111<<<GE<CCE</1<1C1//</</1011D//////00000000000<000</0///:://///8//.//////;.
     25 @SN7001394:302:HMMVKBCXY:1:2103:4787:39729
     26 CCATGGCAAGGGAGGAGAACAAGGAAATATGGAGACCGACTTGATTACACCTAGAAGCAATGTGCTGGGCAGATAAAATATGCCACTGCTGTGGCAGAAAAGGGCACATATCAAAAAATTGTAGCAAT
     27 +
     28 IIIIIIIIIIIIIIIHIIGIIIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIHHEHHHIIGHIIIIIIIIIIIIIIIHIIIIIIHIIIIIIGIHIIIIIIIIHIGIHII0

and first lines of barcoded_2.fastq:

      1 @SN7001394:302:HMMVKBCXY:2:2215:18724:23724
      2 GATCAATTTACCAGATATATGTGTCGGCACAACATCAAAATTAAAAAGACCATGGCAAGGACCATATGAGGTGATAGAGGTGACGGAAACGAACCTCAAGGTTCGGAAAAAAGGGAATTTACTGGAGA
      3 +
      4 IEFFEEHGHIHEGEEHHEHFH<<GHHICHHFEHCFCGHHI@@CHEGHDEHHIHEHIIHHGI1D?@GEH?CGCGH@@HFEH1<FHGHECCEHICFC11DGIH<1<<C0DCC0F@=@/<//FH@//CCH/
      5 @SN7001394:302:HMMVKBCXY:2:2215:18724:23724
      6 ATTAATCCTTCCCCGTTTTCTTCAAACTCAGCGTCCTCGCTGAGACCGTTTTTATTTATCATGGTTCCTTTACCCCTTGGCAAAGAAAGTATTAGTGTTTTACTGAGCACCGTCATGGCTTGTTGTGATCGTAATGGATAACTGTGTGATA
      7 +
      8 <<@@<<11CEHE1FE0D1<DG@H<@111DFCE/<D<CCGHEH@1GHEC<0<1<1<11<1D<G1D<GC1<1<1<GHII?<1<C<<E1D1D<<F111<1DC1<1<1<1<<1DFDHEG1@@HH?C111<DE1F1<<C<1<11111<<1<<1111
      9 @SN7001394:302:HMMVKBCXY:2:2105:15021:65237
     10 AGAGATTAATGAACACAGTATTGATCAATGTAGAACATGTAATACTTTATATGGATGACATTTTGATTGCATCAGAATCATGTGAGCAACATTTAATAAACATAGAGAATATGATGTCCAAATTACAA
     11 +
     12 IIIIIIIHHIHIIIIIIHIIIIIIHIHHIIIIIIIIIHHHIIIIIIIIIIHIIHGHIIIIIIIIIIIIIIIIHIIIIIGIIIIIIIIIIIIIIGIIIHHHIIGHHIIIGGHIHHIIIIIIIIIIIIII
     13 @SN7001394:302:HMMVKBCXY:2:2105:15021:65237
     14 TGTTATTGTTTAATGAAAGAAAAACATCTTTGAACAGAGTATGAAACCCATAGTTGCTCGGTATTAATTTTTCCTTTAACGTATCAAATTTTTTATTGATTGTTTCCCCATAGTTTTTGTATTTTCGCTGAATTTTCATCAAGTTTTTTGG
     15 +
     16 DDDBBHIIIIIIIIIIIIIIIIHIIIHIIIIIIIIIIIIIIIIIIIIIHHIIIIIHIIIIIIHHHIHHIGIIIIHHIGIIGHHIIIIIGIFHIIIIIGIIIGHHHHHIIHIIGHHHHIGIIIHIIEHIHIHHIIHHIHHIHIIHHIHH<E<
     17 @SN7001394:302:HMMVKBCXY:2:1205:17368:72333
     18 AATTATAAATATATTGTTAAAAGTTTAATAAGTTTTTAATCTTATCTATAATGTGTGTATCTATACGTGCTTTTGGTGTGTATTATTTTTCTTATATCTTTCTATAATGATATAGGTACTTATTAAAC
     19 +
     20 IHIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIHIIIIIIIHIHIIIIIIIIIIIGIIIEHFEFHHGHDEHHHIGIIIIIIIIGIHIHIIIIIHIIIIIIIHGHIIIIHHHHIIIIIIIF
     21 @SN7001394:302:HMMVKBCXY:2:1205:17368:72333
     22 ACTAAATTAACCATATTTTATTCATTTAATTTTATATTTTATTCTTTTTAAAAATATTTGATTTTATAAAAATACATTAAAATAAACATCTTTTATTCATTTAATTTTATATTTTAAAAATCCTTAAATTTTAGTAATGTATTTTATAAAT
     23 +
     24 DDDDDIHHIIIIIHIIIIIIHHIIHIHIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIHIIIIIIIIIFGHHIIHIIIIIHIIGHHHIIHIHHIIIIIIIDHHIIIHIIIIIHHIHHH<GHHEHHIIIGHEFIHHIIIIIIIIIIIIGHH
     25 @SN7001394:302:HMMVKBCXY:2:2213:3592:77185
     26 ATATTTAAATTTCAAATAATTAATATATTACGTTGTGGTCTTTTATTATACATATATATATATTGCATTTCATTATTATTTCTTGTAATTAAATTAAAAAAATCGTATATTAAAAAAAAAAAAATAAT
     27 +
     28 IFHHHHIIIEEFEGHHGHIFEHH@G1FFEH?ECCCG00CGEGFHFHICFE1CF@C@GHHFHHHH@1<CHE1<GEHHFE<CF1<<1<<CCGFEFH@EGF@ECCG10DFEEGH?@E<C/C/C<C</<C<<

from arcs.

mossishahi commented on June 13, 2024

@lcoombe now I noticed that unfortunately I've forgot to run the simple command:
perl -ne 'chomp; if(/>/){$ct+=1; print ">$ct\n";}else{print "$_\n";} ' < x.fa > x.renamed.fa
and it might be because of that. if it's the reason, whiche steps should be rerun?
should 'bwa index' be done again?

from arcs.

lcoombe commented on June 13, 2024

@mossishahi - Just to be safe, yes I do rename the scaffolds to numbers before running the ARCS pipeline. If you rename the scaffolds, you need to re-run the bwa indexing, alignment and ARCS steps again.

For the read files -- I'm assuming that reads further down the file do have a BX:Z: tag, just not the first few that you posted?

If that doesn't solve the issue, it would be helpful for you to post the exact commands you ran, the log from ARCS and perhaps a few lines from one of your alignment files. It would also be helpful to know the expected size of your genome, the N50 of your draft assembly and the sequencing coverage of your 10x data.

Also just a head's up that I'm on vacation this week so I won't be checking my e-mails regularly until next week.

from arcs.

mossishahi commented on June 13, 2024

@lcoombe I'm really thankful for your attention and sorry because of the distractions four you on your vacation.
I've renamed the scaffolds and re-run the process.
all of the reads have barcodes on their headers.
I hopefully wait for the end of the process and will inform you of the results.

regards.

from arcs.

mossishahi commented on June 13, 2024

@lcoombe , I hope not to distract you on your vacation, I put the issue messages here just to be saved by the time I'm facing them and not to forget them.

during this command: bwa mem -t8 -C -p DjScaff_fnl20141213.renamed.fa barcoded_2.fq.gz | samtools view -Sb - | samtools sort -@8 -n - -o DjScaff_fnl20141213.sorted_2.bam
I noticed a suspicious message:

However there is some commands like:
candidate unique pairs for (FF, FR, RF, RR): (2167, 75752, 649, 1871)
analyzing insert size distribution for orientation FF ...
(25, 50, 75) percentile: (112, 191, 285)
low and high boundaries for proper pairs: (1, 804)
and also some lines like this for FR,RF, RR
but as you see more details in the screen, there was some lines like:
skip orientation FF
skip orientation RF
skip orientation RR
and this code fragment is repeated in the process.

does it indicate a failiure in our process?

from arcs.

mossishahi commented on June 13, 2024

@lcoombe I re-run the pipeline and fortunately now the_original.gvfile is not empty. but while I try to run the makeTSVfile it has below error:

  File "/s/chopin/a/grad/asharifi/e/Applications/arcs/Examples/makeTSVfile.py", line 82
    estDist = True
                 ^
TabError: inconsistent use of tabs and spaces in indentation

kind regards

from arcs.

benvvalk commented on June 13, 2024

@mossishahi This looks like normal behaviour of bwa. I would not worry about it unless it says something like "warning" or "error".

from arcs.

mossishahi commented on June 13, 2024

@benvvalk thanks alot but it seems that the folder of URL you mentioned does not contain Example/makeTSVfile.py and I also couldn't find the file some where else.

from arcs.

benvvalk commented on June 13, 2024

@mossishahi Oh dear, I see you are right. That file is not included in the release tarball.

You can download the latest version of the file from here instead: https://github.com/bcgsc/arcs/blob/master/Examples/makeTSVfile.py

(click the "Raw" link)

from arcs.

lcoombe commented on June 13, 2024

Thanks @benvvalk!

@mossishahi - Were you able to successfully run the ARCS pipeline using the updated makeTSVfile.py script?

from arcs.

mossishahi commented on June 13, 2024

@lcoombe thanks for your follow up. We had a problem with the LINKS, because of the bloomfilter module which I hope to be solved soon.

from arcs.

mossishahi commented on June 13, 2024

@lcoombe I'm grateful for all of your friendly and kind guidance.
It seems our process is done, the result is below:

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.3651  0.1349  0.1349  0.3651  0.3573  0.0000  0.0000  0.2698  0.0515

Main genome scaffold total:             200912
Main genome contig total:               450272
Main genome scaffold sequence total:    1565.210 MB
Main genome contig sequence total:      1006.033 MB     35.725% gap
Main genome scaffold N/L50:             11387/28.531 KB
Main genome contig N/L50:               62048/4.169 KB
Main genome scaffold N/L90:             69220/3.615 KB
Main genome contig N/L90:               275294/907
Max scaffold length:                    760.01 KB
Max contig length:                      406.522 KB
Number of scaffolds > 50 KB:            5411
% main genome in scaffolds > 50 KB:     35.91%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                200,912         450,272   1,565,209,624   1,006,032,708    64.27%
    250                200,912         450,272   1,565,209,624   1,006,032,708    64.27%
    500                200,886         450,246   1,565,197,188   1,006,020,272    64.27%
   1 KB                124,602         371,431   1,510,377,283     951,541,159    63.00%
 2.5 KB                 81,422         322,785   1,447,707,331     889,406,218    61.44%
   5 KB                 56,648         276,090   1,356,610,958     833,018,628    61.40%
  10 KB                 36,233         228,443   1,191,766,199     756,926,814    63.51%
  25 KB                 13,403         147,175     836,431,662     562,693,175    67.27%
  50 KB                  5,411          95,363     562,067,174     395,565,653    70.38%
 100 KB                  2,081          54,534     330,835,144     238,452,291    72.08%
 250 KB                    178           8,876      57,545,169      43,193,791    75.06%
 500 KB                      7             468       4,187,947       3,457,694    82.56%

though the detail statistics of our previous draft genome is:

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.3651  0.1349  0.1349  0.3651  0.3572  0.0000  0.0000  0.2698  0.0513

Main genome scaffold total:             202925
Main genome contig total:               450272
Main genome scaffold sequence total:    1565.189 MB
Main genome contig sequence total:      1006.033 MB     35.725% gap
Main genome scaffold N/L50:             13220/27.741 KB
Main genome contig N/L50:               62048/4.169 KB
Main genome scaffold N/L90:             71225/3.615 KB
Main genome contig N/L90:               275294/907
Max scaffold length:                    760.01 KB
Max contig length:                      406.522 KB
Number of scaffolds > 50 KB:            5780
% main genome in scaffolds > 50 KB:     32.61%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                202,925         450,272   1,565,189,494   1,006,032,708    64.28%
    250                202,925         450,272   1,565,189,494   1,006,032,708    64.28%
    500                202,899         450,246   1,565,177,058   1,006,020,272    64.28%
   1 KB                126,612         371,427   1,510,354,975     951,539,230    63.00%
 2.5 KB                 83,429         322,776   1,447,679,403     889,398,825    61.44%
   5 KB                 58,645         276,062   1,356,542,964     832,983,212    61.40%
  10 KB                 38,209         228,363   1,191,510,678     756,784,954    63.51%
  25 KB                 14,951         145,472     828,163,693     556,904,021    67.25%
  50 KB                  5,780          86,260     510,443,509     358,821,201    70.30%
 100 KB                  1,589          36,264     222,480,828     160,529,419    72.15%
 250 KB                     37           1,626      12,140,407       9,503,035    78.28%
 500 KB                      2              20       1,479,103       1,463,639    98.95%

the statistics are extracted form data by bbstats command of bbmap

One point we noticed after analyzing the result of Arcs in comparison with the previous draft genome is that Arcs utilizes 10X data to make bigger scaffolds but the total size of the genome has no change, Actually, it doesn't insert the 10X reads in the genome string and only links some scaffolds to make bigger scaffolds.
It's important for us to find a way to do that. Do you have any idea?
the other point that you can see the N50 which is reduced and it's surprising.

Thanks.

from arcs.

lcoombe commented on June 13, 2024

@mossishahi - Glad the pipeline finished!

That is correct - ARCS is purely a scaffolder using 10X reads, so will join sequences together and introduce Ns between the sequences (with or without the gap sizes estimated, depending on if you used the -D option). ARCS looks for sequences that have many 10X barcodes in common as evidence that the sequences are near to each other in the genome, and should thus be joined - It doesn't reconstruct any sequence between the joined contigs. (If you want more information about how the algorithm works, take a look at our paper in Bioinformatics: https://doi.org/10.1093/bioinformatics/btx675).

To attempt to fill in the gaps introduced by ARCS in your scaffolded assembly, you would need to use a downstream gap-filling tool such as Sealer (https://github.com/bcgsc/abyss/tree/master/Sealer).

Yes, that is surprising that the N50 is lower - Do you see the same for the NG50 metric? (To calculate this stat I use abyss-fac from the ABySS tool suite.) I don't know what parameters you played with, but in our runs we commonly vary -a (Higher = less stringent). (Note that in varying parameters for LINKS, such as -a that you only need to re-run the final LINKS step)

from arcs.

mossishahi commented on June 13, 2024

@lcoombe thanks.
as you suggested I ran the abyss-fac, the result:
the new draft:

n       n:500   L50     min     N80     N50     N20     E-size  max     sum     name
200912  199536  9264    500     3920    21174   82789   47857   745447  1.006e9 --

the old draft:

n       n:500   L50     min     N80     N50     N20     E-size  max     sum     name
202925  201549  11026   500     3919    20210   62251   36572   745447  1.006e9 --

now, the N50 is reaonable, but if n:500 meaning is the number of scaffolds longer than 500, why it's decreased? whats the meaning of E-size and sum?

the other point, we didn't run Tigmint before the Arcs.
should we re-run the whole process suggested by the tigmint-make arcs or there is a shortcut to use the previous generated files? (as I noticed there is some differences in commands options, for example in samtoosl view, or bwa mem and etc. )
I'm studying in order to find a way to make toal size of our draft genome bigger. there are some options: one is to use gap-filling tools like sealer (the point is that the draft genome is made of paired-end and mate-pair reads extracted from the library that are different from 10x data library.) the other option is to run a denovo assembly using both the paired-end, mate-pair lib and 10x library (I'm not sure if abyss assembler uses the 10X data only for scaffolding (like the Arcs) or inserts the 10x reads in genome string).
one other idea is to use the draft improved by Tigmint and Arcs and the utilize gap-fillers using paired-end libraries as the inputs.

thanks for your patience.

from arcs.

lcoombe commented on June 13, 2024

Hi @mossishahi,

Since ARCS performs scaffolding (merging sequences), you do expect the number of scaffolds to decrease. The n:500 decreasing just indicates that ARCS was able to merge sequences >=500 bp.
The sum is the total number of bases in scaffolds >= 500bp, and the E-size is the expected size of a randomly chosen scaffold (By randomly choosing a base in the assembly). Note that by default, abyss-fac does NOT count ambiguity codes when calculating these stats.

If you wanted to run Tigmint before ARCS, you can use the same arcs-make file I mentioned above, but use arcs-tigmint as the target instead of arcs. Tigmint does also use BWA alignments of reads to the draft genome, so you could potentially re-use the same alignments as you used for ARCS, but you would need to re-run samtools sort as Tigmint requires the alignments to be sorted by BX:Z tag. However, you would need to re-do all the other Tigmint steps and all the ARCS steps. If the alignments didn't take too long, it it probably simpler to just re-do all the steps.

Yes, any of the options that you mention would potentially work.
For the gap-filling approach, if you are using Sealer, you can use any set of reads that you want - Ie. you could use the PE reads and the 10X reads if you want (we often use this approach for our genome assemblies).
For running a new de novo assembly, take a look at the newest release of ABySS. You can specify all the read sets that you have (PE, MP, 10X), and it will perform the ABySS assembly, then run tigmint and ARCS (https://github.com/bcgsc/abyss#scaffolding-with-linked-reads). It will use the 10X reads for building the de Bruijn graph, so in that sense bases from the 10X reads will be incorporated into the assembly. However, it uses ARCS for scaffolding with the 10X data following the initial ABySS assembly, so these merges will introduce gaps as we discussed above.

Hope that helps!

from arcs.

mossishahi commented on June 13, 2024

@lcoombe I tried to run tigmint-arcs process, but the result was unexpected for me. Surprisingly, the important factors such as N50 and Sum has been decreased by a large magnitude. Is it normal?

n       n:500   L50     min     N80     N50     N20     E-size  max     sum     name
223600  217764  13453   500     3392    16324   51247   31755   745447  1.004e9 DjScaff_fnl20141213.renamed.tigmint.arcs.fa

It's also amazing for me that the factors have been decreased toward the experiment which the Arcs had been only run!

from arcs.

lcoombe commented on June 13, 2024

@mossishahi - So if I understand correctly, you are saying that the N50/sum are lower for Tigmint+ARCS vs. just ARCS? That is possible depending on how many cuts Tigmint made. We do normally see that breaking with Tigmint first does end up increasing the N50 post-ARCS (vs. no Tigmint), but this will likely depend on different factors such as the parameters used for the tools, coverage of 10x data, etc.
Comparing the metrics of baseline, post-Tigmint, post-Tigmint/ARCS, post-ARCS assemblies is a good idea to best understand your results.
Here's the Tigmint biorxiv paper which shows examples of running Tigmint/ARCS on various different assemblies: https://doi.org/10.1101/304253

from arcs.

warrenlr commented on June 13, 2024

We are seeing the behaviour your describe (drop in contiguity) in some genome assemblies, @mossishahi. If you have a way of assessing the assembly correctness (using QUAST and a reference or a close reference, I strongly suggests that you do). This is independent of ARCS, however.

It looks like the original issue you were having (empty _original.gv file) has been resolved. Accordingly, I will now close the ticket. Feel free to open a new issue, as needed.

from arcs.

burri-wildlight commented on June 13, 2024

Hi,

we are facing a similar issue, but with inconsistent outcomes depending on the sample/bam file. We have two 10X libraries from the reference individual, one ca 60x and one ca 15x, and two 10X libraries from individuals from the same species with ca 60x and 25x coverage. However, arcs only outputs a complete *original.gv file for the 60x coverage library of the reference individual, for the others it is empty. We use the exactly same pipeline to produce them. Any insights?
(Note that also for the one that outputs the file and the according tsv table, LINKS fails to produce scaffolds; opened a separate issue on the LINKS page).

Thanks a lot for any insights & help!
Reto

from arcs.

lcoombe commented on June 13, 2024

If you have a fully empty output graph file, that usually indicates an issue with the input file formatting. Take a look at the log files to see if there are any warning messages, and/or where many alignments are being filtered out.
Are you using the provided ARCS makefile (Examples/arcs-make)? I recommend that for issues like this -- using that helps to fix formatting errors that could be causing the empty graph file and perhaps the LINKS issue you are experiencing.

from arcs.

burri-wildlight commented on June 13, 2024

Dear Lauren,

Thanks a lot for your swift reply. Unfortunately, I have to admit that I have pretty much no clue how to run the make pipeline, for which reason I have followed the procedure outlined under issue #9.

ARCS seems to run just fine, with no error messages whatsoever. Same holds for LINKS. One thing I see in the ARCS log is that there seem to be no scaffold-end barcodes retained (see also below):
{ "All_barcodes_unfiltered":2314172, "All_barcodes_filtered":1549248, "Scaffold_end_barcodes":0, "Min_barcode_reads_threshold":5, "Max_barcode_reads_threshold":10000 }

Is this what your question about where alignments are filtered out refers to?

Our aim of using ARCS+LINKS is to scaffold a ca 1.3 Gb PacBio assembly with 1967 scaffolds that is polished with Pilon and Tigmint. The contig N50 of this assembly is 8.6 Mb, with contigs up to 45 Mb. For scaffolding we have 10X data from the same individual (ca 60x) that has been sequenced from UHMW DNA (mean fragment length 200kb) extracted from blood, DNA from the same individual extracted from liver (ca 15x), but could also use similar data from different individuals (which is something I started trying because we got no scaffolding from the former data).

As already mentioned, we are facing two problems:

For our main data with 60x coverage, ARCS seems to work, but LINKS produces no scaffolds
For the other data, the original.gv file is empty, although alignments are produced with the exactly same script.

I have checked both the interleaved reads files and the bam files, and they seem to look fine. Here a couple of lines from one of the individuals on which ARCS fails:

Interleaved fastq.gz:

@A00621:126:HMH3GDSXX:2:2468:19840:5854_AAACACCAGACAATAC TCATAAAAGTTTCTATTACTAGAGTTTAGGATAAATAATAGCAGAACATAAAACAACCAGCCAGGATGTGGCTCCCAAATCCAGAATTTAAATACAGGGATAAAAACTTCCATAAAATATAAATTCTT + FFFFF,:F,:,,FF:F,,,F:,FF:FFFF,:,,,FF,,,,,,F,FFF,,FFFFF,FFFF,F,F,FF:,F,F,:FFFFF,F,,F:,F,,:FF,F:FFFF,F,FFF,,F,,,F:,FFF::FF,FF,FFF, @A00621:126:HMH3GDSXX:2:2468:19840:5854_AAACACCAGACAATAC TCCAAAATGTTTATATAAAATTTATACTGTGAGGAAATGGACAAGATTTTACATTTACATACAAAAATTAGGGGGAAGATCTTTTCATCACATTATAAATTTTATATAAACATTTTGGAAGGGAAATAATATGTGGCAATAAGATAGACTG + FFFFF:F::FFFF:FFFFFFFFFFFF,:FFFFFFFF:FF:F:FFFF::FFFFF,F:FFF::FFFFF:FFFFF:FFF,FF:F:F:FFFFFF::,FF::F:F::FFFFFFF::,,,F,,FF,FFFFFFFFF,:FF,FFF,F,F:,FF,F,FFF @A00621:126:HMH3GDSXX:1:2565:19759:29982_AAACACCAGACAATAC GAAATAAAATATATGCAAACACCAATCCTGGAGAGCTCAAAATGATGCTTCACTACATACCTAAAACACAGACATCACATGTTGCCTCAACACACAACATCGACACTCAATCCCTTCAAAACATCATT + F,FF:,:F::FFFF,,,FFFFFFF:FF,F,,:,:,FFFF,F,F,FFFF,F,,::,:F,FF:FF,,,FFF:,:,:FF:,,,,F,,FFFF:,FFF,F,,,,F,,,,F,F,F,FFFFF,F,,F,F,F,,FF @A00621:126:HMH3GDSXX:1:2565:19759:29982_AAACACCAGACAATAC CTGCTCAGGTCCGTGCAGCTGATCTGAAGCTGCGGGTGTCTGCGTGTGCCGCGGCGCCGAGTGCGTGCGCGCTCCCCCTGCGCTGCACAGCTGCTACCCAAGAAAGGCTAAATATACTCATTCCTCCGCTTTCTGCTGAGAACAGAAACTC + ,F:FFFFF:,,,FFFFF,FF,,,F::FFFFFFF:F::::F,F,FF,,,:FF:,FFFF:FFFFFF,FFFF:FFF::FFFFFFFFFFFFFFF,::F,,FFF::FF:FF:FF::F:FFF,FF,F,FFFFFFF:,FFF,F:::FFFF,FFF:F:

bam file:

A00621:126:HMH3GDSXX:1:1101:1000:12962_AGTTGGTTCACGGTAT 65 289 434235 24 105S23M 191 468286 0 TGTGGTCAGGTGGGGCTGGTCTCCTTCTCCAGGCAGCAACTGACAGAAAGAGAGGACACATTTTCAAGCTGTGTCAAGGGAAAATTAGGTTCGAAATTAGAAAAAAGTTTTTCACAGAAAGAGTGATT A00621:126:HMH3GDSXX:1:1101:1000:12962_AGTTGGTTCACGGTAT 129 191 468286 1 78S57M16S 289 434235 0 CCACAGAATCACTGCAATGTAAGAAAACTTCAAGATCATCGAGTCTAAACCTGCTTTAACACCTGAACTAGAGGATGACACCAAGTGCCATATCCAGCCTTGTTTTAAACACATCCAGGGATGGTGACTCCACCAGCTCCCTGGGCAGACC A00621:126:HMH3GDSXX:1:1101:1000:13275_AGGGAGTGTACCGTTA 69 84 1521596 0 * = 1521596 0 GAAAAATCAGGTACAGAATGAATAACATTAATTAAAAGATTTCTTCAAGAGAGATTCTTTATTTGGACTTCAGTCTTATTTCTATACCCTGATATAGACCTCTGTTGGTCTTATGGTCTAATTTACAG A00621:126:HMH3GDSXX:1:1101:1000:13275_AGGGAGTGTACCGTTA 137 84 1521596 30 25S43M83S = 1521596 0 ACATCTGAGTACAGATTTTTTACTATGTGTGTATATATATATATATATATATATATACACATATATATCTATCTCTCTCTCTCTCAAATTTTTCTTAATATAAAAAGAATTTTCAGAAAATTGTGCTGCTCATAAAAAAAATGAGTCTGTA
Regarding the issue that LINKS does not scaffold any contigs, with e up to 50000 there are mainly single shared barcodes. Still, there are contigs with quite a couple of barcodes. Here the ones with most shared barcodes from the tsv file:

1097-	501+	T	1016	14615	46856	1576606
501-	1097+	T	1016	46856	14615	1576606
1001+	752-	T	622	20798	19404	1576606
752+	1001-	T	622	19404	20798	1576606
1946+	752-	T	587	25938	19404	1576606
752+	1946-	T	587	19404	25938	1576606
1859-	1946-	T	554	17820	25938	1576606
1946+	1859+	T	554	25938	17820	1576606
1859-	752-	T	532	17820	19404	1576606
752+	1859+	T	532	19404	17820	1576606

`
I have run both ARCS with a range of parameters, including different values for c (2,5), m (5-10000), and e (up to 100000). One thing I observe is that with e up to 50000 there are no scaffold-end barcodes. This starts changing from 75000 on, but to no effect on scaffolding.

from arcs.

lcoombe commented on June 13, 2024

Hi @burri-wildlight,

If you go to the README documentation, you see this line:

To run the pipeline in default mode, run Examples/arcs-make arcs. For more info check Examples/arcs-make help.

Running that command Examples/arcs-make help as documented will give you a very detailed description of how to use the Makefile. Using it is much easier and less prone to errors than following the multi-step process, and I suggest that you start there because it is otherwise hard for me to know if the issue is formatting-related or data-related (There are no commands or full logs for me to look at).

Running the makefile is as easy as:
Example: To run arcs with myDraft.fa, myReads.fq.gz (interleaved, processed by longranger basic with the barcode in the BX tag), and a custom c value and multiplicity range, run:

	./arcs-make arcs draft=myDraft reads=myReads m=[User defined multiplicity range] c=[c val]
To ensure that the pipeline runs correctly, make sure that the following tools are in your PATH: bwa, tigmint, samtools, arcs (>= v1.1.0), LINKS (>= v1.8.6)

As stated, any non-default parameters you want are specified there and you can add -n to do a dry-run.

If using the Makefile doesn't solve the problem, maybe open a fresh issue so we can have the discussion on a dedicated thread to keep things more organized.

from arcs.

lcoombe commented on June 13, 2024

I'm very glad you figured it out and thank you for posting your solution! Yes, usually the issues with no scaffolding end up being an issue with one step of the pipeline or file formatting, which is why we made the Makefile in the hope of making these things easier for the user.

Also for the benefit of future users -- without seeing your full arcs-make command, it looks like you probably ran arcs-make arcs draft=oenMen1_1.3.fa reads=reads or something like that. Looking at the example I put above, you'll see that the Makefile expects the draft and reads files to be provided without their respective .fa and .fq.gz suffixes.

from arcs.

empty _original.gv file about arcs HOT 35 CLOSED

Comments (35)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent