Giter VIP home page Giter VIP logo

salsa's People

Contributors

arangrhie avatar edharry avatar feedmewifi avatar ghuryejay avatar godotgildor avatar hyphaltip avatar pickettbd avatar skoren avatar varir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

salsa's Issues

Intrepreting misassembly output: count misassemblies

Hey,

This is a short question. How do we interpret the misassembly outputs?

If I want to count the number of detected misassemblies, do I count the number of total number of joins, the number of suspicious joins, or the number of scaffolds listed?

And are these numbers cumulative in iterations or should I add them to have the final misassembly number?

Cheers,
Ricardo

-rwxrwx--- 1 guerreiro grp_stich 0 Nov 22 10:34 misasm_2.DONE*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 22 11:04 misasm_3.DONE*
-rwxrwx--- 1 guerreiro grp_stich 339 Nov 22 11:04 misasm_3.log*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 22 11:34 misasm_4.DONE*
-rwxrwx--- 1 guerreiro grp_stich 1569 Nov 22 11:34 misasm_4.log*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 22 12:03 misasm_5.DONE*
-rwxrwx--- 1 guerreiro grp_stich 3417 Nov 22 12:03 misasm_5.log*
-rwxrwx--- 1 guerreiro grp_stich 3166 Nov 22 10:27 misasm_iteration_2.report*
-rwxrwx--- 1 guerreiro grp_stich 3179 Nov 22 10:57 misasm_iteration_3.report*
-rwxrwx--- 1 guerreiro grp_stich 2949 Nov 22 11:27 misasm_iteration_4.report*
-rwxrwx--- 1 guerreiro grp_stich 2563 Nov 22 11:56 misasm_iteration_5.report*

tail misasm_iteration_2.report

scaffold_171
scaffold_5684 #Or maybe I have to count the number of scaffolds that are listed here?
scaffold_2227
scaffold_1463
scaffold_5450
scaffold_517
scaffold_2751
Total Joins = 16
Suspicious Joins = 4
Percent Suspicious Joins = 25

tail misasm_3.log

first part : ['scaffold43|size1912476:E', 'scaffold43|size1912476:B', 'scaffold231|size1220199_2:E', 'scaffold231|size1220199_2:B']
second part : ['scaffold397|size522461:B', 'scaffold397|size522461:E', 'scaffold106|size1423362:B', 'scaffold106|size1423362:E', 'scaffold440|size435153:B', 'scaffold440|size435153:E']

"-t" option

Hi,
I am wondering whether "-t" option refers to 10X Genomics sequencing data? I have a draft assembly generated by ONT sequencing data, 10X Genomics data, and HiC data. Can I take advantage of this "-t" option by using 10X Genomics data when I am planning to scaffolding the draft assembly? Thanks!

How to get scaffold

Currently my data has contigs.fasta , unitigs.fasta and HiC data , and I got the bam file using the https://github.com/ArimaGenomics method.
Next, I ran python run_"pipeline.py -a contigs.fasta -l contigs.fasta.fai -b alignment.bed -e {Your Enzyme} -o scaffolds "and "python run_pipeline.py -a contigs.fasta -l contigs.fasta .fai -b alignment.bed -e {Your Enzyme} -o scaffolds -m yes".
Since there are no unitigs_graph.gfa and contigs_graph.gfa files I ran directly "python stitch.py -c contigs.fasta -u unitigs.fasta -p scaffolds_iteration_x.p -o output_scaffolds.fasta", and I got an error without getting output_scaffolds.fasta, error message Is "Traceback (most recent call last):
   File "/home/SALSA-master/stitch.py", line 46, in
     With open(args.bed,'r') as f:
TypeError: coercing to Unicode: need string or buffer, NoneType found

Can you give me some suggestions?

Key Error?

Greetings,

I was running the SALSA pipeline and recieved the following error:

python /home/software/SALSA/refactor_breaks.py -d scaffolds.v1 -i 5 > scaffolds.v1/misasm_5.log
Traceback (most recent call last):
  File "/home/software/SALSA/get_seq.py", line 84, in <module>
    curr_contig += revcompl(id2seq[curr[0]])
  File "/home/software/SALSA/get_seq.py", line 31, in <lambda>
    revcompl = lambda x: ''.join([{'A':'T','C':'G','G':'C','T':'A','N':'N','R':'N','M':'N','Y':'N','S':'N','W':'N','K':'N','a':'t','c':'g','g':'c','t':'a','n':'n',' ':'',}[B] for B in x][::-1])
KeyError: '>'

The commands used to run SALSA:
python /home/software/SALSA/run_pipeline.py -a falcon_4.0.2_complete.quiver.p_ctgV2.clean.fasta -l falcon_4.0.2_complete.quiver.p_ctgV2.clean.fasta.fai -b alignment.bed -e GATC -o scaffolds.v1

ordering and orientation

Is it possible to use SALSA only for ordering and orientation of contigs and not for clustering?

create_link1.py typo and no output from new_links.py

The create_links1.py argparse description of the -m option claims it requires a .bam file. However, it should say .bed file created from the bamToBed script. Not a big issue but could be confusing to some. I also cannot seem to get any output from the new_links.py script (which is the default in the run.py script) which is why I am trying the create_links1.py to contract the edges/edge weights.

It might also be nice to add in an option for Restriction site sequence to the RE_sites.py script so the user doesn't have to manually change it in the script.

Cheers

canu + salsa

Hi I am using salsa with assemblies generated by canu. Somehow salsa is exiting after 3 iterations when I used either of unitigs or contigs without any scaffolding. I read in another post you mention that is because salsa did not find any more links and probably it is due to the coverage of hic data.

I also ran salsa with a falcon assembly from same data and it ran for 16 iterations and increased longest scaffold from 3mb to 22mb.

I am not sure what exactly is the issue here. Any help please?

Salsa did not write scaffolds_FINAL.fasta

Hi,
I am trying to run Salsa on a mouse (2.7 GB), but it did not write the final scaffolds fasta.
I am running using the following command:

$ python run_pipeline.py -a data/M_musculus_129_v1.assembly.fasta -l data/M_musculus_129_v1.assembly.fasta.fai -b salsa/create_bed/bwa.bed -e DpnII -o scaffolds

I don't get any errors, and the last file written is contig_links_scaled_sorted_iteration_2.
If I check the scaffold_length_iteration_2, the n50 is identical to the original fasta, I guess this means
that no links and no scaffolds have been made, and as a consequence the scaffolds_FINAL.fasta is not printed out?
Below more details.

Thank you very much,
Francesca

The files I get are:
alignment_iteration_1.bed breakpoints_iteration_2.txt contig_links_iteration_2
contig_links_scaled_sorted_iteration_1 misasm_2.DONE re_counts_iteration_2 scaffolds_iteration_1.p
alignment_iteration_2.bed commands.log contig_links_scaled_iteration_1
contig_links_scaled_sorted_iteration_2 misasm_iteration_2.report scaffold_length_iteration_1
assembly.cleaned.fasta contig_links_iteration_1 contig_links_scaled_iteration_2
links_avoid_iteration_2 re_counts_iteration_1 scaffold_length_iteration_2

The output was:

python /lustre/scratch117/sciops/team117/hpag/fg6/bin/SALSA/RE_sites.py -a scaffolds/assembly.cleaned.fasta -e DpnII > scaffolds/re_counts_iteration_1
Starting Iteration 1
python /lustre/scratch117/sciops/team117/hpag/fg6/bin/SALSA/make_links.py -b scaffolds/alignment_iteration_1.bed -d scaffolds -i 1 -x abc
bedfile started
bedfile loaded
python /lustre/scratch117/sciops/team117/hpag/fg6/bin/SALSA/fast_scaled_scores.py -d scaffolds -i 1
sort -k 5 -gr scaffolds/contig_links_scaled_iteration_1 > scaffolds/contig_links_scaled_sorted_iteration_1
python /lustre/scratch117/sciops/team117/hpag/fg6/bin/SALSA/layout_unitigs.py -x abc -u abc -t abc -l scaffolds/contig_links_scaled_sorted_iteration_1 -c 1000 -i 1 -d scaffolds
Loading Hi-C links
Finished loading Hi-C links, Loading unitig links now..
Hybrid scaffold graph loaded, nodes = 5412 edges = 2724
Hi-C implied edges = 0
Unitig tiling implied edges = 0
Assembly graph implied edges = 0
10x implied edges = 0
/lustre/scratch117/sciops/team117/hpag/fg6/bin/SALSA/break_contigs -a scaffolds/alignment_iteration_2.bed -b scaffolds/breakpoints_iteration_2.txt -l scaffolds/scaffold_length_iteration_2 -i 2 > scaffolds/misasm_iteration_2.report
python /lustre/scratch117/sciops/team117/hpag/fg6/bin/SALSA/refactor_breaks.py -d scaffolds -i 2
Starting Iteration 2
python /lustre/scratch117/sciops/team117/hpag/fg6/bin/SALSA/make_links.py -b scaffolds/alignment_iteration_2.bed -d scaffolds -i 2
bedfile started
bedfile loaded
Starting Iteration 2

Number of iterations is always 3, even with -i flag

Hi,

The number of iterations SALSA runs is always 3, even though I pass it the -i flag with different number of iterations. The exact command I ran and the output is below:

sudo python run_pipeline.py -a /home/shourj/mylerlab/canu.contigs.chromosomes.fasta -l /home/shourj/mylerlab/canu.contigs.chromosomes.fasta.fai -b /home/shourj/mylerlab/canu_arima.bed -e GATC -o scaffoldTEST -i 35 -m yes
bedfile loaded
python /home/shourj/SALSA/RE_sites.py -a scaffoldTEST/assembly.cleaned.fasta -e GATC > scaffoldTEST/re_counts_itera
tion_1
Starting Iteration 1
python /home/shourj/SALSA/make_links.py -b scaffoldTEST/alignment_iteration_1.bed -d scaffoldTEST -i 1 -x abc
bedfile started
bedfile loaded
python /home/shourj/SALSA/fast_scaled_scores.py -d scaffoldTEST -i 1
sort -k 5 -gr scaffoldTEST/contig_links_scaled_iteration_1 > scaffoldTEST/contig_links_scaled_sorted_iteration_1
python /home/shourj/SALSA/layout_unitigs.py -x abc -u abc -t abc -l scaffoldTEST/contig_links_scaled_sorted_iterati
on_1 -c 1000 -i 1 -d scaffoldTEST
Loading Hi-C links
Finished loading Hi-C links, Loading unitig links now..
Hybrid scaffold graph loaded, nodes = 134 edges = 105
Hi-C implied edges = 0
Unitig tiling implied edges = 0
Assembly graph implied edges = 0
10x implied edges = 0
/home/shourj/SALSA/break_contigs -a scaffoldTEST/alignment_iteration_2.bed -b scaffoldTEST/breakpoints_iteration_2.
txt -l scaffoldTEST/scaffold_length_iteration_2 -i 2 > scaffoldTEST/misasm_iteration_2.report
python /home/shourj/SALSA/refactor_breaks.py -d scaffoldTEST -i 2
Starting Iteration 2
python /home/shourj/SALSA/make_links.py -b scaffoldTEST/alignment_iteration_2.bed -d scaffoldTEST -i 2
bedfile started
bedfile loaded
Starting Iteration 2
python /home/shourj/SALSA/layout_unitigs.py -x abc -u abc -t abc -l scaffoldTEST/contig_links_scaled_sorted_iterati
on_2 -c 1000 -i 2 -d scaffoldTEST
Loading Hi-C links
Finished loading Hi-C links, Loading unitig links now..
Hybrid scaffold graph loaded, nodes = 36 edges = 27
Hi-C implied edges = 0
Unitig tiling implied edges = 0
Assembly graph implied edges = 0
10x implied edges = 0
/home/shourj/SALSA/break_contigs -a scaffoldTEST/alignment_iteration_3.bed -b scaffoldTEST/breakpoints_iteration_3.
txt -l scaffoldTEST/scaffold_length_iteration_3 -i 3 > scaffoldTEST/misasm_iteration_3.report
python /home/shourj/SALSA/refactor_breaks.py -d scaffoldTEST -i 3 > scaffoldTEST/misasm_3.log
Starting Iteration 3
python /home/shourj/SALSA/make_links.py -b scaffoldTEST/alignment_iteration_3.bed -d scaffoldTEST -i 3
bedfile started
bedfile loaded
Starting Iteration 3
python /home/shourj/SALSA/layout_unitigs.py -x abc -u abc -t abc -l scaffoldTEST/contig_links_scaled_sorted_iterati
on_3 -c 1000 -i 3 -d scaffoldTEST
Loading Hi-C links
Finished loading Hi-C links, Loading unitig links now..
Hybrid scaffold graph loaded, nodes = 16 edges = 11
Hi-C implied edges = 0
Unitig tiling implied edges = 0
Assembly graph implied edges = 0
10x implied edges = 0
/home/shourj/SALSA/break_contigs -a scaffoldTEST/alignment_iteration_4.bed -b scaffoldTEST/breakpoints_iteration_4.
txt -l scaffoldTEST/scaffold_length_iteration_4 -i 4 > scaffoldTEST/misasm_iteration_4.report
python /home/shourj/SALSA/refactor_breaks.py -d scaffoldTEST -i 4 > scaffoldTEST/misasm_4.log

Am i just running the command wrong?

Thanks!

about input file

Hello,
In the process of assembling contig, I did not produce the file of contigs_graph.gfa and unitigs_graph.gfa, only the data of contig and HIC. How should I assemble the scaffold with SALSA?
best wishes

New version parameters

Hi Machinegun,
I have a similar issue as the one reported in #10. That one was related to an older version of SALSA. Is there a way to apply the same changes you suggested also in the new code?
Thanks in advance

Inconsistency with Hi-C scaffolds vs. genetic map

I'm working with a dataset in which we have a relatively good genetic map derived from an F2 panel. It seems like the Hi-C is doing well in clustering contigs into scaffold groups that are consistent with linkage groups but within these groupings there can be quite a lot of disagreement. Is there a way to tinker with stringency parameters in order to determine what might be going wrong with the Hi-C scaffolding?

Lots of missing data

Hey

I think I need to figure out which params to adjust... I have 66 contigs, total length 1.8Mb

When I run SALSA, I end up with 3 scaffolds (YAY!) but with only 400kb length (boo!)

The AGP shows that each scaffold is a merging of two pairs of contigs (OK...) but the new_links_sorted file suggests there are 1938 unique links between contigs

These 3 pairs did have hundreds of links between them, but they were by no means the highest and other pairs had as high or higher number of links.

Any idea what might be happening?

Here is the AGP:

scaffold_1      1       62436   1       W       NODE_225_length_62436_cov_43.0636       1       62436   +
scaffold_1      62437   62936   2       N       500     scaffold        yes     na
scaffold_1      62937   171443  3       W       NODE_27_length_108507_cov_43.5609       1       108507  -
scaffold_1      171444  171943  4       N       500     scaffold        yes     na
scaffold_2      1       91316   1       W       NODE_59_length_91316_cov_42.7408        1       91316   +
scaffold_2      91317   91816   2       N       500     scaffold        yes     na
scaffold_2      91817   204445  3       W       NODE_25_length_112629_cov_42.9302       1       112629  -
scaffold_2      204446  204945  4       N       500     scaffold        yes     na
scaffold_3      1       56446   1       W       NODE_287_length_56446_cov_44.2677       1       56446   +
scaffold_3      56447   56946   2       N       500     scaffold        yes     na
scaffold_3      56947   89705   3       W       NODE_1218_length_32759_cov_43.2384      1       32759   -
scaffold_3      89706   90205   4       N       500     scaffold        yes     na

Here is the head of new_links_sorted

NODE_225_length_62436_cov_43.0636:E NODE_27_length_108507_cov_43.5609:E 3.0930812669 226
NODE_184_length_66629_cov_43.5219:E NODE_27_length_108507_cov_43.5609:E 3.05840488864 273
NODE_25_length_112629_cov_42.9302:E NODE_59_length_91316_cov_42.7408:E 3.04631625286 383
NODE_25_length_112629_cov_42.9302:E NODE_27_length_108507_cov_43.5609:E 2.79715011944 310
NODE_27_length_108507_cov_43.5609:E NODE_59_length_91316_cov_42.7408:E 2.57322847893 266
NODE_184_length_66629_cov_43.5219:E NODE_59_length_91316_cov_42.7408:E 2.32333197141 242
NODE_184_length_66629_cov_43.5219:E NODE_25_length_112629_cov_42.9302:E 2.26670520893 253
NODE_158_length_69544_cov_43.9225:E NODE_27_length_108507_cov_43.5609:E 2.12493864459 149
NODE_225_length_62436_cov_43.0636:E NODE_25_length_112629_cov_42.9302:E 1.80256030452 172
NODE_242_length_60045_cov_43.2957:E NODE_27_length_108507_cov_43.5609:E 1.78590949468 195

Cheers
Mick

correct_links aborted: Assertion `(std::size_t)i < pm.n' failed.

Hi,
I am running SALSA on a GFA created by Flye when correct_links crashes. Stderr:
bedfile loaded Starting Iteration 1 bedfile started bedfile loaded Done loading GFA file Number of Nodes = 27928 Number of Edges = 461997442 correct_links: /usit/abel/u1/olekto/miniconda2/include/boost/graph/two_bit_color_map.hpp:86: void boost::put(const boost::two_bit_color_map<IndexMap>&, typename boost::property_traits<PMap>::key_type, boost::two_bit_color_type) [with IndexMap = boost::vec_adj_list_vertex_id_map<boost::property<boost::vertex_name_t, std::__cxx11::basic_string<char> >, long unsigned int>; typename boost::property_traits<PMap>::key_type = long unsigned int]: Assertion (std::size_t)i < pm.n' failed.
/bin/sh: line 1: 85700 Aborted (core dumped) /projects/cees/bin/SALSA/correct_links -g
assembly_graph.gfa -l scaffolds/contig_links_scaled_sorted_iteration_1 > scaffolds/tmp.links`

This is the output of stderr:
python /path/RE_sites.py -a scaffolds/assembly.cleaned.fasta -e GATC > scaffolds/re_counts_iteration_1 python /path/make_links.py -b scaffolds/alignment_iteration_1.bed -d scaffolds -i 1 -x abc python /pathfast_scaled_scores.py -d scaffolds -i 1 sort -k 5 -gr scaffolds/contig_links_scaled_iteration_1 > scaffolds/contig_links_scaled_sorted_iteration_1

Do you know what the issue might be? How many edges could I expect?

Thank you.

Ole

how to calculate the score value for contig length shorter than 300,000bp?

In the scripts :new_links.py ,there are two assignments:l1 = 300000.0 l2 = 300000.0;if my contig length short than 300,000bp, does the l1 and l2 value is meaningful?

if prev_attrs[0] != attrs[0]:
prev_read = prev_attrs[3]
curr_read = attrs[3]
#print prev_read.split('/')[0]
if prev_read.split('/')[0] == curr_read.split('/')[0]:
#print 'here'
pos1 = (long(prev_attrs[1]) + long(prev_attrs[2]))/2.0
pos2 = (long(attrs[1]) + long(attrs[2]))/2.0
l1 = 300000.0
l2 = 300000.0

                    first = ''
                    second = ''
                    if prev_attrs[0] < attrs[0]:
                        first = prev_attrs[0]
                        second = attrs[0]
                    else:
                        first = attrs[0]
                        second = prev_attrs[0]
                        pos1,pos2 = pos2, pos1

It does not iterate beyond 3 !

I tried -i 10 but it stop at 3rd iteration !

➜  SALSA git:(master) ✗ python2 run_pipeline.py -a /media/urbe/MyPassport/ONTPoreChopped/concatPacBioONT50K/mince/A_plus_collapsed.fa -l /media/urbe/MyPassport/ONTPoreChopped/concatPacBioONT50K/mince/A_plus_collapsed.fa.fai -b /media/urbe/MyCDrive/JitDATA/adineta_reads/alignment.bed -e GATC -o scaffolds -i 10 -c 100

bedfile loaded
python2 /home/urbe/Tools/SALSA/RE_sites.py -a scaffolds/assembly.cleaned.fasta -e GATC > scaffolds/re_counts_iteration_1
Starting Iteration 1
python2 /home/urbe/Tools/SALSA/make_links.py -b scaffolds/alignment_iteration_1.bed -d scaffolds -i 1 -x abc
bedfile started
bedfile loaded
python2 /home/urbe/Tools/SALSA/fast_scaled_scores.py -d scaffolds -i 1
sort -k 5 -gr scaffolds/contig_links_scaled_iteration_1 > scaffolds/contig_links_scaled_sorted_iteration_1
python2 /home/urbe/Tools/SALSA/layout_unitigs.py -x abc -l scaffolds/contig_links_scaled_sorted_iteration_1 -c 100 -i 1 -d scaffolds
Loading Hi-C links 
Hybrid scaffold graph loaded, nodes = 0 edges = 0
Hi-C implied edges = 0
/home/urbe/Tools/SALSA/break_contigs -a scaffolds/alignment_iteration_2.bed -b scaffolds/breakpoints_iteration_2.txt -l scaffolds/scaffold_length_iteration_2 -i 2 -s 100   > scaffolds/misasm_iteration_2.report
python2 /home/urbe/Tools/SALSA/refactor_breaks.py -d scaffolds -i 2
Starting Iteration 2
python2 /home/urbe/Tools/SALSA/make_links.py -b scaffolds/alignment_iteration_2.bed -d scaffolds -i 2
bedfile started
bedfile loaded
Starting Iteration 2
python2 /home/urbe/Tools/SALSA/layout_unitigs.py -x abc -l scaffolds/contig_links_scaled_sorted_iteration_2 -c 100 -i 2 -d scaffolds
Loading Hi-C links 
Hybrid scaffold graph loaded, nodes = 0 edges = 0
Hi-C implied edges = 0
/home/urbe/Tools/SALSA/break_contigs -a scaffolds/alignment_iteration_3.bed -b scaffolds/breakpoints_iteration_3.txt -l scaffolds/scaffold_length_iteration_3 -i 3 -s 100  > scaffolds/misasm_iteration_3.report
python2 /home/urbe/Tools/SALSA/refactor_breaks.py -d scaffolds -i 3 > scaffolds/misasm_3.log

Interestingly, some of the intermediate files were empty ! Is that alright ?

➜  scaffolds git:(master) ✗ ls -lh
total 4,6G
-rw-rw-r-- 1 urbe urbe 4,4G Okt  2 10:10 alignment_iteration_1.bed
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:13 alignment_iteration_2.bed
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:16 alignment_iteration_3.bed
-rw-rw-r-- 1 urbe urbe 114M Okt  2 10:10 assembly.cleaned.fasta
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:11 breakpoints_iteration_2.txt
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:13 breakpoints_iteration_3.txt
-rw-rw-r-- 1 urbe urbe 2,1K Okt  2 10:16 commands.log
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:11 contig_links_iteration_1
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:13 contig_links_iteration_2
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:11 contig_links_scaled_iteration_1
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:13 contig_links_scaled_iteration_2
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:11 contig_links_scaled_sorted_iteration_1
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:13 contig_links_scaled_sorted_iteration_2
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:08 input_breaks
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:13 links_avoid_iteration_2
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:16 links_avoid_iteration_3
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:13 misasm_2.DONE
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:16 misasm_3.DONE
-rw-rw-r-- 1 urbe urbe    0 Okt  2 10:16 misasm_3.log
-rw-rw-r-- 1 urbe urbe   69 Okt  2 10:13 misasm_iteration_2.report
-rw-rw-r-- 1 urbe urbe   69 Okt  2 10:16 misasm_iteration_3.report
-rw-rw-r-- 1 urbe urbe  34K Okt  2 10:10 re_counts_iteration_1
-rw-rw-r-- 1 urbe urbe  26K Okt  2 10:13 re_counts_iteration_2
-rw-rw-r-- 1 urbe urbe  26K Okt  2 10:16 re_counts_iteration_3
-rwxrwxr-x 1 urbe urbe  34K Okt  2 10:10 scaffold_length_iteration_1
-rw-rw-r-- 1 urbe urbe  25K Okt  2 10:13 scaffold_length_iteration_2
-rw-rw-r-- 1 urbe urbe  25K Okt  2 10:16 scaffold_length_iteration_3
-rw-rw-r-- 1 urbe urbe  72K Okt  2 10:16 scaffolds_FINAL.agp
-rw-rw-r-- 1 urbe urbe 114M Okt  2 10:16 scaffolds_FINAL.fasta
-rw-rw-r-- 1 urbe urbe  45K Okt  2 11:13 scaffolds_FINAL.fasta.fai
-rw-rw-r-- 1 urbe urbe 120K Okt  2 10:13 scaffolds_iteration_1.p
-rw-rw-r-- 1 urbe urbe 120K Okt  2 10:16 scaffolds_iteration_2.p

Did I do anything wrong ?
Note: My HiC reads are just 33*2

I can not get the scaffolds.fasta

I made the alignment_unique.bed, contig_length_new, new_links_sorted and RE_counts files.
There are 32 lines in the new_links_sorted file.
And then, I run the command :
python links_to_graph3.py -a assembled_contigs.fa -f scaffolds.fasta -l new_links_sorted -n contig_length_new -o scaffold.agp -c 1 > paths

But, the result files (ambigous_contigs, paths, scaffold.agp, scaffolds.fasta) are empty.
How can I get the fasta file containing final scaffolds?

Error in links_to_graph3.py", line 413

Got the first error ;)

Seems related to lower and upper case bases int the contig file. Test with upper-cases only is running.

started scaffolding
Traceback (most recent call last):
File "links_to_graph3.py", line 413, in
main()
File "links_to_graph3.py", line 402, in main
curr_contig += revcompl(id2seq[curr[0]])
File "links_to_graph3.py", line 7, in
revcompl = lambda x: ''.join([{'A':'T','C':'G','G':'C','T':'A'}[B] for B in x][::-1])
KeyError: 'a'
done scaffolding, sequences written to file

GFA file loaded with Nodes = 0 and Edges = 0

I have used SALSA before with GFA file, and it seemed like it worked properly.

This time I used SALSA with a idba_ud assembly and the GFA file obtained with the script IDBA-to-GFA.

I get the following :

('/home/nadege/meta_romania/salsa_scaffolding_AATT_idba_with_gfa/salsa_scaffolding_idba_with_gfa.alignment.bed', '/home/nadege/meta_romania/salsa_scaffolding_AATT_idba_with_gfa/scaffolds/alignment_iteration_1.bed')
('/home/nadege/meta_romania/salsa_scaffolding_AATT_idba_with_gfa/salsa_scaffolding_idba_with_gfa.alignment.bed', '/home/nadege/meta_romania/salsa_scaffolding_AATT_idba_with_gfa/scaffolds/alignment_iteration_1.bed')
bedfile loaded
python /home/nadege/SALSA/RE_sites.py -a /home/nadege/meta_romania/salsa_scaffolding_AATT_idba_with_gfa/scaffolds/assembly.cleaned.fasta -e AATT > /home/nadege/meta_romania/salsa_scaffolding_AATT_idba_with_gfa/scaffolds/re_counts_iteration_1
Starting Iteration 1
python /home/nadege/SALSA/make_links.py -b /home/nadege/meta_romania/salsa_scaffolding_AATT_idba_with_gfa/scaffolds/alignment_iteration_1.bed -d /home/nadege/meta_romania/salsa_scaffolding_AATT_idba_with_gfa/scaffolds -i 1 -x abc
bedfile started
bedfile loaded
python /home/nadege/SALSA/fast_scaled_scores.py -d /home/nadege/meta_romania/salsa_scaffolding_AATT_idba_with_gfa/scaffolds -i 1
sort -k 5 -gr /home/nadege/meta_romania/salsa_scaffolding_AATT_idba_with_gfa/scaffolds/contig_links_scaled_iteration_1 > /home/nadege/meta_romania/salsa_scaffolding_AATT_idba_with_gfa/scaffolds/contig_links_scaled_sorted_iteration_1
Done loading GFA file
Number of Nodes = 0
Number of Edges = 0
python /home/nadege/SALSA/layout_unitigs.py -x /home/nadege/meta_romania/contig-100.gfa -l /home/nadege/meta_romania/salsa_scaffolding_AATT_idba_with_gfa/scaffolds/contig_links_scaled_sorted_iteration_1 -c 1000 -i 1 -d /home/nadege/meta_romania/salsa_scaffolding_AATT_idba_with_gfa/scaffolds
Loading Hi-C links
Hybrid scaffold graph loaded, nodes = 876 edges = 626
Hi-C implied edges = 0

So it says that the GFA is loaded, but with Nodes = 0 and Edges = 0, which I then understand as the GFA not being loaded properly? What I find even more surprising is that I get different outputs with and without the GFA.

Genome sequence always have lowercase ’n‘ error

The Genome sequence always have lowercase ’n‘, however the code did not have:
revcompl = lambda x: ''.join([{'A':'T','C':'G','G':'C','T':'A','N':'N','R':'N','M':'N','Y':'N','S':'N','W':'N','K':'N','a':'t','c':'g','g':'c','t':'a',' ':'',}[B] for B in x][::-1])
This is a bug. Hope the author fixed!

make error

I tried compiling the code provided on a CentOS 7 machine. Initial run of make results in the following:

Makefile:2: *** missing separator (did you mean TAB instead of 8 spaces?). Stop.

replacement of the spaces with a tab in the Makefile results in the following error:

g++ break_contigs.cpp -std=c++11 -o break_contigs break_contigs.cpp:12:21: fatal error: cmdline.h: No such file or directory #include "cmdline.h" ^ compilation terminated. make: *** [all] Error 1

How to generate the interaction picture after scaffolding?

Hi,
How to generate the interaction picture after scaffolding?
I want to ask is there file/files in the output available to make the interaction picture? Or do I need to align the HiC reads to the scaffold and then make the picture by my self?

multiple libraries with different motifs

If I have two libraries for one sample, one from Arima (motif=GATC,GANTC) and one from Dovetail (motif=GATC), can give both sets to SALSA in one run (combine the bed files as input to SALSA and set motif=GATC,GANTC), or would it be better to run serially?

mapping example

Hi,
I have the following Hi-C data

> ls -1
N_Ben_HiC2.R1.fq.gz
N_Ben_HiC2.R2.fq.gz
N_Ben_HiC4.R1.fq.gz
N_Ben_HiC4.R2.fq.gz

How to align all the above files with ArimaGenomics pipeline (I have also asked here) or do you recommend something different?
Do I have to change the filenames?
Will ArimaGenomics recognize all of my 4 files and align them in one go?

Thank you in advance.

Michal

Large discrepancies between versions within the last 30 days.

Hi there,

First of all, I want to say say thank you for this program! Its pretty great!

However, I am having a lot of trouble with the discrepancy between the different updates within the last 25 days. For my scaffolding, of the genome I working with. I am getting differences between the N50 and largest scaffold size for the assemblies when using the latest code, code posted 10 days ago and code posted 25 days ago. I am not sure what to trust as to what is real.

For example, with the break code enable and and un-enabled..

I get the following.

TODAY: with break (-m yes)
sum = 131656186, n = 115, ave = 1144836.40, largest = 20897536
N50 = 3495147, n = 11

TODAY WITHOUT BREAK
sum = 131656186, n = 115, ave = 1144836.40, largest = 20897536
N50 = 3495147, n = 11

WITH BREAK 10 DAYS
sum = 131659186, n = 109, ave = 1207882.44, largest = 16599318
N50 = 4168397, n = 10

WITHOUTBREAK 10 DAYS
sum = 131659186, n = 109, ave = 1207882.44, largest = 16599318
N50 = 4168397, n = 10

WITH BREAK 30 DAYS
sum = 131649371, n = 101, ave = 1303459.12, largest = 46276934
N50 = 5032004, n = 4

WITHOUTBREAK 30 DAYS
sum = 131637371, n = 115, ave = 1144672.79, largest = 28565372
N50 = 4576694, n = 8

I am just not sure what update to trust. The latest? Or the previosus?

Thanks for all the help!

bam2Bed & contig lengths

I just started to play around with the tool. Couple of things popped up.

  1. Where to report issues and commit/send pull requests to machinegun/hi-c-scaffold or pb-jchin/hi-c-scaffold?
  2. Biopython is missing in the requirements section.
  3. I had to manually index the contigs (samtools faidx and extract the contig lenghts) and call bam2Bed since the code block doing it is commented out.

I have plenty of data sets for different species (some Dovetail'd) and happy to benchmark your tool 👍

No scaffolds_FINAL.fasta

This issue seems to be the same as #16 . However, I have already used enzyme sequences:

python run_pipeline.py -a D2.contig.fa -l D2.contig.fa.fai -b HiC.bed -e AAGCTT -o scaffolds -m yes

I only get one iteration and get a assembly.cleaned.fasta file (by default there should be 3 iterations), which is almost the same as D2.contig.fa. Any other possible solutions?

list index out of range

Hi all:
when I run this protocol, I get a error:
bedfile loaded
Starting Iteration 1
bedfile started
bedfile loaded
Loading Hi-C links
Finished loading Hi-C links, Loading unitig links now..
Hybrid scaffold graph loaded, nodes = 1268 edges = 1059
Hi-C implied edges = 0
Unitig tiling implied edges = 0
Assembly graph implied edges = 0
10x implied edges = 0
Traceback (most recent call last):
File "/public1/home/stu_yanhansong/software/Hi-C/SALSA/layout_unitigs.py", line 1056, in
update_bed(expanded_scaffold_paths)
File "/public1/home/stu_yanhansong/software/Hi-C/SALSA/layout_unitigs.py", line 801, in update_bed
first = prev_attrs[3].split('/')[1]
IndexError: list index out of range
any suggestion?
thanks
FAFU
Hansong Yan

ValueError: need more than 1 value to unpack

Hello!

Just trying this for the first time.

Output is:

convertng bam file to bed file
finished conversion
finished conversion
started generating links between contigs
bedfile started
bedfile loaded
done generating links between contigs
started scaffolding
Traceback (most recent call last):
  File "/exports/cmvm/eddie/eb/groups/watson_grp/software/SALSA/links_to_graph3.py", line 457, in <module>
    main()
  File "/exports/cmvm/eddie/eb/groups/watson_grp/software/SALSA/links_to_graph3.py", line 103, in main
    v1, v2 = row[0:2]
ValueError: need more than 1 value to unpack
done scaffolding, sequences written to file

Output folder contains

-rw-r--r-- 1 mwatson9 datastore_eb_groups_watson_grp 65854190 Sep  1 19:19 alignment_unique.bed
-rw-r--r-- 1 mwatson9 datastore_eb_groups_watson_grp     2622 Sep  1 19:19 contig_length_new
-rw-r--r-- 1 mwatson9 datastore_eb_groups_watson_grp     2526 Sep  1 19:19 RE_counts
-rw-r--r-- 1 mwatson9 datastore_eb_groups_watson_grp 32423329 Sep  1 19:19 new_links
-rw-r--r-- 1 mwatson9 datastore_eb_groups_watson_grp 32423329 Sep  1 19:20 new_links_sorted
-rw-r--r-- 1 mwatson9 datastore_eb_groups_watson_grp        0 Sep  1 19:20 paths

Any idea what is causing this?

Thanks
Mick

layout_unitigs.py Key Error

I'm running the program in a python 2.7 virtualenv on a cluster, and it produces initial output to STDERR that seems promising:
bedfile loaded
Starting Iteration 1
Loading Hi-C links
Finished loading Hi-C links, Loading unitig links now..
Hybrid scaffold graph loaded, nodes = 0 edges = 0
Hi-C implied edges = 0
Unitig tiling implied edges = 0
Assembly graph implied edges = 0
10x implied edges = 0
After this point the output directory contains files called assembly_cleaned.fasta, commands.log, scaffolds_iteration_1.p, and scaffold_length_iteration_1 with non-zero sizes, as well as several others with size 0. Shortly after that point, though, the job fails with exit code 1, and the error output contains the messages
Traceback (most recent call last):
File "/home/rwhetten/tools/SALSA-2.0/layout_unitigs.py", line 1056, in
update_bed(expanded_scaffold_paths)
File "/home/rwhetten/tools/SALSA-2.0/layout_unitigs.py", line 752, in update_bed
scaffold_re[key] = re_counts[contig]
KeyError: 'contig_12719214'
Grep finds a sequence with the header line >contig_12719214 in the assembly_cleaned.fasta file, but the sequence is very short (<100 nt) and does not contain a recognition site for the enzyme used to prepare the HiC library. Should the reference genome sequence file be filtered to remove short contigs and those without restriction enzyme recognition sites before running SALSA2?

FINAL.fasta generated from penultimate iteration.

Hi,

Thanks for this tool.

I have been using the latest version to scaffold an assembly, however I have noticed that the statistics of the output assembly do not equal the statistics of the final iteration reported. Is there are reason for this, or could SALSA have continued into additional iterations?

I notice in previous issues that it has been mentioned that the -i parameter is now overwritten and iterations continue until the data is exhausted (#44 (comment) & #24 (comment)), but the example below would suggest that further scaffolding could be performed and that there is evidence for the largest scaffold still increasing by 35%. My command did not specify number of iterations.

Would appreciate any help you can give.

Thanks

Example data:

#stats.sh from bbmap shows longest scaffold ~4.5 Mb and assembly contains 5064 scaffolds.
$ stats.sh scaffolds/scaffolds_FINAL.fasta
A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2690  0.2310  0.2312  0.2688  0.0065  0.0000  0.0000  0.4623  0.0351

Main genome scaffold total:             5064
Main genome contig total:               7811
Main genome scaffold sequence total:    126.785 MB
Main genome contig sequence total:      125.957 MB      0.653% gap
Main genome scaffold N/L50:             53/597.009 KB
Main genome contig N/L50:               469/74.661 KB
Main genome scaffold N/L90:             854/12.856 KB
Main genome contig N/L90:               2220/8.207 KB
Max scaffold length:                    4.526 MB
Max contig length:                      697.806 KB
Number of scaffolds > 50 KB:            282
% main genome in scaffolds > 50 KB:     78.55%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                  5,064           7,811     126,785,198     125,957,098    99.35%
    500                  5,064           7,811     126,785,198     125,957,098    99.35%
   1 KB                  5,058           7,805     126,779,354     125,951,254    99.35%
 2.5 KB                  2,510           4,990     122,884,307     122,080,714    99.35%
   5 KB                  1,541           3,800     119,528,414     118,790,761    99.38%
  10 KB                    986           3,001     115,618,480     114,968,507    99.44%
  25 KB                    532           2,294     108,290,175     107,756,368    99.51%
  50 KB                    282           1,824      99,591,345      99,140,714    99.55%
 100 KB                    177           1,515      92,416,321      92,043,085    99.60%
 250 KB                     91           1,157      79,149,858      78,866,455    99.64%
 500 KB                     61             972      67,866,447      67,631,495    99.65%
   1 MB                     25             577      42,364,736      42,229,476    99.68%
 2.5 MB                      3             121      10,664,405      10,638,478    99.76%

#Longest scaffold from iteration 5 is larger and contains fewer scaffolds.
$ wc -l scaffolds/scaffold_length_iteration_5 && sort -nrk2,2 scaffolds/scaffold_length_iteration_5 | head -1
4970 scaffolds/scaffold_length_iteration_5
scaffold_1643   6128428

#Longest scaffold from iteration 4 more similar to the output assembly. Scaffold count the same.
$ wc -l scaffolds/scaffold_length_iteration_4 && sort -nrk2,2 scaffolds/scaffold_length_iteration_4 | head -1
5064 scaffolds/scaffold_length_iteration_4
scaffold_3642   4518477

The original SALSA command was:

python run_pipeline.py -a jelly.out.fasta -l jelly.out.fasta.fai -b alignment.bed -e GATC -o scaffolds &> salsa.log

The log file shows that iteration 5 began, but maybe didn't complete?:

bedfile loaded
Starting Iteration 1
bedfile started
bedfile loaded
Loading Hi-C links
Hybrid scaffold graph loaded, nodes = 6444 edges = 4125
Hi-C implied edges = 0
Starting Iteration 2
bedfile started
bedfile loaded
Starting Iteration 2
Loading Hi-C links
Hybrid scaffold graph loaded, nodes = 4398 edges = 2470
Hi-C implied edges = 0
Starting Iteration 3
bedfile started
bedfile loaded
Starting Iteration 3
Loading Hi-C links
Hybrid scaffold graph loaded, nodes = 4364 edges = 2346
Hi-C implied edges = 0
Starting Iteration 4
bedfile started
bedfile loaded
Starting Iteration 4
Loading Hi-C links
Hybrid scaffold graph loaded, nodes = 4270 edges = 2248
Hi-C implied edges = 0
python RE_sites.py -a scaffolds/assembly.cleaned.fasta -e GATC > scaffolds/re_counts_iteration_1
python make_links.py -b scaffolds/alignment_iteration_1.bed -d scaffolds -i 1 -x abc
python fast_scaled_scores.py -d scaffolds -i 1
sort -k 5 -gr scaffolds/contig_links_scaled_iteration_1 > scaffolds/contig_links_scaled_sorted_iteration_1
python layout_unitigs.py -x abc -l scaffolds/contig_links_scaled_sorted_iteration_1 -c 1000 -i 1 -d scaffolds
break_contigs -a scaffolds/alignment_iteration_2.bed -b scaffolds/breakpoints_iteration_2.txt -l scaffolds/scaffold_length_iteration_2 -i 2 -s 100   > scaffolds/misasm_iteration_2.report
python refactor_breaks.py -d scaffolds -i 2
python make_links.py -b scaffolds/alignment_iteration_2.bed -d scaffolds -i 2
python layout_unitigs.py -x abc -l scaffolds/contig_links_scaled_sorted_iteration_2 -c 1000 -i 2 -d scaffolds
break_contigs -a scaffolds/alignment_iteration_3.bed -b scaffolds/breakpoints_iteration_3.txt -l scaffolds/scaffold_length_iteration_3 -i 3 -s 100  > scaffolds/misasm_iteration_3.report
python refactor_breaks.py -d scaffolds -i 3 > scaffolds/misasm_3.log
python make_links.py -b scaffolds/alignment_iteration_3.bed -d scaffolds -i 3
python layout_unitigs.py -x abc -l scaffolds/contig_links_scaled_sorted_iteration_3 -c 1000 -i 3 -d scaffolds
break_contigs -a scaffolds/alignment_iteration_4.bed -b scaffolds/breakpoints_iteration_4.txt -l scaffolds/scaffold_length_iteration_4 -i 4 -s 100  > scaffolds/misasm_iteration_4.report
python refactor_breaks.py -d scaffolds -i 4 > scaffolds/misasm_4.log
python make_links.py -b scaffolds/alignment_iteration_4.bed -d scaffolds -i 4
python layout_unitigs.py -x abc -l scaffolds/contig_links_scaled_sorted_iteration_4 -c 1000 -i 4 -d scaffolds
break_contigs -a scaffolds/alignment_iteration_5.bed -b scaffolds/breakpoints_iteration_5.txt -l scaffolds/scaffold_length_iteration_5 -i 5 -s 100  > scaffolds/misasm_iteration_5.report
python refactor_breaks.py -d scaffolds -i 5 > scaffolds/misasm_5.log

break_contigs.py error

when I run the command :
python hi-c-scaffold-master/break_contigs.py -b breakpoints -a genome.fa -o final.assembly -l contig_length_new
Error occure as below
Traceback (most recent call last):
File hi-c-scaffold-master/break_contigs.py", line 46, in
main()
File "hi-c-scaffold-master/break_contigs.py", line 30, in main
seq = id2seq[curr_contig]
KeyError: '000003F_020|quiver'
Could you help me?

Contig Headers with Colons

Hi there,

I have a few contigs in my assembly whose header contains a colon. For example:

transfrag_1:293290599-293291609

This appears to be causing an error in the generate_scaffold_graph() function of layout_unitigs.py.

I specifically get the following error

Traceback (most recent call last):
  File "... /SALSA/layout_unitigs.py", line 904, in <module>
    generate_scaffold_graph()
  File "... /SALSA/layout_unitigs.py", line 397, in generate_scaffold_graph
    if contig_length[c1] <= int(args.cutoff) or contig_length[c2] <= int(args.cutoff):
KeyError: 'transfrag_1'

From looking around it looks like there is some extra information encoded in the contig_links_scaled_sorted_iteration_1 which relies on the addition of a colon to the fasta headers. Then, this causes problems starting at 389 of layout_unitigs.py.

I could mess around and get it to work but I figure the presence of colons in fasta headers is common enough that I would bring it to your attention.

Thanks

unexpected stop

Hi -
I am trying to tun salsa on a CANU assembly. I tried a first time with contigs and assembly graph (.gfa).
However, the pipeline stops during iteration 2 at this step:
make_links.py -b test.salsa/alignment_iteration_2.bed -d test.salsa -i 2
there is no error message. The log says:

bedfile loaded
Starting Iteration 1
bedfile started
bedfile loaded
Loading Hi-C links
Finished loading Hi-C links, Loading unitig links now..
Hybrid scaffold graph loaded, nodes = 2300 edges = 1300
Hi-C implied edges = 0
Unitig tiling implied edges = 0
Assembly graph implied edges = 0
10x implied edges = 0
Starting Iteration 2
bedfile started
bedfile loaded
Starting Iteration 2

Do you have any idea?
My other question is that the N50 of my CANU assembly is quite low (70kb), do you think that could be the cause of the problem?

Thanks a lot for your help!

-u option usage is not very clear

The -u option specifies "The tiling of unitigs to contigs in bed format"
But its use should be made more explicit.
First, the guide should describe how this bed file is structured, as the bed format has optional fields Based on the bed format specs, I thought the format should be (in column order)

  • contig name
  • start coordinate of unitig in contig
  • end coordinate of unitig in contig
  • unitig name
  • feature score (arbitrary?)
  • orientation of unitig in contig

Is this correct?

Also, it's not very clear what the reference genome to map Hi-C reads onto must be in this case. Must Hi-C reads be mapped onto the contigs or onto the unitigs?

I am trying salsa with Hi-C reads mapped on unitigs and a tiling bed file specified as above. (I have also specified a gfa file generated during unitig assembly). The script is running, but the output says "Unitig tiling implied edges = 0" at each iteration, so I wonder if I did something wrong.

Thanks.

AttributeError: 'dict' object has no attribute 'iteritems'

It terminated after loading bed file with following error

➜  SALSA git:(master) ✗ python2 run_pipeline.py -a A_plus_collapsed.fa -l A_plus_collapsed.fa.fai -b alignment.bed -e GATC -o scaffolds -i 3
bedfile loaded
Traceback (most recent call last):
  File "/home/urbe/Tools/SALSA/correct.py", line 28, in <module>
    input_seqs = parse_fasta(open(sys.argv[1],'r'))
  File "/home/urbe/Tools/SALSA/correct.py", line 20, in parse_fasta
    for short_name, nuc_list in fa.iteritems():
AttributeError: 'dict' object has no attribute 'iteritems'

mv: cannot stat 'scaffolds//alignment_iteration_1.tmp.bed': No such file or directory
mv: cannot stat 'scaffolds//asm.cleaned.fasta': No such file or directory
Starting Iteration 1
python /home/urbe/Tools/SALSA/make_links.py -b scaffolds/alignment_iteration_1.bed -d scaffolds -i 1 -x abc
  File "/home/urbe/Tools/SALSA/make_links.py", line 158
    print len(contig_links)
            ^
SyntaxError: invalid syntax

How to fix that?

scaffolds growing in each iteration

I have a pacbio vertebrate assembly (NgG50 4.5M) which was scaffolded with a two enzyme bionano hybrid scaffolding (NG50 28M).
To get full chromosomes I applied SALSA with a 46x coverage HIC data set. And I did the Arima HiC
mapping approach.
The longest scaffold is growing in each iteration starting from 46M (1st iteration) to 228M in the final scaffold file. I also have a genetic map, that disagrees with all iterative joining steps for this scaffold. For the other scaffolds, the merging works quite well.

Is it possible to run SALSA in a more conservative way?

how to generate bed file from other format

Hi,
I'm trying to convert mnd file (from juicer pipeline) to bed format.
I remove intra-fragment interaction and below-mapQ interaction. Finally, each paired contact was divide into two lines with same manually-add read id (4th column in bed file).
But it seems that this doesn't work in salsa.

Any suggestions?

Thanks,
Eric

How to use the -t option?

-t TENX, --tenx TENX  10x links tab separated file, sorted by last columnls

I have 10X genomic data, but I did not know use to use this option. There are no manuals

Using SALSA2 in conjunction with Chicago data

Hi,

I noticed in your recent publication that you used Chicago data in conjunction with SALSA2.
Did you need to make any modifications to the Chicago data prior to using it with SALSA2 or were you able to use it directly?

Best,
Chad

Does not find misassemblies

EDIT:
Please ignore this. The bed file was corrupted. I remade it and resorted by column 4 (read name) and it is working fine.

I figured I should make my own issue, since no one is struggling specifically with misassemblies like me.
I am running the pipeline exactly as you describe (with previous arima mapping) but I can't get SALSA to find misassemblies.

I even went as far as manually creating misassemblies in some of the assembly contigs (namely contig1 which has 2510111 maped HiC reads) and nothing.

Total Joins = 0
Suspicious Joins = 0
Percent Suspicious Joins = -nan

There are no error messages. Salsa runs for 3 iterations and stops, creating plenty of empty files:

-rwxrwx--- 1 guerreiro grp_stich 6222829134 Nov 18 17:52 alignment_iteration_1.bed*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 18:01 alignment_iteration_2.bed*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 18:09 alignment_iteration_3.bed*
-rwxrwx--- 1 guerreiro grp_stich 1145528767 Nov 18 17:52 assembly.cleaned.fasta*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 17:55 breakpoints_iteration_2.txt*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 18:03 breakpoints_iteration_3.txt*
-rwxrwx--- 1 guerreiro grp_stich 2544 Nov 18 18:12 commands.log*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 17:55 contig_links_iteration_1*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 18:03 contig_links_iteration_2*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 17:55 contig_links_scaled_iteration_1*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 18:03 contig_links_scaled_iteration_2*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 17:55 contig_links_scaled_sorted_iteration_1*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 18:03 contig_links_scaled_sorted_iteration_2*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 17:40 input_breaks*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 18:01 links_avoid_iteration_2*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 18:09 links_avoid_iteration_3*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 18:03 misasm_2.DONE*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 18:11 misasm_3.DONE*
-rwxrwx--- 1 guerreiro grp_stich 0 Nov 18 18:09 misasm_3.log*
-rwxrwx--- 1 guerreiro grp_stich 69 Nov 18 18:01 misasm_iteration_2.report*
-rwxrwx--- 1 guerreiro grp_stich 69 Nov 18 18:09 misasm_iteration_3.report*
-rwxrwx--- 1 guerreiro grp_stich 513853 Nov 18 17:53 re_counts_iteration_1*
-rwxrwx--- 1 guerreiro grp_stich 358372 Nov 18 18:03 re_counts_iteration_2*
-rwxrwx--- 1 guerreiro grp_stich 358372 Nov 18 18:11 re_counts_iteration_3*
-rwxr-x--- 1 guerreiro grp_stich 521484 Nov 18 17:52 scaffold_length_iteration_1*
-rwxrwx--- 1 guerreiro grp_stich 366003 Nov 18 18:03 scaffold_length_iteration_2*
-rwxrwx--- 1 guerreiro grp_stich 366003 Nov 18 18:11 scaffold_length_iteration_3*
-rwxrwx--- 1 guerreiro grp_stich 1090646 Nov 18 18:12 scaffolds_FINAL.agp*
-rwxrwx--- 1 guerreiro grp_stich 1145373286 Nov 18 18:12 scaffolds_FINAL.fasta*
-rwxrwx--- 1 guerreiro grp_stich 1947223 Nov 18 18:01 scaffolds_iteration_1.p*
-rwxrwx--- 1 guerreiro grp_stich 1947223 Nov 18 18:09 scaffolds_iteration_2.p*

It does create the final scaffolds fasta, but how is it relevant without missassembly detection?

scaffold_1
TACTTTTAAGCGGAAGTATAAGCACGTTTTCCGCTATTAGAAAGAAGCTTAAGATTTCTGCTTCTTACAA
ATAAGCAGAAAATCATTTATTTTAAAATCAAAATATTTCTGTCATATAATATTTATCAATATATCCCTTATATTT

This is the what salsa prints. I do think it is strange that all those nodes and edges are 0:

bedfile loaded
Starting Iteration 1
bedfile started
bedfile loaded
Loading Hi-C links
Hybrid scaffold graph loaded, nodes = 0 edges = 0
Hi-C implied edges = 0
Starting Iteration 2
bedfile started
bedfile loaded
Starting Iteration 2
Loading Hi-C links
Hybrid scaffold graph loaded, nodes = 0 edges = 0
Hi-C implied edges = 0
python /netscratch/irg/grp_stich/C34_PS/SALSA/RE_sites.py -a scaffolds3_SALSA/assembly.cleaned.fasta -e GATC > scaffolds3_SALSA/re_counts_iteration_1
python /netscratch/irg/grp_stich/C34_PS/SALSA/make_links.py -b scaffolds3_SALSA/alignment_iteration_1.bed -d scaffolds3_SALSA -i 1 -x abc
python /netscratch/irg/grp_stich/C34_PS/SALSA/fast_scaled_scores.py -d scaffolds3_SALSA -i 1
sort -k 5 -gr scaffolds3_SALSA/contig_links_scaled_iteration_1 > scaffolds3_SALSA/contig_links_scaled_sorted_iteration_1
python /netscratch/irg/grp_stich/C34_PS/SALSA/layout_unitigs.py -x abc -l scaffolds3_SALSA/contig_links_scaled_sorted_iteration_1 -c 5000 -i 1 -d scaffolds3_SALSA
/netscratch/irg/grp_stich/C34_PS/SALSA/break_contigs -a scaffolds3_SALSA/alignment_iteration_2.bed -b scaffolds3_SALSA/breakpoints_iteration_2.txt -l scaffolds3_SALSA/scaffold_length_iteration_2 -i 2 -s 100 > scaffolds3_SALSA/misasm_iteration_2.report
python /netscratch/irg/grp_stich/C34_PS/SALSA/refactor_breaks.py -d scaffolds3_SALSA -i 2
python /netscratch/irg/grp_stich/C34_PS/SALSA/make_links.py -b scaffolds3_SALSA/alignment_iteration_2.bed -d scaffolds3_SALSA -i 2
python /netscratch/irg/grp_stich/C34_PS/SALSA/layout_unitigs.py -x abc -l scaffolds3_SALSA/contig_links_scaled_sorted_iteration_2 -c 5000 -i 2 -d scaffolds3_SALSA
/netscratch/irg/grp_stich/C34_PS/SALSA/break_contigs -a scaffolds3_SALSA/alignment_iteration_3.bed -b scaffolds3_SALSA/breakpoints_iteration_3.txt -l scaffolds3_SALSA/scaffold_length_iteration_3 -i 3 -s 100 > scaffolds3_SALSA/misasm_iteration_3.report
python /netscratch/irg/grp_stich/C34_PS/SALSA/refactor_breaks.py -d scaffolds3_SALSA -i 3 > scaffolds3_SALSA/misasm_3.log

I have a variable HiC coverage of my assembly contigs:

contig __ size ______ HiC reads
scaffold1 size18622731 2510111
scaffold2 size7953947 942093
scaffold3 size7445178 977533
scaffold4 size7379637 799571
scaffold5 size7181573 863834
scaffold6 size6983120 917527
scaffold7 size6406061 568697
..
scaffold18436 size1003 117
scaffold18447 size1001 309
scaffold18449 size1001 10
scaffold18452 size1001 117
scaffold18456 size1001 27
scaffold18459 size1000 14
scaffold18462 size1000 8
scaffold18465 size1000 26
scaffold18466 size1000 55
scaffold18469 size1000 7

Due to this, I thought about limiting the considered contigs by length with the -c 5000 option, but it doesn't seem to work, since the generated scaffold_length_iteration_3 includes small scaffolds:

scaffold_16656 1002
scaffold_10401 1002
scaffold_8692 1001
scaffold_18424 1001
scaffold_16809 1001
scaffold_12828 1001
scaffold_18021 1000
scaffold_1800 1000
scaffold_16956 1000

Can you please help me? I am really putting a lot of time and effort into this but can't figure it out.

I suspect the problem might be in the input.
I had a bam like this:

guerreiro@dell-node-7:~/Potatodenovo/Scaffolding/SALSA/duplicated_files$ samtools view SALSA_rep1.bam | head
J00137:130:HWL5JBBXX:6:2127:5112:6238 115 scaffold1|size18622731 4 60 151M scaffold18|size4026230 2661754 0 TTTTAAGCGGAAGTATAAGCACGTTTTCCGCTATTAGAAAGAAGCTTAAGATTTCTGCTTCTTACAAAAAGCAAAGAATAAGCAGAAAATCATTTATTTTAAAATCAAAATATTTCTGTCATATAATATTTATCAATATATCCCTTATATT JFAFJFJFJJFJF<JJJFJJJJJJJF<JJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJFFFAA MD:Z:151 PG:Z:MarkDuplicates RG:Z:HiC NM:i:0 AS:i:151 XS:i:118
J00137:130:HWL5JBBXX:6:2204:28270:48192 163 scaffold1|size18622731 96 48 105M4I2M4D39M1S = 314 218 TTTATTTTAAAATCAAAATATTTCTGTCATATAATATTTATCAATATATCCCTTATATTTAAATCATCATATTAATAGTATATCAATATTTAATTTATTTTTTTATTTTTTTATGTATAATATTGATTGAAAATTTAAATATGTACATTTT AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJ-AJ--F-F-A-A7FA-F<FF<-<F---7AJAFA--7-A7A7-7AJF- XA:Z:scaffold1421|size194778,+78505,33M1D64M1I7M4I2M4D39M1S,11; MD:Z:107^CATC39 PG:Z:MarkDuplicates RG:Z:HiC NM:i:8 AS:i:126 XS:i:102
J00137:130:HWL5JBBXX:6:2117:3457:5235 131 scaffold1|size18622731 239 60 58M93H scaffold959|size259327 72948 0 TACATTTCAATTTAATAAAATAAAATAAAATTAAAATTTGATAATATATATTTACGAT AAAFFJJJJ-JJJJJJJJJJJJJJ<AJF<-FFJJJFF<JJJFJJJAJ-FFFJJJJF7A SA:Z:scaffold959|size259327,73222,-,93M58S,60,2; MD:Z:58 PG:Z:MarkDuplicates RG:Z:HiC NM:i:0 AS:i:58 XS:i:35
J00137:130:HWL5JBBXX:6:1207:19441:45379 83 scaffold1|size18622731 295 21 83H68M scaffold1062|size197374 7538 0 ATCATTTTAATATTAAAAGTAAATTGATACTTGTCATTTTTTGTAATTTGACTCTCAACTGTGTTTTT JAFJJJJJJJFJJJJJFJJJJJJJ<JJJJFFFJFJJJJJJJJJFJJJJJAAJJJJJJAA7FFJF<<A< SA:Z:scaffold1062|size197374,7685,-,83M68S,60,0; XA:Z:scaffold781|size327351,-232740,83S68M,1; MD:Z:68 PG:Z:MarkDuplicates RG:Z:HiC NM:i:0 AS:i:68 XS:i:60
J00137:130:HWL5JBBXX:6:2224:4148:21201 179 scaffold1|size18622731 295 60 43S108M scaffold18|size4026230 537316 0 AATCTTAATGCTTTAACCACTAAAGAAGACACAGGAAATGATCATCATTTTAATATTAAAAGTAAATTGATACTTGTCATTTTTTGTAATTTGACTCTCAACTGTGTTTTTAAAAAAAATAGACAAACAAAATATGCTTATCAAAAGCACT <7--FJAA<JJFF<JF<-7FJJFJFFJJFA<FJ<FFJAAFAA-JFJJAFAJJJJJJJJJJJJAJJJJJJJFJ<J<FJJJJJJFJJJJJJJJJJJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJFFFAA SA:Z:scaffold18|size4026230,537316,+,108S40M3S,60,0; MD:Z:108 PG:Z:MarkDuplicates RG:Z:HiC NM:i:0 AS:i:108 XS:i:83

and transformed it into bed , sorting it by the 4 column (the read name).

scaffold1|size18622731 294 362 J00137:130:HWL5JBBXX:6:1207:19441:45379/1 21 -
scaffold1|size18622731 428 579 J00137:130:HWL5JBBXX:6:1207:3955:42636/2 39 +
scaffold1|size18622731 414 522 J00137:130:HWL5JBBXX:6:1214:3397:36921/2 16 +
scaffold1|size18622731 495 611 J00137:130:HWL5JBBXX:6:1219:14692:7152/2 60 +
scaffold1|size18622731 238 296 J00137:130:HWL5JBBXX:6:2117:3457:5235/2 60 +
scaffold1|size18622731 490 641 J00137:130:HWL5JBBXX:6:2121:18781:10370/1 60 +

assembly error detection isn't much efficient on my data

I'm using a salsa 2 version that I downloaded about a week ago and I'm trying to correct/assemble a genome that we know contains thousands of errors, in which different chromosomes are joined in the same scaffolds. (See https://academic.oup.com/gbe/article/10/2/507/4817508#110136596, second-to-last paragraph of the discussion.)

Using Hi-C (chicago) data mapped with the recommended Arima genomics pipeline on this genome, salsa 2 detected only 63 errors (as listed in the input_breaks file).
Note that the draft assembly onto which I mapped the Hi-C reads is made of scaffolds (not unitigs) and doesn't come with a gfa file (it was assembled by somebody else a long time ago).
Note also that HiRise corrected ~7700 errors on this genome (I haven't run HiRise myself, dovetail did. Salsa is the only program I used so far).

I looked more closely at the data, focusing on a scaffold that we know contains at least 4 assembly errors (the scaffold outlined in the supplementary material:
https://academic.oup.com/gbe/article/10/2/507/4817508#110136605). I computed the physical coverage of Hi-C reads on this scaffold, based on the bed file that was given to salsa.

image

The plot shows the physical coverage of all read pairs whose mates are >500bp appart (excluding these pairs just makes a cleaner curve). The only clear drop in coverage is around 180 kb. It was actually among the very few to be detected by salsa.
I then plotted an image like those made by Dudchenko et al. 2016
(http://science.sciencemag.org/highwire/filestream/691788/field_highwire_adjunct_files/0/Dudchenko_SM.pdf, figure S5.)
image
The window size is 5000bp. Five breaks appear starkly, 4 of which correspond to the 4 known assembly errors. It's also clear that, in this scaffold, parts of the X chromosome were somehow inserted into an autosome during assembly, as the number of links drops to zero in a large region, then increases again. This is why the physical coverage over an assembly errors can still be substantial.
To show this, I plotted the physical coverage on this scaffold, but excluding pairs of mates that are distant from more than 200kb:
image

Unless I'm mistaken, the current method used by salsa to detect assembly errors could be improved, as it may be fooled by certain types of errors.

Running Time

Hello,

I am running SALSA on a Flye (https://github.com/fenderglass/Flye) assembly and it is in the Iteration 1 since 3 days :

bedfile loaded
Starting Iteration 1
bedfile started
bedfile loaded
Done loading GFA file
Number of Nodes = 49542
Number of Edges = 2255685850

Is there anyway to know the remaining time of analysis ? Do i made something wrong in my pipeline ?

#bwa index scaffolds.fasta
#samtools faidx scaffolds.fasta
#bwa mem -t40 -SPM scaffolds.fasta 20160711.A-60444_HiC_R1.fastq.gz 20160711.A-60444_HiC_R2.fastq.gz > hic.sam
#samtools view -Sb hic.sam > hic-samtools.bam
#bamToBed -i hic-samtools.bam > hic-samtools.bed
#sort -k 4 hic-samtools.bed > tmp && mv tmp hic-samtools.bed
#python /media/vol1/apps/SALSA/run_pipeline.py -a scaffolds.fasta -l scaffolds.fasta.fai -b hic-samtools.bed -e GCTCTTCN -o salsa1 -m yes -g assembly_graph.gfa

Thanks,
Luc

when I use this tools ,I met a mistakes,I need some help.

python /public/tools/SALSA/run_pipeline.py -a polsih.fasta -l polsih.fasta.fai -b temp1.bed -e GATC -o scaffolds1 -m yes
bedfile loaded
Traceback (most recent call last):
File "/public/msq/tools/SALSA/correct.py", line 69, in
oline += str(contig2new[attrs[0]] +'\t'+attrs[1]+'\t'+attrs[2]+'\t'+attrs[3]+'\n')
KeyError: '000017FP01'

mv: cannot stat ‘scaffolds1//asm.cleaned.fasta’: No such file or directory

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.