jltsiren / gcsa2 Goto Github PK

View Code? Open in Web Editor NEW

70.0 70.0 11.0 1004 KB

BWT-based index for graphs

License: MIT License

C++ 69.72% Makefile 1.11% TeX 28.75% Shell 0.42%

gcsa2's People

Contributors

Stargazers

Watchers

Forkers

ekg adamnovak pombredanne buske sujunhao glennhickey zihua wangdi2014 wook2014 jeizenga yufeng98

gcsa2's Issues

Binary input format

The text format we've used so far has been useful for debugging various issues during GCSA development. Now as we're moving toward larger graphs, generating, transferring, and parsing the text has become a significant bottleneck. I think it's time to start using a binary format instead.

I've never developed file formats for production use, so I'm open to suggestions. In particular, I'm interested in best practices for not breaking backward compatibility when the requirements for the file format change.

This is what I've been thinking so far:

General

Keep the file format simple. Every field should be an unsigned 64-bit integer.
Make it efficient for GCSA. The bulk of the file should be directly readable into a std::vector<KMer>.
Make it efficient for vg. What does this imply?
The text format apparently has an useful property. If we generate the files for different chromosomes separately with the same kmer length, and then concatenate the files, the result is a valid input for indexing the entire genome. This should be true with the binary format as well.
The file consists of one or more sections. Each section starts with a header, followed by data.

Header

One field for future use. Right now, we should set it to 0.
The number of paths in the section.
The length of path labels.

Data

KMer structures in arbitrary order.
Each structure consists of three fields: key, from, and to.

`key`

Created by Key::encode(const Alphabet& alpha, const std::string& kmer, byte_type pred, byte_type succ).
alpha is an object that maps between characters and the comp values (the internal alphabet of GCSA). The default constructor of Alphabet creates an object that maps $ACGTN# to 0-6.
kmer is the fixed-length label of the path.
pred encodes the set of characters that occur immediately before the path. pred & (1 << comp) is true iff comp is a comp value for a character in the set.
succ encodes the set of characters occurring immediately after the path in the same way.

`from`

Encodes the starting position of the path.
Created by Node::encode(size_type node_id, size_type node_offset).
node_id is the identifier of the vg node where the path starts. We have 54 bits available for encoding the identifier, but using excessively large identifiers is probably a bad idea. The identifiers of different nodes do not have to form a contiguous range.
node_offset is the starting offset (in the label of the node) for the path. We have 10 bits available, so acceptable offsets range from 0 to 1023. The offsets in the same node should obviously be contiguous.

`to`

Encodes a position immediately after the end of the path in the same way as from.
If the path has multiple successors, a separate KMer object with different to fields is created for each of them.
If multiple KMer objects are created for the same path, the succ field of the key can encode either all successor characters or just the one at position to, whichever is more convenient.

build results for k=11, 3 doubling steps

eg10@vr-4-1-14:~/graphs/1kgp+h37$ time build_gcsa-b433cc28 -D k11e3l32.work -l 750 -o wg.k88 $(for i in $(seq 1 22; echo X; echo Y); do echo -n " k11e3l32/$i"; done)                                             
GCSA builder                                                                                             

Input:              k11e3l32/1.graph (binary format)                                                     
Input:              k11e3l32/2.graph (binary format)                                                     
Input:              k11e3l32/3.graph (binary format)                                                     
Input:              k11e3l32/4.graph (binary format)                                                     
Input:              k11e3l32/5.graph (binary format)                                                     
Input:              k11e3l32/6.graph (binary format)                                                     
Input:              k11e3l32/7.graph (binary format)                                                     
Input:              k11e3l32/8.graph (binary format)                                                     
Input:              k11e3l32/9.graph (binary format)                                                     
Input:              k11e3l32/10.graph (binary format)                                                    
Input:              k11e3l32/11.graph (binary format)                                                    
Input:              k11e3l32/12.graph (binary format)                                                    
Input:              k11e3l32/13.graph (binary format)                                                    
Input:              k11e3l32/14.graph (binary format)                                                    
Input:              k11e3l32/15.graph (binary format)                                                    
Input:              k11e3l32/16.graph (binary format)                                                    
Input:              k11e3l32/17.graph (binary format)                                                    
Input:              k11e3l32/18.graph (binary format)                                                    
Input:              k11e3l32/19.graph (binary format)                                                    
Input:              k11e3l32/20.graph (binary format)                                                    
Input:              k11e3l32/21.graph (binary format)                                                    
Input:              k11e3l32/22.graph (binary format)                                                    
Input:              k11e3l32/X.graph (binary format)                                                     
Input:              k11e3l32/Y.graph (binary format)                                                     
Output:             wg.k88.gcsa, wg.k88.lcp                                                              
Doubling steps:     3                                                                                    
Size limit:         750 GB                                                                               
Branching factor:   64                                                                                   
Threads:            32                                                                                   
Temp directory:     k11e3l32.work                                                                        

Index built in 123545 seconds                                                                            
Memory usage: 129.722 GB                                                                                 
I/O volume: 6495.02 GB read, 5396.42 GB write                                                            

Paths:            12277185468                                                                            
Edges:            12724670284                                                                            
Samples:          2170004049 (at 1389202089 positions, 41 bits each)                                     
Max query:        88                                                                                     
Index size:       25282.4 MB                                                                             
Without samples:  14676.4 MB                                                                             
LCP size:         10407.5 MB                                                                             

Final memory usage: 129.722 GB                                                                           


real    2078m0.570s                                                                                      
user    6524m29.209s                                                                                     
sys     648m32.652s

This seems to have worked, but I cannot load the indexes it generated. There is no clear error message from the vg side, so I'm curious how I should debug them using the tools in gcsa2.

Probably I should have verified them in the build.

how to generate a file in .gcsa2 format and in .graph format?

Hello, how can I use the .fa file and the .vcf file to generate a .gcsa2 format or a .graph file? thanks a lot!

debugging output via option rather than macro

Right now GCSA2 uses #ifdef VERBOSE_STATUS_INFO macros to enable or disable debugging information during builds.

In vg we sometimes want to see the debugging information (say, when building a whole genome index) and other times it is superfluous (such as when we're building 10 GCSAs a second in vg msga).

I guess I'd like to also make output dependent on a runtime-configurable debug flag. Any ideas about the best way to do it?

Running out of disk space during construction causes a segmentation fault

The DiskIO writing code doesn't check if its writes succeed. It's possible the files it writes are truncated by the backing disk running out of space. This causes problems when the files are read back.

I'm not entirely clear on the mechanism (maybe acting on garbage data that wasn't replaced by data that was supposed to have been read?), but this generates segfaults like this in vg:

Program received signal SIGSEGV, Segmentation fault.
gcsa::PathGraphMerger::read (this=this@entry=0x7fffffff7a20, path=...) at path_graph.cpp:643
643           path.label[i] = this->rank_files[path.file][path.node.pointer() + i];
(gdb) bt
#0  gcsa::PathGraphMerger::read (this=this@entry=0x7fffffff7a20, path=...) at path_graph.cpp:643
#1  0x00000000009c9c7c in gcsa::PathGraphMerger::bufferNext (this=this@entry=0x7fffffff7a20) at path_graph.cpp:623
#2  0x00000000009c9ec5 in gcsa::PathGraphMerger::rangeEnd (this=this@entry=0x7fffffff7a20, start=start@entry=11)
    at path_graph.cpp:612
#3  0x00000000009ca09f in gcsa::PathGraphMerger::next (this=this@entry=0x7fffffff7a20) at path_graph.cpp:534
#4  0x00000000009cb8a6 in gcsa::PathGraph::prune (this=this@entry=0x7fffffff8330, lcp=..., size_limit=<optimized out>)
    at path_graph.cpp:830
#5  0x000000000098ca4f in gcsa::GCSA::GCSA (this=0x23dc860, graph=..., parameters=...) at gcsa.cpp:488
#6  0x00000000006c8234 in main_index (argc=<optimized out>, argv=<optimized out>) at src/subcommand/index_main.cpp:873
#7  0x0000000000418dd2 in main (argc=13, argv=0x7fffffffcea8) at src/main.cpp:7684

The writing code ought to bring down the whole program if it can't get its data on disk. Alternately, the reading code could have an idea of the expected file size and complain if the file is too small.

position counting (like document counting)

@jltsiren and I have discussed this several times, but in the interest of not losing track of it I'll document it here.

Currently in vg we're using GCSA2 like a kind of kmer index. We sample a series of kmers from the query. Hopefully, some of these have a very small number of positions in the genome. Then, clustering heuristics are used to establish a few targets and a local partial order alignment is executed for each and the best scoring alignment is returned as the mapping.

GCSA2 is preferable to a literal kmer table due to its low memory consumption, but we could be using it much more efficiently if it were possible to find maximal exact matches (MEMs) and use these rather than kmers to drive the mapping process. To quote @lh3 (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436841/):

MEMs are exact matches that cannot be extended in either direction without allowing a mismatch.

A given query may be represented as a series of MEMs broken by differences from the reference (errors or variants).

The goal would be to be able to determine the number of positions in the input graph that lie in a given BWT range. Doing this properly will require a structure that @jltsiren has used before for document counting. However, a stop-gap solution is to record if a given BWT position uniquely maps to a given graph position or not--- which would require the addition of a bitvector of the order of the BWT. This could be populated or during construction and then queried to determine if a given pattern has a single match. The algorithm to find (approximate) MEMs would then increase the query pattern length until obtaining a BWT range of a single position and would then check if this BWT position corresponds to a single position in the input graph.

Backward search returns non-empty interval even if pattern is not in the graph

Hi Jouni,
I was playing with the gcsa2 index and I encountered a strange behaviour: if I manually backward search a pattern in the index or by using the find(), it returns a non-empty interval of size 1 altough I'm pretty sure that it's not in the graph.. The pattern is quite long (>1600bp).

I can prepare the test data and send you my code to test it, but before doing that I'd like to do some more tests and also know if there is a limit on the pattern size. If that's the case, then there is nothing we can do.

I built the gcsa2 index using vg index (version: v1.46.0 "Altamura").

Thanks,
Luca

How do I generate a .gcsa2 format file without reverse complement?

Hello, How do I generate a .gcsa2 format file without reverse complement?

Originally posted by @xd-hbj in #21 (comment)

Adding Support for Cycles

I'm looking to replace the graph indexing code in my string-to-graph aligner that lives over in adamnovak/sequencegraphs. @ekg recommended I look at this project.

My current codebase uses an approach based on keeping a plain string BWT index of several paths through the graph, and then searching in that collection of paths. This provides a modicum of support for cyclic graphs, which I need, for example, in order to be able to find sequences covering the endpoints of inversions or translocations in the index.

I would need some similar partial support for cyclic graphs in GCSA2. What is the state with respect to cyclic graphs so far? Are cycles merely not indexed, or will they break the indexer if they are present? I note that you plan to add cycle support; how are you planning to design that? What constraints (i.e. only going around a cycle once) would you plan to impose? Would it be possible for me to contribute to that part of the project, in order to get it working for my aligner?

How do I do to combine many gcsa files into one file?

I got many gcsa files. Each of them is to one specific chromosome. How do I do to combine many gcsa files into one file?

pre-subsampling results from locate

In some cases it's helpful to get a small subset of exact matches for a given gcsa range even when there are a very large number of matches. However, it can be problematic to actually complete the locate call and fill out the set of hits. Could there be a simple way to obtain a subset that isn't simply the first N such hits?

thread safety

Via xg/vg integration, I've noticed that some features of sdsl-lite are not thread safe (at least this is referred to in sdsl-lite issues). Is GCSA2 thread safe? In scaling up mapping we need to run with multiple threads.

Compilation fails

Hi, when I compile this codes ,it fails , the screen print like this , what can I for this issue? thanks for your help!

Installing GCSA2 to '/data/home/xuyuan/biosoft/gcsa2'.
Running 20 parallel make jobs.
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c algorithms.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c dbg.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c files.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c gcsa.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c internal.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c lcp.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c path_graph.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c support.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c utils.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c build_gcsa.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c convert_graph.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c gcsa_format.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c try_extend.cpp
In file included from include/gcsa/dbg.h:28:0,
from dbg.cpp:26:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from internal.cpp:26:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
In file included from support.cpp:26:0:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from support.cpp:27:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
make: *** [Makefile:47: internal.o] Error 1
make: *** Waiting for unfinished jobs....
In file included from utils.cpp:36:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
make: *** [Makefile:47: dbg.o] Error 1
In file included from include/gcsa/dbg.h:28:0,
from include/gcsa/path_graph.h:29,
from path_graph.cpp:26:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/files.h:29:0,
from convert_graph.cpp:25:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/path_graph.h:31:0,
from path_graph.cpp:26:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
In file included from include/gcsa/files.h:29:0,
from files.cpp:26:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/files.h:29:0,
from include/gcsa/gcsa.h:29,
from gcsa_format.cpp:28:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from files.cpp:27:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
In file included from include/gcsa/files.h:29:0,
from include/gcsa/gcsa.h:29,
from include/gcsa/algorithms.h:29,
from gcsa.cpp:26:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/files.h:29:0,
from include/gcsa/gcsa.h:29,
from include/gcsa/algorithms.h:29,
from algorithms.cpp:26:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/files.h:29:0,
from include/gcsa/lcp.h:29,
from lcp.cpp:26:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from gcsa.cpp:27:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
In file included from lcp.cpp:30:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
In file included from algorithms.cpp:30:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
make: *** [Makefile:47: support.o] Error 1
include/gcsa/internal.h: In instantiation of 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&) [with ValueType = gcsa::PathNode; Getter = gcsa::FromGetter]':
path_graph.cpp:959:54: required from here
include/gcsa/internal.h:126:29: error: no matching function for call to 'sdsl::sd_vector_builder::sd_vector_builder(gcsa::size_type, std::vectorgcsa::PathNode::size_type, bool)'
sdsl::sd_vector_builder builder(Getter::get(input.back()) + 1, input.size(), true);
^~~~~~~
In file included from /data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/bit_vectors.hpp:11:0,
from /data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/wt_pc.hpp:25,
from /data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/wavelet_trees.hpp:32,
from include/gcsa/utils.h:36,
from include/gcsa/support.h:29,
from include/gcsa/dbg.h:28,
from include/gcsa/path_graph.h:29,
from path_graph.cpp:26:
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:80:9: note: candidate: sdsl::sd_vector_builder::sd_vector_builder(sdsl::sd_vector_builder::size_type, sdsl::sd_vector_builder::size_type)
sd_vector_builder(size_type n, size_type m);
^~~~~~~~~~~~~~~~~
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:80:9: note: candidate expects 2 arguments, 3 provided
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:74:9: note: candidate: sdsl::sd_vector_builder::sd_vector_builder()
sd_vector_builder();
^~~~~~~~~~~~~~~~~
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:74:9: note: candidate expects 0 arguments, 3 provided
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:56:7: note: candidate: sdsl::sd_vector_builder::sd_vector_builder(const sdsl::sd_vector_builder&)
class sd_vector_builder
^~~~~~~~~~~~~~~~~
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:56:7: note: candidate expects 1 argument, 3 provided
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:56:7: note: candidate: sdsl::sd_vector_builder::sd_vector_builder(sdsl::sd_vector_builder&&)
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:56:7: note: candidate expects 1 argument, 3 provided
make: *** [Makefile:47: utils.o] Error 1
In file included from include/gcsa/dbg.h:28:0,
from include/gcsa/path_graph.h:29,
from try_extend.cpp:25:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/files.h:29:0,
from include/gcsa/gcsa.h:29,
from include/gcsa/algorithms.h:29,
from build_gcsa.cpp:29:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/path_graph.h:31:0,
from try_extend.cpp:25:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
make: *** [Makefile:47: convert_graph.o] Error 1
make: *** [Makefile:47: gcsa_format.o] Error 1
make: *** [Makefile:47: lcp.o] Error 1
make: *** [Makefile:47: build_gcsa.o] Error 1
make: *** [Makefile:47: files.o] Error 1
make: *** [Makefile:47: try_extend.o] Error 1
make: *** [Makefile:47: algorithms.o] Error 1
make: *** [Makefile:47: path_graph.o] Error 1
make: *** [Makefile:47: gcsa.o] Error 1
Error: Could not compile GCSA2

Cannot serialize when empty

Constructing a gcsa::GCSA with the default constructor, and then immediately calling serialize() to serialize the (empty) index to an ostream, results in a segfault.

If it is forbidden to serialize a default-constructed gcsa::GCSA, that needs to be documented in a doc comment for serialize(). Otherwise, the serialization needs to be fixed to work.

utils.h keys on compiler and not C++ standard library to decide whether to try and use parallel sort

I'm building GCSA2 with GNU GCC and Apple libc++ (instead of GNU libstdc++), in order to try and get vg builds working with GNU GCC and Homebrew's binary Protobuf library on Mac (which links against libc++).

GCSA2 checks the compiler identity, and uses that to decide whether the standard library has GNU's parallel/algorithm nonstandard header. This doesn't work if you are using GNU GCC with anything other than libstdc++, and it misses out on using the header if you are using libstdc++ with any other compiler.

gcsa2/include/gcsa/utils.h

Lines 41 to 44 in 145c4ee

 // Parallel sorting is only available with GCC. 

 #if (defined(__GNUC__) && !defined(__clang__)) 

 #include <parallel/algorithm> 

 #endif

It looks like the Right Way to detect libc++ is to include any standard library header and then check if _LIBCPP_VERSION is defined. For detecting libstdc++, you can incluse nearly any standard header and check for __GLIBCXX__.

We should only use parallel/algorithm if __GLIBCXX__ is defined, instead of keying on the compiler.

build_gcsa accepts two kmer files separately but not concatenated

Over in vgteam/vg, @ekg and I are trying to create a GCSA2 index from only part of a graph. What we're doing is throwing out part of a graph, getting GCSA2 kmers for that, throwing out a different part of the graph, getting GCSA2 kmers for that, and then combining the kmers into one GCSA2 index for the graph.

This is operating under the assumption that you can just concatenate GCSA2 kmer files. It looks like you are supposed to be able to do this, but in practice you can't in some cases. I've attached some example files here.

smaller.zip

I have a graph (smaller.vg) that looks like this:

In JSON format, it looks like:

{"edge": [{"to": 4972, "from": 4969}, {"to": 4972, "from": 4970}, {"to": 4973, "from": 4972}, {"to": 4974, "from": 4972}], "node": [{"id": 4969, "sequence": "G"}, {"id": 4970, "sequence": "A"}, {"id": 4971, "sequence": "T"}, {"id": 4972, "sequence": "AGTATC"}, {"id": 4973, "sequence": "C"}, {"id": 4974, "sequence": "T"}]}

If I prune out some nodes I don't like, it looks like this:

I can make GCSA2 text or binary format kmer files from those two graphs, with vg. I can index each graph separately with build_gcsa, and I can index them together by passing both file names to build_gcsa. But if I actually concatenate the kmer files, the index build fails.

To generate the attached files:

KMER=8
vg view -Jv smaller.json > smaller.vg
vg mod -pl ${KMER} -e 1 smaller.vg | vg kmers -g -k ${KMER} -H 1000000000 -T 1000000001 - > smaller-pruned.gcsa2
vg kmers -g -k ${KMER} -H 1000000000 -T 1000000001 smaller.vg > smaller-all.gcsa2

And to test:

build_gcsa -d 1 -t -v smaller-pruned
# Works


build_gcsa -d 1 -t -v smaller-all
# Works

build_gcsa -d 1 -t -v smaller-pruned smaller-all
# Works with both sets of kmers

cat smaller-pruned.gcsa2 smaller-all.gcsa2 > smaller-cat.gcsa2
build_gcsa -d 1 -t -v smaller-cat
# Fails verification, although it should use the same kmers as above

Is there something different about how kmers from different files versus from the same file are combined? I peeked at the code but it looked like they were all read into a single massive std::vector together.

Is it not supposed to work with these sets of kmers in the same index at all? (And is the fact that it does work when the files are separate just an exploitation of undefined behavior?)

Rare issues in edge generation

Assume that we have the following subgraphs:

1: B[CD]F[GH]J
2: [AB]C[EF]GI

There are the following paths of length 4:

(1,1): BCFG, BCFH, BDFG, BDFH
(1,2): CFGJ, CFHJ, DFGJ, DFHJ
(2,1): ACEG, ACFG, BCEG, BCFG
(2,2): CEGI, CFGI

If we start from paths of length 2 and do one step of prefix-doubling, we end up with the following list of paths:

(2,1): ACEG to BCEG
(1,1): BCFG
(2,1): BCFG
(1,1): BCFH to BDFH
(2,2): CEGI to CFGI
(1,2): CFGJ to DFHJ

The path graph has six nodes and four edges:

[(2,1): ACEG to BCEG] -> [(2,2): CEGI to CFGI] on A, B
[(2,1): BCFG]         -> [(2,2): CEGI to CFGI] on B
[(1,1): BCFG]         -> [(1,2): CFGJ to DFHJ] on B
[(1,1): BCFH to BDFH] -> [(1,2): CFGJ to DFHJ] on B

If we perform the final merge-by-label step, we end up with five nodes and four edges:

[(2,1): ACEG to BCEG] -> [(2,2): CEGI to CFGI] on A, B
[(1,1), (2,1): BCFG]  -> [(2,2): CEGI to CFGI] on B
[(1,1), (2,1): BCFG]  -> [(1,2): CFGJ to DFHJ] on B
[(1,1): BCFH to BDFH] -> [(1,2): CFGJ to DFHJ] on B

By doing backward searching in this graph, we could end up at (1,1) with pattern BCE and at (2,1) with pattern BD, which are both wrong.

The real issue is that the the merging step in prefix-doubling and the final merge-by-label step are incompatible. In some very rare cases (such as the example above), they will result in an incorrect path graph. The merge-by-label step is less useful, so we are probably going to drop it.

The remaining issue is how we can match the destination node of an edge with the correct source node, out of a set of path nodes with identical labels.

CHAR_WIDTH is a reserved macro

CHAR_WIDTH is defined as a macro in (gcc7's) limits.h, but is used as a variable name in GCSA.

We need to rename the variable.

GCSA compression

Many components of GCSA can be compressed to reduce the size of the index in memory. There is always a trade-off: compressing the components will make the queries slower. The following numbers are for a whole-genome graph (-e 4, path length 64, 5.27 billion paths).

The GCSA takes 9444 MB, but it can be compressed to 6174 MB. This can be broken down as follows:

BWT: 2142 MB -> 1140 MB, high impact on performance
Node/edge mapping: 1907 MB -> 367 MB, high impact
Sampled BWT positions: 785 MB -> 328 MB, moderate impact
Sampled nodes: 4610 MB -> 4342 MB, low impact

(There are also SDSL visualizations of the size breakdowns of the original and compressed GCSA.)

There are 918 million samples representing 593 million distinct nodes. Because the samples will take at least 3 GB, no matter how we compress them, compressing the index doesn't sound too beneficial.

results for a whole-genome index

I just build a GCSA for the 1000G + GRCh37 graph in less than 250G of RAM.

First, I pruned regions of the graph with high complexity as described vgteam/vg#92 (comment).

Then, I input the binary graphs into build_gcsa. It just barely fit on the available system. Here's the output log:

GCSA builder                                                                           

Input: 1.k16e3l100.graph (binary format)                                               
Input: 2.k16e3l100.graph (binary format)                                               
Input: 3.k16e3l100.graph (binary format)                                               
Input: 4.k16e3l100.graph (binary format)                                               
Input: 5.k16e3l100.graph (binary format)                                               
Input: 6.k16e3l100.graph (binary format)                                               
Input: 7.k16e3l100.graph (binary format)                                               
Input: 8.k16e3l100.graph (binary format)                                               
Input: 9.k16e3l100.graph (binary format)                                               
Input: 10.k16e3l100.graph (binary format)                                              
Input: 11.k16e3l100.graph (binary format)                                              
Input: 12.k16e3l100.graph (binary format)                                              
Input: 13.k16e3l100.graph (binary format)                                              
Input: 14.k16e3l100.graph (binary format)                                              
Input: 15.k16e3l100.graph (binary format)                                              
Input: 16.k16e3l100.graph (binary format)                                              
Input: 17.k16e3l100.graph (binary format)                                              
Input: 18.k16e3l100.graph (binary format)                                              
Input: 19.k16e3l100.graph (binary format)                                              
Input: 20.k16e3l100.graph (binary format)                                              
Input: 21.k16e3l100.graph (binary format)                                              
Input: 22.k16e3l100.graph (binary format)                                              
Input: X.k16e3l100.graph (binary format)                                               
Input: Y.k16e3l100.graph (binary format)                                               
Output: wg.k16e3l100d1.gcsa                                                            
Doubling steps: 1                                                                      
Size limit: 200 GB                                                                     

readKMers(): Read 4320594988 kmers of length 16                                        
GCSA::prefixDoubling(): Initial path length 16                                         
  mergePaths(): 4320594988 -> 4248730291 paths                                         
  mergePaths(): 742599350 unique, 3452817276 unsorted, 53313665 nondeterministic paths 
GCSA::prefixDoubling(): Step 1 (path length 16 -> 32)                                  
  joinPaths(): 4248730291 -> 5591514526 paths (15978630563 ranks)                      
  mergePaths(): 5591514526 -> 5507683427 paths                                         
  mergePaths(): 4630717465 unique, 815148663 unsorted, 61817299 nondeterministic paths 
GCSA::mergeByLabel(): 5507683427 -> 4729741298 paths                                   
GCSA::build(): 4988594815 edges                                                        
GCSA::sample(): 1185424502 samples at 515920142 positions                              
Index built in 18530.8 seconds                                                         

Paths:            4729741298                                                           
Edges:            4988594815                                                           
Samples:          1185424502 (at 515920142 positions, 41 bits each)                    
Max query:        32                                                                   
Index size:       10350.2 MB                                                           
Without samples:  4556.36 MB                                                           

readKMers(): Read 4320594988 kmers of length 16                                                                  
Index verification complete.                                                                                     

Memory usage: 224537 MB                                                                                          

        Command being timed: "build_gcsa -d 1 -o wg.k16e3l100d1 1.k16e3l100 2.k16e3l100 3.k16e3l100 4.k16e3l100 5
.k16e3l100 6.k16e3l100 7.k16e3l100 8.k16e3l100 9.k16e3l100 10.k16e3l100 11.k16e3l100 12.k16e3l100 13.k16e3l100 14
.k16e3l100 15.k16e3l100 16.k16e3l100 17.k16e3l100 18.k16e3l100 19.k16e3l100 20.k16e3l100 21.k16e3l100 22.k16e3l10
0 X.k16e3l100 Y.k16e3l100"                                                                                       
        User time (seconds): 153966.49                                                                           
        System time (seconds): 2748.86                                                                           
        Percent of CPU this job got: 674%                                                                        
        Elapsed (wall clock) time (h:mm:ss or m:ss): 6:27:24

An extra doubling step would be nice, but I don't think it will work on this system without further disk-based optimizations.

Tests on a single chromosome suggest this parameter configuration for pruning should be fine. Now that I know that this is possible, I'll do some exploration of the pruning parameters to see what else might work and still yield a viable mapper.

Support for Node Orientation?

Does GCSA2 support the concept of orientation for nodes? For example, if you have a k=5 de Bruijn graph for something like GATTACACCCTGTAATC, a sequence that starts with a 7-base motif and ends with its reverse complement, you would expect the node representing the first 5-mer in the de Bruijn graph and that representing the last 5-mer to be opposite orientations of the same node. I would expect this sort of structure, where you come out of a node, go around a loop, and head back into the same node in the opposite direction, would occur with some frequency in de Bruijn graphs with low k. However, I'm not sure how to articulate it in a form that GCSA2 will understand. Do you know how? Or are graphs not allowed to have that topology?

The reason I need this is that I'm working on node orientation and cycle support in vg. I already have node orientation in, but when you add cycles you are able to create graphs where you can't re-spell the graph to get around visiting a node in multiple orientations along the same path.

k<16 input graph

In discussion you mentioned there are problems in using kmers shorter than 16. I am interested in using them for various reasons, so I'd like to understand the limitations.

./configure; make; make install? :-)

Hello!

I am packaging this library for Debian as it is a dependency for VG.

To support this I had to make some changes to the Makefile. It would be nice if there was a standard ./configure; make; make install build system.

My patch is at https://salsa.debian.org/med-team/gcsa/blob/master/debian/patches/add_ldflags_cppflags but it is a bit Debian specific at the moment

Support for non-de-Bruijn graphs?

Looking through the code, it seems that there is only support for loading and indexing graphs composed of kmers of length 16 or less, where each kmer corresponds to a single unique node. There are some interesting tricks to pack that particular graph structure very small.

Is that the only kind of graph that the library is intended to be able to index? No mention of these constraints is made in the readme.

If not, when would there be support for graphs of more general topologies (for example, having the same 16-mer appear in 2 or more nodes)?

	// Parallel sorting is only available with GCC.
	#if (defined(__GNUC__) && !defined(__clang__))
	#include <parallel/algorithm>
	#endif