jltsiren / gcsa2 Goto Github PK
View Code? Open in Web Editor NEWBWT-based index for graphs
License: MIT License
BWT-based index for graphs
License: MIT License
The text format we've used so far has been useful for debugging various issues during GCSA development. Now as we're moving toward larger graphs, generating, transferring, and parsing the text has become a significant bottleneck. I think it's time to start using a binary format instead.
I've never developed file formats for production use, so I'm open to suggestions. In particular, I'm interested in best practices for not breaking backward compatibility when the requirements for the file format change.
This is what I've been thinking so far:
std::vector<KMer>
.KMer
structures in arbitrary order.key
, from
, and to
.key
Key::encode(const Alphabet& alpha, const std::string& kmer, byte_type pred, byte_type succ)
.alpha
is an object that maps between characters and the comp values (the internal alphabet of GCSA). The default constructor of Alphabet
creates an object that maps $ACGTN#
to 0-6.kmer
is the fixed-length label of the path.pred
encodes the set of characters that occur immediately before the path. pred & (1 << comp)
is true iff comp
is a comp value for a character in the set.succ
encodes the set of characters occurring immediately after the path in the same way.from
Node::encode(size_type node_id, size_type node_offset)
.node_id
is the identifier of the vg node where the path starts. We have 54 bits available for encoding the identifier, but using excessively large identifiers is probably a bad idea. The identifiers of different nodes do not have to form a contiguous range.node_offset
is the starting offset (in the label of the node) for the path. We have 10 bits available, so acceptable offsets range from 0 to 1023. The offsets in the same node should obviously be contiguous.to
from
.KMer
object with different to
fields is created for each of them.KMer
objects are created for the same path, the succ
field of the key
can encode either all successor characters or just the one at position to
, whichever is more convenient.eg10@vr-4-1-14:~/graphs/1kgp+h37$ time build_gcsa-b433cc28 -D k11e3l32.work -l 750 -o wg.k88 $(for i in $(seq 1 22; echo X; echo Y); do echo -n " k11e3l32/$i"; done)
GCSA builder
Input: k11e3l32/1.graph (binary format)
Input: k11e3l32/2.graph (binary format)
Input: k11e3l32/3.graph (binary format)
Input: k11e3l32/4.graph (binary format)
Input: k11e3l32/5.graph (binary format)
Input: k11e3l32/6.graph (binary format)
Input: k11e3l32/7.graph (binary format)
Input: k11e3l32/8.graph (binary format)
Input: k11e3l32/9.graph (binary format)
Input: k11e3l32/10.graph (binary format)
Input: k11e3l32/11.graph (binary format)
Input: k11e3l32/12.graph (binary format)
Input: k11e3l32/13.graph (binary format)
Input: k11e3l32/14.graph (binary format)
Input: k11e3l32/15.graph (binary format)
Input: k11e3l32/16.graph (binary format)
Input: k11e3l32/17.graph (binary format)
Input: k11e3l32/18.graph (binary format)
Input: k11e3l32/19.graph (binary format)
Input: k11e3l32/20.graph (binary format)
Input: k11e3l32/21.graph (binary format)
Input: k11e3l32/22.graph (binary format)
Input: k11e3l32/X.graph (binary format)
Input: k11e3l32/Y.graph (binary format)
Output: wg.k88.gcsa, wg.k88.lcp
Doubling steps: 3
Size limit: 750 GB
Branching factor: 64
Threads: 32
Temp directory: k11e3l32.work
Index built in 123545 seconds
Memory usage: 129.722 GB
I/O volume: 6495.02 GB read, 5396.42 GB write
Paths: 12277185468
Edges: 12724670284
Samples: 2170004049 (at 1389202089 positions, 41 bits each)
Max query: 88
Index size: 25282.4 MB
Without samples: 14676.4 MB
LCP size: 10407.5 MB
Final memory usage: 129.722 GB
real 2078m0.570s
user 6524m29.209s
sys 648m32.652s
This seems to have worked, but I cannot load the indexes it generated. There is no clear error message from the vg side, so I'm curious how I should debug them using the tools in gcsa2.
Probably I should have verified them in the build.
Hello, how can I use the .fa file and the .vcf file to generate a .gcsa2 format or a .graph file? thanks a lot!
Right now GCSA2 uses #ifdef VERBOSE_STATUS_INFO
macros to enable or disable debugging information during builds.
In vg
we sometimes want to see the debugging information (say, when building a whole genome index) and other times it is superfluous (such as when we're building 10 GCSAs a second in vg msga
).
I guess I'd like to also make output dependent on a runtime-configurable debug flag. Any ideas about the best way to do it?
The DiskIO
writing code doesn't check if its writes succeed. It's possible the files it writes are truncated by the backing disk running out of space. This causes problems when the files are read back.
I'm not entirely clear on the mechanism (maybe acting on garbage data that wasn't replaced by data that was supposed to have been read?), but this generates segfaults like this in vg
:
Program received signal SIGSEGV, Segmentation fault.
gcsa::PathGraphMerger::read (this=this@entry=0x7fffffff7a20, path=...) at path_graph.cpp:643
643 path.label[i] = this->rank_files[path.file][path.node.pointer() + i];
(gdb) bt
#0 gcsa::PathGraphMerger::read (this=this@entry=0x7fffffff7a20, path=...) at path_graph.cpp:643
#1 0x00000000009c9c7c in gcsa::PathGraphMerger::bufferNext (this=this@entry=0x7fffffff7a20) at path_graph.cpp:623
#2 0x00000000009c9ec5 in gcsa::PathGraphMerger::rangeEnd (this=this@entry=0x7fffffff7a20, start=start@entry=11)
at path_graph.cpp:612
#3 0x00000000009ca09f in gcsa::PathGraphMerger::next (this=this@entry=0x7fffffff7a20) at path_graph.cpp:534
#4 0x00000000009cb8a6 in gcsa::PathGraph::prune (this=this@entry=0x7fffffff8330, lcp=..., size_limit=<optimized out>)
at path_graph.cpp:830
#5 0x000000000098ca4f in gcsa::GCSA::GCSA (this=0x23dc860, graph=..., parameters=...) at gcsa.cpp:488
#6 0x00000000006c8234 in main_index (argc=<optimized out>, argv=<optimized out>) at src/subcommand/index_main.cpp:873
#7 0x0000000000418dd2 in main (argc=13, argv=0x7fffffffcea8) at src/main.cpp:7684
The writing code ought to bring down the whole program if it can't get its data on disk. Alternately, the reading code could have an idea of the expected file size and complain if the file is too small.
@jltsiren and I have discussed this several times, but in the interest of not losing track of it I'll document it here.
Currently in vg we're using GCSA2 like a kind of kmer index. We sample a series of kmers from the query. Hopefully, some of these have a very small number of positions in the genome. Then, clustering heuristics are used to establish a few targets and a local partial order alignment is executed for each and the best scoring alignment is returned as the mapping.
GCSA2 is preferable to a literal kmer table due to its low memory consumption, but we could be using it much more efficiently if it were possible to find maximal exact matches (MEMs) and use these rather than kmers to drive the mapping process. To quote @lh3 (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436841/):
MEMs are exact matches that cannot be extended in either direction without allowing a mismatch.
A given query may be represented as a series of MEMs broken by differences from the reference (errors or variants).
The goal would be to be able to determine the number of positions in the input graph that lie in a given BWT range. Doing this properly will require a structure that @jltsiren has used before for document counting. However, a stop-gap solution is to record if a given BWT position uniquely maps to a given graph position or not--- which would require the addition of a bitvector of the order of the BWT. This could be populated or during construction and then queried to determine if a given pattern has a single match. The algorithm to find (approximate) MEMs would then increase the query pattern length until obtaining a BWT range of a single position and would then check if this BWT position corresponds to a single position in the input graph.
Hi Jouni,
I was playing with the gcsa2 index and I encountered a strange behaviour: if I manually backward search a pattern in the index or by using the find()
, it returns a non-empty interval of size 1 altough I'm pretty sure that it's not in the graph.. The pattern is quite long (>1600bp).
I can prepare the test data and send you my code to test it, but before doing that I'd like to do some more tests and also know if there is a limit on the pattern size. If that's the case, then there is nothing we can do.
I built the gcsa2 index using vg index
(version: v1.46.0 "Altamura").
Thanks,
Luca
Hello, How do I generate a .gcsa2 format file without reverse complement?
Originally posted by @xd-hbj in #21 (comment)
I'm looking to replace the graph indexing code in my string-to-graph aligner that lives over in adamnovak/sequencegraphs. @ekg recommended I look at this project.
My current codebase uses an approach based on keeping a plain string BWT index of several paths through the graph, and then searching in that collection of paths. This provides a modicum of support for cyclic graphs, which I need, for example, in order to be able to find sequences covering the endpoints of inversions or translocations in the index.
I would need some similar partial support for cyclic graphs in GCSA2. What is the state with respect to cyclic graphs so far? Are cycles merely not indexed, or will they break the indexer if they are present? I note that you plan to add cycle support; how are you planning to design that? What constraints (i.e. only going around a cycle once) would you plan to impose? Would it be possible for me to contribute to that part of the project, in order to get it working for my aligner?
I got many gcsa files. Each of them is to one specific chromosome. How do I do to combine many gcsa files into one file?
In some cases it's helpful to get a small subset of exact matches for a given gcsa range even when there are a very large number of matches. However, it can be problematic to actually complete the locate call and fill out the set of hits. Could there be a simple way to obtain a subset that isn't simply the first N such hits?
Via xg/vg integration, I've noticed that some features of sdsl-lite are not thread safe (at least this is referred to in sdsl-lite issues). Is GCSA2 thread safe? In scaling up mapping we need to run with multiple threads.
Hi, when I compile this codes ,it fails , the screen print like this , what can I for this issue? thanks for your help!
Installing GCSA2 to '/data/home/xuyuan/biosoft/gcsa2'.
Running 20 parallel make jobs.
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c algorithms.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c dbg.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c files.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c gcsa.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c internal.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c lcp.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c path_graph.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c support.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c utils.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c build_gcsa.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c convert_graph.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c gcsa_format.cpp
/opt/anaconda3/bin/x86_64-conda_cos6-linux-gnu-c++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -std=c++11 -Wall -Wextra -DNDEBUG -fopenmp -pthread -O3 -ffast-math -funroll-loops -msse4.2 -march=native -msse4.2 -march=native -DHAVE_CXA_DEMANGLE -Iinclude -I/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include -c try_extend.cpp
In file included from include/gcsa/dbg.h:28:0,
from dbg.cpp:26:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from internal.cpp:26:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
In file included from support.cpp:26:0:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from support.cpp:27:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
make: *** [Makefile:47: internal.o] Error 1
make: *** Waiting for unfinished jobs....
In file included from utils.cpp:36:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
make: *** [Makefile:47: dbg.o] Error 1
In file included from include/gcsa/dbg.h:28:0,
from include/gcsa/path_graph.h:29,
from path_graph.cpp:26:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/files.h:29:0,
from convert_graph.cpp:25:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/path_graph.h:31:0,
from path_graph.cpp:26:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
In file included from include/gcsa/files.h:29:0,
from files.cpp:26:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/files.h:29:0,
from include/gcsa/gcsa.h:29,
from gcsa_format.cpp:28:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from files.cpp:27:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
In file included from include/gcsa/files.h:29:0,
from include/gcsa/gcsa.h:29,
from include/gcsa/algorithms.h:29,
from gcsa.cpp:26:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/files.h:29:0,
from include/gcsa/gcsa.h:29,
from include/gcsa/algorithms.h:29,
from algorithms.cpp:26:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/files.h:29:0,
from include/gcsa/lcp.h:29,
from lcp.cpp:26:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from gcsa.cpp:27:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
In file included from lcp.cpp:30:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
In file included from algorithms.cpp:30:0:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
make: *** [Makefile:47: support.o] Error 1
include/gcsa/internal.h: In instantiation of 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&) [with ValueType = gcsa::PathNode; Getter = gcsa::FromGetter]':
path_graph.cpp:959:54: required from here
include/gcsa/internal.h:126:29: error: no matching function for call to 'sdsl::sd_vector_builder::sd_vector_builder(gcsa::size_type, std::vectorgcsa::PathNode::size_type, bool)'
sdsl::sd_vector_builder builder(Getter::get(input.back()) + 1, input.size(), true);
^~~~~~~
In file included from /data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/bit_vectors.hpp:11:0,
from /data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/wt_pc.hpp:25,
from /data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/wavelet_trees.hpp:32,
from include/gcsa/utils.h:36,
from include/gcsa/support.h:29,
from include/gcsa/dbg.h:28,
from include/gcsa/path_graph.h:29,
from path_graph.cpp:26:
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:80:9: note: candidate: sdsl::sd_vector_builder::sd_vector_builder(sdsl::sd_vector_builder::size_type, sdsl::sd_vector_builder::size_type)
sd_vector_builder(size_type n, size_type m);
^~~~~~~~~~~~~~~~~
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:80:9: note: candidate expects 2 arguments, 3 provided
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:74:9: note: candidate: sdsl::sd_vector_builder::sd_vector_builder()
sd_vector_builder();
^~~~~~~~~~~~~~~~~
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:74:9: note: candidate expects 0 arguments, 3 provided
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:56:7: note: candidate: sdsl::sd_vector_builder::sd_vector_builder(const sdsl::sd_vector_builder&)
class sd_vector_builder
^~~~~~~~~~~~~~~~~
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:56:7: note: candidate expects 1 argument, 3 provided
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:56:7: note: candidate: sdsl::sd_vector_builder::sd_vector_builder(sdsl::sd_vector_builder&&)
/data/home/xuyuan/biosoft/sdsl/sdsl-lite/include/sdsl/sd_vector.hpp:56:7: note: candidate expects 1 argument, 3 provided
make: *** [Makefile:47: utils.o] Error 1
In file included from include/gcsa/dbg.h:28:0,
from include/gcsa/path_graph.h:29,
from try_extend.cpp:25:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/files.h:29:0,
from include/gcsa/gcsa.h:29,
from include/gcsa/algorithms.h:29,
from build_gcsa.cpp:29:
include/gcsa/support.h: In constructor 'gcsa::SadaSparse::SadaSparse(const Container&)':
include/gcsa/support.h:351:52: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
if(source[i] > 0) { tail += source[i]; builder.set_unsafe(tail - 1); }
^~~~~~~~~~
In file included from include/gcsa/path_graph.h:31:0,
from try_extend.cpp:25:
include/gcsa/internal.h: In constructor 'gcsa::ValueIndex<ValueType, Getter>::ValueIndex(const std::vector<_RealType>&)':
include/gcsa/internal.h:127:51: error: 'class sdsl::sd_vector_builder' has no member named 'set_unsafe'
for(const ValueType& value : input) { builder.set_unsafe(Getter::get(value)); }
^~~~~~~~~~
include/gcsa/internal.h: In member function 'gcsa::size_type gcsa::ValueIndex<ValueType, Getter>::find(gcsa::size_type) const':
include/gcsa/internal.h:136:30: error: 'const class sdsl::sd_vector<>' has no member named 'successor'
auto iter = this->values.successor(value);
^~~~~~~~~
include/gcsa/internal.h:137:64: error: 'const class sdsl::sd_vector<>' has no member named 'ones'
return (iter->second == value ? iter->first : this->values.ones());
^~~~
make: *** [Makefile:47: convert_graph.o] Error 1
make: *** [Makefile:47: gcsa_format.o] Error 1
make: *** [Makefile:47: lcp.o] Error 1
make: *** [Makefile:47: build_gcsa.o] Error 1
make: *** [Makefile:47: files.o] Error 1
make: *** [Makefile:47: try_extend.o] Error 1
make: *** [Makefile:47: algorithms.o] Error 1
make: *** [Makefile:47: path_graph.o] Error 1
make: *** [Makefile:47: gcsa.o] Error 1
Error: Could not compile GCSA2
Constructing a gcsa::GCSA
with the default constructor, and then immediately calling serialize()
to serialize the (empty) index to an ostream
, results in a segfault.
If it is forbidden to serialize a default-constructed gcsa::GCSA
, that needs to be documented in a doc comment for serialize()
. Otherwise, the serialization needs to be fixed to work.
I'm building GCSA2 with GNU GCC and Apple libc++ (instead of GNU libstdc++), in order to try and get vg builds working with GNU GCC and Homebrew's binary Protobuf library on Mac (which links against libc++).
GCSA2 checks the compiler identity, and uses that to decide whether the standard library has GNU's parallel/algorithm
nonstandard header. This doesn't work if you are using GNU GCC with anything other than libstdc++, and it misses out on using the header if you are using libstdc++ with any other compiler.
Lines 41 to 44 in 145c4ee
It looks like the Right Way to detect libc++ is to include any standard library header and then check if _LIBCPP_VERSION
is defined. For detecting libstdc++, you can incluse nearly any standard header and check for __GLIBCXX__
.
We should only use parallel/algorithm
if __GLIBCXX__
is defined, instead of keying on the compiler.
Over in vgteam/vg, @ekg and I are trying to create a GCSA2 index from only part of a graph. What we're doing is throwing out part of a graph, getting GCSA2 kmers for that, throwing out a different part of the graph, getting GCSA2 kmers for that, and then combining the kmers into one GCSA2 index for the graph.
This is operating under the assumption that you can just concatenate GCSA2 kmer files. It looks like you are supposed to be able to do this, but in practice you can't in some cases. I've attached some example files here.
I have a graph (smaller.vg) that looks like this:
In JSON format, it looks like:
{"edge": [{"to": 4972, "from": 4969}, {"to": 4972, "from": 4970}, {"to": 4973, "from": 4972}, {"to": 4974, "from": 4972}], "node": [{"id": 4969, "sequence": "G"}, {"id": 4970, "sequence": "A"}, {"id": 4971, "sequence": "T"}, {"id": 4972, "sequence": "AGTATC"}, {"id": 4973, "sequence": "C"}, {"id": 4974, "sequence": "T"}]}
If I prune out some nodes I don't like, it looks like this:
I can make GCSA2 text or binary format kmer files from those two graphs, with vg. I can index each graph separately with build_gcsa
, and I can index them together by passing both file names to build_gcsa
. But if I actually concatenate the kmer files, the index build fails.
To generate the attached files:
KMER=8
vg view -Jv smaller.json > smaller.vg
vg mod -pl ${KMER} -e 1 smaller.vg | vg kmers -g -k ${KMER} -H 1000000000 -T 1000000001 - > smaller-pruned.gcsa2
vg kmers -g -k ${KMER} -H 1000000000 -T 1000000001 smaller.vg > smaller-all.gcsa2
And to test:
build_gcsa -d 1 -t -v smaller-pruned
# Works
build_gcsa -d 1 -t -v smaller-all
# Works
build_gcsa -d 1 -t -v smaller-pruned smaller-all
# Works with both sets of kmers
cat smaller-pruned.gcsa2 smaller-all.gcsa2 > smaller-cat.gcsa2
build_gcsa -d 1 -t -v smaller-cat
# Fails verification, although it should use the same kmers as above
Is there something different about how kmers from different files versus from the same file are combined? I peeked at the code but it looked like they were all read into a single massive std::vector
together.
Is it not supposed to work with these sets of kmers in the same index at all? (And is the fact that it does work when the files are separate just an exploitation of undefined behavior?)
Assume that we have the following subgraphs:
1: B[CD]F[GH]J
2: [AB]C[EF]GI
There are the following paths of length 4:
(1,1): BCFG, BCFH, BDFG, BDFH
(1,2): CFGJ, CFHJ, DFGJ, DFHJ
(2,1): ACEG, ACFG, BCEG, BCFG
(2,2): CEGI, CFGI
If we start from paths of length 2 and do one step of prefix-doubling, we end up with the following list of paths:
(2,1): ACEG to BCEG
(1,1): BCFG
(2,1): BCFG
(1,1): BCFH to BDFH
(2,2): CEGI to CFGI
(1,2): CFGJ to DFHJ
The path graph has six nodes and four edges:
[(2,1): ACEG to BCEG] -> [(2,2): CEGI to CFGI] on A, B
[(2,1): BCFG] -> [(2,2): CEGI to CFGI] on B
[(1,1): BCFG] -> [(1,2): CFGJ to DFHJ] on B
[(1,1): BCFH to BDFH] -> [(1,2): CFGJ to DFHJ] on B
If we perform the final merge-by-label step, we end up with five nodes and four edges:
[(2,1): ACEG to BCEG] -> [(2,2): CEGI to CFGI] on A, B
[(1,1), (2,1): BCFG] -> [(2,2): CEGI to CFGI] on B
[(1,1), (2,1): BCFG] -> [(1,2): CFGJ to DFHJ] on B
[(1,1): BCFH to BDFH] -> [(1,2): CFGJ to DFHJ] on B
By doing backward searching in this graph, we could end up at (1,1) with pattern BCE and at (2,1) with pattern BD, which are both wrong.
The real issue is that the the merging step in prefix-doubling and the final merge-by-label step are incompatible. In some very rare cases (such as the example above), they will result in an incorrect path graph. The merge-by-label step is less useful, so we are probably going to drop it.
The remaining issue is how we can match the destination node of an edge with the correct source node, out of a set of path nodes with identical labels.
CHAR_WIDTH is defined as a macro in (gcc7's) limits.h, but is used as a variable name in GCSA.
We need to rename the variable.
Many components of GCSA can be compressed to reduce the size of the index in memory. There is always a trade-off: compressing the components will make the queries slower. The following numbers are for a whole-genome graph (-e 4
, path length 64, 5.27 billion paths).
The GCSA takes 9444 MB, but it can be compressed to 6174 MB. This can be broken down as follows:
(There are also SDSL visualizations of the size breakdowns of the original and compressed GCSA.)
There are 918 million samples representing 593 million distinct nodes. Because the samples will take at least 3 GB, no matter how we compress them, compressing the index doesn't sound too beneficial.
I just build a GCSA for the 1000G + GRCh37 graph in less than 250G of RAM.
First, I pruned regions of the graph with high complexity as described vgteam/vg#92 (comment).
Then, I input the binary graphs into build_gcsa
. It just barely fit on the available system. Here's the output log:
GCSA builder
Input: 1.k16e3l100.graph (binary format)
Input: 2.k16e3l100.graph (binary format)
Input: 3.k16e3l100.graph (binary format)
Input: 4.k16e3l100.graph (binary format)
Input: 5.k16e3l100.graph (binary format)
Input: 6.k16e3l100.graph (binary format)
Input: 7.k16e3l100.graph (binary format)
Input: 8.k16e3l100.graph (binary format)
Input: 9.k16e3l100.graph (binary format)
Input: 10.k16e3l100.graph (binary format)
Input: 11.k16e3l100.graph (binary format)
Input: 12.k16e3l100.graph (binary format)
Input: 13.k16e3l100.graph (binary format)
Input: 14.k16e3l100.graph (binary format)
Input: 15.k16e3l100.graph (binary format)
Input: 16.k16e3l100.graph (binary format)
Input: 17.k16e3l100.graph (binary format)
Input: 18.k16e3l100.graph (binary format)
Input: 19.k16e3l100.graph (binary format)
Input: 20.k16e3l100.graph (binary format)
Input: 21.k16e3l100.graph (binary format)
Input: 22.k16e3l100.graph (binary format)
Input: X.k16e3l100.graph (binary format)
Input: Y.k16e3l100.graph (binary format)
Output: wg.k16e3l100d1.gcsa
Doubling steps: 1
Size limit: 200 GB
readKMers(): Read 4320594988 kmers of length 16
GCSA::prefixDoubling(): Initial path length 16
mergePaths(): 4320594988 -> 4248730291 paths
mergePaths(): 742599350 unique, 3452817276 unsorted, 53313665 nondeterministic paths
GCSA::prefixDoubling(): Step 1 (path length 16 -> 32)
joinPaths(): 4248730291 -> 5591514526 paths (15978630563 ranks)
mergePaths(): 5591514526 -> 5507683427 paths
mergePaths(): 4630717465 unique, 815148663 unsorted, 61817299 nondeterministic paths
GCSA::mergeByLabel(): 5507683427 -> 4729741298 paths
GCSA::build(): 4988594815 edges
GCSA::sample(): 1185424502 samples at 515920142 positions
Index built in 18530.8 seconds
Paths: 4729741298
Edges: 4988594815
Samples: 1185424502 (at 515920142 positions, 41 bits each)
Max query: 32
Index size: 10350.2 MB
Without samples: 4556.36 MB
readKMers(): Read 4320594988 kmers of length 16
Index verification complete.
Memory usage: 224537 MB
Command being timed: "build_gcsa -d 1 -o wg.k16e3l100d1 1.k16e3l100 2.k16e3l100 3.k16e3l100 4.k16e3l100 5
.k16e3l100 6.k16e3l100 7.k16e3l100 8.k16e3l100 9.k16e3l100 10.k16e3l100 11.k16e3l100 12.k16e3l100 13.k16e3l100 14
.k16e3l100 15.k16e3l100 16.k16e3l100 17.k16e3l100 18.k16e3l100 19.k16e3l100 20.k16e3l100 21.k16e3l100 22.k16e3l10
0 X.k16e3l100 Y.k16e3l100"
User time (seconds): 153966.49
System time (seconds): 2748.86
Percent of CPU this job got: 674%
Elapsed (wall clock) time (h:mm:ss or m:ss): 6:27:24
An extra doubling step would be nice, but I don't think it will work on this system without further disk-based optimizations.
Tests on a single chromosome suggest this parameter configuration for pruning should be fine. Now that I know that this is possible, I'll do some exploration of the pruning parameters to see what else might work and still yield a viable mapper.
Does GCSA2 support the concept of orientation for nodes? For example, if you have a k=5 de Bruijn graph for something like GATTACACCCTGTAATC
, a sequence that starts with a 7-base motif and ends with its reverse complement, you would expect the node representing the first 5-mer in the de Bruijn graph and that representing the last 5-mer to be opposite orientations of the same node. I would expect this sort of structure, where you come out of a node, go around a loop, and head back into the same node in the opposite direction, would occur with some frequency in de Bruijn graphs with low k. However, I'm not sure how to articulate it in a form that GCSA2 will understand. Do you know how? Or are graphs not allowed to have that topology?
The reason I need this is that I'm working on node orientation and cycle support in vg. I already have node orientation in, but when you add cycles you are able to create graphs where you can't re-spell the graph to get around visiting a node in multiple orientations along the same path.
In discussion you mentioned there are problems in using kmers shorter than 16. I am interested in using them for various reasons, so I'd like to understand the limitations.
Hello!
I am packaging this library for Debian as it is a dependency for VG.
To support this I had to make some changes to the Makefile. It would be nice if there was a standard ./configure; make; make install
build system.
My patch is at https://salsa.debian.org/med-team/gcsa/blob/master/debian/patches/add_ldflags_cppflags but it is a bit Debian specific at the moment
Looking through the code, it seems that there is only support for loading and indexing graphs composed of kmers of length 16 or less, where each kmer corresponds to a single unique node. There are some interesting tricks to pack that particular graph structure very small.
Is that the only kind of graph that the library is intended to be able to index? No mention of these constraints is made in the readme.
If not, when would there be support for graphs of more general topologies (for example, having the same 16-mer appear in 2 or more nodes)?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.