rdfhdt / hdt-cpp Goto Github PK

View Code? Open in Web Editor NEW

115.0 24.0 65.0 5.43 MB

HDT C++ Library and Tools

C++ 93.31% C 0.27% Shell 0.11% HTML 0.47% QMake 0.21% Makefile 2.22% M4 3.33% Dockerfile 0.08%

rdf compression search visualization hdt

hdt-cpp's Introduction

C++ implementation of the HDT compression format

Header Dictionary Triples (HDT) is a compression format for RDF data that can also be queried for Triple Patterns.

Getting Started

Prerequisites

In order to compile this library, you need to have the following dependencies installed:

GNU Autoconf
- sudo apt install autoconf on Debian-based distros (e.g., Ubuntu)
- sudo dnf install autoconf on Red Hat-based distros (e.g., Fedora)
- brew install autoconf on macOS/OSX
GNU Libtool
- sudo apt install libtool on Debian-based distros (e.g., Ubuntu)
- sudo dnf install libtool on Red Hat-based distros (e.g., Fedora)
- brew install libtool on macOS/OSX
GNU zip (gzip) Allows GNU zipped RDF input files to be ingested, and allows GNU zipped HDT files to be loaded.
- sudo apt install zlib1g zlib1g-dev on Debian-based distros (e.g., Ubuntu)
- sudo dnf install gzip on Red Hat-based distros (e.g., Fedora)
- zlib is already included as part of macOS/OSX
pkg-config A helper tool for compiling applications and libraries.
- sudo apt install pkg-config on Debian-based distros (e.g., Ubuntu)
- sudo dnf install pkgconf-pkg-config on Red Hat-based distros (e.g., Fedora)
- brew install pkg-config on macOS/OSX
Serd v0.28+ The default parser that is used to process RDF input files. It supports the N-Quads, N-Triples, TriG, and Turtle serialization formats.
- sudo apt install libserd-0-0 libserd-dev on Debian-based distros (e.g., Ubuntu)
- sudo dnf install serd serd-devel on Red Hat-based distros (e.g., Fedora)
- brew install serd on macOS/OSX
Sometimes the version of Serd that is distributed by package managers is too old. In that case, Serd can also be built manually: see https://github.com/drobilla/serd for the installation instructions.

Installation

To compile and install, run the following commands under the directory hdt-cpp. This will also compile and install some handy tools.

./autogen.sh
./configure
make -j2
sudo make install

Installation issues

Sometimes, the above instructions do not result in a working HDT installation. This section enumerates common issues and their workaround.

Compilation issues using Kyoto Cabinet

The support for Kyoto Cabinet was never finished and is currently suspended. It is for the time being not possible to compile HDT with KyotoCabinet.

Common error:

In file included from src/dictionary/KyotoDictionary.cpp:38:0:
src/dictionary/KyotoDictionary.hpp:108:18: error: conflicting return type specified for 'virtual unsigned int hdt::KyotoDictionary::getMapping()'
 unsigned int getMapping();
              ^

Package requirements (serd-0 >= 0.28.0) were not met

When getting

Package requirements (serd-0 >= 0.28.0) were not met: Requested 'serd-0 >= 0.28.0' but version of Serd is 0.X

Serd is not 0.28+, probably because of the package manager. Built it manually at https://github.com/drobilla/serd.

`./configure` cannot find Serd

While running ./configure you get a message similar to the following:

Package 'serd-0', required by 'virtual:world', not found

This means that ./configure cannot find the location of the serd-0.pc file on your computer. You have to find this location yourself, e.g., in the following way:

find /usr/ -name serd-0.pc

Once you have found the directory containing the serd-0.pc file, you have to inform the ./configure script about this location by setting the following environment variable (where directory /usr/local/lib/pkgconfig/ is adapted to your situation):

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/

Using HDT

After compiling and installing, you can use the handy tools that are located in hdt-cpp/libhdt/tools. We show some common tasks that can be performed with these tools.

RDF-2-HDT: Creating an HDT

HDT files can only be created for standards-compliant RDF input files. If your input file is not standards-compliant RDF, it is not possible to create an HDT files out of it.

$ ./rdf2hdt data.nt data.hdt

HDT-2-RDF: Exporting an HDT

You can export an HDT file to an RDF file in one of the supported serialization formats (currently: N-Quads, N-Triples, TriG, and Turtle). The default serialization format for exporting is N-Triples.

$ ./hdt2rdf data.hdt data.nt

Querying for Triple Patterns

You can issue Triple Pattern (TP) queries in the terminal by specifying a subject, predicate, and/or object term. The questions mark (?) denotes an uninstantiated term. For example, you can retrieve all the triples by querying for the TP ? ? ?:

$ ./hdtSearch data.hdt
>> ? ? ?
http://example.org/uri3 http://example.org/predicate3 http://example.org/uri4
http://example.org/uri3 http://example.org/predicate3 http://example.org/uri5
http://example.org/uri4 http://example.org/predicate4 http://example.org/uri5
http://example.org/uri1 http://example.org/predicate1 "literal1"
http://example.org/uri1 http://example.org/predicate1 "literalA"
http://example.org/uri1 http://example.org/predicate1 "literalB"
http://example.org/uri1 http://example.org/predicate1 "literalC"
http://example.org/uri1 http://example.org/predicate2 http://example.org/uri3
http://example.org/uri1 http://example.org/predicate2 http://example.org/uriA3
http://example.org/uri2 http://example.org/predicate1 "literal1"
9 results shown.

>> http://example.org/uri3 ? ?
http://example.org/uri3 http://example.org/predicate3 http://example.org/uri4
http://example.org/uri3 http://example.org/predicate3 http://example.org/uri5
2 results shown.

>> exit

Exporting the header

The header component of an HDT contains metadata describing the data contained in the HDT, as well as the creation metadata about the HDT itself. The contents of the header can be exported to an N-Triples file:

$ ./hdtInfo data.hdt > header.nt

Replacing the Header

It can be useful to update the header information of an HDT. This can be done by generating a new HDT file (new.hdt) out of an existing HDT file (old.hdt) and an N-Triples file (new-header.nt) that contains the new header information:

$ ./replaceHeader old.hdt new.hdt new-header.nt

Building docker image

Alternatively, the tools can be used via docker.

To build the docker image (using arbitrary name hdt):

docker build -t hdt .

Using tools via docker image

Asssuming you have built docker image named hdt:

docker run -it --rm -v $PWD:/workdir hdt bash
root@abcd1234:/workdir#

This starts the docker image interactively. Listing files within running container shall show files from your current directory.

To run whatever command from hdt toolset:

root@abcd1234:/workdir# rdf2hdt -f turtle input.ttl output.hdt

To quit the running container, use exit command.

HDT commands can be also called directly from the (docker) host system:

docker run --rm -v $PWD:/workdir hdt rdf2hdt -f turtle input.ttl output.hdt

This takes input.ttl from current directory and create new output.hdt one.

Contributing

Contributions are welcome! Please base your contributions and pull requests (PRs) on the develop branch, and not on the master branch.

License

hdt-cpp is free software licensed as GNU Lesser General Public License (LGPL). See libhdt/COPYRIGHT.

hdt-cpp's People

Contributors

Stargazers

Watchers

hdt-cpp's Issues

enable KCB option for rdf2hdt

I am wondering how I am able to run rdf2hdt conversion with kyotocabinet database (KCB) option. We have a very large RDF nt file that exceeds the memory limit for conversion, and the job got killed using memory....I cannot find anything in README file.

In addition, I took a look at BasicHDT.cpp file and found there is no option for KCB:
void BasicHDT::createComponents() {
// HEADER
header = new PlainHeader();

// DICTIONARY
std::string dictType = spec.get("dictionary.type");
if(dictType==HDTVocabulary::DICTIONARY_TYPE_FOUR) {
    dictionary = new FourSectionDictionary(spec);
} else if(dictType==HDTVocabulary::DICTIONARY_TYPE_PLAIN) {
    dictionary = new PlainDictionary(spec);
} else if(dictType==HDTVocabulary::DICTIONARY_TYPE_LITERAL) {

ifdef HAVE_CDS

    dictionary = new LiteralDictionary(spec);

else

    throw "This version has been compiled without support for this dictionary";

endif

} else {
    dictionary = new FourSectionDictionary(spec);
}

// TRIPLES
std::string triplesType = spec.get("triples.type");
if(triplesType==HDTVocabulary::TRIPLES_TYPE_BITMAP) {
    triples = new BitmapTriples(spec);
} else if(triplesType==HDTVocabulary::TRIPLES_TYPE_COMPACT) {
    triples = new CompactTriples(spec);
} else if(triplesType==HDTVocabulary::TRIPLES_TYPE_PLAIN) {
    triples = new PlainTriples(spec);
} else if(triplesType==HDTVocabulary::TRIPLES_TYPE_TRIPLESLIST) {
            triples = new TriplesList(spec);

ifndef WIN32

} else if (triplesType == HDTVocabulary::TRIPLES_TYPE_TRIPLESLISTDISK) {
    triples = new TripleListDisk();

endif

} else {
    triples = new BitmapTriples(spec);
}

}

Best,
Gang

Cannot create HDT file when numbers appear in IRI schema

I'm unable to create an HDT file for the following N-Triples content:

<a:b> <a:b> <a2:b>

The problem is caused by the 2, which is allowed to appear in IRI schema components, but gives the following error:

syntax does not support relative IRIs
Catch exception load: Error parsing input.

The serdi command-line tool parse this content correctly.

I'm using Serd 0.26.0 and HDT commit d4b5244

Parsing of blank nodes

Parsing of blank nodes in the object position returns invalid terms. To reproduce:

echo "<http://sub> <http://pred> _:b0." > test.nt && ./tools/rdf2hdt test.nt test.hdt && ./tools/hdtSearch -q '? ? ?' test.hdt 2>/dev/null

HDT with the rapper parser returns http://sub http://pred b0. (notice the missing _:)
HDT with the serdi parser returns http://sub http://pred _:b0. (notice the trailing dot, which should not be there)

As a result, when you export this triple to RDF using hdt2rdf, you'll get invalid ntriples[1]

I'm confused about this issue though. Running echo "<http://sub> <http://pred> _:b0." | serdi - returns valid ntriples, i.e. serdi seems to parse/serialize this correctly.
But when I log the SerdNode value in the HDT serializer (https://github.com/rdfhdt/hdt-cpp/blob/develop/hdt-lib/src/rdf/RDFParserSerd.cpp#L18) I do get the dot as part of the bnode value.

Some help is appreciated ;)

[1]
serdi: <http://sub> <http://pred> _:b0. .
rapper: <http://sub> <http://pred> <b0.> .

Building test classes failed in devel branch (OpenMP disabled)

ilterSearch.cpp:(.text.startup+0x70c): undefined reference to typeinfo for hdt::LiteralDictionary' filterSearch.cpp:(.text.startup+0x75b): undefined reference tohdt::LiteralDictionary::substringToId(unsigned char_, unsigned int, unsigned int__)'
collect2: error: ld returned 1 exit status
Makefile:89: recipe for target 'tests/filterSearch' failed
make: *_* [tests/filterSearch] Error 1

Crash due to incorrect exception handling

Using the latest master, trying to load http://lod.labs.vu.nl/brt-clean.nt.gz (be careful, almost 8Gb and you need to uncompress it before loading) we get SEGV in line 295 of FourSectionDictionary.cpp. Here
we see delete objects.

This is quite understandable as objects is possibly already deleted in the try section. I'm not familiar enough with the code base to see the clean way out.

Fix URI encoding problems

I've converted nt file to HDT and loaded the HDT for querying in Jena. However, some resources are not queryable although they are present in the dataset.

The dataset is

<http://ru.dbpedia.org/resource/Список_римско-католических_епархий_(структурный_вид)> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Абаэтетуба> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Бразилия> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/25_ноября> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/1961_год> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Собор_Непорочного_зачатия_Пресвятой_Девы_Марии_(Абаэтетуба)> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Архиепархия_Белен-до-Пара> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Епархия> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Римско-католическая_церковь> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Кафедра_(христианство)> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Собор_(храм)> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Папство> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Иоанн_XXIII> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Булла> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Территориальное_аббатство> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/4_августа> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/1981_год> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Иоанн_Павел_II> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Священник> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Annuario_Pontificio> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Acta_Apostolicae_Sedis> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Категория:Католические_епархии_Бразилии> .
<http://ru.dbpedia.org/resource/Архиепархия_Белен-до-Пара> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://www.w3.org/2000/01/rdf-schema#label> "Епархия Абаэтетубы"@ru .

In Jena the following query does not match anything:

StmtIterator tmpIter = model.listStatements(model.getResource("http://ru.dbpedia.org/resource/Епархия_Абаэтетубы"), null, (RDFNode)null);
if(!tmpIter.hasNext())
    System.out.println("No statements for this resource");

Loading the same dataset in Jena but as a ttl file works.

Maybe a bug in the HDT conversion process? Note that we are dealing with Russian cyrillic chars.

Replace libcds with sdsl-lite

Hi all!

When I first reviewed hdt-cpp last winter, hoping to find a way to get it packaged for Debian, I noticed the libcds dependency that caused me some grief when compiling (but nevermind that), but that I think would also cause problems for packaging. I made a thread in the forums, but I suppose opening an issue is better for the dev's workflow. :-) I hope my view here isn't outdated. :-)

It seems like libcds is a fork of somebody else's code, and the link in there is dead. It also seems the upstream author has started a new version but seemingly not finished it. The author has made a post about the motivation for it.

I think that to get this code into a distro, and I think that is very important for widespread adoption, it would be needed to not bundle libcds, but keep it separate, so that it can be packaged independently by distros. Also, if any patches are needed, they should be sent there for incorporation. And since the upstream has abandoned version 1 of their library, HDT-CPP should probably move to use libcds2.

Cheers,

Kjetil

rdf2hdt does not always print to stdout

It does not always print to stdout (using rdf2hdt file.hdt -).

Ported from #53.

Compiling errors on gcc 4.9.2 (Debian 4.9.2-10)

Got errors like:

In file included from src/bitsequence/BitSequence375.cpp:30:0:
src/bitsequence/../util/bitutil.h: In function ‘void hdt::bitset(uint32_t*, size_t)’:
src/bitsequence/../util/bitutil.h:93:13: error: redefinition of ‘void hdt::bitset(uint32_t*, size_t)’
inline void bitset(uint32_t * e, size_t p) {
         ^
src/bitsequence/../util/bitutil.h:74:13: note: ‘void hdt::bitset(size_t*, size_t)’ previously defined here
 inline void bitset(size_t * e, size_t p) {
         ^
src/bitsequence/../util/bitutil.h: In function ‘void hdt::bitclean(uint32_t*, size_t)’:
src/bitsequence/../util/bitutil.h:98:13: error: redefinition of ‘void hdt::bitclean(uint32_t*, size_t)’
 inline void bitclean(uint32_t * e, size_t p) {
         ^
src/bitsequence/../util/bitutil.h:79:13: note: ‘void hdt::bitclean(size_t*, size_t)’ previously defined here
inline void bitclean(size_t * e, size_t p) {

I commented these sections in bitutil.h and the compilation worked.

Also the Makefile contained an option not recognized by the GCC compiler:

cc1plus: warning: unrecognized command line option "-Wno-unknown-warning-option"

Avoid allocation of strings in Dictionary::idToString.

The Dictionary class has this function:
virtual std::string idToString(unsigned int id, TripleComponentRole role)=0;

Changing that to
virtual void idToString(unsigned int id, TripleComponentRole role, std::string& string);
would allow the caller to supply an existing string object and avoid memory allocation.

Ensure test suite can be run without Kyoto Cabinet and Raptor

Currently you need to have Kyoto Cabinet and Raptor installed to build the test suite with make tests. I've attached the error output in make-tests.log.txt.

The test suite should be amended such that it doesn't require any of the optional dependencies of the library to be installed. This could be done either by restructuring the test suite and/or by ensuring that tests for optional dependencies only activate when those dependencies are present.

hdtSearch improperly handles spaces in input literals

I noticed problems with both hdtSearch (cpp) and hdtSearch.sh (java). Both apparently give incorrect results in some cases when looking for a specific literal value. I will report the Java results in a separate issue for hdt-java.

My test dataset is this NT file with only 3 triples:

<http://example.org/000046085> <http://schema.org/name> "Raamattu" .
<http://example.org/000146854> <http://schema.org/name> "Ajan lyhyt historia" .
<http://example.org/000019643> <http://schema.org/name> "Seitsemän veljestä" .

I converted it to HDT using the cpp version of rdf2hdt. Then I query it for the literal values using hdtSearch:

$ rdf2hdt hdt-test.nt hdt-test.hdt
HDT Successfully generated.                                                    
Total processing time: Clock(801 us)  User()  System()

$ hdtSearch hdt-test.hdt
Predicate Bitmap in 34 usp: 0 % / 14.86 %                            
Count predicates in 16 userences: 0 % / 16.075 %                      
Count Objects in 9 us Max was: 1: 0 % / 34.3 %                      
Bitmap in 7 usx bitmap: 0 % / 45.64 %                      
Bitmap bits: 3 Ones: 3
Object references in 13 usces: 0 % / 48.475 %                      
Sort lists in 8 usblists: 0 % / 68.32 %                      
Index generated in 136 us
>> ? ? "Raamattu"                                 %                      
http://example.org/000046085 http://schema.org/name "Raamattu"
1 results in 40 us
>> ? ? "Ajan lyhyt historia"
0 results in 5 us
>> ? ? "Seitsemän veljestä"
0 results in 6 us

As you can see from the above output, the first query (for "Raamattu") gives the correct result, but the two subsequent ones give zero results even though they should match one triple in the data.

I'm not sure whether the problem is in the HDT generation, index file generation, or querying. However, I did try generating the index with hdtSearch.sh from hdt-java instead. It didn't make a difference to the results.

Arbitrary estimated number of results

I generated a dataset of about 1M triples using the watdiv data generator and converted the resulting file using HDT.
When querying for certain triples, it seems like some estimated counts are not very "estimated", but rather arbitrary.

For example: there should be 433615 triples with the predicate http://db.uwaterloo.ca/~galuc/wsdbm/friendOf but when executing the query ?v0 http://db.uwaterloo.ca/~galuc/wsdbm/friendOf ?v2 I get an estimated count of 3878 triples, which is quite an order of magnitude different.

In order to get these estimated counts, IteratorTripleID. estimatedNumResults() is used (see https://github.com/RubenVerborgh/HDT-Node/blob/master/lib/HdtDocument.cc#L139).

You can find the hdt file I'm using to test querying here: https://dl.dropboxusercontent.com/u/16059961/watdiv_10_dataset.zip

getSuggestions without max_count argument

Function getSuggestions is very nice since it implements the following two common use cases:

Auto-completion for typing IRIs and literals
IRI namespace matching

Currently a maximum size must be given to the function, but this maximum is not always known beforehand. For example, if you want to find all DOI IRIs (use case category 2), there are over 4M matches for prefix http://dx.doi.org/ in LOD-a-lot.

Would it be possible to return the list of suggestions as an iterator of lazy list of arbitrary size?

Fix compiler warnings from the test suite

Building the test suite with make tests emits a bunch of compiler warnings that should be fixed: make-tests.log.txt

How to load a gzipped N-Triples file?

On branch stable I am able to create an HDT file from a gzipped N-Triples file (this probably uses Raptor). On branch master this no works (this probably uses Serd).

What is the best way to create an HDT file from a gizpped N-Triples file on master? Could the library be taught to do the right thing automatically?

Remove Raptor support for everything but XML

Serd is now supported for both parsing and serializing. We should probably not maintain two parsers, and only keep Raptor around for XML.

Discussed in #53.

New hdt version

What about merging develop to master? I've been using the latest develop branch intensively and it seems there are no issues that should stop it from becoming the next version.
If we do the merge soon, we can use the develop branch to start properly testing the new compilation procedure (#81)

rdf2hdt issues

There are a few issues with the rdf2hdt command:

It does not always print to stdout (using rdf2hdt file.hdt -).
The built-in N-Triples serializer works, but when the Raptor serializer is included it does not.
Do we want raptor as a serializer anyway? (considering that the parser lib was changed to serd #31)

Extract random walks efficiently

Since it's possible to extract the estimated number of results for an arbitrary triple pattern, we were wondering whether it would also be possible to efficiently retrieve, for a given node X, a random neighbor node Y of X (in constant time).

Looking at the Web Semantics paper from 2013, it seems that this should be doable by querying bitmap triples representation directly.

Thanks in advance!
@airobert and @wouterbeek

Error compiling test on Ubuntu 14.04 LTS 64 bit

Hi,

after compiling hdt-lib compiling the tests (with make tests) resuls in the following compilation error.

$ make tests
 [HDT] Compiling test tests/bit375.cpp
 [HDT] Compiling test tests/bitutiltest.cpp
 [HDT] Compiling test tests/cmp.cpp
tests/cmp.cpp: In member function ‘virtual unsigned char* MyIterator::next()’:
tests/cmp.cpp:33:65: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
   sprintf((char*)buffer, "AAA %015lld FINNN", (uint64_t) count++);
                                                                 ^
 [HDT] Compiling test tests/confm.cpp
/tmp/cc4opceY.o: In function `main':
confm.cpp:(.text.startup+0xc79): undefined reference to `hdt::LiteralDictionary::LiteralDictionary()'
confm.cpp:(.text.startup+0xc92): undefined reference to `hdt::LiteralDictionary::import(hdt::Dictionary*, hdt::ProgressListener*)'
confm.cpp:(.text.startup+0xd7b): undefined reference to `hdt::LiteralDictionary::save(std::ostream&, hdt::ControlInformation&, hdt::ProgressListener*)'
confm.cpp:(.text.startup+0xdf6): undefined reference to `hdt::LiteralDictionary::~LiteralDictionary()'
confm.cpp:(.text.startup+0xee1): undefined reference to `hdt::LiteralDictionary::~LiteralDictionary()'
collect2: error: ld returned 1 exit status
make: *** [tests/confm] Error 1

Cristian

Remove built-in N-Triples parser/serializer

As discussed in #12, @RubenVerborgh and I would propose to remove the buggy built-in N-Triples parser and serializer implementation and rely on Serd instead.

This does mean that Serd would be required in order for the command-line utilities to be of any use (and it might hence make sense to make Serd a required dependency, instead of optional), but that's practically the case right now as well for anything nontrivial, given that the built-in parser is buggy.

Also, Serd is a compact enough library that potentially (if really needed) we could import it into our source tree, at which point there would be no external dependency to install.

Thoughts?

Question about .gz parsing

Dear all,
I have a short question. I checked that libz is installed and I have an ntriples file called core.nt.gz

Will the program automatically recognize that the file is zipped, if I call
./rdf2hdt core.nt.gz core.hdt

because I receive a lot of parser errors. Maybe I am missing an option.
I also tried -f ntriples

If I unzip the file first, it works.

Unicode sequences are incorrectly escaped

This triple:
<http://dbpedia.org/resource/Barack_Obama> <http://www.w3.org/2002/07/owl#sameAs> <http://el.dbpedia.org/resource/\u039C\u03C0\u03B1\u03C1\u03AC\u03BA_\u039F\u03BC\u03C0\u03AC\u03BC\u03B1> .
results in
<http://dbpedia.org/resource/Barack_Obama> <http://www.w3.org/2002/07/owl#sameAs> <http://el.dbpedia.org/resource/\\u039C\\u03C0\\u03B1\\u03C1\\u03AC\\u03BA_\\u039F\\u03BC\\u03C0\\u03AC\\u03BC\\u03B1>.
This is particularly a problem when using HDT via LDF, as the n3 parser throws an exception.

libserd should be listed as a dependency

Hi,

Installing on Ubuntu 14.04.1 LTS leads to an error of
serd.h: No such file or directory
which can be resolved installing:
apt-get install libserd-dev

Cristian

How do I compile hdt-it?

There doesn't seem to be any Makefile available.

Expected HDT compressibility?

In published examples I've seen, HDT file compressibility (by doing an additional gzip or bzip2 step) has been a few percent at the most. Given that baseline, the following seems anomalous:

-rw-rw-r-- 1 arto arto  21813380 Dec 10 12:43 pc_biosystem.hdt
-rw-rw-r-- 1 arto arto   5054074 Dec 10 12:44 pc_biosystem.hdt.bz2
-rw-rw-r-- 1 arto arto   6988217 Dec 10 12:43 pc_biosystem.hdt.gz
-rw-rw-r-- 1 arto arto 833444759 Dec 10 12:42 pc_biosystem.nt
-rw-rw-r-- 1 arto arto  15820798 Dec 10 12:46 pc_biosystem.nt.bz2
-rw-rw-r-- 1 arto arto  21045372 Dec 10 12:44 pc_biosystem.nt.gz
-rw-rw-r-- 1 arto arto 327940461 Dec 10 12:42 pc_biosystem.ttl

That is, the .hdt file is 20 MB, the gzip-compressed .hdt.gz is 6.66 MB, and the bzip2-compressed .hdt.bz2 is 4.82 MB.

Instructions to reproduce these figures from scratch follow here:

wget ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/biosystem/pc_biosystem.ttl.gz
gunzip pc_biosystem.ttl.gz
rapper -i turtle -o ntriples pc_biosystem.ttl > pc_biosystem.nt
rdf2hdt -f ntriples pc_biosystem.nt pc_biosystem.hdt
gzip -9 < pc_biosystem.hdt > pc_biosystem.hdt.gz
bzip2 -9 < pc_biosystem.hdt > pc_biosystem.hdt.bz2

The question here is, are these results expected or anomalous? (Pinging @MarioAriasGa.)

Error parsing a turtle file with certain language tags

When using rdf2hdt on the following turtle file:

@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skos-thes: <http://purl.org/iso25964/skos-thes#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<urn:x-skosprovider:trees/3> a skos:Collection ;
    dcterms:identifier 3 ;
    skos:inScheme <urn:x-skosprovider:trees> ;
    skos:member <urn:x-skosprovider:trees/1>,
        <urn:x-skosprovider:trees/2> ;
    skos:prefLabel "Trees by species"@en,
        "Bomen per soort"@nl-BE .

<urn:x-skosprovider:trees/1> a skos:Concept ;
    dcterms:identifier 1 ;
    skos:closeMatch <http://id.python.org/different/types/of/trees/nr/1/the/larch> ;
    skos:definition "A type of tree."@en ;
    skos:inScheme <urn:x-skosprovider:trees> ;
    skos:prefLabel "The Larch"@en,
        "De Lariks"@nl .

<urn:x-skosprovider:trees/2> a skos:Concept ;
    dcterms:identifier 2 ;
    skos:altLabel "la châtaigne"@fr,
        "De Paardekastanje"@nl ;
    skos:definition "A different type of tree."@en ;
    skos:inScheme <urn:x-skosprovider:trees> ;
    skos:prefLabel "The Chestnut"@en ;
    skos:relatedMatch <http://id.python.org/different/types/of/trees/nr/17/the/other/chestnut> .

<urn:x-skosprovider:trees> a skos:ConceptScheme ;
    dcterms:identifier "TREES" ;
    skos:hasTopConcept <urn:x-skosprovider:trees/1>,
        <urn:x-skosprovider:trees/2> ;
    skos:prefLabel "Different types of trees"@en,
        "Verschillende soorten bomen"@nl .

I get the following error:

[21:34:00] $ /home/koen/Projecten/general/hdt-cpp/hdt-lib/tools/rdf2hdt TREES-full.ttl TREES-full.hdt
expected `(', not `B'y: 0 K triples processed.: 0 % / 0 %                      
caught here??
Catch exception load: Error parsing input.
ERROR: Error parsing input.

Using an rdfxml version of the same file I have no problems:

[21:39:04] $ /home/koen/Projecten/general/hdt-cpp/hdt-lib/tools/rdf2hdt -f rdfxml TREES-full.rdf TREES-full.hdt
RDF format: rdfxml
HDT Successfully generated.                                                    
Total processing time: Clock(10 ms 901 us)  User(10 ms 516 us)  System(552 us)

Both the Turtle and the rdfxml version were generated from Python using RDFLIB, on Ubuntu 14.04

I believe rdf2hdt use serd for reading turtle files. Using serdi on the same file gives met the following error:

[21:42:26] 1 $ serdi TREES-full.ttl
<urn:x-skosprovider:trees/3> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Collection> .
<urn:x-skosprovider:trees/3> <http://purl.org/dc/terms/identifier> "3"^^<http://www.w3.org/2001/XMLSchema#integer> .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#inScheme> <urn:x-skosprovider:trees> .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#member> <urn:x-skosprovider:trees/1> .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#member> <urn:x-skosprovider:trees/2> .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#prefLabel> "Trees by species"@en .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#prefLabel> "Bomen per soort"@nl- .
error: TREES-full.ttl:16:29: expected `.', not `B'

Implement a proper build system

Right now, the hdt-lib build process is pretty brittle with regards to platform quirks, and produces only tool binaries and static libraries, not any shared library. Further, building with custom paths or with a selection of features to be enabled/disabled (e.g., Serd, Raptor, zlib, KCB) requires manually editing Makefiles--not at all a welcome practice from a downstream user's point of view.

On POSIX systems, the currently produced static libraries aren't portably linkable with shared libraries that would like to make use of libhdt; moreover, on e.g. x86-64 the individual object files that constitute the static libraries would generally need to be compiled with -fPIC for the libraries to be usable from modern projects. All this requires manually fiddling with flags in both libcds-v1.0.12/src/Makefile and hdt-lib/Makefile.

Additionally, no actual installation facility is currently provided, meaning that users have to manually hunt down built artifacts--which are currently placed in different output directories--and copy them one by one to some installation destination.

All this argues for converting the project to use a proper configuration and build system. The two obvious and realistic choices are Autotools and CMake. Doing this would also greatly facilitate packaging the project for downstream distribution (#19).

I would myself volunteer to convert the project to using standard Autotools, which would give users the familiar ./configure && make && sudo make install workflow that works reliably on all platforms and resolves all aforementioned problems.

Untracked file `.make-senitel`

I sometimes get the following file in my hdt-cpp directory: hdt-lib/.make-senitel. Should this be added to .gitignore or should this file not be created in the first place?

Implement proper branch management

Bugs such as LinkedDataFragments/Client.js#26 (comment) occur because people use the latest master as opposed to the latest release.

Proposal A (vote with 👍 )

develop becomes the branch were development happens
master always points to the latest release (which we tag)

Proposal B (vote with 😄 )

stable becomes the default branch

Slow compilation speed

This library (re-)compiles very slowly compared to other libraries. I wonder if we can find causes/fixes for that. Perhaps some of the header files are doing overly complex things?

Alter Makefile to make code position-independent?

Compilation in SWI-Prolog's HDT package requires me to add -fPIC to
the FLAGS variable (on lines 14 and 16). I'm not sure whether this
is a bug or not. According to someone on SO PIC stands for:

> Position Independent Code means that the generated machine code
> is not dependent on being located at a specific address in order
> to work.

My editor also complains about a spurious horizontal tab character on
line 111. This may be removed if only for aesthetic reasons.

Build broken on compilers < G++ 6 due to unordered_map differences between compilers.

Make progress output optional, perhaps through a flag

What about making the progress output of the cmd line clients optional?
E.g., adding --progress/-p flag to show the progress.
Imo, the cmd clients are more usable with the progress output disabled by default.

Merge HDT files or update HDT with rdf

Is it possible or is it in the scope of HDT to merge multiple HDT files or update a HDT file with another rdf file?

This would be great for analysis pipelines that generate new RDF triples based on the content of the input rdf/hdt files.

Add a version number to the .hdt.index files

Different .hdt.index files for the same .hdt file are incompatible with each other; this causes problems if different applications read them. Maybe they should receive some kind of version number.

But this then creates the problem: how to find the index?
Perhaps, instead of supplying the .hdt file name, we can provide the index file name.

Compilation problem in develop

Introduced in:
686e25d

Cannot compile due to narrowing conversion

Compiling hdt-lib using GCC 6.1.1 gives me the following warnings:

static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-1’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
 Tsucc RMQ_succinct::HighestBitsSet[8] = {~0, ~1, ~3, ~7, ~15, ~31, ~63, ~127};
                                                                             ^
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-2’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-4’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-8’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-16’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-32’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-64’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-128’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
Makefile:35: recipe for target 'static/suffixtree/RMQ_succinct.o' failed
make[3]: *** [static/suffixtree/RMQ_succinct.o] Error 1
Makefile:39: recipe for target 'all' failed
make[2]: *** [all] Error 2
Makefile:13: recipe for target 'libcompact' failed
make[1]: *** [libcompact] Error 2
make[1]: Leaving directory '/home/wbeek/Git/hdt-cpp/libcds-v1.0.12'
Makefile:63: recipe for target 'all' failed
make: *** [all] Error 2

In order to compile the library I have to make the following change in two files. From:

const DTsucc RMQ_succinct_lcp::HighestBitsSet[8] = {~0, ~1, ~3, ~7, ~15, ~31, ~63, ~127};

To:

const DTsucc RMQ_succinct_lcp::HighestBitsSet[8] = {static_cast<DTsucc>(~0), static_cast<DTsucc>(~1), static_cast<DTsucc>(~3), static_cast<DTsucc>(~7), static_cast<DTsucc>(~15), static_cast<DTsucc>(~31), static_cast<DTsucc>(~63), static_cast<DTsucc>(~127)};

Since my C++ skills are non-existent, a capable programmer may fix this in a less hacky way :-P

FMIndex loading fails on OS X

Loading an HDT file with an FMIndex fails on OSX. This Travis run shows it works on Linux.

The failure seems to be as follows:

in CSD_FMIndex::load, we convert an in-memory range to a stringstream by calling rdbuf()->pubsetbuf.
when we attempt to read from this string in OS X, the fail and bad bits of the stream are set.

Now there seems to be quite some discussion whether calling pubsetbuf is actually a good idea. So I have tried the often suggested alternative to implement a custom subclass instead. However, this then fails at another point (apparently because seeking is not implemented).

Who can get us rid of the pubsetbuf hack?
The original code was written by @MarioAriasGa.

Relevant backstory: I found this out while testing HDT-node. Strangely, the exact same code works for Node.js 4 and 6 on Linux and on OS X, but fails for Node.js 7 on OS X (but works on Linux). And as you can see above, it fails always in standalone mode on OS X. So somehow, the Node.js 4 and 6 compilations on OS X are doing something different. I wonder if this issue is related.

PS To test this: just fork from the bug-literal-dictionary branch, and when you commit and push, Travis CI will test for you.

Ubuntu make error

Tried to compile hdt-lib in Docker using the latest Ubuntu image.
I get the error:

[HDT] Linking libhdt.a
 [HDT] Compiling tool tools/hdtInfo.cpp
../libcds-v1.0.12/lib/libcds.a: error adding symbols: Archive has no index; run ranlib to add one
collect2: error: ld returned 1 exit status
Makefile:82: recipe for target 'tools/hdtInfo' failed
make[1]: Leaving directory '/hdt-cpp/hdt-lib'
make[1]: *** [tools/hdtInfo] Error 1
make: *** [all] Error 2

Any idea how to fix it?

Fix compilation on 32-bit hosts

https://github.com/rdfhdt/hdt-cpp/blob/master/hdt-lib/src/util/bitutil.h#L42

This gives problems as it is already defined on a 32 bit machine.

Change interfaces to work with larger data (required when merging long-dict-id branch to devel)

Hi all,

We have recently created the LOD-a-lot HDT dataset including 28B triples. @MarioAriasGa performed some datatype changes in the branch https://github.com/rdfhdt/hdt-cpp/tree/long-dict-id in order to manage such an amount of triples.

We should then merge this branch into develop (https://github.com/rdfhdt/hdt-cpp/tree/develop) and later on into master. Given that the interface has changed, we suggest to create a new major revision.

Opinions?

See the required changes (some of them are only for the HDT-it app, please filter by hdt-lib/): https://github.com/rdfhdt/hdt-cpp/compare/develop...long-dict-id?diff=unified&name=long-dict-id

Update website to point at GitHub, instead of Google Code

Hello,

Basically every url mentioned in the rdfhdt.org website points to the Google code project page, but it seems to me that the latest developments are happening here. So I think which is the official/main repo for the project should be clarified.

Thanks,

Cristian

Improve API to production quality

After the already-merged enhancements in pull requests #16 and #17, these are the remaining immediate pain points I have encountered so far in attempting to use hdt-lib in a production application:

The library does not throw standard exceptions, i.e., exceptions derived from the C++ standard library's std::exception base class. Instead, the library throws raw C strings directly (e.g., src/hdt/BasicHDT.cpp:106). This is not great practice, and these exceptions will go uncaught by many applications, or else will land in a top-level catch-all handler catch (...) at which point they can no longer be interpreted at all.
The library in multiple locations eats up exceptions and prints them out to the standard output stream (e.g., src/hdt/BasicHDT.cpp:171), instead of letting the application handle them. Unilaterally printing out stuff to the application's standard output or standard error is a definite no-no for a production-quality library. Also, error conditions really ought to propagate up the stack to where they can be properly observed and logged.
The library is overall missing careful const-qualification of methods and parameters, resulting in spurious temporaries and in the inability to hold const references to hdt-lib objects, complicating const-correct application code by necessitating const_cast<> casts that undermine type safety.
Additionally, from a performance standpoint, the library currently overall performs quite a bit of spurious memory allocations (that will slow things down and fragment the heap), in particular by overuse of std:string copies where a const std::string& reference or a char* pointer would suffice.

I'd like to address each of these points in order to make the library suitable for use from production code, but wanted to express my intentions here first, as in particular the first item would necessitate changes to any existing application code that uses hdt-lib and assumes that it will throw raw C strings instead of standard exceptions.

Soliciting feedback from @mielvds and @RubenVerborgh in particular.

OFFSET is not working in constant time

@momo54 reported that although he was expecting nearly constant time in OFFSETS (using "goto" methods in BitmapTriplesIterators.cpp), he practically observed large delays using Linked Data Fragments+HDT with particular* OFFSET queries: SP?o, S?p?o, ?sPO and ?s?pO + OFFSET

Note that ?sP?o +OFFSET cannot run in constant time with our current implementation.

built with OpenMp fails on Ubuntu 14.10 for devel branch

[CFG] OS= Linux
[CFG] CPP = g++
[CFG] FLAGS = -O3 -Wno-deprecated -fopenmp
[CFG] LDFLAGS = 
[CFG] DEFINES =  -DHAVE_RAPTOR -DHAVE_LIBZ -DHAVE_SERD
[CFG] INCLUDES = -I ../libcds-v1.0.12/includes/ -I /usr/local/include -I ./include -I /opt/local/include -I /usr/include
[CFG] LIB= ../libcds-v1.0.12/lib/libcds.a -L/usr/local/lib -lstdc++ -lraptor2 -lz -lserd-0

[HDT] Compiling src/dictionary/PlainDictionary.cpp
src/dictionary/PlainDictionary.cpp: In member function ‘void hdt::PlainDictionary::lexicographicSort(hdt::ProgressListener_)’:
src/dictionary/PlainDictionary.cpp:453:9: error: expected ‘#pragma omp section’ or ‘}’ before ‘{’ token
         { sort(shared.begin(), shared.end(), DictionaryEntry::cmpLexicographic); }
         ^
src/dictionary/PlainDictionary.cpp: In member function ‘void hdt::PlainDictionary::idSort()’:
src/dictionary/PlainDictionary.cpp:481:9: error: expected ‘#pragma omp section’ or ‘}’ before ‘{’ token
         { sort(subjects.begin(), subjects.end(), DictionaryEntry::cmpID); }
         ^
Makefile:94: recipe for target 'src/dictionary/PlainDictionary.o' failed
make[1]: *_\* [src/dictionary/PlainDictionary.o] Error 1
make[1]: Leaving directory '/hdt-cpp/hdt-lib'
Makefile:57: recipe for target 'all' failed
make: **\* [all] Error 2

Cannot save Index if the HDT is not saved

Using the rdf2hdt tool with the -i option I keep getting this warning/notice (it is not presented as an error). This message is quite confusing, how should the program be called to generate the HDT index?

Thanks,

determine node type from id

A subject can be a blank node or an iri. An object can be a blank node, an iri or a literal. A TripleID contains unsigned int for subject, object and predicate.

Is there a cheap way to determine the type of the node without retrieving the string value?

The nodes are sorted in the dictionary. Blank nodes start with '_', literals start with '"' and IRIs start with '<'. So by knowing the number of blank nodes, IRIs and literals in the dictionary, it is possible to determine the node type from the id.

ModifiableHDT?

I'm working on a SWI-Prolog interface to this library. Read access now mostly works. Now, I try to make it possible to create HDTs from triples available in Prolog. I could of course first write an ntriples file but that seems a waste of resources. I tried including HDTFactory.hpp, create an instance of a basicModifiableHDT, using HDTFactory::createDefaultModifiableHDT() and call ->insert() and ->save() on that. This creates a file, but trying to open it using hdt2rdf says

ERROR: This software cannot open this version of HDT File

What am I missing?

rdfhdt / hdt-cpp Goto Github PK

hdt-cpp's Introduction

C++ implementation of the HDT compression format

Getting Started

Prerequisites

Installation

Installation issues

Compilation issues using Kyoto Cabinet

Package requirements (serd-0 >= 0.28.0) were not met

./configure cannot find Serd

Using HDT

RDF-2-HDT: Creating an HDT

HDT-2-RDF: Exporting an HDT

Querying for Triple Patterns

Exporting the header

Replacing the Header

Building docker image

Using tools via docker image

Contributing

License

hdt-cpp's People

Contributors

Stargazers

Watchers

Forkers

hdt-cpp's Issues

ifdef HAVE_CDS

else

endif

ifndef WIN32

endif

Recommend Projects

Recommend Topics

Recommend Org

`./configure` cannot find Serd