rdfhdt / hdt-cpp Goto Github PK
View Code? Open in Web Editor NEWHDT C++ Library and Tools
HDT C++ Library and Tools
Serd is now supported for both parsing and serializing. We should probably not maintain two parsers, and only keep Raptor around for XML.
Discussed in #53.
I sometimes get the following file in my hdt-cpp
directory: hdt-lib/.make-senitel
. Should this be added to .gitignore
or should this file not be created in the first place?
https://github.com/rdfhdt/hdt-cpp/blob/master/hdt-lib/src/util/bitutil.h#L42
This gives problems as it is already defined on a 32 bit machine.
I noticed problems with both hdtSearch (cpp) and hdtSearch.sh (java). Both apparently give incorrect results in some cases when looking for a specific literal value. I will report the Java results in a separate issue for hdt-java.
My test dataset is this NT file with only 3 triples:
<http://example.org/000046085> <http://schema.org/name> "Raamattu" .
<http://example.org/000146854> <http://schema.org/name> "Ajan lyhyt historia" .
<http://example.org/000019643> <http://schema.org/name> "Seitsemän veljestä" .
I converted it to HDT using the cpp version of rdf2hdt. Then I query it for the literal values using hdtSearch:
$ rdf2hdt hdt-test.nt hdt-test.hdt
HDT Successfully generated.
Total processing time: Clock(801 us) User() System()
$ hdtSearch hdt-test.hdt
Predicate Bitmap in 34 usp: 0 % / 14.86 %
Count predicates in 16 userences: 0 % / 16.075 %
Count Objects in 9 us Max was: 1: 0 % / 34.3 %
Bitmap in 7 usx bitmap: 0 % / 45.64 %
Bitmap bits: 3 Ones: 3
Object references in 13 usces: 0 % / 48.475 %
Sort lists in 8 usblists: 0 % / 68.32 %
Index generated in 136 us
>> ? ? "Raamattu" %
http://example.org/000046085 http://schema.org/name "Raamattu"
1 results in 40 us
>> ? ? "Ajan lyhyt historia"
0 results in 5 us
>> ? ? "Seitsemän veljestä"
0 results in 6 us
As you can see from the above output, the first query (for "Raamattu") gives the correct result, but the two subsequent ones give zero results even though they should match one triple in the data.
I'm not sure whether the problem is in the HDT generation, index file generation, or querying. However, I did try generating the index with hdtSearch.sh
from hdt-java instead. It didn't make a difference to the results.
Got errors like:
In file included from src/bitsequence/BitSequence375.cpp:30:0:
src/bitsequence/../util/bitutil.h: In function ‘void hdt::bitset(uint32_t*, size_t)’:
src/bitsequence/../util/bitutil.h:93:13: error: redefinition of ‘void hdt::bitset(uint32_t*, size_t)’
inline void bitset(uint32_t * e, size_t p) {
^
src/bitsequence/../util/bitutil.h:74:13: note: ‘void hdt::bitset(size_t*, size_t)’ previously defined here
inline void bitset(size_t * e, size_t p) {
^
src/bitsequence/../util/bitutil.h: In function ‘void hdt::bitclean(uint32_t*, size_t)’:
src/bitsequence/../util/bitutil.h:98:13: error: redefinition of ‘void hdt::bitclean(uint32_t*, size_t)’
inline void bitclean(uint32_t * e, size_t p) {
^
src/bitsequence/../util/bitutil.h:79:13: note: ‘void hdt::bitclean(size_t*, size_t)’ previously defined here
inline void bitclean(size_t * e, size_t p) {
I commented these sections in bitutil.h and the compilation worked.
Also the Makefile contained an option not recognized by the GCC compiler:
cc1plus: warning: unrecognized command line option "-Wno-unknown-warning-option"
I'm unable to create an HDT file for the following N-Triples content:
<a:b> <a:b> <a2:b>
The problem is caused by the 2
, which is allowed to appear in IRI schema components, but gives the following error:
syntax does not support relative IRIs
Catch exception load: Error parsing input.
The serdi
command-line tool parse this content correctly.
I'm using Serd 0.26.0 and HDT commit d4b5244
Is it possible or is it in the scope of HDT to merge multiple HDT files or update a HDT file with another rdf file?
This would be great for analysis pipelines that generate new RDF triples based on the content of the input rdf/hdt files.
Hi,
after compiling hdt-lib
compiling the tests (with make tests
) resuls in the following compilation error.
$ make tests
[HDT] Compiling test tests/bit375.cpp
[HDT] Compiling test tests/bitutiltest.cpp
[HDT] Compiling test tests/cmp.cpp
tests/cmp.cpp: In member function ‘virtual unsigned char* MyIterator::next()’:
tests/cmp.cpp:33:65: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
sprintf((char*)buffer, "AAA %015lld FINNN", (uint64_t) count++);
^
[HDT] Compiling test tests/confm.cpp
/tmp/cc4opceY.o: In function `main':
confm.cpp:(.text.startup+0xc79): undefined reference to `hdt::LiteralDictionary::LiteralDictionary()'
confm.cpp:(.text.startup+0xc92): undefined reference to `hdt::LiteralDictionary::import(hdt::Dictionary*, hdt::ProgressListener*)'
confm.cpp:(.text.startup+0xd7b): undefined reference to `hdt::LiteralDictionary::save(std::ostream&, hdt::ControlInformation&, hdt::ProgressListener*)'
confm.cpp:(.text.startup+0xdf6): undefined reference to `hdt::LiteralDictionary::~LiteralDictionary()'
confm.cpp:(.text.startup+0xee1): undefined reference to `hdt::LiteralDictionary::~LiteralDictionary()'
collect2: error: ld returned 1 exit status
make: *** [tests/confm] Error 1
Cristian
As discussed in #12, @RubenVerborgh and I would propose to remove the buggy built-in N-Triples parser and serializer implementation and rely on Serd instead.
This does mean that Serd would be required in order for the command-line utilities to be of any use (and it might hence make sense to make Serd a required dependency, instead of optional), but that's practically the case right now as well for anything nontrivial, given that the built-in parser is buggy.
Also, Serd is a compact enough library that potentially (if really needed) we could import it into our source tree, at which point there would be no external dependency to install.
Thoughts?
Hi all!
When I first reviewed hdt-cpp last winter, hoping to find a way to get it packaged for Debian, I noticed the libcds dependency that caused me some grief when compiling (but nevermind that), but that I think would also cause problems for packaging. I made a thread in the forums, but I suppose opening an issue is better for the dev's workflow. :-) I hope my view here isn't outdated. :-)
It seems like libcds is a fork of somebody else's code, and the link in there is dead. It also seems the upstream author has started a new version but seemingly not finished it. The author has made a post about the motivation for it.
I think that to get this code into a distro, and I think that is very important for widespread adoption, it would be needed to not bundle libcds, but keep it separate, so that it can be packaged independently by distros. Also, if any patches are needed, they should be sent there for incorporation. And since the upstream has abandoned version 1 of their library, HDT-CPP should probably move to use libcds2.
Cheers,
Kjetil
Different .hdt.index files for the same .hdt file are incompatible with each other; this causes problems if different applications read them. Maybe they should receive some kind of version number.
But this then creates the problem: how to find the index?
Perhaps, instead of supplying the .hdt file name, we can provide the index file name.
What about merging develop to master? I've been using the latest develop branch intensively and it seems there are no issues that should stop it from becoming the next version.
If we do the merge soon, we can use the develop branch to start properly testing the new compilation procedure (#81)
[CFG] OS= Linux
[CFG] CPP = g++
[CFG] FLAGS = -O3 -Wno-deprecated -fopenmp
[CFG] LDFLAGS =
[CFG] DEFINES = -DHAVE_RAPTOR -DHAVE_LIBZ -DHAVE_SERD
[CFG] INCLUDES = -I ../libcds-v1.0.12/includes/ -I /usr/local/include -I ./include -I /opt/local/include -I /usr/include
[CFG] LIB= ../libcds-v1.0.12/lib/libcds.a -L/usr/local/lib -lstdc++ -lraptor2 -lz -lserd-0
[HDT] Compiling src/dictionary/PlainDictionary.cpp
src/dictionary/PlainDictionary.cpp: In member function ‘void hdt::PlainDictionary::lexicographicSort(hdt::ProgressListener_)’:
src/dictionary/PlainDictionary.cpp:453:9: error: expected ‘#pragma omp section’ or ‘}’ before ‘{’ token
{ sort(shared.begin(), shared.end(), DictionaryEntry::cmpLexicographic); }
^
src/dictionary/PlainDictionary.cpp: In member function ‘void hdt::PlainDictionary::idSort()’:
src/dictionary/PlainDictionary.cpp:481:9: error: expected ‘#pragma omp section’ or ‘}’ before ‘{’ token
{ sort(subjects.begin(), subjects.end(), DictionaryEntry::cmpID); }
^
Makefile:94: recipe for target 'src/dictionary/PlainDictionary.o' failed
make[1]: *_\* [src/dictionary/PlainDictionary.o] Error 1
make[1]: Leaving directory '/hdt-cpp/hdt-lib'
Makefile:57: recipe for target 'all' failed
make: **\* [all] Error 2
This triple:
<http://dbpedia.org/resource/Barack_Obama> <http://www.w3.org/2002/07/owl#sameAs> <http://el.dbpedia.org/resource/\u039C\u03C0\u03B1\u03C1\u03AC\u03BA_\u039F\u03BC\u03C0\u03AC\u03BC\u03B1> .
results in
<http://dbpedia.org/resource/Barack_Obama> <http://www.w3.org/2002/07/owl#sameAs> <http://el.dbpedia.org/resource/\\u039C\\u03C0\\u03B1\\u03C1\\u03AC\\u03BA_\\u039F\\u03BC\\u03C0\\u03AC\\u03BC\\u03B1>.
This is particularly a problem when using HDT via LDF, as the n3 parser throws an exception.
Parsing of blank nodes in the object position returns invalid terms. To reproduce:
echo "<http://sub> <http://pred> _:b0." > test.nt && ./tools/rdf2hdt test.nt test.hdt && ./tools/hdtSearch -q '? ? ?' test.hdt 2>/dev/null
HDT with the rapper parser returns http://sub http://pred b0.
(notice the missing _:
)
HDT with the serdi parser returns http://sub http://pred _:b0.
(notice the trailing dot, which should not be there)
As a result, when you export this triple to RDF using hdt2rdf
, you'll get invalid ntriples[1]
I'm confused about this issue though. Running echo "<http://sub> <http://pred> _:b0." | serdi -
returns valid ntriples, i.e. serdi seems to parse/serialize this correctly.
But when I log the SerdNode value in the HDT serializer (https://github.com/rdfhdt/hdt-cpp/blob/develop/hdt-lib/src/rdf/RDFParserSerd.cpp#L18) I do get the dot as part of the bnode value.
Some help is appreciated ;)
[1]
serdi: <http://sub> <http://pred> _:b0. .
rapper: <http://sub> <http://pred> <b0.> .
Loading an HDT file with an FMIndex fails on OSX. This Travis run shows it works on Linux.
The failure seems to be as follows:
CSD_FMIndex::load
, we convert an in-memory range to a stringstream
by calling rdbuf()->pubsetbuf
.fail
and bad
bits of the stream are set.Now there seems to be quite some discussion whether calling pubsetbuf
is actually a good idea. So I have tried the often suggested alternative to implement a custom subclass instead. However, this then fails at another point (apparently because seeking is not implemented).
Who can get us rid of the pubsetbuf
hack?
The original code was written by @MarioAriasGa.
Relevant backstory: I found this out while testing HDT-node. Strangely, the exact same code works for Node.js 4 and 6 on Linux and on OS X, but fails for Node.js 7 on OS X (but works on Linux). And as you can see above, it fails always in standalone mode on OS X. So somehow, the Node.js 4 and 6 compilations on OS X are doing something different. I wonder if this issue is related.
PS To test this: just fork from the bug-literal-dictionary branch, and when you commit and push, Travis CI will test for you.
In published examples I've seen, HDT file compressibility (by doing an additional gzip or bzip2 step) has been a few percent at the most. Given that baseline, the following seems anomalous:
-rw-rw-r-- 1 arto arto 21813380 Dec 10 12:43 pc_biosystem.hdt
-rw-rw-r-- 1 arto arto 5054074 Dec 10 12:44 pc_biosystem.hdt.bz2
-rw-rw-r-- 1 arto arto 6988217 Dec 10 12:43 pc_biosystem.hdt.gz
-rw-rw-r-- 1 arto arto 833444759 Dec 10 12:42 pc_biosystem.nt
-rw-rw-r-- 1 arto arto 15820798 Dec 10 12:46 pc_biosystem.nt.bz2
-rw-rw-r-- 1 arto arto 21045372 Dec 10 12:44 pc_biosystem.nt.gz
-rw-rw-r-- 1 arto arto 327940461 Dec 10 12:42 pc_biosystem.ttl
That is, the .hdt file is 20 MB, the gzip-compressed .hdt.gz is 6.66 MB, and the bzip2-compressed .hdt.bz2 is 4.82 MB.
Instructions to reproduce these figures from scratch follow here:
wget ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/biosystem/pc_biosystem.ttl.gz
gunzip pc_biosystem.ttl.gz
rapper -i turtle -o ntriples pc_biosystem.ttl > pc_biosystem.nt
rdf2hdt -f ntriples pc_biosystem.nt pc_biosystem.hdt
gzip -9 < pc_biosystem.hdt > pc_biosystem.hdt.gz
bzip2 -9 < pc_biosystem.hdt > pc_biosystem.hdt.bz2
The question here is, are these results expected or anomalous? (Pinging @MarioAriasGa.)
I'm working on a SWI-Prolog interface to this library. Read access now mostly works. Now, I try to make it possible to create HDTs from triples available in Prolog. I could of course first write an ntriples file but that seems a waste of resources. I tried including HDTFactory.hpp
, create an instance of a basicModifiableHDT, using HDTFactory::createDefaultModifiableHDT()
and call ->insert() and ->save() on that. This creates a file, but trying to open it using hdt2rdf
says
ERROR: This software cannot open this version of HDT File
What am I missing?
Hello,
Basically every url mentioned in the rdfhdt.org website points to the Google code project page, but it seems to me that the latest developments are happening here. So I think which is the official/main repo for the project should be clarified.
Thanks,
Cristian
@momo54 reported that although he was expecting nearly constant time in OFFSETS (using "goto" methods in BitmapTriplesIterators.cpp), he practically observed large delays using Linked Data Fragments+HDT with particular* OFFSET queries: SP?o, S?p?o, ?sPO and ?s?pO + OFFSET
I've converted nt file to HDT and loaded the HDT for querying in Jena. However, some resources are not queryable although they are present in the dataset.
The dataset is
<http://ru.dbpedia.org/resource/Список_римско-католических_епархий_(структурный_вид)> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Абаэтетуба> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Бразилия> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/25_ноября> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/1961_год> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Собор_Непорочного_зачатия_Пресвятой_Девы_Марии_(Абаэтетуба)> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Архиепархия_Белен-до-Пара> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Епархия> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Римско-католическая_церковь> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Кафедра_(христианство)> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Собор_(храм)> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Папство> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Иоанн_XXIII> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Булла> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Территориальное_аббатство> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/4_августа> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/1981_год> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Иоанн_Павел_II> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Священник> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Annuario_Pontificio> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Acta_Apostolicae_Sedis> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Категория:Католические_епархии_Бразилии> .
<http://ru.dbpedia.org/resource/Архиепархия_Белен-до-Пара> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://www.w3.org/2000/01/rdf-schema#label> "Епархия Абаэтетубы"@ru .
In Jena the following query does not match anything:
StmtIterator tmpIter = model.listStatements(model.getResource("http://ru.dbpedia.org/resource/Епархия_Абаэтетубы"), null, (RDFNode)null);
if(!tmpIter.hasNext())
System.out.println("No statements for this resource");
Loading the same dataset in Jena but as a ttl file works.
Maybe a bug in the HDT conversion process? Note that we are dealing with Russian cyrillic chars.
Using the rdf2hdt
tool with the -i
option I keep getting this warning/notice (it is not presented as an error). This message is quite confusing, how should the program be called to generate the HDT index?
Thanks,
C
Currently you need to have Kyoto Cabinet and Raptor installed to build the test suite with make tests
. I've attached the error output in make-tests.log.txt.
The test suite should be amended such that it doesn't require any of the optional dependencies of the library to be installed. This could be done either by restructuring the test suite and/or by ensuring that tests for optional dependencies only activate when those dependencies are present.
Building the test suite with make tests
emits a bunch of compiler warnings that should be fixed: make-tests.log.txt
When using rdf2hdt on the following turtle file:
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skos-thes: <http://purl.org/iso25964/skos-thes#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<urn:x-skosprovider:trees/3> a skos:Collection ;
dcterms:identifier 3 ;
skos:inScheme <urn:x-skosprovider:trees> ;
skos:member <urn:x-skosprovider:trees/1>,
<urn:x-skosprovider:trees/2> ;
skos:prefLabel "Trees by species"@en,
"Bomen per soort"@nl-BE .
<urn:x-skosprovider:trees/1> a skos:Concept ;
dcterms:identifier 1 ;
skos:closeMatch <http://id.python.org/different/types/of/trees/nr/1/the/larch> ;
skos:definition "A type of tree."@en ;
skos:inScheme <urn:x-skosprovider:trees> ;
skos:prefLabel "The Larch"@en,
"De Lariks"@nl .
<urn:x-skosprovider:trees/2> a skos:Concept ;
dcterms:identifier 2 ;
skos:altLabel "la châtaigne"@fr,
"De Paardekastanje"@nl ;
skos:definition "A different type of tree."@en ;
skos:inScheme <urn:x-skosprovider:trees> ;
skos:prefLabel "The Chestnut"@en ;
skos:relatedMatch <http://id.python.org/different/types/of/trees/nr/17/the/other/chestnut> .
<urn:x-skosprovider:trees> a skos:ConceptScheme ;
dcterms:identifier "TREES" ;
skos:hasTopConcept <urn:x-skosprovider:trees/1>,
<urn:x-skosprovider:trees/2> ;
skos:prefLabel "Different types of trees"@en,
"Verschillende soorten bomen"@nl .
I get the following error:
[21:34:00] $ /home/koen/Projecten/general/hdt-cpp/hdt-lib/tools/rdf2hdt TREES-full.ttl TREES-full.hdt
expected `(', not `B'y: 0 K triples processed.: 0 % / 0 %
caught here??
Catch exception load: Error parsing input.
ERROR: Error parsing input.
Using an rdfxml version of the same file I have no problems:
[21:39:04] $ /home/koen/Projecten/general/hdt-cpp/hdt-lib/tools/rdf2hdt -f rdfxml TREES-full.rdf TREES-full.hdt
RDF format: rdfxml
HDT Successfully generated.
Total processing time: Clock(10 ms 901 us) User(10 ms 516 us) System(552 us)
Both the Turtle and the rdfxml version were generated from Python using RDFLIB, on Ubuntu 14.04
I believe rdf2hdt use serd for reading turtle files. Using serdi on the same file gives met the following error:
[21:42:26] 1 $ serdi TREES-full.ttl
<urn:x-skosprovider:trees/3> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Collection> .
<urn:x-skosprovider:trees/3> <http://purl.org/dc/terms/identifier> "3"^^<http://www.w3.org/2001/XMLSchema#integer> .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#inScheme> <urn:x-skosprovider:trees> .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#member> <urn:x-skosprovider:trees/1> .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#member> <urn:x-skosprovider:trees/2> .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#prefLabel> "Trees by species"@en .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#prefLabel> "Bomen per soort"@nl- .
error: TREES-full.ttl:16:29: expected `.', not `B'
There are a few issues with the rdf2hdt
command:
rdf2hdt file.hdt -
).On branch stable
I am able to create an HDT file from a gzipped N-Triples file (this probably uses Raptor). On branch master
this no works (this probably uses Serd).
What is the best way to create an HDT file from a gizpped N-Triples file on master
? Could the library be taught to do the right thing automatically?
Introduced in:
686e25d
A subject can be a blank node or an iri. An object can be a blank node, an iri or a literal. A TripleID contains unsigned int
for subject, object and predicate.
Is there a cheap way to determine the type of the node without retrieving the string value?
The nodes are sorted in the dictionary. Blank nodes start with '_', literals start with '"' and IRIs start with '<'. So by knowing the number of blank nodes, IRIs and literals in the dictionary, it is possible to determine the node type from the id.
I am wondering how I am able to run rdf2hdt conversion with kyotocabinet database (KCB) option. We have a very large RDF nt file that exceeds the memory limit for conversion, and the job got killed using memory....I cannot find anything in README file.
In addition, I took a look at BasicHDT.cpp file and found there is no option for KCB:
void BasicHDT::createComponents() {
// HEADER
header = new PlainHeader();
// DICTIONARY
std::string dictType = spec.get("dictionary.type");
if(dictType==HDTVocabulary::DICTIONARY_TYPE_FOUR) {
dictionary = new FourSectionDictionary(spec);
} else if(dictType==HDTVocabulary::DICTIONARY_TYPE_PLAIN) {
dictionary = new PlainDictionary(spec);
} else if(dictType==HDTVocabulary::DICTIONARY_TYPE_LITERAL) {
dictionary = new LiteralDictionary(spec);
throw "This version has been compiled without support for this dictionary";
} else {
dictionary = new FourSectionDictionary(spec);
}
// TRIPLES
std::string triplesType = spec.get("triples.type");
if(triplesType==HDTVocabulary::TRIPLES_TYPE_BITMAP) {
triples = new BitmapTriples(spec);
} else if(triplesType==HDTVocabulary::TRIPLES_TYPE_COMPACT) {
triples = new CompactTriples(spec);
} else if(triplesType==HDTVocabulary::TRIPLES_TYPE_PLAIN) {
triples = new PlainTriples(spec);
} else if(triplesType==HDTVocabulary::TRIPLES_TYPE_TRIPLESLIST) {
triples = new TriplesList(spec);
} else if (triplesType == HDTVocabulary::TRIPLES_TYPE_TRIPLESLISTDISK) {
triples = new TripleListDisk();
} else {
triples = new BitmapTriples(spec);
}
}
Best,
Gang
After the already-merged enhancements in pull requests #16 and #17, these are the remaining immediate pain points I have encountered so far in attempting to use hdt-lib
in a production application:
std::exception
base class. Instead, the library throws raw C strings directly (e.g., src/hdt/BasicHDT.cpp:106). This is not great practice, and these exceptions will go uncaught by many applications, or else will land in a top-level catch-all handler catch (...)
at which point they can no longer be interpreted at all.hdt-lib
objects, complicating const-correct application code by necessitating const_cast<>
casts that undermine type safety.std:string
copies where a const std::string&
reference or a char*
pointer would suffice.I'd like to address each of these points in order to make the library suitable for use from production code, but wanted to express my intentions here first, as in particular the first item would necessitate changes to any existing application code that uses hdt-lib
and assumes that it will throw raw C strings instead of standard exceptions.
Soliciting feedback from @mielvds and @RubenVerborgh in particular.
Compiling hdt-lib
using GCC 6.1.1 gives me the following warnings:
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-1’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
Tsucc RMQ_succinct::HighestBitsSet[8] = {~0, ~1, ~3, ~7, ~15, ~31, ~63, ~127};
^
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-2’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-4’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-8’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-16’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-32’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-64’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-128’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
Makefile:35: recipe for target 'static/suffixtree/RMQ_succinct.o' failed
make[3]: *** [static/suffixtree/RMQ_succinct.o] Error 1
Makefile:39: recipe for target 'all' failed
make[2]: *** [all] Error 2
Makefile:13: recipe for target 'libcompact' failed
make[1]: *** [libcompact] Error 2
make[1]: Leaving directory '/home/wbeek/Git/hdt-cpp/libcds-v1.0.12'
Makefile:63: recipe for target 'all' failed
make: *** [all] Error 2
In order to compile the library I have to make the following change in two files. From:
const DTsucc RMQ_succinct_lcp::HighestBitsSet[8] = {~0, ~1, ~3, ~7, ~15, ~31, ~63, ~127};
To:
const DTsucc RMQ_succinct_lcp::HighestBitsSet[8] = {static_cast<DTsucc>(~0), static_cast<DTsucc>(~1), static_cast<DTsucc>(~3), static_cast<DTsucc>(~7), static_cast<DTsucc>(~15), static_cast<DTsucc>(~31), static_cast<DTsucc>(~63), static_cast<DTsucc>(~127)};
Since my C++ skills are non-existent, a capable programmer may fix this in a less hacky way :-P
Bugs such as LinkedDataFragments/Client.js#26 (comment) occur because people use the latest master as opposed to the latest release.
Proposal A (vote with 👍 )
develop
becomes the branch were development happensmaster
always points to the latest release (which we tag)Proposal B (vote with 😄 )
stable
becomes the default branchHi all,
We have recently created the LOD-a-lot HDT dataset including 28B triples. @MarioAriasGa performed some datatype changes in the branch https://github.com/rdfhdt/hdt-cpp/tree/long-dict-id in order to manage such an amount of triples.
We should then merge this branch into develop (https://github.com/rdfhdt/hdt-cpp/tree/develop) and later on into master. Given that the interface has changed, we suggest to create a new major revision.
Opinions?
See the required changes (some of them are only for the HDT-it app, please filter by hdt-lib/): https://github.com/rdfhdt/hdt-cpp/compare/develop...long-dict-id?diff=unified&name=long-dict-id
Function getSuggestions
is very nice since it implements the following two common use cases:
Currently a maximum size must be given to the function, but this maximum is not always known beforehand. For example, if you want to find all DOI IRIs (use case category 2), there are over 4M matches for prefix http://dx.doi.org/
in LOD-a-lot
.
Would it be possible to return the list of suggestions as an iterator of lazy list of arbitrary size?
Since it's possible to extract the estimated number of results for an arbitrary triple pattern, we were wondering whether it would also be possible to efficiently retrieve, for a given node X, a random neighbor node Y of X (in constant time).
Looking at the Web Semantics paper from 2013, it seems that this should be doable by querying bitmap triples representation directly.
Thanks in advance!
@airobert and @wouterbeek
Tried to compile hdt-lib in Docker using the latest Ubuntu image.
I get the error:
[HDT] Linking libhdt.a
[HDT] Compiling tool tools/hdtInfo.cpp
../libcds-v1.0.12/lib/libcds.a: error adding symbols: Archive has no index; run ranlib to add one
collect2: error: ld returned 1 exit status
Makefile:82: recipe for target 'tools/hdtInfo' failed
make[1]: Leaving directory '/hdt-cpp/hdt-lib'
make[1]: *** [tools/hdtInfo] Error 1
make: *** [all] Error 2
Any idea how to fix it?
Dear all,
I have a short question. I checked that libz is installed and I have an ntriples file called core.nt.gz
Will the program automatically recognize that the file is zipped, if I call
./rdf2hdt core.nt.gz core.hdt
because I receive a lot of parser errors. Maybe I am missing an option.
I also tried -f ntriples
If I unzip the file first, it works.
Using the latest master, trying to load http://lod.labs.vu.nl/brt-clean.nt.gz
(be careful, almost 8Gb and you need to uncompress it before loading) we get SEGV in line 295 of FourSectionDictionary.cpp. Here
we see delete objects
.
This is quite understandable as objects
is possibly already deleted in the try
section. I'm not familiar enough with the code base to see the clean way out.
I generated a dataset of about 1M triples using the watdiv data generator and converted the resulting file using HDT.
When querying for certain triples, it seems like some estimated counts are not very "estimated", but rather arbitrary.
For example: there should be 433615 triples with the predicate http://db.uwaterloo.ca/~galuc/wsdbm/friendOf
but when executing the query ?v0 http://db.uwaterloo.ca/~galuc/wsdbm/friendOf ?v2
I get an estimated count of 3878 triples, which is quite an order of magnitude different.
In order to get these estimated counts, IteratorTripleID. estimatedNumResults()
is used (see https://github.com/RubenVerborgh/HDT-Node/blob/master/lib/HdtDocument.cc#L139).
You can find the hdt file I'm using to test querying here: https://dl.dropboxusercontent.com/u/16059961/watdiv_10_dataset.zip
The Dictionary class has this function:
virtual std::string idToString(unsigned int id, TripleComponentRole role)=0;
Changing that to
virtual void idToString(unsigned int id, TripleComponentRole role, std::string& string);
would allow the caller to supply an existing string object and avoid memory allocation.
This library (re-)compiles very slowly compared to other libraries. I wonder if we can find causes/fixes for that. Perhaps some of the header files are doing overly complex things?
It does not always print to stdout (using rdf2hdt file.hdt -
).
Ported from #53.
Hi,
Installing on Ubuntu 14.04.1 LTS leads to an error of
serd.h: No such file or directory
which can be resolved installing:
apt-get install libserd-dev
Cristian
Compilation in SWI-Prolog's HDT package requires me to add -fPIC
to
the FLAGS
variable (on lines 14 and 16). I'm not sure whether this
is a bug or not. According to someone on SO PIC stands for:
> Position Independent Code means that the generated machine code
> is not dependent on being located at a specific address in order
> to work.
My editor also complains about a spurious horizontal tab character on
line 111. This may be removed if only for aesthetic reasons.
There doesn't seem to be any Makefile available.
ilterSearch.cpp:(.text.startup+0x70c): undefined reference to typeinfo for hdt::LiteralDictionary' filterSearch.cpp:(.text.startup+0x75b): undefined reference to
hdt::LiteralDictionary::substringToId(unsigned char_, unsigned int, unsigned int__)'
collect2: error: ld returned 1 exit status
Makefile:89: recipe for target 'tests/filterSearch' failed
make: *_* [tests/filterSearch] Error 1
What about making the progress output of the cmd line clients optional?
E.g., adding --progress
/-p
flag to show the progress.
Imo, the cmd clients are more usable with the progress output disabled by default.
Right now, the hdt-lib build process is pretty brittle with regards to platform quirks, and produces only tool binaries and static libraries, not any shared library. Further, building with custom paths or with a selection of features to be enabled/disabled (e.g., Serd, Raptor, zlib, KCB) requires manually editing Makefiles--not at all a welcome practice from a downstream user's point of view.
On POSIX systems, the currently produced static libraries aren't portably linkable with shared libraries that would like to make use of libhdt
; moreover, on e.g. x86-64 the individual object files that constitute the static libraries would generally need to be compiled with -fPIC
for the libraries to be usable from modern projects. All this requires manually fiddling with flags in both libcds-v1.0.12/src/Makefile
and hdt-lib/Makefile
.
Additionally, no actual installation facility is currently provided, meaning that users have to manually hunt down built artifacts--which are currently placed in different output directories--and copy them one by one to some installation destination.
All this argues for converting the project to use a proper configuration and build system. The two obvious and realistic choices are Autotools and CMake. Doing this would also greatly facilitate packaging the project for downstream distribution (#19).
I would myself volunteer to convert the project to using standard Autotools, which would give users the familiar ./configure && make && sudo make install
workflow that works reliably on all platforms and resolves all aforementioned problems.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.