Giter VIP home page Giter VIP logo

hdt-cpp's People

Contributors

artob avatar bjonnh avatar blake-regalia avatar chrysn avatar colinmaudry avatar d063520 avatar dinikolop avatar donpellegrino avatar drobilla avatar florianludwig avatar joachimvh avatar kinghuang avatar laurensdv avatar laurensrietveld avatar lucafabbian avatar marioariasga avatar mielvds avatar osma avatar ptorrestr avatar rubensworks avatar rubenverborgh avatar sharpeeeeeeee avatar sharperpaper avatar vandenoever avatar vlcinsky avatar webdata avatar wouterbeek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hdt-cpp's Issues

Untracked file `.make-senitel`

I sometimes get the following file in my hdt-cpp directory: hdt-lib/.make-senitel. Should this be added to .gitignore or should this file not be created in the first place?

hdtSearch improperly handles spaces in input literals

I noticed problems with both hdtSearch (cpp) and hdtSearch.sh (java). Both apparently give incorrect results in some cases when looking for a specific literal value. I will report the Java results in a separate issue for hdt-java.

My test dataset is this NT file with only 3 triples:

<http://example.org/000046085> <http://schema.org/name> "Raamattu" .
<http://example.org/000146854> <http://schema.org/name> "Ajan lyhyt historia" .
<http://example.org/000019643> <http://schema.org/name> "Seitsemän veljestä" .

I converted it to HDT using the cpp version of rdf2hdt. Then I query it for the literal values using hdtSearch:

$ rdf2hdt hdt-test.nt hdt-test.hdt
HDT Successfully generated.                                                    
Total processing time: Clock(801 us)  User()  System()

$ hdtSearch hdt-test.hdt
Predicate Bitmap in 34 usp: 0 % / 14.86 %                            
Count predicates in 16 userences: 0 % / 16.075 %                      
Count Objects in 9 us Max was: 1: 0 % / 34.3 %                      
Bitmap in 7 usx bitmap: 0 % / 45.64 %                      
Bitmap bits: 3 Ones: 3
Object references in 13 usces: 0 % / 48.475 %                      
Sort lists in 8 usblists: 0 % / 68.32 %                      
Index generated in 136 us
>> ? ? "Raamattu"                                 %                      
http://example.org/000046085 http://schema.org/name "Raamattu"
1 results in 40 us
>> ? ? "Ajan lyhyt historia"
0 results in 5 us
>> ? ? "Seitsemän veljestä"
0 results in 6 us

As you can see from the above output, the first query (for "Raamattu") gives the correct result, but the two subsequent ones give zero results even though they should match one triple in the data.

I'm not sure whether the problem is in the HDT generation, index file generation, or querying. However, I did try generating the index with hdtSearch.sh from hdt-java instead. It didn't make a difference to the results.

Compiling errors on gcc 4.9.2 (Debian 4.9.2-10)

Got errors like:

In file included from src/bitsequence/BitSequence375.cpp:30:0:
src/bitsequence/../util/bitutil.h: In function ‘void hdt::bitset(uint32_t*, size_t)’:
src/bitsequence/../util/bitutil.h:93:13: error: redefinition of ‘void hdt::bitset(uint32_t*, size_t)’
inline void bitset(uint32_t * e, size_t p) {
         ^
src/bitsequence/../util/bitutil.h:74:13: note: ‘void hdt::bitset(size_t*, size_t)’ previously defined here
 inline void bitset(size_t * e, size_t p) {
         ^
src/bitsequence/../util/bitutil.h: In function ‘void hdt::bitclean(uint32_t*, size_t)’:
src/bitsequence/../util/bitutil.h:98:13: error: redefinition of ‘void hdt::bitclean(uint32_t*, size_t)’
 inline void bitclean(uint32_t * e, size_t p) {
         ^
src/bitsequence/../util/bitutil.h:79:13: note: ‘void hdt::bitclean(size_t*, size_t)’ previously defined here
inline void bitclean(size_t * e, size_t p) {

I commented these sections in bitutil.h and the compilation worked.

Also the Makefile contained an option not recognized by the GCC compiler:

cc1plus: warning: unrecognized command line option "-Wno-unknown-warning-option"

Cannot create HDT file when numbers appear in IRI schema

I'm unable to create an HDT file for the following N-Triples content:

<a:b> <a:b> <a2:b>

The problem is caused by the 2, which is allowed to appear in IRI schema components, but gives the following error:

syntax does not support relative IRIs
Catch exception load: Error parsing input.

The serdi command-line tool parse this content correctly.

I'm using Serd 0.26.0 and HDT commit d4b5244

Merge HDT files or update HDT with rdf

Is it possible or is it in the scope of HDT to merge multiple HDT files or update a HDT file with another rdf file?

This would be great for analysis pipelines that generate new RDF triples based on the content of the input rdf/hdt files.

Error compiling test on Ubuntu 14.04 LTS 64 bit

Hi,

after compiling hdt-lib compiling the tests (with make tests) resuls in the following compilation error.

$ make tests
 [HDT] Compiling test tests/bit375.cpp
 [HDT] Compiling test tests/bitutiltest.cpp
 [HDT] Compiling test tests/cmp.cpp
tests/cmp.cpp: In member function ‘virtual unsigned char* MyIterator::next()’:
tests/cmp.cpp:33:65: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
   sprintf((char*)buffer, "AAA %015lld FINNN", (uint64_t) count++);
                                                                 ^
 [HDT] Compiling test tests/confm.cpp
/tmp/cc4opceY.o: In function `main':
confm.cpp:(.text.startup+0xc79): undefined reference to `hdt::LiteralDictionary::LiteralDictionary()'
confm.cpp:(.text.startup+0xc92): undefined reference to `hdt::LiteralDictionary::import(hdt::Dictionary*, hdt::ProgressListener*)'
confm.cpp:(.text.startup+0xd7b): undefined reference to `hdt::LiteralDictionary::save(std::ostream&, hdt::ControlInformation&, hdt::ProgressListener*)'
confm.cpp:(.text.startup+0xdf6): undefined reference to `hdt::LiteralDictionary::~LiteralDictionary()'
confm.cpp:(.text.startup+0xee1): undefined reference to `hdt::LiteralDictionary::~LiteralDictionary()'
collect2: error: ld returned 1 exit status
make: *** [tests/confm] Error 1

Cristian

Remove built-in N-Triples parser/serializer

As discussed in #12, @RubenVerborgh and I would propose to remove the buggy built-in N-Triples parser and serializer implementation and rely on Serd instead.

This does mean that Serd would be required in order for the command-line utilities to be of any use (and it might hence make sense to make Serd a required dependency, instead of optional), but that's practically the case right now as well for anything nontrivial, given that the built-in parser is buggy.

Also, Serd is a compact enough library that potentially (if really needed) we could import it into our source tree, at which point there would be no external dependency to install.

Thoughts?

Replace libcds with sdsl-lite

Hi all!

When I first reviewed hdt-cpp last winter, hoping to find a way to get it packaged for Debian, I noticed the libcds dependency that caused me some grief when compiling (but nevermind that), but that I think would also cause problems for packaging. I made a thread in the forums, but I suppose opening an issue is better for the dev's workflow. :-) I hope my view here isn't outdated. :-)

It seems like libcds is a fork of somebody else's code, and the link in there is dead. It also seems the upstream author has started a new version but seemingly not finished it. The author has made a post about the motivation for it.

I think that to get this code into a distro, and I think that is very important for widespread adoption, it would be needed to not bundle libcds, but keep it separate, so that it can be packaged independently by distros. Also, if any patches are needed, they should be sent there for incorporation. And since the upstream has abandoned version 1 of their library, HDT-CPP should probably move to use libcds2.

Cheers,

Kjetil

Add a version number to the .hdt.index files

Different .hdt.index files for the same .hdt file are incompatible with each other; this causes problems if different applications read them. Maybe they should receive some kind of version number.

But this then creates the problem: how to find the index?
Perhaps, instead of supplying the .hdt file name, we can provide the index file name.

New hdt version

What about merging develop to master? I've been using the latest develop branch intensively and it seems there are no issues that should stop it from becoming the next version.
If we do the merge soon, we can use the develop branch to start properly testing the new compilation procedure (#81)

built with OpenMp fails on Ubuntu 14.10 for devel branch

[CFG] OS= Linux
[CFG] CPP = g++
[CFG] FLAGS = -O3 -Wno-deprecated -fopenmp
[CFG] LDFLAGS = 
[CFG] DEFINES =  -DHAVE_RAPTOR -DHAVE_LIBZ -DHAVE_SERD
[CFG] INCLUDES = -I ../libcds-v1.0.12/includes/ -I /usr/local/include -I ./include -I /opt/local/include -I /usr/include
[CFG] LIB= ../libcds-v1.0.12/lib/libcds.a -L/usr/local/lib -lstdc++ -lraptor2 -lz -lserd-0

[HDT] Compiling src/dictionary/PlainDictionary.cpp
src/dictionary/PlainDictionary.cpp: In member function ‘void hdt::PlainDictionary::lexicographicSort(hdt::ProgressListener_)’:
src/dictionary/PlainDictionary.cpp:453:9: error: expected ‘#pragma omp section’ or ‘}’ before ‘{’ token
         { sort(shared.begin(), shared.end(), DictionaryEntry::cmpLexicographic); }
         ^
src/dictionary/PlainDictionary.cpp: In member function ‘void hdt::PlainDictionary::idSort()’:
src/dictionary/PlainDictionary.cpp:481:9: error: expected ‘#pragma omp section’ or ‘}’ before ‘{’ token
         { sort(subjects.begin(), subjects.end(), DictionaryEntry::cmpID); }
         ^
Makefile:94: recipe for target 'src/dictionary/PlainDictionary.o' failed
make[1]: *_\* [src/dictionary/PlainDictionary.o] Error 1
make[1]: Leaving directory '/hdt-cpp/hdt-lib'
Makefile:57: recipe for target 'all' failed
make: **\* [all] Error 2

Unicode sequences are incorrectly escaped

This triple:
<http://dbpedia.org/resource/Barack_Obama> <http://www.w3.org/2002/07/owl#sameAs> <http://el.dbpedia.org/resource/\u039C\u03C0\u03B1\u03C1\u03AC\u03BA_\u039F\u03BC\u03C0\u03AC\u03BC\u03B1> .
results in
<http://dbpedia.org/resource/Barack_Obama> <http://www.w3.org/2002/07/owl#sameAs> <http://el.dbpedia.org/resource/\\u039C\\u03C0\\u03B1\\u03C1\\u03AC\\u03BA_\\u039F\\u03BC\\u03C0\\u03AC\\u03BC\\u03B1>.
This is particularly a problem when using HDT via LDF, as the n3 parser throws an exception.

Parsing of blank nodes

Parsing of blank nodes in the object position returns invalid terms. To reproduce:

echo "<http://sub> <http://pred> _:b0." > test.nt && ./tools/rdf2hdt test.nt test.hdt && ./tools/hdtSearch -q '? ? ?' test.hdt 2>/dev/null

HDT with the rapper parser returns http://sub http://pred b0. (notice the missing _:)
HDT with the serdi parser returns http://sub http://pred _:b0. (notice the trailing dot, which should not be there)

As a result, when you export this triple to RDF using hdt2rdf, you'll get invalid ntriples[1]

I'm confused about this issue though. Running echo "<http://sub> <http://pred> _:b0." | serdi - returns valid ntriples, i.e. serdi seems to parse/serialize this correctly.
But when I log the SerdNode value in the HDT serializer (https://github.com/rdfhdt/hdt-cpp/blob/develop/hdt-lib/src/rdf/RDFParserSerd.cpp#L18) I do get the dot as part of the bnode value.

Some help is appreciated ;)

[1]
serdi: <http://sub> <http://pred> _:b0. .
rapper: <http://sub> <http://pred> <b0.> .

FMIndex loading fails on OS X

Loading an HDT file with an FMIndex fails on OSX. This Travis run shows it works on Linux.

The failure seems to be as follows:

  • in CSD_FMIndex::load, we convert an in-memory range to a stringstream by calling rdbuf()->pubsetbuf.
  • when we attempt to read from this string in OS X, the fail and bad bits of the stream are set.

Now there seems to be quite some discussion whether calling pubsetbuf is actually a good idea. So I have tried the often suggested alternative to implement a custom subclass instead. However, this then fails at another point (apparently because seeking is not implemented).

Who can get us rid of the pubsetbuf hack?
The original code was written by @MarioAriasGa.


Relevant backstory: I found this out while testing HDT-node. Strangely, the exact same code works for Node.js 4 and 6 on Linux and on OS X, but fails for Node.js 7 on OS X (but works on Linux). And as you can see above, it fails always in standalone mode on OS X. So somehow, the Node.js 4 and 6 compilations on OS X are doing something different. I wonder if this issue is related.


PS To test this: just fork from the bug-literal-dictionary branch, and when you commit and push, Travis CI will test for you.

Expected HDT compressibility?

In published examples I've seen, HDT file compressibility (by doing an additional gzip or bzip2 step) has been a few percent at the most. Given that baseline, the following seems anomalous:

-rw-rw-r-- 1 arto arto  21813380 Dec 10 12:43 pc_biosystem.hdt
-rw-rw-r-- 1 arto arto   5054074 Dec 10 12:44 pc_biosystem.hdt.bz2
-rw-rw-r-- 1 arto arto   6988217 Dec 10 12:43 pc_biosystem.hdt.gz
-rw-rw-r-- 1 arto arto 833444759 Dec 10 12:42 pc_biosystem.nt
-rw-rw-r-- 1 arto arto  15820798 Dec 10 12:46 pc_biosystem.nt.bz2
-rw-rw-r-- 1 arto arto  21045372 Dec 10 12:44 pc_biosystem.nt.gz
-rw-rw-r-- 1 arto arto 327940461 Dec 10 12:42 pc_biosystem.ttl

That is, the .hdt file is 20 MB, the gzip-compressed .hdt.gz is 6.66 MB, and the bzip2-compressed .hdt.bz2 is 4.82 MB.

Instructions to reproduce these figures from scratch follow here:

wget ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/biosystem/pc_biosystem.ttl.gz
gunzip pc_biosystem.ttl.gz
rapper -i turtle -o ntriples pc_biosystem.ttl > pc_biosystem.nt
rdf2hdt -f ntriples pc_biosystem.nt pc_biosystem.hdt
gzip -9 < pc_biosystem.hdt > pc_biosystem.hdt.gz
bzip2 -9 < pc_biosystem.hdt > pc_biosystem.hdt.bz2

The question here is, are these results expected or anomalous? (Pinging @MarioAriasGa.)

ModifiableHDT?

I'm working on a SWI-Prolog interface to this library. Read access now mostly works. Now, I try to make it possible to create HDTs from triples available in Prolog. I could of course first write an ntriples file but that seems a waste of resources. I tried including HDTFactory.hpp, create an instance of a basicModifiableHDT, using HDTFactory::createDefaultModifiableHDT() and call ->insert() and ->save() on that. This creates a file, but trying to open it using hdt2rdf says

ERROR: This software cannot open this version of HDT File

What am I missing?

OFFSET is not working in constant time

@momo54 reported that although he was expecting nearly constant time in OFFSETS (using "goto" methods in BitmapTriplesIterators.cpp), he practically observed large delays using Linked Data Fragments+HDT with particular* OFFSET queries: SP?o, S?p?o, ?sPO and ?s?pO + OFFSET

  • Note that ?sP?o +OFFSET cannot run in constant time with our current implementation.

Fix URI encoding problems

I've converted nt file to HDT and loaded the HDT for querying in Jena. However, some resources are not queryable although they are present in the dataset.

The dataset is

<http://ru.dbpedia.org/resource/Список_римско-католических_епархий_(структурный_вид)> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Абаэтетуба> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Бразилия> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/25_ноября> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/1961_год> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Собор_Непорочного_зачатия_Пресвятой_Девы_Марии_(Абаэтетуба)> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Архиепархия_Белен-до-Пара> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Епархия> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Римско-католическая_церковь> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Кафедра_(христианство)> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Собор_(храм)> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Папство> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Иоанн_XXIII> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Булла> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Территориальное_аббатство> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/4_августа> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/1981_год> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Иоанн_Павел_II> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Священник> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Annuario_Pontificio> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Acta_Apostolicae_Sedis> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Категория:Католические_епархии_Бразилии> .
<http://ru.dbpedia.org/resource/Архиепархия_Белен-до-Пара> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> .
<http://ru.dbpedia.org/resource/Епархия_Абаэтетубы> <http://www.w3.org/2000/01/rdf-schema#label> "Епархия Абаэтетубы"@ru .

In Jena the following query does not match anything:

StmtIterator tmpIter = model.listStatements(model.getResource("http://ru.dbpedia.org/resource/Епархия_Абаэтетубы"), null, (RDFNode)null);
if(!tmpIter.hasNext())
    System.out.println("No statements for this resource");

Loading the same dataset in Jena but as a ttl file works.

Maybe a bug in the HDT conversion process? Note that we are dealing with Russian cyrillic chars.

Cannot save Index if the HDT is not saved

Using the rdf2hdt tool with the -i option I keep getting this warning/notice (it is not presented as an error). This message is quite confusing, how should the program be called to generate the HDT index?

Thanks,

C

Ensure test suite can be run without Kyoto Cabinet and Raptor

Currently you need to have Kyoto Cabinet and Raptor installed to build the test suite with make tests. I've attached the error output in make-tests.log.txt.

The test suite should be amended such that it doesn't require any of the optional dependencies of the library to be installed. This could be done either by restructuring the test suite and/or by ensuring that tests for optional dependencies only activate when those dependencies are present.

Error parsing a turtle file with certain language tags

When using rdf2hdt on the following turtle file:

@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix skos-thes: <http://purl.org/iso25964/skos-thes#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<urn:x-skosprovider:trees/3> a skos:Collection ;
    dcterms:identifier 3 ;
    skos:inScheme <urn:x-skosprovider:trees> ;
    skos:member <urn:x-skosprovider:trees/1>,
        <urn:x-skosprovider:trees/2> ;
    skos:prefLabel "Trees by species"@en,
        "Bomen per soort"@nl-BE .

<urn:x-skosprovider:trees/1> a skos:Concept ;
    dcterms:identifier 1 ;
    skos:closeMatch <http://id.python.org/different/types/of/trees/nr/1/the/larch> ;
    skos:definition "A type of tree."@en ;
    skos:inScheme <urn:x-skosprovider:trees> ;
    skos:prefLabel "The Larch"@en,
        "De Lariks"@nl .

<urn:x-skosprovider:trees/2> a skos:Concept ;
    dcterms:identifier 2 ;
    skos:altLabel "la châtaigne"@fr,
        "De Paardekastanje"@nl ;
    skos:definition "A different type of tree."@en ;
    skos:inScheme <urn:x-skosprovider:trees> ;
    skos:prefLabel "The Chestnut"@en ;
    skos:relatedMatch <http://id.python.org/different/types/of/trees/nr/17/the/other/chestnut> .

<urn:x-skosprovider:trees> a skos:ConceptScheme ;
    dcterms:identifier "TREES" ;
    skos:hasTopConcept <urn:x-skosprovider:trees/1>,
        <urn:x-skosprovider:trees/2> ;
    skos:prefLabel "Different types of trees"@en,
        "Verschillende soorten bomen"@nl .

I get the following error:

[21:34:00] $ /home/koen/Projecten/general/hdt-cpp/hdt-lib/tools/rdf2hdt TREES-full.ttl TREES-full.hdt
expected `(', not `B'y: 0 K triples processed.: 0 % / 0 %                      
caught here??
Catch exception load: Error parsing input.
ERROR: Error parsing input.

Using an rdfxml version of the same file I have no problems:

[21:39:04] $ /home/koen/Projecten/general/hdt-cpp/hdt-lib/tools/rdf2hdt -f rdfxml TREES-full.rdf TREES-full.hdt
RDF format: rdfxml
HDT Successfully generated.                                                    
Total processing time: Clock(10 ms 901 us)  User(10 ms 516 us)  System(552 us)

Both the Turtle and the rdfxml version were generated from Python using RDFLIB, on Ubuntu 14.04

I believe rdf2hdt use serd for reading turtle files. Using serdi on the same file gives met the following error:

[21:42:26] 1 $ serdi TREES-full.ttl
<urn:x-skosprovider:trees/3> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Collection> .
<urn:x-skosprovider:trees/3> <http://purl.org/dc/terms/identifier> "3"^^<http://www.w3.org/2001/XMLSchema#integer> .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#inScheme> <urn:x-skosprovider:trees> .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#member> <urn:x-skosprovider:trees/1> .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#member> <urn:x-skosprovider:trees/2> .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#prefLabel> "Trees by species"@en .
<urn:x-skosprovider:trees/3> <http://www.w3.org/2004/02/skos/core#prefLabel> "Bomen per soort"@nl- .
error: TREES-full.ttl:16:29: expected `.', not `B'

rdf2hdt issues

There are a few issues with the rdf2hdt command:

  • It does not always print to stdout (using rdf2hdt file.hdt -).
  • The built-in N-Triples serializer works, but when the Raptor serializer is included it does not.
  • Do we want raptor as a serializer anyway? (considering that the parser lib was changed to serd #31)

How to load a gzipped N-Triples file?

On branch stable I am able to create an HDT file from a gzipped N-Triples file (this probably uses Raptor). On branch master this no works (this probably uses Serd).

What is the best way to create an HDT file from a gizpped N-Triples file on master? Could the library be taught to do the right thing automatically?

determine node type from id

A subject can be a blank node or an iri. An object can be a blank node, an iri or a literal. A TripleID contains unsigned int for subject, object and predicate.

Is there a cheap way to determine the type of the node without retrieving the string value?

The nodes are sorted in the dictionary. Blank nodes start with '_', literals start with '"' and IRIs start with '<'. So by knowing the number of blank nodes, IRIs and literals in the dictionary, it is possible to determine the node type from the id.

enable KCB option for rdf2hdt

I am wondering how I am able to run rdf2hdt conversion with kyotocabinet database (KCB) option. We have a very large RDF nt file that exceeds the memory limit for conversion, and the job got killed using memory....I cannot find anything in README file.

In addition, I took a look at BasicHDT.cpp file and found there is no option for KCB:
void BasicHDT::createComponents() {
// HEADER
header = new PlainHeader();

// DICTIONARY
std::string dictType = spec.get("dictionary.type");
if(dictType==HDTVocabulary::DICTIONARY_TYPE_FOUR) {
    dictionary = new FourSectionDictionary(spec);
} else if(dictType==HDTVocabulary::DICTIONARY_TYPE_PLAIN) {
    dictionary = new PlainDictionary(spec);
} else if(dictType==HDTVocabulary::DICTIONARY_TYPE_LITERAL) {

ifdef HAVE_CDS

    dictionary = new LiteralDictionary(spec);

else

    throw "This version has been compiled without support for this dictionary";

endif

} else {
    dictionary = new FourSectionDictionary(spec);
}

// TRIPLES
std::string triplesType = spec.get("triples.type");
if(triplesType==HDTVocabulary::TRIPLES_TYPE_BITMAP) {
    triples = new BitmapTriples(spec);
} else if(triplesType==HDTVocabulary::TRIPLES_TYPE_COMPACT) {
    triples = new CompactTriples(spec);
} else if(triplesType==HDTVocabulary::TRIPLES_TYPE_PLAIN) {
    triples = new PlainTriples(spec);
} else if(triplesType==HDTVocabulary::TRIPLES_TYPE_TRIPLESLIST) {
            triples = new TriplesList(spec);

ifndef WIN32

} else if (triplesType == HDTVocabulary::TRIPLES_TYPE_TRIPLESLISTDISK) {
    triples = new TripleListDisk();

endif

} else {
    triples = new BitmapTriples(spec);
}

}

Best,
Gang

Improve API to production quality

After the already-merged enhancements in pull requests #16 and #17, these are the remaining immediate pain points I have encountered so far in attempting to use hdt-lib in a production application:

  1. The library does not throw standard exceptions, i.e., exceptions derived from the C++ standard library's std::exception base class. Instead, the library throws raw C strings directly (e.g., src/hdt/BasicHDT.cpp:106). This is not great practice, and these exceptions will go uncaught by many applications, or else will land in a top-level catch-all handler catch (...) at which point they can no longer be interpreted at all.
  2. The library in multiple locations eats up exceptions and prints them out to the standard output stream (e.g., src/hdt/BasicHDT.cpp:171), instead of letting the application handle them. Unilaterally printing out stuff to the application's standard output or standard error is a definite no-no for a production-quality library. Also, error conditions really ought to propagate up the stack to where they can be properly observed and logged.
  3. The library is overall missing careful const-qualification of methods and parameters, resulting in spurious temporaries and in the inability to hold const references to hdt-lib objects, complicating const-correct application code by necessitating const_cast<> casts that undermine type safety.
  4. Additionally, from a performance standpoint, the library currently overall performs quite a bit of spurious memory allocations (that will slow things down and fragment the heap), in particular by overuse of std:string copies where a const std::string& reference or a char* pointer would suffice.

I'd like to address each of these points in order to make the library suitable for use from production code, but wanted to express my intentions here first, as in particular the first item would necessitate changes to any existing application code that uses hdt-lib and assumes that it will throw raw C strings instead of standard exceptions.

Soliciting feedback from @mielvds and @RubenVerborgh in particular.

Cannot compile due to narrowing conversion

Compiling hdt-lib using GCC 6.1.1 gives me the following warnings:

static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-1’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
 Tsucc RMQ_succinct::HighestBitsSet[8] = {~0, ~1, ~3, ~7, ~15, ~31, ~63, ~127};
                                                                             ^
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-2’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-4’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-8’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-16’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-32’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-64’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
static/suffixtree/RMQ_succinct.cpp:116:85: error: narrowing conversion of ‘-128’ from ‘int’ to ‘DTsucc {aka unsigned char}’ inside { } [-Wnarrowing]
Makefile:35: recipe for target 'static/suffixtree/RMQ_succinct.o' failed
make[3]: *** [static/suffixtree/RMQ_succinct.o] Error 1
Makefile:39: recipe for target 'all' failed
make[2]: *** [all] Error 2
Makefile:13: recipe for target 'libcompact' failed
make[1]: *** [libcompact] Error 2
make[1]: Leaving directory '/home/wbeek/Git/hdt-cpp/libcds-v1.0.12'
Makefile:63: recipe for target 'all' failed
make: *** [all] Error 2

In order to compile the library I have to make the following change in two files. From:

const DTsucc RMQ_succinct_lcp::HighestBitsSet[8] = {~0, ~1, ~3, ~7, ~15, ~31, ~63, ~127};

To:

const DTsucc RMQ_succinct_lcp::HighestBitsSet[8] = {static_cast<DTsucc>(~0), static_cast<DTsucc>(~1), static_cast<DTsucc>(~3), static_cast<DTsucc>(~7), static_cast<DTsucc>(~15), static_cast<DTsucc>(~31), static_cast<DTsucc>(~63), static_cast<DTsucc>(~127)};

Since my C++ skills are non-existent, a capable programmer may fix this in a less hacky way :-P

Change interfaces to work with larger data (required when merging long-dict-id branch to devel)

Hi all,

We have recently created the LOD-a-lot HDT dataset including 28B triples. @MarioAriasGa performed some datatype changes in the branch https://github.com/rdfhdt/hdt-cpp/tree/long-dict-id in order to manage such an amount of triples.

We should then merge this branch into develop (https://github.com/rdfhdt/hdt-cpp/tree/develop) and later on into master. Given that the interface has changed, we suggest to create a new major revision.

Opinions?

See the required changes (some of them are only for the HDT-it app, please filter by hdt-lib/): https://github.com/rdfhdt/hdt-cpp/compare/develop...long-dict-id?diff=unified&name=long-dict-id

getSuggestions without max_count argument

Function getSuggestions is very nice since it implements the following two common use cases:

  1. Auto-completion for typing IRIs and literals
  2. IRI namespace matching

Currently a maximum size must be given to the function, but this maximum is not always known beforehand. For example, if you want to find all DOI IRIs (use case category 2), there are over 4M matches for prefix http://dx.doi.org/ in LOD-a-lot.

Would it be possible to return the list of suggestions as an iterator of lazy list of arbitrary size?

Extract random walks efficiently

Since it's possible to extract the estimated number of results for an arbitrary triple pattern, we were wondering whether it would also be possible to efficiently retrieve, for a given node X, a random neighbor node Y of X (in constant time).

Looking at the Web Semantics paper from 2013, it seems that this should be doable by querying bitmap triples representation directly.

Thanks in advance!
@airobert and @wouterbeek

Ubuntu make error

Tried to compile hdt-lib in Docker using the latest Ubuntu image.
I get the error:

[HDT] Linking libhdt.a
 [HDT] Compiling tool tools/hdtInfo.cpp
../libcds-v1.0.12/lib/libcds.a: error adding symbols: Archive has no index; run ranlib to add one
collect2: error: ld returned 1 exit status
Makefile:82: recipe for target 'tools/hdtInfo' failed
make[1]: Leaving directory '/hdt-cpp/hdt-lib'
make[1]: *** [tools/hdtInfo] Error 1
make: *** [all] Error 2

Any idea how to fix it?

Question about .gz parsing

Dear all,
I have a short question. I checked that libz is installed and I have an ntriples file called core.nt.gz

Will the program automatically recognize that the file is zipped, if I call
./rdf2hdt core.nt.gz core.hdt

because I receive a lot of parser errors. Maybe I am missing an option.
I also tried -f ntriples

If I unzip the file first, it works.

Crash due to incorrect exception handling

Using the latest master, trying to load http://lod.labs.vu.nl/brt-clean.nt.gz (be careful, almost 8Gb and you need to uncompress it before loading) we get SEGV in line 295 of FourSectionDictionary.cpp. Here
we see delete objects.

This is quite understandable as objects is possibly already deleted in the try section. I'm not familiar enough with the code base to see the clean way out.

Arbitrary estimated number of results

I generated a dataset of about 1M triples using the watdiv data generator and converted the resulting file using HDT.
When querying for certain triples, it seems like some estimated counts are not very "estimated", but rather arbitrary.

For example: there should be 433615 triples with the predicate http://db.uwaterloo.ca/~galuc/wsdbm/friendOf but when executing the query ?v0 http://db.uwaterloo.ca/~galuc/wsdbm/friendOf ?v2 I get an estimated count of 3878 triples, which is quite an order of magnitude different.

In order to get these estimated counts, IteratorTripleID. estimatedNumResults() is used (see https://github.com/RubenVerborgh/HDT-Node/blob/master/lib/HdtDocument.cc#L139).

You can find the hdt file I'm using to test querying here: https://dl.dropboxusercontent.com/u/16059961/watdiv_10_dataset.zip

Avoid allocation of strings in Dictionary::idToString.

The Dictionary class has this function:
virtual std::string idToString(unsigned int id, TripleComponentRole role)=0;

Changing that to
virtual void idToString(unsigned int id, TripleComponentRole role, std::string& string);
would allow the caller to supply an existing string object and avoid memory allocation.

Slow compilation speed

This library (re-)compiles very slowly compared to other libraries. I wonder if we can find causes/fixes for that. Perhaps some of the header files are doing overly complex things?

libserd should be listed as a dependency

Hi,

Installing on Ubuntu 14.04.1 LTS leads to an error of
serd.h: No such file or directory
which can be resolved installing:
apt-get install libserd-dev

Cristian

Alter Makefile to make code position-independent?

Compilation in SWI-Prolog's HDT package requires me to add -fPIC to
the FLAGS variable (on lines 14 and 16). I'm not sure whether this
is a bug or not. According to someone on SO PIC stands for:

> Position Independent Code means that the generated machine code
> is not dependent on being located at a specific address in order
> to work.

My editor also complains about a spurious horizontal tab character on
line 111. This may be removed if only for aesthetic reasons.

Building test classes failed in devel branch (OpenMP disabled)

ilterSearch.cpp:(.text.startup+0x70c): undefined reference to typeinfo for hdt::LiteralDictionary' filterSearch.cpp:(.text.startup+0x75b): undefined reference tohdt::LiteralDictionary::substringToId(unsigned char_, unsigned int, unsigned int__)'
collect2: error: ld returned 1 exit status
Makefile:89: recipe for target 'tests/filterSearch' failed
make: *_* [tests/filterSearch] Error 1

Implement a proper build system

Right now, the hdt-lib build process is pretty brittle with regards to platform quirks, and produces only tool binaries and static libraries, not any shared library. Further, building with custom paths or with a selection of features to be enabled/disabled (e.g., Serd, Raptor, zlib, KCB) requires manually editing Makefiles--not at all a welcome practice from a downstream user's point of view.

On POSIX systems, the currently produced static libraries aren't portably linkable with shared libraries that would like to make use of libhdt; moreover, on e.g. x86-64 the individual object files that constitute the static libraries would generally need to be compiled with -fPIC for the libraries to be usable from modern projects. All this requires manually fiddling with flags in both libcds-v1.0.12/src/Makefile and hdt-lib/Makefile.

Additionally, no actual installation facility is currently provided, meaning that users have to manually hunt down built artifacts--which are currently placed in different output directories--and copy them one by one to some installation destination.

All this argues for converting the project to use a proper configuration and build system. The two obvious and realistic choices are Autotools and CMake. Doing this would also greatly facilitate packaging the project for downstream distribution (#19).

I would myself volunteer to convert the project to using standard Autotools, which would give users the familiar ./configure && make && sudo make install workflow that works reliably on all platforms and resolves all aforementioned problems.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.