rdfhdt / hdt-java Goto Github PK

View Code? Open in Web Editor NEW

92.0 18.0 67.0 45.99 MB

HDT Java library and tools.

License: Other

Java 95.74% Shell 3.25% Batchfile 0.09% Ruby 0.73% Dockerfile 0.02% PowerShell 0.18%

rdf compression search sparql hdt

hdt-java's Introduction

HDT Library, Java Implementation. http://www.rdfhdt.org

Overview

HDT-lib is a Java Library that implements the W3C Submission (http://www.w3.org/Submission/2011/03/) of the RDF HDT (Header-Dictionary-Triples) binary format for publishing and exchanging RDF data at large scale. Its compact representation allows storing RDF in fewer space, providing at the same time direct access to the stored information. This is achieved by depicting the RDF graph in terms of three main components: Header, Dictionary and Triples. The Header includes extensible metadata required to describe the RDF data set and details of its internals. The Dictionary organizes the vocabulary of strings present in the RDF graph by assigning numerical IDs to each different string. The Triples component comprises the internal structure of the RDF graph in a compressed form.

It provides several components:

hdt-java-api: Abstract interface for dealing with HDT files.
hdt-java-core: Core library for accessing HDT files programmatically from java. It allows creating HDT files from RDF and converting HDT files back to RDF. It also provides a Search interface to find triples that match a specific triple pattern.
hdt-java-cli: Commandline tools to convert RDF to HDT, merge two HDT files and access HDT files from a terminal.
hdt-jena: Jena integration. Provides a Jena Graph implementation that allows accessing HDT files as normal Jena Models. In turn, this can be used with Jena ARQ to provide more advanced searches, such as SPARQL, and even setting up SPARQL Endpoints with Fuseki.
hdt-java-package: Generates a package with all the components and launcher scripts.
hdt-fuseki (< 2.2.0): Packages Apache Jena Fuseki with the HDT jars and a fast launcher, to start a SPARQL endpoint out of HDT files very easily.

Compiling

Use mvn install to let Apache Maven install the required jars in your system.

You can also run mvn assembly:single under hdt-java-package to generate a distribution directory with all the jars and launcher scripts.

Usage

Please refer to hdt-java-package/README for more information on how to use the library. You can also find useful information on our Web Page http://www.rdfhdt.org

License

Each module has a different License. Core is LGPL, examples and tools are Apache.

hdt-api: Apache License
hdt-java-cli: (Commandline tools and examples): Apache License
hdt-java-core: Lesser General Public License
hdt-jena: Lesser General Public License
hdt-fuseki(< 2.2.0): Apache License

Note that hdt-fuseki has been removed in version 2.2.0 and might be readded later when made compatible with fuseki2.

Authors

Mario Arias mario.arias@gmailcom
Javier D. Fernandez [email protected]
Miguel A. Martinez-Prieto [email protected]
Dennis Diefenbach [email protected]
Jose Gimenez Garcia: [email protected]

Acknowledgements

RDF/HDT is a project developed by the Insight Centre for Data Analytics (www.insight-centre.org), University of Valladolid (www.uva.es), University of Chile (www.uchile.cl). Funded by Science Foundation Ireland: Grant No. SFI/08/CE/I1380, Lion-II; the Spanish Ministry of Economy and Competitiveness (TIN2009-14009-C02-02); Chilean Fondecyt's 1110287 and 1-110066; and the European Union's Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No 642795.

hdt-java's People

Contributors

Stargazers

Watchers

hdt-java's Issues

Fix missing libraries when building with Fuseki 2.3.1

Hi,

I got the following error message when I build the current version with Fuseki 2.3.1

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.1:compile (default-compile) on project hdt-fuseki: Compilation failure: Compilation failure:
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[21,36] error: package org.apache.jena.fuseki does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[21,0] error: static import only from classes and interfaces
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[31,29] error: package org.apache.jena.fuseki does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[32,33] error: package org.apache.jena.fuseki.mgt does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[33,36] error: package org.apache.jena.fuseki.server does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[34,36] error: package org.apache.jena.fuseki.server does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[35,36] error: package org.apache.jena.fuseki.server does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[40,31] error: package org.eclipse.jetty.server does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[46,14] error: package arq.cmd does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[47,18] error: cannot find symbol
[ERROR] 
[ERROR] package arq.cmdline
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[51,28] error: package com.hp.hpl.jena.graph does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[52,28] error: package com.hp.hpl.jena.query does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[53,28] error: package com.hp.hpl.jena.query does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[54,34] error: package com.hp.hpl.jena.sparql.core does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[55,34] error: package com.hp.hpl.jena.sparql.core does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[56,34] error: package com.hp.hpl.jena.sparql.core does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[57,26] error: package com.hp.hpl.jena.tdb does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[58,26] error: package com.hp.hpl.jena.tdb does not exist
[ERROR] 
[ERROR] /home/fug2/hdt-java/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[59,38] error: package com.hp.hpl.jena.tdb.transaction does not exist
[ERROR]

Errors trying to process any other file format than NT

Compiled the HDT-Java library to transform a set of N3 and RDF/XML files to HDT.
When launching the script I get an error:

/hdt-java/hdt-java-package/target/hdt-java-package-2.0-distribution/hdt-java-package-2.0/bin# ./rdf2hdt.sh -rdftype n3 <n3 file> <hdt file>
Converting <n3file> to <hdt> as n3
Exception in thread "main" org.rdfhdt.hdt.exceptions.ParserException
	at org.rdfhdt.hdt.rdf.parsers.RDFParserRIOT.doParse(RDFParserRIOT.java:89)
	at org.rdfhdt.hdt.hdt.impl.TempHDTImporterOnePass.loadFromRDF(TempHDTImporterOnePass.java:100)
	at org.rdfhdt.hdt.hdt.HDTManagerImpl.doGenerateHDT(HDTManagerImpl.java:103)
	at org.rdfhdt.hdt.hdt.HDTManager.generateHDT(HDTManager.java:129)
	at org.rdfhdt.hdt.tools.RDF2HDT.execute(RDF2HDT.java:106)
	at org.rdfhdt.hdt.tools.RDF2HDT.main(RDF2HDT.java:167)

Tried all shell scripts available after the maven install, the error still exists.
The raw files can be imported in Virtuoso without any errors.
How do I process non-NT files using java lib?

Parser exception hidden

In RDFParserRIOT, the "parser not found for" exception is hidden by the catches in lines 87-91. The Ideally, ParserException should take a Throwable as an argument to its constructor, so that he exceptions can be chained.

Too many open files

When performing a

HDT hdt = HDTManager.mapIndexedHDT(rdfFile.getAbsolutePath(), null);

and then do a

hdt.close();

The number of open files keeps increasing. As I have 70.000 HDT files to query I have to run the program multiple times on subsets as otherwise I end up with too many open files in the OS. I tried the 2.1 branch but was unable to compile the code.

hdtSearch.sh and Fuseki give incorrect results for non-ASCII literals

I noticed problems with both hdtSearch (cpp) and hdtSearch.sh (java). Both apparently give incorrect results in some cases when looking for a specific literal value. I have reported the problems with the cpp version in a separate issue.

My test dataset is this NT file with only 3 triples:

<http://example.org/000046085> <http://schema.org/name> "Raamattu" .
<http://example.org/000146854> <http://schema.org/name> "Ajan lyhyt historia" .
<http://example.org/000019643> <http://schema.org/name> "Seitsemän veljestä" .

I converted it to HDT using the Java version of rdf2hdt. Then I query it for the literal values using hdtSearch.sh:

$ rdf2hdt.sh hdt-test.nt hdt-test.hdt
Converting hdt-test.nt to hdt-test.hdt as null
File converted in: 47 ms 744 us 0.0                            
Total Triples: 3
Different subjects: 3
Different predicates: 1
Different objects: 3
Common Subject/Object:0
HDT saved to file in: 3 ms 441 us

$ hdtSearch.sh hdt-test.hdt
Could not read .hdt.index, Generating a new one.
Predicate Bitmap in 319 us
Count predicates in 26 us
Count Objects in 41 us Max was: 1
Bitmap in 23 us
Object references in 39 us
Sort object sublists in 8 us
Count predicates in 17 us
Index generated in 314 us
>> ? ? "Raamattu"
Query: |?| |?| |"Raamattu"|
http://example.org/000046085 http://schema.org/name "Raamattu"
Iterated 1 triples in 7 ms 384 us
>> ? ? "Ajan lyhyt historia"
Query: |?| |?| |"Ajan lyhyt historia"|
http://example.org/000146854 http://schema.org/name "Ajan lyhyt historia"
Iterated 1 triples in 261 us
>> ? ? "Seitsemän veljestä"
Query: |?| |?| |"Seitsemän veljestä"|
No results found.

As you can see from the above output, the first and second queries (for "Raamattu" and "Ajan lyhyt historia") give the correct result, but the last one gives zero results even though it should match one triple in the data.

Likewise, if I start up Fuseki using hdtEndpoint.sh and perform this SPARQL query:

SELECT * { ?s ?p "Seitsemän veljestä" }

I get no results, but similar queries for the two other literal values do give the correct result. I tried this query both directly via the Fuseki UI and via YASGUI.org, just in case there would be some problem with character encodings. The query appears as it should in the Fuseki log/console, there are no obvious encoding problems.

I'm not sure whether the problem is in the HDT generation, index file generation, or querying.

hdtsparql.sh output format

Looks like hdtsparql.sh only returns CSV values. Is it possible to return results in another format, for example json or json-ld?

Exception when processing DBpedia

I have recently added an HDTProcessor in Luzzu, however, I am getting a GC overhead limit exceeded when parsing the HDT version of DBpedia (the index was used as well):

java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.lang.Long.valueOf(Long.java:840)
	at pl.edu.icm.jlargearrays.LongLargeArray.get(LongLargeArray.java:148)
	at org.rdfhdt.hdt.compact.sequence.SequenceLog64Big.getField(SequenceLog64Big.java:129)
	at org.rdfhdt.hdt.compact.sequence.SequenceLog64Big.get(SequenceLog64Big.java:239)
	at org.rdfhdt.hdt.dictionary.impl.section.PFCDictionarySectionBig.extract(PFCDictionarySectionBig.java:344)
	at org.rdfhdt.hdt.dictionary.impl.BaseDictionary.idToString(BaseDictionary.java:219)
	at org.rdfhdt.hdtjena.NodeDictionary.getNode(NodeDictionary.java:114)
	at io.github.luzzu.io.impl.HDTProcessor.startProcessing(HDTProcessor.java:95)
	at io.github.luzzu.io.AbstractIOProcessor.processorWorkFlow(AbstractIOProcessor.java:134)
	at io.github.luzzu.communications.resources.v4.AssessmentResource$1.call(AssessmentResource.java:210)
	at io.github.luzzu.communications.resources.v4.AssessmentResource$1.call(AssessmentResource.java:206)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I implemented the processor [1] in a streaming fashion, and the error seems to be in this line: Node o = this.nodeDictionary.getNode(this.hdtDictionary.stringToId(triple.getObject(), TripleComponentRole.OBJECT), TripleComponentRole.OBJECT);

Do you know what might have triggered that exception?

Thanks!
Jeremy

[1] https://gist.github.com/jerdeb/60ce2a8c07413c0a3b6c816124590e57

Generated HDT file incompatible

The HDT file generated by hdt-java from http://experimental.worldcat.org/fast/download/FASTTitle.nt.zip is incompatible with hdt-cpp. When generating the file with hdt-cpp, it is compatible with both versions.

wrong asNtriples results

hdt-java/hdt-java-core/src/main/java/org/rdfhdt/hdt/util/string/UnicodeEscape.java

Line 86 in 2b7f183

last = i-2;

Should be:
'last = i - 1;'

hdt2rdf usage message is confusing

When I run hdt2rdf.sh without parameters I get this:

Usage: hdt2rdf [options] <input RDF> <output HDT>
  Options:
    -version
       Prints the HDT version number
       Default: false

The first line is clearly wrong, it should be <input HDT> <output RDF>

However the tool seems to be limited to N-Triples output, I think the usage message could state this as well as it wasn't obvious without looking at the source code.

Problem querying large hdt dataset in fuseki

This might or might not be the right project for this issue...

I'm trying to query a large dataset (5GB .hdt file, 266 M-Triples) and have a problem searching for untyped literals. SPARQL queries with typed literals or URIs in the object position run fine. Also, when I create a small dataset (13 triples), SPARQL queries for typed literals run fine, so I assume that it's an issue with the hdt file size. The hdt files were created using hdt-cpp.

I have integrated HDT support into Fuseki as described and the service as a whole works fine.

The problem looks like this: I first run a DESCRIBE in order to get some triples:

PREFIX gndo:  <http://d-nb.info/standards/elementset/gnd#>
PREFIX rdau:  <http://rdaregistry.info/Elements/u/> 
PREFIX dct:   <http://purl.org/dc/terms/> 
PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX dcterm: <http://purl.org/dc/terms/> 
PREFIX xsd:   <http://www.w3.org/2001/XMLSchema#> 
PREFIX skos:  <http://www.w3.org/2004/02/skos/core#> 
PREFIX dcterms: <http://purl.org/dc/terms/> 
PREFIX gnd:   <http://d-nb.info/gnd/> 
PREFIX dc:    <http://purl.org/dc/elements/1.1/> 
PREFIX dnbt: <http://d-nb.info/standards/elementset/dnb#>

DESCRIBE <http://d-nb.info/1000000354>
FROM <http://d-nb.info/dnb-all>

That query returns

@prefix gndo:  <http://d-nb.info/standards/elementset/gnd#> .
@prefix dnbt:  <http://d-nb.info/standards/elementset/dnb#> .
@prefix rdau:  <http://rdaregistry.info/Elements/u/> .
@prefix dct:   <http://purl.org/dc/terms/> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dcterm: <http://purl.org/dc/terms/> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix skos:  <http://www.w3.org/2004/02/skos/core#> .
@prefix gnd:   <http://d-nb.info/gnd/> .
@prefix dc:    <http://purl.org/dc/elements/1.1/> .

<http://d-nb.info/1000000354>
        a                <http://purl.org/ontology/bibo/Collection> ;
        dc:identifier    "(OCoLC)723788590" , "(DE-101)1000000354" ;
        dc:publisher     "A. F. W. Sommer" ;
        dc:subject       "830"^^dnbt:ddc-subject-category , "B"^^dnbt:ddc-subject-category ;
        dc:title         "Neuere Gedichte" ;
        dcterms:creator  gnd:118569317 ;
        dcterms:medium   <http://rdaregistry.info/termList/RDACarrierType/1044> ;
        rdau:P60163      "Wien" ;
        rdau:P60327      "August Friedrich Ernst Langbein" ;
        rdau:P60333      "Wien : A. F. W. Sommer" ;
        rdau:P60493      "1814" ;
        rdau:P60539      "30 cm" ;
        <http://www.w3.org/2002/07/owl#sameAs>
                <http://hub.culturegraph.org/resource/DNB-1000000354> .

When I then try to query for a literal like this:

SELECT ?entity
FROM <http://d-nb.info/dnb-all>
WHERE {
  ?entity ?p "A. F. W. Sommer"
}

I get zero results.
Also adding the datatype `xsd:string' to the literal doesn't help:

SELECT ?entity
FROM <http://d-nb.info/dnb-all>
WHERE {
  ?entity ?p "A. F. W. Sommer"^^<http://www.w3.org/2001/XMLSchema#string>
}

does not help.

If I inspect the hdt file using hdt-it!, a search for "A. F. W. Sommer"^^http://www.w3.org/2001/XMLSchema#stringreturns 231 hits, so the data is obviously present.

As a verification, I created a dataset consisting only of this one entity and configured fuseki to run a separate service with that dataset (15 triples) in a single named graph. With that configuration, SPARQL queries for untyped literals work so I guess that it's a problem with the hdt file size.

Any insights are much appreciated.

Thanks,

Lars

Problems generating or using empty HDT files

I noticed that the hdt-java tools cannot handle empty HDT files i.e. files with zero triples.

Trying to generate a HDT file based on an empty N-Triple file fails:

$ touch empty.nt # create an empty N-Triples file
$ rdf2hdt.sh empty.nt empty.hdt
Converting empty.nt to empty.hdt as null
Exception in thread "main" java.lang.IllegalArgumentException: Adjacency list bitmap and array should have the same size
	at org.rdfhdt.hdt.compact.bitmap.AdjacencyList.<init>(AdjacencyList.java:50)
	at org.rdfhdt.hdt.triples.impl.BitmapTriples.load(BitmapTriples.java:207)
	at org.rdfhdt.hdt.triples.impl.BitmapTriples.load(BitmapTriples.java:224)
	at org.rdfhdt.hdt.hdt.impl.HDTImpl.loadFromModifiableHDT(HDTImpl.java:377)
	at org.rdfhdt.hdt.hdt.HDTManagerImpl.doGenerateHDT(HDTManagerImpl.java:107)
	at org.rdfhdt.hdt.hdt.HDTManager.generateHDT(HDTManager.java:129)
	at org.rdfhdt.hdt.tools.RDF2HDT.execute(RDF2HDT.java:106)
	at org.rdfhdt.hdt.tools.RDF2HDT.main(RDF2HDT.java:167)

Another way of triggering the same exception is to generate the zero-triple HDT file using hdt-cpp (which works) and then attempt to use it using hdtsparql.sh:

$ touch empty.nt # create an empty N-Triples file
$ rdf2hdt empty.nt empty.hdt # make a HDT file out of it using rdf2hdt from the hdt-cpp suite
$ hdtsparql.sh empty.hdt "select * {?s ?p ?o}"
Exception in thread "main" java.lang.IllegalArgumentException: Adjacency list bitmap and array should have the same size
	at org.rdfhdt.hdt.compact.bitmap.AdjacencyList.<init>(AdjacencyList.java:50)
	at org.rdfhdt.hdt.triples.impl.BitmapTriples.mapFromFile(BitmapTriples.java:372)
	at org.rdfhdt.hdt.hdt.impl.HDTImpl.mapFromHDT(HDTImpl.java:260)
	at org.rdfhdt.hdt.hdt.HDTManagerImpl.doMapIndexedHDT(HDTManagerImpl.java:62)
	at org.rdfhdt.hdt.hdt.HDTManager.mapIndexedHDT(HDTManager.java:93)
	at org.rdfhdt.hdtjena.cmd.HDTSparql.main(HDTSparql.java:38)

While one can argue about the usefulness of empty (i.e. zero triples) HDT files, I don't think this special case should trigger an exception. I noticed this while writing unit tests for my application; the tests exercise some special situations, and one of them happens to generate an empty NT file which will then be converted to HDT and queried using hdtsparql.sh.

IndexOutOfBoundsException

Hi,
I tried next query from command line tool on the dataset DBLP 2017 from http://www.rdfhdt.org/datasets/:
SELECT DISTINCT ?property_type WHERE {?p a ?property_type . ?s ?p ?o .} LIMIT 10

and what i got:

Exception in thread "main" java.lang.IndexOutOfBoundsException
	at org.rdfhdt.hdt.compact.sequence.SequenceLog64Map.get(SequenceLog64Map.java:190)
	at org.rdfhdt.hdt.triples.impl.PredicateIndexArray.getOccurrence(PredicateIndexArray.java:44)
	at org.rdfhdt.hdt.triples.impl.BitmapTriplesIteratorYFOQ.goToStart(BitmapTriplesIteratorYFOQ.java:158)
	at org.rdfhdt.hdt.triples.impl.BitmapTriplesIteratorYFOQ.<init>(BitmapTriplesIteratorYFOQ.java:77)
	at org.rdfhdt.hdt.triples.impl.BitmapTriples.search(BitmapTriples.java:239)
	at org.rdfhdt.hdtjena.solver.StageMatchTripleID.makeNextStage(StageMatchTripleID.java:140)
	at org.rdfhdt.hdtjena.solver.StageMatchTripleID.makeNextStage(StageMatchTripleID.java:53)
	at org.apache.jena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:48)
	at org.rdfhdt.hdtjena.util.IterAbortable.hasNext(IterAbortable.java:62)
	at org.apache.jena.atlas.iterator.Iter$4.hasNext(Iter.java:303)
	at org.apache.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:53)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111)
	at org.apache.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:58)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111)
	at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.getInputNextUnseen(QueryIterDistinct.java:104)
	at org.apache.jena.sparql.engine.iterator.QueryIterDistinct.hasNextBinding(QueryIterDistinct.java:70)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111)
	at org.apache.jena.sparql.engine.iterator.QueryIterSlice.hasNextBinding(QueryIterSlice.java:76)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111)
	at org.apache.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:74)
	at org.apache.jena.sparql.engine.ResultSetCheckCondition.hasNext(ResultSetCheckCondition.java:59)
	at org.apache.jena.sparql.resultset.CSVOutput.format(CSVOutput.java:81)
	at org.apache.jena.query.ResultSetFormatter.outputAsCSV(ResultSetFormatter.java:624)
	at org.rdfhdt.hdtjena.cmd.HDTSparql.execute(HDTSparql.java:66)
	at org.rdfhdt.hdtjena.cmd.HDTSparql.main(HDTSparql.java:131)

I have tried on other HDT files generated my me and still happening.

Regards

Update release on Central Repo of Maven

Hi,

Would it be possible to update the release on the Central Repo of Maven? Thanks in advance.

KR
Pieter

UTF-8/unicode issues

Hi all,
I managed to convert DBpedia language versions to HDT with the CPP develop branch, see e.g. here:
http://downloads.dbpedia.org/2016-10/tmp/data/ja/

using this commit: rdfhdt/hdt-cpp@b0bb661

The Ntriples are in unicode, which is fine according to the 1.1 spec. However, the unicode does not seem supported, below using the japanese version.
So I wrote this with CPP and then read it with Java. Not sure where the incompatibility is.

http://wikidata.dbpedia.org/resource/Q11178088 http://xmlns.com/foaf/0.1/name "������������"@ja
http://wikidata.dbpedia.org/resource/Q11178088 http://xmlns.com/foaf/0.1/name "������������������"@ja
http://wikidata.dbpedia.org/resource/Q11178276 http://dbpedia.org/ontology/address "������������������"@ja

Please update to a recent 2.x Jena version

For example, 2.12.1... I gave it a go, but running into Jena API changes around the ReorderTransformationBase class :/ However, I am not familiar with these APIs so cannot update this myself...

IllegalArgumentException when using BIND or VALUES with nonexisting resource

My example data is this triple:

<http://example.org/> <http://schema.org/name> "Example" .

I converted it to HDT and created an index. Then I executed this SPARQL query using hdtsparql.sh:

$ hdtsparql.sh example.hdt "SELECT * { BIND(<http://example.org/> AS ?s)  ?s ?p ?o }"
s,p,o
http://example.org/,http://schema.org/name,Example

So far so good. But when I change the bound URI to something nonexistent, I get an error:

$ hdtsparql.sh example.hdt "SELECT * { BIND(<http://example.org/2> AS ?s)  ?s ?p ?o }"
Exception in thread "main" java.lang.IllegalArgumentException: (?s,null)
	at org.rdfhdt.hdtjena.bindings.BindingHDTId.put(BindingHDTId.java:79)
	at org.rdfhdt.hdtjena.solver.HDTSolverLib$3.apply(HDTSolverLib.java:202)
	at org.rdfhdt.hdtjena.solver.HDTSolverLib$3.apply(HDTSolverLib.java:178)
	at org.apache.jena.atlas.iterator.Iter$4.next(Iter.java:308)
	at org.apache.jena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:47)
	at org.rdfhdt.hdtjena.util.IterAbortable.hasNext(IterAbortable.java:62)
	at org.apache.jena.atlas.iterator.Iter$4.hasNext(Iter.java:303)
	at org.apache.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:53)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39)
	at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111)
	at org.apache.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:74)
	at org.apache.jena.sparql.engine.ResultSetCheckCondition.hasNext(ResultSetCheckCondition.java:59)
	at org.apache.jena.sparql.resultset.CSVOutput.format(CSVOutput.java:81)
	at org.apache.jena.query.ResultSetFormatter.outputAsCSV(ResultSetFormatter.java:624)
	at org.rdfhdt.hdtjena.cmd.HDTSparql.main(HDTSparql.java:53)

I also tested via Fuseki and got the same errors.

Note that this variant, where BIND is not used, works fine:

SELECT * { <http://example.org/2> ?p ?o }

However, if I use VALUES, I can get the same error as with BIND:

SELECT * { VALUES ?s { <http://example.org/2> }  ?s ?p ?o }

Add support for more RDF formats in Fuseki

The output format of this Fuseki server is limited, the following do not work:
output=json-ld
output=json-rdf
output=nt
output=ttl

If we serve it as a REST interface service with content negotiation, we need another dependent library to change format...which is not convenient.

Any comment?

Fuseki is counting predicates

[2018-11-06 01:44:02] PredicateIndexArray INFO Count predicates in 1 hour 16 min 22 sec 104 ms 376 us
for a 2B triples HDT.

Could it use the values from my HDT file?
_:triples http://purl.org/HDT/hdt#triplesnumTriples "2419923078"

it takes a few ms to get that from the header.

Update website to point at GitHub, instead of Google Code

http://www.rdfhdt.org/manual-of-the-java-hdt-library/#download points to the (archived) google code repo - I assume it should point here

Jena ARQ Test Count 2 is failing

Querying HDT for big datasets

I was trying query the LOD-a-lot dataset. The dataset is so big that int's are not sufficing as an index for the datastructures used in HDT. So probably the FourSectionDictionaryBig is used. The problem is that while the HDT file can be loaded, it cannot be queried since the
idToString(int id, TripleComponentRole role)
method uses an int for the id. A long should be possible since the file cannot be queried otherwise.

Where are the CLI tools?

I've ran mvn install. Then... what? Where are the compiled binaries? The bin/ directories only have a bunch of .bat and .sh script, but the README doesn't explain how to use them.

I'd really appreciate any help (or more info in the README) about how to use the CLI tools after compilation.

Publish 2.0 as maven artifact

The 2.0 version of this codebase is not available yet on Maven Central, but I don't have to permissions to do so. @MarioAriasGa or @webdata can you do that or give me the right access?

Fuseki default union graph

I think I've been able to setup a Fuseki endpoint with HDT files. I'm wondering, though, if it's possible to setup a "union graph"? According to HDT documentation, I can choose a default graph like this ja:defaultGraph <#graph-name> ;, but there is no mention of any "union" graph as the union of all HDT files. Any idea? Or do I have to create an additional HDT file containing the same content of all other HDT files, and use this as ja:defaultGraph?

PREFIX http: kills results

Hi,

I have a set up with Fuseki on top of HDT, data with HTTP URIs (surprise, surprise) in my data set, and when I query using:

SELECT * WHERE {?s ?p ?o} LIMIT 5

I get results. However, when I add the prefix definition for HTTP like

PREFIX http: <http://www.w3.org/2011/http#>
SELECT * WHERE {?s ?p ?o} LIMIT 5

I don't get results any more. When I change the prefix to something else like

PREFIX htt: <http://www.w3.org/2011/http#>
SELECT * WHERE {?s ?p ?o} LIMIT 5

I have results again.

When I fire a query that does not have HTTP URIs in the results (eg. only blank nodes), then PREFIX http: <...> does not disturb: I get results with, without, and with the amended prefix.

Cheers!

Stream CONSTRUCT results in hdtsparql

I implemented support for CONSTRUCT queries in the hdtsparql command line tool (#27). However, the CONSTRUCT results are first loaded to an in-memory Model. This causes problems when the results are large and don't fit in memory.

I propose adding a -stream command line argument to hdtsparql that would instead output result triples immediately instead of loading them into a Model. The downside is that some triples may be duplicated in the output. I'm planning to do a PR implementing this feature.

escape string containing backslash

Hi,

When I convert a hdt file using hdt-cpp tool hdt2rdf, the generated nt file cannot be read by hdt-mr.

I got the following error:

Error: java.lang.IllegalArgumentException: Unescaped backslash in: "buttpark 63\09-92"@en
        at org.rdfhdt.hdt.util.UnicodeEscape.unescapeString(UnicodeEscape.java:225)
        at org.rdfhdt.hdt.triples.TripleString.read(TripleString.java:217)
        at org.rdfhdt.mrbuilder.dictionary.DictionarySamplerMapper.map(DictionarySamplerMapper.java:40)
        at org.rdfhdt.mrbuilder.dictionary.DictionarySamplerMapper.map(DictionarySamplerMapper.java:33)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

shall we make less strict rule in the class org.rdfhdt.hdt.util.UnicodeEscape?

Best,
Gang

LOD-a-lot hdt file does not work

I just downloaded the hdt file from http://lod-a-lot.lod.labs.vu.nl/data/LOD_a_lot_v1.hdt
Reading the file the following error appears:

1936 [main] ERROR org.rdfhdt.hdtjena.bindings.BindingHDTNode - get1(?o)
java.lang.NegativeArraySizeException
at org.rdfhdt.hdtjena.cache.DictionaryCacheArray.put(DictionaryCacheArray.java:63)
at org.rdfhdt.hdtjena.NodeDictionary.getNode(NodeDictionary.java:127)
at org.rdfhdt.hdtjena.NodeDictionary.getNode(NodeDictionary.java:110)
at org.rdfhdt.hdtjena.bindings.BindingHDTNode.get1(BindingHDTNode.java:115)
at org.apache.jena.sparql.engine.binding.BindingBase.get(BindingBase.java:121)
at org.rdfhdt.hdtjena.bindings.BindingHDTNode.format(BindingHDTNode.java:133)
at org.apache.jena.sparql.engine.binding.BindingBase.format1(BindingBase.java:163)
at org.apache.jena.sparql.engine.binding.BindingBase.toString(BindingBase.java:138)
at org.apache.jena.sparql.core.ResultBinding.toString(ResultBinding.java:91)
at java.lang.String.valueOf(String.java:2994)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at TestHDT.main(TestHDT.java:46)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)
java.lang.NullPointerException
at org.rdfhdt.hdtjena.bindings.BindingHDTNode.format(BindingHDTNode.java:135)
at org.apache.jena.sparql.engine.binding.BindingBase.format1(BindingBase.java:163)
at org.apache.jena.sparql.engine.binding.BindingBase.toString(BindingBase.java:138)
at org.apache.jena.sparql.core.ResultBinding.toString(ResultBinding.java:91)
at java.lang.String.valueOf(String.java:2994)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at TestHDT.main(TestHDT.java:46)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)

Also the Jena Model.size() is a negative number.

My code:

public class TestHDT {
	public static void main(String[] args) throws IOException {
		File file = new File("LOD_a_lot_v1.hdt");
		HDT hdt = null;
		try {
			hdt = HDTManager.mapHDT(file.getAbsolutePath(), null);
			HDTGraph graph = new HDTGraph(hdt);
			Model model = new ModelCom(graph);
			String sparql = "select * where {?s ?p ?o} limit 10";

			Query query = QueryFactory.create(sparql);

			QueryExecution qe = QueryExecutionFactory.create(query, model);
			ResultSet results = qe.execSelect();

			String csvName = "unidomains.csv";
			int count = 0;
			System.out.println("Model.size(): " + results.getResourceModel().size());
			while (results.hasNext()) {
				QuerySolution thisRow = results.next();
				System.out.println("Row " + (++count) + ": " + thisRow);
			}
			qe.close();
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			if (hdt != null) {
				hdt.close();
			}
		}
	}
}

Add support for named graphs

I just modified the 'fuseki_example.ttl' a little bit to make it point to two hdt files I have generated. But I got the following error message when I tried to start the fuseki with config file:

[fug2@virtuosodev11 hdt-fuseki]$ bin/hdtEndpoint.sh --config=fuseki_example.ttl
com.hp.hpl.jena.assembler.exceptions.AssemblerException: caught: Adjacency list bitmap and array should have the same size
doing:
root: file:///home/fug2/hdt-java/hdt-fuseki/fuseki_example.ttl#graph1 with type: http://www.rdfhdt.org/fuseki#HDTGraph assembler class: class org.rdfhdt.hdtjena.HDTGraphAssembler
root: file:///home/fug2/hdt-java/hdt-fuseki/fuseki_example.ttl#dataset with type: http://jena.hpl.hp.com/2005/11/Assembler#RDFDataset assembler class: class com.hp.hpl.jena.sparql.core.assembler.DatasetAssembler

the changed fuseki_example.ttl file is as follow:

<#graph1> rdfs:label "RDF Graph1 from HDT file" ;
        rdf:type hdt:HDTGraph ;
        hdt:fileName "/export/home/SSD/BIGDATA/hdt/pc_compound_0.hdt" ;
    .
<#graph2> rdfs:label "RDF Graph2 from HDT file" ;
        rdf:type hdt:HDTGraph ;
        hdt:fileName "/export/home/SSD/BIGDATA/hdt/pc_compound_1.hdt" ;
    .

Return instead of write

I have used the HDT package for a while using my own fork but it would be great if the following could be implemented if not already present:

cc85269

When a query is executed I would like to have an option to return the results like:

ResultSet results = qe.execSelect();
return results;

Such that it can be directly incorporated into other programs. Or is this already possible? #

I think this should do it?

    String query = createQueryFromFile("queries/" + queryFile,args).getQuery().toString();
    long millis = System.currentTimeMillis();
    queryFile = "queries/" + queryFile;
    // long millis = System.currentTimeMillis();
    ResultSet result = null;
    HDTGraph graph = new HDTGraph(hdtFile);
    Model model = ModelFactory.createModelForGraph(graph);
    QueryExecution qe = QueryExecutionFactory.create(query,model);

    result = qe.execSelect();
    ResultIteratorRaw walker = new ResultIteratorRaw(result);// new
    // Iteration<HashMap<String,RDFNode>>
    LinkedList<HashMap<String, RDFNode>> res = new LinkedList<HashMap<String, RDFNode>>();

Slow query

Cross posting here since it's probably more relevant: rdfhdt/hdt-cpp#142

java.lang.IllegalArgumentException: (?u,null)

Hello,
Next a generation of a hdt file with https://github.com/rdfhdt/hdt-docker and my own turtle file, I used this hdt file with hdt-fuseki (command : ./bin/hdtEndpoint.sh --hdt ./ALL.hdt /dataset) . the sparql endpoint works well with some sparql command i used but throw an exception for the following sparql request

curl -H "Accept: application/json" http://<myserver>:3030/dataset/sparql --data-urlencode query='PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?u ?l ?p WHERE { VALUES ?u { <fr.mgdis.odata.data.plagesBREfrType> } OPTIONAL { ?u rdfs:label ?l . BIND (rdfs:label AS ?p) } }'

server logs:

15:38:36 INFO [25] Query = PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?u ?l ?p WHERE { VALUES ?u { <fr.mgdis.odata.data.plagesBREfrType> } OPTIONAL { ?u rdfs:label ?l . BIND (rdfs:label AS ?p) } } 15:38:36 WARN [25] RC = 500 : (?u,null) java.lang.IllegalArgumentException: (?u,null) at org.rdfhdt.hdtjena.bindings.BindingHDTId.put(BindingHDTId.java:79) at org.rdfhdt.hdtjena.solver.HDTSolverLib$3.apply(HDTSolverLib.java:202) at org.rdfhdt.hdtjena.solver.HDTSolverLib$3.apply(HDTSolverLib.java:178) at org.apache.jena.atlas.iterator.Iter$4.next(Iter.java:308) at org.apache.jena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:47) at org.rdfhdt.hdtjena.util.IterAbortable.hasNext(IterAbortable.java:62) at org.apache.jena.atlas.iterator.Iter$4.hasNext(Iter.java:303) at org.apache.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:53) at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111) at org.apache.jena.sparql.engine.iterator.QueryIterProcessBinding.hasNextBinding(QueryIterProcessBinding.java:66) at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111) at org.apache.jena.sparql.engine.iterator.QueryIterDefaulting.hasNextBinding(QueryIterDefaulting.java:54) at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111) at org.apache.jena.sparql.engine.iterator.QueryIterRepeatApply.hasNextBinding(QueryIterRepeatApply.java:74) at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111) at org.apache.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:58) at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111) at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39) at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111) at org.apache.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:39) at org.apache.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:111) at org.apache.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:74) at org.apache.jena.sparql.engine.ResultSetCheckCondition.hasNext(ResultSetCheckCondition.java:59) at org.apache.jena.fuseki.servlets.SPARQL_Query.executeQuery(SPARQL_Query.java:297) at org.apache.jena.fuseki.servlets.SPARQL_Query.execute(SPARQL_Query.java:252) at org.apache.jena.fuseki.servlets.SPARQL_Query.executeWithParameter(SPARQL_Query.java:205) at org.apache.jena.fuseki.servlets.SPARQL_Query.perform(SPARQL_Query.java:100) at org.apache.jena.fuseki.servlets.SPARQL_ServletBase.executeLifecycle(SPARQL_ServletBase.java:227) at org.apache.jena.fuseki.servlets.SPARQL_ServletBase.executeAction(SPARQL_ServletBase.java:204) at org.apache.jena.fuseki.servlets.SPARQL_ServletBase.execCommonWorker(SPARQL_ServletBase.java:186) at org.apache.jena.fuseki.servlets.SPARQL_ServletBase.doCommon(SPARQL_ServletBase.java:79) at org.apache.jena.fuseki.servlets.SPARQL_Query.doPost(SPARQL_Query.java:60) at javax.servlet.http.HttpServlet.service(HttpServlet.java:755) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496) at org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:82) at org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:256) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:229) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.nio.BlockingChannelConnector$BlockingChannelEndPoint.run(BlockingChannelConnector.java:298) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:748) 15:38:36 INFO [25] 500 (?u,null) (5 ms)

Documented Maven command does not work

The following command from the readme file does not work on my system:

$ mvn assembly:single
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.4:single (default-cli) on project hdt-java-parent: Error reading assemblies: No assembly descriptors found. -> [Help 1]

I'm on the following Java and Maven versions:

$ java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
$ mvn -version
Apache Maven 3.5.0 (Red Hat 3.5.0-6)
Maven home: /usr/share/maven
Java version: 1.8.0_151, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc27.x86_64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.14.6-300.fc27.x86_64", arch: "amd64", family: "unix"

The other documented Maven command (mvn install) does work.

Performace: more ram or better disk?

Sorry I understand this is not an issue, but I don't see any HDT mailing list or forum where I can ask this. Basically I'm using HDT with Fuseki, and I'd like to understand how are HDT files mapped into memory. When I start Fuseki, what about a HDT file is loaded into RAM? Only the index? And once I start submitting queries, are HDT triples loaded into RAM or read from the hard disk? What happens if the HDT file can't be loaded completely in RAM? Will it swap or will it keep moving new data from disk to ram? Finally, to improve HDT query performance would I be better buying more RAM or a faster disk (SSD) - or something else?

Thank you.

NotFoundException

Hi all,

 try {
                IteratorTripleString it = hdt.search(identifier, property, "");
                while (it.hasNext()) {
                    TripleString ts = it.next();
                    ValAgg.put(identifier, ts.getObject().toString(), lang);
                }
            } catch (NotFoundException nfe) {
                //intentionally left blank.
                //hdt.search returns NF exception instead of null or it.hasNext() == false
            }

this is how my code looks like. I find it weird, that I have to leave the catch block empty. But the code works.

java.lang.NegativeArraySizeException after loading freebase hdt

Hi
NegativeArraySizeException gets thrown in org.rdfhdt.hdt.util.io.IOUtil.readBuffer while trying to load freebase hdt from http://www.rdfhdt.org/datasets/.

Any ideas?

All the best,
Leon

Maven build fails

[INFO] Scanning for projects...                                                                                                                                                                                                               
[INFO] ------------------------------------------------------------------------                                                                                                                                                               
[INFO] Reactor Build Order:                                                                                                                                                                                                                   
[INFO]                                                                                                                                                                                                                                        
[INFO] RDF/HDT                                                                                                                                                                                                                                
[INFO] HDT API                                                                                                                                                                                                                                
[INFO] HDT Java Core                                                                                                                                                                                                                          
[INFO] HDT Java Command line Tools                                                                                                                                                                                                            
[INFO] HDT Jena                                                                                                                                                                                                                               
[INFO] HDT Java Package                                                                                                                                                                                                                       
[INFO] HDT Fuseki                                                                                                                                                                                                                             
[INFO]                                                                                                                                                                                                                                        
[INFO] ------------------------------------------------------------------------                                                                                                                                                               
[INFO] Building RDF/HDT 2.1-SNAPSHOT                                                                                                                                                                                                          
[INFO] ------------------------------------------------------------------------                                                                                                                                                               
[INFO]                                                                                                                                                                                                                                        
[INFO] --- maven-assembly-plugin:3.1.0:single (default-cli) @ hdt-java-parent ---                                                                                                                                                             
[INFO] ------------------------------------------------------------------------                                                                                                                                                               
[INFO] Reactor Summary:                                                                                                                                                                                                                       
[INFO]                                                                                                                                                                                                                                        
[INFO] RDF/HDT ........................................... FAILURE [1.062s]                                                                                                                                                                   
[INFO] HDT API ........................................... SKIPPED                                                                                                                                                                            
[INFO] HDT Java Core ..................................... SKIPPED                                                                                                                                                                            
[INFO] HDT Java Command line Tools ....................... SKIPPED                                                                                                                                                                            
[INFO] HDT Jena .......................................... SKIPPED                                                                                                                                                                            
[INFO] HDT Java Package .................................. SKIPPED                                                                                                                                                                            
[INFO] HDT Fuseki ........................................ SKIPPED                                                                                                                                                                            
[INFO] ------------------------------------------------------------------------                                                                                                                                                               
[INFO] BUILD FAILURE                                                                                                                                                                                                                          
[INFO] ------------------------------------------------------------------------                                                                                                                                                               
[INFO] Total time: 1.972s                                                                                                                                                                                                                     
[INFO] Finished at: Sun Nov 11 00:07:06 CET 2018                                                                                                                                                                                              
[INFO] Final Memory: 8M/245M                                                                                                                                                                                                                  
[INFO] ------------------------------------------------------------------------                                                                                                                                                               
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (default-cli) on project hdt-java-parent: Error reading assemblies: No assembly descriptors found. -> [Help 1]                                     
[ERROR]                                                                                                                                                                                                                                       
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.                                                                                                                                                           
[ERROR] Re-run Maven using the -X switch to enable full debug logging.                                                                                                                                                                        
[ERROR]                                                                                                                                                                                                                                       
[ERROR] For more information about the errors and possible solutions, please read the following articles:                                                                                                                                     
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException                                                                                                                                                      ```

Support for default graph as the union of all named graphs

Hi All,

I found this feature is very helpful in our case, but it is only available for TDB backend, can we make this available for HDT files?

This service offers SPARQL query access only to a TDB database. The TDB database can have specific features set, such as making the default graph the union of all named graphs.

<#service3>  rdf:type fuseki:Service ;
    fuseki:name              "tdb" ;       # http://host:port/tdb
    fuseki:serviceQuery      "sparql" ;    # SPARQL query service
    fuseki:dataset           <#dataset> ;
    .

<#dataset> rdf:type      tdb:DatasetTDB ;
    tdb:location "DB" ;
    # Query timeout on this dataset (1s, 1000 milliseconds)
    ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "1000" ] ;
    # Make the default graph be the union of all named graphs.
    ## tdb:unionDefaultGraph true ;
     .

Experiences converting DBpedia's bz2 files

Hi all,
I just tried to convert DBpedia to hdt and would like to share my experience. Some things did not owrk for me, although I am not sure, whether it is my fault or just missing features.

Java

I saw that bz2 works for Java. Good!
I tried to read in several bz2. files in Java as input and this didn't work (several input files not allowed)
I tried to read one file via stdin in java which worked
I tried to read in several files in Java via stdin with process substitution and mvn exec <(lbzip2 file1.bz2) <(lbzip2 file2.bz2) but it didn't work
I tried to create a big Java Jar with all dependencies with mvn assembly:single, but it just created the bin files (I was too lazy to adjust the descriptors in pom.xnml)

cpp

Although I did apt-get install serdi serd-dbg make failed, I disabled it in the Makefile then it compiled
bz2 didn't work. actually the unzipping seemed to work, but the parser threw a lot of errors
reading via stdin with process substitution failed as cpp doesn't accept stdin
reading from several input files worked. Well, it said Sorting triples and I lost patience as the percentage didn't go up

Overall, I solved it now by sort -um <(lbzip file1.bz2) .... | gzip2 > core.gz

My question is just whether all of the above is intended and expected behaviour or whether I should try some of the methods again as I might have done something wrong.
All the best,
Sebastian

Update Jena?

hdt-java currently uses Jena 3.0.1, which was released in December 2015. Jena 3.2.0 was just released so hdt-java is now three Jena releases behind. There has been a lot of work on Jena, including many fixes to the SPARQL query engine.

I think hdt-java should be upgraded to the newest Jena release. Jena APIs are pretty stable so I don't expect this to be difficult, though I haven't tried (yet).

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68)
	at java.lang.StringBuilder.<init>(StringBuilder.java:89)
	at org.rdfhdt.hdt.hdt.impl.TempHDTImporterOnePass$TripleAppender.processTriple(TempHDTImporterOnePass.java:75)
	at org.rdfhdt.hdt.rdf.parsers.RDFParserSimple.doParse(RDFParserSimple.java:80)
	at org.rdfhdt.hdt.hdt.impl.TempHDTImporterOnePass.loadFromRDF(TempHDTImporterOnePass.java:100)
	at org.rdfhdt.hdt.hdt.HDTManagerImpl.doGenerateHDT(HDTManagerImpl.java:103)
	at org.rdfhdt.hdt.hdt.HDTManager.generateHDT(HDTManager.java:129)
	at org.rdfhdt.hdt.tools.RDF2HDT.execute(RDF2HDT.java:110)
	at org.rdfhdt.hdt.tools.RDF2HDT.main(RDF2HDT.java:175)

I get this error when trying to convert a 1GB .nt graph to HDT using rdf2hdt.sh. My computer has 8GB of RAM. This is also not the largest graph that I need to convert.

How can I convert large graphs to HDT?? For example, how do you convert Wikidata which is ~200GB?

FILTER does not work :(

Hello there,

I am using hdt-jena to query over the geonames HDT. I am unabled to use a FILTER.

Look at this :

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> 
SELECT ?s ?lat ?long 
WHERE {
   ?s geo:lat ?lat. 
   ?s geo:long ?long
} LIMIT 10

I successfully get :

s,lat,long
_:b0,35.325,-71.085641
_:b0,35.325,-80.00
_:b0,35.325,13.286104
_:b0,35.325,2.15898513793945
_:b0,35.325,2.312922
_:b0,35.325,25.13
_:b0,35.325,9.221907
_:b0,40.44,-71.085641
_:b0,40.44,-80.00
_:b0,40.44,13.286104

Now a try to filter over geo:lat :

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> 
SELECT ?s ?lat ?long 
WHERE {
   ?s geo:lat ?lat. 
   ?s geo:long ?long. 
   FILTER(?lat > 40)
} LIMIT 10

Unfortunately, no more result :

s,lat,long

So, can you tell me if FILTER is out of HDT capabilities ?

I crawled the code, and I found this line :

//      // FIXME: Allow a filter here.

https://github.com/rdfhdt/hdt-java/blob/master/hdt-jena/src/main/java/org/rdfhdt/hdtjena/solver/StageMatchTripleID.java#L143

Regards,

HDT-CPP is not able to load HDT file created in HDT-Java

Apparently there is a bug in the format synchronization between HDT-JAVA and HDT-CPP. One HDT file created in HDT-Java cannot be loaded in HDT-CPP (the inverse is working fine).

hdt-fuseki maven error

I'm traying to compile hdt-fuseki with maven but i'm getting the following error:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.7.0:compile (default-compile) on project hdt-fuseki: Compilation failure
[ERROR] /C:/Users/BS/DBpedia/hdt-java-master/hdt-fuseki/src/main/java/org/rdfhdt/hdt/fuseki/FusekiHDTCmd.java:[331,47] cannot access com.hp.hpl.jena.graph.impl.GraphBase
[ERROR] class file for com.hp.hpl.jena.graph.impl.GraphBase not found

I've changed the pom.xml to work with jena v3.8.0.
However the java file FusekiHDTCmd.java still indicating on eclipse:
The type com.hp.hpl.jena.graph.impl.GraphBase cannot be resolved. It is indirectly referenced from required .class files.

Can you help me please ?

Thanks

Please provide a code example that shows how to create a Jena Model instance

Because then I know how to run a SPARQL query against a HDT file...

Are .hdt.index.v* files compatible with hdt-cpp?

Are the .hdt.index.v* files we generate compatible with the hdt-cpp index files for the same value of *?
If not, shouldn't we name them differently?

maven errors about missing hdt-java-parent

Hello,

I tried to follow the instructions at https://github.com/rdfhdt/hdt-java/tree/master/hdt-jena
but I get build errors

$ cd hdt-api/       
$ mvn install
...
[INFO] BUILD SUCCESS
...
$ cd ../hdt-java-core 
$ mvn install
...
[ERROR] Failed to execute goal on project hdt-java-core: Could not resolve dependencies for project org.rdfhdt:hdt-java-core:jar:2.0-SNAPSHOT: Failed to collect dependencies at org.rdfhdt:hdt-api:jar:2.0-SNAPSHOT: Failed to read artifact descriptor for org.rdfhdt:hdt-api:jar:2.0-SNAPSHOT: Failure to find org.rdfhdt:hdt-java-parent:pom:2.0-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of sonatype-nexus-snapshots has elapsed or updates are forced -> [Help 1]

Can anyone give me a hint how to fix it?

Hdt-jena not usable with Lucene in Fuseki?

I am loading a dataset made of a graph indexed with lucene (graphb) and another graph (grapha).
I tried to use an HDT file to replace grapha (grapha_hdt).

But as soon as I load the grapha_hdt, Lucene queries stop working on graphb.

Even if I create a HDT dataset on another service, Lucene queries stops working.

As soon as I remove the HDT-only dataset or the HDT-graph from my combined dataset, Lucene queries work again.

I am using Fuseki 3.8.0 (and updated pom.xml accordingly).

There are no errors logged and I can query the HDT graph properly, as well as the Lucene-backed graph.

I'm guessing that HDTgraph doesn't really like to be in a text dataset. But it doesn't work even when it is in a separate service.

I don't know how I could query my grapha and graphb differently (I use both in my queries, looking for things from grapha into graphb)

@prefix :        <#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb2:     <http://jena.apache.org/2016/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .
@prefix hdt: <http://www.rdfhdt.org/fuseki#> .

hdt:HDTGraph rdfs:subClassOf ja:Graph .

tdb2:DatasetTDB2  rdfs:subClassOf  ja:RDFDataset .

[] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
[] ja:loadClass "org.rdfhdt.hdtjena.HDTGraphAssembler" .

text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

<#mixed> rdf:type fuseki:Service ;
    rdfs:label                      "mixed" ;
    fuseki:name                     "mixed" ;
    fuseki:serviceQuery             "query" ;
    fuseki:serviceQuery             "sparql" ;
    fuseki:serviceReadGraphStore       "get" ;
    fuseki:dataset           :indexed_textset ;    
    .


:indexed_textset rdf:type     text:TextDataset ;
    text:dataset   :mixed_dataset ;
    text:index     <#indexLucene> ;
    tdb2:unionDefaultGraph true ;
    .

:grapha a tdb2:GraphTDB2;
   tdb2:location "DB1" .


:grapha_hdt rdfs:label "RDF Graph1 from HDT file" ;
        rdf:type hdt:HDTGraph ;
	hdt:fileName "grapha.hdt" .
	
:graphb a tdb2:GraphTDB2;
  tdb2:location "DB2" .


:mixed_dataset a ja:RDFDataset;
	  ja:namedGraph
		[ ja:graphName <http://grapha>;
# Switching to grapha make it work
#		  ja:graph :grapha ; ];
		  ja:graph :grapha_hdt ; ];
  	  ja:namedGraph
		[ ja:graphName <http://graphb>;
		  ja:graph :graphb ; ];
	  tdb2:unionDefaultGraph true;
	  .


# Text index description
<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:DB2_index> ;
    text:entityMap <#entMap> ;
    text:storeValues true ; 
    text:analyzer [ a text:StandardAnalyzer ] ;
    text:queryAnalyzer [ a text:KeywordAnalyzer ] ;
    text:queryParser text:AnalyzingQueryParser ;
    text:multilingualSupport true ;
 .


<#entMap> a text:EntityMap ;
    text:defaultField     "label" ;
    text:entityField      "uri" ;
    text:uidField         "uid" ;
    text:langField        "lang" ;
    text:graphField       "graph" ;
    text:map (
         [ text:field "label" ; 
           text:predicate rdfs:label ]
         ) .

Fuseki+HDT graceful "File not found"

Hi,

I'm following this guide http://www.rdfhdt.org/manual-of-hdt-integration-with-jena/#fuseki
to run a Fuseki server on top of HDT files. It works fine, but I have to move my config file
around between several instances, and some graphs (HDT files) might not be available on some
instances. Right now, if one HDT file is not available, Fuseki blows up with a "File not found"
exception and the entire endpoint is unusable because of 1 missing file. Would it be possile,
please..., to add a flag to simply ignore any HDT graph that has been defined in
the config file, but the corresponding HDT is missing?

Thank you very much!

Original Post