In published examples I've seen, HDT file compressibility (by doing an additional gzip

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Expected HDT compressibility? about hdt-cpp HOT 5 CLOSED

rdfhdt commented on May 9, 2024

Expected HDT compressibility?

from hdt-cpp.

Comments (5)

RubenVerborgh commented on May 9, 2024

I guess is strongly depends on the dataset. Would be good to know for what characteristics it behaves in that way. Anything peculiar about pc_biosystem?

from hdt-cpp.

artob commented on May 9, 2024

@RubenVerborgh, the data is a tad unusual in that it is extremely homogeneous, like so:

@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix biosystem:  <http://rdf.ncbi.nlm.nih.gov/pubchem/biosystem/> .
@prefix bp: <http://www.biopax.org/release/biopax-level3.owl#> .
biosystem:BSID100065    rdf:type    bp:Pathway .
biosystem:BSID100093    rdf:type    bp:Pathway .
biosystem:BSID100334    rdf:type    bp:Pathway .
biosystem:BSID100512    rdf:type    bp:Pathway .
biosystem:BSID100519    rdf:type    bp:Pathway .
biosystem:BSID100569    rdf:type    bp:Pathway .
biosystem:BSID100645    rdf:type    bp:Pathway .
...
biosystem:BSID982021    cito:isDiscussedBy  reference:PMID8680961 .
biosystem:BSID1109889   cito:isDiscussedBy  reference:PMID8680961 .
biosystem:BSID546971    cito:isDiscussedBy  reference:PMID8702268 .
biosystem:BSID703134    cito:isDiscussedBy  reference:PMID9003397 .
biosystem:BSID1268789   cito:isDiscussedBy  reference:PMID9023878 .
biosystem:BSID137924    cito:isDiscussedBy  reference:PMID9295307 .
biosystem:BSID138023    cito:isDiscussedBy  reference:PMID9446561 .
biosystem:BSID1268894   cito:isDiscussedBy  reference:PMID9927040 .
biosystem:BSID1268900   cito:isDiscussedBy  reference:PMID9927040 .
biosystem:BSID1268902   cito:isDiscussedBy  reference:PMID9927040 .

from hdt-cpp.

RubenVerborgh commented on May 9, 2024

That's probably it. Very easy for regular compression algorithms in any case.

from hdt-cpp.

artob commented on May 9, 2024

Problem is, as soon as you have to compress HDT files, and would hence make them available as .hdt.gz or .hdt.bz2 dumps, downstream users lose the ability to perform HTTP byte-range requests on HDT files served from a remote server. Ideally, the data structures in HDT files would be maximally compressed such that gzip or bzip2 couldn't do much further on them. But I realize that's the opening to a more fundamental discussion.

from hdt-cpp.

MarioAriasGa commented on May 9, 2024

Hi bendiken,

The key of HDT is that in addition to the compression, you can search. If you do more aggressive compression, then the search becomes slower.

If you would like to see more explicitly the size/speed tradeoff, open this paper [1] and jump to figures on page 29. You can see several techniques (for the dictionary) with different parameters. The default in the HDT code is PFC, with the point around the bottom left corner (block size 16-32). We chose this one as default because it is both very easy and works quite well.

I opened your file and it's particularly a good case because it's very regular. HDT sorts and does a first lightweight compression, and the remaining is handled by the generic compressor. For example HDT doesn't exploit the patterns where the subject usually has the same set properties but it leaves the pattern as runs of binary sequences that gzip/bzip pick up.

Our logic for publishing hdt.gz on the Web is that in many cases it halves the size (and bandwidth and waiting time), and when you decompress it it's very fast to execute queries. I always pipe the download anyway, something like

$ curl http://web/file.hdt.gz | gzip -cd > file.hdt

I'm also working on a subset extract operation that given an HDT, you can generate another HDT with just some triples of the original, with the output already in HDT format. I think that this will be more useful than getting individual bytes.

[1] http://dataweb.infor.uva.es/wp-content/uploads/2015/09/paper.pdf

from hdt-cpp.

Expected HDT compressibility? about hdt-cpp HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent