Giter VIP home page Giter VIP logo

Comments (5)

RubenVerborgh avatar RubenVerborgh commented on May 9, 2024

I guess is strongly depends on the dataset. Would be good to know for what characteristics it behaves in that way. Anything peculiar about pc_biosystem?

from hdt-cpp.

artob avatar artob commented on May 9, 2024

@RubenVerborgh, the data is a tad unusual in that it is extremely homogeneous, like so:

@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix biosystem:  <http://rdf.ncbi.nlm.nih.gov/pubchem/biosystem/> .
@prefix bp: <http://www.biopax.org/release/biopax-level3.owl#> .
biosystem:BSID100065    rdf:type    bp:Pathway .
biosystem:BSID100093    rdf:type    bp:Pathway .
biosystem:BSID100334    rdf:type    bp:Pathway .
biosystem:BSID100512    rdf:type    bp:Pathway .
biosystem:BSID100519    rdf:type    bp:Pathway .
biosystem:BSID100569    rdf:type    bp:Pathway .
biosystem:BSID100645    rdf:type    bp:Pathway .
...
biosystem:BSID982021    cito:isDiscussedBy  reference:PMID8680961 .
biosystem:BSID1109889   cito:isDiscussedBy  reference:PMID8680961 .
biosystem:BSID546971    cito:isDiscussedBy  reference:PMID8702268 .
biosystem:BSID703134    cito:isDiscussedBy  reference:PMID9003397 .
biosystem:BSID1268789   cito:isDiscussedBy  reference:PMID9023878 .
biosystem:BSID137924    cito:isDiscussedBy  reference:PMID9295307 .
biosystem:BSID138023    cito:isDiscussedBy  reference:PMID9446561 .
biosystem:BSID1268894   cito:isDiscussedBy  reference:PMID9927040 .
biosystem:BSID1268900   cito:isDiscussedBy  reference:PMID9927040 .
biosystem:BSID1268902   cito:isDiscussedBy  reference:PMID9927040 .

from hdt-cpp.

RubenVerborgh avatar RubenVerborgh commented on May 9, 2024

That's probably it. Very easy for regular compression algorithms in any case.

from hdt-cpp.

artob avatar artob commented on May 9, 2024

Problem is, as soon as you have to compress HDT files, and would hence make them available as .hdt.gz or .hdt.bz2 dumps, downstream users lose the ability to perform HTTP byte-range requests on HDT files served from a remote server. Ideally, the data structures in HDT files would be maximally compressed such that gzip or bzip2 couldn't do much further on them. But I realize that's the opening to a more fundamental discussion.

from hdt-cpp.

MarioAriasGa avatar MarioAriasGa commented on May 9, 2024

Hi bendiken,

The key of HDT is that in addition to the compression, you can search. If you do more aggressive compression, then the search becomes slower.

If you would like to see more explicitly the size/speed tradeoff, open this paper [1] and jump to figures on page 29. You can see several techniques (for the dictionary) with different parameters. The default in the HDT code is PFC, with the point around the bottom left corner (block size 16-32). We chose this one as default because it is both very easy and works quite well.

I opened your file and it's particularly a good case because it's very regular. HDT sorts and does a first lightweight compression, and the remaining is handled by the generic compressor. For example HDT doesn't exploit the patterns where the subject usually has the same set properties but it leaves the pattern as runs of binary sequences that gzip/bzip pick up.

Our logic for publishing hdt.gz on the Web is that in many cases it halves the size (and bandwidth and waiting time), and when you decompress it it's very fast to execute queries. I always pipe the download anyway, something like

$ curl http://web/file.hdt.gz | gzip -cd > file.hdt

I'm also working on a subset extract operation that given an HDT, you can generate another HDT with just some triples of the original, with the output already in HDT format. I think that this will be more useful than getting individual bytes.

[1] http://dataweb.infor.uva.es/wp-content/uploads/2015/09/paper.pdf

from hdt-cpp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.