Comments (5)
I guess is strongly depends on the dataset. Would be good to know for what characteristics it behaves in that way. Anything peculiar about pc_biosystem
?
from hdt-cpp.
@RubenVerborgh, the data is a tad unusual in that it is extremely homogeneous, like so:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix biosystem: <http://rdf.ncbi.nlm.nih.gov/pubchem/biosystem/> .
@prefix bp: <http://www.biopax.org/release/biopax-level3.owl#> .
biosystem:BSID100065 rdf:type bp:Pathway .
biosystem:BSID100093 rdf:type bp:Pathway .
biosystem:BSID100334 rdf:type bp:Pathway .
biosystem:BSID100512 rdf:type bp:Pathway .
biosystem:BSID100519 rdf:type bp:Pathway .
biosystem:BSID100569 rdf:type bp:Pathway .
biosystem:BSID100645 rdf:type bp:Pathway .
...
biosystem:BSID982021 cito:isDiscussedBy reference:PMID8680961 .
biosystem:BSID1109889 cito:isDiscussedBy reference:PMID8680961 .
biosystem:BSID546971 cito:isDiscussedBy reference:PMID8702268 .
biosystem:BSID703134 cito:isDiscussedBy reference:PMID9003397 .
biosystem:BSID1268789 cito:isDiscussedBy reference:PMID9023878 .
biosystem:BSID137924 cito:isDiscussedBy reference:PMID9295307 .
biosystem:BSID138023 cito:isDiscussedBy reference:PMID9446561 .
biosystem:BSID1268894 cito:isDiscussedBy reference:PMID9927040 .
biosystem:BSID1268900 cito:isDiscussedBy reference:PMID9927040 .
biosystem:BSID1268902 cito:isDiscussedBy reference:PMID9927040 .
from hdt-cpp.
That's probably it. Very easy for regular compression algorithms in any case.
from hdt-cpp.
Problem is, as soon as you have to compress HDT files, and would hence make them available as .hdt.gz
or .hdt.bz2
dumps, downstream users lose the ability to perform HTTP byte-range requests on HDT files served from a remote server. Ideally, the data structures in HDT files would be maximally compressed such that gzip or bzip2 couldn't do much further on them. But I realize that's the opening to a more fundamental discussion.
from hdt-cpp.
Hi bendiken,
The key of HDT is that in addition to the compression, you can search. If you do more aggressive compression, then the search becomes slower.
If you would like to see more explicitly the size/speed tradeoff, open this paper [1] and jump to figures on page 29. You can see several techniques (for the dictionary) with different parameters. The default in the HDT code is PFC, with the point around the bottom left corner (block size 16-32). We chose this one as default because it is both very easy and works quite well.
I opened your file and it's particularly a good case because it's very regular. HDT sorts and does a first lightweight compression, and the remaining is handled by the generic compressor. For example HDT doesn't exploit the patterns where the subject usually has the same set properties but it leaves the pattern as runs of binary sequences that gzip/bzip pick up.
Our logic for publishing hdt.gz on the Web is that in many cases it halves the size (and bandwidth and waiting time), and when you decompress it it's very fast to execute queries. I always pipe the download anyway, something like
$ curl http://web/file.hdt.gz | gzip -cd > file.hdt
I'm also working on a subset extract operation that given an HDT, you can generate another HDT with just some triples of the original, with the output already in HDT format. I think that this will be more useful than getting individual bytes.
[1] http://dataweb.infor.uva.es/wp-content/uploads/2015/09/paper.pdf
from hdt-cpp.
Related Issues (20)
- Removed unneeded exception in BasicHDT
- Consolidate rdf2hdt Windows-specific implementation and base implementation
- Replace use of deprecated ftime() HOT 2
- Resolve "delete called on non-final" warnings.
- Test dumpDictionary not being called with an input HDT file
- Test case "properties" fails HOT 1
- Code formatting / beautifier needed. HOT 1
- Evaluate Parallel Hashmap for potential performance benefits HOT 2
- Add option to ignore error instead of throwing error HOT 5
- `make install` does not install triples/ directory -- hdt-it still active? HOT 1
- clang-format of libdcs [sic]
- hdt::QueryProcessor.searchJoin() gives incorrect results HOT 6
- Compile error on macOS with "make -j2" command HOT 1
- rdf2hdt stops without error message HOT 3
- Add encryption-at-rest to libraries HOT 1
- rdf2hdt produces invalid UTF8 values? HOT 1
- undefined reference to `hdt::HDTManager::mapHDT(char const*, hdt::ProgressListener*)'
- support for quads/named graphs HOT 3
- Memcpy to nullptr in CSD_HTFC::CSD_HTFC()
- Support N-Quads for the C++ repo
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hdt-cpp.