Giter VIP home page Giter VIP logo

diskhash's Introduction

Disk-based hashtable

Build & test (Haskell) Build & test (Python) License: MIT

A simple disk-based hash table (i.e., persistent hash table).

It is a hashtable implemented on memory-mapped disk, so that it can be loaded with a single mmap() system call and used in memory directly (being as fast as an in-memory hashtable once it is loaded from disk).

The code is in C, wrappers are provided for Python, Haskell, and C++. The wrappers follow similar APIs with variations to accommodate the language specificity. They all use the same underlying code, so you can open a hashtable created in C from Haskell, modify it within your Haskell code, and later open the result in Python.

Cross-language functionality will only work for simple types where you can control their binary representation (64-bit integers, for example).

Reading does not touch the disk representation at all and, thus, can be done on top of read-only files or using multiple threads (and different processes will share the memory: the operating system does that for you). Writing or modifying values is, however, not thread-safe.

Examples

The following examples all create a hashtable to store longs (int64_t), then set the value associated with the key "key" to 9. In the current API, the maximum size of the keys needs to be pre-specified, which is the value 15 below.

Raw C

#include <stdio.h>
#include <inttypes.h>
#include "diskhash.h"

int main(void) {
    HashTableOpts opts;
    opts.key_maxlen = 15;
    opts.object_datalen = sizeof(int64_t);
    char* err = NULL;
    HashTable* ht = dht_open("testing.dht", opts, O_RDWR|O_CREAT, &err);
    if (!ht) {
        if (!err) err = "Unknown error";
        fprintf(stderr, "Failed opening hash table: %s.\n", err);
        return 1;
    }
    long i = 9;
    dht_insert(ht, "key", &i);
    
    long* val = (long*) dht_lookup(ht, "key");
    printf("Looked up value: %l\n", *val);

    dht_free(ht);
    return 0;
}

The C API relies on error codes and error strings (the &err argument above). The header file has decent documentation.

Haskell

In Haskell, you have different types/functions for read-write and read-only hashtables. Read-write operations are IO operations, read-only hashtables are pure.

Read write example:

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRW "testing.dht" 15
    htInsertRW ht "key" (9 :: Int64)
    val <- htLookupRW "key" ht
    print val

Read only example (htLookupRO is pure in this case):

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRO "testing.dht" 15
    let val :: Int64
        val = htLookupRO "key" ht
    print val

Python

Python's interface is based on the struct module. For example, 'll' refers to a pair of 64-bit ints (longs):

import diskhash

tb = diskhash.StructHash(
    fname="testing.dht", 
    keysize=15, 
    structformat='ll',  # store pairs of longs
    mode='rw',
)
value = [1, 2]  # pair of longs
tb.insert("key", *value)
print(tb.lookup("key"))

The Python interface is currently Python 3 only. Patches to extend it to 2.7 are welcome, but it's not a priority.

C++

In C++, a simple wrapper is defined, which provides a modicum of type-safety. You use the DiskHash<T> template. Additionally, errors are reported through exceptions (both std::bad_alloc and std::runtime_error can be thrown) and not return codes.

#include <iostream>
#include <string>

#include <diskhash.hpp>

int main() {
    const int key_maxlen = 15;
    dht::DiskHash<uint64_t> ht("testing.dht", key_maxlen, dht::DHOpenRW);
    std::string line;
    uint64_t ix = 0;
    while (std::getline(std::cine, line)) {
        if (line.length() > key_maxlen) {
            std::cerr << "Key too long: '" << line << "'. Aborting.\n";
            return 2;
        }
        const bool inserted = ht.insert(line.c_str(), ix);
        if (!inserted) {
            std::cerr  << "Found repeated key '" << line << "' (ignored).\n";
        }
        ++ix;
    }
    return 0;
}

Stability

This is beta software. It is good enough that I am using it, but the API can change in the future with little warning. The binary format is versioned (the magic string encodes its version, so changes can be detected and you will get an error message in the future rather than some silent misbehaviour.

Automated unit testing ensures that basic mistakes will not go uncaught.

Limitations

  • You must specify the maximum key size. This can be worked around either by pre-hashing the keys (with a strong hash) or using multiple hash tables for different key sizes. Neither is currently implemented in diskhash.

  • You cannot delete objects. This was not a necessity for my uses, so it was not implemented. A simple implementation could be done by marking objects as "deleted" in place and recompacting when the hash table size changes or with an explicit dht_gc() call. It may also be important to add functionality to shrink hashtables so as to not waste disk space.

  • The algorithm is a rather naïve implementation of linear addression. It would not be hard to switch to robin hood hashing and this may indeed happen in the near future.

License: MIT

diskhash's People

Contributors

dvzubarev avatar gukoff avatar luispedro avatar thsfs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

diskhash's Issues

Question about POD data

Hi, thank you for this library.
I was trying to use c++ API of this library for this data: key = uint32_t and my value is

struct temp_t {
    std::uint32_t    _cnt   =0 ;
    std::uint32_t    _cnt2 =0;
};

But it fails to compile, because of static_assert:

 static_assert(std::is_pod<T>::value,
            "DiskHash only works for POD (plain old data) types that can be mempcy()ed around");

My struct is not POD because it has user-defined constructor. But, I believe it still can be memcpyed.
Maybe its better to replace std::is_pod with std::is_standard_layout? What do you think? Also is_pod is deprecated in c++20.

How can I update?

I see that the provided API only has dht-insert, but there is no dht-update.
If there is a key in the hashtable, then the value cannot update the value.
What should I do?

Key types

Thanks for sharing the good work.

I'm looking at the C++ version:

I'd give a suggestion for you to accept more key types. double, float, int, long, ...

Also, a method for clearing the hashmap, so we can better manage unit tests.

... ah, and there is missing a method to remove members.

... keys, once inserted, cannot be updated?

-- luiz

Installation issue

C:\Users\user1>pip install diskhash
Collecting diskhash
Using cached https://files.pythonhosted.org/packages/1e/a4/672ea698c480c44025dd6f82ea2fe1482814d18cc33f0ddac97f8972b860/diskhash-0.4.0.tar.gz
Building wheels for collected packages: diskhash
Building wheel for diskhash (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: 'c:\users\user1\appdata\local\programs\python\python37-32\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\user1\AppData\Local\Temp\pip-install-n8q6wuhq\diskhash\setup.py'"'"'; file='"'"'C:\Users\user1\AppData\Local\Temp\pip-install-n8q6wuhq\diskhash\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\user1\AppData\Local\Temp\pip-wheel-g297yx0g' --python-tag cp37
cwd: C:\Users\user1\AppData\Local\Temp\pip-install-n8q6wuhq\diskhash
Complete output (25 lines):
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win32-3.7
creating build\lib.win32-3.7\diskhash
copying python\diskhash\diskhash_version.py -> build\lib.win32-3.7\diskhash
copying python\diskhash_init_.py -> build\lib.win32-3.7\diskhash
creating build\lib.win32-3.7\diskhash\tests
copying python\diskhash\tests\test_larger.py -> build\lib.win32-3.7\diskhash\tests
copying python\diskhash\tests\test_smoke.py -> build\lib.win32-3.7\diskhash\tests
copying python\diskhash\tests_init_.py -> build\lib.win32-3.7\diskhash\tests
running build_ext
building 'diskhash._diskhash' extension
creating build\temp.win32-3.7
creating build\temp.win32-3.7\Release
creating build\temp.win32-3.7\Release\python
creating build\temp.win32-3.7\Release\python\diskhash
creating build\temp.win32-3.7\Release\src
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Ic:\users\user1\appdata\local\programs\python\python37-32\include -Ic:\users\user1\appdata\local\programs\python\python37-32\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /Tcpython/diskhash/_diskhash.c /Fobuild\temp.win32-3.7\Release\python/diskhash/_diskhash.obj
_diskhash.c
python/diskhash/_diskhash.c(121): error C2078: too many initializers
python/diskhash/_diskhash.c(117): error C2078: too many initializers
python/diskhash/_diskhash.c(196): warning C4028: formal parameter 1 different from declaration
error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\cl.exe' failed with exit status 2

ERROR: Failed building wheel for diskhash
Running setup.py clean for diskhash
Failed to build diskhash
Installing collected packages: diskhash
Running setup.py install for diskhash ... error
ERROR: Command errored out with exit status 1:
command: 'c:\users\user1\appdata\local\programs\python\python37-32\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\user1\AppData\Local\Temp\pip-install-n8q6wuhq\diskhash\setup.py'"'"'; file='"'"'C:\Users\user1\AppData\Local\Temp\pip-install-n8q6wuhq\diskhash\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\user1\AppData\Local\Temp\pip-record-nrcm6gyo\install-record.txt' --single-version-externally-managed --compile
cwd: C:\Users\user1\AppData\Local\Temp\pip-install-n8q6wuhq\diskhash
Complete output (25 lines):
running install
running build
running build_py
creating build
creating build\lib.win32-3.7
creating build\lib.win32-3.7\diskhash
copying python\diskhash\diskhash_version.py -> build\lib.win32-3.7\diskhash
copying python\diskhash_init_.py -> build\lib.win32-3.7\diskhash
creating build\lib.win32-3.7\diskhash\tests
copying python\diskhash\tests\test_larger.py -> build\lib.win32-3.7\diskhash\tests
copying python\diskhash\tests\test_smoke.py -> build\lib.win32-3.7\diskhash\tests
copying python\diskhash\tests_init_.py -> build\lib.win32-3.7\diskhash\tests
running build_ext
building 'diskhash._diskhash' extension
creating build\temp.win32-3.7
creating build\temp.win32-3.7\Release
creating build\temp.win32-3.7\Release\python
creating build\temp.win32-3.7\Release\python\diskhash
creating build\temp.win32-3.7\Release\src
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Ic:\users\user1\appdata\local\programs\python\python37-32\include -Ic:\users\user1\appdata\local\programs\python\python37-32\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /Tcpython/diskhash/_diskhash.c /Fobuild\temp.win32-3.7\Release\python/diskhash/_diskhash.obj
_diskhash.c
python/diskhash/_diskhash.c(121): error C2078: too many initializers
python/diskhash/_diskhash.c(117): error C2078: too many initializers
python/diskhash/_diskhash.c(196): warning C4028: formal parameter 1 different from declaration
error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\cl.exe' failed with exit status 2
----------------------------------------
ERROR: Command errored out with exit status 1: 'c:\users\user1\appdata\local\programs\python\python37-32\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\user1\AppData\Local\Temp\pip-install-n8q6wuhq\diskhash\setup.py'"'"'; file='"'"'C:\Users\user1\AppData\Local\Temp\pip-install-n8q6wuhq\diskhash\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\user1\AppData\Local\Temp\pip-record-nrcm6gyo\install-record.txt' --single-version-externally-managed --compile Check the logs for full command output.

C:\Users\user1>

Compile failure

As can be seen at e.g. https://matrix.hackage.haskell.org/package/diskhash#GHC-7.8/diskhash-0.0.1.2

Configuring component lib from diskhash-0.0.1.2...
Preprocessing library diskhash-0.0.1.2...
[1 of 1] Compiling Data.DiskHash    ( haskell/Data/DiskHash.hs, /tmp/matrix-worker/1498509481/dist-newstyle/build/x86_64-linux/ghc-7.8.4/diskhash-0.0.1.2/build/Data/DiskHash.o )

haskell/Data/DiskHash.hs:63:34: Not in scope: ‘<$>’

haskell/Data/DiskHash.hs:67:34: Not in scope: ‘<$>’

haskell/Data/DiskHash.hs:94:69: Not in scope: ‘<$>’

haskell/Data/DiskHash.hs:138:27: Not in scope: ‘<$>’

haskell/Data/DiskHash.hs:149:26: Not in scope: ‘<$>’

haskell/Data/DiskHash.hs:165:30: Not in scope: ‘<$>’

This is a result of inaccurate version bounds; Proper bounds would be e.g.:

  build-depends: base > 4.8 && < 5, bytestring == 0.10.* 

You can correct the version bounds yourself via
https://hackage.haskell.org/package/diskhash-0.0.1.2/diskhash.cabal/edit

Python Unicode key

It seems that the package does not handle python Unicode string keys correctly.

For example, when I insert a Unicode key using the following code:

import diskhash
tb = diskhash.StructHash("test.dht", 15, 'l', 'rw')
tb.insert("가", 1)
tb.lookup("가")

I get the correct value 1 in the same process where the (key, value) pair is inserted. But if I stop the python shell, restart a new python shell, and execute the following code:

import diskhash
tb = diskhash.StructHash("test.dht", 15, 'l', 'r')
tb.lookup("가")

Nothing is returned from the code. I do not experience a similar problem if the key is an ASCII string.

My environment is:

  • diskahash 0.0.4.0
  • python 3.7 (anaconda)
  • Ubuntu 18.04

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.