erthink / libmdbx Goto Github PK

One of the fastest embeddable key-value ACID database without WAL. libmdbx surpasses the legendary LMDB in terms of reliability, features and performance.

Home Page: https://erthink.github.io/libmdbx/

License: Other

Makefile 1.72% C 65.32% C++ 26.82% Roff 0.61% CMake 4.06% Shell 1.46%

nosql storage-engine performance lmdb mdbx mvcc key-value iiot transaction database

libmdbx's Introduction

The Future will (be) Positive. Всё будет хорошо.

Please refer to the online documentation with C API description and pay attention to the C++ API.

Questions, feedback and suggestions are welcome to the Telegram' group.

For NEWS take a look to the ChangeLog.

libmdbx

libmdbx is an extremely fast, compact, powerful, embedded, transactional key-value database, with permissive license. libmdbx has a specific set of properties and capabilities, focused on creating unique lightweight solutions.

Allows a swarm of multi-threaded processes to ACIDly read and update several key-value maps and multimaps in a locally-shared database.
Provides extraordinary performance, minimal overhead through Memory-Mapping and Olog(N) operations costs by virtue of B+ tree.
Requires no maintenance and no crash recovery since it doesn't use WAL, but that might be a caveat for write-intensive workloads with durability requirements.
Compact and friendly for fully embedding. Only ≈25KLOC of C11, ≈64K x86 binary code of core, no internal threads neither server process(es), but implements a simplified variant of the Berkeley DB and dbm API.
Enforces serializability for writers just by single mutex and affords wait-free for parallel readers without atomic/interlocked operations, while writing and reading transactions do not block each other.
Guarantee data integrity after crash unless this was explicitly neglected in favour of write performance.
Supports Linux, Windows, MacOS, Android, iOS, FreeBSD, DragonFly, Solaris, OpenSolaris, OpenIndiana, NetBSD, OpenBSD and other systems compliant with POSIX.1-2008.

Historically, libmdbx is a deeply revised and extended descendant of the amazing Lightning Memory-Mapped Database. libmdbx inherits all benefits from LMDB, but resolves some issues and adds a set of improvements.

The next version is under active non-public development from scratch and will be released as MithrilDB and libmithrildb for libraries & packages. Admittedly mythical Mithril is resembling silver but being stronger and lighter than steel. Therefore MithrilDB is a rightly relevant name.

MithrilDB will be radically different from libmdbx by the new database format and API based on C++17, as well as the Apache 2.0 License. The goal of this revolution is to provide a clearer and robust API, add more features and new valuable properties of the database.

Characteristics
Usage
Performance comparison

Characteristics

Features

Key-value data model, keys are always sorted.
Fully ACID-compliant, through to MVCC and CoW.
Multiple key-value sub-databases within a single datafile.
Range lookups, including range query estimation.
Efficient support for short fixed length keys, including native 32/64-bit integers.
Ultra-efficient support for multimaps. Multi-values sorted, searchable and iterable. Keys stored without duplication.
Data is memory-mapped and accessible directly/zero-copy. Traversal of database records is extremely-fast.
Transactions for readers and writers, ones do not block others.
Writes are strongly serialized. No transaction conflicts nor deadlocks.
Readers are non-blocking, notwithstanding snapshot isolation.
Nested write transactions.
Reads scale linearly across CPUs.
Continuous zero-overhead database compactification.
Automatic on-the-fly database size adjustment.
Customizable database page size.
Olog(N) cost of lookup, insert, update, and delete operations by virtue of B+ tree characteristics.
Online hot backup.
Append operation for efficient bulk insertion of pre-sorted data.
No WAL nor any transaction journal. No crash recovery needed. No maintenance is required.
No internal cache and/or memory management, all done by basic OS services.

Limitations

Page size: a power of 2, minimum 256 (mostly for testing), maximum 65536 bytes, default 4096 bytes.
Key size: minimum 0, maximum ≈½ pagesize (2022 bytes for default 4K pagesize, 32742 bytes for 64K pagesize).
Value size: minimum 0, maximum 2146435072 (0x7FF00000) bytes for maps, ≈½ pagesize for multimaps (2022 bytes for default 4K pagesize, 32742 bytes for 64K pagesize).
Write transaction size: up to 1327217884 pages (4.944272 TiB for default 4K pagesize, 79.108351 TiB for 64K pagesize).
Database size: up to 2147483648 pages (≈8.0 TiB for default 4K pagesize, ≈128.0 TiB for 64K pagesize).
Maximum sub-databases: 32765.

Gotchas

There cannot be more than one writer at a time, i.e. no more than one write transaction at a time.
libmdbx is based on B+ tree, so access to database pages is mostly random. Thus SSDs provide a significant performance boost over spinning disks for large databases.
libmdbx uses shadow paging instead of WAL. Thus syncing data to disk might be a bottleneck for write intensive workload.
libmdbx uses copy-on-write for snapshot isolation during updates, but read transactions prevents recycling an old retired/freed pages, since it read ones. Thus altering of data during a parallel long-lived read operation will increase the process work set, may exhaust entire free database space, the database can grow quickly, and result in performance degradation. Try to avoid long running read transactions.
libmdbx is extraordinarily fast and provides minimal overhead for data access, so you should reconsider using brute force techniques and double check your code. On the one hand, in the case of libmdbx, a simple linear search may be more profitable than complex indexes. On the other hand, if you make something suboptimally, you can notice detrimentally only on sufficiently large data.

Comparison with other databases

For now please refer to chapter of "BoltDB comparison with other databases" which is also (mostly) applicable to libmdbx.

Improvements beyond LMDB

libmdbx is superior to legendary LMDB in terms of features and reliability, not inferior in performance. In comparison to LMDB, libmdbx make things "just work" perfectly and out-of-the-box, not silently and catastrophically break down. The list below is pruned down to the improvements most notable and obvious from the user's point of view.

Added Features

Keys could be more than 2 times longer than LMDB.

For DB with default page size libmdbx support keys up to 2022 bytes and up to 32742 bytes for 64K page size. LMDB allows key size up to 511 bytes and may silently loses data with large values.

Up to 30% faster than LMDB in CRUD benchmarks.

Benchmarks of the in-tmpfs scenarios, that tests the speed of the engine itself, showned that libmdbx 10-20% faster than LMDB, and up to 30% faster when libmdbx compiled with specific build options which downgrades several runtime checks to be match with LMDB behaviour.

These and other results could be easily reproduced with ioArena just by make bench-quartet command, including comparisons with RockDB and WiredTiger.

Automatic on-the-fly database size adjustment, both increment and reduction.

libmdbx manages the database size according to parameters specified by mdbx_env_set_geometry() function, ones include the growth step and the truncation threshold.

Unfortunately, on-the-fly database size adjustment doesn't work under Wine due to its internal limitations and unimplemented functions, i.e. the MDBX_UNABLE_EXTEND_MAPSIZE error will be returned.

Automatic continuous zero-overhead database compactification.

During each commit libmdbx merges a freeing pages which adjacent with the unallocated area at the end of file, and then truncates unused space when a lot enough of.

The same database format for 32- and 64-bit builds.

libmdbx database format depends only on the endianness but not on the bitness.

LIFO policy for Garbage Collection recycling. This can significantly increase write performance due write-back disk cache up to several times in a best case scenario.

LIFO means that for reuse will be taken the latest becomes unused pages. Therefore the loop of database pages circulation becomes as short as possible. In other words, the set of pages, that are (over)written in memory and on disk during a series of write transactions, will be as small as possible. Thus creates ideal conditions for the battery-backed or flash-backed disk cache efficiency.

Fast estimation of range query result volume, i.e. how many items can be found between a KEY1 and a KEY2. This is a prerequisite for build and/or optimize query execution plans.

libmdbx performs a rough estimate based on common B-tree pages of the paths from root to corresponding keys.

mdbx_chk utility for database integrity check. Since version 0.9.1, the utility supports checking the database using any of the three meta pages and the ability to switch to it.
Support for opening databases in the exclusive mode, including on a network share.
Zero-length for keys and values.
Ability to determine whether the particular data is on a dirty page or not, that allows to avoid copy-out before updates.
Extended information of whole-database, sub-databases, transactions, readers enumeration.

libmdbx provides a lot of information, including dirty and leftover pages for a write transaction, reading lag and holdover space for read transactions.

Extended update and delete operations.

libmdbx allows one at once with getting previous value and addressing the particular item from multi-value with the same key.

Useful runtime options for tuning engine to application's requirements and use cases specific.
Automated steady sync-to-disk upon several thresholds and/or timeout via cheap polling.
Sequence generation and three persistent 64-bit markers.
Handle-Slow-Readers callback to resolve a database full/overflow issues due to long-lived read transaction(s).
Ability to determine whether the cursor is pointed to a key-value pair, to the first, to the last, or not set to anything.

Other fixes and specifics

Fixed more than 10 significant errors, in particular: page leaks, wrong sub-database statistics, segfault in several conditions, nonoptimal page merge strategy, updating an existing record with a change in data size (including for multimap), etc.
All cursors can be reused and should be closed explicitly, regardless ones were opened within a write or read transaction.
Opening database handles are spared from race conditions and pre-opening is not needed.
Returning MDBX_EMULTIVAL error in case of ambiguous update or delete.
Guarantee of database integrity even in asynchronous unordered write-to-disk mode.

libmdbx propose additional trade-off by MDBX_SAFE_NOSYNC with append-like manner for updates, that avoids database corruption after a system crash contrary to LMDB. Nevertheless, the MDBX_UTTERLY_NOSYNC mode is available to match LMDB's behaviour for MDB_NOSYNC.

On MacOS & iOS the fcntl(F_FULLFSYNC) syscall is used by default to synchronize data with the disk, as this is the only way to guarantee data durability in case of power failure. Unfortunately, in scenarios with high write intensity, the use of F_FULLFSYNC significantly degrades performance compared to LMDB, where the fsync() syscall is used. Therefore, libmdbx allows you to override this behavior by defining the MDBX_OSX_SPEED_INSTEADOF_DURABILITY=1 option while build the library.
On Windows the LockFileEx() syscall is used for locking, since it allows place the database on network drives, and provides protection against incompetent user actions (aka poka-yoke). Therefore libmdbx may be a little lag in performance tests from LMDB where the named mutexes are used.

History

Historically, libmdbx is a deeply revised and extended descendant of the Lightning Memory-Mapped Database. At first the development was carried out within the ReOpenLDAP project. About a year later libmdbx was separated into a standalone project, which was presented at Highload++ 2015 conference.

Since 2017 libmdbx is used in Fast Positive Tables, and development is funded by Positive Technologies.

Acknowledgments

Howard Chu [email protected] is the author of LMDB, from which originated the libmdbx in 2015.

Martin Hedenfalk [email protected] is the author of btree.c code, which was used to begin development of LMDB.

Usage

Currently, libmdbx is only available in a source code form. Packages support for common Linux distributions is planned in the future, since release the version 1.0.

Never use tarballs nor zips automatically provided by Github !

Please don't use tarballs nor zips which are automatically provided by Github. These archives do not contain version information and thus are unfit to build libmdbx. Instead of ones just clone the git repository, either download a tarball or zip with the properly amalgamated source core. Moreover, please vote for ability of disabling auto-creation such unsuitable archives.

Source code embedding

libmdbx provides two official ways for integration in source code form:

Using the amalgamated source code.

The amalgamated source code includes all files required to build and use libmdbx, but not for testing libmdbx itself.

Adding the complete original source code as a git submodule.

This allows you to build as libmdbx and testing tool. On the other hand, this way requires you to pull git tags, and use C++11 compiler for test tool.

Please, avoid using any other techniques. Otherwise, at least don't ask for support and don't name such chimeras libmdbx.

The amalgamated source code could be created from the original clone of git repository on Linux by executing make dist. As a result, the desired set of files will be formed in the dist subdirectory.

Building and Testing

Both amalgamated and original source code provides build through the use CMake or GNU Make with bash. All build ways are completely traditional and have minimal prerequirements like build-essential, i.e. the non-obsolete C/C++ compiler and a SDK for the target platform. Obviously you need building tools itself, i.e. git, cmake or GNU make with bash. For your convenience, make help and make options are also available for listing existing targets and build options respectively.

The only significant specificity is that git' tags are required to build from complete (not amalgamated) source codes. Executing git fetch --tags --force --prune is enough to get ones, or git fetch --unshallow --tags --prune --force after the Github's actions/checkout@v2 either set fetch-depth: 0 for it.

So just using CMake or GNU Make in your habitual manner and feel free to fill an issue or make pull request in the case something will be unexpected or broken down.

Testing

The amalgamated source code does not contain any tests for or several reasons. Please read the explanation and don't ask to alter this. So for testing libmdbx itself you need a full source code, i.e. the clone of a git repository, there is no option.

The full source code of libmdbx has a test subdirectory with minimalistic test "framework". Actually yonder is a source code of the mdbx_test – console utility which has a set of command-line options that allow construct and run a reasonable enough test scenarios. This test utility is intended for libmdbx's developers for testing library itself, but not for use by users. Therefore, only basic information is provided:

There are few CRUD-based test cases (hill, TTL, nested, append, jitter, etc), which can be combined to test the concurrent operations within shared database in a multi-processes environment. This is the basic test scenario.
The Makefile provide several self-described targets for testing: smoke, test, check, memcheck, test-valgrind, test-asan, test-leak, test-ubsan, cross-gcc, cross-qemu, gcc-analyzer, smoke-fault, smoke-singleprocess, test-singleprocess, 'long-test'. Please run make --help if doubt.
In addition to the mdbx_test utility, there is the script long_stochastic.sh, which calls mdbx_test by going through set of modes and options, with gradually increasing the number of operations and the size of transactions. This script is used for mostly of all automatic testing, including Makefile targets and Continuous Integration.
Brief information of available command-line options is available by --help. However, you should dive into source code to get all, there is no option.

Anyway, no matter how thoroughly the libmdbx is tested, you should rely only on your own tests for a few reasons:

Mostly of all use cases are unique. So it is no warranty that your use case was properly tested, even the libmdbx's tests engages stochastic approach.
If there are problems, then your test on the one hand will help to verify whether you are using libmdbx correctly, on the other hand it will allow to reproduce the problem and insure against regression in a future.
Actually you should rely on than you checked by yourself or take a risk.

Common important details

Build reproducibility

By default libmdbx track build time via MDBX_BUILD_TIMESTAMP build option and macro. So for a reproducible builds you should predefine/override it to known fixed string value. For instance:

for reproducible build with make: make MDBX_BUILD_TIMESTAMP=unknown ...
or during configure by CMake: cmake -DMDBX_BUILD_TIMESTAMP:STRING=unknown ...

Of course, in addition to this, your toolchain must ensure the reproducibility of builds. For more information please refer to reproducible-builds.org.

Containers

There are no special traits nor quirks if you use libmdbx ONLY inside the single container. But in a cross-container cases or with a host-container(s) mix the two major things MUST be guaranteed:

Coherence of memory mapping content and unified page cache inside OS kernel for host and all container(s) operated with a DB. Basically this means must be only a single physical copy of each memory mapped DB' page in the system memory.
Uniqueness of PID values and/or a common space for ones:
- for POSIX systems: PID uniqueness for all processes operated with a DB. I.e. the --pid=host is required for run DB-aware processes inside Docker, either without host interaction a --pid=container:<name|id> with the same name/id.
- for non-POSIX (i.e. Windows) systems: inter-visibility of processes handles. I.e. the OpenProcess(SYNCHRONIZE, ..., PID) must return reasonable error, including ERROR_ACCESS_DENIED, but not the ERROR_INVALID_PARAMETER as for an invalid/non-existent PID.

DSO/DLL unloading and destructors of Thread-Local-Storage objects

When building libmdbx as a shared library or use static libmdbx as a part of another dynamic library, it is advisable to make sure that your system ensures the correctness of the call destructors of Thread-Local-Storage objects when unloading dynamic libraries.

If this is not the case, then unloading a dynamic-link library with libmdbx code inside, can result in either a resource leak or a crash due to calling destructors from an already unloaded DSO/DLL object. The problem can only manifest in a multithreaded application, which makes the unloading of shared dynamic libraries with libmdbx code inside, after using libmdbx. It is known that TLS-destructors are properly maintained in the following cases:

On all modern versions of Windows (Windows 7 and later).
On systems with the __cxa_thread_atexit_impl() function in the standard C library, including systems with GNU libc version 2.18 and later.
On systems with libpthread/ntpl from GNU libc with bug fixes #21031 and #21032, or where there are no similar bugs in the pthreads implementation.

Linux and other platforms with GNU Make

To build the library it is enough to execute make all in the directory of source code, and make check to execute the basic tests.

If the make installed on the system is not GNU Make, there will be a lot of errors from make when trying to build. In this case, perhaps you should use gmake instead of make, or even gnu-make, etc.

FreeBSD and related platforms

As a rule on BSD and it derivatives the default is to use Berkeley Make and Bash is not installed.

So you need to install the required components: GNU Make, Bash, C and C++ compilers compatible with GCC or CLANG. After that, to build the library, it is enough to execute gmake all (or make all) in the directory with source code, and gmake check (or make check) to run the basic tests.

Windows

For build libmdbx on Windows the original CMake and Microsoft Visual Studio 2019 are recommended. Please use the recent versions of CMake, Visual Studio and Windows SDK to avoid troubles with C11 support and alignas() feature.

For build by MinGW the 10.2 or recent version coupled with a modern CMake are required. So it is recommended to use chocolatey to install and/or update the ones.

Another ways to build is potentially possible but not supported and will not. The CMakeLists.txt or GNUMakefile scripts will probably need to be modified accordingly. Using other methods do not forget to add the ntdll.lib to linking.

It should be noted that in libmdbx was efforts to avoid runtime dependencies from CRT and other MSVC libraries. For this is enough to pass the -DMDBX_WITHOUT_MSVC_CRT:BOOL=ON option during configure by CMake.

An example of running a basic test script can be found in the CI-script for AppVeyor. To run the long stochastic test scenario, bash is required, and such testing is recommended with placing the test data on the RAM-disk.

Windows Subsystem for Linux

libmdbx could be used in WSL2 but NOT in WSL1 environment. This is a consequence of the fundamental shortcomings of WSL1 and cannot be fixed. To avoid data loss, libmdbx returns the ENOLCK (37, "No record locks available") error when opening the database in a WSL1 environment.

MacOS

Current native build tools for MacOS include GNU Make, CLANG and an outdated version of Bash. Therefore, to build the library, it is enough to run make all in the directory with source code, and run make check to execute the base tests. If something goes wrong, it is recommended to install Homebrew and try again.

To run the long stochastic test scenario, you will need to install the current (not outdated) version of Bash. To do this, we recommend that you install Homebrew and then execute brew install bash.

Android

We recommend using CMake to build libmdbx for Android. Please refer to the official guide.

iOS

To build libmdbx for iOS, we recommend using CMake with the "toolchain file" from the ios-cmake project.

API description

Please refer to the online libmdbx API reference and/or see the mdbx.h++ and mdbx.h headers.

Bindings

Runtime	Repo	Author
Scala	mdbx4s	David Bouyssié
Haskell	libmdbx-hs	Francisco Vallarino
NodeJS, Deno	lmdbx-js	Kris Zyp
NodeJS	node-mdbx	Сергей Федотов
Ruby	ruby-mdbx	Mahlon E. Smith
Go	mdbx-go	Alex Sharov
Nim	NimDBX	Jens Alfke
Rust	libmdbx-rs	Artem Vorotnikov
Rust	mdbx	gcxfd
Java	mdbxjni	Castor Technologies
Python (draft)	python-bindings branch	Noel Kuntze
.NET (obsolete)	mdbx.NET	Jerry Wang

Performance comparison

All benchmarks were done in 2015 by IOArena and multiple scripts runs on Lenovo Carbon-2 laptop, i7-4600U 2.1 GHz (2 physical cores, 4 HyperThreading cores), 8 Gb RAM, SSD SAMSUNG MZNTD512HAGL-000L1 (DXT23L0Q) 512 Gb.

Integral performance

Here showed sum of performance metrics in 3 benchmarks:

Read/Search on the machine with 4 logical CPUs in HyperThreading mode (i.e. actually 2 physical CPU cores);
Transactions with CRUD operations in sync-write mode (fdatasync is called after each transaction);
Transactions with CRUD operations in lazy-write mode (moment to sync data to persistent storage is decided by OS).

Reasons why asynchronous mode isn't benchmarked here:

It doesn't make sense as it has to be done with DB engines, oriented for keeping data in memory e.g. Tarantool, Redis), etc.
Performance gap is too high to compare in any meaningful way.

Read Scalability

Summary performance with concurrent read/search queries in 1-2-4-8 threads on the machine with 4 logical CPUs in HyperThreading mode (i.e. actually 2 physical CPU cores).

Sync-write mode

Linear scale on left and dark rectangles mean arithmetic mean transactions per second;
Logarithmic scale on right is in seconds and yellow intervals mean execution time of transactions. Each interval shows minimal and maximum execution time, cross marks standard deviation.

10,000 transactions in sync-write mode. In case of a crash all data is consistent and conforms to the last successful transaction. The fdatasync syscall is used after each write transaction in this mode.

In the benchmark each transaction contains combined CRUD operations (2 inserts, 1 read, 1 update, 1 delete). Benchmark starts on an empty database and after full run the database contains 10,000 small key-value records.

Lazy-write mode

Linear scale on left and dark rectangles mean arithmetic mean of thousands transactions per second;
Logarithmic scale on right in seconds and yellow intervals mean execution time of transactions. Each interval shows minimal and maximum execution time, cross marks standard deviation.

100,000 transactions in lazy-write mode. In case of a crash all data is consistent and conforms to the one of last successful transactions, but transactions after it will be lost. Other DB engines use WAL or transaction journal for that, which in turn depends on order of operations in the journaled filesystem. libmdbx doesn't use WAL and hands I/O operations to filesystem and OS kernel (mmap).

Async-write mode

Linear scale on left and dark rectangles mean arithmetic mean of thousands transactions per second;
Logarithmic scale on right in seconds and yellow intervals mean execution time of transactions. Each interval shows minimal and maximum execution time, cross marks standard deviation.

1,000,000 transactions in async-write mode. In case of a crash all data is consistent and conforms to the one of last successful transactions, but lost transaction count is much higher than in lazy-write mode. All DB engines in this mode do as little writes as possible on persistent storage. libmdbx uses msync(MS_ASYNC) in this mode.

Cost comparison

Summary of used resources during lazy-write mode benchmarks:

Read and write IOPs;
Sum of user CPU time and sys CPU time;
Used space on persistent storage after the test and closed DB, but not waiting for the end of all internal housekeeping operations (LSM compactification, etc).

ForestDB is excluded because benchmark showed it's resource consumption for each resource (CPU, IOPs) much higher than other engines which prevents to meaningfully compare it with them.

All benchmark data is gathered by getrusage() syscall and by scanning the data directory.

This is a mirror of the origin repository that was moved to gitflic.ru because of discriminatory restrictions for Russian Crimea.

libmdbx's People

Contributors

Stargazers

Watchers

Forkers

donaldfoss buybackoff own2pwn rouzier covrom alex-hoshin maitrenem praveenmunagapati catroot patricktoca magictour bayao wangjia184 kioqq alexxlabs iziren snakull flstar cryptozoidberg mmmuuuuua galaxysubrepos sinhasantos sowle rpzsoft hadryan flarebuild snej fpelliccioni rallytronics panda-sheep kindofblue chattyzilla-labs clayne crixalis2013 mkll topecongiro blenessy jsoref plq strogo ryefccd onlyone0001 askalexsharov robertomalatesta doytsujin cybernetics anuragvohraec guozanhua wldp thermi erk- hiqsociety adfernandes lihuibng ruvcoindev soloestoy stutiredboy perrynzhou passchaos sauravkn erigontech kingdama yperbasis vorot93 emg110 gerhobbelt cosim wiltonlazary vladimirvaplaytika stdevmac boucaron gcxfd rmw-lib linecode since1886 flywukong igor-sidorovich canepat raventid joshi-mehul kaiwetlesen ccmlm bestxeosx-gm zmmfsj-z rupurt g-cornland zabrane bishwajitdey hoathienvu8x 0xprames beongkoswara1 sobit1975sobit noelleannedominiquea hishanhishan1234 jeannineevery ogleyshanon fraynejoelle kaleightotosz vikkileverenz leppinkannikki

libmdbx's Issues

PVS Studio report

PVS-Studio Version: 7.01.31225.1583
Total Warnings (GA): 106
Total Warnings (OP): 9
Total Warnings (64): 32

http://www.fly-server.ru/pvs-studio/libmdbx/

Rename to "Mithril DB"?

CRITICAL: license & copyright issues must be resolved ASAP

Some changes from @hyc and from other contributors must be re-copyrighted at 2016.
Possibility of the dual licensing (AGPL + OpenLDAP Foundation) should be verified.

Key and Data sizes with DUPSORT

Hello @leo-yuriev ,

We have an issue before previously in LMDB when handling post quantum signature schemes. Since the keys was so big, there was a limitation in LMDB. When a db is created with DUPSORT flag enabled the key and data values would have the same size of 0 - 511 bytes.

So my question would be does libmdbx inherit the LMDB key size limitation or is it different?

osal.c mdbx_panic error: #error FIXME

A compilation error #error FIXME was detected in "osal.c" function __cold void mdbx_panic(...) by MinGW GCC 7.1 on Windows 7 - x64.

I think it's because the relevant code fragment

if (IsDebuggerPresent()) {
    OutputDebugString("\r\n" FIXME "\r\n");
    FatalExit(ERROR_UNHANDLED_ERROR);
}

is defined only under #ifdef _MSC_VER, which is not true in this environment. But I'm not sure.

Remove dependency on MSVC CRT

In the other words - replacing all CRT's functions by alternatives from ntdll.dll and kernel32.dll, and don't linking mdbx.dll with MSVC CRT libraries:

replace all CRT's functions by mdbx-defines;
add corresponding defines (ntdll.dll and kernel32.dll functions) to OSAL;
add implementation of lacks functions or drop/disable corresponding functionality.

Does mdbx_strerror_r's return value require free-ing?

After something like:

// assume we define appropriate variables
const char * error_text = mdbx_strerror_r(errnum, buffer, 1024);

Do I need to call free on error_text?

Crash after DB auto-shrink on Windows 10 UCRT

Toolhelp API could return a "hidden" ThreadID(s). Then the OpenThread() function return valid handle for such ThreadID for THREAD_SUSPEND_RESUME, but SuspendThread() and other functions fails with ERROR_ACCESS_DENIED. Moreover, such threads NOT visible from Visual Studio debugger, but from ProcessHacker and it shown such threads is executes UCRT initialization code.
Since some thread(s) don't suspended while DB file unmapped from RAM for shrinking the file, NtMapViewOfSection() may fail mapping for previously used address.

Properly workaround with a lot of testing are needs for this Windows bug.

mdbx_chk and mdbx_env_pgwalk() report MDB_CORRUPTED for empty sub-db

For instance:

$ ./mdbx_chk -vvvn ut_schema.fpta 
Running mdbx_chk for 'ut_schema.fpta' in read-only mode...
 - monopolistic mode
 - map size 1048576 (1.00 Mb)
 - pagesize 4096, max keysize 511 (default), max readers 126
 - transactions: last 1, bottom 1, lag reading 0
 - meta-1: steady 0, tail
 - meta-2: steady 1, head
 - perform full check last-txn-id with meta-pages
Traversal b-tree...
 - found 'lmdb' area
     meta-span 0[2] of lmdb: header 32, payload 288, unused 7872
 - found '@_' area
     leaf-page 3 of @_: header 18, payload 72, unused 4006
mdbx_env_pgwalk failed, error -30796 MDB_CORRUPTED: Located page was wrong type

FeatureRequest: OSX support

I'm really excited about this project, it looks very promising, and my team and I are looking into building elixir/erlang bindings for this, but I'm having trouble compiling for OSX and I saw you mentioned in another issue that there is no OSX support yet.

Is there a timeline or an estimate as to when you're planning to make this OSX compatible?

Provide driver for ardb

yinqiwen/ardb#464

An error occurred during the second run (sample-mdb) mdbx_env_open: (22) Invalid argument

#include <stdio.h>
#include "mdbx.h"

int main(int argc,char * argv[])
{
	int rc;
	MDBX_env *env;
	MDBX_dbi dbi;
	MDBX_val key, data;
	MDBX_txn *txn;
	MDBX_cursor *cursor;
	char sval[32];

	rc = mdbx_env_create(&env);
	if (rc != MDBX_SUCCESS) {
		fprintf(stderr, "mdbx_env_create: (%d) %s\n", rc, mdbx_strerror(rc));
		return 0;
	}
	rc = mdbx_env_open(env, "./testdb", 0, 0664);
	if (rc != MDBX_SUCCESS) {
		mdbx_env_close(env);
		fprintf(stderr, "mdbx_env_open: (%d) %s\n", rc, mdbx_strerror(rc));
		return 0;
	}

	rc = mdbx_txn_begin(env, NULL, 0, &txn);
	if (rc != MDBX_SUCCESS) {
		fprintf(stderr, "mdbx_txn_begin: (%d) %s\n", rc, mdbx_strerror(rc));
		goto leave;
	}
	rc = mdbx_dbi_open(txn, NULL, 0, &dbi);
	if (rc != MDBX_SUCCESS) {
		fprintf(stderr, "mdbx_dbi_open: (%d) %s\n", rc, mdbx_strerror(rc));
		goto leave;
	}

	key.iov_len = sizeof(int);
	key.iov_base = sval;
	data.iov_len = sizeof(sval);
	data.iov_base = sval;

	sprintf(sval, "%03x %d foo bar", 32, 3141592);
	rc = mdbx_put(txn, dbi, &key, &data, 0);
	if (rc != MDBX_SUCCESS) {
		fprintf(stderr, "mdbx_put: (%d) %s\n", rc, mdbx_strerror(rc));
		goto leave;
	}
	rc = mdbx_txn_commit(txn);
	if (rc) {
		fprintf(stderr, "mdbx_txn_commit: (%d) %s\n", rc, mdbx_strerror(rc));
		goto leave;
	}
	rc = mdbx_txn_begin(env, NULL, MDBX_RDONLY, &txn);
	rc = mdbx_cursor_open(txn, dbi, &cursor);
	while ((rc = mdbx_cursor_get(cursor, &key, &data, MDBX_NEXT)) == 0) {
		printf("key: %p %.*s, data: %p %.*s\n",
			key.iov_base,  (int) key.iov_len,  (char *) key.iov_base,
			data.iov_base, (int) data.iov_len, (char *) data.iov_base);
	}
	mdbx_cursor_close(cursor);
	mdbx_txn_abort(txn);
leave:
	mdbx_dbi_close(env, dbi);
	mdbx_env_close(env);
	return 0;
}

Provide driver for fastonosql

Related to fastogt/fastonosql#36

Crash (Access Violation) on Windows10 while extending DB size

I could not reproduced this bug on a retro windows (e.g XP).
But on modern Windows 10 (1803 build 17134.407) it is easy to reproduce by ut_fpta9_thread --gtest_repeat=-1 --gtest_shuffle --gtest_break_on_failure (the unit test inside libfpta).

So, I could found the reason: The entire mapped section is unavailable for a short time during NtExtendSection() or VirtualAlloc() execution.

Seems this bug is not inside libmdbx, but inside the Windows.
Nevertheless, a workaround is required.

Document the differences/goals of this fork

Or maybe they're somewhere and I didn't find it.

Anyway, it'd be better to have a README file with this and other info as well.

Opening a txn fails if env was opened with MDBX_NOTLS

I'm porting some LMDB-based code to libmdbx, but hitting an assertion failure as soon as I open the first transaction. This looks like it might be a mistake in an assertion in the library.

My code is equivalent to:

        mdbx_env_create(&_env);
        mdbx_env_set_userctx(_env, this);
        mdbx_env_set_maxdbs(_env, 20);
        mdbx_env_open(_env, string(path).c_str(), MDBX_NOTLS, 0600);
        mdbx_txn_begin(_env, nullptr, MDBX_RDONLY, &_txn);

This last line triggers an assertion failure at core.c line 5322:

    mdbx_assert(env, (txn->mt_flags &
                      ~(MDBX_RDONLY | MDBX_WRITEMAP | MDBX_SHRINK_ALLOWED |
                        MDBX_NOMETASYNC | MDBX_NOSYNC | MDBX_MAPASYNC)) == 0);

At the time, txn->mt_flags is MDBX_NOTLS | MDBX_RDONLY.

My hypothesis is that the transaction inherits the flags of the environment, but the assertion does not consider the possibility that the environment is opened with MDBX_NOTLS. I experimentally added MDBX_NOTLS to the list of flags in the assertion, and the failure went away (now I'm failing somewhere farther on...)

My checkout of liblmdbx is at commit 61d2e07.

Any plans to make bindings for other langs?

Крайне интересный и многообещающий форк LMDB!

А планируется ли в ближайшем будущем сделать биндинги для других языков (например для python)?
Пытаюсь привлечь сообщество, ибо прекрасно понимаю, что биндинги не входят в число приоритетов. Но боюсь, что ничего не выйдет - для оригинального LMDB биндингов мало, они по большей части неподдерживаемые или "на грани".

Happen Segmentation fault, when set `MDBX_RDONLY` flag and db file not exists.

#include "mdbx.h"
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
  (void)argc;
  (void)argv;

  int rc;
  MDBX_env *env = NULL;
  MDBX_dbi dbi = 0;
  MDBX_val key, data;
  MDBX_txn *txn = NULL;
  MDBX_cursor *cursor = NULL;
  char sval[32];

  rc = mdbx_env_create(&env);
  if (rc != MDBX_SUCCESS) {
    fprintf(stderr, "mdbx_env_create: (%d) %s\n", rc, mdbx_strerror(rc));
    goto bailout;
  }
  mdbx_env_set_mapsize(env, 256 * 1024 * 1024);
  mdbx_env_set_maxdbs(env, 256);
  rc = mdbx_env_open(env, "./example-db", MDBX_NOSUBDIR | MDBX_COALESCE | MDBX_LIFORECLAIM, 0664);
  if (rc != MDBX_SUCCESS) {
    fprintf(stderr, "mdbx_env_open: (%d) %s\n", rc, mdbx_strerror(rc));
    goto bailout;
  }

  rc = mdbx_txn_begin(env, NULL, MDBX_RDONLY, &txn);
  if (rc != MDBX_SUCCESS) {
    fprintf(stderr, "mdbx_txn_begin: (%d) %s\n", rc, mdbx_strerror(rc));
    goto bailout;
  }
  rc = mdbx_dbi_open(txn, "test", MDBX_CREATE, &dbi);
  if (rc != MDBX_SUCCESS) {
    fprintf(stderr, "mdbx_dbi_open: (%d) %s\n", rc, mdbx_strerror(rc));
    goto bailout;
  }

  key.iov_len = sizeof(int);
  key.iov_base = sval;
  data.iov_len = sizeof(sval);
  data.iov_base = sval;

  sprintf(sval, "%03x %d foo bar", 32, 3141592);
  rc = mdbx_put(txn, dbi, &key, &data, 0);
  if (rc != MDBX_SUCCESS) {
    fprintf(stderr, "mdbx_put: (%d) %s\n", rc, mdbx_strerror(rc));
    goto bailout;
  }
  rc = mdbx_txn_commit(txn);
  if (rc) {
    fprintf(stderr, "mdbx_txn_commit: (%d) %s\n", rc, mdbx_strerror(rc));
    goto bailout;
  }
  txn = NULL;

  rc = mdbx_txn_begin(env, NULL, MDBX_RDONLY, &txn);
  if (rc != MDBX_SUCCESS) {
    fprintf(stderr, "mdbx_txn_begin: (%d) %s\n", rc, mdbx_strerror(rc));
    goto bailout;
  }
  rc = mdbx_cursor_open(txn, dbi, &cursor);
  if (rc != MDBX_SUCCESS) {
    fprintf(stderr, "mdbx_cursor_open: (%d) %s\n", rc, mdbx_strerror(rc));
    goto bailout;
  }

  int found = 0;
  while ((rc = mdbx_cursor_get(cursor, &key, &data, MDBX_NEXT)) == 0) {
    printf("key: %p %.*s, data: %p %.*s\n", key.iov_base, (int)key.iov_len,
           (char *)key.iov_base, data.iov_base, (int)data.iov_len,
           (char *)data.iov_base);
    found += 1;
  }
  if (rc != MDBX_NOTFOUND || found == 0) {
    fprintf(stderr, "mdbx_cursor_get: (%d) %s\n", rc, mdbx_strerror(rc));
    goto bailout;
  } else {
    rc = MDBX_SUCCESS;
  }
bailout:
  if (cursor)
    mdbx_cursor_close(cursor);
  if (txn)
    mdbx_txn_abort(txn);
  if (dbi)
    mdbx_dbi_close(env, dbi);
  if (env)
    mdbx_env_close(env);
  return (rc != MDBX_SUCCESS) ? EXIT_FAILURE : EXIT_SUCCESS;
}

Fully-fledged and impeccable testset

support both Linux and Windows environments;
support all libmdbx's bits/modes and features;
at least the same set of testing features as ioarena;
superseded ioarena in the concurrency testing;

Setup CD to build libmdbx for mainstream platforms

mdbx.NET ships libmdbx binaries along with its package management. mdbx.NET is a .NET Standard assembly and it is compatible for Linux/Mac as well. When mdbx.NET starts, it checks OS type and loads corresponding edition binary automatically. Hence its user does not care about deployment at all.

In this reason, mdbx.NET needs to embed libmdbx binaries of mainstream platforms. e.g Mac OS / Windows / Linux distributions (Debian / CentOS / Alpine). Currently I only build libmdbx for Windows x64/x86 platform manually. I want to include other platforms but that would cost a lot of time to build libmdbx on all the platforms. And that is not a one-time job.

It is better to have some automatic CI process to build libmdbx on each platform in docker container and binaries are provided directly in release section of GitHub.

Nested transactions are broken now.

mdbx.h: conflicting declaration of mode_t

In "mdbx.h", the declaration

typedef unsigned mode_t;

conflicts with the declaration of the same type in <sys/types.h>

typedef _mode_t mode_t;

The error was detected by GCC 7.1 (Mingw-w64) on Windows 7 - x64.

Changing the conflicting declaration to

typedef unsigned short mode_t;

seems to compile.

Reimplement MDBX on the 1Hppeus basis

1Hippeus project provide a outstanding management of shared memory, in-memory message queues and zero-overhead IPC with zero-copy infrastructure.

Engaging the modified ART-trees (http://www3.informatik.tu-muenchen.de/~leis/papers/ART.pdf) may also be a very good idea, but needs more study.

Best practice of mdbx_txn_commit()

It turns out mdbx_txn_commit() is the most IO-bound operation. In my machine it takes 200-300ms even there is only one mdbx_put in each txn.

I want the changes are accumulated in a single txn, and txn is not commited until mdbx_put cannt put more changes.

int err = mdbx_put(txn, dbi, key, value, 0);
if( MDBX_TXN_FULL == err || MDBX_MAP_FULL == err  ) {
     mdbx_txn_commit(txn);  // now commit the transaction
     mdbx_txn_begin( env, NULL, 0, &tnx); // start a new transaction 
     mdbx_put(txn, dbi, key, value, 0); // add the failed one and it will be commited when MDBX_TXN_FULL 
 or MDBX_MAP_FULL occurs again
}

But mdbx_txn_commit fails with error MDBX_BAD_TXN

error: "MDBX_VERSION_MINOR" redefined

The

make
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Configuring done
-- Generating done
-- Build files have been written to: /home/server/work/libmdbx
Scanning dependencies of target mdbx_STATIC
[  2%] Building C object CMakeFiles/mdbx_STATIC.dir/src/lck-posix.c.o
In file included from /home/server/work/libmdbx/src/./bits.h:61:0,
                 from /home/server/work/libmdbx/src/lck-posix.c:15:
/home/server/work/libmdbx/src/./../mdbx.h:168:0: error: "MDBX_VERSION_MINOR" redefined [-Werror]
 #define MDBX_VERSION_MINOR 1

Is 2-byte alignment of values guaranteed?

A minor issue I had with LMDB is that values are not 2-byte aligned. But libmdbx does appear to 2-byte-align values. Is this guaranteed, or is it an implementation detail you don't want clients to depend on? Either way, it would be nice to document that.

Windows support for `libfpta`

Most done in the devel branch, but more testing is needed.

test-internals broken on Windows now

Support building for Windows by MinGW and (maybe) by other compilers.

Should be taken into account, that at least one additional set of tests must be implemented and integrated into appveyor.yml:

validation of cleanup on thread termination (and so on), with respect to glibc bugs #21031 and #21032.
validation of interoperability with libmdbx.dll and libmdbx.lib, which were built by MSVC 2013/2015/2017.

Database corruption by cross-merging LEAF and BRANCH pages (inherited from LMDB)

This bug discovered by long-running test, which add/update/delete key-value pairs with stochastically choosing the length of key and data.

Running such test with quantity of transaction in range from 100,000 to 1,000,000 fails with a probability of about 0.1 - 1%.

The investigation showed that the damage to the database structure is due to erroneous / invalid merging of different types of pages, i.e. mering LEAF into BRANCH or BRANCH into LEAF.

Sparse file support (mark garbage pages as "holes")

Some filesystems allow page-bounded regions of a file to be "holes" that look as though they're filled with zeroes but actually occupy no disk space. This is of course ideal for free space in a db file.

The last few releases of macOS and iOS have a new filesystem (APFS) that supports this. Holes are created by calling fcntl(fd, F_PUNCHHOLE, ...). It would be great if libmdbx would support this, as it would nearly eliminate wasted space in databases, without compaction.

The drawback is that overwriting a hole can cause a disk-full error even if the file doesn't grow, which isn't otherwise possible, so there may be some new edge cases to handle with nearly-full volumes.

(I know some Linux filesystems support this too, but I don't know the API. I think it's different, since googling for "F_PUNCHHOLE" turned up only Mac and iOS-related info.)

UTF-8-encoded paths support for win-32/64

At this moment mdbx_env_open calls CreateDirectoryA and mdbx_openfile calls CreateFileA, thus it's impossible to open a DB using a path containing Unicode characters (e.g. username in Windows home folder).

It would be great to treat path as utf-8 encoded string in the functions above, so it would work the same way in Windows and Linux/mac, like it was implemented in lmdb (they are using utf8_to_utf16 and than CreateFileW)

non-WRITEMAP mode is broken on OpenBSD (this is OpenBSD kernel's bugfeature)

On OpenBSD writing to the file descriptor of a mmap'ed file are not (always and/or immediately) visible in the corresponding mmap'ed memory region.

For this reason:

all tests in modes with the WRITEMAP flag are always passed.
otherwise (in modes without the WRITEMAP flag) most tests fails with very high probability.

Currently version of MDBX is slower than original LMDB.

This is rather a warning than an error.

Currently MDBX is under develpoment and don't have any releases, but include a few additional checks, TODOs and bottlenecks.
Please wait for the release to compare performance and set of features.

mdbx_sync_locked: Assertion `!(((head)->mm_datasync_sign) > 1u) || env->me_sync_pending != 0' failed.

ver 0.3.1
Use MDBX_CREATE | MDBX_DUPFIXED happen

mdbx:5338: mdbx_sync_locked: Assertion `!(((head)->mm_datasync_sign) > 1u) || env->me_sync_pending != 0' failed.

Source:

	rc = mdbx_env_create(&intern->env);
	if (rc != MDBX_SUCCESS) {
		printf("%s\n", mdbx_strerror(rc));
		return;
	}

	mdbx_env_set_maxreaders(intern->env, 1024);
	mdbx_env_set_mapsize(intern->env, 1024 * 1024 * 1024);
	mdbx_env_set_maxdbs(intern->env, 256);

	rc = mdbx_env_open(intern->env, "lmdbfrontend/", 0, 0664);
	if (rc != MDBX_SUCCESS) {
		printf("%s\n", mdbx_strerror(rc));
		return;
	}

	rc = mdbx_txn_begin(intern->env, NULL, 0, &intern->txn);
	if (rc != MDBX_SUCCESS) {
		printf("%s\n", mdbx_strerror(rc));
		return;
	}

	rc = mdbx_dbi_open(intern->txn, NULL, MDBX_CREATE | MDBX_DUPSORT | MDBX_DUPFIXED, &intern->dbi);
	if (rc != MDBX_SUCCESS) {
		printf("%s\n", mdbx_strerror(rc));
		return;
	}
	rc = mdbx_txn_commit(intern->txn);

Translate README ot english

When second confirmation writing happen error

The first and second write same 100000 items, if second write 1 item will success and then change to 100000 will be success.

mdbx_env_set_mapsize(env, 256 * 1024 * 1024);
mdbx_env_set_maxdbs(env, 256);
rc = mdbx_env_open(env, "./example-db", 0, 0664);

bits.h:1122: NODEPTR: Assertion `NUMKEYS(p) > (unsigned)(i)' failed.

(gdb) bt
#0  0x00007ffff6428428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff642a02a in __GI_abort () at abort.c:89
#2  0x00007ffff6420bd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x7fffec062fbe "NUMKEYS(p) > (unsigned)(i)", file=file@entry=0x7fffec062cee "./storage/libmdbx/bits.h", 
    line=line@entry=1122, function=function@entry=0x7fffec064170 <__PRETTY_FUNCTION__.32848> "NODEPTR") at assert.c:92
#3  0x00007ffff6420c82 in __GI___assert_fail (assertion=assertion@entry=0x7fffec062fbe "NUMKEYS(p) > (unsigned)(i)", file=file@entry=0x7fffec062cee "./storage/libmdbx/bits.h", line=line@entry=1122, 
    function=function@entry=0x7fffec064170 <__PRETTY_FUNCTION__.32848> "NODEPTR") at assert.c:101
#4  0x00007fffec029e5b in NODEPTR (i=0, p=<optimized out>) at ./storage/libmdbx/bits.h:1122
#5  0x00007ffbits.h:1122: NODEPTR: Assertion `NUMKEYS(p) > (unsigned)(i)' failed.fec032f8f in NODEPTR (i=0, p=<optimized out>) at /home/server/work/cphalcon7/ext/storage/libmdbx/mdbx.c:6533
#6  mdbx_cursor_first (mc=mc@entry=0x7fffffff9c20, key=key@entry=0x7fffffff9b50, data=data@entry=0x0) at /home/server/work/cphalcon7/ext/storage/libmdbx/mdbx.c:7026
#7  0x00007fffec0341a5 in mdbx_cursor_get (mc=mc@entry=0x7fffffff9c20, key=key@entry=0x7fffffff9b50, data=data@entry=0x0, op=op@entry=MDBX_FIRST)
    at /home/server/work/cphalcon7/ext/storage/libmdbx/mdbx.c:7228
#8  0x00007fffec031262 in mdbx_page_alloc (num=num@entry=2, mp=mp@entry=0x7fffffff9df0, flags=flags@entry=15, mc=<optimized out>, mc=<optimized out>)
    at /home/server/work/cphalcon7/ext/storage/libmdbx/mdbx.c:2204
#9  0x00007fffec034ae0 in mdbx_page_new (mc=mc@entry=0x7fffffffa0a0, flags=flags@entry=4, num=2, mp=mp@entry=0x7fffffff9e60) at /home/server/work/cphalcon7/ext/storage/libmdbx/mdbx.c:8022
#10 0x00007fffec03506d in mdbx_node_add (mc=mc@entry=0x7fffffffa0a0, indx=0, key=key@entry=0x7fffffffa080, data=<optimized out>, pgno=pgno@entry=0, flags=<optimized out>)
    at /home/server/work/cphalcon7/ext/storage/libmdbx/mdbx.c:8169
#11 0x00007fffec0396c0 in mdbx_cursor_put (mc=mc@entry=0x7fffffffa0a0, key=key@entry=0x7fffffffa080, data=data@entry=0x7fffffffa090, flags=flags@entry=65536)
    at /home/server/work/cphalcon7/ext/storage/libmdbx/mdbx.c:7768
#12 0x00007fffec03bc1d in mdbx_freelist_save (txn=0x555555f5b170) at /home/server/work/cphalcon7/ext/storage/libmdbx/mdbx.c:3614
#13 mdbx_txn_commit (txn=0x555555f5b170) at /home/server/work/cphalcon7/ext/storage/libmdbx/mdbx.c:4194

Unable to create read only transaction if write transaction open

MDBX doesn't allow to open read only (MDBX_RDONLY) transaction if write transaction already open (LMDB allows that).

Code snippet to illustrate the issue:

MDBX_txn* txn;
int rc1 = ::mdbx_txn_begin(env, NULL, 0, &txn);

MDBX_txn* txn2;
int rc2 = ::mdbx_txn_begin(env, NULL, MDBX_RDONLY, &txn2); //<< return MDBX_BUSY

The behavior introduced by commit 6b55333.

Последние наработки по LSM

Не знаю по адресу или нет.
Я тут наткнулся на такое обсуждение
https://forum.golangbridge.org/t/leveldb-written-in-go-build-from-scratch/4431
В итоге этот чел сделал своё хранилище: https://github.com/dgraph-io/badger
Основывается оно на последних наработках по LSM (как я понял из ридми у libmdbx LSM тоже используется):
https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf
Судя по тому что у него получилось
https://blog.dgraph.io/post/badger/
Барсук порвал RocksDB.
А RocksDB одна из быстрых хранилищ судя по этим тестам:
https://www.influxdata.com/benchmarking-leveldb-vs-rocksdb-vs-hyperleveldb-vs-lmdb-performance-for-influxdb/

Может будет полезным, если не в курсе.

rust bindings by bindgen

https://github.com/rust-lang/rust-bindgen

Tag a (new) Release

Hey @leo-yuriev,

I'd like to use this as a vendored dep in a project, but the last release was eons of commits ago. Could you tag a release at the latest point that you think feels reasonable? It'd be great to be able to refer to a proper tagged release as opposed to just pulling HEAD or some random commit hash.

Thanks!

About the relocation function

Did you though about implementing the currently unimplemented LMDB mdb_set_relfunc?

I though about that because in Rust (and probably a whole lot other languages) interpreting a type that is not correctly aligned is undefined behavior, therefore when reading entries from LMDB/MDBX we need to check the alignment and reallocate the data if badly aligned. With this reallocation feature on the database side it would have been way easier to deal with alignment.

Docs/question: What are the intended/optimized-for use cases for the next version?

I see mdbx targets more than POSIX platforms (not reflected the README?), so it's not going to be limited to/optimized for a specific kernel's memory management style?
If I just want a sparse/hash set, does that affect some limit on the number of keys in the database?
What is it not designed/good for? Highly parallel writes? Larger-than-RAM random-access reads?

Address-Sanitizer error, after reopening a db

One of my tests closes a database and then re-opens it. During the second call to mdbx_env_open, the Clang Address Sanitizer reports an invalid memory access and aborts the process:

==18372==ERROR: AddressSanitizer: use-after-poison on address 0x00010fc0401c at pc 0x0001001631a5 bp 0x7ffeefbed280 sp 0x7ffeefbed278
READ of size 8 at 0x00010fc0401c thread T0
atos(18374,0x1000e0dc0) malloc: enabling scribbling to detect mods to free blocks
2019-12-27 13:47:51.699425-0800 atos[18374:3016390] examining /Users/USER/*/FleeceDB [18372]
    #0 0x1001631a4 in meta_txnid core.c:912
    #1 0x100195a61 in mdbx_meta_txnid_fluid core.c:3359
    #2 0x1002feebd in mdbx_meta_eq core.c:3436
    #3 0x10033fa53 in mdbx_meta_eq_mask core.c:3453
    #4 0x100e1caad in mdbx_setup_dxb core.c:8420
    #5 0x100e06cfa in mdbx_env_open core.c:9127
    ...

There's a comment in core.c (just above stack frame 4) that says "AddressSanitizer (at least GCC 7.x, 8.x) could generate false-positive alarm here. I have no other explanation for this except due to an internal ASAN error..."

If I comment out the ASAN_POISON_MEMORY_REGION call a few lines above, the problem goes away and the test completes.

I understand that this is probably a false positive, but it's likely to get in the way of my development since I always use the Address Sanitizer while developing. Having to turn it off to avoid this issue would make it harder for me to make my own code reliable. So it would be nice to get to the bottom of this and find a workaround.

ERROR_INVALID_ADDRESS while extending DB under Windows

Seems that some internals of virtual memory management had changed with Windows 10 1703 (Fall Creators Update).

Anyway, but now does not work the current method of extension of file mapping.

So, the new way is required or we should abandon the dynamic changes of database size under Windows.

Q: Performance/size impact of custom key comparisons

Hope you don't mind another question. Does using a custom MDBX_cmp_func key comparison function lower performance (aside from the time taken by the custom function itself) or increase database size?

It looks like there's no overhead for calling the function — both built-in and custom comparisons are called through the md_cmp function pointer — unless I missed something.

But from the bit I know about B-trees, I suspect that key prefix/suffix compression might not be possible if the keys aren't in a predefined order known to the B-tree manager. So if my custom-sorted keys have a lot of common prefixes, they might not be packed into nodes as efficiently. True?

AIX environment is needed for testing (i.e. ssh access)

Multithread deadlock on Windows (windows OS kernel bug)

Seems this is a Windows kernel bug, i.e. deadlock in case LockFileEx() called from one threads and WriteFile() from another.

First thread:

nt!KiSwapContext+0x7a
nt!KiCommitThreadWait+0x1d2
nt!KeWaitForSingleObject+0x19f
nt!IopSynchronousServiceTail+0x2a9
nt!NtLockFile+0x514
nt!KiSystemServiceCopyEnd+0x13
0x77bee1ea

Second thread:

nt!KiSwapContext+0x7a
nt!KiCommitThreadWait+0x1d2
nt!KeWaitForSingleObject+0x19f
nt!IopAcquireFileObjectLock+0x84
nt! ?? ::NNGAKEGL::`string'+0x491d5
nt!KiSystemServiceCopyEnd+0x13
0x77bed43a

But no reason to IopAcquireFileObjectLock() be blocked.

Make libmdbx SWIG compatible

I've tried to bind libmdbx to python using swig with no success since mdbx.h contains not all nesessary definitions such as MDBX_env.
Can you make mdbx.h suitable to use with SWIG to produce correct output files?

When i tried to call the mdbx.create_env(env) from python i got:
TypeError: in method 'mdbx_env_create', argument 1 of type 'MDBX_env **'
But there is no way to obtain the MDBX_env ** type from as they are not imported by SWIG.

Short-term TODO list

use CreateFileW() for Windows.
Use boot_id to decide whether to rollback unsynced changes when opening the database.
"Pickup" mode for opening DB (i.e. use the mode in which the database is already opened by another process).
Provide --help for mdbx_test.
Use statfs(), statvfs() to prevent opening DB in non-exclusive mode over a network.
Check/fix build on Elbrus arch (E2K).
Check/fix builds for OpenBSD, NetBSD, DragonFly, Solaris.
Update driver for ioarena (switch to cmake).
Packaging?