Add an async flag to the DB to set whether writes are

To my best reading, that mode of lmdb works like this: me_fd i

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Don't mean to reopen the issue, wanted to follow up on <a class="user-mention notransl

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Async Writes about bolt HOT 10 CLOSED

boltdb commented on August 26, 2024

Async Writes

from bolt.

Comments (10)

tv42 commented on August 26, 2024

My understanding is that the writes are currently all async. I don't see an fsync or fdatasync call in the source. This means two things:

boltdb, as it is right now, can lose data on power failure / power failure / kernel crash
this ticket should be titled "Sync writes" ;)

Also, whether sync or async, the writes need to be careful about ordering; with the current POSIX APIs, that means bolt can't write the new meta page until all the dependencies have actually hit the disk.

(And if you're thinking about the actual async AIO api, just don't -- it's not worth the trouble.)

from bolt.

tv42 commented on August 26, 2024

And now I see O_SYNC. That's probably over-eager. I guess now I see what you mean. Sorry for the noise. This ticket is correct.

So, as far as I can see, what you need is

commit: write non-meta pages, fdatasync, write meta page, fdatasync or sync_file_range
size change: fsync
create: fsync, fsync containing dir

and with those, O_SYNC isn't needed.

As has happened before, I'm surprised by the quality I see in BoltDB. Good job!

from bolt.

benbjohnson commented on August 26, 2024

lol, thanks for the ticket. Bolt actually implements what LMDB calls METASYNC. Only the meta page is written with the sync file descriptor. The other pages are being written without O_SYNC.

I'm not sure how to do testing for this yet. Or even if I can do testing without unplugging my hard drive's power source.

The Async could result in lost data but it's mainly there if someone wants to implement a WAL (or if they don't really care about lost data during failures).

from bolt.

tv42 commented on August 26, 2024

To my best reading, that mode of lmdb works like this:

me_fd is opened normally
me_mfd is opened O_SYNC
non-meta pages are flushed to disk with just writes
MDB_FDATASYNC(env->me_fd)
meta page written with O_SYNC

(The above is a good setup because it lets the kernel write the non-meta pages in arbitrary order.)

But that's not safe without the fdatasync! If you don't have the fdatasync in the above, you can end up with this:

App:

submit write for non-meta pages X, Y, Z
submit write for meta page A
start waiting for meta write to complete
power loss

Disk:

write Z
write A
power loss

Now you have a committed transaction pointing to garbage.

from bolt.

tv42 commented on August 26, 2024

Not really related to this ticket but now that I brought it up: here's a commit that adds the fdatasync/fsync: tv42/bolt@5ce378b

from bolt.

benbjohnson commented on August 26, 2024

@tv42 Thanks for the fdatasync() changes. I merged them in via #76.

LMDB says that MDB_NOSYNC preserves ACI of ACID if the file system preserves write order:

 *  <li>#MDB_NOSYNC
 *      Don't flush system buffers to disk when committing a transaction.
 *      This optimization means a system crash can corrupt the database or
 *      lose the last transactions if buffers are not yet flushed to disk.
 *      The risk is governed by how often the system flushes dirty buffers
 *      to disk and how often #mdb_env_sync() is called.  However, if the
 *      filesystem preserves write order and the #MDB_WRITEMAP flag is not
 *      used, transactions exhibit ACI (atomicity, consistency, isolation)
 *      properties and only lose D (durability).  I.e. database integrity
 *      is maintained, but a system crash may undo the final transactions.
 *      Note that (#MDB_NOSYNC | #MDB_WRITEMAP) leaves the system with no
 *      hint for when to write transactions to disk, unless #mdb_env_sync()
 *      is called. (#MDB_MAPASYNC | #MDB_WRITEMAP) may be preferable.
 *      This flag may be changed at any time using #mdb_env_set_flags().

But I need to read up on that further to understand it better. I'm not sure how the meta can be in sync but the data pages not be sync'd and you'd only lose the previous transaction.

I'm wondering if an "async" mode is even a good idea for Bolt. It seems like with fdatasync() that the file system can optimize the write order and it can be left up to the end user to bulk load or coalesce transactions as needed. That way everything in Bolt is ACID.

What do you think?

from bolt.

tv42 commented on August 26, 2024

"However, if the filesystem preserves write order" -- yeah, it's not going to (yes in memory, not on disk), so most of that paragraph is irrelevant. Not sure what the LMDB authors were thinking of. The writes will go in the buffer cache, which will flush them out in somewhat arbitrary order, and the IO scheduler can explicitly reorder them to minimize seeking. If you want ordering, you use fdatasync/sync_file_range etc.

I can only think of two settings where async commits make sense: 1) Redis-style "I don't care about my data" and 2) distributed systems that can set policies like "on disk on 1 node and in memory on 2".

Is probably not an ideal fit for Bolt anyway, because natively in-memory systems will probably always be faster, and the single-writer limit of Bolt is probably going to get in the way. Plus, the hybrid between the two worlds is a silly thing to want.
Can probably just use an in-memory queue of operations to be done, and a single goroutine flushing them out to Bolt, including batching multiple operations into one Bolt transaction.

I wouldn't burn any effort in worse durability guarantees. To me personally, Bolt is valuable because it's simple, has a good API, and performs well for what it is.

from bolt.

benbjohnson commented on August 26, 2024

@tv42 That makes sense. Thanks for all the feedback! I'm going to close this one out and keep Bolt simple and ACID compliant. A "no sync" option can be implemented by the end user as a cache or WAL or whatever. :)

from bolt.

akotlar commented on August 26, 2024

Don't mean to reopen the issue, wanted to follow up on @tv42's comments that file systems do not preserve write order. Is this true? ext4, one of the more commonly used fs, explicitly states that it does (default data=ordered).

https://www.kernel.org/doc/Documentation/filesystems/ext4.txt

from bolt.

tv42 commented on August 26, 2024

data=ordered	(*)	All data are forced directly out to the main file
			system prior to its metadata being committed to the
			journal.

Nothing in that says write to data block A is done before write to data block B, just that both data block writes are done before the corresponding metadata is written to the journal.

from bolt.

Async Writes about bolt HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent