Giter VIP home page Giter VIP logo

Comments (10)

tv42 avatar tv42 commented on August 26, 2024

My understanding is that the writes are currently all async. I don't see an fsync or fdatasync call in the source. This means two things:

  1. boltdb, as it is right now, can lose data on power failure / power failure / kernel crash
  2. this ticket should be titled "Sync writes" ;)

Also, whether sync or async, the writes need to be careful about ordering; with the current POSIX APIs, that means bolt can't write the new meta page until all the dependencies have actually hit the disk.

(And if you're thinking about the actual async AIO api, just don't -- it's not worth the trouble.)

from bolt.

tv42 avatar tv42 commented on August 26, 2024

And now I see O_SYNC. That's probably over-eager. I guess now I see what you mean. Sorry for the noise. This ticket is correct.

So, as far as I can see, what you need is

  1. commit: write non-meta pages, fdatasync, write meta page, fdatasync or sync_file_range
  2. size change: fsync
  3. create: fsync, fsync containing dir

and with those, O_SYNC isn't needed.

As has happened before, I'm surprised by the quality I see in BoltDB. Good job!

from bolt.

benbjohnson avatar benbjohnson commented on August 26, 2024

lol, thanks for the ticket. Bolt actually implements what LMDB calls METASYNC. Only the meta page is written with the sync file descriptor. The other pages are being written without O_SYNC.

I'm not sure how to do testing for this yet. Or even if I can do testing without unplugging my hard drive's power source.

The Async could result in lost data but it's mainly there if someone wants to implement a WAL (or if they don't really care about lost data during failures).

from bolt.

tv42 avatar tv42 commented on August 26, 2024

To my best reading, that mode of lmdb works like this:

  • me_fd is opened normally
  • me_mfd is opened O_SYNC
  • non-meta pages are flushed to disk with just writes
  • MDB_FDATASYNC(env->me_fd)
  • meta page written with O_SYNC

(The above is a good setup because it lets the kernel write the non-meta pages in arbitrary order.)

But that's not safe without the fdatasync! If you don't have the fdatasync in the above, you can end up with this:

App:

  • submit write for non-meta pages X, Y, Z
  • submit write for meta page A
  • start waiting for meta write to complete
  • power loss

Disk:

  • write Z
  • write A
  • power loss

Now you have a committed transaction pointing to garbage.

from bolt.

tv42 avatar tv42 commented on August 26, 2024

Not really related to this ticket but now that I brought it up: here's a commit that adds the fdatasync/fsync: tv42/bolt@5ce378b

from bolt.

benbjohnson avatar benbjohnson commented on August 26, 2024

@tv42 Thanks for the fdatasync() changes. I merged them in via #76.

LMDB says that MDB_NOSYNC preserves ACI of ACID if the file system preserves write order:

 *  <li>#MDB_NOSYNC
 *      Don't flush system buffers to disk when committing a transaction.
 *      This optimization means a system crash can corrupt the database or
 *      lose the last transactions if buffers are not yet flushed to disk.
 *      The risk is governed by how often the system flushes dirty buffers
 *      to disk and how often #mdb_env_sync() is called.  However, if the
 *      filesystem preserves write order and the #MDB_WRITEMAP flag is not
 *      used, transactions exhibit ACI (atomicity, consistency, isolation)
 *      properties and only lose D (durability).  I.e. database integrity
 *      is maintained, but a system crash may undo the final transactions.
 *      Note that (#MDB_NOSYNC | #MDB_WRITEMAP) leaves the system with no
 *      hint for when to write transactions to disk, unless #mdb_env_sync()
 *      is called. (#MDB_MAPASYNC | #MDB_WRITEMAP) may be preferable.
 *      This flag may be changed at any time using #mdb_env_set_flags().

But I need to read up on that further to understand it better. I'm not sure how the meta can be in sync but the data pages not be sync'd and you'd only lose the previous transaction.

I'm wondering if an "async" mode is even a good idea for Bolt. It seems like with fdatasync() that the file system can optimize the write order and it can be left up to the end user to bulk load or coalesce transactions as needed. That way everything in Bolt is ACID.

What do you think?

from bolt.

tv42 avatar tv42 commented on August 26, 2024

"However, if the filesystem preserves write order" -- yeah, it's not going to (yes in memory, not on disk), so most of that paragraph is irrelevant. Not sure what the LMDB authors were thinking of. The writes will go in the buffer cache, which will flush them out in somewhat arbitrary order, and the IO scheduler can explicitly reorder them to minimize seeking. If you want ordering, you use fdatasync/sync_file_range etc.

I can only think of two settings where async commits make sense: 1) Redis-style "I don't care about my data" and 2) distributed systems that can set policies like "on disk on 1 node and in memory on 2".

  1. Is probably not an ideal fit for Bolt anyway, because natively in-memory systems will probably always be faster, and the single-writer limit of Bolt is probably going to get in the way. Plus, the hybrid between the two worlds is a silly thing to want.

  2. Can probably just use an in-memory queue of operations to be done, and a single goroutine flushing them out to Bolt, including batching multiple operations into one Bolt transaction.

I wouldn't burn any effort in worse durability guarantees. To me personally, Bolt is valuable because it's simple, has a good API, and performs well for what it is.

from bolt.

benbjohnson avatar benbjohnson commented on August 26, 2024

@tv42 That makes sense. Thanks for all the feedback! I'm going to close this one out and keep Bolt simple and ACID compliant. A "no sync" option can be implemented by the end user as a cache or WAL or whatever. :)

from bolt.

akotlar avatar akotlar commented on August 26, 2024

Don't mean to reopen the issue, wanted to follow up on @tv42's comments that file systems do not preserve write order. Is this true? ext4, one of the more commonly used fs, explicitly states that it does (default data=ordered).

https://www.kernel.org/doc/Documentation/filesystems/ext4.txt

from bolt.

tv42 avatar tv42 commented on August 26, 2024
data=ordered	(*)	All data are forced directly out to the main file
			system prior to its metadata being committed to the
			journal.

Nothing in that says write to data block A is done before write to data block B, just that both data block writes are done before the corresponding metadata is written to the journal.

from bolt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.