Giter VIP home page Giter VIP logo

Comments (4)

petermattis avatar petermattis commented on August 20, 2024

In addition to preallocating WAL file space, we should investigate reusing WAL files. RocksDB does this via the recycle_log_file_num option.

  // If non-zero, we will reuse previously written log files for new
  // logs, overwriting the old data.  The value indicates how many
  // such files we will keep around at any point in time for later
  // use.  This is more efficient because the blocks are already
  // allocated and fdatasync does not need to update the inode after
  // each write.

@ajkr expanded on this in another comment:

  // On ext4 and xfs, at least, `fallocate()`ing a large empty WAL is not enough
  // to avoid inode writeback on every `fdatasync()`. Although `fallocate()` can
  // preallocate space and preset the file size, it marks the preallocated
  // "extents" as unwritten in the inode to guarantee readers cannot be exposed
  // to data belonging to others. Every time `fdatasync()` happens, an inode
  // writeback happens for the update to split an unwritten extent and mark part
  // of it as written.
  //
  // Setting `recycle_log_file_num > 0` circumvents this as it'll eventually
  // reuse WALs where extents are already all marked as written. When the DB
  // opens, the first WAL will have its space preallocated as unwritten extents,
  // so will still incur frequent inode writebacks. The second WAL will as well
  // since the first WAL cannot be recycled until the first flush completes.
  // From the third WAL onwards, however, we will have a previously written WAL
  // readily available to recycle.
  //
  // We could pick a higher value if we see memtable flush backing up, or if we
  // start using column families (WAL changes every time any column family
  // initiates a flush, and WAL cannot be reused until that flush completes).

@ajkr also notes that there is small possibility of badness with the RocksDB implementation of WAL reuse:

There appears to be an infinitesimally small chance of a wrong record to be replayed during recovery -- a user key or value written to an old WAL could contain bytes that form a valid entry for the recycled WAL, and those bytes would have to immediately follow the final entry written to the recycled WAL.

from pebble.

petermattis avatar petermattis commented on August 20, 2024

In order to support recycling WAL files, RocksDB extends the WAL entry to include the log number:

 * Legacy record format:
 *
 * +---------+-----------+-----------+--- ... ---+
 * |CRC (4B) | Size (2B) | Type (1B) | Payload   |
 * +---------+-----------+-----------+--- ... ---+
 *
 * CRC = 32bit hash computed over the record type and payload using CRC
 * Size = Length of the payload data
 * Type = Type of record
 *        (kZeroType, kFullType, kFirstType, kLastType, kMiddleType )
 *        The type is used to group a bunch of records together to represent
 *        blocks that are larger than kBlockSize
 * Payload = Byte stream as long as specified by the payload size
 *
 * Recyclable record format:
 *
 * +---------+-----------+-----------+----------------+--- ... ---+
 * |CRC (4B) | Size (2B) | Type (1B) | Log number (4B)| Payload   |
 * +---------+-----------+-----------+----------------+--- ... ---+
 *
 * Same as above, with the addition of
 * Log number = 32bit log file number, so that we can distinguish between
 * records written by the most recent log writer vs a previous one.

from pebble.

petermattis avatar petermattis commented on August 20, 2024

Supporting the recyclable record format looks relatively straightforward. The Type field is extended with "recyclable" versions:

enum RecordType {
  // Zero is reserved for preallocated files
  kZeroType = 0,
  kFullType = 1,

  // For fragments
  kFirstType = 2,
  kMiddleType = 3,
  kLastType = 4,

  // For recycled log files
  kRecyclableFullType = 5,
  kRecyclableFirstType = 6,
  kRecyclableMiddleType = 7,
  kRecyclableLastType = 8,
};

Log reading examines the first 6 bytes of the record. If the Type is one of the recyclable types it then verifies that the Log number matches the expected value, otherwise it considers log reading to have reached EOF. As is often the case, adding tests will likely be the largest chunk of work.

from pebble.

petermattis avatar petermattis commented on August 20, 2024

The benchmarks added in #76 point to WAL file reuse being a win. They also point to using direct IO as being an additional win, providing more regular sync performance across supported filesystems. Direct IO comes with caveats under Linux. See clarifying direct IO semantics. In particular, direct IO should be viewed as an additional specialization on top of WAL file reuse.

Even with grouping of write batches, most WAL syncs are small. Instrumentation of cockroach shows that 90% of WAL syncs are for less than 4KB of data on a TPCC workload. 99% of WAL syncs are for less than 16KB in size.

Direct IO requires writing full filesystem pages aligned on page boundaries. Under ext4 this is 4KB aligned (TODO check what the alignment requirements are for xfs, though they are likely similar). There is a question of what to do with the tail of the WAL that doesn't fill a page. We could overwrite that data repeatedly, though there might be a problem with concurrently modifying that tail buffer and writing it to disk (it is unclear if this is safe). An alternative is to pad the WAL to a page boundary whenever a sync occurs. On a TPCC workload, this would increase space usage for the WAL by 30%. For a KV write-only workload, such padding would increase space usage by nearly 100%. Neither increase seems problematic as the WAL files are on the orders of hundreds of megabytes in size, a small fraction of the database size. Note that the amount of data being written to the SSD isn't changing. And padding to page boundaries might actually be kinder to the SSD hardware (I'm always a bit unclear on how wear leveling is done).

How to pad to a page boundary? Add a LogData chunk of the desired size.

from pebble.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.