Giter VIP home page Giter VIP logo

appendlog's People

Contributors

naasking avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

dhilip89 ak800i

appendlog's Issues

Async append

IAppendLog.Append is currently an async operation, but should probably be async.

FileLog in particular uses Monitor.Enter, which should perhaps be a a more async-friendly mutual exclusion mechanism.

Write parallelism and group commit

It's possible to improve the write parallelism with some simple changes.

  1. Decouple header update from append: after a writer closes its stream, the next writer can begin writing immediately. This will probably require two FileStreams, one for writing the header, one for appending, and a semaphore for each. This might only be a problem when the log is initially started, because of block device semantics: the header and the end of the log are in the same sector, so flushing either header or append stream will overwrite the other. Hint: probably use FileOptions.WriteThrough for the header stream -- no sense making two calls for the same purpose.

  2. Group commit: have each thread that's waiting to write to the header publish the transaction id it will commit. The writer that acquires the header FileStream checks the publication list and only writes the highest one to the header. In theory, N writers will be merged into 1 flush instead of N sequential flushes. In practice, N>1 writers waiting to commit is probably very unlikely because the append stream will still require a flush to disk before its writer will try to update header, so this sequence of events would seem to be extremely unlikely:

    1. writer w0 flushes append stream to disk
    2. w0 acquires header stream lock
    3. w0 releases append stream lock
    4. w0 interrupted
    5. writer w1 acquires append stream lock
    6. w1 completes writes
    7. w1 flushes to disk
    8. w1 can't acquire header stream lock, so publishes its higher txid
    9. w0 resumes and checks publication list and writes w1's txid

w0 remaining suspended while w1 executes steps v-viii is exceedingly unlikely.

Async commit protocol

IAppendLog.Append returns an IDisposable handle which, when disposed, performs all of the flushing needed to persist data to disk.

However, this is a synchronous operation where the rest of the IAppendLog API is largely asynchronous. Switch this to something like the following:

public interface ITransaction : IDisposable
{
  Task Commit();
}
public interface IAppendLog
{
  ...
  ITransaction Append(out Stream output, out TransactionId tx);
}

This way, the client must explicitly commit changes to the log and can opt to do so asynchronously. IDisposable.Dispose is then reserved just for resource cleanup as intended, rather than doing double-duty as transaction commit.

No-seek, append-only disk layout

FileLog must currently:

  1. write the entry length at the end of the content,
  2. force a flush to disk to ensure the entry is persisted,
  3. seek back to the header and write a pointer to the new entry in the header
  4. force a flush to disk to persist the update.

A power failure can occur at any step of the above, and only when step 4 is complete is the new transaction actually acknowledged. If we remove or reorder even one flush, bogus data might get through.

For instance, it might seem possible that we could recover from a power failure after step 2, but there's no persisted/reliable indicator that step 2 actually completed. So even if we find a valid entry header, there's no guarantee that all data before the header was persisted because the OS can reorder writes. Only after step 4 actually completes do we have this indicator.

So with all of this in mind, there might be a way to at least reduce seeking and make the log truly forward-only.

Chunked log

  • The log is broken up into fixed-size chunks of a certain (optimal) size for persisting to disk (like a sector).
  • The log header contains a GUID, and this GUID is replicated in each chunk header to tag it as valid data. It's vanishingly unlikely that a random sector dump on power loss will precisely replicate the log GUID in precisely the right spot.
  • There are internal chunks, indicating more data to write/read, and terminal chunks indicating a transaction end.
  • On commit, whatever padding is left in a chunk is zeroed, then Flush(true) is called. Upon completion, the terminal header is written, then Flush(true) is called again. While the total number flushes is the same, no large seeking occurs, just a small seek to rewrite the header at the end of the chunk.
  • On log load, we seek to the last chunk given the file length, then work backwards until we find a terminal chunk and resume writing from there.

Note: this layout may also permit concurrent writing to the same file. Each writer basically just allocates a free chunk via an atomic operation on the log object. This is tricky though, so I won't develop it further at this point.

Supporting concurrent writers

FileLog is intentionally limited to single writers for simplicity and robustness. However, we can build a simple concurrent extension on this foundation by implementing another instance of IAppendLog, which we'll call ConcurrentLog.

Design

The idea is that ConcurrentLog points to a folder which is used to store a set of temporary files, one file per concurrent writer:

  1. On write completion, the written file is renamed to a known extension which is then migrated to an external FileLog by a worker thread.
  2. The ConcurrentLog has a ticket field set at log initialization, which is used to generate unique file names via an atomic increment on each Append operation.
  3. ConcurrentLog can also replay some incomplete transactions, ie. writers that had committed successfully but hadn't yet completed their copy process can simply restart.
  4. No particular migration order is guaranteed, technically the same guarantee provided by Monitor.Enter/Exit, which is used to mutually exclude writers in FileLog. However, I think ordering the writes by ticket number would be most robust.

Open Questions

I'm not sure whether to enforce that all lower numbered tickets are migrated before migrating later ones. Technically, these events should be independent, but it would be easy for programmers to trigger events during a long write that cause a concurrent writer to complete a short event before the earlier event is finished writing. Enforcing this ordering protects against these scenarios, however:

  1. We sacrifice write throughput.
  2. Writes are no longer assumed independent, which means replay can't just skip an incomplete write on replay after power loss. This potentially loses many other independent successful writes.

Given #2 particularly, I'm leaning towards to enforcing sequential replay.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.