fengb / zee_alloc Goto Github PK

View Code? Open in Web Editor NEW

83.0 6.0 6.0 604 KB

tiny Zig allocator primarily targeting WebAssembly

Home Page: https://fengb.github.io/zee_alloc/#root

License: MIT License

Zig 100.00%

memory-allocation wasm tiny zig

zee_alloc's Introduction

zee_alloc — zig wee allocator

A tiny general purpose allocator targeting WebAssembly.

This allocator has not been well tested. Use at your own peril.

Getting Started

In zig:

const zee_alloc = @import("zee_alloc");

pub fn foo() void {
    var mem = zee_alloc.ZeeAllocDefaults.wasm_allocator.alloc(u8, 1000);
    defer zee_alloc.ZeeAllocDefaults.wasm_allocator.free(mem);
}

Exporting into wasm:

const zee_alloc = @import("zee_alloc");

comptime {
    (zee_alloc.ExportC{
        .allocator = zee_alloc.ZeeAllocDefaults.wasm_allocator,
        .malloc = true,
        .free = true,
        .realloc = false,
        .calloc = false,
    }).run();
  }

Goals

(inspired by Rust's wee_alloc)

Tiny compiled output
Tiny compiled output x2
Malloc/free compatibility
Avoid long-term fragmentation
Reasonably fast alloc and free
Code simplicity — probably goes in hand with tiny output

Non-goals

Debugging — this library probably will never do a good job identifying errors. Zig has a great debug allocator in the works, and zig programs should be able to easily swap allocators.
Compact memory — fixed allocation frames are used for speed and simplicity. Memory usage will never be optimum unless the underlying algorithm completely changes
Thread performance — wasm is single-threaded

Benchmarks

Benchmark                                   Mean(ns)
----------------------------------------------------
DirectAllocator.0                              50842
DirectAllocator.1                              98343
DirectAllocator.2                             203980
DirectAllocator.3                              49908
DirectAllocator.4                             103635
DirectAllocator.5                             195941
DirectAllocator.6                              47367
DirectAllocator.7                             101733
DirectAllocator.8                             202697
Arena_DirectAllocator.0                        11837
Arena_DirectAllocator.1                        19591
Arena_DirectAllocator.2                        30689
Arena_DirectAllocator.3                        30916
Arena_DirectAllocator.4                        52425
Arena_DirectAllocator.5                        75673
Arena_DirectAllocator.6                        44874
Arena_DirectAllocator.7                        67557
Arena_DirectAllocator.8                        96276
ZeeAlloc_DirectAllocator.0                     15892
ZeeAlloc_DirectAllocator.1                     24435
ZeeAlloc_DirectAllocator.2                     49564
ZeeAlloc_DirectAllocator.3                     26656
ZeeAlloc_DirectAllocator.4                     52462
ZeeAlloc_DirectAllocator.5                     93854
ZeeAlloc_DirectAllocator.6                     51493
ZeeAlloc_DirectAllocator.7                     95223
ZeeAlloc_DirectAllocator.8                    250187
FixedBufferAllocator.0                           177
FixedBufferAllocator.1                           412
FixedBufferAllocator.2                          1006
FixedBufferAllocator.3                           296
FixedBufferAllocator.4                           785
FixedBufferAllocator.5                          1721
FixedBufferAllocator.6                           848
FixedBufferAllocator.7                          1546
FixedBufferAllocator.8                          3331
Arena_FixedBufferAllocator.0                     299
Arena_FixedBufferAllocator.1                     573
Arena_FixedBufferAllocator.2                    1624
Arena_FixedBufferAllocator.3                    1115
Arena_FixedBufferAllocator.4                    1868
Arena_FixedBufferAllocator.5                    4422
Arena_FixedBufferAllocator.6                    1706
Arena_FixedBufferAllocator.7                    3389
Arena_FixedBufferAllocator.8                    8430
ZeeAlloc_FixedBufferAllocator.0                  232
ZeeAlloc_FixedBufferAllocator.1                  577
ZeeAlloc_FixedBufferAllocator.2                 1165
ZeeAlloc_FixedBufferAllocator.3                  443
ZeeAlloc_FixedBufferAllocator.4                  907
ZeeAlloc_FixedBufferAllocator.5                 1848
ZeeAlloc_FixedBufferAllocator.6                  907
ZeeAlloc_FixedBufferAllocator.7                 1721
ZeeAlloc_FixedBufferAllocator.8                 3836

Architecture — Buddy memory allocation

Caveat: I knew nothing about memory allocation when starting this project. Any semblence of competence is merely a coincidence.

idx frame_size
 0  >65536  jumbo
 1   65536  wasm page size
 2   32768
 3   16384
 4    8192
 5    4096
 6    2048
 7    1024
 8     512
 9     256
10     128
11      64
12      32
13      16  smallest frame

Size order is reversed because 0 and 1 are special. I believe counting down had slightly better semantics than counting up but I haven't actually tested it.

Wasm only allows for allocating entire pages (64K) at a time. Current architecture is heavily influenced by this behavior. In a real OS, the page size is much smaller at 4K. Everything should work as expected even if it does not run as efficient as possible.

Each allocation frame consists of 2 usizes of metadata: the frame size and a pointer to the next free node. This enables some rudimentary debugging as well as a simple lookup when we only have the allocated data-block (especially important for C compatibility).

For allocations <=64K in size, we find the smallest usable free frame. If it's bigger than necessary, we grab the smallest power of 2 we need and resize the rest to toss back as free nodes. This is O(log k) which is O(1).

For allocations >64K, we iterate through list 0 to find a matching size, O(n). Free jumbo frames are never divided into smaller allocations.

ZeeAlloc only supports pointer alignment at 2x usize — 8 bytes in wasm. There are a few ideas to expand this to up-to half page_size but nothing concrete yet.

zee_alloc's People

Contributors

Stargazers

Watchers

Forkers

leroycep zigster64 vesim987 icodein dophist pedrocasf

zee_alloc's Issues

Debug options

It’d be nice to be able to tune separately from compiler options such that upstream can use release-safe while zee_alloc can stay tiny:

Debug — enable all checks
External — disable internal consistency checks, but retain external checks, e.g. double free or freeing wrong memory
Unsafe — no checks

Note: non-release builds should keep on all debugging.

Buddy coalesce on free

Now that we track allocated frames, we can apply a coalesce algorithm on free. Unfortunately, removing from a freelist is currently O(n) so we’ll have to do trade off analysis:

Don’t coalesce. This is the status quo and handles consistent allocations well, but it degenerates upon mixing between large and small allocations.
O(n) coalesce. This is simple and easy, but it starts degenerating when freelists are long.
Hybrid coalesce. Try merging but give up after a few iterations of search. Possibly a decent trade off for performance consistency at the cost of data consistency.
Doubly linked list. Takes extra memory space per frame and minor bookkeeping cycles but guarantees O(1) coalesce.

Note: if implemented with packed pointers (#4), metatdata should be the same size (2x usize).

Jumbo allocation performance

ZeeAlloc does really poorly with variable-sized jumbo allocations. We should investigate a better way to reference these frames.

Detailed problem:

Jumbo frames are all stuck into a (singly) linked list
Only specific matches will fit jumbo frames (search size == frame size)

Potential solutions:

Make search size <= frame size — this speeds up variable reuse at the cost of hurting exact size reuse and internal fragmentation
Increase "page size" — this only punts the problem as well as causing more internal fragmentation
Powers of 2 for jumbo — more internal fragmentation, still poor performance for lots of simultaneous allocations

Support std.WasmPageAllocator

Now that we have a built-in page-based allocator in std, we should start consuming it.

Benefits:

standardize around std — properly coordinate with any wasm allocator
really freeing memory — this should be identical to ArenaAllocator.deinit
potentially ignore jumbo allocations entirely

Drawbacks:

the std allocator requires an additional 1-2 KB
need new metadata architecture for C-like malloc()

Questions:

Is it possible to detect using "free()"? In theory, wasm_page_allocator can be tiny if it optimizes away the free/reuse bookkeeping, so it'd be amazing if it could detect at comptime whether free is needed.
Should we keep around the current dumb page allocator and only enable the new path on "deinit()"? std.WasmPageAllocator was designed to handle freeing any memory page (including ones it didn't originally allocate) so this works around question 1.

Packed doubly-linked-list

Once we have #3, we have a really nice characteristic:

Small allocations are always align(16). The extra bits can denote the small size bitshift... which incidentally is less than 16.
Large allocations are always align(32K) or align(64K). The extra bits can denote the total pages, which maps to 2GB or 4GB respectively.

With this type of packing, we only need a usize word (4B) for the metadata. Thus current minimum payload is expanded to 12B.

~~We might be able to support 4B payloads — it allows for align(8) which is not enough space to store the length, but we can possibly assume any align(8) has to be 4B allocations.~~
Inconsistent metadata is very hard to detect so we'll stick with 12B minimum payloads.

Align within payload

Alternative to #3

All payloads have natural alignment of 8. If we pad the payload by 8, we can get align 16.

This sounds simpler than shifting every block, but it wastes a lot more space.

Experiment with slabs

Buddy allocation is generally pretty efficient but suffers from internal fragmentation. Can slab allocation have similar code efficiency?

Metadata layout — alignment and large pages

Current design is reasonably simple and compact, but we have two major issues:

Alignment is fixed at double-usize (8 bytes). It'd be nice if we can at least align to 16 bytes, and this might give us up to 32KB alignment for "free".
Payload size of 2^n - 8 is super clunky. A lot of my personal usecases require 2^n exactly, and having this limitation requires an extra half page to a page.

Metadata layout

Need to decide on a strategy for exporting the free() function. This would be for external compatibility usage and not necessarily be a guiding principle so poorer performance or efficiency is acceptable.

Store "extern malloc" values in a data structure — probably HashMap. This would be outside of the core allocator and probably easier to maintain, since we can use memory easier without fear of trashing the internals.
Each allocated block has a wrapping parent struct that can store metadata. This would also solve the "free cannot find unused node" problem... by chewing up more memory, and Zig seems to provide enough metadata from the language itself that I want to try avoiding this strategy.

Top offenders

Generated from twiggy

 Shallow Bytes │ Shallow % │ Item
───────────────┼───────────┼─────────────────────────────────────────────────────
           808 ┊    41.41% ┊ main.ZeeAlloc((struct main.Config constant)).realloc
           252 ┊    12.92% ┊ "function names" subsection
           173 ┊     8.87% ┊ realloc
           102 ┊     5.23% ┊ malloc
            98 ┊     5.02% ┊ data[1]
            79 ┊     4.05% ┊ main.WasmPageAllocator.realloc
            60 ┊     3.08% ┊ main.ZeeAlloc((struct main.Config constant)).shrink
            57 ┊     2.92% ┊ main.ZeeAlloc((struct main.Config constant)).free
            47 ┊     2.41% ┊ memset
            40 ┊     2.05% ┊ calloc
            29 ┊     1.49% ┊ free