Hardened allocator designed for modern systems. It has integration into Android's Bionic libc and can be used externally with musl and glibc as a dynamic library for use on other Linux-based platforms. It will gain more portability / integration over time.
One option to keep it simple is having 2 separated reserved regions and moving it between them every time it grows rather. That way, the security property of completely isolated metadata can be provided by not reusing memory used for large allocations for the hash table and vice versa.
These are purged of pages so they aren't hot, and they're memory protected so keeping them that way for as long as possible will generally improve security.
It will add an extra system call with synchronization when it fails, so it isn't necessarily a good idea at the smallest sizes if it almost always fails in practice. The Linux mmap heap uses best-fit and grows downwards which tends to eliminate common cases where this can actually work.
It's also unclear if in-place growth is a neat positive or negative for security, similar to in-place shrinking. It's probably best to focus only on the performance aspects of this for now because there are too many variables when it comes to security. A large virtual memory quarantine could potentially turn avoiding in-place growth into a security feature so there may end up being an option to disable the optimizations.
PDEP allows selecting the nth unset bit efficiently (a couple cycles) so it's a fantastic way of implementing this. There's no clear way to do it at all efficiently elsewhere, which is why the current portable implementation only randomizes the search start index and then uses the ffs intrinsic.
The current implementation will lead to only being able to provide size results for offsets into the initial page(s) of a large allocation, since with a hash table there isn't a way to find the closest previous match.
This would have a performance cost compared to the hash table for allocation, and would also make malloc_object_size significantly slower. It might make sense as a configuration option, or perhaps it's not worth doing at all.
Generating the tags is going to require careful thought. Adjacent allocations should be guaranteed to have different tags, which could be approached by dedicating a bit to whether the slot is at an odd or even index (wasting entropy) or by generating a new random tag until there's no collision which requires a way to fetch the tags from the shadow region or wasting memory on metadata in the allocator.
The existing CSPRNG (ChaCha8 keystream with a small cache) is very efficient and will make generating these tags low overhead, but using a couple more bytes on every allocation will further increase the pressure to optimize it, especially alongside further slab allocation randomization beyond the existing slot selection randomization.
The feature is currently too aggressive and assigns half of the slab address space to guard slabs by default. It would be better to choose a much more conservative proportion by default, such as a guard slab after every 9 usable slabs.
It would be good to cover up to at least 64k and perhaps higher to cover a much higher percentage of the common cases with the proper core allocation scheme.
Using libdivide is one option, or figuring out some way to avoid it altogether. Size and slab sizes aren't all powers of 2 for extremely important reasons so it's non-trivial.
It's nice not needing to grab any locks to return a size for slabs, but it's more important to check for memory corruption. It's a very rarely used API and has little use case with this allocator design due to the tight bounds on internal fragmentation and the fact that it's currently precisely tracking sizes for large allocations instead of rounding them at all.
It can easily return results for pointers within the first page of a memory allocation. It could go beyond that but it would require an extra hash table lookup for each extra page checked.
It needs to move the surrounding guard pages with the allocation so it will be necessary to have extra mprotect calls as part of the implementation. It may end up being cheaper to allocate and copy for the smaller range of large allocations, so there could be a threshold determining when to start using mremap. It's important to leave the guard pages intact at the old location before remapping it to avoid having this reduce security compared to before.
This optimization may end up being lost for at least some ranges of allocations when the implementation is more sophisticated, but that's acceptable.
It's essentially a quarantine already (FIFO queue) and the reuse order doesn't have much impact on performance, so a randomized array would be a good fit.
There are many ways to approach this, but for now it just needs to be a trivial implementation rather than trying to figure out and implement the best design choices for performance and security immediately.