Hi there, great work writing oat++. I am the author of Passenger, an application server that powers 650.000 companies and sites world-wide such as Apple, Pixar, Juniper, etc. Passenger is mostly written in C++ and like you, performance is a priority for us.
I've done a fast code review, and I thought I'd share some of my findings with you. These are based on my experience with writing high-performance servers. You can read about some of my experience in these blog posts.
Hopefully you can use my feedback to make oat++ even faster and better.
- EDIT: elaborated on the documentation feedback
- EDIT: added feedback about zero-copy architecture
Performance feedback
Pools and contention
You are using memory pools and object pools everywhere. This is good, because avoiding dynamic memory allocation is key to performance. It is also cool that you've attempted to write a standardized framework to allow every object to easily make use of pools, including extras like counters.
One issue with your approach is the fact that your pools are static, i.e. all objects of a certain type are allocated from that single pool. In a multithreaded program this creates contention. You can avoid this by using one pool per thread.
I see you already tried to mitigate this somewhat with ThreadDistributedMemoryPool, but that is still dependent on a single contention point, namely the statically allocated atomic counter. Nothing is faster than no contention.
Shared_ptr and atomics everywhere
Shared_ptr is great for safety, but it uses atomic counters under the hood. Atomic operations are expensive; they even impose memory barriers.
What I do in Passenger is using smart pointers that do not use atomic counters. This makes changing the refcount from multiple threads to be unsafe, but performance is much better.
Using shared_ptr has another implication, namely the use of pointers. Pointers take up space too. I am not familiar enough with your codebase, but maybe you can reduce the amount pointer-chasing? Your cache locality is probably not that bad thanks to the use of pools, but I suspect it could be even better if you reduce the number of pointers and make more use of fixed storage. Not sure how to do this, but maybe you can think about it.
List vs vector
You use std::list in some places, such as MemoryPool::m_chunks. std::vector has better memory allocation behavior and is better for cache locality, making it often faster than even the cases where std::list's theoretical complexity is (e.g. removing from the middle).
In any case, it appears you are not relying on the algorithmic complexity properties of std::list, so consider replacing with std::vector.
Or even better: along the lines of improving cache locality and reducing better, consider using small_vector. See boost::container::small_vector, or the LLVM codebase's ADT::SmallVector.
Zero-copy architecture
You are making limited use of a zero-copy architecture. You already have heap-allocated I/O buffers which you can move around without copying the data, and this is good. However, your stream interfaces read data by copying to a void* buffer.
In Passenger we make use of "mbufs" â slices/substrings of an I/O buffer. I/O buffers are non-atomically reference counted (making them only usable within 1 thread). Mbufs act as smart pointers, manipulating the refcount of the underlying I/O buffer. This way we do not need to copy around any data.
You may ask, what about non-contiguous data that are spread over multiple I/O buffers? In Passenger we embrace the fact that data may not be contiguous, using a data structure called LString. We then build algorithms around such a data structure instead of around contiguous strings. Only in cases where we can't build such an algorithm (and thus when we have no choice but to use a contiguous string, e.g. when we need to call an external library or a system call like stat()
) will we turn the LString in a contiguous string.
Other feedback
SpinLock everywhere
You're using spinlocks everywhere instead of OS mutexes. Spinlocks are often faster than mutexes microbenchmarks, as well as smaller, but they can be dangerous depending on the amount of contention you have. With little to no contention, they are great. With large amounts of contention you risk spending a significant amount of time spinning, which is slower than having the OS pause the process and do the scheduling for you.
I see that you use spinlocks in MemoryPool's obtain() method. That method performs a non-trivial amount of work. For example you call into the native memory allocator using new
, which can perform an arbitrary amount of work.
New does not return nullptr?
There is some code which allocates something with new
and then checks for a nullptr result, such as ChunkedBuffer::obtainNewEntry()
. But new
does not return nullptr, it throws an exception.
More documentation
It can be a bit hard to figure out what a class is supposed to do or how it is supposed to be used. More code documentation â both on the class level and on the architecture/concept level â would be nice.
On the class level, I do not understand what ChunkedBuffer is supposed to be and how that's different from FifoBuffer. I also did not understand what ParsingCaret is and how it is used until I've studied quite a lot of code that uses ParsingCaret. It also took me a while to figure out how SHARED_OBJECT_POOL vs OBJECT_POOL are used.
On the conceptual level, I have no idea how the coroutine stuff works and is supposed to be used. I have also no idea what concurrency model you are using. Is it one-client-per-thread? Your benchmarks seem to suggest that you can't be using that. But I grepped for select/poll/epoll/kqueue/O_NONBLOCK and I could not find anything in your code using any of those mechanisms, so apparently you aren't using non-blocking evented I/O either. It is a mystery.
More tests
Unit tests saved my life so many times. Debugging regressions that are discovered after release, is not fun. The amount of tests you have is fairly small compared to the size of your library.