We greatly improved performance recently, but it still needs a lot more work. We are

Performance needs work about nng HOT 13 CLOSED

nanomsg commented on September 18, 2024

Performance needs work

from nng.

Comments (13)

gdamore commented on September 18, 2024 1

Yep. The performance isn't as bad as it was first reported though, as further testing recently has shown we are on part with, and in some cases better than, nanomsg.

There may be some opportunities to reduce use of the global aio lock, or even break it up. For example, I can see we grab it in aio_abort, but all we do is collect the value of a function pointer -- most likely that can be done without a lock, or by using some kind of atomic instead.

I'm remembering now that the global lock for aios was introduced to support expiration. I had trouble figuring out how to detangle multiple locks that would be required otherwise, because we have just a single global expiration thread.

It may be possible to utilize the capabilities in the poller(s) to support expiration, and thus eliminate some of this, but that requires more careful planning and thought, and research.

from nng.

gdamore commented on September 18, 2024 1

If you're considering websockets, you might as well give up the question of performance. WebSockets are abysmal for performance, everywhere -- because the specification requires it. (Extra bogus "encryption" layer means unavoidable data copies, etc.)

Having said that, I probably have a different idea about performance than you do. I'm concerned about insane levels of scaling, and super low latencies. The goal is to support single digit microsecond latencies. (We are not there yet.)

For your use case described above, I don't think it much matters which way you go, and I'd go for whatever is easiest for you. Probably the PUB/SUB approach with NNG will be easier, as with vanilla websockets you still have to do your own message framing etc. (Basically the upper layer protocol bits.)

from nng.

liamstask commented on September 18, 2024

the bit of profiling i've done has pointed pretty clearly to contention on the static nni_mtx nni_aio_lk in aio.c. i need to understand the aio system better to formulate a concrete suggestion, but it would make sense that a global lock for io operations could be a bottleneck.

from nng.

gdamore commented on September 18, 2024

So there are some important things. First off, I've found ways to significantly improve performance, shaving 15-20 usec per operation. This is by eliminating an extra set of context switches by having completions that are already running asynchronously call other completions synchronously. This avoids a pointless set of thread rescheduling hacks.

Second, my own performance tests indicate that nanomsg is still faster when using it's own performance tools. But, I've come to believe that these synthetic benchmarks are actually useless, and that in the real world, nng is probably faster than nanomsg.

There are two reasons for believing this:

First is that almost everyone integrates nanomsg into a poll loop, where they use nn_poll or the NN_RECVFD or NN_SENDFD options to get descriptors that they can integrate into their own poll loop. This means extra system calls before the application gets to know about things. With nng, we can use the aio structure to get notifications via condition variable, bypassing two extra system calls per operation. This should be rather huge. (Note that the synthetic benchmarks don't use these at all.)

Second, nanomsg is inherently single threaded in the backend. This means it does not scale at all, failing to engage multiple cores. For some applications this is fine, but for large numbers of applications this becomes severely limiting. (Worse, nanomsg steals the CPU from the application, by running significant amounts of protocol processing on the application's thread. This leads to faster single threaded performance by avoiding context switches, but it prevents the application from doing anything else useful at the same time.

If your application is inherently single threaded using only blocking nn_send and nn_recv calls, then you will see slightly reduced performance with nanomsg. While these type of applications are common, they are rarely performance sensitive. Far more common in performance sensitive areas are asynchronous application consumers.

from nng.

gdamore commented on September 18, 2024

Things should be much better now... but there is still work to do.

Pollers could utilize multiple threads for increased scalability. (There are some tricky race related considerations though). We also need to do a better job of auto-scaling based on the underlying system (more CPUs == more threads.)

from nng.

frink commented on September 18, 2024

Is there a reason this is still open?

from nng.

gdamore commented on September 18, 2024

Probably I should close it and replace it with specific tickets for specific enhancements.

from nng.

frink commented on September 18, 2024

That would be a helpful communication.

I'm oscillating back and forth between nng and libwebsockets for my next venture. It's backed by SQLite running in a write only thread and then I've got several readers. My app is mostly inside the SQLite database using Triggers for the workload. My thought is to send messages out using a virtual table of some sort. I like the pubsub stuff in NNG and I like the ease of establishing a web service using LWS. It all comes down to which will do more manageable IPC. Comparing the two different approaches is NOT straightforward. Biggest thing is thinking through the memory and message management scenarios. Anything that I can glean on performance is definitely helpful...

from nng.

frink commented on September 18, 2024

libwebsockets helps with the sub protocols. It's pretty interesting. Has some really unique features. The static serving of compressed files directly from inside a zip file. I also like it's stacking of protocol mounting. But your messaging patterns are pretty nifty. I think I'll end up trying both before I decide the best coarse of action...

from nng.

gdamore commented on September 18, 2024

I'm going to close this -- we've broken this up by identifying a bunch of additional work items, and on the way to v1.3 we've actually made more significant strides.

The white hot aio lock is still a problem, but I have been experimenting with ways to reduce that -- more to come later.

from nng.

dumblob commented on September 18, 2024

@gdamore any recent measurements to look at?

from nng.

gdamore commented on September 18, 2024

I've been looking mostly at micro benchmarks on my mac and PC at this point. I can say for some workloads I've seen latencies drop by more than half -- up to 75% in one case -- though that was a somewhat contrived test. The smallest improvement I saw was about 5%.

If the dominant factor in your workload is actually moving the message across the wire, and you're using pipeline or pair (not polyamorous), then you will probably see the smallest benefit.

The pair and pipeline protocols leave the most still on the table. REQ/REP, PUB/SUB, and BUS have the most gains so far. The data copy reductions will improve workloads moving large messages the most. The micro-optimizations and contention improvements will probably show the biggest gains on workload with small messages.

I can tell you that my changes shaved 1-2 dozen microseconds in round-trip latency for typical workloads on my hardware.

The problem with generating "real comparisions" is deciding what workloads to model, and then actually having dedicated hardware to run repeatable benchmarks.

from nng.

dumblob commented on September 18, 2024

Actually this short overview is fine for me, thanks 😉.

from nng.

Performance needs work about nng HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent