Problem: Sometimes the primary workload of a Valkey server will be

[NEW] add a management-port about valkey HOT 13 OPEN

daniel-house commented on June 12, 2024

[NEW] add a management-port

from valkey.

Comments (13)

madolson commented on June 12, 2024 1

If a command like KEYS is yielding, we can only execute some subset of commands, in particular no write commands, so yielding sounds like a somewhat difficult way forward.

I would say no read commands either. I think the intention is just serve ping/pong messages for health. The point is not to make the core multi-threaded, just allow it to be responsive while processing a command.

from valkey.

madolson commented on June 12, 2024 1

When doing a latency check you don't want PING to be given special priority. Is there any risk that someone is using PING this way?

We use ping that way :D. That is a tradeoff, but I would rather info give realistic picture about latency as opposed to just relying on the time it takes a PING to respond.

from valkey.

madolson commented on June 12, 2024

So, the main problem we are trying to solve is that the Valkey engine should be "reliable" to health checks and management operations, and not be stalled by long running developer applications. An alternative port works well, we could even have it return a very subset subset of data in the RESP format from a separate thread so it is always reliable.

However, I think the ideal solution would be to make it so that the engine is practically always available to commands. We can do this in three parts:

Make it so that health checks can always be served from long running commands. Basically, implement some way to "yield" out of a command to serve a very specific subset of commands. We started this work in Redis by allowing module yielding commands to disallow any commands from being executed.
Add support to mark a client as "higher priority" on the event loop, so that it get's served with some regular cadence. There are a few strategies we could do to mark the clients (local connections, unix connections, based off of an admin flag like in #469). There are also ways to make the client higher priority, we could introduce a second event loop just to pool for those higher priority connections.
Make sure that we don't spend too much time on anything else. This includes cleaning up deep command pipelines, and making sure anything like a flush is also periodically does this yielding.

IO Threads might also be a pathway here. We could have a separate thread start polling on the main event loop if a command has been running too long.

The benefit of prioritization, is that it requires no end user improvements and also helps solve the cluster-bus issues.

from valkey.

ranshid commented on June 12, 2024

I agree yielding inside "long-commands" (probably need to consider all commands lack the CMD_FAST flag) is a good option. I think we will have to think what commands we will allow when yielding. For example it might not be good to start BGSAVE process while we yield since that would break save atomicity. I would suggest we start by introducing clients/users prioritization and then build a good yielding infrastructure allowing a minimal set of commands. I guess we can start by using the same mechanism used for lua timeout, where we use a minimized cron and handle some commands like script kill and shutdown nosave. Maybe we can work to incrementally extend the commands we allow?

from valkey.

madolson commented on June 12, 2024

guess we can start by using the same mechanism used for lua timeout

One problem with lua timeout is we throw errors to end users, let's not do that here.

from valkey.

daniel-house commented on June 12, 2024

Is there anything to read that would help me understand more about any existing thoughts on how to make long-running commands (e.g., KEYS *) yeild to some other commands (e.g., PING)? Requirements, designs, or PRs (rejected or paused)?

from valkey.

madolson commented on June 12, 2024

Is there anything to read that would help me understand more about any existing thoughts on how to make long-running commands (e.g., KEYS *) yeild to some other commands (e.g., PING)? Requirements, designs, or PRs (rejected or paused)?

Nothing in the open AFAIK.

from valkey.

daniel-house commented on June 12, 2024

The Async IO threads refactoring says "Read Operation: The IO thread will only read and parse a single command." This means that they will not "look ahead" to see if there is a PING that needs to be prioritized. This is one possible solution: add another thread and a work buffer so that we can de-queue commands, scan them to see if they are PING, then put them in the work buffer. I do not claim that this is the only way, or even a tolerable way, just "a" way. Can anyone suggest a better way, so that we can compare it to adding a management port?

from valkey.

zuiderkwast commented on June 12, 2024

Can we allow I/O threads to serve PING? Next we can allow them to serve cached CLUSTER SLOTS...

If a command like KEYS is yielding, we can only execute some subset of commands, in particular no write commands, so yielding sounds like a somewhat difficult way forward.

from valkey.

daniel-house commented on June 12, 2024

There are two ways that PING could be used: health check and latency check. When doing a latency check you don't want PING to be given special priority. Is there any risk that someone is using PING this way?

from valkey.

daniel-house commented on June 12, 2024

INFO cannot measure total latency because total latency includes the entire network round trip time.

It might be useful for INFO to report an average latency, but even measuring that is problematic. For example, if it isn't a fast command, the latency is different. Would you want INFO to be a command that gets a high priority? What window (1min, 5min, etc) would you use for calculating the average latency? Perhaps p99 latency would be better. So many options. Even if the right information is available from INFO, how do you get all of the existing users of PING to switch? Would that make it a breaking change? The output from INFO is long and ugly when such a small amount of data is being requested - perhaps this use case would be improved by either new options to INFO or a new SPECIFIC-INFO command that accepts a list of fields instead of or in addition to a list of sections, so that clients could fetch just 1 field if that is all they need. Still, if the goal is to measure the total round trip, PING remains superior to INFO if only because it doesn't take any arguments at all.

from valkey.

madolson commented on June 12, 2024

Reading this again now, I might have totally misunderstood the suggestion. Is the proposal just to support a secondary port who also puts commands on the same main event loop? I thought the suggestion was to have a secondary eventloop dedicated to serving requests on this port.

from valkey.

daniel-house commented on June 12, 2024

I'm thinking of this as being in the requirements stage. At this point I hope it will be read as a wish for a new feature and, if possible, should be read as implementation neutral. So, if there is a way to get all the same advantages without actually adding a new port, that's fine. PING needs to be prioritized, but how that is done is not a requirement.

Since a "proposed solution" was part of the template, I put one in. It does make it less of a whine, but it is somewhat misleading.

The problem, slow response to PING, is real and a great motivation for this, and would be an immediate and easy way to take advantage of a new port.

What I really want is a new port. It is expected that it will only be used by administrators and that they will not expose it to anything they do not keep tightly locked down (such as an untrusted network). As such, I would like commands on this new port to be given the highest possible priority. This port might be used, for example, to shut down the server, or to temporarily suspend serving clients on the main port while dealing with an emergency, or killing a client that is misbehaving.

An additional requirement: the new port should be allowed to be on a different network card. This would allow for the type of data-center-control-plane described in Release It! by Michael T. Nygard In my own mind, this is the most valuable consequence of adding a new port. The control-plane-network can be kept clear of data and totally invisible to all but the administrators and can be used to control both Valkey and non-Valkey servers at the same time. The control-plane becomes a natural way for a data-center-dashboard to provide administrators with the support they need.

from valkey.

[NEW] add a management-port about valkey HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent