At present the system representation of a job and user submitted data are combined tog

Separate Job spec from system state about bacalhau HOT 3 CLOSED

frrist commented on June 12, 2024

Separate Job spec from system state

from bacalhau.

Comments (3)

wdbaruni commented on June 12, 2024

Thanks Forrest for your deep dive. Though I am not in agreement with most of the issues and concerns you've raised

First, there is a precedence to what we are doing and how we are handling and filtering out system defined fields from the job submissions. Nomad's job spec is very close to what we are doing where they interleave system defined fields in the job spec. Similarly, Kubernetes allow users to define metadata and spec fields when submitting a job, but also return status field when querying about the job and they do add their own metadata and it is a standard practice for CRDs to inject additional metadata in the same top level metadata field. While not recommended, users can submit jobs with status field, and kubernetes will just ignore them.

Second, we need to think about our API design. Nomad, Kubernets and our V2 APIs are restful APIs with a unified endpoint for interacting with jobs/pods, where users can submit jobs using PUT method and query using GET. Based on your proposal, there will be separate endpoints to submit and others to query

As for your specific concerns:

There are several different methods on the job type that must be called to validate and sanitize it. Some of these methods must only be called on the client side, while others must only be called on the server side. This is poor DevEx.

I don't understand why this is poor DevEx? Again this is what both Kubernetes and Nomad do, and I don't see the problem. Besides, we allow the requester to set default fields if they are missing in the job, and we will benefit from different validation at the submission layer where things can be more relaxed, and later validation after populating defaults.

The meta field of a job has various "reserved" keys that cannot be set by users, if they are we delete them: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/models/job.go#L307. This is both poor DevEx and UX.

Again there is a precedence in Kubernetes with this, and it is the main way for CRDs to inject their own state and metadata. I would prefer this than having multiple places to define metadata for the same job

We only disallow user submissions with the JobID set if the job type is batch or ops. system and daemon jobs may provide their own jobID on submission: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/models/job.go#L295-L300.

Yeah we can remove this logic as it can be confusing. The original intention was to allow users to provide their own job names and be able to update long running jobs, but not update batch and ops jobs. Job updates is a bigger discussion and we have other issues to track this including #3871 and #3864. Most likely we will allow users to update batch and ops jobs as well.

We inefficiency store the entire job structure, and its tasks(!!), repeatedly to disk when only modifying a small portion of the structure: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/jobstore/boltdb/store.go#L848-L852. The State of a job doesn't need to be in the same structure as the users submission. It would be significantly more efficient to store the user submission once and make updates to system state representation separately.

You need to understand the access pattern of our jobs and APIs before concluding if we are doing things inefficiently or not. We almost never retrieve the job spec without its state. Previously when we had job spec and state as separate objects, we always had to do two queries to the database to fetch both. It is true that we are doing heavier writes, but our access pattern is heavier on the read side. Still, we can benefit from a model that is closer to Kubernetes where we have top level fields for metadata, spec and state. This allows job store implementations to optimize their storage for partial updates.

Whenever the time comes to make changes to the system state representation of the job or task (e.g. add a new field, remove a field, modify a fields type, etc.) we will need to modify everything from the client to the server to accommodate it. Ohh and let's not forget the various SDKs!

I don't understand this. If we add a new system state representation, we will need to update the SDKs regardless to be able to display this field. This is regardless if those system fields are interleaved in the job model, or if they are under a top level state field

Running bacalhau job describe may return a job spec that is different from what the users actually submitted. For example, we modify the count field depending on the job type a user submitted: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/models/job.go#L166

We allow requester nodes to set job defaults if they are missing, such as timeout, job resources and publisher. I don't see this is a problem. I am more concerned in understanding how the system handled my job and what fields it assumed when running the job, rather than making sure bacalhau describe to return exactly what I submitted. Also, Kubernetes and Nomad do inject system information in the submitted job that you will see when you try to get the job

Running bacalhau job describe --output yaml | bacalhau job run and getting several warning messages back about fields being ignored is confusing. As a user I would expect this to "just work" with out warning.

First, Kubernetes ignore those fields without a warning. We can remove the warnings in the future if they are painful. Second, this model of re-submitting jobs may not be useful anyways and not common. Third, users are different and we need to collect more feedback from users before making drastic changes to our APIs. Saying that As a user I would expect this to "just work" with out warning. is a big assumption

Users must carry the cognitive overhead of knowing which fields they can set and which they can't when submitting jobs. This is called out plainly in our documentation: https://expanso.gitbook.io/bacalhau-docs/setting-up/jobs/job#server-generated-parameters.

I think you are over complicating this. Having reserved prefixes for metadata doesn't sound like something painful for users to work with, specially as we are following a pattern where we prefix all those reserved metadata with bacalhau.org/

from bacalhau.

wdbaruni commented on June 12, 2024

My overall feedback is that I am sure our APIs will go through another iteration at some point, but lets do that after we have more users and collect more feedback to make informed decisions. The current APIs are functional, have precedence in Nomad, and I don't see will cause us problems for at least 2 years.

We have already opted for fixing a lot of technical debt this quarter, and we need to move forward and deliver new features rather than getting stuck in this phase. When we have more capacity and more users, we will not only explore the structure of our API models, but also the need for separate APIs and CLIs for the different job types as we've discussed before.

from bacalhau.

frrist commented on June 12, 2024

Closing due to the controversial nature of this change in discussions here and synchronously.

from bacalhau.

Separate Job spec from system state about bacalhau HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent