This is a pretty involved feature request centred around cloud usage for fully versioned pipelines (i.e. where both code and the data are versioned, though a combo such as Git+Dud). The domain is more data engineering rather than pure ML, though ML quite often is one of the stages of larger pipeline.
Suppose that the business logic described by the pipeline is heavily reliant on cloud, i.e. successive stages are expected to be performed in the cloud on-demand. This necessitates storing data in the cloud as well. If the data usage is heavy, emulating a "local" solution such as shared disk is impractical due to bandwidth limitations (and also in certain sense runs counter to cloud-native ideology), and we shall assume that the on-demand workers will have to get their data independently from object storage.
Object storage cache backup is currently adopted by Dud by default, however that in itself does not accomplish the goal. The issue is in how dud run
currently expects all the work to be done locally, with remote cache acting as a backup to local cache, and local stage files acting as synchronization objects. To work around this, the user is expected to manually execute the following on an on-demand cloud worker node:
- (git-)pull the stage files from remote
- (dud-)pull the required inputs
- run the job
- (dud-)commit the outputs to local cache
- (dud-)push the outputs to cloud storage
- (git-)commit the Dud state (stage files)
- (git-)push the stage files to remote
It would be great if Dud could do all of that in dud run
, perhaps with a suitable set of flags!
I appreciate this goes counter to having Dud as a single-purpose tool, but this would lift a lot of weight for the developer in a very common (albeit advanced) use case. dud run
at the moment is already a composite tool, devoted to pipelining rather than data versioning per se. This is natural, as pipelining works much better with full versioning of both code and data, and the feature request is a natural development of this functionality.
In this solution synchronisation between the stages is accomplished using Git -- a bit crude, but keeping in the spirit of reproducible pipelines. The resulting "pipeline log" can be squashed, so that's not really a major issue in my view, but certainly there are alternatives such as using cloud-based key-value stores to synchronize.
I would be interested in your thoughts on the matter.