There are a couple of issues with current backup/restore workflow:
- It is very complex and includes a lot of manual steps.
- Backup has API downtime,
- We can't restore in a different environment.
- We can't restore in a live foundation
- Because of the issues 1 and 2 we can't test our backup/restore procedure in a production environment. And that is a real problem for us - we did backup/restore test in infra environment, but we never tested it in prod. When production issue occurred, that required restore, we figured out that we can't use our production backups. (it was our own fault, not bbr issue, but if we were able to test the whole procedure in advance we could avoid that)
- Backup/restore is very slow and backup files are huge (almost terabyte in our case ) Because of that we have to setup a dedicated concourse workers for backup and allocate a huge S3 bucket. Also we can't backup often - at most once a day, so our backups may already become outdated by the time we need to restore.
- Restore is "all or nothing" process - we can't do partial restores, for example, restore only apps in a particular org if they got corrupted for whatever reason, or restore only PCF on a different bosh director.
Because of those issues on PCF dojos Pivotal anchors usually recommend to repave the foundation and re-push all your apps. This works very well if you automate your deployments and force your developers to push all their apps via a single CI/CD pipeline.
In our case we have automation in place, but, because of our internal processes, we can't easily re-push all applications.
Because of that I started to think about a different way to backup/restore PCF. We, like most of PCF consumers, don't really need BOSH or opsman backup because we can easily redeploy opsman and BOSH using the concourse pipeline. We also don't need full backup of uaa and cloud controller databases, because we, like most of PCF users, use cf-mgmt pipeline to configure orgs, spaces, users, permissions, etc. - just run the pipeline on a fresh foundation and we will have identically configured orgs and users.
What we really need is some way to backup the following:
- Application artifacts (jar files)
- Application configuration
- Services configuration
- Service data
We can't backup/restore service data in a generic way. Each service has its own backup mechanism so we can skip this part.
Points 1-3, however, can be backed up via CF API in a generic way. We can use this api call to download application artifacts, this api call to get application configuration and this one to retrieve service instance parameters.
Originally I thought that I could easily write a simple backup/restore CF CLI plugin by myself. The problem, however, is that the last API call (the one that retrieves service instance parameters) doesn't really work. For now, cloud controller doesn't store service instance parameters in its own database and there is no way to retrieve service instance parameters from brokers in a generic way either.
From my point of view, storing service instance parameters in cloud controller database should be easy, so if you can coordinate with Cloud Controller team and work on such a plugin it could be extremely beneficial for the cloud foundry community.
Does all of this make sense? Any thoughts?