The PGE Cloudburst Framework is used to automate the execution of programs and scripts at scale using Amazon AWS Batch. It can be used to run a series of programs across tens, hundreds or even thousands of servers concurrently. The framework makes it easy to run a process that gathers input data, processes it, and stores results for later analysis.
To fully use the framework, you will need to install and configure a number of tools.
- Git, to interact with github
- An editor, such as JetBrains PyCharm or Microsoft Visual Studio Code
- the Amazon AWS CLI (command-line interface)
- Terraform, for configuring AWS
- Python 3.9 or higher, with a couple of python modules
The installation steps are documented below.
- decide on a unit of work. For instance, a single run of a single program, or a sequence of programs operating on common data. Is there a parameter or two that specifies this unit of work? The unit of work should take less than 10 hours to run.
- decide on a name for your project e.g. "projectx"
- develop the programs/scripts to perform the unit of work
- define a set of input data
- define a parameter (or several parameters) to control the execution.
installation involves a number of packages. These can be either downloaded as installers from the web or installed by a command-line tool. For command-line installation on the mac, homebrew is recommended. On the pc, chocolatey is recommended. In Linux, you can use your favorite package manager.
- on windows?
- install chocolatey
- install make: from an admin powershell prompt:
choco install make
- install git
- https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
- mac: (included when you install xcode command line tools)
- win:
choco install git
- Configure your GIT name and email address. Open a command prompt
git config --global user.name "John Doe"
git config --global user.email [email protected]
- install python 3.9+
- https://www.python.org/downloads/
- mac:
brew install python
- win:
choco install python
- install a python package manager, e.g. pip. Add the boto3 and json schema packages
pip install boto3 jsonschema
- install the AWS CLI
- https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
- mac:
brew install awscli
- win:
choco install awscli
- install terraform
- https://www.terraform.io/downloads.html
- mac:
brew install terraform
- win:
choco install terraform
- install 7zip and ensure
7z
is in the path- https://www.7-zip.org/download.html
- mac:
brew install p7zip
- win:
choco install 7zip.install
- install docker desktop
- install an editor of your choice (optional)
- JetBrains Pycharm https://www.jetbrains.com/pycharm/
- Microsoft Visual Studio Code https://code.visualstudio.com/
a. are you creating a new repo? you can fork the cloudburst repo and use this structure to develop your project. This can be done from the github: https://github.com/bh3791/cloudburst
To get updates from the cloudburst repo later, use the github site to merge the updates, or follow these manual steps:
git fetch upstream master
git pull upstream master --squash
-
use an existing github repo. Use
git subtree
to clone the cloudburst repo as a directory subtree in your repo. From a command prompt in your repo, first define a remote (shortcut):git remote add -f cloudburst https://github.com/bh3791/cloudburst.git
-
Next, add the subtree:
git subtree add --prefix cloudburst cloudburst master
-
To update the subtree later on:
git fetch cloudburst master git subtree pull --prefix cloudburst cloudburst master --squash
-
you can even push changes back up to the remote cloudburst repo, being careful to merge any customizations. This is safest to do if you first fork the cloudburst repo, push to your fork, and then use a pull request to request the changes get into the main repo.
git subtree push --prefix=cloudburst https://github.com/<your_name>/cloudburst.git master --squash
-
when you use subtree, you will get the default Makefile, Dockerfile and docker-compose.yml files in the cloudburst subdirectory. It is best to copy these files to your repo root directory, and then customize them, or merge them with your existing files, rather than use them in-place.
-
run the
aws configure
command to set up your security credentials. You will need to generate these in the AWS console https://console.aws.amazon.com/iamv2/home#/users : user : security credentials : access keys -
edit the
Makefile
, naming the TAG and REPO_ID 'projectx' and your AWS_ID (account id) at the top -
configure terraform by editing the
terraform.tfvars
file, naming the deployment 'projectx'. Create the infrastructure when you are satisfied. The project is initially configured to work out of the box in the us-west-2 region. To modify it for another region, networking.tf would need to be edited to apply the correct availability zones and subnets.cd terraform terraform init terraform plan terraform apply
-
create one or more buckets to store input/output data e.g. 'projectx_test'
aws s3 mb projectx_test
The easiest way to approach this is to start by creating a functioning docker container that contains the scripts/executables you want. However, you may not yet be accessing S3 to retrieve and store the input and output data.
To use the cloudburst framework, you will need to edit your Dockerfile to add the scripts
and define the entrypoint as the framework entrypoint. The framework then uses
the file scripts/tasks.json
to retrieve data, run your scripts, and then store the data.
For example, a simple scenario might be:
- copy input data from S3 (requiring you to set up an S3 bucket and load it with input data)
- run a series of programs/scripts,
- and store output data back to S3.
These tasks are configured in the json file. You can refer to the json schema file using a schema
friendly editor to edit the file. The tasks types are fetch, tasks, move, store
and can be
specified in any order:
- fetch: retrieves a file, or files, from an S3 storage bucket and (optionally) unzip it with 7zip.
- tasks: runs a shell script or other command of your choice. This can be multithreaded across the n cores on the instance when a list of input files is specified.
- store: stores a file, or files, to an S3 storage bucket, optionally using 7zip to compress.
- move: moves files from one folder to another.
The extras/
folder contains a few examples of the tasks.json from simple to complex. The tasks.json file
uses parameter substitution using the form {ENV_VARIABLE_NAME}
.
When you start a new job, the most important parameter is the unit of work. This is a string passed by the
start_jobs.py
script to the batch framework and set as the WORK_ITEM
environment variable in the docker container.
This parameter is what allows you to define the unit of work. It is intended to be used in your tasks.json
file to
control the process logic. The unit of work is user-defined, and may be a "site number," "file path," or some other
unit that we are processing. Thus it can be used to fetch a specific zip file or directory from S3, pass to your
program, or store files to S3. The following example shows a fetch task that will download and unzip the specified
WORK_ITEM.7z file to the ./Sites/WORK_ITEM folder:
{
"name" : "get-input-files",
"bucket" : "${BUCKET_NAME}",
"key" : "input/${WORK_ITEM}.7z",
"dest" : "./Sites/${WORK_ITEM}.7z",
"expand" : true
},
When you start a batch of jobs, you can specify what mode
to run them in. Mode is an optional, user-defined
parameter ("default" when not specified) that allows you to conditionally run tasks. This MODE_STR
environment variable is
set in the docker container at the beginning of each task. Each task type includes the options includeWhenMode
and excludeWhenMode
. This provides control over when a task is run.
For example, a process involves a main fortran program and a separate post-processor program, storing the results of both to S3. At times, you want to rerun the post-processor but not the main program. So you can define a mode called full, and one called post. To the main program entry, you can add the directive:
"excludeWhenMode": "post"
You will also need to add a conditional fetch
to copy in the previous main output from S3. You will likely also need
to add a condition to exclude the main "store" task when in "post" mode. The example tasks-haz-dams.json
provides an
example that illustrates this concept.
The following example shows a hazard analysis program looping over a set of input files matching 'haz-*.in' in the
./Sites/WORK_ITEM/ folder. When there are n VCPUs present, n processes will be run concurrently until they are all
complete. When specified, the outputFolder will be pre-created before running the program. This is useful for
processes (such as fortran programs) that expect directories to be pre-created. It is also possible to provide your
own shell script if you wish to further customize this process. The special macro [INPUT_FILE_PATH]
is used to
substitute the name of each input file from the input files matched in inputFilePattern
so it can be used in a
command string. In this example, it is shown with two prepended zero-linebreak sequences, supporting programs
that traditionally takes user input in response to prompts, but which must do so in an automated way here.
{
"name" : "haz",
"excludeWhenMode" : "post",
"command" : "echo '0\n0\n[INPUT_FILE_PATH]' | /opt/haz/haz",
"inputFolder" : "Sites/${WORK_ITEM}",
"inputFilePattern" : "haz-*.in",
"outputFolder" : "./Output/${WORK_ITEM}"
},
It can be useful to ensure a step is always taken regardless of earlier errors. There are two mechanisms
to help with this: "exitOnError": false
and "required": true
. When exitOnError is set to false, an error
that occurs during a processing step may be reported but it will not prevent the execution of subsequent tasks.
Furthermore, it will not by iteself cause the overall processing to end in an error.
The "required": true
setting is especially useful if you wish to ensure storage tasks run, for storing logs to
S3 or ensuring that partial results can be stored.
The framework passes parameters as environment variables into docker. This is true for local testing using
docker-compose.yml
and also true when starting an AWS batch job with start_jobs.py
. The framework
only absolutely requires a single parameter: AWS_REGION=us-west-2
(or the region of your choice)
but other parameters that the start_jobs.py
script pass through are BUCKET_NAME
and MODE_STR
which
define the name of the S3 bucket to use and the mode that the program is running in, if specified.
BUCKET_NAME=my-bucket
DISK_STATS=0
MODE_STR=full
In most cases it will be necessary to define a parameter that defines the unit of work. In the start_jobs.py
script this is anticipated and is called WORK_ITEM
, though it is an optional parameter. If more parameters
are necessary the script must be modified.
WORK_ITEM=test1
The start_jobs.py
script has a parameter called -name-value
, which can be used to specify custom environment
variables in your AWS Batch docker container. The framework can access these variables by using parameter
substitution in your tasks.json
file, e.g. to specify the location of an S3 path or an addition parameter to
pass into your program/script. See the example below for more details.
The following steps illustrate the use of some of the client scripts. A typical cloudburst process would involve storing input data in S3, running a series of programs in the docker container, and then storing results to S3. But you could design a process to pull from other web servers and store to whatever place you wish.
jsonschema scripts/tasks.schema.json -i scripts/tasks.json -o plain
make test
make push
python scripts/compress_inputs.py -source-folder ./source.tmp -zip-folder ./input.tmp -filter 'faultfiles'
python scripts/compress_inputs.py -source-folder ./source.tmp -zip-folder ./input.tmp -filter 'Dam*/Period*'
You can upload files using the upload_inputs script or with the AWS CLI:
python scripts/upload_inputs.py -bucket projectx_test -local-folder ./input.tmp -prefix 'input/' -threads 10
aws s3 cp ./input.tmp/ s3://projectx_test/input/
The start_jobs
script is a powerful tool and has a number of options. Some examples are listed below:
The command will be preview-only until the -apply
parameter is added.
python scripts/start_jobs.py -bucket projectx_test -queue projectx_queue -jobdef projectx_jobdef -workitems Dam001/Period03 -mode full
python scripts/start_jobs.py -bucket projectx_test -queue projectx_queue -jobdef projectx_jobdef -workitems 'Dam001/Period02,Dam001/Period02' -mode post
Use zero -padding
of 2 with a prefix to specify a range of 16 work items of the form 'Site001/Period01 - Site001/Period16'
python scripts/start_jobs.py -bucket bruceh_test -queue cloudburst_queue -jobdef cloudburst_jobdef -workitems 01-16 -mode post -prefix Dam001/Period -padding 2
Pass the name of a work item file:
python scripts/start_jobs.py -bucket projectx_test -queue projectx_queue -jobdef projectx_jobdef -mode full -workitemfile worklist.tmp
The -workitemfile
parameter can use stdin, so you can also pipe a list of work items into start_jobs:
cat worklist.tmp | python scripts/start_jobs.py -bucket projectx_test -queue projectx_queue -jobdef projectx_jobdef -mode full -workitemfile -
The AWS cli can be used to create a list of input files from S3 and pass the work item list to start_jobs.py
:
aws s3 ls s3://test-bruceh/input/ --recursive | cut -c 38-52 | python scripts/start_jobs.py -bucket projectx_test -queue projectx_queue -jobdef projectx_jobdef -mode full -workitemfile -
A custom work item generator can work to start_jobs:
python extras/periodlist-gen.py -dams 001-010 -periods 01-16 | python scripts/start_jobs.py -bucket projectx_test -queue projectx_queue -jobdef projectx_jobdef -mode post -workitemfile -
Custom variables can be passed to the server as environment variables using the -name-value
parameter. These can be used
in the tasks.json
file using environment variable substitution, or accessed from your program/script. The example below shows
two variables being set:
python scripts/start_jobs.py ... -name-value STORAGE_BUCKET=my_storage_bucket -name-value SERIES_NUMBER=21
aws batch list-jobs --job-queue projectx_queue
python scripts/stop_jobs.py -queue projectx_queue -all
python scripts/stop_jobs.py -queue projectx_queue -job
aws batch terminate-job --job-id <jobId> --reason 'user requested stop'
aws batch describe-jobs --jobs <jobId>
python scripts/get_outputs.py -bucket projectx_test -prefix 'output/' -local-folder output.tmp -threads 10
aws s3 cp s3://projectx_test/output/ ./output.tmp/
python scripts/unzip_folder.py -zipdir 'output.tmp'
python scripts/unzip_folder.py --help
python scripts/unzip_folder.py --help
The files within this repository demonstrate the use of cloudburst with
pyStrata
to perform site response simulations. The following steps are used to
run the calculations.
-
First, create the necessary input files. These provide the profiles and motions used in the analysis. For a specific study these would be modified. From the root directory run the following commands:
python scripts/make_profiles.py python scripts/make_motions.py
We can check that input files will have been created:
❯ ls data/input/\* data/input/motions: m5p0-r10km.pkl m5p0-r30km.pkl m5p0-r80km.pkl m6p0-r10km.pkl m6p0-r30km.pkl m6p0-r80km.pkl m7p0-r10km.pkl m7p0-r30km.pkl m7p0-r80km.pkl data/input/profiles: lower.pkl middle.pkl upper.pkl
The naming convention uses '-' to separate components, which is also used by
clilib.py
to specify ranges. -
The
docker-compose.yml
file describes an example configuration state. UsingLOCAL_MODE=1
has the calculation skip the fetch or store tasks, which are mapped into the docker environment using the volumes entry. TheWORK_ITEM
describes the attributes of the job. If this example, I use a '/' to separate the name of the profile and the file name pattern for the motions to compute. This could be adjusted based on the specific workflow. -
From the directory (and if docker is installed), the docker environment can be started with
make test
. For this example, it should start the environment and run the calculation for the three motions. If you have an issue, or want to tinker inside the docker environment runmake run
. This will start abash
shell at the working directory with the environment set by thedocker-compose.yml
. At this point, the calculation details can be finalized prior to involving AWS. -
To use S3, we first need to upload the input files into an S3 bucket so that they can be retrieved by the image.
aws s3 mb s3://docker-pystrata-test aws s3 sync data/input s3://docker-pystrata-test/input
The tasks associated with fetching these files can be found in
scripts/tasks.json
file. Note that all profiles and motions are downloaded for each calculation. Alternatively, selected files could be downloaded by using the '-name-value' parameter, see Start a batch job. In thetasks.json
file steps showing the input and output files are added for illustrative purposes. -
Now, edit
docker-compose.yml
and setLOCAL_MODE=0
and comment out attaching the input volume. Afterwards, rerun the calculation withmake test
. The inputs will now be downloaded from S3 and the outputs uploaded:docker-pystrata on master [✘!?] via (py311) took 1m22s ❯ aws s3 ls s3://docker-pystrata-test/output/ PRE middle/ docker-pystrata on master [✘!?] via (py311) ❯ aws s3 ls s3://docker-pystrata-test/output/middle PRE middle/ docker-pystrata on master [✘!?] via (py311) ❯ aws s3 ls s3://docker-pystrata-test/output/middle/ PRE m5p0-r10km/ PRE m5p0-r30km/ PRE m5p0-r80km/ docker-pystrata on master [✘!?] via (py311) ❯ aws s3 ls s3://docker-pystrata-test/output/middle/m5p0-r10km/ 2024-05-15 21:30:58 551311 output.pkl 2024-05-15 21:30:58 263040 response_spectrum-surf.csv.gz
-
We push the image to the cloud, first by defining
AWS_ID
in theMakefile
and then running:make terraform-apply make push
-
We can then start the calculation queue with:
python scripts/start_jobs.py -bucket docker-pystrata-test -queue cloudburst_queue -jobdef cloudburst_jobdef -workitems 'middle/m5p0-*'
If that is successful and looks good. Repeat it and add
-apply
. Once it is started, you can go the AWS Batch console for the proper region, and watch the queue.
Docker is a tool that allows you, by specifying a single configuration file (the Dockerfile
), build, test and deploy
a working linux image with the tools needed to get your work done. These docker images create "container instances"
that can be run on your own PC or on a cloud server.
Start with a docker image with a docker image that includes python 3.9 or greater (debian:11, ubuntu:20.04, python:3.9-slim-bullseye, continuumio/miniconda3). With the Dockerfile you would then add any pip-installable python libraries to the image, and copy your own files into the image.
For a Gnu Fortran program, you would start with an image that includes python, and in the Dockerfile, add dependencies such as gfortran and make. You can build the fortran programs as part of the image build process.
With a built docker image, you can run a container on your own machine and test it until you are satisfied it works. Once satisfied, the same exact image can be deployed to a DockerHub Repository or Amazon ECR (Elastic Container Repository) so that it can be pulled when as needed when running in the cloud.
AWS Batch uses docker containers to run batch jobs. These containers are stored and retrieved from Amazon ECR (preferred) or DockerHub. Internally, AWS Batch uses the AWS Elastic Container Service (ECS) to run the containers. Logs are written to Cloudwatch Logs. The AWS Batch dashboard is the place to go to monitor the status of a batch of jobs.
Batch uses a "compute environment", a "queue" and a "job definition." When you run a batch job, the job references the job definition and you add the job to a queue, which allocates jobs to a compute environment. Each time a job runs, a new container instance is created and the parameters from the job you create are passed in.
The batch compute environment specifies whether you will run jobs in Fargate or EC2, and in standard or spot instances. The "job definition" that specifies the docker container, CPU/Memory (and to some extent Disk) resources, and any default parameters such as environment variables.
Batch allows you to monitor the running processes in a dashboard that shows a count of running, successful and failed jobs. You can drill down to a particular job and see the logs being generated in real-time, or review them after they have run.
S3 is a cloud storage technology provided by Amazon. A common use case for the framework involves taking some input files, running one or more executables/scripts, and then storing the outputs. The input and output files (and even the executables) can be stored in S3. The framework makes it easy to retrieve and store zipped or uncompressed data to S3.
Terraform is a company that developed a tool that allows you to create networking and computational infrastructure based on .tf configuration files. Terraform works with many cloud providers such as AWS and Azure. The framework provides terraform configuration files that create and manage the infrastructure necessary to run AWS Batch jobs in AWS Fargate, from networking (VPCs and security groups) to the creation of the ECR repository and Batch job definition.
The terraform config files sets up everything needed by AWS Batch. It can be complicated to set up AWS
Batch without such a template. The framework uses the "Fargate Spot" configuration by default, which means that
your process may be stopped during execution. A retry strategy is specified to permit the restart. You can change
the configuration to be a non-spot instance by configuring the terraform/batch.tf
file.
Framework Author: Bruce Hearn 2021 [email protected]