- trainer directory: containing the training package to be submitted to AI Platform
- __init__py which is an empty file. It is needed to make this directory a Python package.
- task.py contains the training code. It create a simple dummy linear dataset and trains a linear regression model with scikit-learn and saves the trained model object in a directory (local or GCS) given by the user.
- scripts directory: command-line scripts to train the model locally or on AI Platform.
We recommend to run the scripts in this directory in the following order, and use
the
source
command to run them, in order to export the environment variables at each step:- train-local.sh trains the model locally using
gcloud
. It is always a good idea to try and train the model locally for debugging, before submitting it to AI Platform. - train-cloud.sh submits a training job to AI Platform.
- deploy.sh creates a model resource, and a model version for the newly trained model.
- cleanup.sh deletes all the resources created in this tutorial.
- train-local.sh trains the model locally using
- prediction containing a Python sample code to invoke the model for prediction.
- predict.py invokes the model for some predictions.
- setup.py: containing all the required Python packages for this tutorial.
TODO: update
In this section, we'll highlight the main elements of this sample.
In this sample we are not passing the input dataset as a parameter. However, we need
to save the trained model. To keep things simple, the code expects one argument
to be passed to the code: the path to to store the model in. In other examples, we will
be using argparse
to process the input arguments. However, in this sample, we simply
read the input argument from sys.argv[1]
.
Also note that we save the model as model.joblib
which is
the name that AI Platform expects for models saved with joblib
to have.
Finally, we are using tf.gfile
from TensorFlow to upload the model to GCS. It does
not mean we are actually using TensorFlow in this sample to train a model. You may
also use google.cloud.storage
library for uploading and downloading to/from GCS.
The advantage of using tf.gfile
is that it works seamlessly whether the file
path is local or a GCS bucket.
TODO: update The command to run the training job locally is this:
gcloud ai-platform local train \
--module-name=trainer.task \
--package-path=${PACKAGE_PATH} \
-- \
${MODEL_DIR}
module-name
is the name of the Python file inside the package which runs the training jobpackage-path
determines where the training Python package is.--
this is just a separator. Anyhing after this will be passed to the training job as input argument.${MODEL_DIR}
will be passed totask.py
assys.argv[1]
TODO: update To submit a training job to AI Platform, the main command is:
gcloud ai-platform jobs submit training ${JOB_NAME} \
--job-dir=${MODEL_DIR} \
--runtime-version=${RUNTIME_VERSION} \
--region=${REGION} \
--scale-tier=${TIER} \
--module-name=trainer.task \
--package-path=${PACKAGE_PATH} \
--python-version=${PYTHON_VERSION} \
-- \
${MODEL_DIR}
${JOB_NAME}
is a unique name for each job. We create one with a timestamp to make it unique each time.scale-tier
is to choose the tier. For this sample, we use BASIC. However, if you need to use accelerators for instance, or do a distributed training, you will need a different tier.