This repo provides examples on how to launch distributed training on A3-Mega (H100x8) on Vertex
This repo contains:
Follow these steps to run this example:
- Dockerfile: using the provided Dockerfile, build an image from NVIDIA Nemo image.
- Nemo Config: Use the llama2-7b.yaml Nemo config file as the reference - you can either use it as is or modify it to your liking.
- Entry script: The job.sh bash script contains the script to set the required environment variabls for
TCPXO
and thetorchrun
launcher to start the Nemo pretraining job. - Vertex payload: The vertex-payload.json file to start a job in Vertex. Make sure you set the right value for the
NNODES
environment variable to reflect the right number of nodes participating in your training job. - Submit the job: Use the following curl command to kick off the job on Vertex:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @nemo/llama2-7b/vertex-payload.json \
"https://<reigon>-aiplatform.googleapis.com/v1/projects/<project-id>/locations/<reigon>/customJobs"
Coming soon
Coming soon