Giter VIP home page Giter VIP logo

Comments (22)

Smahane avatar Smahane commented on July 26, 2024 1

@xpillons I tested this in the gpu_gen2 and it worked. Thank you very much!

from azurehpc.

xpillons avatar xpillons commented on July 26, 2024

@Smahane do you have quota for ND40rs_v2 in eastus ? you can check it from the portal. If not fill up a support request to increase your quota.
I guess you made a typo when you paste the init command as it's missing an s to eastus.

from azurehpc.

Smahane avatar Smahane commented on July 26, 2024

Hello @xpillons ,
I do have up to 16 nodes quota limit in this region for ND40rs_v2. Also, the missing s was just when i pasted here the command.
I'm still getting the same "badRequest" error

Anyway to debug this issue? I ran azhpc buid --debug but the output was not helpful.

from azurehpc.

xpillons avatar xpillons commented on July 26, 2024

Thanks for checking. One way of troubleshooting is to look at the deployemnt et the resource group from the azure portal. I think this can come from the fact that ND40rs_v2 need Gen2 images, and the default configuration file is using a Gen1 image. You may use image OpenLogic:CentOS:7_7-gen2:latest instead.

from azurehpc.

Smahane avatar Smahane commented on July 26, 2024

@xpillons using gen2 solved he problem. thank you.

How do i get the correct image id i just created now please?
I need to use it to create an hpc cluster of ND40rs_v2 compute nodes. I will be using this image i created here in this config.json. is that the recommended way to create a GPU cluster with the Mellanox interconnect?

from azurehpc.

xpillons avatar xpillons commented on July 26, 2024

@Smahane great to hear.
You should capture that image, look at the image example on how to create a VM and capture an image. I suggest that now that you have a GPU node, create a separate config file to capture an image based on that example.

from azurehpc.

Smahane avatar Smahane commented on July 26, 2024

@xpillons the image examples are very helpful. however, i still need to replace the variables below in the config.json with the GPU image i just created. How do i get these values?

hpc_image": "OpenLogic:CentOS-HPC:7.7:latest",
"image": "OpenLogic:CentOS:7.7:latest",

from azurehpc.

Smahane avatar Smahane commented on July 26, 2024

I'm trying az vm image list --offer Centos -p OpenLogic --output table --all but i don't see the image i created here. I tried my user name as the publisher bu it didn't work

from azurehpc.

xpillons avatar xpillons commented on July 26, 2024

you haven't created a GPU image, unless you have captured it manually outside of azhpc.
If you haven't captured your GPU VM into an image the image example will show you how to do it, but as it does the deployment of a VM to capture it you will have to remove all the resource part to just capture the image. Or you can do a capture from the portal.

Then to use an image from an azurehpc config file look at the use_image.json from the image example. The trick is to use the syntax image.. to get the image id.

from azurehpc.

Smahane avatar Smahane commented on July 26, 2024

@xpillons in the nvidia config.json , I added:

 {
        "script": "deprovision.sh",
        "tag": "deprovision",
        "sudo": true
    },
    {
        "type": "local_script",
        "script": "create_image.sh",
        "args": [
            "variables.resource_group",
            "master",
            "variables.image_name",
            "variables.image_resource_group"
        ]
    }

The intention was to create a GPU VM and an image but i got this error:

[2020-09-15 16:14:26] Step 08 : deprovision.sh (jumpbox_script)
[2020-09-15 16:14:27]     duration: 1 seconds
[2020-09-15 16:14:29] Step 09 : create_image.sh (local_script)
[2020-09-15 16:14:41] error: invalid returncode
    args=['azhpc_install_config/install/09_create_image.sh']
    return code=3
    stdout=
    stderr=

I will just be capturing the image via the web portal but it would be nice to get this working.

from azurehpc.

xpillons avatar xpillons commented on July 26, 2024

master

you have to match the parameters of the config you have added to your environment.
So you need to :

  • add the "deprovision" tag to the gpumaster resource
  • add the missing variables for image_name and image_resource_group if not done
  • use gpumaster instead of master in the 2nd parameter of the create_image script

from azurehpc.

xpillons avatar xpillons commented on July 26, 2024

Also,

  • is there a way to stop/start the cluster and preserve the settings (ssh connection, shared directories ...)?
  • Can i resize the cluster (number of nodes)?
    Thank you

no this is not the scenario, but once you have captured an image and if you keep it in another resource group, you can remove the resource group containing all your cluster and rebuild it with azhpc.
for resize scenarios this is possible with SLURM, or you may want to use CycleCloud

from azurehpc.

Smahane avatar Smahane commented on July 26, 2024

@xpillons do you see anything wrong with my config.json please?
config.txt

I'm getting this error:

[2020-09-15 18:26:31] Provising succeeded
[2020-09-15 18:26:31] re-evaluating the config
[2020-09-15 18:26:31] building host lists
[2020-09-15 18:26:31] building install scripts
[2020-09-15 18:26:35] Step 00 : install_node_setup.sh (jumpbox_script)
[2020-09-15 18:27:58] duration: 83 seconds
[2020-09-15 18:28:00] Step 01 : disable-selinux.sh (jumpbox_script)
[2020-09-15 18:28:02] duration: 2 seconds
[2020-09-15 18:28:03] Step 02 : update_kernel.sh (jumpbox_script)
[2020-09-15 18:29:51] duration: 108 seconds
[2020-09-15 18:29:53] Step 03 : wait.sh (jumpbox_script)
[2020-09-15 18:30:26] duration: 32 seconds
[2020-09-15 18:30:27] Step 04 : install_lis.sh (jumpbox_script)
[2020-09-15 18:31:04] duration: 37 seconds
[2020-09-15 18:31:06] Step 05 : wait.sh (jumpbox_script)
[2020-09-15 18:31:38] duration: 32 seconds
[2020-09-15 18:31:40] Step 06 : cuda_drivers.sh (jumpbox_script)

[2020-09-15 18:36:31] duration: 291 seconds
[2020-09-15 18:36:33] Step 07 : check_gpu.sh (jumpbox_script)
[2020-09-15 18:36:56] duration: 23 seconds
[2020-09-15 18:36:58] Step 08 : deprovision.sh (jumpbox_script)
[2020-09-15 18:37:00] duration: 2 seconds
[2020-09-15 18:37:02] Step 09 : create_image.sh (local_script)

[2020-09-15 18:39:57] error: invalid returncode
args=['azhpc_install_config/install/09_create_image.sh']
return code=3
stdout=
stderr=

from azurehpc.

xpillons avatar xpillons commented on July 26, 2024

@Smahane I've found the issue, it's because in the create image script the hyper-v generation value should be set to v2. I'm working on a fix to automatically detect that, i will let you know once done.

from azurehpc.

xpillons avatar xpillons commented on July 26, 2024

@Smahane please try on the gpu_gen2 branch after pulling it from the repo

from azurehpc.

xpillons avatar xpillons commented on July 26, 2024

@Smahane can you please confirm this is solved ?

NOTE : The branch gpu_gen2 is now merged into the master one.

from azurehpc.

Smahane avatar Smahane commented on July 26, 2024

Hello @xpillons , when i create a cluster with the image i generated above, i get the error below. Any idea please?

[2020-09-17 16:37:00] Step 10 : pbsclient.sh (jumpbox_script)
[2020-09-17 16:37:13]     duration: 13 seconds
[2020-09-17 16:37:15] Step 11 : node_healthchecks.sh (jumpbox_script)
[2020-09-17 16:37:17] error: invalid returncode
    args=['ssh', '-o', 'StrictHostKeyChecking=no', '-o', 'UserKnownHostsFile=/dev/null', '-i', 'hpcadmin_id_rsa', '[email protected]', 'azhpc_install_config/install/11_node_healthchecks.sh', 'pbsclient']
    return code=5
    stdout=
    stderr=Warning: Permanently added 'headnode93975e.eastus.cloudapp.azure.com,52.152.238.36' (ECDSA) to the list of known hosts.

from azurehpc.

xpillons avatar xpillons commented on July 26, 2024

@Smahane can you please check the content of the azhpc_install_config/install/11_node_healthchecks.log please ? you may have a bad node.
If so as you reached the last step, you can either discard this node if you need less of them for your tests, or destroy and redeploy the cluster.

from azurehpc.

Smahane avatar Smahane commented on July 26, 2024

@xpillons i already deleted and recreated the cluster and it failed again in node_healthchecks.sh script. Also, what was created was not fully setup (at least nfs wasn't working)

from azurehpc.

xpillons avatar xpillons commented on July 26, 2024

@Smahane I think that you can remove that script as I don't think he can handle GPU nodes corectly.

from azurehpc.

xpillons avatar xpillons commented on July 26, 2024

@Smahane I've fixed that script to not failed in case of unknown VM Size. You should be fine to use the latest master update.

from azurehpc.

Smahane avatar Smahane commented on July 26, 2024

Thank you for the fix. I haven't tested but closing this issue for now.

from azurehpc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.