Giter VIP home page Giter VIP logo

amdacceleratorcloudguides's Introduction

Welcome to AMD Accelerator Cloud (AAC) Reference Documentation

Getting Started

Contact your AMD Sponsor to sign up for access to AMD Accelerator Cloud resources.

How to Login to the Web Interface Login

Go to https://aac.amd.com

How to SSH to the Plano Slurm Cluster

  1. From the laptop or system used to generate SSH keys for AAC User Account Registration, enter the following at the Terminal or PowerShell prompt to SSH to the AAC Plano Slurm cluster:
    ssh <your_userid>@aac1.amd.com
    
    The SSH keys should be accessible under $HOME/.ssh directory or via PuTTy or Mobaterm tools used to generate the SSH keys

Contacting AAC Support Team

For questions or support requests, please email them to [email protected]

The AAC Web Interface has moved to https://aac.amd.com

1. How to Fix ssh <USERID>@aac1.amd.com failed with Host key verification failed message

The Slurm login node was changed during maintenance, so the host key fingerprint is different. SSH users may see a failure to login with a WARNING message such as one shown below. Please update $HOME/.ssh/known_hosts by removing the existing entries for dell-r08-01 and retry ssh <USERID>@aac1.amd.com and accept the new fingerprints:

ECDSA key fingerprint is SHA256:u1u0/uh0GLcs19KNHrmZIA6EDLMvJACK5y2fMkVg1fg.
ECDSA key fingerprint is MD5:76:6a:a4:34:56:c0:04:fa:7f:84:e6:85:0b:f1:65:e5.

Solution:

  1. First remove existing fingerprint of old Slurm login host: ssh-keygen -R aac1.amd.com
  2. Login to the AAC Plano Slurm cluster: ssh <USERID>@aac1.amd.com
  3. Enter "yes" to accept new host key fingerprint at the prompt to continue to login.

2. How to Fix The selected queue is no longer available error

lhkcojnlehhnkkck

Solution:

The Slurm partition/queue names were changed during the maintenance to remove duplicate queue names and standardize on one set. The new partitition names can be used to allocate single node or a multi-node cluster using the Slurm commands.

1CN128C8G2H_2IB_MI210_RHEL9
1CN128C8G2H_2IB_MI210_RHEL8
1CN128C8G2H_2IB_MI210_SLES15
1CN128C8G2H_2IB_MI210_Ubuntu22
1CN96C8G1H_4IB_MI250_Ubuntu22

amdacceleratorcloudguides's People

Contributors

naimishared avatar amddcgpuce avatar sree-harsha-assk avatar antentus avatar arpitkhard avatar ozziemoreno avatar gurumohan123 avatar jagadish-amd avatar

Stargazers

Ayush Pathak avatar  avatar Pengfei Xuan avatar Anthony Rabbito avatar  avatar  avatar Pascal avatar Xueshen Liu avatar Sergei Bastrakov avatar

Watchers

Ivo Georgiev avatar  avatar  avatar Sanjay Tripathi avatar  avatar  avatar

amdacceleratorcloudguides's Issues

rocgdb not working due to missing libpython3.8 library

Partition: 1CN96C8G1H_4IB_MI250_Ubuntu22

$ module load rocm-5.7.1
$ rocgdb
rocgdb: error while loading shared libraries: libpython3.8.so.1.0: cannot open shared object file: No such file or directory

Tried a few other rocm modules, got the same error

Test

Test notification of issues

Failed to run rocgdb on AAC

I am trying to debug a program on a compute node using rocgdb but it failed to launch and returns an error
fatal error: KFD_IOC_DBG_TRAP_GET_VERSION failed Could not attach to process 2569193 (A fatal error has occurred)
I've already done scl enable gcc-toolset-11 bash and module load rocm-5.4.3 but it still occurs.
The full output is as follows,

liuxs@smc-r06-07:~/bioinfo/minimap2$ rocgdb --args ./minimap2 -t 1 --max-chain-skip=2147483647 --gpu-chain /shareddata/umich_folder/data/ONT/hg38.mmi /shareddata/umich_folder/data/ONT/reads_4f452f4a-d82a-4580-981b-32d14b997217.fa
GNU gdb (rocm-rel-5.4-121) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./minimap2...
(gdb) run
Starting program: /shared/prod/home/liuxs/bioinfo/minimap2/minimap2 -t 1 --max-chain-skip=2147483647 --gpu-chain /shareddata/umich_folder/data/ONT/hg38.mmi /shareddata/umich_folder/data/ONT/reads_4f452f4a-d82a-4580-981b-32d14b997217.fa
fatal error: KFD_IOC_DBG_TRAP_GET_VERSION failed
Could not attach to process 2569193 (A fatal error has occurred)
(gdb) q
A debugging session is active.

        Inferior 1 [process 2569193] will be killed.

Quit anyway? (y or n) y

What is the cause of this issue and how to deal with it?

Script run sbatch is much slower than running interactively with salloc

Hello!

I'm trying to submit a job using sbatch. I can successfully submit the job, however it seems like the script I'm running is hanging for quite some time. When I try to run the same script interactively with salloc the script doesn't hang. Based on the output from the sbatch and salloc runs, it seems as though the script is roughly 10x faster when run interactively with salloc.

Here is the sbatch script I'm trying to submit:

#!/bin/bash

#SBATCH -p 1CN128C8G2H_2IB_MI210_Ubuntu22
#SBATCH --time=24:00:00

singularity exec \
--bind /shareddata/utexas_data/james/input_folder:/input \
--bind /shareddata/utexas_data/james/output_folder:/output \
/shared/prod/home/james/images/some_image.sif \
python /some_python_script.py \
--in-folder /input \
--out-folder /output \
--num-proc 32

Is this expected? Should I be using something other than sbatch to submit non-interactive jobs?

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.