Giter VIP home page Giter VIP logo

auks's People

Contributors

arno avatar fdiakh avatar fihuer avatar hautreux avatar kenshin33 avatar rezib avatar robberteggermont avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

auks's Issues

Auks (0.5.3) on RH7 issues with slurm passing the credential

Hi. I'm very very new at trying to get AUKS working and it's likely some configuration that I don't understand.
This is using AUKS 0.5.3 on RH 7 and slurm 20.11.7.
I've set up auksd on a management node and it appears to be functional. I set up our cluster head node and configured it to use the spank plugin for slurm and that appears to function as well.

  • ssh into the cluster head node, klist shows my tickets, auks -p verifies, auks -a and -r work to load the keys into the cache. Watching the debug logs on the server show that my credential [email protected] accesses auks.
  • submitting a job with slurm like: srun --auks=yes also shows accesses to auksd, so I feel that the plugin functions
  • the compute node that the job runs on, however, only generates these messages: auks_krb5_stream: authentication failed : Software caused connection abort

I can sorta reproduce the issue on my head node if I do a kdestroy -A and try to ping auks, it fails, so it almost seems like my cluster node is not receiving the auks credential.

I'll attach my auks.conf and auks.acl files
auksconf.txt
auks.acl.txt

Thanks for any insight you have on this. Let me know if there's any other info I can provide that would help.

Auks API request failed : krb5 cred : unable to read credential cache

Hi,

I am trying to make slurm work with auks. Right now I am testing it on 3 CentOS 7.1 VMs. I am using first one as a mgmt node that has all 3 auks components (auksd, aukspriv and auksdrenewer). All the 3 boxes are compute and login nodes as well.

When as a regular user, when I run srun /bin/hostname, it says it is unable to read credential cache. Right now, as shown below, permissions on /var/cache/auks are just 700. I even changed it to 755 and still error is same. I'm thinking, some other user, (is it user slurm?) is trying to read these files and rightly permission is denied. I think I'm missing something simple here. Can you take a look? I really appreciate any help.

It does print the hostname. But I am not sure whether I'm supposed to see this error. I followed your howto article and created configuration. Right now, I am not entirely sure whether my auks.acl is correct either.

It also complained "unable to parse configuration file" and I had to create a symlink to /etc/auks/auks.conf in /usr/local/etc. I saw another issue and it was stated that this issue was fixed in the latest commit. Yesterday I downloaded the zip file and I made rpms out of it.

I do have the latest version of slurm.
root@slurmdev1:/u/sreedhar$ srun -V
slurm 15.08.6

My configuration looks like this: ( I replaced real names)

root@slurmdev1:/u/sreedhar$ rpm -qa | grep auks
auks-devel-0.4.4-1.el7.centos.x86_64
auks-slurm-0.4.4-1.el7.centos.x86_64
auks-0.4.4-1.el7.centos.x86_64

root@slurmdev1:/u/sreedhar$ cat /etc/auks/auks.conf

------------------------------------------------------------------------------

auks client and server configuration file

------------------------------------------------------------------------------

-

Common client/server elements

-

common {

Primary daemon configuration

PrimaryHost = "slurmdev1" ;
#PrimaryAddress = "" ;
PrimaryPort = 12345 ;
PrimaryPrincipal = "host/[email protected]" ;

Secondary daemon configuration

#SecondaryHost = "auks2" ;
#SecondaryAddress = "" ;
#SecondaryPort = "12345" ;
#SecondaryPrincipal = "host/[email protected]" ;

Enable/Disable NAT traversal support (yes/no)

this value must be the same on every nodes

NAT = no ;

max connection retries number

Retries = 3 ;

connection timeout

Timeout = 10 ;

delay in seconds between retries

Delay = 3 ;

}

-

API only elements

-

api {

log file and level

LogFile = /tmp/auksapi.log ;
LogLevel = 3 ;

optional debug file and level

DebugFile = /tmp/auksapi.log ;
DebugLevel = 3 ;

}

-

Auks daemon only elements

-

auksd {

Primary daemon configuration

PrimaryKeytab = "/etc/krb5.keytab" ;

Secondary daemon configuration

#SecondaryKeytab = "/etc/krb5.keytab" ;

log file and level

LogFile = "/var/log/auksd.log" ;
LogLevel = "2" ;

optional debug file and level

DebugFile = "/var/log/auksd.log" ;
DebugLevel = "0" ;

directory in which daemons store the creds

CacheDir = "/var/cache/auks" ;

ACL file for cred repo access authorization rules

ACLFile = "/etc/auks/auks.acl" ;

default size of incoming requests queue

it grows up dynamically

QueueSize = 500 ;

default repository size (number fo creds)

it grows up dynamicaly

RepoSize = 1000 ;

number of workers for incoming request processing

Workers = 1000 ;

delay in seconds between 2 repository clean stages

CleanDelay = 300 ;

use kerberos replay cache system (slow down)

ReplayCache = no ;

}

-

Auksd renewer only elements

-

renewer {

log file and level

LogFile = "/var/log/auksdrenewer.log" ;
LogLevel = "1" ;

optional debug file and level

DebugFile = "/var/log/auksdrenewer.log" ;
DebugLevel = "0" ;

delay between two renew loops

Delay = "60" ;

Min Lifetime for credentials to be renewed

This value is also used as the grace trigger to renew creds

MinLifeTime = "600" ;

}

root@slurmdev1:/u/sreedhar$ cat /etc/auks/auks.acl

-------------------------------------------------------------------------------

rule {
principal = ^host/slurmdev[1-3][email protected]$ ;
host = * ;
role = admin ;
}

-------------------------------------------------------------------------------

-------------------------------------------------------------------------------

rule {
principal = ^[[:alnum:]]*@REALM.A$ ;
host = * ;
role = user ;
}

-------------------------------------------------------------------------------

root@caslurmdev1:/u/sreedhar$ cat /etc/slurm/plugstack.conf
include /etc/slurm/plugstack.conf.d/*.conf
root@caslurmdev1:/u/sreedhar$ cat /etc/slurm/plugstack.conf.d/auks.conf
optional /usr/lib64/slurm/auks.so default=enabled spankstackcred=yes minimum_uid=1024

root@caslurmdev1:/u/sreedhar$ auks -p
Auks API request succeed
root@caslurmdev1:/u/sreedhar$ exit
exit

sreedhar@slurmdev1:$ auks -p
Auks API init failed : unable to parse configuration file
sreedhar@slurmdev1:
$ auks -vvv -p
Thu Jan 21 11:26:22 2016 [INFO2] [euid=564800185,pid=53870] auks_engine: unable to parse configuration file /usr/local/etc/auks.conf : No such file or directory
Auks API init failed : unable to parse configuration file

sreedhar@slurmdev1:~$ sudo bash
root@slurmdev1:/u/sreedhar$ ln -s /etc/auks/auks.conf /usr/local/etc/auks.conf
root@slurmdev1:/u/sreedhar$ exit
exit

sreedhar@slurmdev1:~$ auks -p
Auks API request succeed

sreedhar@slurmdev1:~$ srun /bin/hostname
Auks API request failed : krb5 cred : unable to read credential cache
slurmdev1.realm.a

sreedhar@slurmdev1:~$ sudo bash
root@slurmdev1:/u/sreedhar$ ls -ld /var/cache/auks
drwx------ 2 root root 71 Jan 21 10:01 /var/cache/auks
root@slurmdev1:/u/sreedhar$ ls -l /var/cache/auks/*
-rw------- 1 root root 1.3K Jan 19 11:10 /var/cache/auks/aukscc_16180
-rw------- 1 root root 1.2K Jan 21 10:01 /var/cache/auks/aukscc_564800185
-rw------- 1 root root 1.2K Jan 20 15:32 /var/cache/auks/aukscc_564800186
root@slurmdev1:/u/sreedhar$ srun /bin/hostname
slurmdev1.realm.a
root@slurmdev1:/u/sreedhar$

sreedhar@slurmdev1:~$ srun klist
Auks API request failed : krb5 cred : unable to read credential cache
Ticket cache: FILE:/tmp/krb5cc_564800185_DMViMeYf6S
Default principal: [email protected]

Valid starting Expires Service principal
01/20/2016 17:40:11 01/25/2016 17:40:08 krbtgt/[email protected]
renew until 01/27/2016 17:40:08
01/21/2016 11:09:55 01/25/2016 17:40:08 host/[email protected]
renew until 01/27/2016 17:40:08
sreedhar@slurmdev1:~$

Compiling on RHEL/CENTOS 7

Hi there,

thanks for the package.
I had to use the following options to get it to compile on RHEL 7 and Centos 7 pulling the latest code from the repository.
Maybe that info can be added to the INSTALL notes or left here.

aclocal && libtoolize --force && autoreconf --install
./configure
./make

Cheers
Axel

Licensing uncertainties

Hi,

I'm currently in the progress of packaging auks for Debian/Ubuntu, and I found some strangeness regarding the licensing of auks.

What was intended here?

(Changing or dual-licensing to GPL or any other well-established license would make things significantly easier. I know, of course, that there are sometimes good reasons or requirements to choose a specific license.)

auks cred: input buffer is too large

Using the latest auks from github, "auks -a" gives me the "auks cred: input buffer is too large" (AUKS_ERROR_CRED_INIT_BUFFER_TOO_LARGE) error.

It seems my ticket length is 2077 bytes while AUKS_CRED_DATA_MAX_LENGTH is defined as 2048.

I tried upping AUKS_CRED_DATA_MAX_LENGTH to 3072, and that seems to make things work, but I would like to make sure this won't break anything else. Also, what would be a sensible value here?

diff -urN auks-0.4.3.1427832275.31aadac/src/api/auks/auks_cred.h auks-0.4.3/src/api/auks/auks_cred.h
--- auks-0.4.3.1427832275.31aadac/src/api/auks/auks_cred.h  2015-06-14 15:00:27.000000000 +0200
+++ auks-0.4.3_patched/src/api/auks/auks_cred.h 2015-06-14 15:08:25.983836640 +0200
@@ -84,7 +84,7 @@
 #define AUKS_CRED_INVALID_TIME       0
 #define AUKS_CRED_FILE_MAX_LENGTH  128

-#define AUKS_CRED_DATA_MAX_LENGTH 2048
+#define AUKS_CRED_DATA_MAX_LENGTH 3072

 typedef struct auks_cred_info {
    char principal[AUKS_PRINCIPAL_MAX_LENGTH + 1];

rpmbuild on centos7.6 failed

hi, when I build rpm package use 0.5.0 tar ball, compile error. Whether I missing some configuration?

[root@hpc02 caosw]# rpmbuild -ta auks-0.5.0.tar.gz
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.IklB0i
+ umask 022
+ cd /root/rpmbuild/BUILD
+ cd /root/rpmbuild/BUILD
+ rm -rf auks-0.5.0
+ /usr/bin/gzip -dc /mnt/lustrefs/home/caosw/auks-0.5.0.tar.gz
+ /usr/bin/tar -xf -
+ STATUS=0
+ '[' 0 -ne 0 ']'
+ cd auks-0.5.0
+ /usr/bin/chmod -Rf a+rX,u+w,g-w,o-w .
+ exit 0
Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.87kahm
+ umask 022
+ cd /root/rpmbuild/BUILD
+ cd auks-0.5.0
+ autoreconf -fvi
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force -I m4
autoreconf: configure.ac: tracing
autoreconf: running: libtoolize --copy --force
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, `build-aux'.
libtoolize: copying file `build-aux/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIR, `m4'.
libtoolize: copying file `m4/libtool.m4'
libtoolize: copying file `m4/ltoptions.m4'
libtoolize: copying file `m4/ltsugar.m4'
libtoolize: copying file `m4/ltversion.m4'
libtoolize: copying file `m4/lt~obsolete.m4'
autoreconf: running: /usr/bin/autoconf --force
autoreconf: running: /usr/bin/autoheader --force
autoreconf: running: automake --add-missing --copy --force-missing
configure.ac:20: installing 'build-aux/ar-lib'
configure.ac: installing 'build-aux/ylwrap'
automake: warnings are treated as errors
src/auks/Makefile.am:5: warning: compiling 'auks.c' with per-target flags requires 'AM_PROG_CC_C_O' in 'configure.ac'
autoreconf: automake failed with exit status: 1
error: Bad exit status from /var/tmp/rpm-tmp.87kahm (%build)


RPM build errors:
    Bad exit status from /var/tmp/rpm-tmp.87kahm (%build)

High availability configuration for auks?

Auks comes with some support for high availability (using a secondary auksd). The auksd man page does not mention a secondary auksd at all, but the auks.conf man page does mention this:

Current version of auksd assumes that when the secondary daemon is started, the primary is stopped. Secondary daemon starts processing requests as soon as it starts. External Fail-Over mechanism must be used to ensure that only one daemon is active at a time.

However, I have not been able to find a recipe for an external fail-over mechanism for auksd. Does one exist? (This talk mentions PaceMaker, but nothing on configuration.)

The reason that only one auksd can be active is that there is no mechanism to synchronize added, renewed or removed credentials between multiple auksd. /var/cache/auks is only read once at the start, after that auksd works from an in-memory copy of the credentials. Any changes (additions, renewals, removals) are written back to /var/cache/auks, but auksd does not have a mechanism to detect changes to /var/cache/auks. (A crude workaround would be to manually add the credentials from /var/cache/auks to the secondary auksd every couple of minutes.)

But what about high availibility for the auksdrenewer? The auksdrenewer man page states:

Only one auksdrenewer should be active at a time. This daemon automatically switch to the backup auksd daemon in case of problems with the first one.

So in case auksdrenewer fails, you again need an external fail-over mechanism to a secondary auksdrenewer.

Did anyone set up auks with a fail over to a secondary auksd (and auksdrenewer), and if so, how?

Are there any plans to add built-in support for automatic fail-over, or for supporting an active/active configuration with automatic synchronization?

Doesn't check if ticket is addressless

I'm testing slurm with auks. In my Kerberos setup, tickets aren't addressless by default. If I submit a slurm job with auks enabled, the job will get a ticket that looks addressless but cannot be used, e.g. I get this error:

kinit -R

kinit: krb5_get_kdc_cred: Incorrect net address

If I get an addressless ticket with kinit -A before submitting to slurm, the ticket can be renewed with kinit -R just fine.

AUKS on CentOS 8

I compiled the auks rpms on our data management node (CentOS 8) which is also able to submit jobs to slurm and is "bound" to our active directory domain. No rpms have been installed on the slurm head node or compute nodes yet as I was initially just trying to get it to work locally on a login node.

As this is a newer OS, sssd has implemented yet another way of managing Kerberos cache called KCM. I saw that previously there’d been some challenges with the cache so I decided to disable use of KCM as well as the KEYCHAIN caches and instead used FILE.

I installed the following rpms although I'm sure not all are needed. I'm following the HOWTO in the git repository for the login/mgmt node role.

[root@gpcc-node01 ~]# rpm -qa | grep auk
auks-debuginfo-0.5.0-1.el8.x86_64
auks-0.5.0-1.el8.x86_64
auks-devel-0.5.0-1.el8.x86_64
auks-slurm-0.5.0-1.el8.x86_64
auks-slurm-debuginfo-0.5.0-1.el8.x86_64
[root@gpcc-node01 ~]# cat /etc/auks/auks.conf
#------------------------------------------------------------------------------
# auks client and server configuration file
#------------------------------------------------------------------------------

#-
# Common client/server elements
#-
common {


# Primary daemon configuration
PrimaryHost        =   "gpcc-node01.domain.name" ;
#PrimaryAddress     =  "" ;
PrimaryPort        =   12345 ;
PrimaryPrincipal   =   "host/[email protected]" ;

# Secondary daemon configuration
#SecondaryHost      =  "auks2" ;
#SecondaryAddress   =  "" ;
#SecondaryPort      =  "12345" ;
#SecondaryPrincipal =  "host/[email protected]" ;

# Enable/Disable NAT traversal support (yes/no)
# this value must be the same on every nodes
NAT                =   no ;
#other entries redacted as they’re default

[root@gpcc-node01 ~]# cat /etc/auks/auks.acl
    rule {
            principal = ^host/[email protected]$ ;
            host = * ;
            role = admin ;
    }
    rule {
            principal = ^host/gpcc-node[0-9][0-9][email protected]$ ;
            host = * ;
            role = admin ;
    }
    rule {
            principal = ^[[:alnum:]]*/DOMAIN.NAME$ ;
            host = * ;
            role = user ;
    }

[root@gpcc-node01 ~]# realm list
DOMAIN.NAME
  type: kerberos
  realm-name: DOMAIN.NAME
  domain-name: DOMAIN.NAME
  configured: kerberos-member
  server-software: active-directory
  client-software: sssd
  required-package: oddjob
  required-package: oddjob-mkhomedir
  required-package: sssd
  required-package: adcli
  required-package: samba-common-tools
  login-formats: %U
  login-policy: allow-permitted-logins

[root@gpcc-node01 ~]# klist -a
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: [email protected]

Valid starting       Expires              Service principal
07/16/2020 17:19:42  07/17/2020 03:19:42  krbtgt/[email protected]
        renew until 07/23/2020 17:19:40
        Addresses: (none)
07/16/2020 17:20:02  07/17/2020 03:19:42  host/[email protected]
        renew until 07/23/2020 17:19:40
        Addresses: (none)
[root@gpcc-node01 ~]# kvno host/[email protected]
host/[email protected]: kvno = 2

[root@gpcc-node01 ~]# systemctl restart auks*
[root@gpcc-node01 ~]# auks -avvvvvvvvvvvvvvvvvvvv
Thu Jul 16 17:29:19 2020 [INFO2] [euid=0,pid=10932] auks_engine: initializing engine from 'common' block of file /etc/auks/auks.conf
Thu Jul 16 17:29:19 2020 [INFO2] [euid=0,pid=10932] auks_engine: initializing engine from 'api' block of file /etc/auks/auks.conf
Thu Jul 16 17:29:19 2020 [INFO2] [euid=0,pid=10932] auks_engine: initializing engine from 'renewer' block of file /etc/auks/auks.conf
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine primary daemon is 'gpcc-node01.DOMAIN.NAME'
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine primary daemon address is 'gpcc-node01.DOMAIN.NAME'
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine primary daemon port is 12345
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine primary daemon principal is host/[email protected]
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine secondary daemon is 'localhost'
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine secondary daemon address is 'localhost'
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine secondary daemon port is 12345
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine secondary daemon principal is
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine logfile is /tmp/auksapi.log
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine loglevel is 3
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine debugfile is /tmp/auksapi.log
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine debuglevel is 3
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine retry number is 3
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine timeout is 10
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine delay is 3
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine NAT traversal mode is disabled
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine renewer_logfile is /var/log/auksdrenewer.log
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine renewer_loglevel is 3
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine renewer_debugfile is /var/log/auksdrenewer.log
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine renewer_debuglevel is 3
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine renewer delay is 60
Thu Jul 16 17:29:19 2020 [INFO3] [euid=0,pid=10932] auks_engine: engine renewer min cred lifetime is 600
Thu Jul 16 17:29:25 2020 [INFO3] [euid=0,pid=10932] auks_api: add request processing failed : auks api : connection failed
Auks API request failed : auks api : connection failed

Port 12345 is open/listening on the host.

[root@gpcc-node01 ~]# ss -lanp | grep auk
u_str            ESTAB                  0                   0                                                                                * 193984                                                      * 0                                   users:(("aukspriv",pid=14632,fd=4))
u_str            ESTAB                  0                   0                                                                                * 161465                                                      * 228544                              users:(("auksdrenewer",pid=14610,fd=2),("auksdrenewer",pid=14610,fd=1))
u_str            ESTAB                  0                   0                                                                                * 219748                                                      * 194183                              users:(("auksd",pid=15549,fd=2),("auksd",pid=15549,fd=1))
tcp              LISTEN                 0                   50                                                                   192.168.21.61:12345                                                 0.0.0.0:*                                   users:(("auksd",pid=15549,fd=3))

The auks services (priv, renewer, auksd) appear to be active, but I did notice some errors in the system's log as shown below.

Jul 16 17:45:15 gpcc-node01 systemd[1]: Started Auks External Kerberos Credential Support Daemon.
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO1] [euid=0,pid=13000] auksd     : worker threads stacksize is 49152
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO1] [euid=0,pid=13000] auksd     : worker args array successfully allocated
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO2] [euid=0,pid=13000] auksd     : worker[0] successfully launched
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO2] [euid=0,pid=13000] auksd     : worker[1] successfully launched
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO2] [euid=0,pid=13000] auksd     : worker[2] successfully launched
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO2] [euid=0,pid=13000] worker[0] : auks cred repo cleaned in ~0s (0 creds removed)
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO2] [euid=0,pid=13000] auksd     : worker[3] successfully launched
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO2] [euid=0,pid=13000] auksd     : worker[4] successfully launched
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO2] [euid=0,pid=13000] auksd     : worker[5] successfully launched
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO2] [euid=0,pid=13000] auksd     : worker[6] successfully launched
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO2] [euid=0,pid=13000] auksd     : worker[7] successfully launched
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO2] [euid=0,pid=13000] auksd     : worker[8] successfully launched
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO2] [euid=0,pid=13000] auksd     : worker[9] successfully launched
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO2] [euid=0,pid=13000] auksd     : worker[10] successfully launched
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO1] [euid=0,pid=13000] auksd     : 11/11 workers launched
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO1] [euid=0,pid=13000] dispatcher: auksd stream created on gpcc-node01.domain.name:12345 (fd is 3)
Jul 16 17:45:15 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:15 2020 [INFO1] [euid=0,pid=13000] dispatcher: socket 3 listening queue successfully specified
Jul 16 17:45:15 gpcc-node01 atftpd[5639]: Invalid request <0> from 192.168.21.140
Jul 16 17:45:17 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:17 2020 [INFO3] [euid=0,pid=13000] dispatcher: incoming connection (4) successfully added to pending queue
Jul 16 17:45:17 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:17 2020 [INFO3] [euid=0,pid=13000] worker[1] : incoming socket 4 successfully dequeued
Jul 16 17:45:17 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:17 2020 [INFO3] [euid=0,pid=13000] worker[1] : krb5 stream successfully initialized for socket 4
Jul 16 17:45:17 gpcc-node01 auksd[13000]: Thu Jul 16 17:45:17 2020 [INFO2] [euid=0,pid=13000] worker[1] : authentication failed on socket 4 (192.168.21.61) : krb5 stream : recvauth stage failed (server side)
Jul 16 17:45:17 gpcc-node01 auksd[13000]: free(): double free detected in tcache 2
Jul 16 17:45:17 gpcc-node01 auksdrenewer[10854]: Thu Jul 16 17:45:17 2020 [INFO3] [euid=0,pid=10854] auks_api: dump request processing failed : auks api : connection failed
Jul 16 17:45:17 gpcc-node01 auksdrenewer[10854]: Thu Jul 16 17:45:17 2020 [INFO1] [euid=0,pid=10854] renewer: unable to dump auksd creds : auks api : request processing failed
Jul 16 17:45:17 gpcc-node01 auksdrenewer[10854]: Thu Jul 16 17:45:17 2020 [INFO1] [euid=0,pid=10854] renewer: 21845 creds renewed in ~6s
Jul 16 17:45:17 gpcc-node01 auksdrenewer[10854]: Thu Jul 16 17:45:17 2020 [INFO2] [euid=0,pid=10854] renewer: sleeping 54 seconds before next renew
Jul 16 17:45:17 gpcc-node01 systemd[1]: Started Process Core Dump (PID 13013/UID 0).
Jul 16 17:45:17 gpcc-node01 systemd[1]: auksd.service: Main process exited, code=killed, status=6/ABRT
Jul 16 17:45:17 gpcc-node01 systemd[1]: auksd.service: Failed with result 'signal'.
Jul 16 17:45:17 gpcc-node01 systemd-coredump[13014]: Process 13000 (auksd) of user 0 dumped core.

Stack trace of thread 13002:
#0  0x00007ffff70e070f raise (libc.so.6)
#1  0x00007ffff70cab25 abort (libc.so.6)
#2  0x00007ffff7123897 __libc_message (libc.so.6)
#3  0x00007ffff7129fdc malloc_printerr (libc.so.6)
#4  0x00007ffff712bd4d _int_free (libc.so.6)
#5  0x00007ffff7bb6d68 auks_krb5_stream_free_contents (libauksapi.so.0)
#6  0x0000555555558128 auksd_process_req (auksd)
#7  0x0000555555556ca7 processor_main_function (auksd)
#8  0x00007ffff74732de start_thread (libpthread.so.0)
#9  0x00007ffff71a4e83 __clone (libc.so.6)

Stack trace of thread 13000:
#0  0x00007ffff747ccb7 accept (libpthread.so.0)
#1  0x00007ffff7bc3afc xstream_accept (libauksapi.so.0)
#2  0x0000555555556e15 dispatcher_main_function (auksd)
#3  0x00005555555571a6 auksd_main_loop (auksd)
#4  0x00005555555566a3 main (auksd)
#5  0x00007ffff70cc6a3 __libc_start_main (libc.so.6)
#6  0x00005555555569ae _start (auksd)

Stack trace of thread 13001:
#0  0x00007ffff7171238 __nanosleep (libc.so.6)
#1  0x00007ffff717113e sleep (libc.so.6)
#2  0x0000555555556b67 cleaner_main_function (auksd)
#3  0x00007ffff74732de start_thread (libpthread.so.0)
#4  0x00007ffff71a4e83 __clone (libc.so.6)

Stack trace of thread 13008:
#0  0x00007ffff747948c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007ffff7bc4eb6 xqueue_dequeue_base (libauksapi.so.0)
#2  0x0000555555556d10 processor_main_function (auksd)
#3  0x00007ffff74732de start_thread (libpthread.so.0)
#4  0x00007ffff71a4e83 __clone (libc.so.6)

Stack trace of thread 13003:
#0  0x00007ffff747948c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007ffff7bc4eb6 xqueue_dequeue_base (libauksapi.so.0)
#2  0x0000555555556d10 processor_main_function (auksd)
#3  0x00007ffff74732de start_thread (libpthread.so.0)
#4  0x00007ffff71a4e83 __clone (libc.so.6)

Stack trace of thread 13007:
#0  0x00007ffff747948c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007ffff7bc4eb6 xqueue_dequeue_base (libauksapi.so.0)
#2  0x0000555555556d10 processor_main_function (auksd)
#3  0x00007ffff74732de start_thread (libpthread.so.0)
#4  0x00007ffff71a4e83 __clone (libc.so.6)

Stack trace of thread 13010:
#0  0x00007ffff747948c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007ffff7bc4eb6 xqueue_dequeue_base (libauksapi.so.0)
#2  0x0000555555556d10 processor_main_function (auksd)
#3  0x00007ffff74732de start_thread (libpthread.so.0)
#4  0x00007ffff71a4e83 __clone (libc.so.6)

Stack trace of thread 13011:
#0  0x00007ffff747948c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007ffff7bc4eb6 xqueue_dequeue_base (libauksapi.so.0)
#2  0x0000555555556d10 processor_main_function (auksd)
#3  0x00007ffff74732de start_thread (libpthread.so.0)
#4  0x00007ffff71a4e83 __clone (libc.so.6)

Stack trace of thread 13005:
#0  0x00007ffff747948c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007ffff7bc4eb6 xqueue_dequeue_base (libauksapi.so.0)
#2  0x0000555555556d10 processor_main_function (auksd)
#3  0x00007ffff74732de start_thread (libpthread.so.0)
#4  0x00007ffff71a4e83 __clone (libc.so.6)

Stack trace of thread 13004:
#0  0x00007ffff747948c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007ffff7bc4eb6 xqueue_dequeue_base (libauksapi.so.0)
#2  0x0000555555556d10 processor_main_function (auksd)
#3  0x00007ffff74732de start_thread (libpthread.so.0)
#4  0x00007ffff71a4e83 __clone (libc.so.6)

Stack trace of thread 13009:
#0  0x00007ffff747948c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007ffff7bc4eb6 xqueue_dequeue_base (libauksapi.so.0)
#2  0x0000555555556d10 processor_main_function (auksd)
#3  0x00007ffff74732de start_thread (libpthread.so.0)
#4  0x00007ffff71a4e83 __clone (libc.so.6)

Stack trace of thread 13006:
#0  0x00007ffff747948c pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
#1  0x00007ffff7bc4eb6 xqueue_dequeue_base (libauksapi.so.0)
#2  0x0000555555556d10 processor_main_function (auksd)
#3  0x00007ffff74732de start_thread (libpthread.so.0)
#4  0x00007ffff71a4e83 __clone (libc.so.6)
Jul 16 17:45:18 gpcc-node01 systemd[1]: auksd.service: Service RestartSec=100ms expired, scheduling restart.
Jul 16 17:45:18 gpcc-node01 systemd[1]: auksd.service: Scheduled restart job, restart counter is at 53.

Not sure why auks did a coredump. Please let me know what additional information I can supply to help determine the cause of the issue. I appreciate any help anyone might be able to offer.

spank-auks: mode disabled

Recently a significant number jobs failed. The common factor seems to be 'spank-auks: mode disabled' entries in the logs for these jobs. The slurm & auks configuration were not changed at all around this time span.

I'm now wondering what chain of events could lead to the spank-auks plugin being loaded but not being able to read the configuration parameters. It seems both the loading of spank-auks and the configuration happen via the same /etc/slurm/plugstack.conf.d/auks.conf file. So spank-auks should either work, or not be loaded at all?

Is the /etc/slurm/plugstack.conf.d/auks.conf file loaded only once, or every time a new job is started? Is it loaded by slurmd or by slurmstepd (or both)?

Are there other reasons (besides the default parameter) for 'mode disabled' (such as hostname resolver problems, network connection problems, job parameter problems, ...)?

Any ideas?

Rare issue causing complete SLURM jobs to remain in the queue

This is an issue that happens maybe every second week. The result being finished SLURM jobs still in the queue with RUNNING status.

The relevant error messages I'm getting are

error: spank-auks: Error while initializing a new unique
and
error: spank-auks: Unable to destroy ccache

Any idea what's happening here?

systemd not starting aukspriv on compute node

The aukspriv.service file defines WantedBy=auksdrenewer.service.
However, on compute nodes auksdrenewer is not enabled, so systemd leaves aukspriv inactive.

Changing the line to WantedBy=auksdrenewer.service slurmd.service makes systemd start aukspriv for slurmd, as would a Wants=aukspriv.service line in slurmd.service. But maybe a more general solution like WantedBy=multi-user.target may be more appropriate?

slurmstepd not killing auks process

With recent versions of SLURM 17.11.3, including 18.08.0-0pre1, the auks -R loop process (on the compute node) doesn't seem to terminate properly. This causes slurmctld to believe that a process is still running and leaves the resources allocated. The job goes into a completing state but never actually completes. I believe, but am not 100% certain, this is related to a change in signal blocking/processing in slurmstepd.c introduced in 17.11.3 per this commit:

-- Make sure the slurmstepd blocks signals like SIGTERM correctly.

In slurm-spank-auks.c, changing:

kill(renewer_pid, SIGTERM);

to:

kill(renewer_pid, SIGKILL);

seems to result in the expected behavior (i.e., auks -R loop exits when the job completes naturally or is scanceled and the resources are freed.) I'm not sure if this is really the best way to go about it though.

Mark

RPM packaging issues

After building the RPMs and installing them, I noticed a couple things. Sorry, I don't know enough to easily provide patches, or fixes to the SPEC file.

  1. Currently, both the systemd service files and the auks command look for auks.conf at /etc/auks.conf but the documentation and example files have it at /etc/auks/auks.conf. I think /etc/auks/auks.conf is probably the preferred location so the results of rpmbuild ought to reflect that. FWIW, I made a symlink from /etc/ which seems to have made everything work.

  2. The ACLFile line in auks.conf refers to /etc/auks/auksd.acl but the example file is auks.acl.example (no 'd') and the man page is also for auks.acl (no 'd'). rpmbuild ought to createauks.conf with a modified ACLFile line. The workaround here is obvious. :-)

  3. The auksd systemd service file (/usr/lib/systemd/system/auksd.service) does "ExecStart=/usr/sbin/auksd -F $AUKSPRIV_OPTIONS" but it probably ought to be "ExecStart=/usr/sbin/auksd -F $AUKSD_OPTIONS". This doesn't really matter as long as you use the same name in /etc/sysconfig/auksd but it is confusing to have the variable named AUKSPRIV when configuring auksd.

I hope this feedback is helpful/useful.

Thanks to the developers for all their work! We are excited to be making progress towards having slurm/auks working on our cluster.

IPv6 support

Slurm recently added support for IPv6. We tried AUKS with a Slurm IPv6 cluster and it does not work. Any thoughts on what it would take to add IPv6 support to AUKS ? Just trying to get an idea of the amount of changes / effort involved.

sbatch in loop is waiting on auks replies for 300 seconds

Hi Matthew,

I came across this new issue just today. I am trying to submit a simple test script 50 times in a loop.

for i in {1..50};do sbatch test.sh;done

When cluster is busy, around 40 calls to sbatch succeed immediately and then the remaining ones just sit there for 300 seconds. First I thought something was going on with slurm. I checked netstat and realized all sbatch calls were waiting for replies from auks (all trying to talk to port 12345 on auks server).

Then I restarted auks daemon and immediately they finished. If I don't restart they just sit there for exactly 5 minutes before they finish.

Now cluster is bit free and then I increased it to 100 and 500, and most of the times they all finish. This makes me think that I need to increase number of threads. Right now I have 1000 workers, queue size of 500, repo size of 1000, clean dealy of 300, and reply cache set to no.

what do you recommend? Increasing workers or some other value? I am not sure whether these 300 seconds has anything to do with clean dealy.

I am just not sure how many threads/workers I can allocate for auks. We definitely submit many jobs in a short amount of time.

If you have any recommendation please let me know. If you think it has nothing to do with workers at all, then please let me know if you have any thoughts on how I can fix this if there is another configuration variable I need to adjust.

Thanks,
Sreedhar.

Building fails on CentOS 8

Some methods cannot be found:

../api/auks/.libs/libauksapi.so: undefined reference to `krb5_rc_close'                                                                                                                      
../api/auks/.libs/libauksapi.so: undefined reference to `krb5_rc_resolve_full'         

Probably caused by #46?

Full log:

Making all in auks                                                                                                                                                                           
make[3]: Entering directory '/root/rpmbuild/BUILD/auks-0.5.0_20.11.4_1/src/auks'                                                                                                            
gcc -DHAVE_CONFIG_H -I. -I../..  -I/usr/include/tirpc -I./../api   -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protecto
r-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-pr
otection -fcf-protection -DSYSCONFDIR=\"/etc/auks\" -c -o auks-auks.o `test -f 'auks.c' || echo './'`auks.c                                                                                  
/bin/sh ../../libtool  --tag=CC   --mode=link gcc  -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecor
d-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-pr
otection -DSYSCONFDIR=\"/etc/auks\"  -lkrb5 -pthread -Wl,-z,relro  -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -o auks auks-auks.o ../api/auks/libauksapi.la                    
libtool: link: gcc -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/r
pm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -DSYSCONFDIR=\"/etc/auk
s\" -pthread -Wl,-z -Wl,relro -Wl,-z -Wl,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -o .libs/auks auks-auks.o  ../api/auks/.libs/libauksapi.so -lkrb5 -ltirpc -pthread                
../api/auks/.libs/libauksapi.so: undefined reference to `krb5_rc_close'                                                                                                                      
../api/auks/.libs/libauksapi.so: undefined reference to `krb5_rc_resolve_full'                                                                                                               
collect2: error: ld returned 1 exit status                                                                                                                                                   
make[3]: *** [Makefile:397: auks] Error 1                                                                                                                                                    
make[3]: Leaving directory '/root/rpmbuild/BUILD/auks-0.5.0_20.11.4_1/src/auks'                                                                                                              
make[2]: *** [Makefile:358: all-recursive] Error 1                                                                                                                                           
make[2]: Leaving directory '/root/rpmbuild/BUILD/auks-0.5.0_20.11.4_1/src'                                                                                                                   
make[1]: *** [Makefile:413: all-recursive] Error 1                                                                                                                                           
make[1]: Leaving directory '/root/rpmbuild/BUILD/auks-0.5.0_20.11.4_1'                                                                                                                       
make: *** [Makefile:345: all] Error 2                                                                                                                                                        
error: Bad exit status from /var/tmp/rpm-tmp.nPE4wc (%build)
    Bad exit status from /var/tmp/rpm-tmp.nPE4wc (%build)

Auksd & SLURM el7

Hey there,

Hope this is the right place for this, if not please tell me.

I am trying to bring up auks with our newly installed slurm implementation but I am having a few problems getting the initial services started. I was hoping you could assist, or help point me in the right direction.

Setting it up on:
Mgmt node
Login node
Compute node

Installing first on mgmt node:

Installed auks via RPM's:

auks-0.4.0-1.x86_64.rpm
auks-debuginfo-0.4.0-1.x86_64.rpm
auks-devel-0.4.0-1.x86_64.rpm
auks-slurm-0.4.0-1.x86_64.rpm

And enabled the auks plugin by adding this to plugstack.conf:

optional /usr/lib64/slurm/auks.so default=enabled spankstackcred=yes minimum_uid=1024

Inside the auks.conf file I have configured the:

PrimaryHost
PrimaryPrincipal

No secondary

Inside auks.acl file (I am a bit confused here) I have the admin line setup and currently it is setup as myself. I know that this is not correct, should this be the slurm user? Also, I am not entirely sure what to set for the guest and user role, or if they need to be defined.

When trying to start the auksd service it hangs on activating and eventually fails. Looking at the auks.log it shows a failure at the krb5_recvauth step:

Wed May 20 10:32:08 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: connection authentication context initialisation succeed
Wed May 20 10:32:08 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: authentication context addrs set up succeed
Wed May 20 10:32:08 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: default kstream initialisation succeed
Wed May 20 10:32:08 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: kstream basic initialisation succeed
Wed May 20 10:32:08 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: keytab initialisation succeed
Wed May 20 10:32:08 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: server kstream initialisation succeed
Wed May 20 10:32:08 2020 [INFO3] [euid=0,pid=31256] worker[6] : krb5 stream successfully initialized for socket 4
Wed May 20 10:32:08 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: authentication failed : Software caused connection abort
Wed May 20 10:32:08 2020 [INFO2] [euid=0,pid=31256] worker[6] : authentication failed on socket 4 (10.232.128.65) : krb5 stream : recvauth stage failed (server side)
Wed May 20 10:32:08 2020 [INFO3] [euid=0,pid=31256] worker[6] : incoming socket 4 processing failed
Wed May 20 10:32:11 2020 [INFO3] [euid=0,pid=31256] dispatcher: incoming connection (3) successfully added to pending queue
Wed May 20 10:32:11 2020 [INFO3] [euid=0,pid=31256] worker[8] : incoming socket 3 successfully dequeued
Wed May 20 10:32:11 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: local endpoint stream 3 informations request succeed
Wed May 20 10:32:11 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: remote endpoint stream 3 informations request succeed
Wed May 20 10:32:11 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: remote host is 10.232.128.65
Wed May 20 10:32:11 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: context initialization succeed
Wed May 20 10:32:11 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: connection authentication context initialisation succeed
Wed May 20 10:32:11 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: authentication context addrs set up succeed
Wed May 20 10:32:11 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: default kstream initialisation succeed
Wed May 20 10:32:11 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: kstream basic initialisation succeed
Wed May 20 10:32:11 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: keytab initialisation succeed
Wed May 20 10:32:11 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: server kstream initialisation succeed
Wed May 20 10:32:11 2020 [INFO3] [euid=0,pid=31256] worker[8] : krb5 stream successfully initialized for socket 3
Wed May 20 10:32:11 2020 [INFO4] [euid=0,pid=31256] auks_krb5_stream: authentication failed : Software caused connection abort
Wed May 20 10:32:11 2020 [INFO2] [euid=0,pid=31256] worker[8] : authentication failed on socket 3 (10.232.128.65) : krb5 stream : recvauth stage failed (server side)

Aukspriv does not seem happy either and complains that it is unable to get the ccache for the host using the keytab file which I "believe" is a good keytab file but my kerberos knowledge is not very good.

unable to get ccache for host ____ using ktfile /etc/krb5.keytab : kinit: Client not found in Kerberos database while getting initial credentials.

Any suggestions or pointers on where to be looking to resolve this would be so helpful.

Best,

John Hudson

systemd reports aukspriv fails to start (timeout)

Using the latest auks from github on CentOS-7.1, systemd reports aukspriv fails to start (timeout).

systemd[1]: Starting SYSV: Auks aukspriv ccache from keytab scripted daemon...
aukspriv[26963]: Starting aukspriv:[  OK  ]
systemd[1]: PID file /var/run/aukspriv.pid not readable (yet?) after start.
systemd[1]: aukspriv.service operation timed out. Terminating.
systemd[1]: Failed to start SYSV: Auks aukspriv ccache from keytab scripted daemon.
systemd[1]: Unit aukspriv.service entered failed state.

I see an aukspriv process running afterwards, but no /var/run/aukspriv.pid.

LogFile ignored

Since upgrading to the latest version on github (commit aa2eb6b), auksd and auksdrenewer ignore the specified LogFile but instead log to stdout (/var/log/messages).

Logging to the specified LogFile was working at least up to and including commit 31aadac.

Auks not working with gssproxy tickets, needs private ticket cache

After upgrading to CentOS7.4, auks stopped functioning properly because rpc.gssd stores a GSSPROXY ticket in /tmp/krb5cc_0 (service principal 'Encrypted/Credentials/v1@X-GSSPROXY:') and auks can't use this ticket. Restarting aukspriv will fix the problem temporarily (new service principal 'krbtgt/DOMAIN@DOMAIN') until rpc.gssd overwrites the cache again.

My workaround is to make auks use it's private ticket cache (/tmp/krb5cc_0_auks). This required quite some searching and trying however:
For aukspriv, I added 'AUKS_PRIV_CCACHE_APPEND=_auks' to /etc/sysconfig/aukspriv.
For auksdrenewer, I added 'KRB5CCNAME=FILE:/tmp/krb5cc_0_auks' to /etc/sysconfig/auksdrenewer.
For the SLURM spank plugin, I added 'hostcredcache=FILE:/tmp/krb5cc_0_auks' to /etc/slurm/plugstack.conf.d/auks.conf.

It would have been nice to have a common setting for this, and even better to use a private ticket cache by default...

slurmd version 20.11.0 fails to start

I have upgraded slurm to 20.11.0 from 20.02.6 and now it fails to start:

slurmd: debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
slurmd: debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
slurmd: debug:  spank: opening plugin stack /etc/slurm/plugstack.conf.d/auks.conf
slurmd: debug:  _establish_config_source: using config_file=/etc/slurm/slurm.conf (environment)
slurmd: debug:  slurm_conf_init: using config_file=/etc/slurm/slurm.conf
slurmd: debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
slurmd: debug3: Couldn't find sym 'slurm_spank_job_prolog' in the plugin
slurmd: debug3: Couldn't find sym 'slurm_spank_task_init_privileged' in the plugin
slurmd: debug3: Couldn't find sym 'slurm_spank_task_init' in the plugin
slurmd: debug3: Couldn't find sym 'slurm_spank_task_post_fork' in the plugin
slurmd: debug3: Couldn't find sym 'slurm_spank_job_epilog' in the plugin
slurmd: debug3: Couldn't find sym 'slurm_spank_slurmd_exit' in the plugin
slurmd: debug2: spank: /usr/lib64/slurm/auks.so: no callbacks in this context
slurmd: debug3: plugin_context_create: no uler type
slurmd: error: cannot create cred context for (null)
slurmd: error: slurmd initialization failed

my /etc/slurm/plugstack.conf.d/auks.conf looks like this:
optional auks.so default=enabled spankstackcred=yes minimum_uid=1000 sync=no

What am I doing wrong. Auks API request failed : auks api : request processing failed

I've had AUKS and SLURM running successfully on CenOS 7 and still is.

I've built a new cluster using RHEL 7

Most things are controlled on these servers by puppet so config and setting should be pretty much the same.

Kerberos is working on these machines fine for nfs shares so I'm assuming that's not the issue.

I can't successfully run auks -p on any node but for now just trying on the master node.

[tanderson@slurm-cont01 ~]$ sudo klist -fna
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: [email protected]

Valid starting Expires Service principal
01/07/20 20:33:11 02/07/20 06:33:11 krbtgt/[email protected]
renew until 29/09/20 20:33:11, Flags: RIA
Addresses: (none)
01/07/20 20:33:17 02/07/20 06:33:11 host/[email protected]
renew until 29/09/20 20:33:11, Flags: RA
Addresses: (none)
[tanderson@slurm-cont01 ~]$ klist -fna
Ticket cache: FILE:/tmp/krb5cc_1001
Default principal: [email protected]

Valid starting Expires Service principal
01/07/20 20:34:20 02/07/20 20:34:17 krbtgt/[email protected]
renew until 29/09/20 20:34:17, Flags: RIA
Addresses: (none)
01/07/20 20:34:32 02/07/20 06:34:32 host/[email protected]
renew until 29/09/20 20:34:17, Flags: RA
Addresses: (none)

Here are some very verbose logs. Thanks for any help.

[tanderson@slurm-cont01 ~]$ auks -vvv -p
Wed Jul 1 20:47:08 2020 [INFO2] [euid=1001,pid=8880] auks_engine: initializing engine from 'common' block of file /etc/auks/auks.conf
Wed Jul 1 20:47:08 2020 [INFO2] [euid=1001,pid=8880] auks_engine: initializing engine from 'api' block of file /etc/auks/auks.conf
Wed Jul 1 20:47:08 2020 [INFO2] [euid=1001,pid=8880] auks_engine: initializing engine from 'renewer' block of file /etc/auks/auks.conf
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine primary daemon is 'slurm-cont01.svi.edu.au'
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine primary daemon address is 'slurm-cont01.svi.edu.au'
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine primary daemon port is 12345
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine primary daemon principal is host/[email protected]
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine secondary daemon is 'localhost'
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine secondary daemon address is 'localhost'
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine secondary daemon port is 12345
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine secondary daemon principal is
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine logfile is /tmp/auksapi.log
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine loglevel is 10
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine debugfile is /tmp/auksapi.log
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine debuglevel is 10
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine retry number is 3
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine timeout is 10
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine delay is 3
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine NAT traversal mode is disabled
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine renewer_logfile is /var/log/auksdrenewer.log
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine renewer_loglevel is 10
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine renewer_debugfile is /var/log/auksdrenewer.log
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine renewer_debuglevel is 10
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine renewer delay is 60
Wed Jul 1 20:47:08 2020 [INFO3] [euid=1001,pid=8880] auks_engine: engine renewer min cred lifetime is 600
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_api: starting retry 1 of 3
Wed Jul 1 20:47:08 2020 [INFO7] [euid=1001,pid=8880] xstream: socket creation succeed
Wed Jul 1 20:47:08 2020 [INFO7] [euid=1001,pid=8880] xstream: socket non-blocking flag is now set
Wed Jul 1 20:47:08 2020 [INFO7] [euid=1001,pid=8880] xstream: connect (10.80.0.141:12345) succeed while polling
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_api: successfully connected to auks server slurm-cont01.svi.edu.au:12345
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: local endpoint stream 4 informations request succeed
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: remote endpoint stream 4 informations request succeed
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: remote host is 10.80.0.141
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: context initialization succeed
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: connection authentication context initialisation succeed
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: authentication context addrs set up succeed
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: default kstream initialisation succeed
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: kstream basic initialisation succeed
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: ccache initialisation succeed
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: client kstream initialisation succeed
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: authentication failed : Server rejected authentication (during sendauth exchange)
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_api: authentication failed : krb5 sendauth stage failed (client side)
Wed Jul 1 20:47:08 2020 [INFO7] [euid=1001,pid=8880] xstream: socket creation succeed
Wed Jul 1 20:47:08 2020 [INFO7] [euid=1001,pid=8880] xstream: socket non-blocking flag is now set
Wed Jul 1 20:47:08 2020 [INFO7] [euid=1001,pid=8880] xstream: connect (127.0.0.1:12345) failed while polling : Connection refused
Wed Jul 1 20:47:08 2020 [INFO7] [euid=1001,pid=8880] xstream: socket creation succeed
Wed Jul 1 20:47:08 2020 [INFO7] [euid=1001,pid=8880] xstream: socket non-blocking flag is now set
Wed Jul 1 20:47:08 2020 [INFO7] [euid=1001,pid=8880] xstream: connect (127.0.0.1:12345) failed while polling : Connection refused
Wed Jul 1 20:47:08 2020 [INFO4] [euid=1001,pid=8880] auks_api: unable to connect to auks server localhost:12345
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_api: starting retry 2 of 3
Wed Jul 1 20:47:11 2020 [INFO7] [euid=1001,pid=8880] xstream: socket creation succeed
Wed Jul 1 20:47:11 2020 [INFO7] [euid=1001,pid=8880] xstream: socket non-blocking flag is now set
Wed Jul 1 20:47:11 2020 [INFO7] [euid=1001,pid=8880] xstream: connect (10.80.0.141:12345) succeed while polling
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_api: successfully connected to auks server slurm-cont01.svi.edu.au:12345
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: local endpoint stream 4 informations request succeed
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: remote endpoint stream 4 informations request succeed
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: remote host is 10.80.0.141
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: context initialization succeed
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: connection authentication context initialisation succeed
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: authentication context addrs set up succeed
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: default kstream initialisation succeed
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: kstream basic initialisation succeed
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: ccache initialisation succeed
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: client kstream initialisation succeed
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: authentication failed : Server rejected authentication (during sendauth exchange)
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_api: authentication failed : krb5 sendauth stage failed (client side)
Wed Jul 1 20:47:11 2020 [INFO7] [euid=1001,pid=8880] xstream: socket creation succeed
Wed Jul 1 20:47:11 2020 [INFO7] [euid=1001,pid=8880] xstream: socket non-blocking flag is now set
Wed Jul 1 20:47:11 2020 [INFO7] [euid=1001,pid=8880] xstream: connect (127.0.0.1:12345) failed while polling : Connection refused
Wed Jul 1 20:47:11 2020 [INFO7] [euid=1001,pid=8880] xstream: socket creation succeed
Wed Jul 1 20:47:11 2020 [INFO7] [euid=1001,pid=8880] xstream: socket non-blocking flag is now set
Wed Jul 1 20:47:11 2020 [INFO7] [euid=1001,pid=8880] xstream: connect (127.0.0.1:12345) failed while polling : Connection refused
Wed Jul 1 20:47:11 2020 [INFO4] [euid=1001,pid=8880] auks_api: unable to connect to auks server localhost:12345
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_api: starting retry 3 of 3
Wed Jul 1 20:47:14 2020 [INFO7] [euid=1001,pid=8880] xstream: socket creation succeed
Wed Jul 1 20:47:14 2020 [INFO7] [euid=1001,pid=8880] xstream: socket non-blocking flag is now set
Wed Jul 1 20:47:14 2020 [INFO7] [euid=1001,pid=8880] xstream: connect (10.80.0.141:12345) succeed while polling
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_api: successfully connected to auks server slurm-cont01.svi.edu.au:12345
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: local endpoint stream 4 informations request succeed
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: remote endpoint stream 4 informations request succeed
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: remote host is 10.80.0.141
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: context initialization succeed
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: connection authentication context initialisation succeed
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: authentication context addrs set up succeed
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: default kstream initialisation succeed
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: kstream basic initialisation succeed
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: ccache initialisation succeed
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: client kstream initialisation succeed
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_krb5_stream: authentication failed : Server rejected authentication (during sendauth exchange)
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_api: authentication failed : krb5 sendauth stage failed (client side)
Wed Jul 1 20:47:14 2020 [INFO7] [euid=1001,pid=8880] xstream: socket creation succeed
Wed Jul 1 20:47:14 2020 [INFO7] [euid=1001,pid=8880] xstream: socket non-blocking flag is now set
Wed Jul 1 20:47:14 2020 [INFO7] [euid=1001,pid=8880] xstream: connect (127.0.0.1:12345) failed while polling : Connection refused
Wed Jul 1 20:47:14 2020 [INFO7] [euid=1001,pid=8880] xstream: socket creation succeed
Wed Jul 1 20:47:14 2020 [INFO7] [euid=1001,pid=8880] xstream: socket non-blocking flag is now set
Wed Jul 1 20:47:14 2020 [INFO7] [euid=1001,pid=8880] xstream: connect (127.0.0.1:12345) failed while polling : Connection refused
Wed Jul 1 20:47:14 2020 [INFO4] [euid=1001,pid=8880] auks_api: unable to connect to auks server localhost:12345
Wed Jul 1 20:47:14 2020 [INFO3] [euid=1001,pid=8880] auks_api: ping request processing failed : auks api : connection failed
Auks API request failed : auks api : request processing failed

current host is not an auks server

To anyone who sees the error exiting : auksd : current host is not an auks server:
PrimaryHost (and SecondaryHost) must exactly match the name returned by uname -n.

So if uname -n returns only the hostname, PrimaryHost must be set to that hostname; if uname -n returns the FQDN, PrimaryHost must be set to that FQDN.

libkrb5 1.18 missing krb5_rc_resolve_full krb5_rc_close

Building the the scibian auks package for debian bullseye runs into

/usr/bin/ld: ../api/auks/.libs/libauksapi.so: undefined reference to `krb5_rc_close'
/usr/bin/ld: ../api/auks/.libs/libauksapi.so: undefined reference to `krb5_rc_resolve_full'

the symbols are missing from libkrb5, they were renamed to k5_rc_close & k5_rc_resolve krb5/krb5@dcb853a.

This is the code in question:

/* disable replay cache if asked to (better scalability without it) */
if ( kstream->flags & AUKS_KRB5_STREAM_NO_RCACHE ) {
krb5_rcache rcache;
kstatus = krb5_rc_resolve_full(kstream->context,&rcache,
"none:");
if (kstatus) {
auks_error("rcache resolve failed : %s",
error_message(kstatus));
fstatus = AUKS_ERROR_KRB5_STREAM_CTX_SETRCACHE ;
goto auth_ctx_exit;
}
kstatus = krb5_rc_initialize(kstream->context,rcache,0);
if (kstatus) {
auks_error("rcache initialisation failed : %s",
error_message(kstatus));
krb5_rc_close(kstream->context,rcache);
#ifdef LIBKRB5_MEMORY_LEAK_WORKAROUND
/* memory leak in libkrb5 ? */
/* valgrind says that it was not freed correctly */
free((char*)rcache);
#endif
fstatus = AUKS_ERROR_KRB5_STREAM_CTX_SETRCACHE ;
goto auth_ctx_exit;
}
kstatus = krb5_auth_con_setrcache(kstream->context,
kstream->auth_context,
rcache);
if (kstatus) {
auks_error("unable to set rcache : %s",
error_message(kstatus));
krb5_rc_close(kstream->context,rcache);
#ifdef LIBKRB5_MEMORY_LEAK_WORKAROUND
/* memory leak in libkrb5 ? */
/* valgrind says that it was not freed correctly */
free((char*)rcache);
#endif
fstatus = AUKS_ERROR_KRB5_STREAM_CTX_SETRCACHE ;
goto auth_ctx_exit;
}
}

According to gentoo#738968 www-apache/mod_auth_kerb-5.4-r2 references krb5_rc_resolve_full, but it is removed from libkrb5.so

RCACHE is mandatory these days so hardcode

Therefore I assume the code in question can be removed (as with the gentoo patch to mod_auth_kerb).
Is this correct?

auksd on Centos6

I have another issue ....

I am running my auksd on Scientific Linux 6 box. I am observing a problem with a leak of fd's. In case the process runs out of fd's, it stops working.

lsof shows that the file in concern is /var/lib/sss/mc/passwd. For me this indicates a problem is with sssd, when called in auks_cred.c ... maybe getpwnam_r ... but that is just a guess.

I will try to understand, if the problem also exists in Centos7. It should have a different version of sssd ... If you have some other suggestion, I am happy to try it.

Cheers, Dietrich

/etc/sysconfig/aukspriv: No such file or directory

Commit 94f4efa makes aukspriv.service fail: when /etc/sysconfig/aukspriv doesn't exist, sourcing fails and aukspriv is not started. /etc/sysconfig/aukspriv didn't need to exist before, and it's not created by the rpm during updating.

Correct me if I'm wrong but that commit seems to be a hack to provide something that systemd doesn't support, and I'm wondering if that is the smart thing to do. Is this something that needs to be in the official tree?

Compute node reports: `auks api : request processing failed`

Hello,

I have troubles debugging the auks -p call on a compute node node2:

[root@node2 auks]# auks -p
Auks API request failed : auks api : request processing failed

On this node, aukspriv is running and it adds the correct default principal:

[root@node2 auks]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: host/[email protected]

Valid starting       Expires              Service principal
12.08.2022 09:45:11  12.08.2022 19:45:11  krbtgt/[email protected]
        renew until 19.08.2022 09:45:11

On the login and management node node1, aukspriv, auksd & auksdrenewer are running, here the auks call works successfully:

[root@node1 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: host/[email protected]

Valid starting       Expires              Service principal
12.08.2022 09:55:13  12.08.2022 19:55:13  krbtgt/[email protected]
        renew until 19.08.2022 09:55:13
[root@node1 ~]# auks -p
Auks API request succeed
[root@node1 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: host/[email protected]

Valid starting       Expires              Service principal
12.08.2022 09:55:13  12.08.2022 19:55:13  krbtgt/[email protected]
        renew until 19.08.2022 09:55:13
12.08.2022 09:55:19  12.08.2022 19:55:13  host/[email protected]
        renew until 19.08.2022 09:55:13

Any ideas how to debug this further?

I am using these configuration files:

rejecting jobs when there are no kerberos tickets

Hi,

When I don't have a kerberos credential, srun shows error but job still runs. How do we make it so that job gets rejected?

After doing kdestroy, jobs still run fine with sbatch and srun spitting out error that auks cred extraction failed. Is it possible to reject jobs? If so what needs to be done in the auks configuration.

Right now this is what I have in auks.conf in /etc/slurm/plugstack.d/auks.conf (I pointed it to plugstackconfig in slurm.conf)

optional /usr/lib64/slurm/auks.so default=enabled spankstackcred=yes minimum_uid=1024 sync=no

I have this only on submit and compute nodes. Do I need this on the front end where auksd runs?

I would really appreciate any advise.

Thanks in advance,
Sreedhar.

Auks, Slurm and OpenAFS

Hi!

Thanks for your effort. This is less an issue, more a question. I am setting up SLURM and AUKS on an OpenAFS system on a small site in Vienna. Evidently auks is keeping the Kerberos token alive, but in case of AFS one has to convert the Kerberos token into an AFS token using aklog on a regular basis.

I succeed setting up the initial AFS token, using the pam_afs_session module, which solves
this problem also during a login with SSH. But now it is required to renew the AFS token on a regular
basis ... which requires detaching another process.

I see that you have a solution for that problem with the auks spank plugin, so I could write my own
plugin to solve that. But I have the impression that I am trying the reinvent the wheel. Is there a standard solution on how to solve this issue for AFS ?

Cheers, Dietrich

Questions regarding ticket renewal and alternative Slurm AuthType

Hi,

We're in the process of evaluating AUKS for Kerberized deployments and I had few questions:

  • From my understanding, auksdrenewer is responsible for renewing the tickets in auksd, while the SPANK plugin process will renew those on each compute nodes (with auks -R loop). What happens when the ticket expires and is not renewable for long-running jobs? Is there a way to update the ticket ahead of time if the user got a fresh one? If so, does the user have to do this manually with the auks API or can it be automated somehow (something similar to GSS rekeying with PAM maybe?). Does it matter whether the job is running or not and is there any race we need to watch for?

  • Has there been any effort towards replacing Munge with a Kerberos based approach as the Slurm AuthType? It doesn't look like this project is addressing this but I guess most of its infrastructure could be reused for it.

error: Failed to get current user environment variables

Hi,

I'm trying to use auks with slurm, but can't make it work in a specific case.
The user's home in my cluster is mounted using kerberos security, that's why auks is needed here. On a simple use, like srun ls ~ everything is fine, I can use my home using my Kerberos ticket.

But, when using "--get-user-env" with sbatch, the "_run_prolog" don't have access to the home. In other terms, the command su - my_username on the node running the job doesn't work in the context of _run_prolog, when trying to execute .bash_profile in my home. Some users of slurm needs to load their environment this way. I feels like auks load credentials only during the real job and not during "_run_prolog" even if it's required.

I can see in slurm's logs:

_run_prolog: run job script took usec=31976
_run_prolog: prolog with lock for job 978965 ran for 0 seconds
error: Failed to get current user environment variables
error: _get_user_env: Unable to get user's local environment, running only with passed environment

With a pstree -p I can see something like this:

           |-slurmd(3437)-+-su(21166)---bash(21167)
           |              `-{slurmd}(21163)

It's maybe because the su isn't launched by slurmstepd ?

Do you know how to solve this issue ? Is there a configuration parameter I missed ?

Thanks

slurm-auks dont' switch to old primary ccache on exit

  1. I start a slurm job, then ssh to the host. Now ssh KRB5CCNAME is KRB5CCNAME=KEYRING:persistent:1275601911
  2. I start a auks job, job done.
  3. My ssh session klist show not found ccache.
klist: Credentials cache keyring 'persistent:1275601911:krb_ccache_UXiwRMG' not found

I find Auks job would switch primary ccache, this behavior disturbe my another ssh krb5 environment.

I solve my problem to comment switch_cc_cache code in auks_krb5_cred.c.

Do you have some good idea?

CentOS 7 - Segmentation Fault

Receiving a segmentation fault error on CentOS 7 for auksd. Has anyone encountered something similar using AUKS?

[root@somevm ~]# auksd -v
Mon Aug 13 10:03:11 2018 [INFO1] [euid=0,pid=1384] auksd_engine: initializing engine from 'common' block of file /etc/auks/auks.conf
Mon Aug 13 10:03:11 2018 [INFO1] [euid=0,pid=1384] auksd_engine: initializing engine from 'auksd' block of file /etc/auks/auks.conf
Segmentation fault

RHEL 8 with version 0.5.0 and patch #53

Hi all,

I'm upgrading our SLURM cluster and wanted to use the latest version of auks too. The first problem I ran into was solved by applying patch #53

Now on the master node all is working fine but any api commands fail from all other nodes.

[trenttest@test-sl01 ~]$ auks -p
Auks API request failed : auks api : request processing failed
[trenttest@test-sl01 ~]$ auks -a
Auks API request failed : auks api : connection failed

Here is a verbose auks -p from the same login node https://pastebin.com/wxUV3xYM

With debug and log level both set to 5 on the master node running auksd nothing is logged. The machines are on the same subnet and no firewall involved. They are rebuilds of the same machines I used RHEL 7 and a previous version of auks successfully on.

Any help would be great.

auks acl : unable to parse acl file

Hello,
Can someone help me please, we are running Freeipa as domain controller and slurm to push jobs to the nodes. in order to use srun we need to configure auks to manage the ticket on slurm server:
I changed the principal name and domain name.

  • `cat /etc/auks/auks.conf
#------------------------------------------------------------------------------
# auks client and server configuration file
#------------------------------------------------------------------------------

#-
# Common client/server elements
#-
common {


 # Primary daemon configuration
 PrimaryHost        =   "hostname.realm.com" ;
 #PrimaryAddress     =  "" ;
 PrimaryPort        =   12345 ;
 PrimaryPrincipal   =   "host/[email protected]" ;

 # Enable/Disable NAT traversal support (yes/no)
 # this value must be the same on every nodes
 NAT                =   yes ;

 # max connection retries number
 Retries            =    3 ;

 # connection timeout
 Timeout            =   10 ;

 # delay in seconds between retries
 Delay              =    3 ;

}

auksd {


 # Primary daemon configuration
 PrimaryKeytab      =   "/etc/krb5.keytab" ;

 # log file and level
 LogFile            =   "/var/log/auksd.log" ;
 LogLevel           =   "5" ;

 # optional debug file and level
 DebugFile          =   "/var/log/auksd.log" ;
 DebugLevel         =   "5" ;

 # directory in which daemons store the creds
 CacheDir           =   "/var/cache/auks" ;

 # ACL file for cred repo access authorization rules
 ACLFile            =   "/etc/auks/auksd.acl" ;

 # default size of incoming requests queue
 # it grows up dynamically
 QueueSize          =   50 ;

 # default repository size (number fo creds)
 # it grows up dynamicaly
 RepoSize           =   500 ;

 # number of workers for incoming request processing
 Workers            =   10 ;

 # delay in seconds between 2 repository clean stages
 CleanDelay         =   300 ;

 # use kerberos replay cache system (slow down)
 ReplayCache        =   yes ;

}

#-
# Auksd renewer only elements
#-
renewer {

 # log file and level
 LogFile            =   "/var/log/auksdrenewer.log" ;
 LogLevel           =   "1" ;

 # optional debug file and level
 DebugFile          =   "/var/log/auksdrenewer.log" ;
 DebugLevel         =   "0" ;

 # delay between two renew loops
 Delay              = "60" ;

 # Min Lifetime for credentials to be renewed
 # This value is also used as the grace trigger to renew creds
 MinLifeTime        = "600" ;

}

#-
# API only elements
#-
api {

 # log file and level
 LogFile            =   "/tmp/auksapi.log" ;
 LogLevel           =   "3" ;

 # optional debug file and level
 DebugFile          =   "/tmp/auksapi.log" ;
 DebugLevel         =   "3" ;

}
`
  • cat /etc/auks/auksd.acl:
rule {
            principal = ^host/[email protected]$ ;
            host = * ;
            role = admin ;
rule {
            principal = ^[[:alnum:]]*@REALM.COM$ ;
            host = * ;
            role = user ;
    }
  • cat /etc/sysconfig/aukspriv:
AUKS_PRIV_PRINC="host/hostname.realm.com"
#AUKS_PRIV_KEYTAB="/etc/auks/auks.keytab"
AUKS_PRIV_KEYTAB="/etc/krb5.keytab"
  • auksd deamon is down
# systemctl  status auksd -l
● auksd.service - Auks External Kerberos Credential Support Daemon
  Loaded: loaded (/usr/lib/systemd/system/auksd.service; enabled; vendor preset: disabled)
  Active: failed (Result: start-limit) since Tue 2019-01-15 15:10:31 EST; 1h 53min ago
 Process: 14286 ExecStart=/usr/sbin/auksd -F $AUKSPRIV_OPTIONS (code=exited, status=150)
Main PID: 14286 (code=exited, status=150)

Jan 15 15:10:30 hostname.realm.com systemd[1]: Unit auksd.service entered failed state.
Jan 15 15:10:30 hostname.realm.com systemd[1]: auksd.service failed.
Jan 15 15:10:31 hostname.realm.com systemd[1]: auksd.service holdoff time over, scheduling restart.
Jan 15 15:10:31 hostname.realm.com systemd[1]: Stopped Auks External Kerberos Credential Support Daemon.
Jan 15 15:10:31 hostname.realm.com systemd[1]: start request repeated too quickly for auksd.service
Jan 15 15:10:31 hostname.realm.com systemd[1]: Failed to start Auks External Kerberos Credential Support Daemon.
Jan 15 15:10:31 hostname.realm.com systemd[1]: Unit auksd.service entered failed state.
Jan 15 15:10:31 hostname.realm.com systemd[1]: auksd.service failed.
  • auksdrenewer active with some errors
# systemctl  status auksdrenewer -l
● auksdrenewer.service - Auks Credentials Renewer Daemon
   Loaded: loaded (/usr/lib/systemd/system/auksdrenewer.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-01-15 15:10:26 EST; 1h 56min ago
 Main PID: 14267 (auksdrenewer)
   CGroup: /system.slice/auksdrenewer.service
           └─14267 /usr/sbin/auksdrenewer -F

Jan 15 17:02:32 hostname.realm.com auksdrenewer[14267]: Tue Jan 15 17:02:32 2019 [INFO1] [euid=0,pid=14267] renewer: unable to dump auksd creds : auks api : request processing failed
Jan 15 17:02:32 hostname.realm.com auksdrenewer[14267]: Tue Jan 15 17:02:32 2019 [INFO1] [euid=0,pid=14267] renewer: 32727 creds renewed in ~6s
Jan 15 17:03:32 hostname.realm.com auksdrenewer[14267]: Tue Jan 15 17:03:32 2019 [INFO1] [euid=0,pid=14267] renewer: unable to dump auksd creds : auks api : request processing failed
Jan 15 17:03:32 hostname.realm.com auksdrenewer[14267]: Tue Jan 15 17:03:32 2019 [INFO1] [euid=0,pid=14267] renewer: 32727 creds renewed in ~6s
Jan 15 17:04:32 hostname.realm.com auksdrenewer[14267]: Tue Jan 15 17:04:32 2019 [INFO1] [euid=0,pid=14267] renewer: unable to dump auksd creds : auks api : request processing failed
Jan 15 17:04:32 hostname.realm.com auksdrenewer[14267]: Tue Jan 15 17:04:32 2019 [INFO1] [euid=0,pid=14267] renewer: 32727 creds renewed in ~6s
Jan 15 17:05:32 hostname.realm.com auksdrenewer[14267]: Tue Jan 15 17:05:32 2019 [INFO1] [euid=0,pid=14267] renewer: unable to dump auksd creds : auks api : request processing failed
Jan 15 17:05:32 hostname.realm.com auksdrenewer[14267]: Tue Jan 15 17:05:32 2019 [INFO1] [euid=0,pid=14267] renewer: 32727 creds renewed in ~6s
Jan 15 17:06:32 hostname.realm.com auksdrenewer[14267]: Tue Jan 15 17:06:32 2019 [INFO1] [euid=0,pid=14267] renewer: unable to dump auksd creds : auks api : request processing failed
Jan 15 17:06:32 hostname.realm.com auksdrenewer[14267]: Tue Jan 15 17:06:32 2019 [INFO1] [euid=0,pid=14267] renewer: 32727 creds renewed in ~6s
  • aukspriv up without error:
# systemctl  status aukspriv -l
● aukspriv.service - Auks ccache from keytab scripted daemon
   Loaded: loaded (/usr/lib/systemd/system/aukspriv.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-01-15 15:10:19 EST; 1h 58min ago
  Process: 14238 ExecStart=/usr/sbin/aukspriv $AUKSPRIV_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 14239 (aukspriv)
   CGroup: /system.slice/aukspriv.service
           ├─14239 /bin/bash /usr/sbin/aukspriv
           └─14248 sleep 35000

Jan 15 15:10:19hostname.realm.com systemd[1]: Starting Auks ccache from keytab scripted daemon...
Jan 15 15:10:19 hostname.realm.com systemd[1]: Started Auks ccache from keytab scripted daemon.
  • The /var/log/auks.log isn't created and /var/cache/auks doesn't exist too, bellow the output of some commands:

$auks -a

-bash-4.2$ auks -a
Tue Jan 15 16:52:16 2019 [INFO3] [euid=1500000047,pid=19292] auks_api: add request processing failed : auks api : connection failed
Auks API request failed : auks api : connection failed
-bash-4.2$ auks -vvv -a
Tue Jan 15 16:52:27 2019 [INFO2] [euid=1500000047,pid=19307] auks_engine: initializing engine from 'common' block of file /etc/auks/auks.conf
Tue Jan 15 16:52:27 2019 [INFO2] [euid=1500000047,pid=19307] auks_engine: initializing engine from 'api' block of file /etc/auks/auks.conf
Tue Jan 15 16:52:27 2019 [INFO2] [euid=1500000047,pid=19307] auks_engine: initializing engine from 'renewer' block of file /etc/auks/auks.conf
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine primary daemon is 'hostname.realm.com'
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine primary daemon address is 'hostname.realm.com'
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine primary daemon port is 12345
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine primary daemon principal is host/[email protected]
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine secondary daemon is 'localhost'
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine secondary daemon address is 'localhost'
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine secondary daemon port is 12345
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine secondary daemon principal is 
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine logfile is /tmp/auksapi.log
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine loglevel is 3
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine debugfile is /tmp/auksapi.log
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine debuglevel is 3
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine retry number is 3
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine timeout is 10
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine delay is 3
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine NAT traversal mode is enabled
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine renewer_logfile is /var/log/auksdrenewer.log
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine renewer_loglevel is 1
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine renewer_debugfile is /var/log/auksdrenewer.log
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine renewer_debuglevel is 0
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine renewer delay is 60
Tue Jan 15 16:52:27 2019 [INFO3] [euid=1500000047,pid=19307] auks_engine: engine renewer min cred lifetime is 600
Tue Jan 15 16:52:33 2019 [INFO3] [euid=1500000047,pid=19307] auks_api: add request processing failed : auks api : connection failed
Auks API request failed : auks api : connection failed

$auksd -vvv

bash-4.2$ auksd -vvv
Tue Jan 15 16:55:57 2019 [INFO1] [euid=1500000047,pid=19477] auksd_engine: initializing engine from 'common' block of file /etc/auks/auks.conf
Tue Jan 15 16:55:57 2019 [INFO1] [euid=1500000047,pid=19477] auksd_engine: initializing engine from 'auksd' block of file /etc/auks/auks.conf
Tue Jan 15 16:55:57 2019 [INFO1] [euid=1500000047,pid=19477] auksd_engine: unable to init auksd engine ACL from file /etc/auks/auksd.acl
Tue Jan 15 16:55:57 2019 [INFO1] [euid=1500000047,pid=19477] auksd_engine: initialization failed
Tue Jan 15 16:55:57 2019 [INFO1] [euid=1500000047,pid=19477] exiting : auks acl : unable to parse acl file

$ klist

Ticket cache: KEYRING:persistent:1500000047:krb_ccache_iqWq6Zl
Default principal: [email protected]

Valid starting       Expires              Service principal
01/15/2019 16:47:40  01/16/2019 16:47:36  krbtgt/[email protected]

vi /tmp/auksapi.log

Mon Jan 14 12:07:50 2019 [INFO3] [euid=0,pid=24313] auks_api: auks cred extraction failed : krb5 cred : unable to read credential cache
Mon Jan 14 12:07:52 2019 [INFO3] [euid=0,pid=24320] auks_api: auks cred extraction failed : krb5 cred : unable to read credential cache
Mon Jan 14 12:07:58 2019 [INFO3] [euid=0,pid=24339] auks_api: auks cred extraction failed : krb5 cred : unable to read credential cache
Mon Jan 14 13:16:14 2019 [INFO3] [euid=0,pid=6201] auks_api: add request processing failed : auks api : connection failed
Tue Jan 15 12:16:33 2019 [INFO3] [euid=0,pid=5408] auks_api: auks cred extraction failed : krb5 cred : unable to read credential cache
Tue Jan 15 12:18:48 2019 [INFO3] [euid=0,pid=5427] auks_api: add request processing failed : auks api : connection failed
Tue Jan 15 12:19:59 2019 [INFO3] [euid=0,pid=5454] auks_api: add request processing failed : auks api : connection failed
Tue Jan 15 12:22:26 2019 [INFO3] [euid=0,pid=5508] auks_api: add request processing failed : auks api : connection failed
Tue Jan 15 13:07:43 2019 [INFO3] [euid=0,pid=7786] auks_api: ping request processing failed : auks api : connection failed
Tue Jan 15 14:05:00 2019 [INFO3] [euid=0,pid=10769] auks_api: ping request processing failed : auks api : connection failed
Tue Jan 15 14:18:11 2019 [INFO3] [euid=0,pid=11552] auks_api: add request processing failed : auks api : connection failed
Tue Jan 15 14:18:34 2019 [INFO3] [euid=0,pid=11572] auks_api: ping request processing failed : auks api : connection failed
Tue Jan 15 16:15:48 2019 [INFO3] [euid=0,pid=17458] auks_api: ping request processing failed : auks api : connection failed
Tue Jan 15 16:19:02 2019 [INFO3] [euid=0,pid=17621] auks_api: dump request processing failed : auks api : connection failed
Tue Jan 15 16:19:02 2019 [INFO1] [euid=0,pid=17621] renewer: unable to dump auksd creds : auks api : request processing failed
Tue Jan 15 16:19:02 2019 [INFO1] [euid=0,pid=17621] renewer: 32647 creds renewed in ~6s
Tue Jan 15 16:19:02 2019 [INFO2] [euid=0,pid=17621] renewer: sleeping 54 seconds before next renew
Tue Jan 15 16:19:28 2019 [INFO1] [euid=0,pid=17621] renewer: ending main loop

Thanks,

auks with keyring credential cache type rather than file type

Hi,

How do I get around where if we use keyring credential cache type for kerberos rather than file type? Right now, it doesn't work until I comment out keyring option in our krb5.conf

default_ccache_name = KEYRING:persistent:%{uid}

Then, by default it's putting cache as file type and things work fine.

Do you have any suggestions? Thanks in advance

Best,
Sreedhar.

Another how to question. not all kerberos principals showing and no access to NFS shares

Hello again,

I now have auks working with slurm.

[trenttesttwo@slurm-login01 tmp]$ auks -p
Auks API request succeed

[trenttesttwo@slurm-login01 tmp]$ srun klist -a
Ticket cache: FILE:/tmp/tktrJUXoi
Default principal: [email protected]

Valid starting     Expires            Service principal
03/07/20 11:43:12  04/07/20 11:43:12  krbtgt/[email protected]
        renew until 01/10/20 11:43:12
        Addresses: (none)

[trenttesttwo@slurm-login01 tmp]$ auks -r
Auks API request succeed

[trenttesttwo@slurm-login01 tmp]$ srun --auks=no klist -a
klist: No credentials cache found (filename: /tmp/krb5cc_1430606966_3dvoPL)
srun: error: scn01: task 0: Exited with exit code 1

[trenttesttwo@slurm-login01 tmp]$ srun klist -a
Ticket cache: FILE:/tmp/tktdgFelH
Default principal: [email protected]

Valid starting     Expires            Service principal
03/07/20 11:43:12  04/07/20 11:43:12  krbtgt/[email protected]
        renew until 01/10/20 11:43:12
        Addresses: (none)

I have NFSv4 shares with sec=krb5 and using these successfully in other locations in the business. The shares are mounted correctly and when a user is directly on a compute node they can access them. However accessing them via slurm isn't working. On a compute node if I su to a user and run kinit and then run a slurm command that use does have access to the share.

User: trenttesttwo on login node

[trenttesttwo@slurm-login01 tmp]$ klist -a
Ticket cache: FILE:/tmp/krb5cc_1430606966_3dvoPL
Default principal: [email protected]

Valid starting     Expires            Service principal
03/07/20 11:43:12  04/07/20 11:43:12  krbtgt/[email protected]
        renew until 01/10/20 11:43:12
        Addresses: (none)
03/07/20 11:43:24  03/07/20 21:43:24  host/[email protected]
        renew until 01/10/20 11:43:12
        Addresses: (none)
03/07/20 11:43:29  03/07/20 21:43:29  nfs/files08.svi.edu.au@
        renew until 01/10/20 11:43:12
        Addresses: (none)
03/07/20 11:43:29  03/07/20 21:43:29  nfs/[email protected]
        renew until 01/10/20 11:43:12
        Addresses: (none)
03/07/20 11:43:29  03/07/20 21:43:29  nfs/files12.svi.edu.au@
        renew until 01/10/20 11:43:12
        Addresses: (none)
03/07/20 11:43:29  03/07/20 21:43:29  nfs/[email protected]
        renew until 01/10/20 11:43:12
        Addresses: (none)

[trenttesttwo@slurm-login01 tmp]$ srun klist -a
Ticket cache: FILE:/tmp/tkt8dzGl2
Default principal: [email protected]

Valid starting     Expires            Service principal
03/07/20 11:43:12  04/07/20 11:43:12  krbtgt/[email protected]
        renew until 01/10/20 11:43:12
        Addresses: (none)


[trenttesttwo@slurm-login01 tmp]$ ls -l /mnt
total 22
drwxrws---. 17 root        50047 19 Mar 11 16:53 mannbiofiles
drwxrws---. 12 root mccart_files 14 Jun 15 15:35 mcfiles
drwxrws---. 13 root mccart_files 17 Feb 10 11:39 mcscratch


[trenttesttwo@slurm-login01 tmp]$ srun ls -l /mnt
/usr/bin/ls: cannot access /mnt/mcfiles: Permission denied
/usr/bin/ls: cannot access /mnt/mcscratch: Permission denied
total 0
d????????? ? ? ? ?            ? mannbiofiles
d????????? ? ? ? ?            ? mcfiles
d????????? ? ? ? ?            ? mcscratch
/usr/bin/ls: cannot access /mnt/mannbiofiles: Permission denied
srun: error: scn01: task 0: Exited with exit code 1

Same user directly on that compute node: This is just to show that the compute node has the right access setup.

[trenttesttwo@scn01 ~]$ ls -l /mnt
total 22
drwxrws---. 17 root        50047 19 Mar 11 16:53 mannbiofiles
drwxrws---. 12 root mccart_files 14 Jun 15 15:35 mcfiles
drwxrws---. 13 root mccart_files 17 Feb 10 11:39 mcscratch

And now again from the login node

[trenttesttwo@slurm-login01 tmp]$ srun klist -a
Ticket cache: FILE:/tmp/tktpppRvy
Default principal: [email protected]

Valid starting     Expires            Service principal
03/07/20 11:43:12  04/07/20 11:43:12  krbtgt/[email protected]
        renew until 01/10/20 11:43:12
        Addresses: (none)

[trenttesttwo@slurm-login01 tmp]$ srun ls -l /mnt
total 22
drwxrws---. 17 root        50047 19 Mar 11 16:53 mannbiofiles
drwxrws---. 12 root mccart_files 14 Jun 15 15:35 mcfiles
drwxrws---. 13 root mccart_files 17 Feb 10 11:39 mcscratch

Grateful for any help.

Cheers

Trent

Unable to deserialize credential data

Hi,

I have been trying to get Auks working on our cluster and currently I'm having this error and i'm not quite sure where the problem might be. I'm on CentOS 7.6

→ auks -vvvvvvv -g -C /tmp/ccache
auks_engine: initializing engine from 'common' block of file /etc/auks/auks.conf
auks_engine: initializing engine from 'api' block of file /etc/auks/auks.conf
auks_engine: initializing engine from 'renewer' block of file /etc/auks/auks.conf
auks_engine: engine primary daemon is 'primary.example.org'
auks_engine: engine primary daemon address is 'primary.example.org'
auks_engine: engine primary daemon port is 5000
auks_engine: engine primary daemon principal is host/[email protected]
auks_engine: engine secondary daemon is 'secondary.example.org'
auks_engine: engine secondary daemon address is 'secondary.example.org'
auks_engine: engine secondary daemon port is 5000
auks_engine: engine secondary daemon principal is host/[email protected]
auks_engine: engine logfile is /var/log/auksapi.log
auks_engine: engine loglevel is 10
auks_engine: engine debugfile is /var/log/auksapi.log
auks_engine: engine debuglevel is 10
auks_engine: engine retry number is 3
auks_engine: engine timeout is 10
auks_engine: engine delay is 3
auks_engine: engine NAT traversal mode is disabled
auks_engine: engine renewer_logfile is /var/log/auksdrenewer.log
auks_engine: engine renewer_loglevel is 10
auks_engine: engine renewer_debugfile is /var/log/auksdrenewer.log
auks_engine: engine renewer_debuglevel is 10
auks_engine: engine renewer delay is 60
auks_engine: engine renewer min cred lifetime is 600
auks_api: starting retry 1 of 3
xstream: socket creation succeed
xstream: socket non-blocking flag is now set
xstream: connect (123.456.789.10:5000) succeed while polling
auks_api: successfully connected to auks server primary.example.org:5000
auks_krb5_stream: local endpoint stream 4 informations request succeed
auks_krb5_stream: remote endpoint stream 4 informations request succeed
auks_krb5_stream: remote host is 123.456.789.10
auks_krb5_stream: context initialization succeed
auks_krb5_stream: connection authentication context initialisation succeed
auks_krb5_stream: authentication context addrs set up succeed
auks_krb5_stream: default kstream initialisation succeed
auks_krb5_stream: kstream basic initialisation succeed
auks_krb5_stream: ccache initialisation succeed
auks_krb5_stream: client kstream initialisation succeed
auks_krb5_stream: authentication succeed
auks_krb5_stream: message encryption succeed
auks_krb5_stream: message transmission succeed : 8 bytes sended
auks_krb5_stream: message reception succeed
auks_krb5_stream: message decryption succeed
auks_krb5_stream: message reception succeed : 8361 bytes stored
auks_krb5_stream: message encryption succeed
auks_krb5_stream: message transmission succeed : 4 bytes sended
auks_krb5_cred: kerberos context successfully initialized
auks_krb5_cred: kerberos authentication context successfully initialized
auks_krb5_cred: unable to deserialize credential data : ASN.1 encoding ended unexpectedly
auks_api: unable to store cred in file '/tmp/ccache' : krb5 cred : unable to load credential from memory
Auks API request failed : auks api : reply processing failed

With auks -a and auks -p I get Auks API request succeed, so that works. Any insights about what could be the problem?

Is there a consensus on which commit to use for RHEL7 with Kerberized NFS?

We spent some time this week trying to deploy Slurm 20.11.3 with auks 0.5.0 on our CentOS 7 cluster. It seems clear that we want a release that is in between 0.4.4 and 0.5.0 and I'm hoping somebody knows where the sweet spot is!

0.5.0 doesn't work with Kerberized NFS because auks creates caches of the form /tmp/tkt* and NFS only looks for /tmp/krb5cc_uid*. See issue #43. This took embarrassingly long to discover for ourselves. :-)

We are also experiencing the issue described in #23 where gssproxy and auks fight over root's ticket cache (RHEL7.4 and newer). The workaround described there looks perfectly fine for 0.4.4 but the relevant environment variable seems to have been removed from 0.5.0. Disabling gssproxy isn't an attractive option for us (but I won't rule it out) so I'd like to avoid a version where AUKS_PRIV_CCACHE_APPEND has been removed.

Unfortunately 0.4.4 apparently doesn't terminate jobs when running newer versions of Slurm. See issue #24 (which has a commit that purports to fix it). We haven't actually tested this ourselves but obviously we'd want that patch.

Does anyone who has already gone down this road have advice? @kenshin33 @trenta

ticket renewal works only once

I have this problem with ticket renewal, only the first attempt seems to work, every further attempt fails:
[pax80] ~ % kinit -A --lifetime=11m
[email protected]'s Password:
[pax80] ~ % auks -a
Auks API request succeed
[pax80] ~ % auks -R once
Auks API request failed : auks cred : credential is still valid

After the renewal on the auks server I get this:
[pax80] ~ % auks -R once
Auks API request succeed
[pax80] ~ % auks -R once
Auks API request failed : krb5 cred : no TGT found in credential cache

The cache directory on the auks server has a valid TGT:
[chap-vm7] /var/cache/auks # klist -v --cache=aukscc_12884
Credentials cache: FILE:aukscc_12884
Principal: [email protected]
Cache version: 4

Server: krbtgt/[email protected]
Client: [email protected]
Ticket etype: aes256-cts-hmac-sha1-96, kvno 120
Ticket length: 346
Auth time: Jun 4 12:54:01 2014
Start time: Jun 4 12:59:21 2014
End time: Jun 4 13:10:21 2014
Renew till: Jul 4 12:54:01 2014
Ticket flags: transited-policy-checked, pre-authent, renewable, forwardable
Addresses: addressless

You can get it on the client computer as well:
[pax80] ~ % auks -g
Auks API request succeed
[pax80] ~ % klist -v
Credentials cache: FILE:/tmp/krb5cc_12884_Wedlsp8119
Principal: [email protected]
Cache version: 4

Server: krbtgt/[email protected]
Client: [email protected]
Ticket etype: aes256-cts-hmac-sha1-96, kvno 120
Ticket length: 346
Auth time: Jun 4 12:54:01 2014
Start time: Jun 4 13:02:20 2014
End time: Jun 4 13:13:20 2014
Renew till: Jul 4 12:54:01 2014
Ticket flags: transited-policy-checked, pre-authent, renewable, forwardable
Addresses: addressless

compile on EL8

Hello, I am trying to compile auks on CentOS8.
At first, I need to do "autoreconf -i" for installing "ar-lib" and "ylwrap"

configure works
while doing make, the following error appears
xmessage.c:82:10: fatal error: rpc/types.h: No such file or directory
#include <rpc/types.h>
^~~~~~~~~~~~~

/usr/include/rpc/types.h came from the glibc-devel package in EL7
Now its included in libtirpc-devel and the path changed to /usr/include/tirpc/rpc/types.h

I am not really aware with changing configure or makefile to get it work. I have tried setting CFLAGS to "-I/usr/include/tirpc" in the Makefile, but this did not help.

Can anyone help me?
Thanks in Advance
Jens

spank-auks: unable to get user xxxxxxx cred : auks api : reply seems corrupted

Hi all,

I'm trying to configure auks on centos 7.5 slurm cluster, but I'm facing this errors:

When running auks -a as root, it shows: 'No translation available for requested principal':

[root@pood006 alex]# klist
Ticket cache: KEYRING:persistent:0:0
Default principal: host/[email protected]

Valid starting Expires Service principal
08/15/2018 18:03:59 08/16/2018 18:03:06 host/[email protected]
08/15/2018 18:03:06 08/16/2018 18:03:06 krbtgt/[email protected]

[root@pood006 alex]# /usr/local/bin/auks -avvvvvvvvv
Wed Aug 15 18:24:39 2018 [INFO2] [euid=0,pid=25130] auks_engine: initializing engine from 'common' block of file /usr/local/etc/auks.conf
Wed Aug 15 18:24:39 2018 [INFO2] [euid=0,pid=25130] auks_engine: initializing engine from 'api' block of file /usr/local/etc/auks.conf
Wed Aug 15 18:24:39 2018 [INFO2] [euid=0,pid=25130] auks_engine: initializing engine from 'renewer' block of file /usr/local/etc/auks.conf
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine primary daemon is 'pood006.int.34tech.io'
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine primary daemon address is 'pood006.int.34tech.io'
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine primary daemon port is 12345
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine primary daemon principal is host/[email protected]
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine secondary daemon is 'localhost'
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine secondary daemon address is 'localhost'
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine secondary daemon port is 12345
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine secondary daemon principal is
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine logfile is /tmp/auksapi.log
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine loglevel is 9
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine debugfile is /tmp/auksapi.log
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine debuglevel is 9
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine retry number is 3
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine timeout is 10
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine delay is 3
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine NAT traversal mode is enabled
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine renewer_logfile is /var/log/auksdrenewer.log
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine renewer_loglevel is 9
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine renewer_debugfile is /var/log/auksdrenewer.log
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine renewer_debuglevel is 9
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine renewer delay is 60
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine renewer min cred lifetime is 600
Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: kerberos context successfully initialized
Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: credential cache successfully resolved
Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: credential cache sequential read successfully started
Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: credential cache sequential read successfully stopped
Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: TGT found in credential cache
Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: kerberos authentication context successfully initialized
Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: credential successfully dumped into buffer
Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: credential successfully stored in output buffer
Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_cred: kerberos context successfully initialized
Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_cred: input buffer credential successfully unserialized
Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_cred: principal successfully unparse
Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_cred: unable to get username from principal host/[email protected] : No translation available for requested principal
Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_api: auks cred extraction failed : auks cred : unable to convert principal to local name
Auks API request failed : auks cred : unable to convert principal to local name

But when running on a normal user, it seems to work.

[alex@pood006 ~]$ klist
Ticket cache: KEYRING:persistent:67200010:krb_ccache_rDbOwm0
Default principal: [email protected]

Valid starting Expires Service principal
08/15/2018 18:26:25 08/16/2018 18:20:50 host/[email protected]
08/15/2018 18:20:50 08/16/2018 18:20:50 krbtgt/[email protected]

[alex@pood006 ~]$ auks -avvvvvvvvv
Wed Aug 15 18:26:25 2018 [INFO2] [euid=67200010,pid=25262] auks_engine: initializing engine from 'common' block of file /usr/local/etc/auks.conf
Wed Aug 15 18:26:25 2018 [INFO2] [euid=67200010,pid=25262] auks_engine: initializing engine from 'api' block of file /usr/local/etc/auks.conf
Wed Aug 15 18:26:25 2018 [INFO2] [euid=67200010,pid=25262] auks_engine: initializing engine from 'renewer' block of file /usr/local/etc/auks.conf
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine primary daemon is 'pood006.int.34tech.io'
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine primary daemon address is 'pood006.int.34tech.io'
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine primary daemon port is 12345
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine primary daemon principal is host/[email protected]
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine secondary daemon is 'localhost'
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine secondary daemon address is 'localhost'
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine secondary daemon port is 12345
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine secondary daemon principal is
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine logfile is /tmp/auksapi.log
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine loglevel is 9
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine debugfile is /tmp/auksapi.log
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine debuglevel is 9
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine retry number is 3
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine timeout is 10
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine delay is 3
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine NAT traversal mode is enabled
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine renewer_logfile is /var/log/auksdrenewer.log
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine renewer_loglevel is 9
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine renewer_debugfile is /var/log/auksdrenewer.log
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine renewer_debuglevel is 9
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine renewer delay is 60
Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine renewer min cred lifetime is 600
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: kerberos context successfully initialized
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: credential cache successfully resolved
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: credential cache sequential read successfully started
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: credential cache sequential read successfully stopped
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: TGT found in credential cache
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: kerberos authentication context successfully initialized
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: credential successfully dumped into buffer
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: credential successfully stored in output buffer
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_cred: kerberos context successfully initialized
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_cred: input buffer credential successfully unserialized
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_cred: principal successfully unparse
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_api: starting retry 1 of 3
Wed Aug 15 18:26:25 2018 [INFO7] [euid=67200010,pid=25262] xstream: socket creation succeed
Wed Aug 15 18:26:25 2018 [INFO7] [euid=67200010,pid=25262] xstream: socket non-blocking flag is now set
Wed Aug 15 18:26:25 2018 [INFO7] [euid=67200010,pid=25262] xstream: connect (192.168.3.46:12345) succeed while polling
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_api: successfully connected to auks server pood006.int.34tech.io:12345
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: local endpoint stream 7 informations request succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: remote endpoint stream 7 informations request succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: remote host is 192.168.3.46
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: context initialization succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: connection authentication context initialisation succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: authentication context addrs set up succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: default kstream initialisation succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: kstream basic initialisation succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: ccache initialisation succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: client kstream initialisation succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: authentication succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: NAT traversal required, setting dummy addresses
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message encryption succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message transmission succeed : 616 bytes sended
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message reception succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message decryption succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message reception succeed : 4 bytes stored
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message encryption succeed
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message transmission succeed : 4 bytes sended
Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_api: auks cred added using default file
Auks API request succeed

When launching jobs using auks plugin, I get the following error on slurmd.log (one compute node):

[2018-08-15T18:20:28.159] [551] debug: spank: /etc/slurm/plugstack.conf:1: Loaded plugin auks.so
[2018-08-15T18:20:28.159] [551] debug: SPANK: appending plugin option "auks"
[2018-08-15T18:20:28.165] [551] error: spank-auks: unable to get user 67200010 cred : auks api : reply seems corrupted
[2018-08-15T18:20:28.165] [551] debug2: spank: auks.so: init = -200302
[2018-08-15T18:20:28.166] [551] debug2: spank: auks.so: init_post_opt = 0
[2018-08-15T18:20:28.166] [551] debug2: After call to spank_init()
......
......
[2018-08-15T18:31:47.328] [552.0] error: couldn't chdir to `/home/test/motorBike2-64p3': Permission denied: going to /tmp instead

As expected, is not able to write on kerberized NFS homedir. :'(

Also, trying to issue:

[alex@pood006 ~]$auks -vvvvvvvv -g -u 67200010 with no luck:

Wed Aug 15 18:43:53 2018 [INFO2] [euid=67200010,pid=27110] auks_engine: initializing engine from 'common' block of file /usr/local/etc/auks.conf
Wed Aug 15 18:43:53 2018 [INFO2] [euid=67200010,pid=27110] auks_engine: initializing engine from 'api' block of file /usr/local/etc/auks.conf
Wed Aug 15 18:43:53 2018 [INFO2] [euid=67200010,pid=27110] auks_engine: initializing engine from 'renewer' block of file /usr/local/etc/auks.conf
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine primary daemon is 'pood006.int.34tech.io'
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine primary daemon address is 'pood006.int.34tech.io'
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine primary daemon port is 12345
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine primary daemon principal is host/[email protected]
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine secondary daemon is 'localhost'
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine secondary daemon address is 'localhost'
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine secondary daemon port is 12345
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine secondary daemon principal is
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine logfile is /tmp/auksapi.log
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine loglevel is 9
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine debugfile is /tmp/auksapi.log
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine debuglevel is 9
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine retry number is 3
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine timeout is 10
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine delay is 3
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine NAT traversal mode is enabled
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine renewer_logfile is /var/log/auksdrenewer.log
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine renewer_loglevel is 9
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine renewer_debugfile is /var/log/auksdrenewer.log
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine renewer_debuglevel is 9
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine renewer delay is 60
Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine renewer min cred lifetime is 600
Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_api: starting retry 1 of 3
Wed Aug 15 18:43:53 2018 [INFO7] [euid=67200010,pid=27110] xstream: socket creation succeed
Wed Aug 15 18:43:53 2018 [INFO7] [euid=67200010,pid=27110] xstream: socket non-blocking flag is now set
Wed Aug 15 18:43:53 2018 [INFO7] [euid=67200010,pid=27110] xstream: connect (192.168.3.46:12345) succeed while polling
Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_api: successfully connected to auks server pood006.int.34tech.io:12345
Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: local endpoint stream 4 informations request succeed
Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: remote endpoint stream 4 informations request succeed
Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: remote host is 192.168.3.46
Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: context initialization succeed
Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: connection authentication context initialisation succeed
Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: authentication context addrs set up succeed
Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: default kstream initialisation succeed
Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: kstream basic initialisation succeed
Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: ccache initialisation succeed
Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: client kstream initialisation succeed
Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: authentication succeed
Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: NAT traversal required, setting dummy addresses
Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message encryption succeed
Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message transmission succeed : 8 bytes sended
Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message reception succeed
Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message decryption succeed
Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message reception succeed : 4 bytes stored
Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message encryption succeed
Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message transmission succeed : 4 bytes sended
Wed Aug 15 18:43:54 2018 [INFO3] [euid=67200010,pid=27110] auks_api: get request failed : bad reply type (21)
Wed Aug 15 18:43:54 2018 [INFO3] [euid=67200010,pid=27110] auks_api: unable to unpack auks cred from reply : auks api : reply type is invalid
Auks API request failed : auks api : reply seems corrupted

[alex@pood006 ~]$ id
uid=67200010(alex) gid=67200010(alex) groups=67200010(alex),67200000(admins),67200004(sudoers),67200008(vpnusers)

[alex@pood006 ~]$ ls -la /tmp/krb5cc*
-rw------- 1 root root 1015 Aug 15 17:58 /tmp/krb5cc_0
-rw------- 1 root root 1015 Aug 15 17:47 /tmp/krb5cc_0_auks
-rw------- 1 root root 1152 Aug 15 14:50 /tmp/krb5ccmachine_INT.34TECH.IO

[root@pood006 alex]# ls -al /var/lib/sss/db/
total 5668
drwx------. 2 sssd sssd 168 Aug 15 18:31 .
drwxr-xr-x. 10 root root 120 May 24 13:36 ..
-rw------- 1 root root 1613824 Aug 15 18:30 cache_int.34tech.io.ldb
-rw------- 1 root root 1150 Aug 15 18:31 ccache_INT.34TECH.IO
-rw-------. 1 root root 1286144 Jun 13 17:47 config.ldb
-rw------- 1 root root 598 Aug 15 14:50 fast_ccache_INT.34TECH.IO
-rw-------. 1 root root 1286144 May 24 13:37 sssd.ldb
-rw------- 1 root root 1609728 Aug 15 18:43 timestamps_int.34tech.io.ldb

Could somebody give me a hand?

Note that compute nodes are centos + sssd + autofs + krb5 + nfs.

Thank you!
Alex

Fixing a few typos in the HOWTO

It's mostly stuff like

host/mngt.realm.a/REALM.A -> host/[email protected]

patch file attached
HOWTO.txt

I didn't put this in the patch but line 406 says to run "auks -a" as root. In general this won't work. See issue #27 . Maybe just delete that piece of the component test?

auksdrenewer needed on compute nodes?

Hello,
We have slurm with auks working, but one thing is not 100% clear to me from reading the documentation: does auksdrenewer just renew the tickets stored in auksd, so 1 or 2 auksdrenewer processes on for example the login nodes is good enough? Or are they for some reason also needed on each compute node?
Kind regards and thank you in advance for any information.
Dries

Auks API init failed : unable to parse configuration file

After applying the latest auks commit (31aadac) from github, auks and auksd ignore /etc/auks/auks.conf and instead return 'unable to parse configuration file'.

It seems both auks and auksd are looking for /etc/auks.conf instead (creating a symlink from /etc/auks.conf to /etc/auks/auks.conf makes both work).

auks_api: auks cred extraction failed : krb5 cred : unable to read credential cache

I'm running 0.5 with the crude patch for ticket names to address #43 .

I see

auks_api: auks cred extraction failed : krb5 cred : unable to read credential cache

in my slurm output and /tmp/auksapi.log (if the file is writeable to the user running the command) for every srun/sbatch invocation.

I traced slurmd …

strace --no-abbrev --decode-fds=all --stack-traces --follow-forks --output-separately -e trace=getuid,geteuid,open,openat,read,write,fork,clone,execve,send,recv,stat -o /tmp/trace-205 s
lurmd -vvvvvvD

Using grep auksapi.log /tmp/trace-205.* to find a relevant trace.

execve("/usr/bin/auks", ["/usr/bin/auks", "-R", "loop"], [… "USER=root" … "SLURM_CONF=/etc/slurm/slurm.conf", "SLURM_MPI_TYPE=none", "KRB5CCNAME=/tmp/krb5cc_11652_211"...]) = 0

… 

 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(execve+0x7) [0xcb6c7]
 > /usr/lib/x86_64-linux-gnu/slurm/auks.so(slurm_spank_user_init+0xcf) [0x2b2f]
 > /usr/lib/x86_64-linux-gnu/slurm-wlm/libslurmfull.so(optz_append+0x405) [0x1726e5]
 > /usr/sbin/slurmstepd-wlm(close_slurmd_conn+0x242a) [0x105aa]
 > /usr/sbin/slurmstepd-wlm(job_manager+0x35e) [0x12d3e]
 > /usr/sbin/slurmstepd-wlm(main+0x13a7) [0xcc87]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(__libc_start_main+0xea) [0x26d0a]
 > /usr/sbin/slurmstepd-wlm(_start+0x2a) [0xd95a]

…

geteuid()                               = 11652
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(geteuid+0x7) [0xcc107]
 > /usr/bin/auks() [0x11e5]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(__libc_start_main+0xea) [0x26d0a]
 > /usr/bin/auks() [0x15ea]

…

getuid()                                = 0
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(getuid+0x7) [0xcc0f7]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(krb5_set_password_using_ccache+0xcd9) [0x70579]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(k5_expand_path_tokens_extra+0x2c5) [0x709c5]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(krb5_cc_default_name+0xb0) [0x6f230]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(krb5_cc_default+0x1f) [0x2ce9f]
 > /usr/lib/x86_64-linux-gnu/libauksapi.so.0.0.1(auks_krb5_cred_get+0x32f) [0x9e7f]
 > /usr/lib/x86_64-linux-gnu/libauksapi.so.0.0.1(auks_cred_extract+0x33) [0xbad3]
 > /usr/lib/x86_64-linux-gnu/libauksapi.so.0.0.1(auks_api_renew_cred+0x49) [0x11779]
 > /usr/bin/auks() [0x14a9]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(__libc_start_main+0xea) [0x26d0a]
 > /usr/bin/auks() [0x15ea]

openat(AT_FDCWD, "/tmp/krb5cc_0", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(__open64+0x57) [0xeebe7]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(k5_ccselect_free_context+0x2601) [0x30db1]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(k5_ccselect_free_context+0x32b9) [0x31a69]
 > /usr/lib/x86_64-linux-gnu/libauksapi.so.0.0.1(auks_krb5_cred_get+0x9f) [0x9bef]
 > /usr/lib/x86_64-linux-gnu/libauksapi.so.0.0.1(auks_cred_extract+0x33) [0xbad3]
 > /usr/lib/x86_64-linux-gnu/libauksapi.so.0.0.1(auks_api_renew_cred+0x49) [0x11779]
 > /usr/bin/auks() [0x14a9]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(__libc_start_main+0xea) [0x26d0a]
 > /usr/bin/auks() [0x15ea]

…

read(4</etc/krb5.conf>, "", 4096)       = 0
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(__read+0xe) [0xeee8e]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(_IO_file_underflow+0x17a) [0x8189a]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(_IO_default_uflow+0x32) [0x82b02]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(_IO_getline_info+0xac) [0x75a5c]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(fgets+0x96) [0x74a56]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(krb5_write_message+0x23f0) [0x7d780]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(krb5_write_message+0x2f03) [0x7e293]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(krb5_write_message+0x13ff) [0x7c78f]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(krb5_write_message+0x148c) [0x7c81c]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(krb5_write_message+0x1ba2) [0x7cf32]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(profile_init_flags+0xf4) [0x803c4]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(k5_os_init_context+0x158) [0x727b8]
 > /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3(krb5_init_context_profile+0xa0) [0x4eb30]
 > /usr/lib/x86_64-linux-gnu/libauksapi.so.0.0.1(auks_krb5_cred_get+0x37) [0x9b87]
 > /usr/lib/x86_64-linux-gnu/libauksapi.so.0.0.1(auks_cred_extract+0x33) [0xbad3]
 > /usr/lib/x86_64-linux-gnu/libauksapi.so.0.0.1(auks_api_renew_cred+0x49) [0x11779]
 > /usr/bin/auks() [0x14a9]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(__libc_start_main+0xea) [0x26d0a]
 > /usr/bin/auks() [0x15ea]

write(3</tmp/auksapi.log>, "Mon Nov 15 20:54:33 2021 [INFO3]"..., 139) = 139
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(__write+0x13) [0xeef33]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(_IO_file_write+0x25) [0x80665]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(_IO_file_setbuf+0xf6) [0x7f9d6]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(_IO_do_write+0x19) [0x81709]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(_IO_file_sync+0xa8) [0x7f818]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(fflush+0x82) [0x74782]
 > /usr/lib/x86_64-linux-gnu/libauksapi.so.0.0.1(xverbose_base+0x10d) [0x13b6d]
 > /usr/lib/x86_64-linux-gnu/libauksapi.so.0.0.1(xverboseN+0x9f) [0x13e9f]
 > /usr/lib/x86_64-linux-gnu/libauksapi.so.0.0.1(auks_api_renew_cred+0x1e5) [0x11915]
 > /usr/bin/auks() [0x14a9]
 > /usr/lib/x86_64-linux-gnu/libc-2.31.so(__libc_start_main+0xea) [0x26d0a]
 > /usr/bin/auks() [0x15ea]

The env KRB5CCNAME is ignored, and the filename is constructed using the uid of the process.
Adding default_ccache_name = FILE:/tmp/krb5cc_%{euid} to /etc/krb5.conf does not help, as the file mismatches the environment variable name.

https://github.com/krb5/krb5/blob/34625d594c339a077899fa01fc4b5c331a1647d0/src/lib/krb5/os/ccdefname.c#L289-L320

Actual problem is, krb5 ignores the KRB5CCNAME in krb5_cc_default_name due to the use of secure_getenv starting in kerberos 1.18.

       The GNU-specific secure_getenv() function is just like getenv() except that it returns NULL in cases where "secure execution" is required.  Secure execution is required if one of the follow‐
       ing conditions was true when the program run by the calling process was loaded:

       *  the process's effective user ID did not match its real user ID or the process's effective group ID did not match its real group ID (typically this is the result of executing a set-user-ID
          or set-group-ID program);

This is the problem here, euid != uid, the KRB5CCNAME is ignored.

Setting the KRB5CCNAME as default manually helps:

--- auks-0.5.0.orig/src/api/auks/auks_krb5_cred.c
+++ auks-0.5.0/src/api/auks/auks_krb5_cred.c
@@ -319,6 +319,8 @@ auks_krb5_cred_get(char *ccachefilename,
        }
        auks_log("kerberos context successfully initialized");
 
+       krb5_cc_set_default_name(context, getenv("KRB5CCNAME"));
+
        /* initialize kerberos credential cache structure */
        if (ccachefilename == NULL)
                err_code = krb5_cc_default(context, &ccache);

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.