Giter VIP home page Giter VIP logo

oar-team / oar Goto Github PK

View Code? Open in Web Editor NEW
43.0 17.0 22.0 229.93 MB

OAR is a versatile resource and task manager (also called a batch scheduler) for clusters and other computing infrastructures.

Home Page: http://oar.imag.fr/

License: GNU General Public License v2.0

Makefile 2.35% Shell 10.77% Ruby 10.50% TeX 2.85% Perl 64.94% Lua 0.26% C 0.50% C++ 5.95% QMake 0.02% PHP 1.02% CSS 0.05% HTML 0.33% Gherkin 0.01% Python 0.38% Raku 0.06%

oar's Introduction

OAR is a versatile resource and task manager (also called a batch scheduler) for HPC clusters, and other computing infrastructures (like distributed computing experimental testbeds where versatility is a key). 

OAR is suitable for production use.

OAR is also a support for scientific researches in the field of distributed computing.

See the OAR web site (http://oar.imag.fr/) for further information.

oar's People

Contributors

acourbet avatar alxmerlin avatar augu5te avatar bluke avatar bzizou avatar capitn avatar christoph-conrads avatar emmanuelthome avatar jonglezb avatar mickours avatar npf avatar phlb avatar salemharrache avatar shurakai avatar snoir avatar tbarbette avatar vdanjean avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

oar's Issues

Do not suspect a node when an error is due to the connection to the deploy/cosystem frontend

If job is cosystem or deploy, and there is an error to connect to the cosystem or deploy frontend, the following message is shown and the first node of the job is suspected.

 server |       oar.log : [debug] [2015-07-05 00:19:11.897] [bipbip 27] execute oarexec on node 127.0.0.1
  server |       oar.log : ssh: connect to host 127.0.0.1 port 6667: Connection refused
  server |       oar.log : [debug] [2015-07-05 00:19:11.903] [bipbip 27] Job 27 is ended
  server |       oar.log : [debug] [2015-07-05 00:19:11.917] [bipbip 27] error of oarexec, exit value = 255; the job 27 is in Error and the node node3 is Suspected; If this job is of type cosystem or deploy, check if the oar server is able to connect to the corresponding nodes, oar-node started

There is actually no reason to suspect that node.

hard coded path for /dev/cpuset

There is a lot of hard coded path to /dev/cpuset/ as this command shows:

~/Project/oar (git)-[2.5] % grep -r "dev/cpuset" *
CHANGELOG:  - cpuset are now stored in /dev/cpuset/oar
sources/core/scripts/oar_server_proepilogue.pl:if (! system("diff /dev/cpuset/oar/'.$struct->{cpuset_name}.'/cpus /dev/cpuset/cpus > /dev/null 2>&1")){
sources/core/scripts/oar_server_proepilogue.pl:if (! system("diff /dev/cpuset/oar/'.$struct->{cpuset_name}.'/cpus /dev/cpuset/cpus > /dev/null 2>&1")){
sources/core/tools/job_resource_manager_altix450.pl:            if (system('oardodo mount -t cpuset | grep " /dev/cpuset " > /dev/null 2>&1')){
sources/core/tools/job_resource_manager_altix450.pl:                if (system('oardodo mkdir -p /dev/cpuset && oardodo mount -t cpuset none /dev/cpuset')){
sources/core/tools/job_resource_manager_altix450.pl:            if (!(-d '/dev/cpuset/'.$Cpuset->{cpuset_path})){
sources/core/tools/job_resource_manager_altix450.pl:                if (system( 'oardodo mkdir -p /dev/cpuset/'.$Cpuset->{cpuset_path}.' &&'. 
sources/core/tools/job_resource_manager_altix450.pl:                            'oardodo chown -R oar /dev/cpuset/'.$Cpuset->{cpuset_path}.' &&'.
sources/core/tools/job_resource_manager_altix450.pl:                            '/bin/echo 0 | cat > /dev/cpuset/'.$Cpuset->{cpuset_path}.'/notify_on_release && '.
sources/core/tools/job_resource_manager_altix450.pl:                            '/bin/echo 0 | cat > /dev/cpuset/'.$Cpuset->{cpuset_path}.'/cpu_exclusive && '.
sources/core/tools/job_resource_manager_altix450.pl:                            'cat /dev/cpuset/mems > /dev/cpuset/'.$Cpuset->{cpuset_path}.'/mems &&'.
sources/core/tools/job_resource_manager_altix450.pl:                            'cat /dev/cpuset/cpus > /dev/cpuset/'.$Cpuset->{cpuset_path}.'/cpus'
sources/core/tools/job_resource_manager_altix450.pl:#'for c in '."@Cpuset_cpus".';do cat /sys/devices/system/cpu/cpu$c/topology/physical_package_id > /dev/cpuset/'.$Cpuset_path_job.'/mems; done && '.
sources/core/tools/job_resource_manager_altix450.pl:        if (system( 'oardodo mkdir -p /dev/cpuset/'.$Cpuset_path_job.' && '.
sources/core/tools/job_resource_manager_altix450.pl:                    'oardodo chown -R oar /dev/cpuset/'.$Cpuset_path_job.' && '.
sources/core/tools/job_resource_manager_altix450.pl:                    'oardodo "echo 0 > /dev/cpuset/mem_exclusive" && '.
sources/core/tools/job_resource_manager_altix450.pl:                    '/bin/echo 0 | cat > /dev/cpuset/'.$Cpuset_path_job.'/notify_on_release && '.
sources/core/tools/job_resource_manager_altix450.pl:                    '/bin/echo 0 | cat > /dev/cpuset/'.$Cpuset_path_job.'/cpu_exclusive && '.
sources/core/tools/job_resource_manager_altix450.pl:#                    'cat /dev/cpuset/mems > /dev/cpuset/'.$Cpuset_path_job.'/mems && '.
sources/core/tools/job_resource_manager_altix450.pl:                    'echo ${pmems#,} > /dev/cpuset/'.$Cpuset_path_job.'/mems && '.
sources/core/tools/job_resource_manager_altix450.pl:                    '/bin/echo '.join(",",@Cpuset_cpus).' | cat > /dev/cpuset/'.$Cpuset_path_job.'/cpus'
sources/core/tools/job_resource_manager_altix450.pl:                    if (-d "/dev/cpuset/$1"){
sources/core/tools/job_resource_manager_altix450.pl:        system('PROCESSES=$(cat /dev/cpuset/'.$Cpuset_path_job.'/tasks)
sources/core/tools/job_resource_manager_altix450.pl:                    PROCESSES=$(cat /dev/cpuset/'.$Cpuset_path_job.'/tasks)
sources/core/tools/job_resource_manager_altix450.pl:        if (system('oardodo rmdir /dev/cpuset'.$Cpuset_path_job)){
sources/core/tools/oarnodecheck/oarnodecheckrun.in:CPUSET_DIR=/dev/cpuset/oar
sources/core/tools/trace_collect/clustermon.pl:   #print "/dev/cpuset/oar/$dir/tasks \n";
sources/core/tools/trace_collect/clustermon.conf:CPUSETDIR: /dev/cpuset/oar/
sources/core/tools/job_resource_manager.pl:            if (system('oardodo grep " /dev/cpuset " /proc/mounts > /dev/null 2>&1')){
sources/core/tools/job_resource_manager.pl:                if (system('oardodo mkdir -p /dev/cpuset && oardodo mount -t cpuset none /dev/cpuset')){
sources/core/tools/job_resource_manager.pl:            if (!(-d '/dev/cpuset/'.$Cpuset->{cpuset_path})){
sources/core/tools/job_resource_manager.pl:                if (system( 'oardodo mkdir -p /dev/cpuset/'.$Cpuset->{cpuset_path}.' &&'. 
sources/core/tools/job_resource_manager.pl:                            'oardodo chown -R oar /dev/cpuset/'.$Cpuset->{cpuset_path}.' &&'.
sources/core/tools/job_resource_manager.pl:                            '/bin/echo 0 | cat > /dev/cpuset/'.$Cpuset->{cpuset_path}.'/notify_on_release && '.
sources/core/tools/job_resource_manager.pl:                            '/bin/echo 0 | cat > /dev/cpuset/'.$Cpuset->{cpuset_path}.'/cpu_exclusive && '.
sources/core/tools/job_resource_manager.pl:                            'cat /dev/cpuset/mems > /dev/cpuset/'.$Cpuset->{cpuset_path}.'/mems &&'.
sources/core/tools/job_resource_manager.pl:                            'cat /dev/cpuset/cpus > /dev/cpuset/'.$Cpuset->{cpuset_path}.'/cpus'
sources/core/tools/job_resource_manager.pl:        if (system( 'oardodo mkdir -p /dev/cpuset/'.$Cpuset_path_job.' && '.
sources/core/tools/job_resource_manager.pl:                    'oardodo chown -R oar /dev/cpuset/'.$Cpuset_path_job.' && '.
sources/core/tools/job_resource_manager.pl:                    '/bin/echo 0 | cat > /dev/cpuset/'.$Cpuset_path_job.'/notify_on_release && '.
sources/core/tools/job_resource_manager.pl:                    '/bin/echo 0 | cat > /dev/cpuset/'.$Cpuset_path_job.'/cpu_exclusive && '.
sources/core/tools/job_resource_manager.pl:                    'cat /dev/cpuset/mems > /dev/cpuset/'.$Cpuset_path_job.'/mems && '.
sources/core/tools/job_resource_manager.pl:                    '/bin/echo '.join(",",@Cpuset_cpus).' | cat > /dev/cpuset/'.$Cpuset_path_job.'/cpus'
sources/core/tools/job_resource_manager.pl:#'MEM= ;for c in '."@Cpuset_cpus".';do MEM=$(cat /sys/devices/system/cpu/cpu$c/topology/physical_package_id),$MEM; done; echo $MEM > /dev/cpuset/'.$Cpuset_path_job.'/mems && '.
sources/core/tools/job_resource_manager.pl:#'MEM= ;for c in '."@Cpuset_cpus".';do for n in /sys/devices/system/node/node* ;do if [ -r "$n/cpu$c" ]; then MEM=$(basename $n | sed s/node//g),$MEM; fi; done; done;echo $MEM > /dev/cpuset/'.$Cpuset_path_job.'/mems && '.
sources/core/tools/job_resource_manager.pl:                    if (-d "/dev/cpuset/$1"){
sources/core/tools/job_resource_manager.pl:        system('PROCESSES=$(cat /dev/cpuset/'.$Cpuset_path_job.'/tasks)
sources/core/tools/job_resource_manager.pl:                    PROCESSES=$(cat /dev/cpuset/'.$Cpuset_path_job.'/tasks)
sources/core/tools/job_resource_manager.pl:            if (system('oardodo rmdir /dev/cpuset'.$Cpuset_path_job)){
sources/core/tools/job_resource_manager.pl:                if (opendir(DIR, "/dev/cpuset/".$Cpuset->{cpuset_path}.'/')) {
sources/core/tools/job_resource_manager.pl:                 exit_myself(18,"Can't opendir: /dev/cpuset/$Cpuset->{cpuset_path}");
sources/core/tools/job_resource_manager_cgroups.pl:                if (system('oardodo mkdir -p '.$Cgroup_mount_point.' && oardodo mount -t cgroup -o cpuset,cpu,cpuacct,devices,freezer,net_cls,blkio none '.$Cgroup_mount_point.'; oardodo rm -f /dev/cpuset; oardodo ln -s '.$Cgroup_mount_point.' /dev/cpuset')){
sources/core/tools/job_resource_manager_cgroups.pl:#'MEM= ;for c in '."@Cpuset_cpus".';do MEM=$(cat /sys/devices/system/cpu/cpu$c/topology/physical_package_id),$MEM; done; echo $MEM > /dev/cpuset/'.$Cpuset_path_job.'/mems && '.
sources/core/tools/job_resource_manager_cgroups.pl:#'MEM= ;for c in '."@Cpuset_cpus".';do for n in /sys/devices/system/node/node* ;do if [ -r "$n/cpu$c" ]; then MEM=$(basename $n | sed s/node//g),$MEM; fi; done; done;echo $MEM > /dev/cpuset/'.$Cpuset_path_job.'/mems && '.
sources/core/tools/job_resource_manager_cgroups.pl:                    if (-d "/dev/cpuset/$1"){
sources/core/tools/job_resource_manager_g5k.pl:            if (system('oardodo mount -t cpuset | grep " /dev/cpuset " > /dev/null 2>&1')){
sources/core/tools/job_resource_manager_g5k.pl:                if (system('oardodo mkdir -p /dev/cpuset && oardodo mount -t cpuset none /dev/cpuset')){
sources/core/tools/job_resource_manager_g5k.pl:            if (!(-d '/dev/cpuset/'.$Cpuset->{cpuset_path})){
sources/core/tools/job_resource_manager_g5k.pl:                if (system( 'oardodo mkdir -p /dev/cpuset/'.$Cpuset->{cpuset_path}.' &&'. 
sources/core/tools/job_resource_manager_g5k.pl:                            'oardodo chown -R oar /dev/cpuset/'.$Cpuset->{cpuset_path}.' &&'.
sources/core/tools/job_resource_manager_g5k.pl:                            '/bin/echo 0 | cat > /dev/cpuset/'.$Cpuset->{cpuset_path}.'/notify_on_release && '.
sources/core/tools/job_resource_manager_g5k.pl:                            '/bin/echo 0 | cat > /dev/cpuset/'.$Cpuset->{cpuset_path}.'/cpu_exclusive && '.
sources/core/tools/job_resource_manager_g5k.pl:                            'cat /dev/cpuset/mems > /dev/cpuset/'.$Cpuset->{cpuset_path}.'/mems &&'.
sources/core/tools/job_resource_manager_g5k.pl:                            'cat /dev/cpuset/cpus > /dev/cpuset/'.$Cpuset->{cpuset_path}.'/cpus'
sources/core/tools/job_resource_manager_g5k.pl:#'for c in '."@Cpuset_cpus".';do cat /sys/devices/system/cpu/cpu$c/topology/physical_package_id > /dev/cpuset/'.$Cpuset_path_job.'/mems; done && '.
sources/core/tools/job_resource_manager_g5k.pl:            if (system( 'oardodo mkdir -p /dev/cpuset/'.$Cpuset_path_job.' && '.
sources/core/tools/job_resource_manager_g5k.pl:                        'oardodo chown -R oar /dev/cpuset/'.$Cpuset_path_job.' && '.
sources/core/tools/job_resource_manager_g5k.pl:                        '/bin/echo 0 | cat > /dev/cpuset/'.$Cpuset_path_job.'/notify_on_release && '.
sources/core/tools/job_resource_manager_g5k.pl:                        '/bin/echo 0 | cat > /dev/cpuset/'.$Cpuset_path_job.'/cpu_exclusive && '.
sources/core/tools/job_resource_manager_g5k.pl:                        'cat /dev/cpuset/mems > /dev/cpuset/'.$Cpuset_path_job.'/mems && '.
sources/core/tools/job_resource_manager_g5k.pl:                        '/bin/echo '.join(",",@Cpuset_cpus).' | cat > /dev/cpuset/'.$Cpuset_path_job.'/cpus'
sources/core/tools/job_resource_manager_g5k.pl:        if (defined($Cpuset->{types}->{$Allow_SSH_type}) and ! system('diff /dev/cpuset/'.$Cpuset_path_job.'/cpus /dev/cpuset/'.$Cpuset->{cpuset_path}.'/cpus > /dev/null 2>&1')){
sources/core/tools/job_resource_manager_g5k.pl:                    if (-d "/dev/cpuset/$1"){
sources/core/tools/job_resource_manager_g5k.pl:        if (defined($Cpuset->{types}->{$Allow_SSH_type}) and ! system('diff /dev/cpuset/'.$Cpuset_path_job.'/cpus /dev/cpuset/'.$Cpuset->{cpuset_path}.'/cpus > /dev/null 2>&1')){
sources/core/tools/job_resource_manager_g5k.pl:        system('PROCESSES=$(cat /dev/cpuset/'.$Cpuset_path_job.'/tasks)
sources/core/tools/job_resource_manager_g5k.pl:                    PROCESSES=$(cat /dev/cpuset/'.$Cpuset_path_job.'/tasks)
sources/core/tools/job_resource_manager_g5k.pl:            if (system('oardodo rmdir /dev/cpuset'.$Cpuset_path_job)){
sources/core/tools/job_resource_manager_g5k.pl:             if (opendir(DIR, "/dev/cpuset/".$Cpuset->{cpuset_path}.'/')) {
sources/core/tools/job_resource_manager_g5k.pl:                 exit_myself(18,"Can't opendir: /dev/cpuset/$Cpuset->{cpuset_path}");
sources/core/tools/oarsh/oarsh_shell.in:CPUSET_MOUNT_POINT="/dev/cpuset"
sources/core/tools/oarmonitor_sensor.pl:my $Cpuset_name = "/dev/cpuset/";
sources/core/tools/suspend_resume_manager.pl:                for p in $(cat /dev/cpuset/oar/'.$Cpuset_name.'/tasks)
sources/core/tools/suspend_resume_manager.pl:        system('PROCESSES=$(cat /dev/cpuset/oar/'.$Cpuset_name.'/tasks)

A transition to cgroups require to remove them all!

oarsh from node to node if not using job_key ?

I can remember how it is supposed to work when not using job key (-k or --use-job-key or globally OARSUB_FORCE_JOB_KEY in oar.conf)

docker@frontend ~
$ oarsub -I -l nodes=2
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=28
Interactive mode : waiting...
Starting...

Connect to OAR job 28 via the node node2
docker@node2 ~
$ oarsh node3
oarsh: Cannot connect using job id from this host.

Removing oar is not working properly

I tried to remove oar completely using apt-get but some error occurred: The oar user was not deleted because a lot of process owned by oar are still running.

Here is the entire command:

apt-get remove --purge oar-common oar-node oar-server oar-server-pgsql
Lecture des listes de paquets... Fait
Construction de l'arbre des dépendances       
Lecture des informations d'état... Fait
Les paquets suivants ont été installés automatiquement et ne sont plus nécessaires :
  libxml-dumper-perl liboar-perl libjson-xs-perl libjson-perl libfcgi-perl
  libsort-versions-perl libcgi-fast-perl libdbd-pg-perl libcommon-sense-perl
  libyaml-libyaml-perl libyaml-perl
Veuillez utiliser « apt-get autoremove » pour les supprimer.
Les paquets suivants seront ENLEVÉS :
  oar-common* oar-node* oar-restful-api* oar-server* oar-server-pgsql* oar-user*
  oar-user-pgsql*
0 mis à jour, 0 nouvellement installés, 7 à enlever et 3 non mis à jour.
Après cette opération, 1 918 ko d'espace disque seront libérés.
Souhaitez-vous continuer [O/n] ? 
(Lecture de la base de données... 65863 fichiers et répertoires déjà installés.)
Suppression de oar-user-pgsql ...
Suppression de oar-restful-api ...
Purge des fichiers de configuration de oar-restful-api ...
Suppression de oar-user ...
Suppression de oar-server ...
Purge des fichiers de configuration de oar-server ...
ucfr: Association belongs to oar-common, not oar-server
ucfr: Aborting
dpkg : erreur de traitement de oar-server (--purge) :
 le sous-processus script post-removal installé a retourné une erreur de sortie d'état 5
Suppression de oar-node ...
Purge des fichiers de configuration de oar-node ...
Suppression de oar-common ...
Purge des fichiers de configuration de oar-common ...
Removing oar system user..userdel: user oar is currently logged in
/usr/sbin/deluser: `/usr/sbin/userdel oar' returned error code 8. Exiting.
..done
Suppression de oar-server-pgsql ...
Traitement des actions différées (« triggers ») pour « man-db »...
Traitement des actions différées (« triggers ») pour « ureadahead »...
Des erreurs ont été rencontrées pendant l'exécution :
 oar-server
E: Sub-process /usr/bin/dpkg returned an error code (1)

I eventually succeed to remove oar user using these commands:

killall -KILL -u oar
userdel -f oar

I am using ubuntu precise and the last oar (2.5.3-2)

IO.pm ymdhms_to_local function use timelocal_nockeck ?

Code is:
sub ymdhms_to_local($$$$$$) {
my ($year,$mon,$mday,$hour,$min,$sec)=@_;
return Time::Local::timelocal_nocheck($sec,$min,$hour,$mday,$mon,$year);
}
Would it be really to costly to use the timeslocal (with check) function ?

Since it is used to convert input from users (advance reservation dates,..), it could seem relevant.

OAR restful API missing dependency on Debian: libyaml-perl

The package libyaml-perl is mark as recommended for the oar-restful-api debian package but the default behavior is to show content as YAML /oarapi/jobs.html for example:

Status: 400 Content-Type: text/html; charset=ISO-8859-1
YAML not enabled

YAML perl module not loaded!

It should be required (better I think) or the default content should be json files (less readable)

provide nicer job states

A job which is terminated because its walltime was reached is in the Error state, which is quite confusing for users.

Either we could rework/rename the state of the jobs, or provide a translation table in OAR's conf file, providing administrators the ability to rename the states if wanted.

uniformize command names

some commands have "_", other "-", other "":
oar-database
oar_resource_init
oarremoveresource

oar-node init script not always working because of an order issue

The oar-node startup script may trig the following error:

"Starting OAR node: Missing privilege separation directory: /var/run/sshd"

The following patch solves the issue (need to launch 'update_rc.d oar-node enable' to update the init scripts links)

--- /etc/init.d/oar-node 2015-01-26 21:13:18.000000000 +0100
+++ oar-node 2015-01-29 16:51:28.040187620 +0100
@@ -7,7 +7,7 @@

BEGIN INIT INFO

Provides: oar-node

-# Required-Start: $network $local_fs $remote_fs
+# Required-Start: $network $local_fs $remote_fs sshd

Required-Stop: $network $local_fs $remote_fs

Default-Start: 2 3 4 5

Default-Stop: 0 1 6

oar_resources_init complains about the cpuset property that already exists

$ oar_resources_init /tmp/nodes
Did you configured the OAR SSH key on all the nodes? [yes/NO]

Checking node1 ... OK
[...]

If the content of '/tmp/oar_resources_init.cmd' is OK for you then you just need to execute:
source /tmp/oar_resources_init.cmd
$ . /tmp/oar_resources_init.cmd
Added property: cpu
Added property: core
Added property: host
DBD::Pg::db do failed: ERROR: column "cpuset" of relation "resources" already exists at /usr/local/lib/oar/oarproperty line 85.
DB error: ERROR: column "cpuset" of relation "resources" already exists
Added property: mem
new resource
node1 added in the database
Update property host with value node1 ...DONE
Update property cpu with value 1 ...DONE
Update property core with value 1 ...DONE
[...]

Drain resources vs job expected resources

When submitting a new job for 1 host while 1 core is in the draining state, it can happen that one gets the host which the draining core belongs to.

One could expect that this host wouldn't be selected since 1 of its cores is not available.

Should something be changed in order to have that behavior ?

Package oar-common is broken

When I try to install oar from the debian package I have this error

in_ctx: Setting up oar-common (2.5.2-3) ...
in_ctx: usermod: no changes
in_ctx: Error: The new file /usr/share/doc/oar-common/examples/oar.conf does not exist!
in_ctx: dpkg: error processing oar-common (--configure):
in_ctx: subprocess installed post-installation script returned error exit status 1
in_ctx: dpkg: dependency problems prevent configuration of oar-node:
in_ctx: oar-node depends on oar-common (= 2.5.2-3); however:
in_ctx: Package oar-common is not configured yet.
in_ctx:
in_ctx: dpkg: error processing oar-node (--configure):
in_ctx: dependency problems - leaving unconfigured
in_ctx: Errors were encountered while processing:

in_ctx: E: Sub-process /usr/bin/dpkg returned an error code (1)

Race condition with named containers

Should we handle the case with a inner job submitted before its container ? (be it named of job id based)

Currently, if the inner job is scheduled before the container job (e.g. case of the fifo scheduling), the inner job wont get scheduled until the container job is actually running (running job are filled in gantt before waiting jobs). This is not harmful, but a annoying as the prediction is useless because not relevant.

This can also happen if using the fair-sharing scheduler, if the container gets a lower scheduling priority.

I see 2 options:

  • make the scheduler re-schedule every inner jobs which have a new container scheduled.
  • do nothing, but advise to put inner jobs in a queue with a lower priority than the one of the containers, so that the latter are scheduled first in any case (except timeout)

DBD::Pg::db begin_work failed: Already in a transaction at /usr/share/perl5/OAR/IO.pm line 7671

Race condition with judas ?

[debug] [2015-07-13 16:57:33.873] [MetaSched] Starting Meta Scheduler
[debug] [2015-07-13 16:57:33.878] [MetaSched] Retrieve information for already scheduler reservations from database before flush (keep assign resources)
[debug] [2015-07-13 16:57:33.883] [MetaSched] Initialize the gantt structure
[debug] [2015-07-13 16:57:33.888] [MetaSched] Begin processing of already scheduled reservations (accepted with resources assigned)
[debug] [2015-07-13 16:57:33.888] [MetaSched] End processing of already scheduled reservations
[debug] [2015-07-13 16:57:33.889] [MetaSched] Begin processing of current jobs
[debug] [2015-07-13 16:57:33.892] [MetaSched] [128] Add job in database
[debug] [2015-07-13 16:57:33.894] [MetaSched] [128] job is (0,u:,,)
[debug] [2015-07-13 16:57:33.894] [MetaSched] [128] add job occupation in gantt (0,,,)
[debug] [2015-07-13 16:57:33.900] [MetaSched] [125] Add job in database
[debug] [2015-07-13 16:57:33.907] [MetaSched] [125] job is (0,u:,,)
[debug] [2015-07-13 16:57:33.908] [MetaSched] [125] add job occupation in gantt (0,,,)
[debug] [2015-07-13 16:57:33.910] [MetaSched] End processing of current jobs
[debug] [2015-07-13 16:57:33.910] [MetaSched] Begin processing of waiting reservations (accepted reservations which do not have assigned resources yet)
[debug] [2015-07-13 16:57:33.910] [MetaSched] End processing of waiting reservations
[debug] [2015-07-13 16:57:33.912] [MetaSched] Queue admin: No job
[debug] [2015-07-13 16:57:33.913] [MetaSched] Queue kinovis: No job
[debug] [2015-07-13 16:57:33.914] [MetaSched] Queue default: Launching scheduler oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas at time 2015-07-13 16:57:34
[debug] [2015-07-13 16:57:34.163] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] Starting scheduler for queue default at time 1436799454
[debug] [2015-07-13 16:57:34.168] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] Begin phase 1 (running jobs)
[debug] [2015-07-13 16:57:34.182] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [125] add job occupation in gantt of container 0
[debug] [2015-07-13 16:57:34.183] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [128] add job occupation in gantt of container 0
[debug] [2015-07-13 16:57:34.183] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] End phase 1 (running jobs)
[debug] [2015-07-13 16:57:34.187] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] Begin EnergySaving phase
[debug] [2015-07-13 16:57:34.189] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] End EnergySaving phase
[debug] [2015-07-13 16:57:34.189] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] Begin phase 2 (waiting jobs)
[debug] [2015-07-13 16:57:34.195] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [127] Start scheduling (Karma note = 1.33303187813989)
[debug] [2015-07-13 16:57:34.206] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [127] find_first_hole with a timeout of 7.5
[debug] [2015-07-13 16:57:34.210] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [127] No enough matching resources (no_matching_slot)
[debug] [2015-07-13 16:57:34.210] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [127] end scheduling
[debug] [2015-07-13 16:57:34.210] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [131] Start scheduling (Karma note = 1.33303187813989)
[debug] [2015-07-13 16:57:34.221] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [131] find_first_hole with a timeout of 7.5
[debug] [2015-07-13 16:57:34.224] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [131] add job occupation in gantt of container 0
[debug] [2015-07-13 16:57:34.230] [MetaSched] Read on the scheduler output:SCHEDRUN JOB_ID=131 MOLDABLE_JOB_ID=131 RESOURCES=681,678,617,660,712,658,673,652,688,630,610,670,614,685,639,629,657,699,621,675,703,680,638,615,671,662,648,623,711,694,696,674,714,659,653,672,661,641,704,634,683,716,718,665,666,636,647,613,618,640,625,690,655,713,687,663,706,692,715,624,609,656,651,650,695,645,702,708,669,654,686,611,676,720,612,717,644,700,643,689,628,719,705,649,642,620,668,632,698,664,693,637,677,667,616,635,627,626,709,697,679,701,633,707,631,622,646,684,710,619,691,682
[debug] [2015-07-13 16:57:34.230] [MetaSched] Early launch the job 131
[debug] [2015-07-13 16:57:34.232] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] [131] end scheduling
[debug] [2015-07-13 16:57:34.233] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] End phase 2 (waiting jobs)
[debug] [2015-07-13 16:57:34.240] [oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas] End of scheduler for queue default
[debug] [2015-07-13 16:57:34.250] [MetaSched] Notify almighty to launch the job 131
[debug] [2015-07-13 16:57:34.250] [Almighty] Appendice received a connection
[debug] [2015-07-13 16:57:34.250] [Almighty] Appendice has read on the socket : OARRUNJOB_131
[debug] [2015-07-13 16:57:34.251] [Almighty][bipbip_launcher] Read on pipe: OARRUNJOB_131
[debug] [2015-07-13 16:57:34.253] [Almighty][bipbip_launcher] Run process: /usr/lib/oar/bipbip 131
[debug] [2015-07-13 16:57:34.253] [Almighty][bipbip_launcher] Nb running bipbip: 1/25; Waiting processes(0): 
[debug] [2015-07-13 16:57:34.253] [Almighty][bipbip_launcher] Check bipbip process duration: job=131, pid=16104, time=1436799454, current_time=1436799454, duration=0s
[debug] [2015-07-13 16:57:34.255] [MetaSched] Queue default: begin processing of waiting reservations
[debug] [2015-07-13 16:57:34.256] [MetaSched] Queue default: end processing of waiting reservations
[debug] [2015-07-13 16:57:34.256] [MetaSched] Queue default: begin processing of new reservations
[debug] [2015-07-13 16:57:34.257] [MetaSched] Queue default: end processing of new reservations
[debug] [2015-07-13 16:57:34.258] [MetaSched] Queue besteffort: No job
[debug] [2015-07-13 16:57:34.258] [MetaSched] Begin precessing of besteffort jobs to kill
[debug] [2015-07-13 16:57:34.260] [MetaSched] End precessing of besteffort jobs to kill
[debug] [2015-07-13 16:57:34.261] [MetaSched] Begin processing of jobs to launch (start time <= 2015-07-13 16:57:34)
[debug] [2015-07-13 16:57:34.265] [MetaSched] End processing of jobs to launch
[debug] [2015-07-13 16:57:34.287] [MetaSched] End of Meta Scheduler
[debug] [2015-07-13 16:57:34.295] [Almighty] /usr/lib/oar/oar_meta_sched terminated :
[debug] [2015-07-13 16:57:34.295] [Almighty] Exit value : 0
[debug] [2015-07-13 16:57:34.295] [Almighty] Signal num : 0
[debug] [2015-07-13 16:57:34.295] [Almighty] Core dumped : 0
[debug] [2015-07-13 16:57:34.295] [Almighty] Current state [Time update]
[debug] [2015-07-13 16:57:34.295] [Almighty] Timeouts check : 2015-7-13 16:57:34
[debug] [2015-07-13 16:57:34.295] [Almighty] Current state [Qget]
[debug] [2015-07-13 16:57:34.295] [Almighty] Got command Time, 99 remaining
[debug] [2015-07-13 16:57:34.295] [Almighty] Command queue : Time
[debug] [2015-07-13 16:57:34.295] [Almighty] Qtype = [Time]
[debug] [2015-07-13 16:57:34.295] [Almighty] Current state [Time update]
[debug] [2015-07-13 16:57:34.295] [Almighty] Timeouts check : 2015-7-13 16:57:34
[debug] [2015-07-13 16:57:34.295] [Almighty] Current state [Qget]
[debug] [2015-07-13 16:57:34.298] [Hulot] Got request 'CHECK'
[debug] [2015-07-13 16:57:34.502] [bipbip 131] JOB: 131; User: neyron; Command: sleep 12h ==> hosts :[kinovis-c1.grenoble.grid5000.fr kinovis-c2.grenoble.grid5000.fr]
Possible precedence issue with control flow operator at /usr/bin/taktuk line 940.
[TAKTUK OUTPUT] digserv: kinovis-c1.grenoble.grid5000.fr (16109): connector > ssh: connect to host kinovis-c1.grenoble.grid5000.fr port 6667: No route to host
[TAKTUK OUTPUT] digserv: kinovis-c1.grenoble.grid5000.fr (16109): state > Connection failed
[TAKTUK OUTPUT] digserv: kinovis-c2.grenoble.grid5000.fr (16110): connector > ssh: connect to host kinovis-c2.grenoble.grid5000.fr port 6667: No route to host
[TAKTUK OUTPUT] digserv: kinovis-c2.grenoble.grid5000.fr (16110): state > Connection failed
[error] [2015-07-13 16:57:37.665] [bipbip 131] /!\ Some nodes are inaccessible (CPUSET_ERROR):
kinovis-c1.grenoble.grid5000.fr kinovis-c2.grenoble.grid5000.fr
[debug] [2015-07-13 16:57:37.670] [Almighty] Appendice received a connection
[debug] [2015-07-13 16:57:37.671] [Almighty] Appendice has read on the socket : ChState
[debug] [2015-07-13 16:57:37.671] [Almighty] Got command ChState, 99 remaining
[debug] [2015-07-13 16:57:37.672] [Almighty] Got command Time, 98 remaining
[debug] [2015-07-13 16:57:37.672] [Almighty] Command queue : ChState Time
[debug] [2015-07-13 16:57:37.672] [Almighty] Qtype = [ChState]
[debug] [2015-07-13 16:57:37.672] [Almighty] Current state [Change node state]
[debug] [2015-07-13 16:57:37.672] [Almighty] Launching command : [/usr/lib/oar/NodeChangeState]
[debug] [2015-07-13 16:57:37.680] [Almighty][bipbip_launcher] Process 16104 for the job 131 ends with exit_code=2, duration=3s
[debug] [2015-07-13 16:57:37.681] [Almighty][bipbip_launcher] Nb running bipbip: 0/25; Waiting processes(0): 
[debug] [2015-07-13 16:57:37.897] [NodeChangeState] Check event for the job 131 with type CPUSET_ERROR
[debug] [2015-07-13 16:57:37.920] [Almighty] Appendice received a connection
[debug] [2015-07-13 16:57:37.921] [Almighty] Appendice has read on the socket : ChState
[info] [2015-07-13 16:57:37.979] [NodeChangeState] error (CPUSET_ERROR) on the nodes:

kinovis-c1.grenoble.grid5000.fr kinovis-c2.grenoble.grid5000.fr

So we are suspecting them
[debug] [2015-07-13 16:57:37.979] [Judas] Mail is not configured
DBD::Pg::db begin_work failed: Already in a transaction at /usr/share/perl5/OAR/IO.pm line 7671.
[info] [2015-07-13 16:57:37.997] [NodeChangeState] We resubmit the job 131 (new id = 132) because the event was CPUSET_ERROR and the job is neither a reservation nor an interactive job.
[debug] [2015-07-13 16:57:38.001] [NodeChangeState] number of resources to change state = 0
commit ineffective with AutoCommit enabled at /usr/share/perl5/OAR/IO.pm line 7687.
[debug] [2015-07-13 16:57:38.006] [Almighty] /usr/lib/oar/NodeChangeState terminated :
[debug] [2015-07-13 16:57:38.006] [Almighty] Exit value : 1
[debug] [2015-07-13 16:57:38.006] [Almighty] Signal num : 0
[debug] [2015-07-13 16:57:38.006] [Almighty] Core dumped : 0
[debug] [2015-07-13 16:57:38.006] [Almighty] Current state [Scheduler]
[debug] [2015-07-13 16:57:38.007] [Almighty] Launching command : [/usr/lib/oar/NodeChangeState]
[debug] [2015-07-13 16:57:38.231] [NodeChangeState] Check event for the job 131 with type RESUBMIT_JOB_AUTOMATICALLY
[debug] [2015-07-13 16:57:38.237] [NodeChangeState] number of resources to change state = 0
[debug] [2015-07-13 16:57:38.246] [Almighty] /usr/lib/oar/NodeChangeState terminated :
[debug] [2015-07-13 16:57:38.246] [Almighty] Exit value : 0
[debug] [2015-07-13 16:57:38.246] [Almighty] Signal num : 0
[debug] [2015-07-13 16:57:38.246] [Almighty] Core dumped : 0
[debug] [2015-07-13 16:57:38.246] [Almighty] Launching command : [/usr/lib/oar/oar_meta_sched]
[debug] [2015-07-13 16:57:38.505] [MetaSched] Starting Meta Scheduler
[debug] [2015-07-13 16:57:38.510] [MetaSched] Retrieve information for already scheduler reservations from database before flush (keep assign resources)
[debug] [2015-07-13 16:57:38.515] [MetaSched] Initialize the gantt structure
[debug] [2015-07-13 16:57:38.519] [MetaSched] Begin processing of already scheduled reservations (accepted with resources assigned)
[debug] [2015-07-13 16:57:38.519] [MetaSched] End processing of already scheduled reservations
[debug] [2015-07-13 16:57:38.519] [MetaSched] Begin processing of current jobs
[debug] [2015-07-13 16:57:38.523] [MetaSched] [128] Add job in database
[debug] [2015-07-13 16:57:38.525] [MetaSched] [128] job is (0,u:,,)
[debug] [2015-07-13 16:57:38.525] [MetaSched] [128] add job occupation in gantt (0,,,)
[debug] [2015-07-13 16:57:38.531] [MetaSched] [125] Add job in database
[debug] [2015-07-13 16:57:38.538] [MetaSched] [125] job is (0,u:,,)
[debug] [2015-07-13 16:57:38.538] [MetaSched] [125] add job occupation in gantt (0,,,)
[debug] [2015-07-13 16:57:38.539] [MetaSched] End processing of current jobs
[debug] [2015-07-13 16:57:38.540] [MetaSched] Begin processing of waiting reservations (accepted reservations which do not have assigned resources yet)
[debug] [2015-07-13 16:57:38.541] [MetaSched] End processing of waiting reservations
[debug] [2015-07-13 16:57:38.542] [MetaSched] Queue admin: No job
[debug] [2015-07-13 16:57:38.543] [MetaSched] Queue kinovis: No job
[debug] [2015-07-13 16:57:38.544] [MetaSched] Queue default: Launching scheduler oar_sched_gantt_with_timesharing_and_fairsharing_and_quotas at time 2015-07-13 16:57:39

oar_resources_add long options --cpus and --cores does not work

"oar_resources_add" command returns a bad result when using with long option "--cpus" and/or "--cores".

The command works fine when it is used with short options.

Example:

with "--cpus" option:

oar_resources_add --hosts 1 --cpus 1 -c 2 --> 4 results
oarproperty -c -a host || true
oarproperty -a cpu || true
oarproperty -a core || true
oarproperty -a thread || true
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=0 -p thread=0 -p cpuset=0
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=1 -p thread=1 -p cpuset=1
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=1 -p core=2 -p thread=2 -p cpuset=2
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=1 -p core=3 -p thread=3 -p cpuset=3

oar_resources_add --hosts 1 --cpus 1 -c 4 --> 8 results
oarproperty -c -a host || true
oarproperty -a cpu || true
oarproperty -a core || true
oarproperty -a thread || true
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=0 -p thread=0 -p cpuset=0
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=1 -p thread=1 -p cpuset=1
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=2 -p thread=2 -p cpuset=2
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=3 -p thread=3 -p cpuset=3
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=1 -p core=4 -p thread=4 -p cpuset=4
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=1 -p core=5 -p thread=5 -p cpuset=5
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=1 -p core=6 -p thread=6 -p cpuset=6
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=1 -p core=7 -p thread=7 -p cpuset=7

with" -C" option:

luke:~# oar_resources_add --hosts 1 -C 1 -c 2 --> OK
oarproperty -c -a host || true
oarproperty -a cpu || true
oarproperty -a core || true
oarproperty -a thread || true
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=0 -p thread=0 -p cpuset=0
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=1 -p thread=1 -p cpuset=1

oar_resources_add --hosts 1 -C 1 -c 4 --> OK
oarproperty -c -a host || true
oarproperty -a cpu || true
oarproperty -a core || true
oarproperty -a thread || true
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=0 -p thread=0 -p cpuset=0
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=1 -p thread=1 -p cpuset=1
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=2 -p thread=2 -p cpuset=2
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=3 -p thread=3 -p cpuset=3

with "--cores" option:

oar_resources_add --hosts 1 -C 1 --cores 2 --> 4 results !
oarproperty -c -a host || true
oarproperty -a cpu || true
oarproperty -a core || true
oarproperty -a thread || true
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=0 -p thread=0 -p cpuset=0
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=1 -p thread=1 -p cpuset=1
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=2 -p thread=2 -p cpuset=2
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=3 -p thread=3 -p cpuset=3

with "--cpus" and "--cores" options:

oar_resources_add --hosts 1 --cpus 1 --cores 2 --> 8 results
oarproperty -c -a host || true
oarproperty -a cpu || true
oarproperty -a core || true
oarproperty -a thread || true
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=0 -p thread=0 -p cpuset=0
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=1 -p thread=1 -p cpuset=1
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=2 -p thread=2 -p cpuset=2
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=0 -p core=3 -p thread=3 -p cpuset=3
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=1 -p core=4 -p thread=4 -p cpuset=4
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=1 -p core=5 -p thread=5 -p cpuset=5
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=1 -p core=6 -p thread=6 -p cpuset=6
oarnodesetting -a -h 'node-1' -p host='node-1' -p cpu=1 -p core=7 -p thread=7 -p cpuset=7

Regards,
Romain C.

Mysql bug with OAR 2.5.4

Hello,

We have tried to update our OAR installation on the Luxembourg site of
G5K last thursday, from 2.5.3 to 2.5.4, but we have encountered
several problems and finally rolled back to 2.5.3.

The OAR server is installed in a debian squeeze VM. We use mysql
version 5.1.63-0+squeeze1, in a separated debian squeeze VM.
The frontend is running under debian wheezy.

My upgrade procedure looks like this:

Update the oar packages:

@Oar: sudo apt-get remove oar-admin
@frontend,www,oar: sudo apt-get update
@frontend,www,oar: sudo apt-get install $(dpkg -l | grep oar- | awk
'{print $2}')

Update the OAR database

@Oar: sudo service oar-server stop
@Oar: oar-database --check
@Oar: oar-database --upgrade
@Oar: sudo service oar-server start

The web services (drawgantt, monika) are working correctly and the
frontend is fine.

However, I've noticed strange errors in the logs emitted by the
MetaSched component:

  • typing errors:

Argument "granduc-12.luxembourg.grid5000.fr" isn't numeric in vec at
/usr/lib/oar/oar_meta_sched line 342.

  • SQL errors:

DBD::mysql::db do failed: You have an error in your SQL syntax; check
the manual that corresponds to your MySQL server version for the right
syntax to use near 'luxembourg.grid5000.
fr),(53069,granduc-13.luxembourg.grid5000.fr),(53069,granduc' at line
2 at /usr/share/perl5/OAR/IO.pm line 6270.

  • Locked tables and duplicated primary keys:

[debug] [2014-11-20 10:24:47.133] [MetaSched] Starting Meta Scheduler
[debug] [2014-11-20 10:24:47.136] [MetaSched] Retrieve information for
already scheduler reservations from database before flush (keep assign
resources)
DBD::mysql::db do failed: Can't execute the given command because you
have active locked tables or an active transaction at
/usr/share/perl5/OAR/IO.pm line 6485.
DBD::mysql::db do failed: Can't execute the given command because you
have active locked tables or an active transaction at
/usr/share/perl5/OAR/IO.pm line 6486.
DBD::mysql::db do failed: Duplicate entry '0' for key 'PRIMARY' at
/usr/share/perl5/OAR/IO.pm line 6294.

DBD::mysql::db do failed: Duplicate entry '53069' for key 'PRIMARY' at
/usr/share/perl5/OAR/IO.pm line 6261.
DBD::mysql::db do failed: You have an error in your SQL syntax; check
the manual that corresponds to your MySQL server version for the right
syntax to use near 'luxembourg.grid5000.
fr),(53069,granduc-19.luxembourg.grid5000.fr),(53069,granduc' at line
2 at /usr/share/perl5/OAR/IO.pm line 6270.

[debug] [2014-11-20 10:27:17.210] [MetaSched] Begin processing of jobs
to launch (start time <= 2014-11-20 10:27:17)
[debug] [2014-11-20 10:27:17.213] [MetaSched] End processing of jobs to launch
DBD::mysql::db do failed: Can't execute the given command because you
have active locked tables or an active transaction at
/usr/share/perl5/OAR/IO.pm line 6372.
DBD::mysql::db do failed: Can't execute the given command because you
have active locked tables or an active transaction at
/usr/share/perl5/OAR/IO.pm line 6373.
DBD::mysql::db do failed: Duplicate entry '53085' for key 'PRIMARY' at
/usr/share/perl5/OAR/IO.pm line 6375.
DBD::mysql::db do failed: Duplicate entry '53083-721' for key
'PRIMARY' at /usr/share/perl5/OAR/IO.pm line 6380.

Also, after some stress tests, the queue was "stalled" and not
rescheduled on new events (in example, jobs finishing before the end
of their walltime).
At this point, I've exported the logs and a dump of the database, and
I've rolled back to 2.5.3.
I suspect that I've forgotten something, or done something wrong with
the database upgrade.

Any idea / hint?

Thanks for your help,

Best regards,

-- Hyacinthe Cartiaux

Importing ssh keys during OAR submission

I used the following command to import a ssh key for a particular reservation
oarsub -I -l host=2/core=1 -i ~/my_OAR_jobkey

That works without any problem but when I add the suffix .pub I got this:

$ oarsub -I -l host=2/core=1 -i ~/my_OAR_jobkey.pub
ADMISSION RULE] Set default walltime to 3600.
[ADMISSION RULE] Modify resource description with type constraints
Import job key from file: /home/cruizsanabria/my_OAR_jobkey.pub
Enter passphrase:

They ask me for a passphrase but I havent set any for my key and at the end it fails:

Error: Fail to extract the public key. Please verify that the job key to import is valid.
OAR_JOB_ID=-14
Oarsub failed: please verify your request syntax or ask for support to your admin.

My question is:

Is there a way of specifying just my public key?

This behavior leads me to wonder how the private key is being used?.

IO::get_job_current_hostnames vs IO::get_cpuset_values_for_a_moldable_job

@capitn:
I'm looking at get_job_current_hostnames and get_cpuset_values_for_a_moldable_job functions of IO.pm:
I'm wondering why the 1st one needs to look at the moldable_job_descriptions table, while the 2nd does not ?

Also, bipbip and other modules make use of both functions while calling the second one could be sufficient to have the list of nodes for a job... Could we merge the 2 in 1 ? ...

ERROR: integer out of range

Hi,
I'm working on migration from MySql to Postgres on Grid'5000 and during insertion of data I got this error :

ERROR:  integer out of range

It happens in accounting table because (I guess) window_start, window_stop and consumption are Integer and not BigInteger in PostGres DB

Cgroup net_cls required to load the `cls_cgroup` module

The job_resources_manager_cgroups.pl is not working on 3.8 kernel because the net_cls cgroups submodule is a kernel module, as all the other submodule, but it is not loaded by default:

# grep CGROUP /boot/config-3.8.0-27-generic                               
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_SCHED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_NET_CLS_CGROUP=m
CONFIG_NETPRIO_CGROUP=m

It give this error on the logs:
sh: /dev/oar_cgroups//oar/myuser_14/net_cls.classid: Permission denied
because the net_cls.classid file was never created.

It should test if modules are loaded before mounting the associated submodules.

Need more parameters available in prologue/epilogue scripts

Currently, inside [pro|epi]logue[_server] we have to make oarstats when we need to get the type of a job for example. This is a frequent need (see metroflux or kavlan job types on grid5000) and this may lead into a scalability problem when we have a lot of jobs resulting in a lot of simultaneous oarstat commands.
So, a good optimization would be to pass more jobs characteristics as parameters to those scripts, especially the job type.

Memory usage too high in Perl schedulers

On a clusteer with 5000 resources and 200 waiting jobs the memory usage of the scheduler can reach up to 1GiB. This is very high!!!

When I check the Perl documentation
http://learn.perl.org/faq/perlfaq3.html#How-can-I-free-an-array-or-hash-so-my-program-shrinks
it seems that there is no garbage collector that will free the gantt tree structures automatically (I think the memory used comes from this because we do a lot of dclone on them).

2 solutions:

  • find a way to force Perl to free unused data strucures
  • create a new process (fork) for the scheduler part that produces a lot of data structure

Does someone has a good idea???

Make the content of OAR_FILE_NODE easily available through the API

(Initially posted on gforge by David Margery)

Using OAR's API, it is possible to use the undocumented relation /jobs//nodes to get the list of nodes of a job.
On the other hand, OAR_FILE_NODE gives the list of node of a job, with one entry by core (is that a config parameter ?)

It would be usefull to have /jobs//nodes_with_repeats or at least /jobs//nodes with a weight given to each node to easily get as much information as OAR_FILE_NODE

SQL query for getting information on every jobs in a job_id interval

Statsg5k (a usage reporting tool for OAR clusters) makes SQL queries directly on OAR databases for getting information on jobs. At each update of statsg5k, the tool queries the OAR database for getting jobs that have been created since the last update of statsg5k. The SQL query aims at getting every new jobs, even unfinished one. If a job is not yet finished, statsg5k keeps it on a "watchlist" for later processing.

The SQL query used for getting all the jobs in a [job_id_min, job_id_max] range is:

(
   SELECT jobs.job_id, jobs.start_time, jobs.stop_time, jobs.job_user, jobs.queue_name, jobs.job_type, jobs.state, resources.cluster, resources.type as resource_type, count(resources.resource_id) as nb_cores
   FROM resources
   INNER JOIN jobs
     ON jobs.job_id >= job_id_min
     AND jobs.job_id <= job_id_max
   INNER JOIN assigned_resources
     ON assigned_resources.resource_id = resources.resource_id
   INNER JOIN moldable_job_descriptions
     ON assigned_resources.moldable_job_id = moldable_job_descriptions.moldable_id
     AND jobs.job_id = moldable_job_descriptions.moldable_job_id
   GROUP BY jobs.job_id, resources.cluster
  )
 UNION
  (
   SELECT  jobs.job_id,  jobs.start_time,  jobs.stop_time,  jobs.job_user,  jobs.queue_name,  jobs.job_type,  jobs.state,  resources.cluster,  resources.type as resource_type, count(resources.resource_id) as nb_cores
   FROM resources
   INNER JOIN jobs
     ON jobs.job_id >= job_id_min
     AND jobs.job_id <= job_id_max
   INNER JOIN gantt_jobs_resources
     ON gantt_jobs_resources.resource_id = resources.resource_id
   INNER JOIN moldable_job_descriptions
     ON gantt_jobs_resources.moldable_job_id = moldable_job_descriptions.moldable_id
     AND jobs.job_id = moldable_job_descriptions.moldable_job_id
   GROUP BY jobs.job_id, resources.cluster
 )

It seems that this query does not capture all the jobs as we found out that some jobs are missing on the statsg5k database. Any idea on how to achieve what we want ?

We welcome any suggestions or insight but to fix the problem, we were thinking on doing two distinct queries:

  • one query for getting the finished jobs:
SELECT jobs.job_id, jobs.start_time, jobs.stop_time, jobs.job_user, jobs.queue_name, jobs.job_type, jobs.state, resources.cluster, resources.type as resource_type, count(resources.resource_id) as nb_cores
   FROM resources
   INNER JOIN jobs
     ON jobs.job_id >= job_id_min
     AND jobs.job_id <= job_id_max
   INNER JOIN assigned_resources
     ON assigned_resources.resource_id = resources.resource_id
   INNER JOIN moldable_job_descriptions
     ON assigned_resources.moldable_job_id = moldable_job_descriptions.moldable_id
     AND jobs.job_id = moldable_job_descriptions.moldable_job_id
   GROUP BY jobs.job_id, resources.cluster
  • one query for getting all the job_id in the range:
SELECT jobs.job_id FROM jobs WHERE jobs.job_id >= job_id_min AND jobs.job_id <= job_id_max

In this way, jobs that are not captured by the first query should go on the watchlist.

Allow noop/cosystem/deploy job to start on resources in standby (without wake-up)

In the case of noop/cosystem/deploy jobs, OAR does almost nothing on the resources (no ping checker). Also, such job often perform the task of booting (rebooting) the nodes for its actual purpose (e.g. G5K). As a result, if the node is in the standby state, it might be useless to wake it up for such a job.

The idea here would be (chronologically):
-1 to not try to wake up requested resources whenever in the standby state
-2 to actually start the job on those resources although in standby state
-3 (maybe?) to set those resources in the Alive state but without any actual check (ping checker)

At the end of the job, resources should be returned (epilogue or server_epilogue of the job) in either the Alive state (if powered on) or the Standby state (if powered off).

Dead code in oarsh ?

@capitn, do you agree that this code is dead in oarsh ?

-2- try connection using a job key pushed by OAR for a job using the job key mechanism.

(oarsh is run from one of the node of the job)

TMP_JOB_KEY_FILE="$OAR_RUNTIME_DIRECTORY/$OARDO_USER.jobkey"
if [ -r $TMP_JOB_KEY_FILE ]; then
umask $OLDUMASK
exec $OPENSSH_CMD $OARSH_OPENSSH_DEFAULT_OPTIONS -i $TMP_JOB_KEY_FILE "$@"
echo "oarsh: Failed to connect using the cpuset job key: $TMP_JOB_KEY_FILE"
exit 4
fi

Postgresql deadlocks with oarapi.pl since last update (2.5.4)

Since the last update, postgresql frequently deadlocks on TRUNCATE queries.
Here is an abstract of my oar.log file:

DBD::Pg::db do failed: ERROR: deadlock detected
DETAIL: Process 27321 waits for AccessExclusiveLock on relation 61813 of database 61666; blocked by process 27446.
Process 27446 waits for AccessShareLock on relation 61791 of database 61666; blocked by process 27321.
HINT: See server log for query details. at /usr/share/perl5/OAR/IO.pm line 6374.
DBD::Pg::db do failed: ERROR: current transaction is aborted, commands ignored until end of transaction block at /usr/share/perl5/OAR/IO.pm line 6380.
DBD::Pg::db do failed: ERROR: current transaction is aborted, commands ignored until end of transaction block at /usr/share/perl5/OAR/IO.pm line 6385.

I replaced the TRUNCATE postgresql command with a DELETE one at the given line of the OAR/IO.pm file, and the problem seems to have disappeard.

A google search on "truncate postgresql deadlock" gives a lot of results...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.