pandawms / pilot2 Goto Github PK

View Code? Open in Web Editor NEW

5.0 10.0 24.0 7.22 MB

PanDA Pilot 2

License: Apache License 2.0

Python 99.96% Shell 0.04%

pilot2's Introduction

PanDA Pilot 2

Contributions

Check the TODO.md and STYLEGUIDE.md files.
Fork the PanDAWMS/pilot2 repository into your private account as origin. Clone it and set the PanDAWMS/pilot2 repository as upstream.
Make new code contributions only to a new branch in your repository, push to origin and make a pull request into upstream. Depending on the type of contribution this should go yo either upstream/next or upstream/hotfix. Any pull requests directly to the master branch will be rejected since that would trigger the automatic pilot tarball creation.

Verifying code correctness

Do not submit code that does not conform to the project standards. We use PEP8 and Flake verification, with everything enabled at a maximum line length of 160 characters and McCabe complexity 12, as well Pylint:

flake8 pilot.py pilot/
pylint <path to pilot module>

Running the pilot

The pilot is a dependency-less Python application and relies on /usr/bin/env python. The minimum pilot can be called like:

./pilot.py -q <PANDA_QUEUE>

where PANDA_QUEUE correspond to the ATLAS PandaQueue as defined in AGIS. This will launch the default generic workflow.

Running the testcases

The test cases are implemented as standard Python unittests under directory pilot/test/. They can be discovered and executed automatically:

unit2 -v

Building and viewing docs

Install sphinx into your environment by pip or other means with all the necessary requirements.
Navigate into ./doc in your fork and run make html.
Open _build/html/index.html with your browser.

Automate documentation to your module

For automatic code documentation of any new pilot module, add the following lines to the corresponding rst file in the doc area:

.. automodule:: your.module
    :members:

See existing rst files. For more info, visit http://sphinx-doc.org

Syncing your GitHub repository

Before making a pull request, make sure that you are synced to the latest version.

git clone https://github.com/USERNAME/pilot2.git
cd pilot2
git checkout next
git remote -v
git remote add upstream https://github.com/PanDAWMS/pilot2.git
git fetch upstream
git merge upstream/next

pilot2's People

Contributors

Stargazers

Watchers

pilot2's Issues

issues with open_remote_file.py

Hi @PalNilsson
https://github.com/PanDAWMS/pilot2/blob/master/pilot/scripts/open_remote_file.py

Here is an example of a job with some problems:
https://bigpanda.cern.ch/job?pandaid=5872397412

2023-06-13 02:09:33,078 | DEBUG    | pilot.user.atlas.container       | get_root_container_script | root setup script content:

date
lsetup 'root pilot'
date
python ./open_remote_file.py --turls='root://basilisk02.westgrid.ca:1094/pnfs/westgrid.uvic.ca/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1' -w . -t 1 1>remote_open.stdout 2>remote_open.stderr
exit $?

The output files of the script are
https://bigpanda.cern.ch//media/filebrowser/a76e447a-f58d-419c-bd21-bb3077d3c005/hc_test/tarball_PandaJob_5872397412_CA-VICTORIA-K8S-T2/remote_open.stderr
https://bigpanda.cern.ch//media/filebrowser/a76e447a-f58d-419c-bd21-bb3077d3c005/hc_test/tarball_PandaJob_5872397412_CA-VICTORIA-K8S-T2/remote_open.stdout

Not clear if the timeout is working properly. It seems to set a timeout of 30s but crashes after ~ 4 minutes.

2023-06-13 02:15:51,607 | INFO     | __main__                         | <module>                  | setting up signal handling
2023-06-13 02:15:51,608 | INFO     | __main__                         | message                   | will attempt to open 1 file(s) using 1 thread(s)
2023-06-13 02:15:52,049 | INFO     | __main__                         | message                   | internal TFile.Open() time-out set to 30000 ms
2023-06-13 02:15:52,049 | INFO     | __main__                         | message                   | opening root://basilisk02.westgrid.ca:1094/pnfs/westgrid.uvic.ca/data/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
2023-06-13 02:20:03,303 | WARNING  | __main__                         | interrupt                 | caught signal: SIGTERM in FRAME=
  File "./open_remote_file.py", line 247, in <module>
    [_thread.join() for _thread in threads]

In another case (https://bigpanda.cern.ch//media/filebrowser/8279922f-fe30-485e-a6bb-97d53402c5ce/hc_test/tarball_PandaJob_5872412969_CA-VICTORIA-K8S-T2/remote_open.stdout) it crashed after 20 s.

After that it seems to go into a loop trying to kill a process 30,000 times and producing 10 MB of log output!

Also the stderr shows

RecursionError: maximum recursion depth exceeded while calling a Python object

There must be a recursive function that calls itself over 1000 times and gets terminated to avoid infinite recursion??

Pilot writes to home directory of executing user during verify_proxy

In jobs running in isolated / containerized environments as users with non-existing or read-only home directory, the following warnings are observed:

2021-08-10 14:33:48,037 | DEBUG    | pilot.user.atlas.proxy           | interpret_proxy_info      | stderr = /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/swConfig/asetup/createUserASetup.sh: line 44: /home/centos/.asetup: No such file or directory
WARNING: Failed to create directory /home/centos/.arc
WARNING: Unable to create /home/centos/.arc directory.
WARNING: Failed to create directory /home/centos/.arc
WARNING: Unable to create /home/centos/.arc directory.

Indeed, in case the home directory exists, files are created there and read from there, potentially breaking job isolation, and of course these errors are misleading when running with dummy users.

I traced these errors back to verify_arcproxy:

pilot2/pilot/user/atlas/proxy.py

Line 77 in d973077

def verify_arcproxy(envsetup, limit, proxy_id="pilot", test=False):

The issue is caused in two places:

/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/swConfig/asetup/createUserASetup.sh is called as part of setupATLAS, and creates an .asetup file in the user home directory.
arcproxy itself parses /etc/passwd and extracts the home directory, then creates its config files in there.

The first could in principle be worked around by adding unset HOME to the envsetup string (even though this is not a nice workaround), or adapting ALRB itself (but since ALRB is often used interactively, I presume other such cases may come up in the future).
The latter can not easily be solved, since arcproxy does not have a --no-config option nor does it accept arcproxy -z /dev/null. I wonder if flipping the order (i.e. preferring voms-proxy-info) is a viable option, now that SL6 is gone?

Remote input

The remote input feature needs to be tested (and possibly it is not fully implemented).

Currenty 'allowremoteinputs' is defined as a string in FileSpec. Change to bool. It is used as a boolean in api/data.py.
Verify that the usage of allowremoteinputs is correct in api/data.py.

Starting pilot2

Hi all! Got trouble starting pilot2:

$ voms-proxy-info
subject : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=psvirin/CN=762973/CN=Pavlo Svirin/CN=proxy
issuer : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=psvirin/CN=762973/CN=Pavlo Svirin
identity : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=psvirin/CN=762973/CN=Pavlo Svirin
type : proxy
strength : 1024 bits
path : /tmp/x509up_u500
timeleft : 11:30:25

$ git status

On branch next

nothing to commit (working directory clean)

curl timeout

Tried to run latest pilot2 from mlassnig/next, modified server in job.py, getting the following.

What could be wrong so far?

P.S.: I'm using plain certificate now, no VOMS extensions added.

Copytools should not overwrite existing output files

We had a case recently where a job ran twice, and the second job overwrote the output file created by the first job. This led to a different checksum of the file in rucio and storage, causing file transfer and job failures. The pilot should instead fail when it encounters a file which already exists. It seems like rucio mover correctly fails but gfal does not. The fix might be as simple as removing the -f here:

https://github.com/PanDAWMS/pilot2/blob/master/pilot/copytool/gfal.py#L140

but other copytools should be checked too.

Essential feature: Add signal handling

Currently the following signals are recognised: SIGINT, SIGTERM, SIGQUIT, SIGSEGV, SIGXCPU, SIGUSR1, SIGBUS - but they are only received to stop the job. The interrupt() function in e.g. pilot/workflow/generic.py should wait before setting the graceful_stop and (somehow) signal the job monitoring to stop the job properly (which takes some time), and at the same time (i.e. in parallel) set the corresponding error code and send a job update to the server. When this has been done, the interrupt() function should continue and kill the pilot.

Correct propagation of protocols into traces.

This is not protocol, but copytool:
https://github.com/PanDAWMS/pilot2/blob/next/pilot/api/data.py#L346
so the field should be named properly in the trace.
module = self.copytool_modules[name]['module_name'] self.logger.info('trying to use copytool=%s for activity=%s' % (name, activity)) copytool = __import__('pilot.copytool.%s' % module, globals(), locals(), [module], -1) self.trace_report.update(protocol=name)
The protocol should be filled in a separate field. For that we would need to populate the protocol list and the protocol_id into fspec. While the protocol_id is never populated, and the list of protocols is created only for stage_out. Not clear to me, where to do that. I'll apply a dirty solution: extract protocol from surl directly in copytoole module and populate the trace there.
This is dirty:
https://github.com/PanDAWMS/pilot2/blob/next/pilot/api/data.py#L426
if replica: surl = self.get_preferred_replica(replicas, ['srm']) or replicas[0] # prefer SRM protocol for surl -- to be verified self.logger.info("[stage-in] surl (srm replica) from Rucio: pfn=%s, ddmendpoint=%s" % (surl, ddmendpoint)) break
shouldn't there be if not replica ?

millions of file paths dumped in pilot log

For example:

/afs/cern.ch/user/f/flin/public/grid.2430635.17.out
/afs/cern.ch/user/f/flin/public/grid.2454670.12.out

Pilot log file size over 100 MB. Mainly due to:

2020-01-24 09:49:42,187 | WARNING  | job_monitor         | pilot.util.auxiliary.4618203336  | get_time_for_last_touch   | find command failed: 1, /home/
pilatl03/home_cream_358495389/CREAM358495389/atlas_C3Bf0MRY/PanDA_Pilot-4618203336
/home/pilatl03/home_cream_358495389/CREAM358495389/atlas_C3Bf0MRY/PanDA_Pilot-4618203336/PoolFileCatalog.xml
/home/pilatl03/home_cream_358495389/CREAM358495389/atlas_C3Bf0MRY/PanDA_Pilot-4618203336/payload.stdout
/home/pilatl03/home_cream_358495389/CREAM358495389/atlas_C3Bf0MRY/PanDA_Pilot-4618203336/pandawnutil/tracer
...

The result of find command dumped millions of paths.

ls this necessary to dump these paths or can be improved to make the log less bulky? Thanks.

Why does is_harvester_mode require - "and not args.update_server"

Please the lines referenced bellow:

pilot2/pilot/util/harvester.py

Lines 30 to 36 in a78652f

 if (args.harvester_workdir != '' or args.harvester_datadir != '' or args.harvester_eventstatusdump != '' or 

 args.harvester_workerattributes != '') and not args.update_server: 

 harvester = True 

 elif 'HARVESTER_ID' in environ or 'HARVESTER_WORKER_ID' in environ: 

 harvester = True 

 else: 

 harvester = False

Final heartbeat not sent after stage in failure

https://bigpanda.cern.ch/job?pandaid=4512622976

This job failed due to a rucio problem but did not send the final heartbeat, so ended up as taskbuffer:300 error (lost heartbeat).

The full log is attached here

4512622976.out.txt

Pilot timing and logging

[From Danila]

Most important - is serving of timing.
Right now, timing stamps collect into special JSON file, it works ok for regular workflow, but does not work well enough for HPC workflow, when Pilot acts like MPI application with a number of instances equal to a number of ranks.
Collecting of timing information from all ranks in one file is not an option due to deadlocks during chaotic read/write operations to the same file from different ranks.
I had made a workaround to have one timing file per rank, this somehow solves the problem with access, but bring another one - a lot of files, especially for a high number of ranks (now we work on Titan on scale up to 1000 ranks).
We already discussed that it will be good to get rid of a file and use some structure in memory, looks like it will be needed before HPC workflow will be ready to use on a high scale.
Logging (minor issue).
Will be good to have switched to disable logging to file. Harvester now days take care about collecting of stdout and stderr for batch submission, which includes full pilot log from the console.

getuser() invoked by utilities.py fails in container

Hi @PalNilsson

Just like #327 , there is another spot that invokes the problematic function, this time it is happening in utilities.py.

We are using the latest pilot version on CVMFS.

---- Retrieve pilot code ----
2021-01-23 00:13:11,137 [wrapper] Using piloturl: file:///cvmfs/atlas.cern.ch/repo/sw/PandaPilot/tar/pilot2.tar.gz
2021-01-23 00:13:11,166 [wrapper] File pilot2/pilot.py exists OK
2021-01-23 00:13:11,170 [wrapper] pilot2/PILOTVERSION: 2.9.4.20



---- Ready to run pilot ----

2021-01-23 00:13:21,136 [wrapper] ==== pilot stdout BEGIN ====
2021-01-23 00:13:21,139 [wrapper] pilotpid: 3371
2021-01-23 00:13:21,349 | INFO     | MainThread          | pilot.util.auxiliary             | pilot_version_banner      | ****************************************
2021-01-23 00:13:21,349 | INFO     | MainThread          | pilot.util.auxiliary             | pilot_version_banner      | ***  PanDA Pilot version 2.9.4 (20)  ***
2021-01-23 00:13:21,350 | INFO     | MainThread          | pilot.util.auxiliary             | pilot_version_banner      | ****************************************
2021-01-23 00:13:21,350 | INFO     | MainThread          | pilot.util.auxiliary             | pilot_version_banner      | 
2021-01-23 00:13:21,350 | INFO     | MainThread          | pilot.util.auxiliary             | pilot_version_banner      | pilot is running in a VM
2021-01-23 00:13:21,350 | INFO     | MainThread          | pilot.util.auxiliary             | display_architecture_info | architecture information:
2021-01-23 00:13:21,424 | INFO     | MainThread          | pilot.util.auxiliary             | display_architecture_info | 



2021-01-23 00:13:24,632 | INFO     | validate            | pilot.control.job                | validate                  | processing PanDA job 4955393196 from task NULL
  File "/pilotdir/atlas_t5eX0Gry/pilot2/pilot/common/exception.py", line 434, in run
    self._Thread__target(**self._Thread__kwargs)
  File "/pilotdir/atlas_t5eX0Gry/pilot2/pilot/control/job.py", line 770, in validate
    if _validate_job(job):
  File "/pilotdir/atlas_t5eX0Gry/pilot2/pilot/control/job.py", line 135, in _validate_job
    user = __import__('pilot.user.%s.common' % pilot_user, globals(), locals(), [pilot_user], 0)  # Python 2/3
  File "/pilotdir/atlas_t5eX0Gry/pilot2/pilot/user/atlas/common.py", line 27, in <module>
    from .utilities import get_memory_monitor_setup, get_network_monitor_setup, post_memory_monitor_action,\
  File "/pilotdir/atlas_t5eX0Gry/pilot2/pilot/user/atlas/utilities.py", line 257, in <module>
    def get_ps_info(pgrp, whoami=getuser(), options='axfo pid,user,args'):
  File "/usr/lib64/python2.7/getpass.py", line 158, in getuser
    return pwd.getpwuid(os.getuid())[0]
exception caught by thread run() function: (<type 'exceptions.KeyError'>, KeyError('getpwuid(): uid not found: 1000',), <traceback object at 0x7f99c4426290>)
Traceback (most recent call last):
  File "/pilotdir/atlas_t5eX0Gry/pilot2/pilot/common/exception.py", line 434, in run
    self._Thread__target(**self._Thread__kwargs)
  File "/pilotdir/atlas_t5eX0Gry/pilot2/pilot/control/job.py", line 770, in validate
    if _validate_job(job):
  File "/pilotdir/atlas_t5eX0Gry/pilot2/pilot/control/job.py", line 135, in _validate_job
    user = __import__('pilot.user.%s.common' % pilot_user, globals(), locals(), [pilot_user], 0)  # Python 2/3
  File "/pilotdir/atlas_t5eX0Gry/pilot2/pilot/user/atlas/common.py", line 27, in <module>
    from .utilities import get_memory_monitor_setup, get_network_monitor_setup, post_memory_monitor_action,\
  File "/pilotdir/atlas_t5eX0Gry/pilot2/pilot/user/atlas/utilities.py", line 257, in <module>
    def get_ps_info(pgrp, whoami=getuser(), options='axfo pid,user,args'):
  File "/usr/lib64/python2.7/getpass.py", line 158, in getuser
    return pwd.getpwuid(os.getuid())[0]
KeyError: 'getpwuid(): uid not found: 1000'

I grepped for getuser in the pilot code, this should be the only remaining case that causes the problem.

Thanks!

Create new error code

Create new error code for the following case (currently labelled as a general stage-out problem):

`2018-07-27 07:04:49,168 | INFO | executing command: /usr/bin/env rucio -v upload --rse BNL-OSG2_DATADISK --scope panda --guid 0e77c88c-74a7-4343-980f-c6104e612307 --summary /home/condor/local/atlas/execute/dir_72469/condorg_nPS5EtLj/PanDA_Pilot2_76596_1532674961/PanDA_Pilot-4005751905/4c4aa220-1bb8-4fb9-871d-ede4ee3ba614_1.job.log.tgz
2018-07-27 07:05:10,328 | ERROR | Traceback (most recent call last):
File "/home/condor/local/atlas/execute/dir_72469/condorg_nPS5EtLj/pilot2/pilot/control/data.py", line 543, in _do_stageout
client.transfer(xdata, activity, **kwargs)
File "/home/condor/local/atlas/execute/dir_72469/condorg_nPS5EtLj/pilot2/pilot/api/data.py", line 318, in transfer
raise PilotException('Failed to transfer files using copytools=%s, error=%s' % (copytools, errors))
PilotException: Error code: 1301, message: An unknown pilot exception has occurred
Details: Failed to transfer files using copytools=['rucio'], error=[PilotException("no error information passed (http status code: 503 ('service_unavailable', 'unavailable'))]",)]

2018-07-27 07:05:10,328 | INFO | summary of transferred files:
2018-07-27 07:05:10,329 | INFO | -- lfn=4c4aa220-1bb8-4fb9-871d-ede4ee3ba614_1.job.log.tgz, status_code=1137, status=failed
`

https://aipanda115.cern.ch/media/filebrowser/0e77c88c-74a7-4343-980f-c6104e612307/panda/tarball_PandaJob_4005751905_BNL_PROD_CONTR_TEST-condor/pilotlog.txt

Pilot may kill unrelated processes in kill_orphans / kill its parent / cause resources not to be freed

The current pilot code in kill_orphans collects any process with PPID=1 belonging to the pilot user and sends kill -9 (with a few hardcoded exceptions).

Apart from the recent problem at RAL (killing runpilot2-wrapper.sh) which is now blacklisted, there are several more issues:

Completely unrelated processes on the same node can be killed.
The pilot can still perform parenticide, i.e. kill it's own parent, thus killing itself.
Sending kill -9 directly without a SIGTERM first may for some processes cause resources not to be freed, errors not to be logged etc.

We are now encountering this issue again in another incantation (running containers with PID namespace and a condor_master as first process inside, which itself has the Pilot in the process tree, then the pilot kills condor_master as an "orphan", killing its parent and thus going to the orphanage itself).

I have several ideas for improvements to the logic, but all require quite some development work (for which I currently don't have the resources to contribute myself).
In increasing order of complexity and robustness (sadly, these are proportional), my ideas are:

Send a regular SIGTERM first, and a SIGKILL only after a grace period.
This should allow killed processes to clean up (e.g. lock files, FUSE mounts etc. which may be left dangling otherwise) and allow them to log the death.
Collect all PIDs following the chain from the pilot PID itself via its PPID up to PID=1, and put all these processes on a whitelist of processes which will not be killed, to prevent prevent parenticide.
Let the pilot open up a PID namespace in a user namespace (if the site has these enabled) and have a dedicated PID=1 process (similar e.g. to tini or Singularity's sinit). Once the PID=1 process in the namespace dies, the kernel will clean up things automagically.

Missing checksum verification in xrdcp

Checksum verification is not fully implemented in xrdcp copy tool. Can be tested on RAL-LCG2_TEST (used for container testing so it's already receiving pilot2-dev).

https://aipanda067.cern.ch/pilots/2018-07-27/RAL-LCG2_TEST-7255/4950372.0.out

2018-07-27 06:09:33,093 | INFO | Use --cksum adler32:print option to get the checksum for xrdcp command
2018-07-27 06:09:33,094 | INFO | executing command: xrdcp -np -f --cksum adler32:print root://xrootd.echo.stfc.ac.uk:1094/atlas:datadisk/rucio/mc15_13TeV/6a/54/HITS.06828093._000096.pool.root.1 /pool/condor/dir_11888/condorg_LZ7HXON5/PanDA_Pilot2_4185_1532671698/PanDA_Pilot-4005712217/HITS.06828093._000096.pool.root.1
2018-07-27 06:09:37,420 | INFO | Summary of transferred files:
2018-07-27 06:09:37,420 | INFO | -- lfn=HITS.06828093._000096.pool.root.1, status_code=0, status=transferred

getuser() fails in containers

Hi @PalNilsson

In singularity containers the user name and UID should be the same inside as outside.

However, in k8s pods, the UID is defined as part of the pod YAML spec. It can be any arbitrary UID (but may need to match a certain range to comply with pod security policy on clusters.) And the user name is equally arbitrary and meaningless as the UID, so generally it is left out to avoid redundancy.

Anyway in order to add full container support and be able to run in k8s (and to some extent Docker as well), applications should not strictly require running with a named user account. There is only an arbitrary UID:

bash-4.2$ whoami
whoami: cannot find name for user ID 10000
bash-4.2$ id
uid=10000 gid=10000 groups=10000,1000

Currently the pilot fails in this situation:

---- Ready to run pilot ----

2020-12-16 21:35:50,110 [wrapper] ==== pilot stdout BEGIN ====
2020-12-16 21:35:50,122 [wrapper] pilotpid: 3345
2020-12-16 21:35:53,016 | INFO     | MainThread          | pilot.util.auxiliary             | pilot_version_banner      | ****************************************
2020-12-16 21:35:53,016 | INFO     | MainThread          | pilot.util.auxiliary             | pilot_version_banner      | ***  PanDA Pilot version 2.9.3 (10)  ***
2020-12-16 21:35:53,016 | INFO     | MainThread          | pilot.util.auxiliary             | pilot_version_banner      | ****************************************
2020-12-16 21:35:53,016 | INFO     | MainThread          | pilot.util.auxiliary             | pilot_version_banner      | 
2020-12-16 21:35:53,016 | INFO     | MainThread          | pilot.util.auxiliary             | pilot_version_banner      | pilot is running in a VM
2020-12-16 21:35:53,016 | INFO     | MainThread          | pilot.util.auxiliary             | display_architecture_info | architecture information:
2020-12-16 21:35:53,184 | INFO     | MainThread          | pilot.util.auxiliary             | display_architecture_info | 
LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.8.2003 (Core)
Release:	7.8.2003
Codename:	Core
2020-12-16 21:35:53,185 | INFO     | MainThread          | pilot.util.auxiliary             | pilot_version_banner      | ****************************************
2020-12-16 21:35:53,216 | DEBUG    | MainThread          | pilot.util.https                 | https_setup               | User-Agent: pilot/2.9.3 (10) (Python 2.7.5; Linux x86_64)
2020-12-16 21:35:53,218 | WARNING  | MainThread          | pilot.util.https                 | https_setup               | Python version <2.7.9 lacks SSL contexts -- falling back to curl
2020-12-16 21:35:53,235 | WARNING  | MainThread          | pilot.info.dataloader            | load_url_data             | cache file=/pilotdir/PanDA_Pilot2_3345_1608154553/queuedata.json is not available: [Errno 2] No such file or directory: '/pilotdir/PanDA_Pilot2_3345_1608154553/queuedata.json' .. skipped
2020-12-16 21:35:53,235 | INFO     | MainThread          | pilot.info.dataloader            | load_url_data             | [attempt=1/3] loading data from url=http://pandaserver.cern.ch:25085/cache/schedconfig/CA-VICTORIA-K8S-TEST-T2.all.json
2020-12-16 21:35:53,605 | INFO     | MainThread          | pilot.info.dataloader            | load_url_data             | saved data from "http://pandaserver.cern.ch:25085/cache/schedconfig/CA-VICTORIA-K8S-TEST-T2.all.json" resource into file=/pilotdir/PanDA_Pilot2_3345_1608154553/queuedata.json, length=3.6Kb
2020-12-16 21:35:53,605 | WARNING  | MainThread          | pilot.info.dataloader            | load_url_data             | cache file=/pilotdir/PanDA_Pilot2_3345_1608154553/cric_pandaqueues.json is not available: [Errno 2] No such file or directory: '/pilotdir/PanDA_Pilot2_3345_1608154553/cric_pandaqueues.json' .. skipped
2020-12-16 21:35:53,606 | INFO     | MainThread          | pilot.info.dataloader            | load_url_data             | [attempt=1/1] loading data from file=/cvmfs/atlas.cern.ch/repo/sw/local/etc/cric_pandaqueues.json
2020-12-16 21:35:54,188 | INFO     | MainThread          | pilot.info.dataloader            | load_url_data             | saved data from "/cvmfs/atlas.cern.ch/repo/sw/local/etc/cric_pandaqueues.json" resource into file=/pilotdir/PanDA_Pilot2_3345_1608154553/agis_schedconf.cvmfs.json, length=1184.0Kb
2020-12-16 21:35:54,236 | INFO     | MainThread          | pilot.info.configinfo            | resolve_queuedata         | queuedata: following keys will be overwritten by config values: {'maxwdir_broken': '14336 MB', 'es_stageout_gap': 601}
2020-12-16 21:35:54,237 | DEBUG    | MainThread          | pilot.info.queuedata             | __init__                  | Final parsed QueueData content:
 acopytools={u'pr': [u'rucio', u'xrdcp'], u'write_lan': [u'rucio'], u'read_lan': [u'rucio', u'xrdcp'], u'pw': [u'rucio']}
 acopytools_schemas={}
 allow_lan=True
 allow_wan=False
 appdir=
 aprotocols=None
 astorages={u'pr': [u'CA-VICTORIA-WESTGRID-T2_DATADISK', u'CA-VICTORIA-WESTGRID-T2_LOCALGROUPDISK', u'CA-VICTORIA-WESTGRID-T2_SCRATCHDISK'], u'es_events': [u'CERN-PROD_ES'], u'read_lan': [u'CA-VICTORIA-WESTGRID-T2_DATADISK', u'CA-VICTORIA-WESTGRID-T2_LOCALGROUPDISK', u'CA-VICTORIA-WESTGRID-T2_SCRATCHDISK'], u'pw': [u'CA-VICTORIA-WESTGRID-T2_DATADISK', u'CA-VICTORIA-WESTGRID-T2_LOCALGROUPDISK', u'CA-VICTORIA-WESTGRID-T2_SCRATCHDISK'], u'write_lan': [u'CA-VICTORIA-WESTGRID-T2_DATADISK', u'CA-VICTORIA-WESTGRID-T2_LOCALGROUPDISK', u'CA-VICTORIA-WESTGRID-T2_SCRATCHDISK']}
 catchall=
 container_options=
 container_type={}
 copytools={u'xrdcp': {u'setup': u'$VO_ATLAS_SW_DIR/local/xrootdsetup.sh'}, u'rucio': {u'setup': u''}}
 corecount=8
 direct_access_lan=False
 direct_access_wan=False
 es_stageout_gap=7200
 is_cvmfs=True
 maxinputsize=20000
 maxrss=32000
 maxtime=86400
 maxwdir=150000
 name=CA-VICTORIA-K8S-TEST-T2
 platform=
 pledgedcpu=0
 resource=CA-VICTORIA-K8S-TEST-T2
 site=CA-VICTORIA-WESTGRID-T2
 state=ACTIVE
 status=test
 timefloor=3600
 type=production
 use_pcache=False

2020-12-16 21:35:54,241 | WARNING  | MainThread          | pilot.info.dataloader            | load_url_data             | cache file=/pilotdir/PanDA_Pilot2_3345_1608154553/cric_ddmendpoints.json is not available: [Errno 2] No such file or directory: '/pilotdir/PanDA_Pilot2_3345_1608154553/cric_ddmendpoints.json' .. skipped
2020-12-16 21:35:54,241 | INFO     | MainThread          | pilot.info.dataloader            | load_url_data             | [attempt=1/1] loading data from file=/cvmfs/atlas.cern.ch/repo/sw/local/etc/cric_ddmendpoints.json
2020-12-16 21:35:54,274 | INFO     | MainThread          | pilot.info.dataloader            | load_url_data             | saved data from "/cvmfs/atlas.cern.ch/repo/sw/local/etc/cric_ddmendpoints.json" resource into file=/pilotdir/PanDA_Pilot2_3345_1608154553/cric_ddmendpoints.json, length=2186.6Kb
2020-12-16 21:35:54,379 | INFO     | MainThread          | __main__                         | main                      | pilot arguments: Namespace(abort_job=<threading._Event object at 0x7f7455668510>, allow_other_country=False, allow_same_user=False, cacert=None, capath=None, cleanup=True, country_group='', debug=True, graceful_stop=<threading._Event object at 0x7f7455668450>, harvester=False, harvester_datadir='', harvester_eventstatusdump='', harvester_submitmode='PULL', harvester_workdir='', harvester_workerattributes='', hpc_mode='manytoone', hpc_resource='', input_dir='', job_aborted=<threading._Event object at 0x7f7455668550>, job_label='managed', job_status={}, jobtype='', kill_time=0, lifetime=324000, mainworkdir='/pilotdir/PanDA_Pilot2_3345_1608154553', nopilotlog=False, output_dir='', pilot_user='ATLAS', port=25443, queue='CA-VICTORIA-K8S-TEST-T2', queuedata_url='', resource=None, resource_type='SCORE_HIMEM', retrieve_next_job=True, signal=None, signal_counter=0, site=None, sourcedir='/pilotdir/atlas_6uTEAmt0', timing={'1': {'PILOT_MULTIJOB_START_TIME': 1608154553.015654}, '0': {'PILOT_START_TIME': 1608154553.015652}}, update_server=True, url='https://pandaserver.cern.ch', use_https=True, verify_proxy=False, version_tag='PR', workdir='/pilotdir', workflow='generic', working_group='')
Traceback (most recent call last):
  File "pilot2/pilot.py", line 557, in <module>
    trace = main()
  File "pilot2/pilot.py", line 85, in main
    workflow = __import__('pilot.workflow.%s' % args.workflow, globals(), locals(), [args.workflow], 0)  # Python 3, -1 -> 0
  File "/pilotdir/atlas_6uTEAmt0/pilot2/pilot/workflow/generic.py", line 32, in <module>
    from pilot.control import job, payload, data, monitor
  File "/pilotdir/atlas_6uTEAmt0/pilot2/pilot/control/monitor.py", line 131, in <module>
    def get_process_info(cmd, user=getuser(), args='aufx', pid=None):
  File "/usr/lib64/python2.7/getpass.py", line 158, in getuser
    return pwd.getpwuid(os.getuid())[0]
KeyError: 'getpwuid(): uid not found: 10000'
2020-12-16 21:35:54,662 [wrapper] ==== pilot stdout END ====
2020-12-16 21:35:54,673 [wrapper] ==== wrapper stdout RESUME ====
2020-12-16 21:35:54,685 [wrapper] Pilot exit status: 1

Even on quite old linuxes , either UID or name can be used in ps -u

       -u userlist     Select by effective user ID (EUID) or name.
                       This selects the processes whose effective user name or ID is in userlist. The effective user ID describes the user whose file
                       access permissions are used by the process (see geteuid(2)). Identical to U and --user.

so this code should work fine on both traditional systems and k8s containers if UID is used instead of user name, fixing the issue:
https://github.com/PanDAWMS/pilot2/blob/master/pilot/control/monitor.py#L131

Thanks!

Retryable errors

Pilot should send a negative error code back to the server for jobs that can be retried. Same mechanism as in Pilot 1. Some function should be added to common/errorcodes.py that negate the error code if the error is retryable. Always call this function before reporting error to server.

in control/job.py Harvester setting assumes multiple jobs per pilot

This lines -

pilot2/pilot/control/job.py

Lines 872 to 876 in 2bd9eab

 if harvester and jobnumber > 0: 

 # unless it's the first job (which is preplaced in the init dir), instruct Harvester to place another job 

 # in the init dir 

 logger.info('asking Harvester for another job') 

 request_new_jobs()

in control/job.py Harvester setting assumes multiple jobs per pilot

need to differential many to one running of the harvester from one to one running.
(many jobs per pilot) vs 1 job per pilot.

[Question] AGIS PQ fields: container_type setup

Hello,

Now our site run T1 jobs in kubernetes container, we want to switch to pilot2 and disable singularity in this queue.
What does the setup of container_type="docker:wrapper" on AGIS mean?
"pilot will launch a container and run inside" or " pilot was already in container and then run"?
Or other setup I should do?

Thanks!

Wrong message format

pilot2/pilot/user/atlas/common.py

Line 199 in 198b0bc

 work_attributes['jobMetrics'] = 'core_count=%s n_events=%s db_time=%s db_data=%s workdir_size=%s' % \ 

jobMetrics should have format: 'coreCount=%s nEvents=%s dbTime=%s dbData=%s workDirSize=%s' instead of 'core_count=%s n_events=%s db_time=%s db_data=%s'

hardcoded software release

Noticed in the Titan plugin for the HPC workflow that there are several lines that are hard coded
for a specific software release. Should it be more general

pilot2/pilot/resource/titan.py

Lines 133 to 134 in cf151e8

 src_file = '/ccs/proj/csc108/AtlasReleases/21.0.15/DBRelease/current/sqlite200/ALLP200.db' 

 src_file_2 = '/ccs/proj/csc108/AtlasReleases/21.0.15/DBRelease/current/geomDB/geomDB_sqlite'

https://github.com/dougbenjamin/pilot2/blob/cf151e829ab8f1d6f429c0e9ca6faa1484882084/pilot/resource/titan.py#L87

Renaming of Job description fields

Hi,

We will may expire inconvenience in long term support in case we will use different names for job description fields in Pilot and in other components of PanDA. Minor changes like: converting to lower case, will work well, but changing of field name will work against us

pilot2/jobdescription.py

Line 258 in 4977788

'PandaID': 'job_id', # it is job id, not PanDA

Buggy calculation of used cores

On a single core job I see in the log

2019-10-03 10:15:55,351 | DEBUG    | job_monitor         | pilot.util.monitoring            | check_number_used_cores   | ps axo pgid,psr | sort | grep 600 | uniq | wc -l:
6
2019-10-03 10:15:55,351 | DEBUG    | job_monitor         | pilot.util.monitoring            | check_number_used_cores   | set number of actual cores to: 6

grepping for the process id is unreliable because it matches any id with those numbers inside, eg 6001, 26004

Maybe psutil is a better alternative?

Stage In/Out is a parts of workflow, but not a job execution.

Setup of stage in/out mostly relate with workflow and should not be implemented as method of Job class.

pilot2/job.py

Line 410 in 4977788

self.stage_in()

Improve killing orphans regexp

In kubernetes we run a python script to setup the environment and run the pilot wrapper. Once the pilot has executed a job, it invokes the kill orphan method and kills anything running under the same user, including the startup script. So the startup script can't finish properly and upload the logs.

I'm trying to fall into one of the exceptions, like renaming my python script to "pilots_starter.py" (see https://github.com/PanDAWMS/pilot2/blob/master/pilot/util/processes.py#L346). However I don't think the regexp matches properly the command, since the command can contain whitespaces but the regexp cuts after the first whitespace.

Currently it works like this. See that from "python pilots_starter.py" only "python" is kept:

[root@ae86f0c2304b /]# ps -o pid,ppid,args
  PID  PPID COMMAND
...
   33     1 python pilots_starter.py
[root@ae86f0c2304b /]# python
Python 2.7.5 (default, Apr  9 2019, 14:30:50)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = re.compile('(\d+)\s+(\d+)\s+(\S+)')
>>> line = '33     1 python pilots_starter.py'
>>> ids = pattern.search(line)
>>> print(ids.group(3))
python

Proposal to change the pattern to accept whitespaces for the last block:

>>> pattern = re.compile('(\d+)\s+(\d+)\s+([\S\s]+)')
>>> ids = pattern.search(line)
>>> print(ids.group(3))
python pilots_starter.py

Exception when running multiple jobs per pilot

Pilot version 2.4.2 running with timefloor=60

If the first job finishes before timefloor, pilot fetches a second job from PanDA; however, the pilot gets an exception and exits gracefully, and the second job never runs and eventually fails with jobDispatcher error 102

Example:

https://bigpanda.cern.ch/harvesterworkerinfo/?harvesterid=CERN_central_B&workerid=134818652
https://aipanda183.cern.ch/condor_logs_2/20-02-27_23/grid.5231873.3.out

exception caught by thread run() function: (<type 'exceptions.IndexError'>, IndexError('deque index out of range',), <traceback object at 0x2b9dcc353b90>)
Traceback (most recent call last):
File "/pool/condor/dir_2357/tmp/atlas_6G0RwFAb/pilot2/pilot/common/exception.py", line 434, in run
self._Thread__target(**self._Thread__kwargs)
File "/pool/condor/dir_2357/tmp/atlas_6G0RwFAb/pilot2/pilot/control/job.py", line 1921, in job_monitor
update_time = send_heartbeat_if_time(jobs[i], args, update_time)
IndexError: deque index out of range

None
exception has been put in bucket queue belonging to thread 'job_monitor'
setting graceful stop in 10 s since there is no point in continuing

More examples:
https://bigpanda.cern.ch/jobs/?computingsite=CERN&hours=48&jobstatus=failed&jobdispatchererrorcode=102&display_limit=100

detect_client_location on IPv6 machine

You mention some rucio bug in detect_client_location

pilot2/pilot/api/data.py

Line 246 in a78652f

def detect_client_location(self): ## TO BE DEPRECATED ONCE RUCIO BUG IS FIXED

but to be honest I don't know why you force pilot to fail in case this function is not able to resolve address. When I look at rucio master branch

https://github.com/rucio/rucio/blob/f3784d622e7b8a07635871ec99b31f2a0c1927b0/lib/rucio/common/utils.py#L658

their code never fails and just use some default IP addresses.

Basically your pilot2 and also old pilot always fails on machine without IPv4 address. Could you please update code not to fail in such situation. Unfortunately I really don't know which rucio bug you are trying to address by duplication this function with respect to rucio.common.utils.detect_client_location

Does HPC workflow - always require MPI?

In the wiki documentation -

The Pilot 2 HPC workflow is a special mode where the application works without a remote connection to PanDA server or other remote facilities. All intercommunications in this case are managed by the Harvester application. Also, in this mode Pilot 2 acts like a simple MPI application, which performs execution of multiple jobs on the computing nodes of the HPC.

What about HPC jobs that do not require MPI?

	if (args.harvester_workdir != '' or args.harvester_datadir != '' or args.harvester_eventstatusdump != '' or
	args.harvester_workerattributes != '') and not args.update_server:
	harvester = True
	elif 'HARVESTER_ID' in environ or 'HARVESTER_WORKER_ID' in environ:
	harvester = True
	else:
	harvester = False

	if harvester and jobnumber > 0:
	# unless it's the first job (which is preplaced in the init dir), instruct Harvester to place another job
	# in the init dir
	logger.info('asking Harvester for another job')
	request_new_jobs()

	src_file = '/ccs/proj/csc108/AtlasReleases/21.0.15/DBRelease/current/sqlite200/ALLP200.db'
	src_file_2 = '/ccs/proj/csc108/AtlasReleases/21.0.15/DBRelease/current/geomDB/geomDB_sqlite'

pandawms / pilot2 Goto Github PK

pilot2's Introduction

PanDA Pilot 2

Contributions

Verifying code correctness

Running the pilot

Running the testcases

Building and viewing docs

Automate documentation to your module

Syncing your GitHub repository

pilot2's People

Contributors

Stargazers

Watchers

Forkers

pilot2's Issues

On branch next

Recommend Projects

Recommend Topics

Recommend Org