eessi / eessi-bot-software-layer Goto Github PK
View Code? Open in Web Editor NEWBot to help with requests to add software installations to the EESSI software layer
License: GNU General Public License v2.0
Bot to help with requests to add software installations to the EESSI software layer
License: GNU General Public License v2.0
If a job fails, it may be convenient to investigate only that single job (on a single bot instance). This could be done by updating the PR and reconfiguring and restarting the bot (or all bots at possibly different sites) such that when the bot:label
event is being resent only that particular job is rerun. This is a bit cumbersome.
It would be nicer if one could resubmit the job locally (with or without some local modifications without changing the PR). To document progress, the reprocessing could still add comments to the original job comment of the PR (adding more rows in the status table).
When the bot (re)starts, for example, after a crash, it should resume monitoring potentially existing jobs.
To implement this,
Maybe the bot could do all monitoring of jobs (and processing of finished jobs) in a separate thread?
Once the bot(s) has(have) built for all targets an administrator may approve the PR. The event for this shall be handled by the bot which leads to transferring all tarballs to the Stratum 0 server.
A comment to the PR could signal that that transfer has happened.
Currently some information is conveyed via environment variables which are added to executing sbatch
via a dictionary (parameter to subprocess.run
), then in the build script they are checked for existence and stored in the file _env
in the job directory. The latter is necessary to hand them to the actual build script as this runs in the compat layer (script run_in_compat_layer.sh
which clears the environment).
This looks very complicated. We should check which variables we actually need (and in which script), then try to convey the information in a better/easier way.
Failing build jobs will be frequent. With PR #85 there is already a tool to resubmit a build job. Sometimes, it may be very useful to enter an interactive debugging session. An additional tool allowing this could:
The bot's event handler consists of the code (main file eessi_bot_software_layer.py
) and the run script run.sh
. The bot's job manager consists of the code (main file eessi_bot_job_manager.py
) and the run script job_manager.sh
.
For more consistency, the former's files could be renamed to eessi_bot_event_handler.py
and event_handler.sh
.
tools/args.py
defines the command line arguments. Currently this is used for two programs where not all arguments are used. Would be good to support different sets of arguments for different programs/tools. In the future, we may add more tools which have other command line arguments.
At the moment for each launched job the status table contains three lines (max):
Submitted -- added when job is submitted with UserHeld status
Released -- added when job manager removed the UserHeld status (usually a job is waiting to be started for some time)
Finished -- added when job manager recognises the job has ended
Released does not mean the job has started. The job manager could add another line when the job has started.
Ideally using a separate PR per Python file
connections/github.py
eessi_bot_event_handler.py
eessi_bot_job_manager.py
(see #63)tasks/build.py
(done?)tests/test_eessi_bot_job_manager.py
tests/test_task_build.py
tools/args.py
tools/config.py
tools/logging.py
We should also add a code style check in GitHub Actions, cfr. https://github.com/easybuilders/easybuild-framework/blob/develop/.github/workflows/linting.yml
Same here, this can become significantly shorter, I think, by creating a bot-create-tarball.sh
script that is just being run here (and fed a couple of input variables).
Originally posted by @boegel in #10 (comment)
Communicating with GitHub (e.g., adding a comment to a pull request) might fail for various reasons. For example, we once received a 401
response with the message Bad credentials
. Then, retrying by restarting the bot AND redelivering the same event that eventually led to the comment action resulted in a successful operation.
Instead of just failing the bot could after waiting some time try to repeat the communication with GitHub. Only when it fails again, or repeatedly fails, it could stop trying and report the problem, e.g., via email.
Currently the bot does not handle errors or exceptions. In the future, this is needed to ensure that the bot processes events reliably. Errors might be reported back such that mitigation measures could be taken.
For example, submitting a batch job could result in many errors for various reasons. Another example, could be that disk space is exceeded on a resource.
The bot reports status changes, results, etc as comments back to the PR. For updating these comments it first needs to identify the comment. Currently (as of PR #24) this is based on the contents of the first line in a PR comment (using a job's ID and the app's name which is defined in app.cfg
). The code is duplicated at some places and should be collected into one function, e.g.,
def identify_pull_request_comment(pattern: str, repository: str, pr_number: int) -> Tuple[bool, int]:
which receives a search pattern, a repository name and a pull request number as input and returns a tuple with a return value (True
- comment found, False
- no matching comment found) and a number identifying the first comment to be updated (only meaningful if return value is True
).
At the moment, it seems, that the GH account which created the pull request can set a label for the pull request in the target repository. We should have a means to restrict that, e.g., only an owner of the target repository should be able to set that label.
Current structure is as follows
event['X-GitHub-Delivery']/run-id/EESSI_PILOT_VERSION/OS_TYPE_ARCH_TARGET/stackfile
where run-id is randomised directory name generated with mkdtemp (e.g., tmpqad1fuqj
), EESSI_PILOT_VERSION is the EESSI pilot version for which software shall be built (e.g., 2021.12), OS_TYPE_ARCH_TARGET is a concatenation of EESSI_OS_TYPE
and EESSI_SOFTWARE_SUBDIR
with '/' replaced by '_' (e.g., linux_x86_64_intel_haswell).
In the future the structure could be as follows
YYYY.MM/pr<id>/jobs/<event_id>/<run_number>/<cpu_target>/<branch of pull request>
and
YYYY.MM/pr<id>/tars/<event_id>/<run_number>/<cpu_target>
(alternatively structure could be simply
YYYY.MM/pr<id>/<event_id>/<run_number>/<cpu_target>/<branch of pull request>
)
with the following intentions
YYYY.MM
representing the year and month when this event was processed (could be used to later more easily cleanup outdated event process handlings)mkdtemp
(where the order may be less obvious)x86_64_amd_zen2
('/' replaced by '_') -- might not be necessary for tars
tree if tarballs' name include the cpu target/software subdirectoryIf the GitHub app does not have the right permissions, some actions will fail hard.
Here, the bot is trying to post a comment in a PR because a job was submitted for that PR, but that's failing because the GitHub app was only configured with read-only
permissions for pull requests (should be read+write
):
[20221007-T18:22:38] WARNING: A crash occurred!
Traceback (most recent call last):
File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/pyghee/lib.py", line 170, in process_event
self.handle_event(event_info, log_file=log_file)
File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/pyghee/lib.py", line 102, in handle_event
handler(event_info, log_file=log_file)
File "/mnt/shared/home/boegel/eessi-bot-software-layer/eessi_bot_software_layer.py", line 91, in handle_pull_request_event
handler(event_info, pr)
File "/mnt/shared/home/boegel/eessi-bot-software-layer/eessi_bot_software_layer.py", line 61, in handle_pull_request_labeled_event
build_easystack_from_pr(pr, event_info)
File "/mnt/shared/home/boegel/eessi-bot-software-layer/tasks/build.py", line 237, in build_easystack_from_pr
pull_request.create_issue_comment(job_comment)
File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/github/PullRequest.py", line 457, in create_issue_comment
"POST", f"{self.issue_url}/comments", input=post_parameters
File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/github/Requester.py", line 355, in requestJsonAndCheck
verb, url, parameters, headers, input, self.__customConnection(url)
File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/github/Requester.py", line 378, in __check
raise self.__createException(status, responseHeaders, output)
github.GithubException.GithubException: 403 {"message": "Resource not accessible by integration", "documentation_url": "https://docs.github.com/rest/reference/issues#create-an-issue-comment"}
We should check whether all necessary permissions are available at startup, fail to start bot event handler/job manager if some permissions are missing...
The job manager uses
username = os.getlogin()
See job manager code below where the above is used
In some cases, for example, when sudo
ing into a user account (say newuser
), os.getlogin()
does not return the expected value. os.getlogin()
should be replaced by a different method.
Files such as app.cfg
*.log
venv_bot_p37
*.swo
are specific to the runtime environment. Itβs better to separate code environment and runtime environment.
I've been testing the bot locally for non-EESSI builds, and it works really well, except that the job manager doesn't find the generated tarball (and this results in the build being reported as failed). This is because it's looking for specific tarball names: tarball_pattern = "eessi-*software-*.tar.gz"
.
It would be nice if this pattern can be changed in the config file, in order to make it easier to use the bot for non-EESSI stuff.
For example, the could act on specific labels that are set/un-set/re-set ... by a human in the loop. Labels could be bot:build
, bot:deploy
, ...
Currently the job manager considers all jobs in the queue to be bot jobs (which may not be the case, and hence can lead to crashes as described in #33).
The job manager could try to only catch bot jobs in the first place, that is when querying for them. Currently, the query just limits jobs to be considered by the username that runs the bot.
Relevant code in eessi_bot_job_manager.py
def get_current_jobs(self):
# who am i
username = os.getlogin()
squeue_cmd = '%s --long --user=%s' % (self.poll_command,username)
log("get_current_jobs(): run squeue command: %s" % squeue_cmd, self.logfile)
Alternative queries could be for an account or a job name. However, one needs to be careful to only catch jobs which are under the control of the user running the job manager. This concerns both interactions with Slurm and access permissions on the file system. Probably, giving bot jobs a descriptive name and then querying for that name (fixed or prefix or other pattern) could be an option.
In PR #10 the contents of a pull request for EESSI/software-layer are obtained in tasks/build.py
with the code
git_clone = 'git clone https://github.com/%s --branch %s --single-branch %s' % (repo_name, branch_name, arch_job_dir)
It was pointed out by @boegel that this may not be correct, because the build than may not include more recent upstream commits.
A better (or the only correct) way would be:
Step 1 could be a simple git clone https://github.com/REPO_NAME
where REPO_NAME
is taken from the base
of the PR (in PR #10 it's taken from the head
).
Step 2 could obtain the changes by accessing https://github.com/REPO_NAME/pull/PR_NUMBER.patch
and then run patch
REPO_NAME
again would be taken from the base
of the PR with the number PR_NUMBER
.
One should investigate if PyGitHub provides functionality for executing some of these steps (instead of running Seems PyGitHub is not meant for such operations. Hence, resorting to just running the following commands from git
commands).destination_directory
might do the "trick":
git clone https://github.com/REPO_NAME destination_directory
curl -L https://github.com/REPO_NAME/pull/PR.patch > PR.patch
git am PR.patch
Notes, destination_directory
must be non-existing or empty or git clone
fails; -L
instructs curl
to follow redirects.
Update documentation to take the more generic bot into account (see #10)
Sometimes it is useful to run an (interactive) test job. Currently the job manager crashes with messages such as below when it encounters one of these jobs.
(venv_bot_p38) [trz42@mgmt eessi-bot-software-layer]$ ./job_manager.sh
job manager just started, logging to '/mnt/shared/home/trz42/pilot.nessi/eessi-bot-software-layer/eessi_bot_job_manager.log', processing job ids ''
Traceback (most recent call last):
File "/usr/lib64/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib64/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/shared/home/trz42/pilot.nessi/eessi-bot-software-layer/eessi_bot_job_manager.py", line 472, in <module>
main()
File "/mnt/shared/home/trz42/pilot.nessi/eessi-bot-software-layer/eessi_bot_job_manager.py", line 459, in main
job_manager.process_finished_job(known_jobs[fj])
File "/mnt/shared/home/trz42/pilot.nessi/eessi-bot-software-layer/eessi_bot_job_manager.py", line 262, in process_finished_job
repo_name = metadata_pr['repo'] or ''
KeyError: 'repo'
This is very likely because of non-bot jobs lack the job's metadata file (named _bot_jobJOBID.metadata
). The job manager could test if the file exists, and if not prints out a message and ignores the job.
At the moment, updates include events such as job has been submitted, job has been released and job has finished. Issue #27 suggests to add information when a job has started running. There may be more important information such as:
We should come up with a directory structure for a repository of an EESSI stack. Assuming we would support several versions of an EESSI stack, e.g., 2021.03, 2021.06, 2021.12, etc, and a complete stack for a single version is defined by one softwarelist.yaml
easystack file, the directory structure could be as simple as
/stacks
- 2021.03
|- softwarelist.yaml
- 2021.06
|- softwarelist.yaml
.
.
.
If we want to have different stacks for development and production, we could add another directory level or just use a separate repository. The latter would probably better to support different access rights for different stacks.
The structure of app.cfg
and use of more features provided by configparser
could be revisited. For example, configparser
allows for a [DEFAULT]
section and template variables.
Also, more could be made configurable such as comments added to PRs. Some ideas may be taken from filesystem-layer/PR#90.
The description could include the semi-automatic workflow, the servers/services involved, tools/scripts involved, repositories which hold sources, software-layer, directories, configuration capabilities, best practices, etc.
After a general introduction it could include sections about setting the bot environment up (across different machines), about updating it and about using it.
It could also cover the interplay of different repositories for development, building and testing in the workflow.
It could also cover some frequently used git
commands/workflows/setups to help users managing a complex distributed system.
While reviewing #52, I noticed a couple more opportunities for cleaning up/improving the code in tasks/build.py
(but I didn't want to block merging the PR over these):
build_easystack_from_pr
function to submit_build_jobs
(both in tasks/build.py
and eessi_bot_event_handler.py
);create_directory
function to create_pr_dir
(to avoid confusion with the generic mkdir
function);download_pull_request
and setup_pr_in_arch_job_dir
functions are switched?
setup_pr_in_arch_job_dir
function basically only downloads to code that corresponds to a PR into the specified directory (so that should be named download_pr_to_dir
?);download_pull_request
function sets up the directory for each architecture target (so that should be named setup_pr_arch_job_dirs
?);The interaction between the event handler (eessi_bot_software_layer.py
) and the job manager (eessi_bot_job_manager.py
) is facilitated by the event handler submitting jobs with the parameter --hold
, then the job manager releases jobs with the command scontrol
. The parameter --hold
is currently provided via the key slurm_params
in the bot's app.cfg
.
To avoid accidentally removing that parameter and thereby making the bot or interaction between its components dysfunctional, the parameter should be hardcoded for now.
If a pull request is updated, the bot receives a pull_request
event with the action synchronize
. Handling this might be as simple as calling the same function that handles the opened
action.
Add CI tests to make sure changes to the bot don't break it's functionality.
Make sure that same container image is used throughout the whole job. That is, it should be downloaded first, then it the downloaded container image should be (re-)used.
This should mainly improve consistency of the build environment.
A further improvement might be to download all needed container images first (maybe even at start of the bot or by admin before).
When testing the bot implementation with the changes in #24, I ran into a crash when handling a bot:build
label added event:
[20221007-T18:14:09] WARNING: [event id d67043c0-466b-11ed-96e7-ca8e283f9478] No handler found for event type 'label' (action: created) - event was received but left unhandled!
[20221007-T18:14:39] Event received (id: e8475f70-466b-11ed-9eaa-b259298657c8, type: pull_request, action: labeled), event data logged at /mnt/shared/home/boegel/eessi-bot-software-layer/events_log/pull_request/labeled/2022-10-07/2022-10-07T18-14-39_e8475f70-466b-11ed-9eaa-b259298657c8
[20221007-T18:14:39] Request verified: signature OK!
[20221007-T18:14:39] [event id e8475f70-466b-11ed-9eaa-b259298657c8] Handler found for event type 'pull_request' (action: labeled)
[20221007-T18:14:39] repository: 'boegel/software-layer'
[20221007-T18:14:40] WARNING: A crash occurred!
Traceback (most recent call last):
File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/pyghee/lib.py", line 170, in process_event
self.handle_event(event_info, log_file=log_file)
File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/pyghee/lib.py", line 102, in handle_event
handler(event_info, log_file=log_file)
File "/mnt/shared/home/boegel/eessi-bot-software-layer/eessi_bot_software_layer.py", line 80, in handle_pull_request_event
pr = gh.get_repo(event_info['raw_request_body']['repository']['full_name']).get_pull(event_info['raw_request_body']['pull_request']['number'])
File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/github/MainClass.py", line 330, in get_repo
headers, data = self.__requester.requestJsonAndCheck("GET", url)
File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/github/Requester.py", line 355, in requestJsonAndCheck
verb, url, parameters, headers, input, self.__customConnection(url)
File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/github/Requester.py", line 378, in __check
raise self.__createException(status, responseHeaders, output)
github.GithubException.GithubException: 502 {"message": "Server Error"}
It seems like the connection to GitHub failed for some reason.
We should catch exceptions like these, and handle them more gracefully, maybe by waiting for a while, and trying again a couple of times?
Worked fine on 2nd attempt without any code changes...
Despite a successful build some job steps are considered failed, e.g., when checking with sacct
. This is likely because some scripts do not explicitly return meaningful values.
Currently (PR #4), the following (lines 103-105) is done to start the app
app = create_app(klass=EESSIBotSoftwareLayer)
log("EESSI bot for software layer started!")
waitress.serve(app, listen='*:3000')
Ideally this should be largely moved to PyGHee which would allow us to move the dependency on waitress
to PyGHee and then just start the app by something like
app.start(port=3000)
Currently, the bot (PR#4) makes a number of assumptions about the pull request (updates to easy stack files, where these are stored, etc.) and how it is handled (requiring essentially a new script to build & install software).
In the future, the bot should simply download the branch of the pull request (for the software-layer repository, this branch by design includes nearly everything necessary for handling the pull request: scripts, configuration, easystack files, etc) and submit a job that only follows the instructions for manually building an EESSI software stack (https://eessi.github.io/docs/software_layer/build_nodes/). By using existing helper scripts of the software-layer the job script, likely part of EESSI/eessi-bot-software-layer repository, can be very generic not requiring much knowledge of the internals of the EESSI/software-layer repository.
Parts of the bot's documentation (README.md) may be difficult to follow or even outdated. It should be updated after PR#24 has been merged and more people have started using the bot.
The bot should detect when a batch job ends / has ended.
Throughout the job script (scripts/eessi-bot-build.slurm
) and other scripts information needs to be logged. Currently, there is no consistent format for these loggings.
Loggings could
- include timestamp + log message
- log "example" should print something like
">> [2022-10-07 13:56] example"
The bot should be able to add comments to a PR, e.g., when a job has ended or the tarball has been generated.
https://github.com/settings/apps/NAME_OF_YOUR_BOT/permissions
https://github.com/settings/installations/12345678
bot:build
label and re-add to test againhttps://github.com/YOUR_GITHUB_ACCOUNT/software-layer/compare/main...EESSI:software-layer:add-CaDiCaL-9.3.0?expand=1
to make a test PRCode like this (taken from tasks/build.py
):
git_clone_cmd = ' '.join([
'git clone',
'https://github.com/' + repo_name,
arch_job_dir,
])
log("Clone repo by running '%s' in directory '%s'" % (git_clone_cmd,arch_job_dir))
cloned_repo = subprocess.run(git_clone_cmd,
cwd=arch_job_dir,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
log("Cloned repo!\nStdout %s\nStderr: %s" % (cloned_repo.stdout,cloned_repo.stderr))
should be replaced with a call to a function to run commands, for example something like:
git_clone_cmd = ' '.join([
'git clone',
'https://github.com/' + repo_name,
arch_job_dir,
])
run_cmd(git_clone_cmd, "Clone repo")
The run_cmd
function would then be responsible for:
subprocess
module;This should make become a simple call to a bot-build.sh
script that also sits in the software-layer
repo (something to follow up on later)
Originally posted by @boegel in #10 (comment)
Currently, the bot's build target CVMFS repository is the default pilot.eessi-hpc.org (as configured in the build container) or any other repository that is configured via the pull request of the software layer. While this works well, it is not so straightforward to build one PR for another repository (e.g., for test purposes). Instead of configuring the target repository via a pull request, the bot's configuration could define target repositories and all necessary customisations. Then, the bot could work as follow to set up a build job:
The bot uses the command curl
to fetch a patch file from GitHub. Should we remove that dependency?
Environment variables may provide a convenient way to set configuration options, to hide sensitive information in PR comments and yet provide easy access to job information.
For example, there could be an environment variable BOT_BASE_DIR
which can be used to define where job directories and symlinks are placed on the filesystem. This could then be used in comments to PRs to the software-layer (thereby hiding information about account names involved). If it is defined in app.cfg
there could be a simple tool or shell/python function which exports this variable or derives the actual location of a job (when just providing a job id).
The function build_easystack_from_pr
in tasks/build.py
has gotten quite long over time. Refactoring it into smaller functions should make the code more readable and also help reviewing future pull requests. Smaller functions might then also be reused by other components of the bot.
After #59 is merged
For example in eessi_bot_job_manager.py
(maybe also in other places)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.