Giter VIP home page Giter VIP logo

eessi-bot-software-layer's People

Contributors

bedroge avatar boegel avatar hafsa-naeem avatar hafsanaeem-2 avatar jacobz137 avatar jonas-lq avatar larappr avatar neves-bot avatar neves-p avatar ocaisa avatar poksumdo avatar truib avatar trz42 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

eessi-bot-software-layer's Issues

Add a tool for resubmitting jobs locally

If a job fails, it may be convenient to investigate only that single job (on a single bot instance). This could be done by updating the PR and reconfiguring and restarting the bot (or all bots at possibly different sites) such that when the bot:label event is being resent only that particular job is rerun. This is a bit cumbersome.

It would be nicer if one could resubmit the job locally (with or without some local modifications without changing the PR). To document progress, the reprocessing could still add comments to the original job comment of the PR (adding more rows in the status table).

Monitor potentially existing jobs after bot is (re)started

When the bot (re)starts, for example, after a crash, it should resume monitoring potentially existing jobs.

To implement this,

  • the bot probably need to persist information about jobs on disk (after they have been submitted),
  • this job handling likely need to be done in a separate thread such that the bot could still receive and handle new events.

Maybe the bot could do all monitoring of jobs (and processing of finished jobs) in a separate thread?

Implement a new handler for approved PR

Once the bot(s) has(have) built for all targets an administrator may approve the PR. The event for this shall be handled by the bot which leads to transferring all tarballs to the Stratum 0 server.

A comment to the PR could signal that that transfer has happened.

Improve way information is handed from bot to build job

Currently some information is conveyed via environment variables which are added to executing sbatch via a dictionary (parameter to subprocess.run), then in the build script they are checked for existence and stored in the file _env in the job directory. The latter is necessary to hand them to the actual build script as this runs in the compat layer (script run_in_compat_layer.sh which clears the environment).

This looks very complicated. We should check which variables we actually need (and in which script), then try to convey the information in a better/easier way.

Tools to easily enter interactive debugging session

Failing build jobs will be frequent. With PR #85 there is already a tool to resubmit a build job. Sometimes, it may be very useful to enter an interactive debugging session. An additional tool allowing this could:

  • be given a job directory and stops at a specific stage, e.g., when the build container was launched, before EasyBuild was installed, after EasyBuild was installed, before/after installing a certain package, etc.
  • potentially pick up the status of a failing job with the changes made to the upper layer of the fuse-overlay in place (to investigate the state when a build job failed, to quickly start debugging work without rebuilding packages that worked)
  • jump to a snapshot in the build process to start debugging from there (for each package built there could be a snapshot of the changes, which could then be used to select a specific state within the build procedure)

Rename `eessi_bot_software_layer.py`

The bot's event handler consists of the code (main file eessi_bot_software_layer.py) and the run script run.sh. The bot's job manager consists of the code (main file eessi_bot_job_manager.py) and the run script job_manager.sh.

For more consistency, the former's files could be renamed to eessi_bot_event_handler.py and event_handler.sh.

Separate handling of command line arguments

tools/args.py defines the command line arguments. Currently this is used for two programs where not all arguments are used. Would be good to support different sets of arguments for different programs/tools. In the future, we may add more tools which have other command line arguments.

Update PR comments when job starts running

At the moment for each launched job the status table contains three lines (max):

Submitted -- added when job is submitted with UserHeld status
Released -- added when job manager removed the UserHeld status (usually a job is waiting to be started for some time)
Finished -- added when job manager recognises the job has ended

Released does not mean the job has started. The job manager could add another line when the job has started.

Retry communication with GitHub if first attempt failed

Communicating with GitHub (e.g., adding a comment to a pull request) might fail for various reasons. For example, we once received a 401 response with the message Bad credentials. Then, retrying by restarting the bot AND redelivering the same event that eventually led to the comment action resulted in a successful operation.

Instead of just failing the bot could after waiting some time try to repeat the communication with GitHub. Only when it fails again, or repeatedly fails, it could stop trying and report the problem, e.g., via email.

Improve bot by handling errors / exceptions properly

Currently the bot does not handle errors or exceptions. In the future, this is needed to ensure that the bot processes events reliably. Errors might be reported back such that mitigation measures could be taken.

For example, submitting a batch job could result in many errors for various reasons. Another example, could be that disk space is exceeded on a resource.

Function for identifying PR comment to be updated

The bot reports status changes, results, etc as comments back to the PR. For updating these comments it first needs to identify the comment. Currently (as of PR #24) this is based on the contents of the first line in a PR comment (using a job's ID and the app's name which is defined in app.cfg). The code is duplicated at some places and should be collected into one function, e.g.,

def identify_pull_request_comment(pattern: str, repository: str, pr_number: int) -> Tuple[bool, int]:

which receives a search pattern, a repository name and a pull request number as input and returns a tuple with a return value (True - comment found, False - no matching comment found) and a number identifying the first comment to be updated (only meaningful if return value is True).

Restrict whom can trigger the building of packages

At the moment, it seems, that the GH account which created the pull request can set a label for the pull request in the target repository. We should have a means to restrict that, e.g., only an owner of the target repository should be able to set that label.

  1. Determine if this can be configured for the target repository. ... if not -> 2.
  2. Configure the bot such that it only handles a specific event (type) if it was created by a member of a specific group / list of GH accounts.

Improve directory structure for processing pull requests

Current structure is as follows

event['X-GitHub-Delivery']/run-id/EESSI_PILOT_VERSION/OS_TYPE_ARCH_TARGET/stackfile

where run-id is randomised directory name generated with mkdtemp (e.g., tmpqad1fuqj), EESSI_PILOT_VERSION is the EESSI pilot version for which software shall be built (e.g., 2021.12), OS_TYPE_ARCH_TARGET is a concatenation of EESSI_OS_TYPE and EESSI_SOFTWARE_SUBDIR with '/' replaced by '_' (e.g., linux_x86_64_intel_haswell).

In the future the structure could be as follows

YYYY.MM/pr<id>/jobs/<event_id>/<run_number>/<cpu_target>/<branch of pull request> and
YYYY.MM/pr<id>/tars/<event_id>/<run_number>/<cpu_target>

(alternatively structure could be simply
YYYY.MM/pr<id>/<event_id>/<run_number>/<cpu_target>/<branch of pull request>)

with the following intentions

  • YYYY.MM representing the year and month when this event was processed (could be used to later more easily cleanup outdated event process handlings)
  • pr just the id of the pull request, so can be easier investigated
  • jobs -- all jobs for this pull request
  • tars -- all artifacts being built and possibly to be transferred to Stratum 0 server
  • <event_id> id of the event (event[X-GitHub-Delivery] in the version represented by #4)
  • <run_number> a number representing the delivery of the event (particularly during development each event could be delivered multiple times) -- in the version in #4 this is a string generated by mkdtemp (where the order may be less obvious)
  • <cpu_target> is a name representing the SOFTWARE SUBDIR, e.g., x86_64_amd_zen2('/' replaced by '_') -- might not be necessary for tars tree if tarballs' name include the cpu target/software subdirectory
  • is the full content of the branch of the pull request

verify at startup whether necessary permissions are available

If the GitHub app does not have the right permissions, some actions will fail hard.

Here, the bot is trying to post a comment in a PR because a job was submitted for that PR, but that's failing because the GitHub app was only configured with read-only permissions for pull requests (should be read+write):

[20221007-T18:22:38] WARNING: A crash occurred!
Traceback (most recent call last):
  File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/pyghee/lib.py", line 170, in process_event
    self.handle_event(event_info, log_file=log_file)
  File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/pyghee/lib.py", line 102, in handle_event
    handler(event_info, log_file=log_file)
  File "/mnt/shared/home/boegel/eessi-bot-software-layer/eessi_bot_software_layer.py", line 91, in handle_pull_request_event
    handler(event_info, pr)
  File "/mnt/shared/home/boegel/eessi-bot-software-layer/eessi_bot_software_layer.py", line 61, in handle_pull_request_labeled_event
    build_easystack_from_pr(pr, event_info)
  File "/mnt/shared/home/boegel/eessi-bot-software-layer/tasks/build.py", line 237, in build_easystack_from_pr
    pull_request.create_issue_comment(job_comment)
  File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/github/PullRequest.py", line 457, in create_issue_comment
    "POST", f"{self.issue_url}/comments", input=post_parameters
  File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/github/Requester.py", line 355, in requestJsonAndCheck
    verb, url, parameters, headers, input, self.__customConnection(url)
  File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/github/Requester.py", line 378, in __check
    raise self.__createException(status, responseHeaders, output)
github.GithubException.GithubException: 403 {"message": "Resource not accessible by integration", "documentation_url": "https://docs.github.com/rest/reference/issues#create-an-issue-comment"}

We should check whether all necessary permissions are available at startup, fail to start bot event handler/job manager if some permissions are missing...

Make pattern for tarball name configurable

I've been testing the bot locally for non-EESSI builds, and it works really well, except that the job manager doesn't find the generated tarball (and this results in the build being reported as failed). This is because it's looking for specific tarball names: tarball_pattern = "eessi-*software-*.tar.gz".
It would be nice if this pattern can be changed in the config file, in order to make it easier to use the bot for non-EESSI stuff.

Only consider bot jobs to be further processed by job manager

Currently the job manager considers all jobs in the queue to be bot jobs (which may not be the case, and hence can lead to crashes as described in #33).

The job manager could try to only catch bot jobs in the first place, that is when querying for them. Currently, the query just limits jobs to be considered by the username that runs the bot.

Relevant code in eessi_bot_job_manager.py

    def get_current_jobs(self):
        # who am i
        username = os.getlogin()

        squeue_cmd = '%s --long --user=%s' % (self.poll_command,username)
        log("get_current_jobs(): run squeue command: %s" % squeue_cmd, self.logfile)

Alternative queries could be for an account or a job name. However, one needs to be careful to only catch jobs which are under the control of the user running the job manager. This concerns both interactions with Slurm and access permissions on the file system. Probably, giving bot jobs a descriptive name and then querying for that name (fixed or prefix or other pattern) could be an option.

Change method for fetching PR

In PR #10 the contents of a pull request for EESSI/software-layer are obtained in tasks/build.py with the code

        git_clone = 'git clone https://github.com/%s --branch %s --single-branch %s' % (repo_name, branch_name, arch_job_dir)

It was pointed out by @boegel that this may not be correct, because the build than may not include more recent upstream commits.

A better (or the only correct) way would be:

  1. Clone the base branch (target of the pull request), and
  2. Apply the changes of the pull request on top.

Step 1 could be a simple git clone https://github.com/REPO_NAME where REPO_NAME is taken from the base of the PR (in PR #10 it's taken from the head).

Step 2 could obtain the changes by accessing https://github.com/REPO_NAME/pull/PR_NUMBER.patch and then run patch REPO_NAME again would be taken from the base of the PR with the number PR_NUMBER.

One should investigate if PyGitHub provides functionality for executing some of these steps (instead of running git commands). Seems PyGitHub is not meant for such operations. Hence, resorting to just running the following commands from destination_directory might do the "trick":

git clone https://github.com/REPO_NAME destination_directory
curl -L https://github.com/REPO_NAME/pull/PR.patch > PR.patch
git am PR.patch

Notes, destination_directory must be non-existing or empty or git clone fails; -L instructs curl to follow redirects.

Job manager crashes when it encounters a non-bot job

Sometimes it is useful to run an (interactive) test job. Currently the job manager crashes with messages such as below when it encounters one of these jobs.

(venv_bot_p38) [trz42@mgmt eessi-bot-software-layer]$ ./job_manager.sh
job manager just started, logging to '/mnt/shared/home/trz42/pilot.nessi/eessi-bot-software-layer/eessi_bot_job_manager.log', processing job ids ''
Traceback (most recent call last):
  File "/usr/lib64/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/mnt/shared/home/trz42/pilot.nessi/eessi-bot-software-layer/eessi_bot_job_manager.py", line 472, in <module>
    main()
  File "/mnt/shared/home/trz42/pilot.nessi/eessi-bot-software-layer/eessi_bot_job_manager.py", line 459, in main
    job_manager.process_finished_job(known_jobs[fj])
  File "/mnt/shared/home/trz42/pilot.nessi/eessi-bot-software-layer/eessi_bot_job_manager.py", line 262, in process_finished_job
    repo_name = metadata_pr['repo'] or ''
KeyError: 'repo'

This is very likely because of non-bot jobs lack the job's metadata file (named _bot_jobJOBID.metadata). The job manager could test if the file exists, and if not prints out a message and ignores the job.

Update PR comments with important job information

At the moment, updates include events such as job has been submitted, job has been released and job has finished. Issue #27 suggests to add information when a job has started running. There may be more important information such as:

  • job state indicates it will never start
  • preparation of job actually failed, e.g., patch could not be applied
  • add progress information, e.g., 3 of 5 packages built and/or which packages have been built
  • add information about resource usage (CPU utilization, disk read/write, RAM use, ...)
  • add information about time left
  • add billing information (bot overview could contain information about resources left)

Directory structure for repository of an EESSI stack

We should come up with a directory structure for a repository of an EESSI stack. Assuming we would support several versions of an EESSI stack, e.g., 2021.03, 2021.06, 2021.12, etc, and a complete stack for a single version is defined by one softwarelist.yaml easystack file, the directory structure could be as simple as

/stacks
- 2021.03
   |- softwarelist.yaml
- 2021.06
   |- softwarelist.yaml
.
.
.

If we want to have different stacks for development and production, we could add another directory level or just use a separate repository. The latter would probably better to support different access rights for different stacks.

Describe environment for bot++

The description could include the semi-automatic workflow, the servers/services involved, tools/scripts involved, repositories which hold sources, software-layer, directories, configuration capabilities, best practices, etc.

After a general introduction it could include sections about setting the bot environment up (across different machines), about updating it and about using it.

It could also cover the interplay of different repositories for development, building and testing in the workflow.

It could also cover some frequently used git commands/workflows/setups to help users managing a complex distributed system.

more refactoring/renaming in tasks/build.py

While reviewing #52, I noticed a couple more opportunities for cleaning up/improving the code in tasks/build.py (but I didn't want to block merging the PR over these):

  • rename build_easystack_from_pr function to submit_build_jobs (both in tasks/build.py and eessi_bot_event_handler.py);
  • rename create_directory function to create_pr_dir (to avoid confusion with the generic mkdir function);
  • it seems like the names of the download_pull_request and setup_pr_in_arch_job_dir functions are switched?
    • the setup_pr_in_arch_job_dir function basically only downloads to code that corresponds to a PR into the specified directory (so that should be named download_pr_to_dir?);
    • the download_pull_request function sets up the directory for each architecture target (so that should be named setup_pr_arch_job_dirs?);

Hardcode holding of jobs

The interaction between the event handler (eessi_bot_software_layer.py) and the job manager (eessi_bot_job_manager.py) is facilitated by the event handler submitting jobs with the parameter --hold, then the job manager releases jobs with the command scontrol. The parameter --hold is currently provided via the key slurm_params in the bot's app.cfg.

To avoid accidentally removing that parameter and thereby making the bot or interaction between its components dysfunctional, the parameter should be hardcoded for now.

Make sure the same container image is used

Make sure that same container image is used throughout the whole job. That is, it should be downloaded first, then it the downloaded container image should be (re-)used.

This should mainly improve consistency of the build environment.

A further improvement might be to download all needed container images first (maybe even at start of the bot or by admin before).

handle connection error to GitHub gracefully

When testing the bot implementation with the changes in #24, I ran into a crash when handling a bot:build label added event:

[20221007-T18:14:09] WARNING: [event id d67043c0-466b-11ed-96e7-ca8e283f9478] No handler found for event type 'label' (action: created) - event was received but left unhandled!
[20221007-T18:14:39] Event received (id: e8475f70-466b-11ed-9eaa-b259298657c8, type: pull_request, action: labeled), event data logged at /mnt/shared/home/boegel/eessi-bot-software-layer/events_log/pull_request/labeled/2022-10-07/2022-10-07T18-14-39_e8475f70-466b-11ed-9eaa-b259298657c8
[20221007-T18:14:39] Request verified: signature OK!
[20221007-T18:14:39] [event id e8475f70-466b-11ed-9eaa-b259298657c8] Handler found for event type 'pull_request' (action: labeled)
[20221007-T18:14:39] repository: 'boegel/software-layer'
[20221007-T18:14:40] WARNING: A crash occurred!
Traceback (most recent call last):
  File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/pyghee/lib.py", line 170, in process_event
    self.handle_event(event_info, log_file=log_file)
  File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/pyghee/lib.py", line 102, in handle_event
    handler(event_info, log_file=log_file)
  File "/mnt/shared/home/boegel/eessi-bot-software-layer/eessi_bot_software_layer.py", line 80, in handle_pull_request_event
    pr = gh.get_repo(event_info['raw_request_body']['repository']['full_name']).get_pull(event_info['raw_request_body']['pull_request']['number'])
  File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/github/MainClass.py", line 330, in get_repo
    headers, data = self.__requester.requestJsonAndCheck("GET", url)
  File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/github/Requester.py", line 355, in requestJsonAndCheck
    verb, url, parameters, headers, input, self.__customConnection(url)
  File "/mnt/shared/home/boegel/.local/lib/python3.6/site-packages/github/Requester.py", line 378, in __check
    raise self.__createException(status, responseHeaders, output)
github.GithubException.GithubException: 502 {"message": "Server Error"}

It seems like the connection to GitHub failed for some reason.

We should catch exceptions like these, and handle them more gracefully, maybe by waiting for a while, and trying again a couple of times?

Worked fine on 2nd attempt without any code changes...

Improve return values of Slurm jobs

Despite a successful build some job steps are considered failed, e.g., when checking with sacct. This is likely because some scripts do not explicitly return meaningful values.

Improve start of app (in PyGHee)

Currently (PR #4), the following (lines 103-105) is done to start the app

app = create_app(klass=EESSIBotSoftwareLayer)
log("EESSI bot for software layer started!")
waitress.serve(app, listen='*:3000')

Ideally this should be largely moved to PyGHee which would allow us to move the dependency on waitress to PyGHee and then just start the app by something like

app.start(port=3000)

Make handling of pull requests more generic

Currently, the bot (PR#4) makes a number of assumptions about the pull request (updates to easy stack files, where these are stored, etc.) and how it is handled (requiring essentially a new script to build & install software).

In the future, the bot should simply download the branch of the pull request (for the software-layer repository, this branch by design includes nearly everything necessary for handling the pull request: scripts, configuration, easystack files, etc) and submit a job that only follows the instructions for manually building an EESSI software stack (https://eessi.github.io/docs/software_layer/build_nodes/). By using existing helper scripts of the software-layer the job script, likely part of EESSI/eessi-bot-software-layer repository, can be very generic not requiring much knowledge of the internals of the EESSI/software-layer repository.

Polish documentation for running the bot

Parts of the bot's documentation (README.md) may be difficult to follow or even outdated. It should be updated after PR#24 has been merged and more people have started using the bot.

Implement log function in job script

Throughout the job script (scripts/eessi-bot-build.slurm) and other scripts information needs to be logged. Currently, there is no consistent format for these loggings.

Loggings could
- include timestamp + log message
- log "example" should print something like
">> [2022-10-07 13:56] example"

update README file with missing steps

  • clarify section 5.3
  • changes to app configuration file
    • cvmfs_customizations
    • use absolute filepath for jobs_base_dir + jobs_ids_dir
  • permissions for GitHub App have to be correct
    • read+write permission for pull requests in https://github.com/settings/apps/NAME_OF_YOUR_BOT/permissions
    • make sure that new permissions are accepted installation of bot (if needed) - see https://github.com/settings/installations/12345678
  • how to test the bot
    • opening of PR + bot:build + close after testing or delete bot:build label and re-add to test again
    • can use https://github.com/YOUR_GITHUB_ACCOUNT/software-layer/compare/main...EESSI:software-layer:add-CaDiCaL-9.3.0?expand=1 to make a test PR
    • clarify need for two GitHub accounts to test with (or not)

implement dedicated function to run commands

Code like this (taken from tasks/build.py):

git_clone_cmd = ' '.join([
    'git clone',
    'https://github.com/' + repo_name,
    arch_job_dir,
])
log("Clone repo by running '%s' in directory '%s'" % (git_clone_cmd,arch_job_dir))
cloned_repo = subprocess.run(git_clone_cmd,
                             cwd=arch_job_dir,
                             shell=True,
                             stdout=subprocess.PIPE,
                             stderr=subprocess.PIPE)
log("Cloned repo!\nStdout %s\nStderr: %s" % (cloned_repo.stdout,cloned_repo.stderr))

should be replaced with a call to a function to run commands, for example something like:

git_clone_cmd = ' '.join([
    'git clone',
    'https://github.com/' + repo_name,
    arch_job_dir,
])
run_cmd(git_clone_cmd, "Clone repo")

The run_cmd function would then be responsible for:

  • logging a message that a command is being run;
  • running the command via Python's subprocess module;
  • collecting the output (stdout + stderr separately?);
  • checking the exit code of the command, log an error if it's non-zero;
  • return output + exit code;

Enable the bot to build for specific CVMFS repo

Currently, the bot's build target CVMFS repository is the default pilot.eessi-hpc.org (as configured in the build container) or any other repository that is configured via the pull request of the software layer. While this works well, it is not so straightforward to build one PR for another repository (e.g., for test purposes). Instead of configuring the target repository via a pull request, the bot's configuration could define target repositories and all necessary customisations. Then, the bot could work as follow to set up a build job:

  1. Obtain PR content.
  2. Apply changes defined in bot's configuration (the bot could have a default target which does not apply any changes to the PR content ... this would allow that also on the side of the PR one could still built for any repository).
  3. Submit job.

Make use of environment variables in app.cfg and PR comments

Environment variables may provide a convenient way to set configuration options, to hide sensitive information in PR comments and yet provide easy access to job information.

For example, there could be an environment variable BOT_BASE_DIR which can be used to define where job directories and symlinks are placed on the filesystem. This could then be used in comments to PRs to the software-layer (thereby hiding information about account names involved). If it is defined in app.cfg there could be a simple tool or shell/python function which exports this variable or derives the actual location of a job (when just providing a job id).

Refactor `build_easystack_from_pr` into smaller functions

The function build_easystack_from_pr in tasks/build.py has gotten quite long over time. Refactoring it into smaller functions should make the code more readable and also help reviewing future pull requests. Smaller functions might then also be reused by other components of the bot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.