g-node / gin-proc Goto Github PK

View Code? Open in Web Editor NEW

0.0 4.0 7.0 8.34 MB

License: BSD 3-Clause "New" or "Revised" License

Python 58.56% Dockerfile 1.08% JavaScript 3.60% Vue 36.01% Shell 0.75%

development-paused

gin-proc's People

Watchers

Forkers

mpsonntag mrinalwahal abhinav77205 alokkumar8 abhishek0912 lgtm-migrator

gin-proc's Issues

Enable Docker Secrets to store user's Private Keys

As of now the micro-service works on my host machine, therefore, access to ssh keys isn't an issue, however, turning the service into a docker image, would require us to mount external volumes to access user's ssh keys or enable something like docker secrets for the same.

Logging

We can design a debug more for the flask server and enable it with an env variable, something like DEBUG=TRUE during launching of gin-proc container by the user.

Or we could:

Enable logging
And rotate those logs periodically in user's mounted volumes.

May not be a priority right now, but I think its a professional principle to go by. It should make it easier for the user's to debug their server's problem's later.

Write CI config in yaml format from python

Currently I'm using a workaround to read the default saved template files of CI configurations and adding required lines after detecting hash break-points. Ex. #add-annex-files

I feel this is an unprofessional way of doing it. Though, it does the job for now, it would be more professional to do this in a more hackable and clean manner using something like pyYAML.

This will also allow us to read the yaml configs at a later stage and only replace the values that we want to, instead of writing and replacing the entire config - as we are doing currently.

Default clone step: GIN-specific docker container with script

Since we need to override the default clone step to clone via SSH and download annexed data, the clone step should be handled by a container that performs all the necessary steps. We could host this on dockerhub and make it GIN specific. Ideally, the clone step in the drone.yml configuration for all repositories should be as follows:

- name: clone
  image: docker:gnode/gin-proc-clone
  environment:
    SSH_KEY:
      from_secret: DRONE_PRIVATE_SSH_KEY

The container will be built with git and git-annex and the entrypoint should be a script that uses the default drone environment to clone the repository (e.g., git clone $DRONE_REMOTE_URL for the initial clone step).

The gin-proc web service could also add extra fields for specifying which annexed file content to download (if not everything). The clone plugin would then use a predefined env variable (that the gin-proc web service always sets) to determine which annexed files to download.

Snakemake directory handling

The SnakeFile Path (location of the snakemake file) should be an optional input.
When the user specifies a directory, the drone.yml should include a line for switching to snakemake directory before running the build.

If no path is specified, the root of the repo should be assumed, so no directory change should be performed.

Specific exception catching

A lot of functions are wrapped in try .. except blocks with a catch-all Exception handler. In some cases, the block is doing things that aren't likely to cause exceptions (like appending items to a list). We should have more specific exception handlers and only have them where necessary.

Overwrite drone.yml when it can't be read for update

When trying to update an existing drone.yml file, if it can't be read for some reason, just overwrite it with the new data.

Code quality and style

I intend to add automatic checks on pull requests for these, but for now, here's a list of some of the code style issues that need to be addressed:

os.system() should never be used. All instances should be replaced by subprocess.call() or check_output().
Indent Python code with 4 spaces (no tabs).
Adhere to PEP8 strictly.
Concatenate strings that represent filesystem paths using os.path.join().
Commented out code should be removed, or if there's a good reason for it to remain, it shouldn't be a """string""" but a # comment.

Review uses of username as repository owner

In some cases the service assumes that the logged in user is also the owner of the repository and uses the username of the logged in user to construct the repository path for API calls. This isn't always necessarily true. Users can enable builds and write configurations for collaborative repositories (either through sharing or as part of an organisation.

We should review all cases where the current user's username is used to infer the repository full name.

Key description should mention gin-proc

The name/description of the key that the gin-proc service installs for the user should be called gin-proc or something similar to make its purpose clear to the user.

gin-proc/back-end/service.py

Line 32 in 9690416

PRIV_KEY = 'gin_id_rsa'

Push build output to private gin-proc repo and share with user

This is an alternative idea to the current workflow of pushing output to a gin-proc branch of the original repository.

One of the original ideas we had for serving build output to the user was having a data store that would serve archives. The output would be privately accessible, either using credentials or by secret URLs available only to the user.

This lead me to the idea that we could use GIN repositories as data stores. The workflow would be:

When the user enables gin-proc builds on their repository (when the hook is created) create a private repository as the gin-proc user on GIN with the name gin-proc/<user>-<repository> (the data repository). The repository name is guaranteed unique since the repositories unique names are <user>/<repository>.
The originating user is added as a collaborator on the repository.
- The user could be added as a read-only collaborator. This would prevent users from adding commits to the data repository and creating conflicts on subsequent builds, or deleting the repository, creating issues for gin-proc.
On successful build, newly created files or more specifically, the output files specified by the user in their configuration, are moved to a local clone of the data repository and pushed.
- Subdirectories can be used to separate different builds. Alternatively, each new build can create a new commit with a message specifying which build number it was. Commits can even be tagged.

Benefits of this approach:

When a user visits gin.g-node.org/gin-proc, they will see a list of data repositories for their CI enabled repositories only.
Users have access to the data store but they can't remove any data or add extra commits that would create conflicts for the gin-proc user.
- In the current branch-based method, users can delete the branch or modify files in it, which can create issues for the gin-proc user. We could work around this. It might also be desirable to allow the user to delete or modify build outputs, so this point is not a clear benefit.
The data repositories can be used to store intermediate outputs from snakemake as well as build outputs (in a subdirectory names appropriately). This might help users with troubleshooting builds when something goes wrong, without requiring the hindsight of specifying storage of intermediate files.

Disadvantages of this approach (versus branch-based):

User has no direct control over the storage of their build outputs.
- This is the counterpart to the second point above and depends on whether we want to give that control to the users and how we handle conflicts and other issues.
If we store intermediate snakemake files in a user-accessible repository, we take away the ability for us to clear old build outputs if (when) storage becomes an issue, because they will always be part of the git history.
- We can get around this by having a clear-written policy that old builds are deleted after a certain period of time, however to do this we would have to delete git history (or annexed data) potentially causing issues for users who have cloned their data repository.

For now, we should move forward with the branch-based method, since it's more straightforward. I thought this idea would require GOGS changes as well at first, since there was no API call to add collaborators, but that's available now.

Feel free to use this issue for discussions on this idea and any alternatives.

Quote filenames for git/annex operations

Files listed for git annex get should be individually quoted (like they are for commit & push) and there should be no trailing space

gin-proc/back-end/config.py

Lines 70 to 75 in 606541f

 if len(files) > 0: 

 input_files = '' 

 for filename in files: 

 input_files += '{} '.format(filename) 

 commands.append("git annex get {}".format(input_files))

Disable DEBUG on Flask Before Production

Support different types of processing pipelines

A nice feature would be to support different types of pipelines that make it easier for users to set up common kinds of pipelines. For instance, we could start by offering two types of builds:

Custom: This just lets users enter any sequence of (shell) commands and they are run in order during the processing step.
Snakemake: Asks the user to specify a directory where the snakefile is located (defaults to root of the repo) and automatically:
- uses an image that has snakemake installed
- creates a processing step that includes cding to the specified directory and runs the snakemake pipeline.

More advanced features can then be added to the second option (snakemake) for caching intermediate steps/files, figuring out dependencies, etc.

In the first option. all the "smart" features are disabled so the user can just run any script they want without caching or dependency management. It would all be up to the user to figure out.

Exception Handling and Code Cleanup

Full service deployment compose file

As it stands the project consists of three services:

The web frontend (in vue.js):
- Login (using GIN credentials)
- Web form for user input that is sent to 2.
The web backend (in Python Flask):
- Creates the drone.yml
- Sets up the hook
- Sets up keys
The Drone service:
- Runs the pipeline(s)

We should have a docker-compose.yml file that sets up two containers to work together, one for the frontend and backend web services and another for the Drone container.

Configurable auth backend address

The backend address (for auth and api routes) in the frontend (see lines below) should be configurable. These should be set to an externally accessible address that the user's browser, running the frontend, can access in order to log in.

gin-proc/front-end/nuxt.config.js

Line 49 in 9690416

baseURL: 'http://localhost:8000/auth'

gin-proc/front-end/pages/index.vue

Line 163 in 9690416

var API = "http://127.0.0.1:8000/api"

User defined output files push

Use drone.yml to define which files from the CI build output should be pushed after a build is done.

Rekey existing repositories when secrets change

Since we set up keys in drone configurations (see #12) and the key can change when gin-proc loses a key or a user deletes the public key in gogs (see #31), a key change should also trigger an update of the secrets for all activated drone configurations.

Misc drone.yml fixes

Custom workflow support has a bug with the processing of command lines
Frontend should complain when no repository is selected and not submit the form
Fix drone.yml rewrite removing needed lines (e.g., git annex init)

Overwrite/delete existing key on GIN

Currently if the service finds a key with the same name as the one it uses in the user's GIN configuration, it asks the user to delete it. If we use a key name specific to the service (gin-proc) the service can detect it and overwrite it instead of asking the user to do it.

Automatically set up Drone secrets via web frontend

When a user sets up a gin-proc build configuration via the web frontend, the web backend sets up a key pair for itself to push the drone.yml. It should also set up the Drone service for the user, by adding the private key into the secrets for the build.

Automatic SSH key pair setup

When setting up the configuration for the user, the SSH key pair should be generated and set up automatically. The public part should be added to the user's profile (via the GIN API /api/v1/user/keys) and the private part should be stored on the GIN Proc server, accessible by the service. All clone steps can then reference the private key via a plugin for external secrets.

The details of the secrets plugin need to be discussed further.

Web frontend for configuring builds

A web frontend that guides users to set up their builds. It should give users the ability to configure the most important options (input files, scripts/pipelines to run, output files to be pushed back) and generate a drone.yml (or update an existing one) for the builds.

Use git annex copy to upload files

gin-proc/back-end/config.py

Line 63 in 606541f

commands.append('git annex sync --content')

This will also download old build files in the working directory, which we don't want.

Use git annex copy --to=origin <filenames> instead.

Enable shared volume between pipeline steps

Currently, entire pipeline runs in a single step. We need to allow users to add their own intermediate pipeline steps and for that all steps needs to access a shared volume.

Don't gitignore snakemake

gin-proc/back-end/config.py

Line 86 in 606541f

commands.append('echo ".snakemake/" > .gitignore')

This is unnecessary since we're committing specific files by name for the push after the workflow is complete.

Checkout gin-proc branch if it already exists

gin-proc/back-end/config.py

Line 49 in 606541f

commands.append('git checkout --orphan gin-proc')

This will fail the second time when the branch already exists.

Error handling

What happens with the build if a step in the drone.yml produces an error?

It should probably handle errors differently depending on the build step: setup, processing, output handling, etc. For example, if a user specifies some files to be pushed back on success but some of the files don't exist, it should probably warn that some of the specified files don't exist but it shouldn't fail completely (or at least, the files that do exist should be pushed).

Web service: Prototype

Full web service should meet the requirements specified in Issue #8. This issue describes what we would need from a prototype of the service:

Should ask for user input and generate a valid drone.yml file with the following information:

Input files: A list of files to git annex get ... after git clone is complete. Empty value should imply all files.
Processing step: A single script line would be enough for a prototype.
Output files: A list of files to git add ... (annex filtering could be done using a default config) followed by git push.

Filenames aren't read properly from the input form

The generated drone.yml doesn't include the user specified files, but instead just the numbers 1, 2, etc.

Allow git-annex push for CI outputs

As of now just git push is allowed for storing CI outputs.

Don't sync content if no annex files are needed

gin-proc/back-end/config.py

Line 77 in 606541f

commands.append("git annex sync --content")

This line should be removed.

Editing existing workflows

It would be nice if the guide could read any existing drone.yml and populate the fields with existing values for the user to edit.

Enable Volume caching for faster CI jobs

As of now, all pipelines run from scratch in a separate container, which can be avoided by storing intermediate temporary build files created by being cached, and eventually speed up the build job completion.

Update Slack webhook

Current webhook is expired, and a fresh webhook needs to be generated before pushing the service to production.

Documentation and API

User documentation

We need user documentation that explains how users should interact with the gin-proc interface, what it does, and what they can expect.

API and internals

Developer/maintainer documentation that explains the program flow, API endpoints for gin-proc backend and how front-end, back-end, and GIN/GOGS interact at each step.

Web frontend should only create drone.yml

The web frontend should only create (or edit) the drone.yml for the user and nothing else. If the user needs to run a snakemake pipeline they will have their own snakemake file and it's up to them to define the processing steps required to run it. The gin-proc service should not create or edit a user's snakemake configuration.

Authenticate user from GIN web app

As for the testing environment, I'm authenticating the dev user using a personal access token designed manually. Eventually, we'll have to authenticate the user using their GIN username and password, just the way Drone does it.

Fetch (or accept) host key from GIN server

When the backend service clones for the first time it will be prompted to accept the SSH host key. Fetching of the key should be part of the setup process.

See ssh-keyscan.

	if len(files) > 0:
	input_files = ''
	for filename in files:
	input_files += '{} '.format(filename)

	commands.append("git annex get {}".format(input_files))

g-node / gin-proc Goto Github PK

gin-proc's People

Watchers

Forkers

gin-proc's Issues

Benefits of this approach:

Disadvantages of this approach (versus branch-based):

User documentation

API and internals

Recommend Projects

Recommend Topics

Recommend Org