kitware / hpccloud Goto Github PK

View Code? Open in Web Editor NEW

50.0 15.0 23.0 21.96 MB

A Cloud/Web-Based Simulation Environment

Home Page: https://kitware.github.io/HPCCloud/

License: Apache License 2.0

JavaScript 66.22% HTML 0.43% Python 33.12% Shell 0.19% CMake 0.04%

simulation-environment cloud hpc

hpccloud's Introduction

HPCCloud

Goal

Web interface to the HPCCloud infrastructure that abstract simulation environment and resources on which you can run those simulations.

Installation

Observe the instructions for HPCCloud deploy;

Development

$ git clone https://github.com/Kitware/HPCCloud.git
$ cd HPCCloud
$ npm install
$ npm start

Troubleshooting

(With the vm running from HPCCloud-Deploy)

$ vagrant ssh
$ sudo -iu hpccloud

Fixing celery Girder URL

$ vi /opt/hpccloud/cumulus/cumulus/conf/config.json
  +-> Fix host to be localhost
  +-> baseUrl: "http://localhost:8080/api/v1",
$ sudo service celeryd restart

Documentation

See the documentation in this repository for a getting started guide, advanced documentation, and workflow descriptions.

Licensing

HPCCloud is licensed under Apache 2.

Getting Involved

Fork our repository and do great things. At Kitware, we've been contributing to open-source software for 15 years and counting, and we want to make hpc-cloud useful to as many people as possible.

hpccloud's People

Contributors

Stargazers

Watchers

hpccloud's Issues

In simulation start dialog cap the number instance in the cluster

We have the following two endpoints:

GET /user/{id}/aws/profiles/{profileId}/maxinstances
GET /user/{id}/aws/profiles/{profileId}/runninginstances

The max number instances that a user can create in a cluster should be the result of the first request minus the second.

Ensure that when a user registers the correct directory structure is create for them

Our deployment script creates a directory in the hydra collection and gives them access to it. @jourdain Why do we store the user data under the hydra collection? Doesn't it make more sense to store this user specific data under the user?

The stderr and stdout file from sge should be viewable when a simulation is complete

When simulation is complete these files are uploaded to Girder, we should provide a mechanism for them to be viewed in the client. They will be particularly important for debugging setup issues in a traditional cluster, for example this is the output for a job where the hydra executable was incorrectly defined:

--------------------------------------------------------------------------
mpiexec was unable to launch the specified application as it could not access
or execute an executable:

Executable: /hydra
Node: ulmus

while attempting to start process rank 0.
--------------------------------------------------------------------------

Register button is a little confusing

I keep clicking on it when trying to login. The arrow icon should probably say 'Sign in' or something like that and register button should be a smaller link.

Status event broadcast seems to fail sometimes

I see the following in the console:

broadcasting SSE: task.status running

However, the simulation status is not updated

Add AWS resource usage to profile view

A simple overview of resources used, something list:

1 Running instance
2 Volumes

We can work out the endpoints to call.

Remove option to visualize results on webserver

@jourdain If we remove the ability to upload data to Girder then we will need to disable this feature, which I think we probably want todo anyway?

Clone copies over output file from previous simulation

It should only copy over the simput files.

Hide cost for simulations on traditional clusters?

They don't accrue cost correct?

Tasks that go into failure state disappear from the UI

Consider moving the following parameters into the cluster profile

I'm not sure we want the user to have to type these in each time. We at least need to provide them with away to set defaults.

Hydra executable path
Parallel Environment
Number of slots
Job Output Directory

Unable to register new user

I have a test instance running at:

http://52.27.36.93/

When I click register nothing happens ... no errors in console.

Task status needs to be fetch on reload if not in a completed state

Currently the simulation status is only updated when a SSE is received. If however the user refreshes their page or logs out when the event is received the simulation will not move into the correct state. So we need to query the status endpoint in these cases.

400 (Bad Request) when accessing folder.

I am seeing the following error when entering the nuclear simulation project:

POST http://localhost:9000/api/v1/folder?parentId=hydra-th&parentType=collection&name=websimdev&description=websimdev%27s%20projects 400 (Bad Request)

The response is:

{"field": "id", "message": "Invalid ObjectId: hydra-th", "type": "validation"}

Looks like we are passing an invalid ObjectId.

When running visualization steps we will need to be able to config ParaView path

When running visualization on a traditional cluster we will need to be able to configure the ParaView path.

Add view for users to configure a "traditional" HPC cluster

We need to make it clear how a simulation is started

Every time I try to start a simulation I have trouble finding the correct button ...

Status icon should update as task status changes to failure

If a simulation moves into a error state the log is longer available

In the simulation start dialog we need to provide an option to enable or disable upload

Enable or disable the upload of the simulation output. This should be used to determine if 'output.item.id' is passed when creating the task. If this value is not specified no upload will occur.

The state of a simulation is not updated after it has been terminate

Still see the green rocket, even after termination.

Replace occurrences of cmb with hpcc

We have cmb in the directory structure etc. Would be nice to clean this up.

Unrendered template are displayed on load ( see screenshot )

I know there is a way fixing this but can't remember it ;-)

When testing a trad cluster and error is alway reported

A toast appear reporting an error even if the test is successful.

Tailing of log file doesn't start if the simulation panel is already open

When annotating a simulation the simulation name is not visible

Add new parameter to trad sim run dialog

It should be called:

"Output directory"
This should an optional full path to directory where the output of a job will be written to.

The value should be pass to /tasks/<task_id>/run as 'jobOuputDir'

AWS profile region restrictions

AWS creates a default VPC per region. Starcluster creates its instances in this VPC by default so that the cluster instance are not accessible from the outside world. In order to do this Starcluster must be run from a machine that is in the VPC. Starcluster also has the option to assign public ips to all node in the cluster that then removes this restriction although there is potentially a greater security risk. So what does this mean client? We need to expose and check box indicating whether to use public ip on cluster nodes. This check box will have different behavior depending if the HPCCloud stack is deployed on EC2 or not.

EC2 deployment

Check box is checked -
We configure Starcluster to use public IP and the pull list of AWS regions is available to choose from.
Check box is unchecked - We have to restrict the regions that can be used to the region the stack has been deployed in so Starcluster will have access to the default VPC. The region will be configured into a JSON file as part of the deployment process.

Non EC2 deployment

In this case the only option is to use public IP for every node as Starcluster will be running outside EC2.
The checkbox should be checked and disabled and all regions will be availble to the user to create clusters in.

Title and description alignments

(side note, I also filed this in the completely wrong repository).

Add view to tail the log of a cluster or job

Both these resources have a log endpoint. It would be nice to expose this in similar way you can view log file on Travis ...

Visualize simulation after complete

When running sim on trad cluster ok button should be disable until a cluster is selected

Look at supporting Google OAuth login

Girder has a plugin that support this, it might be nice to add support in the UI for this. I have done this for another project so have a bit of a idea of what we need to do.

Use SSE event for status updates

So once Kitware/cumulus#35 is merged the backend will produce server side events for status update, this should reduce the amount of polling that we will need to do in the UI. There might be some nice angular way of access these events, but here is some sample code in raw javascript.

        if (window.EventSource) {
            // The timeout is the number of second we should wait without an event before closing the connection ...
            url = girder.apiRoot + '/notification/stream?timeout=300';

            var eventSource = new window.EventSource(url);

            eventSource.onmessage = function (e) {
                var obj;
                try {
                    obj = window.JSON.parse(e.data);
                } catch (err) {
                    console.error('Invalid JSON from SSE stream: ' + e.data + ',' + err);
                    return;
                }
                /* Do something useful with the event. See description of event object below */
            };
        } else {
            console.error('EventSource is not supported on this platform.');
        }

The event object have the following form:

{
  "type": "<profile.status|job.status|cluster.status|task.status>",
  "data": {
    "_id": "<the id of the resource>",
    "status": "the new status"
  }
  // There are other property but these can be ignored for our purposes
}

Add views for EBS volumes

Currently the only configurable parameter to expose to the user is the size.

The REST endpoint can be found here: http://localhost:8080/api/v1#!/volumes

Clone and delete of simulation needs some progress indicator

These operations can take several seconds. We need some sort of progress indicator otherwise I will keep clicking buttons!

It is currently possible to delete a running simulation

It should not be possible to delete a running simulation.

Deleting a simulation in the UI does not call the tasks delete endpoint

To clean up resources associated with a task the following endpoint should be called:

DELETE /tasks/<task_id>

Add another simulation type

It would be nice to be able to run another simulation as well has hydra, @jourdain and @patrickoleary can probably advise here.

Add view for adding and configuring a AWS profile

An AWS profile has the following properties:

Access Key ID
Secret Access Key
Availability Zone
Name
Region

The REST endpoint can be found here: http://localhost:8080/api/v1#!/user/create_profile

Update style of new views ( user preferences ) to match rest of app

We could probably overall the style while we are at it

gulp serve throws an error

@TristanWright Did we every resolve this issue?

[12:18:50] 'watch' errored after 196 ms
[12:18:50] TypeError: Cannot read property 'prototype' of undefined
at Object.exports.inherits (util.js:556:43)
at Object. (/home/cjh/work/source/HPCCloud/node_modules/browser-sync/node_modules/http-proxy/lib/http-proxy/index.js:111:17)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Module.require (module.js:364:17)
at require (module.js:380:17)
at Object. (/home/cjh/work/source/HPCCloud/node_modules/browser-sync/node_modules/http-proxy/lib/http-proxy.js:4:17)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Module.require (module.js:364:17)
at require (module.js:380:17)
at Object. (/home/cjh/work/source/HPCCloud/node_modules/browser-sync/node_modules/http-proxy/index.js:13:18)

Do status labels show task status or item status?

Split RESTful endpoints into separate services

kw.girder.js is getting unreasonably large for one single service much less file. We should split it up into specific endpoints e.g. /user, /tasks, etc; and injecting them as they're needed. Further splitting them into specific girder and plugin endpoints would help a lot too.

Add logging information to simulation

The task resource has a log property that we should expose to the user. The property is of the following form:

  "log": [
    {
      "$ref": "http://ulex:9001/api/v1/jobs/56030771f657105ab7913b18/log", ...
    }

This is basically an aggregate log of all the resource create by the task for example the cluster and jobs that are submitted to it. So the pseudo code for tail the task log would be as follows:

while task is running:
fetch task:
for each log in task['log']:
fetch log provide offset of lines already fetched:
add lines to log in ui

The log entries from the resources are of the following form:

{
      "args": [],
      "created": 1442957579.78141,
      "exc_info": null,
      "exc_text": null,
      "filename": "job.py",
      "funcName": "_tail_output",
      "levelname": "INFO",
      "levelno": 20,
      "lineno": 288,
      "module": "job",
      "msecs": 781.4099788665771,
      "msg": "Skipping tail of 5601c8e4f657105a8f7ce936/output/stat.txt as file doesn't currently exist",
      "name": "starcluster",
      "pathname": "/home/cjh/work/source/cumulus/cumulus/starcluster/tasks/job.py",
      "process": 22983,
      "processName": "Worker-2",
      "relativeCreated": 1952257.1358680725,
      "thread": 139862483904320,
      "threadName": "MainThread"
}

For now I think we what to present the following information:

With created converted to a human readable format. I might be nice to allow a user to double click on an entry which would reveal all the log information.